Sample records for large statistical sample

  1. [Effect sizes, statistical power and sample sizes in "the Japanese Journal of Psychology"].

    PubMed

    Suzukawa, Yumi; Toyoda, Hideki

    2012-04-01

    This study analyzed the statistical power of research studies published in the "Japanese Journal of Psychology" in 2008 and 2009. Sample effect sizes and sample statistical powers were calculated for each statistical test and analyzed with respect to the analytical methods and the fields of the studies. The results show that in the fields like perception, cognition or learning, the effect sizes were relatively large, although the sample sizes were small. At the same time, because of the small sample sizes, some meaningful effects could not be detected. In the other fields, because of the large sample sizes, meaningless effects could be detected. This implies that researchers who could not get large enough effect sizes would use larger samples to obtain significant results.

  2. The large sample size fallacy.

    PubMed

    Lantz, Björn

    2013-06-01

    Significance in the statistical sense has little to do with significance in the common practical sense. Statistical significance is a necessary but not a sufficient condition for practical significance. Hence, results that are extremely statistically significant may be highly nonsignificant in practice. The degree of practical significance is generally determined by the size of the observed effect, not the p-value. The results of studies based on large samples are often characterized by extreme statistical significance despite small or even trivial effect sizes. Interpreting such results as significant in practice without further analysis is referred to as the large sample size fallacy in this article. The aim of this article is to explore the relevance of the large sample size fallacy in contemporary nursing research. Relatively few nursing articles display explicit measures of observed effect sizes or include a qualitative discussion of observed effect sizes. Statistical significance is often treated as an end in itself. Effect sizes should generally be calculated and presented along with p-values for statistically significant results, and observed effect sizes should be discussed qualitatively through direct and explicit comparisons with the effects in related literature. © 2012 Nordic College of Caring Science.

  3. A simulative comparison of respondent driven sampling with incentivized snowball sampling--the "strudel effect".

    PubMed

    Gyarmathy, V Anna; Johnston, Lisa G; Caplinskiene, Irma; Caplinskas, Saulius; Latkin, Carl A

    2014-02-01

    Respondent driven sampling (RDS) and incentivized snowball sampling (ISS) are two sampling methods that are commonly used to reach people who inject drugs (PWID). We generated a set of simulated RDS samples on an actual sociometric ISS sample of PWID in Vilnius, Lithuania ("original sample") to assess if the simulated RDS estimates were statistically significantly different from the original ISS sample prevalences for HIV (9.8%), Hepatitis A (43.6%), Hepatitis B (Anti-HBc 43.9% and HBsAg 3.4%), Hepatitis C (87.5%), syphilis (6.8%) and Chlamydia (8.8%) infections and for selected behavioral risk characteristics. The original sample consisted of a large component of 249 people (83% of the sample) and 13 smaller components with 1-12 individuals. Generally, as long as all seeds were recruited from the large component of the original sample, the simulation samples simply recreated the large component. There were no significant differences between the large component and the entire original sample for the characteristics of interest. Altogether 99.2% of 360 simulation sample point estimates were within the confidence interval of the original prevalence values for the characteristics of interest. When population characteristics are reflected in large network components that dominate the population, RDS and ISS may produce samples that have statistically non-different prevalence values, even though some isolated network components may be under-sampled and/or statistically significantly different from the main groups. This so-called "strudel effect" is discussed in the paper. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  4. A simulative comparison of respondent driven sampling with incentivized snowball sampling – the “strudel effect”

    PubMed Central

    Gyarmathy, V. Anna; Johnston, Lisa G.; Caplinskiene, Irma; Caplinskas, Saulius; Latkin, Carl A.

    2014-01-01

    Background Respondent driven sampling (RDS) and Incentivized Snowball Sampling (ISS) are two sampling methods that are commonly used to reach people who inject drugs (PWID). Methods We generated a set of simulated RDS samples on an actual sociometric ISS sample of PWID in Vilnius, Lithuania (“original sample”) to assess if the simulated RDS estimates were statistically significantly different from the original ISS sample prevalences for HIV (9.8%), Hepatitis A (43.6%), Hepatitis B (Anti-HBc 43.9% and HBsAg 3.4%), Hepatitis C (87.5%), syphilis (6.8%) and Chlamydia (8.8%) infections and for selected behavioral risk characteristics. Results The original sample consisted of a large component of 249 people (83% of the sample) and 13 smaller components with 1 to 12 individuals. Generally, as long as all seeds were recruited from the large component of the original sample, the simulation samples simply recreated the large component. There were no significant differences between the large component and the entire original sample for the characteristics of interest. Altogether 99.2% of 360 simulation sample point estimates were within the confidence interval of the original prevalence values for the characteristics of interest. Conclusions When population characteristics are reflected in large network components that dominate the population, RDS and ISS may produce samples that have statistically non-different prevalence values, even though some isolated network components may be under-sampled and/or statistically significantly different from the main groups. This so-called “strudel effect” is discussed in the paper. PMID:24360650

  5. Qualitative Meta-Analysis on the Hospital Task: Implications for Research

    ERIC Educational Resources Information Center

    Noll, Jennifer; Sharma, Sashi

    2014-01-01

    The "law of large numbers" indicates that as sample size increases, sample statistics become less variable and more closely estimate their corresponding population parameters. Different research studies investigating how people consider sample size when evaluating the reliability of a sample statistic have found a wide range of…

  6. How Large Should a Statistical Sample Be?

    ERIC Educational Resources Information Center

    Menil, Violeta C.; Ye, Ruili

    2012-01-01

    This study serves as a teaching aid for teachers of introductory statistics. The aim of this study was limited to determining various sample sizes when estimating population proportion. Tables on sample sizes were generated using a C[superscript ++] program, which depends on population size, degree of precision or error level, and confidence…

  7. Does size matter? Statistical limits of paleomagnetic field reconstruction from small rock specimens

    NASA Astrophysics Data System (ADS)

    Berndt, Thomas; Muxworthy, Adrian R.; Fabian, Karl

    2016-01-01

    As samples of ever decreasing sizes are being studied paleomagnetically, care has to be taken that the underlying assumptions of statistical thermodynamics (Maxwell-Boltzmann statistics) are being met. Here we determine how many grains and how large a magnetic moment a sample needs to have to be able to accurately record an ambient field. It is found that for samples with a thermoremanent magnetic moment larger than 10-11Am2 the assumption of a sufficiently large number of grains is usually given. Standard 25 mm diameter paleomagnetic samples usually contain enough magnetic grains such that statistical errors are negligible, but "single silicate crystal" works on, for example, zircon, plagioclase, and olivine crystals are approaching the limits of what is physically possible, leading to statistic errors in both the angular deviation and paleointensity that are comparable to other sources of error. The reliability of nanopaleomagnetic imaging techniques capable of resolving individual grains (used, for example, to study the cloudy zone in meteorites), however, is questionable due to the limited area of the material covered.

  8. Chi-squared and C statistic minimization for low count per bin data. [sampling in X ray astronomy

    NASA Technical Reports Server (NTRS)

    Nousek, John A.; Shue, David R.

    1989-01-01

    Results are presented from a computer simulation comparing two statistical fitting techniques on data samples with large and small counts per bin; the results are then related specifically to X-ray astronomy. The Marquardt and Powell minimization techniques are compared by using both to minimize the chi-squared statistic. In addition, Cash's C statistic is applied, with Powell's method, and it is shown that the C statistic produces better fits in the low-count regime than chi-squared.

  9. Got power? A systematic review of sample size adequacy in health professions education research.

    PubMed

    Cook, David A; Hatala, Rose

    2015-03-01

    Many education research studies employ small samples, which in turn lowers statistical power. We re-analyzed the results of a meta-analysis of simulation-based education to determine study power across a range of effect sizes, and the smallest effect that could be plausibly excluded. We systematically searched multiple databases through May 2011, and included all studies evaluating simulation-based education for health professionals in comparison with no intervention or another simulation intervention. Reviewers working in duplicate abstracted information to calculate standardized mean differences (SMD's). We included 897 original research studies. Among the 627 no-intervention-comparison studies the median sample size was 25. Only two studies (0.3%) had ≥80% power to detect a small difference (SMD > 0.2 standard deviations) and 136 (22%) had power to detect a large difference (SMD > 0.8). 110 no-intervention-comparison studies failed to find a statistically significant difference, but none excluded a small difference and only 47 (43%) excluded a large difference. Among 297 studies comparing alternate simulation approaches the median sample size was 30. Only one study (0.3%) had ≥80% power to detect a small difference and 79 (27%) had power to detect a large difference. Of the 128 studies that did not detect a statistically significant effect, 4 (3%) excluded a small difference and 91 (71%) excluded a large difference. In conclusion, most education research studies are powered only to detect effects of large magnitude. For most studies that do not reach statistical significance, the possibility of large and important differences still exists.

  10. Some limit theorems for ratios of order statistics from uniform random variables.

    PubMed

    Xu, Shou-Fang; Miao, Yu

    2017-01-01

    In this paper, we study the ratios of order statistics based on samples drawn from uniform distribution and establish some limit properties such as the almost sure central limit theorem, the large deviation principle, the Marcinkiewicz-Zygmund law of large numbers and complete convergence.

  11. Testing for independence in J×K contingency tables with complex sample survey data.

    PubMed

    Lipsitz, Stuart R; Fitzmaurice, Garrett M; Sinha, Debajyoti; Hevelone, Nathanael; Giovannucci, Edward; Hu, Jim C

    2015-09-01

    The test of independence of row and column variables in a (J×K) contingency table is a widely used statistical test in many areas of application. For complex survey samples, use of the standard Pearson chi-squared test is inappropriate due to correlation among units within the same cluster. Rao and Scott (1981, Journal of the American Statistical Association 76, 221-230) proposed an approach in which the standard Pearson chi-squared statistic is multiplied by a design effect to adjust for the complex survey design. Unfortunately, this test fails to exist when one of the observed cell counts equals zero. Even with the large samples typical of many complex surveys, zero cell counts can occur for rare events, small domains, or contingency tables with a large number of cells. Here, we propose Wald and score test statistics for independence based on weighted least squares estimating equations. In contrast to the Rao-Scott test statistic, the proposed Wald and score test statistics always exist. In simulations, the score test is found to perform best with respect to type I error. The proposed method is motivated by, and applied to, post surgical complications data from the United States' Nationwide Inpatient Sample (NIS) complex survey of hospitals in 2008. © 2015, The International Biometric Society.

  12. Modified Distribution-Free Goodness-of-Fit Test Statistic.

    PubMed

    Chun, So Yeon; Browne, Michael W; Shapiro, Alexander

    2018-03-01

    Covariance structure analysis and its structural equation modeling extensions have become one of the most widely used methodologies in social sciences such as psychology, education, and economics. An important issue in such analysis is to assess the goodness of fit of a model under analysis. One of the most popular test statistics used in covariance structure analysis is the asymptotically distribution-free (ADF) test statistic introduced by Browne (Br J Math Stat Psychol 37:62-83, 1984). The ADF statistic can be used to test models without any specific distribution assumption (e.g., multivariate normal distribution) of the observed data. Despite its advantage, it has been shown in various empirical studies that unless sample sizes are extremely large, this ADF statistic could perform very poorly in practice. In this paper, we provide a theoretical explanation for this phenomenon and further propose a modified test statistic that improves the performance in samples of realistic size. The proposed statistic deals with the possible ill-conditioning of the involved large-scale covariance matrices.

  13. A review of empirical research related to the use of small quantitative samples in clinical outcome scale development.

    PubMed

    Houts, Carrie R; Edwards, Michael C; Wirth, R J; Deal, Linda S

    2016-11-01

    There has been a notable increase in the advocacy of using small-sample designs as an initial quantitative assessment of item and scale performance during the scale development process. This is particularly true in the development of clinical outcome assessments (COAs), where Rasch analysis has been advanced as an appropriate statistical tool for evaluating the developing COAs using a small sample. We review the benefits such methods are purported to offer from both a practical and statistical standpoint and detail several problematic areas, including both practical and statistical theory concerns, with respect to the use of quantitative methods, including Rasch-consistent methods, with small samples. The feasibility of obtaining accurate information and the potential negative impacts of misusing large-sample statistical methods with small samples during COA development are discussed.

  14. 48 CFR 31.201-6 - Accounting for unallowable costs.

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... unbiased sample that is a reasonable representation of the sampling universe. (ii) Any large dollar value... universe from that sampled cost is also subject to the same penalty provisions. (4) Use of statistical...

  15. 48 CFR 31.201-6 - Accounting for unallowable costs.

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... unbiased sample that is a reasonable representation of the sampling universe. (ii) Any large dollar value... universe from that sampled cost is also subject to the same penalty provisions. (4) Use of statistical...

  16. 48 CFR 31.201-6 - Accounting for unallowable costs.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... unbiased sample that is a reasonable representation of the sampling universe. (ii) Any large dollar value... universe from that sampled cost is also subject to the same penalty provisions. (4) Use of statistical...

  17. 48 CFR 31.201-6 - Accounting for unallowable costs.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... unbiased sample that is a reasonable representation of the sampling universe. (ii) Any large dollar value... universe from that sampled cost is also subject to the same penalty provisions. (4) Use of statistical...

  18. Evaluation of the statistical evidence for Characteristic Earthquakes in the frequency-magnitude distributions of Sumatra and other subduction zone regions

    NASA Astrophysics Data System (ADS)

    Naylor, M.; Main, I. G.; Greenhough, J.; Bell, A. F.; McCloskey, J.

    2009-04-01

    The Sumatran Boxing Day earthquake and subsequent large events provide an opportunity to re-evaluate the statistical evidence for characteristic earthquake events in frequency-magnitude distributions. Our aims are to (i) improve intuition regarding the properties of samples drawn from power laws, (ii) illustrate using random samples how appropriate Poisson confidence intervals can both aid the eye and provide an appropriate statistical evaluation of data drawn from power-law distributions, and (iii) apply these confidence intervals to test for evidence of characteristic earthquakes in subduction-zone frequency-magnitude distributions. We find no need for a characteristic model to describe frequency magnitude distributions in any of the investigated subduction zones, including Sumatra, due to an emergent skew in residuals of power law count data at high magnitudes combined with a sample bias for examining large earthquakes as candidate characteristic events.

  19. A robust clustering algorithm for identifying problematic samples in genome-wide association studies.

    PubMed

    Bellenguez, Céline; Strange, Amy; Freeman, Colin; Donnelly, Peter; Spencer, Chris C A

    2012-01-01

    High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental assay can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become a standard practice to remove individuals whose genome-wide data differ from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections. The algorithm is written in R and is freely available at www.well.ox.ac.uk/chris-spencer chris.spencer@well.ox.ac.uk Supplementary data are available at Bioinformatics online.

  20. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Fangyan; Zhang, Song; Chung Wong, Pak

    Effectively visualizing large graphs and capturing the statistical properties are two challenging tasks. To aid in these two tasks, many sampling approaches for graph simplification have been proposed, falling into three categories: node sampling, edge sampling, and traversal-based sampling. It is still unknown which approach is the best. We evaluate commonly used graph sampling methods through a combined visual and statistical comparison of graphs sampled at various rates. We conduct our evaluation on three graph models: random graphs, small-world graphs, and scale-free graphs. Initial results indicate that the effectiveness of a sampling method is dependent on the graph model, themore » size of the graph, and the desired statistical property. This benchmark study can be used as a guideline in choosing the appropriate method for a particular graph sampling task, and the results presented can be incorporated into graph visualization and analysis tools.« less

  1. Statistical characterization of a large geochemical database and effect of sample size

    USGS Publications Warehouse

    Zhang, C.; Manheim, F.T.; Hinde, J.; Grossman, J.N.

    2005-01-01

    The authors investigated statistical distributions for concentrations of chemical elements from the National Geochemical Survey (NGS) database of the U.S. Geological Survey. At the time of this study, the NGS data set encompasses 48,544 stream sediment and soil samples from the conterminous United States analyzed by ICP-AES following a 4-acid near-total digestion. This report includes 27 elements: Al, Ca, Fe, K, Mg, Na, P, Ti, Ba, Ce, Co, Cr, Cu, Ga, La, Li, Mn, Nb, Nd, Ni, Pb, Sc, Sr, Th, V, Y and Zn. The goal and challenge for the statistical overview was to delineate chemical distributions in a complex, heterogeneous data set spanning a large geographic range (the conterminous United States), and many different geological provinces and rock types. After declustering to create a uniform spatial sample distribution with 16,511 samples, histograms and quantile-quantile (Q-Q) plots were employed to delineate subpopulations that have coherent chemical and mineral affinities. Probability groupings are discerned by changes in slope (kinks) on the plots. Major rock-forming elements, e.g., Al, Ca, K and Na, tend to display linear segments on normal Q-Q plots. These segments can commonly be linked to petrologic or mineralogical associations. For example, linear segments on K and Na plots reflect dilution of clay minerals by quartz sand (low in K and Na). Minor and trace element relationships are best displayed on lognormal Q-Q plots. These sensitively reflect discrete relationships in subpopulations within the wide range of the data. For example, small but distinctly log-linear subpopulations for Pb, Cu, Zn and Ag are interpreted to represent ore-grade enrichment of naturally occurring minerals such as sulfides. None of the 27 chemical elements could pass the test for either normal or lognormal distribution on the declustered data set. Part of the reasons relate to the presence of mixtures of subpopulations and outliers. Random samples of the data set with successively smaller numbers of data points showed that few elements passed standard statistical tests for normality or log-normality until sample size decreased to a few hundred data points. Large sample size enhances the power of statistical tests, and leads to rejection of most statistical hypotheses for real data sets. For large sample sizes (e.g., n > 1000), graphical methods such as histogram, stem-and-leaf, and probability plots are recommended for rough judgement of probability distribution if needed. ?? 2005 Elsevier Ltd. All rights reserved.

  2. Transport Coefficients from Large Deviation Functions

    NASA Astrophysics Data System (ADS)

    Gao, Chloe; Limmer, David

    2017-10-01

    We describe a method for computing transport coefficients from the direct evaluation of large deviation function. This method is general, relying on only equilibrium fluctuations, and is statistically efficient, employing trajectory based importance sampling. Equilibrium fluctuations of molecular currents are characterized by their large deviation functions, which is a scaled cumulant generating function analogous to the free energy. A diffusion Monte Carlo algorithm is used to evaluate the large deviation functions, from which arbitrary transport coefficients are derivable. We find significant statistical improvement over traditional Green-Kubo based calculations. The systematic and statistical errors of this method are analyzed in the context of specific transport coefficient calculations, including the shear viscosity, interfacial friction coefficient, and thermal conductivity.

  3. Impact of Design Effects in Large-Scale District and State Assessments

    ERIC Educational Resources Information Center

    Phillips, Gary W.

    2015-01-01

    This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling…

  4. A statistically rigorous sampling design to integrate avian monitoring and management within Bird Conservation Regions.

    PubMed

    Pavlacky, David C; Lukacs, Paul M; Blakesley, Jennifer A; Skorkowsky, Robert C; Klute, David S; Hahn, Beth A; Dreitz, Victoria J; George, T Luke; Hanni, David J

    2017-01-01

    Monitoring is an essential component of wildlife management and conservation. However, the usefulness of monitoring data is often undermined by the lack of 1) coordination across organizations and regions, 2) meaningful management and conservation objectives, and 3) rigorous sampling designs. Although many improvements to avian monitoring have been discussed, the recommendations have been slow to emerge in large-scale programs. We introduce the Integrated Monitoring in Bird Conservation Regions (IMBCR) program designed to overcome the above limitations. Our objectives are to outline the development of a statistically defensible sampling design to increase the value of large-scale monitoring data and provide example applications to demonstrate the ability of the design to meet multiple conservation and management objectives. We outline the sampling process for the IMBCR program with a focus on the Badlands and Prairies Bird Conservation Region (BCR 17). We provide two examples for the Brewer's sparrow (Spizella breweri) in BCR 17 demonstrating the ability of the design to 1) determine hierarchical population responses to landscape change and 2) estimate hierarchical habitat relationships to predict the response of the Brewer's sparrow to conservation efforts at multiple spatial scales. The collaboration across organizations and regions provided economy of scale by leveraging a common data platform over large spatial scales to promote the efficient use of monitoring resources. We designed the IMBCR program to address the information needs and core conservation and management objectives of the participating partner organizations. Although it has been argued that probabilistic sampling designs are not practical for large-scale monitoring, the IMBCR program provides a precedent for implementing a statistically defensible sampling design from local to bioregional scales. We demonstrate that integrating conservation and management objectives with rigorous statistical design and analyses ensures reliable knowledge about bird populations that is relevant and integral to bird conservation at multiple scales.

  5. A statistically rigorous sampling design to integrate avian monitoring and management within Bird Conservation Regions

    PubMed Central

    Hahn, Beth A.; Dreitz, Victoria J.; George, T. Luke

    2017-01-01

    Monitoring is an essential component of wildlife management and conservation. However, the usefulness of monitoring data is often undermined by the lack of 1) coordination across organizations and regions, 2) meaningful management and conservation objectives, and 3) rigorous sampling designs. Although many improvements to avian monitoring have been discussed, the recommendations have been slow to emerge in large-scale programs. We introduce the Integrated Monitoring in Bird Conservation Regions (IMBCR) program designed to overcome the above limitations. Our objectives are to outline the development of a statistically defensible sampling design to increase the value of large-scale monitoring data and provide example applications to demonstrate the ability of the design to meet multiple conservation and management objectives. We outline the sampling process for the IMBCR program with a focus on the Badlands and Prairies Bird Conservation Region (BCR 17). We provide two examples for the Brewer’s sparrow (Spizella breweri) in BCR 17 demonstrating the ability of the design to 1) determine hierarchical population responses to landscape change and 2) estimate hierarchical habitat relationships to predict the response of the Brewer’s sparrow to conservation efforts at multiple spatial scales. The collaboration across organizations and regions provided economy of scale by leveraging a common data platform over large spatial scales to promote the efficient use of monitoring resources. We designed the IMBCR program to address the information needs and core conservation and management objectives of the participating partner organizations. Although it has been argued that probabilistic sampling designs are not practical for large-scale monitoring, the IMBCR program provides a precedent for implementing a statistically defensible sampling design from local to bioregional scales. We demonstrate that integrating conservation and management objectives with rigorous statistical design and analyses ensures reliable knowledge about bird populations that is relevant and integral to bird conservation at multiple scales. PMID:29065128

  6. Differential gene expression detection and sample classification using penalized linear regression models.

    PubMed

    Wu, Baolin

    2006-02-15

    Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.

  7. Scaling up to address data science challenges

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wendelberger, Joanne R.

    Statistics and Data Science provide a variety of perspectives and technical approaches for exploring and understanding Big Data. Partnerships between scientists from different fields such as statistics, machine learning, computer science, and applied mathematics can lead to innovative approaches for addressing problems involving increasingly large amounts of data in a rigorous and effective manner that takes advantage of advances in computing. Here, this article will explore various challenges in Data Science and will highlight statistical approaches that can facilitate analysis of large-scale data including sampling and data reduction methods, techniques for effective analysis and visualization of large-scale simulations, and algorithmsmore » and procedures for efficient processing.« less

  8. Scaling up to address data science challenges

    DOE PAGES

    Wendelberger, Joanne R.

    2017-04-27

    Statistics and Data Science provide a variety of perspectives and technical approaches for exploring and understanding Big Data. Partnerships between scientists from different fields such as statistics, machine learning, computer science, and applied mathematics can lead to innovative approaches for addressing problems involving increasingly large amounts of data in a rigorous and effective manner that takes advantage of advances in computing. Here, this article will explore various challenges in Data Science and will highlight statistical approaches that can facilitate analysis of large-scale data including sampling and data reduction methods, techniques for effective analysis and visualization of large-scale simulations, and algorithmsmore » and procedures for efficient processing.« less

  9. Some connections between importance sampling and enhanced sampling methods in molecular dynamics.

    PubMed

    Lie, H C; Quer, J

    2017-11-21

    In molecular dynamics, enhanced sampling methods enable the collection of better statistics of rare events from a reference or target distribution. We show that a large class of these methods is based on the idea of importance sampling from mathematical statistics. We illustrate this connection by comparing the Hartmann-Schütte method for rare event simulation (J. Stat. Mech. Theor. Exp. 2012, P11004) and the Valsson-Parrinello method of variationally enhanced sampling [Phys. Rev. Lett. 113, 090601 (2014)]. We use this connection in order to discuss how recent results from the Monte Carlo methods literature can guide the development of enhanced sampling methods.

  10. Some connections between importance sampling and enhanced sampling methods in molecular dynamics

    NASA Astrophysics Data System (ADS)

    Lie, H. C.; Quer, J.

    2017-11-01

    In molecular dynamics, enhanced sampling methods enable the collection of better statistics of rare events from a reference or target distribution. We show that a large class of these methods is based on the idea of importance sampling from mathematical statistics. We illustrate this connection by comparing the Hartmann-Schütte method for rare event simulation (J. Stat. Mech. Theor. Exp. 2012, P11004) and the Valsson-Parrinello method of variationally enhanced sampling [Phys. Rev. Lett. 113, 090601 (2014)]. We use this connection in order to discuss how recent results from the Monte Carlo methods literature can guide the development of enhanced sampling methods.

  11. Information Entropy Production of Maximum Entropy Markov Chains from Spike Trains

    NASA Astrophysics Data System (ADS)

    Cofré, Rodrigo; Maldonado, Cesar

    2018-01-01

    We consider the maximum entropy Markov chain inference approach to characterize the collective statistics of neuronal spike trains, focusing on the statistical properties of the inferred model. We review large deviations techniques useful in this context to describe properties of accuracy and convergence in terms of sampling size. We use these results to study the statistical fluctuation of correlations, distinguishability and irreversibility of maximum entropy Markov chains. We illustrate these applications using simple examples where the large deviation rate function is explicitly obtained for maximum entropy models of relevance in this field.

  12. Historical changes in large river fish assemblages of the Americas: A synthesis

    EPA Science Inventory

    The objective of this synthesis is to summarize patterns in historical changes in the fish assemblages of selected large American rivers, to document causes for those changes, and to suggest rehabilitation measures. Although not a statistically representative sample of large riv...

  13. Statistical Analysis of a Large Sample Size Pyroshock Test Data Set Including Post Flight Data Assessment. Revision 1

    NASA Technical Reports Server (NTRS)

    Hughes, William O.; McNelis, Anne M.

    2010-01-01

    The Earth Observing System (EOS) Terra spacecraft was launched on an Atlas IIAS launch vehicle on its mission to observe planet Earth in late 1999. Prior to launch, the new design of the spacecraft's pyroshock separation system was characterized by a series of 13 separation ground tests. The analysis methods used to evaluate this unusually large amount of shock data will be discussed in this paper, with particular emphasis on population distributions and finding statistically significant families of data, leading to an overall shock separation interface level. The wealth of ground test data also allowed a derivation of a Mission Assurance level for the flight. All of the flight shock measurements were below the EOS Terra Mission Assurance level thus contributing to the overall success of the EOS Terra mission. The effectiveness of the statistical methodology for characterizing the shock interface level and for developing a flight Mission Assurance level from a large sample size of shock data is demonstrated in this paper.

  14. A Hybrid Semi-supervised Classification Scheme for Mining Multisource Geospatial Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vatsavai, Raju; Bhaduri, Budhendra L

    2011-01-01

    Supervised learning methods such as Maximum Likelihood (ML) are often used in land cover (thematic) classification of remote sensing imagery. ML classifier relies exclusively on spectral characteristics of thematic classes whose statistical distributions (class conditional probability densities) are often overlapping. The spectral response distributions of thematic classes are dependent on many factors including elevation, soil types, and ecological zones. A second problem with statistical classifiers is the requirement of large number of accurate training samples (10 to 30 |dimensions|), which are often costly and time consuming to acquire over large geographic regions. With the increasing availability of geospatial databases, itmore » is possible to exploit the knowledge derived from these ancillary datasets to improve classification accuracies even when the class distributions are highly overlapping. Likewise newer semi-supervised techniques can be adopted to improve the parameter estimates of statistical model by utilizing a large number of easily available unlabeled training samples. Unfortunately there is no convenient multivariate statistical model that can be employed for mulitsource geospatial databases. In this paper we present a hybrid semi-supervised learning algorithm that effectively exploits freely available unlabeled training samples from multispectral remote sensing images and also incorporates ancillary geospatial databases. We have conducted several experiments on real datasets, and our new hybrid approach shows over 25 to 35% improvement in overall classification accuracy over conventional classification schemes.« less

  15. Estimating statistical uncertainty of Monte Carlo efficiency-gain in the context of a correlated sampling Monte Carlo code for brachytherapy treatment planning with non-normal dose distribution.

    PubMed

    Mukhopadhyay, Nitai D; Sampson, Andrew J; Deniz, Daniel; Alm Carlsson, Gudrun; Williamson, Jeffrey; Malusek, Alexandr

    2012-01-01

    Correlated sampling Monte Carlo methods can shorten computing times in brachytherapy treatment planning. Monte Carlo efficiency is typically estimated via efficiency gain, defined as the reduction in computing time by correlated sampling relative to conventional Monte Carlo methods when equal statistical uncertainties have been achieved. The determination of the efficiency gain uncertainty arising from random effects, however, is not a straightforward task specially when the error distribution is non-normal. The purpose of this study is to evaluate the applicability of the F distribution and standardized uncertainty propagation methods (widely used in metrology to estimate uncertainty of physical measurements) for predicting confidence intervals about efficiency gain estimates derived from single Monte Carlo runs using fixed-collision correlated sampling in a simplified brachytherapy geometry. A bootstrap based algorithm was used to simulate the probability distribution of the efficiency gain estimates and the shortest 95% confidence interval was estimated from this distribution. It was found that the corresponding relative uncertainty was as large as 37% for this particular problem. The uncertainty propagation framework predicted confidence intervals reasonably well; however its main disadvantage was that uncertainties of input quantities had to be calculated in a separate run via a Monte Carlo method. The F distribution noticeably underestimated the confidence interval. These discrepancies were influenced by several photons with large statistical weights which made extremely large contributions to the scored absorbed dose difference. The mechanism of acquiring high statistical weights in the fixed-collision correlated sampling method was explained and a mitigation strategy was proposed. Copyright © 2011 Elsevier Ltd. All rights reserved.

  16. Multi-Reader ROC studies with Split-Plot Designs: A Comparison of Statistical Methods

    PubMed Central

    Obuchowski, Nancy A.; Gallas, Brandon D.; Hillis, Stephen L.

    2012-01-01

    Rationale and Objectives Multi-reader imaging trials often use a factorial design, where study patients undergo testing with all imaging modalities and readers interpret the results of all tests for all patients. A drawback of the design is the large number of interpretations required of each reader. Split-plot designs have been proposed as an alternative, in which one or a subset of readers interprets all images of a sample of patients, while other readers interpret the images of other samples of patients. In this paper we compare three methods of analysis for the split-plot design. Materials and Methods Three statistical methods are presented: Obuchowski-Rockette method modified for the split-plot design, a newly proposed marginal-mean ANOVA approach, and an extension of the three-sample U-statistic method. A simulation study using the Roe-Metz model was performed to compare the type I error rate, power and confidence interval coverage of the three test statistics. Results The type I error rates for all three methods are close to the nominal level but tend to be slightly conservative. The statistical power is nearly identical for the three methods. The coverage of 95% CIs fall close to the nominal coverage for small and large sample sizes. Conclusions The split-plot MRMC study design can be statistically efficient compared with the factorial design, reducing the number of interpretations required per reader. Three methods of analysis, shown to have nominal type I error rate, similar power, and nominal CI coverage, are available for this study design. PMID:23122570

  17. Multi-reader ROC studies with split-plot designs: a comparison of statistical methods.

    PubMed

    Obuchowski, Nancy A; Gallas, Brandon D; Hillis, Stephen L

    2012-12-01

    Multireader imaging trials often use a factorial design, in which study patients undergo testing with all imaging modalities and readers interpret the results of all tests for all patients. A drawback of this design is the large number of interpretations required of each reader. Split-plot designs have been proposed as an alternative, in which one or a subset of readers interprets all images of a sample of patients, while other readers interpret the images of other samples of patients. In this paper, the authors compare three methods of analysis for the split-plot design. Three statistical methods are presented: the Obuchowski-Rockette method modified for the split-plot design, a newly proposed marginal-mean analysis-of-variance approach, and an extension of the three-sample U-statistic method. A simulation study using the Roe-Metz model was performed to compare the type I error rate, power, and confidence interval coverage of the three test statistics. The type I error rates for all three methods are close to the nominal level but tend to be slightly conservative. The statistical power is nearly identical for the three methods. The coverage of 95% confidence intervals falls close to the nominal coverage for small and large sample sizes. The split-plot multireader, multicase study design can be statistically efficient compared to the factorial design, reducing the number of interpretations required per reader. Three methods of analysis, shown to have nominal type I error rates, similar power, and nominal confidence interval coverage, are available for this study design. Copyright © 2012 AUR. All rights reserved.

  18. THE RADIO/GAMMA-RAY CONNECTION IN ACTIVE GALACTIC NUCLEI IN THE ERA OF THE FERMI LARGE AREA TELESCOPE

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ackermann, M.; Ajello, M.; Allafort, A.

    We present a detailed statistical analysis of the correlation between radio and gamma-ray emission of the active galactic nuclei (AGNs) detected by Fermi during its first year of operation, with the largest data sets ever used for this purpose. We use both archival interferometric 8.4 GHz data (from the Very Large Array and ATCA, for the full sample of 599 sources) and concurrent single-dish 15 GHz measurements from the Owens Valley Radio Observatory (OVRO, for a sub sample of 199 objects). Our unprecedentedly large sample permits us to assess with high accuracy the statistical significance of the correlation, using amore » surrogate data method designed to simultaneously account for common-distance bias and the effect of a limited dynamical range in the observed quantities. We find that the statistical significance of a positive correlation between the centimeter radio and the broadband (E > 100 MeV) gamma-ray energy flux is very high for the whole AGN sample, with a probability of <10{sup -7} for the correlation appearing by chance. Using the OVRO data, we find that concurrent data improve the significance of the correlation from 1.6 x 10{sup -6} to 9.0 x 10{sup -8}. Our large sample size allows us to study the dependence of correlation strength and significance on specific source types and gamma-ray energy band. We find that the correlation is very significant (chance probability < 10{sup -7}) for both flat spectrum radio quasars and BL Lac objects separately; a dependence of the correlation strength on the considered gamma-ray energy band is also present, but additional data will be necessary to constrain its significance.« less

  19. The radio/gamma-ray connection in active galactic nuclei in the era of the Fermi Large Area Telescope

    DOE PAGES

    Ackermann, M.; Ajello, M.; Allafort, A.; ...

    2011-10-12

    We present a detailed statistical analysis of the correlation between radio and gamma-ray emission of the active galactic nuclei (AGNs) detected by Fermi during its first year of operation, with the largest data sets ever used for this purpose. We use both archival interferometric 8.4 GHz data (from the Very Large Array and ATCA, for the full sample of 599 sources) and concurrent single-dish 15 GHz measurements from the Owens Valley Radio Observatory (OVRO, for a sub sample of 199 objects). Our unprecedentedly large sample permits us to assess with high accuracy the statistical significance of the correlation, using amore » surrogate data method designed to simultaneously account for common-distance bias and the effect of a limited dynamical range in the observed quantities. We find that the statistical significance of a positive correlation between the centimeter radio and the broadband (E > 100 MeV) gamma-ray energy flux is very high for the whole AGN sample, with a probability of <10 –7 for the correlation appearing by chance. Using the OVRO data, we find that concurrent data improve the significance of the correlation from 1.6 × 10 –6 to 9.0 × 10 –8. Our large sample size allows us to study the dependence of correlation strength and significance on specific source types and gamma-ray energy band. As a result, we find that the correlation is very significant (chance probability < 10 –7) for both flat spectrum radio quasars and BL Lac objects separately; a dependence of the correlation strength on the considered gamma-ray energy band is also present, but additional data will be necessary to constrain its significance.« less

  20. The Radio/Gamma-Ray Connection in Active Galactic Nuclei in the Era of the Fermi Large Area Telescope

    NASA Technical Reports Server (NTRS)

    Ackermann, M.; Ajello, M.; Allafort, A.; Angelakis, E.; Axelsson, M.; Baldini, L.; Ballet, J.; Barbiellini, G.; Bastieri, D.; Bellazzini, R.; hide

    2011-01-01

    We present a detailed statistical analysis of the correlation between radio and gamma-ray emission of the active galactic nuclei (AGNs) detected by Fermi during its first year of operation, with the largest data sets ever used for this purpose.We use both archival interferometric 8.4 GHz data (from the Very Large Array and ATCA, for the full sample of 599 sources) and concurrent single-dish 15 GHz measurements from the OwensValley RadioObservatory (OVRO, for a sub sample of 199 objects). Our unprecedentedly large sample permits us to assess with high accuracy the statistical significance of the correlation, using a surrogate data method designed to simultaneously account for common-distance bias and the effect of a limited dynamical range in the observed quantities. We find that the statistical significance of a positive correlation between the centimeter radio and the broadband (E > 100 MeV) gamma-ray energy flux is very high for the whole AGN sample, with a probability of <10(exp -7) for the correlation appearing by chance. Using the OVRO data, we find that concurrent data improve the significance of the correlation from 1.6 10(exp -6) to 9.0 10(exp -8). Our large sample size allows us to study the dependence of correlation strength and significance on specific source types and gamma-ray energy band. We find that the correlation is very significant (chance probability < 10(exp -7)) for both flat spectrum radio quasars and BL Lac objects separately; a dependence of the correlation strength on the considered gamma-ray energy band is also present, but additional data will be necessary to constrain its significance.

  1. Chi-squared and C statistic minimization for low count per bin data

    NASA Astrophysics Data System (ADS)

    Nousek, John A.; Shue, David R.

    1989-07-01

    Results are presented from a computer simulation comparing two statistical fitting techniques on data samples with large and small counts per bin; the results are then related specifically to X-ray astronomy. The Marquardt and Powell minimization techniques are compared by using both to minimize the chi-squared statistic. In addition, Cash's C statistic is applied, with Powell's method, and it is shown that the C statistic produces better fits in the low-count regime than chi-squared.

  2. Statistical inference involving binomial and negative binomial parameters.

    PubMed

    García-Pérez, Miguel A; Núñez-Antón, Vicente

    2009-05-01

    Statistical inference about two binomial parameters implies that they are both estimated by binomial sampling. There are occasions in which one aims at testing the equality of two binomial parameters before and after the occurrence of the first success along a sequence of Bernoulli trials. In these cases, the binomial parameter before the first success is estimated by negative binomial sampling whereas that after the first success is estimated by binomial sampling, and both estimates are related. This paper derives statistical tools to test two hypotheses, namely, that both binomial parameters equal some specified value and that both parameters are equal though unknown. Simulation studies are used to show that in small samples both tests are accurate in keeping the nominal Type-I error rates, and also to determine sample size requirements to detect large, medium, and small effects with adequate power. Additional simulations also show that the tests are sufficiently robust to certain violations of their assumptions.

  3. Sampling errors in the estimation of empirical orthogonal functions. [for climatology studies

    NASA Technical Reports Server (NTRS)

    North, G. R.; Bell, T. L.; Cahalan, R. F.; Moeng, F. J.

    1982-01-01

    Empirical Orthogonal Functions (EOF's), eigenvectors of the spatial cross-covariance matrix of a meteorological field, are reviewed with special attention given to the necessary weighting factors for gridded data and the sampling errors incurred when too small a sample is available. The geographical shape of an EOF shows large intersample variability when its associated eigenvalue is 'close' to a neighboring one. A rule of thumb indicating when an EOF is likely to be subject to large sampling fluctuations is presented. An explicit example, based on the statistics of the 500 mb geopotential height field, displays large intersample variability in the EOF's for sample sizes of a few hundred independent realizations, a size seldom exceeded by meteorological data sets.

  4. Valid statistical inference methods for a case-control study with missing data.

    PubMed

    Tian, Guo-Liang; Zhang, Chi; Jiang, Xuejun

    2018-04-01

    The main objective of this paper is to derive the valid sampling distribution of the observed counts in a case-control study with missing data under the assumption of missing at random by employing the conditional sampling method and the mechanism augmentation method. The proposed sampling distribution, called the case-control sampling distribution, can be used to calculate the standard errors of the maximum likelihood estimates of parameters via the Fisher information matrix and to generate independent samples for constructing small-sample bootstrap confidence intervals. Theoretical comparisons of the new case-control sampling distribution with two existing sampling distributions exhibit a large difference. Simulations are conducted to investigate the influence of the three different sampling distributions on statistical inferences. One finding is that the conclusion by the Wald test for testing independency under the two existing sampling distributions could be completely different (even contradictory) from the Wald test for testing the equality of the success probabilities in control/case groups under the proposed distribution. A real cervical cancer data set is used to illustrate the proposed statistical methods.

  5. Technology Tips: Sample Too Small? Probably Not!

    ERIC Educational Resources Information Center

    Strayer, Jeremy F.

    2013-01-01

    Statistical studies are referenced in the news every day, so frequently that people are sometimes skeptical of reported results. Often, no matter how large a sample size researchers use in their studies, people believe that the sample size is too small to make broad generalizations. The tasks presented in this article use simulations of repeated…

  6. Probing the statistics of primordial fluctuations and their evolution

    NASA Technical Reports Server (NTRS)

    Gaztanaga, Enrique; Yokoyama, Jun'ichi

    1993-01-01

    The statistical distribution of fluctuations on various scales is analyzed in terms of the counts in cells of smoothed density fields, using volume-limited samples of galaxy redshift catalogs. It is shown that the distribution on large scales, with volume average of the two-point correlation function of the smoothed field less than about 0.05, is consistent with Gaussian. Statistics are shown to agree remarkably well with the negative binomial distribution, which has hierarchial correlations and a Gaussian behavior at large scales. If these observed properties correspond to the matter distribution, they suggest that our universe started with Gaussian fluctuations and evolved keeping hierarchial form.

  7. Gene coexpression measures in large heterogeneous samples using count statistics.

    PubMed

    Wang, Y X Rachel; Waterman, Michael S; Huang, Haiyan

    2014-11-18

    With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.

  8. Identifying Microlensing Events in Large, Non-Uniformly Sampled Surveys: The Case of the Palomar Transient Factory

    NASA Astrophysics Data System (ADS)

    Price-Whelan, Adrian M.; Agueros, M. A.; Fournier, A.; Street, R.; Ofek, E.; Levitan, D. B.; PTF Collaboration

    2013-01-01

    Many current photometric, time-domain surveys are driven by specific goals such as searches for supernovae or transiting exoplanets, or studies of stellar variability. These goals in turn set the cadence with which individual fields are re-imaged. In the case of the Palomar Transient Factory (PTF), several such sub-surveys are being conducted in parallel, leading to extremely non-uniform sampling over the survey's nearly 20,000 sq. deg. footprint. While the typical 7.26 sq. deg. PTF field has been imaged 20 times in R-band, ~2300 sq. deg. have been observed more than 100 times. We use the existing PTF data 6.4x107 light curves) to study the trade-off that occurs when searching for microlensing events when one has access to a large survey footprint with irregular sampling. To examine the probability that microlensing events can be recovered in these data, we also test previous statistics used on uniformly sampled data to identify variables and transients. We find that one such statistic, the von Neumann ratio, performs best for identifying simulated microlensing events. We develop a selection method using this statistic and apply it to data from all PTF fields with >100 observations to uncover a number of interesting candidate events. This work can help constrain all-sky event rate predictions and tests microlensing signal recovery in large datasets, both of which will be useful to future wide-field, time-domain surveys such as the LSST.

  9. Efficient bootstrap estimates for tail statistics

    NASA Astrophysics Data System (ADS)

    Breivik, Øyvind; Aarnes, Ole Johan

    2017-03-01

    Bootstrap resamples can be used to investigate the tail of empirical distributions as well as return value estimates from the extremal behaviour of the sample. Specifically, the confidence intervals on return value estimates or bounds on in-sample tail statistics can be obtained using bootstrap techniques. However, non-parametric bootstrapping from the entire sample is expensive. It is shown here that it suffices to bootstrap from a small subset consisting of the highest entries in the sequence to make estimates that are essentially identical to bootstraps from the entire sample. Similarly, bootstrap estimates of confidence intervals of threshold return estimates are found to be well approximated by using a subset consisting of the highest entries. This has practical consequences in fields such as meteorology, oceanography and hydrology where return values are calculated from very large gridded model integrations spanning decades at high temporal resolution or from large ensembles of independent and identically distributed model fields. In such cases the computational savings are substantial.

  10. A random-sum Wilcoxon statistic and its application to analysis of ROC and LROC data.

    PubMed

    Tang, Liansheng Larry; Balakrishnan, N

    2011-01-01

    The Wilcoxon-Mann-Whitney statistic is commonly used for a distribution-free comparison of two groups. One requirement for its use is that the sample sizes of the two groups are fixed. This is violated in some of the applications such as medical imaging studies and diagnostic marker studies; in the former, the violation occurs since the number of correctly localized abnormal images is random, while in the latter the violation is due to some subjects not having observable measurements. For this reason, we propose here a random-sum Wilcoxon statistic for comparing two groups in the presence of ties, and derive its variance as well as its asymptotic distribution for large sample sizes. The proposed statistic includes the regular Wilcoxon rank-sum statistic. Finally, we apply the proposed statistic for summarizing location response operating characteristic data from a liver computed tomography study, and also for summarizing diagnostic accuracy of biomarker data.

  11. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies.

    PubMed

    Dai, Mingwei; Ming, Jingsi; Cai, Mingxuan; Liu, Jin; Yang, Can; Wan, Xiang; Xu, Zongben

    2017-09-15

    Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question. In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants. The IGESS software is available at https://github.com/daviddaigithub/IGESS . zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  12. Sampling and Data Analysis for Environmental Microbiology

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Murray, Christopher J.

    2001-06-01

    A brief review of the literature indicates the importance of statistical analysis in applied and environmental microbiology. Sampling designs are particularly important for successful studies, and it is highly recommended that researchers review their sampling design before heading to the laboratory or the field. Most statisticians have numerous stories of scientists who approached them after their study was complete only to have to tell them that the data they gathered could not be used to test the hypothesis they wanted to address. Once the data are gathered, a large and complex body of statistical techniques are available for analysis ofmore » the data. Those methods include both numerical and graphical techniques for exploratory characterization of the data. Hypothesis testing and analysis of variance (ANOVA) are techniques that can be used to compare the mean and variance of two or more groups of samples. Regression can be used to examine the relationships between sets of variables and is often used to examine the dependence of microbiological populations on microbiological parameters. Multivariate statistics provides several methods that can be used for interpretation of datasets with a large number of variables and to partition samples into similar groups, a task that is very common in taxonomy, but also has applications in other fields of microbiology. Geostatistics and other techniques have been used to examine the spatial distribution of microorganisms. The objectives of this chapter are to provide a brief survey of some of the statistical techniques that can be used for sample design and data analysis of microbiological data in environmental studies, and to provide some examples of their use from the literature.« less

  13. Investigating the Randomness of Numbers

    ERIC Educational Resources Information Center

    Pendleton, Kenn L.

    2009-01-01

    The use of random numbers is pervasive in today's world. Random numbers have practical applications in such far-flung arenas as computer simulations, cryptography, gambling, the legal system, statistical sampling, and even the war on terrorism. Evaluating the randomness of extremely large samples is a complex, intricate process. However, the…

  14. The relation between statistical power and inference in fMRI

    PubMed Central

    Wager, Tor D.; Yarkoni, Tal

    2017-01-01

    Statistically underpowered studies can result in experimental failure even when all other experimental considerations have been addressed impeccably. In fMRI the combination of a large number of dependent variables, a relatively small number of observations (subjects), and a need to correct for multiple comparisons can decrease statistical power dramatically. This problem has been clearly addressed yet remains controversial—especially in regards to the expected effect sizes in fMRI, and especially for between-subjects effects such as group comparisons and brain-behavior correlations. We aimed to clarify the power problem by considering and contrasting two simulated scenarios of such possible brain-behavior correlations: weak diffuse effects and strong localized effects. Sampling from these scenarios shows that, particularly in the weak diffuse scenario, common sample sizes (n = 20–30) display extremely low statistical power, poorly represent the actual effects in the full sample, and show large variation on subsequent replications. Empirical data from the Human Connectome Project resembles the weak diffuse scenario much more than the localized strong scenario, which underscores the extent of the power problem for many studies. Possible solutions to the power problem include increasing the sample size, using less stringent thresholds, or focusing on a region-of-interest. However, these approaches are not always feasible and some have major drawbacks. The most prominent solutions that may help address the power problem include model-based (multivariate) prediction methods and meta-analyses with related synthesis-oriented approaches. PMID:29155843

  15. Reliability and statistical power analysis of cortical and subcortical FreeSurfer metrics in a large sample of healthy elderly.

    PubMed

    Liem, Franziskus; Mérillat, Susan; Bezzola, Ladina; Hirsiger, Sarah; Philipp, Michel; Madhyastha, Tara; Jäncke, Lutz

    2015-03-01

    FreeSurfer is a tool to quantify cortical and subcortical brain anatomy automatically and noninvasively. Previous studies have reported reliability and statistical power analyses in relatively small samples or only selected one aspect of brain anatomy. Here, we investigated reliability and statistical power of cortical thickness, surface area, volume, and the volume of subcortical structures in a large sample (N=189) of healthy elderly subjects (64+ years). Reliability (intraclass correlation coefficient) of cortical and subcortical parameters is generally high (cortical: ICCs>0.87, subcortical: ICCs>0.95). Surface-based smoothing increases reliability of cortical thickness maps, while it decreases reliability of cortical surface area and volume. Nevertheless, statistical power of all measures benefits from smoothing. When aiming to detect a 10% difference between groups, the number of subjects required to test effects with sufficient power over the entire cortex varies between cortical measures (cortical thickness: N=39, surface area: N=21, volume: N=81; 10mm smoothing, power=0.8, α=0.05). For subcortical regions this number is between 16 and 76 subjects, depending on the region. We also demonstrate the advantage of within-subject designs over between-subject designs. Furthermore, we publicly provide a tool that allows researchers to perform a priori power analysis and sensitivity analysis to help evaluate previously published studies and to design future studies with sufficient statistical power. Copyright © 2014 Elsevier Inc. All rights reserved.

  16. Optimizing liquid effluent monitoring at a large nuclear complex.

    PubMed

    Chou, Charissa J; Barnett, D Brent; Johnson, Vernon G; Olson, Phil M

    2003-12-01

    Effluent monitoring typically requires a large number of analytes and samples during the initial or startup phase of a facility. Once a baseline is established, the analyte list and sampling frequency may be reduced. Although there is a large body of literature relevant to the initial design, few, if any, published papers exist on updating established effluent monitoring programs. This paper statistically evaluates four years of baseline data to optimize the liquid effluent monitoring efficiency of a centralized waste treatment and disposal facility at a large defense nuclear complex. Specific objectives were to: (1) assess temporal variability in analyte concentrations, (2) determine operational factors contributing to waste stream variability, (3) assess the probability of exceeding permit limits, and (4) streamline the sampling and analysis regime. Results indicated that the probability of exceeding permit limits was one in a million under normal facility operating conditions, sampling frequency could be reduced, and several analytes could be eliminated. Furthermore, indicators such as gross alpha and gross beta measurements could be used in lieu of more expensive specific isotopic analyses (radium, cesium-137, and strontium-90) for routine monitoring. Study results were used by the state regulatory agency to modify monitoring requirements for a new discharge permit, resulting in an annual cost savings of US dollars 223,000. This case study demonstrates that statistical evaluation of effluent contaminant variability coupled with process knowledge can help plant managers and regulators streamline analyte lists and sampling frequencies based on detection history and environmental risk.

  17. Statistical description of non-Gaussian samples in the F2 layer of the ionosphere during heliogeophysical disturbances

    NASA Astrophysics Data System (ADS)

    Sergeenko, N. P.

    2017-11-01

    An adequate statistical method should be developed in order to predict probabilistically the range of ionospheric parameters. This problem is solved in this paper. The time series of the critical frequency of the layer F2- foF2( t) were subjected to statistical processing. For the obtained samples {δ foF2}, statistical distributions and invariants up to the fourth order are calculated. The analysis shows that the distributions differ from the Gaussian law during the disturbances. At levels of sufficiently small probability distributions, there are arbitrarily large deviations from the model of the normal process. Therefore, it is attempted to describe statistical samples {δ foF2} based on the Poisson model. For the studied samples, the exponential characteristic function is selected under the assumption that time series are a superposition of some deterministic and random processes. Using the Fourier transform, the characteristic function is transformed into a nonholomorphic excessive-asymmetric probability-density function. The statistical distributions of the samples {δ foF2} calculated for the disturbed periods are compared with the obtained model distribution function. According to the Kolmogorov's criterion, the probabilities of the coincidence of a posteriori distributions with the theoretical ones are P 0.7-0.9. The conducted analysis makes it possible to draw a conclusion about the applicability of a model based on the Poisson random process for the statistical description and probabilistic variation estimates during heliogeophysical disturbances of the variations {δ foF2}.

  18. Simulation of Wind Profile Perturbations for Launch Vehicle Design

    NASA Technical Reports Server (NTRS)

    Adelfang, S. I.

    2004-01-01

    Ideally, a statistically representative sample of measured high-resolution wind profiles with wavelengths as small as tens of meters is required in design studies to establish aerodynamic load indicator dispersions and vehicle control system capability. At most potential launch sites, high- resolution wind profiles may not exist. Representative samples of Rawinsonde wind profiles to altitudes of 30 km are more likely to be available from the extensive network of measurement sites established for routine sampling in support of weather observing and forecasting activity. Such a sample, large enough to be statistically representative of relatively large wavelength perturbations, would be inadequate for launch vehicle design assessments because the Rawinsonde system accurately measures wind perturbations with wavelengths no smaller than 2000 m (1000 m altitude increment). The Kennedy Space Center (KSC) Jimsphere wind profiles (150/month and seasonal 2 and 3.5-hr pairs) are the only adequate samples of high resolution profiles approx. 150 to 300 m effective resolution, but over-sampled at 25 m intervals) that have been used extensively for launch vehicle design assessments. Therefore, a simulation process has been developed for enhancement of measured low-resolution Rawinsonde profiles that would be applicable in preliminary launch vehicle design studies at launch sites other than KSC.

  19. Computing physical properties with quantum Monte Carlo methods with statistical fluctuations independent of system size.

    PubMed

    Assaraf, Roland

    2014-12-01

    We show that the recently proposed correlated sampling without reweighting procedure extends the locality (asymptotic independence of the system size) of a physical property to the statistical fluctuations of its estimator. This makes the approach potentially vastly more efficient for computing space-localized properties in large systems compared with standard correlated methods. A proof is given for a large collection of noninteracting fragments. Calculations on hydrogen chains suggest that this behavior holds not only for systems displaying short-range correlations, but also for systems with long-range correlations.

  20. Accurate low-cost methods for performance evaluation of cache memory systems

    NASA Technical Reports Server (NTRS)

    Laha, Subhasis; Patel, Janak H.; Iyer, Ravishankar K.

    1988-01-01

    Methods of simulation based on statistical techniques are proposed to decrease the need for large trace measurements and for predicting true program behavior. Sampling techniques are applied while the address trace is collected from a workload. This drastically reduces the space and time needed to collect the trace. Simulation techniques are developed to use the sampled data not only to predict the mean miss rate of the cache, but also to provide an empirical estimate of its actual distribution. Finally, a concept of primed cache is introduced to simulate large caches by the sampling-based method.

  1. Genome-wide association analysis of secondary imaging phenotypes from the Alzheimer's disease neuroimaging initiative study.

    PubMed

    Zhu, Wensheng; Yuan, Ying; Zhang, Jingwen; Zhou, Fan; Knickmeyer, Rebecca C; Zhu, Hongtu

    2017-02-01

    The aim of this paper is to systematically evaluate a biased sampling issue associated with genome-wide association analysis (GWAS) of imaging phenotypes for most imaging genetic studies, including the Alzheimer's Disease Neuroimaging Initiative (ADNI). Specifically, the original sampling scheme of these imaging genetic studies is primarily the retrospective case-control design, whereas most existing statistical analyses of these studies ignore such sampling scheme by directly correlating imaging phenotypes (called the secondary traits) with genotype. Although it has been well documented in genetic epidemiology that ignoring the case-control sampling scheme can produce highly biased estimates, and subsequently lead to misleading results and suspicious associations, such findings are not well documented in imaging genetics. We use extensive simulations and a large-scale imaging genetic data analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) data to evaluate the effects of the case-control sampling scheme on GWAS results based on some standard statistical methods, such as linear regression methods, while comparing it with several advanced statistical methods that appropriately adjust for the case-control sampling scheme. Copyright © 2016 Elsevier Inc. All rights reserved.

  2. Statistical Analyses of Scatterplots to Identify Important Factors in Large-Scale Simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kleijnen, J.P.C.; Helton, J.C.

    1999-04-01

    The robustness of procedures for identifying patterns in scatterplots generated in Monte Carlo sensitivity analyses is investigated. These procedures are based on attempts to detect increasingly complex patterns in the scatterplots under consideration and involve the identification of (1) linear relationships with correlation coefficients, (2) monotonic relationships with rank correlation coefficients, (3) trends in central tendency as defined by means, medians and the Kruskal-Wallis statistic, (4) trends in variability as defined by variances and interquartile ranges, and (5) deviations from randomness as defined by the chi-square statistic. The following two topics related to the robustness of these procedures are consideredmore » for a sequence of example analyses with a large model for two-phase fluid flow: the presence of Type I and Type II errors, and the stability of results obtained with independent Latin hypercube samples. Observations from analysis include: (1) Type I errors are unavoidable, (2) Type II errors can occur when inappropriate analysis procedures are used, (3) physical explanations should always be sought for why statistical procedures identify variables as being important, and (4) the identification of important variables tends to be stable for independent Latin hypercube samples.« less

  3. An introduction to medical statistics for health care professionals: Hypothesis tests and estimation.

    PubMed

    Thomas, Elaine

    2005-01-01

    This article is the second in a series of three that will give health care professionals (HCPs) a sound introduction to medical statistics (Thomas, 2004). The objective of research is to find out about the population at large. However, it is generally not possible to study the whole of the population and research questions are addressed in an appropriate study sample. The next crucial step is then to use the information from the sample of individuals to make statements about the wider population of like individuals. This procedure of drawing conclusions about the population, based on study data, is known as inferential statistics. The findings from the study give us the best estimate of what is true for the relevant population, given the sample is representative of the population. It is important to consider how accurate this best estimate is, based on a single sample, when compared to the unknown population figure. Any difference between the observed sample result and the population characteristic is termed the sampling error. This article will cover the two main forms of statistical inference (hypothesis tests and estimation) along with issues that need to be addressed when considering the implications of the study results. Copyright (c) 2005 Whurr Publishers Ltd.

  4. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

    PubMed

    Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W; Francis, Suzanna C; Fraser, Louise J; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka

    2015-01-01

    Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

  5. State Estimates of Disability in America. Disability Statistics Report 3.

    ERIC Educational Resources Information Center

    LaPlante, Mitchell P.

    This study presents and discusses existing data on disability by state, from the 1980 and 1990 censuses, the Current Population Survey (CPS), and the National Health Interview Survey (NHIS). The study used direct methods for states with large sample sizes and synthetic estimates for states with low sample sizes. The study's highlighted findings…

  6. Statistical Aspects of Point Count Sampling

    Treesearch

    Richard J. Barker; John R. Sauer

    1995-01-01

    The dominant feature of point counts is that they do not census birds, but instead provide incomplete counts of individuals present within a survey plot. Considering a simple model for point count sampling, we demonstrate that use of these incomplete counts can bias estimators and testing procedures, leading to inappropriate conclusions. A large portion of the...

  7. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining.

    PubMed

    Hero, Alfred O; Rajaratnam, Bala

    2016-01-01

    When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.

  8. Speeding Up Non-Parametric Bootstrap Computations for Statistics Based on Sample Moments in Small/Moderate Sample Size Applications

    PubMed Central

    Chaibub Neto, Elias

    2015-01-01

    In this paper we propose a vectorized implementation of the non-parametric bootstrap for statistics based on sample moments. Basically, we adopt the multinomial sampling formulation of the non-parametric bootstrap, and compute bootstrap replications of sample moment statistics by simply weighting the observed data according to multinomial counts instead of evaluating the statistic on a resampled version of the observed data. Using this formulation we can generate a matrix of bootstrap weights and compute the entire vector of bootstrap replications with a few matrix multiplications. Vectorization is particularly important for matrix-oriented programming languages such as R, where matrix/vector calculations tend to be faster than scalar operations implemented in a loop. We illustrate the application of the vectorized implementation in real and simulated data sets, when bootstrapping Pearson’s sample correlation coefficient, and compared its performance against two state-of-the-art R implementations of the non-parametric bootstrap, as well as a straightforward one based on a for loop. Our investigations spanned varying sample sizes and number of bootstrap replications. The vectorized bootstrap compared favorably against the state-of-the-art implementations in all cases tested, and was remarkably/considerably faster for small/moderate sample sizes. The same results were observed in the comparison with the straightforward implementation, except for large sample sizes, where the vectorized bootstrap was slightly slower than the straightforward implementation due to increased time expenditures in the generation of weight matrices via multinomial sampling. PMID:26125965

  9. Overweight and Obesity: Prevalence and Correlates in a Large Clinical Sample of Children with Autism Spectrum Disorder

    ERIC Educational Resources Information Center

    Zuckerman, Katharine E.; Hill, Alison P.; Guion, Kimberly; Voltolina, Lisa; Fombonne, Eric

    2014-01-01

    Autism Spectrum Disorders (ASDs) and childhood obesity (OBY) are rising public health concerns. This study aimed to evaluate the prevalence of overweight (OWT) and OBY in a sample of 376 Oregon children with ASD, and to assess correlates of OWT and OBY in this sample. We used descriptive statistics, bivariate, and focused multivariate analyses to…

  10. Partitioning heritability by functional annotation using genome-wide association summary statistics.

    PubMed

    Finucane, Hilary K; Bulik-Sullivan, Brendan; Gusev, Alexander; Trynka, Gosia; Reshef, Yakir; Loh, Po-Ru; Anttila, Verneri; Xu, Han; Zang, Chongzhi; Farh, Kyle; Ripke, Stephan; Day, Felix R; Purcell, Shaun; Stahl, Eli; Lindstrom, Sara; Perry, John R B; Okada, Yukinori; Raychaudhuri, Soumya; Daly, Mark J; Patterson, Nick; Neale, Benjamin M; Price, Alkes L

    2015-11-01

    Recent work has demonstrated that some functional categories of the genome contribute disproportionately to the heritability of complex diseases. Here we analyze a broad set of functional elements, including cell type-specific elements, to estimate their polygenic contributions to heritability in genome-wide association studies (GWAS) of 17 complex diseases and traits with an average sample size of 73,599. To enable this analysis, we introduce a new method, stratified LD score regression, for partitioning heritability from GWAS summary statistics while accounting for linked markers. This new method is computationally tractable at very large sample sizes and leverages genome-wide information. Our findings include a large enrichment of heritability in conserved regions across many traits, a very large immunological disease-specific enrichment of heritability in FANTOM5 enhancers and many cell type-specific enrichments, including significant enrichment of central nervous system cell types in the heritability of body mass index, age at menarche, educational attainment and smoking behavior.

  11. Steganalysis of recorded speech

    NASA Astrophysics Data System (ADS)

    Johnson, Micah K.; Lyu, Siwei; Farid, Hany

    2005-03-01

    Digital audio provides a suitable cover for high-throughput steganography. At 16 bits per sample and sampled at a rate of 44,100 Hz, digital audio has the bit-rate to support large messages. In addition, audio is often transient and unpredictable, facilitating the hiding of messages. Using an approach similar to our universal image steganalysis, we show that hidden messages alter the underlying statistics of audio signals. Our statistical model begins by building a linear basis that captures certain statistical properties of audio signals. A low-dimensional statistical feature vector is extracted from this basis representation and used by a non-linear support vector machine for classification. We show the efficacy of this approach on LSB embedding and Hide4PGP. While no explicit assumptions about the content of the audio are made, our technique has been developed and tested on high-quality recorded speech.

  12. Efficient statistical tests to compare Youden index: accounting for contingency correlation.

    PubMed

    Chen, Fangyao; Xue, Yuqiang; Tan, Ming T; Chen, Pingyan

    2015-04-30

    Youden index is widely utilized in studies evaluating accuracy of diagnostic tests and performance of predictive, prognostic, or risk models. However, both one and two independent sample tests on Youden index have been derived ignoring the dependence (association) between sensitivity and specificity, resulting in potentially misleading findings. Besides, paired sample test on Youden index is currently unavailable. This article develops efficient statistical inference procedures for one sample, independent, and paired sample tests on Youden index by accounting for contingency correlation, namely associations between sensitivity and specificity and paired samples typically represented in contingency tables. For one and two independent sample tests, the variances are estimated by Delta method, and the statistical inference is based on the central limit theory, which are then verified by bootstrap estimates. For paired samples test, we show that the estimated covariance of the two sensitivities and specificities can be represented as a function of kappa statistic so the test can be readily carried out. We then show the remarkable accuracy of the estimated variance using a constrained optimization approach. Simulation is performed to evaluate the statistical properties of the derived tests. The proposed approaches yield more stable type I errors at the nominal level and substantially higher power (efficiency) than does the original Youden's approach. Therefore, the simple explicit large sample solution performs very well. Because we can readily implement the asymptotic and exact bootstrap computation with common software like R, the method is broadly applicable to the evaluation of diagnostic tests and model performance. Copyright © 2015 John Wiley & Sons, Ltd.

  13. Differences in results of analyses of concurrent and split stream-water samples collected and analyzed by the US Geological Survey and the Illinois Environmental Protection Agency, 1985-91

    USGS Publications Warehouse

    Melching, C.S.; Coupe, R.H.

    1995-01-01

    During water years 1985-91, the U.S. Geological Survey (USGS) and the Illinois Environmental Protection Agency (IEPA) cooperated in the collection and analysis of concurrent and split stream-water samples from selected sites in Illinois. Concurrent samples were collected independently by field personnel from each agency at the same time and sent to the IEPA laboratory, whereas the split samples were collected by USGS field personnel and divided into aliquots that were sent to each agency's laboratory for analysis. The water-quality data from these programs were examined by means of the Wilcoxon signed ranks test to identify statistically significant differences between results of the USGS and IEPA analyses. The data sets for constituents and properties identified by the Wilcoxon test as having significant differences were further examined by use of the paired t-test, mean relative percentage difference, and scattergrams to determine if the differences were important. Of the 63 constituents and properties in the concurrent-sample analysis, differences in only 2 (pH and ammonia) were statistically significant and large enough to concern water-quality engineers and planners. Of the 27 constituents and properties in the split-sample analysis, differences in 9 (turbidity, dissolved potassium, ammonia, total phosphorus, dissolved aluminum, dissolved barium, dissolved iron, dissolved manganese, and dissolved nickel) were statistically significant and large enough to con- cern water-quality engineers and planners. The differences in concentration between pairs of the concurrent samples were compared to the precision of the laboratory or field method used. The differences in concentration between pairs of the concurrent samples were compared to the precision of the laboratory or field method used. The differences in concentration between paris of split samples were compared to the precision of the laboratory method used and the interlaboratory precision of measuring a given concentration or property. Consideration of method precision indicated that differences between concurrent samples were insignificant for all concentrations and properties except pH, and that differences between split samples were significant for all concentrations and properties. Consideration of interlaboratory precision indicated that the differences between the split samples were not unusually large. The results for the split samples illustrate the difficulty in obtaining comparable and accurate water-quality data.

  14. Static versus dynamic sampling for data mining

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    John, G.H.; Langley, P.

    1996-12-31

    As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this end, we introduce the {open_quotes}Probably Close Enough{close_quotes} criterion to describe the desired properties of a sample. Sampling usually refers to the use of static statistical tests to decide whether a sample is sufficiently similar to the large database, in the absence of any knowledgemore » of the tools the data miner intends to use. We discuss dynamic sampling methods, which take into account the mining tool being used and can thus give better samples. We describe dynamic schemes that observe a mining tool`s performance on training samples of increasing size and use these results to determine when a sample is sufficiently large. We evaluate these sampling methods on data from the UCI repository and conclude that dynamic sampling is preferable.« less

  15. Experimental Study of the Effect of the Initial Spectrum Width on the Statistics of Random Wave Groups

    NASA Astrophysics Data System (ADS)

    Shemer, L.; Sergeeva, A.

    2009-12-01

    The statistics of random water wave field determines the probability of appearance of extremely high (freak) waves. This probability is strongly related to the spectral wave field characteristics. Laboratory investigation of the spatial variation of the random wave-field statistics for various initial conditions is thus of substantial practical importance. Unidirectional nonlinear random wave groups are investigated experimentally in the 300 m long Large Wave Channel (GWK) in Hannover, Germany, which is the biggest facility of its kind in Europe. Numerous realizations of a wave field with the prescribed frequency power spectrum, yet randomly-distributed initial phases of each harmonic, were generated by a computer-controlled piston-type wavemaker. Several initial spectral shapes with identical dominant wave length but different width were considered. For each spectral shape, the total duration of sampling in all realizations was long enough to yield sufficient sample size for reliable statistics. Through all experiments, an effort had been made to retain the characteristic wave height value and thus the degree of nonlinearity of the wave field. Spatial evolution of numerous statistical wave field parameters (skewness, kurtosis and probability distributions) is studied using about 25 wave gauges distributed along the tank. It is found that, depending on the initial spectral shape, the frequency spectrum of the wave field may undergo significant modification in the course of its evolution along the tank; the values of all statistical wave parameters are strongly related to the local spectral width. A sample of the measured wave height probability functions (scaled by the variance of surface elevation) is plotted in Fig. 1 for the initially narrow rectangular spectrum. The results in Fig. 1 resemble findings obtained in [1] for the initial Gaussian spectral shape. The probability of large waves notably surpasses that predicted by the Rayleigh distribution and is the highest at the distance of about 100 m. Acknowledgement This study is carried out in the framework of the EC supported project "Transnational access to large-scale tests in the Large Wave Channel (GWK) of Forschungszentrum Küste (Contract HYDRALAB III - No. 022441). [1] L. Shemer and A. Sergeeva, J. Geophys. Res. Oceans 114, C01015 (2009). Figure 1. Variation along the tank of the measured wave height distribution for rectangular initial spectral shape, the carrier wave period T0=1.5 s.

  16. Latent spatial models and sampling design for landscape genetics

    USGS Publications Warehouse

    Hanks, Ephraim M.; Hooten, Mevin B.; Knick, Steven T.; Oyler-McCance, Sara J.; Fike, Jennifer A.; Cross, Todd B.; Schwartz, Michael K.

    2016-01-01

    We propose a spatially-explicit approach for modeling genetic variation across space and illustrate how this approach can be used to optimize spatial prediction and sampling design for landscape genetic data. We propose a multinomial data model for categorical microsatellite allele data commonly used in landscape genetic studies, and introduce a latent spatial random effect to allow for spatial correlation between genetic observations. We illustrate how modern dimension reduction approaches to spatial statistics can allow for efficient computation in landscape genetic statistical models covering large spatial domains. We apply our approach to propose a retrospective spatial sampling design for greater sage-grouse (Centrocercus urophasianus) population genetics in the western United States.

  17. The coverage of a random sample from a biological community.

    PubMed

    Engen, S

    1975-03-01

    A taxonomic group will frequently have a large number of species with small abundances. When a sample is drawn at random from this group, one is therefore faced with the problem that a large proportion of the species will not be discovered. A general definition of quantitative measures of "sample coverage" is proposed, and the problem of statistical inference is considered for two special cases, (1) the actual total relative abundance of those species that are represented in the sample, and (2) their relative contribution to the information index of diversity. The analysis is based on a extended version of the negative binomial species frequency model. The results are tabulated.

  18. The multiple imputation method: a case study involving secondary data analysis.

    PubMed

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  19. When the Test of Mediation is More Powerful than the Test of the Total Effect

    PubMed Central

    O'Rourke, Holly P.; MacKinnon, David P.

    2014-01-01

    Although previous research has studied power in mediation models, the extent to which the inclusion of a mediator will increase power has not been investigated. First, a study compared analytical power of the mediated effect to the total effect in a single mediator model to identify the situations in which the inclusion of one mediator increased statistical power. Results from the first study indicated that including a mediator increased statistical power in small samples with large coefficients and in large samples with small coefficients, and when coefficients were non-zero and equal across models. Next, a study identified conditions where power was greater for the test of the total mediated effect compared to the test of the total effect in the parallel two mediator model. Results indicated that including two mediators increased power in small samples with large coefficients and in large samples with small coefficients, the same pattern of results found in the first study. Finally, a study assessed analytical power for a sequential (three-path) two mediator model and compared power to detect the three-path mediated effect to power to detect both the test of the total effect and the test of the mediated effect for the single mediator model. Results indicated that the three-path mediated effect had more power than the mediated effect from the single mediator model and the test of the total effect. Practical implications of these results for researchers are then discussed. PMID:24903690

  20. Sampling benthic macroinvertebrates in a large flood-plain river: Considerations of study design, sample size, and cost

    USGS Publications Warehouse

    Bartsch, L.A.; Richardson, W.B.; Naimo, T.J.

    1998-01-01

    Estimation of benthic macroinvertebrate populations over large spatial scales is difficult due to the high variability in abundance and the cost of sample processing and taxonomic analysis. To determine a cost-effective, statistically powerful sample design, we conducted an exploratory study of the spatial variation of benthic macroinvertebrates in a 37 km reach of the Upper Mississippi River. We sampled benthos at 36 sites within each of two strata, contiguous backwater and channel border. Three standard ponar (525 cm(2)) grab samples were obtained at each site ('Original Design'). Analysis of variance and sampling cost of strata-wide estimates for abundance of Oligochaeta, Chironomidae, and total invertebrates showed that only one ponar sample per site ('Reduced Design') yielded essentially the same abundance estimates as the Original Design, while reducing the overall cost by 63%. A posteriori statistical power analysis (alpha = 0.05, beta = 0.20) on the Reduced Design estimated that at least 18 sites per stratum were needed to detect differences in mean abundance between contiguous backwater and channel border areas for Oligochaeta, Chironomidae, and total invertebrates. Statistical power was nearly identical for the three taxonomic groups. The abundances of several taxa of concern (e.g., Hexagenia mayflies and Musculium fingernail clams) were too spatially variable to estimate power with our method. Resampling simulations indicated that to achieve adequate sampling precision for Oligochaeta, at least 36 sample sites per stratum would be required, whereas a sampling precision of 0.2 would not be attained with any sample size for Hexagenia in channel border areas, or Chironomidae and Musculium in both strata given the variance structure of the original samples. Community-wide diversity indices (Brillouin and 1-Simpsons) increased as sample area per site increased. The backwater area had higher diversity than the channel border area. The number of sampling sites required to sample benthic macroinvertebrates during our sampling period depended on the study objective and ranged from 18 to more than 40 sites per stratum. No single sampling regime would efficiently and adequately sample all components of the macroinvertebrate community.

  1. Methodological Issues Related to the Use of P Less than 0.05 in Health Behavior Research

    ERIC Educational Resources Information Center

    Duryea, Elias; Graner, Stephen P.; Becker, Jeremy

    2009-01-01

    This paper reviews methodological issues related to the use of P less than 0.05 in health behavior research and suggests how application and presentation of statistical significance may be improved. Assessment of sample size and P less than 0.05, the file drawer problem, the Law of Large Numbers and the statistical significance arguments in…

  2. Estimating migratory fish distribution from altitude and basin area: a case study in a large Neotropical river

    Treesearch

    Jose Ricardo Barradas; Lucas G. Silva; Bret C. Harvey; Nelson F. Fontoura

    2012-01-01

    1. The objective of this study was to identify longitudinal distribution patterns of large migratory fish species in the Uruguay River basin, southern Brazil, and construct statistical distribution models for Salminus brasiliensis, Prochilodus lineatus, Leporinus obtusidens and Pseudoplatystoma corruscans. 2. The sampling programme resulted in 202 interviews with old...

  3. Understanding Loan Use and Debt Burden among Low-Income and Minority Students at a Large Urban Community College

    ERIC Educational Resources Information Center

    Luna-Torres, Maria; McKinney, Lyle; Horn, Catherine; Jones, Sara

    2018-01-01

    This study examined a sample of community college students from a diverse, large urban community college system in Texas. To gain a deeper understanding about the effects of background characteristics on student borrowing behaviors and enrollment outcomes, the study employed descriptive statistics and regression techniques to examine two separate…

  4. Statistical Searches for Microlensing Events in Large, Non-uniformly Sampled Time-Domain Surveys: A Test Using Palomar Transient Factory Data

    NASA Astrophysics Data System (ADS)

    Price-Whelan, Adrian M.; Agüeros, Marcel A.; Fournier, Amanda P.; Street, Rachel; Ofek, Eran O.; Covey, Kevin R.; Levitan, David; Laher, Russ R.; Sesar, Branimir; Surace, Jason

    2014-01-01

    Many photometric time-domain surveys are driven by specific goals, such as searches for supernovae or transiting exoplanets, which set the cadence with which fields are re-imaged. In the case of the Palomar Transient Factory (PTF), several sub-surveys are conducted in parallel, leading to non-uniform sampling over its ~20,000 deg2 footprint. While the median 7.26 deg2 PTF field has been imaged ~40 times in the R band, ~2300 deg2 have been observed >100 times. We use PTF data to study the trade off between searching for microlensing events in a survey whose footprint is much larger than that of typical microlensing searches, but with far-from-optimal time sampling. To examine the probability that microlensing events can be recovered in these data, we test statistics used on uniformly sampled data to identify variables and transients. We find that the von Neumann ratio performs best for identifying simulated microlensing events in our data. We develop a selection method using this statistic and apply it to data from fields with >10 R-band observations, 1.1 × 109 light curves, uncovering three candidate microlensing events. We lack simultaneous, multi-color photometry to confirm these as microlensing events. However, their number is consistent with predictions for the event rate in the PTF footprint over the survey's three years of operations, as estimated from near-field microlensing models. This work can help constrain all-sky event rate predictions and tests microlensing signal recovery in large data sets, which will be useful to future time-domain surveys, such as that planned with the Large Synoptic Survey Telescope.

  5. Four hundred or more participants needed for stable contingency table estimates of clinical prediction rule performance.

    PubMed

    Kent, Peter; Boyle, Eleanor; Keating, Jennifer L; Albert, Hanne B; Hartvigsen, Jan

    2017-02-01

    To quantify variability in the results of statistical analyses based on contingency tables and discuss the implications for the choice of sample size for studies that derive clinical prediction rules. An analysis of three pre-existing sets of large cohort data (n = 4,062-8,674) was performed. In each data set, repeated random sampling of various sample sizes, from n = 100 up to n = 2,000, was performed 100 times at each sample size and the variability in estimates of sensitivity, specificity, positive and negative likelihood ratios, posttest probabilities, odds ratios, and risk/prevalence ratios for each sample size was calculated. There were very wide, and statistically significant, differences in estimates derived from contingency tables from the same data set when calculated in sample sizes below 400 people, and typically, this variability stabilized in samples of 400-600 people. Although estimates of prevalence also varied significantly in samples below 600 people, that relationship only explains a small component of the variability in these statistical parameters. To reduce sample-specific variability, contingency tables should consist of 400 participants or more when used to derive clinical prediction rules or test their performance. Copyright © 2016 Elsevier Inc. All rights reserved.

  6. Chlorophyll a and inorganic suspended solids in backwaters of the upper Mississippi River system: Backwater lake effects and their associations with selected environmental predictors

    USGS Publications Warehouse

    Rogala, James T.; Gray, Brian R.

    2006-01-01

    The Long Term Resource Monitoring Program (LTRMP) uses a stratified random sampling design to obtain water quality statistics within selected study reaches of the Upper Mississippi River System (UMRS). LTRMP sampling strata are based on aquatic area types generally found in large rivers (e.g., main channel, side channel, backwater, and impounded areas). For hydrologically well-mixed strata (i.e., main channel), variance associated with spatial scales smaller than the strata scale is a relatively minor issue for many water quality parameters. However, analysis of LTRMP water quality data has shown that within-strata variability at the strata scale is high in off-channel areas (i.e., backwaters). A portion of that variability may be associated with differences among individual backwater lakes (i.e., small and large backwater regions separated by channels) that cumulatively make up the backwater stratum. The objective of the statistical modeling presented here is to determine if differences among backwater lakes account for a large portion of the variance observed in the backwater stratum for selected parameters. If variance associated with backwater lakes is high, then inclusion of backwater lake effects within statistical models is warranted. Further, lakes themselves may represent natural experimental units where associations of interest to management may be estimated.

  7. Reliability Sampling Plans: A Review and Some New Results

    ERIC Educational Resources Information Center

    Isaic-Maniu, Alexandru; Voda, Viorel Gh.

    2009-01-01

    In this work we present a large area of aspects related to the problem of sampling inspection in the case of reliability. First we discuss the actual status of this domain, mentioning the newest approaches (from a technical view point) such as HALT and HASS and the statistical perspective. After a brief description of the general procedure in…

  8. Wilderness adventure therapy effects on the mental health of youth participants.

    PubMed

    Bowen, Daniel J; Neill, James T; Crisp, Simon J R

    2016-10-01

    Adventure therapy offers a prevention, early intervention, and treatment modality for people with behavioural, psychological, and psychosocial issues. It can appeal to youth-at-risk who are often less responsive to traditional psychotherapeutic interventions. This study evaluated Wilderness Adventure Therapy (WAT) outcomes based on participants' pre-program, post-program, and follow-up responses to self-report questionnaires. The sample consisted of 36 adolescent out-patients with mixed mental health issues who completed a 10-week, manualised WAT intervention. The overall short-term standardised mean effect size was small, positive, and statistically significant (0.26), with moderate, statistically significant improvements in psychological resilience and social self-esteem. Total short-term effects were within age-based adventure therapy meta-analytic benchmark 90% confidence intervals, except for the change in suicidality which was lower than the comparable benchmark. The short-term changes were retained at the three-month follow-up, except for family functioning (significant reduction) and suicidality (significant improvement). For participants in clinical ranges pre-program, there was a large, statistically significant reduction in depressive symptomology, and large to very large, statistically significant improvements in behavioural and emotional functioning. These changes were retained at the three-month follow-up. These findings indicate that WAT is as effective as traditional psychotherapy techniques for clinically symptomatic people. Future research utilising a comparison or wait-list control group, multiple sources of data, and a larger sample, could help to qualify and extend these findings. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.

  9. Statistical Learning Theory for High Dimensional Prediction: Application to Criterion-Keyed Scale Development

    PubMed Central

    Chapman, Benjamin P.; Weiss, Alexander; Duberstein, Paul

    2016-01-01

    Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in “big data” problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how three common SLT algorithms–Supervised Principal Components, Regularization, and Boosting—can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach—or perhaps because of them–SLT methods may hold value as a statistically rigorous approach to exploratory regression. PMID:27454257

  10. Identifying currents in the gene pool for bacterial populations using an integrative approach.

    PubMed

    Tang, Jing; Hanage, William P; Fraser, Christophe; Corander, Jukka

    2009-08-01

    The evolution of bacterial populations has recently become considerably better understood due to large-scale sequencing of population samples. It has become clear that DNA sequences from a multitude of genes, as well as a broad sample coverage of a target population, are needed to obtain a relatively unbiased view of its genetic structure and the patterns of ancestry connected to the strains. However, the traditional statistical methods for evolutionary inference, such as phylogenetic analysis, are associated with several difficulties under such an extensive sampling scenario, in particular when a considerable amount of recombination is anticipated to have taken place. To meet the needs of large-scale analyses of population structure for bacteria, we introduce here several statistical tools for the detection and representation of recombination between populations. Also, we introduce a model-based description of the shape of a population in sequence space, in terms of its molecular variability and affinity towards other populations. Extensive real data from the genus Neisseria are utilized to demonstrate the potential of an approach where these population genetic tools are combined with an phylogenetic analysis. The statistical tools introduced here are freely available in BAPS 5.2 software, which can be downloaded from http://web.abo.fi/fak/mnf/mate/jc/software/baps.html.

  11. Evaluation of respondent-driven sampling.

    PubMed

    McCreesh, Nicky; Frost, Simon D W; Seeley, Janet; Katongole, Joseph; Tarsh, Matilda N; Ndunguse, Richard; Jichi, Fatima; Lunel, Natasha L; Maher, Dermot; Johnston, Lisa G; Sonnenberg, Pam; Copas, Andrew J; Hayes, Richard J; White, Richard G

    2012-01-01

    Respondent-driven sampling is a novel variant of link-tracing sampling for estimating the characteristics of hard-to-reach groups, such as HIV prevalence in sex workers. Despite its use by leading health organizations, the performance of this method in realistic situations is still largely unknown. We evaluated respondent-driven sampling by comparing estimates from a respondent-driven sampling survey with total population data. Total population data on age, tribe, religion, socioeconomic status, sexual activity, and HIV status were available on a population of 2402 male household heads from an open cohort in rural Uganda. A respondent-driven sampling (RDS) survey was carried out in this population, using current methods of sampling (RDS sample) and statistical inference (RDS estimates). Analyses were carried out for the full RDS sample and then repeated for the first 250 recruits (small sample). We recruited 927 household heads. Full and small RDS samples were largely representative of the total population, but both samples underrepresented men who were younger, of higher socioeconomic status, and with unknown sexual activity and HIV status. Respondent-driven sampling statistical inference methods failed to reduce these biases. Only 31%-37% (depending on method and sample size) of RDS estimates were closer to the true population proportions than the RDS sample proportions. Only 50%-74% of respondent-driven sampling bootstrap 95% confidence intervals included the population proportion. Respondent-driven sampling produced a generally representative sample of this well-connected nonhidden population. However, current respondent-driven sampling inference methods failed to reduce bias when it occurred. Whether the data required to remove bias and measure precision can be collected in a respondent-driven sampling survey is unresolved. Respondent-driven sampling should be regarded as a (potentially superior) form of convenience sampling method, and caution is required when interpreting findings based on the sampling method.

  12. Statistical analysis of early failures in electromigration

    NASA Astrophysics Data System (ADS)

    Gall, M.; Capasso, C.; Jawarani, D.; Hernandez, R.; Kawasaki, H.; Ho, P. S.

    2001-07-01

    The detection of early failures in electromigration (EM) and the complicated statistical nature of this important reliability phenomenon have been difficult issues to treat in the past. A satisfactory experimental approach for the detection and the statistical analysis of early failures has not yet been established. This is mainly due to the rare occurrence of early failures and difficulties in testing of large sample populations. Furthermore, experimental data on the EM behavior as a function of varying number of failure links are scarce. In this study, a technique utilizing large interconnect arrays in conjunction with the well-known Wheatstone Bridge is presented. Three types of structures with a varying number of Ti/TiN/Al(Cu)/TiN-based interconnects were used, starting from a small unit of five lines in parallel. A serial arrangement of this unit enabled testing of interconnect arrays encompassing 480 possible failure links. In addition, a Wheatstone Bridge-type wiring using four large arrays in each device enabled simultaneous testing of 1920 interconnects. In conjunction with a statistical deconvolution to the single interconnect level, the results indicate that the electromigration failure mechanism studied here follows perfect lognormal behavior down to the four sigma level. The statistical deconvolution procedure is described in detail. Over a temperature range from 155 to 200 °C, a total of more than 75 000 interconnects were tested. None of the samples have shown an indication of early, or alternate, failure mechanisms. The activation energy of the EM mechanism studied here, namely the Cu incubation time, was determined to be Q=1.08±0.05 eV. We surmise that interface diffusion of Cu along the Al(Cu) sidewalls and along the top and bottom refractory layers, coupled with grain boundary diffusion within the interconnects, constitutes the Cu incubation mechanism.

  13. Spatial Statistical Models and Optimal Survey Design for Rapid Geophysical characterization of UXO Sites

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    G. Ostrouchov; W.E.Doll; D.A.Wolf

    2003-07-01

    Unexploded ordnance(UXO)surveys encompass large areas, and the cost of surveying these areas can be high. Enactment of earlier protocols for sampling UXO sites have shown the shortcomings of these procedures and led to a call for development of scientifically defensible statistical procedures for survey design and analysis. This project is one of three funded by SERDP to address this need.

  14. VHSIC Electronics and the Cost of Air Force Avionics in the 1990s

    DTIC Science & Technology

    1990-11-01

    circuit. LRM Line replaceable module. LRU Line replaceable unit. LSI Large-scale integration. LSTTL Tow-power Schottky Transitor -to-Transistor Logic...displays, communications/navigation/identification, electronic combat equipment, dispensers, and computers. These CERs, which statistically relate the...some of the reliability numbers, and adding the F-15 and F-16 to obtain the data sample shown in Table 6. Both suite costs and reliability statistics

  15. Small studies may overestimate the effect sizes in critical care meta-analyses: a meta-epidemiological study

    PubMed Central

    2013-01-01

    Introduction Small-study effects refer to the fact that trials with limited sample sizes are more likely to report larger beneficial effects than large trials. However, this has never been investigated in critical care medicine. Thus, the present study aimed to examine the presence and extent of small-study effects in critical care medicine. Methods Critical care meta-analyses involving randomized controlled trials and reported mortality as an outcome measure were considered eligible for the study. Component trials were classified as large (≥100 patients per arm) and small (<100 patients per arm) according to their sample sizes. Ratio of odds ratio (ROR) was calculated for each meta-analysis and then RORs were combined using a meta-analytic approach. ROR<1 indicated larger beneficial effect in small trials. Small and large trials were compared in methodological qualities including sequence generating, blinding, allocation concealment, intention to treat and sample size calculation. Results A total of 27 critical care meta-analyses involving 317 trials were included. Of them, five meta-analyses showed statistically significant RORs <1, and other meta-analyses did not reach a statistical significance. Overall, the pooled ROR was 0.60 (95% CI: 0.53 to 0.68); the heterogeneity was moderate with an I2 of 50.3% (chi-squared = 52.30; P = 0.002). Large trials showed significantly better reporting quality than small trials in terms of sequence generating, allocation concealment, blinding, intention to treat, sample size calculation and incomplete follow-up data. Conclusions Small trials are more likely to report larger beneficial effects than large trials in critical care medicine, which could be partly explained by the lower methodological quality in small trials. Caution should be practiced in the interpretation of meta-analyses involving small trials. PMID:23302257

  16. Biostatistics primer: part I.

    PubMed

    Overholser, Brian R; Sowinski, Kevin M

    2007-12-01

    Biostatistics is the application of statistics to biologic data. The field of statistics can be broken down into 2 fundamental parts: descriptive and inferential. Descriptive statistics are commonly used to categorize, display, and summarize data. Inferential statistics can be used to make predictions based on a sample obtained from a population or some large body of information. It is these inferences that are used to test specific research hypotheses. This 2-part review will outline important features of descriptive and inferential statistics as they apply to commonly conducted research studies in the biomedical literature. Part 1 in this issue will discuss fundamental topics of statistics and data analysis. Additionally, some of the most commonly used statistical tests found in the biomedical literature will be reviewed in Part 2 in the February 2008 issue.

  17. Across-cohort QC analyses of GWAS summary statistics from complex traits.

    PubMed

    Chen, Guo-Bo; Lee, Sang Hong; Robinson, Matthew R; Trzaskowski, Maciej; Zhu, Zhi-Xiang; Winkler, Thomas W; Day, Felix R; Croteau-Chonka, Damien C; Wood, Andrew R; Locke, Adam E; Kutalik, Zoltán; Loos, Ruth J F; Frayling, Timothy M; Hirschhorn, Joel N; Yang, Jian; Wray, Naomi R; Visscher, Peter M

    2016-01-01

    Genome-wide association studies (GWASs) have been successful in discovering SNP trait associations for many quantitative traits and common diseases. Typically, the effect sizes of SNP alleles are very small and this requires large genome-wide association meta-analyses (GWAMAs) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study, we propose four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We propose methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap. (I) We use the population genetics F st statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. (II) We conduct principal component analysis based on reported allele frequencies, and are able to recover the ancestral information for each cohort. (III) We propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. (IV) To quantify unknown sample overlap across all pairs of cohorts, we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.

  18. Across-cohort QC analyses of GWAS summary statistics from complex traits

    PubMed Central

    Chen, Guo-Bo; Lee, Sang Hong; Robinson, Matthew R; Trzaskowski, Maciej; Zhu, Zhi-Xiang; Winkler, Thomas W; Day, Felix R; Croteau-Chonka, Damien C; Wood, Andrew R; Locke, Adam E; Kutalik, Zoltán; Loos, Ruth J F; Frayling, Timothy M; Hirschhorn, Joel N; Yang, Jian; Wray, Naomi R; Visscher, Peter M

    2017-01-01

    Genome-wide association studies (GWASs) have been successful in discovering SNP trait associations for many quantitative traits and common diseases. Typically, the effect sizes of SNP alleles are very small and this requires large genome-wide association meta-analyses (GWAMAs) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study, we propose four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We propose methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap. (I) We use the population genetics Fst statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. (II) We conduct principal component analysis based on reported allele frequencies, and are able to recover the ancestral information for each cohort. (III) We propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. (IV) To quantify unknown sample overlap across all pairs of cohorts, we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy. PMID:27552965

  19. When the test of mediation is more powerful than the test of the total effect.

    PubMed

    O'Rourke, Holly P; MacKinnon, David P

    2015-06-01

    Although previous research has studied power in mediation models, the extent to which the inclusion of a mediator will increase power has not been investigated. To address this deficit, in a first study we compared the analytical power values of the mediated effect and the total effect in a single-mediator model, to identify the situations in which the inclusion of one mediator increased statistical power. The results from this first study indicated that including a mediator increased statistical power in small samples with large coefficients and in large samples with small coefficients, and when coefficients were nonzero and equal across models. Next, we identified conditions under which power was greater for the test of the total mediated effect than for the test of the total effect in the parallel two-mediator model. These results indicated that including two mediators increased power in small samples with large coefficients and in large samples with small coefficients, the same pattern of results that had been found in the first study. Finally, we assessed the analytical power for a sequential (three-path) two-mediator model and compared the power to detect the three-path mediated effect to the power to detect both the test of the total effect and the test of the mediated effect for the single-mediator model. The results indicated that the three-path mediated effect had more power than the mediated effect from the single-mediator model and the test of the total effect. Practical implications of these results for researchers are then discussed.

  20. Estimating and comparing microbial diversity in the presence of sequencing errors

    PubMed Central

    Chiu, Chun-Huo

    2016-01-01

    Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here a nonparametric estimator of the true singleton count to replace the spurious singleton count in all methods/approaches. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons, provided these three frequency counts are reliable. To quantify microbial alpha diversity for an individual community, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an order q that determines the measures’ emphasis on rare or common species, include taxa richness (q = 0), Shannon diversity (q = 1, the exponential of Shannon entropy), and Simpson diversity (q = 2, the inverse of Simpson index). A diversity profile which depicts the Hill number as a function of order q conveys all information contained in a taxa abundance distribution. Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches (non-asymptotic and asymptotic) are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach refers to the comparison of estimated diversities of standardized samples with a common finite sample size or sample completeness. This approach aims to compare diversity estimates for equally-large or equally-complete samples; it is based on the seamless rarefaction and extrapolation sampling curves of Hill numbers, specifically for q = 0, 1 and 2. (2) An asymptotic approach refers to the comparison of the estimated asymptotic diversity profiles. That is, this approach compares the estimated profiles for complete samples or samples whose size tends to be sufficiently large. It is based on statistical estimation of the true Hill number of any order q ≥ 0. In the two approaches, replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons and also make fair comparisons across microbial communities, as illustrated in our simulation results and in applying our method to analyze sequencing data from viral metagenomes. PMID:26855872

  1. Electric Field Fluctuations in Water

    NASA Astrophysics Data System (ADS)

    Thorpe, Dayton; Limmer, David; Chandler, David

    2013-03-01

    Charge transfer in solution, such as autoionization and ion pair dissociation in water, is governed by rare electric field fluctuations of the solvent. Knowing the statistics of such fluctuations can help explain the dynamics of these rare events. Trajectories short enough to be tractable by computer simulation are virtually certain not to sample the large fluctuations that promote rare events. Here, we employ importance sampling techniques with classical molecular dynamics simulations of liquid water to study statistics of electric field fluctuations far from their means. We find that the distributions of electric fields located on individual water molecules are not in general gaussian. Near the mean this non-gaussianity is due to the internal charge distribution of the water molecule. Further from the mean, however, there is a previously unreported Bjerrum-like defect that stabilizes certain large fluctuations out of equilibrium. As expected, differences in electric fields acting between molecules are gaussian to a remarkable degree. By studying these differences, though, we are able to determine what configurations result not only in large electric fields, but also in electric fields with long spatial correlations that may be needed to promote charge separation.

  2. Dependence of the clustering properties of galaxies on stellar velocity dispersion in the Main galaxy sample of SDSS DR10

    NASA Astrophysics Data System (ADS)

    Deng, Xin-Fa; Song, Jun; Chen, Yi-Qing; Jiang, Peng; Ding, Ying-Ping

    2014-08-01

    Using two volume-limited Main galaxy samples of the Sloan Digital Sky Survey Data Release 10 (SDSS DR10), we investigate the dependence of the clustering properties of galaxies on stellar velocity dispersion by cluster analysis. It is found that in the luminous volume-limited Main galaxy sample, except at r=1.2, richer and larger systems can be more easily formed in the large stellar velocity dispersion subsample, while in the faint volume-limited Main galaxy sample, at r≥0.9, an opposite trend is observed. According to statistical analyses of the multiplicity functions, we conclude in two volume-limited Main galaxy samples: small stellar velocity dispersion galaxies preferentially form isolated galaxies, close pairs and small group, while large stellar velocity dispersion galaxies preferentially inhabit the dense groups and clusters. However, we note the difference between two volume-limited Main galaxy samples: in the faint volume-limited Main galaxy sample, at r≥0.9, the small stellar velocity dispersion subsample has a higher proportion of galaxies in superclusters ( n≥200) than the large stellar velocity dispersion subsample.

  3. Ichthyoplankton abundance and variance in a large river system concerns for long-term monitoring

    USGS Publications Warehouse

    Holland-Bartels, Leslie E.; Dewey, Michael R.; Zigler, Steven J.

    1995-01-01

    System-wide spatial patterns of ichthyoplankton abundance and variability were assessed in the upper Mississippi and lower Illinois rivers to address the experimental design and statistical confidence in density estimates. Ichthyoplankton was sampled from June to August 1989 in primary milieus (vegetated and non-vegated backwaters and impounded areas, main channels and main channel borders) in three navigation pools (8, 13 and 26) of the upper Mississippi River and in a downstream reach of the Illinois River. Ichthyoplankton densities varied among stations of similar aquatic landscapes (milieus) more than among subsamples within a station. An analysis of sampling effort indicated that the collection of single samples at many stations in a given milieu type is statistically and economically preferable to the collection of multiple subsamples at fewer stations. Cluster analyses also revealed that stations only generally grouped by their preassigned milieu types. Pilot studies such as this can define station groupings and sources of variation beyond an a priori habitat classification. Thus the minimum intensity of sampling required to achieve a desired statistical confidence can be identified before implementing monitoring efforts.

  4. Extreme Mean and Its Applications

    NASA Technical Reports Server (NTRS)

    Swaroop, R.; Brownlow, J. D.

    1979-01-01

    Extreme value statistics obtained from normally distributed data are considered. An extreme mean is defined as the mean of p-th probability truncated normal distribution. An unbiased estimate of this extreme mean and its large sample distribution are derived. The distribution of this estimate even for very large samples is found to be nonnormal. Further, as the sample size increases, the variance of the unbiased estimate converges to the Cramer-Rao lower bound. The computer program used to obtain the density and distribution functions of the standardized unbiased estimate, and the confidence intervals of the extreme mean for any data are included for ready application. An example is included to demonstrate the usefulness of extreme mean application.

  5. Statistical detection of systematic election irregularities

    PubMed Central

    Klimek, Peter; Yegorov, Yuri; Hanel, Rudolf; Thurner, Stefan

    2012-01-01

    Democratic societies are built around the principle of free and fair elections, and that each citizen’s vote should count equally. National elections can be regarded as large-scale social experiments, where people are grouped into usually large numbers of electoral districts and vote according to their preferences. The large number of samples implies statistical consequences for the polling results, which can be used to identify election irregularities. Using a suitable data representation, we find that vote distributions of elections with alleged fraud show a kurtosis substantially exceeding the kurtosis of normal elections, depending on the level of data aggregation. As an example, we show that reported irregularities in recent Russian elections are, indeed, well-explained by systematic ballot stuffing. We develop a parametric model quantifying the extent to which fraudulent mechanisms are present. We formulate a parametric test detecting these statistical properties in election results. Remarkably, this technique produces robust outcomes with respect to the resolution of the data and therefore, allows for cross-country comparisons. PMID:23010929

  6. Statistical detection of systematic election irregularities.

    PubMed

    Klimek, Peter; Yegorov, Yuri; Hanel, Rudolf; Thurner, Stefan

    2012-10-09

    Democratic societies are built around the principle of free and fair elections, and that each citizen's vote should count equally. National elections can be regarded as large-scale social experiments, where people are grouped into usually large numbers of electoral districts and vote according to their preferences. The large number of samples implies statistical consequences for the polling results, which can be used to identify election irregularities. Using a suitable data representation, we find that vote distributions of elections with alleged fraud show a kurtosis substantially exceeding the kurtosis of normal elections, depending on the level of data aggregation. As an example, we show that reported irregularities in recent Russian elections are, indeed, well-explained by systematic ballot stuffing. We develop a parametric model quantifying the extent to which fraudulent mechanisms are present. We formulate a parametric test detecting these statistical properties in election results. Remarkably, this technique produces robust outcomes with respect to the resolution of the data and therefore, allows for cross-country comparisons.

  7. Inferring general relations between network characteristics from specific network ensembles.

    PubMed

    Cardanobile, Stefano; Pernice, Volker; Deger, Moritz; Rotter, Stefan

    2012-01-01

    Different network models have been suggested for the topology underlying complex interactions in natural systems. These models are aimed at replicating specific statistical features encountered in real-world networks. However, it is rarely considered to which degree the results obtained for one particular network class can be extrapolated to real-world networks. We address this issue by comparing different classical and more recently developed network models with respect to their ability to generate networks with large structural variability. In particular, we consider the statistical constraints which the respective construction scheme imposes on the generated networks. After having identified the most variable networks, we address the issue of which constraints are common to all network classes and are thus suitable candidates for being generic statistical laws of complex networks. In fact, we find that generic, not model-related dependencies between different network characteristics do exist. This makes it possible to infer global features from local ones using regression models trained on networks with high generalization power. Our results confirm and extend previous findings regarding the synchronization properties of neural networks. Our method seems especially relevant for large networks, which are difficult to map completely, like the neural networks in the brain. The structure of such large networks cannot be fully sampled with the present technology. Our approach provides a method to estimate global properties of under-sampled networks in good approximation. Finally, we demonstrate on three different data sets (C. elegans neuronal network, R. prowazekii metabolic network, and a network of synonyms extracted from Roget's Thesaurus) that real-world networks have statistical relations compatible with those obtained using regression models.

  8. Statistical learning theory for high dimensional prediction: Application to criterion-keyed scale development.

    PubMed

    Chapman, Benjamin P; Weiss, Alexander; Duberstein, Paul R

    2016-12-01

    Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  9. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hero, Alfred O.; Rajaratnam, Bala

    When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, hasmore » received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.« less

  10. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    PubMed Central

    Hero, Alfred O.; Rajaratnam, Bala

    2015-01-01

    When can reliable inference be drawn in fue “Big Data” context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for “Big Data”. Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks. PMID:27087700

  11. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

    DOE PAGES

    Hero, Alfred O.; Rajaratnam, Bala

    2015-12-09

    When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, hasmore » received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.« less

  12. Some challenges with statistical inference in adaptive designs.

    PubMed

    Hung, H M James; Wang, Sue-Jane; Yang, Peiling

    2014-01-01

    Adaptive designs have generated a great deal of attention to clinical trial communities. The literature contains many statistical methods to deal with added statistical uncertainties concerning the adaptations. Increasingly encountered in regulatory applications are adaptive statistical information designs that allow modification of sample size or related statistical information and adaptive selection designs that allow selection of doses or patient populations during the course of a clinical trial. For adaptive statistical information designs, a few statistical testing methods are mathematically equivalent, as a number of articles have stipulated, but arguably there are large differences in their practical ramifications. We pinpoint some undesirable features of these methods in this work. For adaptive selection designs, the selection based on biomarker data for testing the correlated clinical endpoints may increase statistical uncertainty in terms of type I error probability, and most importantly the increased statistical uncertainty may be impossible to assess.

  13. The t-test: An Influential Inferential Tool in Chaplaincy and Other Healthcare Research.

    PubMed

    Jankowski, Katherine R B; Flannelly, Kevin J; Flannelly, Laura T

    2018-01-01

    The t-test developed by William S. Gosset (also known as Student's t-test and the two-sample t-test) is commonly used to compare one sample mean on a measure with another sample mean on the same measure. The outcome of the t-test is used to draw inferences about how different the samples are from each other. It is probably one of the most frequently relied upon statistics in inferential research. It is easy to use: a researcher can calculate the statistic with three simple tools: paper, pen, and a calculator. A computer program can quickly calculate the t-test for large samples. The ease of use can result in the misuse of the t-test. This article discusses the development of the original t-test, basic principles of the t-test, two additional types of t-tests (the one-sample t-test and the paired t-test), and recommendations about what to consider when using the t-test to draw inferences in research.

  14. Individualized statistical learning from medical image databases: application to identification of brain lesions.

    PubMed

    Erus, Guray; Zacharaki, Evangelia I; Davatzikos, Christos

    2014-04-01

    This paper presents a method for capturing statistical variation of normal imaging phenotypes, with emphasis on brain structure. The method aims to estimate the statistical variation of a normative set of images from healthy individuals, and identify abnormalities as deviations from normality. A direct estimation of the statistical variation of the entire volumetric image is challenged by the high-dimensionality of images relative to smaller sample sizes. To overcome this limitation, we iteratively sample a large number of lower dimensional subspaces that capture image characteristics ranging from fine and localized to coarser and more global. Within each subspace, a "target-specific" feature selection strategy is applied to further reduce the dimensionality, by considering only imaging characteristics present in a test subject's images. Marginal probability density functions of selected features are estimated through PCA models, in conjunction with an "estimability" criterion that limits the dimensionality of estimated probability densities according to available sample size and underlying anatomy variation. A test sample is iteratively projected to the subspaces of these marginals as determined by PCA models, and its trajectory delineates potential abnormalities. The method is applied to segmentation of various brain lesion types, and to simulated data on which superiority of the iterative method over straight PCA is demonstrated. Copyright © 2014 Elsevier B.V. All rights reserved.

  15. Individualized Statistical Learning from Medical Image Databases: Application to Identification of Brain Lesions

    PubMed Central

    Erus, Guray; Zacharaki, Evangelia I.; Davatzikos, Christos

    2014-01-01

    This paper presents a method for capturing statistical variation of normal imaging phenotypes, with emphasis on brain structure. The method aims to estimate the statistical variation of a normative set of images from healthy individuals, and identify abnormalities as deviations from normality. A direct estimation of the statistical variation of the entire volumetric image is challenged by the high-dimensionality of images relative to smaller sample sizes. To overcome this limitation, we iteratively sample a large number of lower dimensional subspaces that capture image characteristics ranging from fine and localized to coarser and more global. Within each subspace, a “target-specific” feature selection strategy is applied to further reduce the dimensionality, by considering only imaging characteristics present in a test subject’s images. Marginal probability density functions of selected features are estimated through PCA models, in conjunction with an “estimability” criterion that limits the dimensionality of estimated probability densities according to available sample size and underlying anatomy variation. A test sample is iteratively projected to the subspaces of these marginals as determined by PCA models, and its trajectory delineates potential abnormalities. The method is applied to segmentation of various brain lesion types, and to simulated data on which superiority of the iterative method over straight PCA is demonstrated. PMID:24607564

  16. Assessing signal-to-noise in quantitative proteomics: multivariate statistical analysis in DIGE experiments.

    PubMed

    Friedman, David B

    2012-01-01

    All quantitative proteomics experiments measure variation between samples. When performing large-scale experiments that involve multiple conditions or treatments, the experimental design should include the appropriate number of individual biological replicates from each condition to enable the distinction between a relevant biological signal from technical noise. Multivariate statistical analyses, such as principal component analysis (PCA), provide a global perspective on experimental variation, thereby enabling the assessment of whether the variation describes the expected biological signal or the unanticipated technical/biological noise inherent in the system. Examples will be shown from high-resolution multivariable DIGE experiments where PCA was instrumental in demonstrating biologically significant variation as well as sample outliers, fouled samples, and overriding technical variation that would not be readily observed using standard univariate tests.

  17. The Relationship between SAT Scores and Retention to the Second Year: 2007 SAT Validity Sample. Statistical Report No. 2011-4

    ERIC Educational Resources Information Center

    Mattern, Krista D.; Patterson, Brian F.

    2011-01-01

    This report presents the findings from a replication of the analyses from the report, "Is Performance on the SAT Related to College Retention?" (Mattern & Patterson, 2009). The tables presented herein are based on the 2007 sample and the findings are largely the same as those presented in the original report, and show SAT scores are…

  18. CALIPSO Observations of Near-Cloud Aerosol Properties as a Function of Cloud Fraction

    NASA Technical Reports Server (NTRS)

    Yang, Weidong; Marshak, Alexander; Varnai, Tamas; Wood, Robert

    2015-01-01

    This paper uses spaceborne lidar data to study how near-cloud aerosol statistics of attenuated backscatter depend on cloud fraction. The results for a large region around the Azores show that: (1) far-from-cloud aerosol statistics are dominated by samples from scenes with lower cloud fractions, while near-cloud aerosol statistics are dominated by samples from scenes with higher cloud fractions; (2) near-cloud enhancements of attenuated backscatter occur for any cloud fraction but are most pronounced for higher cloud fractions; (3) the difference in the enhancements for different cloud fractions is most significant within 5km from clouds; (4) near-cloud enhancements can be well approximated by logarithmic functions of cloud fraction and distance to clouds. These findings demonstrate that if variability in cloud fraction across the scenes used to composite aerosol statistics are not considered, a sampling artifact will affect these statistics calculated as a function of distance to clouds. For the Azores-region dataset examined here, this artifact occurs mostly within 5 km from clouds, and exaggerates the near-cloud enhancements of lidar backscatter and color ratio by about 30. This shows that for accurate characterization of the changes in aerosol properties with distance to clouds, it is important to account for the impact of changes in cloud fraction.

  19. From sexless to sexy: Why it is time for human genetics to consider and report analyses of sex.

    PubMed

    Powers, Matthew S; Smith, Phillip H; McKee, Sherry A; Ehringer, Marissa A

    2017-01-01

    Science has come a long way with regard to the consideration of sex differences in clinical and preclinical research, but one field remains behind the curve: human statistical genetics. The goal of this commentary is to raise awareness and discussion about how to best consider and evaluate possible sex effects in the context of large-scale human genetic studies. Over the course of this commentary, we reinforce the importance of interpreting genetic results in the context of biological sex, establish evidence that sex differences are not being considered in human statistical genetics, and discuss how best to conduct and report such analyses. Our recommendation is to run stratified analyses by sex no matter the sample size or the result and report the findings. Summary statistics from stratified analyses are helpful for meta-analyses, and patterns of sex-dependent associations may be hidden in a combined dataset. In the age of declining sequencing costs, large consortia efforts, and a number of useful control samples, it is now time for the field of human genetics to appropriately include sex in the design, analysis, and reporting of results.

  20. A statistical model and national data set for partioning fish-tissue mercury concentration variation between spatiotemporal and sample characteristic effects

    USGS Publications Warehouse

    Wente, Stephen P.

    2004-01-01

    Many Federal, Tribal, State, and local agencies monitor mercury in fish-tissue samples to identify sites with elevated fish-tissue mercury (fish-mercury) concentrations, track changes in fish-mercury concentrations over time, and produce fish-consumption advisories. Interpretation of such monitoring data commonly is impeded by difficulties in separating the effects of sample characteristics (species, tissues sampled, and sizes of fish) from the effects of spatial and temporal trends on fish-mercury concentrations. Without such a separation, variation in fish-mercury concentrations due to differences in the characteristics of samples collected over time or across space can be misattributed to temporal or spatial trends; and/or actual trends in fish-mercury concentration can be misattributed to differences in sample characteristics. This report describes a statistical model and national data set (31,813 samples) for calibrating the aforementioned statistical model that can separate spatiotemporal and sample characteristic effects in fish-mercury concentration data. This model could be useful for evaluating spatial and temporal trends in fishmercury concentrations and developing fish-consumption advisories. The observed fish-mercury concentration data and model predictions can be accessed, displayed geospatially, and downloaded via the World Wide Web (http://emmma.usgs.gov). This report and the associated web site may assist in the interpretation of large amounts of data from widespread fishmercury monitoring efforts.

  1. DESCARTES' RULE OF SIGNS AND THE IDENTIFIABILITY OF POPULATION DEMOGRAPHIC MODELS FROM GENOMIC VARIATION DATA.

    PubMed

    Bhaskar, Anand; Song, Yun S

    2014-01-01

    The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the "folded" SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes' rule of signs for polynomials to the Laplace transform of piecewise continuous functions.

  2. DESCARTES’ RULE OF SIGNS AND THE IDENTIFIABILITY OF POPULATION DEMOGRAPHIC MODELS FROM GENOMIC VARIATION DATA1

    PubMed Central

    Bhaskar, Anand; Song, Yun S.

    2016-01-01

    The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the “folded” SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes’ rule of signs for polynomials to the Laplace transform of piecewise continuous functions. PMID:28018011

  3. Possession experiences in dissociative identity disorder: a preliminary study.

    PubMed

    Ross, Colin A

    2011-01-01

    Dissociative trance disorder, which includes possession experiences, was introduced as a provisional diagnosis requiring further study in the Diagnostic and Statistical Manual of Mental Disorders (4th ed.). Consideration is now being given to including possession experiences within dissociative identity disorder (DID) in the Diagnostic and Statistical Manual of Mental Disorders (5th ed.), which is due to be published in 2013. In order to provide empirical data relevant to the relationship between DID and possession states, I analyzed data on the prevalence of trance, possession states, sleepwalking, and paranormal experiences in 3 large samples: patients with DID from North America; psychiatric outpatients from Shanghai, China; and a general population sample from Winnipeg, Canada. Trance, sleepwalking, paranormal, and possession experiences were much more common in the DID patients than in the 2 comparison samples. The study is preliminary and exploratory in nature because the samples were not matched in any way.

  4. A proposed method to minimize waste from institutional radiation safety surveillance programs through the application of expected value statistics.

    PubMed

    Emery, R J

    1997-03-01

    Institutional radiation safety programs routinely use wipe test sampling and liquid scintillation counting analysis to indicate the presence of removable radioactive contamination. Significant volumes of liquid waste can be generated by such surveillance activities, and the subsequent disposal of these materials can sometimes be difficult and costly. In settings where large numbers of negative results are regularly obtained, the limited grouping of samples for analysis based on expected value statistical techniques is possible. To demonstrate the plausibility of the approach, single wipe samples exposed to varying amounts of contamination were analyzed concurrently with nine non-contaminated samples. Although the sample grouping inevitably leads to increased quenching with liquid scintillation counting systems, the effect did not impact the ability to detect removable contamination in amounts well below recommended action levels. Opportunities to further improve this cost effective semi-quantitative screening procedure are described, including improvements in sample collection procedures, enhancing sample-counting media contact through mixing and extending elution periods, increasing sample counting times, and adjusting institutional action levels.

  5. Comparative AMS radiocarbon dating of pretreated versus non-pretreated tropical wood samples

    NASA Astrophysics Data System (ADS)

    Patrut, Adrian; von Reden, Karl F.; Lowy, Daniel A.; Mayne, Diana H.; Elder, Kathryn E.; Roberts, Mark L.; McNichol, Ann P.

    2010-04-01

    Several wood samples collected from Dorslandboom, a large iconic African baobab ( Adansonia digitata L.) from Namibia, were investigated by AMS radiocarbon dating subsequent to pretreatment and, alternatively, without pretreatment. The comparative statistical evaluation of results showed that there were no significant differences between fraction modern values and radiocarbon dates of the samples analyzed after pretreatment and without pretreatment, respectively. The radiocarbon date of the oldest sample was 993 ± 20 BP. Dating results also revealed that Dorslandboom is a multi-generation tree, with several stems showing different ages.

  6. Sanov and central limit theorems for output statistics of quantum Markov chains

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Horssen, Merlijn van, E-mail: merlijn.vanhorssen@nottingham.ac.uk; Guţă, Mădălin, E-mail: madalin.guta@nottingham.ac.uk

    2015-02-15

    In this paper, we consider the statistics of repeated measurements on the output of a quantum Markov chain. We establish a large deviations result analogous to Sanov’s theorem for the multi-site empirical measure associated to finite sequences of consecutive outcomes of a classical stochastic process. Our result relies on the construction of an extended quantum transition operator (which keeps track of previous outcomes) in terms of which we compute moment generating functions, and whose spectral radius is related to the large deviations rate function. As a corollary to this, we obtain a central limit theorem for the empirical measure. Suchmore » higher level statistics may be used to uncover critical behaviour such as dynamical phase transitions, which are not captured by lower level statistics such as the sample mean. As a step in this direction, we give an example of a finite system whose level-1 (empirical mean) rate function is independent of a model parameter while the level-2 (empirical measure) rate is not.« less

  7. CALIFA: a diameter-selected sample for an integral field spectroscopy galaxy survey

    NASA Astrophysics Data System (ADS)

    Walcher, C. J.; Wisotzki, L.; Bekeraité, S.; Husemann, B.; Iglesias-Páramo, J.; Backsmann, N.; Barrera Ballesteros, J.; Catalán-Torrecilla, C.; Cortijo, C.; del Olmo, A.; Garcia Lorenzo, B.; Falcón-Barroso, J.; Jilkova, L.; Kalinova, V.; Mast, D.; Marino, R. A.; Méndez-Abreu, J.; Pasquali, A.; Sánchez, S. F.; Trager, S.; Zibetti, S.; Aguerri, J. A. L.; Alves, J.; Bland-Hawthorn, J.; Boselli, A.; Castillo Morales, A.; Cid Fernandes, R.; Flores, H.; Galbany, L.; Gallazzi, A.; García-Benito, R.; Gil de Paz, A.; González-Delgado, R. M.; Jahnke, K.; Jungwiert, B.; Kehrig, C.; Lyubenova, M.; Márquez Perez, I.; Masegosa, J.; Monreal Ibero, A.; Pérez, E.; Quirrenbach, A.; Rosales-Ortega, F. F.; Roth, M. M.; Sanchez-Blazquez, P.; Spekkens, K.; Tundo, E.; van de Ven, G.; Verheijen, M. A. W.; Vilchez, J. V.; Ziegler, B.

    2014-09-01

    We describe and discuss the selection procedure and statistical properties of the galaxy sample used by the Calar Alto Legacy Integral Field Area (CALIFA) survey, a public legacy survey of 600 galaxies using integral field spectroscopy. The CALIFA "mother sample" was selected from the Sloan Digital Sky Survey (SDSS) DR7 photometric catalogue to include all galaxies with an r-band isophotal major axis between 45'' and 79.2'' and with a redshift 0.005 < z < 0.03. The mother sample contains 939 objects, 600 of which will be observed in the course of the CALIFA survey. The selection of targets for observations is based solely on visibility and thus keeps the statistical properties of the mother sample. By comparison with a large set of SDSS galaxies, we find that the CALIFA sample is representative of galaxies over a luminosity range of -19 > Mr > -23.1 and over a stellar mass range between 109.7 and 1011.4 M⊙. In particular, within these ranges, the diameter selection does not lead to any significant bias against - or in favour of - intrinsically large or small galaxies. Only below luminosities of Mr = -19 (or stellar masses <109.7 M⊙) is there a prevalence of galaxies with larger isophotal sizes, especially of nearly edge-on late-type galaxies, but such galaxies form <10% of the full sample. We estimate volume-corrected distribution functions in luminosities and sizes and show that these are statistically fully compatible with estimates from the full SDSS when accounting for large-scale structure. For full characterization of the sample, we also present a number of value-added quantities determined for the galaxies in the CALIFA sample. These include consistent multi-band photometry based on growth curve analyses; stellar masses; distances and quantities derived from these; morphological classifications; and an overview of available multi-wavelength photometric measurements. We also explore different ways of characterizing the environments of CALIFA galaxies, finding that the sample covers environmental conditions from the field to genuine clusters. We finally consider the expected incidence of active galactic nuclei among CALIFA galaxies given the existing pre-CALIFA data, finding that the final observed CALIFA sample will contain approximately 30 Sey2 galaxies. Based on observations collected at the Centro Astronómico Hispano Alemán (CAHA) at Calar Alto, operated jointly by the Max Planck Institute for Astronomy and the Instituto de Astrofísica de Andalucía (CSIC). Publically released data products from CALIFA are made available on the webpage http://www.caha.es/CALIFA

  8. On the Study of Statistical Intuitions.

    DTIC Science & Technology

    1981-05-15

    teller; 3. Linda is a bank teller who is PAGE 10 active in the feminist movement. In a large sample of statistically naive undergraduates, 86% judged...Linda is both a bank teller and an active feminist must be smaller than the probability that she is a bank teller. (ii) B is more probable than A...because Linda resembles a bank teller who is active in the feminist movement more than she resembles a bank teller. Argument (i) favoring the conjunction

  9. Area estimation using multiyear designs and partial crop identification

    NASA Technical Reports Server (NTRS)

    Sielken, R. L., Jr.

    1984-01-01

    Statistical procedures were developed for large area assessments using both satellite and conventional data. Crop acreages, other ground cover indices, and measures of change were the principal characteristics of interest. These characteristics are capable of being estimated from samples collected possibly from several sources at varying times, with different levels of identification. Multiyear analysis techniques were extended to include partially identified samples; the best current year sampling design corresponding to a given sampling history was determined; weights reflecting the precision or confidence in each observation were identified and utilized, and the variation in estimates incorporating partially identified samples were quantified.

  10. Geostatistics - a tool applied to the distribution of Legionella pneumophila in a hospital water system.

    PubMed

    Laganà, Pasqualina; Moscato, Umberto; Poscia, Andrea; La Milia, Daniele Ignazio; Boccia, Stefania; Avventuroso, Emanuela; Delia, Santi

    2015-01-01

    Legionnaires' disease is normally acquired by inhalation of legionellae from a contaminated environmental source. Water systems of large buildings, such as hospitals, are often contaminated with legionellae and therefore represent a potential risk for the hospital population. The aim of this study was to evaluate the potential contamination of Legionella pneumophila (LP) in a large hospital in Italy through georeferential statistical analysis to assess the possible sources of dispersion and, consequently, the risk of exposure for both health care staff and patients. LP serogroups 1 and 2-14 distribution was considered in the wards housed on two consecutive floors of the hospital building. On the basis of information provided by 53 bacteriological analysis, a 'random' grid of points was chosen and spatial geostatistics or FAIk Kriging was applied and compared with the results of classical statistical analysis. Over 50% of the examined samples were positive for Legionella pneumophila. LP 1 was isolated in 69% of samples from the ground floor and in 60% of sample from the first floor; LP 2-14 in 36% of sample from the ground floor and 24% from the first. The iso-estimation maps show clearly the most contaminated pipe and the difference in the diffusion of the different L. pneumophila serogroups. Experimental work has demonstrated that geostatistical methods applied to the microbiological analysis of water matrices allows a better modeling of the phenomenon under study, a greater potential for risk management and a greater choice of methods of prevention and environmental recovery to be put in place with respect to the classical statistical analysis.

  11. Evaluation of Respondent-Driven Sampling

    PubMed Central

    McCreesh, Nicky; Frost, Simon; Seeley, Janet; Katongole, Joseph; Tarsh, Matilda Ndagire; Ndunguse, Richard; Jichi, Fatima; Lunel, Natasha L; Maher, Dermot; Johnston, Lisa G; Sonnenberg, Pam; Copas, Andrew J; Hayes, Richard J; White, Richard G

    2012-01-01

    Background Respondent-driven sampling is a novel variant of link-tracing sampling for estimating the characteristics of hard-to-reach groups, such as HIV prevalence in sex-workers. Despite its use by leading health organizations, the performance of this method in realistic situations is still largely unknown. We evaluated respondent-driven sampling by comparing estimates from a respondent-driven sampling survey with total-population data. Methods Total-population data on age, tribe, religion, socioeconomic status, sexual activity and HIV status were available on a population of 2402 male household-heads from an open cohort in rural Uganda. A respondent-driven sampling (RDS) survey was carried out in this population, employing current methods of sampling (RDS sample) and statistical inference (RDS estimates). Analyses were carried out for the full RDS sample and then repeated for the first 250 recruits (small sample). Results We recruited 927 household-heads. Full and small RDS samples were largely representative of the total population, but both samples under-represented men who were younger, of higher socioeconomic status, and with unknown sexual activity and HIV status. Respondent-driven-sampling statistical-inference methods failed to reduce these biases. Only 31%-37% (depending on method and sample size) of RDS estimates were closer to the true population proportions than the RDS sample proportions. Only 50%-74% of respondent-driven-sampling bootstrap 95% confidence intervals included the population proportion. Conclusions Respondent-driven sampling produced a generally representative sample of this well-connected non-hidden population. However, current respondent-driven-sampling inference methods failed to reduce bias when it occurred. Whether the data required to remove bias and measure precision can be collected in a respondent-driven sampling survey is unresolved. Respondent-driven sampling should be regarded as a (potentially superior) form of convenience-sampling method, and caution is required when interpreting findings based on the sampling method. PMID:22157309

  12. Expected values and variances of Bragg peak intensities measured in a nanocrystalline powder diffraction experiment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Öztürk, Hande; Noyan, I. Cevdet

    A rigorous study of sampling and intensity statistics applicable for a powder diffraction experiment as a function of crystallite size is presented. Our analysis yields approximate equations for the expected value, variance and standard deviations for both the number of diffracting grains and the corresponding diffracted intensity for a given Bragg peak. The classical formalism published in 1948 by Alexander, Klug & Kummer [J. Appl. Phys.(1948),19, 742–753] appears as a special case, limited to large crystallite sizes, here. It is observed that both the Lorentz probability expression and the statistics equations used in the classical formalism are inapplicable for nanocrystallinemore » powder samples.« less

  13. VizieR Online Data Catalog: The ESO DIBs Large Exploration Survey (Cox+, 2017)

    NASA Astrophysics Data System (ADS)

    Cox, N. L. J.; Cami, J.; Farhang, A.; Smoker, J.; Monreal-Ibero, A.; Lallement, R.; Sarre, P. J.; Marshall, C. C. M.; Smith, K. T.; Evans, C. J.; Royer, P.; Linnartz, H.; Cordiner, M. A.; Joblin, C.; van Loon, J. T.; Foing, B. H.; Bhatt, N. H.; Bron, E.; Elyajouri, M.; de Koter, A.; Ehrenfreund, P.; Javadi, A.; Kaper, L.; Khosroshadi, H. G.; Laverick, M.; Le Petit, F.; Mulas, G.; Roueff, E.; Salama, F.; Spaans, M.

    2018-01-01

    We constructed a statistically representative survey sample that probes a wide range of interstellar environment parameters including reddening E(B-V), visual extinction AV, total-to-selective extinction ratio RV, and molecular hydrogen fraction fH2. EDIBLES provides the community with optical (~305-1042nm) spectra at high spectral resolution (R~70000 in the blue arm and 100000 in the red arm) and high signal-to-noise (S/N; median value ~500-1000), for a statistically significant sample of interstellar sightlines. Many of the >100 sightlines included in the survey already have auxiliary available ultraviolet, infrared and/or polarisation data on the dust and gas components. (2 data files).

  14. Expected values and variances of Bragg peak intensities measured in a nanocrystalline powder diffraction experiment

    DOE PAGES

    Öztürk, Hande; Noyan, I. Cevdet

    2017-08-24

    A rigorous study of sampling and intensity statistics applicable for a powder diffraction experiment as a function of crystallite size is presented. Our analysis yields approximate equations for the expected value, variance and standard deviations for both the number of diffracting grains and the corresponding diffracted intensity for a given Bragg peak. The classical formalism published in 1948 by Alexander, Klug & Kummer [J. Appl. Phys.(1948),19, 742–753] appears as a special case, limited to large crystallite sizes, here. It is observed that both the Lorentz probability expression and the statistics equations used in the classical formalism are inapplicable for nanocrystallinemore » powder samples.« less

  15. MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data

    PubMed Central

    Hu, Jiyuan; Li, Tengfei; Xiu, Zidi; Zhang, Hong

    2015-01-01

    Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package “MAFsnp” implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/. PMID:26309201

  16. The SDSS-III Multi-object Apo Radial-velocity Exoplanet Large-area Survey

    NASA Astrophysics Data System (ADS)

    Ge, Jian; Mahadevan, S.; Lee, B.; Wan, X.; Zhao, B.; van Eyken, J.; Kane, S.; Guo, P.; Ford, E. B.; Agol, E.; Gaudi, S.; Fleming, S.; Crepp, J.; Cohen, R.; Groot, J.; Galvez, M.; Liu, J.; Ford, H.; Schneider, D.; Seager, S.; Hawley, S. L.; Weinberg, D.; Eisenstein, D.

    2007-12-01

    As part of SDSS-III survey in 2008-2014, the Multi-object APO Radial-Velocity Exoplanet Large-area Survey (MARVELS) will conduct the largest ground-based Doppler planet survey to date using the SDSS telescope and new generation multi-object Doppler instruments with 120 object capability and 10-20 m/s Doppler precision. The baseline survey plan is to monitor a total of 11,000 V=8-12 stars ( 10,000 main sequence stars and 1000 giant stars) over 800 square degrees over the 6 years. The primary goal is to produce a large, statistically well defined sample of giant planets ( 200) with a wide range of masses ( 0.2-10 Jupiter masses) and orbits (1 day-2 years) drawn from a large of host stars with a diverse set of masses, compositions, and ages for studying the diversity of extrasolar planets and constraining planet formation, migration & dynamical evolution of planetary systems. The survey data will also be used for providing a statistical sample for theoretical comparison and discovering rare systems and identifying signposts for lower-mass or more distant planets. Early science results from the pilot program will be reported. We would like to thank the SDSS MC for allocation of the telescope time and the W.M. Keck Foundation, NSF, NASA and UF for support.

  17. Impact of Satellite Viewing-Swath Width on Global and Regional Aerosol Optical Thickness Statistics and Trends

    NASA Technical Reports Server (NTRS)

    Colarco, P. R.; Kahn, R. A.; Remer, L. A.; Levy, R. C.

    2014-01-01

    We use the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite aerosol optical thickness (AOT) product to assess the impact of reduced swath width on global and regional AOT statistics and trends. Alongtrack and across-track sampling strategies are employed, in which the full MODIS data set is sub-sampled with various narrow-swath (approximately 400-800 km) and single pixel width (approximately 10 km) configurations. Although view-angle artifacts in the MODIS AOT retrieval confound direct comparisons between averages derived from different sub-samples, careful analysis shows that with many portions of the Earth essentially unobserved, spatial sampling introduces uncertainty in the derived seasonal-regional mean AOT. These AOT spatial sampling artifacts comprise up to 60%of the full-swath AOT value under moderate aerosol loading, and can be as large as 0.1 in some regions under high aerosol loading. Compared to full-swath observations, narrower swath and single pixel width sampling exhibits a reduced ability to detect AOT trends with statistical significance. On the other hand, estimates of the global, annual mean AOT do not vary significantly from the full-swath values as spatial sampling is reduced. Aggregation of the MODIS data at coarse grid scales (10 deg) shows consistency in the aerosol trends across sampling strategies, with increased statistical confidence, but quantitative errors in the derived trends are found even for the full-swath data when compared to high spatial resolution (0.5 deg) aggregations. Using results of a model-derived aerosol reanalysis, we find consistency in our conclusions about a seasonal-regional spatial sampling artifact in AOT Furthermore, the model shows that reduced spatial sampling can amount to uncertainty in computed shortwave top-ofatmosphere aerosol radiative forcing of 2-3 W m(sup-2). These artifacts are lower bounds, as possibly other unconsidered sampling strategies would perform less well. These results suggest that future aerosol satellite missions having significantly less than full-swath viewing are unlikely to sample the true AOT distribution well enough to obtain the statistics needed to reduce uncertainty in aerosol direct forcing of climate.

  18. Evaluating Site-Specific and Generic Spatial Models of Aboveground Forest Biomass Based on Landsat Time-Series and LiDAR Strip Samples in the Eastern USA

    Treesearch

    Ram Deo; Matthew Russell; Grant Domke; Hans-Erik Andersen; Warren Cohen; Christopher Woodall

    2017-01-01

    Large-area assessment of aboveground tree biomass (AGB) to inform regional or national forest monitoring programs can be efficiently carried out by combining remotely sensed data and field sample measurements through a generic statistical model, in contrast to site-specific models. We integrated forest inventory plot data with spatial predictors from Landsat time-...

  19. Cloud-based solution to identify statistically significant MS peaks differentiating sample categories.

    PubMed

    Ji, Jun; Ling, Jeffrey; Jiang, Helen; Wen, Qiaojun; Whitin, John C; Tian, Lu; Cohen, Harvey J; Ling, Xuefeng B

    2013-03-23

    Mass spectrometry (MS) has evolved to become the primary high throughput tool for proteomics based biomarker discovery. Until now, multiple challenges in protein MS data analysis remain: large-scale and complex data set management; MS peak identification, indexing; and high dimensional peak differential analysis with the concurrent statistical tests based false discovery rate (FDR). "Turnkey" solutions are needed for biomarker investigations to rapidly process MS data sets to identify statistically significant peaks for subsequent validation. Here we present an efficient and effective solution, which provides experimental biologists easy access to "cloud" computing capabilities to analyze MS data. The web portal can be accessed at http://transmed.stanford.edu/ssa/. Presented web application supplies large scale MS data online uploading and analysis with a simple user interface. This bioinformatic tool will facilitate the discovery of the potential protein biomarkers using MS.

  20. Results of the NaCo Large Program: probing the occurrence of exoplanets and brown dwarfs at wide orbit

    NASA Astrophysics Data System (ADS)

    Vigan, A.; Chauvin, G.; Bonavita, M.; Desidera, S.; Bonnefoy, M.; Mesa, D.; Beuzit, J.-L.; Augereau, J.-C.; Biller, B.; Boccaletti, A.; Brugaletta, E.; Buenzli, E.; Carson, J.; Covino, E.; Delorme, P.; Eggenberger, A.; Feldt, M.; Hagelberg, J.; Henning, T.; Lagrange, A.-M.; Lanzafame, A.; Ménard, F.; Messina, S.; Meyer, M.; Montagnier, G.; Mordasini, C.; Mouillet, D.; Moutou, C.; Mugnier, L.; Quanz, S. P.; Reggiani, M.; Ségransan, D.; Thalmann, C.; Waters, R.; Zurlo, A.

    2014-01-01

    Over the past decade, a growing number of deep imaging surveys have started to provide meaningful constraints on the population of extrasolar giant planets at large orbital separation. Primary targets for these surveys have been carefully selected based on their age, distance and spectral type, and often on their membership to young nearby associations where all stars share common kinematics, photometric and spectroscopic properties. The next step is a wider statistical analysis of the frequency and properties of low mass companions as a function of stellar mass and orbital separation. In late 2009, we initiated a coordinated European Large Program using angular differential imaging in the H band (1.66 μm) with NaCo at the VLT. Our aim is to provide a comprehensive and statistically significant study of the occurrence of extrasolar giant planets and brown dwarfs at large (5-500 AU) orbital separation around ~150 young, nearby stars, a large fraction of which have never been observed at very deep contrast. The survey has now been completed and we present the data analysis and detection limits for the observed sample, for which we reach the planetary-mass domain at separations of >~50 AU on average. We also present the results of the statistical analysis that has been performed over the 75 targets newly observed at high-contrast. We discuss the details of the statistical analysis and the physical constraints that our survey provides for the frequency and formation scenario of planetary mass companions at large separation.

  1. Singular spectrum analysis in nonlinear dynamics, with applications to paleoclimatic time series

    NASA Technical Reports Server (NTRS)

    Vautard, R.; Ghil, M.

    1989-01-01

    Two dimensions of a dynamical system given by experimental time series are distinguished. Statistical dimension gives a theoretical upper bound for the minimal number of degrees of freedom required to describe the attractor up to the accuracy of the data, taking into account sampling and noise problems. The dynamical dimension is the intrinsic dimension of the attractor and does not depend on the quality of the data. Singular Spectrum Analysis (SSA) provides estimates of the statistical dimension. SSA also describes the main physical phenomena reflected by the data. It gives adaptive spectral filters associated with the dominant oscillations of the system and clarifies the noise characteristics of the data. SSA is applied to four paleoclimatic records. The principal climatic oscillations and the regime changes in their amplitude are detected. About 10 degrees of freedom are statistically significant in the data. Large noise and insufficient sample length do not allow reliable estimates of the dynamical dimension.

  2. Statistical genetics concepts and approaches in schizophrenia and related neuropsychiatric research.

    PubMed

    Schork, Nicholas J; Greenwood, Tiffany A; Braff, David L

    2007-01-01

    Statistical genetics is a research field that focuses on mathematical models and statistical inference methodologies that relate genetic variations (ie, naturally occurring human DNA sequence variations or "polymorphisms") to particular traits or diseases (phenotypes) usually from data collected on large samples of families or individuals. The ultimate goal of such analysis is the identification of genes and genetic variations that influence disease susceptibility. Although of extreme interest and importance, the fact that many genes and environmental factors contribute to neuropsychiatric diseases of public health importance (eg, schizophrenia, bipolar disorder, and depression) complicates relevant studies and suggests that very sophisticated mathematical and statistical modeling may be required. In addition, large-scale contemporary human DNA sequencing and related projects, such as the Human Genome Project and the International HapMap Project, as well as the development of high-throughput DNA sequencing and genotyping technologies have provided statistical geneticists with a great deal of very relevant and appropriate information and resources. Unfortunately, the use of these resources and their interpretation are not straightforward when applied to complex, multifactorial diseases such as schizophrenia. In this brief and largely nonmathematical review of the field of statistical genetics, we describe many of the main concepts, definitions, and issues that motivate contemporary research. We also provide a discussion of the most pressing contemporary problems that demand further research if progress is to be made in the identification of genes and genetic variations that predispose to complex neuropsychiatric diseases.

  3. Neurons from the adult human dentate nucleus: neural networks in the neuron classification.

    PubMed

    Grbatinić, Ivan; Marić, Dušica L; Milošević, Nebojša T

    2015-04-07

    Topological (central vs. border neuron type) and morphological classification of adult human dentate nucleus neurons according to their quantified histomorphological properties using neural networks on real and virtual neuron samples. In the real sample 53.1% and 14.1% of central and border neurons, respectively, are classified correctly with total of 32.8% of misclassified neurons. The most important result present 62.2% of misclassified neurons in border neurons group which is even greater than number of correctly classified neurons (37.8%) in that group, showing obvious failure of network to classify neurons correctly based on computational parameters used in our study. On the virtual sample 97.3% of misclassified neurons in border neurons group which is much greater than number of correctly classified neurons (2.7%) in that group, again confirms obvious failure of network to classify neurons correctly. Statistical analysis shows that there is no statistically significant difference in between central and border neurons for each measured parameter (p>0.05). Total of 96.74% neurons are morphologically classified correctly by neural networks and each one belongs to one of the four histomorphological types: (a) neurons with small soma and short dendrites, (b) neurons with small soma and long dendrites, (c) neuron with large soma and short dendrites, (d) neurons with large soma and long dendrites. Statistical analysis supports these results (p<0.05). Human dentate nucleus neurons can be classified in four neuron types according to their quantitative histomorphological properties. These neuron types consist of two neuron sets, small and large ones with respect to their perykarions with subtypes differing in dendrite length i.e. neurons with short vs. long dendrites. Besides confirmation of neuron classification on small and large ones, already shown in literature, we found two new subtypes i.e. neurons with small soma and long dendrites and with large soma and short dendrites. These neurons are most probably equally distributed throughout the dentate nucleus as no significant difference in their topological distribution is observed. Copyright © 2015 Elsevier Ltd. All rights reserved.

  4. No association of dynamin binding protein (DNMBP) gene SNPs and Alzheimer's disease.

    PubMed

    Minster, Ryan L; DeKosky, Steven T; Kamboh, M Ilyas

    2008-10-01

    A recent scan of single nucleotide polymorphisms (SNPs) on chromosome 10q found significant association of six correlated SNPs with late-onset Alzheimer's disease (AD) among Japanese. We examined the SNP with the highest statistical significance (rs3740058) in a large Caucasian American case-control cohort and the remaining five SNPs in a smaller subset of cases and controls. We observed no association of statistical significance in either the total sample or the APOE*4 non-carriers for any of the SNPs.

  5. Induced earthquake magnitudes are as large as (statistically) expected

    USGS Publications Warehouse

    Van Der Elst, Nicholas; Page, Morgan T.; Weiser, Deborah A.; Goebel, Thomas; Hosseini, S. Mehran

    2016-01-01

    A major question for the hazard posed by injection-induced seismicity is how large induced earthquakes can be. Are their maximum magnitudes determined by injection parameters or by tectonics? Deterministic limits on induced earthquake magnitudes have been proposed based on the size of the reservoir or the volume of fluid injected. However, if induced earthquakes occur on tectonic faults oriented favorably with respect to the tectonic stress field, then they may be limited only by the regional tectonics and connectivity of the fault network. In this study, we show that the largest magnitudes observed at fluid injection sites are consistent with the sampling statistics of the Gutenberg-Richter distribution for tectonic earthquakes, assuming no upper magnitude bound. The data pass three specific tests: (1) the largest observed earthquake at each site scales with the log of the total number of induced earthquakes, (2) the order of occurrence of the largest event is random within the induced sequence, and (3) the injected volume controls the total number of earthquakes rather than the total seismic moment. All three tests point to an injection control on earthquake nucleation but a tectonic control on earthquake magnitude. Given that the largest observed earthquakes are exactly as large as expected from the sampling statistics, we should not conclude that these are the largest earthquakes possible. Instead, the results imply that induced earthquake magnitudes should be treated with the same maximum magnitude bound that is currently used to treat seismic hazard from tectonic earthquakes.

  6. The impact of study design and diagnostic approach in a large multi-centre ADHD study: Part 2: Dimensional measures of psychopathology and intelligence.

    PubMed

    Müller, Ueli C; Asherson, Philip; Banaschewski, Tobias; Buitelaar, Jan K; Ebstein, Richard P; Eisenberg, Jaques; Gill, Michael; Manor, Iris; Miranda, Ana; Oades, Robert D; Roeyers, Herbert; Rothenberger, Aribert; Sergeant, Joseph A; Sonuga-Barke, Edmund Js; Thompson, Margaret; Faraone, Stephen V; Steinhausen, Hans-Christoph

    2011-04-07

    The International Multi-centre ADHD Genetics (IMAGE) project with 11 participating centres from 7 European countries and Israel has collected a large behavioural and genetic database for present and future research. Behavioural data were collected from 1068 probands with ADHD and 1446 unselected siblings. The aim was to describe and analyse questionnaire data and IQ measures from all probands and siblings. In particular, to investigate the influence of age, gender, family status (proband vs. sibling), informant, and centres on sample homogeneity in psychopathological measures. Conners' Questionnaires, Strengths and Difficulties Questionnaires, and Wechsler Intelligence Scores were used to describe the phenotype of the sample. Data were analysed by use of robust statistical multi-way procedures. Besides main effects of age, gender, informant, and centre, there were considerable interaction effects on questionnaire data. The larger differences between probands and siblings at home than at school may reflect contrast effects in the parents. Furthermore, there were marked gender by status effects on the ADHD symptom ratings with girls scoring one standard deviation higher than boys in the proband sample but lower than boys in the siblings sample. The multi-centre design is another important source of heterogeneity, particularly in the interaction with the family status. To a large extent the centres differed from each other with regard to differences between proband and sibling scores. When ADHD probands are diagnosed by use of fixed symptom counts, the severity of the disorder in the proband sample may markedly differ between boys and girls and across age, particularly in samples with a large age range. A multi-centre design carries the risk of considerable phenotypic differences between centres and, consequently, of additional heterogeneity of the sample even if standardized diagnostic procedures are used. These possible sources of variance should be counteracted in genetic analyses either by using age and gender adjusted diagnostic procedures and regional normative data or by adjusting for design artefacts by use of covariate statistics, by eliminating outliers, or by other methods suitable for reducing heterogeneity.

  7. Power of tests for comparing trend curves with application to national immunization survey (NIS).

    PubMed

    Zhao, Zhen

    2011-02-28

    To develop statistical tests for comparing trend curves of study outcomes between two socio-demographic strata across consecutive time points, and compare statistical power of the proposed tests under different trend curves data, three statistical tests were proposed. For large sample size with independent normal assumption among strata and across consecutive time points, the Z and Chi-square test statistics were developed, which are functions of outcome estimates and the standard errors at each of the study time points for the two strata. For small sample size with independent normal assumption, the F-test statistic was generated, which is a function of sample size of the two strata and estimated parameters across study period. If two trend curves are approximately parallel, the power of Z-test is consistently higher than that of both Chi-square and F-test. If two trend curves cross at low interaction, the power of Z-test is higher than or equal to the power of both Chi-square and F-test; however, at high interaction, the powers of Chi-square and F-test are higher than that of Z-test. The measurement of interaction of two trend curves was defined. These tests were applied to the comparison of trend curves of vaccination coverage estimates of standard vaccine series with National Immunization Survey (NIS) 2000-2007 data. Copyright © 2011 John Wiley & Sons, Ltd.

  8. Validity of the SAT for Predicting First-Year Grades: 2008 SAT Validity Sample. Statistical Report No. 2011-5

    ERIC Educational Resources Information Center

    Patterson, Brian F.; Mattern, Krista D.

    2011-01-01

    The findings for the 2008 sample are largely consistent with the previous reports. SAT scores were found to be correlated with FYGPA (r = 0.54), with a magnitude similar to HSGPA (r = 0.56). The best set of predictors of FYGPA remains SAT scores and HSGPA (r = 0.63), as the addition of the SAT sections to the correlation of HSGPA alone with FYGPA…

  9. 1979 Reserve Force Studies Surveys: Survey Design, Sample Design and Administrative Procedures,

    DTIC Science & Technology

    1981-08-01

    three factors: the need for a statistically significant number of usable questionnaires from different groups within the random sampls and from...Because of the multipurpose nature of these surveys and the large number of questions needed to fully address some of the topics covered, we...varies. Collection of data at the unit level is needed to accurately estimate actual reserve compensation and benefits and their possible role in both

  10. A video multitracking system for quantification of individual behavior in a large fish shoal: advantages and limits.

    PubMed

    Delcourt, Johann; Becco, Christophe; Vandewalle, Nicolas; Poncin, Pascal

    2009-02-01

    The capability of a new multitracking system to track a large number of unmarked fish (up to 100) is evaluated. This system extrapolates a trajectory from each individual and analyzes recorded sequences that are several minutes long. This system is very efficient in statistical individual tracking, where the individual's identity is important for a short period of time in comparison with the duration of the track. Individual identification is typically greater than 99%. Identification is largely efficient (more than 99%) when the fish images do not cross the image of a neighbor fish. When the images of two fish merge (occlusion), we consider that the spot on the screen has a double identity. Consequently, there are no identification errors during occlusions, even though the measurement of the positions of each individual is imprecise. When the images of these two merged fish separate (separation), individual identification errors are more frequent, but their effect is very low in statistical individual tracking. On the other hand, in complete individual tracking, where individual fish identity is important for the entire trajectory, each identification error invalidates the results. In such cases, the experimenter must observe whether the program assigns the correct identification, and, when an error is made, must edit the results. This work is not too costly in time because it is limited to the separation events, accounting for fewer than 0.1% of individual identifications. Consequently, in both statistical and rigorous individual tracking, this system allows the experimenter to gain time by measuring the individual position automatically. It can also analyze the structural and dynamic properties of an animal group with a very large sample, with precision and sampling that are impossible to obtain with manual measures.

  11. Replicability of time-varying connectivity patterns in large resting state fMRI samples.

    PubMed

    Abrol, Anees; Damaraju, Eswar; Miller, Robyn L; Stephen, Julia M; Claus, Eric D; Mayer, Andrew R; Calhoun, Vince D

    2017-12-01

    The past few years have seen an emergence of approaches that leverage temporal changes in whole-brain patterns of functional connectivity (the chronnectome). In this chronnectome study, we investigate the replicability of the human brain's inter-regional coupling dynamics during rest by evaluating two different dynamic functional network connectivity (dFNC) analysis frameworks using 7 500 functional magnetic resonance imaging (fMRI) datasets. To quantify the extent to which the emergent functional connectivity (FC) patterns are reproducible, we characterize the temporal dynamics by deriving several summary measures across multiple large, independent age-matched samples. Reproducibility was demonstrated through the existence of basic connectivity patterns (FC states) amidst an ensemble of inter-regional connections. Furthermore, application of the methods to conservatively configured (statistically stationary, linear and Gaussian) surrogate datasets revealed that some of the studied state summary measures were indeed statistically significant and also suggested that this class of null model did not explain the fMRI data fully. This extensive testing of reproducibility of similarity statistics also suggests that the estimated FC states are robust against variation in data quality, analysis, grouping, and decomposition methods. We conclude that future investigations probing the functional and neurophysiological relevance of time-varying connectivity assume critical importance. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  12. Replicability of time-varying connectivity patterns in large resting state fMRI samples

    PubMed Central

    Abrol, Anees; Damaraju, Eswar; Miller, Robyn L.; Stephen, Julia M.; Claus, Eric D.; Mayer, Andrew R.; Calhoun, Vince D.

    2018-01-01

    The past few years have seen an emergence of approaches that leverage temporal changes in whole-brain patterns of functional connectivity (the chronnectome). In this chronnectome study, we investigate the replicability of the human brain’s inter-regional coupling dynamics during rest by evaluating two different dynamic functional network connectivity (dFNC) analysis frameworks using 7 500 functional magnetic resonance imaging (fMRI) datasets. To quantify the extent to which the emergent functional connectivity (FC) patterns are reproducible, we characterize the temporal dynamics by deriving several summary measures across multiple large, independent age-matched samples. Reproducibility was demonstrated through the existence of basic connectivity patterns (FC states) amidst an ensemble of inter-regional connections. Furthermore, application of the methods to conservatively configured (statistically stationary, linear and Gaussian) surrogate datasets revealed that some of the studied state summary measures were indeed statistically significant and also suggested that this class of null model did not explain the fMRI data fully. This extensive testing of reproducibility of similarity statistics also suggests that the estimated FC states are robust against variation in data quality, analysis, grouping, and decomposition methods. We conclude that future investigations probing the functional and neurophysiological relevance of time-varying connectivity assume critical importance. PMID:28916181

  13. Descriptive Statistical Attributes of Special Education Data Sets

    ERIC Educational Resources Information Center

    Felder, Valerie

    2013-01-01

    Micceri (1989) examined the distributional characteristics of 440 large-sample achievement and psychometric measures. All the distributions were found to be nonnormal at alpha = 0.01. Micceri indicated three factors that might contribute to a non-Gaussian error distribution in the population. The first factor is subpopulations within a target…

  14. Mental Representation of Circuit Diagrams: Individual Differences in Procedural Knowledge.

    DTIC Science & Technology

    1983-12-01

    operation. One may know, for example, that a transformer serves to change the voltage of an AC supply, that a particular combination of transitors acts as a...and error measures with respect to overall performance. Even if a large 3-1-- sample could provide statistically significant differences between skill

  15. Counting at low concentrations: the statistical challenges of verifying ballast water discharge standards

    USGS Publications Warehouse

    Frazier, Melanie; Miller, A. Whitman; Lee, Henry; Reusser, Deborah A.

    2013-01-01

    Discharge from the ballast tanks of ships is one of the primary vectors of nonindigenous species in marine environments. To mitigate this environmental and economic threat, international, national, and state entities are establishing regulations to limit the concentration of living organisms that may be discharged from the ballast tanks of ships. The proposed discharge standards have ranged from zero detectable organisms to 3. If standard sampling methods are used, verifying whether ballast discharge complies with these stringent standards will be challenging due to the inherent stochasticity of sampling. Furthermore, at low concentrations, very large volumes of water must be sampled to find enough organisms to accurately estimate concentration. Despite these challenges, adequate sampling protocols comprise a critical aspect of establishing standards because they help define the actual risk level associated with a standard. A standard that appears very stringent may be effectively lax if it is paired with an inadequate sampling protocol. We describe some of the statistical issues associated with sampling at low concentrations to help regulators understand the uncertainties of sampling as well as to inform the development of sampling protocols that ensure discharge standards are adequately implemented.

  16. The HI Content of Galaxies as a Function of Local Density and Large-Scale Environment

    NASA Astrophysics Data System (ADS)

    Thoreen, Henry; Cantwell, Kelly; Maloney, Erin; Cane, Thomas; Brough Morris, Theodore; Flory, Oscar; Raskin, Mark; Crone-Odekon, Mary; ALFALFA Team

    2017-01-01

    We examine the HI content of galaxies as a function of environment, based on a catalogue of 41527 galaxies that are part of the 70% complete Arecibo Legacy Fast-ALFA (ALFALFA) survey. We use nearest-neighbor methods to characterize local environment, and a modified version of the algorithm developed for the Galaxy and Mass Assembly (GAMA) survey to classify large-scale environment as group, filament, tendril, or void. We compare the HI content in these environments using statistics that include both HI detections and the upper limits on detections from ALFALFA. The large size of the sample allows to statistically compare the HI content in different environments for early-type galaxies as well as late-type galaxies. This work is supported by NSF grants AST-1211005 and AST-1637339, the Skidmore Faculty-Student Summer Research program, and the Schupf Scholars program.

  17. Weighted statistical parameters for irregularly sampled time series

    NASA Astrophysics Data System (ADS)

    Rimoldini, Lorenzo

    2014-01-01

    Unevenly spaced time series are common in astronomy because of the day-night cycle, weather conditions, dependence on the source position in the sky, allocated telescope time and corrupt measurements, for example, or inherent to the scanning law of satellites like Hipparcos and the forthcoming Gaia. Irregular sampling often causes clumps of measurements and gaps with no data which can severely disrupt the values of estimators. This paper aims at improving the accuracy of common statistical parameters when linear interpolation (in time or phase) can be considered an acceptable approximation of a deterministic signal. A pragmatic solution is formulated in terms of a simple weighting scheme, adapting to the sampling density and noise level, applicable to large data volumes at minimal computational cost. Tests on time series from the Hipparcos periodic catalogue led to significant improvements in the overall accuracy and precision of the estimators with respect to the unweighted counterparts and those weighted by inverse-squared uncertainties. Automated classification procedures employing statistical parameters weighted by the suggested scheme confirmed the benefits of the improved input attributes. The classification of eclipsing binaries, Mira, RR Lyrae, Delta Cephei and Alpha2 Canum Venaticorum stars employing exclusively weighted descriptive statistics achieved an overall accuracy of 92 per cent, about 6 per cent higher than with unweighted estimators.

  18. Non-statistical effects in bond fission reactions of 1,2-difluoroethane

    NASA Astrophysics Data System (ADS)

    Schranz, Harold W.; Raff, Lionel M.; Thompson, Donald L.

    1991-08-01

    A microcanonical, classical variational transition-state theory based on the use of the efficient microcanonical sampling (EMS) procedure is applied to simple bond fission in 1,2-difluoroethane. Comparison is made with results of trajectory calculations performed on the same global potential-energy surface. Agreement between the statistical theory and trajectory results for CC CF and CH bond fissions is poor with differences as large as a factor of 125. Most importantly, at the lower energy studied, 6.0 eV, the statistical calculations predict considerably slower rates than those computed from trajectories. We conclude from these results that the statistical assumptions inherent in the transition-state theory method are not valid for 1,2-difluoroethane in spite of the fact that the total intramolecular energy transfer rate out of CH and CC normal and local modes is large relative to the bond fission rates. The IVR rate is not globally rapid and the trajectories do not access all of the energetically available phase space uniformly on the timescale of the reactions.

  19. Proteomic Workflows for Biomarker Identification Using Mass Spectrometry — Technical and Statistical Considerations during Initial Discovery

    PubMed Central

    Orton, Dennis J.; Doucette, Alan A.

    2013-01-01

    Identification of biomarkers capable of differentiating between pathophysiological states of an individual is a laudable goal in the field of proteomics. Protein biomarker discovery generally employs high throughput sample characterization by mass spectrometry (MS), being capable of identifying and quantifying thousands of proteins per sample. While MS-based technologies have rapidly matured, the identification of truly informative biomarkers remains elusive, with only a handful of clinically applicable tests stemming from proteomic workflows. This underlying lack of progress is attributed in large part to erroneous experimental design, biased sample handling, as well as improper statistical analysis of the resulting data. This review will discuss in detail the importance of experimental design and provide some insight into the overall workflow required for biomarker identification experiments. Proper balance between the degree of biological vs. technical replication is required for confident biomarker identification. PMID:28250400

  20. Implementation of Dynamic Extensible Adaptive Locally Exchangeable Measures (IDEALEM) v 0.1

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sim, Alex; Lee, Dongeun; Wu, K. John

    2016-03-04

    Handling large streaming data is essential for various applications such as network traffic analysis, social networks, energy cost trends, and environment modeling. However, it is in general intractable to store, compute, search, and retrieve large streaming data. This software addresses a fundamental issue, which is to reduce the size of large streaming data and still obtain accurate statistical analysis. As an example, when a high-speed network such as 100 Gbps network is monitored, the collected measurement data rapidly grows so that polynomial time algorithms (e.g., Gaussian processes) become intractable. One possible solution to reduce the storage of vast amounts ofmore » measured data is to store a random sample, such as one out of 1000 network packets. However, such static sampling methods (linear sampling) have drawbacks: (1) it is not scalable for high-rate streaming data, and (2) there is no guarantee of reflecting the underlying distribution. In this software, we implemented a dynamic sampling algorithm, based on the recent technology from the relational dynamic bayesian online locally exchangeable measures, that reduces the storage of data records in a large scale, and still provides accurate analysis of large streaming data. The software can be used for both online and offline data records.« less

  1. Testing homogeneity of proportion ratios for stratified correlated bilateral data in two-arm randomized clinical trials.

    PubMed

    Pei, Yanbo; Tian, Guo-Liang; Tang, Man-Lai

    2014-11-10

    Stratified data analysis is an important research topic in many biomedical studies and clinical trials. In this article, we develop five test statistics for testing the homogeneity of proportion ratios for stratified correlated bilateral binary data based on an equal correlation model assumption. Bootstrap procedures based on these test statistics are also considered. To evaluate the performance of these statistics and procedures, we conduct Monte Carlo simulations to study their empirical sizes and powers under various scenarios. Our results suggest that the procedure based on score statistic performs well generally and is highly recommended. When the sample size is large, procedures based on the commonly used weighted least square estimate and logarithmic transformation with Mantel-Haenszel estimate are recommended as they do not involve any computation of maximum likelihood estimates requiring iterative algorithms. We also derive approximate sample size formulas based on the recommended test procedures. Finally, we apply the proposed methods to analyze a multi-center randomized clinical trial for scleroderma patients. Copyright © 2014 John Wiley & Sons, Ltd.

  2. Phenotypic Association Analyses With Copy Number Variation in Recurrent Depressive Disorder.

    PubMed

    Rucker, James J H; Tansey, Katherine E; Rivera, Margarita; Pinto, Dalila; Cohen-Woods, Sarah; Uher, Rudolf; Aitchison, Katherine J; Craddock, Nick; Owen, Michael J; Jones, Lisa; Jones, Ian; Korszun, Ania; Barnes, Michael R; Preisig, Martin; Mors, Ole; Maier, Wolfgang; Rice, John; Rietschel, Marcella; Holsboer, Florian; Farmer, Anne E; Craig, Ian W; Scherer, Stephen W; McGuffin, Peter; Breen, Gerome

    2016-02-15

    Defining the molecular genomic basis of the likelihood of developing depressive disorder is a considerable challenge. We previously associated rare, exonic deletion copy number variants (CNV) with recurrent depressive disorder (RDD). Sex chromosome abnormalities also have been observed to co-occur with RDD. In this reanalysis of our RDD dataset (N = 3106 cases; 459 screened control samples and 2699 population control samples), we further investigated the role of larger CNVs and chromosomal abnormalities in RDD and performed association analyses with clinical data derived from this dataset. We found an enrichment of Turner's syndrome among cases of depression compared with the frequency observed in a large population sample (N = 34,910) of live-born infants collected in Denmark (two-sided p = .023, odds ratio = 7.76 [95% confidence interval = 1.79-33.6]), a case of diploid/triploid mosaicism, and several cases of uniparental isodisomy. In contrast to our previous analysis, large deletion CNVs were no more frequent in cases than control samples, although deletion CNVs in cases contained more genes than control samples (two-sided p = .0002). After statistical correction for multiple comparisons, our data do not support a substantial role for CNVs in RDD, although (as has been observed in similar samples) occasional cases may harbor large variants with etiological significance. Genetic pleiotropy and sample heterogeneity suggest that very large sample sizes are required to study conclusively the role of genetic variation in mood disorders. Copyright © 2016 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.

  3. Remote sensing data with the conditional latin hypercube sampling and geostatistical approach to delineate landscape changes induced by large chronological physical disturbances.

    PubMed

    Lin, Yu-Pin; Chu, Hone-Jay; Wang, Cheng-Long; Yu, Hsiao-Hsuan; Wang, Yung-Chieh

    2009-01-01

    This study applies variogram analyses of normalized difference vegetation index (NDVI) images derived from SPOT HRV images obtained before and after the ChiChi earthquake in the Chenyulan watershed, Taiwan, as well as images after four large typhoons, to delineate the spatial patterns, spatial structures and spatial variability of landscapes caused by these large disturbances. The conditional Latin hypercube sampling approach was applied to select samples from multiple NDVI images. Kriging and sequential Gaussian simulation with sufficient samples were then used to generate maps of NDVI images. The variography of NDVI image results demonstrate that spatial patterns of disturbed landscapes were successfully delineated by variogram analysis in study areas. The high-magnitude Chi-Chi earthquake created spatial landscape variations in the study area. After the earthquake, the cumulative impacts of typhoons on landscape patterns depended on the magnitudes and paths of typhoons, but were not always evident in the spatiotemporal variability of landscapes in the study area. The statistics and spatial structures of multiple NDVI images were captured by 3,000 samples from 62,500 grids in the NDVI images. Kriging and sequential Gaussian simulation with the 3,000 samples effectively reproduced spatial patterns of NDVI images. However, the proposed approach, which integrates the conditional Latin hypercube sampling approach, variogram, kriging and sequential Gaussian simulation in remotely sensed images, efficiently monitors, samples and maps the effects of large chronological disturbances on spatial characteristics of landscape changes including spatial variability and heterogeneity.

  4. Statistical Analysis of Large Scale Structure by the Discrete Wavelet Transform

    NASA Astrophysics Data System (ADS)

    Pando, Jesus

    1997-10-01

    The discrete wavelet transform (DWT) is developed as a general statistical tool for the study of large scale structures (LSS) in astrophysics. The DWT is used in all aspects of structure identification including cluster analysis, spectrum and two-point correlation studies, scale-scale correlation analysis and to measure deviations from Gaussian behavior. The techniques developed are demonstrated on 'academic' signals, on simulated models of the Lymanα (Lyα) forests, and on observational data of the Lyα forests. This technique can detect clustering in the Ly-α clouds where traditional techniques such as the two-point correlation function have failed. The position and strength of these clusters in both real and simulated data is determined and it is shown that clusters exist on scales as large as at least 20 h-1 Mpc at significance levels of 2-4 σ. Furthermore, it is found that the strength distribution of the clusters can be used to distinguish between real data and simulated samples even where other traditional methods have failed to detect differences. Second, a method for measuring the power spectrum of a density field using the DWT is developed. All common features determined by the usual Fourier power spectrum can be calculated by the DWT. These features, such as the index of a power law or typical scales, can be detected even when the samples are geometrically complex, the samples are incomplete, or the mean density on larger scales is not known (the infrared uncertainty). Using this method the spectra of Ly-α forests in both simulated and real samples is calculated. Third, a method for measuring hierarchical clustering is introduced. Because hierarchical evolution is characterized by a set of rules of how larger dark matter halos are formed by the merging of smaller halos, scale-scale correlations of the density field should be one of the most sensitive quantities in determining the merging history. We show that these correlations can be completely determined by the correlations between discrete wavelet coefficients on adjacent scales and at nearly the same spatial position, Cj,j+12/cdot2. Scale-scale correlations on two samples of the QSO Ly-α forests absorption spectra are computed. Lastly, higher order statistics are developed to detect deviations from Gaussian behavior. These higher order statistics are necessary to fully characterize the Ly-α forests because the usual 2nd order statistics, such as the two-point correlation function or power spectrum, give inconclusive results. It is shown how this technique takes advantage of the locality of the DWT to circumvent the central limit theorem. A non-Gaussian spectrum is defined and this spectrum reveals not only the magnitude, but the scales of non-Gaussianity. When applied to simulated and observational samples of the Ly-α clouds, it is found that different popular models of structure formation have different spectra while two, independent observational data sets, have the same spectra. Moreover, the non-Gaussian spectra of real data sets are significantly different from the spectra of various possible random samples. (Abstract shortened by UMI.)

  5. Assessment of levels of bacterial contamination of large wild game meat in Europe.

    PubMed

    Membré, Jeanne-Marie; Laroche, Michel; Magras, Catherine

    2011-08-01

    The variations in prevalence and levels of pathogens and fecal contamination indicators in large wild game meat were studied to assess their potential impact on consumers. This analysis was based on hazard analysis, data generation and statistical analysis. A total of 2919 meat samples from three species (red deer, roe deer, wild boar) were collected at French game meat traders' facilities using two sampling protocols. Information was gathered on the types of meat cuts (forequarter or haunch; first sampling protocol) or type of retail-ready meat (stewing meat or roasting meat; second protocol), and also on the meat storage conditions (frozen or chilled), country of origin (eight countries) and shooting season (autumn, winter, spring). The samples were analyzed in both protocols for detection and enumeration of Escherichia coli, coagulase+staphylococci and Clostridium perfringens. In addition, detection and enumeration of thermotolerant coliforms and Listeria monocytogenes were performed for samples collected in the first and second protocols, respectively. The levels of bacterial contamination of the raw meat were determined by performing statistical analysis involving probabilistic techniques and Bayesian inference. C. perfringens was found in the highest numbers for the three indicators of microbial quality, hygiene and good handling, and L. monocytogenes in the lowest. Differences in contamination levels between game species and between meats distributed as chilled or frozen products were not significant. These results might be included in quantitative exposure assessments. Copyright © 2011 Elsevier Ltd. All rights reserved.

  6. A Fast Framework for Abrupt Change Detection Based on Binary Search Trees and Kolmogorov Statistic

    PubMed Central

    Qi, Jin-Peng; Qi, Jie; Zhang, Qing

    2016-01-01

    Change-Point (CP) detection has attracted considerable attention in the fields of data mining and statistics; it is very meaningful to discuss how to quickly and efficiently detect abrupt change from large-scale bioelectric signals. Currently, most of the existing methods, like Kolmogorov-Smirnov (KS) statistic and so forth, are time-consuming, especially for large-scale datasets. In this paper, we propose a fast framework for abrupt change detection based on binary search trees (BSTs) and a modified KS statistic, named BSTKS (binary search trees and Kolmogorov statistic). In this method, first, two binary search trees, termed as BSTcA and BSTcD, are constructed by multilevel Haar Wavelet Transform (HWT); second, three search criteria are introduced in terms of the statistic and variance fluctuations in the diagnosed time series; last, an optimal search path is detected from the root to leaf nodes of two BSTs. The studies on both the synthetic time series samples and the real electroencephalograph (EEG) recordings indicate that the proposed BSTKS can detect abrupt change more quickly and efficiently than KS, t-statistic (t), and Singular-Spectrum Analyses (SSA) methods, with the shortest computation time, the highest hit rate, the smallest error, and the highest accuracy out of four methods. This study suggests that the proposed BSTKS is very helpful for useful information inspection on all kinds of bioelectric time series signals. PMID:27413364

  7. A Fast Framework for Abrupt Change Detection Based on Binary Search Trees and Kolmogorov Statistic.

    PubMed

    Qi, Jin-Peng; Qi, Jie; Zhang, Qing

    2016-01-01

    Change-Point (CP) detection has attracted considerable attention in the fields of data mining and statistics; it is very meaningful to discuss how to quickly and efficiently detect abrupt change from large-scale bioelectric signals. Currently, most of the existing methods, like Kolmogorov-Smirnov (KS) statistic and so forth, are time-consuming, especially for large-scale datasets. In this paper, we propose a fast framework for abrupt change detection based on binary search trees (BSTs) and a modified KS statistic, named BSTKS (binary search trees and Kolmogorov statistic). In this method, first, two binary search trees, termed as BSTcA and BSTcD, are constructed by multilevel Haar Wavelet Transform (HWT); second, three search criteria are introduced in terms of the statistic and variance fluctuations in the diagnosed time series; last, an optimal search path is detected from the root to leaf nodes of two BSTs. The studies on both the synthetic time series samples and the real electroencephalograph (EEG) recordings indicate that the proposed BSTKS can detect abrupt change more quickly and efficiently than KS, t-statistic (t), and Singular-Spectrum Analyses (SSA) methods, with the shortest computation time, the highest hit rate, the smallest error, and the highest accuracy out of four methods. This study suggests that the proposed BSTKS is very helpful for useful information inspection on all kinds of bioelectric time series signals.

  8. Quantifying in situ Zooplankton Movement and Trophic Impacts on Thin Layers in East Sound, Washington

    DTIC Science & Technology

    2006-09-30

    strength of the combination is that the tracking system quantifies swimming behaviors of protists in natural seawater samples with large numbers of motile...Sound was to link observations of thin layers to behavioral analysis of protists resident above, within, and below these features. Analysis of our...cells and diatom chains. We are not yet able to make statistical statements about swimming characteristics of the motile protists in our video samples

  9. Speckle in the diffraction patterns of Hendricks-Teller and icosahedral glass models

    NASA Technical Reports Server (NTRS)

    Garg, Anupam; Levine, Dov

    1988-01-01

    It is shown that the X-ray diffraction patterns from the Hendricks-Teller model for layered systems and the icosahedral glass models for the icosahedral phases show large fluctuations between nearby scattering wave vectors and from sample to sample, that are quite analogous to laser speckle. The statistics of these fluctuations are studied analytically for the first model and via computer simulations for the second. The observability of these effects is discussed briefly.

  10. [The research protocol III. Study population].

    PubMed

    Arias-Gómez, Jesús; Villasís-Keever, Miguel Ángel; Miranda-Novales, María Guadalupe

    2016-01-01

    The study population is defined as a set of cases, determined, limited, and accessible, that will constitute the subjects for the selection of the sample, and must fulfill several characteristics and distinct criteria. The objectives of this manuscript are focused on specifying each one of the elements required to make the selection of the participants of a research project, during the elaboration of the protocol, including the concepts of study population, sample, selection criteria and sampling methods. After delineating the study population, the researcher must specify the criteria that each participant has to comply. The criteria that include the specific characteristics are denominated selection or eligibility criteria. These criteria are inclusion, exclusion and elimination, and will delineate the eligible population. The sampling methods are divided in two large groups: 1) probabilistic or random sampling and 2) non-probabilistic sampling. The difference lies in the employment of statistical methods to select the subjects. In every research, it is necessary to establish at the beginning the specific number of participants to be included to achieve the objectives of the study. This number is the sample size, and can be calculated or estimated with mathematical formulas and statistic software.

  11. Robust regression for large-scale neuroimaging studies.

    PubMed

    Fritsch, Virgile; Da Mota, Benoit; Loth, Eva; Varoquaux, Gaël; Banaschewski, Tobias; Barker, Gareth J; Bokde, Arun L W; Brühl, Rüdiger; Butzek, Brigitte; Conrod, Patricia; Flor, Herta; Garavan, Hugh; Lemaitre, Hervé; Mann, Karl; Nees, Frauke; Paus, Tomas; Schad, Daniel J; Schümann, Gunter; Frouin, Vincent; Poline, Jean-Baptiste; Thirion, Bertrand

    2015-05-01

    Multi-subject datasets used in neuroimaging group studies have a complex structure, as they exhibit non-stationary statistical properties across regions and display various artifacts. While studies with small sample sizes can rarely be shown to deviate from standard hypotheses (such as the normality of the residuals) due to the poor sensitivity of normality tests with low degrees of freedom, large-scale studies (e.g. >100 subjects) exhibit more obvious deviations from these hypotheses and call for more refined models for statistical inference. Here, we demonstrate the benefits of robust regression as a tool for analyzing large neuroimaging cohorts. First, we use an analytic test based on robust parameter estimates; based on simulations, this procedure is shown to provide an accurate statistical control without resorting to permutations. Second, we show that robust regression yields more detections than standard algorithms using as an example an imaging genetics study with 392 subjects. Third, we show that robust regression can avoid false positives in a large-scale analysis of brain-behavior relationships with over 1500 subjects. Finally we embed robust regression in the Randomized Parcellation Based Inference (RPBI) method and demonstrate that this combination further improves the sensitivity of tests carried out across the whole brain. Altogether, our results show that robust procedures provide important advantages in large-scale neuroimaging group studies. Copyright © 2015 Elsevier Inc. All rights reserved.

  12. Software engineering the mixed model for genome-wide association studies on large samples.

    PubMed

    Zhang, Zhiwu; Buckler, Edward S; Casstevens, Terry M; Bradbury, Peter J

    2009-11-01

    Mixed models improve the ability to detect phenotype-genotype associations in the presence of population stratification and multiple levels of relatedness in genome-wide association studies (GWAS), but for large data sets the resource consumption becomes impractical. At the same time, the sample size and number of markers used for GWAS is increasing dramatically, resulting in greater statistical power to detect those associations. The use of mixed models with increasingly large data sets depends on the availability of software for analyzing those models. While multiple software packages implement the mixed model method, no single package provides the best combination of fast computation, ability to handle large samples, flexible modeling and ease of use. Key elements of association analysis with mixed models are reviewed, including modeling phenotype-genotype associations using mixed models, population stratification, kinship and its estimation, variance component estimation, use of best linear unbiased predictors or residuals in place of raw phenotype, improving efficiency and software-user interaction. The available software packages are evaluated, and suggestions made for future software development.

  13. Large-Scale Optimization for Bayesian Inference in Complex Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Willcox, Karen; Marzouk, Youssef

    2013-11-12

    The SAGUARO (Scalable Algorithms for Groundwater Uncertainty Analysis and Robust Optimization) Project focused on the development of scalable numerical algorithms for large-scale Bayesian inversion in complex systems that capitalize on advances in large-scale simulation-based optimization and inversion methods. The project was a collaborative effort among MIT, the University of Texas at Austin, Georgia Institute of Technology, and Sandia National Laboratories. The research was directed in three complementary areas: efficient approximations of the Hessian operator, reductions in complexity of forward simulations via stochastic spectral approximations and model reduction, and employing large-scale optimization concepts to accelerate sampling. The MIT--Sandia component of themore » SAGUARO Project addressed the intractability of conventional sampling methods for large-scale statistical inverse problems by devising reduced-order models that are faithful to the full-order model over a wide range of parameter values; sampling then employs the reduced model rather than the full model, resulting in very large computational savings. Results indicate little effect on the computed posterior distribution. On the other hand, in the Texas--Georgia Tech component of the project, we retain the full-order model, but exploit inverse problem structure (adjoint-based gradients and partial Hessian information of the parameter-to-observation map) to implicitly extract lower dimensional information on the posterior distribution; this greatly speeds up sampling methods, so that fewer sampling points are needed. We can think of these two approaches as ``reduce then sample'' and ``sample then reduce.'' In fact, these two approaches are complementary, and can be used in conjunction with each other. Moreover, they both exploit deterministic inverse problem structure, in the form of adjoint-based gradient and Hessian information of the underlying parameter-to-observation map, to achieve their speedups.« less

  14. Final Report: Large-Scale Optimization for Bayesian Inference in Complex Systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ghattas, Omar

    2013-10-15

    The SAGUARO (Scalable Algorithms for Groundwater Uncertainty Analysis and Robust Optimiza- tion) Project focuses on the development of scalable numerical algorithms for large-scale Bayesian inversion in complex systems that capitalize on advances in large-scale simulation-based optimiza- tion and inversion methods. Our research is directed in three complementary areas: efficient approximations of the Hessian operator, reductions in complexity of forward simulations via stochastic spectral approximations and model reduction, and employing large-scale optimization concepts to accelerate sampling. Our efforts are integrated in the context of a challenging testbed problem that considers subsurface reacting flow and transport. The MIT component of the SAGUAROmore » Project addresses the intractability of conventional sampling methods for large-scale statistical inverse problems by devising reduced-order models that are faithful to the full-order model over a wide range of parameter values; sampling then employs the reduced model rather than the full model, resulting in very large computational savings. Results indicate little effect on the computed posterior distribution. On the other hand, in the Texas-Georgia Tech component of the project, we retain the full-order model, but exploit inverse problem structure (adjoint-based gradients and partial Hessian information of the parameter-to- observation map) to implicitly extract lower dimensional information on the posterior distribution; this greatly speeds up sampling methods, so that fewer sampling points are needed. We can think of these two approaches as "reduce then sample" and "sample then reduce." In fact, these two approaches are complementary, and can be used in conjunction with each other. Moreover, they both exploit deterministic inverse problem structure, in the form of adjoint-based gradient and Hessian information of the underlying parameter-to-observation map, to achieve their speedups.« less

  15. Open star clusters and Galactic structure

    NASA Astrophysics Data System (ADS)

    Joshi, Yogesh C.

    2018-04-01

    In order to understand the Galactic structure, we perform a statistical analysis of the distribution of various cluster parameters based on an almost complete sample of Galactic open clusters yet available. The geometrical and physical characteristics of a large number of open clusters given in the MWSC catalogue are used to study the spatial distribution of clusters in the Galaxy and determine the scale height, solar offset, local mass density and distribution of reddening material in the solar neighbourhood. We also explored the mass-radius and mass-age relations in the Galactic open star clusters. We find that the estimated parameters of the Galactic disk are largely influenced by the choice of cluster sample.

  16. Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

    PubMed Central

    Boitard, Simon; Rodríguez, Willy; Jay, Flora; Mona, Stefano; Austerlitz, Frédéric

    2016-01-01

    Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles. PMID:26943927

  17. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

    PubMed Central

    Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P.; Patterson, Nick; Price, Alkes L.

    2014-01-01

    Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of χ2 association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. Contact: bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary information: Supplementary materials are available at Bioinformatics online. PMID:24990607

  18. Irradiation-hyperthermia in canine hemangiopericytomas: large-animal model for therapeutic response.

    PubMed

    Richardson, R C; Anderson, V L; Voorhees, W D; Blevins, W E; Inskeep, T K; Janas, W; Shupe, R E; Babbs, C F

    1984-11-01

    Results of irradiation-hyperthermia treatment in 11 dogs with naturally occurring hemangiopericytoma were reported. Similarities of canine and human hemangiopericytomas were described. Orthovoltage X-irradiation followed by microwave-induced hyperthermia resulted in a 91% objective response rate. A statistical procedure was given to evaluate quantitatively the clinical behavior of locally invasive, nonmetastatic tumors in dogs that were undergoing therapy for control of local disease. The procedure used a small sample size and demonstrated distribution of the data on a scaled response as well as transformation of the data through classical parametric and nonparametric statistical methods. These statistical methods set confidence limits on the population mean and placed tolerance limits on a population percentage. Application of the statistical methods to human and animal clinical trials was apparent.

  19. Calculating p-values and their significances with the Energy Test for large datasets

    NASA Astrophysics Data System (ADS)

    Barter, W.; Burr, C.; Parkes, C.

    2018-04-01

    The energy test method is a multi-dimensional test of whether two samples are consistent with arising from the same underlying population, through the calculation of a single test statistic (called the T-value). The method has recently been used in particle physics to search for samples that differ due to CP violation. The generalised extreme value function has previously been used to describe the distribution of T-values under the null hypothesis that the two samples are drawn from the same underlying population. We show that, in a simple test case, the distribution is not sufficiently well described by the generalised extreme value function. We present a new method, where the distribution of T-values under the null hypothesis when comparing two large samples can be found by scaling the distribution found when comparing small samples drawn from the same population. This method can then be used to quickly calculate the p-values associated with the results of the test.

  20. Just the right age: well-clustered exposure ages from a global glacial 10Be compilation

    NASA Astrophysics Data System (ADS)

    Heyman, Jakob; Margold, Martin

    2017-04-01

    Cosmogenic exposure dating has been used extensively for defining glacial chronologies, both in ice sheet and alpine settings, and the global set of published ages today reaches well beyond 10,000 samples. Over the last few years, a number of important developments have improved the measurements (with well-defined AMS standards) and exposure age calculations (with updated data and methods for calculating production rates), in the best case enabling high precision dating of past glacial events. A remaining problem, however, is the fact that a large portion of all dated samples have been affected by prior and/or incomplete exposure, yielding erroneous exposure ages under the standard assumptions. One way to address this issue is to only use exposure ages that can be confidently considered as unaffected by prior/incomplete exposure, such as groups of samples with statistically identical ages. Here we use objective statistical criteria to identify groups of well-clustered exposure ages from the global glacial "expage" 10Be compilation. Out of ˜1700 groups with at least 3 individual samples ˜30% are well-clustered, increasing to ˜45% if allowing outlier rejection of a maximum of 1/3 of the samples (still requiring a minimum of 3 well-clustered ages). The dataset of well-clustered ages is heavily dominated by ages <30 ka, showing that well-defined cosmogenic chronologies primarily exist for the last glaciation. We observe a large-scale global synchronicity in the timing of the last deglaciation from ˜20 to 10 ka. There is also a general correlation between the timing of deglaciation and latitude (or size of the individual ice mass), with earlier deglaciation in lower latitudes and later deglaciation towards the poles. Grouping the data into regions and comparing with available paleoclimate data we can start to untangle regional differences in the last deglaciation and the climate events controlling the ice mass loss. The extensive dataset and the statistical analysis enables an unprecedented global view on the last deglaciation.

  1. The Kindness of Strangers Revisited: A Comparison of 24 US Cities

    ERIC Educational Resources Information Center

    Levine, Robert V.; Reysen, Stephen; Ganz, Ellen

    2008-01-01

    Three field studies compared helping behavior across a sample of 24 small, medium and large cities across the United States. The relationship of helping to statistics reflecting the demographic, social, and economic characteristics of these communities was then examined. The strongest predictors of city differences in helping were population size,…

  2. It's a Girl! Random Numbers, Simulations, and the Law of Large Numbers

    ERIC Educational Resources Information Center

    Goodwin, Chris; Ortiz, Enrique

    2015-01-01

    Modeling using mathematics and making inferences about mathematical situations are becoming more prevalent in most fields of study. Descriptive statistics cannot be used to generalize about a population or make predictions of what can occur. Instead, inference must be used. Simulation and sampling are essential in building a foundation for…

  3. Comparing and combining process-based crop models and statistical models with some implications for climate change

    NASA Astrophysics Data System (ADS)

    Roberts, Michael J.; Braun, Noah O.; Sinclair, Thomas R.; Lobell, David B.; Schlenker, Wolfram

    2017-09-01

    We compare predictions of a simple process-based crop model (Soltani and Sinclair 2012), a simple statistical model (Schlenker and Roberts 2009), and a combination of both models to actual maize yields on a large, representative sample of farmer-managed fields in the Corn Belt region of the United States. After statistical post-model calibration, the process model (Simple Simulation Model, or SSM) predicts actual outcomes slightly better than the statistical model, but the combined model performs significantly better than either model. The SSM, statistical model and combined model all show similar relationships with precipitation, while the SSM better accounts for temporal patterns of precipitation, vapor pressure deficit and solar radiation. The statistical and combined models show a more negative impact associated with extreme heat for which the process model does not account. Due to the extreme heat effect, predicted impacts under uniform climate change scenarios are considerably more severe for the statistical and combined models than for the process-based model.

  4. First Detected Arrival of a Quantum Walker on an Infinite Line

    NASA Astrophysics Data System (ADS)

    Thiel, Felix; Barkai, Eli; Kessler, David A.

    2018-01-01

    The first detection of a quantum particle on a graph is shown to depend sensitively on the distance ξ between the detector and initial location of the particle, and on the sampling time τ . Here, we use the recently introduced quantum renewal equation to investigate the statistics of first detection on an infinite line, using a tight-binding lattice Hamiltonian with nearest-neighbor hops. Universal features of the first detection probability are uncovered and simple limiting cases are analyzed. These include the large ξ limit, the small τ limit, and the power law decay with the attempt number of the detection probability over which quantum oscillations are superimposed. For large ξ the first detection probability assumes a scaling form and when the sampling time is equal to the inverse of the energy band width nonanalytical behaviors arise, accompanied by a transition in the statistics. The maximum total detection probability is found to occur for τ close to this transition point. When the initial location of the particle is far from the detection node we find that the total detection probability attains a finite value that is distance independent.

  5. The Structure of Diagnostic and Statistical Manual of Mental Disorders (4th Edition, Text Revision) Personality Disorder Symptoms in a Large National Sample

    PubMed Central

    Trull, Timothy J.; Vergés, Alvaro; Wood, Phillip K.; Jahng, Seungmin; Sher, Kenneth J.

    2013-01-01

    We examined the latent structure underlying the criteria for DSM–IV–TR (American Psychiatric Association, 2000, Diagnostic and statistical manual of mental disorders (4th ed., text revision). Washington, DC: Author.) personality disorders in a large nationally representative sample of U.S. adults. Personality disorder symptom data were collected using a structured diagnostic interview from approximately 35,000 adults assessed over two waves of data collection in the National Epidemiologic Survey on Alcohol and Related Conditions. Our analyses suggested that a seven-factor solution provided the best fit for the data, and these factors were marked primarily by one or at most two personality disorder criteria sets. A series of regression analyses that used external validators tapping Axis I psychopathology, treatment for mental health problems, functioning scores, interpersonal conflict, and suicidal ideation and behavior provided support for the seven-factor solution. We discuss these findings in the context of previous studies that have examined the structure underlying the personality disorder criteria as well as the current proposals for DSM-5 personality disorders. PMID:22506626

  6. Internal pilots for a class of linear mixed models with Gaussian and compound symmetric data

    PubMed Central

    Gurka, Matthew J.; Coffey, Christopher S.; Muller, Keith E.

    2015-01-01

    SUMMARY An internal pilot design uses interim sample size analysis, without interim data analysis, to adjust the final number of observations. The approach helps to choose a sample size sufficiently large (to achieve the statistical power desired), but not too large (which would waste money and time). We report on recent research in cerebral vascular tortuosity (curvature in three dimensions) which would benefit greatly from internal pilots due to uncertainty in the parameters of the covariance matrix used for study planning. Unfortunately, observations correlated across the four regions of the brain and small sample sizes preclude using existing methods. However, as in a wide range of medical imaging studies, tortuosity data have no missing or mistimed data, a factorial within-subject design, the same between-subject design for all responses, and a Gaussian distribution with compound symmetry. For such restricted models, we extend exact, small sample univariate methods for internal pilots to linear mixed models with any between-subject design (not just two groups). Planning a new tortuosity study illustrates how the new methods help to avoid sample sizes that are too small or too large while still controlling the type I error rate. PMID:17318914

  7. LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies

    PubMed Central

    Bulik-Sullivan, Brendan K.; Loh, Po-Ru; Finucane, Hilary; Ripke, Stephan; Yang, Jian; Patterson, Nick; Daly, Mark J.; Price, Alkes L.; Neale, Benjamin M.

    2015-01-01

    Both polygenicity (i.e., many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of test statistic inflation in many GWAS of large sample size. PMID:25642630

  8. High-Dimensional Multivariate Repeated Measures Analysis with Unequal Covariance Matrices.

    PubMed

    Harrar, Solomon W; Kong, Xiaoli

    2015-03-01

    In this paper, test statistics for repeated measures design are introduced when the dimension is large. By large dimension is meant the number of repeated measures and the total sample size grow together but either one could be larger than the other. Asymptotic distribution of the statistics are derived for the equal as well as unequal covariance cases in the balanced as well as unbalanced cases. The asymptotic framework considered requires proportional growth of the sample sizes and the dimension of the repeated measures in the unequal covariance case. In the equal covariance case, one can grow at much faster rate than the other. The derivations of the asymptotic distributions mimic that of Central Limit Theorem with some important peculiarities addressed with sufficient rigor. Consistent and unbiased estimators of the asymptotic variances, which make efficient use of all the observations, are also derived. Simulation study provides favorable evidence for the accuracy of the asymptotic approximation under the null hypothesis. Power simulations have shown that the new methods have comparable power with a popular method known to work well in low-dimensional situation but the new methods have shown enormous advantage when the dimension is large. Data from Electroencephalograph (EEG) experiment is analyzed to illustrate the application of the results.

  9. High-Dimensional Multivariate Repeated Measures Analysis with Unequal Covariance Matrices

    PubMed Central

    Harrar, Solomon W.; Kong, Xiaoli

    2015-01-01

    In this paper, test statistics for repeated measures design are introduced when the dimension is large. By large dimension is meant the number of repeated measures and the total sample size grow together but either one could be larger than the other. Asymptotic distribution of the statistics are derived for the equal as well as unequal covariance cases in the balanced as well as unbalanced cases. The asymptotic framework considered requires proportional growth of the sample sizes and the dimension of the repeated measures in the unequal covariance case. In the equal covariance case, one can grow at much faster rate than the other. The derivations of the asymptotic distributions mimic that of Central Limit Theorem with some important peculiarities addressed with sufficient rigor. Consistent and unbiased estimators of the asymptotic variances, which make efficient use of all the observations, are also derived. Simulation study provides favorable evidence for the accuracy of the asymptotic approximation under the null hypothesis. Power simulations have shown that the new methods have comparable power with a popular method known to work well in low-dimensional situation but the new methods have shown enormous advantage when the dimension is large. Data from Electroencephalograph (EEG) experiment is analyzed to illustrate the application of the results. PMID:26778861

  10. Statistical process control charts for attribute data involving very large sample sizes: a review of problems and solutions.

    PubMed

    Mohammed, Mohammed A; Panesar, Jagdeep S; Laney, David B; Wilson, Richard

    2013-04-01

    The use of statistical process control (SPC) charts in healthcare is increasing. The primary purpose of SPC is to distinguish between common-cause variation which is attributable to the underlying process, and special-cause variation which is extrinsic to the underlying process. This is important because improvement under common-cause variation requires action on the process, whereas special-cause variation merits an investigation to first find the cause. Nonetheless, when dealing with attribute or count data (eg, number of emergency admissions) involving very large sample sizes, traditional SPC charts often produce tight control limits with most of the data points appearing outside the control limits. This can give a false impression of common and special-cause variation, and potentially misguide the user into taking the wrong actions. Given the growing availability of large datasets from routinely collected databases in healthcare, there is a need to present a review of this problem (which arises because traditional attribute charts only consider within-subgroup variation) and its solutions (which consider within and between-subgroup variation), which involve the use of the well-established measurements chart and the more recently developed attribute charts based on Laney's innovative approach. We close by making some suggestions for practice.

  11. Samples in applied psychology: over a decade of research in review.

    PubMed

    Shen, Winny; Kiger, Thomas B; Davies, Stacy E; Rasch, Rena L; Simon, Kara M; Ones, Deniz S

    2011-09-01

    This study examines sample characteristics of articles published in Journal of Applied Psychology (JAP) from 1995 to 2008. At the individual level, the overall median sample size over the period examined was approximately 173, which is generally adequate for detecting the average magnitude of effects of primary interest to researchers who publish in JAP. Samples using higher units of analyses (e.g., teams, departments/work units, and organizations) had lower median sample sizes (Mdn ≈ 65), yet were arguably robust given typical multilevel design choices of JAP authors despite the practical constraints of collecting data at higher units of analysis. A substantial proportion of studies used student samples (~40%); surprisingly, median sample sizes for student samples were smaller than working adult samples. Samples were more commonly occupationally homogeneous (~70%) than occupationally heterogeneous. U.S. and English-speaking participants made up the vast majority of samples, whereas Middle Eastern, African, and Latin American samples were largely unrepresented. On the basis of study results, recommendations are provided for authors, editors, and readers, which converge on 3 themes: (a) appropriateness and match between sample characteristics and research questions, (b) careful consideration of statistical power, and (c) the increased popularity of quantitative synthesis. Implications are discussed in terms of theory building, generalizability of research findings, and statistical power to detect effects. PsycINFO Database Record (c) 2011 APA, all rights reserved

  12. Using the bootstrap to establish statistical significance for relative validity comparisons among patient-reported outcome measures

    PubMed Central

    2013-01-01

    Background Relative validity (RV), a ratio of ANOVA F-statistics, is often used to compare the validity of patient-reported outcome (PRO) measures. We used the bootstrap to establish the statistical significance of the RV and to identify key factors affecting its significance. Methods Based on responses from 453 chronic kidney disease (CKD) patients to 16 CKD-specific and generic PRO measures, RVs were computed to determine how well each measure discriminated across clinically-defined groups of patients compared to the most discriminating (reference) measure. Statistical significance of RV was quantified by the 95% bootstrap confidence interval. Simulations examined the effects of sample size, denominator F-statistic, correlation between comparator and reference measures, and number of bootstrap replicates. Results The statistical significance of the RV increased as the magnitude of denominator F-statistic increased or as the correlation between comparator and reference measures increased. A denominator F-statistic of 57 conveyed sufficient power (80%) to detect an RV of 0.6 for two measures correlated at r = 0.7. Larger denominator F-statistics or higher correlations provided greater power. Larger sample size with a fixed denominator F-statistic or more bootstrap replicates (beyond 500) had minimal impact. Conclusions The bootstrap is valuable for establishing the statistical significance of RV estimates. A reasonably large denominator F-statistic (F > 57) is required for adequate power when using the RV to compare the validity of measures with small or moderate correlations (r < 0.7). Substantially greater power can be achieved when comparing measures of a very high correlation (r > 0.9). PMID:23721463

  13. Statistically Controlling for Confounding Constructs Is Harder than You Think

    PubMed Central

    Westfall, Jacob; Yarkoni, Tal

    2016-01-01

    Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement-level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http://jakewestfall.org/ivy/) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity. PMID:27031707

  14. Long-term evolution of a planetesimal swarm in the vicinity of a protoplanet

    NASA Technical Reports Server (NTRS)

    Kary, David M.; Lissauer, Jack J.

    1991-01-01

    Many models of planet formation involve scenarios in which one or a few large protoplanets interact with a swarm of much smaller planetesimals. In such scenarios, three-body perturbations by the protoplanet as well as mutual collisions and gravitational interactions between the swarm bodies are important in determining the velocity distribution of the swarm. We are developing a model to examine the effects of these processes on the evolution of a planetesimal swarm. The model consists of a combination of numerical integrations of the gravitational influence of one (or a few) massive protoplanets on swarm bodies together with a statistical treatment of the interactions between the planetesimals. Integrating the planetesimal orbits allows us to take into account effects that are difficult to model analytically or statistically, such as three-body collision cross-sections and resonant perturbations by the protoplanet, while using a statistical treatment for the particle-particle interactions allows us to use a large enough sample to obtain meaningful results.

  15. Classical boson sampling algorithms with superior performance to near-term experiments

    NASA Astrophysics Data System (ADS)

    Neville, Alex; Sparrow, Chris; Clifford, Raphaël; Johnston, Eric; Birchall, Patrick M.; Montanaro, Ashley; Laing, Anthony

    2017-12-01

    It is predicted that quantum computers will dramatically outperform their conventional counterparts. However, large-scale universal quantum computers are yet to be built. Boson sampling is a rudimentary quantum algorithm tailored to the platform of linear optics, which has sparked interest as a rapid way to demonstrate such quantum supremacy. Photon statistics are governed by intractable matrix functions, which suggests that sampling from the distribution obtained by injecting photons into a linear optical network could be solved more quickly by a photonic experiment than by a classical computer. The apparently low resource requirements for large boson sampling experiments have raised expectations of a near-term demonstration of quantum supremacy by boson sampling. Here we present classical boson sampling algorithms and theoretical analyses of prospects for scaling boson sampling experiments, showing that near-term quantum supremacy via boson sampling is unlikely. Our classical algorithm, based on Metropolised independence sampling, allowed the boson sampling problem to be solved for 30 photons with standard computing hardware. Compared to current experiments, a demonstration of quantum supremacy over a successful implementation of these classical methods on a supercomputer would require the number of photons and experimental components to increase by orders of magnitude, while tackling exponentially scaling photon loss.

  16. Sizing for the apparel industry using statistical analysis - a Brazilian case study

    NASA Astrophysics Data System (ADS)

    Capelassi, C. H.; Carvalho, M. A.; El Kattel, C.; Xu, B.

    2017-10-01

    The study of the body measurements of Brazilian women used the Kinect Body Imaging system for 3D body scanning. The result of the study aims to meet the needs of the apparel industry for accurate measurements. Data was statistically treated using the IBM SPSS 23 system, with 95% confidence (P<0,05) for the inferential analysis, with the purpose of grouping the measurements in sizes, so that a smaller number of sizes can cover a greater number of people. The sample consisted of 101 volunteers aged between 19 and 62 years. A cluster analysis was performed to identify the main body shapes of the sample. The results were divided between the top and bottom body portions; For the top portion, were used the measurements of the abdomen, waist and bust circumferences, as well as the height; For the bottom portion, were used the measurements of the hip circumference and the height. Three sizing systems were developed for the researched sample from the Abdomen-to-Height Ratio - AHR (top portion): Small (AHR < 0,52), Medium (AHR: 0,52-0,58), Large (AHR > 0,58) and from the Hip-to-Height Ratio - HHR (bottom portion): Small (HHR < 0,62), Medium (HHR: 0,62-0,68), Large (HHR > 0,68).

  17. A Large Scale (N=400) Investigation of Gray Matter Differences in Schizophrenia Using Optimized Voxel-based Morphometry

    PubMed Central

    Meda, Shashwath A.; Giuliani, Nicole R.; Calhoun, Vince D.; Jagannathan, Kanchana; Schretlen, David J.; Pulver, Anne; Cascella, Nicola; Keshavan, Matcheri; Kates, Wendy; Buchanan, Robert; Sharma, Tonmoy; Pearlson, Godfrey D.

    2008-01-01

    Background Many studies have employed voxel-based morphometry (VBM) of MRI images as an automated method of investigating cortical gray matter differences in schizophrenia. However, results from these studies vary widely, likely due to different methodological or statistical approaches. Objective To use VBM to investigate gray matter differences in schizophrenia in a sample significantly larger than any published to date, and to increase statistical power sufficiently to reveal differences missed in smaller analyses. Methods Magnetic resonance whole brain images were acquired from four geographic sites, all using the same model 1.5T scanner and software version, and combined to form a sample of 200 patients with both first episode and chronic schizophrenia and 200 healthy controls, matched for age, gender and scanner location. Gray matter concentration was assessed and compared using optimized VBM. Results Compared to the healthy controls, schizophrenia patients showed significantly less gray matter concentration in multiple cortical and subcortical regions, some previously unreported. Overall, we found lower concentrations of gray matter in regions identified in prior studies, most of which reported only subsets of the affected areas. Conclusions Gray matter differences in schizophrenia are most comprehensively elucidated using a large, diverse and representative sample. PMID:18378428

  18. QNB: differential RNA methylation analysis for count-based small-sample sequencing data with a quad-negative binomial model.

    PubMed

    Liu, Lian; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia

    2017-08-31

    As a newly emerged research area, RNA epigenetics has drawn increasing attention recently for the participation of RNA methylation and other modifications in a number of crucial biological processes. Thanks to high throughput sequencing techniques, such as, MeRIP-Seq, transcriptome-wide RNA methylation profile is now available in the form of count-based data, with which it is often of interests to study the dynamics at epitranscriptomic layer. However, the sample size of RNA methylation experiment is usually very small due to its costs; and additionally, there usually exist a large number of genes whose methylation level cannot be accurately estimated due to their low expression level, making differential RNA methylation analysis a difficult task. We present QNB, a statistical approach for differential RNA methylation analysis with count-based small-sample sequencing data. Compared with previous approaches such as DRME model based on a statistical test covering the IP samples only with 2 negative binomial distributions, QNB is based on 4 independent negative binomial distributions with their variances and means linked by local regressions, and in the way, the input control samples are also properly taken care of. In addition, different from DRME approach, which relies only the input control sample only for estimating the background, QNB uses a more robust estimator for gene expression by combining information from both input and IP samples, which could largely improve the testing performance for very lowly expressed genes. QNB showed improved performance on both simulated and real MeRIP-Seq datasets when compared with competing algorithms. And the QNB model is also applicable to other datasets related RNA modifications, including but not limited to RNA bisulfite sequencing, m 1 A-Seq, Par-CLIP, RIP-Seq, etc.

  19. X-ray studies of quasars with the Einstein Observatory. IV - X-ray dependence on radio emission

    NASA Technical Reports Server (NTRS)

    Worrall, D. M.; Tananbaum, H.; Giommi, P.; Zamorani, G.

    1987-01-01

    The X-ray properties of a sample of 114 radio-loud quasars observed with the Einstein Observatory are examined, and the results are compared with those obtained from a large sample of radio-quiet quasars. The results of statistical analysis of the dependence of X-ray luminosity on combined functions of optical and radio luminosity show that the dependence on both luminosities is important. However, statistically significant differences are found between subsamples of flat radio spectra quasars and steep radio spectra quasars with regard to dependence of X-ray luminosity on only radio luminosity. The data are consistent with radio-loud quasars having a physical component, not directly related to the optical luminosity, which produces the core radio luminosity plus 'extra' X-ray emission.

  20. Rare Event Simulation for T-cell Activation

    NASA Astrophysics Data System (ADS)

    Lipsmeier, Florian; Baake, Ellen

    2009-02-01

    The problem of statistical recognition is considered, as it arises in immunobiology, namely, the discrimination of foreign antigens against a background of the body's own molecules. The precise mechanism of this foreign-self-distinction, though one of the major tasks of the immune system, continues to be a fundamental puzzle. Recent progress has been made by van den Berg, Rand, and Burroughs (J. Theor. Biol. 209:465-486, 2001), who modelled the probabilistic nature of the interaction between the relevant cell types, namely, T-cells and antigen-presenting cells (APCs). Here, the stochasticity is due to the random sample of antigens present on the surface of every APC, and to the random receptor type that characterises individual T-cells. It has been shown previously (van den Berg et al. in J. Theor. Biol. 209:465-486, 2001; Zint et al. in J. Math. Biol. 57:841-861, 2008) that this model, though highly idealised, is capable of reproducing important aspects of the recognition phenomenon, and of explaining them on the basis of stochastic rare events. These results were obtained with the help of a refined large deviation theorem and were thus asymptotic in nature. Simulations have, so far, been restricted to the straightforward simple sampling approach, which does not allow for sample sizes large enough to address more detailed questions. Building on the available large deviation results, we develop an importance sampling technique that allows for a convenient exploration of the relevant tail events by means of simulation. With its help, we investigate the mechanism of statistical recognition in some depth. In particular, we illustrate how a foreign antigen can stand out against the self background if it is present in sufficiently many copies, although no a priori difference between self and nonself is built into the model.

  1. Testing a level of response to alcohol-based model of heavy drinking and alcohol problems in 1,905 17-year-olds.

    PubMed

    Schuckit, Marc A; Smith, Tom L; Heron, Jon; Hickman, Matthew; Macleod, John; Lewis, Glyn; Davis, John M; Hibbeln, Joseph R; Brown, Sandra; Zuccolo, Luisa; Miller, Laura L; Davey-Smith, George

    2011-10-01

    The low level of response (LR) to alcohol is one of several genetically influenced characteristics that increase the risk for heavy drinking and alcohol problems. Efforts to understand how LR operates through additional life influences have been carried out primarily in modest-sized U.S.-based samples with limited statistical power, raising questions about generalizability and about the importance of components with smaller effects. This study evaluates a full LR-based model of risk in a large sample of adolescents from the United Kingdom. Cross-sectional structural equation models were used for the approximate first half of the age 17 subjects assessed by the Avon Longitudinal Study of Parents and Children, generating data on 1,905 adolescents (mean age 17.8 years, 44.2% boys). LR was measured with the Self-Rating of the Effects of Alcohol Questionnaire, outcomes were based on drinking quantities and problems, and standardized questionnaires were used to evaluate peer substance use, alcohol expectancies, and using alcohol to cope with stress. In this young and large U.K. sample, a low LR related to more adverse alcohol outcomes both directly and through partial mediation by all 3 additional key variables (peer substance use, expectancies, and coping). The models were similar in boys and girls. These results confirm key elements of the hypothesized LR-based model in a large U.K. sample, supporting some generalizability beyond U.S. groups. They also indicate that with enough statistical power, multiple elements contribute to how LR relates to alcohol outcomes and reinforce the applicability of the model to both genders. Copyright © 2011 by the Research Society on Alcoholism.

  2. Statistical Measures of Large-Scale Structure

    NASA Astrophysics Data System (ADS)

    Vogeley, Michael; Geller, Margaret; Huchra, John; Park, Changbom; Gott, J. Richard

    1993-12-01

    \\inv Mpc} To quantify clustering in the large-scale distribution of galaxies and to test theories for the formation of structure in the universe, we apply statistical measures to the CfA Redshift Survey. This survey is complete to m_{B(0)}=15.5 over two contiguous regions which cover one-quarter of the sky and include ~ 11,000 galaxies. The salient features of these data are voids with diameter 30-50\\hmpc and coherent dense structures with a scale ~ 100\\hmpc. Comparison with N-body simulations rules out the ``standard" CDM model (Omega =1, b=1.5, sigma_8 =1) at the 99% confidence level because this model has insufficient power on scales lambda >30\\hmpc. An unbiased open universe CDM model (Omega h =0.2) and a biased CDM model with non-zero cosmological constant (Omega h =0.24, lambda_0 =0.6) match the observed power spectrum. The amplitude of the power spectrum depends on the luminosity of galaxies in the sample; bright (L>L(*) ) galaxies are more strongly clustered than faint galaxies. The paucity of bright galaxies in low-density regions may explain this dependence. To measure the topology of large-scale structure, we compute the genus of isodensity surfaces of the smoothed density field. On scales in the ``non-linear" regime, <= 10\\hmpc, the high- and low-density regions are multiply-connected over a broad range of density threshold, as in a filamentary net. On smoothing scales >10\\hmpc, the topology is consistent with statistics of a Gaussian random field. Simulations of CDM models fail to produce the observed coherence of structure on non-linear scales (>95% confidence level). The underdensity probability (the frequency of regions with density contrast delta rho //lineρ=-0.8) depends strongly on the luminosity of galaxies; underdense regions are significantly more common (>2sigma ) in bright (L>L(*) ) galaxy samples than in samples which include fainter galaxies.

  3. Charm dimuon production in neutrino-nucleon interactions in the NOMAD experiment

    NASA Astrophysics Data System (ADS)

    Petti, Roberto; Samoylov, Oleg

    2012-09-01

    We present our new measurement of charm dimuon production in neutrino-iron interactions based upon the full statistics collected by the NOMAD experiment. After background subtraction we observe 15,340 charm dimuon events, providing the largest sample currently available. The analysis exploits the large inclusive charged current sample (about 9 million events after all analysis cuts) to constrain the total systematic uncertainty to about 2%. The extraction of strange sea and charm production parameters is also discussed.

  4. Charm dimuon production in neutrino-nucleon interactions in the NOMAD experiment

    NASA Astrophysics Data System (ADS)

    Petti, R.; Samoylov, O. B.

    2011-12-01

    We present our new measurement of charm dimuon production in neutrino-iron interactions based upon the full statistics collected by the NOMAD experiment. After background subtraction we observe 15,340 charm dimuon events, providing the largest sample currently available. The analysis exploits the large inclusive charged current sample (about 9 million events after all analysis cuts) to constrain the total systematic uncertainty to ˜2%. The extraction of strange sea and charm production parameters is also discussed.

  5. Monte Carlo and Molecular Dynamics in the Multicanonical Ensemble: Connections between Wang-Landau Sampling and Metadynamics

    NASA Astrophysics Data System (ADS)

    Vogel, Thomas; Perez, Danny; Junghans, Christoph

    2014-03-01

    We show direct formal relationships between the Wang-Landau iteration [PRL 86, 2050 (2001)], metadynamics [PNAS 99, 12562 (2002)] and statistical temperature molecular dynamics [PRL 97, 050601 (2006)], the major Monte Carlo and molecular dynamics work horses for sampling from a generalized, multicanonical ensemble. We aim at helping to consolidate the developments in the different areas by indicating how methodological advancements can be transferred in a straightforward way, avoiding the parallel, largely independent, developments tracks observed in the past.

  6. Asymptotic Linear Spectral Statistics for Spiked Hermitian Random Matrices

    NASA Astrophysics Data System (ADS)

    Passemier, Damien; McKay, Matthew R.; Chen, Yang

    2015-07-01

    Using the Coulomb Fluid method, this paper derives central limit theorems (CLTs) for linear spectral statistics of three "spiked" Hermitian random matrix ensembles. These include Johnstone's spiked model (i.e., central Wishart with spiked correlation), non-central Wishart with rank-one non-centrality, and a related class of non-central matrices. For a generic linear statistic, we derive simple and explicit CLT expressions as the matrix dimensions grow large. For all three ensembles under consideration, we find that the primary effect of the spike is to introduce an correction term to the asymptotic mean of the linear spectral statistic, which we characterize with simple formulas. The utility of our proposed framework is demonstrated through application to three different linear statistics problems: the classical likelihood ratio test for a population covariance, the capacity analysis of multi-antenna wireless communication systems with a line-of-sight transmission path, and a classical multiple sample significance testing problem.

  7. A nonparametric method to generate synthetic populations to adjust for complex sampling design features.

    PubMed

    Dong, Qi; Elliott, Michael R; Raghunathan, Trivellore E

    2014-06-01

    Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.

  8. A nonparametric method to generate synthetic populations to adjust for complex sampling design features

    PubMed Central

    Dong, Qi; Elliott, Michael R.; Raghunathan, Trivellore E.

    2017-01-01

    Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs. PMID:29200608

  9. Assessment of sampling stability in ecological applications of discriminant analysis

    USGS Publications Warehouse

    Williams, B.K.; Titus, K.

    1988-01-01

    A simulation study was undertaken to assess the sampling stability of the variable loadings in linear discriminant function analysis. A factorial design was used for the factors of multivariate dimensionality, dispersion structure, configuration of group means, and sample size. A total of 32,400 discriminant analyses were conducted, based on data from simulated populations with appropriate underlying statistical distributions. A review of 60 published studies and 142 individual analyses indicated that sample sizes in ecological studies often have met that requirement. However, individual group sample sizes frequently were very unequal, and checks of assumptions usually were not reported. The authors recommend that ecologists obtain group sample sizes that are at least three times as large as the number of variables measured.

  10. Relations between Brain Structure and Attentional Function in Spina Bifida: Utilization of Robust Statistical Approaches

    PubMed Central

    Kulesz, Paulina A.; Tian, Siva; Juranek, Jenifer; Fletcher, Jack M.; Francis, David J.

    2015-01-01

    Objective Weak structure-function relations for brain and behavior may stem from problems in estimating these relations in small clinical samples with frequently occurring outliers. In the current project, we focused on the utility of using alternative statistics to estimate these relations. Method Fifty-four children with spina bifida meningomyelocele performed attention tasks and received MRI of the brain. Using a bootstrap sampling process, the Pearson product moment correlation was compared with four robust correlations: the percentage bend correlation, the Winsorized correlation, the skipped correlation using the Donoho-Gasko median, and the skipped correlation using the minimum volume ellipsoid estimator Results All methods yielded similar estimates of the relations between measures of brain volume and attention performance. The similarity of estimates across correlation methods suggested that the weak structure-function relations previously found in many studies are not readily attributable to the presence of outlying observations and other factors that violate the assumptions behind the Pearson correlation. Conclusions Given the difficulty of assembling large samples for brain-behavior studies, estimating correlations using multiple, robust methods may enhance the statistical conclusion validity of studies yielding small, but often clinically significant, correlations. PMID:25495830

  11. Relations between volumetric measures of brain structure and attentional function in spina bifida: utilization of robust statistical approaches.

    PubMed

    Kulesz, Paulina A; Tian, Siva; Juranek, Jenifer; Fletcher, Jack M; Francis, David J

    2015-03-01

    Weak structure-function relations for brain and behavior may stem from problems in estimating these relations in small clinical samples with frequently occurring outliers. In the current project, we focused on the utility of using alternative statistics to estimate these relations. Fifty-four children with spina bifida meningomyelocele performed attention tasks and received MRI of the brain. Using a bootstrap sampling process, the Pearson product-moment correlation was compared with 4 robust correlations: the percentage bend correlation, the Winsorized correlation, the skipped correlation using the Donoho-Gasko median, and the skipped correlation using the minimum volume ellipsoid estimator. All methods yielded similar estimates of the relations between measures of brain volume and attention performance. The similarity of estimates across correlation methods suggested that the weak structure-function relations previously found in many studies are not readily attributable to the presence of outlying observations and other factors that violate the assumptions behind the Pearson correlation. Given the difficulty of assembling large samples for brain-behavior studies, estimating correlations using multiple, robust methods may enhance the statistical conclusion validity of studies yielding small, but often clinically significant, correlations. PsycINFO Database Record (c) 2015 APA, all rights reserved.

  12. Water quality analysis of the Rapur area, Andhra Pradesh, South India using multivariate techniques

    NASA Astrophysics Data System (ADS)

    Nagaraju, A.; Sreedhar, Y.; Thejaswi, A.; Sayadi, Mohammad Hossein

    2017-10-01

    The groundwater samples from Rapur area were collected from different sites to evaluate the major ion chemistry. The large number of data can lead to difficulties in the integration, interpretation, and representation of the results. Two multivariate statistical methods, hierarchical cluster analysis (HCA) and factor analysis (FA), were applied to evaluate their usefulness to classify and identify geochemical processes controlling groundwater geochemistry. Four statistically significant clusters were obtained from 30 sampling stations. This has resulted two important clusters viz., cluster 1 (pH, Si, CO3, Mg, SO4, Ca, K, HCO3, alkalinity, Na, Na + K, Cl, and hardness) and cluster 2 (EC and TDS) which are released to the study area from different sources. The application of different multivariate statistical techniques, such as principal component analysis (PCA), assists in the interpretation of complex data matrices for a better understanding of water quality of a study area. From PCA, it is clear that the first factor (factor 1), accounted for 36.2% of the total variance, was high positive loading in EC, Mg, Cl, TDS, and hardness. Based on the PCA scores, four significant cluster groups of sampling locations were detected on the basis of similarity of their water quality.

  13. SparRec: An effective matrix completion framework of missing data imputation for GWAS

    NASA Astrophysics Data System (ADS)

    Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen

    2016-10-01

    Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.

  14. Evidence for plant-derived xenomiRs based on a large-scale analysis of public small RNA sequencing data from human samples.

    PubMed

    Zhao, Qi; Liu, Yuanning; Zhang, Ning; Hu, Menghan; Zhang, Hao; Joshi, Trupti; Xu, Dong

    2018-01-01

    In recent years, an increasing number of studies have reported the presence of plant miRNAs in human samples, which resulted in a hypothesis asserting the existence of plant-derived exogenous microRNA (xenomiR). However, this hypothesis is not widely accepted in the scientific community due to possible sample contamination and the small sample size with lack of rigorous statistical analysis. This study provides a systematic statistical test that can validate (or invalidate) the plant-derived xenomiR hypothesis by analyzing 388 small RNA sequencing data from human samples in 11 types of body fluids/tissues. A total of 166 types of plant miRNAs were found in at least one human sample, of which 14 plant miRNAs represented more than 80% of the total plant miRNAs abundance in human samples. Plant miRNA profiles were characterized to be tissue-specific in different human samples. Meanwhile, the plant miRNAs identified from microbiome have an insignificant abundance compared to those from humans, while plant miRNA profiles in human samples were significantly different from those in plants, suggesting that sample contamination is an unlikely reason for all the plant miRNAs detected in human samples. This study also provides a set of testable synthetic miRNAs with isotopes that can be detected in situ after being fed to animals.

  15. Soil carbon inventories under a bioenergy crop (switchgrass): Measurement limitations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Garten, C.T. Jr.; Wullschleger, S.D.

    Approximately 5 yr after planting, coarse root carbon (C) and soil organic C (SOC) inventories were compared under different types of plant cover at four switchgrass (Panicum virgatum L.) production field trials in the southeastern USA. There was significantly more coarse root C under switchgrass (Alamo variety) and forest cover than tall fescue (Festuca arundinacea Schreb.), corn (Zea mays L.), or native pastures of mixed grasses. Inventories of SOC under switchgrass were not significantly greater than SOC inventories under other plant covers. At some locations the statistical power associated with ANOVA of SOC inventories was low, which raised questions aboutmore » whether differences in SOC could be detected statistically. A minimum detectable difference (MDD) for SOC inventories was calculated. The MDD is the smallest detectable difference between treatment means once the variation, significance level, statistical power, and sample size are specified. The analysis indicated that a difference of {approx}50 mg SOC/cm{sup 2} or 5 Mg SOC/ha, which is {approx}10 to 15% of existing SOC, could be detected with reasonable sample sizes and good statistical power. The smallest difference in SOC inventories that can be detected, and only with exceedingly large sample sizes, is {approx}2 to 3%. These measurement limitations have implications for monitoring and verification of proposals to ameliorate increasing global atmospheric CO{sub 2} concentrations by sequestering C in soils.« less

  16. Assessment of statistical methods used in library-based approaches to microbial source tracking.

    PubMed

    Ritter, Kerry J; Carruthers, Ethan; Carson, C Andrew; Ellender, R D; Harwood, Valerie J; Kingsley, Kyle; Nakatsu, Cindy; Sadowsky, Michael; Shear, Brian; West, Brian; Whitlock, John E; Wiggins, Bruce A; Wilbur, Jayson D

    2003-12-01

    Several commonly used statistical methods for fingerprint identification in microbial source tracking (MST) were examined to assess the effectiveness of pattern-matching algorithms to correctly identify sources. Although numerous statistical methods have been employed for source identification, no widespread consensus exists as to which is most appropriate. A large-scale comparison of several MST methods, using identical fecal sources, presented a unique opportunity to assess the utility of several popular statistical methods. These included discriminant analysis, nearest neighbour analysis, maximum similarity and average similarity, along with several measures of distance or similarity. Threshold criteria for excluding uncertain or poorly matched isolates from final analysis were also examined for their ability to reduce false positives and increase prediction success. Six independent libraries used in the study were constructed from indicator bacteria isolated from fecal materials of humans, seagulls, cows and dogs. Three of these libraries were constructed using the rep-PCR technique and three relied on antibiotic resistance analysis (ARA). Five of the libraries were constructed using Escherichia coli and one using Enterococcus spp. (ARA). Overall, the outcome of this study suggests a high degree of variability across statistical methods. Despite large differences in correct classification rates among the statistical methods, no single statistical approach emerged as superior. Thresholds failed to consistently increase rates of correct classification and improvement was often associated with substantial effective sample size reduction. Recommendations are provided to aid in selecting appropriate analyses for these types of data.

  17. A Sorting Statistic with Application in Neurological Magnetic Resonance Imaging of Autism.

    PubMed

    Levman, Jacob; Takahashi, Emi; Forgeron, Cynthia; MacDonald, Patrick; Stewart, Natalie; Lim, Ashley; Martel, Anne

    2018-01-01

    Effect size refers to the assessment of the extent of differences between two groups of samples on a single measurement. Assessing effect size in medical research is typically accomplished with Cohen's d statistic. Cohen's d statistic assumes that average values are good estimators of the position of a distribution of numbers and also assumes Gaussian (or bell-shaped) underlying data distributions. In this paper, we present an alternative evaluative statistic that can quantify differences between two data distributions in a manner that is similar to traditional effect size calculations; however, the proposed approach avoids making assumptions regarding the shape of the underlying data distribution. The proposed sorting statistic is compared with Cohen's d statistic and is demonstrated to be capable of identifying feature measurements of potential interest for which Cohen's d statistic implies the measurement would be of little use. This proposed sorting statistic has been evaluated on a large clinical autism dataset from Boston Children's Hospital , Harvard Medical School , demonstrating that it can potentially play a constructive role in future healthcare technologies.

  18. A Sorting Statistic with Application in Neurological Magnetic Resonance Imaging of Autism

    PubMed Central

    Takahashi, Emi; Lim, Ashley; Martel, Anne

    2018-01-01

    Effect size refers to the assessment of the extent of differences between two groups of samples on a single measurement. Assessing effect size in medical research is typically accomplished with Cohen's d statistic. Cohen's d statistic assumes that average values are good estimators of the position of a distribution of numbers and also assumes Gaussian (or bell-shaped) underlying data distributions. In this paper, we present an alternative evaluative statistic that can quantify differences between two data distributions in a manner that is similar to traditional effect size calculations; however, the proposed approach avoids making assumptions regarding the shape of the underlying data distribution. The proposed sorting statistic is compared with Cohen's d statistic and is demonstrated to be capable of identifying feature measurements of potential interest for which Cohen's d statistic implies the measurement would be of little use. This proposed sorting statistic has been evaluated on a large clinical autism dataset from Boston Children's Hospital, Harvard Medical School, demonstrating that it can potentially play a constructive role in future healthcare technologies. PMID:29796236

  19. Climate and Edaphic Controls on Humid Tropical Forest Tree Height

    NASA Astrophysics Data System (ADS)

    Yang, Y.; Saatchi, S. S.; Xu, L.

    2014-12-01

    Uncertainty in the magnitude and spatial variations of forest carbon density in tropical regions is due to under sampling of forest structure from inventory plots and the lack of regional allometry to estimate the carbon density from structure. Here we quantify the variation of tropical forest structure by using more than 2.5 million measurements of canopy height from systematic sampling of Geoscience Laser Altimeter System (GLAS) satellite observations between 2004 to 2008 and examine the climate and edaphic variables influencing the variations. We used top canopy height of GLAS footprints (~ 0.25 ha) to grid the statistical mean and 90 percentile of samples at 0.5 degrees to capture the regional variability of large trees in tropics. GLAS heights were also aggregated based on a stratification of tropical regions using soil, elevation, and forest types. Both approaches provided consistent patterns of statistically dominant large trees and the least heterogeneity, both as strong drivers of distribution of high biomass forests. Statistical models accounting for spatial autocorrelation suggest that climate, soil and spatial features together can explain more than 60% of the variations in observed tree height information, while climate-only variables explains about one third of the first-order changes in tree height. Soil basics, including physical compositions such as clay and sand contents, chemical properties such as PH values and cation-exchange capacity, as well as biological variables such as organic matters, all present independent but statistically significant relationships to tree height variations. The results confirm other landscape and regional studies that soil fertility, geology and climate may jointly control a majority of the regional variations of forest structure in pan-tropics and influencing both biomass stocks and dynamics. Consequently, other factors such as biotic and disturbance regimes, not included in this study, may have less influence on regional variations but strongly mediate landscape and small-scale forest structure and dynamics.

  20. 42 CFR 402.109 - Statistical sampling.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 42 Public Health 2 2011-10-01 2011-10-01 false Statistical sampling. 402.109 Section 402.109... Statistical sampling. (a) Purpose. CMS or OIG may introduce the results of a statistical sampling study to... or caused to be presented. (b) Prima facie evidence. The results of the statistical sampling study...

  1. 42 CFR 402.109 - Statistical sampling.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 42 Public Health 2 2010-10-01 2010-10-01 false Statistical sampling. 402.109 Section 402.109... Statistical sampling. (a) Purpose. CMS or OIG may introduce the results of a statistical sampling study to... or caused to be presented. (b) Prima facie evidence. The results of the statistical sampling study...

  2. Accuracy assessment of percent canopy cover, cover type, and size class

    Treesearch

    H. T. Schreuder; S. Bain; R. C. Czaplewski

    2003-01-01

    Truth for vegetation cover percent and type is obtained from very large-scale photography (VLSP), stand structure as measured by size classes, and vegetation types from a combination of VLSP and ground sampling. We recommend using the Kappa statistic with bootstrap confidence intervals for overall accuracy, and similarly bootstrap confidence intervals for percent...

  3. A Statistical Analysis of Data Used in Critical Decision Making by Secondary School Personnel.

    ERIC Educational Resources Information Center

    Dunn, Charleta J.; Kowitz, Gerald T.

    Guidance decisions depend on the validity of standardized tests and teacher judgment records as measures of student achievement. To test this validity, a sample of 400 high school juniors, randomly selected from two large Gulf Coas t area schools, were administered the Iowa Tests of Educational Development. The nine subtest scores and each…

  4. Limits on the Accuracy of Linking. Research Report. ETS RR-10-22

    ERIC Educational Resources Information Center

    Haberman, Shelby J.

    2010-01-01

    Sampling errors limit the accuracy with which forms can be linked. Limitations on accuracy are especially important in testing programs in which a very large number of forms are employed. Standard inequalities in mathematical statistics may be used to establish lower bounds on the achievable inking accuracy. To illustrate results, a variety of…

  5. Automated Hypothesis Tests and Standard Errors for Nonstandard Problems with Description of Computer Package: A Draft.

    ERIC Educational Resources Information Center

    Lord, Frederic M.; Stocking, Martha

    A general Computer program is described that will compute asymptotic standard errors and carry out significance tests for an endless variety of (standard and) nonstandard large-sample statistical problems, without requiring the statistician to derive asymptotic standard error formulas. The program assumes that the observations have a multinormal…

  6. Intermittent electron density and temperature fluctuations and associated fluxes in the Alcator C-Mod scrape-off layer

    NASA Astrophysics Data System (ADS)

    Kube, R.; Garcia, O. E.; Theodorsen, A.; Brunner, D.; Kuang, A. Q.; LaBombard, B.; Terry, J. L.

    2018-06-01

    The Alcator C-Mod mirror Langmuir probe system has been used to sample data time series of fluctuating plasma parameters in the outboard mid-plane far scrape-off layer. We present a statistical analysis of one second long time series of electron density, temperature, radial electric drift velocity and the corresponding particle and electron heat fluxes. These are sampled during stationary plasma conditions in an ohmically heated, lower single null diverted discharge. The electron density and temperature are strongly correlated and feature fluctuation statistics similar to the ion saturation current. Both electron density and temperature time series are dominated by intermittent, large-amplitude burst with an exponential distribution of both burst amplitudes and waiting times between them. The characteristic time scale of the large-amplitude bursts is approximately 15 μ {{s}}. Large-amplitude velocity fluctuations feature a slightly faster characteristic time scale and appear at a faster rate than electron density and temperature fluctuations. Describing these time series as a superposition of uncorrelated exponential pulses, we find that probability distribution functions, power spectral densities as well as auto-correlation functions of the data time series agree well with predictions from the stochastic model. The electron particle and heat fluxes present large-amplitude fluctuations. For this low-density plasma, the radial electron heat flux is dominated by convection, that is, correlations of fluctuations in the electron density and radial velocity. Hot and dense blobs contribute only a minute fraction of the total fluctuation driven heat flux.

  7. Statistical Models for the Analysis of Zero-Inflated Pain Intensity Numeric Rating Scale Data.

    PubMed

    Goulet, Joseph L; Buta, Eugenia; Bathulapalli, Harini; Gueorguieva, Ralitza; Brandt, Cynthia A

    2017-03-01

    Pain intensity is often measured in clinical and research settings using the 0 to 10 numeric rating scale (NRS). NRS scores are recorded as discrete values, and in some samples they may display a high proportion of zeroes and a right-skewed distribution. Despite this, statistical methods for normally distributed data are frequently used in the analysis of NRS data. We present results from an observational cross-sectional study examining the association of NRS scores with patient characteristics using data collected from a large cohort of 18,935 veterans in Department of Veterans Affairs care diagnosed with a potentially painful musculoskeletal disorder. The mean (variance) NRS pain was 3.0 (7.5), and 34% of patients reported no pain (NRS = 0). We compared the following statistical models for analyzing NRS scores: linear regression, generalized linear models (Poisson and negative binomial), zero-inflated and hurdle models for data with an excess of zeroes, and a cumulative logit model for ordinal data. We examined model fit, interpretability of results, and whether conclusions about the predictor effects changed across models. In this study, models that accommodate zero inflation provided a better fit than the other models. These models should be considered for the analysis of NRS data with a large proportion of zeroes. We examined and analyzed pain data from a large cohort of veterans with musculoskeletal disorders. We found that many reported no current pain on the NRS on the diagnosis date. We present several alternative statistical methods for the analysis of pain intensity data with a large proportion of zeroes. Published by Elsevier Inc.

  8. Comparison of Two Methods for Estimating the Sampling-Related Uncertainty of Satellite Rainfall Averages Based on a Large Radar Data Set

    NASA Technical Reports Server (NTRS)

    Lau, William K. M. (Technical Monitor); Bell, Thomas L.; Steiner, Matthias; Zhang, Yu; Wood, Eric F.

    2002-01-01

    The uncertainty of rainfall estimated from averages of discrete samples collected by a satellite is assessed using a multi-year radar data set covering a large portion of the United States. The sampling-related uncertainty of rainfall estimates is evaluated for all combinations of 100 km, 200 km, and 500 km space domains, 1 day, 5 day, and 30 day rainfall accumulations, and regular sampling time intervals of 1 h, 3 h, 6 h, 8 h, and 12 h. These extensive analyses are combined to characterize the sampling uncertainty as a function of space and time domain, sampling frequency, and rainfall characteristics by means of a simple scaling law. Moreover, it is shown that both parametric and non-parametric statistical techniques of estimating the sampling uncertainty produce comparable results. Sampling uncertainty estimates, however, do depend on the choice of technique for obtaining them. They can also vary considerably from case to case, reflecting the great variability of natural rainfall, and should therefore be expressed in probabilistic terms. Rainfall calibration errors are shown to affect comparison of results obtained by studies based on data from different climate regions and/or observation platforms.

  9. Unbiased, scalable sampling of protein loop conformations from probabilistic priors.

    PubMed

    Zhang, Yajia; Hauser, Kris

    2013-01-01

    Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents Sub-Loop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences. Our simulation experiments demonstrate that the method computes high-scoring conformations of large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques. Two new developments contribute to the scalability of the new method. First, structural preferences are specified via a probabilistic graphical model (PGM) that links conformation variables, spatial variables (e.g., atom positions), constraints and prior information in a unified framework. The method uses a sparse PGM that exploits locality of interactions between atoms and residues. Second, a novel method for sampling sub-loops is developed to generate statistically unbiased samples of probability densities restricted by loop-closure constraints. Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ion-binding demonstrate its potential as a tool for loop ensemble generation and missing structure completion.

  10. Unbiased, scalable sampling of protein loop conformations from probabilistic priors

    PubMed Central

    2013-01-01

    Background Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents Sub-Loop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences. Results Our simulation experiments demonstrate that the method computes high-scoring conformations of large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques. Two new developments contribute to the scalability of the new method. First, structural preferences are specified via a probabilistic graphical model (PGM) that links conformation variables, spatial variables (e.g., atom positions), constraints and prior information in a unified framework. The method uses a sparse PGM that exploits locality of interactions between atoms and residues. Second, a novel method for sampling sub-loops is developed to generate statistically unbiased samples of probability densities restricted by loop-closure constraints. Conclusion Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ion-binding demonstrate its potential as a tool for loop ensemble generation and missing structure completion. PMID:24565175

  11. ON THE STAR FORMATION PROPERTIES OF VOID GALAXIES

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Moorman, Crystal M.; Moreno, Jackeline; White, Amanda

    2016-11-10

    We measure the star formation properties of two large samples of galaxies from the SDSS in large-scale cosmic voids on timescales of 10 and 100 Myr, using H α emission line strengths and GALEX FUV fluxes, respectively. The first sample consists of 109,818 optically selected galaxies. We find that void galaxies in this sample have higher specific star formation rates (SSFRs; star formation rates per unit stellar mass) than similar stellar mass galaxies in denser regions. The second sample is a subset of the optically selected sample containing 8070 galaxies with reliable H i detections from ALFALFA. For the fullmore » H i detected sample, SSFRs do not vary systematically with large-scale environment. However, investigating only the H i detected dwarf galaxies reveals a trend toward higher SSFRs in voids. Furthermore, we estimate the star formation rate per unit H i mass (known as the star formation efficiency; SFE) of a galaxy, as a function of environment. For the overall H i detected population, we notice no environmental dependence. Limiting the sample to dwarf galaxies still does not reveal a statistically significant difference between SFEs in voids versus walls. These results suggest that void environments, on average, provide a nurturing environment for dwarf galaxy evolution allowing for higher specific star formation rates while forming stars with similar efficiencies to those in walls.« less

  12. Combining censored and uncensored data in a U-statistic: design and sample size implications for cell therapy research.

    PubMed

    Moyé, Lemuel A; Lai, Dejian; Jing, Kaiyan; Baraniuk, Mary Sarah; Kwak, Minjung; Penn, Marc S; Wu, Colon O

    2011-01-01

    The assumptions that anchor large clinical trials are rooted in smaller, Phase II studies. In addition to specifying the target population, intervention delivery, and patient follow-up duration, physician-scientists who design these Phase II studies must select the appropriate response variables (endpoints). However, endpoint measures can be problematic. If the endpoint assesses the change in a continuous measure over time, then the occurrence of an intervening significant clinical event (SCE), such as death, can preclude the follow-up measurement. Finally, the ideal continuous endpoint measurement may be contraindicated in a fraction of the study patients, a change that requires a less precise substitution in this subset of participants.A score function that is based on the U-statistic can address these issues of 1) intercurrent SCE's and 2) response variable ascertainments that use different measurements of different precision. The scoring statistic is easy to apply, clinically relevant, and provides flexibility for the investigators' prospective design decisions. Sample size and power formulations for this statistic are provided as functions of clinical event rates and effect size estimates that are easy for investigators to identify and discuss. Examples are provided from current cardiovascular cell therapy research.

  13. An introduction to Bayesian statistics in health psychology.

    PubMed

    Depaoli, Sarah; Rus, Holly M; Clifton, James P; van de Schoot, Rens; Tiemensma, Jitske

    2017-09-01

    The aim of the current article is to provide a brief introduction to Bayesian statistics within the field of health psychology. Bayesian methods are increasing in prevalence in applied fields, and they have been shown in simulation research to improve the estimation accuracy of structural equation models, latent growth curve (and mixture) models, and hierarchical linear models. Likewise, Bayesian methods can be used with small sample sizes since they do not rely on large sample theory. In this article, we discuss several important components of Bayesian statistics as they relate to health-based inquiries. We discuss the incorporation and impact of prior knowledge into the estimation process and the different components of the analysis that should be reported in an article. We present an example implementing Bayesian estimation in the context of blood pressure changes after participants experienced an acute stressor. We conclude with final thoughts on the implementation of Bayesian statistics in health psychology, including suggestions for reviewing Bayesian manuscripts and grant proposals. We have also included an extensive amount of online supplementary material to complement the content presented here, including Bayesian examples using many different software programmes and an extensive sensitivity analysis examining the impact of priors.

  14. A Bayesian nonparametric method for prediction in EST analysis

    PubMed Central

    Lijoi, Antonio; Mena, Ramsés H; Prünster, Igor

    2007-01-01

    Background Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library. Results In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail. Conclusion The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample. PMID:17868445

  15. Assessing the sustainable construction of large construction companies in Malaysia

    NASA Astrophysics Data System (ADS)

    Adewale, Bamgbade Jibril; Mohammed, Kamaruddeen Ahmed; Nasrun, Mohd Nawi Mohd

    2016-08-01

    Considering the increasing concerns for the consideration of sustainability issues in construction project delivery within the construction industry, this paper assesses the extent of sustainable construction among Malaysian large contractors, in order to ascertain the level of the industry's impacts on both the environment and the society. Sustainable construction explains the construction industry's responsibility to efficiently utilise the finite resources while also reducing construction impacts on both humans and the environment throughout the phases of construction. This study used proportionate stratified random sampling to conduct a field study with a sample of 172 contractors out of the 708 administered questionnaires. Data were collected from large contractors in the eleven states of peninsular Malaysia. Using the five-level rating scale (which include: 1= Very Low; 2= Low; 3= Moderate; 4= High; 5= Very High) to describe the level of sustainable construction of Malaysian contractors based on previous studies, statistical analysis reveals that environmental, social and economic sustainability of Malaysian large contractors are high.

  16. A measure of the signal-to-noise ratio of microarray samples and studies using gene correlations.

    PubMed

    Venet, David; Detours, Vincent; Bersini, Hugues

    2012-01-01

    The quality of gene expression data can vary dramatically from platform to platform, study to study, and sample to sample. As reliable statistical analysis rests on reliable data, determining such quality is of the utmost importance. Quality measures to spot problematic samples exist, but they are platform-specific, and cannot be used to compare studies. As a proxy for quality, we propose a signal-to-noise ratio for microarray data, the "Signal-to-Noise Applied to Gene Expression Experiments", or SNAGEE. SNAGEE is based on the consistency of gene-gene correlations. We applied SNAGEE to a compendium of 80 large datasets on 37 platforms, for a total of 24,380 samples, and assessed the signal-to-noise ratio of studies and samples. This allowed us to discover serious issues with three studies. We show that signal-to-noise ratios of both studies and samples are linked to the statistical significance of the biological results. We showed that SNAGEE is an effective way to measure data quality for most types of gene expression studies, and that it often outperforms existing techniques. Furthermore, SNAGEE is platform-independent and does not require raw data files. The SNAGEE R package is available in BioConductor.

  17. Large ensemble modeling of the last deglacial retreat of the West Antarctic Ice Sheet: comparison of simple and advanced statistical techniques

    NASA Astrophysics Data System (ADS)

    Pollard, David; Chang, Won; Haran, Murali; Applegate, Patrick; DeConto, Robert

    2016-05-01

    A 3-D hybrid ice-sheet model is applied to the last deglacial retreat of the West Antarctic Ice Sheet over the last ˜ 20 000 yr. A large ensemble of 625 model runs is used to calibrate the model to modern and geologic data, including reconstructed grounding lines, relative sea-level records, elevation-age data and uplift rates, with an aggregate score computed for each run that measures overall model-data misfit. Two types of statistical methods are used to analyze the large-ensemble results: simple averaging weighted by the aggregate score, and more advanced Bayesian techniques involving Gaussian process-based emulation and calibration, and Markov chain Monte Carlo. The analyses provide sea-level-rise envelopes with well-defined parametric uncertainty bounds, but the simple averaging method only provides robust results with full-factorial parameter sampling in the large ensemble. Results for best-fit parameter ranges and envelopes of equivalent sea-level rise with the simple averaging method agree well with the more advanced techniques. Best-fit parameter ranges confirm earlier values expected from prior model tuning, including large basal sliding coefficients on modern ocean beds.

  18. A note about Gaussian statistics on a sphere

    NASA Astrophysics Data System (ADS)

    Chave, Alan D.

    2015-11-01

    The statistics of directional data on a sphere can be modelled either using the Fisher distribution that is conditioned on the magnitude being unity, in which case the sample space is confined to the unit sphere, or using the latitude-longitude marginal distribution derived from a trivariate Gaussian model that places no constraint on the magnitude. These two distributions are derived from first principles and compared. The Fisher distribution more closely approximates the uniform distribution on a sphere for a given small value of the concentration parameter, while the latitude-longitude marginal distribution is always slightly larger than the Fisher distribution at small off-axis angles for large values of the concentration parameter. Asymptotic analysis shows that the two distributions only become equivalent in the limit of large concentration parameter and very small off-axis angle.

  19. Selection effects and binary galaxy velocity differences

    NASA Technical Reports Server (NTRS)

    Schneider, Stephen E.; Salpeter, Edwin E.

    1990-01-01

    Measurements of the velocity differences (delta v's) in pairs of galaxies from large statistical samples have often been used to estimate the average masses of binary galaxies. A basic prediction of these models is that the delta v distribution ought to decline monotonically. However, some peculiar aspects of the kinematics have been uncovered, with an anomalous preference for delta v approx. equal to 72 km s(sup-1) appearing to be present in the data. The authors examine a large sample of binary galaxies with accurate redshift measurements and confirm that the distribution of delta v's appears to be non-monotonic with peaks at 0 and approx. 72 km s (exp -1). The authors suggest that the non-zero peak results from the isolation criteria employed in defining samples of binaries and that it indicates there are two populations of binary orbits contributing to the observed delta v distribution.

  20. Anxiety and depressive symptoms related to parenthood in a large Norwegian community sample: the HUNT2 study.

    PubMed

    Rimehaug, Tormod; Wallander, Jan

    2010-07-01

    The study compared anxiety and depression prevalence between parents and non-parents in a society with family- and parenthood-friendly social politics, controlling for family status and family history, age, gender, education and social class. All participants aged 30-49 (N = 24,040) in the large, non-sampled Norwegian HUNT2 community health study completed the Hospital Anxiety and Depression Scales. The slightly elevated anxiety and depression among non-parents compared to parents in the complete sample was not confirmed as statistically significant within any subgroups. Married parents and (previously unmarried) cohabiting parents did not differ in portraying low anxiety and depression prevalence. Anxiety was associated with single parenthood, living alone or being divorced, while elevated depression was found only among those living alone. Burdening selection and cultural/political context are suggested as interpretative perspectives on the contextual and personal influences on the complex relationship between parenthood and mental health.

  1. Disentangling Vulnerabilities from Outcomes: Distinctions between Trait Affect and Depressive Symptoms in Adolescent and Adult Samples

    PubMed Central

    Harding, Kaitlin A.; Willey, Brittany; Ahles, Joshua; Mezulis, Amy

    2016-01-01

    Background Trait negative affect and trait positive affect are affective vulnerabilities to depressive symptoms in adolescence and adulthood. While trait affect and the state affect characteristic of depressive symptoms are proposed to be theoretically distinct, no studies have established that these constructs are statistically distinct. Therefore, the purpose of the current study was to determine whether the trait affect (e.g. temperament dimensions) that predicts depressive symptoms and the state affect characteristic of depressive symptoms are statistically distinct among early adolescents and adults. We hypothesized that trait negative affect, trait positive affect, and depressive symptoms would represent largely distinct factors in both samples. Method Participants were 268 early adolescents (53.73% female) and 321 young adults (70.09% female) who completed self-report measures of demographic information, trait affect, and depressive symptoms. Results Principal axis factoring with oblique rotation for both samples indicated distinct adolescent factor loadings and overlapping adult factor loadings. Confirmatory factor analyses in both samples supported distinct but related relationships between trait NA, trait PA, and depressive symptoms. Limitations Study limitations include our cross-sectional design that prevented examination of self-reported fluctuations in trait affect and depressive symptoms and the unknown potential effects of self-report biases among adolescents and adults. Conclusions Findings support existing theoretical distinctions between adolescent constructs but highlight a need to revise or remove items to distinguish measurements of adult trait affect and depressive symptoms. Adolescent trait affect and depressive symptoms are statistically distinct, but adult trait affect and depressive symptoms statistically overlap and warrant further consideration. PMID:27085163

  2. Environmental Monitoring and Assessment Program Western Pilot Project - Information about selected fish and macroinvertebrates sampled from North Dakota perennial streams, 2000-2003

    USGS Publications Warehouse

    Vining, Kevin C.; Lundgren, Robert F.

    2008-01-01

    Sixty-five sampling sites, selected by a statistical design to represent lengths of perennial streams in North Dakota, were chosen to be sampled for fish and aquatic insects (macroinvertebrates) to establish unbiased baseline data. Channel catfish and common carp were the most abundant game and large fish species in the Cultivated Plains and Rangeland Plains, respectively. Blackflies were present in more than 50 percent of stream lengths sampled in the State; mayflies and caddisflies were present in more than 80 percent. Dragonflies were present in a greater percentage of stream lengths in the Rangeland Plains than in the Cultivated Plains.

  3. Approximate sample sizes required to estimate length distributions

    USGS Publications Warehouse

    Miranda, L.E.

    2007-01-01

    The sample sizes required to estimate fish length were determined by bootstrapping from reference length distributions. Depending on population characteristics and species-specific maximum lengths, 1-cm length-frequency histograms required 375-1,200 fish to estimate within 10% with 80% confidence, 2.5-cm histograms required 150-425 fish, proportional stock density required 75-140 fish, and mean length required 75-160 fish. In general, smaller species, smaller populations, populations with higher mortality, and simpler length statistics required fewer samples. Indices that require low sample sizes may be suitable for monitoring population status, and when large changes in length are evident, additional sampling effort may be allocated to more precisely define length status with more informative estimators. ?? Copyright by the American Fisheries Society 2007.

  4. Statistical Thermodynamic Approach to Vibrational Solitary Waves in Acetanilide

    NASA Astrophysics Data System (ADS)

    Vasconcellos, Áurea R.; Mesquita, Marcus V.; Luzzi, Roberto

    1998-03-01

    We analyze the behavior of the macroscopic thermodynamic state of polymers, centering on acetanilide. The nonlinear equations of evolution for the populations and the statistically averaged field amplitudes of CO-stretching modes are derived. The existence of excitations of the solitary wave type is evidenced. The infrared spectrum is calculated and compared with the experimental data of Careri et al. [Phys. Rev. Lett. 51, 104 (1983)], resulting in a good agreement. We also consider the situation of a nonthermally highly excited sample, predicting the occurrence of a large increase in the lifetime of the solitary wave excitation.

  5. Pearson's chi-square test and rank correlation inferences for clustered data.

    PubMed

    Shih, Joanna H; Fay, Michael P

    2017-09-01

    Pearson's chi-square test has been widely used in testing for association between two categorical responses. Spearman rank correlation and Kendall's tau are often used for measuring and testing association between two continuous or ordered categorical responses. However, the established statistical properties of these tests are only valid when each pair of responses are independent, where each sampling unit has only one pair of responses. When each sampling unit consists of a cluster of paired responses, the assumption of independent pairs is violated. In this article, we apply the within-cluster resampling technique to U-statistics to form new tests and rank-based correlation estimators for possibly tied clustered data. We develop large sample properties of the new proposed tests and estimators and evaluate their performance by simulations. The proposed methods are applied to a data set collected from a PET/CT imaging study for illustration. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.

  6. Statistical considerations in monitoring birds over large areas

    USGS Publications Warehouse

    Johnson, D.H.

    2000-01-01

    The proper design of a monitoring effort depends primarily on the objectives desired, constrained by the resources available to conduct the work. Typically, managers have numerous objectives, such as determining abundance of the species, detecting changes in population size, evaluating responses to management activities, and assessing habitat associations. A design that is optimal for one objective will likely not be optimal for others. Careful consideration of the importance of the competing objectives may lead to a design that adequately addresses the priority concerns, although it may not be optimal for any individual objective. Poor design or inadequate sample sizes may result in such weak conclusions that the effort is wasted. Statistical expertise can be used at several stages, such as estimating power of certain hypothesis tests, but is perhaps most useful in fundamental considerations of describing objectives and designing sampling plans.

  7. Bayesian evaluation of effect size after replicating an original study

    PubMed Central

    van Aert, Robbie C. M.; van Assen, Marcel A. L. M.

    2017-01-01

    The vast majority of published results in the literature is statistically significant, which raises concerns about their reliability. The Reproducibility Project Psychology (RPP) and Experimental Economics Replication Project (EE-RP) both replicated a large number of published studies in psychology and economics. The original study and replication were statistically significant in 36.1% in RPP and 68.8% in EE-RP suggesting many null effects among the replicated studies. However, evidence in favor of the null hypothesis cannot be examined with null hypothesis significance testing. We developed a Bayesian meta-analysis method called snapshot hybrid that is easy to use and understand and quantifies the amount of evidence in favor of a zero, small, medium and large effect. The method computes posterior model probabilities for a zero, small, medium, and large effect and adjusts for publication bias by taking into account that the original study is statistically significant. We first analytically approximate the methods performance, and demonstrate the necessity to control for the original study’s significance to enable the accumulation of evidence for a true zero effect. Then we applied the method to the data of RPP and EE-RP, showing that the underlying effect sizes of the included studies in EE-RP are generally larger than in RPP, but that the sample sizes of especially the included studies in RPP are often too small to draw definite conclusions about the true effect size. We also illustrate how snapshot hybrid can be used to determine the required sample size of the replication akin to power analysis in null hypothesis significance testing and present an easy to use web application (https://rvanaert.shinyapps.io/snapshot/) and R code for applying the method. PMID:28388646

  8. Immunochip Analyses of Epistasis in Rheumatoid Arthritis Confirm Multiple Interactions within MHC and Suggest Novel Non-MHC Epistatic Signals.

    PubMed

    Wei, Wen-Hua; Loh, Chia-Yin; Worthington, Jane; Eyre, Stephen

    2016-05-01

    Studying statistical gene-gene interactions (epistasis) has been limited by the difficulties in performance, both statistically and computationally, in large enough sample numbers to gain sufficient power. Three large Immunochip datasets from cohort samples recruited in the United Kingdom, United States, and Sweden with European ancestry were used to examine epistasis in rheumatoid arthritis (RA). A full pairwise search was conducted in the UK cohort using a high-throughput tool and the resultant significant epistatic signals were tested for replication in the United States and Swedish cohorts. A forward selection approach was applied to remove redundant signals, while conditioning on the preidentified additive effects. We detected abundant genome-wide significant (p < 1.0e-13) epistatic signals, all within the MHC region. These signals were reduced substantially, but a proportion remained significant (p < 1.0e-03) in conditional tests. We identified 11 independent epistatic interactions across the entire MHC, each explaining on average 0.12% of the phenotypic variance, nearly all replicated in both replication cohorts. We also identified non-MHC epistatic interactions between RA susceptible loci LOC100506023 and IRF5 with Immunochip-wide significance (p < 1.1e-08) and between 2 neighboring single-nucleotide polymorphism near PTPN22 that were in low linkage disequilibrium with independent interaction (p < 1.0e-05). Both non-MHC epistatic interactions were statistically replicated with a similar interaction pattern in the US cohort only. There are multiple but relatively weak interactions independent of the additive effects in RA and a larger sample number is required to confidently assign additional non-MHC epistasis.

  9. FIR statistics of paired galaxies

    NASA Technical Reports Server (NTRS)

    Sulentic, Jack W.

    1990-01-01

    Much progress has been made in understanding the effects of interaction on galaxies (see reviews in this volume by Heckman and Kennicutt). Evidence for enhanced emission from galaxies in pairs first emerged in the radio (Sulentic 1976) and optical (Larson and Tinsley 1978) domains. Results in the far infrared (FIR) lagged behind until the advent of the Infrared Astronomy Satellite (IRAS). The last five years have seen numerous FIR studies of optical and IR selected samples of interacting galaxies (e.g., Cutri and McAlary 1985; Joseph and Wright 1985; Kennicutt et al. 1987; Haynes and Herter 1988). Despite all of this work, there are still contradictory ideas about the level and, even, the reality of an FIR enhancement in interacting galaxies. Much of the confusion originates in differences between the galaxy samples that were studied (i.e., optical morphology and redshift coverage). Here, the authors report on a study of the FIR detection properties for a large sample of interacting galaxies and a matching control sample. They focus on the distance independent detection fraction (DF) statistics of the sample. The results prove useful in interpreting the previously published work. A clarification of the phenomenology provides valuable clues about the physics of the FIR enhancement in galaxies.

  10. Galaxy Evolution Studies with the SPace IR Telescope for Cosmology and Astrophysics (SPICA): The Power of IR Spectroscopy

    NASA Astrophysics Data System (ADS)

    Spinoglio, L.; Alonso-Herrero, A.; Armus, L.; Baes, M.; Bernard-Salas, J.; Bianchi, S.; Bocchio, M.; Bolatto, A.; Bradford, C.; Braine, J.; Carrera, F. J.; Ciesla, L.; Clements, D. L.; Dannerbauer, H.; Doi, Y.; Efstathiou, A.; Egami, E.; Fernández-Ontiveros, J. A.; Ferrara, A.; Fischer, J.; Franceschini, A.; Gallerani, S.; Giard, M.; González-Alfonso, E.; Gruppioni, C.; Guillard, P.; Hatziminaoglou, E.; Imanishi, M.; Ishihara, D.; Isobe, N.; Kaneda, H.; Kawada, M.; Kohno, K.; Kwon, J.; Madden, S.; Malkan, M. A.; Marassi, S.; Matsuhara, H.; Matsuura, M.; Miniutti, G.; Nagamine, K.; Nagao, T.; Najarro, F.; Nakagawa, T.; Onaka, T.; Oyabu, S.; Pallottini, A.; Piro, L.; Pozzi, F.; Rodighiero, G.; Roelfsema, P.; Sakon, I.; Santini, P.; Schaerer, D.; Schneider, R.; Scott, D.; Serjeant, S.; Shibai, H.; Smith, J.-D. T.; Sobacchi, E.; Sturm, E.; Suzuki, T.; Vallini, L.; van der Tak, F.; Vignali, C.; Yamada, T.; Wada, T.; Wang, L.

    2017-11-01

    IR spectroscopy in the range 12-230 μm with the SPace IR telescope for Cosmology and Astrophysics (SPICA) will reveal the physical processes governing the formation and evolution of galaxies and black holes through cosmic time, bridging the gap between the James Webb Space Telescope and the upcoming Extremely Large Telescopes at shorter wavelengths and the Atacama Large Millimeter Array at longer wavelengths. The SPICA, with its 2.5-m telescope actively cooled to below 8 K, will obtain the first spectroscopic determination, in the mid-IR rest-frame, of both the star-formation rate and black hole accretion rate histories of galaxies, reaching lookback times of 12 Gyr, for large statistically significant samples. Densities, temperatures, radiation fields, and gas-phase metallicities will be measured in dust-obscured galaxies and active galactic nuclei, sampling a large range in mass and luminosity, from faint local dwarf galaxies to luminous quasars in the distant Universe. Active galactic nuclei and starburst feedback and feeding mechanisms in distant galaxies will be uncovered through detailed measurements of molecular and atomic line profiles. The SPICA's large-area deep spectrophotometric surveys will provide mid-IR spectra and continuum fluxes for unbiased samples of tens of thousands of galaxies, out to redshifts of z 6.

  11. Chemometric and multivariate statistical analysis of time-of-flight secondary ion mass spectrometry spectra from complex Cu-Fe sulfides.

    PubMed

    Kalegowda, Yogesh; Harmer, Sarah L

    2012-03-20

    Time-of-flight secondary ion mass spectrometry (TOF-SIMS) spectra of mineral samples are complex, comprised of large mass ranges and many peaks. Consequently, characterization and classification analysis of these systems is challenging. In this study, different chemometric and statistical data evaluation methods, based on monolayer sensitive TOF-SIMS data, have been tested for the characterization and classification of copper-iron sulfide minerals (chalcopyrite, chalcocite, bornite, and pyrite) at different flotation pulp conditions (feed, conditioned feed, and Eh modified). The complex mass spectral data sets were analyzed using the following chemometric and statistical techniques: principal component analysis (PCA); principal component-discriminant functional analysis (PC-DFA); soft independent modeling of class analogy (SIMCA); and k-Nearest Neighbor (k-NN) classification. PCA was found to be an important first step in multivariate analysis, providing insight into both the relative grouping of samples and the elemental/molecular basis for those groupings. For samples exposed to oxidative conditions (at Eh ~430 mV), each technique (PCA, PC-DFA, SIMCA, and k-NN) was found to produce excellent classification. For samples at reductive conditions (at Eh ~ -200 mV SHE), k-NN and SIMCA produced the most accurate classification. Phase identification of particles that contain the same elements but a different crystal structure in a mixed multimetal mineral system has been achieved.

  12. A Future Moon Mission: Curatorial Statistics on Regolith Fragments Applicable to Sample Collection by Raking

    NASA Technical Reports Server (NTRS)

    Allton, J. H.; Bevill, T. J.

    2003-01-01

    The strategy of raking rock fragments from the lunar regolith as a means of acquiring representative samples has wide support due to science return, spacecraft simplicity (reliability) and economy [3, 4, 5]. While there exists widespread agreement that raking or sieving the bulk regolith is good strategy, there is lively discussion about the minimum sample size. Advocates of consor-tium studies desire fragments large enough to support petrologic and isotopic studies. Fragments from 5 to 10 mm are thought adequate [4, 5]. Yet, Jolliff et al. [6] demonstrated use of 2-4 mm fragments as repre-sentative of larger rocks. Here we make use of cura-torial records and sample catalogs to give a different perspective on minimum sample size for a robotic sample collector.

  13. An automated baseline correction protocol for infrared spectra of atmospheric aerosols collected on polytetrafluoroethylene (Teflon) filters

    NASA Astrophysics Data System (ADS)

    Kuzmiakova, Adele; Dillner, Ann M.; Takahama, Satoshi

    2016-06-01

    A growing body of research on statistical applications for characterization of atmospheric aerosol Fourier transform infrared (FT-IR) samples collected on polytetrafluoroethylene (PTFE) filters (e.g., Russell et al., 2011; Ruthenburg et al., 2014) and a rising interest in analyzing FT-IR samples collected by air quality monitoring networks call for an automated PTFE baseline correction solution. The existing polynomial technique (Takahama et al., 2013) is not scalable to a project with a large number of aerosol samples because it contains many parameters and requires expert intervention. Therefore, the question of how to develop an automated method for baseline correcting hundreds to thousands of ambient aerosol spectra given the variability in both environmental mixture composition and PTFE baselines remains. This study approaches the question by detailing the statistical protocol, which allows for the precise definition of analyte and background subregions, applies nonparametric smoothing splines to reproduce sample-specific PTFE variations, and integrates performance metrics from atmospheric aerosol and blank samples alike in the smoothing parameter selection. Referencing 794 atmospheric aerosol samples from seven Interagency Monitoring of PROtected Visual Environment (IMPROVE) sites collected during 2011, we start by identifying key FT-IR signal characteristics, such as non-negative absorbance or analyte segment transformation, to capture sample-specific transitions between background and analyte. While referring to qualitative properties of PTFE background, the goal of smoothing splines interpolation is to learn the baseline structure in the background region to predict the baseline structure in the analyte region. We then validate the model by comparing smoothing splines baseline-corrected spectra with uncorrected and polynomial baseline (PB)-corrected equivalents via three statistical applications: (1) clustering analysis, (2) functional group quantification, and (3) thermal optical reflectance (TOR) organic carbon (OC) and elemental carbon (EC) predictions. The discrepancy rate for a four-cluster solution is 10 %. For all functional groups but carboxylic COH the discrepancy is ≤ 10 %. Performance metrics obtained from TOR OC and EC predictions (R2 ≥ 0.94 %, bias ≤ 0.01 µg m-3, and error ≤ 0.04 µg m-3) are on a par with those obtained from uncorrected and PB-corrected spectra. The proposed protocol leads to visually and analytically similar estimates as those generated by the polynomial method. More importantly, the automated solution allows us and future users to evaluate its analytical reproducibility while minimizing reducible user bias. We anticipate the protocol will enable FT-IR researchers and data analysts to quickly and reliably analyze a large amount of data and connect them to a variety of available statistical learning methods to be applied to analyte absorbances isolated in atmospheric aerosol samples.

  14. Predictability of Circulation Transitions (Observed and Modeled): Non-diffusive Dynamics, Markov Chains and Error Growth.

    NASA Astrophysics Data System (ADS)

    Straus, D. M.

    2006-12-01

    The transitions between portions of the state space of the large-scale flow is studied from daily wintertime data over the Pacific North America region using the NCEP reanalysis data set (54 winters) and very large suites of hindcasts made with the COLA atmospheric GCM with observed SST (55 members for each of 18 winters). The partition of the large-scale state space is guided by cluster analysis, whose statistical significance and relationship to SST is reviewed (Straus and Molteni, 2004; Straus, Corti and Molteni, 2006). The determination of the global nature of the flow through state space is studied using Markov Chains (Crommelin, 2004). In particular the non-diffusive part of the flow is contrasted in nature (small data sample) and the AGCM (large data sample). The intrinsic error growth associated with different portions of the state space is studied through sets of identical twin AGCM simulations. The goal is to obtain realistic estimates of predictability times for large-scale transitions that should be useful in long-range forecasting.

  15. 40Ar/39Ar technique of KAr dating: a comparison with the conventional technique

    USGS Publications Warehouse

    Brent, Dalrymple G.; Lanphere, M.A.

    1971-01-01

    K-Ar ages have been determined by the 40Ar/39Ar total fusion technique on 19 terrestrial samples whose conventional K-Ar ages range from 3.4 my to nearly 1700 my. Sample materials included biotite, muscovite, sanidine, adularia, plagioclase, hornblende, actinolite, alunite, dacite, and basalt. For 18 samples there are no significant differences at the 95% confidence level between the KAr ages obtained by these two techniques; for one sample the difference is 4.3% and is statistically significant. For the neutron doses used in these experiments (???4 ?? 1018 nvt) it appears that corrections for interfering Ca- and K-derived Ar isotopes can be made without significant loss of precision for samples with K/Ca > 1 as young as about 5 ?? 105 yr, and for samples with K/Ca < 1 as young as about 107 yr. For younger samples the combination of large atmospheric Ar corrections and large corrections for Ca- and K-derived Ar may make the precision of the 40Ar/39Ar technique less than that of the conventional technique unless the irradiation parameters are adjusted to minimize these corrections. ?? 1971.

  16. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.

    PubMed

    Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P; Patterson, Nick; Price, Alkes L

    2014-10-15

    Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case-control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of [Formula: see text] association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary materials are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  17. Sampling-based approaches to improve estimation of mortality among patient dropouts: experience from a large PEPFAR-funded program in Western Kenya.

    PubMed

    Yiannoutsos, Constantin T; An, Ming-Wen; Frangakis, Constantine E; Musick, Beverly S; Braitstein, Paula; Wools-Kaloustian, Kara; Ochieng, Daniel; Martin, Jeffrey N; Bacon, Melanie C; Ochieng, Vincent; Kimaiyo, Sylvester

    2008-01-01

    Monitoring and evaluation (M&E) of HIV care and treatment programs is impacted by losses to follow-up (LTFU) in the patient population. The severity of this effect is undeniable but its extent unknown. Tracing all lost patients addresses this but census methods are not feasible in programs involving rapid scale-up of HIV treatment in the developing world. Sampling-based approaches and statistical adjustment are the only scaleable methods permitting accurate estimation of M&E indices. In a large antiretroviral therapy (ART) program in western Kenya, we assessed the impact of LTFU on estimating patient mortality among 8,977 adult clients of whom, 3,624 were LTFU. Overall, dropouts were more likely male (36.8% versus 33.7%; p = 0.003), and younger than non-dropouts (35.3 versus 35.7 years old; p = 0.020), with lower median CD4 count at enrollment (160 versus 189 cells/ml; p<0.001) and WHO stage 3-4 disease (47.5% versus 41.1%; p<0.001). Urban clinic clients were 75.0% of non-dropouts but 70.3% of dropouts (p<0.001). Of the 3,624 dropouts, 1,143 were sought and 621 had their vital status ascertained. Statistical techniques were used to adjust mortality estimates based on information obtained from located LTFU patients. Observed mortality estimates one year after enrollment were 1.7% (95% CI 1.3%-2.0%), revised to 2.8% (2.3%-3.1%) when deaths discovered through outreach were added and adjusted to 9.2% (7.8%-10.6%) and 9.9% (8.4%-11.5%) through statistical modeling depending on the method used. The estimates 12 months after ART initiation were 1.7% (1.3%-2.2%), 3.4% (2.9%-4.0%), 10.5% (8.7%-12.3%) and 10.7% (8.9%-12.6%) respectively. CONCLUSIONS/SIGNIFICANCE ABSTRACT: Assessment of the impact of LTFU is critical in program M&E as estimated mortality based on passive monitoring may underestimate true mortality by up to 80%. This bias can be ameliorated by tracing a sample of dropouts and statistically adjust the mortality estimates to properly evaluate and guide large HIV care and treatment programs.

  18. Maximizing Macromolecule Crystal Size for Neutron Diffraction Experiments

    NASA Technical Reports Server (NTRS)

    Judge, R. A.; Kephart, R.; Leardi, R.; Myles, D. A.; Snell, E. H.; vanderWoerd, M.; Curreri, Peter A. (Technical Monitor)

    2002-01-01

    A challenge in neutron diffraction experiments is growing large (greater than 1 cu mm) macromolecule crystals. In taking up this challenge we have used statistical experiment design techniques to quickly identify crystallization conditions under which the largest crystals grow. These techniques provide the maximum information for minimal experimental effort, allowing optimal screening of crystallization variables in a simple experimental matrix, using the minimum amount of sample. Analysis of the results quickly tells the investigator what conditions are the most important for the crystallization. These can then be used to maximize the crystallization results in terms of reducing crystal numbers and providing large crystals of suitable habit. We have used these techniques to grow large crystals of Glucose isomerase. Glucose isomerase is an industrial enzyme used extensively in the food industry for the conversion of glucose to fructose. The aim of this study is the elucidation of the enzymatic mechanism at the molecular level. The accurate determination of hydrogen positions, which is critical for this, is a requirement that neutron diffraction is uniquely suited for. Preliminary neutron diffraction experiments with these crystals conducted at the Institute Laue-Langevin (Grenoble, France) reveal diffraction to beyond 2.5 angstrom. Macromolecular crystal growth is a process involving many parameters, and statistical experimental design is naturally suited to this field. These techniques are sample independent and provide an experimental strategy to maximize crystal volume and habit for neutron diffraction studies.

  19. Spatial inventory integrating raster databases and point sample data. [Geographic Information System for timber inventory

    NASA Technical Reports Server (NTRS)

    Strahler, A. H.; Woodcock, C. E.; Logan, T. L.

    1983-01-01

    A timber inventory of the Eldorado National Forest, located in east-central California, provides an example of the use of a Geographic Information System (GIS) to stratify large areas of land for sampling and the collection of statistical data. The raster-based GIS format of the VICAR/IBIS software system allows simple and rapid tabulation of areas, and facilitates the selection of random locations for ground sampling. Algorithms that simplify the complex spatial pattern of raster-based information, and convert raster format data to strings of coordinate vectors, provide a link to conventional vector-based geographic information systems.

  20. Virtual unfolding of light sheet fluorescence microscopy dataset for quantitative analysis of the mouse intestine

    NASA Astrophysics Data System (ADS)

    Candeo, Alessia; Sana, Ilenia; Ferrari, Eleonora; Maiuri, Luigi; D'Andrea, Cosimo; Valentini, Gianluca; Bassi, Andrea

    2016-05-01

    Light sheet fluorescence microscopy has proven to be a powerful tool to image fixed and chemically cleared samples, providing in depth and high resolution reconstructions of intact mouse organs. We applied light sheet microscopy to image the mouse intestine. We found that large portions of the sample can be readily visualized, assessing the organ status and highlighting the presence of regions with impaired morphology. Yet, three-dimensional (3-D) sectioning of the intestine leads to a large dataset that produces unnecessary storage and processing overload. We developed a routine that extracts the relevant information from a large image stack and provides quantitative analysis of the intestine morphology. This result was achieved by a three step procedure consisting of: (1) virtually unfold the 3-D reconstruction of the intestine; (2) observe it layer-by-layer; and (3) identify distinct villi and statistically analyze multiple samples belonging to different intestinal regions. Even if the procedure has been developed for the murine intestine, most of the underlying concepts have a general applicability.

  1. The Abundance of Large Arcs From CLASH

    NASA Astrophysics Data System (ADS)

    Xu, Bingxiao; Postman, Marc; Meneghetti, Massimo; Coe, Dan A.; Clash Team

    2015-01-01

    We have developed an automated arc-finding algorithm to perform a rigorous comparison of the observed and simulated abundance of large lensed background galaxies (a.k.a arcs). We use images from the CLASH program to derive our observed arc abundance. Simulated CLASH images are created by performing ray tracing through mock clusters generated by the N-body simulation calibrated tool -- MOKA, and N-body/hydrodynamic simulations -- MUSIC, over the same mass and redshift range as the CLASH X-ray selected sample. We derive a lensing efficiency of 15 ± 3 arcs per cluster for the X-ray selected CLASH sample and 4 ± 2 arcs per cluster for the simulated sample. The marginally significant difference (3.0 σ) between the results for the observations and the simulations can be explained by the systematically smaller area with magnification larger than 3 (by a factor of ˜4) in both MOKA and MUSIC mass models relative to those derived from the CLASH data. Accounting for this difference brings the observed and simulated arc statistics into full agreement. We find that the source redshift distribution does not have big impact on the arc abundance but the arc abundance is very sensitive to the concentration of the dark matter halos. Our results suggest that the solution to the "arc statistics problem" lies primarily in matching the cluster dark matter distribution.

  2. Ground-water quality beneath irrigated agriculture in the central High Plains aquifer, 1999-2000

    USGS Publications Warehouse

    Bruce, Breton W.; Becker, Mark F.; Pope, Larry M.; Gurdak, Jason J.

    2003-01-01

    In 1999 and 2000, 30 water-quality monitoring wells were installed in the central High Plains aquifer to evaluate the quality of recently recharged ground water in areas of irrigated agriculture and to identify the factors affecting ground-water quality. Wells were installed adjacent to irrigated agricultural fields with 10- or 20-foot screened intervals placed near the water table. Each well was sampled once for about 100 waterquality constituents associated with agricultural practices. Water samples from 70 percent of the wells (21 of 30 sites) contained nitrate concentrations larger than expected background concentrations (about 3 mg/L as N) and detectable pesticides. Atrazine or its metabolite, deethylatrazine, were detected with greater frequency than other pesticides and were present in all 21 samples where pesticides were detected. The 21 samples with detectable pesticides also contained tritium concentrations large enough to indicate that at least some part of the water sample had been recharged within about the last 50 years. These 21 ground-water samples are considered to show water-quality effects related to irrigated agriculture. The remaining 9 groundwater samples contained no pesticides, small tritium concentrations, and nitrate concentrations less than 3.45 milligrams per liter as nitrogen. These samples are considered unaffected by the irrigated agricultural land-use setting. Nitrogen isotope ratios indicate that commercial fertilizer was the dominant source of nitrate in 13 of the 21 samples affected by irrigated agriculture. Nitrogen isotope ratios for 4 of these 21 samples were indicative of an animal waste source. Dissolved-solids concentrations were larger in samples affected by irrigated agriculture, with large sulfate concentrations having strong correlation with large dissolved solids concentrations in these samples. A strong statistical correlation is shown between samples affected by irrigated agriculture and sites with large rates of pesticide and nitrogen applications and shallow depths to ground water.

  3. Notes for Brazil sampling frame evaluation trip

    NASA Technical Reports Server (NTRS)

    Horvath, R. (Principal Investigator); Hicks, D. R. (Compiler)

    1981-01-01

    Field notes describing a trip conducted in Brazil are presented. This trip was conducted for the purpose of evaluating a sample frame developed using LANDSAT full frame images by the USDA Economic and Statistics Service for the eventual purpose of cropland production estimation with LANDSAT by the Foreign Commodity Production Forecasting Project of the AgRISTARS program. Six areas were analyzed on the basis of land use, crop land in corn and soybean, field size and soil type. The analysis indicated generally successful use of LANDSAT images for purposes of remote large area land use stratification.

  4. Nowcasting and Forecasting the Monthly Food Stamps Data in the US Using Online Search Data

    PubMed Central

    Fantazzini, Dean

    2014-01-01

    We propose the use of Google online search data for nowcasting and forecasting the number of food stamps recipients. We perform a large out-of-sample forecasting exercise with almost 3000 competing models with forecast horizons up to 2 years ahead, and we show that models including Google search data statistically outperform the competing models at all considered horizons. These results hold also with several robustness checks, considering alternative keywords, a falsification test, different out-of-samples, directional accuracy and forecasts at the state-level. PMID:25369315

  5. Children's Knowledge of the Earth: A New Methodological and Statistical Approach

    ERIC Educational Resources Information Center

    Straatemeier, Marthe; van der Maas, Han L. J.; Jansen, Brenda R. J.

    2008-01-01

    In the field of children's knowledge of the earth, much debate has concerned the question of whether children's naive knowledge--that is, their knowledge before they acquire the standard scientific theory--is coherent (i.e., theory-like) or fragmented. We conducted two studies with large samples (N = 328 and N = 381) using a new paper-and-pencil…

  6. New Resources for Computer-Aided Legal Research: An Assessment of the Usefulness of the DIALOG System in Securities Regulation Studies.

    ERIC Educational Resources Information Center

    Gruner, Richard; Heron, Carol E.

    1984-01-01

    Examines usefulness of DIALOG as legal research tool through use of DIALOG's DIALINDEX database to identify those databases among almost 200 available that contain large numbers of records related to federal securities regulation. Eight databases selected for further study are detailed. Twenty-six footnotes, database statistics, and samples are…

  7. Comparison of Efficiency of Jackknife and Variance Component Estimators of Standard Errors. Program Statistics Research. Technical Report.

    ERIC Educational Resources Information Center

    Longford, Nicholas T.

    Large scale surveys usually employ a complex sampling design and as a consequence, no standard methods for estimation of the standard errors associated with the estimates of population means are available. Resampling methods, such as jackknife or bootstrap, are often used, with reference to their properties of robustness and reduction of bias. A…

  8. VizieR Online Data Catalog: REFLEX Galaxy Cluster Survey catalogue (Boehringer+, 2004)

    NASA Astrophysics Data System (ADS)

    Boehringer, H.; Schuecker, P.; Guzzo, L.; Collins, C. A.; Voges, W.; Cruddace, R. G.; Ortiz-Gil, A.; Chincarini, G.; de Grandi, S.; Edge, A. C.; MacGillivray, H. T.; Neumann, D. M.; Schindler, S.; Shaver, P.

    2004-05-01

    The following tables provide the catalogue as well as several data files necessary to reproduce the sample preparation. These files are also required for the cosmological modeling of these observations in e.g. the study of the statistics of the large-scale structure of the matter distribution in the Universe and related cosmological tests. (13 data files).

  9. Intervention for First Graders with Limited Number Knowledge: Large-Scale Replication of a Randomized Controlled Trial

    ERIC Educational Resources Information Center

    Gersten, Russell; Rolfhus, Eric; Clarke, Ben; Decker, Lauren E.; Wilkins, Chuck; Dimino, Joseph

    2015-01-01

    Replication studies are extremely rare in education. This randomized controlled trial (RCT) is a scale-up replication of Fuchs et al., which in a sample of 139 found a statistically significant positive impact for Number Rockets, a small-group intervention for at-risk first graders that focused on building understanding of number operations. The…

  10. Internet cognitive testing of large samples needed in genetic research.

    PubMed

    Haworth, Claire M A; Harlaar, Nicole; Kovas, Yulia; Davis, Oliver S P; Oliver, Bonamy R; Hayiou-Thomas, Marianna E; Frances, Jane; Busfield, Patricia; McMillan, Andrew; Dale, Philip S; Plomin, Robert

    2007-08-01

    Quantitative and molecular genetic research requires large samples to provide adequate statistical power, but it is expensive to test large samples in person, especially when the participants are widely distributed geographically. Increasing access to inexpensive and fast Internet connections makes it possible to test large samples efficiently and economically online. Reliability and validity of Internet testing for cognitive ability have not been previously reported; these issues are especially pertinent for testing children. We developed Internet versions of reading, language, mathematics and general cognitive ability tests and investigated their reliability and validity for 10- and 12-year-old children. We tested online more than 2500 pairs of 10-year-old twins and compared their scores to similar internet-based measures administered online to a subsample of the children when they were 12 years old (> 759 pairs). Within 3 months of the online testing at 12 years, we administered standard paper and pencil versions of the reading and mathematics tests in person to 30 children (15 pairs of twins). Scores on Internet-based measures at 10 and 12 years correlated .63 on average across the two years, suggesting substantial stability and high reliability. Correlations of about .80 between Internet measures and in-person testing suggest excellent validity. In addition, the comparison of the internet-based measures to ratings from teachers based on criteria from the UK National Curriculum suggests good concurrent validity for these tests. We conclude that Internet testing can be reliable and valid for collecting cognitive test data on large samples even for children as young as 10 years.

  11. The structure of Diagnostic and Statistical Manual of Mental Disorders (4th edition, text revision) personality disorder symptoms in a large national sample.

    PubMed

    Trull, Timothy J; Vergés, Alvaro; Wood, Phillip K; Jahng, Seungmin; Sher, Kenneth J

    2012-10-01

    We examined the latent structure underlying the criteria for DSM-IV-TR (American Psychiatric Association, 2000, Diagnostic and statistical manual of mental disorders (4th ed., text revision). Washington, DC: Author.) personality disorders in a large nationally representative sample of U.S. adults. Personality disorder symptom data were collected using a structured diagnostic interview from approximately 35,000 adults assessed over two waves of data collection in the National Epidemiologic Survey on Alcohol and Related Conditions. Our analyses suggested that a seven-factor solution provided the best fit for the data, and these factors were marked primarily by one or at most two personality disorder criteria sets. A series of regression analyses that used external validators tapping Axis I psychopathology, treatment for mental health problems, functioning scores, interpersonal conflict, and suicidal ideation and behavior provided support for the seven-factor solution. We discuss these findings in the context of previous studies that have examined the structure underlying the personality disorder criteria as well as the current proposals for DSM-5 personality disorders. (PsycINFO Database Record (c) 2012 APA, all rights reserved).

  12. Connecting optical and X-ray tracers of galaxy cluster relaxation

    NASA Astrophysics Data System (ADS)

    Roberts, Ian D.; Parker, Laura C.; Hlavacek-Larrondo, Julie

    2018-04-01

    Substantial effort has been devoted in determining the ideal proxy for quantifying the morphology of the hot intracluster medium in clusters of galaxies. These proxies, based on X-ray emission, typically require expensive, high-quality X-ray observations making them difficult to apply to large surveys of groups and clusters. Here, we compare optical relaxation proxies with X-ray asymmetries and centroid shifts for a sample of Sloan Digital Sky Survey clusters with high-quality, archival X-ray data from Chandra and XMM-Newton. The three optical relaxation measures considered are the shape of the member-galaxy projected velocity distribution - measured by the Anderson-Darling (AD) statistic, the stellar mass gap between the most-massive and second-most-massive cluster galaxy, and the offset between the most-massive galaxy (MMG) position and the luminosity-weighted cluster centre. The AD statistic and stellar mass gap correlate significantly with X-ray relaxation proxies, with the AD statistic being the stronger correlator. Conversely, we find no evidence for a correlation between X-ray asymmetry or centroid shift and the MMG offset. High-mass clusters (Mhalo > 1014.5 M⊙) in this sample have X-ray asymmetries, centroid shifts, and Anderson-Darling statistics which are systematically larger than for low-mass systems. Finally, considering the dichotomy of Gaussian and non-Gaussian clusters (measured by the AD test), we show that the probability of being a non-Gaussian cluster correlates significantly with X-ray asymmetry but only shows a marginal correlation with centroid shift. These results confirm the shape of the radial velocity distribution as a useful proxy for cluster relaxation, which can then be applied to large redshift surveys lacking extensive X-ray coverage.

  13. A modified weighted function method for parameter estimation of Pearson type three distribution

    NASA Astrophysics Data System (ADS)

    Liang, Zhongmin; Hu, Yiming; Li, Binquan; Yu, Zhongbo

    2014-04-01

    In this paper, an unconventional method called Modified Weighted Function (MWF) is presented for the conventional moment estimation of a probability distribution function. The aim of MWF is to estimate the coefficient of variation (CV) and coefficient of skewness (CS) from the original higher moment computations to the first-order moment calculations. The estimators for CV and CS of Pearson type three distribution function (PE3) were derived by weighting the moments of the distribution with two weight functions, which were constructed by combining two negative exponential-type functions. The selection of these weight functions was based on two considerations: (1) to relate weight functions to sample size in order to reflect the relationship between the quantity of sample information and the role of weight function and (2) to allocate more weights to data close to medium-tail positions in a sample series ranked in an ascending order. A Monte-Carlo experiment was conducted to simulate a large number of samples upon which statistical properties of MWF were investigated. For the PE3 parent distribution, results of MWF were compared to those of the original Weighted Function (WF) and Linear Moments (L-M). The results indicate that MWF was superior to WF and slightly better than L-M, in terms of statistical unbiasness and effectiveness. In addition, the robustness of MWF, WF, and L-M were compared by designing the Monte-Carlo experiment that samples are obtained from Log-Pearson type three distribution (LPE3), three parameter Log-Normal distribution (LN3), and Generalized Extreme Value distribution (GEV), respectively, but all used as samples from the PE3 distribution. The results show that in terms of statistical unbiasness, no one method possesses the absolutely overwhelming advantage among MWF, WF, and L-M, while in terms of statistical effectiveness, the MWF is superior to WF and L-M.

  14. Dynamic heterogeneity and non-Gaussian statistics for acetylcholine receptors on live cell membrane

    NASA Astrophysics Data System (ADS)

    He, W.; Song, H.; Su, Y.; Geng, L.; Ackerson, B. J.; Peng, H. B.; Tong, P.

    2016-05-01

    The Brownian motion of molecules at thermal equilibrium usually has a finite correlation time and will eventually be randomized after a long delay time, so that their displacement follows the Gaussian statistics. This is true even when the molecules have experienced a complex environment with a finite correlation time. Here, we report that the lateral motion of the acetylcholine receptors on live muscle cell membranes does not follow the Gaussian statistics for normal Brownian diffusion. From a careful analysis of a large volume of the protein trajectories obtained over a wide range of sampling rates and long durations, we find that the normalized histogram of the protein displacements shows an exponential tail, which is robust and universal for cells under different conditions. The experiment indicates that the observed non-Gaussian statistics and dynamic heterogeneity are inherently linked to the slow-active remodelling of the underlying cortical actin network.

  15. An astronomer's guide to period searching

    NASA Astrophysics Data System (ADS)

    Schwarzenberg-Czerny, A.

    2003-03-01

    We concentrate on analysis of unevenly sampled time series, interrupted by periodic gaps, as often encountered in astronomy. While some of our conclusions may appear surprising, all are based on classical statistical principles of Fisher & successors. Except for discussion of the resolution issues, it is best for the reader to forget temporarily about Fourier transforms and to concentrate on problems of fitting of a time series with a model curve. According to their statistical content we divide the issues into several sections, consisting of: (ii) statistical numerical aspects of model fitting, (iii) evaluation of fitted models as hypotheses testing, (iv) the role of the orthogonal models in signal detection (v) conditions for equivalence of periodograms (vi) rating sensitivity by test power. An experienced observer working with individual objects would benefit little from formalized statistical approach. However, we demonstrate the usefulness of this approach in evaluation of performance of periodograms and in quantitative design of large variability surveys.

  16. Statistically significant performance results of a mine detector and fusion algorithm from an x-band high-resolution SAR

    NASA Astrophysics Data System (ADS)

    Williams, Arnold C.; Pachowicz, Peter W.

    2004-09-01

    Current mine detection research indicates that no single sensor or single look from a sensor will detect mines/minefields in a real-time manner at a performance level suitable for a forward maneuver unit. Hence, the integrated development of detectors and fusion algorithms are of primary importance. A problem in this development process has been the evaluation of these algorithms with relatively small data sets, leading to anecdotal and frequently over trained results. These anecdotal results are often unreliable and conflicting among various sensors and algorithms. Consequently, the physical phenomena that ought to be exploited and the performance benefits of this exploitation are often ambiguous. The Army RDECOM CERDEC Night Vision Laboratory and Electron Sensors Directorate has collected large amounts of multisensor data such that statistically significant evaluations of detection and fusion algorithms can be obtained. Even with these large data sets care must be taken in algorithm design and data processing to achieve statistically significant performance results for combined detectors and fusion algorithms. This paper discusses statistically significant detection and combined multilook fusion results for the Ellipse Detector (ED) and the Piecewise Level Fusion Algorithm (PLFA). These statistically significant performance results are characterized by ROC curves that have been obtained through processing this multilook data for the high resolution SAR data of the Veridian X-Band radar. We discuss the implications of these results on mine detection and the importance of statistical significance, sample size, ground truth, and algorithm design in performance evaluation.

  17. BCM: toolkit for Bayesian analysis of Computational Models using samplers.

    PubMed

    Thijssen, Bram; Dijkstra, Tjeerd M H; Heskes, Tom; Wessels, Lodewyk F A

    2016-10-21

    Computational models in biology are characterized by a large degree of uncertainty. This uncertainty can be analyzed with Bayesian statistics, however, the sampling algorithms that are frequently used for calculating Bayesian statistical estimates are computationally demanding, and each algorithm has unique advantages and disadvantages. It is typically unclear, before starting an analysis, which algorithm will perform well on a given computational model. We present BCM, a toolkit for the Bayesian analysis of Computational Models using samplers. It provides efficient, multithreaded implementations of eleven algorithms for sampling from posterior probability distributions and for calculating marginal likelihoods. BCM includes tools to simplify the process of model specification and scripts for visualizing the results. The flexible architecture allows it to be used on diverse types of biological computational models. In an example inference task using a model of the cell cycle based on ordinary differential equations, BCM is significantly more efficient than existing software packages, allowing more challenging inference problems to be solved. BCM represents an efficient one-stop-shop for computational modelers wishing to use sampler-based Bayesian statistics.

  18. Secular Extragalactic Parallax and Geometric Distances with Gaia Proper Motions

    NASA Astrophysics Data System (ADS)

    Paine, Jennie; Darling, Jeremiah K.

    2018-06-01

    The motion of the Solar System with respect to the cosmic microwave background (CMB) rest frame creates a well measured dipole in the CMB, which corresponds to a linear solar velocity of about 78 AU/yr. This motion causes relatively nearby extragalactic objects to appear to move compared to more distant objects, an effect that can be measured in the proper motions of nearby galaxies. An object at 1 Mpc and perpendicular to the CMB apex will exhibit a secular parallax, observed as a proper motion, of 78 µas/yr. The relatively large peculiar motions of galaxies make the detection of secular parallax challenging for individual objects. Instead, a statistical parallax measurement can be made for a sample of objects with proper motions, where the global parallax signal is modeled as an E-mode dipole that diminishes linearly with distance. We present preliminary results of applying this model to a sample of nearby galaxies with Gaia proper motions to detect the statistical secular parallax signal. The statistical measurement can be used to calibrate the canonical cosmological “distance ladder.”

  19. Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies

    PubMed Central

    Huang, Jim C.; Meek, Christopher; Kadie, Carl; Heckerman, David

    2011-01-01

    Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science. PMID:21765897

  20. Confirmatory factor analytic structure and measurement invariance of quantitative autistic traits measured by the social responsiveness scale-2.

    PubMed

    Frazier, Thomas W; Ratliff, Kristin R; Gruber, Chris; Zhang, Yi; Law, Paul A; Constantino, John N

    2014-01-01

    Understanding the factor structure of autistic symptomatology is critical to the discovery and interpretation of causal mechanisms in autism spectrum disorder. We applied confirmatory factor analysis and assessment of measurement invariance to a large (N = 9635) accumulated collection of reports on quantitative autistic traits using the Social Responsiveness Scale, representing a broad diversity of age, severity, and reporter type. A two-factor structure (corresponding to social communication impairment and restricted, repetitive behavior) as elaborated in the updated Diagnostic and Statistical Manual of Mental Disorders (5th ed.; DSM-5) criteria for autism spectrum disorder exhibited acceptable model fit in confirmatory factor analysis. Measurement invariance was appreciable across age, sex, and reporter (self vs other), but somewhat less apparent between clinical and nonclinical populations in this sample comprised of both familial and sporadic autism spectrum disorders. The statistical power afforded by this large sample allowed relative differentiation of three factors among items encompassing social communication impairment (emotion recognition, social avoidance, and interpersonal relatedness) and two factors among items encompassing restricted, repetitive behavior (insistence on sameness and repetitive mannerisms). Cross-trait correlations remained extremely high, that is, on the order of 0.66-0.92. These data clarify domains of statistically significant factoral separation that may relate to partially-but not completely-overlapping biological mechanisms, contributing to variation in human social competency. Given such robust intercorrelations among symptom domains, understanding their co-emergence remains a high priority in conceptualizing common neural mechanisms underlying autistic syndromes.

  1. Statistical Hypothesis Testing in Intraspecific Phylogeography: NCPA versus ABC

    PubMed Central

    Templeton, Alan R.

    2009-01-01

    Nested clade phylogeographic analysis (NCPA) and approximate Bayesian computation (ABC) have been used to test phylogeographic hypotheses. Multilocus NCPA tests null hypotheses, whereas ABC discriminates among a finite set of alternatives. The interpretive criteria of NCPA are explicit and allow complex models to be built from simple components. The interpretive criteria of ABC are ad hoc and require the specification of a complete phylogeographic model. The conclusions from ABC are often influenced by implicit assumptions arising from the many parameters needed to specify a complex model. These complex models confound many assumptions so that biological interpretations are difficult. Sampling error is accounted for in NCPA, but ABC ignores important sources of sampling error that creates pseudo-statistical power. NCPA generates the full sampling distribution of its statistics, but ABC only yields local probabilities, which in turn make it impossible to distinguish between a good fitting model, a non-informative model, and an over-determined model. Both NCPA and ABC use approximations, but convergences of the approximations used in NCPA are well defined whereas those in ABC are not. NCPA can analyze a large number of locations, but ABC cannot. Finally, the dimensionality of tested hypothesis is known in NCPA, but not for ABC. As a consequence, the “probabilities” generated by ABC are not true probabilities and are statistically non-interpretable. Accordingly, ABC should not be used for hypothesis testing, but simulation approaches are valuable when used in conjunction with NCPA or other methods that do not rely on highly parameterized models. PMID:19192182

  2. Significance levels for studies with correlated test statistics.

    PubMed

    Shi, Jianxin; Levinson, Douglas F; Whittemore, Alice S

    2008-07-01

    When testing large numbers of null hypotheses, one needs to assess the evidence against the global null hypothesis that none of the hypotheses is false. Such evidence typically is based on the test statistic of the largest magnitude, whose statistical significance is evaluated by permuting the sample units to simulate its null distribution. Efron (2007) has noted that correlation among the test statistics can induce substantial interstudy variation in the shapes of their histograms, which may cause misleading tail counts. Here, we show that permutation-based estimates of the overall significance level also can be misleading when the test statistics are correlated. We propose that such estimates be conditioned on a simple measure of the spread of the observed histogram, and we provide a method for obtaining conditional significance levels. We justify this conditioning using the conditionality principle described by Cox and Hinkley (1974). Application of the method to gene expression data illustrates the circumstances when conditional significance levels are needed.

  3. Statistical Design for Biospecimen Cohort Size in Proteomics-based Biomarker Discovery and Verification Studies

    PubMed Central

    Skates, Steven J.; Gillette, Michael A.; LaBaer, Joshua; Carr, Steven A.; Anderson, N. Leigh; Liebler, Daniel C.; Ransohoff, David; Rifai, Nader; Kondratovich, Marina; Težak, Živana; Mansfield, Elizabeth; Oberg, Ann L.; Wright, Ian; Barnes, Grady; Gail, Mitchell; Mesri, Mehdi; Kinsinger, Christopher R.; Rodriguez, Henry; Boja, Emily S.

    2014-01-01

    Protein biomarkers are needed to deepen our understanding of cancer biology and to improve our ability to diagnose, monitor and treat cancers. Important analytical and clinical hurdles must be overcome to allow the most promising protein biomarker candidates to advance into clinical validation studies. Although contemporary proteomics technologies support the measurement of large numbers of proteins in individual clinical specimens, sample throughput remains comparatively low. This problem is amplified in typical clinical proteomics research studies, which routinely suffer from a lack of proper experimental design, resulting in analysis of too few biospecimens to achieve adequate statistical power at each stage of a biomarker pipeline. To address this critical shortcoming, a joint workshop was held by the National Cancer Institute (NCI), National Heart, Lung and Blood Institute (NHLBI), and American Association for Clinical Chemistry (AACC), with participation from the U.S. Food and Drug Administration (FDA). An important output from the workshop was a statistical framework for the design of biomarker discovery and verification studies. Herein, we describe the use of quantitative clinical judgments to set statistical criteria for clinical relevance, and the development of an approach to calculate biospecimen sample size for proteomic studies in discovery and verification stages prior to clinical validation stage. This represents a first step towards building a consensus on quantitative criteria for statistical design of proteomics biomarker discovery and verification research. PMID:24063748

  4. Statistical design for biospecimen cohort size in proteomics-based biomarker discovery and verification studies.

    PubMed

    Skates, Steven J; Gillette, Michael A; LaBaer, Joshua; Carr, Steven A; Anderson, Leigh; Liebler, Daniel C; Ransohoff, David; Rifai, Nader; Kondratovich, Marina; Težak, Živana; Mansfield, Elizabeth; Oberg, Ann L; Wright, Ian; Barnes, Grady; Gail, Mitchell; Mesri, Mehdi; Kinsinger, Christopher R; Rodriguez, Henry; Boja, Emily S

    2013-12-06

    Protein biomarkers are needed to deepen our understanding of cancer biology and to improve our ability to diagnose, monitor, and treat cancers. Important analytical and clinical hurdles must be overcome to allow the most promising protein biomarker candidates to advance into clinical validation studies. Although contemporary proteomics technologies support the measurement of large numbers of proteins in individual clinical specimens, sample throughput remains comparatively low. This problem is amplified in typical clinical proteomics research studies, which routinely suffer from a lack of proper experimental design, resulting in analysis of too few biospecimens to achieve adequate statistical power at each stage of a biomarker pipeline. To address this critical shortcoming, a joint workshop was held by the National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI), and American Association for Clinical Chemistry (AACC) with participation from the U.S. Food and Drug Administration (FDA). An important output from the workshop was a statistical framework for the design of biomarker discovery and verification studies. Herein, we describe the use of quantitative clinical judgments to set statistical criteria for clinical relevance and the development of an approach to calculate biospecimen sample size for proteomic studies in discovery and verification stages prior to clinical validation stage. This represents a first step toward building a consensus on quantitative criteria for statistical design of proteomics biomarker discovery and verification research.

  5. Non-linear matter power spectrum covariance matrix errors and cosmological parameter uncertainties

    NASA Astrophysics Data System (ADS)

    Blot, L.; Corasaniti, P. S.; Amendola, L.; Kitching, T. D.

    2016-06-01

    The covariance of the matter power spectrum is a key element of the analysis of galaxy clustering data. Independent realizations of observational measurements can be used to sample the covariance, nevertheless statistical sampling errors will propagate into the cosmological parameter inference potentially limiting the capabilities of the upcoming generation of galaxy surveys. The impact of these errors as function of the number of realizations has been previously evaluated for Gaussian distributed data. However, non-linearities in the late-time clustering of matter cause departures from Gaussian statistics. Here, we address the impact of non-Gaussian errors on the sample covariance and precision matrix errors using a large ensemble of N-body simulations. In the range of modes where finite volume effects are negligible (0.1 ≲ k [h Mpc-1] ≲ 1.2), we find deviations of the variance of the sample covariance with respect to Gaussian predictions above ˜10 per cent at k > 0.3 h Mpc-1. Over the entire range these reduce to about ˜5 per cent for the precision matrix. Finally, we perform a Fisher analysis to estimate the effect of covariance errors on the cosmological parameter constraints. In particular, assuming Euclid-like survey characteristics we find that a number of independent realizations larger than 5000 is necessary to reduce the contribution of sampling errors to the cosmological parameter uncertainties at subpercent level. We also show that restricting the analysis to large scales k ≲ 0.2 h Mpc-1 results in a considerable loss in constraining power, while using the linear covariance to include smaller scales leads to an underestimation of the errors on the cosmological parameters.

  6. An atlas of L-T transition brown dwarfs with VLT/XShooter

    NASA Astrophysics Data System (ADS)

    Marocco, F.; Day-Jones, A. C.; Jones, H. R. A.; Pinfield, D. J.

    In this contribution we present the first results from a large observing campaign we are carrying out using VLT/Xshooter to obtain spectra of a large sample (˜250 objects) of L-T transition brown dwarfs. Here we report the results based on the first ˜120 spectra already obtained. The large sample, and the wide spectral coverage (300-2480 nm) given by Xshooter, will allow us to do a new powerful analysis, at an unprecedent level. By fitting the absorption lines of a given element (e.g. Na) at different wavelengths we can test ultracool atmospheric models and draw for the first time a 3D picture of stellar atmospheres at temperatures down to 1000K. Determining the atmospheric parameters (e.g. temperature, surface gravity and metallicity) of a big sample of brown dwarfs, will allow us to understand the role of these parameters on the formation of their spectra. The large number of objects in our sample also will allow us to do a statistical significant test of the birth rate and initial mass function predictions for brown dwarfs. Determining the shape of the initial mass function for very low mass objects is a fundamental task to improve galaxy models, as recent studies tep{2010Natur.468..940V} have shown that low-mass objects dominate in massive elliptical galaxies.

  7. Nearest neighbor density ratio estimation for large-scale applications in astronomy

    NASA Astrophysics Data System (ADS)

    Kremer, J.; Gieseke, F.; Steenstrup Pedersen, K.; Igel, C.

    2015-09-01

    In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-weight the labeled training patterns to match the distribution of unlabeled data that are available already in the training phase. There are many examples in practice where this strategy yielded good results, but estimating the weights reliably from a finite sample is challenging. We consider an efficient nearest neighbor density ratio estimator that can exploit large samples to increase the accuracy of the weight estimates. To solve the problem of choosing the right neighborhood size, we propose to use cross-validation on a model selection criterion that is unbiased under covariate shift. The resulting algorithm is our method of choice for density ratio estimation when the feature space dimensionality is small and sample sizes are large. The approach is simple and, because of the model selection, robust. We empirically find that it is on a par with established kernel-based methods on relatively small regression benchmark datasets. However, when applied to large-scale photometric redshift estimation, our approach outperforms the state-of-the-art.

  8. Reply: Comparison of slope instability screening tools following a large storm event and application to forest management and policy

    NASA Astrophysics Data System (ADS)

    Whittaker, Kara A.; McShane, Dan

    2013-02-01

    A large storm event in southwest Washington State triggered over 2500 landslides and provided an opportunity to assess two slope stability screening tools. The statistical analysis conducted demonstrated that both screening tools are effective at predicting where landslides were likely to take place (Whittaker and McShane, 2012). Here we reply to two discussions of this article related to the development of the slope stability screening tools and the accuracy and scale of the spatial data used. Neither of the discussions address our statistical analysis or results. We provide greater detail on our sampling criteria and also elaborate on the policy and management implications of our findings and how they complement those of a separate investigation of landslides resulting from the same storm. The conclusions made in Whittaker and McShane (2012) stand as originally published unless future analysis indicates otherwise.

  9. Analytic score distributions for a spatially continuous tridirectional Monte Carol transport problem

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Booth, T.E.

    1996-01-01

    The interpretation of the statistical error estimates produced by Monte Carlo transport codes is still somewhat of an art. Empirically, there are variance reduction techniques whose error estimates are almost always reliable, and there are variance reduction techniques whose error estimates are often unreliable. Unreliable error estimates usually result from inadequate large-score sampling from the score distribution`s tail. Statisticians believe that more accurate confidence interval statements are possible if the general nature of the score distribution can be characterized. Here, the analytic score distribution for the exponential transform applied to a simple, spatially continuous Monte Carlo transport problem is provided.more » Anisotropic scattering and implicit capture are included in the theory. In large part, the analytic score distributions that are derived provide the basis for the ten new statistical quality checks in MCNP.« less

  10. Cloud encounter statistics in the 28.5-43.5 KFT altitude region from four years of GASP observations

    NASA Technical Reports Server (NTRS)

    Jasperson, W. H.; Nastrom, G. D.; Davis, R. E.; Holdeman, J. D.

    1983-01-01

    The results of an analysis of cloud encounter measurements taken at aircraft flight altitudes as part of the Global Atmospheric Sampling Program are summarized. The results can be used in estimating the probability of cloud encounter and in assessing the economic feasibility of laminar flow control aircraft along particular routes. The data presented clearly show the tropical circulation and its seasonal migration; characteristics of the mid-latitude regime, such as the large-scale traveling cyclones in the winter and increased convective activity in the summer, can be isolated in the data. The cloud encounter statistics are shown to be consistent with the mid-latitude cyclone model. A model for TIC (time-in-clouds), a cloud encounter statistic, is presented for several common airline routes.

  11. Sampling studies to estimate the HIV prevalence rate in female commercial sex workers.

    PubMed

    Pascom, Ana Roberta Pati; Szwarcwald, Célia Landmann; Barbosa Júnior, Aristides

    2010-01-01

    We investigated sampling methods being used to estimate the HIV prevalence rate among female commercial sex workers. The studies were classified according to the adequacy or not of the sample size to estimate HIV prevalence rate and according to the sampling method (probabilistic or convenience). We identified 75 studies that estimated the HIV prevalence rate among female sex workers. Most of the studies employed convenience samples. The sample size was not adequate to estimate HIV prevalence rate in 35 studies. The use of convenience sample limits statistical inference for the whole group. It was observed that there was an increase in the number of published studies since 2005, as well as in the number of studies that used probabilistic samples. This represents a large advance in the monitoring of risk behavior practices and HIV prevalence rate in this group.

  12. A new u-statistic with superior design sensitivity in matched observational studies.

    PubMed

    Rosenbaum, Paul R

    2011-09-01

    In an observational or nonrandomized study of treatment effects, a sensitivity analysis indicates the magnitude of bias from unmeasured covariates that would need to be present to alter the conclusions of a naïve analysis that presumes adjustments for observed covariates suffice to remove all bias. The power of sensitivity analysis is the probability that it will reject a false hypothesis about treatment effects allowing for a departure from random assignment of a specified magnitude; in particular, if this specified magnitude is "no departure" then this is the same as the power of a randomization test in a randomized experiment. A new family of u-statistics is proposed that includes Wilcoxon's signed rank statistic but also includes other statistics with substantially higher power when a sensitivity analysis is performed in an observational study. Wilcoxon's statistic has high power to detect small effects in large randomized experiments-that is, it often has good Pitman efficiency-but small effects are invariably sensitive to small unobserved biases. Members of this family of u-statistics that emphasize medium to large effects can have substantially higher power in a sensitivity analysis. For example, in one situation with 250 pair differences that are Normal with expectation 1/2 and variance 1, the power of a sensitivity analysis that uses Wilcoxon's statistic is 0.08 while the power of another member of the family of u-statistics is 0.66. The topic is examined by performing a sensitivity analysis in three observational studies, using an asymptotic measure called the design sensitivity, and by simulating power in finite samples. The three examples are drawn from epidemiology, clinical medicine, and genetic toxicology. © 2010, The International Biometric Society.

  13. How allele frequency and study design affect association test statistics with misrepresentation errors.

    PubMed

    Escott-Price, Valentina; Ghodsi, Mansoureh; Schmidt, Karl Michael

    2014-04-01

    We evaluate the effect of genotyping errors on the type-I error of a general association test based on genotypes, showing that, in the presence of errors in the case and control samples, the test statistic asymptotically follows a scaled non-central $\\chi ^2$ distribution. We give explicit formulae for the scaling factor and non-centrality parameter for the symmetric allele-based genotyping error model and for additive and recessive disease models. They show how genotyping errors can lead to a significantly higher false-positive rate, growing with sample size, compared with the nominal significance levels. The strength of this effect depends very strongly on the population distribution of the genotype, with a pronounced effect in the case of rare alleles, and a great robustness against error in the case of large minor allele frequency. We also show how these results can be used to correct $p$-values.

  14. The slip resistance of common footwear materials measured with two slipmeters.

    PubMed

    Chang, W R; Matz, S

    2001-12-01

    The slip resistance of 16 commonly used footwear materials was measured with the Brungraber Mark II and the English XL on 3 floor surfaces under surface conditions of dry, wet, oily and oily wet. Three samples were used for each material combination and surface condition. The results of a one way ANOVA analysis indicated that the differences among different samples were statistically significant for a large number of material combinations and surface conditions. The results indicated that the ranking of materials based on their slip resistance values depends highly on the slipmeters, floor surfaces and surface conditions. For contaminated surfaces including wet, oily and oily wet surfaces, the slip resistance obtained with the English XL was usually higher than that measured with the Brungraber Mark II. The correlation coefficients between the slip resistance obtained with these two slipmeters calculated for different surface conditions indicated a strong correlation with statistical significance.

  15. Genetic Programming as Alternative for Predicting Development Effort of Individual Software Projects

    PubMed Central

    Chavoya, Arturo; Lopez-Martin, Cuauhtemoc; Andalon-Garcia, Irma R.; Meda-Campaña, M. E.

    2012-01-01

    Statistical and genetic programming techniques have been used to predict the software development effort of large software projects. In this paper, a genetic programming model was used for predicting the effort required in individually developed projects. Accuracy obtained from a genetic programming model was compared against one generated from the application of a statistical regression model. A sample of 219 projects developed by 71 practitioners was used for generating the two models, whereas another sample of 130 projects developed by 38 practitioners was used for validating them. The models used two kinds of lines of code as well as programming language experience as independent variables. Accuracy results from the model obtained with genetic programming suggest that it could be used to predict the software development effort of individual projects when these projects have been developed in a disciplined manner within a development-controlled environment. PMID:23226305

  16. The Marshall Islands Data Management Program

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Stoker, A.C.; Conrado, C.L.

    1995-09-01

    This report is a resource document of the methods and procedures used currently in the Data Management Program of the Marshall Islands Dose Assessment and Radioecology Project. Since 1973, over 60,000 environmental samples have been collected. Our program includes relational database design, programming and maintenance; sample and information management; sample tracking; quality control; and data entry, evaluation and reduction. The usefulness of scientific databases involves careful planning in order to fulfill the requirements of any large research program. Compilation of scientific results requires consolidation of information from several databases, and incorporation of new information as it is generated. The successmore » in combining and organizing all radionuclide analysis, sample information and statistical results into a readily accessible form, is critical to our project.« less

  17. Measurement of the integrated Luminosities of cross-section scan data samples around the $${\\rm{\\psi }}$$(3770) mass region

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ablikim, M.; Achasov, M. N.; Ahmed, S.

    To investigate the nature of the (3770) resonance and to measure the cross section for e +e -→Dmore » $$\\bar{D}$$, a cross-section scan data sample, distributed among 41 center-of-mass energy points from 3.73 to 3.89 GeV, was taken with the BESIII detector operated at the BEPCII collider in the year 2010. By analyzing the large angle Bhabha scattering events, we measure the integrated luminosity of the data sample at each center-of-mass energy point. The total integrated luminosity of the data sample is 76.16±0.04±0.61 pb -1, where the first uncertainty is statistical and the second systematic.« less

  18. Measurement of the integrated Luminosities of cross-section scan data samples around the $${\\rm{\\psi }}$$(3770) mass region

    DOE PAGES

    Ablikim, M.; Achasov, M. N.; Ahmed, S.; ...

    2018-05-01

    To investigate the nature of the (3770) resonance and to measure the cross section for e +e -→Dmore » $$\\bar{D}$$, a cross-section scan data sample, distributed among 41 center-of-mass energy points from 3.73 to 3.89 GeV, was taken with the BESIII detector operated at the BEPCII collider in the year 2010. By analyzing the large angle Bhabha scattering events, we measure the integrated luminosity of the data sample at each center-of-mass energy point. The total integrated luminosity of the data sample is 76.16±0.04±0.61 pb -1, where the first uncertainty is statistical and the second systematic.« less

  19. Determination of element composition and extraterrestrial material occurrence in moss and lichen samples from King George Island (Antarctica) using reactor neutron activation analysis and SEM microscopy.

    PubMed

    Mróz, Tomasz; Szufa, Katarzyna; Frontasyeva, Marina V; Tselmovich, Vladimir; Ostrovnaya, Tatiana; Kornaś, Andrzej; Olech, Maria A; Mietelski, Jerzy W; Brudecki, Kamil

    2018-01-01

    Seven lichens (Usnea antarctica and U. aurantiacoatra) and nine moss samples (Sanionia uncinata) collected in King George Island were analyzed using instrumental neutron activation analysis, and concentration of major and trace elements was calculated. For some elements, the concentrations observed in moss samples were higher than corresponding values reported from other sites in the Antarctica, but in the lichens, these were in the same range of concentrations. Scanning electron microscopy (SEM) and statistical analysis showed large influence of volcanic-origin particles. Also, the interplanetary cosmic particles (ICP) were observed in investigated samples, as mosses and lichens are good collectors of ICP and micrometeorites.

  20. Imaging Extended Emission-Line Regions of Obscured AGN with the Subaru Hyper Suprime-Cam Survey

    NASA Astrophysics Data System (ADS)

    Sun, Ai-Lei; Greene, Jenny E.; Zakamska, Nadia L.; Goulding, Andy; Strauss, Michael A.; Huang, Song; Johnson, Sean; Kawaguchi, Toshihiro; Matsuoka, Yoshiki; Marsteller, Alisabeth A.; Nagao, Tohru; Toba, Yoshiki

    2018-05-01

    Narrow-line regions excited by active galactic nuclei (AGN) are important for studying AGN photoionization and feedback. Their strong [O III] lines can be detected with broadband images, allowing morphological studies of these systems with large-area imaging surveys. We develop a new broad-band imaging technique to reconstruct the images of the [O III] line using the Subaru Hyper Suprime-Cam (HSC) Survey aided with spectra from the Sloan Digital Sky Survey (SDSS). The technique involves a careful subtraction of the galactic continuum to isolate emission from the [O III]λ5007 and [O III]λ4959 lines. Compared to traditional targeted observations, this technique is more efficient at covering larger samples without dedicated observational resources. We apply this technique to an SDSS spectroscopically selected sample of 300 obscured AGN at redshifts 0.1 - 0.7, uncovering extended emission-line region candidates with sizes up to tens of kpc. With the largest sample of uniformly derived narrow-line region sizes, we revisit the narrow-line region size - luminosity relation. The area and radii of the [O III] emission-line regions are strongly correlated with the AGN luminosity inferred from the mid-infrared (15 μm rest-frame) with a power-law slope of 0.62^{+0.05}_{-0.06}± 0.10 (statistical and systematic errors), consistent with previous spectroscopic findings. We discuss the implications for the physics of AGN emission-line regions and future applications of this technique, which should be useful for current and next-generation imaging surveys to study AGN photoionization and feedback with large statistical samples.

  1. Multi-Genetic Marker Approach and Spatio-Temporal Analysis Suggest There Is a Single Panmictic Population of Swordfish Xiphias gladius in the Indian Ocean

    PubMed Central

    Muths, Delphine; Le Couls, Sarah; Evano, Hugues; Grewe, Peter; Bourjea, Jerome

    2013-01-01

    Genetic population structure of swordfish Xiphias gladius was examined based on 2231 individual samples, collected mainly between 2009 and 2010, among three major sampling areas within the Indian Ocean (IO; twelve distinct sites), Atlantic (two sites) and Pacific (one site) Oceans using analysis of nineteen microsatellite loci (n = 2146) and mitochondrial ND2 sequences (n = 2001) data. Sample collection was stratified in time and space in order to investigate the stability of the genetic structure observed with a special focus on the South West Indian Ocean. Significant AMOVA variance was observed for both markers indicating genetic population subdivision was present between oceans. Overall value of F-statistics for ND2 sequences confirmed that Atlantic and Indian Oceans swordfish represent two distinct genetic stocks. Indo-Pacific differentiation was also significant but lower than that observed between Atlantic and Indian Oceans. However, microsatellite F-statistics failed to reveal structure even at the inter-oceanic scale, indicating that resolving power of our microsatellite loci was insufficient for detecting population subdivision. At the scale of the Indian Ocean, results obtained from both markers are consistent with swordfish belonging to a single unique panmictic population. Analyses partitioned by sampling area, season, or sex also failed to identify any clear structure within this ocean. Such large spatial and temporal homogeneity of genetic structure, observed for such a large highly mobile pelagic species, suggests as satisfactory to consider swordfish as a single panmictic population in the Indian Ocean. PMID:23717447

  2. Segment-Wise Genome-Wide Association Analysis Identifies a Candidate Region Associated with Schizophrenia in Three Independent Samples

    PubMed Central

    Rietschel, Marcella; Mattheisen, Manuel; Breuer, René; Schulze, Thomas G.; Nöthen, Markus M.; Levinson, Douglas; Shi, Jianxin; Gejman, Pablo V.; Cichon, Sven; Ophoff, Roel A.

    2012-01-01

    Recent studies suggest that variation in complex disorders (e.g., schizophrenia) is explained by a large number of genetic variants with small effect size (Odds Ratio∼1.05–1.1). The statistical power to detect these genetic variants in Genome Wide Association (GWA) studies with large numbers of cases and controls (∼15,000) is still low. As it will be difficult to further increase sample size, we decided to explore an alternative method for analyzing GWA data in a study of schizophrenia, dramatically reducing the number of statistical tests. The underlying hypothesis was that at least some of the genetic variants related to a common outcome are collocated in segments of chromosomes at a wider scale than single genes. Our approach was therefore to study the association between relatively large segments of DNA and disease status. An association test was performed for each SNP and the number of nominally significant tests in a segment was counted. We then performed a permutation-based binomial test to determine whether this region contained significantly more nominally significant SNPs than expected under the null hypothesis of no association, taking linkage into account. Genome Wide Association data of three independent schizophrenia case/control cohorts with European ancestry (Dutch, German, and US) using segments of DNA with variable length (2 to 32 Mbp) was analyzed. Using this approach we identified a region at chromosome 5q23.3-q31.3 (128–160 Mbp) that was significantly enriched with nominally associated SNPs in three independent case-control samples. We conclude that considering relatively wide segments of chromosomes may reveal reliable relationships between the genome and schizophrenia, suggesting novel methodological possibilities as well as raising theoretical questions. PMID:22723893

  3. A new exact and more powerful unconditional test of no treatment effect from binary matched pairs.

    PubMed

    Lloyd, Chris J

    2008-09-01

    We consider the problem of testing for a difference in the probability of success from matched binary pairs. Starting with three standard inexact tests, the nuisance parameter is first estimated and then the residual dependence is eliminated by maximization, producing what I call an E+M P-value. The E+M P-value based on McNemar's statistic is shown numerically to dominate previous suggestions, including partially maximized P-values as described in Berger and Sidik (2003, Statistical Methods in Medical Research 12, 91-108). The latter method, however, may have computational advantages for large samples.

  4. Methodological Issues in the Study of Teachers' Careers: Critical Features of a Truly Longitudinal Study. Working Paper Series.

    ERIC Educational Resources Information Center

    Singer, Judith D.; Willett, John B.

    The National Center for Education Statistics (NCES) is exploring the possibility of conducting a large-scale multi-year study of teachers' careers. The proposed new study is intended to follow a national probability sample of teachers over an extended period of time. A number of methodological issues need to be addressed before the study can be…

  5. A Theory-Based Model for Understanding Faculty Intention to Use Students Ratings to Improve Teaching in a Health Sciences Institution in Puerto Rico

    ERIC Educational Resources Information Center

    Collazo, Andrés A.

    2018-01-01

    A model derived from the theory of planned behavior was empirically assessed for understanding faculty intention to use student ratings for teaching improvement. A sample of 175 professors participated in the study. The model was statistically significant and had a very large explanatory power. Instrumental attitude, affective attitude, perceived…

  6. A Taxometric Investigation of "DSM-IV" Major Depression in a Large Outpatient Sample: Interpretable Structural Results Depend on the Mode of Assessment

    ERIC Educational Resources Information Center

    Ruscio, John; Brown, Timothy A.; Ruscio, Ayelet Meron

    2009-01-01

    Most taxometric studies of depressive constructs have drawn indicators from self-report instruments that do not bear directly on the "Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV)" diagnostic construct of major depressive disorder (MDD). The present study examined the latent structure of MDD using indicator sets…

  7. Using Modern Statistical Methods to Analyze Demographics of Kansas ABE/GED Students Who Transition to Community or Technical College Programs

    ERIC Educational Resources Information Center

    Zacharakis, Jeff; Wang, Haiyan; Patterson, Margaret Becker; Andersen, Lori

    2015-01-01

    This research analyzed linked high-quality state data from K-12, adult education, and postsecondary state datasets in order to better understand the association between student demographics and successful completion of a postsecondary program. Due to the relatively small sample size compared to the large number of features, we analyzed the data…

  8. Feelings and Performance in the First Year at University: Learning-Related Emotions as Predictors of Achievement Outcomes in Mathematics and Statistics

    ERIC Educational Resources Information Center

    Niculescu, Alexandra C.; Templelaar, Dirk; Leppink, Jimmie; Dailey-Hebert, Amber; Segers, Mien; Gijselaers, Wim

    2015-01-01

    Introduction: This study examined the predictive value of four learning-related emotions--Enjoyment, Anxiety, Boredom and Hopelessness for achievement outcomes in the first year of study at university. Method: We used a large sample (N = 2337) of first year university students enrolled over three consecutive academic years in a mathematics and…

  9. Selecting the optimum plot size for a California design-based stream and wetland mapping program.

    PubMed

    Lackey, Leila G; Stein, Eric D

    2014-04-01

    Accurate estimates of the extent and distribution of wetlands and streams are the foundation of wetland monitoring, management, restoration, and regulatory programs. Traditionally, these estimates have relied on comprehensive mapping. However, this approach is prohibitively resource-intensive over large areas, making it both impractical and statistically unreliable. Probabilistic (design-based) approaches to evaluating status and trends provide a more cost-effective alternative because, compared with comprehensive mapping, overall extent is inferred from mapping a statistically representative, randomly selected subset of the target area. In this type of design, the size of sample plots has a significant impact on program costs and on statistical precision and accuracy; however, no consensus exists on the appropriate plot size for remote monitoring of stream and wetland extent. This study utilized simulated sampling to assess the performance of four plot sizes (1, 4, 9, and 16 km(2)) for three geographic regions of California. Simulation results showed smaller plot sizes (1 and 4 km(2)) were most efficient for achieving desired levels of statistical accuracy and precision. However, larger plot sizes were more likely to contain rare and spatially limited wetland subtypes. Balancing these considerations led to selection of 4 km(2) for the California status and trends program.

  10. Reliability and One-Year Stability of the PIN3 Neighborhood Environmental Audit in Urban and Rural Neighborhoods.

    PubMed

    Porter, Anna K; Wen, Fang; Herring, Amy H; Rodríguez, Daniel A; Messer, Lynne C; Laraia, Barbara A; Evenson, Kelly R

    2018-06-01

    Reliable and stable environmental audit instruments are needed to successfully identify the physical and social attributes that may influence physical activity. This study described the reliability and stability of the PIN3 environmental audit instrument in both urban and rural neighborhoods. Four randomly sampled road segments in and around a one-quarter mile buffer of participants' residences from the Pregnancy, Infection, and Nutrition (PIN3) study were rated twice, approximately 2 weeks apart. One year later, 253 of the year 1 sampled roads were re-audited. The instrument included 43 measures that resulted in 73 item scores for calculation of percent overall agreement, kappa statistics, and log-linear models. For same-day reliability, 81% of items had moderate to outstanding kappa statistics (kappas ≥ 0.4). Two-week reliability was slightly lower, with 77% of items having moderate to outstanding agreement using kappa statistics. One-year stability had 68% of items showing moderate to outstanding agreement using kappa statistics. The reliability of the audit measures was largely consistent when comparing urban to rural locations, with only 8% of items exhibiting significant differences (α < 0.05) by urbanicity. The PIN3 instrument is a reliable and stable audit tool for studies assessing neighborhood attributes in urban and rural environments.

  11. Prevalence of diseases and statistical power of the Japan Nurses' Health Study.

    PubMed

    Fujita, Toshiharu; Hayashi, Kunihiko; Katanoda, Kota; Matsumura, Yasuhiro; Lee, Jung Su; Takagi, Hirofumi; Suzuki, Shosuke; Mizunuma, Hideki; Aso, Takeshi

    2007-10-01

    The Japan Nurses' Health Study (JNHS) is a long-term, large-scale cohort study investigating the effects of various lifestyle factors and healthcare habits on the health of Japanese women. Based on currently limited statistical data regarding the incidence of disease among Japanese women, our initial sample size was tentatively set at 50,000 during the design phase. The actual number of women who agreed to participate in follow-up surveys was approximately 18,000. Taking into account the actual sample size and new information on disease frequency obtained during the baseline component, we established the prevalence of past diagnoses of target diseases, predicted their incidence, and calculated the statistical power for JNHS follow-up surveys. For all diseases except ovarian cancer, the prevalence of a past diagnosis increased markedly with age, and incidence rates could be predicted based on the degree of increase in prevalence between two adjacent 5-yr age groups. The predicted incidence rate for uterine myoma, hypercholesterolemia, and hypertension was > or =3.0 (per 1,000 women, per year), while the rate of thyroid disease, hepatitis, gallstone disease, and benign breast tumor was predicted to be > or =1.0. For these diseases, the statistical power to detect risk factors with a relative risk of 1.5 or more within ten years, was 70% or higher.

  12. Groups of galaxies in the Center for Astrophysics redshift survey

    NASA Technical Reports Server (NTRS)

    Ramella, Massimo; Geller, Margaret J.; Huchra, John P.

    1989-01-01

    By applying the Huchra and Geller (1982) objective group identification algorithm to the Center for Astrophysics' redshift survey, a catalog of 128 groups with three or more members is extracted, and 92 of these are used as a statistical sample. A comparison of the distribution of group centers with the distribution of all galaxies in the survey indicates qualitatively that groups trace the large-scale structure of the region. The physical properties of groups may be related to the details of large-scale structure, and it is concluded that differences among group catalogs may be due to the properties of large-scale structures and their location relative to the survey limits.

  13. 45 CFR 160.536 - Statistical sampling.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 45 Public Welfare 1 2010-10-01 2010-10-01 false Statistical sampling. 160.536 Section 160.536... REQUIREMENTS GENERAL ADMINISTRATIVE REQUIREMENTS Procedures for Hearings § 160.536 Statistical sampling. (a) In... statistical sampling study as evidence of the number of violations under § 160.406 of this part, or the...

  14. 42 CFR 1003.133 - Statistical sampling.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 42 Public Health 5 2011-10-01 2011-10-01 false Statistical sampling. 1003.133 Section 1003.133... AUTHORITIES CIVIL MONEY PENALTIES, ASSESSMENTS AND EXCLUSIONS § 1003.133 Statistical sampling. (a) In meeting... statistical sampling study as evidence of the number and amount of claims and/or requests for payment as...

  15. 45 CFR 160.536 - Statistical sampling.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 45 Public Welfare 1 2011-10-01 2011-10-01 false Statistical sampling. 160.536 Section 160.536... REQUIREMENTS GENERAL ADMINISTRATIVE REQUIREMENTS Procedures for Hearings § 160.536 Statistical sampling. (a) In... statistical sampling study as evidence of the number of violations under § 160.406 of this part, or the...

  16. 42 CFR 1003.133 - Statistical sampling.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 42 Public Health 5 2010-10-01 2010-10-01 false Statistical sampling. 1003.133 Section 1003.133... AUTHORITIES CIVIL MONEY PENALTIES, ASSESSMENTS AND EXCLUSIONS § 1003.133 Statistical sampling. (a) In meeting... statistical sampling study as evidence of the number and amount of claims and/or requests for payment as...

  17. 42 CFR 405.1064 - ALJ decisions involving statistical samples.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 42 Public Health 2 2011-10-01 2011-10-01 false ALJ decisions involving statistical samples. 405... Medicare Coverage Policies § 405.1064 ALJ decisions involving statistical samples. When an appeal from the QIC involves an overpayment issue and the QIC used a statistical sample in reaching its...

  18. 42 CFR 405.1064 - ALJ decisions involving statistical samples.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... 42 Public Health 2 2010-10-01 2010-10-01 false ALJ decisions involving statistical samples. 405... Medicare Coverage Policies § 405.1064 ALJ decisions involving statistical samples. When an appeal from the QIC involves an overpayment issue and the QIC used a statistical sample in reaching its...

  19. Anchored LH2 complexes in 2D polarization imaging.

    PubMed

    Tubasum, Sumera; Sakai, Shunsuke; Dewa, Takehisa; Sundström, Villy; Scheblykin, Ivan G; Nango, Mamoru; Pullerits, Tõnu

    2013-09-26

    Protein is a soft material with inherently large structural disorder. Consequently, the bulk spectroscopies of photosynthetic pigment protein complexes provide averaged information where many details are lost. Here we report spectroscopy of single light-harvesting complexes where fluorescence excitation and detection polarizations are both independently rotated. Two samples of peripheral antenna (LH2) complexes from Rhodopseudomonas acidophila were studied. In one, the complexes were embedded in polyvinyl alcohol (PVA) film; in the other, they were anchored on the glass surface and covered by the PVA film. LH2 contains two rings of pigment molecules-B800 and B850. The B800 excitation polarization properties of the two samples were found to be very similar, indicating that orientation statistics of LH2s are the same in these two very different preparations. At the same time, we found a significant difference in B850 emission polarization statistics. We conclude that the B850 band of the anchored sample is substantially more disordered. We argue that both B800 excitation and B850 emission polarization properties can be explained by the tilt of the anchored LH2s due to the spin-casting of the PVA film on top of the complexes and related shear forces. Due to the tilt, the orientation statistics of two samples become similar. Anchoring is expected to orient the LH2s so that B850 is closer to the substrate. Consequently, the tilt-related strain leads to larger deformation and disorder in B850 than in B800.

  20. SERE: single-parameter quality control and sample comparison for RNA-Seq.

    PubMed

    Schulze, Stefan K; Kanwar, Rahul; Gölzenleuchter, Meike; Therneau, Terry M; Beutler, Andreas S

    2012-10-03

    Assessing the reliability of experimental replicates (or global alterations corresponding to different experimental conditions) is a critical step in analyzing RNA-Seq data. Pearson's correlation coefficient r has been widely used in the RNA-Seq field even though its statistical characteristics may be poorly suited to the task. Here we present a single-parameter test procedure for count data, the Simple Error Ratio Estimate (SERE), that can determine whether two RNA-Seq libraries are faithful replicates or globally different. Benchmarking shows that the interpretation of SERE is unambiguous regardless of the total read count or the range of expression differences among bins (exons or genes), a score of 1 indicating faithful replication (i.e., samples are affected only by Poisson variation of individual counts), a score of 0 indicating data duplication, and scores >1 corresponding to true global differences between RNA-Seq libraries. On the contrary the interpretation of Pearson's r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count). Cohen's simple Kappa results are also ambiguous and are highly dependent on the choice of bins. For quantifying global sample differences SERE performs similarly to a measure based on the negative binomial distribution yet is simpler to compute. SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.

  1. SERE: Single-parameter quality control and sample comparison for RNA-Seq

    PubMed Central

    2012-01-01

    Background Assessing the reliability of experimental replicates (or global alterations corresponding to different experimental conditions) is a critical step in analyzing RNA-Seq data. Pearson’s correlation coefficient r has been widely used in the RNA-Seq field even though its statistical characteristics may be poorly suited to the task. Results Here we present a single-parameter test procedure for count data, the Simple Error Ratio Estimate (SERE), that can determine whether two RNA-Seq libraries are faithful replicates or globally different. Benchmarking shows that the interpretation of SERE is unambiguous regardless of the total read count or the range of expression differences among bins (exons or genes), a score of 1 indicating faithful replication (i.e., samples are affected only by Poisson variation of individual counts), a score of 0 indicating data duplication, and scores >1 corresponding to true global differences between RNA-Seq libraries. On the contrary the interpretation of Pearson’s r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count). Cohen’s simple Kappa results are also ambiguous and are highly dependent on the choice of bins. For quantifying global sample differences SERE performs similarly to a measure based on the negative binomial distribution yet is simpler to compute. Conclusions SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter. PMID:23033915

  2. Testing AGN unification via inference from large catalogs

    NASA Astrophysics Data System (ADS)

    Nikutta, Robert; Ivezic, Zeljko; Elitzur, Moshe; Nenkova, Maia

    2018-01-01

    Source orientation and clumpiness of the central dust are the main factors in AGN classification. Type-1 QSOs are easy to observe and large samples are available (e.g. in SDSS), but obscured type-2 AGN are dimmer and redder as our line of sight is more obscured, making it difficult to obtain a complete sample. WISE has found up to a million QSOs. With only 4 bands and a relatively small aperture the analysis of individual sources is challenging, but the large sample allows inference of bulk properties at a very significant level.CLUMPY (www.clumpy.org) is arguably the most popular database of AGN torus SEDs. We model the ensemble properties of the entire WISE AGN content using regularized linear regression, with orientation-dependent CLUMPY color-color-magnitude (CCM) tracks as basis functions. We can reproduce the observed number counts per CCM bin with percent-level accuracy, and simultaneously infer the probability distributions of all torus parameters, redshifts, additional SED components, and identify type-1/2 AGN populations through their IR properties alone. We increase the statistical power of our AGN unification tests even further, by adding other datasets as axes in the regression problem. To this end, we make use of the NOAO Data Lab (datalab.noao.edu), which hosts several high-level large datasets and provides very powerful tools for handling large data, e.g. cross-matched catalogs, fast remote queries, etc.

  3. Biases in the OSSOS Detection of Large Semimajor Axis Trans-Neptunian Objects

    NASA Astrophysics Data System (ADS)

    Gladman, Brett; Shankman, Cory; OSSOS Collaboration

    2017-10-01

    The accumulating but small set of large semimajor axis trans-Neptunian objects (TNOs) shows an apparent clustering in the orientations of their orbits. This clustering must either be representative of the intrinsic distribution of these TNOs, or else have arisen as a result of observation biases and/or statistically expected variations for such a small set of detected objects. The clustered TNOs were detected across different and independent surveys, which has led to claims that the detections are therefore free of observational bias. This apparent clustering has led to the so-called “Planet 9” hypothesis that a super-Earth currently resides in the distant solar system and causes this clustering. The Outer Solar System Origins Survey (OSSOS) is a large program that ran on the Canada-France-Hawaii Telescope from 2013 to 2017, discovering more than 800 new TNOs. One of the primary design goals of OSSOS was the careful determination of observational biases that would manifest within the detected sample. We demonstrate the striking and non-intuitive biases that exist for the detection of TNOs with large semimajor axes. The eight large semimajor axis OSSOS detections are an independent data set, of comparable size to the conglomerate samples used in previous studies. We conclude that the orbital distribution of the OSSOS sample is consistent with being detected from a uniform underlying angular distribution.

  4. OSSOS. VI. Striking Biases in the Detection of Large Semimajor Axis Trans-Neptunian Objects

    NASA Astrophysics Data System (ADS)

    Shankman, Cory; Kavelaars, J. J.; Bannister, Michele T.; Gladman, Brett J.; Lawler, Samantha M.; Chen, Ying-Tung; Jakubik, Marian; Kaib, Nathan; Alexandersen, Mike; Gwyn, Stephen D. J.; Petit, Jean-Marc; Volk, Kathryn

    2017-08-01

    The accumulating but small set of large semimajor axis trans-Neptunian objects (TNOs) shows an apparent clustering in the orientations of their orbits. This clustering must either be representative of the intrinsic distribution of these TNOs, or else have arisen as a result of observation biases and/or statistically expected variations for such a small set of detected objects. The clustered TNOs were detected across different and independent surveys, which has led to claims that the detections are therefore free of observational bias. This apparent clustering has led to the so-called “Planet 9” hypothesis that a super-Earth currently resides in the distant solar system and causes this clustering. The Outer Solar System Origins Survey (OSSOS) is a large program that ran on the Canada–France–Hawaii Telescope from 2013 to 2017, discovering more than 800 new TNOs. One of the primary design goals of OSSOS was the careful determination of observational biases that would manifest within the detected sample. We demonstrate the striking and non-intuitive biases that exist for the detection of TNOs with large semimajor axes. The eight large semimajor axis OSSOS detections are an independent data set, of comparable size to the conglomerate samples used in previous studies. We conclude that the orbital distribution of the OSSOS sample is consistent with being detected from a uniform underlying angular distribution.

  5. Voids and constraints on nonlinear clustering of galaxies

    NASA Technical Reports Server (NTRS)

    Vogeley, Michael S.; Geller, Margaret J.; Park, Changbom; Huchra, John P.

    1994-01-01

    Void statistics of the galaxy distribution in the Center for Astrophysics Redshift Survey provide strong constraints on galaxy clustering in the nonlinear regime, i.e., on scales R equal to or less than 10/h Mpc. Computation of high-order moments of the galaxy distribution requires a sample that (1) densely traces the large-scale structure and (2) covers sufficient volume to obtain good statistics. The CfA redshift survey densely samples structure on scales equal to or less than 10/h Mpc and has sufficient depth and angular coverage to approach a fair sample on these scales. In the nonlinear regime, the void probability function (VPF) for CfA samples exhibits apparent agreement with hierarchical scaling (such scaling implies that the N-point correlation functions for N greater than 2 depend only on pairwise products of the two-point function xi(r)) However, simulations of cosmological models show that this scaling in redshift space does not necessarily imply such scaling in real space, even in the nonlinear regime; peculiar velocities cause distortions which can yield erroneous agreement with hierarchical scaling. The underdensity probability measures the frequency of 'voids' with density rho less than 0.2 -/rho. This statistic reveals a paucity of very bright galaxies (L greater than L asterisk) in the 'voids.' Underdensities are equal to or greater than 2 sigma more frequent in bright galaxy samples than in samples that include fainter galaxies. Comparison of void statistics of CfA samples with simulations of a range of cosmological models favors models with Gaussian primordial fluctuations and Cold Dark Matter (CDM)-like initial power spectra. Biased models tend to produce voids that are too empty. We also compare these data with three specific models of the Cold Dark Matter cosmogony: an unbiased, open universe CDM model (omega = 0.4, h = 0.5) provides a good match to the VPF of the CfA samples. Biasing of the galaxy distribution in the 'standard' CDM model (omega = 1, b = 1.5; see below for definitions) and nonzero cosmological constant CDM model (omega = 0.4, h = 0.6 lambda(sub 0) = 0.6, b = 1.3) produce voids that are too empty. All three simulations match the observed VPF and underdensity probability for samples of very bright (M less than M asterisk = -19.2) galaxies, but produce voids that are too empty when compared with samples that include fainter galaxies.

  6. Mosaic construction, processing, and review of very large electron micrograph composites

    NASA Astrophysics Data System (ADS)

    Vogt, Robert C., III; Trenkle, John M.; Harmon, Laurel A.

    1996-11-01

    A system of programs is described for acquisition, mosaicking, cueing and interactive review of large-scale transmission electron micrograph composite images. This work was carried out as part of a final-phase clinical analysis study of a drug for the treatment of diabetic peripheral neuropathy. MOre than 500 nerve biopsy samples were prepared, digitally imaged, processed, and reviewed. For a given sample, typically 1000 or more 1.5 megabyte frames were acquired, for a total of between 1 and 2 gigabytes of data per sample. These frames were then automatically registered and mosaicked together into a single virtual image composite, which was subsequently used to perform automatic cueing of axons and axon clusters, as well as review and marking by qualified neuroanatomists. Statistics derived from the review process were used to evaluate the efficacy of the drug in promoting regeneration of myelinated nerve fibers. This effort demonstrates a new, entirely digital capability for doing large-scale electron micrograph studies, in which all of the relevant specimen data can be included at high magnification, as opposed to simply taking a random sample of discrete locations. It opens up the possibility of a new era in electron microscopy--one which broadens the scope of questions that this imaging modality can be used to answer.

  7. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kreuzer-Martin, Helen W.; Hegg, Eric L.

    The use of isotopic signatures for forensic analysis of biological materials is well-established, and the same general principles that apply to interpretation of stable isotope content of C, N, O, and H apply to the analysis of microorganisms. Heterotrophic microorganisms derive their isotopic content from their growth substrates, which are largely plant and animal products, and the water in their culture medium. Thus the isotope signatures of microbes are tied to their growth environment. The C, N, O, and H isotope ratios of spores have been demonstrated to constitute highly discriminating signatures for sample matching. They can rule out specificmore » samples of media and/or water as possible production media, and can predict isotope ratio ranges of the culture media and water used to produce a given sample. These applications have been developed and tested through analyses of approximately 250 samples of Bacillus subtilis spores and over 500 samples of culture media, providing a strong statistical basis for data interpretation. A Bayesian statistical framework for integrating stable isotope data with other types of signatures derived from microorganisms has been able to characterize the culture medium used to produce spores of various Bacillus species, leveraging isotopic differences in different medium types and demonstrating the power of data integration for forensic investigations.« less

  8. Plasma creatinine in dogs: intra- and inter-laboratory variation in 10 European veterinary laboratories

    PubMed Central

    2011-01-01

    Background There is substantial variation in reported reference intervals for canine plasma creatinine among veterinary laboratories, thereby influencing the clinical assessment of analytical results. The aims of the study was to determine the inter- and intra-laboratory variation in plasma creatinine among 10 veterinary laboratories, and to compare results from each laboratory with the upper limit of its reference interval. Methods Samples were collected from 10 healthy dogs, 10 dogs with expected intermediate plasma creatinine concentrations, and 10 dogs with azotemia. Overlap was observed for the first two groups. The 30 samples were divided into 3 batches and shipped in random order by postal delivery for plasma creatinine determination. Statistical testing was performed in accordance with ISO standard methodology. Results Inter- and intra-laboratory variation was clinically acceptable as plasma creatinine values for most samples were usually of the same magnitude. A few extreme outliers caused three laboratories to fail statistical testing for consistency. Laboratory sample means above or below the overall sample mean, did not unequivocally reflect high or low reference intervals in that laboratory. Conclusions In spite of close analytical results, further standardization among laboratories is warranted. The discrepant reference intervals seem to largely reflect different populations used in establishing the reference intervals, rather than analytical variation due to different laboratory methods. PMID:21477356

  9. Refractory Sampling Links Efficiency and Costs of Sensory Encoding to Stimulus Statistics

    PubMed Central

    Song, Zhuoyi

    2014-01-01

    Sensory neurons integrate information about the world, adapting their sampling to its changes. However, little is understood mechanistically how this primary encoding process, which ultimately limits perception, depends upon stimulus statistics. Here, we analyze this open question systematically by using intracellular recordings from fly (Drosophila melanogaster and Coenosia attenuata) photoreceptors and corresponding stochastic simulations from biophysically realistic photoreceptor models. Recordings show that photoreceptors can sample more information from naturalistic light intensity time series (NS) than from Gaussian white-noise (GWN), shuffled-NS or Gaussian-1/f stimuli; integrating larger responses with higher signal-to-noise ratio and encoding efficiency to large bursty contrast changes. Simulations reveal how a photoreceptor's information capture depends critically upon the stochastic refractoriness of its 30,000 sampling units (microvilli). In daylight, refractoriness sacrifices sensitivity to enhance intensity changes in neural image representations, with more and faster microvilli improving encoding. But for GWN and other stimuli, which lack longer dark contrasts of real-world intensity changes that reduce microvilli refractoriness, these performance gains are submaximal and energetically costly. These results provide mechanistic reasons why information sampling is more efficient for natural/naturalistic stimulation and novel insight into the operation, design, and evolution of signaling and code in sensory neurons. PMID:24849356

  10. Chromium isotopic anomalies in the Allende meteorite

    NASA Technical Reports Server (NTRS)

    Papanastassiou, D. A.

    1986-01-01

    Abundances of the chromium isotopes in terrestrial and bulk meteorite samples are identical to 0.01 percent. However, Ca-Al-rich inclusions from the Allende meteorite show endemic isotopic anomalies in chromium which require at least three nucleosynthetic components. Large anomalies at Cr-54 in a special class of inclusions are correlated with large anomalies at Ca-48 and Ti-50 and provide strong support for a component reflecting neutron-rich nucleosynthesis at nuclear statistical equilibrium. This correlation suggests that materials from very near the core of an exploding massive star may be injected into the interstellar medium.

  11. Comparative Financial Statistics for Public Two-Year Colleges: FY 1993 National Sample.

    ERIC Educational Resources Information Center

    Dickmeyer, Nathan; Meeker, Bradley

    This report provides comparative information derived from a national sample of 516 public two-year colleges, highlighting financial statistics for fiscal year, 1992-93. This report provides space for colleges to compare their institutional statistics with national sample medians, quartile data for the national sample, and statistics presented in a…

  12. 7 CFR 52.38a - Definitions of terms applicable to statistical sampling.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 2 2011-01-01 2011-01-01 false Definitions of terms applicable to statistical... Sampling § 52.38a Definitions of terms applicable to statistical sampling. (a) Terms applicable to both on... acceptable as a process average. At the AQL's contained in the statistical sampling plans of this subpart...

  13. 7 CFR 52.38a - Definitions of terms applicable to statistical sampling.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 2 2010-01-01 2010-01-01 false Definitions of terms applicable to statistical... Sampling § 52.38a Definitions of terms applicable to statistical sampling. (a) Terms applicable to both on... acceptable as a process average. At the AQL's contained in the statistical sampling plans of this subpart...

  14. Evidence for a Global Sampling Process in Extraction of Summary Statistics of Item Sizes in a Set.

    PubMed

    Tokita, Midori; Ueda, Sachiyo; Ishiguchi, Akira

    2016-01-01

    Several studies have shown that our visual system may construct a "summary statistical representation" over groups of visual objects. Although there is a general understanding that human observers can accurately represent sets of a variety of features, many questions on how summary statistics, such as an average, are computed remain unanswered. This study investigated sampling properties of visual information used by human observers to extract two types of summary statistics of item sets, average and variance. We presented three models of ideal observers to extract the summary statistics: a global sampling model without sampling noise, global sampling model with sampling noise, and limited sampling model. We compared the performance of an ideal observer of each model with that of human observers using statistical efficiency analysis. Results suggest that summary statistics of items in a set may be computed without representing individual items, which makes it possible to discard the limited sampling account. Moreover, the extraction of summary statistics may not necessarily require the representation of individual objects with focused attention when the sets of items are larger than 4.

  15. Efficient Bayesian mixed model analysis increases association power in large cohorts

    PubMed Central

    Loh, Po-Ru; Tucker, George; Bulik-Sullivan, Brendan K; Vilhjálmsson, Bjarni J; Finucane, Hilary K; Salem, Rany M; Chasman, Daniel I; Ridker, Paul M; Neale, Benjamin M; Berger, Bonnie; Patterson, Nick; Price, Alkes L

    2014-01-01

    Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts, and may not optimize power. All existing methods require time cost O(MN2) (where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN)-time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for GWAS in large cohorts. PMID:25642633

  16. Protecting High Energy Barriers: A New Equation to Regulate Boost Energy in Accelerated Molecular Dynamics Simulations.

    PubMed

    Sinko, William; de Oliveira, César Augusto F; Pierce, Levi C T; McCammon, J Andrew

    2012-01-10

    Molecular dynamics (MD) is one of the most common tools in computational chemistry. Recently, our group has employed accelerated molecular dynamics (aMD) to improve the conformational sampling over conventional molecular dynamics techniques. In the original aMD implementation, sampling is greatly improved by raising energy wells below a predefined energy level. Recently, our group presented an alternative aMD implementation where simulations are accelerated by lowering energy barriers of the potential energy surface. When coupled with thermodynamic integration simulations, this implementation showed very promising results. However, when applied to large systems, such as proteins, the simulation tends to be biased to high energy regions of the potential landscape. The reason for this behavior lies in the boost equation used since the highest energy barriers are dramatically more affected than the lower ones. To address this issue, in this work, we present a new boost equation that prevents oversampling of unfavorable high energy conformational states. The new boost potential provides not only better recovery of statistics throughout the simulation but also enhanced sampling of statistically relevant regions in explicit solvent MD simulations.

  17. Bacterial diversity of surface sand samples from the Gobi and Taklamaken deserts.

    PubMed

    An, Shu; Couteau, Cécile; Luo, Fan; Neveu, Julie; DuBow, Michael S

    2013-11-01

    Arid regions represent nearly 30 % of the Earth's terrestrial surface, but their microbial biodiversity is not yet well characterized. The surface sands of deserts, a subset of arid regions, are generally subjected to large temperature fluctuations plus high UV light exposure and are low in organic matter. We examined surface sand samples from the Taklamaken (China, three samples) and Gobi (Mongolia, two samples) deserts, using pyrosequencing of PCR-amplified 16S V1/V2 rDNA sequences from total extracted DNA in order to gain an assessment of the bacterial population diversity. In total, 4,088 OTUs (using ≥97 % sequence similarity levels), with Chao1 estimates varying from 1,172 to 2,425 OTUs per sample, were discernable. These could be grouped into 102 families belonging to 15 phyla, with OTUs belonging to the Firmicutes, Proteobacteria, Bacteroidetes, and Actinobacteria phyla being the most abundant. The bacterial population composition was statistically different among the samples, though members from 30 genera were found to be common among the five samples. An increase in phylotype numbers with increasing C/N ratio was noted, suggesting a possible role in the bacterial richness of these desert sand environments. Our results imply an unexpectedly large bacterial diversity residing in the harsh environment of these two Asian deserts, worthy of further investigation.

  18. AgRISTARS: Supporting research. US crop calendars in support of the early warning project

    NASA Technical Reports Server (NTRS)

    Hodges, T. (Principal Investigator)

    1981-01-01

    The crop calendars produced for the Large Area Crop Inventory Experiment (LACIE) and crop calendar samples for Colorado, Iowa, Kansas, Minnesota, Montana, Nebraska, North Dakota, South Dakota, and Texas are presented. These calendars are based on weekly crop reporting district level observations of the percentage of various crops at several growth stages. A sample of the statistical treatments of the weekly data is provided. Four to five years of 50-percent dates for stages on a crop reporting district level for Arkansas, Iowa, Kentucky, Louisiana, Michigan, Mississippi, Ohio and Wisconsin are also given.

  19. Crop identification and area estimation over large geographic areas using LANDSAT MSS data

    NASA Technical Reports Server (NTRS)

    Bauer, M. E. (Principal Investigator)

    1977-01-01

    The author has identified the following significant results. LANDSAT MSS data was adequate to accurately identify wheat in Kansas; corn and soybean estimates in Indiana were less accurate. Computer-aided analysis techniques were effectively used to extract crop identification information from LANDSAT data. Systematic sampling of entire counties made possible by computer classification methods resulted in very precise area estimates at county, district, and state levels. Training statistics were successfully extended from one county to other counties having similar crops and soils if the training areas sampled the total variation of the area to be classified.

  20. Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures.

    PubMed

    Scheid, Anika; Nebel, Markus E

    2012-07-09

    Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case - without sacrificing much of the accuracy of the results. Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms.

  1. Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures

    PubMed Central

    2012-01-01

    Background Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. Results In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case – without sacrificing much of the accuracy of the results. Conclusions Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms. PMID:22776037

  2. Deficiency of ''Thin'' Stellar Bars in Seyfert Host Galaxies

    NASA Technical Reports Server (NTRS)

    Shlosman, Isaac; Peletier, Reynier F.; Knapen, Johan

    1999-01-01

    Using all available major samples of Seyfert galaxies and their corresponding control samples of closely matched non-active galaxies, we find that the bar ellipticities (or axial ratios) in Seyfert galaxies are systematically different from those in non-active galaxies. Overall, there is a deficiency of bars with large ellipticities (i.e., 'fat' or 'weak' bars) in Seyferts, compared to non-active galaxies. Accompanied with a large dispersion due to small number statistics, this effect is strictly speaking at the 2 sigma level. To obtain this result, the active galaxy samples of near-infrared surface photometry were matched to those of normal galaxies in type, host galaxy ellipticity, absolute magnitude, and, to some extent, in redshift. We discuss possible theoretical explanations of this phenomenon within the framework of galactic evolution, and, in particular, of radial gas redistribution in barred galaxies. Our conclusions provide further evidence that Seyfert hosts differ systematically from their non-active counterparts on scales of a few kpc.

  3. Ketorolac therapy for the prevention of acute pseudophakic cystoid macular edema: a systematic review

    PubMed Central

    Yilmaz, T; Cordero-Coma, M; Gallagher, M J

    2012-01-01

    To assess the effectiveness of ketorolac vs control for prevention of acute pseudophakic cystoid macular edema (CME). The following databases were searched: Medline (1950–June 11, 2011), The Cochrane Library (Issue 2, 2011), and the TRIP Database (up to 11 June 2011), using no language or other limits. Randomized controlled clinical trials (RCTs) were included that consisted of patients with acute pseudophakic cystoid macular edema, those comparing ketorolac with control, and those having at least a minimum follow-up of 28 days. In the four RCTs evaluating ketorolac vs control, treatment with ketorolac significantly reduced the risk of CME development at the end of treatment (∼4 weeks) compared to control (P=0.008; 95% confidence interval (0.03–0.58)). When analyzed individually, each individual study was statistically nonsignificant in its findings with the exception of one study. When the pooled relative risk was calculated, the large sample size of this systematic review led to overall statistical significance, which is attributable to the review's large sample size and not to the individual studies themselves. In this systematic review of four RCTs, two of which compared ketorolac with no treatment and two of which evaluated ketorolac vs placebo drops, treatment with ketorolac significantly reduced the risk of developing CME at the end of ∼4 weeks of treatment compared with controls. These results, however, should be interpreted with caution considering the paucity of large randomized clinical trials in the literature. PMID:22094296

  4. Sample and population exponents of generalized Taylor's law.

    PubMed

    Giometto, Andrea; Formentin, Marco; Rinaldo, Andrea; Cohen, Joel E; Maritan, Amos

    2015-06-23

    Taylor's law (TL) states that the variance V of a nonnegative random variable is a power function of its mean M; i.e., V = aM(b). TL has been verified extensively in ecology, where it applies to population abundance, physics, and other natural sciences. Its ubiquitous empirical verification suggests a context-independent mechanism. Sample exponents b measured empirically via the scaling of sample mean and variance typically cluster around the value b = 2. Some theoretical models of population growth, however, predict a broad range of values for the population exponent b pertaining to the mean and variance of population density, depending on details of the growth process. Is the widely reported sample exponent b ≃ 2 the result of ecological processes or could it be a statistical artifact? Here, we apply large deviations theory and finite-sample arguments to show exactly that in a broad class of growth models the sample exponent is b ≃ 2 regardless of the underlying population exponent. We derive a generalized TL in terms of sample and population exponents b(jk) for the scaling of the kth vs. the jth cumulants. The sample exponent b(jk) depends predictably on the number of samples and for finite samples we obtain b(jk) ≃ k = j asymptotically in time, a prediction that we verify in two empirical examples. Thus, the sample exponent b ≃ 2 may indeed be a statistical artifact and not dependent on population dynamics under conditions that we specify exactly. Given the broad class of models investigated, our results apply to many fields where TL is used although inadequately understood.

  5. Evaluation and application of summary statistic imputation to discover new height-associated loci.

    PubMed

    Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán

    2018-05-01

    As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.

  6. Evaluation and application of summary statistic imputation to discover new height-associated loci

    PubMed Central

    2018-01-01

    As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression. PMID:29782485

  7. A Simple Sampling Method for Estimating the Accuracy of Large Scale Record Linkage Projects.

    PubMed

    Boyd, James H; Guiver, Tenniel; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Anderson, Phil; Dickinson, Teresa

    2016-05-17

    Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions.

  8. Probabilistic Graphical Model Representation in Phylogenetics

    PubMed Central

    Höhna, Sebastian; Heath, Tracy A.; Boussau, Bastien; Landis, Michael J.; Ronquist, Fredrik; Huelsenbeck, John P.

    2014-01-01

    Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (i) reproducibility of an analysis, (ii) model development, and (iii) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and nonspecialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis–Hastings or Gibbs sampling of the posterior distribution. [Computation; graphical models; inference; modularization; statistical phylogenetics; tree plate.] PMID:24951559

  9. Errors in patient specimen collection: application of statistical process control.

    PubMed

    Dzik, Walter Sunny; Beckman, Neil; Selleng, Kathleen; Heddle, Nancy; Szczepiorkowski, Zbigniew; Wendel, Silvano; Murphy, Michael

    2008-10-01

    Errors in the collection and labeling of blood samples for pretransfusion testing increase the risk of transfusion-associated patient morbidity and mortality. Statistical process control (SPC) is a recognized method to monitor the performance of a critical process. An easy-to-use SPC method was tested to determine its feasibility as a tool for monitoring quality in transfusion medicine. SPC control charts were adapted to a spreadsheet presentation. Data tabulating the frequency of mislabeled and miscollected blood samples from 10 hospitals in five countries from 2004 to 2006 were used to demonstrate the method. Control charts were produced to monitor process stability. The participating hospitals found the SPC spreadsheet very suitable to monitor the performance of the sample labeling and collection and applied SPC charts to suit their specific needs. One hospital monitored subcategories of sample error in detail. A large hospital monitored the number of wrong-blood-in-tube (WBIT) events. Four smaller-sized facilities, each following the same policy for sample collection, combined their data on WBIT samples into a single control chart. One hospital used the control chart to monitor the effect of an educational intervention. A simple SPC method is described that can monitor the process of sample collection and labeling in any hospital. SPC could be applied to other critical steps in the transfusion processes as a tool for biovigilance and could be used to develop regional or national performance standards for pretransfusion sample collection. A link is provided to download the spreadsheet for free.

  10. An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data.

    PubMed

    Jenkinson, Garrett; Abante, Jordi; Feinberg, Andrew P; Goutsias, John

    2018-03-07

    DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads. We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data. This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods.

  11. Efficient computation of the joint sample frequency spectra for multiple populations.

    PubMed

    Kamm, John A; Terhorst, Jonathan; Song, Yun S

    2017-01-01

    A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.

  12. Efficient computation of the joint sample frequency spectra for multiple populations

    PubMed Central

    Kamm, John A.; Terhorst, Jonathan; Song, Yun S.

    2016-01-01

    A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity. PMID:28239248

  13. Inverse statistical physics of protein sequences: a key issues review.

    PubMed

    Cocco, Simona; Feinauer, Christoph; Figliuzzi, Matteo; Monasson, Rémi; Weigt, Martin

    2018-03-01

    In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.

  14. Inverse statistical physics of protein sequences: a key issues review

    NASA Astrophysics Data System (ADS)

    Cocco, Simona; Feinauer, Christoph; Figliuzzi, Matteo; Monasson, Rémi; Weigt, Martin

    2018-03-01

    In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.

  15. Using lod scores to detect sex differences in male-female recombination fractions.

    PubMed

    Feenstra, B; Greenberg, D A; Hodge, S E

    2004-01-01

    Human recombination fraction (RF) can differ between males and females, but investigators do not always know which disease genes are located in genomic areas of large RF sex differences. Knowledge of RF sex differences contributes to our understanding of basic biology and can increase the power of a linkage study, improve gene localization, and provide clues to possible imprinting. One way to detect these differences is to use lod scores. In this study we focused on detecting RF sex differences and answered the following questions, in both phase-known and phase-unknown matings: (1) How large a sample size is needed to detect a RF sex difference? (2) What are "optimal" proportions of paternally vs. maternally informative matings? (3) Does ascertaining nonoptimal proportions of paternally or maternally informative matings lead to ascertainment bias? Our results were as follows: (1) We calculated expected lod scores (ELODs) under two different conditions: "unconstrained," allowing sex-specific RF parameters (theta(female), theta(male)); and "constrained," requiring theta(female) = theta(male). We then examined the DeltaELOD (identical with difference between maximized constrained and unconstrained ELODs) and calculated minimum sample sizes required to achieve statistically significant DeltaELODs. For large RF sex differences, samples as small as 10 to 20 fully informative matings can achieve statistical significance. We give general sample size guidelines for detecting RF differences in informative phase-known and phase-unknown matings. (2) We defined p as the proportion of paternally informative matings in the dataset; and the optimal proportion p(circ) as that value of p that maximizes DeltaELOD. We determined that, surprisingly, p(circ) does not necessarily equal (1/2), although it does fall between approximately 0.4 and 0.6 in most situations. (3) We showed that if p in a sample deviates from its optimal value, no bias is introduced (asymptotically) to the maximum likelihood estimates of theta(female) and theta(male), even though ELOD is reduced (see point 2). This fact is important because often investigators cannot control the proportions of paternally and maternally informative families. In conclusion, it is possible to reliably detect sex differences in recombination fraction. Copyright 2004 S. Karger AG, Basel

  16. A statistical evaluation of non-ergodic variogram estimators

    USGS Publications Warehouse

    Curriero, F.C.; Hohn, M.E.; Liebhold, A.M.; Lele, S.R.

    2002-01-01

    Geostatistics is a set of statistical techniques that is increasingly used to characterize spatial dependence in spatially referenced ecological data. A common feature of geostatistics is predicting values at unsampled locations from nearby samples using the kriging algorithm. Modeling spatial dependence in sampled data is necessary before kriging and is usually accomplished with the variogram and its traditional estimator. Other types of estimators, known as non-ergodic estimators, have been used in ecological applications. Non-ergodic estimators were originally suggested as a method of choice when sampled data are preferentially located and exhibit a skewed frequency distribution. Preferentially located samples can occur, for example, when areas with high values are sampled more intensely than other areas. In earlier studies the visual appearance of variograms from traditional and non-ergodic estimators were compared. Here we evaluate the estimators' relative performance in prediction. We also show algebraically that a non-ergodic version of the variogram is equivalent to the traditional variogram estimator. Simulations, designed to investigate the effects of data skewness and preferential sampling on variogram estimation and kriging, showed the traditional variogram estimator outperforms the non-ergodic estimators under these conditions. We also analyzed data on carabid beetle abundance, which exhibited large-scale spatial variability (trend) and a skewed frequency distribution. Detrending data followed by robust estimation of the residual variogram is demonstrated to be a successful alternative to the non-ergodic approach.

  17. Type I error probabilities based on design-stage strategies with applications to noninferiority trials.

    PubMed

    Rothmann, Mark

    2005-01-01

    When testing the equality of means from two different populations, a t-test or large sample normal test tend to be performed. For these tests, when the sample size or design for the second sample is dependent on the results of the first sample, the type I error probability is altered for each specific possibility in the null hypothesis. We will examine the impact on the type I error probabilities for two confidence interval procedures and procedures using test statistics when the design for the second sample or experiment is dependent on the results from the first sample or experiment (or series of experiments). Ways for controlling a desired maximum type I error probability or a desired type I error rate will be discussed. Results are applied to the setting of noninferiority comparisons in active controlled trials where the use of a placebo is unethical.

  18. Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry

    NASA Astrophysics Data System (ADS)

    Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y.; Drake, Steven K.; Gucek, Marjan; Sacks, David B.; Yu, Yi-Kuo

    2018-06-01

    Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is complicating correct microbial identification even in a simple sample due to the large number of candidates present. To properly untwine candidate microbes in samples containing one or more microbes, one needs to go beyond apparent morphology or simple "fingerprinting"; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.

  19. Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry.

    PubMed

    Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y; Drake, Steven K; Gucek, Marjan; Sacks, David B; Yu, Yi-Kuo

    2018-06-05

    Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is complicating correct microbial identification even in a simple sample due to the large number of candidates present. To properly untwine candidate microbes in samples containing one or more microbes, one needs to go beyond apparent morphology or simple "fingerprinting"; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract ᅟ.

  20. Growth, Characterization and Applications of Beta-Barium Borate and Related Crystals

    DTIC Science & Technology

    1993-10-31

    Crystal symmetry determines the form of the second order polarization tensor. The second order polarizability tensor is defined by the piezoelectric...cold finger. A temperature oscillation technique1 I was used to limit the number of nuclei formed . These experiments typically yielded thin crystal...statistically sampled to determine the optimal seeding orientation. % was reasoned that the large crystal plates were formed from nucleii which had a favorable

  1. Nondestructive ultrasonic testing of materials

    DOEpatents

    Hildebrand, Bernard P.

    1994-01-01

    Reflection wave forms obtained from aged and unaged material samples can be compared in order to indicate trends toward age-related flaws. Statistical comparison of a large number of data points from such wave forms can indicate changes in the microstructure of the material due to aging. The process is useful for predicting when flaws may occur in structural elements of high risk structures such as nuclear power plants, airplanes, and bridges.

  2. Nondestructive ultrasonic testing of materials

    DOEpatents

    Hildebrand, B.P.

    1994-08-02

    Reflection wave forms obtained from aged and unaged material samples can be compared in order to indicate trends toward age-related flaws. Statistical comparison of a large number of data points from such wave forms can indicate changes in the microstructure of the material due to aging. The process is useful for predicting when flaws may occur in structural elements of high risk structures such as nuclear power plants, airplanes, and bridges. 4 figs.

  3. Outdoor module testing and comparison of photovoltaic technologies

    NASA Astrophysics Data System (ADS)

    Fabick, L. B.; Rifai, R.; Mitchell, K.; Woolston, T.; Canale, J.

    A comparison of outdoor test results for several module technologies is presented. The technologies include thin-film silicon:hydrogen alloys (TFS), TFS modules with semitransparent conductor back contacts, and CuInSe2 module prototypes. A method for calculating open-circuit voltage and fill-factor temperature coefficients is proposed. The method relies on the acquisition of large statistical data samples to average effects due to varying insolation level.

  4. Spiked Models of Large Dimensional Random Matrices Applied to Wireless Communications and Array Signal Processing

    DTIC Science & Technology

    2013-12-14

    population covariance matrix with application to array signal processing; and 5) a sample covariance matrix for which a CLT is studied on linear...Applications , (01 2012): 1150004. doi: Walid Hachem, Malika Kharouf, Jamal Najim, Jack W. Silverstein. A CLT FOR INFORMATION- THEORETIC STATISTICS...for Multi-source Power Estimation, (04 2010) Malika Kharouf, Jamal Najim, Jack W. Silverstein, Walid Hachem. A CLT FOR INFORMATION- THEORETIC

  5. Evaluating markers for the early detection of cancer: overview of study designs and methods.

    PubMed

    Baker, Stuart G; Kramer, Barnett S; McIntosh, Martin; Patterson, Blossom H; Shyr, Yu; Skates, Steven

    2006-01-01

    The field of cancer biomarker development has been evolving rapidly. New developments both in the biologic and statistical realms are providing increasing opportunities for evaluation of markers for both early detection and diagnosis of cancer. To review the major conceptual and methodological issues in cancer biomarker evaluation, with an emphasis on recent developments in statistical methods together with practical recommendations. We organized this review by type of study: preliminary performance, retrospective performance, prospective performance and cancer screening evaluation. For each type of study, we discuss methodologic issues, provide examples and discuss strengths and limitations. Preliminary performance studies are useful for quickly winnowing down the number of candidate markers; however their results may not apply to the ultimate target population, asymptomatic subjects. If stored specimens from cohort studies with clinical cancer endpoints are available, retrospective studies provide a quick and valid way to evaluate performance of the markers or changes in the markers prior to the onset of clinical symptoms. Prospective studies have a restricted role because they require large sample sizes, and, if the endpoint is cancer on biopsy, there may be bias due to overdiagnosis. Cancer screening studies require very large sample sizes and long follow-up, but are necessary for evaluating the marker as a trigger of early intervention.

  6. Microbial secondary metabolites in school buildings inspected for moisture damage in Finland, The Netherlands and Spain.

    PubMed

    Peitzsch, Mirko; Sulyok, Michael; Täubel, Martin; Vishwanath, Vinay; Krop, Esmeralda; Borràs-Santos, Alicia; Hyvärinen, Anne; Nevalainen, Aino; Krska, Rudolf; Larsson, Lennart

    2012-08-01

    Secondary metabolites produced by fungi and bacteria are among the potential agents that contribute to adverse health effects observed in occupants of buildings affected by moisture damage, dampness and associated microbial growth. However, few attempts have been made to assess the occurrence of these compounds in relation to moisture damage and dampness in buildings. This study conducted in the context of the HITEA project (Health Effects of Indoor Pollutants: Integrating microbial, toxicological and epidemiological approaches) aimed at providing systematic information on the prevalence of microbial secondary metabolites in a large number of school buildings in three European countries, considering both buildings with and without moisture damage and/or dampness observations. In order to address the multitude and diversity of secondary metabolites a large number of more than 180 analytes was targeted in settled dust and surface swab samples using liquid chromatography/mass spectrometry (LC/MS) based methodology. While 42%, 58% and 44% of all samples collected in Spanish, Dutch and Finnish schools, respectively, were positive for at least one of the metabolites analyzed, frequency of detection for the individual microbial secondary metabolites - with the exceptions of emodin, certain enniatins and physcion - was low, typically in the range of and below 10% of positive samples. In total, 30 different fungal and bacterial secondary metabolites were found in the samples. Some differences in the metabolite profiles were observed between countries and between index and reference school buildings. A major finding in this study was that settled dust derived from moisture damaged, damp schools contained larger numbers of microbial secondary metabolites at higher levels compared to respective dust samples from schools not affected by moisture damage and dampness. This observation was true for schools in each of the three countries, but became statistically significant only when combining schools from all countries and thus increasing the sample number in the statistical analyses.

  7. The Kaiser Permanente Northern California Adult Member Health Survey.

    PubMed

    Gordon, Nancy; Lin, Teresa

    2016-01-01

    The Kaiser Permanente Northern California (KPNC) Member Health Survey (MHS) is used to describe sociodemographic and health-related characteristics of the adult membership of this large, integrated health care delivery system to monitor trends over time, identify health disparities, and conduct research. To provide an overview of the KPNC MHS and share findings that illustrate how survey statistics and data have been and can be used for research and programmatic purposes. The MHS is a large-scale, institutional review board-approved survey of English-speaking KPNC adult members. The confidential survey has been conducted by mail triennially starting in 1993 with independent age-sex and geographically stratified random samples, with an option for online completion starting in 2005. The full survey sample and survey data are linkable at the individual level to Health Plan and geocoded data. Respondents are assigned weighting factors for their survey year and additional weighting factors for analysis of pooled survey data. Statistics from the 1999, 2002, 2005, 2008, and 2011 surveys show trends in sociodemographic and health-related characteristics and access to the Internet and e-mail for the adult membership aged 25 to 79 years and for 6 age-sex subgroups. Pooled data from the 2008 and 2011 surveys show many significant differences in these characteristics across the 5 largest race/ethnic groups in KPNC (non-Hispanic whites, blacks, Latinos, Filipinos, and Chinese). The KPNC MHS has yielded unique insights and provides an opportunity for researchers and public health organizations outside of KPNC to leverage our survey-generated statistics and collaborate on epidemiologic and health services research studies.

  8. Presolar Materials in a Giant Cluster IDP of Probable Cometary Origin

    NASA Technical Reports Server (NTRS)

    Messenger, S.; Brownlee, D. E.; Joswiak, D. J.; Nguyen, A. N.

    2015-01-01

    Chondritic porous interplanetary dust particles (CP-IDPs) have been linked to comets by their fragile structure, primitive mineralogy, dynamics, and abundant interstellar materials. But differences have emerged between 'cometary' CP-IDPs and comet 81P/Wild 2 Stardust Mission samples. Particles resembling Ca-Al-rich inclusions (CAIs), chondrules, and amoeboid olivine aggregates (AOAs) in Wild 2 samples are rare in CP-IDPs. Unlike IDPs, presolar materials are scarce in Wild 2 samples. These differences may be due to selection effects, such as destruction of fine grained (presolar) components during the 6 km/s aerogel impact collection of Wild 2 samples. Large refractory grains observed in Wild 2 samples are also unlikely to be found in most (less than 30 micrometers) IDPs. Presolar materials provide a measure of primitive-ness of meteorites and IDPs. Organic matter in IDPs and chondrites shows H and N isotopic anomalies attributed to low-T interstellar or protosolar disk chemistry, where the largest anomalies occur in the most primitive samples. Presolar silicates are abundant in meteorites with low levels of aqueous alteration (Acfer 094 approximately 200 ppm) and scarce in altered chondrites (e.g. Semarkona approximately 20 ppm). Presolar silicates in minimally altered CP-IDPs range from approximately 400 ppm to 15,000 ppm, possibly reflecting variable levels of destruction in the solar nebula or statistical variations due to small sample sizes. Here we present preliminary isotopic and mineralogical studies of a very large CP-IDP. The goals of this study are to more accurately determine the abundances of presolar components of CP-IDP material for comparison with comet Wild 2 samples and meteorites. The large mass of this IDP presents a unique opportunity to accurately determine the abundance of pre-solar grains in a likely cometary sample.

  9. Cosmological Constraints from Fourier Phase Statistics

    NASA Astrophysics Data System (ADS)

    Ali, Kamran; Obreschkow, Danail; Howlett, Cullan; Bonvin, Camille; Llinares, Claudio; Oliveira Franco, Felipe; Power, Chris

    2018-06-01

    Most statistical inference from cosmic large-scale structure relies on two-point statistics, i.e. on the galaxy-galaxy correlation function (2PCF) or the power spectrum. These statistics capture the full information encoded in the Fourier amplitudes of the galaxy density field but do not describe the Fourier phases of the field. Here, we quantify the information contained in the line correlation function (LCF), a three-point Fourier phase correlation function. Using cosmological simulations, we estimate the Fisher information (at redshift z = 0) of the 2PCF, LCF and their combination, regarding the cosmological parameters of the standard ΛCDM model, as well as a Warm Dark Matter (WDM) model and the f(R) and Symmetron modified gravity models. The galaxy bias is accounted for at the level of a linear bias. The relative information of the 2PCF and the LCF depends on the survey volume, sampling density (shot noise) and the bias uncertainty. For a volume of 1h^{-3}Gpc^3, sampled with points of mean density \\bar{n} = 2× 10^{-3} h3 Mpc^{-3} and a bias uncertainty of 13%, the LCF improves the parameter constraints by about 20% in the ΛCDM cosmology and potentially even more in alternative models. Finally, since a linear bias only affects the Fourier amplitudes (2PCF), but not the phases (LCF), the combination of the 2PCF and the LCF can be used to break the degeneracy between the linear bias and σ8, present in 2-point statistics.

  10. A note on the kappa statistic for clustered dichotomous data.

    PubMed

    Zhou, Ming; Yang, Zhao

    2014-06-30

    The kappa statistic is widely used to assess the agreement between two raters. Motivated by a simulation-based cluster bootstrap method to calculate the variance of the kappa statistic for clustered physician-patients dichotomous data, we investigate its special correlation structure and develop a new simple and efficient data generation algorithm. For the clustered physician-patients dichotomous data, based on the delta method and its special covariance structure, we propose a semi-parametric variance estimator for the kappa statistic. An extensive Monte Carlo simulation study is performed to evaluate the performance of the new proposal and five existing methods with respect to the empirical coverage probability, root-mean-square error, and average width of the 95% confidence interval for the kappa statistic. The variance estimator ignoring the dependence within a cluster is generally inappropriate, and the variance estimators from the new proposal, bootstrap-based methods, and the sampling-based delta method perform reasonably well for at least a moderately large number of clusters (e.g., the number of clusters K ⩾50). The new proposal and sampling-based delta method provide convenient tools for efficient computations and non-simulation-based alternatives to the existing bootstrap-based methods. Moreover, the new proposal has acceptable performance even when the number of clusters is as small as K = 25. To illustrate the practical application of all the methods, one psychiatric research data and two simulated clustered physician-patients dichotomous data are analyzed. Copyright © 2014 John Wiley & Sons, Ltd.

  11. From medium heterogeneity to flow and transport: A time-domain random walk approach

    NASA Astrophysics Data System (ADS)

    Hakoun, V.; Comolli, A.; Dentz, M.

    2017-12-01

    The prediction of flow and transport processes in heterogeneous porous media is based on the qualitative and quantitative understanding of the interplay between 1) spatial variability of hydraulic conductivity, 2) groundwater flow and 3) solute transport. Using a stochastic modeling approach, we study this interplay through direct numerical simulations of Darcy flow and advective transport in heterogeneous media. First, we study flow in correlated hydraulic permeability fields and shed light on the relationship between the statistics of log-hydraulic conductivity, a medium attribute, and the flow statistics. Second, we determine relationships between Eulerian and Lagrangian velocity statistics, this means, between flow and transport attributes. We show how Lagrangian statistics and thus transport behaviors such as late particle arrival times are influenced by the medium heterogeneity on one hand and the initial particle velocities on the other. We find that equidistantly sampled Lagrangian velocities can be described by a Markov process that evolves on the characteristic heterogeneity length scale. We employ a stochastic relaxation model for the equidistantly sampled particle velocities, which is parametrized by the velocity correlation length. This description results in a time-domain random walk model for the particle motion, whose spatial transitions are characterized by the velocity correlation length and temporal transitions by the particle velocities. This approach relates the statistical medium and flow properties to large scale transport, and allows for conditioning on the initial particle velocities and thus to the medium properties in the injection region. The approach is tested against direct numerical simulations.

  12. Rasch fit statistics and sample size considerations for polytomous data.

    PubMed

    Smith, Adam B; Rush, Robert; Fallowfield, Lesley J; Velikova, Galina; Sharpe, Michael

    2008-05-29

    Previous research on educational data has demonstrated that Rasch fit statistics (mean squares and t-statistics) are highly susceptible to sample size variation for dichotomously scored rating data, although little is known about this relationship for polytomous data. These statistics help inform researchers about how well items fit to a unidimensional latent trait, and are an important adjunct to modern psychometrics. Given the increasing use of Rasch models in health research the purpose of this study was therefore to explore the relationship between fit statistics and sample size for polytomous data. Data were collated from a heterogeneous sample of cancer patients (n = 4072) who had completed both the Patient Health Questionnaire - 9 and the Hospital Anxiety and Depression Scale. Ten samples were drawn with replacement for each of eight sample sizes (n = 25 to n = 3200). The Rating and Partial Credit Models were applied and the mean square and t-fit statistics (infit/outfit) derived for each model. The results demonstrated that t-statistics were highly sensitive to sample size, whereas mean square statistics remained relatively stable for polytomous data. It was concluded that mean square statistics were relatively independent of sample size for polytomous data and that misfit to the model could be identified using published recommended ranges.

  13. Rasch fit statistics and sample size considerations for polytomous data

    PubMed Central

    Smith, Adam B; Rush, Robert; Fallowfield, Lesley J; Velikova, Galina; Sharpe, Michael

    2008-01-01

    Background Previous research on educational data has demonstrated that Rasch fit statistics (mean squares and t-statistics) are highly susceptible to sample size variation for dichotomously scored rating data, although little is known about this relationship for polytomous data. These statistics help inform researchers about how well items fit to a unidimensional latent trait, and are an important adjunct to modern psychometrics. Given the increasing use of Rasch models in health research the purpose of this study was therefore to explore the relationship between fit statistics and sample size for polytomous data. Methods Data were collated from a heterogeneous sample of cancer patients (n = 4072) who had completed both the Patient Health Questionnaire – 9 and the Hospital Anxiety and Depression Scale. Ten samples were drawn with replacement for each of eight sample sizes (n = 25 to n = 3200). The Rating and Partial Credit Models were applied and the mean square and t-fit statistics (infit/outfit) derived for each model. Results The results demonstrated that t-statistics were highly sensitive to sample size, whereas mean square statistics remained relatively stable for polytomous data. Conclusion It was concluded that mean square statistics were relatively independent of sample size for polytomous data and that misfit to the model could be identified using published recommended ranges. PMID:18510722

  14. Understanding the Sampling Distribution and the Central Limit Theorem.

    ERIC Educational Resources Information Center

    Lewis, Charla P.

    The sampling distribution is a common source of misuse and misunderstanding in the study of statistics. The sampling distribution, underlying distribution, and the Central Limit Theorem are all interconnected in defining and explaining the proper use of the sampling distribution of various statistics. The sampling distribution of a statistic is…

  15. The cost of large numbers of hypothesis tests on power, effect size and sample size.

    PubMed

    Lazzeroni, L C; Ray, A

    2012-01-01

    Advances in high-throughput biology and computer science are driving an exponential increase in the number of hypothesis tests in genomics and other scientific disciplines. Studies using current genotyping platforms frequently include a million or more tests. In addition to the monetary cost, this increase imposes a statistical cost owing to the multiple testing corrections needed to avoid large numbers of false-positive results. To safeguard against the resulting loss of power, some have suggested sample sizes on the order of tens of thousands that can be impractical for many diseases or may lower the quality of phenotypic measurements. This study examines the relationship between the number of tests on the one hand and power, detectable effect size or required sample size on the other. We show that once the number of tests is large, power can be maintained at a constant level, with comparatively small increases in the effect size or sample size. For example at the 0.05 significance level, a 13% increase in sample size is needed to maintain 80% power for ten million tests compared with one million tests, whereas a 70% increase in sample size is needed for 10 tests compared with a single test. Relative costs are less when measured by increases in the detectable effect size. We provide an interactive Excel calculator to compute power, effect size or sample size when comparing study designs or genome platforms involving different numbers of hypothesis tests. The results are reassuring in an era of extreme multiple testing.

  16. Scaling laws and fluctuations in the statistics of word frequencies

    NASA Astrophysics Data System (ADS)

    Gerlach, Martin; Altmann, Eduardo G.

    2014-11-01

    In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps’ law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylor's law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps’ and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipf's law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.

  17. Developing Students' Reasoning about Samples and Sampling Variability as a Path to Expert Statistical Thinking

    ERIC Educational Resources Information Center

    Garfield, Joan; Le, Laura; Zieffler, Andrew; Ben-Zvi, Dani

    2015-01-01

    This paper describes the importance of developing students' reasoning about samples and sampling variability as a foundation for statistical thinking. Research on expert-novice thinking as well as statistical thinking is reviewed and compared. A case is made that statistical thinking is a type of expert thinking, and as such, research…

  18. Upper-tropospheric environment-tropical cyclone interactions over the western North Pacific: A statistical study

    NASA Astrophysics Data System (ADS)

    Qian, Yu-Kun; Liang, Chang-Xia; Yuan, Zhuojian; Peng, Shiqiu; Wu, Junjie; Wang, Sihua

    2016-05-01

    Based on 25-year (1987-2011) tropical cyclone (TC) best track data, a statistical study was carried out to investigate the basic features of upper-tropospheric TC-environment interactions over the western North Pacific. Interaction was defined as the absolute value of eddy momentum flux convergence (EFC) exceeding 10 m s-1 d-1. Based on this definition, it was found that 18% of all six-hourly TC samples experienced interaction. Extreme interaction cases showed that EFC can reach ~120 m s-1 d-1 during the extratropical-cyclone (EC) stage, an order of magnitude larger than reported in previous studies. Composite analysis showed that positive interactions are characterized by a double-jet flow pattern, rather than the traditional trough pattern, because it is the jets that bring in large EFC from the upper-level environment to the TC center. The role of the outflow jet is also enhanced by relatively low inertial stability, as compared to the inflow jet. Among several environmental factors, it was found that extremely large EFC is usually accompanied by high inertial stability, low SST and strong vertical wind shear (VWS). Thus, the positive effect of EFC is cancelled by their negative effects. Only those samples during the EC stage, whose intensities were less dependent on VWS and the underlying SST, could survive in extremely large EFC environments, or even re-intensify. For classical TCs (not in the EC stage), it was found that environments with a moderate EFC value generally below ~25 m s-1 d-1 are more favorable for a TC's intensification than those with extremely large EFC.

  19. Statistical inference of seabed sound-speed structure in the Gulf of Oman Basin.

    PubMed

    Sagers, Jason D; Knobles, David P

    2014-06-01

    Addressed is the statistical inference of the sound-speed depth profile of a thick soft seabed from broadband sound propagation data recorded in the Gulf of Oman Basin in 1977. The acoustic data are in the form of time series signals recorded on a sparse vertical line array and generated by explosive sources deployed along a 280 km track. The acoustic data offer a unique opportunity to study a deep-water bottom-limited thickly sedimented environment because of the large number of time series measurements, very low seabed attenuation, and auxiliary measurements. A maximum entropy method is employed to obtain a conditional posterior probability distribution (PPD) for the sound-speed ratio and the near-surface sound-speed gradient. The multiple data samples allow for a determination of the average error constraint value required to uniquely specify the PPD for each data sample. Two complicating features of the statistical inference study are addressed: (1) the need to develop an error function that can both utilize the measured multipath arrival structure and mitigate the effects of data errors and (2) the effect of small bathymetric slopes on the structure of the bottom interacting arrivals.

  20. A Monte Carlo investigation of thrust imbalance of solid rocket motor pairs

    NASA Technical Reports Server (NTRS)

    Sforzini, R. H.; Foster, W. A., Jr.; Johnson, J. S., Jr.

    1974-01-01

    A technique is described for theoretical, statistical evaluation of the thrust imbalance of pairs of solid-propellant rocket motors (SRMs) firing in parallel. Sets of the significant variables, determined as a part of the research, are selected using a random sampling technique and the imbalance calculated for a large number of motor pairs. The performance model is upgraded to include the effects of statistical variations in the ovality and alignment of the motor case and mandrel. Effects of cross-correlations of variables are minimized by selecting for the most part completely independent input variables, over forty in number. The imbalance is evaluated in terms of six time - varying parameters as well as eleven single valued ones which themselves are subject to statistical analysis. A sample study of the thrust imbalance of 50 pairs of 146 in. dia. SRMs of the type to be used on the space shuttle is presented. The FORTRAN IV computer program of the analysis and complete instructions for its use are included. Performance computation time for one pair of SRMs is approximately 35 seconds on the IBM 370/155 using the FORTRAN H compiler.

  1. A study of correlations between crude oil spot and futures markets: A rolling sample test

    NASA Astrophysics Data System (ADS)

    Liu, Li; Wan, Jieqiu

    2011-10-01

    In this article, we investigate the asymmetries of exceedance correlations and cross-correlations between West Texas Intermediate (WTI) spot and futures markets. First, employing the test statistic proposed by Hong et al. [Asymmetries in stock returns: statistical tests and economic evaluation, Review of Financial Studies 20 (2007) 1547-1581], we find that the exceedance correlations were overall symmetric. However, the results from rolling windows show that some occasional events could induce the significant asymmetries of the exceedance correlations. Second, employing the test statistic proposed by Podobnik et al. [Quantifying cross-correlations using local and global detrending approaches, European Physics Journal B 71 (2009) 243-250], we find that the cross-correlations were significant even for large lagged orders. Using the detrended cross-correlation analysis proposed by Podobnik and Stanley [Detrended cross-correlation analysis: a new method for analyzing two nonstationary time series, Physics Review Letters 100 (2008) 084102], we find that the cross-correlations were weakly persistent and were stronger between spot and futures contract with larger maturity. Our results from rolling sample test also show the apparent effects of the exogenous events. Additionally, we have some relevant discussions on the obtained evidence.

  2. Two-dimensional random surface model for asperity-contact in elastohydrodynamic lubrication

    NASA Technical Reports Server (NTRS)

    Coy, J. J.; Sidik, S. M.

    1979-01-01

    Relations for the asperity-contact time function during elastohydrodynamic lubrication of a ball bearing are presented. The analysis is based on a two-dimensional random surface model, and actual profile traces of the bearing surfaces are used as statistical sample records. The results of the analysis show that transition from 90 percent contact to 1 percent contact occurs within a dimensionless film thickness range of approximately four to five. This thickness ratio is several times large than reported in the literature where one-dimensional random surface models were used. It is shown that low pass filtering of the statistical records will bring agreement between the present results and those in the literature.

  3. Why am I not disabled? Making state subjects, making statistics in post--Mao China.

    PubMed

    Kohrman, Matthew

    2003-03-01

    In this article I examine how and why disability was defined and statistically quantified by China's party-state in the late 1980s. I describe the unfolding of a particular epidemiological undertaking--China's 1987 National Sample Survey of Disabled Persons--as well as the ways the survey was an extension of what Ian Hacking has called modernity's "avalanche of numbers." I argue that, to a large degree, what fueled and shaped the 1987 survey's codification and quantification of disability was how Chinese officials were incited to shape their own identities as they negotiated an array of social, political, and ethical forces, which were at once national and transnational in orientation.

  4. SYN-JEM: A Quantitative Job-Exposure Matrix for Five Lung Carcinogens.

    PubMed

    Peters, Susan; Vermeulen, Roel; Portengen, Lützen; Olsson, Ann; Kendzia, Benjamin; Vincent, Raymond; Savary, Barbara; Lavoué, Jérôme; Cavallo, Domenico; Cattaneo, Andrea; Mirabelli, Dario; Plato, Nils; Fevotte, Joelle; Pesch, Beate; Brüning, Thomas; Straif, Kurt; Kromhout, Hans

    2016-08-01

    The use of measurement data in occupational exposure assessment allows more quantitative analyses of possible exposure-response relations. We describe a quantitative exposure assessment approach for five lung carcinogens (i.e. asbestos, chromium-VI, nickel, polycyclic aromatic hydrocarbons (by its proxy benzo(a)pyrene (BaP)) and respirable crystalline silica). A quantitative job-exposure matrix (JEM) was developed based on statistical modeling of large quantities of personal measurements. Empirical linear models were developed using personal occupational exposure measurements (n = 102306) from Europe and Canada, as well as auxiliary information like job (industry), year of sampling, region, an a priori exposure rating of each job (none, low, and high exposed), sampling and analytical methods, and sampling duration. The model outcomes were used to create a JEM with a quantitative estimate of the level of exposure by job, year, and region. Decreasing time trends were observed for all agents between the 1970s and 2009, ranging from -1.2% per year for personal BaP and nickel exposures to -10.7% for asbestos (in the time period before an asbestos ban was implemented). Regional differences in exposure concentrations (adjusted for measured jobs, years of measurement, and sampling method and duration) varied by agent, ranging from a factor 3.3 for chromium-VI up to a factor 10.5 for asbestos. We estimated time-, job-, and region-specific exposure levels for four (asbestos, chromium-VI, nickel, and RCS) out of five considered lung carcinogens. Through statistical modeling of large amounts of personal occupational exposure measurement data we were able to derive a quantitative JEM to be used in community-based studies. © The Author 2016. Published by Oxford University Press on behalf of the British Occupational Hygiene Society.

  5. Statistical methods for investigating quiescence and other temporal seismicity patterns

    USGS Publications Warehouse

    Matthews, M.V.; Reasenberg, P.A.

    1988-01-01

    We propose a statistical model and a technique for objective recognition of one of the most commonly cited seismicity patterns:microearthquake quiescence. We use a Poisson process model for seismicity and define a process with quiescence as one with a particular type of piece-wise constant intensity function. From this model, we derive a statistic for testing stationarity against a 'quiescence' alternative. The large-sample null distribution of this statistic is approximated from simulated distributions of appropriate functionals applied to Brownian bridge processes. We point out the restrictiveness of the particular model we propose and of the quiescence idea in general. The fact that there are many point processes which have neither constant nor quiescent rate functions underscores the need to test for and describe nonuniformity thoroughly. We advocate the use of the quiescence test in conjunction with various other tests for nonuniformity and with graphical methods such as density estimation. ideally these methods may promote accurate description of temporal seismicity distributions and useful characterizations of interesting patterns. ?? 1988 Birkha??user Verlag.

  6. Kappa statistic for clustered matched-pair data.

    PubMed

    Yang, Zhao; Zhou, Ming

    2014-07-10

    Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in clusters, on the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed kappa statistic provides consistent estimation and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g., K ≥50). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair (ρ ≥0.3), with more pronounced improvement when ρ is further increased. To illustrate the practical application of the proposed estimator, we analyze two real data examples of clustered matched-pair data. Copyright © 2014 John Wiley & Sons, Ltd.

  7. Inherent Variability in Short-time Wind Turbine Statistics from Turbulence Structure in the Atmospheric Surface Layer

    NASA Astrophysics Data System (ADS)

    Lavely, Adam; Vijayakumar, Ganesh; Brasseur, James; Paterson, Eric; Kinzel, Michael

    2011-11-01

    Using large-eddy simulation (LES) of the neutral and moderately convective atmospheric boundary layers (NBL, MCBL), we analyze the impact of coherent turbulence structure of the atmospheric surface layer on the short-time statistics that are commonly collected from wind turbines. The incoming winds are conditionally sampled with a filtering and thresholding algorithm into high/low horizontal and vertical velocity fluctuation coherent events. The time scales of these events are ~5 - 20 blade rotations and are roughly twice as long in the MCBL as the NBL. Horizontal velocity events are associated with greater variability in rotor power, lift and blade-bending moment than vertical velocity events. The variability in the industry standard 10 minute average for rotor power, sectional lift and wind velocity had a standard deviation of ~ 5% relative to the ``infinite time'' statistics for the NBL and ~10% for the MCBL. We conclude that turbulence structure associated with atmospheric stability state contributes considerable, quantifiable, variability to wind turbine statistics. Supported by NSF and DOE.

  8. Propensity score to detect baseline imbalance in cluster randomized trials: the role of the c-statistic.

    PubMed

    Leyrat, Clémence; Caille, Agnès; Foucher, Yohann; Giraudeau, Bruno

    2016-01-22

    Despite randomization, baseline imbalance and confounding bias may occur in cluster randomized trials (CRTs). Covariate imbalance may jeopardize the validity of statistical inferences if they occur on prognostic factors. Thus, the diagnosis of a such imbalance is essential to adjust statistical analysis if required. We developed a tool based on the c-statistic of the propensity score (PS) model to detect global baseline covariate imbalance in CRTs and assess the risk of confounding bias. We performed a simulation study to assess the performance of the proposed tool and applied this method to analyze the data from 2 published CRTs. The proposed method had good performance for large sample sizes (n =500 per arm) and when the number of unbalanced covariates was not too small as compared with the total number of baseline covariates (≥40% of unbalanced covariates). We also provide a strategy for pre selection of the covariates needed to be included in the PS model to enhance imbalance detection. The proposed tool could be useful in deciding whether covariate adjustment is required before performing statistical analyses of CRTs.

  9. Statistical and Machine Learning forecasting methods: Concerns and ways forward

    PubMed Central

    Makridakis, Spyros; Assimakopoulos, Vassilios

    2018-01-01

    Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are considerably greater than those of statistical methods. The paper discusses the results, explains why the accuracy of ML models is below that of statistical ones and proposes some possible ways forward. The empirical results found in our research stress the need for objective and unbiased ways to test the performance of forecasting methods that can be achieved through sizable and open competitions allowing meaningful comparisons and definite conclusions. PMID:29584784

  10. Sampling procedures for throughfall monitoring: A simulation study

    NASA Astrophysics Data System (ADS)

    Zimmermann, Beate; Zimmermann, Alexander; Lark, Richard Murray; Elsenbeer, Helmut

    2010-01-01

    What is the most appropriate sampling scheme to estimate event-based average throughfall? A satisfactory answer to this seemingly simple question has yet to be found, a failure which we attribute to previous efforts' dependence on empirical studies. Here we try to answer this question by simulating stochastic throughfall fields based on parameters for statistical models of large monitoring data sets. We subsequently sampled these fields with different sampling designs and variable sample supports. We evaluated the performance of a particular sampling scheme with respect to the uncertainty of possible estimated means of throughfall volumes. Even for a relative error limit of 20%, an impractically large number of small, funnel-type collectors would be required to estimate mean throughfall, particularly for small events. While stratification of the target area is not superior to simple random sampling, cluster random sampling involves the risk of being less efficient. A larger sample support, e.g., the use of trough-type collectors, considerably reduces the necessary sample sizes and eliminates the sensitivity of the mean to outliers. Since the gain in time associated with the manual handling of troughs versus funnels depends on the local precipitation regime, the employment of automatically recording clusters of long troughs emerges as the most promising sampling scheme. Even so, a relative error of less than 5% appears out of reach for throughfall under heterogeneous canopies. We therefore suspect a considerable uncertainty of input parameters for interception models derived from measured throughfall, in particular, for those requiring data of small throughfall events.

  11. Testing the Isotropic Universe Using the Gamma-Ray Burst Data of Fermi/GBM

    NASA Astrophysics Data System (ADS)

    Řípa, Jakub; Shafieloo, Arman

    2017-12-01

    The sky distribution of gamma-ray bursts (GRBs) has been intensively studied by various groups for more than two decades. Most of these studies test the isotropy of GRBs based on their sky number density distribution. In this work, we propose an approach to test the isotropy of the universe through inspecting the isotropy of the properties of GRBs such as their duration, fluences, and peak fluxes at various energy bands and different timescales. We apply this method on the Fermi/Gamma-ray Burst Monitor (GBM) data sample containing 1591 GRBs. The most noticeable feature we found is near the Galactic coordinates l≈ 30^\\circ , b≈ 15^\\circ , and radius r≈ 20^\\circ {--}40^\\circ . The inferred probability for the occurrence of such an anisotropic signal (in a random isotropic sample) is derived to be less than a percent in some of the tests while the other tests give results consistent with isotropy. These are based on the comparison of the results from the real data with the randomly shuffled data samples. Considering the large number of statistics we used in this work (some of which are correlated with each other), we can anticipate that the detected feature could be a result of statistical fluctuations. Moreover, we noticed a considerably low number of GRBs in this particular patch, which might be due to some instrumentation or observational effects that can consequently affect our statistics through some systematics. Further investigation is highly desirable in order to clarify this result, e.g., utilizing a larger future Fermi/GBM data sample as well as data samples of other GRB missions and also looking for possible systematics.

  12. Testing for X-Ray–SZ Differences and Redshift Evolution in the X-Ray Morphology of Galaxy Clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nurgaliev, D.; McDonald, M.; Benson, B. A.

    We present a quantitative study of the X-ray morphology of galaxy clusters, as a function of their detection method and redshift. We analyze two separate samples of galaxy clusters: a sample of 36 clusters atmore » $$0.35\\lt z\\lt 0.9$$ selected in the X-ray with the ROSAT PSPC 400 deg(2) survey, and a sample of 90 clusters at $$0.25\\lt z\\lt 1.2$$ selected via the Sunyaev–Zel’dovich (SZ) effect with the South Pole Telescope. Clusters from both samples have similar-quality Chandra observations, which allow us to quantify their X-ray morphologies via two distinct methods: centroid shifts (w) and photon asymmetry ($${A}_{\\mathrm{phot}}$$). The latter technique provides nearly unbiased morphology estimates for clusters spanning a broad range of redshift and data quality. We further compare the X-ray morphologies of X-ray- and SZ-selected clusters with those of simulated clusters. We do not find a statistically significant difference in the measured X-ray morphology of X-ray and SZ-selected clusters over the redshift range probed by these samples, suggesting that the two are probing similar populations of clusters. We find that the X-ray morphologies of simulated clusters are statistically indistinguishable from those of X-ray- or SZ-selected clusters, implying that the most important physics for dictating the large-scale gas morphology (outside of the core) is well-approximated in these simulations. Finally, we find no statistically significant redshift evolution in the X-ray morphology (both for observed and simulated clusters), over the range of $$z\\sim 0.3$$ to $$z\\sim 1$$, seemingly in contradiction with the redshift-dependent halo merger rate predicted by simulations.« less

  13. Outlier Removal and the Relation with Reporting Errors and Quality of Psychological Research

    PubMed Central

    Bakker, Marjan; Wicherts, Jelte M.

    2014-01-01

    Background The removal of outliers to acquire a significant result is a questionable research practice that appears to be commonly used in psychology. In this study, we investigated whether the removal of outliers in psychology papers is related to weaker evidence (against the null hypothesis of no effect), a higher prevalence of reporting errors, and smaller sample sizes in these papers compared to papers in the same journals that did not report the exclusion of outliers from the analyses. Methods and Findings We retrieved a total of 2667 statistical results of null hypothesis significance tests from 153 articles in main psychology journals, and compared results from articles in which outliers were removed (N = 92) with results from articles that reported no exclusion of outliers (N = 61). We preregistered our hypotheses and methods and analyzed the data at the level of articles. Results show no significant difference between the two types of articles in median p value, sample sizes, or prevalence of all reporting errors, large reporting errors, and reporting errors that concerned the statistical significance. However, we did find a discrepancy between the reported degrees of freedom of t tests and the reported sample size in 41% of articles that did not report removal of any data values. This suggests common failure to report data exclusions (or missingness) in psychological articles. Conclusions We failed to find that the removal of outliers from the analysis in psychological articles was related to weaker evidence (against the null hypothesis of no effect), sample size, or the prevalence of errors. However, our control sample might be contaminated due to nondisclosure of excluded values in articles that did not report exclusion of outliers. Results therefore highlight the importance of more transparent reporting of statistical analyses. PMID:25072606

  14. How Big of a Problem is Analytic Error in Secondary Analyses of Survey Data?

    PubMed

    West, Brady T; Sakshaug, Joseph W; Aurelien, Guy Alain S

    2016-01-01

    Secondary analyses of survey data collected from large probability samples of persons or establishments further scientific progress in many fields. The complex design features of these samples improve data collection efficiency, but also require analysts to account for these features when conducting analysis. Unfortunately, many secondary analysts from fields outside of statistics, biostatistics, and survey methodology do not have adequate training in this area, and as a result may apply incorrect statistical methods when analyzing these survey data sets. This in turn could lead to the publication of incorrect inferences based on the survey data that effectively negate the resources dedicated to these surveys. In this article, we build on the results of a preliminary meta-analysis of 100 peer-reviewed journal articles presenting analyses of data from a variety of national health surveys, which suggested that analytic errors may be extremely prevalent in these types of investigations. We first perform a meta-analysis of a stratified random sample of 145 additional research products analyzing survey data from the Scientists and Engineers Statistical Data System (SESTAT), which describes features of the U.S. Science and Engineering workforce, and examine trends in the prevalence of analytic error across the decades used to stratify the sample. We once again find that analytic errors appear to be quite prevalent in these studies. Next, we present several example analyses of real SESTAT data, and demonstrate that a failure to perform these analyses correctly can result in substantially biased estimates with standard errors that do not adequately reflect complex sample design features. Collectively, the results of this investigation suggest that reviewers of this type of research need to pay much closer attention to the analytic methods employed by researchers attempting to publish or present secondary analyses of survey data.

  15. Testing for X-Ray–SZ Differences and Redshift Evolution in the X-Ray Morphology of Galaxy Clusters

    DOE PAGES

    Nurgaliev, D.; McDonald, M.; Benson, B. A.; ...

    2017-05-16

    We present a quantitative study of the X-ray morphology of galaxy clusters, as a function of their detection method and redshift. We analyze two separate samples of galaxy clusters: a sample of 36 clusters atmore » $$0.35\\lt z\\lt 0.9$$ selected in the X-ray with the ROSAT PSPC 400 deg(2) survey, and a sample of 90 clusters at $$0.25\\lt z\\lt 1.2$$ selected via the Sunyaev–Zel’dovich (SZ) effect with the South Pole Telescope. Clusters from both samples have similar-quality Chandra observations, which allow us to quantify their X-ray morphologies via two distinct methods: centroid shifts (w) and photon asymmetry ($${A}_{\\mathrm{phot}}$$). The latter technique provides nearly unbiased morphology estimates for clusters spanning a broad range of redshift and data quality. We further compare the X-ray morphologies of X-ray- and SZ-selected clusters with those of simulated clusters. We do not find a statistically significant difference in the measured X-ray morphology of X-ray and SZ-selected clusters over the redshift range probed by these samples, suggesting that the two are probing similar populations of clusters. We find that the X-ray morphologies of simulated clusters are statistically indistinguishable from those of X-ray- or SZ-selected clusters, implying that the most important physics for dictating the large-scale gas morphology (outside of the core) is well-approximated in these simulations. Finally, we find no statistically significant redshift evolution in the X-ray morphology (both for observed and simulated clusters), over the range of $$z\\sim 0.3$$ to $$z\\sim 1$$, seemingly in contradiction with the redshift-dependent halo merger rate predicted by simulations.« less

  16. How Big of a Problem is Analytic Error in Secondary Analyses of Survey Data?

    PubMed Central

    West, Brady T.; Sakshaug, Joseph W.; Aurelien, Guy Alain S.

    2016-01-01

    Secondary analyses of survey data collected from large probability samples of persons or establishments further scientific progress in many fields. The complex design features of these samples improve data collection efficiency, but also require analysts to account for these features when conducting analysis. Unfortunately, many secondary analysts from fields outside of statistics, biostatistics, and survey methodology do not have adequate training in this area, and as a result may apply incorrect statistical methods when analyzing these survey data sets. This in turn could lead to the publication of incorrect inferences based on the survey data that effectively negate the resources dedicated to these surveys. In this article, we build on the results of a preliminary meta-analysis of 100 peer-reviewed journal articles presenting analyses of data from a variety of national health surveys, which suggested that analytic errors may be extremely prevalent in these types of investigations. We first perform a meta-analysis of a stratified random sample of 145 additional research products analyzing survey data from the Scientists and Engineers Statistical Data System (SESTAT), which describes features of the U.S. Science and Engineering workforce, and examine trends in the prevalence of analytic error across the decades used to stratify the sample. We once again find that analytic errors appear to be quite prevalent in these studies. Next, we present several example analyses of real SESTAT data, and demonstrate that a failure to perform these analyses correctly can result in substantially biased estimates with standard errors that do not adequately reflect complex sample design features. Collectively, the results of this investigation suggest that reviewers of this type of research need to pay much closer attention to the analytic methods employed by researchers attempting to publish or present secondary analyses of survey data. PMID:27355817

  17. Occurrence of Radio Minihalos in a Mass-Limited Sample of Galaxy Clusters

    NASA Technical Reports Server (NTRS)

    Giacintucci, Simona; Markevitch, Maxim; Cassano, Rossella; Venturi, Tiziana; Clarke, Tracy E.; Brunetti, Gianfranco

    2017-01-01

    We investigate the occurrence of radio minihalos-diffuse radio sources of unknown origin observed in the cores of some galaxy clusters-in a statistical sample of 58 clusters drawn from the Planck Sunyaev-Zeldovich cluster catalog using a mass cut (M(sub 500) greater than 6 x 10(exp 14) solar mass). We supplement our statistical sample with a similarly sized nonstatistical sample mostly consisting of clusters in the ACCEPT X-ray catalog with suitable X-ray and radio data, which includes lower-mass clusters. Where necessary (for nine clusters), we reanalyzed the Very Large Array archival radio data to determine whether a minihalo is present. Our total sample includes all 28 currently known and recently discovered radio minihalos, including six candidates. We classify clusters as cool-core or non-cool-core according to the value of the specific entropy floor in the cluster center, rederived or newly derived from the Chandra X-ray density and temperature profiles where necessary (for 27 clusters). Contrary to the common wisdom that minihalos are rare, we find that almost all cool cores-at least 12 out of 15 (80%)-in our complete sample of massive clusters exhibit minihalos. The supplementary sample shows that the occurrence of minihalos may be lower in lower-mass cool-core clusters. No minihalos are found in non-cool cores or "warm cores." These findings will help test theories of the origin of minihalos and provide information on the physical processes and energetics of the cluster cores.

  18. Water-quality trends and constituent-transport analysis for selected sampling sites in the Milltown Reservoir/Clark Fork River Superfund Site in the upper Clark Fork Basin, Montana, water years 1996–2015

    USGS Publications Warehouse

    Sando, Steven K.; Vecchia, Aldo V.

    2016-07-20

    During the extended history of mining in the upper Clark Fork Basin in Montana, large amounts of waste materials enriched with metallic contaminants (cadmium, copper, lead, and zinc) and the metalloid trace element arsenic were generated from mining operations near Butte and milling and smelting operations near Anaconda. Extensive deposition of mining wastes in the Silver Bow Creek and Clark Fork channels and flood plains had substantial effects on water quality. Federal Superfund remediation activities in the upper Clark Fork Basin began in 1983 and have included substantial remediation near Butte and removal of the former Milltown Dam near Missoula. To aid in evaluating the effects of remediation activities on water quality, the U.S. Geological Survey began collecting streamflow and water-quality data in the upper Clark Fork Basin in the 1980s.Trend analysis was done on specific conductance, selected trace elements (arsenic, copper, and zinc), and suspended sediment for seven sampling sites in the Milltown Reservoir/Clark Fork River Superfund Site for water years 1996–2015. The most upstream site included in trend analysis is Silver Bow Creek at Warm Springs, Montana (sampling site 8), and the most downstream site is Clark Fork above Missoula, Montana (sampling site 22), which is just downstream from the former Milltown Dam. Water year is the 12-month period from October 1 through September 30 and is designated by the year in which it ends. Trend analysis was done by using a joint time-series model for concentration and streamflow. To provide temporal resolution of changes in water quality, trend analysis was conducted for four sequential 5-year periods: period 1 (water years 1996–2000), period 2 (water years 2001–5), period 3 (water years 2006–10), and period 4 (water years 2011–15). Because of the substantial effect of the intentional breach of Milltown Dam on March 28, 2008, period 3 was subdivided into period 3A (October 1, 2005–March 27, 2008) and period 3B (March 28, 2008–September 30, 2010) for the Clark Fork above Missoula (sampling site 22). Trend results were considered statistically significant when the statistical probability level was less than 0.01.In conjunction with the trend analysis, estimated normalized constituent loads (hereinafter referred to as “loads”) were calculated and presented within the framework of a constituent-transport analysis to assess the temporal trends in flow-adjusted concentrations (FACs) in the context of sources and transport. The transport analysis allows assessment of temporal changes in relative contributions from upstream source areas to loads transported past each reach outflow.Trend results indicate that FACs of unfiltered-recoverable copper decreased at the sampling sites from the start of period 1 through the end of period 4; the decreases ranged from large for one sampling site (Silver Bow Creek at Warm Springs [sampling site 8]) to moderate for two sampling sites (Clark Fork near Galen, Montana [sampling site 11] and Clark Fork above Missoula [sampling site 22]) to small for four sampling sites (Clark Fork at Deer Lodge, Montana [sampling site 14], Clark Fork at Goldcreek, Montana [sampling site 16], Clark Fork near Drummond, Montana [sampling site 18], and Clark Fork at Turah Bridge near Bonner, Montana [sampling site 20]). For period 4 (water years 2011–15), the most notable changes indicated for the Milltown Reservoir/Clark Fork River Superfund Site were statistically significant decreases in FACs and loads of unfiltered-recoverable copper for sampling sites 8 and 22. The period 4 changes in FACs of unfiltered-recoverable copper for all other sampling sites were not statistically significant.Trend results indicate that FACs of unfiltered-recoverable arsenic decreased at the sampling sites from period 1 through period 4 (water years 1996–2015); the decreases ranged from minor (sampling sites 8–20) to small (sampling site 22). For period 4 (water years 2011–15), the most notable changes indicated for the Milltown Reservoir/Clark Fork River Superfund Site were statistically significant decreases in FACs and loads of unfiltered-recoverable arsenic for sampling site 8 and near statistically significant decreases for sampling site 22. The period 4 changes in FACs of unfiltered-recoverable arsenic for all other sampling sites were not statistically significant.Trend results indicate that FACs of suspended sediment decreased at the sampling sites from period 1 through period 4 (water years 1996–2015); the decreases ranged from moderate (sampling site 8) to small (sampling sites 11–22). For period 4 (water years 2011–15), the changes in FACs of suspended sediment were not statistically significant for any sampling sites.The reach of the Clark Fork from Galen to Deer Lodge is a large source of metallic contaminants and suspended sediment, which strongly affects downstream transport of those constituents. Mobilization of copper and suspended sediment from flood-plain tailings and the streambed of the Clark Fork and its tributaries within the reach results in a contribution of those constituents that is proportionally much larger than the contribution of streamflow from within the reach. Within the reach from Galen to Deer Lodge, unfiltered-recoverable copper loads increased by a factor of about 4 and suspended-sediment loads increased by a factor of about 5, whereas streamflow increased by a factor of slightly less than 2. For period 4 (water years 2011–15), unfiltered-recoverable copper and suspended-sediment loads sourced from within the reach accounted for about 41 and 14 percent, respectively, of the loads at Clark Fork above Missoula (sampling site 22), whereas streamflow sourced from within the reach accounted for about 4 percent of the streamflow at sampling site 22. During water years 1996–2015, decreases in FACs and loads of unfiltered-recoverable copper and suspended sediment for the reach generally were proportionally smaller than for most other reaches.Unfiltered-recoverable copper loads sourced within the reaches of the Clark Fork between Deer Lodge and Turah Bridge near Bonner (just upstream from the former Milltown Dam) were proportionally smaller than contributions of streamflow sourced from within the reaches; these reaches contributed proportionally much less to copper loading in the Clark Fork than the reach between Galen and Deer Lodge. Although substantial decreases in FACs and loads of unfiltered-recoverable copper and suspended sediment were indicated for Silver Bow Creek at Warm Springs (sampling site 8), those substantial decreases were not translated to downstream reaches between Deer Lodge and Turah Bridge near Bonner. The effect of the reach of the Clark Fork from Galen to Deer Lodge as a large source of copper and suspended sediment, in combination with little temporal change in those constituents for the reach, contributes to this pattern.With the removal of the former Milltown Dam in 2008, substantial amounts of contaminated sediments that remained in the Clark Fork channel and flood plain in reach 9 (downstream from Turah Bridge near Bonner) became more available for mobilization and transport than before the dam removal. After the removal of the former Milltown Dam, the Clark Fork above Missoula (sampling site 22) had statistically significant decreases in FACs of unfiltered-recoverable copper in period 3B (March 28, 2008, through water year 2010) that continued in period 4 (water years 2011–15). Also, decreases in FACs of unfiltered-recoverable arsenic and suspended sediment were indicated for period 4 at this site. The decrease in FACs of unfiltered-recoverable copper for sampling site 22 during period 4 was proportionally much larger than the decrease for the Clark Fork at Turah Bridge near Bonner (sampling site 20). Net mobilization of unfiltered-recoverable copper and arsenic from sources within reach 9 are smaller for period 4 than for period 1 when the former Milltown Dam was in place, providing evidence that contaminant source materials have been substantially reduced in reach 9.

  19. The effects of sample preparation on measured concentrations of eight elements in edible tissues of fish from streams contaminated by lead mining

    USGS Publications Warehouse

    Schmitt, Christopher J.; Finger, Susan E.

    1987-01-01

    The influence of sample preparation on measured concentrations of eight elements in the edible tissues of two black basses (Centrarchidae), two catfishes (Ictaluridae), and the black redhorse,Moxostoma duquesnei (Catostomidae) from two rivers in southeastern Missouri contaminated by mining and related activities was investigated. Concentrations of Pb, Cd, Cu, Zn, Fe, Mn, Ba, and Ca were measured in two skinless, boneless samples of axial muscle from individual fish prepared in a clean room. One sample (normally-processed) was removed from each fish with a knife in a manner typically used by investigators to process fish for elemental analysis and presumedly representative of methods employed by anglers when preparing fish for home consumption. A second sample (clean-processed) was then prepared from each normally-processed sample by cutting away all surface material with acid-cleaned instruments under ultraclean conditions. The samples were analyzed as a single group by atomic absorption spectrophotometry. Of the elements studied, only Pb regularly exceeded current guidelines for elemental contaminants in foods. Concentrations were high in black redhorse from contaminated sites, regardless of preparation method; for the other fishes, whether or not Pb guidelines were exceeded depended on preparation technique. Except for Mn and Ca, concentrations of all elements measured were significantly lower in cleanthan in normally-processed tissue samples. Absolute differences in measured concentrations between clean- and normally-processed samples were most evident for Pb and Ba in bass and catfish and for Cd and Zn in redhorse. Regardless of preparation method, concentrations of Pb, Ca, Mn, and Ba in individual fish were closely correlated; samples that were high or low in one of these four elements were correspondingly high or low in the other three. In contrast, correlations between Zn, Fe, and Cd occurred only in normallyprocessed samples, suggesting that these correlations resulted from high concentrations on the surfaces of some samples. Concentrations of Pb and Ba in edible tissues of fish from contaminated sites were highly correlated with Ca content, which was probably determined largely by the amount of tissue other than muscle in the sample because fish muscle contains relatively little Ca. Accordingly, variation within a group of similar samples can be reduced by normalizing Pb and Ba concentrations to a standard Ca concentration. When sample size (N) is large, this can be accomplished statistically by analysis of covariance; whenN is small, molar ratios of [Pb]/[Ca] and [Ba]/[Ca] can be computed. Without such adjustments, unrealistically large Ns are required to yield statistically reliable estimates of Pb concentrations in edible tissues. Investigators should acknowledge that reported concentrations of certain elements are only estimates, and that regardless of the care exercised during the collection, preparation, and analysis of samples, results should be interpreted with the awareness that contamination from external sources may have occurred.

  20. Efficient Robust Regression via Two-Stage Generalized Empirical Likelihood

    PubMed Central

    Bondell, Howard D.; Stefanski, Leonard A.

    2013-01-01

    Large- and finite-sample efficiency and resistance to outliers are the key goals of robust statistics. Although often not simultaneously attainable, we develop and study a linear regression estimator that comes close. Efficiency obtains from the estimator’s close connection to generalized empirical likelihood, and its favorable robustness properties are obtained by constraining the associated sum of (weighted) squared residuals. We prove maximum attainable finite-sample replacement breakdown point, and full asymptotic efficiency for normal errors. Simulation evidence shows that compared to existing robust regression estimators, the new estimator has relatively high efficiency for small sample sizes, and comparable outlier resistance. The estimator is further illustrated and compared to existing methods via application to a real data set with purported outliers. PMID:23976805

  1. Potential, velocity, and density fields from sparse and noisy redshift-distance samples - Method

    NASA Technical Reports Server (NTRS)

    Dekel, Avishai; Bertschinger, Edmund; Faber, Sandra M.

    1990-01-01

    A method for recovering the three-dimensional potential, velocity, and density fields from large-scale redshift-distance samples is described. Galaxies are taken as tracers of the velocity field, not of the mass. The density field and the initial conditions are calculated using an iterative procedure that applies the no-vorticity assumption at an initial time and uses the Zel'dovich approximation to relate initial and final positions of particles on a grid. The method is tested using a cosmological N-body simulation 'observed' at the positions of real galaxies in a redshift-distance sample, taking into account their distance measurement errors. Malmquist bias and other systematic and statistical errors are extensively explored using both analytical techniques and Monte Carlo simulations.

  2. An Independent Filter for Gene Set Testing Based on Spectral Enrichment.

    PubMed

    Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H

    2015-01-01

    Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.

  3. Environmental monitoring study of linear alkylbenzene sulfonates and insoluble soap in Spanish sewage sludge samples.

    PubMed

    Cantarero, Samuel; Zafra-Gómez, Alberto; Ballesteros, Oscar; Navalón, Alberto; Reis, Marco S; Saraiva, Pedro M; Vílchez, José L

    2011-01-01

    In this work we present a monitoring study of linear alkylbenzene sulfonates (LAS) and insoluble soap performed on Spanish sewage sludge samples. This work focuses on finding statistical relations between LAS concentrations and insoluble soap in sewage sludge samples and variables related to wastewater treatment plants such as water hardness, population and treatment type. It is worth to mention that 38 samples, collected from different Spanish regions, were studied. The statistical tool we used was Principal Component Analysis (PC), in order to reduce the number of response variables. The analysis of variance (ANOVA) test and a non-parametric test such as the Kruskal-Wallis test were also studied through the estimation of the p-value (probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true) in order to study possible relations between the concentration of both analytes and the rest of variables. We also compared LAS and insoluble soap behaviors. In addition, the results obtained for LAS (mean value) were compared with the limit value proposed by the future Directive entitled "Working Document on Sludge". According to the results, the mean obtained for soap and LAS was 26.49 g kg(-1) and 6.15 g kg(-1) respectively. It is worth noting that LAS mean was significantly higher than the limit value (2.6 g kg(-1)). In addition, LAS and soap concentrations depend largely on water hardness. However, only LAS concentration depends on treatment type.

  4. Large ensemble modeling of last deglacial retreat of the West Antarctic Ice Sheet: comparison of simple and advanced statistical techniques

    NASA Astrophysics Data System (ADS)

    Pollard, D.; Chang, W.; Haran, M.; Applegate, P.; DeConto, R.

    2015-11-01

    A 3-D hybrid ice-sheet model is applied to the last deglacial retreat of the West Antarctic Ice Sheet over the last ~ 20 000 years. A large ensemble of 625 model runs is used to calibrate the model to modern and geologic data, including reconstructed grounding lines, relative sea-level records, elevation-age data and uplift rates, with an aggregate score computed for each run that measures overall model-data misfit. Two types of statistical methods are used to analyze the large-ensemble results: simple averaging weighted by the aggregate score, and more advanced Bayesian techniques involving Gaussian process-based emulation and calibration, and Markov chain Monte Carlo. Results for best-fit parameter ranges and envelopes of equivalent sea-level rise with the simple averaging method agree quite well with the more advanced techniques, but only for a large ensemble with full factorial parameter sampling. Best-fit parameter ranges confirm earlier values expected from prior model tuning, including large basal sliding coefficients on modern ocean beds. Each run is extended 5000 years into the "future" with idealized ramped climate warming. In the majority of runs with reasonable scores, this produces grounding-line retreat deep into the West Antarctic interior, and the analysis provides sea-level-rise envelopes with well defined parametric uncertainty bounds.

  5. Calculating Confidence, Uncertainty, and Numbers of Samples When Using Statistical Sampling Approaches to Characterize and Clear Contaminated Areas

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Piepel, Gregory F.; Matzke, Brett D.; Sego, Landon H.

    2013-04-27

    This report discusses the methodology, formulas, and inputs needed to make characterization and clearance decisions for Bacillus anthracis-contaminated and uncontaminated (or decontaminated) areas using a statistical sampling approach. Specifically, the report includes the methods and formulas for calculating the • number of samples required to achieve a specified confidence in characterization and clearance decisions • confidence in making characterization and clearance decisions for a specified number of samples for two common statistically based environmental sampling approaches. In particular, the report addresses an issue raised by the Government Accountability Office by providing methods and formulas to calculate the confidence that amore » decision area is uncontaminated (or successfully decontaminated) if all samples collected according to a statistical sampling approach have negative results. Key to addressing this topic is the probability that an individual sample result is a false negative, which is commonly referred to as the false negative rate (FNR). The two statistical sampling approaches currently discussed in this report are 1) hotspot sampling to detect small isolated contaminated locations during the characterization phase, and 2) combined judgment and random (CJR) sampling during the clearance phase. Typically if contamination is widely distributed in a decision area, it will be detectable via judgment sampling during the characterization phrase. Hotspot sampling is appropriate for characterization situations where contamination is not widely distributed and may not be detected by judgment sampling. CJR sampling is appropriate during the clearance phase when it is desired to augment judgment samples with statistical (random) samples. The hotspot and CJR statistical sampling approaches are discussed in the report for four situations: 1. qualitative data (detect and non-detect) when the FNR = 0 or when using statistical sampling methods that account for FNR > 0 2. qualitative data when the FNR > 0 but statistical sampling methods are used that assume the FNR = 0 3. quantitative data (e.g., contaminant concentrations expressed as CFU/cm2) when the FNR = 0 or when using statistical sampling methods that account for FNR > 0 4. quantitative data when the FNR > 0 but statistical sampling methods are used that assume the FNR = 0. For Situation 2, the hotspot sampling approach provides for stating with Z% confidence that a hotspot of specified shape and size with detectable contamination will be found. Also for Situation 2, the CJR approach provides for stating with X% confidence that at least Y% of the decision area does not contain detectable contamination. Forms of these statements for the other three situations are discussed in Section 2.2. Statistical methods that account for FNR > 0 currently only exist for the hotspot sampling approach with qualitative data (or quantitative data converted to qualitative data). This report documents the current status of methods and formulas for the hotspot and CJR sampling approaches. Limitations of these methods are identified. Extensions of the methods that are applicable when FNR = 0 to account for FNR > 0, or to address other limitations, will be documented in future revisions of this report if future funding supports the development of such extensions. For quantitative data, this report also presents statistical methods and formulas for 1. quantifying the uncertainty in measured sample results 2. estimating the true surface concentration corresponding to a surface sample 3. quantifying the uncertainty of the estimate of the true surface concentration. All of the methods and formulas discussed in the report were applied to example situations to illustrate application of the methods and interpretation of the results.« less

  6. Quantifying Mesoscale Neuroanatomy Using X-Ray Microtomography

    PubMed Central

    Gray Roncal, William; Prasad, Judy A.; Fernandes, Hugo L.; Gürsoy, Doga; De Andrade, Vincent; Fezzaa, Kamel; Xiao, Xianghui; Vogelstein, Joshua T.; Jacobsen, Chris; Körding, Konrad P.

    2017-01-01

    Methods for resolving the three-dimensional (3D) microstructure of the brain typically start by thinly slicing and staining the brain, followed by imaging numerous individual sections with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography (µCT) for producing mesoscale (∼1 µm 3 resolution) brain maps from millimeter-scale volumes of mouse brain. We introduce a pipeline for µCT-based brain mapping that develops and integrates methods for sample preparation, imaging, and automated segmentation of cells, blood vessels, and myelinated axons, in addition to statistical analyses of these brain structures. Our results demonstrate that X-ray tomography achieves rapid quantification of large brain volumes, complementing other brain mapping and connectomics efforts. PMID:29085899

  7. NO RELATIONSHIP BETWEEN INTELLIGENCE AND FACIAL ATTRACTIVENESS IN A LARGE, GENETICALLY INFORMATIVE SAMPLE

    PubMed Central

    Mitchem, Dorian G.; Zietsch, Brendan P.; Wright, Margaret J.; Martin, Nicholas G.; Hewitt, John K.; Keller, Matthew C.

    2015-01-01

    Theories in both evolutionary and social psychology suggest that a positive correlation should exist between facial attractiveness and general intelligence, and several empirical observations appear to corroborate this expectation. Using highly reliable measures of facial attractiveness and IQ in a large sample of identical and fraternal twins and their siblings, we found no evidence for a phenotypic correlation between these traits. Likewise, neither the genetic nor the environmental latent factor correlations were statistically significant. We supplemented our analyses of new data with a simple meta-analysis that found evidence of publication bias among past studies of the relationship between facial attractiveness and intelligence. In view of these results, we suggest that previously published reports may have overestimated the strength of the relationship and that the theoretical bases for the predicted attractiveness-intelligence correlation may need to be reconsidered. PMID:25937789

  8. A Review of Enhanced Sampling Approaches for Accelerated Molecular Dynamics

    NASA Astrophysics Data System (ADS)

    Tiwary, Pratyush; van de Walle, Axel

    Molecular dynamics (MD) simulations have become a tool of immense use and popularity for simulating a variety of systems. With the advent of massively parallel computer resources, one now routinely sees applications of MD to systems as large as hundreds of thousands to even several million atoms, which is almost the size of most nanomaterials. However, it is not yet possible to reach laboratory timescales of milliseconds and beyond with MD simulations. Due to the essentially sequential nature of time, parallel computers have been of limited use in solving this so-called timescale problem. Instead, over the years a large range of statistical mechanics based enhanced sampling approaches have been proposed for accelerating molecular dynamics, and accessing timescales that are well beyond the reach of the fastest computers. In this review we provide an overview of these approaches, including the underlying theory, typical applications, and publicly available software resources to implement them.

  9. Detecting a Weak Association by Testing its Multiple Perturbations: a Data Mining Approach

    NASA Astrophysics Data System (ADS)

    Lo, Min-Tzu; Lee, Wen-Chung

    2014-05-01

    Many risk factors/interventions in epidemiologic/biomedical studies are of minuscule effects. To detect such weak associations, one needs a study with a very large sample size (the number of subjects, n). The n of a study can be increased but unfortunately only to an extent. Here, we propose a novel method which hinges on increasing sample size in a different direction-the total number of variables (p). We construct a p-based `multiple perturbation test', and conduct power calculations and computer simulations to show that it can achieve a very high power to detect weak associations when p can be made very large. As a demonstration, we apply the method to analyze a genome-wide association study on age-related macular degeneration and identify two novel genetic variants that are significantly associated with the disease. The p-based method may set a stage for a new paradigm of statistical tests.

  10. Circum-Arctic petroleum systems identified using decision-tree chemometrics

    USGS Publications Warehouse

    Peters, K.E.; Ramos, L.S.; Zumberge, J.E.; Valin, Z.C.; Scotese, C.R.; Gautier, D.L.

    2007-01-01

    Source- and age-related biomarker and isotopic data were measured for more than 1000 crude oil samples from wells and seeps collected above approximately 55??N latitude. A unique, multitiered chemometric (multivariate statistical) decision tree was created that allowed automated classification of 31 genetically distinct circumArctic oil families based on a training set of 622 oil samples. The method, which we call decision-tree chemometrics, uses principal components analysis and multiple tiers of K-nearest neighbor and SIMCA (soft independent modeling of class analogy) models to classify and assign confidence limits for newly acquired oil samples and source rock extracts. Geochemical data for each oil sample were also used to infer the age, lithology, organic matter input, depositional environment, and identity of its source rock. These results demonstrate the value of large petroleum databases where all samples were analyzed using the same procedures and instrumentation. Copyright ?? 2007. The American Association of Petroleum Geologists. All rights reserved.

  11. The New Maia Detector System: Methods For High Definition Trace Element Imaging Of Natural Material

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ryan, C. G.; School of Physics, University of Melbourne, Parkville VIC; CODES Centre of Excellence, University of Tasmania, Hobart TAS

    2010-04-06

    Motivated by the need for megapixel high definition trace element imaging to capture intricate detail in natural material, together with faster acquisition and improved counting statistics in elemental imaging, a large energy-dispersive detector array called Maia has been developed by CSIRO and BNL for SXRF imaging on the XFM beamline at the Australian Synchrotron. A 96 detector prototype demonstrated the capacity of the system for real-time deconvolution of complex spectral data using an embedded implementation of the Dynamic Analysis method and acquiring highly detailed images up to 77 M pixels spanning large areas of complex mineral sample sections.

  12. The New Maia Detector System: Methods For High Definition Trace Element Imaging Of Natural Material

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ryan, C.G.; Siddons, D.P.; Kirkham, R.

    2010-05-25

    Motivated by the need for megapixel high definition trace element imaging to capture intricate detail in natural material, together with faster acquisition and improved counting statistics in elemental imaging, a large energy-dispersive detector array called Maia has been developed by CSIRO and BNL for SXRF imaging on the XFM beamline at the Australian Synchrotron. A 96 detector prototype demonstrated the capacity of the system for real-time deconvolution of complex spectral data using an embedded implementation of the Dynamic Analysis method and acquiring highly detailed images up to 77 M pixels spanning large areas of complex mineral sample sections.

  13. An audit of the statistics and the comparison with the parameter in the population

    NASA Astrophysics Data System (ADS)

    Bujang, Mohamad Adam; Sa'at, Nadiah; Joys, A. Reena; Ali, Mariana Mohamad

    2015-10-01

    The sufficient sample size that is needed to closely estimate the statistics for particular parameters are use to be an issue. Although sample size might had been calculated referring to objective of the study, however, it is difficult to confirm whether the statistics are closed with the parameter for a particular population. All these while, guideline that uses a p-value less than 0.05 is widely used as inferential evidence. Therefore, this study had audited results that were analyzed from various sub sample and statistical analyses and had compared the results with the parameters in three different populations. Eight types of statistical analysis and eight sub samples for each statistical analysis were analyzed. Results found that the statistics were consistent and were closed to the parameters when the sample study covered at least 15% to 35% of population. Larger sample size is needed to estimate parameter that involve with categorical variables compared with numerical variables. Sample sizes with 300 to 500 are sufficient to estimate the parameters for medium size of population.

  14. Negative symptoms in schizophrenia: a study in a large clinical sample of patients using a novel automated method.

    PubMed

    Patel, Rashmi; Jayatilleke, Nishamali; Broadbent, Matthew; Chang, Chin-Kuo; Foskett, Nadia; Gorrell, Genevieve; Hayes, Richard D; Jackson, Richard; Johnston, Caroline; Shetty, Hitesh; Roberts, Angus; McGuire, Philip; Stewart, Robert

    2015-09-07

    To identify negative symptoms in the clinical records of a large sample of patients with schizophrenia using natural language processing and assess their relationship with clinical outcomes. Observational study using an anonymised electronic health record case register. South London and Maudsley NHS Trust (SLaM), a large provider of inpatient and community mental healthcare in the UK. 7678 patients with schizophrenia receiving care during 2011. Hospital admission, readmission and duration of admission. 10 different negative symptoms were ascertained with precision statistics above 0.80. 41% of patients had 2 or more negative symptoms. Negative symptoms were associated with younger age, male gender and single marital status, and with increased likelihood of hospital admission (OR 1.24, 95% CI 1.10 to 1.39), longer duration of admission (β-coefficient 20.5 days, 7.6-33.5), and increased likelihood of readmission following discharge (OR 1.58, 1.28 to 1.95). Negative symptoms were common and associated with adverse clinical outcomes, consistent with evidence that these symptoms account for much of the disability associated with schizophrenia. Natural language processing provides a means of conducting research in large representative samples of patients, using data recorded during routine clinical practice. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  15. Are There Spatial or Temporal Patterns to Holocene Explosive Eruptions in the Aleutian Archipelago? A Work in Progress

    NASA Astrophysics Data System (ADS)

    Martin, C.; Nicolaysen, K. P.; McConville, K.; Hatfield, V.; West, D.

    2013-12-01

    By examining the existing geological and archeological record of radiocarbon dated Aleutian tephras of the last 12,000 years, this study sought to determine whether there were spatial or temporal patterns of explosive eruptive activity. The Holocene tephra record has important implications because two episodes of migration and colonization by humans of distinct cultures established the Unangan/Aleut peoples of the Aleutian Islands concurrently with the volcanic activity. From Aniakchak Volcano on the Alaska Peninsula to the Andreanof Islands (158 to 178° W longitude), 55 distinct tephras represent significant explosive eruptions of the last 12,000 years. Initial results suggest that the Andreanof and Fox Island regions of the archipelago have had frequent explosive eruptions whereas the Islands of Four Mountains, Rat, and Near Island regions have apparently had little or no eruptive activity. However, one clear result of the investigation is that sampling bias strongly influences the apparent spatial patterns. For example field reconnaissance in the Islands of Four Mountains documents two Holocene calderas and a minimum of 20 undated tephras in addition to the large ignimbrites. Only the lack of significant explosive activity in the Near Islands seems a valid spatial result as archeological excavations and geologic reports failed to document Holocene tephras there. An intriguing preliminary temporal pattern is the apparent absence of large explosive eruptions across the archipelago from ca. 4,800 to 6,000 yBP. To test the validity of apparent patterns, a statistical treatment of the compiled data grappled with the sampling bias by considering three confounding variables: larger island size allows more opportunity for geologic preservation of tephras; larger magnitude eruption promotes tephra preservation by creating thicker and more widespread deposits; the comprehensiveness of the tephra sampling of each volcano and island varies widely because of logistical and financial limitations. This initial statistical investigation proposes variables to mitigate the effects of sampling bias and makes recommendations for sampling strategies to enable statistically valid examination of research questions. Further, though caldera-forming eruptions occurred throughout the Holocene - and several remain undated - four of six dated eruptions occurred throughout the archipelago between 8,000-9,100 yBP, a period coinciding with some of the earliest human occupation (Early Anangula Phase) of the eastern Aleutians.

  16. Statistical distribution of mechanical properties for three graphite-epoxy material systems

    NASA Technical Reports Server (NTRS)

    Reese, C.; Sorem, J., Jr.

    1981-01-01

    Graphite-epoxy composites are playing an increasing role as viable alternative materials in structural applications necessitating thorough investigation into the predictability and reproducibility of their material strength properties. This investigation was concerned with tension, compression, and short beam shear coupon testing of large samples from three different material suppliers to determine their statistical strength behavior. Statistical results indicate that a two Parameter Weibull distribution model provides better overall characterization of material behavior for the graphite-epoxy systems tested than does the standard Normal distribution model that is employed for most design work. While either a Weibull or Normal distribution model provides adequate predictions for average strength values, the Weibull model provides better characterization in the lower tail region where the predictions are of maximum design interest. The two sets of the same material were found to have essentially the same material properties, and indicate that repeatability can be achieved.

  17. Statistical analysis of effective singular values in matrix rank determination

    NASA Technical Reports Server (NTRS)

    Konstantinides, Konstantinos; Yao, Kung

    1988-01-01

    A major problem in using SVD (singular-value decomposition) as a tool in determining the effective rank of a perturbed matrix is that of distinguishing between significantly small and significantly large singular values to the end, conference regions are derived for the perturbed singular values of matrices with noisy observation data. The analysis is based on the theories of perturbations of singular values and statistical significance test. Threshold bounds for perturbation due to finite-precision and i.i.d. random models are evaluated. In random models, the threshold bounds depend on the dimension of the matrix, the noisy variance, and predefined statistical level of significance. Results applied to the problem of determining the effective order of a linear autoregressive system from the approximate rank of a sample autocorrelation matrix are considered. Various numerical examples illustrating the usefulness of these bounds and comparisons to other previously known approaches are given.

  18. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sil’chenko, Olga K., E-mail: olga@sai.msu.su; Isaac Newton Institute, Chile, Moscow Branch

    I analyze statistics of the stellar population properties for stellar nuclei and bulges of nearby lenticular galaxies in different environments by using panoramic spectral data of the integral-field spectrograph SAURON retrieved from the open archive of the Isaac Newton Group. I also estimate the fraction of nearby lenticular galaxies having inner polar gaseous disks by exploring the volume-limited sample of early-type galaxies of the ATLAS-3D survey. By inspecting the two-dimensional velocity fields of the stellar and gaseous components with the running tilted-ring technique, I have found seven new cases of inner polar disks. Together with those, the frequency of inner polar disks in nearby S0 galaxiesmore » reaches 10%, which is much higher than the frequency of large-scale polar rings. Interestingly, the properties of the nuclear stellar populations in the inner polar ring hosts are statistically the same as those in the whole S0 sample, implying similar histories of multiple gas-accretion events from various directions.« less

  19. Cannabis, motivation, and life satisfaction in an internet sample

    PubMed Central

    Barnwell, Sara Smucker; Earleywine, Mitch; Wilcox, Rand

    2006-01-01

    Although little evidence supports cannabis-induced amotivational syndrome, sources continue to assert that the drug saps motivation [1], which may guide current prohibitions. Few studies report low motivation in chronic users; another reveals that they have higher subjective wellbeing. To assess differences in motivation and subjective wellbeing, we used a large sample (N = 487) and strict definitions of cannabis use (7 days/week) and abstinence (never). Standard statistical techniques showed no differences. Robust statistical methods controlling for heteroscedasticity, non-normality and extreme values found no differences in motivation but a small difference in subjective wellbeing. Medical users of cannabis reporting health problems tended to account for a significant portion of subjective wellbeing differences, suggesting that illness decreased wellbeing. All p-values were above p = .05. Thus, daily use of cannabis does not impair motivation. Its impact on subjective wellbeing is small and may actually reflect lower wellbeing due to medical symptoms rather than actual consumption of the plant. PMID:16722561

  20. Feature recognition and detection for ancient architecture based on machine vision

    NASA Astrophysics Data System (ADS)

    Zou, Zheng; Wang, Niannian; Zhao, Peng; Zhao, Xuefeng

    2018-03-01

    Ancient architecture has a very high historical and artistic value. The ancient buildings have a wide variety of textures and decorative paintings, which contain a lot of historical meaning. Therefore, the research and statistics work of these different compositional and decorative features play an important role in the subsequent research. However, until recently, the statistics of those components are mainly by artificial method, which consumes a lot of labor and time, inefficiently. At present, as the strong support of big data and GPU accelerated training, machine vision with deep learning as the core has been rapidly developed and widely used in many fields. This paper proposes an idea to recognize and detect the textures, decorations and other features of ancient building based on machine vision. First, classify a large number of surface textures images of ancient building components manually as a set of samples. Then, using the convolution neural network to train the samples in order to get a classification detector. Finally verify its precision.

  1. Measurements of Turbulence at Two Tidal Energy Sites in Puget Sound, WA

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thomson, Jim; Polagye, Brian; Durgesh, Vibhav

    2012-06-05

    Field measurements of turbulence are pre- sented from two sites in Puget Sound, WA (USA) that are proposed for electrical power generation using tidal current turbines. Rapidly sampled data from multiple acoustic Doppler instruments are analyzed to obtain statistical mea- sures of fluctuations in both the magnitude and direction of the tidal currents. The resulting turbulence intensities (i.e., the turbulent velocity fluctuations normalized by the harmonic tidal currents) are typically 10% at the hub- heights (i.e., the relevant depth bin) of the proposed turbines. Length and time scales of the turbulence are also analyzed. Large-scale, anisotropic eddies dominate the energymore » spectra, which may be the result of proximity to headlands at each site. At small scales, an isotropic turbulent cascade is observed and used to estimate the dissipation rate of turbulent kinetic energy. Data quality and sampling parameters are discussed, with an emphasis on the removal of Doppler noise from turbulence statistics.« less

  2. Evaluating optimal therapy robustness by virtual expansion of a sample population, with a case study in cancer immunotherapy

    PubMed Central

    Barish, Syndi; Ochs, Michael F.; Sontag, Eduardo D.; Gevertz, Jana L.

    2017-01-01

    Cancer is a highly heterogeneous disease, exhibiting spatial and temporal variations that pose challenges for designing robust therapies. Here, we propose the VEPART (Virtual Expansion of Populations for Analyzing Robustness of Therapies) technique as a platform that integrates experimental data, mathematical modeling, and statistical analyses for identifying robust optimal treatment protocols. VEPART begins with time course experimental data for a sample population, and a mathematical model fit to aggregate data from that sample population. Using nonparametric statistics, the sample population is amplified and used to create a large number of virtual populations. At the final step of VEPART, robustness is assessed by identifying and analyzing the optimal therapy (perhaps restricted to a set of clinically realizable protocols) across each virtual population. As proof of concept, we have applied the VEPART method to study the robustness of treatment response in a mouse model of melanoma subject to treatment with immunostimulatory oncolytic viruses and dendritic cell vaccines. Our analysis (i) showed that every scheduling variant of the experimentally used treatment protocol is fragile (nonrobust) and (ii) discovered an alternative region of dosing space (lower oncolytic virus dose, higher dendritic cell dose) for which a robust optimal protocol exists. PMID:28716945

  3. Assessment and Implication of Prognostic Imbalance in Randomized Controlled Trials with a Binary Outcome – A Simulation Study

    PubMed Central

    Chu, Rong; Walter, Stephen D.; Guyatt, Gordon; Devereaux, P. J.; Walsh, Michael; Thorlund, Kristian; Thabane, Lehana

    2012-01-01

    Background Chance imbalance in baseline prognosis of a randomized controlled trial can lead to over or underestimation of treatment effects, particularly in trials with small sample sizes. Our study aimed to (1) evaluate the probability of imbalance in a binary prognostic factor (PF) between two treatment arms, (2) investigate the impact of prognostic imbalance on the estimation of a treatment effect, and (3) examine the effect of sample size (n) in relation to the first two objectives. Methods We simulated data from parallel-group trials evaluating a binary outcome by varying the risk of the outcome, effect of the treatment, power and prevalence of the PF, and n. Logistic regression models with and without adjustment for the PF were compared in terms of bias, standard error, coverage of confidence interval and statistical power. Results For a PF with a prevalence of 0.5, the probability of a difference in the frequency of the PF≥5% reaches 0.42 with 125/arm. Ignoring a strong PF (relative risk = 5) leads to underestimating the strength of a moderate treatment effect, and the underestimate is independent of n when n is >50/arm. Adjusting for such PF increases statistical power. If the PF is weak (RR = 2), adjustment makes little difference in statistical inference. Conditional on a 5% imbalance of a powerful PF, adjustment reduces the likelihood of large bias. If an absolute measure of imbalance ≥5% is deemed important, including 1000 patients/arm provides sufficient protection against such an imbalance. Two thousand patients/arm may provide an adequate control against large random deviations in treatment effect estimation in the presence of a powerful PF. Conclusions The probability of prognostic imbalance in small trials can be substantial. Covariate adjustment improves estimation accuracy and statistical power, and hence should be performed when strong PFs are observed. PMID:22629322

  4. Methods for evaluating temporal groundwater quality data and results of decadal-scale changes in chloride, dissolved solids, and nitrate concentrations in groundwater in the United States, 1988-2010

    USGS Publications Warehouse

    Lindsey, Bruce D.; Rupert, Michael G.

    2012-01-01

    Decadal-scale changes in groundwater quality were evaluated by the U.S. Geological Survey National Water-Quality Assessment (NAWQA) Program. Samples of groundwater collected from wells during 1988-2000 - a first sampling event representing the decade ending the 20th century - were compared on a pair-wise basis to samples from the same wells collected during 2001-2010 - a second sampling event representing the decade beginning the 21st century. The data set consists of samples from 1,236 wells in 56 well networks, representing major aquifers and urban and agricultural land-use areas, with analytical results for chloride, dissolved solids, and nitrate. Statistical analysis was done on a network basis rather than by individual wells. Although spanning slightly more or less than a 10-year period, the two-sample comparison between the first and second sampling events is referred to as an analysis of decadal-scale change based on a step-trend analysis. The 22 principal aquifers represented by these 56 networks account for nearly 80 percent of the estimated withdrawals of groundwater used for drinking-water supply in the Nation. Well networks where decadal-scale changes in concentrations were statistically significant were identified using the Wilcoxon-Pratt signed-rank test. For the statistical analysis of chloride, dissolved solids, and nitrate concentrations at the network level, more than half revealed no statistically significant change over the decadal period. However, for networks that had statistically significant changes, increased concentrations outnumbered decreased concentrations by a large margin. Statistically significant increases of chloride concentrations were identified for 43 percent of 56 networks. Dissolved solids concentrations increased significantly in 41 percent of the 54 networks with dissolved solids data, and nitrate concentrations increased significantly in 23 percent of 56 networks. At least one of the three - chloride, dissolved solids, or nitrate - had a statistically significant increase in concentration in 66 percent of the networks. Statistically significant decreases in concentrations were identified in 4 percent of the networks for chloride, 2 percent of the networks for dissolved solids, and 9 percent of the networks for nitrate. A larger percentage of urban land-use networks had statistically significant increases in chloride, dissolved solids, and nitrate concentrations than agricultural land-use networks. In order to assess the magnitude of statistically significant changes, the median of the differences between constituent concentrations from the first full-network sampling event and those from the second full-network sampling event was calculated using the Turnbull method. The largest median decadal increases in chloride concentrations were in networks in the Upper Illinois River Basin (67 mg/L) and in the New England Coastal Basins (34 mg/L), whereas the largest median decadal decrease in chloride concentrations was in the Upper Snake River Basin (1 mg/L). The largest median decadal increases in dissolved solids concentrations were in networks in the Rio Grande Valley (260 mg/L) and the Upper Illinois River Basin (160 mg/L). The largest median decadal decrease in dissolved solids concentrations was in the Apalachicola-Chattahoochee-Flint River Basin (6.0 mg/L). The largest median decadal increases in nitrate as nitrogen (N) concentrations were in networks in the South Platte River Basin (2.0 mg/L as N) and the San Joaquin-Tulare Basins (1.0 mg/L as N). The largest median decadal decrease in nitrate concentrations was in the Santee River Basin and Coastal Drainages (0.63 mg/L). The magnitude of change in networks with statistically significant increases typically was much larger than the magnitude of change in networks with statistically significant decreases. The magnitude of change was greatest for chloride in the urban land-use networks and greatest for dissolved solids and nitrate in the agricultural land-use networks. Analysis of data from all networks combined indicated statistically significant increases for chloride, dissolved solids, and nitrate. Although chloride, dissolved solids, and nitrate concentrations were typically less than the drinking-water standards and guidelines, a statistical test was used to determine whether or not the proportion of samples exceeding the drinking-water standard or guideline changed significantly between the first and second full-network sampling events. The proportion of samples exceeding the U.S. Environmental Protection Agency (USEPA) Secondary Maximum Contaminant Level for dissolved solids (500 milligrams per liter) increased significantly between the first and second full-network sampling events when evaluating all networks combined at the national level. Also, for all networks combined, the proportion of samples exceeding the USEPA Maximum Contaminant Level (MCL) of 10 mg/L as N for nitrate increased significantly. One network in the Delmarva Peninsula had a significant increase in the proportion of samples exceeding the MCL for nitrate. A subset of 261 wells was sampled every other year (biennially) to evaluate decadal-scale changes using a time-series analysis. The analysis of the biennial data set showed that changes were generally similar to the findings from the analysis of decadal-scale change that was based on a step-trend analysis. Because of the small number of wells in a network with biennial data (typically 4-5 wells), the time-series analysis is more useful for understanding water-quality responses to changes in site-specific conditions rather than as an indicator of the change for the entire network.

  5. COMPARISON OF TWO METHODS FOR THE ISOLATION OF SALMONELLAE FROM IMPORTED FOODS.

    PubMed

    TAYLOR, W I; HOBBS, B C; SMITH, M E

    1964-01-01

    Two methods for the detection of salmonellae in foods were compared in 179 imported meat and egg samples. The number of positive samples and replications, and the number of strains and kinds of serotypes were statistically comparable by both the direct enrichment method of the Food Hygiene Laboratory in England, and the pre-enrichment method devised for processed foods in the United States. Boneless frozen beef, veal, and horsemeat imported from five countries for consumption in England were found to have salmonellae present in 48 of 116 (41%) samples. Dried egg products imported from three countries were observed to have salmonellae in 10 of 63 (16%) samples. The high incidence of salmonellae isolated from imported foods illustrated the existence of an international health hazard resulting from the continuous introduction of exogenous strains of pathogenic microorganisms on a large scale.

  6. Statistical theory and methodology for remote sensing data analysis

    NASA Technical Reports Server (NTRS)

    Odell, P. L.

    1974-01-01

    A model is developed for the evaluation of acreages (proportions) of different crop-types over a geographical area using a classification approach and methods for estimating the crop acreages are given. In estimating the acreages of a specific croptype such as wheat, it is suggested to treat the problem as a two-crop problem: wheat vs. nonwheat, since this simplifies the estimation problem considerably. The error analysis and the sample size problem is investigated for the two-crop approach. Certain numerical results for sample sizes are given for a JSC-ERTS-1 data example on wheat identification performance in Hill County, Montana and Burke County, North Dakota. Lastly, for a large area crop acreages inventory a sampling scheme is suggested for acquiring sample data and the problem of crop acreage estimation and the error analysis is discussed.

  7. One should avoid retro-orbital pharmacokinetic sample collections for intranasal dosing in rats: Illustration of spurious pharmacokinetics generated for anti-migraine drugs zolmitriptan and eletriptan.

    PubMed

    Patel, Harilal; Patel, Prakash; Modi, Nirav; Shah, Shaival; Ghoghari, Ashok; Variya, Bhavesh; Laddha, Ritu; Baradia, Dipesh; Dobaria, Nitin; Mehta, Pavak; Srinivas, Nuggehally R

    2017-08-30

    Because of the avoidance of first pass metabolic effects due to direct and rapid absorption with improved permeability, intranasal route represents a good alternative for extravascular drug administration. The aim of the study was to investigate the intranasal pharmacokinetics of two anti-migraine drugs (zolmitriptan and eletriptan), using retro-orbital sinus and jugular vein sites sampling. In a parallel study design, healthy male Sprague-Dawley (SD) rats aged between 8 and 12weeks were divided into groups (n=4 or 5/group). The animals of individual groups were dosed intranasal (~1.0mg/kg) and oral doses of 2.1mg/kg of either zolmitriptan or eletriptan. Serial blood sampling was performed from jugular vein or retro-orbital site and plasma samples were analyzed for drug concentrations using LC-MS/MS assay. Standard pharmacokinetics parameters such as T max , C max , AUC last , AUC 0-inf and T 1/2 were calculated and statistics of derived parameters was performed using unpaired t-test. After intranasal dosing, the mean pharmacokinetic parameters C max and AUC inf of zolmitriptan/eletriptan showed about 17-fold and 3-5-fold higher values for retro-orbital sampling as compared to the jugular vein sampling site. Whereas after oral administration such parameters derived for both drugs were largely comparable between the two sampling sites and statistically non-significant. In conclusion, the assessment of plasma levels after intranasal administration with retro-orbital sampling would result in spurious and misleading pharmacokinetics. Copyright © 2017 Elsevier B.V. All rights reserved.

  8. The High AV Quasar Survey: Reddened Quasi-Stellar Objects Selected from Optical/Near-Infrared Photometry—II

    NASA Astrophysics Data System (ADS)

    Krogager, J.-K.; Geier, S.; Fynbo, J. P. U.; Venemans, B. P.; Ledoux, C.; Møller, P.; Noterdaeme, P.; Vestergaard, M.; Kangas, T.; Pursimo, T.; Saturni, F. G.; Smirnova, O.

    2015-03-01

    Quasi-stellar objects (QSOs) whose spectral energy distributions (SEDs) are reddened by dust either in their host galaxies or in intervening absorber galaxies are to a large degree missed by optical color selection criteria like the ones used by the Sloan Digital Sky Survey (SDSS). To overcome this bias against red QSOs, we employ a combined optical and near-infrared (near-IR) color selection. In this paper, we present a spectroscopic follow-up campaign of a sample of red candidate QSOs which were selected from the SDSS and the UKIRT Infrared Deep Sky Survey (UKIDSS). The spectroscopic data and SDSS/UKIDSS photometry are supplemented by mid-infrared photometry from the Wide-field Infrared Survey Explorer. In our sample of 159 candidates, 154 (97%) are confirmed to be QSOs. We use a statistical algorithm to identify sightlines with plausible intervening absorption systems and identify nine such cases assuming dust in the absorber similar to Large Magellanic Cloud sightlines. We find absorption systems toward 30 QSOs, 2 of which are consistent with the best-fit absorber redshift from the statistical modeling. Furthermore, we observe a broad range in SED properties of the QSOs as probed by the rest-frame 2 μm flux. We find QSOs with a strong excess as well as QSOs with a large deficit at rest-frame 2 μm relative to a QSO template. Potential solutions to these discrepancies are discussed. Overall, our study demonstrates the high efficiency of the optical/near-IR selection of red QSOs.

  9. Big Data in the SHELA Field: Investigating Galaxy Quenching at High Redshifts

    NASA Astrophysics Data System (ADS)

    Stevans, Matthew L.; Finkelstein, Steven L.; Wold, Isak; Kawinwanichakij, Lalitwadee; Sherman, Sydney; Gebhardt, Karl; Jogee, Shardha; Papovich, Casey J.; Ciardullo, Robin; Gronwall, Caryl; Gawiser, Eric J.; Acquaviva, Viviana; Casey, Caitlin; Florez, Jonathan; HETDEX Team

    2017-06-01

    We present a measurement of the z ~ 4 Lyman break galaxy (LBG) rest-frame UV luminosity function to investigate the onset of quenching in the early universe. The bright-end of the galaxy luminosity function typically shows an exponential decline far steeper than that of the underlying halo mass function. This is typically attributed to negative feedback from past active galactic nuclei (AGN) activity as well as dust attenuation. Constraining the abundance of bright galaxies at early times (z > 3) can provide a key insight into the mechanisms regulating star formation in galaxies. However, existing studies suffer from low number statistics and/or the inability to robustly remove stellar and AGN contaminants. In this study we take advantage of the unprecedentedly large (24 deg^2) Spitzer/HETDEX Exploratory Large Area (SHELA) field and its deep multi-wavelength photometry, which includes DECam ugriz, NEWFIRM K-band, Spitzer/IRAC, Herschel/SPIRE, and X-ray from XMM-Newton and Chandra. With SHELA’s deep imaging over a large area we are uniquely positioned to study statistically significant samples of massive galaxies at high redshifts (z > 3) when the first massive galaxies began quenching. We select our sample using photometric redshifts from the EAZY software package (Brammer et al. 2008) based on the optical and far-infrared imaging. We directly identify and remove stellar contaminants and AGN with IRAC colors and X-ray detections, respectively. By pinning down the exact shape of the bright-end of the z ~ 4 LBG luminosity function, we provide the deepest probe yet into the baryonic physics dominating star formation and quenching in the early universe.

  10. Statistical properties of Faraday rotation measure in external galaxies - I. Intervening disc galaxies

    NASA Astrophysics Data System (ADS)

    Basu, Aritra; Mao, S. A.; Fletcher, Andrew; Kanekar, Nissim; Shukurov, Anvar; Schnitzeler, Dominic; Vacca, Valentina; Junklewitz, Henrik

    2018-06-01

    Deriving the Faraday rotation measure (RM) of quasar absorption line systems, which are tracers of high-redshift galaxies intervening background quasars, is a powerful tool for probing magnetic fields in distant galaxies. Statistically comparing the RM distributions of two quasar samples, with and without absorption line systems, allows one to infer magnetic field properties of the intervening galaxy population. Here, we have derived the analytical form of the probability distribution function (PDF) of RM produced by a single galaxy with an axisymmetric large-scale magnetic field. We then further determine the PDF of RM for one random sight line traversing each galaxy in a population with a large-scale magnetic field prescription. We find that the resulting PDF of RM is dominated by a Lorentzian with a width that is directly related to the mean axisymmetric large-scale field strength of the galaxy population if the dispersion of B0 within the population is smaller than . Provided that RMs produced by the intervening galaxies have been successfully isolated from other RM contributions along the line of sight, our simple model suggests that in galaxies probed by quasar absorption line systems can be measured within ≈50 per cent accuracy without additional constraints on the magneto-ionic medium properties of the galaxies. Finally, we discuss quasar sample selection criteria that are crucial to reliably interpret observations, and argue that within the limitations of the current data base of absorption line systems, high-metallicity damped Lyman-α absorbers are best suited to study galactic dynamo action in distant disc galaxies.

  11. Variance Estimation, Design Effects, and Sample Size Calculations for Respondent-Driven Sampling

    PubMed Central

    2006-01-01

    Hidden populations, such as injection drug users and sex workers, are central to a number of public health problems. However, because of the nature of these groups, it is difficult to collect accurate information about them, and this difficulty complicates disease prevention efforts. A recently developed statistical approach called respondent-driven sampling improves our ability to study hidden populations by allowing researchers to make unbiased estimates of the prevalence of certain traits in these populations. Yet, not enough is known about the sample-to-sample variability of these prevalence estimates. In this paper, we present a bootstrap method for constructing confidence intervals around respondent-driven sampling estimates and demonstrate in simulations that it outperforms the naive method currently in use. We also use simulations and real data to estimate the design effects for respondent-driven sampling in a number of situations. We conclude with practical advice about the power calculations that are needed to determine the appropriate sample size for a study using respondent-driven sampling. In general, we recommend a sample size twice as large as would be needed under simple random sampling. PMID:16937083

  12. Large-scale correlations in gas traced by Mg II absorbers around low-mass galaxies

    NASA Astrophysics Data System (ADS)

    Kauffmann, Guinevere

    2018-03-01

    The physical origin of the large-scale conformity in the colours and specific star formation rates of isolated low-mass central galaxies and their neighbours on scales in excess of 1 Mpc is still under debate. One possible scenario is that gas is heated over large scales by feedback from active galactic nuclei (AGNs), leading to coherent modulation of cooling and star formation between well-separated galaxies. In this Letter, the metal line absorption catalogue of Zhu & Ménard is used to probe gas out to large projected radii around a sample of a million galaxies with stellar masses ˜1010M⊙ and photometric redshifts in the range 0.4 < z < 0.8 selected from Sloan Digital Sky Survey imaging data. This galaxy sample covers an effective volume of 2.2 Gpc3. A statistically significant excess of Mg II absorbers is present around the red-low-mass galaxies compared to their blue counterparts out to projected radii of 10 Mpc. In addition, the equivalent width distribution function of Mg II absorbers around low-mass galaxies is shown to be strongly affected by the presence of a nearby (Rp < 2 Mpc) radio-loud AGNs out to projected radii of 5 Mpc.

  13. Dynamic triggering of low magnitude earthquakes in the Middle American Subduction Zone

    NASA Astrophysics Data System (ADS)

    Escudero, C. R.; Velasco, A. A.

    2010-12-01

    We analyze global and Middle American Subduction Zone (MASZ) seismicity from 1998 to 2008 to quantify the transient stresses effects at teleseismic distances. We use the Bulletin of the International Seismological Centre Catalog (ISCCD) published by the Incorporated Research Institutions for Seismology (IRIS). To identify MASZ seismicity changes due to distant, large (Mw >7) earthquakes, we first identify local earthquakes that occurred before and after the mainshocks. We then group the local earthquakes within a cluster radius between 75 to 200 km. We obtain statistics based on characteristics of both mainshocks and local earthquakes clusters, such as local cluster-mainshock azimuth, mainshock focal mechanism, and local earthquakes clusters within the MASZ. Due to lateral variations of the dip along the subducted oceanic plate, we divide the Mexican subduction zone in four segments. We then apply the Paired Samples Statistical Test (PSST) to the sorted data to identify increment, decrement or either in the local seismicity associated with distant large earthquakes. We identify dynamic triggering for all MASZ segments produced by large earthquakes emerging from specific azimuths, as well as, a decrease for some cases. We find no depend of seismicity changes due to focal mainshock mechanism.

  14. Towards a routine application of Top-Down approaches for label-free discovery workflows.

    PubMed

    Schmit, Pierre-Olivier; Vialaret, Jerome; Wessels, Hans J C T; van Gool, Alain J; Lehmann, Sylvain; Gabelle, Audrey; Wood, Jason; Bern, Marshall; Paape, Rainer; Suckau, Detlev; Kruppa, Gary; Hirtz, Christophe

    2018-03-20

    Thanks to proteomics investigations, our vision of the role of different protein isoforms in the pathophysiology of diseases has largely evolved. The idea that protein biomarkers like tau, amyloid peptides, ApoE, cystatin, or neurogranin are represented in body fluids as single species is obviously over-simplified, as most proteins are present in different isoforms and subjected to numerous processing and post-translational modifications. Measuring the intact mass of proteins by MS has the advantage to provide information on the presence and relative amount of the different proteoforms. Such Top-Down approaches typically require a high degree of sample pre-fractionation to allow the MS system to deliver optimal performance in terms of dynamic range, mass accuracy and resolution. In clinical studies, however, the requirements for pre-analytical robustness and sample size large enough for statistical power restrict the routine use of a high degree of sample pre-fractionation. In this study, we have investigated the capacities of current-generation Ultra-High Resolution Q-Tof systems to deal with high complexity intact protein samples and have evaluated the approach on a cohort of patients suffering from neurodegenerative disease. Statistical analysis has shown that several proteoforms can be used to distinguish Alzheimer disease patients from patients suffering from other neurodegenerative disease. Top-down approaches have an extremely high biological relevance, especially when it comes to biomarker discovery, but the necessary pre-fractionation constraints are not easily compatible with the robustness requirements and the size of clinical sample cohorts. We have demonstrated that intact protein profiling studies could be run on UHR-Q-ToF with limited pre-fractionation. The proteoforms that have been identified as candidate biomarkers in the-proof-of concept study are derived from proteins known to play a role in the pathophysiology process of Alzheimer disease. Copyright © 2017 Elsevier B.V. All rights reserved.

  15. Serration Behavior of a Zr-Based Metallic Glass Under Different Constrained Loading Conditions

    NASA Astrophysics Data System (ADS)

    Yang, G. N.; Gu, J. L.; Chen, S. Q.; Shao, Y.; Wang, H.; Yao, K. F.

    2016-11-01

    To understand the plastic behavior and shear band dynamics of metallic glasses (MGs) being tuned by the external constraint, uniaxial compression tests were performed on Zr41.2Ti13.8Cu12.5Ni10.0Be22.5 MG samples with aspect ratios of 0.5:1, 1:1, 1.5:1, 2:1, 2.5:1, and 3:1. Better plasticity was observed for the samples with smaller aspect ratio (under higher constraint degree). In the beginning of yielding, increasing serration (jerky stress drop) size on the loading curves was noticed for all samples. Statistical analysis of the serration patterns indicated that the small stress-drop serrations and large stress-drop serrations follow self-organized critical and chaotic dynamics, respectively. Under constrained loading, the large stress-drop serrations are depressed, while the small stress-drop serrations are less affected. When changing the external constraint level by varying the sample aspect ratio, the serration pattern, shear band dynamics, and plastic behavior will change accordingly. This study provides a perspective from tuning shear band dynamics to understand the plastic behavior of MGs under different external constraint.

  16. Design-based Sample and Probability Law-Assumed Sample: Their Role in Scientific Investigation.

    ERIC Educational Resources Information Center

    Ojeda, Mario Miguel; Sahai, Hardeo

    2002-01-01

    Discusses some key statistical concepts in probabilistic and non-probabilistic sampling to provide an overview for understanding the inference process. Suggests a statistical model constituting the basis of statistical inference and provides a brief review of the finite population descriptive inference and a quota sampling inferential theory.…

  17. Comparative Financial Statistics for Public Two-Year Colleges: FY 1991 National Sample.

    ERIC Educational Resources Information Center

    Dickmeyer, Nathan; Cirino, Anna Marie

    This report provides comparative financial information derived from a national sample of 503 public two-year colleges. The report includes space for colleges to compare their institutional statistics with data provided on national sample medians; quartile data for the national sample; and statistics presented in various formats, including tables,…

  18. On the Use of Biomineral Oxygen Isotope Data to Identify Human Migrants in the Archaeological Record: Intra-Sample Variation, Statistical Methods and Geographical Considerations

    PubMed Central

    Lightfoot, Emma; O’Connell, Tamsin C.

    2016-01-01

    Oxygen isotope analysis of archaeological skeletal remains is an increasingly popular tool to study past human migrations. It is based on the assumption that human body chemistry preserves the δ18O of precipitation in such a way as to be a useful technique for identifying migrants and, potentially, their homelands. In this study, the first such global survey, we draw on published human tooth enamel and bone bioapatite data to explore the validity of using oxygen isotope analyses to identify migrants in the archaeological record. We use human δ18O results to show that there are large variations in human oxygen isotope values within a population sample. This may relate to physiological factors influencing the preservation of the primary isotope signal, or due to human activities (such as brewing, boiling, stewing, differential access to water sources and so on) causing variation in ingested water and food isotope values. We compare the number of outliers identified using various statistical methods. We determine that the most appropriate method for identifying migrants is dependent on the data but is likely to be the IQR or median absolute deviation from the median under most archaeological circumstances. Finally, through a spatial assessment of the dataset, we show that the degree of overlap in human isotope values from different locations across Europe is such that identifying individuals’ homelands on the basis of oxygen isotope analysis alone is not possible for the regions analysed to date. Oxygen isotope analysis is a valid method for identifying first-generation migrants from an archaeological site when used appropriately, however it is difficult to identify migrants using statistical methods for a sample size of less than c. 25 individuals. In the absence of local previous analyses, each sample should be treated as an individual dataset and statistical techniques can be used to identify migrants, but in most cases pinpointing a specific homeland should not be attempted. PMID:27124001

  19. Statistical analyses to support guidelines for marine avian sampling. Final report

    USGS Publications Warehouse

    Kinlan, Brian P.; Zipkin, Elise; O'Connell, Allan F.; Caldow, Chris

    2012-01-01

    Interest in development of offshore renewable energy facilities has led to a need for high-quality, statistically robust information on marine wildlife distributions. A practical approach is described to estimate the amount of sampling effort required to have sufficient statistical power to identify species-specific “hotspots” and “coldspots” of marine bird abundance and occurrence in an offshore environment divided into discrete spatial units (e.g., lease blocks), where “hotspots” and “coldspots” are defined relative to a reference (e.g., regional) mean abundance and/or occurrence probability for each species of interest. For example, a location with average abundance or occurrence that is three times larger the mean (3x effect size) could be defined as a “hotspot,” and a location that is three times smaller than the mean (1/3x effect size) as a “coldspot.” The choice of the effect size used to define hot and coldspots will generally depend on a combination of ecological and regulatory considerations. A method is also developed for testing the statistical significance of possible hotspots and coldspots. Both methods are illustrated with historical seabird survey data from the USGS Avian Compendium Database. Our approach consists of five main components: 1. A review of the primary scientific literature on statistical modeling of animal group size and avian count data to develop a candidate set of statistical distributions that have been used or may be useful to model seabird counts. 2. Statistical power curves for one-sample, one-tailed Monte Carlo significance tests of differences of observed small-sample means from a specified reference distribution. These curves show the power to detect "hotspots" or "coldspots" of occurrence and abundance at a range of effect sizes, given assumptions which we discuss. 3. A model selection procedure, based on maximum likelihood fits of models in the candidate set, to determine an appropriate statistical distribution to describe counts of a given species in a particular region and season. 4. Using a large database of historical at-sea seabird survey data, we applied this technique to identify appropriate statistical distributions for modeling a variety of species, allowing the distribution to vary by season. For each species and season, we used the selected distribution to calculate and map retrospective statistical power to detect hotspots and coldspots, and map pvalues from Monte Carlo significance tests of hotspots and coldspots, in discrete lease blocks designated by the U.S. Department of Interior, Bureau of Ocean Energy Management (BOEM). 5. Because our definition of hotspots and coldspots does not explicitly include variability over time, we examine the relationship between the temporal scale of sampling and the proportion of variance captured in time series of key environmental correlates of marine bird abundance, as well as available marine bird abundance time series, and use these analyses to develop recommendations for the temporal distribution of sampling to adequately represent both shortterm and long-term variability. We conclude by presenting a schematic “decision tree” showing how this power analysis approach would fit in a general framework for avian survey design, and discuss implications of model assumptions and results. We discuss avenues for future development of this work, and recommendations for practical implementation in the context of siting and wildlife assessment for offshore renewable energy development projects.

  20. Chemometrics-based process analytical technology (PAT) tools: applications and adaptation in pharmaceutical and biopharmaceutical industries.

    PubMed

    Challa, Shruthi; Potumarthi, Ravichandra

    2013-01-01

    Process analytical technology (PAT) is used to monitor and control critical process parameters in raw materials and in-process products to maintain the critical quality attributes and build quality into the product. Process analytical technology can be successfully implemented in pharmaceutical and biopharmaceutical industries not only to impart quality into the products but also to prevent out-of-specifications and improve the productivity. PAT implementation eliminates the drawbacks of traditional methods which involves excessive sampling and facilitates rapid testing through direct sampling without any destruction of sample. However, to successfully adapt PAT tools into pharmaceutical and biopharmaceutical environment, thorough understanding of the process is needed along with mathematical and statistical tools to analyze large multidimensional spectral data generated by PAT tools. Chemometrics is a chemical discipline which incorporates both statistical and mathematical methods to obtain and analyze relevant information from PAT spectral tools. Applications of commonly used PAT tools in combination with appropriate chemometric method along with their advantages and working principle are discussed. Finally, systematic application of PAT tools in biopharmaceutical environment to control critical process parameters for achieving product quality is diagrammatically represented.

  1. Alteration of histological gastritis after cure of Helicobacter pylori infection.

    PubMed

    Hojo, M; Miwa, H; Ohkusa, T; Ohkura, R; Kurosawa, A; Sato, N

    2002-11-01

    It is still disputed whether gastric atrophy or intestinal metaplasia improves after the cure of Helicobacter pylori infection. To clarify the histological changes after the cure of H. pylori infection through a literature survey. Fifty-one selected reports from 1066 relevant articles were reviewed. The extracted data were pooled according to histological parameters of gastritis based on the (updated) Sydney system. Activity improved more rapidly than inflammation. Eleven of 25 reports described significant improvement of atrophy. Atrophy was not improved in one of four studies with a large sample size (> 100 samples) and in two of five studies with a long follow-up period (> 12 months), suggesting that disagreement between the studies was not totally due to sample size or follow-up period. Methodological flaws, such as patient selection, and statistical analysis based on the assumption that atrophy improves continuously and generally in all patients might be responsible for the inconsistent results. Four of 28 studies described significant improvement of intestinal metaplasia [corrected]. Activity and inflammation were improved after the cure of H. pylori infection. Atrophy did not improve generally among all patients, but improved in certain patients. Improvement of intestinal metaplasia was difficult to analyse due to methodological problems including statistical power.

  2. Hierarchical distance-sampling models to estimate population size and habitat-specific abundance of an island endemic

    USGS Publications Warehouse

    Sillett, Scott T.; Chandler, Richard B.; Royle, J. Andrew; Kéry, Marc; Morrison, Scott A.

    2012-01-01

    Population size and habitat-specific abundance estimates are essential for conservation management. A major impediment to obtaining such estimates is that few statistical models are able to simultaneously account for both spatial variation in abundance and heterogeneity in detection probability, and still be amenable to large-scale applications. The hierarchical distance-sampling model of J. A. Royle, D. K. Dawson, and S. Bates provides a practical solution. Here, we extend this model to estimate habitat-specific abundance and rangewide population size of a bird species of management concern, the Island Scrub-Jay (Aphelocoma insularis), which occurs solely on Santa Cruz Island, California, USA. We surveyed 307 randomly selected, 300 m diameter, point locations throughout the 250-km2 island during October 2008 and April 2009. Population size was estimated to be 2267 (95% CI 1613-3007) and 1705 (1212-2369) during the fall and spring respectively, considerably lower than a previously published but statistically problematic estimate of 12 500. This large discrepancy emphasizes the importance of proper survey design and analysis for obtaining reliable information for management decisions. Jays were most abundant in low-elevation chaparral habitat; the detection function depended primarily on the percent cover of chaparral and forest within count circles. Vegetation change on the island has been dramatic in recent decades, due to release from herbivory following the eradication of feral sheep (Ovis aries) from the majority of the island in the mid-1980s. We applied best-fit fall and spring models of habitat-specific jay abundance to a vegetation map from 1985, and estimated the population size of A. insularis was 1400-1500 at that time. The 20-30% increase in the jay population suggests that the species has benefited from the recovery of native vegetation since sheep removal. Nevertheless, this jay's tiny range and small population size make it vulnerable to natural disasters and to habitat alteration related to climate change. Our results demonstrate that hierarchical distance-sampling models hold promise for estimating population size and spatial density variation at large scales. Our statistical methods have been incorporated into the R package unmarked to facilitate their use by animal ecologists, and we provide annotated code in the Supplement.

  3. Ion Beam Analyses Of Bark And Wood In Environmental Studies

    NASA Astrophysics Data System (ADS)

    Lill, J.-O.; Saarela, K.-E.; Harju, L.; Rajander, J.; Lindroos, A.; Heselius, S.-J.

    2011-06-01

    A large number of wood and bark samples have been analysed utilizing particle-induced X-ray emission (PIXE) and particle-induced gamma-ray emission (PIGE) techniques. Samples of common tree species like Scots Pine, Norway Spruce and birch were collected from a large number of sites in Southern and Southwestern Finland. Some of the samples were from a heavily polluted area in the vicinity of a copper-nickel smelter. The samples were dry ashed at 550 °C for the removal of the organic matrix in order to increase the analytical sensitivity of the method. The sensitivity was enhanced by a factor of 50 for wood and slightly less for bark. The ashed samples were pressed into pellets and irradiated as thick targets with a millimetre-sized proton beam. By including the ashing procedure in the method, the statistical dispersion due to elemental heterogeneities in wood material could be reduced. As a by-product, information about the elemental composition of ashes was obtained. By comparing the concentration of an element in bark ash to the concentration in wood ash of the same tree useful information from environmental point of view was obtained. The obtained ratio of the ashes was used to distinguish between elemental contributions from anthropogenic atmospheric sources and natural geochemical sources, like soil and bedrock.

  4. Construct validity of the Groningen Frailty Indicator established in a large sample of home-dwelling elderly persons: Evidence of stability across age and gender.

    PubMed

    Peters, L L; Boter, H; Burgerhof, J G M; Slaets, J P J; Buskens, E

    2015-09-01

    The primary objective of the present study was to evaluate the validity of the Groningen Frailty Indicator (GFI) in a sample of Dutch elderly persons participating in LifeLines, a large population-based cohort study. Additional aims were to assess differences between frail and non-frail elderly and examine which individual characteristics were associated with frailty. By December 2012, 5712 elderly persons were enrolled in LifeLines and complied with the inclusion criteria of the present study. Mann-Whitney U or Kruskal-Wallis tests were used to assess the variability of GFI-scores among elderly subgroups that differed in demographic characteristics, morbidity, obesity, and healthcare utilization. Within subgroups Kruskal-Wallis tests were also used to examine differences in GFI-scores across age groups. Multivariate logistic regression analyses were performed to assess associations between individual characteristics and frailty. The GFI discriminated between subgroups: statistically significantly higher GFI-median scores (interquartile range) were found in e.g. males (1 [0-2]), the oldest old (2 [1-3]), in elderly who were single (1 [0-2]), with lower socio economic status (1 [0-3]), with increasing co-morbidity (2 [1-3]), who were obese (2 [1-3]), and used more healthcare (2 [1-4]). Overall age had an independent and statistically significant association with GFI scores. Compared with the non-frail, frail elderly persons experienced statistically significantly more chronic stress and more social/psychological related problems. In the multivariate logistic regression model, psychological morbidity had the strongest association with frailty. The present study supports the construct validity of the GFI and provides an insight in the characteristics of (non)frail community-dwelling elderly persons participating in LifeLines. Copyright © 2015 Elsevier Inc. All rights reserved.

  5. Negative affect is associated with alcohol, but not cigarette use in heavy drinking smokers.

    PubMed

    Bujarski, Spencer; Ray, Lara A

    2014-12-01

    Co-use of alcohol and cigarettes is highly prevalent, and heavy drinking smokers represent a large and difficult-to-treat subgroup of smokers. Negative affect, including anxiety and depressive symptomatology, has been associated with both cigarette and alcohol use independently, but less is known about the role of negative affect in heavy drinking smokers. Furthermore, while some studies have shown negative affect to precede substance use, a precise biobehavioral mechanism has not been established. The aims of the present study were twofold. First, to test whether negative affect is associated with alcohol and cigarette use in a large community sample of heavy drinking smokers (n=461). And second, to examine craving as a plausible statistical mediator of the association between negative affect and alcohol and/or cigarette use. Hypothesis testing was conducted using a structural equation modeling approach with cross-sectional data. Analysis revealed a significant main effect of negative affect on alcohol use (β=0.210, p<0.05), but not cigarette use (β=0.131, p>0.10) in this sample. Mediational analysis revealed that alcohol craving was a full statistical mediator of this association (p<0.05), such that there was no direct association between negative affect and alcohol use after accounting for alcohol craving. These results are consistent with a negative reinforcement and relief craving models of alcohol use insofar as the experience of negative affect was associated with increased alcohol use, and the relationship was statistically mediated by alcohol craving, presumably to alleviate negative affect. Further longitudinal or experimental studies are warranted to enhance the causal inferences of this mediated effect. Copyright © 2014 Elsevier Ltd. All rights reserved.

  6. Appraising into the Sun: Six-State Solar Home Paired-Sale Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lawrence Berkeley National Laboratory

    Although residential solar photovoltaic (PV) installations have proliferated, PV systems on some U.S. homes still receive no value during an appraisal because comparable home sales are lacking. To value residential PV, some previous studies have employed paired-sales appraisal methods to analyze small PV home samples in depth, while others have used statistical methods to analyze large samples. Our first-of-its-kind study connects the two approaches. It uses appraisal methods to evaluate sales price premiums for owned PV systems on single-unit detached houses that were also evaluated in a large statistical study. Independent appraisers evaluated 43 recent home sales pairs in sixmore » states: California, Oregon, Florida, Maryland, North Carolina, and Pennsylvania. We compare these results with contributory-value estimates—based on income (using the PV Value® tool), gross cost, and net cost—as well as hedonic modeling results from the recent statistical study. The results provide strong, appraisal-based evidence of PV premiums in all states. More importantly, the results support the use of cost- and incomebased PV premium estimates when paired-sales analysis is impossible. PV premiums from the paired-sales analysis are most similar to net PV cost estimates. PV Value® income results generally track the appraised premiums, although conservatively. The appraised premiums are in agreement with the hedonic modeling results as well, which bolsters the suitability of both approaches for estimating PV home premiums. Therefore, these results will benefit valuation professionals and mortgage lenders who increasingly are encountering homes equipped with PV and need to understand the factors that can both contribute to and detract from market value.« less

  7. Multiple regression and Artificial Neural Network for long-term rainfall forecasting using large scale climate modes

    NASA Astrophysics Data System (ADS)

    Mekanik, F.; Imteaz, M. A.; Gato-Trinidad, S.; Elmahdi, A.

    2013-10-01

    In this study, the application of Artificial Neural Networks (ANN) and Multiple regression analysis (MR) to forecast long-term seasonal spring rainfall in Victoria, Australia was investigated using lagged El Nino Southern Oscillation (ENSO) and Indian Ocean Dipole (IOD) as potential predictors. The use of dual (combined lagged ENSO-IOD) input sets for calibrating and validating ANN and MR Models is proposed to investigate the simultaneous effect of past values of these two major climate modes on long-term spring rainfall prediction. The MR models that did not violate the limits of statistical significance and multicollinearity were selected for future spring rainfall forecast. The ANN was developed in the form of multilayer perceptron using Levenberg-Marquardt algorithm. Both MR and ANN modelling were assessed statistically using mean square error (MSE), mean absolute error (MAE), Pearson correlation (r) and Willmott index of agreement (d). The developed MR and ANN models were tested on out-of-sample test sets; the MR models showed very poor generalisation ability for east Victoria with correlation coefficients of -0.99 to -0.90 compared to ANN with correlation coefficients of 0.42-0.93; ANN models also showed better generalisation ability for central and west Victoria with correlation coefficients of 0.68-0.85 and 0.58-0.97 respectively. The ability of multiple regression models to forecast out-of-sample sets is compatible with ANN for Daylesford in central Victoria and Kaniva in west Victoria (r = 0.92 and 0.67 respectively). The errors of the testing sets for ANN models are generally lower compared to multiple regression models. The statistical analysis suggest the potential of ANN over MR models for rainfall forecasting using large scale climate modes.

  8. H{sub 2}O Megamasers toward Radio-bright Seyfert 2 Nuclei

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, J. S.; Liu, Z. W.; Henkel, C.

    2017-02-20

    Using the Effelsberg-100 m telescope, we perform a successful pilot survey on H{sub 2}O maser emission toward a small sample of radio-bright Seyfert 2 galaxies with a redshift larger than 0.04. The targets were selected from a large Seyfert 2 sample derived from the spectroscopic Sloan Digital Sky Survey Data Release 7 (SDSS-DR7). One source, SDSS J102802.9+104630.4 ( z ∼ 0.0448), was detected four times during our observations, with a typical maser flux density of ∼30 mJy and a corresponding (very large) luminosity of ∼1135 L {sub ⊙}. The successful detection of this radio-bright Seyfert 2 and an additional tentativemore » detection support our previous statistical results that H{sub 2}O megamasers tend to arise from Seyfert 2 galaxies with large radio luminosity. The finding provides further motivation for an upcoming larger H{sub 2}O megamaser survey toward Seyfert 2s with particularly radio-bright nuclei with the basic goal to improve our understanding of the nuclear environment of active megamaser host galaxies.« less

  9. Partial Least Squares Regression Can Aid in Detecting Differential Abundance of Multiple Features in Sets of Metagenomic Samples

    PubMed Central

    Libiger, Ondrej; Schork, Nicholas J.

    2015-01-01

    It is now feasible to examine the composition and diversity of microbial communities (i.e., “microbiomes”) that populate different human organs and orifices using DNA sequencing and related technologies. To explore the potential links between changes in microbial communities and various diseases in the human body, it is essential to test associations involving different species within and across microbiomes, environmental settings and disease states. Although a number of statistical techniques exist for carrying out relevant analyses, it is unclear which of these techniques exhibit the greatest statistical power to detect associations given the complexity of most microbiome datasets. We compared the statistical power of principal component regression, partial least squares regression, regularized regression, distance-based regression, Hill's diversity measures, and a modified test implemented in the popular and widely used microbiome analysis methodology “Metastats” across a wide range of simulated scenarios involving changes in feature abundance between two sets of metagenomic samples. For this purpose, simulation studies were used to change the abundance of microbial species in a real dataset from a published study examining human hands. Each technique was applied to the same data, and its ability to detect the simulated change in abundance was assessed. We hypothesized that a small subset of methods would outperform the rest in terms of the statistical power. Indeed, we found that the Metastats technique modified to accommodate multivariate analysis and partial least squares regression yielded high power under the models and data sets we studied. The statistical power of diversity measure-based tests, distance-based regression and regularized regression was significantly lower. Our results provide insight into powerful analysis strategies that utilize information on species counts from large microbiome data sets exhibiting skewed frequency distributions obtained on a small to moderate number of samples. PMID:26734061

  10. Nanoengineered capsules for selective SERS analysis of biological samples

    NASA Astrophysics Data System (ADS)

    You, Yil-Hwan; Schechinger, Monika; Locke, Andrea; Coté, Gerard; McShane, Mike

    2018-02-01

    Metal nanoparticles conjugated with DNA oligomers have been intensively studied for a variety of applications, including optical diagnostics. Assays based on aggregation of DNA-coated particles in proportion to the concentration of target analyte have not been widely adopted for clinical analysis, however, largely due to the nonspecific responses observed in complex biofluids. While sample pre-preparation such as dialysis is helpful to enable selective sensing, here we sought to prove that assay encapsulation in hollow microcapsules could remove this requirement and thereby facilitate more rapid analysis on complex samples. Gold nanoparticle-based assays were incorporated into capsules comprising polyelectrolyte multilayer (PEMs), and the response to small molecule targets and larger proteins were compared. Gold nanoparticles were able to selectively sense small Raman dyes (Rhodamine 6G) in the presence of large protein molecules (BSA) when encapsulated. A ratiometric based microRNA-17 sensing assay exhibited drastic reduction in response after encapsulation, with statistically-significant relative Raman intensity changes only at a microRNA-17 concentration of 10 nM compared to a range of 0-500 nM for the corresponding solution-phase response.

  11. A large-scale study of the ultrawideband microwave dielectric properties of normal breast tissue obtained from reduction surgeries.

    PubMed

    Lazebnik, Mariya; McCartney, Leah; Popovic, Dijana; Watkins, Cynthia B; Lindstrom, Mary J; Harter, Josephine; Sewall, Sarah; Magliocco, Anthony; Booske, John H; Okoniewski, Michal; Hagness, Susan C

    2007-05-21

    The efficacy of emerging microwave breast cancer detection and treatment techniques will depend, in part, on the dielectric properties of normal breast tissue. However, knowledge of these properties at microwave frequencies has been limited due to gaps and discrepancies in previously reported small-scale studies. To address these issues, we experimentally characterized the wideband microwave-frequency dielectric properties of a large number of normal breast tissue samples obtained from breast reduction surgeries at the University of Wisconsin and University of Calgary hospitals. The dielectric spectroscopy measurements were conducted from 0.5 to 20 GHz using a precision open-ended coaxial probe. The tissue composition within the probe's sensing region was quantified in terms of percentages of adipose, fibroconnective and glandular tissues. We fit a one-pole Cole-Cole model to the complex permittivity data set obtained for each sample and determined median Cole-Cole parameters for three groups of normal breast tissues, categorized by adipose tissue content (0-30%, 31-84% and 85-100%). Our analysis of the dielectric properties data for 354 tissue samples reveals that there is a large variation in the dielectric properties of normal breast tissue due to substantial tissue heterogeneity. We observed no statistically significant difference between the within-patient and between-patient variability in the dielectric properties.

  12. Estimating Animal Abundance in Ground Beef Batches Assayed with Molecular Markers

    PubMed Central

    Hu, Xin-Sheng; Simila, Janika; Platz, Sindey Schueler; Moore, Stephen S.; Plastow, Graham; Meghen, Ciaran N.

    2012-01-01

    Estimating animal abundance in industrial scale batches of ground meat is important for mapping meat products through the manufacturing process and for effectively tracing the finished product during a food safety recall. The processing of ground beef involves a potentially large number of animals from diverse sources in a single product batch, which produces a high heterogeneity in capture probability. In order to estimate animal abundance through DNA profiling of ground beef constituents, two parameter-based statistical models were developed for incidence data. Simulations were applied to evaluate the maximum likelihood estimate (MLE) of a joint likelihood function from multiple surveys, showing superiority in the presence of high capture heterogeneity with small sample sizes, or comparable estimation in the presence of low capture heterogeneity with a large sample size when compared to other existing models. Our model employs the full information on the pattern of the capture-recapture frequencies from multiple samples. We applied the proposed models to estimate animal abundance in six manufacturing beef batches, genotyped using 30 single nucleotide polymorphism (SNP) markers, from a large scale beef grinding facility. Results show that between 411∼1367 animals were present in six manufacturing beef batches. These estimates are informative as a reference for improving recall processes and tracing finished meat products back to source. PMID:22479559

  13. Learning maximum entropy models from finite-size data sets: A fast data-driven algorithm allows sampling from the posterior distribution.

    PubMed

    Ferrari, Ulisse

    2016-08-01

    Maximum entropy models provide the least constrained probability distributions that reproduce statistical properties of experimental datasets. In this work we characterize the learning dynamics that maximizes the log-likelihood in the case of large but finite datasets. We first show how the steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of the model parameters' space. We then provide a way for rectifying this space which relies only on dataset properties and does not require large computational efforts. We conclude by solving the long-time limit of the parameters' dynamics including the randomness generated by the systematic use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior distribution of the parameters. We sum up all these insights in a "rectified" data-driven algorithm that is fast and by sampling from the parameters' posterior avoids both under- and overfitting along all the directions of the parameters' space. Through the learning of pairwise Ising models from the recording of a large population of retina neurons, we show how our algorithm outperforms the steepest descent method.

  14. Identifying sources of groundwater nitrate contamination in a large alluvial groundwater basin with highly diversified intensive agricultural production

    NASA Astrophysics Data System (ADS)

    Lockhart, K. M.; King, A. M.; Harter, T.

    2013-08-01

    Groundwater quality is a concern in alluvial aquifers underlying agricultural areas worldwide. Nitrate from land applied fertilizers or from animal waste can leach to groundwater and contaminate drinking water resources. The San Joaquin Valley, California, is an example of an agricultural landscape with a large diversity of field, vegetable, tree, nut, and citrus crops, but also confined animal feeding operations (CAFOs, here mostly dairies) that generate, store, and land apply large amounts of liquid manure. As in other such regions around the world, the rural population in the San Joaquin Valley relies almost exclusively on shallow domestic wells (≤ 150 m deep), of which many have been affected by nitrate. Variability in crops, soil type, and depth to groundwater contribute to large variability in nitrate occurrence across the underlying aquifer system. The role of these factors in controlling groundwater nitrate contamination levels is examined. Two hundred domestic wells were sampled in two sub-regions of the San Joaquin Valley, Stanislaus and Merced (Stan/Mer) and Tulare and Kings (Tul/Kings) Counties. Forty six percent of well water samples in Tul/Kings and 42% of well water samples in Stan/Mer exceeded the MCL for nitrate (10 mg/L NO3-N). For statistical analysis of nitrate contamination, 78 crop and landuse types were considered by grouping them into ten categories (CAFO, citrus, deciduous fruits and nuts, field crops, forage, native, pasture, truck crops, urban, and vineyards). Vadose zone thickness, soil type, well construction information, well proximity to dairies, and dominant landuse near the well were considered. In the Stan/Mer area, elevated nitrate levels in domestic wells most strongly correlate with the combination of very shallow (≤ 21 m) water table and the presence of either CAFO derived animal waste applications or deciduous fruit and nut crops (synthetic fertilizer applications). In Tulare County, statistical data indicate that elevated nitrate levels in domestic well water are most strongly associated with citrus orchards when located in areas with a very shallow (≤ 21 m) water table. Kings County had relatively few nitrate MCL exceedances in domestic wells, probably due to the deeper water table in Kings County.

  15. Phylogenetic signal in the acoustic parameters of the advertisement calls of four clades of anurans.

    PubMed

    Gingras, Bruno; Mohandesan, Elmira; Boko, Drasko; Fitch, W Tecumseh

    2013-07-01

    Anuran vocalizations, especially their advertisement calls, are largely species-specific and can be used to identify taxonomic affiliations. Because anurans are not vocal learners, their vocalizations are generally assumed to have a strong genetic component. This suggests that the degree of similarity between advertisement calls may be related to large-scale phylogenetic relationships. To test this hypothesis, advertisement calls from 90 species belonging to four large clades (Bufo, Hylinae, Leptodactylus, and Rana) were analyzed. Phylogenetic distances were estimated based on the DNA sequences of the 12S mitochondrial ribosomal RNA gene, and, for a subset of 49 species, on the rhodopsin gene. Mean values for five acoustic parameters (coefficient of variation of root-mean-square amplitude, dominant frequency, spectral flux, spectral irregularity, and spectral flatness) were computed for each species. We then tested for phylogenetic signal on the body-size-corrected residuals of these five parameters, using three statistical tests (Moran's I, Mantel, and Blomberg's K) and three models of genetic distance (pairwise distances, Abouheif's proximities, and the variance-covariance matrix derived from the phylogenetic tree). A significant phylogenetic signal was detected for most acoustic parameters on the 12S dataset, across statistical tests and genetic distance models, both for the entire sample of 90 species and within clades in several cases. A further analysis on a subset of 49 species using genetic distances derived from rhodopsin and from 12S broadly confirmed the results obtained on the larger sample, indicating that the phylogenetic signals observed in these acoustic parameters can be detected using a variety of genetic distance models derived either from a variable mitochondrial sequence or from a conserved nuclear gene. We found a robust relationship, in a large number of species, between anuran phylogenetic relatedness and acoustic similarity in the advertisement calls in a taxon with no evidence for vocal learning, even after correcting for the effect of body size. This finding, covering a broad sample of species whose vocalizations are fairly diverse, indicates that the intense selection on certain call characteristics observed in many anurans does not eliminate all acoustic indicators of relatedness. Our approach could potentially be applied to other vocal taxa.

  16. Comparison of statistical models for writer verification

    NASA Astrophysics Data System (ADS)

    Srihari, Sargur; Ball, Gregory R.

    2009-01-01

    A novel statistical model for determining whether a pair of documents, a known and a questioned, were written by the same individual is proposed. The goal of this formulation is to learn the specific uniqueness of style in a particular author's writing, given the known document. Since there are often insufficient samples to extrapolate a generalized model of an writer's handwriting based solely on the document, we instead generalize over the differences between the author and a large population of known different writers. This is in contrast to an earlier model proposed whereby probability distributions were a priori without learning. We show the performance of the model along with a comparison in performance to the non-learning, older model, which shows significant improvement.

  17. Methods for planning a statistical POD study

    NASA Astrophysics Data System (ADS)

    Koh, Y.-M.; Meeker, W. Q.

    2013-01-01

    The most common question asked of a statistician is "How large should my sample be?" In NDE applications, the most common questions asked of a statistician are "How many specimens do I need and what should be the distribution of flaw sizes?" Although some useful general guidelines exist (e.g. in MIK-HDBK-1823) it is possible to use statistical tools to provide more definitive guidelines and to allow comparison among different proposed study plans. One can assess the performance of a proposed POD study plan by obtaining computable expressions for estimation precision. This allows for a quick and easy assessment of tradeoffs and comparison of various alternative plans. We use a signal-response dataset obtained from MIK-HDBK-1823 to illustrate the ideas.

  18. The Abnormal vs. Normal ECG Classification Based on Key Features and Statistical Learning

    NASA Astrophysics Data System (ADS)

    Dong, Jun; Tong, Jia-Fei; Liu, Xia

    As cardiovascular diseases appear frequently in modern society, the medicine and health system should be adjusted to meet the new requirements. Chinese government has planned to establish basic community medical insurance system (BCMIS) before 2020, where remote medical service is one of core issues. Therefore, we have developed the "remote network hospital system" which includes data server and diagnosis terminal by the aid of wireless detector to sample ECG. To improve the efficiency of ECG processing, in this paper, abnormal vs. normal ECG classification approach based on key features and statistical learning is presented, and the results are analyzed. Large amount of normal ECG could be filtered by computer automatically and abnormal ECG is left to be diagnosed specially by physicians.

  19. A critique of Rasch residual fit statistics.

    PubMed

    Karabatsos, G

    2000-01-01

    In test analysis involving the Rasch model, a large degree of importance is placed on the "objective" measurement of individual abilities and item difficulties. The degree to which the objectivity properties are attained, of course, depends on the degree to which the data fit the Rasch model. It is therefore important to utilize fit statistics that accurately and reliably detect the person-item response inconsistencies that threaten the measurement objectivity of persons and items. Given this argument, it is somewhat surprising that there is far more emphasis placed in the objective measurement of person and items than there is in the measurement quality of Rasch fit statistics. This paper provides a critical analysis of the residual fit statistics of the Rasch model, arguably the most often used fit statistics, in an effort to illustrate that the task of Rasch fit analysis is not as simple and straightforward as it appears to be. The faulty statistical properties of the residual fit statistics do not allow either a convenient or a straightforward approach to Rasch fit analysis. For instance, given a residual fit statistic, the use of a single minimum critical value for misfit diagnosis across different testing situations, where the situations vary in sample and test properties, leads to both the overdetection and underdetection of misfit. To improve this situation, it is argued that psychometricians need to implement residual-free Rasch fit statistics that are based on the number of Guttman response errors, or use indices that are statistically optimal in detecting measurement disturbances.

  20. Simulating the complex output of rainfall and hydrological processes using the information contained in large data sets: the Direct Sampling approach.

    NASA Astrophysics Data System (ADS)

    Oriani, Fabio

    2017-04-01

    The unpredictable nature of rainfall makes its estimation as much difficult as it is essential to hydrological applications. Stochastic simulation is often considered a convenient approach to asses the uncertainty of rainfall processes, but preserving their irregular behavior and variability at multiple scales is a challenge even for the most advanced techniques. In this presentation, an overview on the Direct Sampling technique [1] and its recent application to rainfall and hydrological data simulation [2, 3] is given. The algorithm, having its roots in multiple-point statistics, makes use of a training data set to simulate the outcome of a process without inferring any explicit probability measure: the data are simulated in time or space by sampling the training data set where a sufficiently similar group of neighbor data exists. This approach allows preserving complex statistical dependencies at different scales with a good approximation, while reducing the parameterization to the minimum. The straights and weaknesses of the Direct Sampling approach are shown through a series of applications to rainfall and hydrological data: from time-series simulation to spatial rainfall fields conditioned by elevation or a climate scenario. In the era of vast databases, is this data-driven approach a valid alternative to parametric simulation techniques? [1] Mariethoz G., Renard P., and Straubhaar J. (2010), The Direct Sampling method to perform multiple-point geostatistical simulations, Water. Rerous. Res., 46(11), http://dx.doi.org/10.1029/2008WR007621 [2] Oriani F., Straubhaar J., Renard P., and Mariethoz G. (2014), Simulation of rainfall time series from different climatic regions using the direct sampling technique, Hydrol. Earth Syst. Sci., 18, 3015-3031, http://dx.doi.org/10.5194/hess-18-3015-2014 [3] Oriani F., Borghi A., Straubhaar J., Mariethoz G., Renard P. (2016), Missing data simulation inside flow rate time-series using multiple-point statistics, Environ. Model. Softw., vol. 86, pp. 264 - 276, http://dx.doi.org/10.1016/j.envsoft.2016.10.002

  1. 7 CFR 52.38c - Statistical sampling procedures for lot inspection of processed fruits and vegetables by attributes.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 2 2011-01-01 2011-01-01 false Statistical sampling procedures for lot inspection of processed fruits and vegetables by attributes. 52.38c Section 52.38c Agriculture Regulations of the... Regulations Governing Inspection and Certification Sampling § 52.38c Statistical sampling procedures for lot...

  2. 7 CFR 52.38b - Statistical sampling procedures for on-line inspection by attributes of processed fruits and...

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 2 2011-01-01 2011-01-01 false Statistical sampling procedures for on-line inspection by attributes of processed fruits and vegetables. 52.38b Section 52.38b Agriculture Regulations of... Regulations Governing Inspection and Certification Sampling § 52.38b Statistical sampling procedures for on...

  3. 7 CFR 52.38b - Statistical sampling procedures for on-line inspection by attributes of processed fruits and...

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 2 2010-01-01 2010-01-01 false Statistical sampling procedures for on-line inspection by attributes of processed fruits and vegetables. 52.38b Section 52.38b Agriculture Regulations of... Regulations Governing Inspection and Certification Sampling § 52.38b Statistical sampling procedures for on...

  4. 7 CFR 52.38c - Statistical sampling procedures for lot inspection of processed fruits and vegetables by attributes.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 2 2010-01-01 2010-01-01 false Statistical sampling procedures for lot inspection of processed fruits and vegetables by attributes. 52.38c Section 52.38c Agriculture Regulations of the... Regulations Governing Inspection and Certification Sampling § 52.38c Statistical sampling procedures for lot...

  5. 75 FR 38871 - Proposed Collection; Comment Request for Revenue Procedure 2004-29

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-06

    ... comments concerning Revenue Procedure 2004-29, Statistical Sampling in Sec. 274 Context. DATES: Written... Internet, at [email protected] . SUPPLEMENTARY INFORMATION: Title: Statistical Sampling in Sec...: Revenue Procedure 2004-29 prescribes the statistical sampling methodology by which taxpayers under...

  6. Generic Difference Between Early and Late Stages of BATSE Gamma-Ray Bursts

    NASA Technical Reports Server (NTRS)

    Mitrofanov, Igor G.; Litvak, Maxim L.; Anfimov, Dimitrij S.; Sanin, Anton B.; Briggs, Michael S.; Paciesas, William S.; Pendleton, Geoffrey N.; Preece, Robert D.; Meegan, Charles A.

    2001-01-01

    The early and late stages of gamma-ray bursts are studied in a statistical analysis of the large sample of long BATSE events. The primary peak is used as the boundary between the early and late stages of emission. Significant differences are found between the stages: the early stage is shorter, it has harder emission, and it becomes a smaller fraction of the total burst duration for burst groups of decreasing intensity.

  7. General Differences between Early and Late Stages of BATSE Gamma-Ray Bursts

    NASA Technical Reports Server (NTRS)

    Mitrofanov, I. G.; Litvak, M. L.; Anfimov, D. S.; Sanin, A. B.; Briggs, M. S.; Paciesas, W. S.; Pendleton, G. N.; Preece, R. D.; Meegan, C. A.; Rose, M. Franklin (Technical Monitor)

    2000-01-01

    The early and late stages of gamma-ray bursts are studied in a statistical analysis of the large sample of long BATSE events. The primary peak is used as the boundary between the early and late stages of emission. Significant differences are found between the stages: the early stage is shorter, it has harder emission, and it becomes a smaller fraction of the total burst duration for burst groups of decreasing intensity.

  8. HLA Haplotype Frequency Estimation from Real-Life Data with the Hapl-o-Mat Software.

    PubMed

    Sauter, Jürgen; Schäfer, Christian; Schmidt, Alexander H

    2018-01-01

    HLA haplotype frequencies are of use in a variety of settings. Such data is typically derived either from family pedigree data by targeted typing or statistical analysis of large population-specific genotype samples. As established tools for the latter approach lacked ability to treat the amount, ambiguity, and inhomogeneity found in genotype data in hematopoietic stem cell donor registries, we developed Hapl-o-Mat to alleviate these specific shortcomings.

  9. Time Series Analysis Based on Running Mann Whitney Z Statistics

    USDA-ARS?s Scientific Manuscript database

    A sensitive and objective time series analysis method based on the calculation of Mann Whitney U statistics is described. This method samples data rankings over moving time windows, converts those samples to Mann-Whitney U statistics, and then normalizes the U statistics to Z statistics using Monte-...

  10. Response Variability in Commercial MOSFET SEE Qualification

    DOE PAGES

    George, J. S.; Clymer, D. A.; Turflinger, T. L.; ...

    2016-12-01

    Single-event effects (SEE) evaluation of five different part types of next generation, commercial trench MOSFETs indicates large part-to-part variation in determining a safe operating area (SOA) for drain-source voltage (V DS) following a test campaign that exposed >50 samples per part type to heavy ions. These results suggest a determination of a SOA using small sample sizes may fail to capture the full extent of the part-to-part variability. An example method is discussed for establishing a Safe Operating Area using a one-sided statistical tolerance limit based on the number of test samples. Finally, burn-in is shown to be a criticalmore » factor in reducing part-to-part variation in part response. Implications for radiation qualification requirements are also explored.« less

  11. Response Variability in Commercial MOSFET SEE Qualification

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    George, J. S.; Clymer, D. A.; Turflinger, T. L.

    Single-event effects (SEE) evaluation of five different part types of next generation, commercial trench MOSFETs indicates large part-to-part variation in determining a safe operating area (SOA) for drain-source voltage (V DS) following a test campaign that exposed >50 samples per part type to heavy ions. These results suggest a determination of a SOA using small sample sizes may fail to capture the full extent of the part-to-part variability. An example method is discussed for establishing a Safe Operating Area using a one-sided statistical tolerance limit based on the number of test samples. Finally, burn-in is shown to be a criticalmore » factor in reducing part-to-part variation in part response. Implications for radiation qualification requirements are also explored.« less

  12. Sampling populations of humans across the world: ELSI issues.

    PubMed

    Knoppers, Bartha Maria; Zawati, Ma'n H; Kirby, Emily S

    2012-01-01

    There are an increasing number of population studies collecting data and samples to illuminate gene-environment contributions to disease risk and health. The rising affordability of innovative technologies capable of generating large amounts of data helps achieve statistical power and has paved the way for new international research collaborations. Most data and sample collections can be grouped into longitudinal, disease-specific, or residual tissue biobanks, with accompanying ethical, legal, and social issues (ELSI). Issues pertaining to consent, confidentiality, and oversight cannot be examined using a one-size-fits-all approach-the particularities of each biobank must be taken into account. It remains to be seen whether current governance approaches will be adequate to handle the impact of next-generation sequencing technologies on communication with participants in population biobanking studies.

  13. CAN'T MISS--conquer any number task by making important statistics simple. Part 2. Probability, populations, samples, and normal distributions.

    PubMed

    Hansen, John P

    2003-01-01

    Healthcare quality improvement professionals need to understand and use inferential statistics to interpret sample data from their organizations. In quality improvement and healthcare research studies all the data from a population often are not available, so investigators take samples and make inferences about the population by using inferential statistics. This three-part series will give readers an understanding of the concepts of inferential statistics as well as the specific tools for calculating confidence intervals for samples of data. This article, Part 2, describes probability, populations, and samples. The uses of descriptive and inferential statistics are outlined. The article also discusses the properties and probability of normal distributions, including the standard normal distribution.

  14. Detection and Attribution of Temperature Trends in the Presence of Natural Variability

    NASA Astrophysics Data System (ADS)

    Wallace, J. M.

    2014-12-01

    The fingerprint of human-induced global warming stands out clearly above the noise In the time series of global-mean temperature, but not local temperature. At extratropical latitudes over land the standard error of 50-year linear temperature trends at a fixed point is as large as the cumulative rise in global-mean temperature over the past century. Much of the samping variability in local temperature trends is "dynamically-induced", i.e., attributable to the fact that the seasonally-varying mean circulation varies substantially from one year to the next and anomalous circulation patterns are generally accompanied by anomalous temperature patterns. In the presence of such large sampling variability it is virtually impossible to identify the spatial signature of greenhouse warming based on observational data or to partition observed local temperature trends into natural and human-induced components. It follows that previous IPCC assessments, which have focused on the deterministic signature of human-induced climate change, are inherently limited as to what they can tell us about the attribution of the past record of local temperature change or about how much the temperature at a particular place is likely to rise in the next few decades in response to global warming. To obtain more informative assessments of regional and local climate variability and change it will be necessary to take a probabilistic approach. Just as the use of the ensembles has contributed to more informative extended range weather predictions, large ensembles of climate model simulations can provide a statistical context for interpreting observed climate change and for framing projections of future climate. For some purposes, statistics relating to the interannual variability in the historical record can serve as a surrogate for statistics relating to the diversity of climate change scenarios in large ensembles.

  15. Optimal spatial sampling techniques for ground truth data in microwave remote sensing of soil moisture

    NASA Technical Reports Server (NTRS)

    Rao, R. G. S.; Ulaby, F. T.

    1977-01-01

    The paper examines optimal sampling techniques for obtaining accurate spatial averages of soil moisture, at various depths and for cell sizes in the range 2.5-40 acres, with a minimum number of samples. Both simple random sampling and stratified sampling procedures are used to reach a set of recommended sample sizes for each depth and for each cell size. Major conclusions from statistical sampling test results are that (1) the number of samples required decreases with increasing depth; (2) when the total number of samples cannot be prespecified or the moisture in only one single layer is of interest, then a simple random sample procedure should be used which is based on the observed mean and SD for data from a single field; (3) when the total number of samples can be prespecified and the objective is to measure the soil moisture profile with depth, then stratified random sampling based on optimal allocation should be used; and (4) decreasing the sensor resolution cell size leads to fairly large decreases in samples sizes with stratified sampling procedures, whereas only a moderate decrease is obtained in simple random sampling procedures.

  16. Chi-Squared Test of Fit and Sample Size-A Comparison between a Random Sample Approach and a Chi-Square Value Adjustment Method.

    PubMed

    Bergh, Daniel

    2015-01-01

    Chi-square statistics are commonly used for tests of fit of measurement models. Chi-square is also sensitive to sample size, which is why several approaches to handle large samples in test of fit analysis have been developed. One strategy to handle the sample size problem may be to adjust the sample size in the analysis of fit. An alternative is to adopt a random sample approach. The purpose of this study was to analyze and to compare these two strategies using simulated data. Given an original sample size of 21,000, for reductions of sample sizes down to the order of 5,000 the adjusted sample size function works as good as the random sample approach. In contrast, when applying adjustments to sample sizes of lower order the adjustment function is less effective at approximating the chi-square value for an actual random sample of the relevant size. Hence, the fit is exaggerated and misfit under-estimated using the adjusted sample size function. Although there are big differences in chi-square values between the two approaches at lower sample sizes, the inferences based on the p-values may be the same.

  17. Testing the equivalence of modern human cranial covariance structure: Implications for bioarchaeological applications.

    PubMed

    von Cramon-Taubadel, Noreen; Schroeder, Lauren

    2016-10-01

    Estimation of the variance-covariance (V/CV) structure of fragmentary bioarchaeological populations requires the use of proxy extant V/CV parameters. However, it is currently unclear whether extant human populations exhibit equivalent V/CV structures. Random skewers (RS) and hierarchical analyses of common principal components (CPC) were applied to a modern human cranial dataset. Cranial V/CV similarity was assessed globally for samples of individual populations (jackknifed method) and for pairwise population sample contrasts. The results were examined in light of potential explanatory factors for covariance difference, such as geographic region, among-group distance, and sample size. RS analyses showed that population samples exhibited highly correlated multivariate responses to selection, and that differences in RS results were primarily a consequence of differences in sample size. The CPC method yielded mixed results, depending upon the statistical criterion used to evaluate the hierarchy. The hypothesis-testing (step-up) approach was deemed problematic due to sensitivity to low statistical power and elevated Type I errors. In contrast, the model-fitting (lowest AIC) approach suggested that V/CV matrices were proportional and/or shared a large number of CPCs. Pairwise population sample CPC results were correlated with cranial distance, suggesting that population history explains some of the variability in V/CV structure among groups. The results indicate that patterns of covariance in human craniometric samples are broadly similar but not identical. These findings have important implications for choosing extant covariance matrices to use as proxy V/CV parameters in evolutionary analyses of past populations. © 2016 Wiley Periodicals, Inc.

  18. Occurrence of Radio Minihalos in a Mass-limited Sample of Galaxy Clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Giacintucci, Simona; Clarke, Tracy E.; Markevitch, Maxim

    2017-06-01

    We investigate the occurrence of radio minihalos—diffuse radio sources of unknown origin observed in the cores of some galaxy clusters—in a statistical sample of 58 clusters drawn from the Planck Sunyaev–Zel’dovich cluster catalog using a mass cut ( M {sub 500} > 6 × 10{sup 14} M {sub ⊙}). We supplement our statistical sample with a similarly sized nonstatistical sample mostly consisting of clusters in the ACCEPT X-ray catalog with suitable X-ray and radio data, which includes lower-mass clusters. Where necessary (for nine clusters), we reanalyzed the Very Large Array archival radio data to determine whether a minihalo is present.more » Our total sample includes all 28 currently known and recently discovered radio minihalos, including six candidates. We classify clusters as cool-core or non-cool-core according to the value of the specific entropy floor in the cluster center, rederived or newly derived from the Chandra X-ray density and temperature profiles where necessary (for 27 clusters). Contrary to the common wisdom that minihalos are rare, we find that almost all cool cores—at least 12 out of 15 (80%)—in our complete sample of massive clusters exhibit minihalos. The supplementary sample shows that the occurrence of minihalos may be lower in lower-mass cool-core clusters. No minihalos are found in non-cool cores or “warm cores.” These findings will help test theories of the origin of minihalos and provide information on the physical processes and energetics of the cluster cores.« less

  19. Improved Statistical Sampling and Accuracy with Accelerated Molecular Dynamics on Rotatable Torsions.

    PubMed

    Doshi, Urmi; Hamelberg, Donald

    2012-11-13

    In enhanced sampling techniques, the precision of the reweighted ensemble properties is often decreased due to large variation in statistical weights and reduction in the effective sampling size. To abate this reweighting problem, here, we propose a general accelerated molecular dynamics (aMD) approach in which only the rotatable dihedrals are subjected to aMD (RaMD), unlike the typical implementation wherein all dihedrals are boosted (all-aMD). Nonrotatable and improper dihedrals are marginally important to conformational changes or the different rotameric states. Not accelerating them avoids the sharp increases in the potential energies due to small deviations from their minimum energy conformations and leads to improvement in the precision of RaMD. We present benchmark studies on two model dipeptides, Ace-Ala-Nme and Ace-Trp-Nme, simulated with normal MD, all-aMD, and RaMD. We carry out a systematic comparison between the performances of both forms of aMD using a theory that allows quantitative estimation of the effective number of sampled points and the associated uncertainty. Our results indicate that, for the same level of acceleration and simulation length, as used in all-aMD, RaMD results in significantly less loss in the effective sample size and, hence, increased accuracy in the sampling of φ-ψ space. RaMD yields an accuracy comparable to that of all-aMD, from simulation lengths 5 to 1000 times shorter, depending on the peptide and the acceleration level. Such improvement in speed and accuracy over all-aMD is highly remarkable, suggesting RaMD as a promising method for sampling larger biomolecules.

  20. Identifying Minefields and Verifying Clearance: Adapting Statistical Methods for UXO Target Detection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gilbert, Richard O.; O'Brien, Robert F.; Wilson, John E.

    2003-09-01

    It may not be feasible to completely survey large tracts of land suspected of containing minefields. It is desirable to develop a characterization protocol that will confidently identify minefields within these large land tracts if they exist. Naturally, surveying areas of greatest concern and most likely locations would be necessary but will not provide the needed confidence that an unknown minefield had not eluded detection. Once minefields are detected, methods are needed to bound the area that will require detailed mine detection surveys. The US Department of Defense Strategic Environmental Research and Development Program (SERDP) is sponsoring the development ofmore » statistical survey methods and tools for detecting potential UXO targets. These methods may be directly applicable to demining efforts. Statistical methods are employed to determine the optimal geophysical survey transect spacing to have confidence of detecting target areas of a critical size, shape, and anomaly density. Other methods under development determine the proportion of a land area that must be surveyed to confidently conclude that there are no UXO present. Adaptive sampling schemes are also being developed as an approach for bounding the target areas. These methods and tools will be presented and the status of relevant research in this area will be discussed.« less

  1. 75 FR 53738 - Proposed Collection; Comment Request for Rev. Proc. 2007-35

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-09-01

    ... Revenue Procedure Revenue Procedure 2007-35, Statistical Sampling for purposes of Section 199. DATES... through the Internet, at [email protected] . SUPPLEMENTARY INFORMATION: Title: Statistical Sampling...: This revenue procedure provides for determining when statistical sampling may be used in purposes of...

  2. Is psychology suffering from a replication crisis? What does "failure to replicate" really mean?

    PubMed

    Maxwell, Scott E; Lau, Michael Y; Howard, George S

    2015-09-01

    Psychology has recently been viewed as facing a replication crisis because efforts to replicate past study findings frequently do not show the same result. Often, the first study showed a statistically significant result but the replication does not. Questions then arise about whether the first study results were false positives, and whether the replication study correctly indicates that there is truly no effect after all. This article suggests these so-called failures to replicate may not be failures at all, but rather are the result of low statistical power in single replication studies, and the result of failure to appreciate the need for multiple replications in order to have enough power to identify true effects. We provide examples of these power problems and suggest some solutions using Bayesian statistics and meta-analysis. Although the need for multiple replication studies may frustrate those who would prefer quick answers to psychology's alleged crisis, the large sample sizes typically needed to provide firm evidence will almost always require concerted efforts from multiple investigators. As a result, it remains to be seen how many of the recently claimed failures to replicate will be supported or instead may turn out to be artifacts of inadequate sample sizes and single study replications. (PsycINFO Database Record (c) 2015 APA, all rights reserved).

  3. Making Sense of 'Big Data' in Provenance Studies

    NASA Astrophysics Data System (ADS)

    Vermeesch, P.

    2014-12-01

    Huge online databases can be 'mined' to reveal previously hidden trends and relationships in society. One could argue that sedimentary geology has entered a similar era of 'Big Data', as modern provenance studies routinely apply multiple proxies to dozens of samples. Just like the Internet, sedimentary geology now requires specialised statistical tools to interpret such large datasets. These can be organised on three levels of progressively higher order:A single sample: The most effective way to reveal the provenance information contained in a representative sample of detrital zircon U-Pb ages are probability density estimators such as histograms and kernel density estimates. The widely popular 'probability density plots' implemented in IsoPlot and AgeDisplay compound analytical uncertainty with geological scatter and are therefore invalid.Several samples: Multi-panel diagrams comprising many detrital age distributions or compositional pie charts quickly become unwieldy and uninterpretable. For example, if there are N samples in a study, then the number of pairwise comparisons between samples increases quadratically as N(N-1)/2. This is simply too much information for the human eye to process. To solve this problem, it is necessary to (a) express the 'distance' between two samples as a simple scalar and (b) combine all N(N-1)/2 such values in a single two-dimensional 'map', grouping similar and pulling apart dissimilar samples. This can be easily achieved using simple statistics-based dissimilarity measures and a standard statistical method called Multidimensional Scaling (MDS).Several methods: Suppose that we use four provenance proxies: bulk petrography, chemistry, heavy minerals and detrital geochronology. This will result in four MDS maps, each of which likely show slightly different trends and patterns. To deal with such cases, it may be useful to use a related technique called 'three way multidimensional scaling'. This results in two graphical outputs: an MDS map, and a map with 'weights' showing to what extent the different provenance proxies influence the horizontal and vertical axis of the MDS map. Thus, detrital data can not only inform the user about the provenance of sediments, but also about the causal relationships between the mineralogy, geochronology and chemistry.

  4. Novel Kalman Filter Algorithm for Statistical Monitoring of Extensive Landscapes with Synoptic Sensor Data

    PubMed Central

    Czaplewski, Raymond L.

    2015-01-01

    Wall-to-wall remotely sensed data are increasingly available to monitor landscape dynamics over large geographic areas. However, statistical monitoring programs that use post-stratification cannot fully utilize those sensor data. The Kalman filter (KF) is an alternative statistical estimator. I develop a new KF algorithm that is numerically robust with large numbers of study variables and auxiliary sensor variables. A National Forest Inventory (NFI) illustrates application within an official statistics program. Practical recommendations regarding remote sensing and statistical issues are offered. This algorithm has the potential to increase the value of synoptic sensor data for statistical monitoring of large geographic areas. PMID:26393588

  5. Evaluating the statistical methodology of randomized trials on dentin hypersensitivity management.

    PubMed

    Matranga, Domenica; Matera, Federico; Pizzo, Giuseppe

    2017-12-27

    The present study aimed to evaluate the characteristics and quality of statistical methodology used in clinical studies on dentin hypersensitivity management. An electronic search was performed for data published from 2009 to 2014 by using PubMed, Ovid/MEDLINE, and Cochrane Library databases. The primary search terms were used in combination. Eligibility criteria included randomized clinical trials that evaluated the efficacy of desensitizing agents in terms of reducing dentin hypersensitivity. A total of 40 studies were considered eligible for assessment of quality statistical methodology. The four main concerns identified were i) use of nonparametric tests in the presence of large samples, coupled with lack of information about normality and equality of variances of the response; ii) lack of P-value adjustment for multiple comparisons; iii) failure to account for interactions between treatment and follow-up time; and iv) no information about the number of teeth examined per patient and the consequent lack of cluster-specific approach in data analysis. Owing to these concerns, statistical methodology was judged as inappropriate in 77.1% of the 35 studies that used parametric methods. Additional studies with appropriate statistical analysis are required to obtain appropriate assessment of the efficacy of desensitizing agents.

  6. Sample Size and Statistical Conclusions from Tests of Fit to the Rasch Model According to the Rasch Unidimensional Measurement Model (Rumm) Program in Health Outcome Measurement.

    PubMed

    Hagell, Peter; Westergren, Albert

    Sample size is a major factor in statistical null hypothesis testing, which is the basis for many approaches to testing Rasch model fit. Few sample size recommendations for testing fit to the Rasch model concern the Rasch Unidimensional Measurement Models (RUMM) software, which features chi-square and ANOVA/F-ratio based fit statistics, including Bonferroni and algebraic sample size adjustments. This paper explores the occurrence of Type I errors with RUMM fit statistics, and the effects of algebraic sample size adjustments. Data with simulated Rasch model fitting 25-item dichotomous scales and sample sizes ranging from N = 50 to N = 2500 were analysed with and without algebraically adjusted sample sizes. Results suggest the occurrence of Type I errors with N less then or equal to 500, and that Bonferroni correction as well as downward algebraic sample size adjustment are useful to avoid such errors, whereas upward adjustment of smaller samples falsely signal misfit. Our observations suggest that sample sizes around N = 250 to N = 500 may provide a good balance for the statistical interpretation of the RUMM fit statistics studied here with respect to Type I errors and under the assumption of Rasch model fit within the examined frame of reference (i.e., about 25 item parameters well targeted to the sample).

  7. A Census of Large-scale (≥10 PC), Velocity-coherent, Dense Filaments in the Northern Galactic Plane: Automated Identification Using Minimum Spanning Tree

    NASA Astrophysics Data System (ADS)

    Wang, Ke; Testi, Leonardo; Burkert, Andreas; Walmsley, C. Malcolm; Beuther, Henrik; Henning, Thomas

    2016-09-01

    Large-scale gaseous filaments with lengths up to the order of 100 pc are on the upper end of the filamentary hierarchy of the Galactic interstellar medium (ISM). Their association with respect to the Galactic structure and their role in Galactic star formation are of great interest from both an observational and theoretical point of view. Previous “by-eye” searches, combined together, have started to uncover the Galactic distribution of large filaments, yet inherent bias and small sample size limit conclusive statistical results from being drawn. Here, we present (1) a new, automated method for identifying large-scale velocity-coherent dense filaments, and (2) the first statistics and the Galactic distribution of these filaments. We use a customized minimum spanning tree algorithm to identify filaments by connecting voxels in the position-position-velocity space, using the Bolocam Galactic Plane Survey spectroscopic catalog. In the range of 7\\buildrel{\\circ}\\over{.} 5≤slant l≤slant 194^\\circ , we have identified 54 large-scale filaments and derived mass (˜ {10}3{--}{10}5 {M}⊙ ), length (10-276 pc), linear mass density (54-8625 {M}⊙ pc-1), aspect ratio, linearity, velocity gradient, temperature, fragmentation, Galactic location, and orientation angle. The filaments concentrate along major spiral arms. They are widely distributed across the Galactic disk, with 50% located within ±20 pc from the Galactic mid-plane and 27% run in the center of spiral arms. An order of 1% of the molecular ISM is confined in large filaments. Massive star formation is more favorable in large filaments compared to elsewhere. This is the first comprehensive catalog of large filaments that can be useful for a quantitative comparison with spiral structures and numerical simulations.

  8. Effects of wind direction on coarse and fine particulate matter concentrations in southeast Kansas.

    PubMed

    Guerra, Sergio A; Lane, Dennis D; Marotz, Glen A; Carter, Ray E; Hohl, Carrie M; Baldauf, Richard W

    2006-11-01

    Field data for coarse particulate matter ([PM] PM10) and fine particulate matter (PM2.5) were collected at selected sites in Southeast Kansas from March 1999 to October 2000, using portable MiniVol particulate samplers. The purpose was to assess the influence on air quality of four industrial facilities that burn hazardous waste in the area located in the communities of Chanute, Independence, Fredonia, and Coffeyville. Both spatial and temporal variation were observed in the data. Variation because of sampling site was found to be statistically significant for PM10 but not for PM2.5. PM10 concentrations were typically slightly higher at sites located within the four study communities than at background sites. Sampling sites were located north and south of the four targeted sources to provide upwind and downwind monitoring pairs. No statistically significant differences were found between upwind and downwind samples for either PM10 or PM2.5, indicating that the targeted sources did not contribute significantly to PM concentrations. Wind direction can frequently contribute to temporal variation in air pollutant concentrations and was investigated in this study. Sampling days were divided into four classifications: predominantly south winds, predominantly north winds, calm/variable winds, and winds from other directions. The effect of wind direction was found to be statistically significant for both PM10 and PM2.5. For both size ranges, PM concentrations were typically highest on days with predominantly south winds; days with calm/variable winds generally produced higher concentrations than did those with predominantly north winds or those with winds from "other" directions. The significant effect of wind direction suggests that regional sources may exert a large influence on PM concentrations in the area.

  9. Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints.

    PubMed

    Tang, Nian-Sheng; Yu, Bin; Tang, Man-Lai

    2014-12-18

    A two-arm non-inferiority trial without a placebo is usually adopted to demonstrate that an experimental treatment is not worse than a reference treatment by a small pre-specified non-inferiority margin due to ethical concerns. Selection of the non-inferiority margin and establishment of assay sensitivity are two major issues in the design, analysis and interpretation for two-arm non-inferiority trials. Alternatively, a three-arm non-inferiority clinical trial including a placebo is usually conducted to assess the assay sensitivity and internal validity of a trial. Recently, some large-sample approaches have been developed to assess the non-inferiority of a new treatment based on the three-arm trial design. However, these methods behave badly with small sample sizes in the three arms. This manuscript aims to develop some reliable small-sample methods to test three-arm non-inferiority. Saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling methods are developed to calculate p-values of the Wald-type, score and likelihood ratio tests. Simulation studies are conducted to evaluate their performance in terms of type I error rate and power. Our empirical results show that the saddlepoint approximation method generally behaves better than the asymptotic method based on the Wald-type test statistic. For small sample sizes, approximate unconditional and bootstrap-resampling methods based on the score test statistic perform better in the sense that their corresponding type I error rates are generally closer to the prespecified nominal level than those of other test procedures. Both approximate unconditional and bootstrap-resampling test procedures based on the score test statistic are generally recommended for three-arm non-inferiority trials with binary outcomes.

  10. Application of PCA and SIMCA statistical analysis of FT-IR spectra for the classification and identification of different slag types with environmental origin.

    PubMed

    Stumpe, B; Engel, T; Steinweg, B; Marschner, B

    2012-04-03

    In the past, different slag materials were often used for landscaping and construction purposes or simply dumped. Nowadays German environmental laws strictly control the use of slags, but there is still a remaining part of 35% which is uncontrolled dumped in landfills. Since some slags have high heavy metal contents and different slag types have typical chemical and physical properties that will influence the risk potential and other characteristics of the deposits, an identification of the slag types is needed. We developed a FT-IR-based statistical method to identify different slags classes. Slags samples were collected at different sites throughout various cities within the industrial Ruhr area. Then, spectra of 35 samples from four different slags classes, ladle furnace (LF), blast furnace (BF), oxygen furnace steel (OF), and zinc furnace slags (ZF), were determined in the mid-infrared region (4000-400 cm(-1)). The spectra data sets were subject to statistical classification methods for the separation of separate spectral data of different slag classes. Principal component analysis (PCA) models for each slag class were developed and further used for soft independent modeling of class analogy (SIMCA). Precise classification of slag samples into four different slag classes were achieved using two different SIMCA models stepwise. At first, SIMCA 1 was used for classification of ZF as well as OF slags over the total spectral range. If no correct classification was found, then the spectrum was analyzed with SIMCA 2 at reduced wavenumbers for the classification of LF as well as BF spectra. As a result, we provide a time- and cost-efficient method based on FT-IR spectroscopy for processing and identifying large numbers of environmental slag samples.

  11. 78 FR 43002 - Proposed Collection; Comment Request for Revenue Procedure 2004-29

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-07-18

    ... comments concerning statistical sampling in Sec. 274 Context. DATES: Written comments should be received on... INFORMATION: Title: Statistical Sampling in Sec. 274 Contest. OMB Number: 1545-1847. Revenue Procedure Number: Revenue Procedure 2004-29. Abstract: Revenue Procedure 2004-29 prescribes the statistical sampling...

  12. 42 CFR 1003.133 - Statistical sampling.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... 42 Public Health 5 2014-10-01 2014-10-01 false Statistical sampling. 1003.133 Section 1003.133 Public Health OFFICE OF INSPECTOR GENERAL-HEALTH CARE, DEPARTMENT OF HEALTH AND HUMAN SERVICES OIG AUTHORITIES CIVIL MONEY PENALTIES, ASSESSMENTS AND EXCLUSIONS § 1003.133 Statistical sampling. (a) In meeting...

  13. EVALUATION OF A NEW MEAN SCALED AND MOMENT ADJUSTED TEST STATISTIC FOR SEM.

    PubMed

    Tong, Xiaoxiao; Bentler, Peter M

    2013-01-01

    Recently a new mean scaled and skewness adjusted test statistic was developed for evaluating structural equation models in small samples and with potentially nonnormal data, but this statistic has received only limited evaluation. The performance of this statistic is compared to normal theory maximum likelihood and two well-known robust test statistics. A modification to the Satorra-Bentler scaled statistic is developed for the condition that sample size is smaller than degrees of freedom. The behavior of the four test statistics is evaluated with a Monte Carlo confirmatory factor analysis study that varies seven sample sizes and three distributional conditions obtained using Headrick's fifth-order transformation to nonnormality. The new statistic performs badly in most conditions except under the normal distribution. The goodness-of-fit χ(2) test based on maximum-likelihood estimation performed well under normal distributions as well as under a condition of asymptotic robustness. The Satorra-Bentler scaled test statistic performed best overall, while the mean scaled and variance adjusted test statistic outperformed the others at small and moderate sample sizes under certain distributional conditions.

  14. Comparison of Two Methods for the Isolation of Salmonellae From Imported Foods

    PubMed Central

    Taylor, Welton I.; Hobbs, Betty C.; Smith, Muriel E.

    1964-01-01

    Two methods for the detection of salmonellae in foods were compared in 179 imported meat and egg samples. The number of positive samples and replications, and the number of strains and kinds of serotypes were statistically comparable by both the direct enrichment method of the Food Hygiene Laboratory in England, and the pre-enrichment method devised for processed foods in the United States. Boneless frozen beef, veal, and horsemeat imported from five countries for consumption in England were found to have salmonellae present in 48 of 116 (41%) samples. Dried egg products imported from three countries were observed to have salmonellae in 10 of 63 (16%) samples. The high incidence of salmonellae isolated from imported foods illustrated the existence of an international health hazard resulting from the continuous introduction of exogenous strains of pathogenic microorganisms on a large scale. PMID:14106941

  15. Computer-aided boundary delineation of agricultural lands

    NASA Technical Reports Server (NTRS)

    Cheng, Thomas D.; Angelici, Gary L.; Slye, Robert E.; Ma, Matt

    1989-01-01

    The National Agricultural Statistics Service of the United States Department of Agriculture (USDA) presently uses labor-intensive aerial photographic interpretation techniques to divide large geographical areas into manageable-sized units for estimating domestic crop and livestock production. Prototype software, the computer-aided stratification (CAS) system, was developed to automate the procedure, and currently runs on a Sun-based image processing system. With a background display of LANDSAT Thematic Mapper and United States Geological Survey Digital Line Graph data, the operator uses a cursor to delineate agricultural areas, called sampling units, which are assigned to strata of land-use and land-cover types. The resultant stratified sampling units are used as input into subsequent USDA sampling procedures. As a test, three counties in Missouri were chosen for application of the CAS procedures. Subsequent analysis indicates that CAS was five times faster in creating sampling units than the manual techniques were.

  16. Viewpoints: Interactive Exploration of Large Multivariate Earth and Space Science Data Sets

    NASA Astrophysics Data System (ADS)

    Levit, C.; Gazis, P. R.

    2006-05-01

    Analysis and visualization of extremely large and complex data sets may be one of the most significant challenges facing earth and space science investigators in the forthcoming decades. While advances in hardware speed and storage technology have roughly kept up with (indeed, have driven) increases in database size, the same is not of our abilities to manage the complexity of these data. Current missions, instruments, and simulations produce so much data of such high dimensionality that they outstrip the capabilities of traditional visualization and analysis software. This problem can only be expected to get worse as data volumes increase by orders of magnitude in future missions and in ever-larger supercomputer simulations. For large multivariate data (more than 105 samples or records with more than 5 variables per sample) the interactive graphics response of most existing statistical analysis, machine learning, exploratory data analysis, and/or visualization tools such as Torch, MLC++, Matlab, S++/R, and IDL stutters, stalls, or stops working altogether. Fortunately, the graphics processing units (GPUs) built in to all professional desktop and laptop computers currently on the market are capable of transforming, filtering, and rendering hundreds of millions of points per second. We present a prototype open-source cross-platform application which leverages much of the power latent in the GPU to enable smooth interactive exploration and analysis of large high- dimensional data using a variety of classical and recent techniques. The targeted application is the interactive analysis of large, complex, multivariate data sets, with dimensionalities that may surpass 100 and sample sizes that may exceed 106-108.

  17. STAPP: Spatiotemporal analysis of plantar pressure measurements using statistical parametric mapping.

    PubMed

    Booth, Brian G; Keijsers, Noël L W; Sijbers, Jan; Huysmans, Toon

    2018-05-03

    Pedobarography produces large sets of plantar pressure samples that are routinely subsampled (e.g. using regions of interest) or aggregated (e.g. center of pressure trajectories, peak pressure images) in order to simplify statistical analysis and provide intuitive clinical measures. We hypothesize that these data reductions discard gait information that can be used to differentiate between groups or conditions. To test the hypothesis of null information loss, we created an implementation of statistical parametric mapping (SPM) for dynamic plantar pressure datasets (i.e. plantar pressure videos). Our SPM software framework brings all plantar pressure videos into anatomical and temporal correspondence, then performs statistical tests at each sampling location in space and time. Novelly, we introduce non-linear temporal registration into the framework in order to normalize for timing differences within the stance phase. We refer to our software framework as STAPP: spatiotemporal analysis of plantar pressure measurements. Using STAPP, we tested our hypothesis on plantar pressure videos from 33 healthy subjects walking at different speeds. As walking speed increased, STAPP was able to identify significant decreases in plantar pressure at mid-stance from the heel through the lateral forefoot. The extent of these plantar pressure decreases has not previously been observed using existing plantar pressure analysis techniques. We therefore conclude that the subsampling of plantar pressure videos - a task which led to the discarding of gait information in our study - can be avoided using STAPP. Copyright © 2018 Elsevier B.V. All rights reserved.

  18. European consumer attitudes on the associated health benefits of neutraceutical-containing processed meats using Co-enzyme Q10 as a sample functional ingredient.

    PubMed

    Tobin, Brian D; O'Sullivan, Maurice G; Hamill, Ruth; Kerry, Joseph P

    2014-06-01

    This study accumulated European consumer attitudes towards processed meats and their use as a functional food. A survey was set up using an online web-application to gather information on consumer perception of processed meats as well as neutraceutical-containing processed meats. 548 responses were obtained and statistical analysis was carried out using a statistical software package. Data was summarized as frequencies for each question and statistical differences analyzed using the Chi-Square statistical test with a significance level of 5% (P<0.05). The majority of consumer attitudes towards processed meat indicate that they are unhealthy products. Most believe that processed meats contain large quantities of harmful chemicals, fat and salt. Consumers were found to be very pro-bioactive compounds in yogurt style products but unsure of their feelings in meat based products, which is likely due to the lack of familiarity to these products. Many of the respondents were willing to consume meat based functional foods but were not willing to pay more for them. Copyright © 2014 Elsevier Ltd. All rights reserved.

  19. 78 FR 63568 - Proposed Collection; Comment Request for Rev. Proc. 2007-35

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-10-24

    ... Revenue Procedure 2007-35, Statistical Sampling for purposes of Section 199. DATES: Written comments... . SUPPLEMENTARY INFORMATION: Title: Statistical Sampling for purposes of Section 199. OMB Number: 1545-2072... statistical sampling may be used in purposes of section 199, which provides a deduction for income...

  20. On the use of total aerobic spore bacteria to make treatment decisions due to Cryptosporidium risk at public water system wells.

    PubMed

    Berger, Philip; Messner, Michael J; Crosby, Jake; Vacs Renwick, Deborah; Heinrich, Austin

    2018-05-01

    Spore reduction can be used as a surrogate measure of Cryptosporidium natural filtration efficiency. Estimates of log10 (log) reduction were derived from spore measurements in paired surface and well water samples in Casper Wyoming and Kearney Nebraska. We found that these data were suitable for testing the hypothesis (H 0 ) that the average reduction at each site was 2 log or less, using a one-sided Student's t-test. After establishing data quality objectives for the test (expressed as tolerable Type I and Type II error rates), we evaluated the test's performance as a function of the (a) true log reduction, (b) number of paired samples assayed and (c) variance of observed log reductions. We found that 36 paired spore samples are sufficient to achieve the objectives over a wide range of variance, including the variances observed in the two data sets. We also explored the feasibility of using smaller numbers of paired spore samples to supplement bioparticle counts for screening purposes in alluvial aquifers, to differentiate wells with large volume surface water induced recharge from wells with negligible surface water induced recharge. With key assumptions, we propose a normal statistical test of the same hypothesis (H 0 ), but with different performance objectives. As few as six paired spore samples appear adequate as a screening metric to supplement bioparticle counts to differentiate wells in alluvial aquifers with large volume surface water induced recharge. For the case when all available information (including failure to reject H 0 based on the limited paired spore data) leads to the conclusion that wells have large surface water induced recharge, we recommend further evaluation using additional paired biweekly spore samples. Published by Elsevier GmbH.

Top