A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
Some challenges with statistical inference in adaptive designs.
Hung, H M James; Wang, Sue-Jane; Yang, Peiling
2014-01-01
Adaptive designs have generated a great deal of attention to clinical trial communities. The literature contains many statistical methods to deal with added statistical uncertainties concerning the adaptations. Increasingly encountered in regulatory applications are adaptive statistical information designs that allow modification of sample size or related statistical information and adaptive selection designs that allow selection of doses or patient populations during the course of a clinical trial. For adaptive statistical information designs, a few statistical testing methods are mathematically equivalent, as a number of articles have stipulated, but arguably there are large differences in their practical ramifications. We pinpoint some undesirable features of these methods in this work. For adaptive selection designs, the selection based on biomarker data for testing the correlated clinical endpoints may increase statistical uncertainty in terms of type I error probability, and most importantly the increased statistical uncertainty may be impossible to assess.
Statistical Analysis of Big Data on Pharmacogenomics
Fan, Jianqing; Liu, Han
2013-01-01
This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. We selectively review several prominent statistical methods for estimating large covariance matrix for understanding correlation structure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly differently expressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identifying important molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimation and biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, including complex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statistical methods, are also discussed. PMID:23602905
[Evaluation of using statistical methods in selected national medical journals].
Sych, Z
1996-01-01
The paper covers the performed evaluation of frequency with which the statistical methods were applied in analyzed works having been published in six selected, national medical journals in the years 1988-1992. For analysis the following journals were chosen, namely: Klinika Oczna, Medycyna Pracy, Pediatria Polska, Polski Tygodnik Lekarski, Roczniki Państwowego Zakładu Higieny, Zdrowie Publiczne. Appropriate number of works up to the average in the remaining medical journals was randomly selected from respective volumes of Pol. Tyg. Lek. The studies did not include works wherein the statistical analysis was not implemented, which referred both to national and international publications. That exemption was also extended to review papers, casuistic ones, reviews of books, handbooks, monographies, reports from scientific congresses, as well as papers on historical topics. The number of works was defined in each volume. Next, analysis was performed to establish the mode of finding out a suitable sample in respective studies, differentiating two categories: random and target selections. Attention was also paid to the presence of control sample in the individual works. In the analysis attention was also focussed on the existence of sample characteristics, setting up three categories: complete, partial and lacking. In evaluating the analyzed works an effort was made to present the results of studies in tables and figures (Tab. 1, 3). Analysis was accomplished with regard to the rate of employing statistical methods in analyzed works in relevant volumes of six selected, national medical journals for the years 1988-1992, simultaneously determining the number of works, in which no statistical methods were used. Concurrently the frequency of applying the individual statistical methods was analyzed in the scrutinized works. Prominence was given to fundamental statistical methods in the field of descriptive statistics (measures of position, measures of dispersion) as well as most important methods of mathematical statistics such as parametric tests of significance, analysis of variance (in single and dual classifications). non-parametric tests of significance, correlation and regression. The works, in which use was made of either multiple correlation or multiple regression or else more complex methods of studying the relationship for two or more numbers of variables, were incorporated into the works whose statistical methods were constituted by correlation and regression as well as other methods, e.g. statistical methods being used in epidemiology (coefficients of incidence and morbidity, standardization of coefficients, survival tables) factor analysis conducted by Jacobi-Hotellng's method, taxonomic methods and others. On the basis of the performed studies it has been established that the frequency of employing statistical methods in the six selected national, medical journals in the years 1988-1992 was 61.1-66.0% of the analyzed works (Tab. 3), and they generally were almost similar to the frequency provided in English language medical journals. On a whole, no significant differences were disclosed in the frequency of applied statistical methods (Tab. 4) as well as in frequency of random tests (Tab. 3) in the analyzed works, appearing in the medical journals in respective years 1988-1992. The most frequently used statistical methods in analyzed works for 1988-1992 were the measures of position 44.2-55.6% and measures of dispersion 32.5-38.5% as well as parametric tests of significance 26.3-33.1% of the works analyzed (Tab. 4). For the purpose of increasing the frequency and reliability of the used statistical methods, the didactics should be widened in the field of biostatistics at medical studies and postgraduation training designed for physicians and scientific-didactic workers.
NASA Astrophysics Data System (ADS)
Yuan, Ye; Ries, Ludwig; Petermeier, Hannes; Steinbacher, Martin; Gómez-Peláez, Angel J.; Leuenberger, Markus C.; Schumacher, Marcus; Trickl, Thomas; Couret, Cedric; Meinhardt, Frank; Menzel, Annette
2018-03-01
Critical data selection is essential for determining representative baseline levels of atmospheric trace gases even at remote measurement sites. Different data selection techniques have been used around the world, which could potentially lead to reduced compatibility when comparing data from different stations. This paper presents a novel statistical data selection method named adaptive diurnal minimum variation selection (ADVS) based on CO2 diurnal patterns typically occurring at elevated mountain stations. Its capability and applicability were studied on records of atmospheric CO2 observations at six Global Atmosphere Watch stations in Europe, namely, Zugspitze-Schneefernerhaus (Germany), Sonnblick (Austria), Jungfraujoch (Switzerland), Izaña (Spain), Schauinsland (Germany), and Hohenpeissenberg (Germany). Three other frequently applied statistical data selection methods were included for comparison. Among the studied methods, our ADVS method resulted in a lower fraction of data selected as a baseline with lower maxima during winter and higher minima during summer in the selected data. The measured time series were analyzed for long-term trends and seasonality by a seasonal-trend decomposition technique. In contrast to unselected data, mean annual growth rates of all selected datasets were not significantly different among the sites, except for the data recorded at Schauinsland. However, clear differences were found in the annual amplitudes as well as the seasonal time structure. Based on a pairwise analysis of correlations between stations on the seasonal-trend decomposed components by statistical data selection, we conclude that the baseline identified by the ADVS method is a better representation of lower free tropospheric (LFT) conditions than baselines identified by the other methods.
ERIC Educational Resources Information Center
Jones, Andrew T.
2011-01-01
Practitioners often depend on item analysis to select items for exam forms and have a variety of options available to them. These include the point-biserial correlation, the agreement statistic, the B index, and the phi coefficient. Although research has demonstrated that these statistics can be useful for item selection, no research as of yet has…
Variable Selection in the Presence of Missing Data: Imputation-based Methods.
Zhao, Yize; Long, Qi
2017-01-01
Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.
Xu, Rengyi; Mesaros, Clementina; Weng, Liwei; Snyder, Nathaniel W; Vachani, Anil; Blair, Ian A; Hwang, Wei-Ting
2017-07-01
We compared three statistical methods in selecting a panel of serum lipid biomarkers for mesothelioma and asbestos exposure. Serum samples from mesothelioma, asbestos-exposed subjects and controls (40 per group) were analyzed. Three variable selection methods were considered: top-ranked predictors from univariate model, stepwise and least absolute shrinkage and selection operator. Crossed-validated area under the receiver operating characteristic curve was used to compare the prediction performance. Lipids with high crossed-validated area under the curve were identified. Lipid with mass-to-charge ratio of 372.31 was selected by all three methods comparing mesothelioma versus control. Lipids with mass-to-charge ratio of 1464.80 and 329.21 were selected by two models for asbestos exposure versus control. Different methods selected a similar set of serum lipids. Combining candidate biomarkers can improve prediction.
Plant selection for ethnobotanical uses on the Amalfi Coast (Southern Italy).
Savo, V; Joy, R; Caneva, G; McClatchey, W C
2015-07-15
Many ethnobotanical studies have investigated selection criteria for medicinal and non-medicinal plants. In this paper we test several statistical methods using different ethnobotanical datasets in order to 1) define to which extent the nature of the datasets can affect the interpretation of results; 2) determine if the selection for different plant uses is based on phylogeny, or other selection criteria. We considered three different ethnobotanical datasets: two datasets of medicinal plants and a dataset of non-medicinal plants (handicraft production, domestic and agro-pastoral practices) and two floras of the Amalfi Coast. We performed residual analysis from linear regression, the binomial test and the Bayesian approach for calculating under-used and over-used plant families within ethnobotanical datasets. Percentages of agreement were calculated to compare the results of the analyses. We also analyzed the relationship between plant selection and phylogeny, chorology, life form and habitat using the chi-square test. Pearson's residuals for each of the significant chi-square analyses were examined for investigating alternative hypotheses of plant selection criteria. The three statistical analysis methods differed within the same dataset, and between different datasets and floras, but with some similarities. In the two medicinal datasets, only Lamiaceae was identified in both floras as an over-used family by all three statistical methods. All statistical methods in one flora agreed that Malvaceae was over-used and Poaceae under-used, but this was not found to be consistent with results of the second flora in which one statistical result was non-significant. All other families had some discrepancy in significance across methods, or floras. Significant over- or under-use was observed in only a minority of cases. The chi-square analyses were significant for phylogeny, life form and habitat. Pearson's residuals indicated a non-random selection of woody species for non-medicinal uses and an under-use of plants of temperate forests for medicinal uses. Our study showed that selection criteria for plant uses (including medicinal) are not always based on phylogeny. The comparison of different statistical methods (regression, binomial and Bayesian) under different conditions led to the conclusion that the most conservative results are obtained using regression analysis.
Conditional maximum-entropy method for selecting prior distributions in Bayesian statistics
NASA Astrophysics Data System (ADS)
Abe, Sumiyoshi
2014-11-01
The conditional maximum-entropy method (abbreviated here as C-MaxEnt) is formulated for selecting prior probability distributions in Bayesian statistics for parameter estimation. This method is inspired by a statistical-mechanical approach to systems governed by dynamics with largely separated time scales and is based on three key concepts: conjugate pairs of variables, dimensionless integration measures with coarse-graining factors and partial maximization of the joint entropy. The method enables one to calculate a prior purely from a likelihood in a simple way. It is shown, in particular, how it not only yields Jeffreys's rules but also reveals new structures hidden behind them.
Petersson, K M; Nichols, T E; Poline, J B; Holmes, A P
1999-01-01
Functional neuroimaging (FNI) provides experimental access to the intact living brain making it possible to study higher cognitive functions in humans. In this review and in a companion paper in this issue, we discuss some common methods used to analyse FNI data. The emphasis in both papers is on assumptions and limitations of the methods reviewed. There are several methods available to analyse FNI data indicating that none is optimal for all purposes. In order to make optimal use of the methods available it is important to know the limits of applicability. For the interpretation of FNI results it is also important to take into account the assumptions, approximations and inherent limitations of the methods used. This paper gives a brief overview over some non-inferential descriptive methods and common statistical models used in FNI. Issues relating to the complex problem of model selection are discussed. In general, proper model selection is a necessary prerequisite for the validity of the subsequent statistical inference. The non-inferential section describes methods that, combined with inspection of parameter estimates and other simple measures, can aid in the process of model selection and verification of assumptions. The section on statistical models covers approaches to global normalization and some aspects of univariate, multivariate, and Bayesian models. Finally, approaches to functional connectivity and effective connectivity are discussed. In the companion paper we review issues related to signal detection and statistical inference. PMID:10466149
DOT National Transportation Integrated Search
1979-03-01
There are several conditions that can influence the calculation of the statistical validity of a test battery such as that used to selected Air Traffic Control Specialists. Two conditions of prime importance to statistical validity are recruitment pr...
Ries, Kernell G.; Eng, Ken
2010-01-01
The U.S. Geological Survey, in cooperation with the Maryland Department of the Environment, operated a network of 20 low-flow partial-record stations during 2008 in a region that extends from southwest of Baltimore to the northeastern corner of Maryland to obtain estimates of selected streamflow statistics at the station locations. The study area is expected to face a substantial influx of new residents and businesses as a result of military and civilian personnel transfers associated with the Federal Base Realignment and Closure Act of 2005. The estimated streamflow statistics, which include monthly 85-percent duration flows, the 10-year recurrence-interval minimum base flow, and the 7-day, 10-year low flow, are needed to provide a better understanding of the availability of water resources in the area to be affected by base-realignment activities. Streamflow measurements collected for this study at the low-flow partial-record stations and measurements collected previously for 8 of the 20 stations were related to concurrent daily flows at nearby index streamgages to estimate the streamflow statistics. Three methods were used to estimate the streamflow statistics and two methods were used to select the index streamgages. Of the three methods used to estimate the streamflow statistics, two of them--the Moments and MOVE1 methods--rely on correlating the streamflow measurements at the low-flow partial-record stations with concurrent streamflows at nearby, hydrologically similar index streamgages to determine the estimates. These methods, recommended for use by the U.S. Geological Survey, generally require about 10 streamflow measurements at the low-flow partial-record station. The third method transfers the streamflow statistics from the index streamgage to the partial-record station based on the average of the ratios of the measured streamflows at the partial-record station to the concurrent streamflows at the index streamgage. This method can be used with as few as one pair of streamflow measurements made on a single streamflow recession at the low-flow partial-record station, although additional pairs of measurements will increase the accuracy of the estimates. Errors associated with the two correlation methods generally were lower than the errors associated with the flow-ratio method, but the advantages of the flow-ratio method are that it can produce reasonably accurate estimates from streamflow measurements much faster and at lower cost than estimates obtained using the correlation methods. The two index-streamgage selection methods were (1) selection based on the highest correlation coefficient between the low-flow partial-record station and the index streamgages, and (2) selection based on Euclidean distance, where the Euclidean distance was computed as a function of geographic proximity and the basin characteristics: drainage area, percentage of forested area, percentage of impervious area, and the base-flow recession time constant, t. Method 1 generally selected index streamgages that were significantly closer to the low-flow partial-record stations than method 2. The errors associated with the estimated streamflow statistics generally were lower for method 1 than for method 2, but the differences were not statistically significant. The flow-ratio method for estimating streamflow statistics at low-flow partial-record stations was shown to be independent from the two correlation-based estimation methods. As a result, final estimates were determined for eight low-flow partial-record stations by weighting estimates from the flow-ratio method with estimates from one of the two correlation methods according to the respective variances of the estimates. Average standard errors of estimate for the final estimates ranged from 90.0 to 7.0 percent, with an average value of 26.5 percent. Average standard errors of estimate for the weighted estimates were, on average, 4.3 percent less than the best average standard errors of estima
Garud, Nandita R; Rosenberg, Noah A
2015-06-01
Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H12, a statistic designed to identify both hard and soft selective sweeps, and H2/H1, a statistic that conditional on high H12 values seeks to distinguish between hard and soft sweeps. A challenge in the use of H2/H1 is that its range depends on the associated value of H12, so that equal H2/H1 values might provide different levels of support for a soft sweep model at different values of H12. Here, we enhance the H12 and H2/H1 haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H2/H1 as a function of H12, thereby generating a statistic that normalizes H2/H1 to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization. Copyright © 2015 Elsevier Inc. All rights reserved.
Brandt, Laura A.; Benscoter, Allison; Harvey, Rebecca G.; Speroterra, Carolina; Bucklin, David N.; Romañach, Stephanie; Watling, James I.; Mazzotti, Frank J.
2017-01-01
Climate envelope models are widely used to describe potential future distribution of species under different climate change scenarios. It is broadly recognized that there are both strengths and limitations to using climate envelope models and that outcomes are sensitive to initial assumptions, inputs, and modeling methods Selection of predictor variables, a central step in modeling, is one of the areas where different techniques can yield varying results. Selection of climate variables to use as predictors is often done using statistical approaches that develop correlations between occurrences and climate data. These approaches have received criticism in that they rely on the statistical properties of the data rather than directly incorporating biological information about species responses to temperature and precipitation. We evaluated and compared models and prediction maps for 15 threatened or endangered species in Florida based on two variable selection techniques: expert opinion and a statistical method. We compared model performance between these two approaches for contemporary predictions, and the spatial correlation, spatial overlap and area predicted for contemporary and future climate predictions. In general, experts identified more variables as being important than the statistical method and there was low overlap in the variable sets (<40%) between the two methods Despite these differences in variable sets (expert versus statistical), models had high performance metrics (>0.9 for area under the curve (AUC) and >0.7 for true skill statistic (TSS). Spatial overlap, which compares the spatial configuration between maps constructed using the different variable selection techniques, was only moderate overall (about 60%), with a great deal of variability across species. Difference in spatial overlap was even greater under future climate projections, indicating additional divergence of model outputs from different variable selection techniques. Our work is in agreement with other studies which have found that for broad-scale species distribution modeling, using statistical methods of variable selection is a useful first step, especially when there is a need to model a large number of species or expert knowledge of the species is limited. Expert input can then be used to refine models that seem unrealistic or for species that experts believe are particularly sensitive to change. It also emphasizes the importance of using multiple models to reduce uncertainty and improve map outputs for conservation planning. Where outputs overlap or show the same direction of change there is greater certainty in the predictions. Areas of disagreement can be used for learning by asking why the models do not agree, and may highlight areas where additional on-the-ground data collection could improve the models.
Continuum radiation from active galactic nuclei: A statistical study
NASA Technical Reports Server (NTRS)
Isobe, T.; Feigelson, E. D.; Singh, K. P.; Kembhavi, A.
1986-01-01
The physics of the continuum spectrum of active galactic nuclei (AGNs) was examined using a large data set and rigorous statistical methods. A data base was constructed for 469 objects which include radio selected quasars, optically selected quasars, X-ray selected AGNs, BL Lac objects, and optically unidentified compact radio sources. Each object has measurements of its radio, optical, X-ray core continuum luminosity, though many of them are upper limits. Since many radio sources have extended components, the core component were carefully selected out from the total radio luminosity. With survival analysis statistical methods, which can treat upper limits correctly, these data can yield better statistical results than those previously obtained. A variety of statistical tests are performed, such as the comparison of the luminosity functions in different subsamples, and linear regressions of luminosities in different bands. Interpretation of the results leads to the following tentative conclusions: the main emission mechanism of optically selected quasars and X-ray selected AGNs is thermal, while that of BL Lac objects is synchrotron; radio selected quasars may have two different emission mechanisms in the X-ray band; BL Lac objects appear to be special cases of the radio selected quasars; some compact radio sources show the possibility of synchrotron self-Compton (SSC) in the optical band; and the spectral index between the optical and the X-ray bands depends on the optical luminosity.
Xu, Cheng-Jian; van der Schaaf, Arjen; Schilstra, Cornelis; Langendijk, Johannes A; van't Veld, Aart A
2012-03-15
To study the impact of different statistical learning methods on the prediction performance of multivariate normal tissue complication probability (NTCP) models. In this study, three learning methods, stepwise selection, least absolute shrinkage and selection operator (LASSO), and Bayesian model averaging (BMA), were used to build NTCP models of xerostomia following radiotherapy treatment for head and neck cancer. Performance of each learning method was evaluated by a repeated cross-validation scheme in order to obtain a fair comparison among methods. It was found that the LASSO and BMA methods produced models with significantly better predictive power than that of the stepwise selection method. Furthermore, the LASSO method yields an easily interpretable model as the stepwise method does, in contrast to the less intuitive BMA method. The commonly used stepwise selection method, which is simple to execute, may be insufficient for NTCP modeling. The LASSO method is recommended. Copyright © 2012 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Nielsen, Richard A.
2016-01-01
This article shows how statistical matching methods can be used to select "most similar" cases for qualitative analysis. I first offer a methodological justification for research designs based on selecting most similar cases. I then discuss the applicability of existing matching methods to the task of selecting most similar cases and…
Two Paradoxes in Linear Regression Analysis.
Feng, Ge; Peng, Jing; Tu, Dongke; Zheng, Julia Z; Feng, Changyong
2016-12-25
Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection.
Curve fitting and modeling with splines using statistical variable selection techniques
NASA Technical Reports Server (NTRS)
Smith, P. L.
1982-01-01
The successful application of statistical variable selection techniques to fit splines is demonstrated. Major emphasis is given to knot selection, but order determination is also discussed. Two FORTRAN backward elimination programs, using the B-spline basis, were developed. The program for knot elimination is compared in detail with two other spline-fitting methods and several statistical software packages. An example is also given for the two-variable case using a tensor product basis, with a theoretical discussion of the difficulties of their use.
Fitting multidimensional splines using statistical variable selection techniques
NASA Technical Reports Server (NTRS)
Smith, P. L.
1982-01-01
This report demonstrates the successful application of statistical variable selection techniques to fit splines. Major emphasis is given to knot selection, but order determination is also discussed. Two FORTRAN backward elimination programs using the B-spline basis were developed, and the one for knot elimination is compared in detail with two other spline-fitting methods and several statistical software packages. An example is also given for the two-variable case using a tensor product basis, with a theoretical discussion of the difficulties of their use.
Cao, Hui; Markatou, Marianthi; Melton, Genevieve B; Chiang, Michael F; Hripcsak, George
2005-01-01
This paper applies co-occurrence statistics to discover disease-finding associations in a clinical data warehouse. We used two methods, chi2 statistics and the proportion confidence interval (PCI) method, to measure the dependence of pairs of diseases and findings, and then used heuristic cutoff values for association selection. An intrinsic evaluation showed that 94 percent of disease-finding associations obtained by chi2 statistics and 76.8 percent obtained by the PCI method were true associations. The selected associations were used to construct knowledge bases of disease-finding relations (KB-chi2, KB-PCI). An extrinsic evaluation showed that both KB-chi2 and KB-PCI could assist in eliminating clinically non-informative and redundant findings from problem lists generated by our automated problem list summarization system.
Frequentist Model Averaging in Structural Equation Modelling.
Jin, Shaobo; Ankargren, Sebastian
2018-06-04
Model selection from a set of candidate models plays an important role in many structural equation modelling applications. However, traditional model selection methods introduce extra randomness that is not accounted for by post-model selection inference. In the current study, we propose a model averaging technique within the frequentist statistical framework. Instead of selecting an optimal model, the contributions of all candidate models are acknowledged. Valid confidence intervals and a [Formula: see text] test statistic are proposed. A simulation study shows that the proposed method is able to produce a robust mean-squared error, a better coverage probability, and a better goodness-of-fit test compared to model selection. It is an interesting compromise between model selection and the full model.
Two Paradoxes in Linear Regression Analysis
FENG, Ge; PENG, Jing; TU, Dongke; ZHENG, Julia Z.; FENG, Changyong
2016-01-01
Summary Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection. PMID:28638214
A Selective Review of Group Selection in High-Dimensional Models
Huang, Jian; Breheny, Patrick; Ma, Shuangge
2013-01-01
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study. PMID:24174707
ERIC Educational Resources Information Center
Henry, Gary T.; And Others
1992-01-01
A statistical technique is presented for developing performance standards based on benchmark groups. The benchmark groups are selected using a multivariate technique that relies on a squared Euclidean distance method. For each observation unit (a school district in the example), a unique comparison group is selected. (SLD)
Cao, Hui; Markatou, Marianthi; Melton, Genevieve B.; Chiang, Michael F.; Hripcsak, George
2005-01-01
This paper applies co-occurrence statistics to discover disease-finding associations in a clinical data warehouse. We used two methods, χ2 statistics and the proportion confidence interval (PCI) method, to measure the dependence of pairs of diseases and findings, and then used heuristic cutoff values for association selection. An intrinsic evaluation showed that 94 percent of disease-finding associations obtained by χ2 statistics and 76.8 percent obtained by the PCI method were true associations. The selected associations were used to construct knowledge bases of disease-finding relations (KB-χ2, KB-PCI). An extrinsic evaluation showed that both KB-χ2 and KB-PCI could assist in eliminating clinically non-informative and redundant findings from problem lists generated by our automated problem list summarization system. PMID:16779011
Vallée, Julie; Souris, Marc; Fournet, Florence; Bochaton, Audrey; Mobillion, Virginie; Peyronnie, Karine; Salem, Gérard
2007-01-01
Background Geographical objectives and probabilistic methods are difficult to reconcile in a unique health survey. Probabilistic methods focus on individuals to provide estimates of a variable's prevalence with a certain precision, while geographical approaches emphasise the selection of specific areas to study interactions between spatial characteristics and health outcomes. A sample selected from a small number of specific areas creates statistical challenges: the observations are not independent at the local level, and this results in poor statistical validity at the global level. Therefore, it is difficult to construct a sample that is appropriate for both geographical and probability methods. Methods We used a two-stage selection procedure with a first non-random stage of selection of clusters. Instead of randomly selecting clusters, we deliberately chose a group of clusters, which as a whole would contain all the variation in health measures in the population. As there was no health information available before the survey, we selected a priori determinants that can influence the spatial homogeneity of the health characteristics. This method yields a distribution of variables in the sample that closely resembles that in the overall population, something that cannot be guaranteed with randomly-selected clusters, especially if the number of selected clusters is small. In this way, we were able to survey specific areas while minimising design effects and maximising statistical precision. Application We applied this strategy in a health survey carried out in Vientiane, Lao People's Democratic Republic. We selected well-known health determinants with unequal spatial distribution within the city: nationality and literacy. We deliberately selected a combination of clusters whose distribution of nationality and literacy is similar to the distribution in the general population. Conclusion This paper describes the conceptual reasoning behind the construction of the survey sample and shows that it can be advantageous to choose clusters using reasoned hypotheses, based on both probability and geographical approaches, in contrast to a conventional, random cluster selection strategy. PMID:17543100
2012-01-01
Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems. PMID:22643026
Howard B. Stauffer; Cynthia J. Zabel; Jeffrey R. Dunk
2005-01-01
We compared a set of competing logistic regression habitat selection models for Northern Spotted Owls (Strix occidentalis caurina) in California. The habitat selection models were estimated, compared, evaluated, and tested using multiple sample datasets collected on federal forestlands in northern California. We used Bayesian methods in interpreting...
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, e...
Lin, Kao; Li, Haipeng; Schlötterer, Christian; Futschik, Andreas
2011-01-01
Summary statistics are widely used in population genetics, but they suffer from the drawback that no simple sufficient summary statistic exists, which captures all information required to distinguish different evolutionary hypotheses. Here, we apply boosting, a recent statistical method that combines simple classification rules to maximize their joint predictive performance. We show that our implementation of boosting has a high power to detect selective sweeps. Demographic events, such as bottlenecks, do not result in a large excess of false positives. A comparison to other neutrality tests shows that our boosting implementation performs well compared to other neutrality tests. Furthermore, we evaluated the relative contribution of different summary statistics to the identification of selection and found that for recent sweeps integrated haplotype homozygosity is very informative whereas older sweeps are better detected by Tajima's π. Overall, Watterson's θ was found to contribute the most information for distinguishing between bottlenecks and selection. PMID:21041556
Vallée, Julie; Souris, Marc; Fournet, Florence; Bochaton, Audrey; Mobillion, Virginie; Peyronnie, Karine; Salem, Gérard
2007-06-01
Geographical objectives and probabilistic methods are difficult to reconcile in a unique health survey. Probabilistic methods focus on individuals to provide estimates of a variable's prevalence with a certain precision, while geographical approaches emphasise the selection of specific areas to study interactions between spatial characteristics and health outcomes. A sample selected from a small number of specific areas creates statistical challenges: the observations are not independent at the local level, and this results in poor statistical validity at the global level. Therefore, it is difficult to construct a sample that is appropriate for both geographical and probability methods. We used a two-stage selection procedure with a first non-random stage of selection of clusters. Instead of randomly selecting clusters, we deliberately chose a group of clusters, which as a whole would contain all the variation in health measures in the population. As there was no health information available before the survey, we selected a priori determinants that can influence the spatial homogeneity of the health characteristics. This method yields a distribution of variables in the sample that closely resembles that in the overall population, something that cannot be guaranteed with randomly-selected clusters, especially if the number of selected clusters is small. In this way, we were able to survey specific areas while minimising design effects and maximising statistical precision. We applied this strategy in a health survey carried out in Vientiane, Lao People's Democratic Republic. We selected well-known health determinants with unequal spatial distribution within the city: nationality and literacy. We deliberately selected a combination of clusters whose distribution of nationality and literacy is similar to the distribution in the general population. This paper describes the conceptual reasoning behind the construction of the survey sample and shows that it can be advantageous to choose clusters using reasoned hypotheses, based on both probability and geographical approaches, in contrast to a conventional, random cluster selection strategy.
Austin, Peter C; Schuster, Tibor; Platt, Robert W
2015-10-15
Estimating statistical power is an important component of the design of both randomized controlled trials (RCTs) and observational studies. Methods for estimating statistical power in RCTs have been well described and can be implemented simply. In observational studies, statistical methods must be used to remove the effects of confounding that can occur due to non-random treatment assignment. Inverse probability of treatment weighting (IPTW) using the propensity score is an attractive method for estimating the effects of treatment using observational data. However, sample size and power calculations have not been adequately described for these methods. We used an extensive series of Monte Carlo simulations to compare the statistical power of an IPTW analysis of an observational study with time-to-event outcomes with that of an analysis of a similarly-structured RCT. We examined the impact of four factors on the statistical power function: number of observed events, prevalence of treatment, the marginal hazard ratio, and the strength of the treatment-selection process. We found that, on average, an IPTW analysis had lower statistical power compared to an analysis of a similarly-structured RCT. The difference in statistical power increased as the magnitude of the treatment-selection model increased. The statistical power of an IPTW analysis tended to be lower than the statistical power of a similarly-structured RCT.
Reproducibility-optimized test statistic for ranking genes in microarray studies.
Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero
2008-01-01
A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.
Southard, Rodney E.
2013-01-01
The weather and precipitation patterns in Missouri vary considerably from year to year. In 2008, the statewide average rainfall was 57.34 inches and in 2012, the statewide average rainfall was 30.64 inches. This variability in precipitation and resulting streamflow in Missouri underlies the necessity for water managers and users to have reliable streamflow statistics and a means to compute select statistics at ungaged locations for a better understanding of water availability. Knowledge of surface-water availability is dependent on the streamflow data that have been collected and analyzed by the U.S. Geological Survey for more than 100 years at approximately 350 streamgages throughout Missouri. The U.S. Geological Survey, in cooperation with the Missouri Department of Natural Resources, computed streamflow statistics at streamgages through the 2010 water year, defined periods of drought and defined methods to estimate streamflow statistics at ungaged locations, and developed regional regression equations to compute selected streamflow statistics at ungaged locations. Streamflow statistics and flow durations were computed for 532 streamgages in Missouri and in neighboring States of Missouri. For streamgages with more than 10 years of record, Kendall’s tau was computed to evaluate for trends in streamflow data. If trends were detected, the variable length method was used to define the period of no trend. Water years were removed from the dataset from the beginning of the record for a streamgage until no trend was detected. Low-flow frequency statistics were then computed for the entire period of record and for the period of no trend if 10 or more years of record were available for each analysis. Three methods are presented for computing selected streamflow statistics at ungaged locations. The first method uses power curve equations developed for 28 selected streams in Missouri and neighboring States that have multiple streamgages on the same streams. Statistical estimates on one of these streams can be calculated at an ungaged location that has a drainage area that is between 40 percent of the drainage area of the farthest upstream streamgage and within 150 percent of the drainage area of the farthest downstream streamgage along the stream of interest. The second method may be used on any stream with a streamgage that has operated for 10 years or longer and for which anthropogenic effects have not changed the low-flow characteristics at the ungaged location since collection of the streamflow data. A ratio of drainage area of the stream at the ungaged location to the drainage area of the stream at the streamgage was computed to estimate the statistic at the ungaged location. The range of applicability is between 40- and 150-percent of the drainage area of the streamgage, and the ungaged location must be located on the same stream as the streamgage. The third method uses regional regression equations to estimate selected low-flow frequency statistics for unregulated streams in Missouri. This report presents regression equations to estimate frequency statistics for the 10-year recurrence interval and for the N-day durations of 1, 2, 3, 7, 10, 30, and 60 days. Basin and climatic characteristics were computed using geographic information system software and digital geospatial data. A total of 35 characteristics were computed for use in preliminary statewide and regional regression analyses based on existing digital geospatial data and previous studies. Spatial analyses for geographical bias in the predictive accuracy of the regional regression equations defined three low-flow regions with the State representing the three major physiographic provinces in Missouri. Region 1 includes the Central Lowlands, Region 2 includes the Ozark Plateaus, and Region 3 includes the Mississippi Alluvial Plain. A total of 207 streamgages were used in the regression analyses for the regional equations. Of the 207 U.S. Geological Survey streamgages, 77 were located in Region 1, 120 were located in Region 2, and 10 were located in Region 3. Streamgages located outside of Missouri were selected to extend the range of data used for the independent variables in the regression analyses. Streamgages included in the regression analyses had 10 or more years of record and were considered to be affected minimally by anthropogenic activities or trends. Regional regression analyses identified three characteristics as statistically significant for the development of regional equations. For Region 1, drainage area, longest flow path, and streamflow-variability index were statistically significant. The range in the standard error of estimate for Region 1 is 79.6 to 94.2 percent. For Region 2, drainage area and streamflow variability index were statistically significant, and the range in the standard error of estimate is 48.2 to 72.1 percent. For Region 3, drainage area and streamflow-variability index also were statistically significant with a range in the standard error of estimate of 48.1 to 96.2 percent. Limitations on the use of estimating low-flow frequency statistics at ungaged locations are dependent on the method used. The first method outlined for use in Missouri, power curve equations, were developed to estimate the selected statistics for ungaged locations on 28 selected streams with multiple streamgages located on the same stream. A second method uses a drainage-area ratio to compute statistics at an ungaged location using data from a single streamgage on the same stream with 10 or more years of record. Ungaged locations on these streams may use the ratio of the drainage area at an ungaged location to the drainage area at a streamgage location to scale the selected statistic value from the streamgage location to the ungaged location. This method can be used if the drainage area of the ungaged location is within 40 to 150 percent of the streamgage drainage area. The third method is the use of the regional regression equations. The limits for the use of these equations are based on the ranges of the characteristics used as independent variables and that streams must be affected minimally by anthropogenic activities.
Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario
2014-01-01
Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
NASA Astrophysics Data System (ADS)
Hsu, Kuo-Hsien
2012-11-01
Formosat-2 image is a kind of high-spatial-resolution (2 meters GSD) remote sensing satellite data, which includes one panchromatic band and four multispectral bands (Blue, Green, Red, near-infrared). An essential sector in the daily processing of received Formosat-2 image is to estimate the cloud statistic of image using Automatic Cloud Coverage Assessment (ACCA) algorithm. The information of cloud statistic of image is subsequently recorded as an important metadata for image product catalog. In this paper, we propose an ACCA method with two consecutive stages: preprocessing and post-processing analysis. For pre-processing analysis, the un-supervised K-means classification, Sobel's method, thresholding method, non-cloudy pixels reexamination, and cross-band filter method are implemented in sequence for cloud statistic determination. For post-processing analysis, Box-Counting fractal method is implemented. In other words, the cloud statistic is firstly determined via pre-processing analysis, the correctness of cloud statistic of image of different spectral band is eventually cross-examined qualitatively and quantitatively via post-processing analysis. The selection of an appropriate thresholding method is very critical to the result of ACCA method. Therefore, in this work, We firstly conduct a series of experiments of the clustering-based and spatial thresholding methods that include Otsu's, Local Entropy(LE), Joint Entropy(JE), Global Entropy(GE), and Global Relative Entropy(GRE) method, for performance comparison. The result shows that Otsu's and GE methods both perform better than others for Formosat-2 image. Additionally, our proposed ACCA method by selecting Otsu's method as the threshoding method has successfully extracted the cloudy pixels of Formosat-2 image for accurate cloud statistic estimation.
Fully Bayesian tests of neutrality using genealogical summary statistics.
Drummond, Alexei J; Suchard, Marc A
2008-10-31
Many data summary statistics have been developed to detect departures from neutral expectations of evolutionary models. However questions about the neutrality of the evolution of genetic loci within natural populations remain difficult to assess. One critical cause of this difficulty is that most methods for testing neutrality make simplifying assumptions simultaneously about the mutational model and the population size model. Consequentially, rejecting the null hypothesis of neutrality under these methods could result from violations of either or both assumptions, making interpretation troublesome. Here we harness posterior predictive simulation to exploit summary statistics of both the data and model parameters to test the goodness-of-fit of standard models of evolution. We apply the method to test the selective neutrality of molecular evolution in non-recombining gene genealogies and we demonstrate the utility of our method on four real data sets, identifying significant departures of neutrality in human influenza A virus, even after controlling for variation in population size. Importantly, by employing a full model-based Bayesian analysis, our method separates the effects of demography from the effects of selection. The method also allows multiple summary statistics to be used in concert, thus potentially increasing sensitivity. Furthermore, our method remains useful in situations where analytical expectations and variances of summary statistics are not available. This aspect has great potential for the analysis of temporally spaced data, an expanding area previously ignored for limited availability of theory and methods.
Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics
Chen, Wenan; Larrabee, Beth R.; Ovsyannikova, Inna G.; Kennedy, Richard B.; Haralambieva, Iana H.; Poland, Gregory A.; Schaid, Daniel J.
2015-01-01
Two recently developed fine-mapping methods, CAVIAR and PAINTOR, demonstrate better performance over other fine-mapping methods. They also have the advantage of using only the marginal test statistics and the correlation among SNPs. Both methods leverage the fact that the marginal test statistics asymptotically follow a multivariate normal distribution and are likelihood based. However, their relationship with Bayesian fine mapping, such as BIMBAM, is not clear. In this study, we first show that CAVIAR and BIMBAM are actually approximately equivalent to each other. This leads to a fine-mapping method using marginal test statistics in the Bayesian framework, which we call CAVIAR Bayes factor (CAVIARBF). Another advantage of the Bayesian framework is that it can answer both association and fine-mapping questions. We also used simulations to compare CAVIARBF with other methods under different numbers of causal variants. The results showed that both CAVIARBF and BIMBAM have better performance than PAINTOR and other methods. Compared to BIMBAM, CAVIARBF has the advantage of using only marginal test statistics and takes about one-quarter to one-fifth of the running time. We applied different methods on two independent cohorts of the same phenotype. Results showed that CAVIARBF, BIMBAM, and PAINTOR selected the same top 3 SNPs; however, CAVIARBF and BIMBAM had better consistency in selecting the top 10 ranked SNPs between the two cohorts. Software is available at https://bitbucket.org/Wenan/caviarbf. PMID:25948564
Evaluation of resource selection methods with different definitions of availability
Seth A. McClean; Mark A. Rumble; Rudy M. King; William L. Baker
1998-01-01
Because resource selection is of paramount importance to ecology and management of any species, we compared 6 statistical methods of analyzing resource selection data, given the known biological requirements of radiomarked Merriamâs wild turkey (Meleagris gallopavo merriami) hens with poults in the Black Hills of South Dakota. A single variable,...
Data-adaptive test statistics for microarray data.
Mukherjee, Sach; Roberts, Stephen J; van der Laan, Mark J
2005-09-01
An important task in microarray data analysis is the selection of genes that are differentially expressed between different tissue samples, such as healthy and diseased. However, microarray data contain an enormous number of dimensions (genes) and very few samples (arrays), a mismatch which poses fundamental statistical problems for the selection process that have defied easy resolution. In this paper, we present a novel approach to the selection of differentially expressed genes in which test statistics are learned from data using a simple notion of reproducibility in selection results as the learning criterion. Reproducibility, as we define it, can be computed without any knowledge of the 'ground-truth', but takes advantage of certain properties of microarray data to provide an asymptotically valid guide to expected loss under the true data-generating distribution. We are therefore able to indirectly minimize expected loss, and obtain results substantially more robust than conventional methods. We apply our method to simulated and oligonucleotide array data. By request to the corresponding author.
Chao, Li-Wei; Szrek, Helena; Peltzer, Karl; Ramlagan, Shandir; Fleming, Peter; Leite, Rui; Magerman, Jesswill; Ngwenya, Godfrey B.; Pereira, Nuno Sousa; Behrman, Jere
2011-01-01
Finding an efficient method for sampling micro- and small-enterprises (MSEs) for research and statistical reporting purposes is a challenge in developing countries, where registries of MSEs are often nonexistent or outdated. This lack of a sampling frame creates an obstacle in finding a representative sample of MSEs. This study uses computer simulations to draw samples from a census of businesses and non-businesses in the Tshwane Municipality of South Africa, using three different sampling methods: the traditional probability sampling method, the compact segment sampling method, and the World Health Organization’s Expanded Programme on Immunization (EPI) sampling method. Three mechanisms by which the methods could differ are tested, the proximity selection of respondents, the at-home selection of respondents, and the use of inaccurate probability weights. The results highlight the importance of revisits and accurate probability weights, but the lesser effect of proximity selection on the samples’ statistical properties. PMID:22582004
Lam, Lun Tak; Sun, Yi; Davey, Neil; Adams, Rod; Prapopoulou, Maria; Brown, Marc B; Moss, Gary P
2010-06-01
The aim was to employ Gaussian processes to assess mathematically the nature of a skin permeability dataset and to employ these methods, particularly feature selection, to determine the key physicochemical descriptors which exert the most significant influence on percutaneous absorption, and to compare such models with established existing models. Gaussian processes, including automatic relevance detection (GPRARD) methods, were employed to develop models of percutaneous absorption that identified key physicochemical descriptors of percutaneous absorption. Using MatLab software, the statistical performance of these models was compared with single linear networks (SLN) and quantitative structure-permeability relationships (QSPRs). Feature selection methods were used to examine in more detail the physicochemical parameters used in this study. A range of statistical measures to determine model quality were used. The inherently nonlinear nature of the skin data set was confirmed. The Gaussian process regression (GPR) methods yielded predictive models that offered statistically significant improvements over SLN and QSPR models with regard to predictivity (where the rank order was: GPR > SLN > QSPR). Feature selection analysis determined that the best GPR models were those that contained log P, melting point and the number of hydrogen bond donor groups as significant descriptors. Further statistical analysis also found that great synergy existed between certain parameters. It suggested that a number of the descriptors employed were effectively interchangeable, thus questioning the use of models where discrete variables are output, usually in the form of an equation. The use of a nonlinear GPR method produced models with significantly improved predictivity, compared with SLN or QSPR models. Feature selection methods were able to provide important mechanistic information. However, it was also shown that significant synergy existed between certain parameters, and as such it was possible to interchange certain descriptors (i.e. molecular weight and melting point) without incurring a loss of model quality. Such synergy suggested that a model constructed from discrete terms in an equation may not be the most appropriate way of representing mechanistic understandings of skin absorption.
Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods.
Martínez, María Jimena; Ponzoni, Ignacio; Díaz, Mónica F; Vazquez, Gustavo E; Soto, Axel J
2015-01-01
The design of QSAR/QSPR models is a challenging problem, where the selection of the most relevant descriptors constitutes a key step of the process. Several feature selection methods that address this step are concentrated on statistical associations among descriptors and target properties, whereas the chemical knowledge is left out of the analysis. For this reason, the interpretability and generality of the QSAR/QSPR models obtained by these feature selection methods are drastically affected. Therefore, an approach for integrating domain expert's knowledge in the selection process is needed for increase the confidence in the final set of descriptors. In this paper a software tool, which we named Visual and Interactive DEscriptor ANalysis (VIDEAN), that combines statistical methods with interactive visualizations for choosing a set of descriptors for predicting a target property is proposed. Domain expertise can be added to the feature selection process by means of an interactive visual exploration of data, and aided by statistical tools and metrics based on information theory. Coordinated visual representations are presented for capturing different relationships and interactions among descriptors, target properties and candidate subsets of descriptors. The competencies of the proposed software were assessed through different scenarios. These scenarios reveal how an expert can use this tool to choose one subset of descriptors from a group of candidate subsets or how to modify existing descriptor subsets and even incorporate new descriptors according to his or her own knowledge of the target property. The reported experiences showed the suitability of our software for selecting sets of descriptors with low cardinality, high interpretability, low redundancy and high statistical performance in a visual exploratory way. Therefore, it is possible to conclude that the resulting tool allows the integration of a chemist's expertise in the descriptor selection process with a low cognitive effort in contrast with the alternative of using an ad-hoc manual analysis of the selected descriptors. Graphical abstractVIDEAN allows the visual analysis of candidate subsets of descriptors for QSAR/QSPR. In the two panels on the top, users can interactively explore numerical correlations as well as co-occurrences in the candidate subsets through two interactive graphs.
Logging costs and production rates for the group selection cutting method
Philip M. McDonald
1965-01-01
Young-growth, mixed-conifer stands were logged by a group-selection method designed to create openings 30, 60, and 90 feet in diameter. Total costs for felling, limbing, bucking, and skidding on these openings ranged from $7.04 to $7.99 per thousand board feet. Cost differences between openings were not statistically significant. Logging costs for group selection...
Nebraska's forests, 2005: statistics, methods, and quality assurance
Patrick D. Miles; Dacia M. Meneguzzo; Charles J. Barnett
2011-01-01
The first full annual inventory of Nebraska's forests was completed in 2005 after 8,335 plots were selected and 274 forested plots were visited and measured. This report includes detailed information on forest inventory methods, and data quality estimates. Tables of various important resource statistics are presented. Detailed analysis of the inventory data are...
Kansas's forests, 2005: statistics, methods, and quality assurance
Patrick D. Miles; W. Keith Moser; Charles J. Barnett
2011-01-01
The first full annual inventory of Kansas's forests was completed in 2005 after 8,868 plots were selected and 468 forested plots were visited and measured. This report includes detailed information on forest inventory methods and data quality estimates. Important resource statistics are included in the tables. A detailed analysis of Kansas inventory is presented...
ERIC Educational Resources Information Center
National Center for Education Statistics (ED), Washington, DC.
The papers were presented at the Social Statistics Section, the Government Statistics Section, and the Section on Survey Research Methods. The following papers are included in the Social Statistics Section and Government Statistics Section, "Overcoming the Bureaucratic Paradigm: Memorial Session in Honor of Roger Herriot": "1995…
The T(ea) Test: Scripted Stories Increase Statistical Method Selection Skills
ERIC Educational Resources Information Center
Hackathorn, Jana; Ashdown, Brien
2015-01-01
To teach statistics, teachers must attempt to overcome pedagogical obstacles, such as dread, anxiety, and boredom. There are many options available to teachers that facilitate a pedagogically conducive environment in the classroom. The current study examined the effectiveness of incorporating scripted stories and humor into statistical method…
Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics.
Chen, Wenan; Larrabee, Beth R; Ovsyannikova, Inna G; Kennedy, Richard B; Haralambieva, Iana H; Poland, Gregory A; Schaid, Daniel J
2015-07-01
Two recently developed fine-mapping methods, CAVIAR and PAINTOR, demonstrate better performance over other fine-mapping methods. They also have the advantage of using only the marginal test statistics and the correlation among SNPs. Both methods leverage the fact that the marginal test statistics asymptotically follow a multivariate normal distribution and are likelihood based. However, their relationship with Bayesian fine mapping, such as BIMBAM, is not clear. In this study, we first show that CAVIAR and BIMBAM are actually approximately equivalent to each other. This leads to a fine-mapping method using marginal test statistics in the Bayesian framework, which we call CAVIAR Bayes factor (CAVIARBF). Another advantage of the Bayesian framework is that it can answer both association and fine-mapping questions. We also used simulations to compare CAVIARBF with other methods under different numbers of causal variants. The results showed that both CAVIARBF and BIMBAM have better performance than PAINTOR and other methods. Compared to BIMBAM, CAVIARBF has the advantage of using only marginal test statistics and takes about one-quarter to one-fifth of the running time. We applied different methods on two independent cohorts of the same phenotype. Results showed that CAVIARBF, BIMBAM, and PAINTOR selected the same top 3 SNPs; however, CAVIARBF and BIMBAM had better consistency in selecting the top 10 ranked SNPs between the two cohorts. Software is available at https://bitbucket.org/Wenan/caviarbf. Copyright © 2015 by the Genetics Society of America.
Statistical Selection of Biological Models for Genome-Wide Association Analyses.
Bi, Wenjian; Kang, Guolian; Pounds, Stanley B
2018-05-24
Genome-wide association studies have discovered many biologically important associations of genes with phenotypes. Typically, genome-wide association analyses formally test the association of each genetic feature (SNP, CNV, etc) with the phenotype of interest and summarize the results with multiplicity-adjusted p-values. However, very small p-values only provide evidence against the null hypothesis of no association without indicating which biological model best explains the observed data. Correctly identifying a specific biological model may improve the scientific interpretation and can be used to more effectively select and design a follow-up validation study. Thus, statistical methodology to identify the correct biological model for a particular genotype-phenotype association can be very useful to investigators. Here, we propose a general statistical method to summarize how accurately each of five biological models (null, additive, dominant, recessive, co-dominant) represents the data observed for each variant in a GWAS study. We show that the new method stringently controls the false discovery rate and asymptotically selects the correct biological model. Simulations of two-stage discovery-validation studies show that the new method has these properties and that its validation power is similar to or exceeds that of simple methods that use the same statistical model for all SNPs. Example analyses of three data sets also highlight these advantages of the new method. An R package is freely available at www.stjuderesearch.org/site/depts/biostats/maew. Copyright © 2018. Published by Elsevier Inc.
CADDIS Volume 4. Data Analysis: Selecting an Analysis Approach
An approach for selecting statistical analyses to inform causal analysis. Describes methods for determining whether test site conditions differ from reference expectations. Describes an approach for estimating stressor-response relationships.
Max-AUC Feature Selection in Computer-Aided Detection of Polyps in CT Colonography
Xu, Jian-Wu; Suzuki, Kenji
2014-01-01
We propose a feature selection method based on a sequential forward floating selection (SFFS) procedure to improve the performance of a classifier in computerized detection of polyps in CT colonography (CTC). The feature selection method is coupled with a nonlinear support vector machine (SVM) classifier. Unlike the conventional linear method based on Wilks' lambda, the proposed method selected the most relevant features that would maximize the area under the receiver operating characteristic curve (AUC), which directly maximizes classification performance, evaluated based on AUC value, in the computer-aided detection (CADe) scheme. We presented two variants of the proposed method with different stopping criteria used in the SFFS procedure. The first variant searched all feature combinations allowed in the SFFS procedure and selected the subsets that maximize the AUC values. The second variant performed a statistical test at each step during the SFFS procedure, and it was terminated if the increase in the AUC value was not statistically significant. The advantage of the second variant is its lower computational cost. To test the performance of the proposed method, we compared it against the popular stepwise feature selection method based on Wilks' lambda for a colonic-polyp database (25 polyps and 2624 nonpolyps). We extracted 75 morphologic, gray-level-based, and texture features from the segmented lesion candidate regions. The two variants of the proposed feature selection method chose 29 and 7 features, respectively. Two SVM classifiers trained with these selected features yielded a 96% by-polyp sensitivity at false-positive (FP) rates of 4.1 and 6.5 per patient, respectively. Experiments showed a significant improvement in the performance of the classifier with the proposed feature selection method over that with the popular stepwise feature selection based on Wilks' lambda that yielded 18.0 FPs per patient at the same sensitivity level. PMID:24608058
Max-AUC feature selection in computer-aided detection of polyps in CT colonography.
Xu, Jian-Wu; Suzuki, Kenji
2014-03-01
We propose a feature selection method based on a sequential forward floating selection (SFFS) procedure to improve the performance of a classifier in computerized detection of polyps in CT colonography (CTC). The feature selection method is coupled with a nonlinear support vector machine (SVM) classifier. Unlike the conventional linear method based on Wilks' lambda, the proposed method selected the most relevant features that would maximize the area under the receiver operating characteristic curve (AUC), which directly maximizes classification performance, evaluated based on AUC value, in the computer-aided detection (CADe) scheme. We presented two variants of the proposed method with different stopping criteria used in the SFFS procedure. The first variant searched all feature combinations allowed in the SFFS procedure and selected the subsets that maximize the AUC values. The second variant performed a statistical test at each step during the SFFS procedure, and it was terminated if the increase in the AUC value was not statistically significant. The advantage of the second variant is its lower computational cost. To test the performance of the proposed method, we compared it against the popular stepwise feature selection method based on Wilks' lambda for a colonic-polyp database (25 polyps and 2624 nonpolyps). We extracted 75 morphologic, gray-level-based, and texture features from the segmented lesion candidate regions. The two variants of the proposed feature selection method chose 29 and 7 features, respectively. Two SVM classifiers trained with these selected features yielded a 96% by-polyp sensitivity at false-positive (FP) rates of 4.1 and 6.5 per patient, respectively. Experiments showed a significant improvement in the performance of the classifier with the proposed feature selection method over that with the popular stepwise feature selection based on Wilks' lambda that yielded 18.0 FPs per patient at the same sensitivity level.
ERIC Educational Resources Information Center
Peterlin, Primoz
2010-01-01
Two methods of data analysis are compared: spreadsheet software and a statistics software suite. Their use is compared analysing data collected in three selected experiments taken from an introductory physics laboratory, which include a linear dependence, a nonlinear dependence and a histogram. The merits of each method are compared. (Contains 7…
North Dakota's forests, 2005: statistics, methods, and quality assurance
Patrick D. Miles; David E. Haugen; Charles J. Barnett
2011-01-01
The first full annual inventory of North Dakota's forests was completed in 2005 after 7,622 plots were selected and 164 forested plots were visited and measured. This report includes detailed information on forest inventory methods and data quality estimates. Important resource statistics are included in the tables. A detailed analysis of the North Dakota...
South Dakota's forests, 2005: statistics, methods, and quality assurance
Patrick D. Miles; Ronald J. Piva; Charles J. Barnett
2011-01-01
The first full annual inventory of South Dakota's forests was completed in 2005 after 8,302 plots were selected and 325 forested plots were visited and measured. This report includes detailed information on forest inventory methods and data quality estimates. Important resource statistics are included in the tables. A detailed analysis of the South Dakota...
Statistical learning and selective inference.
Taylor, Jonathan; Tibshirani, Robert J
2015-06-23
We describe the problem of "selective inference." This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have "cherry-picked"--searched for the strongest associations--means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
NASA Astrophysics Data System (ADS)
Määttä, A.; Laine, M.; Tamminen, J.; Veefkind, J. P.
2013-09-01
We study uncertainty quantification in remote sensing of aerosols in the atmosphere with top of the atmosphere reflectance measurements from the nadir-viewing Ozone Monitoring Instrument (OMI). Focus is on the uncertainty in aerosol model selection of pre-calculated aerosol models and on the statistical modelling of the model inadequacies. The aim is to apply statistical methodologies that improve the uncertainty estimates of the aerosol optical thickness (AOT) retrieval by propagating model selection and model error related uncertainties more realistically. We utilise Bayesian model selection and model averaging methods for the model selection problem and use Gaussian processes to model the smooth systematic discrepancies from the modelled to observed reflectance. The systematic model error is learned from an ensemble of operational retrievals. The operational OMI multi-wavelength aerosol retrieval algorithm OMAERO is used for cloud free, over land pixels of the OMI instrument with the additional Bayesian model selection and model discrepancy techniques. The method is demonstrated with four examples with different aerosol properties: weakly absorbing aerosols, forest fires over Greece and Russia, and Sahara dessert dust. The presented statistical methodology is general; it is not restricted to this particular satellite retrieval application.
Viallon, Vivian; Banerjee, Onureena; Jougla, Eric; Rey, Grégoire; Coste, Joel
2014-03-01
Looking for associations among multiple variables is a topical issue in statistics due to the increasing amount of data encountered in biology, medicine, and many other domains involving statistical applications. Graphical models have recently gained popularity for this purpose in the statistical literature. In the binary case, however, exact inference is generally very slow or even intractable because of the form of the so-called log-partition function. In this paper, we review various approximate methods for structure selection in binary graphical models that have recently been proposed in the literature and compare them through an extensive simulation study. We also propose a modification of one existing method, that is shown to achieve good performance and to be generally very fast. We conclude with an application in which we search for associations among causes of death recorded on French death certificates. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
NASA Technical Reports Server (NTRS)
Schweikhard, W. G.; Chen, Y. S.
1986-01-01
The Melick method of inlet flow dynamic distortion prediction by statistical means is outlined. A hypothetic vortex model is used as the basis for the mathematical formulations. The main variables are identified by matching the theoretical total pressure rms ratio with the measured total pressure rms ratio. Data comparisons, using the HiMAT inlet test data set, indicate satisfactory prediction of the dynamic peak distortion for cases with boundary layer control device vortex generators. A method for the dynamic probe selection was developed. Validity of the probe selection criteria is demonstrated by comparing the reduced-probe predictions with the 40-probe predictions. It is indicated that the the number of dynamic probes can be reduced to as few as two and still retain good accuracy.
Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps.
Jacobs, Guy S; Sluckin, Tim J; Kivisild, Toomas
2016-08-01
During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly's [Formula: see text] and [Formula: see text]) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly's [Formula: see text] offers high power, but is outperformed by a novel statistic that we test, which we call [Formula: see text] We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics-[Formula: see text] Kelly's [Formula: see text] and [Formula: see text]-are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While [Formula: see text] replicates most candidates when recombination map data are not available, the [Formula: see text] and [Formula: see text] statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available. Copyright © 2016 by the Genetics Society of America.
Wright, Marvin N; Dankowski, Theresa; Ziegler, Andreas
2017-04-15
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Uncertainties in Estimates of Fleet Average Fuel Economy : A Statistical Evaluation
DOT National Transportation Integrated Search
1977-01-01
Research was performed to assess the current Federal procedure for estimating the average fuel economy of each automobile manufacturer's new car fleet. Test vehicle selection and fuel economy estimation methods were characterized statistically and so...
A Method for Search Engine Selection using Thesaurus for Selective Meta-Search Engine
NASA Astrophysics Data System (ADS)
Goto, Shoji; Ozono, Tadachika; Shintani, Toramatsu
In this paper, we propose a new method for selecting search engines on WWW for selective meta-search engine. In selective meta-search engine, a method is needed that would enable selecting appropriate search engines for users' queries. Most existing methods use statistical data such as document frequency. These methods may select inappropriate search engines if a query contains polysemous words. In this paper, we describe an search engine selection method based on thesaurus. In our method, a thesaurus is constructed from documents in a search engine and is used as a source description of the search engine. The form of a particular thesaurus depends on the documents used for its construction. Our method enables search engine selection by considering relationship between terms and overcomes the problems caused by polysemous words. Further, our method does not have a centralized broker maintaining data, such as document frequency for all search engines. As a result, it is easy to add a new search engine, and meta-search engines become more scalable with our method compared to other existing methods.
NASA Astrophysics Data System (ADS)
Wang, Dong
2016-03-01
Gears are the most commonly used components in mechanical transmission systems. Their failures may cause transmission system breakdown and result in economic loss. Identification of different gear crack levels is important to prevent any unexpected gear failure because gear cracks lead to gear tooth breakage. Signal processing based methods mainly require expertize to explain gear fault signatures which is usually not easy to be achieved by ordinary users. In order to automatically identify different gear crack levels, intelligent gear crack identification methods should be developed. The previous case studies experimentally proved that K-nearest neighbors based methods exhibit high prediction accuracies for identification of 3 different gear crack levels under different motor speeds and loads. In this short communication, to further enhance prediction accuracies of existing K-nearest neighbors based methods and extend identification of 3 different gear crack levels to identification of 5 different gear crack levels, redundant statistical features are constructed by using Daubechies 44 (db44) binary wavelet packet transform at different wavelet decomposition levels, prior to the use of a K-nearest neighbors method. The dimensionality of redundant statistical features is 620, which provides richer gear fault signatures. Since many of these statistical features are redundant and highly correlated with each other, dimensionality reduction of redundant statistical features is conducted to obtain new significant statistical features. At last, the K-nearest neighbors method is used to identify 5 different gear crack levels under different motor speeds and loads. A case study including 3 experiments is investigated to demonstrate that the developed method provides higher prediction accuracies than the existing K-nearest neighbors based methods for recognizing different gear crack levels under different motor speeds and loads. Based on the new significant statistical features, some other popular statistical models including linear discriminant analysis, quadratic discriminant analysis, classification and regression tree and naive Bayes classifier, are compared with the developed method. The results show that the developed method has the highest prediction accuracies among these statistical models. Additionally, selection of the number of new significant features and parameter selection of K-nearest neighbors are thoroughly investigated.
Mansourian, Robert; Mutch, David M; Antille, Nicolas; Aubert, Jerome; Fogel, Paul; Le Goff, Jean-Marc; Moulin, Julie; Petrov, Anton; Rytz, Andreas; Voegel, Johannes J; Roberts, Matthew-Alan
2004-11-01
Microarray technology has become a powerful research tool in many fields of study; however, the cost of microarrays often results in the use of a low number of replicates (k). Under circumstances where k is low, it becomes difficult to perform standard statistical tests to extract the most biologically significant experimental results. Other more advanced statistical tests have been developed; however, their use and interpretation often remain difficult to implement in routine biological research. The present work outlines a method that achieves sufficient statistical power for selecting differentially expressed genes under conditions of low k, while remaining as an intuitive and computationally efficient procedure. The present study describes a Global Error Assessment (GEA) methodology to select differentially expressed genes in microarray datasets, and was developed using an in vitro experiment that compared control and interferon-gamma treated skin cells. In this experiment, up to nine replicates were used to confidently estimate error, thereby enabling methods of different statistical power to be compared. Gene expression results of a similar absolute expression are binned, so as to enable a highly accurate local estimate of the mean squared error within conditions. The model then relates variability of gene expression in each bin to absolute expression levels and uses this in a test derived from the classical ANOVA. The GEA selection method is compared with both the classical and permutational ANOVA tests, and demonstrates an increased stability, robustness and confidence in gene selection. A subset of the selected genes were validated by real-time reverse transcription-polymerase chain reaction (RT-PCR). All these results suggest that GEA methodology is (i) suitable for selection of differentially expressed genes in microarray data, (ii) intuitive and computationally efficient and (iii) especially advantageous under conditions of low k. The GEA code for R software is freely available upon request to authors.
The discounting model selector: Statistical software for delay discounting applications.
Gilroy, Shawn P; Franck, Christopher T; Hantula, Donald A
2017-05-01
Original, open-source computer software was developed and validated against established delay discounting methods in the literature. The software executed approximate Bayesian model selection methods from user-supplied temporal discounting data and computed the effective delay 50 (ED50) from the best performing model. Software was custom-designed to enable behavior analysts to conveniently apply recent statistical methods to temporal discounting data with the aid of a graphical user interface (GUI). The results of independent validation of the approximate Bayesian model selection methods indicated that the program provided results identical to that of the original source paper and its methods. Monte Carlo simulation (n = 50,000) confirmed that true model was selected most often in each setting. Simulation code and data for this study were posted to an online repository for use by other researchers. The model selection approach was applied to three existing delay discounting data sets from the literature in addition to the data from the source paper. Comparisons of model selected ED50 were consistent with traditional indices of discounting. Conceptual issues related to the development and use of computer software by behavior analysts and the opportunities afforded by free and open-sourced software are discussed and a review of possible expansions of this software are provided. © 2017 Society for the Experimental Analysis of Behavior.
Limited Evidence for Classic Selective Sweeps in African Populations
Granka, Julie M.; Henn, Brenna M.; Gignoux, Christopher R.; Kidd, Jeffrey M.; Bustamante, Carlos D.; Feldman, Marcus W.
2012-01-01
While hundreds of loci have been identified as reflecting strong-positive selection in human populations, connections between candidate loci and specific selective pressures often remain obscure. This study investigates broader patterns of selection in African populations, which are underrepresented despite their potential to offer key insights into human adaptation. We scan for hard selective sweeps using several haplotype and allele-frequency statistics with a data set of nearly 500,000 genome-wide single-nucleotide polymorphisms in 12 highly diverged African populations that span a range of environments and subsistence strategies. We find that positive selection does not appear to be a strong determinant of allele-frequency differentiation among these African populations. Haplotype statistics do identify putatively selected regions that are shared across African populations. However, as assessed by extensive simulations, patterns of haplotype sharing between African populations follow neutral expectations and suggest that tails of the empirical distributions contain false-positive signals. After highlighting several genomic regions where positive selection can be inferred with higher confidence, we use a novel method to identify biological functions enriched among populations’ empirical tail genomic windows, such as immune response in agricultural groups. In general, however, it seems that current methods for selection scans are poorly suited to populations that, like the African populations in this study, are affected by ascertainment bias and have low levels of linkage disequilibrium, possibly old selective sweeps, and potentially reduced phasing accuracy. Additionally, population history can confound the interpretation of selection statistics, suggesting that greater care is needed in attributing broad genetic patterns to human adaptation. PMID:22960214
A statistical method (cross-validation) for bone loss region detection after spaceflight
Zhao, Qian; Li, Wenjun; Li, Caixia; Chu, Philip W.; Kornak, John; Lang, Thomas F.
2010-01-01
Astronauts experience bone loss after the long spaceflight missions. Identifying specific regions that undergo the greatest losses (e.g. the proximal femur) could reveal information about the processes of bone loss in disuse and disease. Methods for detecting such regions, however, remains an open problem. This paper focuses on statistical methods to detect such regions. We perform statistical parametric mapping to get t-maps of changes in images, and propose a new cross-validation method to select an optimum suprathreshold for forming clusters of pixels. Once these candidate clusters are formed, we use permutation testing of longitudinal labels to derive significant changes. PMID:20632144
NASA Astrophysics Data System (ADS)
Samanta, Gaurab; Beris, Antony; Handler, Robert; Housiadas, Kostas
2009-03-01
Karhunen-Loeve (KL) analysis of DNS data of viscoelastic turbulent channel flows helps us to reveal more information on the time-dependent dynamics of viscoelastic modification of turbulence [Samanta et. al., J. Turbulence (in press), 2008]. A selected set of KL modes can be used for a data reduction modeling of these flows. However, it is pertinent that verification be done against established DNS results. For this purpose, we did comparisons of velocity and conformations statistics and probability density functions (PDFs) of relevant quantities obtained from DNS and reconstructed fields using selected KL modes and time-dependent coefficients. While the velocity statistics show good agreement between results from DNS and KL reconstructions even with just hundreds of KL modes, tens of thousands of KL modes are required to adequately capture the trace of polymer conformation resulting from DNS. New modifications to KL method have therefore been attempted to account for the differences in conformation statistics. The applicability and impact of these new modified KL methods will be discussed in the perspective of data reduction modeling.
Gardenier, John S
2012-12-01
This paper recommends how authors of statistical studies can communicate to general audiences fully, clearly, and comfortably. The studies may use statistical methods to explore issues in science, engineering, and society or they may address issues in statistics specifically. In either case, readers without explicit statistical training should have no problem understanding the issues, the methods, or the results at a non-technical level. The arguments for those results should be clear, logical, and persuasive. This paper also provides advice for editors of general journals on selecting high quality statistical articles without the need for exceptional work or expense. Finally, readers are also advised to watch out for some common errors or misuses of statistics that can be detected without a technical statistical background.
New insights into time series analysis. II - Non-correlated observations
NASA Astrophysics Data System (ADS)
Ferreira Lopes, C. E.; Cross, N. J. G.
2017-08-01
Context. Statistical parameters are used to draw conclusions in a vast number of fields such as finance, weather, industrial, and science. These parameters are also used to identify variability patterns on photometric data to select non-stochastic variations that are indicative of astrophysical effects. New, more efficient, selection methods are mandatory to analyze the huge amount of astronomical data. Aims: We seek to improve the current methods used to select non-stochastic variations on non-correlated data. Methods: We used standard and new data-mining parameters to analyze non-correlated data to find the best way to discriminate between stochastic and non-stochastic variations. A new approach that includes a modified Strateva function was performed to select non-stochastic variations. Monte Carlo simulations and public time-domain data were used to estimate its accuracy and performance. Results: We introduce 16 modified statistical parameters covering different features of statistical distribution such as average, dispersion, and shape parameters. Many dispersion and shape parameters are unbound parameters, I.e. equations that do not require the calculation of average. Unbound parameters are computed with single loop and hence decreasing running time. Moreover, the majority of these parameters have lower errors than previous parameters, which is mainly observed for distributions with few measurements. A set of non-correlated variability indices, sample size corrections, and a new noise model along with tests of different apertures and cut-offs on the data (BAS approach) are introduced. The number of mis-selections are reduced by about 520% using a single waveband and 1200% combining all wavebands. On the other hand, the even-mean also improves the correlated indices introduced in Paper I. The mis-selection rate is reduced by about 18% if the even-mean is used instead of the mean to compute the correlated indices in the WFCAM database. Even-statistics allows us to improve the effectiveness of both correlated and non-correlated indices. Conclusions: The selection of non-stochastic variations is improved by non-correlated indices. The even-averages provide a better estimation of mean and median for almost all statistical distributions analyzed. The correlated variability indices, which are proposed in the first paper of this series, are also improved if the even-mean is used. The even-parameters will also be useful for classifying light curves in the last step of this project. We consider that the first step of this project, where we set new techniques and methods that provide a huge improvement on the efficiency of selection of variable stars, is now complete. Many of these techniques may be useful for a large number of fields. Next, we will commence a new step of this project regarding the analysis of period search methods.
Yan, Zhengbing; Kuang, Te-Hui; Yao, Yuan
2017-09-01
In recent years, multivariate statistical monitoring of batch processes has become a popular research topic, wherein multivariate fault isolation is an important step aiming at the identification of the faulty variables contributing most to the detected process abnormality. Although contribution plots have been commonly used in statistical fault isolation, such methods suffer from the smearing effect between correlated variables. In particular, in batch process monitoring, the high autocorrelations and cross-correlations that exist in variable trajectories make the smearing effect unavoidable. To address such a problem, a variable selection-based fault isolation method is proposed in this research, which transforms the fault isolation problem into a variable selection problem in partial least squares discriminant analysis and solves it by calculating a sparse partial least squares model. As different from the traditional methods, the proposed method emphasizes the relative importance of each process variable. Such information may help process engineers in conducting root-cause diagnosis. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
Perturbation Selection and Local Influence Analysis for Nonlinear Structural Equation Model
ERIC Educational Resources Information Center
Chen, Fei; Zhu, Hong-Tu; Lee, Sik-Yum
2009-01-01
Local influence analysis is an important statistical method for studying the sensitivity of a proposed model to model inputs. One of its important issues is related to the appropriate choice of a perturbation vector. In this paper, we develop a general method to select an appropriate perturbation vector and a second-order local influence measure…
Trutschel, Diana; Palm, Rebecca; Holle, Bernhard; Simon, Michael
2017-11-01
Because not every scientific question on effectiveness can be answered with randomised controlled trials, research methods that minimise bias in observational studies are required. Two major concerns influence the internal validity of effect estimates: selection bias and clustering. Hence, to reduce the bias of the effect estimates, more sophisticated statistical methods are needed. To introduce statistical approaches such as propensity score matching and mixed models into representative real-world analysis and to conduct the implementation in statistical software R to reproduce the results. Additionally, the implementation in R is presented to allow the results to be reproduced. We perform a two-level analytic strategy to address the problems of bias and clustering: (i) generalised models with different abilities to adjust for dependencies are used to analyse binary data and (ii) the genetic matching and covariate adjustment methods are used to adjust for selection bias. Hence, we analyse the data from two population samples, the sample produced by the matching method and the full sample. The different analysis methods in this article present different results but still point in the same direction. In our example, the estimate of the probability of receiving a case conference is higher in the treatment group than in the control group. Both strategies, genetic matching and covariate adjustment, have their limitations but complement each other to provide the whole picture. The statistical approaches were feasible for reducing bias but were nevertheless limited by the sample used. For each study and obtained sample, the pros and cons of the different methods have to be weighted. Copyright © 2017 The Author(s). Published by Elsevier Ltd.. All rights reserved.
Pantyley, Viktoriya
2014-01-01
In new conditions of socio-economic development in the Ukraine, the health of the population of children is considered as the most reliable indicator of socio-economic development of the country. The primary goal of the study was analysis of the effect of contemporary socio-economic transformations, their scope, and strength of effect on the demographic and social situation of children in various regions of the Ukraine. The methodological objectives of the study were as follows: development of a synthetic measure of the state of health of the population of children, based on the Hellwig's method, and selection of districts in the Ukraine according to the present health-demographic situation of children. The study was based on statistical data from the State Statistics Service of Ukraine, Centre of Medical Statistics in Kiev, Ukrainian Ministry of Defence, as well as Ministry of Education and Science, Youth and Sports of Ukraine. The following research methods were used: analysis of literature and Internet sources, selection and analysis of statistical materials, cartographic and statistical methods. Basic indices of the demographic and health situation of the population of children were analyzed, as well as factors of a socio-economic nature which affect this situation. A set of variables was developed for the synthetic evaluation of the state of health of the population of children. The typology of the Ukrainian districts was performed according to the state of health of the child population, based on the Hellwig's taxonomic method. Deterioration was observed of selected quality parameters, as well as a change in the strength and directions of effect of factors of organizational-institutional, socioeconomic, historical and cultural nature on the population of children potential.
Healthy Worker Effect Phenomenon: Revisited with Emphasis on Statistical Methods – A Review
Chowdhury, Ritam; Shah, Divyang; Payal, Abhishek R.
2017-01-01
Known since 1885 but studied systematically only in the past four decades, the healthy worker effect (HWE) is a special form of selection bias common to occupational cohort studies. The phenomenon has been under debate for many years with respect to its impact, conceptual approach (confounding, selection bias, or both), and ways to resolve or account for its effect. The effect is not uniform across age groups, gender, race, and types of occupations and nor is it constant over time. Hence, assessing HWE and accounting for it in statistical analyses is complicated and requires sophisticated methods. Here, we review the HWE, factors affecting it, and methods developed so far to deal with it. PMID:29391741
An instrument to assess the statistical intensity of medical research papers.
Nieminen, Pentti; Virtanen, Jorma I; Vähänikkilä, Hannu
2017-01-01
There is widespread evidence that statistical methods play an important role in original research articles, especially in medical research. The evaluation of statistical methods and reporting in journals suffers from a lack of standardized methods for assessing the use of statistics. The objective of this study was to develop and evaluate an instrument to assess the statistical intensity in research articles in a standardized way. A checklist-type measure scale was developed by selecting and refining items from previous reports about the statistical contents of medical journal articles and from published guidelines for statistical reporting. A total of 840 original medical research articles that were published between 2007-2015 in 16 journals were evaluated to test the scoring instrument. The total sum of all items was used to assess the intensity between sub-fields and journals. Inter-rater agreement was examined using a random sample of 40 articles. Four raters read and evaluated the selected articles using the developed instrument. The scale consisted of 66 items. The total summary score adequately discriminated between research articles according to their study design characteristics. The new instrument could also discriminate between journals according to their statistical intensity. The inter-observer agreement measured by the ICC was 0.88 between all four raters. Individual item analysis showed very high agreement between the rater pairs, the percentage agreement ranged from 91.7% to 95.2%. A reliable and applicable instrument for evaluating the statistical intensity in research papers was developed. It is a helpful tool for comparing the statistical intensity between sub-fields and journals. The novel instrument may be applied in manuscript peer review to identify papers in need of additional statistical review.
Identifying natural flow regimes using fish communities
NASA Astrophysics Data System (ADS)
Chang, Fi-John; Tsai, Wen-Ping; Wu, Tzu-Ching; Chen, Hung-kwai; Herricks, Edwin E.
2011-10-01
SummaryModern water resources management has adopted natural flow regimes as reasonable targets for river restoration and conservation. The characterization of a natural flow regime begins with the development of hydrologic statistics from flow records. However, little guidance exists for defining the period of record needed for regime determination. In Taiwan, the Taiwan Eco-hydrological Indicator System (TEIS), a group of hydrologic statistics selected for fisheries relevance, is being used to evaluate ecological flows. The TEIS consists of a group of hydrologic statistics selected to characterize the relationships between flow and the life history of indigenous species. Using the TEIS and biosurvey data for Taiwan, this paper identifies the length of hydrologic record sufficient for natural flow regime characterization. To define the ecological hydrology of fish communities, this study connected hydrologic statistics to fish communities by using methods to define antecedent conditions that influence existing community composition. A moving average method was applied to TEIS statistics to reflect the effects of antecedent flow condition and a point-biserial correlation method was used to relate fisheries collections with TEIS statistics. The resulting fish species-TEIS (FISH-TEIS) hydrologic statistics matrix takes full advantage of historical flows and fisheries data. The analysis indicates that, in the watersheds analyzed, averaging TEIS statistics for the present year and 3 years prior to the sampling date, termed MA(4), is sufficient to develop a natural flow regime. This result suggests that flow regimes based on hydrologic statistics for the period of record can be replaced by regimes developed for sampled fish communities.
Adaptive interference cancel filter for evoked potential using high-order cumulants.
Lin, Bor-Shyh; Lin, Bor-Shing; Chong, Fok-Ching; Lai, Feipei
2004-01-01
This paper is to present evoked potential (EP) processing using adaptive interference cancel (AIC) filter with second and high order cumulants. In conventional ensemble averaging method, people have to conduct repetitively experiments to record the required data. Recently, the use of AIC structure with second statistics in processing EP has proved more efficiency than traditional averaging method, but it is sensitive to both of the reference signal statistics and the choice of step size. Thus, we proposed higher order statistics-based AIC method to improve these disadvantages. This study was experimented in somatosensory EP corrupted with EEG. Gradient type algorithm is used in AIC method. Comparisons with AIC filter on second, third, fourth order statistics are also presented in this paper. We observed that AIC filter with third order statistics has better convergent performance for EP processing and is not sensitive to the selection of step size and reference input.
A statistical approach to selecting and confirming validation targets in -omics experiments
2012-01-01
Background Genomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets. Results Here we present a new statistical method and approach for statistically validating lists of significant results based on confirming only a small random sample. We apply our statistical method to show that the usual practice of confirming only the most statistically significant results does not statistically validate result lists. We analyze an extensively validated RNA-sequencing experiment to show that confirming a random subset can statistically validate entire lists of significant results. Finally, we analyze multiple publicly available microarray experiments to show that statistically validating random samples can both (i) provide evidence to confirm long gene lists and (ii) save thousands of dollars and hundreds of hours of labor over manual validation of each significant result. Conclusions For high-throughput -omics studies, statistical validation is a cost-effective and statistically valid approach to confirming lists of significant results. PMID:22738145
Temporal variation and scale in movement-based resource selection functions
Hooten, M.B.; Hanks, E.M.; Johnson, D.S.; Alldredge, M.W.
2013-01-01
A common population characteristic of interest in animal ecology studies pertains to the selection of resources. That is, given the resources available to animals, what do they ultimately choose to use? A variety of statistical approaches have been employed to examine this question and each has advantages and disadvantages with respect to the form of available data and the properties of estimators given model assumptions. A wealth of high resolution telemetry data are now being collected to study animal population movement and space use and these data present both challenges and opportunities for statistical inference. We summarize traditional methods for resource selection and then describe several extensions to deal with measurement uncertainty and an explicit movement process that exists in studies involving high-resolution telemetry data. Our approach uses a correlated random walk movement model to obtain temporally varying use and availability distributions that are employed in a weighted distribution context to estimate selection coefficients. The temporally varying coefficients are then weighted by their contribution to selection and combined to provide inference at the population level. The result is an intuitive and accessible statistical procedure that uses readily available software and is computationally feasible for large datasets. These methods are demonstrated using data collected as part of a large-scale mountain lion monitoring study in Colorado, USA.
Koltun, G.F.; Kula, Stephanie P.
2013-01-01
This report presents the results of a study to develop methods for estimating selected low-flow statistics and for determining annual flow-duration statistics for Ohio streams. Regression techniques were used to develop equations for estimating 10-year recurrence-interval (10-percent annual-nonexceedance probability) low-flow yields, in cubic feet per second per square mile, with averaging periods of 1, 7, 30, and 90-day(s), and for estimating the yield corresponding to the long-term 80-percent duration flow. These equations, which estimate low-flow yields as a function of a streamflow-variability index, are based on previously published low-flow statistics for 79 long-term continuous-record streamgages with at least 10 years of data collected through water year 1997. When applied to the calibration dataset, average absolute percent errors for the regression equations ranged from 15.8 to 42.0 percent. The regression results have been incorporated into the U.S. Geological Survey (USGS) StreamStats application for Ohio (http://water.usgs.gov/osw/streamstats/ohio.html) in the form of a yield grid to facilitate estimation of the corresponding streamflow statistics in cubic feet per second. Logistic-regression equations also were developed and incorporated into the USGS StreamStats application for Ohio for selected low-flow statistics to help identify occurrences of zero-valued statistics. Quantiles of daily and 7-day mean streamflows were determined for annual and annual-seasonal (September–November) periods for each complete climatic year of streamflow-gaging station record for 110 selected streamflow-gaging stations with 20 or more years of record. The quantiles determined for each climatic year were the 99-, 98-, 95-, 90-, 80-, 75-, 70-, 60-, 50-, 40-, 30-, 25-, 20-, 10-, 5-, 2-, and 1-percent exceedance streamflows. Selected exceedance percentiles of the annual-exceedance percentiles were subsequently computed and tabulated to help facilitate consideration of the annual risk of exceedance or nonexceedance of annual and annual-seasonal-period flow-duration values. The quantiles are based on streamflow data collected through climatic year 2008.
Zhang, Ying; Sun, Jin; Zhang, Yun-Jiao; Chai, Qian-Yun; Zhang, Kang; Ma, Hong-Li; Wu, Xiao-Ke; Liu, Jian-Ping
2016-10-21
Although Traditional Chinese Medicine (TCM) has been widely used in clinical settings, a major challenge that remains in TCM is to evaluate its efficacy scientifically. This randomized controlled trial aims to evaluate the efficacy and safety of berberine in the treatment of patients with polycystic ovary syndrome. In order to improve the transparency and research quality of this clinical trial, we prepared this statistical analysis plan (SAP). The trial design, primary and secondary outcomes, and safety outcomes were declared to reduce selection biases in data analysis and result reporting. We specified detailed methods for data management and statistical analyses. Statistics in corresponding tables, listings, and graphs were outlined. The SAP provided more detailed information than trial protocol on data management and statistical analysis methods. Any post hoc analyses could be identified via referring to this SAP, and the possible selection bias and performance bias will be reduced in the trial. This study is registered at ClinicalTrials.gov, NCT01138930 , registered on 7 June 2010.
Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps
Jacobs, Guy S.; Sluckin, Timothy J.; Kivisild, Toomas
2016-01-01
During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly’s ZnS and ωmax) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly’s ZnS offers high power, but is outperformed by a novel statistic that we test, which we call Zα. We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics—ωmax, Kelly’s ZnS, and Zα—are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While ωmax replicates most candidates when recombination map data are not available, the ZnS and Zα statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available. PMID:27516617
Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
2013-01-01
Background Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers. High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties. As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers. Results A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past. Conclusions The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method. PMID:23651459
Analysis of Statistical Methods Currently used in Toxicology Journals
Na, Jihye; Yang, Hyeri
2014-01-01
Statistical methods are frequently used in toxicology, yet it is not clear whether the methods employed by the studies are used consistently and conducted based on sound statistical grounds. The purpose of this paper is to describe statistical methods used in top toxicology journals. More specifically, we sampled 30 papers published in 2014 from Toxicology and Applied Pharmacology, Archives of Toxicology, and Toxicological Science and described methodologies used to provide descriptive and inferential statistics. One hundred thirteen endpoints were observed in those 30 papers, and most studies had sample size less than 10, with the median and the mode being 6 and 3 & 6, respectively. Mean (105/113, 93%) was dominantly used to measure central tendency, and standard error of the mean (64/113, 57%) and standard deviation (39/113, 34%) were used to measure dispersion, while few studies provide justifications regarding why the methods being selected. Inferential statistics were frequently conducted (93/113, 82%), with one-way ANOVA being most popular (52/93, 56%), yet few studies conducted either normality or equal variance test. These results suggest that more consistent and appropriate use of statistical method is necessary which may enhance the role of toxicology in public health. PMID:25343012
Analysis of Statistical Methods Currently used in Toxicology Journals.
Na, Jihye; Yang, Hyeri; Bae, SeungJin; Lim, Kyung-Min
2014-09-01
Statistical methods are frequently used in toxicology, yet it is not clear whether the methods employed by the studies are used consistently and conducted based on sound statistical grounds. The purpose of this paper is to describe statistical methods used in top toxicology journals. More specifically, we sampled 30 papers published in 2014 from Toxicology and Applied Pharmacology, Archives of Toxicology, and Toxicological Science and described methodologies used to provide descriptive and inferential statistics. One hundred thirteen endpoints were observed in those 30 papers, and most studies had sample size less than 10, with the median and the mode being 6 and 3 & 6, respectively. Mean (105/113, 93%) was dominantly used to measure central tendency, and standard error of the mean (64/113, 57%) and standard deviation (39/113, 34%) were used to measure dispersion, while few studies provide justifications regarding why the methods being selected. Inferential statistics were frequently conducted (93/113, 82%), with one-way ANOVA being most popular (52/93, 56%), yet few studies conducted either normality or equal variance test. These results suggest that more consistent and appropriate use of statistical method is necessary which may enhance the role of toxicology in public health.
Introduction to Statistics. Learning Packages in the Policy Sciences Series, PS-26. Revised Edition.
ERIC Educational Resources Information Center
Policy Studies Associates, Croton-on-Hudson, NY.
The primary objective of this booklet is to introduce students to basic statistical skills that are useful in the analysis of public policy data. A few, selected statistical methods are presented, and theory is not emphasized. Chapter 1 provides instruction for using tables, bar graphs, bar graphs with grouped data, trend lines, pie diagrams,…
Exploring Milk and Yogurt Selection in an Urban Universal School Breakfast Program
ERIC Educational Resources Information Center
Miller, M. Elizabeth; Kwon, Sockju
2015-01-01
Purpose/Objectives: The purpose of this study was to explore milk and yogurt selection among students participating in a School Breakfast Program. Methods: Researchers observed breakfast selection of milk, juice and yogurt in six elementary and four secondary schools. Data were analyzed using descriptive statistics and logistic regression to…
Sakunpak, Apirak; Suksaeree, Jirapornchai; Monton, Chaowalit; Pathompak, Pathamaporn; Kraisintu, Krisana
2014-01-01
Objective To develop and validate an image analysis method for quantitative analysis of γ-oryzanol in cold pressed rice bran oil. Methods TLC-densitometric and TLC-image analysis methods were developed, validated, and used for quantitative analysis of γ-oryzanol in cold pressed rice bran oil. The results obtained by these two different quantification methods were compared by paired t-test. Results Both assays provided good linearity, accuracy, reproducibility and selectivity for determination of γ-oryzanol. Conclusions The TLC-densitometric and TLC-image analysis methods provided a similar reproducibility, accuracy and selectivity for the quantitative determination of γ-oryzanol in cold pressed rice bran oil. A statistical comparison of the quantitative determinations of γ-oryzanol in samples did not show any statistically significant difference between TLC-densitometric and TLC-image analysis methods. As both methods were found to be equal, they therefore can be used for the determination of γ-oryzanol in cold pressed rice bran oil. PMID:25182282
[The research protocol III. Study population].
Arias-Gómez, Jesús; Villasís-Keever, Miguel Ángel; Miranda-Novales, María Guadalupe
2016-01-01
The study population is defined as a set of cases, determined, limited, and accessible, that will constitute the subjects for the selection of the sample, and must fulfill several characteristics and distinct criteria. The objectives of this manuscript are focused on specifying each one of the elements required to make the selection of the participants of a research project, during the elaboration of the protocol, including the concepts of study population, sample, selection criteria and sampling methods. After delineating the study population, the researcher must specify the criteria that each participant has to comply. The criteria that include the specific characteristics are denominated selection or eligibility criteria. These criteria are inclusion, exclusion and elimination, and will delineate the eligible population. The sampling methods are divided in two large groups: 1) probabilistic or random sampling and 2) non-probabilistic sampling. The difference lies in the employment of statistical methods to select the subjects. In every research, it is necessary to establish at the beginning the specific number of participants to be included to achieve the objectives of the study. This number is the sample size, and can be calculated or estimated with mathematical formulas and statistic software.
Collection development using interlibrary loan borrowing and acquisitions statistics.
Byrd, G D; Thomas, D A; Hughes, K E
1982-01-01
Libraries, especially those supporting the sciences, continually face the problem of selecting appropriate new books for their users. Traditional collection development techniques include the use of librarian or user subject specialists, user recommendations, and approval plans. These methods of selection, however, are most effective in large libraries and do not systemically correlate new book purchases with the actual demands of users served. This paper describes a statistical method for determining subject strengths and weaknesses in a library book collection in relation to user demand. Using interlibrary loan borrowing and book acquisition statistics gathered for one fiscal year from three health sciences libraries, the authors developed a way to graph the broad and narrow subject fields of strength and potential weakness in a book collection. This method has the advantages of simplicity, speed of implementation, and clarity. It can also be used over a period of time to verify the success or failure of a collection development program. Finally, the method has potential as a tool for use by two or more libraries seeking to improve cooperative collection development in a network or consortium. PMID:7059712
Addeh, Abdoljalil; Khormali, Aminollah; Golilarz, Noorbakhsh Amiri
2018-05-04
The control chart patterns are the most commonly used statistical process control (SPC) tools to monitor process changes. When a control chart produces an out-of-control signal, this means that the process has been changed. In this study, a new method based on optimized radial basis function neural network (RBFNN) is proposed for control chart patterns (CCPs) recognition. The proposed method consists of four main modules: feature extraction, feature selection, classification and learning algorithm. In the feature extraction module, shape and statistical features are used. Recently, various shape and statistical features have been presented for the CCPs recognition. In the feature selection module, the association rules (AR) method has been employed to select the best set of the shape and statistical features. In the classifier section, RBFNN is used and finally, in RBFNN, learning algorithm has a high impact on the network performance. Therefore, a new learning algorithm based on the bees algorithm has been used in the learning module. Most studies have considered only six patterns: Normal, Cyclic, Increasing Trend, Decreasing Trend, Upward Shift and Downward Shift. Since three patterns namely Normal, Stratification, and Systematic are very similar to each other and distinguishing them is very difficult, in most studies Stratification and Systematic have not been considered. Regarding to the continuous monitoring and control over the production process and the exact type detection of the problem encountered during the production process, eight patterns have been investigated in this study. The proposed method is tested on a dataset containing 1600 samples (200 samples from each pattern) and the results showed that the proposed method has a very good performance. Copyright © 2018 ISA. Published by Elsevier Ltd. All rights reserved.
Erus, Guray; Zacharaki, Evangelia I; Davatzikos, Christos
2014-04-01
This paper presents a method for capturing statistical variation of normal imaging phenotypes, with emphasis on brain structure. The method aims to estimate the statistical variation of a normative set of images from healthy individuals, and identify abnormalities as deviations from normality. A direct estimation of the statistical variation of the entire volumetric image is challenged by the high-dimensionality of images relative to smaller sample sizes. To overcome this limitation, we iteratively sample a large number of lower dimensional subspaces that capture image characteristics ranging from fine and localized to coarser and more global. Within each subspace, a "target-specific" feature selection strategy is applied to further reduce the dimensionality, by considering only imaging characteristics present in a test subject's images. Marginal probability density functions of selected features are estimated through PCA models, in conjunction with an "estimability" criterion that limits the dimensionality of estimated probability densities according to available sample size and underlying anatomy variation. A test sample is iteratively projected to the subspaces of these marginals as determined by PCA models, and its trajectory delineates potential abnormalities. The method is applied to segmentation of various brain lesion types, and to simulated data on which superiority of the iterative method over straight PCA is demonstrated. Copyright © 2014 Elsevier B.V. All rights reserved.
Erus, Guray; Zacharaki, Evangelia I.; Davatzikos, Christos
2014-01-01
This paper presents a method for capturing statistical variation of normal imaging phenotypes, with emphasis on brain structure. The method aims to estimate the statistical variation of a normative set of images from healthy individuals, and identify abnormalities as deviations from normality. A direct estimation of the statistical variation of the entire volumetric image is challenged by the high-dimensionality of images relative to smaller sample sizes. To overcome this limitation, we iteratively sample a large number of lower dimensional subspaces that capture image characteristics ranging from fine and localized to coarser and more global. Within each subspace, a “target-specific” feature selection strategy is applied to further reduce the dimensionality, by considering only imaging characteristics present in a test subject’s images. Marginal probability density functions of selected features are estimated through PCA models, in conjunction with an “estimability” criterion that limits the dimensionality of estimated probability densities according to available sample size and underlying anatomy variation. A test sample is iteratively projected to the subspaces of these marginals as determined by PCA models, and its trajectory delineates potential abnormalities. The method is applied to segmentation of various brain lesion types, and to simulated data on which superiority of the iterative method over straight PCA is demonstrated. PMID:24607564
Learning process mapping heuristics under stochastic sampling overheads
NASA Technical Reports Server (NTRS)
Ieumwananonthachai, Arthur; Wah, Benjamin W.
1991-01-01
A statistical method was developed previously for improving process mapping heuristics. The method systematically explores the space of possible heuristics under a specified time constraint. Its goal is to get the best possible heuristics while trading between the solution quality of the process mapping heuristics and their execution time. The statistical selection method is extended to take into consideration the variations in the amount of time used to evaluate heuristics on a problem instance. The improvement in performance is presented using the more realistic assumption along with some methods that alleviate the additional complexity.
Sakunpak, Apirak; Suksaeree, Jirapornchai; Monton, Chaowalit; Pathompak, Pathamaporn; Kraisintu, Krisana
2014-02-01
To develop and validate an image analysis method for quantitative analysis of γ-oryzanol in cold pressed rice bran oil. TLC-densitometric and TLC-image analysis methods were developed, validated, and used for quantitative analysis of γ-oryzanol in cold pressed rice bran oil. The results obtained by these two different quantification methods were compared by paired t-test. Both assays provided good linearity, accuracy, reproducibility and selectivity for determination of γ-oryzanol. The TLC-densitometric and TLC-image analysis methods provided a similar reproducibility, accuracy and selectivity for the quantitative determination of γ-oryzanol in cold pressed rice bran oil. A statistical comparison of the quantitative determinations of γ-oryzanol in samples did not show any statistically significant difference between TLC-densitometric and TLC-image analysis methods. As both methods were found to be equal, they therefore can be used for the determination of γ-oryzanol in cold pressed rice bran oil.
NASA Astrophysics Data System (ADS)
Steinberg, P. D.; Brener, G.; Duffy, D.; Nearing, G. S.; Pelissier, C.
2017-12-01
Hyperparameterization, of statistical models, i.e. automated model scoring and selection, such as evolutionary algorithms, grid searches, and randomized searches, can improve forecast model skill by reducing errors associated with model parameterization, model structure, and statistical properties of training data. Ensemble Learning Models (Elm), and the related Earthio package, provide a flexible interface for automating the selection of parameters and model structure for machine learning models common in climate science and land cover classification, offering convenient tools for loading NetCDF, HDF, Grib, or GeoTiff files, decomposition methods like PCA and manifold learning, and parallel training and prediction with unsupervised and supervised classification, clustering, and regression estimators. Continuum Analytics is using Elm to experiment with statistical soil moisture forecasting based on meteorological forcing data from NASA's North American Land Data Assimilation System (NLDAS). There Elm is using the NSGA-2 multiobjective optimization algorithm for optimizing statistical preprocessing of forcing data to improve goodness-of-fit for statistical models (i.e. feature engineering). This presentation will discuss Elm and its components, including dask (distributed task scheduling), xarray (data structures for n-dimensional arrays), and scikit-learn (statistical preprocessing, clustering, classification, regression), and it will show how NSGA-2 is being used for automate selection of soil moisture forecast statistical models for North America.
Zhang, Yun; Baheti, Saurabh; Sun, Zhifu
2018-05-01
High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the count-based tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.
solGS: a web-based tool for genomic selection
USDA-ARS?s Scientific Manuscript database
Genomic selection (GS) promises to improve accuracy in estimating breeding values and genetic gain for quantitative traits compared to traditional breeding methods. Its reliance on high-throughput genome-wide markers and statistical complexity, however, is a serious challenge in data management, ana...
NASA Astrophysics Data System (ADS)
Giovanna, Vessia; Luca, Pisano; Carmela, Vennari; Mauro, Rossi; Mario, Parise
2016-01-01
This paper proposes an automated method for the selection of rainfall data (duration, D, and cumulated, E), responsible for shallow landslide initiation. The method mimics an expert person identifying D and E from rainfall records through a manual procedure whose rules are applied according to her/his judgement. The comparison between the two methods is based on 300 D-E pairs drawn from temporal rainfall data series recorded in a 30 days time-lag before the landslide occurrence. Statistical tests, employed on D and E samples considered both paired and independent values to verify whether they belong to the same population, show that the automated procedure is able to replicate the expert pairs drawn by the expert judgment. Furthermore, a criterion based on cumulated distribution functions (CDFs) is proposed to select the most related D-E pairs to the expert one among the 6 drawn from the coded procedure for tracing the empirical rainfall threshold line.
Rao, Masood Hussain; Khan, Nazeer
2010-09-01
To compare the statistical methods, types of article and design of studies used in 1998 and 2007 articles of leading indexed and non-indexed medical journals of Pakistan. Six leading medical journals of Pakistan: (1) JCPSP, (2) JPMA, (3) JAMC, (4) PJMS, (5) PJMR and (6) PAFMJ, were selected for this study. Articles reviewed were 1057 to achieve the above mentioned objective. The articles reviewed for 1998 and 2007 were 366 and 691, respectively. Original articles contributed the maximum percentage of 65.6%, followed by case reports with 24.8%. The contribution of case reports in 1998 was 20.5% which increased to 27.1% in 2007. There was no statistically significant difference between 'indexed' and 'non-indexed' journals for different type of statistical methods in 1998 or 2007. In total, 749 articles were categorized as 'original articles' or 'short communication'. Among them, 51% articles mentioned study design and 67.3% of them were correct for the respective methodology. In 1998, 202 (74%) articles did not use any statistics or indicated only descriptive statistics, while in 2007, 239 (50.2%) articles did the same. The reader who was familiar with t-test and contingency tables in 1998 could have understood 97.4% of the scientific articles. However, this percentage dropped to 83.0% in 2007. Quality of elaborating methods and usage of biostatistics in 6 leading Pakistani medical journals improved from 1998 to 2007, but has still to come up as compared to other western medical journals.
Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications
Qian, Guoqi; Wu, Yuehua; Ferrari, Davide; Qiao, Puxue; Hollande, Frédéric
2016-01-01
Regression clustering is a mixture of unsupervised and supervised statistical learning and data mining method which is found in a wide range of applications including artificial intelligence and neuroscience. It performs unsupervised learning when it clusters the data according to their respective unobserved regression hyperplanes. The method also performs supervised learning when it fits regression hyperplanes to the corresponding data clusters. Applying regression clustering in practice requires means of determining the underlying number of clusters in the data, finding the cluster label of each data point, and estimating the regression coefficients of the model. In this paper, we review the estimation and selection issues in regression clustering with regard to the least squares and robust statistical methods. We also provide a model selection based technique to determine the number of regression clusters underlying the data. We further develop a computing procedure for regression clustering estimation and selection. Finally, simulation studies are presented for assessing the procedure, together with analyzing a real data set on RGB cell marking in neuroscience to illustrate and interpret the method. PMID:27212939
Baccini, Michela; Carreras, Giulia
2014-10-01
This paper describes the methods used to investigate variations in total alcoholic beverage consumption as related to selected control intervention policies and other socioeconomic factors (unplanned factors) within 12 European countries involved in the AMPHORA project. The analysis presented several critical points: presence of missing values, strong correlation among the unplanned factors, long-term waves or trends in both the time series of alcohol consumption and the time series of the main explanatory variables. These difficulties were addressed by implementing a multiple imputation procedure for filling in missing values, then specifying for each country a multiple regression model which accounted for time trend, policy measures and a limited set of unplanned factors, selected in advance on the basis of sociological and statistical considerations are addressed. This approach allowed estimating the "net" effect of the selected control policies on alcohol consumption, but not the association between each unplanned factor and the outcome.
Parameter estimation and order selection for an empirical model of VO2 on-kinetics.
Alata, O; Bernard, O
2007-04-27
In humans, VO2 on-kinetics are noisy numerical signals that reflect the pulmonary oxygen exchange kinetics at the onset of exercise. They are empirically modelled as a sum of an offset and delayed exponentials. The number of delayed exponentials; i.e. the order of the model, is commonly supposed to be 1 for low-intensity exercises and 2 for high-intensity exercises. As no ground truth has ever been provided to validate these postulates, physiologists still need statistical methods to verify their hypothesis about the number of exponentials of the VO2 on-kinetics especially in the case of high-intensity exercises. Our objectives are first to develop accurate methods for estimating the parameters of the model at a fixed order, and then, to propose statistical tests for selecting the appropriate order. In this paper, we provide, on simulated Data, performances of Simulated Annealing for estimating model parameters and performances of Information Criteria for selecting the order. These simulated Data are generated with both single-exponential and double-exponential models, and noised by white and Gaussian noise. The performances are given at various Signal to Noise Ratio (SNR). Considering parameter estimation, results show that the confidences of estimated parameters are improved by increasing the SNR of the response to be fitted. Considering model selection, results show that Information Criteria are adapted statistical criteria to select the number of exponentials.
Mao, Yong; Zhou, Xiao-Bo; Pi, Dao-Ying; Sun, You-Xian; Wong, Stephen T C
2005-10-01
In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables and small number of samples as well as its non-linearity. It is difficult to get satisfying results by using conventional linear statistical methods. Recursive feature elimination based on support vector machine (SVM RFE) is an effective algorithm for gene selection and cancer classification, which are integrated into a consistent framework. In this paper, we propose a new method to select parameters of the aforementioned algorithm implemented with Gaussian kernel SVMs as better alternatives to the common practice of selecting the apparently best parameters by using a genetic algorithm to search for a couple of optimal parameter. Fast implementation issues for this method are also discussed for pragmatic reasons. The proposed method was tested on two representative hereditary breast cancer and acute leukaemia datasets. The experimental results indicate that the proposed method performs well in selecting genes and achieves high classification accuracies with these genes.
Uniting statistical and individual-based approaches for animal movement modelling.
Latombe, Guillaume; Parrott, Lael; Basille, Mathieu; Fortin, Daniel
2014-01-01
The dynamic nature of their internal states and the environment directly shape animals' spatial behaviours and give rise to emergent properties at broader scales in natural systems. However, integrating these dynamic features into habitat selection studies remains challenging, due to practically impossible field work to access internal states and the inability of current statistical models to produce dynamic outputs. To address these issues, we developed a robust method, which combines statistical and individual-based modelling. Using a statistical technique for forward modelling of the IBM has the advantage of being faster for parameterization than a pure inverse modelling technique and allows for robust selection of parameters. Using GPS locations from caribou monitored in Québec, caribou movements were modelled based on generative mechanisms accounting for dynamic variables at a low level of emergence. These variables were accessed by replicating real individuals' movements in parallel sub-models, and movement parameters were then empirically parameterized using Step Selection Functions. The final IBM model was validated using both k-fold cross-validation and emergent patterns validation and was tested for two different scenarios, with varying hardwood encroachment. Our results highlighted a functional response in habitat selection, which suggests that our method was able to capture the complexity of the natural system, and adequately provided projections on future possible states of the system in response to different management plans. This is especially relevant for testing the long-term impact of scenarios corresponding to environmental configurations that have yet to be observed in real systems.
Uniting Statistical and Individual-Based Approaches for Animal Movement Modelling
Latombe, Guillaume; Parrott, Lael; Basille, Mathieu; Fortin, Daniel
2014-01-01
The dynamic nature of their internal states and the environment directly shape animals' spatial behaviours and give rise to emergent properties at broader scales in natural systems. However, integrating these dynamic features into habitat selection studies remains challenging, due to practically impossible field work to access internal states and the inability of current statistical models to produce dynamic outputs. To address these issues, we developed a robust method, which combines statistical and individual-based modelling. Using a statistical technique for forward modelling of the IBM has the advantage of being faster for parameterization than a pure inverse modelling technique and allows for robust selection of parameters. Using GPS locations from caribou monitored in Québec, caribou movements were modelled based on generative mechanisms accounting for dynamic variables at a low level of emergence. These variables were accessed by replicating real individuals' movements in parallel sub-models, and movement parameters were then empirically parameterized using Step Selection Functions. The final IBM model was validated using both k-fold cross-validation and emergent patterns validation and was tested for two different scenarios, with varying hardwood encroachment. Our results highlighted a functional response in habitat selection, which suggests that our method was able to capture the complexity of the natural system, and adequately provided projections on future possible states of the system in response to different management plans. This is especially relevant for testing the long-term impact of scenarios corresponding to environmental configurations that have yet to be observed in real systems. PMID:24979047
Le, Trang T; Simmons, W Kyle; Misaki, Masaya; Bodurka, Jerzy; White, Bill C; Savitz, Jonathan; McKinney, Brett A
2017-09-15
Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting. We introduce private Evaporative Cooling, a stochastic privacy-preserving machine learning algorithm that uses Relief-F for feature selection and random forest for privacy preserving classification that also prevents overfitting. We relate the privacy-preserving threshold mechanism to a thermodynamic Maxwell-Boltzmann distribution, where the temperature represents the privacy threshold. We use the thermal statistical physics concept of Evaporative Cooling of atomic gases to perform backward stepwise privacy-preserving feature selection. On simulated data with main effects and statistical interactions, we compare accuracies on holdout and validation sets for three privacy-preserving methods: the reusable holdout, reusable holdout with random forest, and private Evaporative Cooling, which uses Relief-F feature selection and random forest classification. In simulations where interactions exist between attributes, private Evaporative Cooling provides higher classification accuracy without overfitting based on an independent validation set. In simulations without interactions, thresholdout with random forest and private Evaporative Cooling give comparable accuracies. We also apply these privacy methods to human brain resting-state fMRI data from a study of major depressive disorder. Code available at http://insilico.utulsa.edu/software/privateEC . brett-mckinney@utulsa.edu. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
`New insight into statistical hydrology' preface to the special issue
NASA Astrophysics Data System (ADS)
Kochanek, Krzysztof
2018-04-01
Statistical methods are still the basic tool for investigating random, extreme events occurring in hydrosphere. On 21-22 September 2017, in Warsaw (Poland) the international workshop of the Statistical Hydrology (StaHy) 2017 took place under the auspices of the International Association of Hydrological Sciences. The authors of the presentations proposed to publish their research results in the Special Issue of the Acta Geophysica-`New Insight into Statistical Hydrology'. Five papers were selected for publication, touching on the most crucial issues of statistical methodology in hydrology.
NASA Technical Reports Server (NTRS)
Heine, John J. (Inventor); Clarke, Laurence P. (Inventor); Deans, Stanley R. (Inventor); Stauduhar, Richard Paul (Inventor); Cullers, David Kent (Inventor)
2001-01-01
A system and method for analyzing a medical image to determine whether an abnormality is present, for example, in digital mammograms, includes the application of a wavelet expansion to a raw image to obtain subspace images of varying resolution. At least one subspace image is selected that has a resolution commensurate with a desired predetermined detection resolution range. A functional form of a probability distribution function is determined for each selected subspace image, and an optimal statistical normal image region test is determined for each selected subspace image. A threshold level for the probability distribution function is established from the optimal statistical normal image region test for each selected subspace image. A region size comprising at least one sector is defined, and an output image is created that includes a combination of all regions for each selected subspace image. Each region has a first value when the region intensity level is above the threshold and a second value when the region intensity level is below the threshold. This permits the localization of a potential abnormality within the image.
Statistical process control methods allow the analysis and improvement of anesthesia care.
Fasting, Sigurd; Gisvold, Sven E
2003-10-01
Quality aspects of the anesthetic process are reflected in the rate of intraoperative adverse events. The purpose of this report is to illustrate how the quality of the anesthesia process can be analyzed using statistical process control methods, and exemplify how this analysis can be used for quality improvement. We prospectively recorded anesthesia-related data from all anesthetics for five years. The data included intraoperative adverse events, which were graded into four levels, according to severity. We selected four adverse events, representing important quality and safety aspects, for statistical process control analysis. These were: inadequate regional anesthesia, difficult emergence from general anesthesia, intubation difficulties and drug errors. We analyzed the underlying process using 'p-charts' for statistical process control. In 65,170 anesthetics we recorded adverse events in 18.3%; mostly of lesser severity. Control charts were used to define statistically the predictable normal variation in problem rate, and then used as a basis for analysis of the selected problems with the following results: Inadequate plexus anesthesia: stable process, but unacceptably high failure rate; Difficult emergence: unstable process, because of quality improvement efforts; Intubation difficulties: stable process, rate acceptable; Medication errors: methodology not suited because of low rate of errors. By applying statistical process control methods to the analysis of adverse events, we have exemplified how this allows us to determine if a process is stable, whether an intervention is required, and if quality improvement efforts have the desired effect.
A Generic multi-dimensional feature extraction method using multiobjective genetic programming.
Zhang, Yang; Rockett, Peter I
2009-01-01
In this paper, we present a generic feature extraction method for pattern classification using multiobjective genetic programming. This not only evolves the (near-)optimal set of mappings from a pattern space to a multi-dimensional decision space, but also simultaneously optimizes the dimensionality of that decision space. The presented framework evolves vector-to-vector feature extractors that maximize class separability. We demonstrate the efficacy of our approach by making statistically-founded comparisons with a wide variety of established classifier paradigms over a range of datasets and find that for most of the pairwise comparisons, our evolutionary method delivers statistically smaller misclassification errors. At very worst, our method displays no statistical difference in a few pairwise comparisons with established classifier/dataset combinations; crucially, none of the misclassification results produced by our method is worse than any comparator classifier. Although principally focused on feature extraction, feature selection is also performed as an implicit side effect; we show that both feature extraction and selection are important to the success of our technique. The presented method has the practical consequence of obviating the need to exhaustively evaluate a large family of conventional classifiers when faced with a new pattern recognition problem in order to attain a good classification accuracy.
Park, Ji Eun; Han, Kyunghwa; Sung, Yu Sub; Chung, Mi Sun; Koo, Hyun Jung; Yoon, Hee Mang; Choi, Young Jun; Lee, Seung Soo; Kim, Kyung Won; Shin, Youngbin; An, Suah; Cho, Hyo-Min
2017-01-01
Objective To evaluate the frequency and adequacy of statistical analyses in a general radiology journal when reporting a reliability analysis for a diagnostic test. Materials and Methods Sixty-three studies of diagnostic test accuracy (DTA) and 36 studies reporting reliability analyses published in the Korean Journal of Radiology between 2012 and 2016 were analyzed. Studies were judged using the methodological guidelines of the Radiological Society of North America-Quantitative Imaging Biomarkers Alliance (RSNA-QIBA), and COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative. DTA studies were evaluated by nine editorial board members of the journal. Reliability studies were evaluated by study reviewers experienced with reliability analysis. Results Thirty-one (49.2%) of the 63 DTA studies did not include a reliability analysis when deemed necessary. Among the 36 reliability studies, proper statistical methods were used in all (5/5) studies dealing with dichotomous/nominal data, 46.7% (7/15) of studies dealing with ordinal data, and 95.2% (20/21) of studies dealing with continuous data. Statistical methods were described in sufficient detail regarding weighted kappa in 28.6% (2/7) of studies and regarding the model and assumptions of intraclass correlation coefficient in 35.3% (6/17) and 29.4% (5/17) of studies, respectively. Reliability parameters were used as if they were agreement parameters in 23.1% (3/13) of studies. Reproducibility and repeatability were used incorrectly in 20% (3/15) of studies. Conclusion Greater attention to the importance of reporting reliability, thorough description of the related statistical methods, efforts not to neglect agreement parameters, and better use of relevant terminology is necessary. PMID:29089821
Statistical procedures for analyzing mental health services data.
Elhai, Jon D; Calhoun, Patrick S; Ford, Julian D
2008-08-15
In mental health services research, analyzing service utilization data often poses serious problems, given the presence of substantially skewed data distributions. This article presents a non-technical introduction to statistical methods specifically designed to handle the complexly distributed datasets that represent mental health service use, including Poisson, negative binomial, zero-inflated, and zero-truncated regression models. A flowchart is provided to assist the investigator in selecting the most appropriate method. Finally, a dataset of mental health service use reported by medical patients is described, and a comparison of results across several different statistical methods is presented. Implications of matching data analytic techniques appropriately with the often complexly distributed datasets of mental health services utilization variables are discussed.
Decisions that Make a Difference in Detecting Differential Item Functioning
ERIC Educational Resources Information Center
Sireci, Stephen G.; Rios, Joseph A.
2013-01-01
There are numerous statistical procedures for detecting items that function differently across subgroups of examinees that take a test or survey. However, in endeavouring to detect items that may function differentially, selection of the statistical method is only one of many important decisions. In this article, we discuss the important decisions…
Assessing Statistical Change Indices in Selected Social Work Intervention Research Studies
ERIC Educational Resources Information Center
Ham, Amanda D.; Huggins-Hoyt, Kimberly Y.; Pettus, Joelle
2016-01-01
Objectives: This study examined how evaluation and intervention research (IR) studies assessed statistical change to ascertain effectiveness. Methods: Studies from six core social work journals (2009-2013) were reviewed (N = 1,380). Fifty-two evaluation (n= 27) and intervention (n = 25) studies met the inclusion criteria. These studies were…
Zhang, Xiaohua Douglas; Yang, Xiting Cindy; Chung, Namjin; Gates, Adam; Stec, Erica; Kunapuli, Priya; Holder, Dan J; Ferrer, Marc; Espeseth, Amy S
2006-04-01
RNA interference (RNAi) high-throughput screening (HTS) experiments carried out using large (>5000 short interfering [si]RNA) libraries generate a huge amount of data. In order to use these data to identify the most effective siRNAs tested, it is critical to adopt and develop appropriate statistical methods. To address the questions in hit selection of RNAi HTS, we proposed a quartile-based method which is robust to outliers, true hits and nonsymmetrical data. We compared it with the more traditional tests, mean +/- k standard deviation (SD) and median +/- 3 median of absolute deviation (MAD). The results suggested that the quartile-based method selected more hits than mean +/- k SD under the same preset error rate. The number of hits selected by median +/- k MAD was close to that by the quartile-based method. Further analysis suggested that the quartile-based method had the greatest power in detecting true hits, especially weak or moderate true hits. Our investigation also suggested that platewise analysis (determining effective siRNAs on a plate-by-plate basis) can adjust for systematic errors in different plates, while an experimentwise analysis, in which effective siRNAs are identified in an analysis of the entire experiment, cannot. However, experimentwise analysis may detect a cluster of true positive hits placed together in one or several plates, while platewise analysis may not. To display hit selection results, we designed a specific figure called a plate-well series plot. We thus suggest the following strategy for hit selection in RNAi HTS experiments. First, choose the quartile-based method, or median +/- k MAD, for identifying effective siRNAs. Second, perform the chosen method experimentwise on transformed/normalized data, such as percentage inhibition, to check the possibility of hit clusters. If a cluster of selected hits are observed, repeat the analysis based on untransformed data to determine whether the cluster is due to an artifact in the data. If no clusters of hits are observed, select hits by performing platewise analysis on transformed data. Third, adopt the plate-well series plot to visualize both the data and the hit selection results, as well as to check for artifacts.
Jasińska-Stroschein, Magdalena; Kurczewska, Urszula; Orszulak-Michalak, Daria
2017-05-01
When performing in vitro dissolution testing, especially in the area of biowaivers, it is necessary to follow regulatory guidelines to minimize the risk of an unsafe or ineffective product being approved. The present study examines model-independent and model-dependent methods of comparing dissolution profiles based on various compared and contrasted international guidelines. Dissolution profiles for immediate release solid oral dosage forms were generated. The test material comprised tablets containing several substances, with at least 85% of the labeled amount dissolved within 15 min, 20-30 min, or 45 min. Dissolution profile similarity can vary with regard to the following criteria: time point selection (including the last time point), coefficient of variation, and statistical method selection. Variation between regulatory guidance and statistical methods can raise methodological questions and result potentially in a different outcome when reporting dissolution profile testing. The harmonization of existing guidelines would address existing problems concerning the interpretation of regulatory recommendations and research findings. Copyright © 2017 American Pharmacists Association®. Published by Elsevier Inc. All rights reserved.
Assessment of NDE reliability data
NASA Technical Reports Server (NTRS)
Yee, B. G. W.; Couchman, J. C.; Chang, F. H.; Packman, D. F.
1975-01-01
Twenty sets of relevant nondestructive test (NDT) reliability data were identified, collected, compiled, and categorized. A criterion for the selection of data for statistical analysis considerations was formulated, and a model to grade the quality and validity of the data sets was developed. Data input formats, which record the pertinent parameters of the defect/specimen and inspection procedures, were formulated for each NDE method. A comprehensive computer program was written and debugged to calculate the probability of flaw detection at several confidence limits by the binomial distribution. This program also selects the desired data sets for pooling and tests the statistical pooling criteria before calculating the composite detection reliability. An example of the calculated reliability of crack detection in bolt holes by an automatic eddy current method is presented.
Podsakoff, Philip M; MacKenzie, Scott B; Lee, Jeong-Yeon; Podsakoff, Nathan P
2003-10-01
Interest in the problem of method biases has a long history in the behavioral sciences. Despite this, a comprehensive summary of the potential sources of method biases and how to control for them does not exist. Therefore, the purpose of this article is to examine the extent to which method biases influence behavioral research results, identify potential sources of method biases, discuss the cognitive processes through which method biases influence responses to measures, evaluate the many different procedural and statistical techniques that can be used to control method biases, and provide recommendations for how to select appropriate procedural and statistical remedies for different types of research settings.
Mueller, David S.
2013-01-01
profiles from the entire cross section and multiple transects to determine a mean profile for the measurement. The use of an exponent derived from normalized data from the entire cross section is shown to be valid for application of the power velocity distribution law in the computation of the unmeasured discharge in a cross section. Selected statistics are combined with empirically derived criteria to automatically select the appropriate extrapolation methods. A graphical user interface (GUI) provides the user tools to visually evaluate the automatically selected extrapolation methods and manually change them, as necessary. The sensitivity of the total discharge to available extrapolation methods is presented in the GUI. Use of extrap by field hydrographers has demonstrated that extrap is a more accurate and efficient method of determining the appropriate extrapolation methods compared with tools currently (2012) provided in the ADCP manufacturers’ software.
Statistical Analysis of the Uncertainty in Pre-Flight Aerodynamic Database of a Hypersonic Vehicle
NASA Astrophysics Data System (ADS)
Huh, Lynn
The objective of the present research was to develop a new method to derive the aerodynamic coefficients and the associated uncertainties for flight vehicles via post- flight inertial navigation analysis using data from the inertial measurement unit. Statistical estimates of vehicle state and aerodynamic coefficients are derived using Monte Carlo simulation. Trajectory reconstruction using the inertial navigation system (INS) is a simple and well used method. However, deriving realistic uncertainties in the reconstructed state and any associated parameters is not so straight forward. Extended Kalman filters, batch minimum variance estimation and other approaches have been used. However, these methods generally depend on assumed physical models, assumed statistical distributions (usually Gaussian) or have convergence issues for non-linear problems. The approach here assumes no physical models, is applicable to any statistical distribution, and does not have any convergence issues. The new approach obtains the statistics directly from a sufficient number of Monte Carlo samples using only the generally well known gyro and accelerometer specifications and could be applied to the systems of non-linear form and non-Gaussian distribution. When redundant data are available, the set of Monte Carlo simulations are constrained to satisfy the redundant data within the uncertainties specified for the additional data. The proposed method was applied to validate the uncertainty in the pre-flight aerodynamic database of the X-43A Hyper-X research vehicle. In addition to gyro and acceleration data, the actual flight data include redundant measurements of position and velocity from the global positioning system (GPS). The criteria derived from the blend of the GPS and INS accuracy was used to select valid trajectories for statistical analysis. The aerodynamic coefficients were derived from the selected trajectories by either direct extraction method based on the equations in dynamics, or by the inquiry of the pre-flight aerodynamic database. After the application of the proposed method to the case of the X-43A Hyper-X research vehicle, it was found that 1) there were consistent differences in the aerodynamic coefficients from the pre-flight aerodynamic database and post-flight analysis, 2) the pre-flight estimation of the pitching moment coefficients was significantly different from the post-flight analysis, 3) the type of distribution of the states from the Monte Carlo simulation were affected by that of the perturbation parameters, 4) the uncertainties in the pre-flight model were overestimated, 5) the range where the aerodynamic coefficients from the pre-flight aerodynamic database and post-flight analysis are in closest agreement is between Mach *.* and *.* and more data points may be needed between Mach * and ** in the pre-flight aerodynamic database, 6) selection criterion for valid trajectories from the Monte Carlo simulations was mostly driven by the horizontal velocity error, 7) the selection criterion must be based on reasonable model to ensure the validity of the statistics from the proposed method, and 8) the results from the proposed method applied to the two different flights with the identical geometry and similar flight profile were consistent.
Wall finish selection in hospital design: a survey of facility managers.
Lavy, Sarel; Dixit, Manish K
2012-01-01
This paper seeks to analyze healthcare facility managers' perceptions regarding the materials used for interior wall finishes and the criteria used to select them. It also examines differences in wall finish materials and the selection process in three major hospital spaces: emergency, surgery, and in-patient units. These findings are compared with healthcare designers' perceptions on similar issues, as currently documented in the literature. Hospital design and the materials used for hospital construction have a considerable effect on the environment and health of patients. A 2002 survey revealed which characteristics healthcare facility designers consider when selecting materials for healthcare facilities; however, no similar study has examined the views of facility managers on building finish selection. A 22-question survey questionnaire was distributed to 210 facility managers of metropolitan, for-profit hospitals in Texas; IRB approval was obtained. Respondents were asked to rank 10 interior wall finish materials and 11 selection criteria for wall finishes. Data from 48 complete questionnaires were analyzed using descriptive statistics and nonparametric statistical analysis methods. The study found no statistically significant differences in terms of wall finish materials or the characteristics for material selection in the three major spaces studied. It identified facility managers' four most-preferred wall finish materials and the five-most preferred characteristics, with a statistical confidence level of greater than 95%. The paper underscores the importance of incorporating all perspectives: facility designers and facility managers should work together toward achieving common organizational goals.
NASA Technical Reports Server (NTRS)
1989-01-01
An assessment of quantitative methods and measures for measuring launch commit criteria (LCC) performance measurement trends is made. A statistical performance trending analysis pilot study was processed and compared to STS-26 mission data. This study used four selected shuttle measurement types (solid rocket booster, external tank, space shuttle main engine, and range safety switch safe and arm device) from the five missions prior to mission 51-L. After obtaining raw data coordinates, each set of measurements was processed to obtain statistical confidence bounds and mean data profiles for each of the selected measurement types. STS-26 measurements were compared to the statistical data base profiles to verify the statistical capability of assessing occurrences of data trend anomalies and abnormal time-varying operational conditions associated with data amplitude and phase shifts.
Feature selection from a facial image for distinction of sasang constitution.
Koo, Imhoi; Kim, Jong Yeol; Kim, Myoung Geun; Kim, Keun Ho
2009-09-01
Recently, oriental medicine has received attention for providing personalized medicine through consideration of the unique nature and constitution of individual patients. With the eventual goal of globalization, the current trend in oriental medicine research is the standardization by adopting western scientific methods, which could represent a scientific revolution. The purpose of this study is to establish methods for finding statistically significant features in a facial image with respect to distinguishing constitution and to show the meaning of those features. From facial photo images, facial elements are analyzed in terms of the distance, angle and the distance ratios, for which there are 1225, 61 250 and 749 700 features, respectively. Due to the very large number of facial features, it is quite difficult to determine truly meaningful features. We suggest a process for the efficient analysis of facial features including the removal of outliers, control for missing data to guarantee data confidence and calculation of statistical significance by applying ANOVA. We show the statistical properties of selected features according to different constitutions using the nine distances, 10 angles and 10 rates of distance features that are finally established. Additionally, the Sasang constitutional meaning of the selected features is shown here.
Feature Selection from a Facial Image for Distinction of Sasang Constitution
Koo, Imhoi; Kim, Jong Yeol; Kim, Myoung Geun
2009-01-01
Recently, oriental medicine has received attention for providing personalized medicine through consideration of the unique nature and constitution of individual patients. With the eventual goal of globalization, the current trend in oriental medicine research is the standardization by adopting western scientific methods, which could represent a scientific revolution. The purpose of this study is to establish methods for finding statistically significant features in a facial image with respect to distinguishing constitution and to show the meaning of those features. From facial photo images, facial elements are analyzed in terms of the distance, angle and the distance ratios, for which there are 1225, 61 250 and 749 700 features, respectively. Due to the very large number of facial features, it is quite difficult to determine truly meaningful features. We suggest a process for the efficient analysis of facial features including the removal of outliers, control for missing data to guarantee data confidence and calculation of statistical significance by applying ANOVA. We show the statistical properties of selected features according to different constitutions using the nine distances, 10 angles and 10 rates of distance features that are finally established. Additionally, the Sasang constitutional meaning of the selected features is shown here. PMID:19745013
Exploring and accounting for publication bias in mental health: a brief overview of methods.
Mavridis, Dimitris; Salanti, Georgia
2014-02-01
OBJECTIVE Publication bias undermines the integrity of published research. The aim of this paper is to present a synopsis of methods for exploring and accounting for publication bias. METHODS We discussed the main features of the following methods to assess publication bias: funnel plot analysis; trim-and-fill methods; regression techniques and selection models. We applied these methods to a well-known example of antidepressants trials that compared trials submitted to the Food and Drug Administration (FDA) for regulatory approval. RESULTS The funnel plot-related methods (visual inspection, trim-and-fill, regression models) revealed an association between effect size and SE. Contours of statistical significance showed that asymmetry in the funnel plot is probably due to publication bias. Selection model found a significant correlation between effect size and propensity for publication. CONCLUSIONS Researchers should always consider the possible impact of publication bias. Funnel plot-related methods should be seen as a means of examining for small-study effects and not be directly equated with publication bias. Possible causes for funnel plot asymmetry should be explored. Contours of statistical significance may help disentangle whether asymmetry in a funnel plot is caused by publication bias or not. Selection models, although underused, could be useful resource when publication bias and heterogeneity are suspected because they address directly the problem of publication bias and not that of small-study effects.
Wang, Yikai; Kang, Jian; Kemmer, Phebe B.; Guo, Ying
2016-01-01
Currently, network-oriented analysis of fMRI data has become an important tool for understanding brain organization and brain networks. Among the range of network modeling methods, partial correlation has shown great promises in accurately detecting true brain network connections. However, the application of partial correlation in investigating brain connectivity, especially in large-scale brain networks, has been limited so far due to the technical challenges in its estimation. In this paper, we propose an efficient and reliable statistical method for estimating partial correlation in large-scale brain network modeling. Our method derives partial correlation based on the precision matrix estimated via Constrained L1-minimization Approach (CLIME), which is a recently developed statistical method that is more efficient and demonstrates better performance than the existing methods. To help select an appropriate tuning parameter for sparsity control in the network estimation, we propose a new Dens-based selection method that provides a more informative and flexible tool to allow the users to select the tuning parameter based on the desired sparsity level. Another appealing feature of the Dens-based method is that it is much faster than the existing methods, which provides an important advantage in neuroimaging applications. Simulation studies show that the Dens-based method demonstrates comparable or better performance with respect to the existing methods in network estimation. We applied the proposed partial correlation method to investigate resting state functional connectivity using rs-fMRI data from the Philadelphia Neurodevelopmental Cohort (PNC) study. Our results show that partial correlation analysis removed considerable between-module marginal connections identified by full correlation analysis, suggesting these connections were likely caused by global effects or common connection to other nodes. Based on partial correlation, we find that the most significant direct connections are between homologous brain locations in the left and right hemisphere. When comparing partial correlation derived under different sparse tuning parameters, an important finding is that the sparse regularization has more shrinkage effects on negative functional connections than on positive connections, which supports previous findings that many of the negative brain connections are due to non-neurophysiological effects. An R package “DensParcorr” can be downloaded from CRAN for implementing the proposed statistical methods. PMID:27242395
Wang, Yikai; Kang, Jian; Kemmer, Phebe B; Guo, Ying
2016-01-01
Currently, network-oriented analysis of fMRI data has become an important tool for understanding brain organization and brain networks. Among the range of network modeling methods, partial correlation has shown great promises in accurately detecting true brain network connections. However, the application of partial correlation in investigating brain connectivity, especially in large-scale brain networks, has been limited so far due to the technical challenges in its estimation. In this paper, we propose an efficient and reliable statistical method for estimating partial correlation in large-scale brain network modeling. Our method derives partial correlation based on the precision matrix estimated via Constrained L1-minimization Approach (CLIME), which is a recently developed statistical method that is more efficient and demonstrates better performance than the existing methods. To help select an appropriate tuning parameter for sparsity control in the network estimation, we propose a new Dens-based selection method that provides a more informative and flexible tool to allow the users to select the tuning parameter based on the desired sparsity level. Another appealing feature of the Dens-based method is that it is much faster than the existing methods, which provides an important advantage in neuroimaging applications. Simulation studies show that the Dens-based method demonstrates comparable or better performance with respect to the existing methods in network estimation. We applied the proposed partial correlation method to investigate resting state functional connectivity using rs-fMRI data from the Philadelphia Neurodevelopmental Cohort (PNC) study. Our results show that partial correlation analysis removed considerable between-module marginal connections identified by full correlation analysis, suggesting these connections were likely caused by global effects or common connection to other nodes. Based on partial correlation, we find that the most significant direct connections are between homologous brain locations in the left and right hemisphere. When comparing partial correlation derived under different sparse tuning parameters, an important finding is that the sparse regularization has more shrinkage effects on negative functional connections than on positive connections, which supports previous findings that many of the negative brain connections are due to non-neurophysiological effects. An R package "DensParcorr" can be downloaded from CRAN for implementing the proposed statistical methods.
1983-06-16
has been advocated by Gnanadesikan and ilk (1969), and others in the literature. This suggests that, if we use the formal signficance test type...American Statistical Asso., 62, 1159-1178. Gnanadesikan , R., and Wilk, M..B. (1969). Data Analytic Methods in Multi- variate Statistical Analysis. In
Detection of Alzheimer's disease using group lasso SVM-based region selection
NASA Astrophysics Data System (ADS)
Sun, Zhuo; Fan, Yong; Lelieveldt, Boudewijn P. F.; van de Giessen, Martijn
2015-03-01
Alzheimer's disease (AD) is one of the most frequent forms of dementia and an increasing challenging public health problem. In the last two decades, structural magnetic resonance imaging (MRI) has shown potential in distinguishing patients with Alzheimer's disease and elderly controls (CN). To obtain AD-specific biomarkers, previous research used either statistical testing to find statistically significant different regions between the two clinical groups, or l1 sparse learning to select isolated features in the image domain. In this paper, we propose a new framework that uses structural MRI to simultaneously distinguish the two clinical groups and find the bio-markers of AD, using a group lasso support vector machine (SVM). The group lasso term (mixed l1- l2 norm) introduces anatomical information from the image domain into the feature domain, such that the resulting set of selected voxels are more meaningful than the l1 sparse SVM. Because of large inter-structure size variation, we introduce a group specific normalization factor to deal with the structure size bias. Experiments have been performed on a well-designed AD vs. CN dataset1 to validate our method. Comparing to the l1 sparse SVM approach, our method achieved better classification performance and a more meaningful biomarker selection. When we vary the training set, the selected regions by our method were more stable than the l1 sparse SVM. Classification experiments showed that our group normalization lead to higher classification accuracy with fewer selected regions than the non-normalized method. Comparing to the state-of-art AD vs. CN classification methods, our approach not only obtains a high accuracy with the same dataset, but more importantly, we simultaneously find the brain anatomies that are closely related to the disease.
Statistical auditing and randomness test of lotto k/N-type games
NASA Astrophysics Data System (ADS)
Coronel-Brizio, H. F.; Hernández-Montoya, A. R.; Rapallo, F.; Scalas, E.
2008-11-01
One of the most popular lottery games worldwide is the so-called “lotto k/N”. It considers N numbers 1,2,…,N from which k are drawn randomly, without replacement. A player selects k or more numbers and the first prize is shared amongst those players whose selected numbers match all of the k randomly drawn. Exact rules may vary in different countries. In this paper, mean values and covariances for the random variables representing the numbers drawn from this kind of game are presented, with the aim of using them to audit statistically the consistency of a given sample of historical results with theoretical values coming from a hypergeometric statistical model. The method can be adapted to test pseudorandom number generators.
Lin, Wen-Bin; Tung, I-Wu; Chen, Mei-Jung; Chen, Mei-Yen
2011-08-01
Selection of a qualified pitcher has relied previously on qualitative indices; here, both quantitative and qualitative indices including pitching statistics, defense, mental skills, experience, and managers' recognition were collected, and an analytic hierarchy process was used to rank baseball pitchers. The participants were 8 experts who ranked characteristics and statistics of 15 baseball pitchers who comprised the first round of potential representatives for the Chinese Taipei National Baseball team. The results indicated a selection rate that was 91% consistent with the official national team roster, as 11 pitchers with the highest scores who were recommended as optimal choices to be official members of the Chinese Tai-pei National Baseball team actually participated in the 2009 Baseball World Cup. An analytic hierarchy can aid in selection of qualified pitchers, depending on situational and practical needs; the method could be extended to other sports and team-selection situations.
USDA FS
1982-01-01
Instructions, illustrated with examples and experimental results, are given for the controlled-environment propagation and selection of poplar clones. Greenhouse and growth-room culture of poplar stock plants and scions are described, and statistical techniques for discriminating among clones on the basis of growth variables are emphasized.
Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula
2011-01-01
Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.
Assessment of statistical methods used in library-based approaches to microbial source tracking.
Ritter, Kerry J; Carruthers, Ethan; Carson, C Andrew; Ellender, R D; Harwood, Valerie J; Kingsley, Kyle; Nakatsu, Cindy; Sadowsky, Michael; Shear, Brian; West, Brian; Whitlock, John E; Wiggins, Bruce A; Wilbur, Jayson D
2003-12-01
Several commonly used statistical methods for fingerprint identification in microbial source tracking (MST) were examined to assess the effectiveness of pattern-matching algorithms to correctly identify sources. Although numerous statistical methods have been employed for source identification, no widespread consensus exists as to which is most appropriate. A large-scale comparison of several MST methods, using identical fecal sources, presented a unique opportunity to assess the utility of several popular statistical methods. These included discriminant analysis, nearest neighbour analysis, maximum similarity and average similarity, along with several measures of distance or similarity. Threshold criteria for excluding uncertain or poorly matched isolates from final analysis were also examined for their ability to reduce false positives and increase prediction success. Six independent libraries used in the study were constructed from indicator bacteria isolated from fecal materials of humans, seagulls, cows and dogs. Three of these libraries were constructed using the rep-PCR technique and three relied on antibiotic resistance analysis (ARA). Five of the libraries were constructed using Escherichia coli and one using Enterococcus spp. (ARA). Overall, the outcome of this study suggests a high degree of variability across statistical methods. Despite large differences in correct classification rates among the statistical methods, no single statistical approach emerged as superior. Thresholds failed to consistently increase rates of correct classification and improvement was often associated with substantial effective sample size reduction. Recommendations are provided to aid in selecting appropriate analyses for these types of data.
Spectroscopic Diagnosis of Arsenic Contamination in Agricultural Soils
Shi, Tiezhu; Liu, Huizeng; Chen, Yiyun; Fei, Teng; Wang, Junjie; Wu, Guofeng
2017-01-01
This study investigated the abilities of pre-processing, feature selection and machine-learning methods for the spectroscopic diagnosis of soil arsenic contamination. The spectral data were pre-processed by using Savitzky-Golay smoothing, first and second derivatives, multiplicative scatter correction, standard normal variate, and mean centering. Principle component analysis (PCA) and the RELIEF algorithm were used to extract spectral features. Machine-learning methods, including random forests (RF), artificial neural network (ANN), radial basis function- and linear function- based support vector machine (RBF- and LF-SVM) were employed for establishing diagnosis models. The model accuracies were evaluated and compared by using overall accuracies (OAs). The statistical significance of the difference between models was evaluated by using McNemar’s test (Z value). The results showed that the OAs varied with the different combinations of pre-processing, feature selection, and classification methods. Feature selection methods could improve the modeling efficiencies and diagnosis accuracies, and RELIEF often outperformed PCA. The optimal models established by RF (OA = 86%), ANN (OA = 89%), RBF- (OA = 89%) and LF-SVM (OA = 87%) had no statistical difference in diagnosis accuracies (Z < 1.96, p < 0.05). These results indicated that it was feasible to diagnose soil arsenic contamination using reflectance spectroscopy. The appropriate combination of multivariate methods was important to improve diagnosis accuracies. PMID:28471412
Extending the Peak Bandwidth of Parameters for Softmax Selection in Reinforcement Learning.
Iwata, Kazunori
2016-05-11
Softmax selection is one of the most popular methods for action selection in reinforcement learning. Although various recently proposed methods may be more effective with full parameter tuning, implementing a complicated method that requires the tuning of many parameters can be difficult. Thus, softmax selection is still worth revisiting, considering the cost savings of its implementation and tuning. In fact, this method works adequately in practice with only one parameter appropriately set for the environment. The aim of this paper is to improve the variable setting of this method to extend the bandwidth of good parameters, thereby reducing the cost of implementation and parameter tuning. To achieve this, we take advantage of the asymptotic equipartition property in a Markov decision process to extend the peak bandwidth of softmax selection. Using a variety of episodic tasks, we show that our setting is effective in extending the bandwidth and that it yields a better policy in terms of stability. The bandwidth is quantitatively assessed in a series of statistical tests.
NASA Astrophysics Data System (ADS)
Mueller, David S.
2013-04-01
Selection of the appropriate extrapolation methods for computing the discharge in the unmeasured top and bottom parts of a moving-boat acoustic Doppler current profiler (ADCP) streamflow measurement is critical to the total discharge computation. The software tool, extrap, combines normalized velocity profiles from the entire cross section and multiple transects to determine a mean profile for the measurement. The use of an exponent derived from normalized data from the entire cross section is shown to be valid for application of the power velocity distribution law in the computation of the unmeasured discharge in a cross section. Selected statistics are combined with empirically derived criteria to automatically select the appropriate extrapolation methods. A graphical user interface (GUI) provides the user tools to visually evaluate the automatically selected extrapolation methods and manually change them, as necessary. The sensitivity of the total discharge to available extrapolation methods is presented in the GUI. Use of extrap by field hydrographers has demonstrated that extrap is a more accurate and efficient method of determining the appropriate extrapolation methods compared with tools currently (2012) provided in the ADCP manufacturers' software.
Granato, Gregory E.
2014-01-01
The U.S. Geological Survey (USGS) developed the Stochastic Empirical Loading and Dilution Model (SELDM) in cooperation with the Federal Highway Administration (FHWA) to indicate the risk for stormwater concentrations, flows, and loads to be above user-selected water-quality goals and the potential effectiveness of mitigation measures to reduce such risks. SELDM models the potential effect of mitigation measures by using Monte Carlo methods with statistics that approximate the net effects of structural and nonstructural best management practices (BMPs). In this report, structural BMPs are defined as the components of the drainage pathway between the source of runoff and a stormwater discharge location that affect the volume, timing, or quality of runoff. SELDM uses a simple stochastic statistical model of BMP performance to develop planning-level estimates of runoff-event characteristics. This statistical approach can be used to represent a single BMP or an assemblage of BMPs. The SELDM BMP-treatment module has provisions for stochastic modeling of three stormwater treatments: volume reduction, hydrograph extension, and water-quality treatment. In SELDM, these three treatment variables are modeled by using the trapezoidal distribution and the rank correlation with the associated highway-runoff variables. This report describes methods for calculating the trapezoidal-distribution statistics and rank correlation coefficients for stochastic modeling of volume reduction, hydrograph extension, and water-quality treatment by structural stormwater BMPs and provides the calculated values for these variables. This report also provides robust methods for estimating the minimum irreducible concentration (MIC), which is the lowest expected effluent concentration from a particular BMP site or a class of BMPs. These statistics are different from the statistics commonly used to characterize or compare BMPs. They are designed to provide a stochastic transfer function to approximate the quantity, duration, and quality of BMP effluent given the associated inflow values for a population of storm events. A database application and several spreadsheet tools are included in the digital media accompanying this report for further documentation of methods and for future use. In this study, analyses were done with data extracted from a modified copy of the January 2012 version of International Stormwater Best Management Practices Database, designated herein as the January 2012a version. Statistics for volume reduction, hydrograph extension, and water-quality treatment were developed with selected data. Sufficient data were available to estimate statistics for 5 to 10 BMP categories by using data from 40 to more than 165 monitoring sites. Water-quality treatment statistics were developed for 13 runoff-quality constituents commonly measured in highway and urban runoff studies including turbidity, sediment and solids; nutrients; total metals; organic carbon; and fecal coliforms. The medians of the best-fit statistics for each category were selected to construct generalized cumulative distribution functions for the three treatment variables. For volume reduction and hydrograph extension, interpretation of available data indicates that selection of a Spearman’s rho value that is the average of the median and maximum values for the BMP category may help generate realistic simulation results in SELDM. The median rho value may be selected to help generate realistic simulation results for water-quality treatment variables. MIC statistics were developed for 12 runoff-quality constituents commonly measured in highway and urban runoff studies by using data from 11 BMP categories and more than 167 monitoring sites. Four statistical techniques were applied for estimating MIC values with monitoring data from each site. These techniques produce a range of lower-bound estimates for each site. Four MIC estimators are proposed as alternatives for selecting a value from among the estimates from multiple sites. Correlation analysis indicates that the MIC estimates from multiple sites were weakly correlated with the geometric mean of inflow values, which indicates that there may be a qualitative or semiquantitative link between the inflow quality and the MIC. Correlations probably are weak because the MIC is influenced by the inflow water quality and the capability of each individual BMP site to reduce inflow concentrations.
Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes
Xia, Xiao-Lei; Xing, Huanlai; Liu, Xueqin
2013-01-01
One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing -like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate. PMID:24349110
NASA Astrophysics Data System (ADS)
Eissa, Maya S.; Abou Al Alamein, Amal M.
2018-03-01
Different innovative spectrophotometric methods were introduced for the first time for simultaneous quantification of sacubitril/valsartan in their binary mixture and in their combined dosage form without prior separation through two manipulation approaches. These approaches were developed and based either on two wavelength selection in zero-order absorption spectra namely; dual wavelength method (DWL) at 226 nm and 275 nm for valsartan, induced dual wavelength method (IDW) at 226 nm and 254 nm for sacubitril and advanced absorbance subtraction (AAS) based on their iso-absorptive point at 246 nm (λiso) and 261 nm (sacubitril shows equal absorbance values at the two selected wavelengths) or on ratio spectra using their normalized spectra namely; ratio difference spectrophotometric method (RD) at 225 nm and 264 nm for both of them in their ratio spectra, first derivative of ratio spectra (DR1) at 232 nm for valsartan and 239 nm for sacubitril and mean centering of ratio spectra (MCR) at 260 nm for both of them. Both sacubitril and valsartan showed linearity upon application of these methods in the range of 2.5-25.0 μg/mL. The developed spectrophotmetric methods were successfully applied to the analysis of their combined tablet dosage form ENTRESTO™. The adopted spectrophotometric methods were also validated according to ICH guidelines. The results obtained from the proposed methods were statistically compared to a reported HPLC method using Student t-test, F-test and a comparative study was also developed with one-way ANOVA, showing no statistical difference in accordance to precision and accuracy.
Potential errors and misuse of statistics in studies on leakage in endodontics.
Lucena, C; Lopez, J M; Pulgar, R; Abalos, C; Valderrama, M J
2013-04-01
To assess the quality of the statistical methodology used in studies of leakage in Endodontics, and to compare the results found using appropriate versus inappropriate inferential statistical methods. The search strategy used the descriptors 'root filling' 'microleakage', 'dye penetration', 'dye leakage', 'polymicrobial leakage' and 'fluid filtration' for the time interval 2001-2010 in journals within the categories 'Dentistry, Oral Surgery and Medicine' and 'Materials Science, Biomaterials' of the Journal Citation Report. All retrieved articles were reviewed to find potential pitfalls in statistical methodology that may be encountered during study design, data management or data analysis. The database included 209 papers. In all the studies reviewed, the statistical methods used were appropriate for the category attributed to the outcome variable, but in 41% of the cases, the chi-square test or parametric methods were inappropriately selected subsequently. In 2% of the papers, no statistical test was used. In 99% of cases, a statistically 'significant' or 'not significant' effect was reported as a main finding, whilst only 1% also presented an estimation of the magnitude of the effect. When the appropriate statistical methods were applied in the studies with originally inappropriate data analysis, the conclusions changed in 19% of the cases. Statistical deficiencies in leakage studies may affect their results and interpretation and might be one of the reasons for the poor agreement amongst the reported findings. Therefore, more effort should be made to standardize statistical methodology. © 2012 International Endodontic Journal.
Point process statistics in atom probe tomography.
Philippe, T; Duguay, S; Grancher, G; Blavette, D
2013-09-01
We present a review of spatial point processes as statistical models that we have designed for the analysis and treatment of atom probe tomography (APT) data. As a major advantage, these methods do not require sampling. The mean distance to nearest neighbour is an attractive approach to exhibit a non-random atomic distribution. A χ(2) test based on distance distributions to nearest neighbour has been developed to detect deviation from randomness. Best-fit methods based on first nearest neighbour distance (1 NN method) and pair correlation function are presented and compared to assess the chemical composition of tiny clusters. Delaunay tessellation for cluster selection has been also illustrated. These statistical tools have been applied to APT experiments on microelectronics materials. Copyright © 2012 Elsevier B.V. All rights reserved.
Shi, Weiwei; Bugrim, Andrej; Nikolsky, Yuri; Nikolskya, Tatiana; Brennan, Richard J
2008-01-01
ABSTRACT The ideal toxicity biomarker is composed of the properties of prediction (is detected prior to traditional pathological signs of injury), accuracy (high sensitivity and specificity), and mechanistic relationships to the endpoint measured (biological relevance). Gene expression-based toxicity biomarkers ("signatures") have shown good predictive power and accuracy, but are difficult to interpret biologically. We have compared different statistical methods of feature selection with knowledge-based approaches, using GeneGo's database of canonical pathway maps, to generate gene sets for the classification of renal tubule toxicity. The gene set selection algorithms include four univariate analyses: t-statistics, fold-change, B-statistics, and RankProd, and their combination and overlap for the identification of differentially expressed probes. Enrichment analysis following the results of the four univariate analyses, Hotelling T-square test, and, finally out-of-bag selection, a variant of cross-validation, were used to identify canonical pathway maps-sets of genes coordinately involved in key biological processes-with classification power. Differentially expressed genes identified by the different statistical univariate analyses all generated reasonably performing classifiers of tubule toxicity. Maps identified by enrichment analysis or Hotelling T-square had lower classification power, but highlighted perturbed lipid homeostasis as a common discriminator of nephrotoxic treatments. The out-of-bag method yielded the best functionally integrated classifier. The map "ephrins signaling" performed comparably to a classifier derived using sparse linear programming, a machine learning algorithm, and represents a signaling network specifically involved in renal tubule development and integrity. Such functional descriptors of toxicity promise to better integrate predictive toxicogenomics with mechanistic analysis, facilitating the interpretation and risk assessment of predictive genomic investigations.
Method of analysis of local neuronal circuits in the vertebrate central nervous system.
Reinis, S; Weiss, D S; McGaraughty, S; Tsoukatos, J
1992-06-01
Although a considerable amount of knowledge has been accumulated about the activity of individual nerve cells in the brain, little is known about their mutual interactions at the local level. The method presented in this paper allows the reconstruction of functional relations within a group of neurons as recorded by a single microelectrode. Data are sampled at 10 or 13 kHz. Prominent spikes produced by one or more single cells are selected and sorted by K-means cluster analysis. The activities of single cells are then related to the background firing of neurons in their vicinity. Auto-correlograms of the leading cells, auto-correlograms of the background cells (mass correlograms) and cross-correlograms between these two levels of firing are computed and evaluated. The statistical probability of mutual interactions is determined, and the statistically significant, most common interspike intervals are stored and attributed to real pairs of spikes in the original record. Selected pairs of spikes, characterized by statistically significant intervals between them, are then assembled into a working model of the system. This method has revealed substantial differences between the information processing in the visual cortex, the inferior colliculus, the rostral ventromedial medulla and the ventrobasal complex of the thalamus. Even short 1-s records of the multiple neuronal activity may provide meaningful and statistically significant results.
Active learning methods for interactive image retrieval.
Gosselin, Philippe Henri; Cord, Matthieu
2008-07-01
Active learning methods have been considered with increased interest in the statistical learning community. Initially developed within a classification framework, a lot of extensions are now being proposed to handle multimedia applications. This paper provides algorithms within a statistical framework to extend active learning for online content-based image retrieval (CBIR). The classification framework is presented with experiments to compare several powerful classification techniques in this information retrieval context. Focusing on interactive methods, active learning strategy is then described. The limitations of this approach for CBIR are emphasized before presenting our new active selection process RETIN. First, as any active method is sensitive to the boundary estimation between classes, the RETIN strategy carries out a boundary correction to make the retrieval process more robust. Second, the criterion of generalization error to optimize the active learning selection is modified to better represent the CBIR objective of database ranking. Third, a batch processing of images is proposed. Our strategy leads to a fast and efficient active learning scheme to retrieve sets of online images (query concept). Experiments on large databases show that the RETIN method performs well in comparison to several other active strategies.
A New Sample Size Formula for Regression.
ERIC Educational Resources Information Center
Brooks, Gordon P.; Barcikowski, Robert S.
The focus of this research was to determine the efficacy of a new method of selecting sample sizes for multiple linear regression. A Monte Carlo simulation was used to study both empirical predictive power rates and empirical statistical power rates of the new method and seven other methods: those of C. N. Park and A. L. Dudycha (1974); J. Cohen…
A comparison of linear interpolation models for iterative CT reconstruction.
Hahn, Katharina; Schöndube, Harald; Stierstorfer, Karl; Hornegger, Joachim; Noo, Frédéric
2016-12-01
Recent reports indicate that model-based iterative reconstruction methods may improve image quality in computed tomography (CT). One difficulty with these methods is the number of options available to implement them, including the selection of the forward projection model and the penalty term. Currently, the literature is fairly scarce in terms of guidance regarding this selection step, whereas these options impact image quality. Here, the authors investigate the merits of three forward projection models that rely on linear interpolation: the distance-driven method, Joseph's method, and the bilinear method. The authors' selection is motivated by three factors: (1) in CT, linear interpolation is often seen as a suitable trade-off between discretization errors and computational cost, (2) the first two methods are popular with manufacturers, and (3) the third method enables assessing the importance of a key assumption in the other methods. One approach to evaluate forward projection models is to inspect their effect on discretized images, as well as the effect of their transpose on data sets, but significance of such studies is unclear since the matrix and its transpose are always jointly used in iterative reconstruction. Another approach is to investigate the models in the context they are used, i.e., together with statistical weights and a penalty term. Unfortunately, this approach requires the selection of a preferred objective function and does not provide clear information on features that are intrinsic to the model. The authors adopted the following two-stage methodology. First, the authors analyze images that progressively include components of the singular value decomposition of the model in a reconstructed image without statistical weights and penalty term. Next, the authors examine the impact of weights and penalty on observed differences. Image quality metrics were investigated for 16 different fan-beam imaging scenarios that enabled probing various aspects of all models. The metrics include a surrogate for computational cost, as well as bias, noise, and an estimation task, all at matched resolution. The analysis revealed fundamental differences in terms of both bias and noise. Task-based assessment appears to be required to appreciate the differences in noise; the estimation task the authors selected showed that these differences balance out to yield similar performance. Some scenarios highlighted merits for the distance-driven method in terms of bias but with an increase in computational cost. Three combinations of statistical weights and penalty term showed that the observed differences remain the same, but strong edge-preserving penalty can dramatically reduce the magnitude of these differences. In many scenarios, Joseph's method seems to offer an interesting compromise between cost and computational effort. The distance-driven method offers the possibility to reduce bias but with an increase in computational cost. The bilinear method indicated that a key assumption in the other two methods is highly robust. Last, strong edge-preserving penalty can act as a compensator for insufficiencies in the forward projection model, bringing all models to similar levels in the most challenging imaging scenarios. Also, the authors find that their evaluation methodology helps appreciating how model, statistical weights, and penalty term interplay together.
Lancaster, Cady; Espinoza, Edgard
2012-05-15
International trade of several Dalbergia wood species is regulated by The Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES). In order to supplement morphological identification of these species, a rapid chemical method of analysis was developed. Using Direct Analysis in Real Time (DART) ionization coupled with Time-of-Flight (TOF) Mass Spectrometry (MS), selected Dalbergia and common trade species were analyzed. Each of the 13 wood species was classified using principal component analysis and linear discriminant analysis (LDA). These statistical data clusters served as reliable anchors for species identification of unknowns. Analysis of 20 or more samples from the 13 species studied in this research indicates that the DART-TOFMS results are reproducible. Statistical analysis of the most abundant ions gave good classifications that were useful for identifying unknown wood samples. DART-TOFMS and LDA analysis of 13 species of selected timber samples and the statistical classification allowed for the correct assignment of unknown wood samples. This method is rapid and can be useful when anatomical identification is difficult but needed in order to support CITES enforcement. Published 2012. This article is a US Government work and is in the public domain in the USA.
An efficient empirical Bayes method for genomewide association studies.
Wang, Q; Wei, J; Pan, Y; Xu, S
2016-08-01
Linear mixed model (LMM) is one of the most popular methods for genomewide association studies (GWAS). Numerous forms of LMM have been developed; however, there are two major issues in GWAS that have not been fully addressed before. The two issues are (i) the genomic background noise and (ii) low statistical power after Bonferroni correction. We proposed an empirical Bayes (EB) method by assigning each marker effect a normal prior distribution, resulting in shrinkage estimates of marker effects. We found that such a shrinkage approach can selectively shrink marker effects and reduce the noise level to zero for majority of non-associated markers. In the meantime, the EB method allows us to use an 'effective number of tests' to perform Bonferroni correction for multiple tests. Simulation studies for both human and pig data showed that EB method can significantly increase statistical power compared with the widely used exact GWAS methods, such as GEMMA and FaST-LMM-Select. Real data analyses in human breast cancer identified improved detection signals for markers previously known to be associated with breast cancer. We therefore believe that EB method is a valuable tool for identifying the genetic basis of complex traits. © 2015 Blackwell Verlag GmbH.
System and method for progressive band selection for hyperspectral images
NASA Technical Reports Server (NTRS)
Fisher, Kevin (Inventor)
2013-01-01
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for progressive band selection for hyperspectral images. A system having module configured to control a processor to practice the method calculates a virtual dimensionality of a hyperspectral image having multiple bands to determine a quantity Q of how many bands are needed for a threshold level of information, ranks each band based on a statistical measure, selects Q bands from the multiple bands to generate a subset of bands based on the virtual dimensionality, and generates a reduced image based on the subset of bands. This approach can create reduced datasets of full hyperspectral images tailored for individual applications. The system uses a metric specific to a target application to rank the image bands, and then selects the most useful bands. The number of bands selected can be specified manually or calculated from the hyperspectral image's virtual dimensionality.
Horsch, Alexander; Hapfelmeier, Alexander; Elter, Matthias
2011-11-01
Breast cancer is globally a major threat for women's health. Screening and adequate follow-up can significantly reduce the mortality from breast cancer. Human second reading of screening mammograms can increase breast cancer detection rates, whereas this has not been proven for current computer-aided detection systems as "second reader". Critical factors include the detection accuracy of the systems and the screening experience and training of the radiologist with the system. When assessing the performance of systems and system components, the choice of evaluation methods is particularly critical. Core assets herein are reference image databases and statistical methods. We have analyzed characteristics and usage of the currently largest publicly available mammography database, the Digital Database for Screening Mammography (DDSM) from the University of South Florida, in literature indexed in Medline, IEEE Xplore, SpringerLink, and SPIE, with respect to type of computer-aided diagnosis (CAD) (detection, CADe, or diagnostics, CADx), selection of database subsets, choice of evaluation method, and quality of descriptions. 59 publications presenting 106 evaluation studies met our selection criteria. In 54 studies (50.9%), the selection of test items (cases, images, regions of interest) extracted from the DDSM was not reproducible. Only 2 CADx studies, not any CADe studies, used the entire DDSM. The number of test items varies from 100 to 6000. Different statistical evaluation methods are chosen. Most common are train/test (34.9% of the studies), leave-one-out (23.6%), and N-fold cross-validation (18.9%). Database-related terminology tends to be imprecise or ambiguous, especially regarding the term "case". Overall, both the use of the DDSM as data source for evaluation of mammography CAD systems, and the application of statistical evaluation methods were found highly diverse. Results reported from different studies are therefore hardly comparable. Drawbacks of the DDSM (e.g. varying quality of lesion annotations) may contribute to the reasons. But larger bias seems to be caused by authors' own decisions upon study design. RECOMMENDATIONS/CONCLUSION: For future evaluation studies, we derive a set of 13 recommendations concerning the construction and usage of a test database, as well as the application of statistical evaluation methods.
DAnTE: a statistical tool for quantitative analysis of –omics data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Polpitiya, Ashoka D.; Qian, Weijun; Jaitly, Navdeep
2008-05-03
DAnTE (Data Analysis Tool Extension) is a statistical tool designed to address challenges unique to quantitative bottom-up, shotgun proteomics data. This tool has also been demonstrated for microarray data and can easily be extended to other high-throughput data types. DAnTE features selected normalization methods, missing value imputation algorithms, peptide to protein rollup methods, an extensive array of plotting functions, and a comprehensive ANOVA scheme that can handle unbalanced data and random effects. The Graphical User Interface (GUI) is designed to be very intuitive and user friendly.
NASA Astrophysics Data System (ADS)
Gao, Jike
2018-01-01
Through using the method of literature review, instrument measuring, questionnaire and mathematical statistics, this paper analyzed the current situation in Mass Sports of Tibetan Areas Plateau in Gansu Province. Through experimental test access to Tibetan areas in gansu province of air pollutants and meteorological index data as the foundation, control related national standard and exercise science, statistical analysis of data, the Tibetan plateau, gansu province people participate in physical exercise is dedicated to providing you with scientific methods and appropriate time.
Forester, James D; Im, Hae Kyung; Rathouz, Paul J
2009-12-01
Patterns of resource selection by animal populations emerge as a result of the behavior of many individuals. Statistical models that describe these population-level patterns of habitat use can miss important interactions between individual animals and characteristics of their local environment; however, identifying these interactions is difficult. One approach to this problem is to incorporate models of individual movement into resource selection models. To do this, we propose a model for step selection functions (SSF) that is composed of a resource-independent movement kernel and a resource selection function (RSF). We show that standard case-control logistic regression may be used to fit the SSF; however, the sampling scheme used to generate control points (i.e., the definition of availability) must be accommodated. We used three sampling schemes to analyze simulated movement data and found that ignoring sampling and the resource-independent movement kernel yielded biased estimates of selection. The level of bias depended on the method used to generate control locations, the strength of selection, and the spatial scale of the resource map. Using empirical or parametric methods to sample control locations produced biased estimates under stronger selection; however, we show that the addition of a distance function to the analysis substantially reduced that bias. Assuming a uniform availability within a fixed buffer yielded strongly biased selection estimates that could be corrected by including the distance function but remained inefficient relative to the empirical and parametric sampling methods. As a case study, we used location data collected from elk in Yellowstone National Park, USA, to show that selection and bias may be temporally variable. Because under constant selection the amount of bias depends on the scale at which a resource is distributed in the landscape, we suggest that distance always be included as a covariate in SSF analyses. This approach to modeling resource selection is easily implemented using common statistical tools and promises to provide deeper insight into the movement ecology of animals.
USDA-ARS?s Scientific Manuscript database
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its ef...
Guo, Hui; Zhang, Zhen; Yao, Yuan; Liu, Jialin; Chang, Ruirui; Liu, Zhao; Hao, Hongyuan; Huang, Taohong; Wen, Jun; Zhou, Tingting
2018-08-30
Semen sojae praeparatum with homology of medicine and food is a famous traditional Chinese medicine. A simple and effective quality fingerprint analysis, coupled with chemometrics methods, was developed for quality assessment of Semen sojae praeparatum. First, similarity analysis (SA) and hierarchical clusting analysis (HCA) were applied to select the qualitative markers, which obviously influence the quality of Semen sojae praeparatum. 21 chemicals were selected and characterized by high resolution ion trap/time-of-flight mass spectrometry (LC-IT-TOF-MS). Subsequently, principal components analysis (PCA) and orthogonal partial least squares discriminant analysis (OPLS-DA) were conducted to select the quantitative markers of Semen sojae praeparatum samples from different origins. Moreover, 11 compounds with statistical significance were determined quantitatively, which provided an accurate and informative data for quality evaluation. This study proposes a new strategy for "statistic analysis-based fingerprint establishment", which would be a valuable reference for further study. Copyright © 2018 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Yuvchenko, S. A.; Ushakova, E. V.; Pavlova, M. V.; Alonova, M. V.; Zimnyakov, D. A.
2018-04-01
We consider the practical realization of a new optical probe method of the random media which is defined as the reference-free path length interferometry with the intensity moments analysis. A peculiarity in the statistics of the spectrally selected fluorescence radiation in laser-pumped dye-doped random medium is discussed. Previously established correlations between the second- and the third-order moments of the intensity fluctuations in the random interference patterns, the coherence function of the probe radiation, and the path difference probability density for the interfering partial waves in the medium are confirmed. The correlations were verified using the statistical analysis of the spectrally selected fluorescence radiation emitted by a laser-pumped dye-doped random medium. Water solution of Rhodamine 6G was applied as the doping fluorescent agent for the ensembles of the densely packed silica grains, which were pumped by the 532 nm radiation of a solid state laser. The spectrum of the mean path length for a random medium was reconstructed.
Alles, Susan; Peng, Linda X; Mozola, Mark A
2009-01-01
A modification to Performance-Tested Method (PTM) 070601, Reveal Listeria Test (Reveal), is described. The modified method uses a new media formulation, LESS enrichment broth, in single-step enrichment protocols for both foods and environmental sponge and swab samples. Food samples are enriched for 27-30 h at 30 degrees C and environmental samples for 24-48 h at 30 degrees C. Implementation of these abbreviated enrichment procedures allows test results to be obtained on a next-day basis. In testing of 14 food types in internal comparative studies with inoculated samples, there was a statistically significant difference in performance between the Reveal and reference culture [U.S. Food and Drug Administration's Bacteriological Analytical Manual (FDA/BAM) or U.S. Department of Agriculture-Food Safety and Inspection Service (USDA-FSIS)] methods for only a single food in one trial (pasteurized crab meat) at the 27 h enrichment time point, with more positive results obtained with the FDA/BAM reference method. No foods showed statistically significant differences in method performance at the 30 h time point. Independent laboratory testing of 3 foods again produced a statistically significant difference in results for crab meat at the 27 h time point; otherwise results of the Reveal and reference methods were statistically equivalent. Overall, considering both internal and independent laboratory trials, sensitivity of the Reveal method relative to the reference culture procedures in testing of foods was 85.9% at 27 h and 97.1% at 30 h. Results from 5 environmental surfaces inoculated with various strains of Listeria spp. showed that the Reveal method was more productive than the reference USDA-FSIS culture procedure for 3 surfaces (stainless steel, plastic, and cast iron), whereas results were statistically equivalent to the reference method for the other 2 surfaces (ceramic tile and sealed concrete). An independent laboratory trial with ceramic tile inoculated with L. monocytogenes confirmed the effectiveness of the Reveal method at the 24 h time point. Overall, sensitivity of the Reveal method at 24 h relative to that of the USDA-FSIS method was 153%. The Reveal method exhibited extremely high specificity, with only a single false-positive result in all trials combined for overall specificity of 99.5%.
Infrared face recognition based on LBP histogram and KW feature selection
NASA Astrophysics Data System (ADS)
Xie, Zhihua
2014-07-01
The conventional LBP-based feature as represented by the local binary pattern (LBP) histogram still has room for performance improvements. This paper focuses on the dimension reduction of LBP micro-patterns and proposes an improved infrared face recognition method based on LBP histogram representation. To extract the local robust features in infrared face images, LBP is chosen to get the composition of micro-patterns of sub-blocks. Based on statistical test theory, Kruskal-Wallis (KW) feature selection method is proposed to get the LBP patterns which are suitable for infrared face recognition. The experimental results show combination of LBP and KW features selection improves the performance of infrared face recognition, the proposed method outperforms the traditional methods based on LBP histogram, discrete cosine transform(DCT) or principal component analysis(PCA).
Recursive feature selection with significant variables of support vectors.
Tsai, Chen-An; Huang, Chien-Hsun; Chang, Ching-Wei; Chen, Chun-Houh
2012-01-01
The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.
Properties of different selection signature statistics and a new strategy for combining them.
Ma, Y; Ding, X; Qanbari, S; Weigend, S; Zhang, Q; Simianer, H
2015-11-01
Identifying signatures of recent or ongoing selection is of high relevance in livestock population genomics. From a statistical perspective, determining a proper testing procedure and combining various test statistics is challenging. On the basis of extensive simulations in this study, we discuss the statistical properties of eight different established selection signature statistics. In the considered scenario, we show that a reasonable power to detect selection signatures is achieved with high marker density (>1 SNP/kb) as obtained from sequencing, while rather small sample sizes (~15 diploid individuals) appear to be sufficient. Most selection signature statistics such as composite likelihood ratio and cross population extended haplotype homozogysity have the highest power when fixation of the selected allele is reached, while integrated haplotype score has the highest power when selection is ongoing. We suggest a novel strategy, called de-correlated composite of multiple signals (DCMS) to combine different statistics for detecting selection signatures while accounting for the correlation between the different selection signature statistics. When examined with simulated data, DCMS consistently has a higher power than most of the single statistics and shows a reliable positional resolution. We illustrate the new statistic to the established selective sweep around the lactase gene in human HapMap data providing further evidence of the reliability of this new statistic. Then, we apply it to scan selection signatures in two chicken samples with diverse skin color. Our analysis suggests that a set of well-known genes such as BCO2, MC1R, ASIP and TYR were involved in the divergent selection for this trait.
Handique, Bijoy K; Khan, Siraj A; Mahanta, J; Sudhakar, S
2014-09-01
Japanese encephalitis (JE) is one of the dreaded mosquito-borne viral diseases mostly prevalent in south Asian countries including India. Early warning of the disease in terms of disease intensity is crucial for taking adequate and appropriate intervention measures. The present study was carried out in Dibrugarh district in the state of Assam located in the northeastern region of India to assess the accuracy of selected forecasting methods based on historical morbidity patterns of JE incidence during the past 22 years (1985-2006). Four selected forecasting methods, viz. seasonal average (SA), seasonal adjustment with last three observations (SAT), modified method adjusting long-term and cyclic trend (MSAT), and autoregressive integrated moving average (ARIMA) have been employed to assess the accuracy of each of the forecasting methods. The forecasting methods were validated for five consecutive years from 2007-2012 and accuracy of each method has been assessed. The forecasting method utilising seasonal adjustment with long-term and cyclic trend emerged as best forecasting method among the four selected forecasting methods and outperformed the even statistically more advanced ARIMA method. Peak of the disease incidence could effectively be predicted with all the methods, but there are significant variations in magnitude of forecast errors among the selected methods. As expected, variation in forecasts at primary health centre (PHC) level is wide as compared to that of district level forecasts. The study showed that adopted forecasting techniques could reasonably forecast the intensity of JE cases at PHC level without considering the external variables. The results indicate that the understanding of long-term and cyclic trend of the disease intensity will improve the accuracy of the forecasts, but there is a need for making the forecast models more robust to explain sudden variation in the disease intensity with detail analysis of parasite and host population dynamics.
Variability aware compact model characterization for statistical circuit design optimization
NASA Astrophysics Data System (ADS)
Qiao, Ying; Qian, Kun; Spanos, Costas J.
2012-03-01
Variability modeling at the compact transistor model level can enable statistically optimized designs in view of limitations imposed by the fabrication technology. In this work we propose an efficient variabilityaware compact model characterization methodology based on the linear propagation of variance. Hierarchical spatial variability patterns of selected compact model parameters are directly calculated from transistor array test structures. This methodology has been implemented and tested using transistor I-V measurements and the EKV-EPFL compact model. Calculation results compare well to full-wafer direct model parameter extractions. Further studies are done on the proper selection of both compact model parameters and electrical measurement metrics used in the method.
Use of the smartphone for end vertebra selection in scoliosis.
Pepe, Murad; Kocadal, Onur; Iyigun, Abdullah; Gunes, Zafer; Aksahin, Ertugrul; Aktekin, Cem Nuri
2017-03-01
The aim of our study was to develop a smartphone-aided end vertebra selection method and to investigate its effectiveness in Cobb angle measurement. Twenty-nine adolescent idiopathic scoliosis patients' pre-operative posteroanterior scoliosis radiographs were used for end vertebra selection and Cobb angle measurement by standard method and smartphone-aided method. Measurements were performed by 7 examiners. The intraclass correlation coefficient was used to analyze selection and measurement reliability. Summary statistics of variance calculations were used to provide 95% prediction limits for the error in Cobb angle measurements. A paired 2-tailed t test was used to analyze end vertebra selection differences. Mean absolute Cobb angle difference was 3.6° for the manual method and 1.9° for the smartphone-aided method. Both intraobserver and interobserver reliability were found excellent in manual and smartphone set for Cobb angle measurement. Both intraobserver and interobserver reliability were found excellent in manual and smartphone set for end vertebra selection. But reliability values of manual set were lower than smartphone. Two observers selected significantly different end vertebra in their repeated selections for manual method. Smartphone-aided method for end vertebra selection and Cobb angle measurement showed excellent reliability. We can expect a reduction in measurement error rates with the widespread use of this method in clinical practice. Level III, Diagnostic study. Copyright © 2016 Turkish Association of Orthopaedics and Traumatology. Production and hosting by Elsevier B.V. All rights reserved.
Selecting long-term care facilities with high use of acute hospitalisations: issues and options
2014-01-01
Background This paper considers approaches to the question “Which long-term care facilities have residents with high use of acute hospitalisations?” It compares four methods of identifying long-term care facilities with high use of acute hospitalisations by demonstrating four selection methods, identifies key factors to be resolved when deciding which methods to employ, and discusses their appropriateness for different research questions. Methods OPAL was a census-type survey of aged care facilities and residents in Auckland, New Zealand, in 2008. It collected information about facility management and resident demographics, needs and care. Survey records (149 aged care facilities, 6271 residents) were linked to hospital and mortality records routinely assembled by health authorities. The main ranking endpoint was acute hospitalisations for diagnoses that were classified as potentially avoidable. Facilities were ranked using 1) simple event counts per person, 2) event rates per year of resident follow-up, 3) statistical model of rates using four predictors, and 4) change in ranks between methods 2) and 3). A generalized mixed model was used for Method 3 to handle the clustered nature of the data. Results 3048 potentially avoidable hospitalisations were observed during 22 months’ follow-up. The same “top ten” facilities were selected by Methods 1 and 2. The statistical model (Method 3), predicting rates from resident and facility characteristics, ranked facilities differently than these two simple methods. The change-in-ranks method identified a very different set of “top ten” facilities. All methods showed a continuum of use, with no clear distinction between facilities with higher use. Conclusion Choice of selection method should depend upon the purpose of selection. To monitor performance during a period of change, a recent simple rate, count per resident, or even count per bed, may suffice. To find high–use facilities regardless of resident needs, recent history of admissions is highly predictive. To target a few high-use facilities that have high rates after considering facility and resident characteristics, model residuals or a large increase in rank may be preferable. PMID:25052433
Will genomic selection be a practical method for plant breeding?
Nakaya, Akihiro; Isobe, Sachiko N
2012-11-01
Genomic selection or genome-wide selection (GS) has been highlighted as a new approach for marker-assisted selection (MAS) in recent years. GS is a form of MAS that selects favourable individuals based on genomic estimated breeding values. Previous studies have suggested the utility of GS, especially for capturing small-effect quantitative trait loci, but GS has not become a popular methodology in the field of plant breeding, possibly because there is insufficient information available on GS for practical use. In this review, GS is discussed from a practical breeding viewpoint. Statistical approaches employed in GS are briefly described, before the recent progress in GS studies is surveyed. GS practices in plant breeding are then reviewed before future prospects are discussed. Statistical concepts used in GS are discussed with genetic models and variance decomposition, heritability, breeding value and linear model. Recent progress in GS studies is reviewed with a focus on empirical studies. For the practice of GS in plant breeding, several specific points are discussed including linkage disequilibrium, feature of populations and genotyped markers and breeding scheme. Currently, GS is not perfect, but it is a potent, attractive and valuable approach for plant breeding. This method will be integrated into many practical breeding programmes in the near future with further advances and the maturing of its theory.
NASA Astrophysics Data System (ADS)
Adeniyi, D. A.; Wei, Z.; Yang, Y.
2017-10-01
Recommendation problem has been extensively studied by researchers in the field of data mining, database and information retrieval. This study presents the design and realisation of an automated, personalised news recommendations system based on Chi-square statistics-based K-nearest neighbour (χ2SB-KNN) model. The proposed χ2SB-KNN model has the potential to overcome computational complexity and information overloading problems, reduces runtime and speeds up execution process through the use of critical value of χ2 distribution. The proposed recommendation engine can alleviate scalability challenges through combined online pattern discovery and pattern matching for real-time recommendations. This work also showcases the development of a novel method of feature selection referred to as Data Discretisation-Based feature selection method. This is used for selecting the best features for the proposed χ2SB-KNN algorithm at the preprocessing stage of the classification procedures. The implementation of the proposed χ2SB-KNN model is achieved through the use of a developed in-house Java program on an experimental website called OUC newsreaders' website. Finally, we compared the performance of our system with two baseline methods which are traditional Euclidean distance K-nearest neighbour and Naive Bayesian techniques. The result shows a significant improvement of our method over the baseline methods studied.
A re-evaluation of a case-control model with contaminated controls for resource selection studies
Christopher T. Rota; Joshua J. Millspaugh; Dylan C. Kesler; Chad P. Lehman; Mark A. Rumble; Catherine M. B. Jachowski
2013-01-01
A common sampling design in resource selection studies involves measuring resource attributes at sample units used by an animal and at sample units considered available for use. Few models can estimate the absolute probability of using a sample unit from such data, but such approaches are generally preferred over statistical methods that estimate a relative probability...
Muddukrishna, B S; Pai, Vasudev; Lobo, Richard; Pai, Aravinda
2017-11-22
In the present study, five important binary fingerprinting techniques were used to model novel flavones for the selective inhibition of Tankyrase I. From the fingerprints used: the fingerprint atom pairs resulted in a statistically significant 2D QSAR model using a kernel-based partial least square regression method. This model indicates that the presence of electron-donating groups positively contributes to activity, whereas the presence of electron withdrawing groups negatively contributes to activity. This model could be used to develop more potent as well as selective analogues for the inhibition of Tankyrase I. Schematic representation of 2D QSAR work flow.
Selecting relevant 3D image features of margin sharpness and texture for lung nodule retrieval.
Ferreira, José Raniery; de Azevedo-Marques, Paulo Mazzoncini; Oliveira, Marcelo Costa
2017-03-01
Lung cancer is the leading cause of cancer-related deaths in the world. Its diagnosis is a challenge task to specialists due to several aspects on the classification of lung nodules. Therefore, it is important to integrate content-based image retrieval methods on the lung nodule classification process, since they are capable of retrieving similar cases from databases that were previously diagnosed. However, this mechanism depends on extracting relevant image features in order to obtain high efficiency. The goal of this paper is to perform the selection of 3D image features of margin sharpness and texture that can be relevant on the retrieval of similar cancerous and benign lung nodules. A total of 48 3D image attributes were extracted from the nodule volume. Border sharpness features were extracted from perpendicular lines drawn over the lesion boundary. Second-order texture features were extracted from a cooccurrence matrix. Relevant features were selected by a correlation-based method and a statistical significance analysis. Retrieval performance was assessed according to the nodule's potential malignancy on the 10 most similar cases and by the parameters of precision and recall. Statistical significant features reduced retrieval performance. Correlation-based method selected 2 margin sharpness attributes and 6 texture attributes and obtained higher precision compared to all 48 extracted features on similar nodule retrieval. Feature space dimensionality reduction of 83 % obtained higher retrieval performance and presented to be a computationaly low cost method of retrieving similar nodules for the diagnosis of lung cancer.
Biomarker selection for medical diagnosis using the partial area under the ROC curve
2014-01-01
Background A biomarker is usually used as a diagnostic or assessment tool in medical research. Finding an ideal biomarker is not easy and combining multiple biomarkers provides a promising alternative. Moreover, some biomarkers based on the optimal linear combination do not have enough discriminatory power. As a result, the aim of this study was to find the significant biomarkers based on the optimal linear combination maximizing the pAUC for assessment of the biomarkers. Methods Under the binormality assumption we obtain the optimal linear combination of biomarkers maximizing the partial area under the receiver operating characteristic curve (pAUC). Related statistical tests are developed for assessment of a biomarker set and of an individual biomarker. Stepwise biomarker selections are introduced to identify those biomarkers of statistical significance. Results The results of simulation study and three real examples, Duchenne Muscular Dystrophy disease, heart disease, and breast tissue example are used to show that our methods are most suitable biomarker selection for the data sets of a moderate number of biomarkers. Conclusions Our proposed biomarker selection approaches can be used to find the significant biomarkers based on hypothesis testing. PMID:24410929
Robust Feature Selection Technique using Rank Aggregation.
Sarkar, Chandrima; Cooley, Sarah; Srivastava, Jaideep
2014-01-01
Although feature selection is a well-developed research area, there is an ongoing need to develop methods to make classifiers more efficient. One important challenge is the lack of a universal feature selection technique which produces similar outcomes with all types of classifiers. This is because all feature selection techniques have individual statistical biases while classifiers exploit different statistical properties of data for evaluation. In numerous situations this can put researchers into dilemma as to which feature selection method and a classifiers to choose from a vast range of choices. In this paper, we propose a technique that aggregates the consensus properties of various feature selection methods to develop a more optimal solution. The ensemble nature of our technique makes it more robust across various classifiers. In other words, it is stable towards achieving similar and ideally higher classification accuracy across a wide variety of classifiers. We quantify this concept of robustness with a measure known as the Robustness Index (RI). We perform an extensive empirical evaluation of our technique on eight data sets with different dimensions including Arrythmia, Lung Cancer, Madelon, mfeat-fourier, internet-ads, Leukemia-3c and Embryonal Tumor and a real world data set namely Acute Myeloid Leukemia (AML). We demonstrate not only that our algorithm is more robust, but also that compared to other techniques our algorithm improves the classification accuracy by approximately 3-4% (in data set with less than 500 features) and by more than 5% (in data set with more than 500 features), across a wide range of classifiers.
Statistical assessment of the learning curves of health technologies.
Ramsay, C R; Grant, A M; Wallace, S A; Garthwaite, P H; Monk, A F; Russell, I T
2001-01-01
(1) To describe systematically studies that directly assessed the learning curve effect of health technologies. (2) Systematically to identify 'novel' statistical techniques applied to learning curve data in other fields, such as psychology and manufacturing. (3) To test these statistical techniques in data sets from studies of varying designs to assess health technologies in which learning curve effects are known to exist. METHODS - STUDY SELECTION (HEALTH TECHNOLOGY ASSESSMENT LITERATURE REVIEW): For a study to be included, it had to include a formal analysis of the learning curve of a health technology using a graphical, tabular or statistical technique. METHODS - STUDY SELECTION (NON-HEALTH TECHNOLOGY ASSESSMENT LITERATURE SEARCH): For a study to be included, it had to include a formal assessment of a learning curve using a statistical technique that had not been identified in the previous search. METHODS - DATA SOURCES: Six clinical and 16 non-clinical biomedical databases were searched. A limited amount of handsearching and scanning of reference lists was also undertaken. METHODS - DATA EXTRACTION (HEALTH TECHNOLOGY ASSESSMENT LITERATURE REVIEW): A number of study characteristics were abstracted from the papers such as study design, study size, number of operators and the statistical method used. METHODS - DATA EXTRACTION (NON-HEALTH TECHNOLOGY ASSESSMENT LITERATURE SEARCH): The new statistical techniques identified were categorised into four subgroups of increasing complexity: exploratory data analysis; simple series data analysis; complex data structure analysis, generic techniques. METHODS - TESTING OF STATISTICAL METHODS: Some of the statistical methods identified in the systematic searches for single (simple) operator series data and for multiple (complex) operator series data were illustrated and explored using three data sets. The first was a case series of 190 consecutive laparoscopic fundoplication procedures performed by a single surgeon; the second was a case series of consecutive laparoscopic cholecystectomy procedures performed by ten surgeons; the third was randomised trial data derived from the laparoscopic procedure arm of a multicentre trial of groin hernia repair, supplemented by data from non-randomised operations performed during the trial. RESULTS - HEALTH TECHNOLOGY ASSESSMENT LITERATURE REVIEW: Of 4571 abstracts identified, 272 (6%) were later included in the study after review of the full paper. Some 51% of studies assessed a surgical minimal access technique and 95% were case series. The statistical method used most often (60%) was splitting the data into consecutive parts (such as halves or thirds), with only 14% attempting a more formal statistical analysis. The reporting of the studies was poor, with 31% giving no details of data collection methods. RESULTS - NON-HEALTH TECHNOLOGY ASSESSMENT LITERATURE SEARCH: Of 9431 abstracts assessed, 115 (1%) were deemed appropriate for further investigation and, of these, 18 were included in the study. All of the methods for complex data sets were identified in the non-clinical literature. These were discriminant analysis, two-stage estimation of learning rates, generalised estimating equations, multilevel models, latent curve models, time series models and stochastic parameter models. In addition, eight new shapes of learning curves were identified. RESULTS - TESTING OF STATISTICAL METHODS: No one particular shape of learning curve performed significantly better than another. The performance of 'operation time' as a proxy for learning differed between the three procedures. Multilevel modelling using the laparoscopic cholecystectomy data demonstrated and measured surgeon-specific and confounding effects. The inclusion of non-randomised cases, despite the possible limitations of the method, enhanced the interpretation of learning effects. CONCLUSIONS - HEALTH TECHNOLOGY ASSESSMENT LITERATURE REVIEW: The statistical methods used for assessing learning effects in health technology assessment have been crude and the reporting of studies poor. CONCLUSIONS - NON-HEALTH TECHNOLOGY ASSESSMENT LITERATURE SEARCH: A number of statistical methods for assessing learning effects were identified that had not hitherto been used in health technology assessment. There was a hierarchy of methods for the identification and measurement of learning, and the more sophisticated methods for both have had little if any use in health technology assessment. This demonstrated the value of considering fields outside clinical research when addressing methodological issues in health technology assessment. CONCLUSIONS - TESTING OF STATISTICAL METHODS: It has been demonstrated that the portfolio of techniques identified can enhance investigations of learning curve effects. (ABSTRACT TRUNCATED)
Lindsell, Christopher J.; Welty, Leah J.; Mazumdar, Madhu; Thurston, Sally W.; Rahbar, Mohammad H.; Carter, Rickey E.; Pollock, Bradley H.; Cucchiara, Andrew J.; Kopras, Elizabeth J.; Jovanovic, Borko D.; Enders, Felicity T.
2014-01-01
Abstract Introduction Statistics is an essential training component for a career in clinical and translational science (CTS). Given the increasing complexity of statistics, learners may have difficulty selecting appropriate courses. Our question was: what depth of statistical knowledge do different CTS learners require? Methods For three types of CTS learners (principal investigator, co‐investigator, informed reader of the literature), each with different backgrounds in research (no previous research experience, reader of the research literature, previous research experience), 18 experts in biostatistics, epidemiology, and research design proposed levels for 21 statistical competencies. Results Statistical competencies were categorized as fundamental, intermediate, or specialized. CTS learners who intend to become independent principal investigators require more specialized training, while those intending to become informed consumers of the medical literature require more fundamental education. For most competencies, less training was proposed for those with more research background. Discussion When selecting statistical coursework, the learner's research background and career goal should guide the decision. Some statistical competencies are considered to be more important than others. Baseline knowledge assessments may help learners identify appropriate coursework. Conclusion Rather than one size fits all, tailoring education to baseline knowledge, learner background, and future goals increases learning potential while minimizing classroom time. PMID:25212569
Mapping landslide susceptibility using data-driven methods.
Zêzere, J L; Pereira, S; Melo, R; Oliveira, S C; Garcia, R A C
2017-07-01
Most epistemic uncertainty within data-driven landslide susceptibility assessment results from errors in landslide inventories, difficulty in identifying and mapping landslide causes and decisions related with the modelling procedure. In this work we evaluate and discuss differences observed on landslide susceptibility maps resulting from: (i) the selection of the statistical method; (ii) the selection of the terrain mapping unit; and (iii) the selection of the feature type to represent landslides in the model (polygon versus point). The work is performed in a single study area (Silveira Basin - 18.2km 2 - Lisbon Region, Portugal) using a unique database of geo-environmental landslide predisposing factors and an inventory of 82 shallow translational slides. The logistic regression, the discriminant analysis and two versions of the information value were used and we conclude that multivariate statistical methods perform better when computed over heterogeneous terrain units and should be selected to assess landslide susceptibility based on slope terrain units, geo-hydrological terrain units or census terrain units. However, evidence was found that the chosen terrain mapping unit can produce greater differences on final susceptibility results than those resulting from the chosen statistical method for modelling. The landslide susceptibility should be assessed over grid cell terrain units whenever the spatial accuracy of landslide inventory is good. In addition, a single point per landslide proved to be efficient to generate accurate landslide susceptibility maps, providing the landslides are of small size, thus minimizing the possible existence of heterogeneities of predisposing factors within the landslide boundary. Although during last years the ROC curves have been preferred to evaluate the susceptibility model's performance, evidence was found that the model with the highest AUC ROC is not necessarily the best landslide susceptibility model, namely when terrain mapping units are heterogeneous in size and reduced in number. Copyright © 2017 Elsevier B.V. All rights reserved.
Spindel, Jennifer; Begum, Hasina; Akdemir, Deniz; Virk, Parminder; Collard, Bertrand; Redoña, Edilberto; Atlin, Gary; Jannink, Jean-Luc; McCouch, Susan R
2015-02-01
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline.
Spindel, Jennifer; Begum, Hasina; Akdemir, Deniz; Virk, Parminder; Collard, Bertrand; Redoña, Edilberto; Atlin, Gary; Jannink, Jean-Luc; McCouch, Susan R.
2015-01-01
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline. PMID:25689273
Eissa, Maya S; Abou Al Alamein, Amal M
2018-03-15
Different innovative spectrophotometric methods were introduced for the first time for simultaneous quantification of sacubitril/valsartan in their binary mixture and in their combined dosage form without prior separation through two manipulation approaches. These approaches were developed and based either on two wavelength selection in zero-order absorption spectra namely; dual wavelength method (DWL) at 226nm and 275nm for valsartan, induced dual wavelength method (IDW) at 226nm and 254nm for sacubitril and advanced absorbance subtraction (AAS) based on their iso-absorptive point at 246nm (λ iso ) and 261nm (sacubitril shows equal absorbance values at the two selected wavelengths) or on ratio spectra using their normalized spectra namely; ratio difference spectrophotometric method (RD) at 225nm and 264nm for both of them in their ratio spectra, first derivative of ratio spectra (DR 1 ) at 232nm for valsartan and 239nm for sacubitril and mean centering of ratio spectra (MCR) at 260nm for both of them. Both sacubitril and valsartan showed linearity upon application of these methods in the range of 2.5-25.0μg/mL. The developed spectrophotmetric methods were successfully applied to the analysis of their combined tablet dosage form ENTRESTO™. The adopted spectrophotometric methods were also validated according to ICH guidelines. The results obtained from the proposed methods were statistically compared to a reported HPLC method using Student t-test, F-test and a comparative study was also developed with one-way ANOVA, showing no statistical difference in accordance to precision and accuracy. Copyright © 2017 Elsevier B.V. All rights reserved.
Weir, Christopher J; Butcher, Isabella; Assi, Valentina; Lewis, Stephanie C; Murray, Gordon D; Langhorne, Peter; Brady, Marian C
2018-03-07
Rigorous, informative meta-analyses rely on availability of appropriate summary statistics or individual participant data. For continuous outcomes, especially those with naturally skewed distributions, summary information on the mean or variability often goes unreported. While full reporting of original trial data is the ideal, we sought to identify methods for handling unreported mean or variability summary statistics in meta-analysis. We undertook two systematic literature reviews to identify methodological approaches used to deal with missing mean or variability summary statistics. Five electronic databases were searched, in addition to the Cochrane Colloquium abstract books and the Cochrane Statistics Methods Group mailing list archive. We also conducted cited reference searching and emailed topic experts to identify recent methodological developments. Details recorded included the description of the method, the information required to implement the method, any underlying assumptions and whether the method could be readily applied in standard statistical software. We provided a summary description of the methods identified, illustrating selected methods in example meta-analysis scenarios. For missing standard deviations (SDs), following screening of 503 articles, fifteen methods were identified in addition to those reported in a previous review. These included Bayesian hierarchical modelling at the meta-analysis level; summary statistic level imputation based on observed SD values from other trials in the meta-analysis; a practical approximation based on the range; and algebraic estimation of the SD based on other summary statistics. Following screening of 1124 articles for methods estimating the mean, one approximate Bayesian computation approach and three papers based on alternative summary statistics were identified. Illustrative meta-analyses showed that when replacing a missing SD the approximation using the range minimised loss of precision and generally performed better than omitting trials. When estimating missing means, a formula using the median, lower quartile and upper quartile performed best in preserving the precision of the meta-analysis findings, although in some scenarios, omitting trials gave superior results. Methods based on summary statistics (minimum, maximum, lower quartile, upper quartile, median) reported in the literature facilitate more comprehensive inclusion of randomised controlled trials with missing mean or variability summary statistics within meta-analyses.
Zhang, Xinyu; Cao, Jiguo; Carroll, Raymond J
2015-03-01
We consider model selection and estimation in a context where there are competing ordinary differential equation (ODE) models, and all the models are special cases of a "full" model. We propose a computationally inexpensive approach that employs statistical estimation of the full model, followed by a combination of a least squares approximation (LSA) and the adaptive Lasso. We show the resulting method, here called the LSA method, to be an (asymptotically) oracle model selection method. The finite sample performance of the proposed LSA method is investigated with Monte Carlo simulations, in which we examine the percentage of selecting true ODE models, the efficiency of the parameter estimation compared to simply using the full and true models, and coverage probabilities of the estimated confidence intervals for ODE parameters, all of which have satisfactory performances. Our method is also demonstrated by selecting the best predator-prey ODE to model a lynx and hare population dynamical system among some well-known and biologically interpretable ODE models. © 2014, The International Biometric Society.
Inference on the Strength of Balancing Selection for Epistatically Interacting Loci
Buzbas, Erkan Ozge; Joyce, Paul; Rosenberg, Noah A.
2011-01-01
Existing inference methods for estimating the strength of balancing selection in multi-locus genotypes rely on the assumption that there are no epistatic interactions between loci. Complex systems in which balancing selection is prevalent, such as sets of human immune system genes, are known to contain components that interact epistatically. Therefore, current methods may not produce reliable inference on the strength of selection at these loci. In this paper, we address this problem by presenting statistical methods that can account for epistatic interactions in making inference about balancing selection. A theoretical result due to Fearnhead (2006) is used to build a multi-locus Wright-Fisher model of balancing selection, allowing for epistatic interactions among loci. Antagonistic and synergistic types of interactions are examined. The joint posterior distribution of the selection and mutation parameters is sampled by Markov chain Monte Carlo methods, and the plausibility of models is assessed via Bayes factors. As a component of the inference process, an algorithm to generate multi-locus allele frequencies under balancing selection models with epistasis is also presented. Recent evidence on interactions among a set of human immune system genes is introduced as a motivating biological system for the epistatic model, and data on these genes are used to demonstrate the methods. PMID:21277883
NASA Astrophysics Data System (ADS)
Riad, Safaa M.; El-Rahman, Mohamed K. Abd; Fawaz, Esraa M.; Shehata, Mostafa A.
2015-06-01
Three sensitive, selective, and precise stability indicating spectrophotometric methods for the determination of the X-ray contrast agent, diatrizoate sodium (DTA) in the presence of its acidic degradation product (highly cytotoxic 3,5-diamino metabolite) and in pharmaceutical formulation, were developed and validated. The first method is ratio difference, the second one is the bivariate method, and the third one is the dual wavelength method. The calibration curves for the three proposed methods are linear over a concentration range of 2-24 μg/mL. The selectivity of the proposed methods was tested using laboratory prepared mixtures. The proposed methods have been successfully applied to the analysis of DTA in pharmaceutical dosage forms without interference from other dosage form additives. The results were statistically compared with the official US pharmacopeial method. No significant difference for either accuracy or precision was observed.
Riad, Safaa M; El-Rahman, Mohamed K Abd; Fawaz, Esraa M; Shehata, Mostafa A
2015-06-15
Three sensitive, selective, and precise stability indicating spectrophotometric methods for the determination of the X-ray contrast agent, diatrizoate sodium (DTA) in the presence of its acidic degradation product (highly cytotoxic 3,5-diamino metabolite) and in pharmaceutical formulation, were developed and validated. The first method is ratio difference, the second one is the bivariate method, and the third one is the dual wavelength method. The calibration curves for the three proposed methods are linear over a concentration range of 2-24 μg/mL. The selectivity of the proposed methods was tested using laboratory prepared mixtures. The proposed methods have been successfully applied to the analysis of DTA in pharmaceutical dosage forms without interference from other dosage form additives. The results were statistically compared with the official US pharmacopeial method. No significant difference for either accuracy or precision was observed. Copyright © 2015 Elsevier B.V. All rights reserved.
Fundamental Vocabulary Selection Based on Word Familiarity
NASA Astrophysics Data System (ADS)
Sato, Hiroshi; Kasahara, Kaname; Kanasugi, Tomoko; Amano, Shigeaki
This paper proposes a new method for selecting fundamental vocabulary. We are presently constructing the Fundamental Vocabulary Knowledge-base of Japanese that contains integrated information on syntax, semantics and pragmatics, for the purposes of advanced natural language processing. This database mainly consists of a lexicon and a treebank: Lexeed (a Japanese Semantic Lexicon) and the Hinoki Treebank. Fundamental vocabulary selection is the first step in the construction of Lexeed. The vocabulary should include sufficient words to describe general concepts for self-expandability, and should not be prohibitively large to construct and maintain. There are two conventional methods for selecting fundamental vocabulary. The first is intuition-based selection by experts. This is the traditional method for making dictionaries. A weak point of this method is that the selection strongly depends on personal intuition. The second is corpus-based selection. This method is superior in objectivity to intuition-based selection, however, it is difficult to compile a sufficiently balanced corpora. We propose a psychologically-motivated selection method that adopts word familiarity as the selection criterion. Word familiarity is a rating that represents the familiarity of a word as a real number ranging from 1 (least familiar) to 7 (most familiar). We determined the word familiarity ratings statistically based on psychological experiments over 32 subjects. We selected about 30,000 words as the fundamental vocabulary, based on a minimum word familiarity threshold of 5. We also evaluated the vocabulary by comparing its word coverage with conventional intuition-based and corpus-based selection over dictionary definition sentences and novels, and demonstrated the superior coverage of our lexicon. Based on this, we conclude that the proposed method is superior to conventional methods for fundamental vocabulary selection.
Genome-wide regression and prediction with the BGLR statistical package.
Pérez, Paulino; de los Campos, Gustavo
2014-10-01
Many modern genomic data analyses require implementing regressions where the number of parameters (p, e.g., the number of marker effects) exceeds sample size (n). Implementing these large-p-with-small-n regressions poses several statistical and computational challenges, some of which can be confronted using Bayesian methods. This approach allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner. The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures (Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally developed for genomic applications; however, the methods implemented are useful for many nongenomic applications as well. The response can be continuous (censored or not) or categorical (either binary or ordinal). The algorithm is based on a Gibbs sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in BGLR, present examples of the use of the package, and discuss practical issues emerging in real-data analysis. Copyright © 2014 by the Genetics Society of America.
Eash, David A.; Barnes, Kimberlee K.
2017-01-01
A statewide study was conducted to develop regression equations for estimating six selected low-flow frequency statistics and harmonic mean flows for ungaged stream sites in Iowa. The estimation equations developed for the six low-flow frequency statistics include: the annual 1-, 7-, and 30-day mean low flows for a recurrence interval of 10 years, the annual 30-day mean low flow for a recurrence interval of 5 years, and the seasonal (October 1 through December 31) 1- and 7-day mean low flows for a recurrence interval of 10 years. Estimation equations also were developed for the harmonic-mean-flow statistic. Estimates of these seven selected statistics are provided for 208 U.S. Geological Survey continuous-record streamgages using data through September 30, 2006. The study area comprises streamgages located within Iowa and 50 miles beyond the State's borders. Because trend analyses indicated statistically significant positive trends when considering the entire period of record for the majority of the streamgages, the longest, most recent period of record without a significant trend was determined for each streamgage for use in the study. The median number of years of record used to compute each of these seven selected statistics was 35. Geographic information system software was used to measure 54 selected basin characteristics for each streamgage. Following the removal of two streamgages from the initial data set, data collected for 206 streamgages were compiled to investigate three approaches for regionalization of the seven selected statistics. Regionalization, a process using statistical regression analysis, provides a relation for efficiently transferring information from a group of streamgages in a region to ungaged sites in the region. The three regionalization approaches tested included statewide, regional, and region-of-influence regressions. For the regional regression, the study area was divided into three low-flow regions on the basis of hydrologic characteristics, landform regions, and soil regions. A comparison of root mean square errors and average standard errors of prediction for the statewide, regional, and region-of-influence regressions determined that the regional regression provided the best estimates of the seven selected statistics at ungaged sites in Iowa. Because a significant number of streams in Iowa reach zero flow as their minimum flow during low-flow years, four different types of regression analyses were used: left-censored, logistic, generalized-least-squares, and weighted-least-squares regression. A total of 192 streamgages were included in the development of 27 regression equations for the three low-flow regions. For the northeast and northwest regions, a censoring threshold was used to develop 12 left-censored regression equations to estimate the 6 low-flow frequency statistics for each region. For the southern region a total of 12 regression equations were developed; 6 logistic regression equations were developed to estimate the probability of zero flow for the 6 low-flow frequency statistics and 6 generalized least-squares regression equations were developed to estimate the 6 low-flow frequency statistics, if nonzero flow is estimated first by use of the logistic equations. A weighted-least-squares regression equation was developed for each region to estimate the harmonic-mean-flow statistic. Average standard errors of estimate for the left-censored equations for the northeast region range from 64.7 to 88.1 percent and for the northwest region range from 85.8 to 111.8 percent. Misclassification percentages for the logistic equations for the southern region range from 5.6 to 14.0 percent. Average standard errors of prediction for generalized least-squares equations for the southern region range from 71.7 to 98.9 percent and pseudo coefficients of determination for the generalized-least-squares equations range from 87.7 to 91.8 percent. Average standard errors of prediction for weighted-least-squares equations developed for estimating the harmonic-mean-flow statistic for each of the three regions range from 66.4 to 80.4 percent. The regression equations are applicable only to stream sites in Iowa with low flows not significantly affected by regulation, diversion, or urbanization and with basin characteristics within the range of those used to develop the equations. If the equations are used at ungaged sites on regulated streams, or on streams affected by water-supply and agricultural withdrawals, then the estimates will need to be adjusted by the amount of regulation or withdrawal to estimate the actual flow conditions if that is of interest. Caution is advised when applying the equations for basins with characteristics near the applicable limits of the equations and for basins located in karst topography. A test of two drainage-area ratio methods using 31 pairs of streamgages, for the annual 7-day mean low-flow statistic for a recurrence interval of 10 years, indicates a weighted drainage-area ratio method provides better estimates than regional regression equations for an ungaged site on a gaged stream in Iowa when the drainage-area ratio is between 0.5 and 1.4. These regression equations will be implemented within the U.S. Geological Survey StreamStats web-based geographic-information-system tool. StreamStats allows users to click on any ungaged site on a river and compute estimates of the seven selected statistics; in addition, 90-percent prediction intervals and the measured basin characteristics for the ungaged sites also are provided. StreamStats also allows users to click on any streamgage in Iowa and estimates computed for these seven selected statistics are provided for the streamgage.
NASA Astrophysics Data System (ADS)
Röpke, G.
2018-01-01
One of the fundamental problems in physics that are not yet rigorously solved is the statistical mechanics of nonequilibrium processes. An important contribution to describing irreversible behavior starting from reversible Hamiltonian dynamics was given by D. N. Zubarev, who invented the method of the nonequilibrium statistical operator. We discuss this approach, in particular, the extended von Neumann equation, and as an example consider the electrical conductivity of a system of charged particles. We consider the selection of the set of relevant observables. We show the relation between kinetic theory and linear response theory. Using thermodynamic Green's functions, we present a systematic treatment of correlation functions, but the convergence needs investigation. We compare different expressions for the conductivity and list open questions.
The construction and assessment of a statistical model for the prediction of protein assay data.
Pittman, J; Sacks, J; Young, S Stanley
2002-01-01
The focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.
Mathematics pre-service teachers’ statistical reasoning about meaning
NASA Astrophysics Data System (ADS)
Kristanto, Y. D.
2018-01-01
This article offers a descriptive qualitative analysis of 3 second-year pre-service teachers’ statistical reasoning about the mean. Twenty-six pre-service teachers were tested using an open-ended problem where they were expected to analyze a method in finding the mean of a data. Three of their test results are selected to be analyzed. The results suggest that the pre-service teachers did not use context to develop the interpretation of mean. Therefore, this article also offers strategies to promote statistical reasoning about mean that use various contexts.
Statistical Analysis of Adaptive Beam-Forming Methods
1988-05-01
minimum amount of computing resources? * What are the tradeoffs being made when a system design selects block averaging over exponential averaging? Will...understood by many signal processing practitioners, however, is how system parameters and the number of sensors effect the distribution of the... system performance improve and if so by how much? b " It is well known that the noise sampled at adjacent sensors is not statistically independent
Poppe, L.J.; Eliason, A.H.; Hastings, M.E.
2004-01-01
Measures that describe and summarize sediment grain-size distributions are important to geologists because of the large amount of information contained in textural data sets. Statistical methods are usually employed to simplify the necessary comparisons among samples and quantify the observed differences. The two statistical methods most commonly used by sedimentologists to describe particle distributions are mathematical moments (Krumbein and Pettijohn, 1938) and inclusive graphics (Folk, 1974). The choice of which of these statistical measures to use is typically governed by the amount of data available (Royse, 1970). If the entire distribution is known, the method of moments may be used; if the next to last accumulated percent is greater than 95, inclusive graphics statistics can be generated. Unfortunately, earlier programs designed to describe sediment grain-size distributions statistically do not run in a Windows environment, do not allow extrapolation of the distribution's tails, or do not generate both moment and graphic statistics (Kane and Hubert, 1963; Collias et al., 1963; Schlee and Webster, 1967; Poppe et al., 2000)1.Owing to analytical limitations, electro-resistance multichannel particle-size analyzers, such as Coulter Counters, commonly truncate the tails of the fine-fraction part of grain-size distributions. These devices do not detect fine clay in the 0.6–0.1 μm range (part of the 11-phi and all of the 12-phi and 13-phi fractions). Although size analyses performed down to 0.6 μm microns are adequate for most freshwater and near shore marine sediments, samples from many deeper water marine environments (e.g. rise and abyssal plain) may contain significant material in the fine clay fraction, and these analyses benefit from extrapolation.The program (GSSTAT) described herein generates statistics to characterize sediment grain-size distributions and can extrapolate the fine-grained end of the particle distribution. It is written in Microsoft Visual Basic 6.0 and provides a window to facilitate program execution. The input for the sediment fractions is weight percentages in whole-phi notation (Krumbein, 1934; Inman, 1952), and the program permits the user to select output in either method of moments or inclusive graphics statistics (Fig. 1). Users select options primarily with mouse-click events, or through interactive dialogue boxes.
Comparison of Housing Construction Development in Selected Regions of Central Europe
NASA Astrophysics Data System (ADS)
Dvorský, Ján; Petráková, Zora; Hollý, Ján
2017-12-01
In fast-growing countries, the economic growth, which came after the global financial crisis, ought to be manifested in the development of housing policy. The development of the region is directly related to the increase of the quality of living of its inhabitants. Housing construction and its relation with the availability of housing is a key issue for population overall. Comparison of its development in selected regions is important for experts in the field of construction, mayors of the regions, the state, but especially for the inhabitants themselves. The aim of the article is to compare the number of new dwellings with building permits and completed dwellings with final building approval between selected regions by using a mathematical statistics method - “Analysis of variance”. The article also uses the tools of descriptive statistics such as a point graph, a graph of deviations from the average, basic statistical characteristics of mean and variability. Qualitative factors influencing the construction of flats as well as the causes of quantitative differences in the number of started apartments under construction and completed apartments in selected regions of Central Europe are the subjects of the article’s conclusions.
Methods for estimating flow-duration and annual mean-flow statistics for ungaged streams in Oklahoma
Esralew, Rachel A.; Smith, S. Jerrod
2010-01-01
Flow statistics can be used to provide decision makers with surface-water information needed for activities such as water-supply permitting, flow regulation, and other water rights issues. Flow statistics could be needed at any location along a stream. Most often, streamflow statistics are needed at ungaged sites, where no flow data are available to compute the statistics. Methods are presented in this report for estimating flow-duration and annual mean-flow statistics for ungaged streams in Oklahoma. Flow statistics included the (1) annual (period of record), (2) seasonal (summer-autumn and winter-spring), and (3) 12 monthly duration statistics, including the 20th, 50th, 80th, 90th, and 95th percentile flow exceedances, and the annual mean-flow (mean of daily flows for the period of record). Flow statistics were calculated from daily streamflow information collected from 235 streamflow-gaging stations throughout Oklahoma and areas in adjacent states. A drainage-area ratio method is the preferred method for estimating flow statistics at an ungaged location that is on a stream near a gage. The method generally is reliable only if the drainage-area ratio of the two sites is between 0.5 and 1.5. Regression equations that relate flow statistics to drainage-basin characteristics were developed for the purpose of estimating selected flow-duration and annual mean-flow statistics for ungaged streams that are not near gaging stations on the same stream. Regression equations were developed from flow statistics and drainage-basin characteristics for 113 unregulated gaging stations. Separate regression equations were developed by using U.S. Geological Survey streamflow-gaging stations in regions with similar drainage-basin characteristics. These equations can increase the accuracy of regression equations used for estimating flow-duration and annual mean-flow statistics at ungaged stream locations in Oklahoma. Streamflow-gaging stations were grouped by selected drainage-basin characteristics by using a k-means cluster analysis. Three regions were identified for Oklahoma on the basis of the clustering of gaging stations and a manual delineation of distinguishable hydrologic and geologic boundaries: Region 1 (western Oklahoma excluding the Oklahoma and Texas Panhandles), Region 2 (north- and south-central Oklahoma), and Region 3 (eastern and central Oklahoma). A total of 228 regression equations (225 flow-duration regressions and three annual mean-flow regressions) were developed using ordinary least-squares and left-censored (Tobit) multiple-regression techniques. These equations can be used to estimate 75 flow-duration statistics and annual mean-flow for ungaged streams in the three regions. Drainage-basin characteristics that were statistically significant independent variables in the regression analyses were (1) contributing drainage area; (2) station elevation; (3) mean drainage-basin elevation; (4) channel slope; (5) percentage of forested canopy; (6) mean drainage-basin hillslope; (7) soil permeability; and (8) mean annual, seasonal, and monthly precipitation. The accuracy of flow-duration regression equations generally decreased from high-flow exceedance (low-exceedance probability) to low-flow exceedance (high-exceedance probability) . This decrease may have happened because a greater uncertainty exists for low-flow estimates and low-flow is largely affected by localized geology that was not quantified by the drainage-basin characteristics selected. The standard errors of estimate of regression equations for Region 1 (western Oklahoma) were substantially larger than those standard errors for other regions, especially for low-flow exceedances. These errors may be a result of greater variability in low flow because of increased irrigation activities in this region. Regression equations may not be reliable for sites where the drainage-basin characteristics are outside the range of values of independent vari
Testing for X-Ray–SZ Differences and Redshift Evolution in the X-Ray Morphology of Galaxy Clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nurgaliev, D.; McDonald, M.; Benson, B. A.
We present a quantitative study of the X-ray morphology of galaxy clusters, as a function of their detection method and redshift. We analyze two separate samples of galaxy clusters: a sample of 36 clusters atmore » $$0.35\\lt z\\lt 0.9$$ selected in the X-ray with the ROSAT PSPC 400 deg(2) survey, and a sample of 90 clusters at $$0.25\\lt z\\lt 1.2$$ selected via the Sunyaev–Zel’dovich (SZ) effect with the South Pole Telescope. Clusters from both samples have similar-quality Chandra observations, which allow us to quantify their X-ray morphologies via two distinct methods: centroid shifts (w) and photon asymmetry ($${A}_{\\mathrm{phot}}$$). The latter technique provides nearly unbiased morphology estimates for clusters spanning a broad range of redshift and data quality. We further compare the X-ray morphologies of X-ray- and SZ-selected clusters with those of simulated clusters. We do not find a statistically significant difference in the measured X-ray morphology of X-ray and SZ-selected clusters over the redshift range probed by these samples, suggesting that the two are probing similar populations of clusters. We find that the X-ray morphologies of simulated clusters are statistically indistinguishable from those of X-ray- or SZ-selected clusters, implying that the most important physics for dictating the large-scale gas morphology (outside of the core) is well-approximated in these simulations. Finally, we find no statistically significant redshift evolution in the X-ray morphology (both for observed and simulated clusters), over the range of $$z\\sim 0.3$$ to $$z\\sim 1$$, seemingly in contradiction with the redshift-dependent halo merger rate predicted by simulations.« less
Testing for X-Ray–SZ Differences and Redshift Evolution in the X-Ray Morphology of Galaxy Clusters
Nurgaliev, D.; McDonald, M.; Benson, B. A.; ...
2017-05-16
We present a quantitative study of the X-ray morphology of galaxy clusters, as a function of their detection method and redshift. We analyze two separate samples of galaxy clusters: a sample of 36 clusters atmore » $$0.35\\lt z\\lt 0.9$$ selected in the X-ray with the ROSAT PSPC 400 deg(2) survey, and a sample of 90 clusters at $$0.25\\lt z\\lt 1.2$$ selected via the Sunyaev–Zel’dovich (SZ) effect with the South Pole Telescope. Clusters from both samples have similar-quality Chandra observations, which allow us to quantify their X-ray morphologies via two distinct methods: centroid shifts (w) and photon asymmetry ($${A}_{\\mathrm{phot}}$$). The latter technique provides nearly unbiased morphology estimates for clusters spanning a broad range of redshift and data quality. We further compare the X-ray morphologies of X-ray- and SZ-selected clusters with those of simulated clusters. We do not find a statistically significant difference in the measured X-ray morphology of X-ray and SZ-selected clusters over the redshift range probed by these samples, suggesting that the two are probing similar populations of clusters. We find that the X-ray morphologies of simulated clusters are statistically indistinguishable from those of X-ray- or SZ-selected clusters, implying that the most important physics for dictating the large-scale gas morphology (outside of the core) is well-approximated in these simulations. Finally, we find no statistically significant redshift evolution in the X-ray morphology (both for observed and simulated clusters), over the range of $$z\\sim 0.3$$ to $$z\\sim 1$$, seemingly in contradiction with the redshift-dependent halo merger rate predicted by simulations.« less
Wang, Jie; Feng, Zuren; Lu, Na; Luo, Jing
2018-06-01
Feature selection plays an important role in the field of EEG signals based motor imagery pattern classification. It is a process that aims to select an optimal feature subset from the original set. Two significant advantages involved are: lowering the computational burden so as to speed up the learning procedure and removing redundant and irrelevant features so as to improve the classification performance. Therefore, feature selection is widely employed in the classification of EEG signals in practical brain-computer interface systems. In this paper, we present a novel statistical model to select the optimal feature subset based on the Kullback-Leibler divergence measure, and automatically select the optimal subject-specific time segment. The proposed method comprises four successive stages: a broad frequency band filtering and common spatial pattern enhancement as preprocessing, features extraction by autoregressive model and log-variance, the Kullback-Leibler divergence based optimal feature and time segment selection and linear discriminate analysis classification. More importantly, this paper provides a potential framework for combining other feature extraction models and classification algorithms with the proposed method for EEG signals classification. Experiments on single-trial EEG signals from two public competition datasets not only demonstrate that the proposed method is effective in selecting discriminative features and time segment, but also show that the proposed method yields relatively better classification results in comparison with other competitive methods. Copyright © 2018 Elsevier Ltd. All rights reserved.
Morphometrical study on senile larynx.
Zieliński, R
2001-01-01
The aim of the study was a morphometrical macroscopic evaluation of senile larynges, according to its usefulness in ORL diagnostic and operational methods. Larynx preparations were taken from cadavers of both sexes, of age 65 and over, about 24 hours after death. Clinically important laryngeal diameters were collected using common morphometrical methods. A few body features were also being gathered. Computer statistical methods were used in data assessment, including basic statistics and linear correlations between diameters and between diameters and body features. The data presented in the study may be very helpful in evaluation of diagnostic methods. It may also help in selection of right operational tool' sizes, the most appropriate operational technique choice, preoperative preparations and designing and building virtual and plastic models for physicians' training.
Wiuf, Carsten; Schaumburg-Müller Pallesen, Jonatan; Foldager, Leslie; Grove, Jakob
2016-08-01
In many areas of science it is custom to perform many, potentially millions, of tests simultaneously. To gain statistical power it is common to group tests based on a priori criteria such as predefined regions or by sliding windows. However, it is not straightforward to choose grouping criteria and the results might depend on the chosen criteria. Methods that summarize, or aggregate, test statistics or p-values, without relying on a priori criteria, are therefore desirable. We present a simple method to aggregate a sequence of stochastic variables, such as test statistics or p-values, into fewer variables without assuming a priori defined groups. We provide different ways to evaluate the significance of the aggregated variables based on theoretical considerations and resampling techniques, and show that under certain assumptions the FWER is controlled in the strong sense. Validity of the method was demonstrated using simulations and real data analyses. Our method may be a useful supplement to standard procedures relying on evaluation of test statistics individually. Moreover, by being agnostic and not relying on predefined selected regions, it might be a practical alternative to conventionally used methods of aggregation of p-values over regions. The method is implemented in Python and freely available online (through GitHub, see the Supplementary information).
Trends in study design and the statistical methods employed in a leading general medicine journal.
Gosho, M; Sato, Y; Nagashima, K; Takahashi, S
2018-02-01
Study design and statistical methods have become core components of medical research, and the methodology has become more multifaceted and complicated over time. The study of the comprehensive details and current trends of study design and statistical methods is required to support the future implementation of well-planned clinical studies providing information about evidence-based medicine. Our purpose was to illustrate study design and statistical methods employed in recent medical literature. This was an extension study of Sato et al. (N Engl J Med 2017; 376: 1086-1087), which reviewed 238 articles published in 2015 in the New England Journal of Medicine (NEJM) and briefly summarized the statistical methods employed in NEJM. Using the same database, we performed a new investigation of the detailed trends in study design and individual statistical methods that were not reported in the Sato study. Due to the CONSORT statement, prespecification and justification of sample size are obligatory in planning intervention studies. Although standard survival methods (eg Kaplan-Meier estimator and Cox regression model) were most frequently applied, the Gray test and Fine-Gray proportional hazard model for considering competing risks were sometimes used for a more valid statistical inference. With respect to handling missing data, model-based methods, which are valid for missing-at-random data, were more frequently used than single imputation methods. These methods are not recommended as a primary analysis, but they have been applied in many clinical trials. Group sequential design with interim analyses was one of the standard designs, and novel design, such as adaptive dose selection and sample size re-estimation, was sometimes employed in NEJM. Model-based approaches for handling missing data should replace single imputation methods for primary analysis in the light of the information found in some publications. Use of adaptive design with interim analyses is increasing after the presentation of the FDA guidance for adaptive design. © 2017 John Wiley & Sons Ltd.
ERIC Educational Resources Information Center
Egberink, Iris J. L.; Meijer, Rob R.; Tendeiro, Jorge N.
2015-01-01
A popular method to assess measurement invariance of a particular item is based on likelihood ratio tests with all other items as anchor items. The results of this method are often only reported in terms of statistical significance, and researchers proposed different methods to empirically select anchor items. It is unclear, however, how many…
Sun, Hokeun; Wang, Shuang
2014-08-15
Existing association methods for rare variants from sequencing data have focused on aggregating variants in a gene or a genetic region because of the fact that analysing individual rare variants is underpowered. However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region. In this article, we propose a power set-based statistical selection procedure that is able to identify the locations of the potentially susceptible rare variants within a disease-related gene or a genetic region. The selection performance of the proposed selection procedure was evaluated through simulation studies, where we demonstrated the feasibility and superior power over several comparable existing methods. In particular, the proposed method is able to handle the mixed effects when both risk and protective variants are present in a gene or a genetic region. The proposed selection procedure was also applied to the sequence data on the ANGPTL gene family from the Dallas Heart Study to identify potentially susceptible rare variants within the trait-related genes. An R package 'rvsel' can be downloaded from http://www.columbia.edu/∼sw2206/ and http://statsun.pusan.ac.kr. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Evaluation of redundancy analysis to identify signatures of local adaptation.
Capblancq, Thibaut; Luu, Keurcien; Blum, Michael G B; Bazin, Eric
2018-05-26
Ordination is a common tool in ecology that aims at representing complex biological information in a reduced space. In landscape genetics, ordination methods such as principal component analysis (PCA) have been used to detect adaptive variation based on genomic data. Taking advantage of environmental data in addition to genotype data, redundancy analysis (RDA) is another ordination approach that is useful to detect adaptive variation. This paper aims at proposing a test statistic based on RDA to search for loci under selection. We compare redundancy analysis to pcadapt, which is a nonconstrained ordination method, and to a latent factor mixed model (LFMM), which is a univariate genotype-environment association method. Individual-based simulations identify evolutionary scenarios where RDA genome scans have a greater statistical power than genome scans based on PCA. By constraining the analysis with environmental variables, RDA performs better than PCA in identifying adaptive variation when selection gradients are weakly correlated with population structure. Additionally, we show that if RDA and LFMM have a similar power to identify genetic markers associated with environmental variables, the RDA-based procedure has the advantage to identify the main selective gradients as a combination of environmental variables. To give a concrete illustration of RDA in population genomics, we apply this method to the detection of outliers and selective gradients on an SNP data set of Populus trichocarpa (Geraldes et al., 2013). The RDA-based approach identifies the main selective gradient contrasting southern and coastal populations to northern and continental populations in the northwestern American coast. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Evaluation of cancer mortality in a cohort of workers exposed to low-level radiation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lea, C.S.
1995-12-01
The purpose of this dissertation was to re-analyze existing data to explore methodologic approaches that may determine whether excess cancer mortality in the ORNL cohort can be explained by time-related factors not previously considered; grouping of cancer outcomes; selection bias due to choice of method selected to incorporate an empirical induction period; or the type of statistical model chosen.
Chen, Yifei; Sun, Yuxing; Han, Bing-Qing
2015-01-01
Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.
A Developed Meta-model for Selection of Cotton Fabrics Using Design of Experiments and TOPSIS Method
NASA Astrophysics Data System (ADS)
Chakraborty, Shankar; Chatterjee, Prasenjit
2017-12-01
Selection of cotton fabrics for providing optimal clothing comfort is often considered as a multi-criteria decision making problem consisting of an array of candidate alternatives to be evaluated based of several conflicting properties. In this paper, design of experiments and technique for order preference by similarity to ideal solution (TOPSIS) are integrated so as to develop regression meta-models for identifying the most suitable cotton fabrics with respect to the computed TOPSIS scores. The applicability of the adopted method is demonstrated using two real time examples. These developed models can also identify the statistically significant fabric properties and their interactions affecting the measured TOPSIS scores and final selection decisions. There exists good degree of congruence between the ranking patterns as derived using these meta-models and the existing methods for cotton fabric ranking and subsequent selection.
Bayesian Group Bridge for Bi-level Variable Selection.
Mallick, Himel; Yi, Nengjun
2017-06-01
A Bayesian bi-level variable selection method (BAGB: Bayesian Analysis of Group Bridge) is developed for regularized regression and classification. This new development is motivated by grouped data, where generic variables can be divided into multiple groups, with variables in the same group being mechanistically related or statistically correlated. As an alternative to frequentist group variable selection methods, BAGB incorporates structural information among predictors through a group-wise shrinkage prior. Posterior computation proceeds via an efficient MCMC algorithm. In addition to the usual ease-of-interpretation of hierarchical linear models, the Bayesian formulation produces valid standard errors, a feature that is notably absent in the frequentist framework. Empirical evidence of the attractiveness of the method is illustrated by extensive Monte Carlo simulations and real data analysis. Finally, several extensions of this new approach are presented, providing a unified framework for bi-level variable selection in general models with flexible penalties.
Simultaneous Spectrophotometric Estimation of Nitazoxanide and Ofloxacin in Tablets
Game, Madhuri D.; Sakarkar, D. M.
2011-01-01
Two simple, accurate and precise spectrophotometric methods have been developed for simultaneous determination of nitazoxanide and ofloxacin in tablets. Method I is Q-absorbance ratio method which involves Q-absorbance at isobestic point (306.25 nm) and max (347.5 nm) of nitazoxanide, while method II is two wavelength method, where 244.6 nm and 273.0 nm were selected as 1 and 2 for determination of nitazoxanide and 294.3 nm and 388.1 nm were selected as 3 and 4 for determination of ofloxacin. Both drugs obeyed the Beer's law in the concentration range 2-30 μg/ml,correlation coefficient (r2<1). Both methods were validated statistically and recovery studies were carried out to confirm the accuracy. Commercial tablet formulation was successfully analyzed using the developed methods. PMID:22131624
Biological evolution and statistical physics
NASA Astrophysics Data System (ADS)
Drossel, Barbara
2001-03-01
This review is an introduction to theoretical models and mathematical calculations for biological evolution, aimed at physicists. The methods in the field are naturally very similar to those used in statistical physics, although the majority of publications have appeared in biology journals. The review has three parts, which can be read independently. The first part deals with evolution in fitness landscapes and includes Fisher's theorem, adaptive walks, quasispecies models, effects of finite population sizes, and neutral evolution. The second part studies models of coevolution, including evolutionary game theory, kin selection, group selection, sexual selection, speciation, and coevolution of hosts and parasites. The third part discusses models for networks of interacting species and their extinction avalanches. Throughout the review, attention is paid to giving the necessary biological information, and to pointing out the assumptions underlying the models, and their limits of validity.
A Comparison of Analytical and Data Preprocessing Methods for Spectral Fingerprinting
LUTHRIA, DEVANAND L.; MUKHOPADHYAY, SUDARSAN; LIN, LONG-ZE; HARNLY, JAMES M.
2013-01-01
Spectral fingerprinting, as a method of discriminating between plant cultivars and growing treatments for a common set of broccoli samples, was compared for six analytical instruments. Spectra were acquired for finely powdered solid samples using Fourier transform infrared (FT-IR) and Fourier transform near-infrared (NIR) spectrometry. Spectra were also acquired for unfractionated aqueous methanol extracts of the powders using molecular absorption in the ultraviolet (UV) and visible (VIS) regions and mass spectrometry with negative (MS−) and positive (MS+) ionization. The spectra were analyzed using nested one-way analysis of variance (ANOVA) and principal component analysis (PCA) to statistically evaluate the quality of discrimination. All six methods showed statistically significant differences between the cultivars and treatments. The significance of the statistical tests was improved by the judicious selection of spectral regions (IR and NIR), masses (MS+ and MS−), and derivatives (IR, NIR, UV, and VIS). PMID:21352644
Vomweg, T W; Buscema, M; Kauczor, H U; Teifke, A; Intraligi, M; Terzi, S; Heussel, C P; Achenbach, T; Rieker, O; Mayer, D; Thelen, M
2003-09-01
The aim of this study was to evaluate the capability of improved artificial neural networks (ANN) and additional novel training methods in distinguishing between benign and malignant breast lesions in contrast-enhanced magnetic resonance-mammography (MRM). A total of 604 histologically proven cases of contrast-enhanced lesions of the female breast at MRI were analyzed. Morphological, dynamic and clinical parameters were collected and stored in a database. The data set was divided into several groups using random or experimental methods [Training & Testing (T&T) algorithm] to train and test different ANNs. An additional novel computer program for input variable selection was applied. Sensitivity and specificity were calculated and compared with a statistical method and an expert radiologist. After optimization of the distribution of cases among the training and testing sets by the T & T algorithm and the reduction of input variables by the Input Selection procedure a highly sophisticated ANN achieved a sensitivity of 93.6% and a specificity of 91.9% in predicting malignancy of lesions within an independent prediction sample set. The best statistical method reached a sensitivity of 90.5% and a specificity of 68.9%. An expert radiologist performed better than the statistical method but worse than the ANN (sensitivity 92.1%, specificity 85.6%). Features extracted out of dynamic contrast-enhanced MRM and additional clinical data can be successfully analyzed by advanced ANNs. The quality of the resulting network strongly depends on the training methods, which are improved by the use of novel training tools. The best results of an improved ANN outperform expert radiologists.
NASA Astrophysics Data System (ADS)
Lazoglou, Georgia; Anagnostopoulou, Christina; Tolika, Konstantia; Kolyva-Machera, Fotini
2018-04-01
The increasing trend of the intensity and frequency of temperature and precipitation extremes during the past decades has substantial environmental and socioeconomic impacts. Thus, the objective of the present study is the comparison of several statistical methods of the extreme value theory (EVT) in order to identify which is the most appropriate to analyze the behavior of the extreme precipitation, and high and low temperature events, in the Mediterranean region. The extremes choice was made using both the block maxima and the peaks over threshold (POT) technique and as a consequence both the generalized extreme value (GEV) and generalized Pareto distributions (GPDs) were used to fit them. The results were compared, in order to select the most appropriate distribution for extremes characterization. Moreover, this study evaluates the maximum likelihood estimation, the L-moments and the Bayesian method, based on both graphical and statistical goodness-of-fit tests. It was revealed that the GPD can characterize accurately both precipitation and temperature extreme events. Additionally, GEV distribution with the Bayesian method is proven to be appropriate especially for the greatest values of extremes. Another important objective of this investigation was the estimation of the precipitation and temperature return levels for three return periods (50, 100, and 150 years) classifying the data into groups with similar characteristics. Finally, the return level values were estimated with both GEV and GPD and with the three different estimation methods, revealing that the selected method can affect the return level values for both the parameter of precipitation and temperature.
GAPIT: genome association and prediction integrated tool.
Lipka, Alexander E; Tian, Feng; Wang, Qishan; Peiffer, Jason; Li, Meng; Bradbury, Peter J; Gore, Michael A; Buckler, Edward S; Zhang, Zhiwu
2012-09-15
Software programs that conduct genome-wide association studies and genomic prediction and selection need to use methodologies that maximize statistical power, provide high prediction accuracy and run in a computationally efficient manner. We developed an R package called Genome Association and Prediction Integrated Tool (GAPIT) that implements advanced statistical methods including the compressed mixed linear model (CMLM) and CMLM-based genomic prediction and selection. The GAPIT package can handle large datasets in excess of 10 000 individuals and 1 million single-nucleotide polymorphisms with minimal computational time, while providing user-friendly access and concise tables and graphs to interpret results. http://www.maizegenetics.net/GAPIT. zhiwu.zhang@cornell.edu Supplementary data are available at Bioinformatics online.
Pereira, Tiago Veiga; Rudnicki, Martina; Pereira, Alexandre Costa; Pombo-de-Oliveira, Maria S; Franco, Rendrik França
2006-01-01
Meta-analysis has become an important statistical tool in genetic association studies, since it may provide more powerful and precise estimates. However, meta-analytic studies are prone to several potential biases not only because the preferential publication of "positive'' studies but also due to difficulties in obtaining all relevant information during the study selection process. In this letter, we point out major problems in meta-analysis that may lead to biased conclusions, illustrating an empirical example of two recent meta-analyses on the relation between MTHFR polymorphisms and risk of acute lymphoblastic leukemia that, despite the similarity in statistical methods and period of study selection, provided partially conflicting results.
An evaluation of flow-stratified sampling for estimating suspended sediment loads
Robert B. Thomas; Jack Lewis
1995-01-01
Abstract - Flow-stratified sampling is a new method for sampling water quality constituents such as suspended sediment to estimate loads. As with selection-at-list-time (SALT) and time-stratified sampling, flow-stratified sampling is a statistical method requiring random sampling, and yielding unbiased estimates of load and variance. It can be used to estimate event...
Sampling methods for amphibians in streams in the Pacific Northwest.
R. Bruce Bury; Paul Stephen Corn
1991-01-01
Methods describing how to sample aquatic and semiaquatic amphibians in small streams and headwater habitats in the Pacific Northwest are presented. We developed a technique that samples 10-meter stretches of selected streams, which was adequate to detect presence or absence of amphibian species and provided sample sizes statistically sufficient to compare abundance of...
efficient association study design via power-optimized tag SNP selection
HAN, BUHM; KANG, HYUN MIN; SEO, MYEONG SEONG; ZAITLEN, NOAH; ESKIN, ELEAZAR
2008-01-01
Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high-throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow-up studies as well as designing the high-throughput genotyping products themselves. The most widely used tag SNP selection method optimizes over the correlation between SNPs (r2). However, tag SNPs chosen based on an r2 criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r2-based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power-optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow-up studies, our method provides up to twice the power increase using the same number of tag SNPs as r2-based methods. Our method is publicly available via web server at http://design.cs.ucla.edu. PMID:18702637
A probabilistic method for testing and estimating selection differences between populations
He, Yungang; Wang, Minxian; Huang, Xin; Li, Ran; Xu, Hongyang; Xu, Shuhua; Jin, Li
2015-01-01
Human populations around the world encounter various environmental challenges and, consequently, develop genetic adaptations to different selection forces. Identifying the differences in natural selection between populations is critical for understanding the roles of specific genetic variants in evolutionary adaptation. Although numerous methods have been developed to detect genetic loci under recent directional selection, a probabilistic solution for testing and quantifying selection differences between populations is lacking. Here we report the development of a probabilistic method for testing and estimating selection differences between populations. By use of a probabilistic model of genetic drift and selection, we showed that logarithm odds ratios of allele frequencies provide estimates of the differences in selection coefficients between populations. The estimates approximate a normal distribution, and variance can be estimated using genome-wide variants. This allows us to quantify differences in selection coefficients and to determine the confidence intervals of the estimate. Our work also revealed the link between genetic association testing and hypothesis testing of selection differences. It therefore supplies a solution for hypothesis testing of selection differences. This method was applied to a genome-wide data analysis of Han and Tibetan populations. The results confirmed that both the EPAS1 and EGLN1 genes are under statistically different selection in Han and Tibetan populations. We further estimated differences in the selection coefficients for genetic variants involved in melanin formation and determined their confidence intervals between continental population groups. Application of the method to empirical data demonstrated the outstanding capability of this novel approach for testing and quantifying differences in natural selection. PMID:26463656
A Granular Self-Organizing Map for Clustering and Gene Selection in Microarray Data.
Ray, Shubhra Sankar; Ganivada, Avatharam; Pal, Sankar K
2016-09-01
A new granular self-organizing map (GSOM) is developed by integrating the concept of a fuzzy rough set with the SOM. While training the GSOM, the weights of a winning neuron and the neighborhood neurons are updated through a modified learning procedure. The neighborhood is newly defined using the fuzzy rough sets. The clusters (granules) evolved by the GSOM are presented to a decision table as its decision classes. Based on the decision table, a method of gene selection is developed. The effectiveness of the GSOM is shown in both clustering samples and developing an unsupervised fuzzy rough feature selection (UFRFS) method for gene selection in microarray data. While the superior results of the GSOM, as compared with the related clustering methods, are provided in terms of β -index, DB-index, Dunn-index, and fuzzy rough entropy, the genes selected by the UFRFS are not only better in terms of classification accuracy and a feature evaluation index, but also statistically more significant than the related unsupervised methods. The C-codes of the GSOM and UFRFS are available online at http://avatharamg.webs.com/software-code.
Walking through the statistical black boxes of plant breeding.
Xavier, Alencar; Muir, William M; Craig, Bruce; Rainey, Katy Martin
2016-10-01
The main statistical procedures in plant breeding are based on Gaussian process and can be computed through mixed linear models. Intelligent decision making relies on our ability to extract useful information from data to help us achieve our goals more efficiently. Many plant breeders and geneticists perform statistical analyses without understanding the underlying assumptions of the methods or their strengths and pitfalls. In other words, they treat these statistical methods (software and programs) like black boxes. Black boxes represent complex pieces of machinery with contents that are not fully understood by the user. The user sees the inputs and outputs without knowing how the outputs are generated. By providing a general background on statistical methodologies, this review aims (1) to introduce basic concepts of machine learning and its applications to plant breeding; (2) to link classical selection theory to current statistical approaches; (3) to show how to solve mixed models and extend their application to pedigree-based and genomic-based prediction; and (4) to clarify how the algorithms of genome-wide association studies work, including their assumptions and limitations.
Application of multivariate statistical techniques in microbial ecology
Paliy, O.; Shankar, V.
2016-01-01
Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large scale ecological datasets. Especially noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions, and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amounts of data, powerful statistical techniques of multivariate analysis are well suited to analyze and interpret these datasets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular dataset. In this review we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive, and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and dataset structure. PMID:26786791
Pitfalls in statistical landslide susceptibility modelling
NASA Astrophysics Data System (ADS)
Schröder, Boris; Vorpahl, Peter; Märker, Michael; Elsenbeer, Helmut
2010-05-01
The use of statistical methods is a well-established approach to predict landslide occurrence probabilities and to assess landslide susceptibility. This is achieved by applying statistical methods relating historical landslide inventories to topographic indices as predictor variables. In our contribution, we compare several new and powerful methods developed in machine learning and well-established in landscape ecology and macroecology for predicting the distribution of shallow landslides in tropical mountain rainforests in southern Ecuador (among others: boosted regression trees, multivariate adaptive regression splines, maximum entropy). Although these methods are powerful, we think it is necessary to follow a basic set of guidelines to avoid some pitfalls regarding data sampling, predictor selection, and model quality assessment, especially if a comparison of different models is contemplated. We therefore suggest to apply a novel toolbox to evaluate approaches to the statistical modelling of landslide susceptibility. Additionally, we propose some methods to open the "black box" as an inherent part of machine learning methods in order to achieve further explanatory insights into preparatory factors that control landslides. Sampling of training data should be guided by hypotheses regarding processes that lead to slope failure taking into account their respective spatial scales. This approach leads to the selection of a set of candidate predictor variables considered on adequate spatial scales. This set should be checked for multicollinearity in order to facilitate model response curve interpretation. Model quality assesses how well a model is able to reproduce independent observations of its response variable. This includes criteria to evaluate different aspects of model performance, i.e. model discrimination, model calibration, and model refinement. In order to assess a possible violation of the assumption of independency in the training samples or a possible lack of explanatory information in the chosen set of predictor variables, the model residuals need to be checked for spatial auto¬correlation. Therefore, we calculate spline correlograms. In addition to this, we investigate partial dependency plots and bivariate interactions plots considering possible interactions between predictors to improve model interpretation. Aiming at presenting this toolbox for model quality assessment, we investigate the influence of strategies in the construction of training datasets for statistical models on model quality.
Davalos, Angel D; Luben, Thomas J; Herring, Amy H; Sacks, Jason D
2017-02-01
Air pollution epidemiology traditionally focuses on the relationship between individual air pollutants and health outcomes (e.g., mortality). To account for potential copollutant confounding, individual pollutant associations are often estimated by adjusting or controlling for other pollutants in the mixture. Recently, the need to characterize the relationship between health outcomes and the larger multipollutant mixture has been emphasized in an attempt to better protect public health and inform more sustainable air quality management decisions. New and innovative statistical methods to examine multipollutant exposures were identified through a broad literature search, with a specific focus on those statistical approaches currently used in epidemiologic studies of short-term exposures to criteria air pollutants (i.e., particulate matter, carbon monoxide, sulfur dioxide, nitrogen dioxide, and ozone). Five broad classes of statistical approaches were identified for examining associations between short-term multipollutant exposures and health outcomes, specifically additive main effects, effect measure modification, unsupervised dimension reduction, supervised dimension reduction, and nonparametric methods. These approaches are characterized including advantages and limitations in different epidemiologic scenarios. By highlighting the characteristics of various studies in which multipollutant statistical methods have been used, this review provides epidemiologists and biostatisticians with a resource to aid in the selection of the most optimal statistical method to use when examining multipollutant exposures. Published by Elsevier Inc.
Alles, Susan; Peng, Linda X; Mozola, Mark A
2009-01-01
A modification to Performance-Tested Method 010403, GeneQuence Listeria Test (DNAH method), is described. The modified method uses a new media formulation, LESS enrichment broth, in single-step enrichment protocols for both foods and environmental sponge and swab samples. Food samples are enriched for 27-30 h at 30 degrees C, and environmental samples for 24-48 h at 30 degrees C. Implementation of these abbreviated enrichment procedures allows test results to be obtained on a next-day basis. In testing of 14 food types in internal comparative studies with inoculated samples, there were statistically significant differences in method performance between the DNAH method and reference culture procedures for only 2 foods (pasteurized crab meat and lettuce) at the 27 h enrichment time point and for only a single food (pasteurized crab meat) in one trial at the 30 h enrichment time point. Independent laboratory testing with 3 foods showed statistical equivalence between the methods for all foods, and results support the findings of the internal trials. Overall, considering both internal and independent laboratory trials, sensitivity of the DNAH method relative to the reference culture procedures was 90.5%. Results of testing 5 environmental surfaces inoculated with various strains of Listeria spp. showed that the DNAH method was more productive than the reference U.S. Department of Agriculture-Food Safety and Inspection Service (USDA-FSIS) culture procedure for 3 surfaces (stainless steel, plastic, and cast iron), whereas results were statistically equivalent to the reference method for the other 2 surfaces (ceramic tile and sealed concrete). An independent laboratory trial with ceramic tile inoculated with L. monocytogenes confirmed the effectiveness of the DNAH method at the 24 h time point. Overall, sensitivity of the DNAH method at 24 h relative to that of the USDA-FSIS method was 152%. The DNAH method exhibited extremely high specificity, with only 1% false-positive reactions overall.
Dietrich, Stefan; Floegel, Anna; Troll, Martina; Kühn, Tilman; Rathmann, Wolfgang; Peters, Anette; Sookthai, Disorn; von Bergen, Martin; Kaaks, Rudolf; Adamski, Jerzy; Prehn, Cornelia; Boeing, Heiner; Schulze, Matthias B; Illig, Thomas; Pischon, Tobias; Knüppel, Sven; Wang-Sattler, Rui; Drogan, Dagmar
2016-10-01
The application of metabolomics in prospective cohort studies is statistically challenging. Given the importance of appropriate statistical methods for selection of disease-associated metabolites in highly correlated complex data, we combined random survival forest (RSF) with an automated backward elimination procedure that addresses such issues. Our RSF approach was illustrated with data from the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam study, with concentrations of 127 serum metabolites as exposure variables and time to development of type 2 diabetes mellitus (T2D) as outcome variable. Out of this data set, Cox regression with a stepwise selection method was recently published. Replication of methodical comparison (RSF and Cox regression) was conducted in two independent cohorts. Finally, the R-code for implementing the metabolite selection procedure into the RSF-syntax is provided. The application of the RSF approach in EPIC-Potsdam resulted in the identification of 16 incident T2D-associated metabolites which slightly improved prediction of T2D when used in addition to traditional T2D risk factors and also when used together with classical biomarkers. The identified metabolites partly agreed with previous findings using Cox regression, though RSF selected a higher number of highly correlated metabolites. The RSF method appeared to be a promising approach for identification of disease-associated variables in complex data with time to event as outcome. The demonstrated RSF approach provides comparable findings as the generally used Cox regression, but also addresses the problem of multicollinearity and is suitable for high-dimensional data. © The Author 2016; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.
Genome-Wide Specific Selection in Three Domestic Sheep Breeds.
Wang, Huihua; Zhang, Li; Cao, Jiaxve; Wu, Mingming; Ma, Xiaomeng; Liu, Zhen; Liu, Ruizao; Zhao, Fuping; Wei, Caihong; Du, Lixin
2015-01-01
Commercial sheep raised for mutton grow faster than traditional Chinese sheep breeds. Here, we aimed to evaluate genetic selection among three different types of sheep breed: two well-known commercial mutton breeds and one indigenous Chinese breed. We first combined locus-specific branch lengths and di statistical methods to detect candidate regions targeted by selection in the three different populations. The results showed that the genetic distances reached at least medium divergence for each pairwise combination. We found these two methods were highly correlated, and identified many growth-related candidate genes undergoing artificial selection. For production traits, APOBR and FTO are associated with body mass index. For meat traits, ALDOA, STK32B and FAM190A are related to marbling. For reproduction traits, CCNB2 and SLC8A3 affect oocyte development. We also found two well-known genes, GHR (which affects meat production and quality) and EDAR (associated with hair thickness) were associated with German mutton merino sheep. Furthermore, four genes (POL, RPL7, MSL1 and SHISA9) were associated with pre-weaning gain in our previous genome-wide association study. Our results indicated that combine locus-specific branch lengths and di statistical approaches can reduce the searching ranges for specific selection. And we got many credible candidate genes which not only confirm the results of previous reports, but also provide a suite of novel candidate genes in defined breeds to guide hybridization breeding.
Will genomic selection be a practical method for plant breeding?
Nakaya, Akihiro; Isobe, Sachiko N.
2012-01-01
Background Genomic selection or genome-wide selection (GS) has been highlighted as a new approach for marker-assisted selection (MAS) in recent years. GS is a form of MAS that selects favourable individuals based on genomic estimated breeding values. Previous studies have suggested the utility of GS, especially for capturing small-effect quantitative trait loci, but GS has not become a popular methodology in the field of plant breeding, possibly because there is insufficient information available on GS for practical use. Scope In this review, GS is discussed from a practical breeding viewpoint. Statistical approaches employed in GS are briefly described, before the recent progress in GS studies is surveyed. GS practices in plant breeding are then reviewed before future prospects are discussed. Conclusions Statistical concepts used in GS are discussed with genetic models and variance decomposition, heritability, breeding value and linear model. Recent progress in GS studies is reviewed with a focus on empirical studies. For the practice of GS in plant breeding, several specific points are discussed including linkage disequilibrium, feature of populations and genotyped markers and breeding scheme. Currently, GS is not perfect, but it is a potent, attractive and valuable approach for plant breeding. This method will be integrated into many practical breeding programmes in the near future with further advances and the maturing of its theory. PMID:22645117
Garcia, Rafaela Alvim; Vanelli, Chislene Pereira; Pereira Junior, Olavo Dos Santos; Corrêa, José Otávio do Amaral
2018-06-19
Hydroelectrolytic disorders are common in clinical situations and may be harmful to the patient, especially those involving plasma sodium and potassium dosages. Among the possible methods for the dosages are flame photometry, ion-selective electrode (ISE) and colorimetric enzymatic method. We analyzed 175 samples in the three different methods cited from patients attending the laboratory of the University Hospital of the Federal University of Juiz de Fora. The values obtained were statistically treated using SPSS 19.0 software. The present study aims to evaluate the impact of the use of these different methods in the determination of plasma sodium and potassium. The averages obtained for sodium and potassium measurements by flame photometry were similar (P > .05) to the means obtained for the two electrolytes by ISE. The averages obtained by the colorimetric enzymatic method presented statistical difference in relation to ISE, both for sodium and potassium. In the correlation analysis, both flame photometry and colorimetric enzymatic showed a strong correlation with the ISE method for both dosages. At the first time in the same work sodium and potassium were analyzed by three different methods and the results allowed us to conclude that the methods showed a positive and strong correlation, and can be applied in the clinical routine. © 2018 Wiley Periodicals, Inc.
Efficient Variable Selection Method for Exposure Variables on Binary Data
NASA Astrophysics Data System (ADS)
Ohno, Manabu; Tarumi, Tomoyuki
In this paper, we propose a new variable selection method for "robust" exposure variables. We define "robust" as property that the same variable can select among original data and perturbed data. There are few studies of effective for the selection method. The problem that selects exposure variables is almost the same as a problem that extracts correlation rules without robustness. [Brin 97] is suggested that correlation rules are possible to extract efficiently using chi-squared statistic of contingency table having monotone property on binary data. But the chi-squared value does not have monotone property, so it's is easy to judge the method to be not independent with an increase in the dimension though the variable set is completely independent, and the method is not usable in variable selection for robust exposure variables. We assume anti-monotone property for independent variables to select robust independent variables and use the apriori algorithm for it. The apriori algorithm is one of the algorithms which find association rules from the market basket data. The algorithm use anti-monotone property on the support which is defined by association rules. But independent property does not completely have anti-monotone property on the AIC of independent probability model, but the tendency to have anti-monotone property is strong. Therefore, selected variables with anti-monotone property on the AIC have robustness. Our method judges whether a certain variable is exposure variable for the independent variable using previous comparison of the AIC. Our numerical experiments show that our method can select robust exposure variables efficiently and precisely.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blume-Kohout, Robin J; Scholten, Travis L.
Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint ρ ≥ 0, quantum state space does not generally satisfy local asymptotic normality (LAN), meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) shouldmore » not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of LAN, metric-projected LAN, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.« less
Robertson, David S; Prevost, A Toby; Bowden, Jack
2016-09-30
Seamless phase II/III clinical trials offer an efficient way to select an experimental treatment and perform confirmatory analysis within a single trial. However, combining the data from both stages in the final analysis can induce bias into the estimates of treatment effects. Methods for bias adjustment developed thus far have made restrictive assumptions about the design and selection rules followed. In order to address these shortcomings, we apply recent methodological advances to derive the uniformly minimum variance conditionally unbiased estimator for two-stage seamless phase II/III trials. Our framework allows for the precision of the treatment arm estimates to take arbitrary values, can be utilised for all treatments that are taken forward to phase III and is applicable when the decision to select or drop treatment arms is driven by a multiplicity-adjusted hypothesis testing procedure. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Quantitative trait nucleotide analysis using Bayesian model selection.
Blangero, John; Goring, Harald H H; Kent, Jack W; Williams, Jeff T; Peterson, Charles P; Almasy, Laura; Dyer, Thomas D
2005-10-01
Although much attention has been given to statistical genetic methods for the initial localization and fine mapping of quantitative trait loci (QTLs), little methodological work has been done to date on the problem of statistically identifying the most likely functional polymorphisms using sequence data. In this paper we provide a general statistical genetic framework, called Bayesian quantitative trait nucleotide (BQTN) analysis, for assessing the likely functional status of genetic variants. The approach requires the initial enumeration of all genetic variants in a set of resequenced individuals. These polymorphisms are then typed in a large number of individuals (potentially in families), and marker variation is related to quantitative phenotypic variation using Bayesian model selection and averaging. For each sequence variant a posterior probability of effect is obtained and can be used to prioritize additional molecular functional experiments. An example of this quantitative nucleotide analysis is provided using the GAW12 simulated data. The results show that the BQTN method may be useful for choosing the most likely functional variants within a gene (or set of genes). We also include instructions on how to use our computer program, SOLAR, for association analysis and BQTN analysis.
CAPSAS: Computer Assisted Program for the Selection of Appropriate Statistics.
ERIC Educational Resources Information Center
Shermis, Mark D.; Albert, Susan L.
A computer-assisted program has been developed for the selection of statistics or statistical techniques by both students and researchers. Based on Andrews, Klem, Davidson, O'Malley and Rodgers "A Guide for Selecting Statistical Techniques for Analyzing Social Science Data," this FORTRAN-compiled interactive computer program was…
NASA Astrophysics Data System (ADS)
Peresan, Antonella; Gentili, Stefania
2017-04-01
Identification and statistical characterization of seismic clusters may provide useful insights about the features of seismic energy release and their relation to physical properties of the crust within a given region. Moreover, a number of studies based on spatio-temporal analysis of main-shocks occurrence require preliminary declustering of the earthquake catalogs. Since various methods, relying on different physical/statistical assumptions, may lead to diverse classifications of earthquakes into main events and related events, we aim to investigate the classification differences among different declustering techniques. Accordingly, a formal selection and comparative analysis of earthquake clusters is carried out for the most relevant earthquakes in North-Eastern Italy, as reported in the local OGS-CRS bulletins, compiled at the National Institute of Oceanography and Experimental Geophysics since 1977. The comparison is then extended to selected earthquake sequences associated with a different seismotectonic setting, namely to events that occurred in the region struck by the recent Central Italy destructive earthquakes, making use of INGV data. Various techniques, ranging from classical space-time windows methods to ad hoc manual identification of aftershocks, are applied for detection of earthquake clusters. In particular, a statistical method based on nearest-neighbor distances of events in space-time-energy domain, is considered. Results from clusters identification by the nearest-neighbor method turn out quite robust with respect to the time span of the input catalogue, as well as to minimum magnitude cutoff. The identified clusters for the largest events reported in North-Eastern Italy since 1977 are well consistent with those reported in earlier studies, which were aimed at detailed manual aftershocks identification. The study shows that the data-driven approach, based on the nearest-neighbor distances, can be satisfactorily applied to decompose the seismic catalog into background seismicity and individual sequences of earthquake clusters, also in areas characterized by moderate seismic activity, where the standard declustering techniques may turn out rather gross approximations. With these results acquired, the main statistical features of seismic clusters are explored, including complex interdependence of related events, with the aim to characterize the space-time patterns of earthquakes occurrence in North-Eastern Italy and capture their basic differences with Central Italy sequences.
History and Development of the Schmidt-Hunter Meta-Analysis Methods
ERIC Educational Resources Information Center
Schmidt, Frank L.
2015-01-01
In this article, I provide answers to the questions posed by Will Shadish about the history and development of the Schmidt-Hunter methods of meta-analysis. In the 1970s, I headed a research program on personnel selection at the US Office of Personnel Management (OPM). After our research showed that validity studies have low statistical power, OPM…
NASA Astrophysics Data System (ADS)
Shams Esfand Abadi, Mohammad; AbbasZadeh Arani, Seyed Ali Asghar
2011-12-01
This paper extends the recently introduced variable step-size (VSS) approach to the family of adaptive filter algorithms. This method uses prior knowledge of the channel impulse response statistic. Accordingly, optimal step-size vector is obtained by minimizing the mean-square deviation (MSD). The presented algorithms are the VSS affine projection algorithm (VSS-APA), the VSS selective partial update NLMS (VSS-SPU-NLMS), the VSS-SPU-APA, and the VSS selective regressor APA (VSS-SR-APA). In VSS-SPU adaptive algorithms the filter coefficients are partially updated which reduce the computational complexity. In VSS-SR-APA, the optimal selection of input regressors is performed during the adaptation. The presented algorithms have good convergence speed, low steady state mean square error (MSE), and low computational complexity features. We demonstrate the good performance of the proposed algorithms through several simulations in system identification scenario.
Development of a Multidimensional Functional Health Scale for Older Adults in China.
Mao, Fanzhen; Han, Yaofeng; Chen, Junze; Chen, Wei; Yuan, Manqiong; Alicia Hong, Y; Fang, Ya
2016-05-01
A first step to achieve successful aging is assessing functional wellbeing of older adults. This study reports the development of a culturally appropriate brief scale (the Multidimensional Functional Health Scale for Chinese Elderly, MFHSCE) to assess the functional health of Chinese elderly. Through systematic literature review, Delphi method, cultural adaptation, synthetic statistical item selection, Cronbach's alpha and confirmatory factor analysis, we conducted development of item pool, two rounds of item selection, and psychometric evaluation. Synthetic statistical item selection and psychometric evaluation was processed among 539 and 2032 older adults, separately. The MFHSCE consists of 30 items, covering activities of daily living, social relationships, physical health, mental health, cognitive function, and economic resources. The Cronbach's alpha was 0.92, and the comparative fit index was 0.917. The MFHSCE has good internal consistency and construct validity; it is also concise and easy to use in general practice, especially in communities in China.
Soares, Micaela A R; Andrade, Sandra R; Martins, Rui C; Quina, Margarida J; Quinta-Ferreira, Rosa M
2012-01-01
Composting is one of the technologies recommended for pre-treating industrial eggshells (ES) before its application in soils, for calcium recycling. However, due to the high inorganic content of ES, a mixture of biodegradable materials is required to assure a successful procedure. In this study, an adequate organic blend composition containing potato peel (PP), grass clippings (GC) and wheat straw (WS) was determined by applying the simplex-centroid mixture design method to achieve a desired moisture content, carbon: nitrogen ratio and free air space for effective composting of ES. A blend of 56% PP, 37% GC and 7% WS was selected and tested in a self heating reactor, where 10% (w/w) of ES was incorporated. After 29 days of reactor operation, a dry matter reduction of 46% was achieved and thermophilic temperatures were maintained during 15 days, indicating that the blend selected by statistical approach was adequate for composting of ES.
Statistical classification of road pavements using near field vehicle rolling noise measurements.
Paulo, Joel Preto; Coelho, J L Bento; Figueiredo, Mário A T
2010-10-01
Low noise surfaces have been increasingly considered as a viable and cost-effective alternative to acoustical barriers. However, road planners and administrators frequently lack information on the correlation between the type of road surface and the resulting noise emission profile. To address this problem, a method to identify and classify different types of road pavements was developed, whereby near field road noise is analyzed using statistical learning methods. The vehicle rolling sound signal near the tires and close to the road surface was acquired by two microphones in a special arrangement which implements the Close-Proximity method. A set of features, characterizing the properties of the road pavement, was extracted from the corresponding sound profiles. A feature selection method was used to automatically select those that are most relevant in predicting the type of pavement, while reducing the computational cost. A set of different types of road pavement segments were tested and the performance of the classifier was evaluated. Results of pavement classification performed during a road journey are presented on a map, together with geographical data. This procedure leads to a considerable improvement in the quality of road pavement noise data, thereby increasing the accuracy of road traffic noise prediction models.
An online sleep apnea detection method based on recurrence quantification analysis.
Nguyen, Hoa Dinh; Wilkins, Brek A; Cheng, Qi; Benjamin, Bruce Allen
2014-07-01
This paper introduces an online sleep apnea detection method based on heart rate complexity as measured by recurrence quantification analysis (RQA) statistics of heart rate variability (HRV) data. RQA statistics can capture nonlinear dynamics of a complex cardiorespiratory system during obstructive sleep apnea. In order to obtain a more robust measurement of the nonstationarity of the cardiorespiratory system, we use different fixed amount of neighbor thresholdings for recurrence plot calculation. We integrate a feature selection algorithm based on conditional mutual information to select the most informative RQA features for classification, and hence, to speed up the real-time classification process without degrading the performance of the system. Two types of binary classifiers, i.e., support vector machine and neural network, are used to differentiate apnea from normal sleep. A soft decision fusion rule is developed to combine the results of these classifiers in order to improve the classification performance of the whole system. Experimental results show that our proposed method achieves better classification results compared with the previous recurrence analysis-based approach. We also show that our method is flexible and a strong candidate for a real efficient sleep apnea detection system.
Pavlidis, Paul; Qin, Jie; Arango, Victoria; Mann, John J; Sibille, Etienne
2004-06-01
One of the challenges in the analysis of gene expression data is placing the results in the context of other data available about genes and their relationships to each other. Here, we approach this problem in the study of gene expression changes associated with age in two areas of the human prefrontal cortex, comparing two computational methods. The first method, "overrepresentation analysis" (ORA), is based on statistically evaluating the fraction of genes in a particular gene ontology class found among the set of genes showing age-related changes in expression. The second method, "functional class scoring" (FCS), examines the statistical distribution of individual gene scores among all genes in the gene ontology class and does not involve an initial gene selection step. We find that FCS yields more consistent results than ORA, and the results of ORA depended strongly on the gene selection threshold. Our findings highlight the utility of functional class scoring for the analysis of complex expression data sets and emphasize the advantage of considering all available genomic information rather than sets of genes that pass a predetermined "threshold of significance."
SD-MSAEs: Promoter recognition in human genome based on deep feature extraction.
Xu, Wenxuan; Zhang, Li; Lu, Yaping
2016-06-01
The prediction and recognition of promoter in human genome play an important role in DNA sequence analysis. Entropy, in Shannon sense, of information theory is a multiple utility in bioinformatic details analysis. The relative entropy estimator methods based on statistical divergence (SD) are used to extract meaningful features to distinguish different regions of DNA sequences. In this paper, we choose context feature and use a set of methods of SD to select the most effective n-mers distinguishing promoter regions from other DNA regions in human genome. Extracted from the total possible combinations of n-mers, we can get four sparse distributions based on promoter and non-promoters training samples. The informative n-mers are selected by optimizing the differentiating extents of these distributions. Specially, we combine the advantage of statistical divergence and multiple sparse auto-encoders (MSAEs) in deep learning to extract deep feature for promoter recognition. And then we apply multiple SVMs and a decision model to construct a human promoter recognition method called SD-MSAEs. Framework is flexible that it can integrate new feature extraction or new classification models freely. Experimental results show that our method has high sensitivity and specificity. Copyright © 2016 Elsevier Inc. All rights reserved.
Integrative Analysis of “-Omics” Data Using Penalty Functions
Zhao, Qing; Shi, Xingjie; Huang, Jian; Liu, Jin; Li, Yang; Ma, Shuangge
2014-01-01
In the analysis of omics data, integrative analysis provides an effective way of pooling information across multiple datasets or multiple correlated responses, and can be more effective than single-dataset (response) analysis. Multiple families of integrative analysis methods have been proposed in the literature. The current review focuses on the penalization methods. Special attention is paid to sparse meta-analysis methods that pool summary statistics across datasets, and integrative analysis methods that pool raw data across datasets. We discuss their formulation and rationale. Beyond “standard” penalized selection, we also review contrasted penalization and Laplacian penalization which accommodate finer data structures. The computational aspects, including computational algorithms and tuning parameter selection, are examined. This review concludes with possible limitations and extensions. PMID:25691921
Detection of selective sweeps in structured populations: a comparison of recent methods.
Vatsiou, Alexandra I; Bazin, Eric; Gaggiotti, Oscar E
2016-01-01
Identifying genomic regions targeted by positive selection has been a long-standing interest of evolutionary biologists. This objective was difficult to achieve until the recent emergence of next-generation sequencing, which is fostering the development of large-scale catalogues of genetic variation for increasing number of species. Several statistical methods have been recently developed to analyse these rich data sets, but there is still a poor understanding of the conditions under which these methods produce reliable results. This study aims at filling this gap by assessing the performance of genome-scan methods that consider explicitly the physical linkage among SNPs surrounding a selected variant. Our study compares the performance of seven recent methods for the detection of selective sweeps (iHS, nSL, EHHST, xp-EHH, XP-EHHST, XPCLR and hapFLK). We use an individual-based simulation approach to investigate the power and accuracy of these methods under a wide range of population models under both hard and soft sweeps. Our results indicate that XPCLR and hapFLK perform best and can detect soft sweeps under simple population structure scenarios if migration rate is low. All methods perform poorly with moderate-to-high migration rates, or with weak selection and very poorly under a hierarchical population structure. Finally, no single method is able to detect both starting and nearly completed selective sweeps. However, combining several methods (XPCLR or hapFLK with iHS or nSL) can greatly increase the power to pinpoint the selected region. © 2015 John Wiley & Sons Ltd.
Supply chain value creation methodology under BSC approach
NASA Astrophysics Data System (ADS)
Golrizgashti, Seyedehfatemeh
2014-06-01
The objective of this paper is proposing a developed balanced scorecard approach to measure supply chain performance with the aim of creating more value in manufacturing and business operations. The most important metrics have been selected based on experts' opinion acquired by in-depth interviews focused on creating more value for stakeholders. Using factor analysis method, a survey research has been used to categorize selected metrics into balanced scorecard perspectives. The result identifies the intensity of correlation between perspectives and cause-and-effect chains among them using statistical method based on a real case study in home appliance manufacturing industries.
NASA Astrophysics Data System (ADS)
Weiss, Jake; Newberg, Heidi Jo; Arsenault, Matthew; Bechtel, Torrin; Desell, Travis; Newby, Matthew; Thompson, Jeffery M.
2016-01-01
Statistical photometric parallax is a method for using the distribution of absolute magnitudes of stellar tracers to statistically recover the underlying density distribution of these tracers. In previous work, statistical photometric parallax was used to trace the Sagittarius Dwarf tidal stream, the so-called bifurcated piece of the Sagittaritus stream, and the Virgo Overdensity through the Milky Way. We use an improved knowledge of this distribution in a new algorithm that accounts for the changes in the stellar population of color-selected stars near the photometric limit of the Sloan Digital Sky Survey (SDSS). Although we select bluer main sequence turnoff stars (MSTO) as tracers, large color errors near the survey limit cause many stars to be scattered out of our selection box and many fainter, redder stars to be scattered into our selection box. We show that we are able to recover parameters for analogues of these streams in simulated data using a maximum likelihood optimization on MilkyWay@home. We also present the preliminary results of fitting the density distribution of major Milky Way tidal streams in SDSS data. This research is supported by generous gifts from the Marvin Clan, Babette Josephs, Manit Limlamai, and the MilkyWay@home volunteers.
Compressive strength of human openwedges: a selection method
NASA Astrophysics Data System (ADS)
Follet, H.; Gotteland, M.; Bardonnet, R.; Sfarghiu, A. M.; Peyrot, J.; Rumelhart, C.
2004-02-01
A series of 44 samples of bone wedges of human origin, intended for allograft openwedge osteotomy and obtained without particular precautions during hip arthroplasty were re-examined. After viral inactivity chemical treatment, lyophilisation and radio-sterilisation (intended to produce optimal health safety), the compressive strength, independent of age, sex and the height of the sample (or angle of cut), proved to be too widely dispersed [ 10{-}158 MPa] in the first study. We propose a method for selecting samples which takes into account their geometry (width, length, thicknesses, cortical surface area). Statistical methods (Principal Components Analysis PCA, Hierarchical Cluster Analysis, Multilinear regression) allowed final selection of 29 samples having a mean compressive strength σ_{max} =103 MPa ± 26 and with variation [ 61{-}158 MPa] . These results are equivalent or greater than average materials currently used in openwedge osteotomy.
Guisande, Cástor; Vari, Richard P; Heine, Jürgen; García-Roselló, Emilio; González-Dacosta, Jacinto; Perez-Schofield, Baltasar J García; González-Vilas, Luis; Pelayo-Villamil, Patricia
2016-09-12
We present and discuss VARSEDIG, an algorithm which identifies the morphometric features that significantly discriminate two taxa and validates the morphological distinctness between them via a Monte-Carlo test. VARSEDIG is freely available as a function of the RWizard application PlotsR (http://www.ipez.es/RWizard) and as R package on CRAN. The variables selected by VARSEDIG with the overlap method were very similar to those selected by logistic regression and discriminant analysis, but overcomes some shortcomings of these methods. VARSEDIG is, therefore, a good alternative by comparison to current classical classification methods for identifying morphometric features that significantly discriminate a taxon and for validating its morphological distinctness from other taxa. As a demonstration of the potential of VARSEDIG for this purpose, we analyze morphological discrimination among some species of the Neotropical freshwater family Characidae.
2014-01-01
Background The UK Clinical Aptitude Test (UKCAT) was designed to address issues identified with traditional methods of selection. This study aims to examine the predictive validity of the UKCAT and compare this to traditional selection methods in the senior years of medical school. This was a follow-up study of two cohorts of students from two medical schools who had previously taken part in a study examining the predictive validity of the UKCAT in first year. Methods The sample consisted of 4th and 5th Year students who commenced their studies at the University of Aberdeen or University of Dundee medical schools in 2007. Data collected were: demographics (gender and age group), UKCAT scores; Universities and Colleges Admissions Service (UCAS) form scores; admission interview scores; Year 4 and 5 degree examination scores. Pearson’s correlations were used to examine the relationships between admissions variables, examination scores, gender and age group, and to select variables for multiple linear regression analysis to predict examination scores. Results Ninety-nine and 89 students at Aberdeen medical school from Years 4 and 5 respectively, and 51 Year 4 students in Dundee, were included in the analysis. Neither UCAS form nor interview scores were statistically significant predictors of examination performance. Conversely, the UKCAT yielded statistically significant validity coefficients between .24 and .36 in four of five assessments investigated. Multiple regression analysis showed the UKCAT made a statistically significant unique contribution to variance in examination performance in the senior years. Conclusions Results suggest the UKCAT appears to predict performance better in the later years of medical school compared to earlier years and provides modest supportive evidence for the UKCAT’s role in student selection within these institutions. Further research is needed to assess the predictive validity of the UKCAT against professional and behavioural outcomes as the cohort commences working life. PMID:24762134
Catto, James W F; Abbod, Maysam F; Wild, Peter J; Linkens, Derek A; Pilarsky, Christian; Rehman, Ishtiaq; Rosario, Derek J; Denzinger, Stefan; Burger, Maximilian; Stoehr, Robert; Knuechel, Ruth; Hartmann, Arndt; Hamdy, Freddie C
2010-03-01
New methods for identifying bladder cancer (BCa) progression are required. Gene expression microarrays can reveal insights into disease biology and identify novel biomarkers. However, these experiments produce large datasets that are difficult to interpret. To develop a novel method of microarray analysis combining two forms of artificial intelligence (AI): neurofuzzy modelling (NFM) and artificial neural networks (ANN) and validate it in a BCa cohort. We used AI and statistical analyses to identify progression-related genes in a microarray dataset (n=66 tumours, n=2800 genes). The AI-selected genes were then investigated in a second cohort (n=262 tumours) using immunohistochemistry. We compared the accuracy of AI and statistical approaches to identify tumour progression. AI identified 11 progression-associated genes (odds ratio [OR]: 0.70; 95% confidence interval [CI], 0.56-0.87; p=0.0004), and these were more discriminate than genes chosen using statistical analyses (OR: 1.24; 95% CI, 0.96-1.60; p=0.09). The expression of six AI-selected genes (LIG3, FAS, KRT18, ICAM1, DSG2, and BRCA2) was determined using commercial antibodies and successfully identified tumour progression (concordance index: 0.66; log-rank test: p=0.01). AI-selected genes were more discriminate than pathologic criteria at determining progression (Cox multivariate analysis: p=0.01). Limitations include the use of statistical correlation to identify 200 genes for AI analysis and that we did not compare regression identified genes with immunohistochemistry. AI and statistical analyses use different techniques of inference to determine gene-phenotype associations and identify distinct prognostic gene signatures that are equally valid. We have identified a prognostic gene signature whose members reflect a variety of carcinogenic pathways that could identify progression in non-muscle-invasive BCa. 2009 European Association of Urology. Published by Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Ohyanagi, S.; Dileonardo, C.
2013-12-01
As a natural phenomenon earthquake occurrence is difficult to predict. Statistical analysis of earthquake data was performed using candlestick chart and Bollinger Band methods. These statistical methods, commonly used in the financial world to analyze market trends were tested against earthquake data. Earthquakes above Mw 4.0 located on shore of Sanriku (37.75°N ~ 41.00°N, 143.00°E ~ 144.50°E) from February 1973 to May 2013 were selected for analysis. Two specific patterns in earthquake occurrence were recognized through the analysis. One is a spread of candlestick prior to the occurrence of events greater than Mw 6.0. A second pattern shows convergence in the Bollinger Band, which implies a positive or negative change in the trend of earthquakes. Both patterns match general models for the buildup and release of strain through the earthquake cycle, and agree with both the characteristics of the candlestick chart and Bollinger Band analysis. These results show there is a high correlation between patterns in earthquake occurrence and trend analysis by these two statistical methods. The results of this study agree with the appropriateness of the application of these financial analysis methods to the analysis of earthquake occurrence.
Identification of atypical flight patterns
NASA Technical Reports Server (NTRS)
Statler, Irving C. (Inventor); Ferryman, Thomas A. (Inventor); Amidan, Brett G. (Inventor); Whitney, Paul D. (Inventor); White, Amanda M. (Inventor); Willse, Alan R. (Inventor); Cooley, Scott K. (Inventor); Jay, Joseph Griffith (Inventor); Lawrence, Robert E. (Inventor); Mosbrucker, Chris (Inventor)
2005-01-01
Method and system for analyzing aircraft data, including multiple selected flight parameters for a selected phase of a selected flight, and for determining when the selected phase of the selected flight is atypical, when compared with corresponding data for the same phase for other similar flights. A flight signature is computed using continuous-valued and discrete-valued flight parameters for the selected flight parameters and is optionally compared with a statistical distribution of other observed flight signatures, yielding atypicality scores for the same phase for other similar flights. A cluster analysis is optionally applied to the flight signatures to define an optimal collection of clusters. A level of atypicality for a selected flight is estimated, based upon an index associated with the cluster analysis.
A probabilistic method for testing and estimating selection differences between populations.
He, Yungang; Wang, Minxian; Huang, Xin; Li, Ran; Xu, Hongyang; Xu, Shuhua; Jin, Li
2015-12-01
Human populations around the world encounter various environmental challenges and, consequently, develop genetic adaptations to different selection forces. Identifying the differences in natural selection between populations is critical for understanding the roles of specific genetic variants in evolutionary adaptation. Although numerous methods have been developed to detect genetic loci under recent directional selection, a probabilistic solution for testing and quantifying selection differences between populations is lacking. Here we report the development of a probabilistic method for testing and estimating selection differences between populations. By use of a probabilistic model of genetic drift and selection, we showed that logarithm odds ratios of allele frequencies provide estimates of the differences in selection coefficients between populations. The estimates approximate a normal distribution, and variance can be estimated using genome-wide variants. This allows us to quantify differences in selection coefficients and to determine the confidence intervals of the estimate. Our work also revealed the link between genetic association testing and hypothesis testing of selection differences. It therefore supplies a solution for hypothesis testing of selection differences. This method was applied to a genome-wide data analysis of Han and Tibetan populations. The results confirmed that both the EPAS1 and EGLN1 genes are under statistically different selection in Han and Tibetan populations. We further estimated differences in the selection coefficients for genetic variants involved in melanin formation and determined their confidence intervals between continental population groups. Application of the method to empirical data demonstrated the outstanding capability of this novel approach for testing and quantifying differences in natural selection. © 2015 He et al.; Published by Cold Spring Harbor Laboratory Press.
NASA Technical Reports Server (NTRS)
Coggeshall, M. E.; Hoffer, R. M.
1973-01-01
Remote sensing equipment and automatic data processing techniques were employed as aids in the institution of improved forest resource management methods. On the basis of automatically calculated statistics derived from manually selected training samples, the feature selection processor of LARSYS selected, upon consideration of various groups of the four available spectral regions, a series of channel combinations whose automatic classification performances (for six cover types, including both deciduous and coniferous forest) were tested, analyzed, and further compared with automatic classification results obtained from digitized color infrared photography.
RAId_aPS: MS/MS Analysis with Multiple Scoring Functions and Spectrum-Specific Statistics
Alves, Gelio; Ogurtsov, Aleksey Y.; Yu, Yi-Kuo
2010-01-01
Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an -value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic -value reported by any method into the textbook-defined -value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific -values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign -values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign -values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page. PMID:21103371
Li, Qizhai; Hu, Jiyuan; Ding, Juan; Zheng, Gang
2014-04-01
A classical approach to combine independent test statistics is Fisher's combination of $p$-values, which follows the $\\chi ^2$ distribution. When the test statistics are dependent, the gamma distribution (GD) is commonly used for the Fisher's combination test (FCT). We propose to use two generalizations of the GD: the generalized and the exponentiated GDs. We study some properties of mis-using the GD for the FCT to combine dependent statistics when one of the two proposed distributions are true. Our results show that both generalizations have better control of type I error rates than the GD, which tends to have inflated type I error rates at more extreme tails. In practice, common model selection criteria (e.g. Akaike information criterion/Bayesian information criterion) can be used to help select a better distribution to use for the FCT. A simple strategy of the two generalizations of the GD in genome-wide association studies is discussed. Applications of the results to genetic pleiotrophic associations are described, where multiple traits are tested for association with a single marker.
Analysis of Information Content in High-Spectral Resolution Sounders using Subset Selection Analysis
NASA Technical Reports Server (NTRS)
Velez-Reyes, Miguel; Joiner, Joanna
1998-01-01
In this paper, we summarize the results of the sensitivity analysis and data reduction carried out to determine the information content of AIRS and IASI channels. The analysis and data reduction was based on the use of subset selection techniques developed in the linear algebra and statistical community to study linear dependencies in high dimensional data sets. We applied the subset selection method to study dependency among channels by studying the dependency among their weighting functions. Also, we applied the technique to study the information provided by the different levels in which the atmosphere is discretized for retrievals and analysis. Results from the method correlate well with intuition in many respects and point out to possible modifications for band selection in sensor design and number and location of levels in the analysis process.
ERIC Educational Resources Information Center
van der Linden, Wim J.; Scrams, David J.; Schnipke, Deborah L.
This paper proposes an item selection algorithm that can be used to neutralize the effect of time limits in computer adaptive testing. The method is based on a statistical model for the response-time distributions of the test takers on the items in the pool that is updated each time a new item has been administered. Predictions from the model are…
Minimizing Statistical Bias with Queries.
1995-09-14
method for optimally selecting these points would o er enormous savings in time and money. An active learning system will typically attempt to select data...research in active learning assumes that the sec- ond term of Equation 2 is approximately zero, that is, that the learner is unbiased. If this is the case...outperforms the variance- minimizing algorithm and random exploration. and e ective strategy for active learning . I have given empirical evidence that, with
A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress
2018-01-01
The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i) the proposed model is different from the previous models lacking the concept of time series; (ii) the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii) the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies. PMID:29765399
A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress.
Cheng, Ching-Hsue; Chan, Chia-Pang; Yang, Jun-He
2018-01-01
The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i) the proposed model is different from the previous models lacking the concept of time series; (ii) the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii) the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies.
Bostanov, Vladimir; Kotchoubey, Boris
2006-12-01
This study was aimed at developing a method for extraction and assessment of event-related brain potentials (ERP) from single-trials. This method should be applicable in the assessment of single persons' ERPs and should be able to handle both single ERP components and whole waveforms. We adopted a recently developed ERP feature extraction method, the t-CWT, for the purposes of hypothesis testing in the statistical assessment of ERPs. The t-CWT is based on the continuous wavelet transform (CWT) and Student's t-statistics. The method was tested in two ERP paradigms, oddball and semantic priming, by assessing individual-participant data on a single-trial basis, and testing the significance of selected ERP components, P300 and N400, as well as of whole ERP waveforms. The t-CWT was also compared to other univariate and multivariate ERP assessment methods: peak picking, area computation, discrete wavelet transform (DWT) and principal component analysis (PCA). The t-CWT produced better results than all of the other assessment methods it was compared with. The t-CWT can be used as a reliable and powerful method for ERP-component detection and testing of statistical hypotheses concerning both single ERP components and whole waveforms extracted from either single persons' or group data. The t-CWT is the first such method based explicitly on the criteria of maximal statistical difference between two average ERPs in the time-frequency domain and is particularly suitable for ERP assessment of individual data (e.g. in clinical settings), but also for the investigation of small and/or novel ERP effects from group data.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhen, X; Chen, H; Liao, Y
Purpose: To study the feasibility of employing deformable registration methods for accurate rectum dose volume parameters calculation and their potentials in revealing rectum dose-toxicity between complication and non-complication cervical cancer patients with brachytherapy treatment. Method and Materials: Data from 60 patients treated with BT including planning images, treatment plans, and follow-up clinical exam were retrospectively collected. Among them, 12 patients complained about hematochezia were further examined with colonoscopy and scored as Grade 1–3 complication (CP). Meanwhile, another 12 non-complication (NCP) patients were selected as a reference group. To seek for potential gains in rectum toxicity prediction when fractional anatomical deformationsmore » are account for, the rectum dose volume parameters D0.1/1/2cc of the selected patients were retrospectively computed by three different approaches: the simple “worstcase scenario” (WS) addition method, an intensity-based deformable image registration (DIR) algorithm-Demons, and a more accurate, recent developed local topology preserved non-rigid point matching algorithm (TOP). Statistical significance of the differences between rectum doses of the CP group and the NCP group were tested by a two-tailed t-test and results were considered to be statistically significant if p < 0.05. Results: For the D0.1cc, no statistical differences are found between the CP and NCP group in all three methods. For the D1cc, dose difference is not detected by the WS method, however, statistical differences between the two groups are observed by both Demons and TOP, and more evident in TOP. For the D2cc, the CP and NCP cases are statistically significance of the difference for all three methods but more pronounced with TOP. Conclusion: In this study, we calculated the rectum D0.1/1/2cc by simple WS addition and two DIR methods and seek for gains in rectum toxicity prediction. The results favor the claim that accurate dose deformation and summation tend to be more sensitive in unveiling the dose-toxicity relationship. This work is supported in part by grant from VARIAN MEDICAL SYSTEMS INC, the National Natural Science Foundation of China (no 81428019 and no 81301940), the Guangdong Natural Science Foundation (2015A030313302)and the 2015 Pearl River S&T Nova Program of Guangzhou (201506010096).« less
Labor Productivity Standards in Texas School Foodservice Operations
ERIC Educational Resources Information Center
Sherrin, A. Rachelle; Bednar, Carolyn; Kwon, Junehee
2009-01-01
Purpose: Purpose of this research was to investigate utilization of labor productivity standards and variables that affect productivity in Texas school foodservice operations. Methods: A questionnaire was developed, validated, and pilot tested, then mailed to 200 randomly selected Texas school foodservice directors. Descriptive statistics for…
Linear regression analysis: part 14 of a series on evaluation of scientific publications.
Schneider, Astrid; Hommel, Gerhard; Blettner, Maria
2010-11-01
Regression analysis is an important statistical method for the analysis of medical data. It enables the identification and characterization of relationships among multiple factors. It also enables the identification of prognostically relevant risk factors and the calculation of risk scores for individual prognostication. This article is based on selected textbooks of statistics, a selective review of the literature, and our own experience. After a brief introduction of the uni- and multivariable regression models, illustrative examples are given to explain what the important considerations are before a regression analysis is performed, and how the results should be interpreted. The reader should then be able to judge whether the method has been used correctly and interpret the results appropriately. The performance and interpretation of linear regression analysis are subject to a variety of pitfalls, which are discussed here in detail. The reader is made aware of common errors of interpretation through practical examples. Both the opportunities for applying linear regression analysis and its limitations are presented.
Methods for estimating selected low-flow frequency statistics for unregulated streams in Kentucky
Martin, Gary R.; Arihood, Leslie D.
2010-01-01
This report provides estimates of, and presents methods for estimating, selected low-flow frequency statistics for unregulated streams in Kentucky including the 30-day mean low flows for recurrence intervals of 2 and 5 years (30Q2 and 30Q5) and the 7-day mean low flows for recurrence intervals of 5, 10, and 20 years (7Q2, 7Q10, and 7Q20). Estimates of these statistics are provided for 121 U.S. Geological Survey streamflow-gaging stations with data through the 2006 climate year, which is the 12-month period ending March 31 of each year. Data were screened to identify the periods of homogeneous, unregulated flows for use in the analyses. Logistic-regression equations are presented for estimating the annual probability of the selected low-flow frequency statistics being equal to zero. Weighted-least-squares regression equations were developed for estimating the magnitude of the nonzero 30Q2, 30Q5, 7Q2, 7Q10, and 7Q20 low flows. Three low-flow regions were defined for estimating the 7-day low-flow frequency statistics. The explicit explanatory variables in the regression equations include total drainage area and the mapped streamflow-variability index measured from a revised statewide coverage of this characteristic. The percentage of the station low-flow statistics correctly classified as zero or nonzero by use of the logistic-regression equations ranged from 87.5 to 93.8 percent. The average standard errors of prediction of the weighted-least-squares regression equations ranged from 108 to 226 percent. The 30Q2 regression equations have the smallest standard errors of prediction, and the 7Q20 regression equations have the largest standard errors of prediction. The regression equations are applicable only to stream sites with low flows unaffected by regulation from reservoirs and local diversions of flow and to drainage basins in specified ranges of basin characteristics. Caution is advised when applying the equations for basins with characteristics near the applicable limits and for basins with karst drainage features.
Sparse High Dimensional Models in Economics
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2010-01-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635
Leyrat, Clémence; Caille, Agnès; Foucher, Yohann; Giraudeau, Bruno
2016-01-22
Despite randomization, baseline imbalance and confounding bias may occur in cluster randomized trials (CRTs). Covariate imbalance may jeopardize the validity of statistical inferences if they occur on prognostic factors. Thus, the diagnosis of a such imbalance is essential to adjust statistical analysis if required. We developed a tool based on the c-statistic of the propensity score (PS) model to detect global baseline covariate imbalance in CRTs and assess the risk of confounding bias. We performed a simulation study to assess the performance of the proposed tool and applied this method to analyze the data from 2 published CRTs. The proposed method had good performance for large sample sizes (n =500 per arm) and when the number of unbalanced covariates was not too small as compared with the total number of baseline covariates (≥40% of unbalanced covariates). We also provide a strategy for pre selection of the covariates needed to be included in the PS model to enhance imbalance detection. The proposed tool could be useful in deciding whether covariate adjustment is required before performing statistical analyses of CRTs.
Ziegeweid, Jeffrey R.; Lorenz, David L.; Sanocki, Chris A.; Czuba, Christiana R.
2015-12-24
Equations developed in this study apply only to stream locations where flows are not substantially affected by regulation, diversion, or urbanization. All equations presented in this study will be incorporated into StreamStats, a web-based geographic information system tool developed by the U.S. Geological Survey. StreamStats allows users to obtain streamflow statistics, basin characteristics, and other information for user-selected locations on streams through an interactive map.
Ross, Joseph S; Bates, Jonathan; Parzynski, Craig S; Akar, Joseph G; Curtis, Jeptha P; Desai, Nihar R; Freeman, James V; Gamble, Ginger M; Kuntz, Richard; Li, Shu-Xia; Marinac-Dabic, Danica; Masoudi, Frederick A; Normand, Sharon-Lise T; Ranasinghe, Isuru; Shaw, Richard E; Krumholz, Harlan M
2017-01-01
Machine learning methods may complement traditional analytic methods for medical device surveillance. Using data from the National Cardiovascular Data Registry for implantable cardioverter-defibrillators (ICDs) linked to Medicare administrative claims for longitudinal follow-up, we applied three statistical approaches to safety-signal detection for commonly used dual-chamber ICDs that used two propensity score (PS) models: one specified by subject-matter experts (PS-SME), and the other one by machine learning-based selection (PS-ML). The first approach used PS-SME and cumulative incidence (time-to-event), the second approach used PS-SME and cumulative risk (Data Extraction and Longitudinal Trend Analysis [DELTA]), and the third approach used PS-ML and cumulative risk (embedded feature selection). Safety-signal surveillance was conducted for eleven dual-chamber ICD models implanted at least 2,000 times over 3 years. Between 2006 and 2010, there were 71,948 Medicare fee-for-service beneficiaries who received dual-chamber ICDs. Cumulative device-specific unadjusted 3-year event rates varied for three surveyed safety signals: death from any cause, 12.8%-20.9%; nonfatal ICD-related adverse events, 19.3%-26.3%; and death from any cause or nonfatal ICD-related adverse event, 27.1%-37.6%. Agreement among safety signals detected/not detected between the time-to-event and DELTA approaches was 90.9% (360 of 396, k =0.068), between the time-to-event and embedded feature-selection approaches was 91.7% (363 of 396, k =-0.028), and between the DELTA and embedded feature selection approaches was 88.1% (349 of 396, k =-0.042). Three statistical approaches, including one machine learning method, identified important safety signals, but without exact agreement. Ensemble methods may be needed to detect all safety signals for further evaluation during medical device surveillance.
Swiderska, Zaneta; Korzynska, Anna; Markiewicz, Tomasz; Lorent, Malgorzata; Zak, Jakub; Wesolowska, Anna; Roszkowiak, Lukasz; Slodkowska, Janina; Grala, Bartlomiej
2015-01-01
Background. This paper presents the study concerning hot-spot selection in the assessment of whole slide images of tissue sections collected from meningioma patients. The samples were immunohistochemically stained to determine the Ki-67/MIB-1 proliferation index used for prognosis and treatment planning. Objective. The observer performance was examined by comparing results of the proposed method of automatic hot-spot selection in whole slide images, results of traditional scoring under a microscope, and results of a pathologist's manual hot-spot selection. Methods. The results of scoring the Ki-67 index using optical scoring under a microscope, software for Ki-67 index quantification based on hot spots selected by two pathologists (resp., once and three times), and the same software but on hot spots selected by proposed automatic methods were compared using Kendall's tau-b statistics. Results. Results show intra- and interobserver agreement. The agreement between Ki-67 scoring with manual and automatic hot-spot selection is high, while agreement between Ki-67 index scoring results in whole slide images and traditional microscopic examination is lower. Conclusions. The agreement observed for the three scoring methods shows that automation of area selection is an effective tool in supporting physicians and in increasing the reliability of Ki-67 scoring in meningioma.
Swiderska, Zaneta; Korzynska, Anna; Markiewicz, Tomasz; Lorent, Malgorzata; Zak, Jakub; Wesolowska, Anna; Roszkowiak, Lukasz; Slodkowska, Janina; Grala, Bartlomiej
2015-01-01
Background. This paper presents the study concerning hot-spot selection in the assessment of whole slide images of tissue sections collected from meningioma patients. The samples were immunohistochemically stained to determine the Ki-67/MIB-1 proliferation index used for prognosis and treatment planning. Objective. The observer performance was examined by comparing results of the proposed method of automatic hot-spot selection in whole slide images, results of traditional scoring under a microscope, and results of a pathologist's manual hot-spot selection. Methods. The results of scoring the Ki-67 index using optical scoring under a microscope, software for Ki-67 index quantification based on hot spots selected by two pathologists (resp., once and three times), and the same software but on hot spots selected by proposed automatic methods were compared using Kendall's tau-b statistics. Results. Results show intra- and interobserver agreement. The agreement between Ki-67 scoring with manual and automatic hot-spot selection is high, while agreement between Ki-67 index scoring results in whole slide images and traditional microscopic examination is lower. Conclusions. The agreement observed for the three scoring methods shows that automation of area selection is an effective tool in supporting physicians and in increasing the reliability of Ki-67 scoring in meningioma. PMID:26240787
Sharpening method of satellite thermal image based on the geographical statistical model
NASA Astrophysics Data System (ADS)
Qi, Pengcheng; Hu, Shixiong; Zhang, Haijun; Guo, Guangmeng
2016-04-01
To improve the effectiveness of thermal sharpening in mountainous regions, paying more attention to the laws of land surface energy balance, a thermal sharpening method based on the geographical statistical model (GSM) is proposed. Explanatory variables were selected from the processes of land surface energy budget and thermal infrared electromagnetic radiation transmission, then high spatial resolution (57 m) raster layers were generated for these variables through spatially simulating or using other raster data as proxies. Based on this, the local adaptation statistical relationship between brightness temperature (BT) and the explanatory variables, i.e., the GSM, was built at 1026-m resolution using the method of multivariate adaptive regression splines. Finally, the GSM was applied to the high-resolution (57-m) explanatory variables; thus, the high-resolution (57-m) BT image was obtained. This method produced a sharpening result with low error and good visual effect. The method can avoid the blind choice of explanatory variables and remove the dependence on synchronous imagery at visible and near-infrared bands. The influences of the explanatory variable combination, sampling method, and the residual error correction on sharpening results were analyzed deliberately, and their influence mechanisms are reported herein.
Wingate, Peter H; Thornton, George C; McIntyre, Kelly S; Frame, Jennifer H
2003-02-01
The present study examined relationships between reduction-in-force (RIF) personnel practices, presentation of statistical evidence, and litigation outcomes. Policy capturing methods were utilized to analyze the components of 115 federal district court opinions involving age discrimination disparate treatment allegations and organizational downsizing. Univariate analyses revealed meaningful links between RIF personnel practices, use of statistical evidence, and judicial verdict. The defendant organization was awarded summary judgment in 73% of the claims included in the study. Judicial decisions in favor of the defendant organization were found to be significantly related to such variables as formal performance appraisal systems, termination decision review within the organization, methods of employee assessment and selection for termination, and the presence of a concrete layoff policy. The use of statistical evidence in ADEA disparate treatment litigation was investigated and found to be a potentially persuasive type of indirect evidence. Legal, personnel, and evidentiary ramifications are reviewed, and a framework of downsizing mechanics emphasizing legal defensibility is presented.
A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins
Knudsen, Bjarne; Miyamoto, Michael M.
2001-01-01
Changes in protein function can lead to changes in the selection acting on specific residues. This can often be detected as evolutionary rate changes at the sites in question. A maximum-likelihood method for detecting evolutionary rate shifts at specific protein positions is presented. The method determines significance values of the rate differences to give a sound statistical foundation for the conclusions drawn from the analyses. A statistical test for detecting slowly evolving sites is also described. The methods are applied to a set of Myc proteins for the identification of both conserved sites and those with changing evolutionary rates. Those positions with conserved and changing rates are related to the structures and functions of their proteins. The results are compared with an earlier Bayesian method, thereby highlighting the advantages of the new likelihood ratio tests. PMID:11734650
NASA Astrophysics Data System (ADS)
Attallah, Bilal; Serir, Amina; Chahir, Youssef; Boudjelal, Abdelwahhab
2017-11-01
Palmprint recognition systems are dependent on feature extraction. A method of feature extraction using higher discrimination information was developed to characterize palmprint images. In this method, two individual feature extraction techniques are applied to a discrete wavelet transform of a palmprint image, and their outputs are fused. The two techniques used in the fusion are the histogram of gradient and the binarized statistical image features. They are then evaluated using an extreme learning machine classifier before selecting a feature based on principal component analysis. Three palmprint databases, the Hong Kong Polytechnic University (PolyU) Multispectral Palmprint Database, Hong Kong PolyU Palmprint Database II, and the Delhi Touchless (IIDT) Palmprint Database, are used in this study. The study shows that our method effectively identifies and verifies palmprints and outperforms other methods based on feature extraction.
The effect of feature selection methods on computer-aided detection of masses in mammograms
NASA Astrophysics Data System (ADS)
Hupse, Rianne; Karssemeijer, Nico
2010-05-01
In computer-aided diagnosis (CAD) research, feature selection methods are often used to improve generalization performance of classifiers and shorten computation times. In an application that detects malignant masses in mammograms, we investigated the effect of using a selection criterion that is similar to the final performance measure we are optimizing, namely the mean sensitivity of the system in a predefined range of the free-response receiver operating characteristics (FROC). To obtain the generalization performance of the selected feature subsets, a cross validation procedure was performed on a dataset containing 351 abnormal and 7879 normal regions, each region providing a set of 71 mass features. The same number of noise features, not containing any information, were added to investigate the ability of the feature selection algorithms to distinguish between useful and non-useful features. It was found that significantly higher performances were obtained using feature sets selected by the general test statistic Wilks' lambda than using feature sets selected by the more specific FROC measure. Feature selection leads to better performance when compared to a system in which all features were used.
Statistical Model Selection for TID Hardness Assurance
NASA Technical Reports Server (NTRS)
Ladbury, R.; Gorelick, J. L.; McClure, S.
2010-01-01
Radiation Hardness Assurance (RHA) methodologies against Total Ionizing Dose (TID) degradation impose rigorous statistical treatments for data from a part's Radiation Lot Acceptance Test (RLAT) and/or its historical performance. However, no similar methods exist for using "similarity" data - that is, data for similar parts fabricated in the same process as the part under qualification. This is despite the greater difficulty and potential risk in interpreting of similarity data. In this work, we develop methods to disentangle part-to-part, lot-to-lot and part-type-to-part-type variation. The methods we develop apply not just for qualification decisions, but also for quality control and detection of process changes and other "out-of-family" behavior. We begin by discussing the data used in ·the study and the challenges of developing a statistic providing a meaningful measure of degradation across multiple part types, each with its own performance specifications. We then develop analysis techniques and apply them to the different data sets.
NASA Technical Reports Server (NTRS)
Sprowls, D. O.; Bucci, R. J.; Ponchel, B. M.; Brazill, R. L.; Bretz, P. E.
1984-01-01
A technique is demonstrated for accelerated stress corrosion testing of high strength aluminum alloys. The method offers better precision and shorter exposure times than traditional pass fail procedures. The approach uses data from tension tests performed on replicate groups of smooth specimens after various lengths of exposure to static stress. The breaking strength measures degradation in the test specimen load carrying ability due to the environmental attack. Analysis of breaking load data by extreme value statistics enables the calculation of survival probabilities and a statistically defined threshold stress applicable to the specific test conditions. A fracture mechanics model is given which quantifies depth of attack in the stress corroded specimen by an effective flaw size calculated from the breaking stress and the material strength and fracture toughness properties. Comparisons are made with experimental results from three tempers of 7075 alloy plate tested by the breaking load method and by traditional tests of statistically loaded smooth tension bars and conventional precracked specimens.
Saad, Ahmed S; Attia, Ali K; Alaraki, Manal S; Elzanfaly, Eman S
2015-11-05
Five different spectrophotometric methods were applied for simultaneous determination of fenbendazole and rafoxanide in their binary mixture; namely first derivative, derivative ratio, ratio difference, dual wavelength and H-point standard addition spectrophotometric methods. Different factors affecting each of the applied spectrophotometric methods were studied and the selectivity of the applied methods was compared. The applied methods were validated as per the ICH guidelines and good accuracy; specificity and precision were proven within the concentration range of 5-50 μg/mL for both drugs. Statistical analysis using one-way ANOVA proved no significant differences among the proposed methods for the determination of the two drugs. The proposed methods successfully determined both drugs in laboratory prepared and commercially available binary mixtures, and were found applicable for the routine analysis in quality control laboratories. Copyright © 2015 Elsevier B.V. All rights reserved.
Optimization of Microphone Locations for Acoustic Liner Impedance Eduction
NASA Technical Reports Server (NTRS)
Jones, M. G.; Watson, W. R.; June, J. C.
2015-01-01
Two impedance eduction methods are explored for use with data acquired in the NASA Langley Grazing Flow Impedance Tube. The first is an indirect method based on the convected Helmholtz equation, and the second is a direct method based on the Kumaresan and Tufts algorithm. Synthesized no-flow data, with random jitter to represent measurement error, are used to evaluate a number of possible microphone locations. Statistical approaches are used to evaluate the suitability of each set of microphone locations. Given the computational resources required, small sample statistics are employed for the indirect method. Since the direct method is much less computationally intensive, a Monte Carlo approach is employed to gather its statistics. A comparison of results achieved with full and reduced sets of microphone locations is used to determine which sets of microphone locations are acceptable. For the indirect method, each array that includes microphones in all three regions (upstream and downstream hard wall sections, and liner test section) provides acceptable results, even when as few as eight microphones are employed. The best arrays employ microphones well away from the leading and trailing edges of the liner. The direct method is constrained to use microphones opposite the liner. Although a number of arrays are acceptable, the optimum set employs 14 microphones positioned well away from the leading and trailing edges of the liner. The selected sets of microphone locations are also evaluated with data measured for ceramic tubular and perforate-over-honeycomb liners at three flow conditions (Mach 0.0, 0.3, and 0.5). They compare favorably with results attained using all 53 microphone locations. Although different optimum microphone locations are selected for the two impedance eduction methods, there is significant overlap. Thus, the union of these two microphone arrays is preferred, as it supports usage of both methods. This array contains 3 microphones in the upstream hard wall section, 14 microphones opposite the liner, and 3 microphones in the downstream hard wall section.
Ryu, Jihye; Lee, Chaeyoung
2014-12-01
Positive selection not only increases beneficial allele frequency but also causes augmentation of allele frequencies of sequence variants in close proximity. Signals for positive selection were detected by the statistical differences in subsequent allele frequencies. To identify selection signatures in Korean cattle, we applied a composite log-likelihood (CLL)-based method, which calculates a composite likelihood of the allelic frequencies observed across sliding windows of five adjacent loci and compares the value with the critical statistic estimated by 50,000 permutations. Data for a total of 11,799 nucleotide polymorphisms were used with 71 Korean cattle and 209 foreign beef cattle. As a result, 147 signals were identified for Korean cattle based on CLL estimates (P < 0.01). The signals might be candidate genetic factors for meat quality by which the Korean cattle have been selected. Further genetic association analysis with 41 intragenic variants in the selection signatures with the greatest CLL for each chromosome revealed that marbling score was associated with five variants. Intensive association studies with all the selection signatures identified in this study are required to exclude signals associated with other phenotypes or signals falsely detected and thus to identify genetic markers for meat quality. © 2014 Stichting International Foundation for Animal Genetics.
Marembo, Joan; Zvinavashe, Mathilda; Nyamakura, Rudo; Shaibu, Sheila; Mogobe, Keitshokile Dintle
2014-10-01
To assess factors influencing infant-feeding methods selected by HIV-infected mothers. A descriptive quantitative study was conducted among 80 mothers with babies aged 0-6 months who were randomly selected and interviewed. Descriptive statistics were used to summarize the findings. Factors considered by women in choosing the infant-feeding methods included sociocultural acceptability (58.8%), feasibility and support from significant others (35%), knowledge of the selected method (55%), affordability (61.2%), implementation of the infant-feeding method without interference (62.5%), and safety (47.5%). Exclusive breast-feeding was the most preferred method of infant feeding. Disclosure of HIV status by a woman to her partner is a major condition for successful replacement feeding method, especially within the African cultural context. However, disclosure of HIV status to the partner was feared by most women as only 16.2% of the women disclosed their HIV status to partners. The factors considered by women in choosing the infant-feeding option were ability to implement the options without interference from significant others, affordability, and sociocultural acceptability. Knowledge of the selected option, its advantages and disadvantages, safety, and feasibility were also important factors. Nurses and midwives have to educate clients and support them in their choice of infant-feeding methods. © 2013 The Authors. Japan Journal of Nursing Science © 2013 Japan Academy of Nursing Science.
Selecting the most appropriate inferential statistical test for your quantitative research study.
Bettany-Saltikov, Josette; Whittaker, Victoria Jane
2014-06-01
To discuss the issues and processes relating to the selection of the most appropriate statistical test. A review of the basic research concepts together with a number of clinical scenarios is used to illustrate this. Quantitative nursing research generally features the use of empirical data which necessitates the selection of both descriptive and statistical tests. Different types of research questions can be answered by different types of research designs, which in turn need to be matched to a specific statistical test(s). Discursive paper. This paper discusses the issues relating to the selection of the most appropriate statistical test and makes some recommendations as to how these might be dealt with. When conducting empirical quantitative studies, a number of key issues need to be considered. Considerations for selecting the most appropriate statistical tests are discussed and flow charts provided to facilitate this process. When nursing clinicians and researchers conduct quantitative research studies, it is crucial that the most appropriate statistical test is selected to enable valid conclusions to be made. © 2013 John Wiley & Sons Ltd.
NASA Astrophysics Data System (ADS)
Ding, Hao; Cao, Ming; DuPont, Andrew W.; Scott, Larry D.; Guha, Sushovan; Singhal, Shashideep; Younes, Mamoun; Pence, Isaac; Herline, Alan; Schwartz, David; Xu, Hua; Mahadevan-Jansen, Anita; Bi, Xiaohong
2016-03-01
Inflammatory bowel disease (IBD) is an idiopathic disease that is typically characterized by chronic inflammation of the gastrointestinal tract. Recently much effort has been devoted to the development of novel diagnostic tools that can assist physicians for fast, accurate, and automated diagnosis of the disease. Previous research based on Raman spectroscopy has shown promising results in differentiating IBD patients from normal screening cases. In the current study, we examined IBD patients in vivo through a colonoscope-coupled Raman system. Optical diagnosis for IBD discrimination was conducted based on full-range spectra using multivariate statistical methods. Further, we incorporated several feature selection methods in machine learning into the classification model. The diagnostic performance for disease differentiation was significantly improved after feature selection. Our results showed that improved IBD diagnosis can be achieved using Raman spectroscopy in combination with multivariate analysis and feature selection.
NASA Astrophysics Data System (ADS)
Vallam, P.; Qin, X. S.
2017-10-01
Anthropogenic-driven climate change would affect the global ecosystem and is becoming a world-wide concern. Numerous studies have been undertaken to determine the future trends of meteorological variables at different scales. Despite these studies, there remains significant uncertainty in the prediction of future climates. To examine the uncertainty arising from using different schemes to downscale the meteorological variables for the future horizons, projections from different statistical downscaling schemes were examined. These schemes included statistical downscaling method (SDSM), change factor incorporated with LARS-WG, and bias corrected disaggregation (BCD) method. Global circulation models (GCMs) based on CMIP3 (HadCM3) and CMIP5 (CanESM2) were utilized to perturb the changes in the future climate. Five study sites (i.e., Alice Springs, Edmonton, Frankfurt, Miami, and Singapore) with diverse climatic conditions were chosen for examining the spatial variability of applying various statistical downscaling schemes. The study results indicated that the regions experiencing heavy precipitation intensities were most likely to demonstrate the divergence between the predictions from various statistical downscaling methods. Also, the variance computed in projecting the weather extremes indicated the uncertainty derived from selection of downscaling tools and climate models. This study could help gain an improved understanding about the features of different downscaling approaches and the overall downscaling uncertainty.
Thapaliya, Kiran; Pyun, Jae-Young; Park, Chun-Su; Kwon, Goo-Rak
2013-01-01
The level set approach is a powerful tool for segmenting images. This paper proposes a method for segmenting brain tumor images from MR images. A new signed pressure function (SPF) that can efficiently stop the contours at weak or blurred edges is introduced. The local statistics of the different objects present in the MR images were calculated. Using local statistics, the tumor objects were identified among different objects. In this level set method, the calculation of the parameters is a challenging task. The calculations of different parameters for different types of images were automatic. The basic thresholding value was updated and adjusted automatically for different MR images. This thresholding value was used to calculate the different parameters in the proposed algorithm. The proposed algorithm was tested on the magnetic resonance images of the brain for tumor segmentation and its performance was evaluated visually and quantitatively. Numerical experiments on some brain tumor images highlighted the efficiency and robustness of this method. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Statistical Tests of System Linearity Based on the Method of Surrogate Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hunter, N.; Paez, T.; Red-Horse, J.
When dealing with measured data from dynamic systems we often make the tacit assumption that the data are generated by linear dynamics. While some systematic tests for linearity and determinism are available - for example the coherence fimction, the probability density fimction, and the bispectrum - fi,u-ther tests that quanti$ the existence and the degree of nonlinearity are clearly needed. In this paper we demonstrate a statistical test for the nonlinearity exhibited by a dynamic system excited by Gaussian random noise. We perform the usual division of the input and response time series data into blocks as required by themore » Welch method of spectrum estimation and search for significant relationships between a given input fkequency and response at harmonics of the selected input frequency. We argue that systematic tests based on the recently developed statistical method of surrogate data readily detect significant nonlinear relationships. The paper elucidates the method of surrogate data. Typical results are illustrated for a linear single degree-of-freedom system and for a system with polynomial stiffness nonlinearity.« less
Li, Ke; Zhang, Qiuju; Wang, Kun; Chen, Peng; Wang, Huaqing
2016-01-01
A new fault diagnosis method for rotating machinery based on adaptive statistic test filter (ASTF) and Diagnostic Bayesian Network (DBN) is presented in this paper. ASTF is proposed to obtain weak fault features under background noise, ASTF is based on statistic hypothesis testing in the frequency domain to evaluate similarity between reference signal (noise signal) and original signal, and remove the component of high similarity. The optimal level of significance α is obtained using particle swarm optimization (PSO). To evaluate the performance of the ASTF, evaluation factor Ipq is also defined. In addition, a simulation experiment is designed to verify the effectiveness and robustness of ASTF. A sensitive evaluation method using principal component analysis (PCA) is proposed to evaluate the sensitiveness of symptom parameters (SPs) for condition diagnosis. By this way, the good SPs that have high sensitiveness for condition diagnosis can be selected. A three-layer DBN is developed to identify condition of rotation machinery based on the Bayesian Belief Network (BBN) theory. Condition diagnosis experiment for rolling element bearings demonstrates the effectiveness of the proposed method. PMID:26761006
Li, Ke; Zhang, Qiuju; Wang, Kun; Chen, Peng; Wang, Huaqing
2016-01-08
A new fault diagnosis method for rotating machinery based on adaptive statistic test filter (ASTF) and Diagnostic Bayesian Network (DBN) is presented in this paper. ASTF is proposed to obtain weak fault features under background noise, ASTF is based on statistic hypothesis testing in the frequency domain to evaluate similarity between reference signal (noise signal) and original signal, and remove the component of high similarity. The optimal level of significance α is obtained using particle swarm optimization (PSO). To evaluate the performance of the ASTF, evaluation factor Ipq is also defined. In addition, a simulation experiment is designed to verify the effectiveness and robustness of ASTF. A sensitive evaluation method using principal component analysis (PCA) is proposed to evaluate the sensitiveness of symptom parameters (SPs) for condition diagnosis. By this way, the good SPs that have high sensitiveness for condition diagnosis can be selected. A three-layer DBN is developed to identify condition of rotation machinery based on the Bayesian Belief Network (BBN) theory. Condition diagnosis experiment for rolling element bearings demonstrates the effectiveness of the proposed method.
Application of multivariate statistical techniques in microbial ecology.
Paliy, O; Shankar, V
2016-03-01
Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large-scale ecological data sets. In particular, noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amount of data, powerful statistical techniques of multivariate analysis are well suited to analyse and interpret these data sets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular data set. In this review, we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and data set structure. © 2016 John Wiley & Sons Ltd.
Statistics in the pharmacy literature.
Lee, Charlene M; Soin, Herpreet K; Einarson, Thomas R
2004-09-01
Research in statistical methods is essential for maintenance of high quality of the published literature. To update previous reports of the types and frequencies of statistical terms and procedures in research studies of selected professional pharmacy journals. We obtained all research articles published in 2001 in 6 journals: American Journal of Health-System Pharmacy, The Annals of Pharmacotherapy, Canadian Journal of Hospital Pharmacy, Formulary, Hospital Pharmacy, and Journal of the American Pharmaceutical Association. Two independent reviewers identified and recorded descriptive and inferential statistical terms/procedures found in the methods, results, and discussion sections of each article. Results were determined by tallying the total number of times, as well as the percentage, that each statistical term or procedure appeared in the articles. One hundred forty-four articles were included. Ninety-eight percent employed descriptive statistics; of these, 28% used only descriptive statistics. The most common descriptive statistical terms were percentage (90%), mean (74%), standard deviation (58%), and range (46%). Sixty-nine percent of the articles used inferential statistics, the most frequent being chi(2) (33%), Student's t-test (26%), Pearson's correlation coefficient r (18%), ANOVA (14%), and logistic regression (11%). Statistical terms and procedures were found in nearly all of the research articles published in pharmacy journals. Thus, pharmacy education should aim to provide current and future pharmacists with an understanding of the common statistical terms and procedures identified to facilitate the appropriate appraisal and consequential utilization of the information available in research articles.
Systematic sampling for suspended sediment
Robert B. Thomas
1991-01-01
Abstract - Because of high costs or complex logistics, scientific populations cannot be measured entirely and must be sampled. Accepted scientific practice holds that sample selection be based on statistical principles to assure objectivity when estimating totals and variances. Probability sampling--obtaining samples with known probabilities--is the only method that...
An Analysis of Methods Used to Examine Gender Differences in Computer-Related Behavior.
ERIC Educational Resources Information Center
Kay, Robin
1992-01-01
Review of research investigating gender differences in computer-related behavior examines statistical and methodological flaws. Issues addressed include sample selection, sample size, scale development, scale quality, the use of univariate and multivariate analyses, regressional analysis, construct definition, construct testing, and the…
Dong, Qi; Elliott, Michael R; Raghunathan, Trivellore E
2014-06-01
Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.
Dong, Qi; Elliott, Michael R.; Raghunathan, Trivellore E.
2017-01-01
Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs. PMID:29200608
Statistical technique for analysing functional connectivity of multiple spike trains.
Masud, Mohammad Shahed; Borisyuk, Roman
2011-03-15
A new statistical technique, the Cox method, used for analysing functional connectivity of simultaneously recorded multiple spike trains is presented. This method is based on the theory of modulated renewal processes and it estimates a vector of influence strengths from multiple spike trains (called reference trains) to the selected (target) spike train. Selecting another target spike train and repeating the calculation of the influence strengths from the reference spike trains enables researchers to find all functional connections among multiple spike trains. In order to study functional connectivity an "influence function" is identified. This function recognises the specificity of neuronal interactions and reflects the dynamics of postsynaptic potential. In comparison to existing techniques, the Cox method has the following advantages: it does not use bins (binless method); it is applicable to cases where the sample size is small; it is sufficiently sensitive such that it estimates weak influences; it supports the simultaneous analysis of multiple influences; it is able to identify a correct connectivity scheme in difficult cases of "common source" or "indirect" connectivity. The Cox method has been thoroughly tested using multiple sets of data generated by the neural network model of the leaky integrate and fire neurons with a prescribed architecture of connections. The results suggest that this method is highly successful for analysing functional connectivity of simultaneously recorded multiple spike trains. Copyright © 2011 Elsevier B.V. All rights reserved.
Kaye, T.N.; Pyke, David A.
2003-01-01
Population viability analysis is an important tool for conservation biologists, and matrix models that incorporate stochasticity are commonly used for this purpose. However, stochastic simulations may require assumptions about the distribution of matrix parameters, and modelers often select a statistical distribution that seems reasonable without sufficient data to test its fit. We used data from long-term (5a??10 year) studies with 27 populations of five perennial plant species to compare seven methods of incorporating environmental stochasticity. We estimated stochastic population growth rate (a measure of viability) using a matrix-selection method, in which whole observed matrices were selected at random at each time step of the model. In addition, we drew matrix elements (transition probabilities) at random using various statistical distributions: beta, truncated-gamma, truncated-normal, triangular, uniform, or discontinuous/observed. Recruitment rates were held constant at their observed mean values. Two methods of constraining stage-specific survival to a??100% were also compared. Different methods of incorporating stochasticity and constraining matrix column sums interacted in their effects and resulted in different estimates of stochastic growth rate (differing by up to 16%). Modelers should be aware that when constraining stage-specific survival to 100%, different methods may introduce different levels of bias in transition element means, and when this happens, different distributions for generating random transition elements may result in different viability estimates. There was no species effect on the results and the growth rates derived from all methods were highly correlated with one another. We conclude that the absolute value of population viability estimates is sensitive to model assumptions, but the relative ranking of populations (and management treatments) is robust. Furthermore, these results are applicable to a range of perennial plants and possibly other life histories.
Identifying Node Role in Social Network Based on Multiple Indicators
Huang, Shaobin; Lv, Tianyang; Zhang, Xizhe; Yang, Yange; Zheng, Weimin; Wen, Chao
2014-01-01
It is a classic topic of social network analysis to evaluate the importance of nodes and identify the node that takes on the role of core or bridge in a network. Because a single indicator is not sufficient to analyze multiple characteristics of a node, it is a natural solution to apply multiple indicators that should be selected carefully. An intuitive idea is to select some indicators with weak correlations to efficiently assess different characteristics of a node. However, this paper shows that it is much better to select the indicators with strong correlations. Because indicator correlation is based on the statistical analysis of a large number of nodes, the particularity of an important node will be outlined if its indicator relationship doesn't comply with the statistical correlation. Therefore, the paper selects the multiple indicators including degree, ego-betweenness centrality and eigenvector centrality to evaluate the importance and the role of a node. The importance of a node is equal to the normalized sum of its three indicators. A candidate for core or bridge is selected from the great degree nodes or the nodes with great ego-betweenness centrality respectively. Then, the role of a candidate is determined according to the difference between its indicators' relationship with the statistical correlation of the overall network. Based on 18 real networks and 3 kinds of model networks, the experimental results show that the proposed methods perform quite well in evaluating the importance of nodes and in identifying the node role. PMID:25089823
ERIC Educational Resources Information Center
Substance Abuse and Mental Health Services Administration (DHHS/PHS), Rockville, MD. Office of Applied Studies.
This document presents the technical appendices and selected data tables from the 2001 National Household Survey on Drug Abuse. Included are a description of the survey; statistical methods and limitations of the data; effects of changes in survey protocol on trend measurement; key definitions for the 1999-2001 survey years; and other sources of…
Yokoyama, Shozo; Takenaka, Naomi
2005-04-01
Red-green color vision is strongly suspected to enhance the survival of its possessors. Despite being red-green color blind, however, many species have successfully competed in nature, which brings into question the evolutionary advantage of achieving red-green color vision. Here, we propose a new method of identifying positive selection at individual amino acid sites with the premise that if positive Darwinian selection has driven the evolution of the protein under consideration, then it should be found mostly at the branches in the phylogenetic tree where its function had changed. The statistical and molecular methods have been applied to 29 visual pigments with the wavelengths of maximal absorption at approximately 510-540 nm (green- or middle wavelength-sensitive [MWS] pigments) and at approximately 560 nm (red- or long wavelength-sensitive [LWS] pigments), which are sampled from a diverse range of vertebrate species. The results show that the MWS pigments are positively selected through amino acid replacements S180A, Y277F, and T285A and that the LWS pigments have been subjected to strong evolutionary conservation. The fact that these positively selected M/LWS pigments are found not only in animals with red-green color vision but also in those with red-green color blindness strongly suggests that both red-green color vision and color blindness have undergone adaptive evolution independently in different species.
Method for Reducing Pumping Damage to Blood
NASA Technical Reports Server (NTRS)
Bozeman, Richard J., Jr. (Inventor); Akkerman, James W. (Inventor); Aber, Gregory S. (Inventor); VanDamm, George Arthur (Inventor); Bacak, James W. (Inventor); Svejkovsky, Robert J. (Inventor); Benkowski, Robert J. (Inventor)
1997-01-01
Methods are provided for minimizing damage to blood in a blood pump wherein the blood pump comprises a plurality of pump components that may affect blood damage such as clearance between pump blades and housing, number of impeller blades, rounded or flat blade edges, variations in entrance angles of blades, impeller length, and the like. The process comprises selecting a plurality of pump components believed to affect blood damage such as those listed herein before. Construction variations for each of the plurality of pump components are then selected. The pump components and variations are preferably listed in a matrix for easy visual comparison of test results. Blood is circulated through a pump configuration to test each variation of each pump component. After each test, total blood damage is determined for the blood pump. Preferably each pump component variation is tested at least three times to provide statistical results and check consistency of results. The least hemolytic variation for each pump component is preferably selected as an optimized component. If no statistical difference as to blood damage is produced for a variation of a pump component, then the variation that provides preferred hydrodynamic performance is selected. To compare the variation of pump components such as impeller and stator blade geometries, the preferred embodiment of the invention uses a stereolithography technique for realizing complex shapes within a short time period.
Mallik, Saurav; Bhadra, Tapas; Maulik, Ujjwal
2017-01-01
Epigenetic Biomarker discovery is an important task in bioinformatics. In this article, we develop a new framework of identifying statistically significant epigenetic biomarkers using maximal-relevance and minimal-redundancy criterion based feature (gene) selection for multi-omics dataset. Firstly, we determine the genes that have both expression as well as methylation values, and follow normal distribution. Similarly, we identify the genes which consist of both expression and methylation values, but do not follow normal distribution. For each case, we utilize a gene-selection method that provides maximal-relevant, but variable-weighted minimum-redundant genes as top ranked genes. For statistical validation, we apply t-test on both the expression and methylation data consisting of only the normally distributed top ranked genes to determine how many of them are both differentially expressed andmethylated. Similarly, we utilize Limma package for performing non-parametric Empirical Bayes test on both expression and methylation data comprising only the non-normally distributed top ranked genes to identify how many of them are both differentially expressed and methylated. We finally report the top-ranking significant gene-markerswith biological validation. Moreover, our framework improves positive predictive rate and reduces false positive rate in marker identification. In addition, we provide a comparative analysis of our gene-selection method as well as othermethods based on classificationperformances obtained using several well-known classifiers.
Pashaei, Elnaz; Pashaei, Elham; Aydin, Nizamettin
2018-04-14
In cancer classification, gene selection is an important data preprocessing technique, but it is a difficult task due to the large search space. Accordingly, the objective of this study is to develop a hybrid meta-heuristic Binary Black Hole Algorithm (BBHA) and Binary Particle Swarm Optimization (BPSO) (4-2) model that emphasizes gene selection. In this model, the BBHA is embedded in the BPSO (4-2) algorithm to make the BPSO (4-2) more effective and to facilitate the exploration and exploitation of the BPSO (4-2) algorithm to further improve the performance. This model has been associated with Random Forest Recursive Feature Elimination (RF-RFE) pre-filtering technique. The classifiers which are evaluated in the proposed framework are Sparse Partial Least Squares Discriminant Analysis (SPLSDA); k-nearest neighbor and Naive Bayes. The performance of the proposed method was evaluated on two benchmark and three clinical microarrays. The experimental results and statistical analysis confirm the better performance of the BPSO (4-2)-BBHA compared with the BBHA, the BPSO (4-2) and several state-of-the-art methods in terms of avoiding local minima, convergence rate, accuracy and number of selected genes. The results also show that the BPSO (4-2)-BBHA model can successfully identify known biologically and statistically significant genes from the clinical datasets. Copyright © 2018 Elsevier Inc. All rights reserved.
Improved method for selection of the NOAEL.
Calabrese, E J; Baldwin, L A
1994-02-01
The paper proposes that the NOAEL be defined as the highest dosage tested that is statistically significantly different from the control group while also being statistically significantly different from the LOAEL. This new definition requires that the NOAEL be defined from two points of reference rather than the current approach (i.e., single point of reference) in which the NOAEL represents only the highest dosage not statistically significantly different from the control group. This proposal is necessary in order to differentiate NOAELs which are statistically distinguishable from the LOAEL. Under the new regime only those satisfying both criteria would be designated a true NOAEL while those satisfying only one criteria (i.e., not statistically significant different from the control group) would be designated a "quasi" NOAEL and handled differently (i.e., via an uncertainty factor) for risk assessment purposes.
Population genetics inference for longitudinally-sampled mutants under strong selection.
Lacerda, Miguel; Seoighe, Cathal
2014-11-01
Longitudinal allele frequency data are becoming increasingly prevalent. Such samples permit statistical inference of the population genetics parameters that influence the fate of mutant variants. To infer these parameters by maximum likelihood, the mutant frequency is often assumed to evolve according to the Wright-Fisher model. For computational reasons, this discrete model is commonly approximated by a diffusion process that requires the assumption that the forces of natural selection and mutation are weak. This assumption is not always appropriate. For example, mutations that impart drug resistance in pathogens may evolve under strong selective pressure. Here, we present an alternative approximation to the mutant-frequency distribution that does not make any assumptions about the magnitude of selection or mutation and is much more computationally efficient than the standard diffusion approximation. Simulation studies are used to compare the performance of our method to that of the Wright-Fisher and Gaussian diffusion approximations. For large populations, our method is found to provide a much better approximation to the mutant-frequency distribution when selection is strong, while all three methods perform comparably when selection is weak. Importantly, maximum-likelihood estimates of the selection coefficient are severely attenuated when selection is strong under the two diffusion models, but not when our method is used. This is further demonstrated with an application to mutant-frequency data from an experimental study of bacteriophage evolution. We therefore recommend our method for estimating the selection coefficient when the effective population size is too large to utilize the discrete Wright-Fisher model. Copyright © 2014 by the Genetics Society of America.
NASA Astrophysics Data System (ADS)
Zheng, Feifei; Maier, Holger R.; Wu, Wenyan; Dandy, Graeme C.; Gupta, Hoshin V.; Zhang, Tuqiao
2018-02-01
Hydrological models are used for a wide variety of engineering purposes, including streamflow forecasting and flood-risk estimation. To develop such models, it is common to allocate the available data to calibration and evaluation data subsets. Surprisingly, the issue of how this allocation can affect model evaluation performance has been largely ignored in the research literature. This paper discusses the evaluation performance bias that can arise from how available data are allocated to calibration and evaluation subsets. As a first step to assessing this issue in a statistically rigorous fashion, we present a comprehensive investigation of the influence of data allocation on the development of data-driven artificial neural network (ANN) models of streamflow. Four well-known formal data splitting methods are applied to 754 catchments from Australia and the U.S. to develop 902,483 ANN models. Results clearly show that the choice of the method used for data allocation has a significant impact on model performance, particularly for runoff data that are more highly skewed, highlighting the importance of considering the impact of data splitting when developing hydrological models. The statistical behavior of the data splitting methods investigated is discussed and guidance is offered on the selection of the most appropriate data splitting methods to achieve representative evaluation performance for streamflow data with different statistical properties. Although our results are obtained for data-driven models, they highlight the fact that this issue is likely to have a significant impact on all types of hydrological models, especially conceptual rainfall-runoff models.
Exercise reduces depressive symptoms in adults with arthritis: Evidential value
Kelley, George A; Kelley, Kristi S
2016-01-01
AIM To determine whether evidential value exists that exercise reduces depression in adults with arthritis and other rheumatic conditions. METHODS Utilizing data derived from a prior meta-analysis of 29 randomized controlled trials comprising 2449 participants (1470 exercise, 979 control) with fibromyalgia, osteoarthritis, rheumatoid arthritis or systemic lupus erythematosus, a new method, P-curve, was utilized to assess for evidentiary worth as well as dismiss the possibility of discriminating reporting of statistically significant results regarding exercise and depression in adults with arthritis and other rheumatic conditions. Using the method of Stouffer, Z-scores were calculated to examine selective-reporting bias. An alpha (P) value < 0.05 was deemed statistically significant. In addition, average power of the tests included in P-curve, adjusted for publication bias, was calculated. RESULTS Fifteen of 29 studies (51.7%) with exercise and depression results were statistically significant (P < 0.05) while none of the results were statistically significant with respect to exercise increasing depression in adults with arthritis and other rheumatic conditions. Right-skew to dismiss selective reporting was identified (Z = −5.28, P < 0.0001). In addition, the included studies did not lack evidential value (Z = 2.39, P = 0.99), nor did they lack evidential value and were P-hacked (Z = 5.28, P > 0.99). The relative frequencies of P-values were 66.7% at 0.01, 6.7% each at 0.02 and 0.03, 13.3% at 0.04 and 6.7% at 0.05. The average power of the tests included in P-curve, corrected for publication bias, was 69%. Diagnostic plot results revealed that the observed power estimate was a better fit than the alternatives. CONCLUSION Evidential value results provide additional support that exercise reduces depression in adults with arthritis and other rheumatic conditions. PMID:27489782
Global vision of druggability issues: applications and perspectives.
Abi Hussein, Hiba; Geneix, Colette; Petitjean, Michel; Borrel, Alexandre; Flatters, Delphine; Camproux, Anne-Claude
2017-02-01
During the preliminary stage of a drug discovery project, the lack of druggability information and poor target selection are the main causes of frequent failures. Elaborating on accurate computational druggability prediction methods is a requirement for prioritizing target selection, designing new drugs and avoiding side effects. In this review, we describe a survey of recently reported druggability prediction methods mainly based on networks, statistical pocket druggability predictions and virtual screening. An application for a frequent mutation of p53 tumor suppressor is presented, illustrating the complementarity of druggability prediction approaches, the remaining challenges and potential new drug development perspectives. Copyright © 2016 Elsevier Ltd. All rights reserved.
On the analysis of very small samples of Gaussian repeated measurements: an alternative approach.
Westgate, Philip M; Burchett, Woodrow W
2017-03-15
The analysis of very small samples of Gaussian repeated measurements can be challenging. First, due to a very small number of independent subjects contributing outcomes over time, statistical power can be quite small. Second, nuisance covariance parameters must be appropriately accounted for in the analysis in order to maintain the nominal test size. However, available statistical strategies that ensure valid statistical inference may lack power, whereas more powerful methods may have the potential for inflated test sizes. Therefore, we explore an alternative approach to the analysis of very small samples of Gaussian repeated measurements, with the goal of maintaining valid inference while also improving statistical power relative to other valid methods. This approach uses generalized estimating equations with a bias-corrected empirical covariance matrix that accounts for all small-sample aspects of nuisance correlation parameter estimation in order to maintain valid inference. Furthermore, the approach utilizes correlation selection strategies with the goal of choosing the working structure that will result in the greatest power. In our study, we show that when accurate modeling of the nuisance correlation structure impacts the efficiency of regression parameter estimation, this method can improve power relative to existing methods that yield valid inference. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Stephan, Wolfgang
2016-01-01
In the past 15 years, numerous methods have been developed to detect selective sweeps underlying adaptations. These methods are based on relatively simple population genetic models, including one or two loci at which positive directional selection occurs, and one or two marker loci at which the impact of selection on linked neutral variation is quantified. Information about the phenotype under selection is not included in these models (except for fitness). In contrast, in the quantitative genetic models of adaptation, selection acts on one or more phenotypic traits, such that a genotype-phenotype map is required to bridge the gap to population genetics theory. Here I describe the range of population genetic models from selective sweeps in a panmictic population of constant size to evolutionary traffic when simultaneous sweeps at multiple loci interfere, and I also consider the case of polygenic selection characterized by subtle allele frequency shifts at many loci. Furthermore, I present an overview of the statistical tests that have been proposed based on these population genetics models to detect evidence for positive selection in the genome. © 2015 John Wiley & Sons Ltd.
Modelling multiple sources of dissemination bias in meta-analysis.
Bowden, Jack; Jackson, Dan; Thompson, Simon G
2010-03-30
Asymmetry in the funnel plot for a meta-analysis suggests the presence of dissemination bias. This may be caused by publication bias through the decisions of journal editors, by selective reporting of research results by authors or by a combination of both. Typically, study results that are statistically significant or have larger estimated effect sizes are more likely to appear in the published literature, hence giving a biased picture of the evidence-base. Previous statistical approaches for addressing dissemination bias have assumed only a single selection mechanism. Here we consider a more realistic scenario in which multiple dissemination processes, involving both the publishing authors and journals, are operating. In practical applications, the methods can be used to provide sensitivity analyses for the potential effects of multiple dissemination biases operating in meta-analysis.
NASA Astrophysics Data System (ADS)
Lazar, Dora; Ihasz, Istvan
2013-04-01
The short and medium range operational forecasts, warning and alarm of the severe weather are one of the most important activities of the Hungarian Meteorological Service. Our study provides comprehensive summary of newly developed methods based on ECMWF ensemble forecasts to assist successful prediction of the convective weather situations. . In the first part of the study a brief overview is given about the components of atmospheric convection, which are the atmospheric lifting force, convergence and vertical wind shear. The atmospheric instability is often used to characterize the so-called instability index; one of the most popular and often used indexes is the convective available potential energy. Heavy convective events, like intensive storms, supercells and tornadoes are needed the vertical instability, adequate moisture and vertical wind shear. As a first step statistical studies of these three parameters are based on nine years time series of 51-member ensemble forecasting model based on convective summer time period, various statistical analyses were performed. Relationship of the rate of the convective and total precipitation and above three parameters was studied by different statistical methods. Four new visualization methods were applied for supporting successful forecasts of severe weathers. Two of the four visualization methods the ensemble meteogram and the ensemble vertical profiles had been available at the beginning of our work. Both methods show probability of the meteorological parameters for the selected location. Additionally two new methods have been developed. First method provides probability map of the event exceeding predefined values, so the incident of the spatial uncertainty is well-defined. The convective weather events are characterized by the incident of space often rhapsodic occurs rather have expected the event area can be selected so that the ensemble forecasts give very good support. Another new visualization tool shows time evolution of predefined multiple thresholds in graphical form for any selected location. With applying this tool degree of the dangerous weather conditions can be well estimated. Besides intensive convective periods are clearly marked during the forecasting period. Developments were done by MAGICS++ software under UNIX operating system. The third part of the study usefulness of these tools is demonstrated in three interesting cases studies of last summer.
Behavior of the maximum likelihood in quantum state tomography
NASA Astrophysics Data System (ADS)
Scholten, Travis L.; Blume-Kohout, Robin
2018-02-01
Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint ρ ≥ 0, quantum state space does not generally satisfy local asymptotic normality (LAN), meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) should not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of LAN, metric-projected LAN, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.
Behavior of the maximum likelihood in quantum state tomography
Blume-Kohout, Robin J; Scholten, Travis L.
2018-02-22
Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint ρ ≥ 0, quantum state space does not generally satisfy local asymptotic normality (LAN), meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) shouldmore » not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of LAN, metric-projected LAN, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.« less
Behavior of the maximum likelihood in quantum state tomography
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blume-Kohout, Robin J; Scholten, Travis L.
Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint ρ ≥ 0, quantum state space does not generally satisfy local asymptotic normality (LAN), meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) shouldmore » not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of LAN, metric-projected LAN, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.« less
Identification of reliable gridded reference data for statistical downscaling methods in Alberta
NASA Astrophysics Data System (ADS)
Eum, H. I.; Gupta, A.
2017-12-01
Climate models provide essential information to assess impacts of climate change at regional and global scales. However, statistical downscaling methods have been applied to prepare climate model data for various applications such as hydrologic and ecologic modelling at a watershed scale. As the reliability and (spatial and temporal) resolution of statistically downscaled climate data mainly depend on a reference data, identifying the most reliable reference data is crucial for statistical downscaling. A growing number of gridded climate products are available for key climate variables which are main input data to regional modelling systems. However, inconsistencies in these climate products, for example, different combinations of climate variables, varying data domains and data lengths and data accuracy varying with physiographic characteristics of the landscape, have caused significant challenges in selecting the most suitable reference climate data for various environmental studies and modelling. Employing various observation-based daily gridded climate products available in public domain, i.e. thin plate spline regression products (ANUSPLIN and TPS), inverse distance method (Alberta Townships), and numerical climate model (North American Regional Reanalysis) and an optimum interpolation technique (Canadian Precipitation Analysis), this study evaluates the accuracy of the climate products at each grid point by comparing with the Adjusted and Homogenized Canadian Climate Data (AHCCD) observations for precipitation, minimum and maximum temperature over the province of Alberta. Based on the performance of climate products at AHCCD stations, we ranked the reliability of these publically available climate products corresponding to the elevations of stations discretized into several classes. According to the rank of climate products for each elevation class, we identified the most reliable climate products based on the elevation of target points. A web-based system was developed to allow users to easily select the most reliable reference climate data at each target point based on the elevation of grid cell. By constructing the best combination of reference data for the study domain, the accurate and reliable statistically downscaled climate projections could be significantly improved.
Zarei, Kobra; Atabati, Morteza; Ahmadi, Monire
2017-05-04
Bee algorithm (BA) is an optimization algorithm inspired by the natural foraging behaviour of honey bees to find the optimal solution which can be proposed to feature selection. In this paper, shuffling cross-validation-BA (CV-BA) was applied to select the best descriptors that could describe the retention factor (log k) in the biopartitioning micellar chromatography (BMC) of 79 heterogeneous pesticides. Six descriptors were obtained using BA and then the selected descriptors were applied for model development using multiple linear regression (MLR). The descriptor selection was also performed using stepwise, genetic algorithm and simulated annealing methods and MLR was applied to model development and then the results were compared with those obtained from shuffling CV-BA. The results showed that shuffling CV-BA can be applied as a powerful descriptor selection method. Support vector machine (SVM) was also applied for model development using six selected descriptors by BA. The obtained statistical results using SVM were better than those obtained using MLR, as the root mean square error (RMSE) and correlation coefficient (R) for whole data set (training and test), using shuffling CV-BA-MLR, were obtained as 0.1863 and 0.9426, respectively, while these amounts for the shuffling CV-BA-SVM method were obtained as 0.0704 and 0.9922, respectively.
Statistical and Spatial Analysis of Bathymetric Data for the St. Clair River, 1971-2007
Bennion, David
2009-01-01
To address questions concerning ongoing geomorphic processes in the St. Clair River, selected bathymetric datasets spanning 36 years were analyzed. Comparisons of recent high-resolution datasets covering the upper river indicate a highly variable, active environment. Although statistical and spatial comparisons of the datasets show that some changes to the channel size and shape have taken place during the study period, uncertainty associated with various survey methods and interpolation processes limit the statistically certain results. The methods used to spatially compare the datasets are sensitive to small variations in position and depth that are within the range of uncertainty associated with the datasets. Characteristics of the data, such as the density of measured points and the range of values surveyed, can also influence the results of spatial comparison. With due consideration of these limitations, apparently active and ongoing areas of elevation change in the river are mapped and discussed.
NASA Astrophysics Data System (ADS)
Majerek, Dariusz; Guz, Łukasz; Suchorab, Zbigniew; Łagód, Grzegorz; Sobczuk, Henryk
2017-07-01
Mold that develops on moistened building barriers is a major cause of the Sick Building Syndrome (SBS). Fungal contamination is normally evaluated using standard biological methods which are time-consuming and require a lot of manual labor. Fungi emit Volatile Organic Compounds (VOC) that can be detected in the indoor air using several techniques of detection e.g. chromatography. VOCs can be also detected using gas sensors arrays. All array sensors generate particular voltage signals that ought to be analyzed using properly selected statistical methods of interpretation. This work is focused on the attempt to apply statistical classifying models in evaluation of signals from gas sensors matrix to analyze the air sampled from the headspace of various types of the building materials at different level of contamination but also clean reference materials.
Nonparametric entropy estimation using kernel densities.
Lake, Douglas E
2009-01-01
The entropy of experimental data from the biological and medical sciences provides additional information over summary statistics. Calculating entropy involves estimates of probability density functions, which can be effectively accomplished using kernel density methods. Kernel density estimation has been widely studied and a univariate implementation is readily available in MATLAB. The traditional definition of Shannon entropy is part of a larger family of statistics, called Renyi entropy, which are useful in applications that require a measure of the Gaussianity of data. Of particular note is the quadratic entropy which is related to the Friedman-Tukey (FT) index, a widely used measure in the statistical community. One application where quadratic entropy is very useful is the detection of abnormal cardiac rhythms, such as atrial fibrillation (AF). Asymptotic and exact small-sample results for optimal bandwidth and kernel selection to estimate the FT index are presented and lead to improved methods for entropy estimation.
Mensah, F K; Willett, E V; Simpson, J; Smith, A G; Roman, E
2007-09-15
Substantial heterogeneity has been observed among case-control studies investigating associations between non-Hodgkin's lymphoma and familial characteristics, such as birth order and sibship size. The potential role of selection bias in explaining such heterogeneity is considered within this study. Selection bias according to familial characteristics and socioeconomic status is investigated within a United Kingdom-based case-control study of non-Hodgkin's lymphoma diagnosed during 1998-2001. Reported distributions of birth order and maternal age are each compared with expected reference distributions derived using national birth statistics from the United Kingdom. A method is detailed in which yearly data are used to derive expected distributions, taking account of variability in birth statistics over time. Census data are used to reweight both the case and control study populations such that they are comparable with the general population with regard to socioeconomic status. The authors found little support for an association between non-Hodgkin's lymphoma and birth order or family size and little evidence for an influence of selection bias. However, the findings suggest that between-study heterogeneity could be explained by selection biases that influence the demographic characteristics of participants.
Evidential Value That Exercise Improves BMI z-Score in Overweight and Obese Children and Adolescents
Kelley, George A.; Kelley, Kristi S.
2015-01-01
Background. Given the cardiovascular disease (CVD) related importance of understanding the true effects of exercise on adiposity in overweight and obese children and adolescents, this study examined whether there is evidential value to rule out excessive and inappropriate reporting of statistically significant results, a major problem in the published literature, with respect to exercise-induced improvements in BMI z-score among overweight and obese children and adolescents. Methods. Using data from a previous meta-analysis of 10 published studies that included 835 overweight and obese children and adolescents, a novel, recently developed approach (p-curve) was used to test for evidential value and rule out selective reporting of findings. Chi-squared tests (χ 2) were used to test for statistical significance with alpha (p) values <0.05 considered statistically significant. Results. Six of 10 findings (60%) were statistically significant. Statistically significant right-skew to rule out selective reporting was found (χ 2 = 38.8, p = 0.0001). Conversely, studies neither lacked evidential value (χ 2 = 6.8, p = 0.87) nor lacked evidential value and were intensely p-hacked (χ 2 = 4.3, p = 0.98). Conclusion. Evidential value results confirm that exercise reduces BMI z-score in overweight and obese children and adolescents, an important therapeutic strategy for treating and preventing CVD. PMID:26509145
Kelley, George A; Kelley, Kristi S
2015-01-01
Background. Given the cardiovascular disease (CVD) related importance of understanding the true effects of exercise on adiposity in overweight and obese children and adolescents, this study examined whether there is evidential value to rule out excessive and inappropriate reporting of statistically significant results, a major problem in the published literature, with respect to exercise-induced improvements in BMI z-score among overweight and obese children and adolescents. Methods. Using data from a previous meta-analysis of 10 published studies that included 835 overweight and obese children and adolescents, a novel, recently developed approach (p-curve) was used to test for evidential value and rule out selective reporting of findings. Chi-squared tests (χ (2)) were used to test for statistical significance with alpha (p) values <0.05 considered statistically significant. Results. Six of 10 findings (60%) were statistically significant. Statistically significant right-skew to rule out selective reporting was found (χ (2) = 38.8, p = 0.0001). Conversely, studies neither lacked evidential value (χ (2) = 6.8, p = 0.87) nor lacked evidential value and were intensely p-hacked (χ (2) = 4.3, p = 0.98). Conclusion. Evidential value results confirm that exercise reduces BMI z-score in overweight and obese children and adolescents, an important therapeutic strategy for treating and preventing CVD.
Journal selection decisions: a biomedical library operations research model. I. The framework.
Kraft, D H; Polacsek, R A; Soergel, L; Burns, K; Klair, A
1976-01-01
The problem of deciding which journal titles to select for acquisition in a biomedical library is modeled. The approach taken is based on cost/benefit ratios. Measures of journal worth, methods of data collection, and journal cost data are considered. The emphasis is on the development of a practical process for selecting journal titles, based on the objectivity and rationality of the model; and on the collection of the approprate data and library statistics in a reasonable manner. The implications of this process towards an overall management information system (MIS) for biomedical serials handling are discussed. PMID:820391
Hybrid genetic algorithm-neural network: feature extraction for unpreprocessed microarray data.
Tong, Dong Ling; Schierz, Amanda C
2011-09-01
Suitable techniques for microarray analysis have been widely researched, particularly for the study of marker genes expressed to a specific type of cancer. Most of the machine learning methods that have been applied to significant gene selection focus on the classification ability rather than the selection ability of the method. These methods also require the microarray data to be preprocessed before analysis takes place. The objective of this study is to develop a hybrid genetic algorithm-neural network (GANN) model that emphasises feature selection and can operate on unpreprocessed microarray data. The GANN is a hybrid model where the fitness value of the genetic algorithm (GA) is based upon the number of samples correctly labelled by a standard feedforward artificial neural network (ANN). The model is evaluated by using two benchmark microarray datasets with different array platforms and differing number of classes (a 2-class oligonucleotide microarray data for acute leukaemia and a 4-class complementary DNA (cDNA) microarray dataset for SRBCTs (small round blue cell tumours)). The underlying concept of the GANN algorithm is to select highly informative genes by co-evolving both the GA fitness function and the ANN weights at the same time. The novel GANN selected approximately 50% of the same genes as the original studies. This may indicate that these common genes are more biologically significant than other genes in the datasets. The remaining 50% of the significant genes identified were used to build predictive models and for both datasets, the models based on the set of genes extracted by the GANN method produced more accurate results. The results also suggest that the GANN method not only can detect genes that are exclusively associated with a single cancer type but can also explore the genes that are differentially expressed in multiple cancer types. The results show that the GANN model has successfully extracted statistically significant genes from the unpreprocessed microarray data as well as extracting known biologically significant genes. We also show that assessing the biological significance of genes based on classification accuracy may be misleading and though the GANN's set of extra genes prove to be more statistically significant than those selected by other methods, a biological assessment of these genes is highly recommended to confirm their functionality. Copyright © 2011 Elsevier B.V. All rights reserved.
Michigan's forests, 2004: statistics and quality assurance
Scott A. Pugh; Mark H. Hansen; Gary Brand; Ronald E. McRoberts
2010-01-01
The first annual inventory of Michigan's forests was completed in 2004 after 18,916 plots were selected and 10,355 forested plots were visited. This report includes detailed information on forest inventory methods, quality of estimates, and additional tables. An earlier publication presented analyses of the inventoried data (Pugh et al. 2009).
49 CFR Schedule G to Subpart B of... - Selected Statistical Data
Code of Federal Regulations, 2011 CFR
2011-10-01
... 49 Transportation 8 2011-10-01 2011-10-01 false Selected Statistical Data G Schedule G to Subpart... Statistical Data [Dollars in thousands] () Greyhound Lines, Inc. () Trailways combined () All study carriers... purpose of Schedule G is to develop selected property, labor and operational data for use in evaluating...
Murray, Christopher J L
2007-03-10
Health statistics are at the centre of an increasing number of worldwide health controversies. Several factors are sharpening the tension between the supply and demand for high quality health information, and the health-related Millennium Development Goals (MDGs) provide a high-profile example. With thousands of indicators recommended but few measured well, the worldwide health community needs to focus its efforts on improving measurement of a small set of priority areas. Priority indicators should be selected on the basis of public-health significance and several dimensions of measurability. Health statistics can be divided into three types: crude, corrected, and predicted. Health statistics are necessary inputs to planning and strategic decision making, programme implementation, monitoring progress towards targets, and assessment of what works and what does not. Crude statistics that are biased have no role in any of these steps; corrected statistics are preferred. For strategic decision making, when corrected statistics are unavailable, predicted statistics can play an important part. For monitoring progress towards agreed targets and assessment of what works and what does not, however, predicted statistics should not be used. Perhaps the most effective method to decrease controversy over health statistics and to encourage better primary data collection and the development of better analytical methods is a strong commitment to provision of an explicit data audit trail. This initiative would make available the primary data, all post-data collection adjustments, models including covariates used for farcasting and forecasting, and necessary documentation to the public.
Identification of selection footprints on the X chromosome in pig.
Ma, Yunlong; Zhang, Haihan; Zhang, Qin; Ding, Xiangdong
2014-01-01
Identifying footprints of selection can provide a straightforward insight into the mechanism of artificial selection and further dig out the causal genes related to important traits. In this study, three between-population and two within-population approaches, the Cross Population Extend Haplotype Homozygosity Test (XPEHH), the Cross Population Composite Likelihood Ratio (XPCLR), the F-statistics (Fst), the Integrated Haplotype Score (iHS) and the Tajima's D, were implemented to detect the selection footprints on the X chromosome in three pig breeds using Illumina Porcine60K SNP chip. In the detection of selection footprints using between-population methods, 11, 11 and 7 potential selection regions with length of 15.62 Mb, 12.32 Mb and 9.38 Mb were identified in Landrace, Chinese Songliao and Yorkshire by XPEHH, respectively, and 16, 13 and 17 potential selection regions with length of 15.20 Mb, 13.00 Mb and 19.21 Mb by XPCLR, 4, 2 and 4 potential selection regions with length of 3.20 Mb, 1.60 Mb and 3.20 Mb by Fst. For within-population methods, 7, 10 and 9 potential selection regions with length of 8.12 Mb, 8.40 Mb and 9.99 Mb were identified in Landrace, Chinese Songliao and Yorkshire by iHS, and 4, 3 and 2 potential selection regions with length of 3.20 Mb, 2.40 Mb and 1.60 Mb by Tajima's D. Moreover, the selection regions from different methods were partly overlapped, especially the regions around 22∼25 Mb were detected under selection in Landrace and Yorkshire while no selection in Chinese Songliao by all three between-population methods. Only quite few overlap of selection regions identified by between-population and within-population methods were found. Bioinformatics analysis showed that the genes relevant with meat quality, reproduction and immune were found in potential selection regions. In addition, three out of five significant SNPs associated with hematological traits reported in our genome-wide association study were harbored in potential selection regions.
Velasco, Valeria; Sherwood, Julie S.; Rojas-García, Pedro P.; Logue, Catherine M.
2014-01-01
The aim of this study was to compare a real-time PCR assay, with a conventional culture/PCR method, to detect S. aureus, mecA and Panton-Valentine Leukocidin (PVL) genes in animals and retail meat, using a two-step selective enrichment protocol. A total of 234 samples were examined (77 animal nasal swabs, 112 retail raw meat, and 45 deli meat). The multiplex real-time PCR targeted the genes: nuc (identification of S. aureus), mecA (associated with methicillin resistance) and PVL (virulence factor), and the primary and secondary enrichment samples were assessed. The conventional culture/PCR method included the two-step selective enrichment, selective plating, biochemical testing, and multiplex PCR for confirmation. The conventional culture/PCR method recovered 95/234 positive S. aureus samples. Application of real-time PCR on samples following primary and secondary enrichment detected S. aureus in 111/234 and 120/234 samples respectively. For detection of S. aureus, the kappa statistic was 0.68–0.88 (from substantial to almost perfect agreement) and 0.29–0.77 (from fair to substantial agreement) for primary and secondary enrichments, using real-time PCR. For detection of mecA gene, the kappa statistic was 0–0.49 (from no agreement beyond that expected by chance to moderate agreement) for primary and secondary enrichment samples. Two pork samples were mecA gene positive by all methods. The real-time PCR assay detected the mecA gene in samples that were negative for S. aureus, but positive for Staphylococcus spp. The PVL gene was not detected in any sample by the conventional culture/PCR method or the real-time PCR assay. Among S. aureus isolated by conventional culture/PCR method, the sequence type ST398, and multi-drug resistant strains were found in animals and raw meat samples. The real-time PCR assay may be recommended as a rapid method for detection of S. aureus and the mecA gene, with further confirmation of methicillin-resistant S. aureus (MRSA) using the standard culture method. PMID:24849624
Velasco, Valeria; Sherwood, Julie S; Rojas-García, Pedro P; Logue, Catherine M
2014-01-01
The aim of this study was to compare a real-time PCR assay, with a conventional culture/PCR method, to detect S. aureus, mecA and Panton-Valentine Leukocidin (PVL) genes in animals and retail meat, using a two-step selective enrichment protocol. A total of 234 samples were examined (77 animal nasal swabs, 112 retail raw meat, and 45 deli meat). The multiplex real-time PCR targeted the genes: nuc (identification of S. aureus), mecA (associated with methicillin resistance) and PVL (virulence factor), and the primary and secondary enrichment samples were assessed. The conventional culture/PCR method included the two-step selective enrichment, selective plating, biochemical testing, and multiplex PCR for confirmation. The conventional culture/PCR method recovered 95/234 positive S. aureus samples. Application of real-time PCR on samples following primary and secondary enrichment detected S. aureus in 111/234 and 120/234 samples respectively. For detection of S. aureus, the kappa statistic was 0.68-0.88 (from substantial to almost perfect agreement) and 0.29-0.77 (from fair to substantial agreement) for primary and secondary enrichments, using real-time PCR. For detection of mecA gene, the kappa statistic was 0-0.49 (from no agreement beyond that expected by chance to moderate agreement) for primary and secondary enrichment samples. Two pork samples were mecA gene positive by all methods. The real-time PCR assay detected the mecA gene in samples that were negative for S. aureus, but positive for Staphylococcus spp. The PVL gene was not detected in any sample by the conventional culture/PCR method or the real-time PCR assay. Among S. aureus isolated by conventional culture/PCR method, the sequence type ST398, and multi-drug resistant strains were found in animals and raw meat samples. The real-time PCR assay may be recommended as a rapid method for detection of S. aureus and the mecA gene, with further confirmation of methicillin-resistant S. aureus (MRSA) using the standard culture method.
Rotundo, Roberto; Nieri, Michele; Bonaccini, Daniele; Mori, Massimiliano; Lamberti, Elena; Massironi, Domenico; Giachetti, Luca; Franchi, Lorenzo; Venezia, Piero; Cavalcanti, Raffaele; Bondi, Elena; Farneti, Mauro; Pinchi, Vilma; Buti, Jacopo
2015-01-01
To propose a method to measure the esthetics of the smile and to report its validation by means of an intra-rater and inter-rater agreement analysis. Ten variables were chosen as determinants for the esthetics of a smile: smile line and facial midline, tooth alignment, tooth deformity, tooth dischromy, gingival dischromy, gingival recession, gingival excess, gingival scars and diastema/missing papillae. One examiner consecutively selected seventy smile pictures, which were in the frontal view. Ten examiners, with different levels of clinical experience and specialties, applied the proposed assessment method twice on the selected pictures, independently and blindly. Intraclass correlation coefficient (ICC) and Fleiss' kappa) statistics were performed to analyse the intra-rater and inter-rater agreement. Considering the cumulative assessment of the Smile Esthetic Index (SEI), the ICC value for the inter-rater agreement of the 10 examiners was 0.62 (95% CI: 0.51 to 0.72), representing a substantial agreement. Intra-rater agreement ranged from 0.86 to 0.99. Inter-rater agreement (Fleiss' kappa statistics) calculated for each variable ranged from 0.17 to 0.75. The SEI was a reproducible method, to assess the esthetic component of the smile, useful for the diagnostic phase and for setting appropriate treatment plans.
[Study on ecological suitability regionalization of Eucommia ulmoides in Guizhou].
Kang, Chuan-Zhi; Wang, Qing-Qing; Zhou, Tao; Jiang, Wei-Ke; Xiao, Cheng-Hong; Xie, Yu
2014-05-01
To study the ecological suitability regionalization of Eucommia ulmoides, for selecting artificial planting base and high-quality industrial raw material purchase area of the herb in Guizhou. Based on the investigation of 14 Eucommia ulmoides producing areas, pinoresinol diglucoside content and ecological factors were obtained. Using spatial analysis method to carry on ecological suitability regionalization. Meanwhile, combining pinoresinol diglucoside content, the correlation of major active components and environmental factors were analyzed by statistical analysis. The most suitability planting area of Eucommia ulmoides was the northwest of Guizhou. The distribution of Eucommia ulmoides was mainly affected by the type and pH value of soil, and monthly precipitation. The spatial structure of major active components in Eucommia ulmoides were randomly distributed in global space, but had only one aggregation point which had a high positive correlation in local space. The major active components of Eucommia ulmoides had no correlation with altitude, longitude or latitude. Using the spatial analysis method and statistical analysis method, based on environmental factor and pinoresinol diglucoside content, the ecological suitability regionalization of Eucommia ulmoides can provide reference for the selection of suitable planting area, artificial planting base and directing production layout.
Kling, Teresia; Johansson, Patrik; Sanchez, José; Marinescu, Voichita D.; Jörnsten, Rebecka; Nelander, Sven
2015-01-01
Statistical network modeling techniques are increasingly important tools to analyze cancer genomics data. However, current tools and resources are not designed to work across multiple diagnoses and technical platforms, thus limiting their applicability to comprehensive pan-cancer datasets such as The Cancer Genome Atlas (TCGA). To address this, we describe a new data driven modeling method, based on generalized Sparse Inverse Covariance Selection (SICS). The method integrates genetic, epigenetic and transcriptional data from multiple cancers, to define links that are present in multiple cancers, a subset of cancers, or a single cancer. It is shown to be statistically robust and effective at detecting direct pathway links in data from TCGA. To facilitate interpretation of the results, we introduce a publicly accessible tool (cancerlandscapes.org), in which the derived networks are explored as interactive web content, linked to several pathway and pharmacological databases. To evaluate the performance of the method, we constructed a model for eight TCGA cancers, using data from 3900 patients. The model rediscovered known mechanisms and contained interesting predictions. Possible applications include prediction of regulatory relationships, comparison of network modules across multiple forms of cancer and identification of drug targets. PMID:25953855
Labots, M; Laarakker, M C; Ohl, F; van Lith, H A
2016-06-29
Selecting chromosome substitution strains (CSSs, also called consomic strains/lines) used in the search for quantitative trait loci (QTLs) consistently requires the identification of the respective phenotypic trait of interest and is simply based on a significant difference between a consomic and host strain. However, statistical significance as represented by P values does not necessarily predicate practical importance. We therefore propose a method that pays attention to both the statistical significance and the actual size of the observed effect. The present paper extends on this approach and describes in more detail the use of effect size measures (Cohen's d, partial eta squared - η p (2) ) together with the P value as statistical selection parameters for the chromosomal assignment of QTLs influencing anxiety-related behavior and locomotion in laboratory mice. The effect size measures were based on integrated behavioral z-scoring and were calculated in three experiments: (A) a complete consomic male mouse panel with A/J as the donor strain and C57BL/6J as the host strain. This panel, including host and donor strains, was analyzed in the modified Hole Board (mHB). The consomic line with chromosome 19 from A/J (CSS-19A) was selected since it showed increased anxiety-related behavior, but similar locomotion compared to its host. (B) Following experiment A, female CSS-19A mice were compared with their C57BL/6J counterparts; however no significant differences and effect sizes close to zero were found. (C) A different consomic mouse strain (CSS-19PWD), with chromosome 19 from PWD/PhJ transferred on the genetic background of C57BL/6J, was compared with its host strain. Here, in contrast with CSS-19A, there was a decreased overall anxiety in CSS-19PWD compared to C57BL/6J males, but not locomotion. This new method shows an improved way to identify CSSs for QTL analysis for anxiety-related behavior using a combination of statistical significance testing and effect sizes. In addition, an intercross between CSS-19A and CSS-19PWD may be of interest for future studies on the genetic background of anxiety-related behavior.
Genomic selection of agronomic traits in hybrid rice using an NCII population.
Xu, Yang; Wang, Xin; Ding, Xiaowen; Zheng, Xingfei; Yang, Zefeng; Xu, Chenwu; Hu, Zhongli
2018-05-10
Hybrid breeding is an effective tool to improve yield in rice, while parental selection remains the key and difficult issue. Genomic selection (GS) provides opportunities to predict the performance of hybrids before phenotypes are measured. However, the application of GS is influenced by several genetic and statistical factors. Here, we used a rice North Carolina II (NC II) population constructed by crossing 115 rice varieties with five male sterile lines as a model to evaluate effects of statistical methods, heritability, marker density and training population size on prediction for hybrid performance. From the comparison of six GS methods, we found that predictabilities for different methods are significantly different, with genomic best linear unbiased prediction (GBLUP) and least absolute shrinkage and selection operation (LASSO) being the best, support vector machine (SVM) and partial least square (PLS) being the worst. The marker density has lower influence on predicting rice hybrid performance compared with the size of training population. Additionally, we used the 575 (115 × 5) hybrid rice as a training population to predict eight agronomic traits of all hybrids derived from 120 (115 + 5) rice varieties each mating with 3023 rice accessions from the 3000 rice genomes project (3 K RGP). Of the 362,760 potential hybrids, selection of the top 100 predicted hybrids would lead to 35.5%, 23.25%, 30.21%, 42.87%, 61.80%, 75.83%, 19.24% and 36.12% increase in grain yield per plant, thousand-grain weight, panicle number per plant, plant height, secondary branch number, grain number per panicle, panicle length and primary branch number, respectively. This study evaluated the factors affecting predictabilities for hybrid prediction and demonstrated the implementation of GS to predict hybrid performance of rice. Our results suggest that GS could enable the rapid selection of superior hybrids, thus increasing the efficiency of rice hybrid breeding.
Selected Streamflow Statistics for Streamgaging Stations in Delaware, 2003
Ries, Kernell G.
2004-01-01
Flow-duration and low-flow frequency statistics were calculated for 15 streamgaging stations in Delaware, in cooperation with the Delaware Geological Survey. The flow-duration statistics include the 1-, 2-, 5-, 10-, 20-, 30-, 40-, 50-, 60-, 70-, 80-, 90-, 95-, 98-, and 99-percent duration discharges. The low-flow frequency statistics include the average discharges for 1, 7, 14, 30, 60, 90, and 120 days that recur, on average, once in 1.01, 2, 5, 10, 20, 50, and 100 years. The statistics were computed using U.S. Geological Survey computer programs that can be downloaded from the World Wide Web at no cost. The computer programs automate standard U.S. Geological Survey methods for computing the statistics. Documentation is provided at the Web sites for the individual programs. The computed statistics are presented in tabular format on a separate page for each station, along with the station name, station number, the location, the period of record, and remarks.
NASA Astrophysics Data System (ADS)
Attia, Khalid A. M.; El-Abasawi, Nasr M.; El-Olemy, Ahmed; Abdelazim, Ahmed H.
2018-01-01
The first three UV spectrophotometric methods have been developed of simultaneous determination of two new FDA approved drugs namely; elbasvir and grazoprevir in their combined pharmaceutical dosage form. These methods include simultaneous equation, partial least squares with and without variable selection procedure (genetic algorithm). For simultaneous equation method, the absorbance values at 369 (λmax of elbasvir) and 253 nm (λmax of grazoprevir) have been selected for the formation of two simultaneous equations required for the mathematical processing and quantitative analysis of the studied drugs. Alternatively, the partial least squares with and without variable selection procedure (genetic algorithm) have been applied in the spectra analysis because the synchronous inclusion of many unreal wavelengths rather than by using a single or dual wavelength which greatly increases the precision and predictive ability of the methods. Successfully assay of the drugs in their pharmaceutical formulation has been done by the proposed methods. Statistically comparative analysis for the obtained results with the manufacturing methods has been performed. It is noteworthy to mention that there was no significant difference between the proposed methods and the manufacturing one with respect to the validation parameters.
Shah, Munir H; Shaheen, N; Jaffar, M
2006-03-01
To understand the metal distribution characteristics in the atmosphere of urban Islamabad, total suspended particulate (TSP) samples were collected on daily 12 h basis, at Quaid-i-Azam University campus, using high volume sampler. The TSP samples were treated with HNO(3)/HClO(4) based wet digestion method for the quantification of eight selected metals; Fe, Zn, Pb, Mn, Cr, Co, Ni and Cd by FAAS method. The monitoring period ran from June 2001 to January 2002, with a total of 194 samples collected on cellulose filters. Effects of different meteorological conditions such as temperature, relative humidity, wind speed and wind direction on selected metal levels were interpreted by means of multivariate statistical approach. Enhanced metal levels for Fe (930 ng/m(3)), Zn (542 ng/m(3)) and Pb (210 ng/m(3)) were found on the mean scale while Mn, Cr, Co and Ni emerged as minor contributors. Statistical correlation study was also conducted and a strong correlation was observed between Pb-Cr (r=0.611). The relative humidity showed some significant influence on atmospheric metal distribution while other meteorological parameters showed weak relationship with TSP metal levels. Regarding the origin of sources of heavy metals in TSP, the statistical procedure identified three source profiles; automobile emissions, industrial/metallurgical units, and natural soil dust. The metal levels were also compared with those reported for other parts of the world which showed that the metal levels in urban atmosphere of Islamabad are in exceedence than those of European industrial and urban sites while comparable with some Asian sites.
Ross, Joseph S; Bates, Jonathan; Parzynski, Craig S; Akar, Joseph G; Curtis, Jeptha P; Desai, Nihar R; Freeman, James V; Gamble, Ginger M; Kuntz, Richard; Li, Shu-Xia; Marinac-Dabic, Danica; Masoudi, Frederick A; Normand, Sharon-Lise T; Ranasinghe, Isuru; Shaw, Richard E; Krumholz, Harlan M
2017-01-01
Background Machine learning methods may complement traditional analytic methods for medical device surveillance. Methods and results Using data from the National Cardiovascular Data Registry for implantable cardioverter–defibrillators (ICDs) linked to Medicare administrative claims for longitudinal follow-up, we applied three statistical approaches to safety-signal detection for commonly used dual-chamber ICDs that used two propensity score (PS) models: one specified by subject-matter experts (PS-SME), and the other one by machine learning-based selection (PS-ML). The first approach used PS-SME and cumulative incidence (time-to-event), the second approach used PS-SME and cumulative risk (Data Extraction and Longitudinal Trend Analysis [DELTA]), and the third approach used PS-ML and cumulative risk (embedded feature selection). Safety-signal surveillance was conducted for eleven dual-chamber ICD models implanted at least 2,000 times over 3 years. Between 2006 and 2010, there were 71,948 Medicare fee-for-service beneficiaries who received dual-chamber ICDs. Cumulative device-specific unadjusted 3-year event rates varied for three surveyed safety signals: death from any cause, 12.8%–20.9%; nonfatal ICD-related adverse events, 19.3%–26.3%; and death from any cause or nonfatal ICD-related adverse event, 27.1%–37.6%. Agreement among safety signals detected/not detected between the time-to-event and DELTA approaches was 90.9% (360 of 396, k=0.068), between the time-to-event and embedded feature-selection approaches was 91.7% (363 of 396, k=−0.028), and between the DELTA and embedded feature selection approaches was 88.1% (349 of 396, k=−0.042). Conclusion Three statistical approaches, including one machine learning method, identified important safety signals, but without exact agreement. Ensemble methods may be needed to detect all safety signals for further evaluation during medical device surveillance. PMID:28860874
Pseudorandom number generation using chaotic true orbits of the Bernoulli map
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saito, Asaki, E-mail: saito@fun.ac.jp; Yamaguchi, Akihiro
We devise a pseudorandom number generator that exactly computes chaotic true orbits of the Bernoulli map on quadratic algebraic integers. Moreover, we describe a way to select the initial points (seeds) for generating multiple pseudorandom binary sequences. This selection method distributes the initial points almost uniformly (equidistantly) in the unit interval, and latter parts of the generated sequences are guaranteed not to coincide. We also demonstrate through statistical testing that the generated sequences possess good randomness properties.
Khan, Hafiz; Saxena, Anshul; Perisetti, Abhilash; Rafiq, Aamrin; Gabbidon, Kemesha; Mende, Sarah; Lyuksyutova, Maria; Quesada, Kandi; Blakely, Summre; Torres, Tiffany; Afesse, Mahlet
2016-12-01
Background: Breast cancer is a worldwide public health concern and is the most prevalent type of cancer in women in the United States. This study concerned the best fit of statistical probability models on the basis of survival times for nine state cancer registries: California, Connecticut, Georgia, Hawaii, Iowa, Michigan, New Mexico, Utah, and Washington. Materials and Methods: A probability random sampling method was applied to select and extract records of 2,000 breast cancer patients from the Surveillance Epidemiology and End Results (SEER) database for each of the nine state cancer registries used in this study. EasyFit software was utilized to identify the best probability models by using goodness of fit tests, and to estimate parameters for various statistical probability distributions that fit survival data. Results: Statistical analysis for the summary of statistics is reported for each of the states for the years 1973 to 2012. Kolmogorov-Smirnov, Anderson-Darling, and Chi-squared goodness of fit test values were used for survival data, the highest values of goodness of fit statistics being considered indicative of the best fit survival model for each state. Conclusions: It was found that California, Connecticut, Georgia, Iowa, New Mexico, and Washington followed the Burr probability distribution, while the Dagum probability distribution gave the best fit for Michigan and Utah, and Hawaii followed the Gamma probability distribution. These findings highlight differences between states through selected sociodemographic variables and also demonstrate probability modeling differences in breast cancer survival times. The results of this study can be used to guide healthcare providers and researchers for further investigations into social and environmental factors in order to reduce the occurrence of and mortality due to breast cancer. Creative Commons Attribution License
Experimental statistics for biological sciences.
Bang, Heejung; Davidian, Marie
2010-01-01
In this chapter, we cover basic and fundamental principles and methods in statistics - from "What are Data and Statistics?" to "ANOVA and linear regression," which are the basis of any statistical thinking and undertaking. Readers can easily find the selected topics in most introductory statistics textbooks, but we have tried to assemble and structure them in a succinct and reader-friendly manner in a stand-alone chapter. This text has long been used in real classroom settings for both undergraduate and graduate students who do or do not major in statistical sciences. We hope that from this chapter, readers would understand the key statistical concepts and terminologies, how to design a study (experimental or observational), how to analyze the data (e.g., describe the data and/or estimate the parameter(s) and make inference), and how to interpret the results. This text would be most useful if it is used as a supplemental material, while the readers take their own statistical courses or it would serve as a great reference text associated with a manual for any statistical software as a self-teaching guide.
Selected-node stochastic simulation algorithm
NASA Astrophysics Data System (ADS)
Duso, Lorenzo; Zechner, Christoph
2018-04-01
Stochastic simulations of biochemical networks are of vital importance for understanding complex dynamics in cells and tissues. However, existing methods to perform such simulations are associated with computational difficulties and addressing those remains a daunting challenge to the present. Here we introduce the selected-node stochastic simulation algorithm (snSSA), which allows us to exclusively simulate an arbitrary, selected subset of molecular species of a possibly large and complex reaction network. The algorithm is based on an analytical elimination of chemical species, thereby avoiding explicit simulation of the associated chemical events. These species are instead described continuously in terms of statistical moments derived from a stochastic filtering equation, resulting in a substantial speedup when compared to Gillespie's stochastic simulation algorithm (SSA). Moreover, we show that statistics obtained via snSSA profit from a variance reduction, which can significantly lower the number of Monte Carlo samples needed to achieve a certain performance. We demonstrate the algorithm using several biological case studies for which the simulation time could be reduced by orders of magnitude.
Optimization of Thick, Large Area YBCO Film Growth Through Response Surface Methods
NASA Astrophysics Data System (ADS)
Porzio, J.; Mahoney, C. H.; Sullivan, M. C.
2014-03-01
We present our work on the optimization of thick, large area YB2C3O7-δ (YBCO) film growth through response surface methods. Thick, large area films have commercial uses and have recently been used in dramatic demonstrations of levitation and suspension. Our films are grown via pulsed laser deposition and we have optimized growth parameters via response surface methods. Response surface methods is a statistical tool to optimize selected quantities with respect to a set of variables. We optimized our YBCO films' critical temperatures, thicknesses, and structures with respect to three PLD growth parameters: deposition temperature, laser energy, and deposition pressure. We will present an overview of YBCO growth via pulsed laser deposition, the statistical theory behind response surface methods, and the application of response surface methods to pulsed laser deposition growth of YBCO. Results from the experiment will be presented in a discussion of the optimized film quality. Supported by NFS grant DMR-1305637
A study of finite mixture model: Bayesian approach on financial time series data
NASA Astrophysics Data System (ADS)
Phoong, Seuk-Yen; Ismail, Mohd Tahir
2014-07-01
Recently, statistician have emphasized on the fitting finite mixture model by using Bayesian method. Finite mixture model is a mixture of distributions in modeling a statistical distribution meanwhile Bayesian method is a statistical method that use to fit the mixture model. Bayesian method is being used widely because it has asymptotic properties which provide remarkable result. In addition, Bayesian method also shows consistency characteristic which means the parameter estimates are close to the predictive distributions. In the present paper, the number of components for mixture model is studied by using Bayesian Information Criterion. Identify the number of component is important because it may lead to an invalid result. Later, the Bayesian method is utilized to fit the k-component mixture model in order to explore the relationship between rubber price and stock market price for Malaysia, Thailand, Philippines and Indonesia. Lastly, the results showed that there is a negative effect among rubber price and stock market price for all selected countries.
Wu, K; Daruwalla, Z J; Wong, K L; Murphy, D; Ren, H
2015-08-01
The commercial humeral implants based on the Western population are currently not entirely compatible with Asian patients, due to differences in bone size, shape and structure. Surgeons may have to compromise or use different implants that are less conforming, which may cause complications of as well as inconvenience to the implant position. The construction of Asian humerus atlases of different clusters has therefore been proposed to eradicate this problem and to facilitate planning minimally invasive surgical procedures [6,31]. According to the features of the atlases, new implants could be designed specifically for different patients. Furthermore, an automatic implant selection algorithm has been proposed as well in order to reduce the complications caused by implant and bone mismatch. Prior to the design of the implant, data clustering and extraction of the relevant features were carried out on the datasets of each gender. The fuzzy C-means clustering method is explored in this paper. Besides, two new schemes of implant selection procedures, namely the Procrustes analysis-based scheme and the group average distance-based scheme, were proposed to better search for the matching implants for new coming patients from the database. Both these two algorithms have not been used in this area, while they turn out to have excellent performance in implant selection. Additionally, algorithms to calculate the matching scores between various implants and the patient data are proposed in this paper to assist the implant selection procedure. The results obtained have indicated the feasibility of the proposed development and selection scheme. The 16 sets of male data were divided into two clusters with 8 and 8 subjects, respectively, and the 11 female datasets were also divided into two clusters with 5 and 6 subjects, respectively. Based on the features of each cluster, the implants designed by the proposed algorithm fit very well on their reference humeri and the proposed implant selection procedure allows for a scenario of treating a patient with merely a preoperative anatomical model in order to correctly select the implant that has the best fit. Based on the leave-one-out validation, it can be concluded that both the PA-based method and GAD-based method are able to achieve excellent performance when dealing with the problem of implant selection. The accuracy and average execution time for the PA-based method were 100 % and 0.132 s, respectively, while those of the GAD- based method were 100 % and 0.058 s. Therefore, the GAD-based method outperformed the PA-based method in terms of execution speed. The primary contributions of this paper include the proposal of methods for development of Asian-, gender- and cluster-specific implants based on shape features and selection of the best fit implants for future patients according to their features. To the best of our knowledge, this is the first work that proposes implant design and selection for Asian patients automatically based on features extracted from cluster-specific statistical atlases.
Modelling gene expression profiles related to prostate tumor progression using binary states
2013-01-01
Background Cancer is a complex disease commonly characterized by the disrupted activity of several cancer-related genes such as oncogenes and tumor-suppressor genes. Previous studies suggest that the process of tumor progression to malignancy is dynamic and can be traced by changes in gene expression. Despite the enormous efforts made for differential expression detection and biomarker discovery, few methods have been designed to model the gene expression level to tumor stage during malignancy progression. Such models could help us understand the dynamics and simplify or reveal the complexity of tumor progression. Methods We have modeled an on-off state of gene activation per sample then per stage to select gene expression profiles associated to tumor progression. The selection is guided by statistical significance of profiles based on random permutated datasets. Results We show that our method identifies expected profiles corresponding to oncogenes and tumor suppressor genes in a prostate tumor progression dataset. Comparisons with other methods support our findings and indicate that a considerable proportion of significant profiles is not found by other statistical tests commonly used to detect differential expression between tumor stages nor found by other tailored methods. Ontology and pathway analysis concurred with these findings. Conclusions Results suggest that our methodology may be a valuable tool to study tumor malignancy progression, which might reveal novel cancer therapies. PMID:23721350
Härkänen, Tommi; Kaikkonen, Risto; Virtala, Esa; Koskinen, Seppo
2014-11-06
To assess the nonresponse rates in a questionnaire survey with respect to administrative register data, and to correct the bias statistically. The Finnish Regional Health and Well-being Study (ATH) in 2010 was based on a national sample and several regional samples. Missing data analysis was based on socio-demographic register data covering the whole sample. Inverse probability weighting (IPW) and doubly robust (DR) methods were estimated using the logistic regression model, which was selected using the Bayesian information criteria. The crude, weighted and true self-reported turnout in the 2008 municipal election and prevalences of entitlements to specially reimbursed medication, and the crude and weighted body mass index (BMI) means were compared. The IPW method appeared to remove a relatively large proportion of the bias compared to the crude prevalence estimates of the turnout and the entitlements to specially reimbursed medication. Several demographic factors were shown to be associated with missing data, but few interactions were found. Our results suggest that the IPW method can improve the accuracy of results of a population survey, and the model selection provides insight into the structure of missing data. However, health-related missing data mechanisms are beyond the scope of statistical methods, which mainly rely on socio-demographic information to correct the results.
Designing Intervention Studies: Selected Populations, Range Restrictions, and Statistical Power
ERIC Educational Resources Information Center
Miciak, Jeremy; Taylor, W. Pat; Stuebing, Karla K.; Fletcher, Jack M.; Vaughn, Sharon
2016-01-01
An appropriate estimate of statistical power is critical for the design of intervention studies. Although the inclusion of a pretest covariate in the test of the primary outcome can increase statistical power, samples selected on the basis of pretest performance may demonstrate range restriction on the selection measure and other correlated…
Designing Intervention Studies: Selected Populations, Range Restrictions, and Statistical Power
Miciak, Jeremy; Taylor, W. Pat; Stuebing, Karla K.; Fletcher, Jack M.; Vaughn, Sharon
2016-01-01
An appropriate estimate of statistical power is critical for the design of intervention studies. Although the inclusion of a pretest covariate in the test of the primary outcome can increase statistical power, samples selected on the basis of pretest performance may demonstrate range restriction on the selection measure and other correlated measures. This can result in attenuated pretest-posttest correlations, reducing the variance explained by the pretest covariate. We investigated the implications of two potential range restriction scenarios: direct truncation on a selection measure and indirect range restriction on correlated measures. Empirical and simulated data indicated direct range restriction on the pretest covariate greatly reduced statistical power and necessitated sample size increases of 82%–155% (dependent on selection criteria) to achieve equivalent statistical power to parameters with unrestricted samples. However, measures demonstrating indirect range restriction required much smaller sample size increases (32%–71%) under equivalent scenarios. Additional analyses manipulated the correlations between measures and pretest-posttest correlations to guide planning experiments. Results highlight the need to differentiate between selection measures and potential covariates and to investigate range restriction as a factor impacting statistical power. PMID:28479943
Designing Intervention Studies: Selected Populations, Range Restrictions, and Statistical Power.
Miciak, Jeremy; Taylor, W Pat; Stuebing, Karla K; Fletcher, Jack M; Vaughn, Sharon
2016-01-01
An appropriate estimate of statistical power is critical for the design of intervention studies. Although the inclusion of a pretest covariate in the test of the primary outcome can increase statistical power, samples selected on the basis of pretest performance may demonstrate range restriction on the selection measure and other correlated measures. This can result in attenuated pretest-posttest correlations, reducing the variance explained by the pretest covariate. We investigated the implications of two potential range restriction scenarios: direct truncation on a selection measure and indirect range restriction on correlated measures. Empirical and simulated data indicated direct range restriction on the pretest covariate greatly reduced statistical power and necessitated sample size increases of 82%-155% (dependent on selection criteria) to achieve equivalent statistical power to parameters with unrestricted samples. However, measures demonstrating indirect range restriction required much smaller sample size increases (32%-71%) under equivalent scenarios. Additional analyses manipulated the correlations between measures and pretest-posttest correlations to guide planning experiments. Results highlight the need to differentiate between selection measures and potential covariates and to investigate range restriction as a factor impacting statistical power.
Heskes, Tom; Eisinga, Rob; Breitling, Rainer
2014-11-21
The rank product method is a powerful statistical technique for identifying differentially expressed molecules in replicated experiments. A critical issue in molecule selection is accurate calculation of the p-value of the rank product statistic to adequately address multiple testing. Both exact calculation and permutation and gamma approximations have been proposed to determine molecule-level significance. These current approaches have serious drawbacks as they are either computationally burdensome or provide inaccurate estimates in the tail of the p-value distribution. We derive strict lower and upper bounds to the exact p-value along with an accurate approximation that can be used to assess the significance of the rank product statistic in a computationally fast manner. The bounds and the proposed approximation are shown to provide far better accuracy over existing approximate methods in determining tail probabilities, with the slightly conservative upper bound protecting against false positives. We illustrate the proposed method in the context of a recently published analysis on transcriptomic profiling performed in blood. We provide a method to determine upper bounds and accurate approximate p-values of the rank product statistic. The proposed algorithm provides an order of magnitude increase in throughput as compared with current approaches and offers the opportunity to explore new application domains with even larger multiple testing issue. The R code is published in one of the Additional files and is available at http://www.ru.nl/publish/pages/726696/rankprodbounds.zip .
Kline, Joshua C.
2014-01-01
Over the past four decades, various methods have been implemented to measure synchronization of motor-unit firings. In this work, we provide evidence that prior reports of the existence of universal common inputs to all motoneurons and the presence of long-term synchronization are misleading, because they did not use sufficiently rigorous statistical tests to detect synchronization. We developed a statistically based method (SigMax) for computing synchronization and tested it with data from 17,736 motor-unit pairs containing 1,035,225 firing instances from the first dorsal interosseous and vastus lateralis muscles—a data set one order of magnitude greater than that reported in previous studies. Only firing data, obtained from surface electromyographic signal decomposition with >95% accuracy, were used in the study. The data were not subjectively selected in any manner. Because of the size of our data set and the statistical rigor inherent to SigMax, we have confidence that the synchronization values that we calculated provide an improved estimate of physiologically driven synchronization. Compared with three other commonly used techniques, ours revealed three types of discrepancies that result from failing to use sufficient statistical tests necessary to detect synchronization. 1) On average, the z-score method falsely detected synchronization at 16 separate latencies in each motor-unit pair. 2) The cumulative sum method missed one out of every four synchronization identifications found by SigMax. 3) The common input assumption method identified synchronization from 100% of motor-unit pairs studied. SigMax revealed that only 50% of motor-unit pairs actually manifested synchronization. PMID:25210152
Statistical Considerations of Food Allergy Prevention Studies.
Bahnson, Henry T; du Toit, George; Lack, Gideon
Clinical studies to prevent the development of food allergy have recently helped reshape public policy recommendations on the early introduction of allergenic foods. These trials are also prompting new research, and it is therefore important to address the unique design and analysis challenges of prevention trials. We highlight statistical concepts and give recommendations that clinical researchers may wish to adopt when designing future study protocols and analysis plans for prevention studies. Topics include selecting a study sample, addressing internal and external validity, improving statistical power, choosing alpha and beta, analysis innovations to address dilution effects, and analysis methods to deal with poor compliance, dropout, and missing data. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Dougherty, T B; Porche, V H; Thall, P F
2000-04-01
This study investigated the ability of the modified continual reassessment method (MCRM) to determine the maximum tolerated dose of the opioid antagonist nalmefene, which does not reverse analgesia in an acceptable number of postoperative patients receiving epidural fentanyl in 0.075% bupivacaine. In the postanesthetic care unit, patients received a single intravenous dose of 0.25, 0.50, 0.75, or 1.00 microg/kg nalmefene. Reversal of analgesia was defined as an increase in pain score of two or more integers above baseline on a visual analog scale from 0 through 10 after nalmefene administration. Patients were treated in cohorts of one, starting with the lowest dose. The maximum tolerated dose of nalmefene was defined as that dose, among the four studied, with a final mean probability of reversal of anesthesia (PROA) closest to 0.20 (ie., a 20% chance of causing reversal). The modified continual reassessment method is an iterative Bayesian statistical procedure that, in this study, selected the dose for each successive cohort as that having a mean PROA closest to the preselected target PROA of 0.20. The modified continual reassessment method repeatedly updated the PROA of each dose level as successive patients were observed for presence or absence of ROA. After 25 patients, the maximum tolerated dose of nalmefene was selected as 0.50 microg/kg (final mean PROA = 0.18). The 1.00-microg/kg dose was never tried because its projected PROA was far above 0.20. The modified continual reassessment method facilitated determination of the maximum tolerated dose ofnalmefene . Operating characteristics of the modified continual reassessment method suggest it may be an effective statistical tool for dose-finding in trials of selected analgesic or anesthetic agents.
NASA Technical Reports Server (NTRS)
Thomas, J. M.; Hanagud, S.
1974-01-01
The design criteria and test options for aerospace structural reliability were investigated. A decision methodology was developed for selecting a combination of structural tests and structural design factors. The decision method involves the use of Bayesian statistics and statistical decision theory. Procedures are discussed for obtaining and updating data-based probabilistic strength distributions for aerospace structures when test information is available and for obtaining subjective distributions when data are not available. The techniques used in developing the distributions are explained.
Visual attention based bag-of-words model for image classification
NASA Astrophysics Data System (ADS)
Wang, Qiwei; Wan, Shouhong; Yue, Lihua; Wang, Che
2014-04-01
Bag-of-words is a classical method for image classification. The core problem is how to count the frequency of the visual words and what visual words to select. In this paper, we propose a visual attention based bag-of-words model (VABOW model) for image classification task. The VABOW model utilizes visual attention method to generate a saliency map, and uses the saliency map as a weighted matrix to instruct the statistic process for the frequency of the visual words. On the other hand, the VABOW model combines shape, color and texture cues and uses L1 regularization logistic regression method to select the most relevant and most efficient features. We compare our approach with traditional bag-of-words based method on two datasets, and the result shows that our VABOW model outperforms the state-of-the-art method for image classification.
Daee, Pedram; Mirian, Maryam S; Ahmadabadi, Majid Nili
2014-01-01
In a multisensory task, human adults integrate information from different sensory modalities--behaviorally in an optimal Bayesian fashion--while children mostly rely on a single sensor modality for decision making. The reason behind this change of behavior over age and the process behind learning the required statistics for optimal integration are still unclear and have not been justified by the conventional Bayesian modeling. We propose an interactive multisensory learning framework without making any prior assumptions about the sensory models. In this framework, learning in every modality and in their joint space is done in parallel using a single-step reinforcement learning method. A simple statistical test on confidence intervals on the mean of reward distributions is used to select the most informative source of information among the individual modalities and the joint space. Analyses of the method and the simulation results on a multimodal localization task show that the learning system autonomously starts with sensory selection and gradually switches to sensory integration. This is because, relying more on modalities--i.e. selection--at early learning steps (childhood) is more rewarding than favoring decisions learned in the joint space since, smaller state-space in modalities results in faster learning in every individual modality. In contrast, after gaining sufficient experiences (adulthood), the quality of learning in the joint space matures while learning in modalities suffers from insufficient accuracy due to perceptual aliasing. It results in tighter confidence interval for the joint space and consequently causes a smooth shift from selection to integration. It suggests that sensory selection and integration are emergent behavior and both are outputs of a single reward maximization process; i.e. the transition is not a preprogrammed phenomenon.
Defining the ecological hydrology of Taiwan Rivers using multivariate statistical methods
NASA Astrophysics Data System (ADS)
Chang, Fi-John; Wu, Tzu-Ching; Tsai, Wen-Ping; Herricks, Edwin E.
2009-09-01
SummaryThe identification and verification of ecohydrologic flow indicators has found new support as the importance of ecological flow regimes is recognized in modern water resources management, particularly in river restoration and reservoir management. An ecohydrologic indicator system reflecting the unique characteristics of Taiwan's water resources and hydrology has been developed, the Taiwan ecohydrological indicator system (TEIS). A major challenge for the water resources community is using the TEIS to provide environmental flow rules that improve existing water resources management. This paper examines data from the extensive network of flow monitoring stations in Taiwan using TEIS statistics to define and refine environmental flow options in Taiwan. Multivariate statistical methods were used to examine TEIS statistics for 102 stations representing the geographic and land use diversity of Taiwan. The Pearson correlation coefficient showed high multicollinearity between the TEIS statistics. Watersheds were separated into upper and lower-watershed locations. An analysis of variance indicated significant differences between upstream, more natural, and downstream, more developed, locations in the same basin with hydrologic indicator redundancy in flow change and magnitude statistics. Issues of multicollinearity were examined using a Principal Component Analysis (PCA) with the first three components related to general flow and high/low flow statistics, frequency and time statistics, and quantity statistics. These principle components would explain about 85% of the total variation. A major conclusion is that managers must be aware of differences among basins, as well as differences within basins that will require careful selection of management procedures to achieve needed flow regimes.
47 CFR 54.807 - Interstate access universal service support.
Code of Federal Regulations, 2010 CFR
2010-10-01
... Common Carrier Bureau Report, Statistics of Communications Common Carriers, Table 6.10—Selected Operating Statistics. Interested parties may obtain this report from the U.S. Government Printing Office or by... Bureau Report, Statistics of Communications Common Carriers, Table 6.10—Selected Operating Statistics...
NASA Astrophysics Data System (ADS)
Määttä, A.; Laine, M.; Tamminen, J.; Veefkind, J. P.
2014-05-01
Satellite instruments are nowadays successfully utilised for measuring atmospheric aerosol in many applications as well as in research. Therefore, there is a growing need for rigorous error characterisation of the measurements. Here, we introduce a methodology for quantifying the uncertainty in the retrieval of aerosol optical thickness (AOT). In particular, we concentrate on two aspects: uncertainty due to aerosol microphysical model selection and uncertainty due to imperfect forward modelling. We apply the introduced methodology for aerosol optical thickness retrieval of the Ozone Monitoring Instrument (OMI) on board NASA's Earth Observing System (EOS) Aura satellite, launched in 2004. We apply statistical methodologies that improve the uncertainty estimates of the aerosol optical thickness retrieval by propagating aerosol microphysical model selection and forward model error more realistically. For the microphysical model selection problem, we utilise Bayesian model selection and model averaging methods. Gaussian processes are utilised to characterise the smooth systematic discrepancies between the measured and modelled reflectances (i.e. residuals). The spectral correlation is composed empirically by exploring a set of residuals. The operational OMI multi-wavelength aerosol retrieval algorithm OMAERO is used for cloud-free, over-land pixels of the OMI instrument with the additional Bayesian model selection and model discrepancy techniques introduced here. The method and improved uncertainty characterisation is demonstrated by several examples with different aerosol properties: weakly absorbing aerosols, forest fires over Greece and Russia, and Sahara desert dust. The statistical methodology presented is general; it is not restricted to this particular satellite retrieval application.
Why the null matters: statistical tests, random walks and evolution.
Sheets, H D; Mitchell, C E
2001-01-01
A number of statistical tests have been developed to determine what type of dynamics underlie observed changes in morphology in evolutionary time series, based on the pattern of change within the time series. The theory of the 'scaled maximum', the 'log-rate-interval' (LRI) method, and the Hurst exponent all operate on the same principle of comparing the maximum change, or rate of change, in the observed dataset to the maximum change expected of a random walk. Less change in a dataset than expected of a random walk has been interpreted as indicating stabilizing selection, while more change implies directional selection. The 'runs test' in contrast, operates on the sequencing of steps, rather than on excursion. Applications of these tests to computer generated, simulated time series of known dynamical form and various levels of additive noise indicate that there is a fundamental asymmetry in the rate of type II errors of the tests based on excursion: they are all highly sensitive to noise in models of directional selection that result in a linear trend within a time series, but are largely noise immune in the case of a simple model of stabilizing selection. Additionally, the LRI method has a lower sensitivity than originally claimed, due to the large range of LRI rates produced by random walks. Examination of the published results of these tests show that they have seldom produced a conclusion that an observed evolutionary time series was due to directional selection, a result which needs closer examination in light of the asymmetric response of these tests.
Kong, Deguo; MacLeod, Matthew; Hung, Hayley; Cousins, Ian T
2014-11-04
During recent decades concentrations of persistent organic pollutants (POPs) in the atmosphere have been monitored at multiple stations worldwide. We used three statistical methods to analyze a total of 748 time series of selected POPs in the atmosphere to determine if there are statistically significant reductions in levels of POPs that have had control actions enacted to restrict or eliminate manufacture, use and emissions. Significant decreasing trends were identified in 560 (75%) of the 748 time series collected from the Arctic, North America, and Europe, indicating that the atmospheric concentrations of these POPs are generally decreasing, consistent with the overall effectiveness of emission control actions. Statistically significant trends in synthetic time series could be reliably identified with the improved Mann-Kendall (iMK) test and the digital filtration (DF) technique in time series longer than 5 years. The temporal trends of new (or emerging) POPs in the atmosphere are often unclear because time series are too short. A statistical detrending method based on the iMK test was not able to identify abrupt changes in the rates of decline of atmospheric POP concentrations encoded into synthetic time series.
Comparisons of topological properties in autism for the brain network construction methods
NASA Astrophysics Data System (ADS)
Lee, Min-Hee; Kim, Dong Youn; Lee, Sang Hyeon; Kim, Jin Uk; Chung, Moo K.
2015-03-01
Structural brain networks can be constructed from the white matter fiber tractography of diffusion tensor imaging (DTI), and the structural characteristics of the brain can be analyzed from its networks. When brain networks are constructed by the parcellation method, their network structures change according to the parcellation scale selection and arbitrary thresholding. To overcome these issues, we modified the Ɛ -neighbor construction method proposed by Chung et al. (2011). The purpose of this study was to construct brain networks for 14 control subjects and 16 subjects with autism using both the parcellation and the Ɛ-neighbor construction method and to compare their topological properties between two methods. As the number of nodes increased, connectedness decreased in the parcellation method. However in the Ɛ-neighbor construction method, connectedness remained at a high level even with the rising number of nodes. In addition, statistical analysis for the parcellation method showed significant difference only in the path length. However, statistical analysis for the Ɛ-neighbor construction method showed significant difference with the path length, the degree and the density.
A multi-strategy approach to informative gene identification from gene expression data.
Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D
2010-02-01
An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.
Saad, Ahmed S; Abo-Talib, Nisreen F; El-Ghobashy, Mohamed R
2016-01-05
Different methods have been introduced to enhance selectivity of UV-spectrophotometry thus enabling accurate determination of co-formulated components, however mixtures whose components exhibit wide variation in absorptivities has been an obstacle against application of UV-spectrophotometry. The developed ratio difference at coabsorptive point method (RDC) represents a simple effective solution for the mentioned problem, where the additive property of light absorbance enabled the consideration of the two components as multiples of the lower absorptivity component at certain wavelength (coabsorptive point), at which their total concentration multiples could be determined, whereas the other component was selectively determined by applying the ratio difference method in a single step. Mixture of perindopril arginine (PA) and amlodipine besylate (AM) figures that problem, where the low absorptivity of PA relative to AM hinders selective spectrophotometric determination of PA. The developed method successfully determined both components in the overlapped region of their spectra with accuracy 99.39±1.60 and 100.51±1.21, for PA and AM, respectively. The method was validated as per the USP guidelines and showed no significant difference upon statistical comparison with reported chromatographic method. Copyright © 2015 Elsevier B.V. All rights reserved.
Stewart, Sarah; Pearson, Janet; Rome, Keith; Dalbeth, Nicola; Vandal, Alain C
2018-01-01
Statistical techniques currently used in musculoskeletal research often inefficiently account for paired-limb measurements or the relationship between measurements taken from multiple regions within limbs. This study compared three commonly used analysis methods with a mixed-models approach that appropriately accounted for the association between limbs, regions, and trials and that utilised all information available from repeated trials. Four analysis were applied to an existing data set containing plantar pressure data, which was collected for seven masked regions on right and left feet, over three trials, across three participant groups. Methods 1-3 averaged data over trials and analysed right foot data (Method 1), data from a randomly selected foot (Method 2), and averaged right and left foot data (Method 3). Method 4 used all available data in a mixed-effects regression that accounted for repeated measures taken for each foot, foot region and trial. Confidence interval widths for the mean differences between groups for each foot region were used as a criterion for comparison of statistical efficiency. Mean differences in pressure between groups were similar across methods for each foot region, while the confidence interval widths were consistently smaller for Method 4. Method 4 also revealed significant between-group differences that were not detected by Methods 1-3. A mixed effects linear model approach generates improved efficiency and power by producing more precise estimates compared to alternative approaches that discard information in the process of accounting for paired-limb measurements. This approach is recommended in generating more clinically sound and statistically efficient research outputs. Copyright © 2017 Elsevier B.V. All rights reserved.
Terrestrial photovoltaic cell process testing
NASA Technical Reports Server (NTRS)
Burger, D. R.
1985-01-01
The paper examines critical test parameters, criteria for selecting appropriate tests, and the use of statistical controls and test patterns to enhance PV-cell process test results. The coverage of critical test parameters is evaluated by examining available test methods and then screening these methods by considering the ability to measure those critical parameters which are most affected by the generic process, the cost of the test equipment and test performance, and the feasibility for process testing.
Terrestrial photovoltaic cell process testing
NASA Astrophysics Data System (ADS)
Burger, D. R.
The paper examines critical test parameters, criteria for selecting appropriate tests, and the use of statistical controls and test patterns to enhance PV-cell process test results. The coverage of critical test parameters is evaluated by examining available test methods and then screening these methods by considering the ability to measure those critical parameters which are most affected by the generic process, the cost of the test equipment and test performance, and the feasibility for process testing.
Campana, R.; Bernieri, E.; Massaro, E.; ...
2013-05-22
We present that the minimal spanning tree (MST) algorithm is a graph-theoretical cluster-finding method. We previously applied it to γ-ray bidimensional images, showing that it is quite sensitive in finding faint sources. Possible sources are associated with the regions where the photon arrival directions clusterize. MST selects clusters starting from a particular “tree” connecting all the point of the image and performing a cut based on the angular distance between photons, with a number of events higher than a given threshold. In this paper, we show how a further filtering, based on some parameters linked to the cluster properties, canmore » be applied to reduce spurious detections. We find that the most efficient parameter for this secondary selection is the magnitudeM of a cluster, defined as the product of its number of events by its clustering degree. We test the sensitivity of the method by means of simulated and real Fermi-Large Area Telescope (LAT) fields. Our results show that √M is strongly correlated with other statistical significance parameters, derived from a wavelet based algorithm and maximum likelihood (ML) analysis, and that it can be used as a good estimator of statistical significance of MST detections. Finally, we apply the method to a 2-year LAT image at energies higher than 3 GeV, and we show the presence of new clusters, likely associated with BL Lac objects.« less
Medical cost analysis: application to colorectal cancer data from the SEER Medicare database.
Bang, Heejung
2005-10-01
Incompleteness is a key feature of most survival data. Numerous well established statistical methodologies and algorithms exist for analyzing life or failure time data. However, induced censorship invalidates the use of those standard analytic tools for some survival-type data such as medical costs. In this paper, some valid methods currently available for analyzing censored medical cost data are reviewed. Some cautionary findings under different assumptions are envisioned through application to medical costs from colorectal cancer patients. Cost analysis should be suitably planned and carefully interpreted under various meaningful scenarios even with judiciously selected statistical methods. This approach would be greatly helpful to policy makers who seek to prioritize health care expenditures and to assess the elements of resource use.
Hickey, Graeme L; Blackstone, Eugene H
2016-08-01
Clinical risk-prediction models serve an important role in healthcare. They are used for clinical decision-making and measuring the performance of healthcare providers. To establish confidence in a model, external model validation is imperative. When designing such an external model validation study, thought must be given to patient selection, risk factor and outcome definitions, missing data, and the transparent reporting of the analysis. In addition, there are a number of statistical methods available for external model validation. Execution of a rigorous external validation study rests in proper study design, application of suitable statistical methods, and transparent reporting. Copyright © 2016 The American Association for Thoracic Surgery. Published by Elsevier Inc. All rights reserved.
49 CFR Schedule G to Subpart B of... - Selected Statistical Data
Code of Federal Regulations, 2010 CFR
2010-10-01
... 49 Transportation 8 2010-10-01 2010-10-01 false Selected Statistical Data G Schedule G to Subpart... Statistical Data [Dollars in thousands] () Greyhound Lines, Inc. () Trailways combined () All study carriers.... 9002, L. 9, col. (b) Other Statistics: 25Number of regulator route intercity passenger miles Sch. 9002...
Shirazi, Mohammadali; Dhavala, Soma Sekhar; Lord, Dominique; Geedipally, Srinivas Reddy
2017-10-01
Safety analysts usually use post-modeling methods, such as the Goodness-of-Fit statistics or the Likelihood Ratio Test, to decide between two or more competitive distributions or models. Such metrics require all competitive distributions to be fitted to the data before any comparisons can be accomplished. Given the continuous growth in introducing new statistical distributions, choosing the best one using such post-modeling methods is not a trivial task, in addition to all theoretical or numerical issues the analyst may face during the analysis. Furthermore, and most importantly, these measures or tests do not provide any intuitions into why a specific distribution (or model) is preferred over another (Goodness-of-Logic). This paper ponders into these issues by proposing a methodology to design heuristics for Model Selection based on the characteristics of data, in terms of descriptive summary statistics, before fitting the models. The proposed methodology employs two analytic tools: (1) Monte-Carlo Simulations and (2) Machine Learning Classifiers, to design easy heuristics to predict the label of the 'most-likely-true' distribution for analyzing data. The proposed methodology was applied to investigate when the recently introduced Negative Binomial Lindley (NB-L) distribution is preferred over the Negative Binomial (NB) distribution. Heuristics were designed to select the 'most-likely-true' distribution between these two distributions, given a set of prescribed summary statistics of data. The proposed heuristics were successfully compared against classical tests for several real or observed datasets. Not only they are easy to use and do not need any post-modeling inputs, but also, using these heuristics, the analyst can attain useful information about why the NB-L is preferred over the NB - or vice versa- when modeling data. Copyright © 2017 Elsevier Ltd. All rights reserved.
Spectroflurimetric estimation of the new antiviral agent ledipasvir in presence of sofosbuvir
NASA Astrophysics Data System (ADS)
Salama, Fathy M.; Attia, Khalid A.; Abouserie, Ahmed A.; El-Olemy, Ahmed; Abolmagd, Ebrahim
2018-02-01
A spectroflurimetric method has been developed and validated for the selective quantitative determination of ledipasvir in presence of sofosbuvir. In this method the native fluorescence of ledipasvir in ethanol at 405 nm was measured after excitation at 340 nm. The proposed method was validated according to ICH guidelines and show high sensitivity, accuracy and precision. Furthermore this method was successfully applied to the analysis of ledipasvir in pharmaceutical dosage form without interference from sofosbuvir and other additives and the results were statistically compared to a reported method and found no significant difference.
Yilmaz, E; Kayikcioglu, T; Kayipmaz, S
2017-07-01
In this article, we propose a decision support system for effective classification of dental periapical cyst and keratocystic odontogenic tumor (KCOT) lesions obtained via cone beam computed tomography (CBCT). CBCT has been effectively used in recent years for diagnosing dental pathologies and determining their boundaries and content. Unlike other imaging techniques, CBCT provides detailed and distinctive information about the pathologies by enabling a three-dimensional (3D) image of the region to be displayed. We employed 50 CBCT 3D image dataset files as the full dataset of our study. These datasets were identified by experts as periapical cyst and KCOT lesions according to the clinical, radiographic and histopathologic features. Segmentation operations were performed on the CBCT images using viewer software that we developed. Using the tools of this software, we marked the lesional volume of interest and calculated and applied the order statistics and 3D gray-level co-occurrence matrix for each CBCT dataset. A feature vector of the lesional region, including 636 different feature items, was created from those statistics. Six classifiers were used for the classification experiments. The Support Vector Machine (SVM) classifier achieved the best classification performance with 100% accuracy, and 100% F-score (F1) scores as a result of the experiments in which a ten-fold cross validation method was used with a forward feature selection algorithm. SVM achieved the best classification performance with 96.00% accuracy, and 96.00% F1 scores in the experiments in which a split sample validation method was used with a forward feature selection algorithm. SVM additionally achieved the best performance of 94.00% accuracy, and 93.88% F1 in which a leave-one-out (LOOCV) method was used with a forward feature selection algorithm. Based on the results, we determined that periapical cyst and KCOT lesions can be classified with a high accuracy with the models that we built using the new dataset selected for this study. The studies mentioned in this article, along with the selected 3D dataset, 3D statistics calculated from the dataset, and performance results of the different classifiers, comprise an important contribution to the field of computer-aided diagnosis of dental apical lesions. Copyright © 2017 Elsevier B.V. All rights reserved.
Adjusting for radiotelemetry error to improve estimates of habitat use.
Scott L. Findholt; Bruce K. Johnson; Lyman L. McDonald; John W. Kern; Alan Ager; Rosemary J. Stussy; Larry D. Bryant
2002-01-01
Animal locations estimated from radiotelemetry have traditionally been treated as error-free when analyzed in relation to habitat variables. Location error lowers the power of statistical tests of habitat selection. We describe a method that incorporates the error surrounding point estimates into measures of environmental variables determined from a geographic...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-08-06
... considerations affecting the design and conduct of repellent studies when human subjects are involved. Any... recommendations for the design and execution of studies to evaluate the performance of pesticide products intended... recommends appropriate study designs and methods for selecting subjects, statistical analysis, and reporting...
Farmers as Consumers of Agricultural Education Services: Willingness to Pay and Spend Time
ERIC Educational Resources Information Center
Charatsari, Chrysanthi; Papadaki-Klavdianou, Afroditi; Michailidis, Anastasios
2011-01-01
This study assessed farmers' willingness to pay for and spend time attending an Agricultural Educational Program (AEP). Primary data on the demographic and socio-economic variables of farmers were collected from 355 farmers selected randomly from Northern Greece. Descriptive statistics and multivariate analysis methods were used in order to meet…
Psycho-Motor Needs Assessment of Virginia School Children.
ERIC Educational Resources Information Center
Glen Haven Achievement Center, Fort Collins, CO.
An effort to assess psycho-motor (P-M) needs among Virginia children in K-4 and in special primary classes for the educable mentally retarded is presented. Included are methods for selecting, combining, and developing evaluation measures, which are verified statistically by analyses of data collected from a stratified sample of approximately 4,500…
Assessment of NDE Reliability Data
NASA Technical Reports Server (NTRS)
Yee, B. G. W.; Chang, F. H.; Couchman, J. C.; Lemon, G. H.; Packman, P. F.
1976-01-01
Twenty sets of relevant Nondestructive Evaluation (NDE) reliability data have been identified, collected, compiled, and categorized. A criterion for the selection of data for statistical analysis considerations has been formulated. A model to grade the quality and validity of the data sets has been developed. Data input formats, which record the pertinent parameters of the defect/specimen and inspection procedures, have been formulated for each NDE method. A comprehensive computer program has been written to calculate the probability of flaw detection at several confidence levels by the binomial distribution. This program also selects the desired data sets for pooling and tests the statistical pooling criteria before calculating the composite detection reliability. Probability of detection curves at 95 and 50 percent confidence levels have been plotted for individual sets of relevant data as well as for several sets of merged data with common sets of NDE parameters.
Paretti, Nicholas V.; Kennedy, Jeffrey R.; Turney, Lovina A.; Veilleux, Andrea G.
2014-01-01
The regional regression equations were integrated into the U.S. Geological Survey’s StreamStats program. The StreamStats program is a national map-based web application that allows the public to easily access published flood frequency and basin characteristic statistics. The interactive web application allows a user to select a point within a watershed (gaged or ungaged) and retrieve flood-frequency estimates derived from the current regional regression equations and geographic information system data within the selected basin. StreamStats provides users with an efficient and accurate means for retrieving the most up to date flood frequency and basin characteristic data. StreamStats is intended to provide consistent statistics, minimize user error, and reduce the need for large datasets and costly geographic information system software.
NASA Astrophysics Data System (ADS)
Shi, Jinfei; Zhu, Songqing; Chen, Ruwen
2017-12-01
An order selection method based on multiple stepwise regressions is proposed for General Expression of Nonlinear Autoregressive model which converts the model order problem into the variable selection of multiple linear regression equation. The partial autocorrelation function is adopted to define the linear term in GNAR model. The result is set as the initial model, and then the nonlinear terms are introduced gradually. Statistics are chosen to study the improvements of both the new introduced and originally existed variables for the model characteristics, which are adopted to determine the model variables to retain or eliminate. So the optimal model is obtained through data fitting effect measurement or significance test. The simulation and classic time-series data experiment results show that the method proposed is simple, reliable and can be applied to practical engineering.
Genomic Selection in Plant Breeding: Methods, Models, and Perspectives.
Crossa, José; Pérez-Rodríguez, Paulino; Cuevas, Jaime; Montesinos-López, Osval; Jarquín, Diego; de Los Campos, Gustavo; Burgueño, Juan; González-Camacho, Juan M; Pérez-Elizalde, Sergio; Beyene, Yoseph; Dreisigacker, Susanne; Singh, Ravi; Zhang, Xuecai; Gowda, Manje; Roorkiwal, Manish; Rutkoski, Jessica; Varshney, Rajeev K
2017-11-01
Genomic selection (GS) facilitates the rapid selection of superior genotypes and accelerates the breeding cycle. In this review, we discuss the history, principles, and basis of GS and genomic-enabled prediction (GP) as well as the genetics and statistical complexities of GP models, including genomic genotype×environment (G×E) interactions. We also examine the accuracy of GP models and methods for two cereal crops and two legume crops based on random cross-validation. GS applied to maize breeding has shown tangible genetic gains. Based on GP results, we speculate how GS in germplasm enhancement (i.e., prebreeding) programs could accelerate the flow of genes from gene bank accessions to elite lines. Recent advances in hyperspectral image technology could be combined with GS and pedigree-assisted breeding. Copyright © 2017 Elsevier Ltd. All rights reserved.
Pace, M.N.; Rosentreter, J.J.; Bartholomay, R.C.
2001-01-01
Idaho State University and the US Geological Survey, in cooperation with the US Department of Energy, conducted a study to determine and evaluate strontium distribution coefficients (Kds) of subsurface materials at the Idaho National Engineering and Environmental Laboratory (INEEL). The Kds were determined to aid in assessing the variability of strontium Kds and their effects on chemical transport of strontium-90 in the Snake River Plain aquifer system. Data from batch experiments done to determine strontium Kds of five sediment-infill samples and six standard reference material samples were analyzed by using multiple linear regression analysis and the stepwise variable-selection method in the statistical program, Statistical Product and Service Solutions, to derive an equation of variables that can be used to predict strontium Kds of sediment-infill samples. The sediment-infill samples were from basalt vesicles and fractures from a selected core at the INEEL; strontium Kds ranged from ???201 to 356 ml g-1. The standard material samples consisted of clay minerals and calcite. The statistical analyses of the batch-experiment results showed that the amount of strontium in the initial solution, the amount of manganese oxide in the sample material, and the amount of potassium in the initial solution are the most important variables in predicting strontium Kds of sediment-infill samples.
Atlas of susceptibility to pollution in marinas. Application to the Spanish coast.
Gómez, Aina G; Ondiviela, Bárbara; Fernández, María; Juanes, José A
2017-01-15
An atlas of susceptibility to pollution of 320 Spanish marinas is provided. Susceptibility is assessed through a simple, fast and low cost empirical method estimating the flushing capacity of marinas. The Complexity Tidal Range Index (CTRI) was selected among eleven empirical methods. The CTRI method was selected by means of statistical analyses because: it contributes to explain the system's variance; it is highly correlated to numerical model results; and, it is sensitive to marinas' location and typology. The process of implementation to the Spanish coast confirmed its usefulness, versatility and adaptability as a tool for the environmental management of marinas worldwide. The atlas of susceptibility, assessed through CTRI values, is an appropriate instrument to prioritize environmental and planning strategies at a regional scale. Copyright © 2016 Elsevier Ltd. All rights reserved.
Dashtban, M; Balafar, Mohammadali
2017-03-01
Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset. Copyright © 2017 Elsevier Inc. All rights reserved.
Probability genotype imputation method and integrated weighted lasso for QTL identification.
Demetrashvili, Nino; Van den Heuvel, Edwin R; Wit, Ernst C
2013-12-30
Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest. Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax. Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.
NASA Astrophysics Data System (ADS)
Jin, Yang; Ciwei, Gao; Jing, Zhang; Min, Sun; Jie, Yu
2017-05-01
The selection and evaluation of priority domains in Global Energy Internet standard development will help to break through limits of national investment, thus priority will be given to standardizing technical areas with highest urgency and feasibility. Therefore, in this paper, the process of Delphi survey based on technology foresight is put forward, the evaluation index system of priority domains is established, and the index calculation method is determined. Afterwards, statistical method is used to evaluate the alternative domains. Finally the top four priority domains are determined as follows: Interconnected Network Planning and Simulation Analysis, Interconnected Network Safety Control and Protection, Intelligent Power Transmission and Transformation, and Internet of Things.
Nagy, László G; Urban, Alexander; Orstadius, Leif; Papp, Tamás; Larsson, Ellen; Vágvölgyi, Csaba
2010-12-01
Recently developed comparative phylogenetic methods offer a wide spectrum of applications in evolutionary biology, although it is generally accepted that their statistical properties are incompletely known. Here, we examine and compare the statistical power of the ML and Bayesian methods with regard to selection of best-fit models of fruiting-body evolution and hypothesis testing of ancestral states on a real-life data set of a physiological trait (autodigestion) in the family Psathyrellaceae. Our phylogenies are based on the first multigene data set generated for the family. Two different coding regimes (binary and multistate) and two data sets differing in taxon sampling density are examined. The Bayesian method outperformed Maximum Likelihood with regard to statistical power in all analyses. This is particularly evident if the signal in the data is weak, i.e. in cases when the ML approach does not provide support to choose among competing hypotheses. Results based on binary and multistate coding differed only modestly, although it was evident that multistate analyses were less conclusive in all cases. It seems that increased taxon sampling density has favourable effects on inference of ancestral states, while model parameters are influenced to a smaller extent. The model best fitting our data implies that the rate of losses of deliquescence equals zero, although model selection in ML does not provide proper support to reject three of the four candidate models. The results also support the hypothesis that non-deliquescence (lack of autodigestion) has been ancestral in Psathyrellaceae, and that deliquescent fruiting bodies represent the preferred state, having evolved independently several times during evolution. Copyright © 2010 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Viironen, K.; Marín-Franch, A.; López-Sanjuan, C.; Varela, J.; Chaves-Montero, J.; Cristóbal-Hornillos, D.; Molino, A.; Fernández-Soto, A.; Vilella-Rojo, G.; Ascaso, B.; Cenarro, A. J.; Cerviño, M.; Cepa, J.; Ederoclite, A.; Márquez, I.; Masegosa, J.; Moles, M.; Oteo, I.; Pović, M.; Aguerri, J. A. L.; Alfaro, E.; Aparicio-Villegas, T.; Benítez, N.; Broadhurst, T.; Cabrera-Caño, J.; Castander, J. F.; Del Olmo, A.; González Delgado, R. M.; Husillos, C.; Infante, L.; Martínez, V. J.; Perea, J.; Prada, F.; Quintana, J. M.
2015-04-01
Context. Most observational results on the high redshift restframe UV-bright galaxies are based on samples pinpointed using the so-called dropout technique or Ly-α selection. However, the availability of multifilter data now allows the dropout selections to be replaced by direct methods based on photometric redshifts. In this paper we present the methodology to select and study the population of high redshift galaxies in the ALHAMBRA survey data. Aims: Our aim is to develop a less biased methodology than the traditional dropout technique to study the high redshift galaxies in ALHAMBRA and other multifilter data. Thanks to the wide area ALHAMBRA covers, we especially aim at contributing to the study of the brightest, least frequent, high redshift galaxies. Methods: The methodology is based on redshift probability distribution functions (zPDFs). It is shown how a clean galaxy sample can be obtained by selecting the galaxies with high integrated probability of being within a given redshift interval. However, reaching both a complete and clean sample with this method is challenging. Hence, a method to derive statistical properties by summing the zPDFs of all the galaxies in the redshift bin of interest is introduced. Results: Using this methodology we derive the galaxy rest frame UV number counts in five redshift bins centred at z = 2.5,3.0,3.5,4.0, and 4.5, being complete up to the limiting magnitude at mUV(AB) = 24, where mUV refers to the first ALHAMBRA filter redwards of the Ly-α line. With the wide field ALHAMBRA data we especially contribute to the study of the brightest ends of these counts, accurately sampling the surface densities down to mUV(AB) = 21-22. Conclusions: We show that using the zPDFs it is easy to select a very clean sample of high redshift galaxies. We also show that it is better to do statistical analysis of the properties of galaxies using a probabilistic approach, which takes into account both the incompleteness and contamination issues in a natural way. Based on observations collected at the German-Spanish Astronomical Center, Calar Alto, jointly operated by the Max-Planck-Institut für Astronomie (MPIA) at Heidelberg and the Instituto de Astrofísica de Andalucía (CSIC).
Comparing data collected by computerized and written surveys for adolescence health research.
Wu, Ying; Newfield, Susan A
2007-01-01
This study assessed whether data-collection formats, computerized versus paper-and-pencil, affect response patterns and descriptive statistics for adolescent health assessment surveys. Youth were assessed as part of a health risk reduction program. Baseline data from 1131 youth were analyzed. Participants completed the questionnaire either by computer (n = 390) or by paper-and-pencil (n = 741). The rate of returned surveys meeting inclusion requirements was 90.6% and did not differ by methods. However, the computerized method resulted in significantly less incompleteness but more identical responses. Multiple regression indicated that the survey methods did not contribute to problematic responses. The two survey methods yielded similar scale internal reliability and descriptive statistics for behavioral and psychological outcomes, although the computerized method elicited higher reports of some risk items such as carrying a knife, beating up a person, selling drugs, and delivering drugs. Overall, the survey method did not produce a significant difference in outcomes. This provides support for program personnel selecting survey methods based on study goals with confidence that the method of administration will not have a significant impact on the outcome.
Meta-analysis of diagnostic test data: a bivariate Bayesian modeling approach.
Verde, Pablo E
2010-12-30
In the last decades, the amount of published results on clinical diagnostic tests has expanded very rapidly. The counterpart to this development has been the formal evaluation and synthesis of diagnostic results. However, published results present substantial heterogeneity and they can be regarded as so far removed from the classical domain of meta-analysis, that they can provide a rather severe test of classical statistical methods. Recently, bivariate random effects meta-analytic methods, which model the pairs of sensitivities and specificities, have been presented from the classical point of view. In this work a bivariate Bayesian modeling approach is presented. This approach substantially extends the scope of classical bivariate methods by allowing the structural distribution of the random effects to depend on multiple sources of variability. Meta-analysis is summarized by the predictive posterior distributions for sensitivity and specificity. This new approach allows, also, to perform substantial model checking, model diagnostic and model selection. Statistical computations are implemented in the public domain statistical software (WinBUGS and R) and illustrated with real data examples. Copyright © 2010 John Wiley & Sons, Ltd.
Lifshits, A M
1979-01-01
General characteristics of the multivariate statistical analysis (MSA) is given. Methodical premises and criteria for the selection of an adequate MSA method applicable to pathoanatomic investigations of the epidemiology of multicausal diseases are presented. The experience of using MSA with computors and standard computing programs in studies of coronary arteries aterosclerosis on the materials of 2060 autopsies is described. The combined use of 4 MSA methods: sequential, correlational, regressional, and discriminant permitted to quantitate the contribution of each of the 8 examined risk factors in the development of aterosclerosis. The most important factors were found to be the age, arterial hypertension, and heredity. Occupational hypodynamia and increased fatness were more important in men, whereas diabetes melitus--in women. The registration of this combination of risk factors by MSA methods provides for more reliable prognosis of the likelihood of coronary heart disease with a fatal outcome than prognosis of the degree of coronary aterosclerosis.
Experimental and environmental factors affect spurious detection of ecological thresholds
Daily, Jonathan P.; Hitt, Nathaniel P.; Smith, David; Snyder, Craig D.
2012-01-01
Threshold detection methods are increasingly popular for assessing nonlinear responses to environmental change, but their statistical performance remains poorly understood. We simulated linear change in stream benthic macroinvertebrate communities and evaluated the performance of commonly used threshold detection methods based on model fitting (piecewise quantile regression [PQR]), data partitioning (nonparametric change point analysis [NCPA]), and a hybrid approach (significant zero crossings [SiZer]). We demonstrated that false detection of ecological thresholds (type I errors) and inferences on threshold locations are influenced by sample size, rate of linear change, and frequency of observations across the environmental gradient (i.e., sample-environment distribution, SED). However, the relative importance of these factors varied among statistical methods and between inference types. False detection rates were influenced primarily by user-selected parameters for PQR (τ) and SiZer (bandwidth) and secondarily by sample size (for PQR) and SED (for SiZer). In contrast, the location of reported thresholds was influenced primarily by SED. Bootstrapped confidence intervals for NCPA threshold locations revealed strong correspondence to SED. We conclude that the choice of statistical methods for threshold detection should be matched to experimental and environmental constraints to minimize false detection rates and avoid spurious inferences regarding threshold location.
Smith, Joseph M.; Mather, Martha E.
2012-01-01
Ecological indicators are science-based tools used to assess how human activities have impacted environmental resources. For monitoring and environmental assessment, existing species assemblage data can be used to make these comparisons through time or across sites. An impediment to using assemblage data, however, is that these data are complex and need to be simplified in an ecologically meaningful way. Because multivariate statistics are mathematical relationships, statistical groupings may not make ecological sense and will not have utility as indicators. Our goal was to define a process to select defensible and ecologically interpretable statistical simplifications of assemblage data in which researchers and managers can have confidence. For this, we chose a suite of statistical methods, compared the groupings that resulted from these analyses, identified convergence among groupings, then we interpreted the groupings using species and ecological guilds. When we tested this approach using a statewide stream fish dataset, not all statistical methods worked equally well. For our dataset, logistic regression (Log), detrended correspondence analysis (DCA), cluster analysis (CL), and non-metric multidimensional scaling (NMDS) provided consistent, simplified output. Specifically, the Log, DCA, CL-1, and NMDS-1 groupings were ≥60% similar to each other, overlapped with the fluvial-specialist ecological guild, and contained a common subset of species. Groupings based on number of species (e.g., Log, DCA, CL and NMDS) outperformed groupings based on abundance [e.g., principal components analysis (PCA) and Poisson regression]. Although the specific methods that worked on our test dataset have generality, here we are advocating a process (e.g., identifying convergent groupings with redundant species composition that are ecologically interpretable) rather than the automatic use of any single statistical tool. We summarize this process in step-by-step guidance for the future use of these commonly available ecological and statistical methods in preparing assemblage data for use in ecological indicators.
Rivoirard, Romain; Duplay, Vianney; Oriol, Mathieu; Tinquaut, Fabien; Chauvin, Franck; Magne, Nicolas; Bourmaud, Aurelie
2016-01-01
Background Quality of reporting for Randomized Clinical Trials (RCTs) in oncology was analyzed in several systematic reviews, but, in this setting, there is paucity of data for the outcomes definitions and consistency of reporting for statistical tests in RCTs and Observational Studies (OBS). The objective of this review was to describe those two reporting aspects, for OBS and RCTs in oncology. Methods From a list of 19 medical journals, three were retained for analysis, after a random selection: British Medical Journal (BMJ), Annals of Oncology (AoO) and British Journal of Cancer (BJC). All original articles published between March 2009 and March 2014 were screened. Only studies whose main outcome was accompanied by a corresponding statistical test were included in the analysis. Studies based on censored data were excluded. Primary outcome was to assess quality of reporting for description of primary outcome measure in RCTs and of variables of interest in OBS. A logistic regression was performed to identify covariates of studies potentially associated with concordance of tests between Methods and Results parts. Results 826 studies were included in the review, and 698 were OBS. Variables were described in Methods section for all OBS studies and primary endpoint was clearly detailed in Methods section for 109 RCTs (85.2%). 295 OBS (42.2%) and 43 RCTs (33.6%) had perfect agreement for reported statistical test between Methods and Results parts. In multivariable analysis, variable "number of included patients in study" was associated with test consistency: aOR (adjusted Odds Ratio) for third group compared to first group was equal to: aOR Grp3 = 0.52 [0.31–0.89] (P value = 0.009). Conclusion Variables in OBS and primary endpoint in RCTs are reported and described with a high frequency. However, statistical tests consistency between methods and Results sections of OBS is not always noted. Therefore, we encourage authors and peer reviewers to verify consistency of statistical tests in oncology studies. PMID:27716793
Suner, Aslı; Karakülah, Gökhan; Dicle, Oğuz
2014-01-01
Statistical hypothesis testing is an essential component of biological and medical studies for making inferences and estimations from the collected data in the study; however, the misuse of statistical tests is widely common. In order to prevent possible errors in convenient statistical test selection, it is currently possible to consult available test selection algorithms developed for various purposes. However, the lack of an algorithm presenting the most common statistical tests used in biomedical research in a single flowchart causes several problems such as shifting users among the algorithms, poor decision support in test selection and lack of satisfaction of potential users. Herein, we demonstrated a unified flowchart; covers mostly used statistical tests in biomedical domain, to provide decision aid to non-statistician users while choosing the appropriate statistical test for testing their hypothesis. We also discuss some of the findings while we are integrating the flowcharts into each other to develop a single but more comprehensive decision algorithm.
Comparison of statistical tests for association between rare variants and binary traits.
Bacanu, Silviu-Alin; Nelson, Matthew R; Whittaker, John C
2012-01-01
Genome-wide association studies have found thousands of common genetic variants associated with a wide variety of diseases and other complex traits. However, a large portion of the predicted genetic contribution to many traits remains unknown. One plausible explanation is that some of the missing variation is due to the effects of rare variants. Nonetheless, the statistical analysis of rare variants is challenging. A commonly used method is to contrast, within the same region (gene), the frequency of minor alleles at rare variants between cases and controls. However, this strategy is most useful under the assumption that the tested variants have similar effects. We previously proposed a method that can accommodate heterogeneous effects in the analysis of quantitative traits. Here we extend this method to include binary traits that can accommodate covariates. We use simulations for a variety of causal and covariate impact scenarios to compare the performance of the proposed method to standard logistic regression, C-alpha, SKAT, and EREC. We found that i) logistic regression methods perform well when the heterogeneity of the effects is not extreme and ii) SKAT and EREC have good performance under all tested scenarios but they can be computationally intensive. Consequently, it would be more computationally desirable to use a two-step strategy by (i) selecting promising genes by faster methods and ii) analyzing selected genes using SKAT/EREC. To select promising genes one can use (1) regression methods when effect heterogeneity is assumed to be low and the covariates explain a non-negligible part of trait variability, (2) C-alpha when heterogeneity is assumed to be large and covariates explain a small fraction of trait's variability and (3) the proposed trend and heterogeneity test when the heterogeneity is assumed to be non-trivial and the covariates explain a large fraction of trait variability.
NASA Astrophysics Data System (ADS)
Liu, Yonghe; Feng, Jinming; Liu, Xiu; Zhao, Yadi
2017-12-01
Statistical downscaling (SD) is a method that acquires the local information required for hydrological impact assessment from large-scale atmospheric variables. Very few statistical and deterministic downscaling models for daily precipitation have been conducted for local sites influenced by the East Asian monsoon. In this study, SD models were constructed by selecting the best predictors and using generalized linear models (GLMs) for Feixian, a site in the Yishu River Basin and Shandong Province. By calculating and mapping Spearman rank correlation coefficients between the gridded standardized values of five large-scale variables and daily observed precipitation, different cyclonic circulation patterns were found for monsoonal precipitation in summer (June-September) and winter (November-December and January-March); the values of the gridded boxes with the highest absolute correlations for observed precipitation were selected as predictors. Data for predictors and predictands covered the period 1979-2015, and different calibration and validation periods were divided when fitting and validating the models. Meanwhile, the bootstrap method was also used to fit the GLM. All the above thorough validations indicated that the models were robust and not sensitive to different samples or different periods. Pearson's correlations between downscaled and observed precipitation (logarithmically transformed) on a daily scale reached 0.54-0.57 in summer and 0.56-0.61 in winter, and the Nash-Sutcliffe efficiency between downscaled and observed precipitation reached 0.1 in summer and 0.41 in winter. The downscaled precipitation partially reflected exact variations in winter and main trends in summer for total interannual precipitation. For the number of wet days, both winter and summer models were able to reflect interannual variations. Other comparisons were also made in this study. These results demonstrated that when downscaling, it is appropriate to combine a correlation-based predictor selection across a spatial domain with GLM modeling.
Tamborini, Marco
2015-11-01
This paper examines the subversive role of statistics paleontology at the end of the 19th and the beginning of the 20th centuries. In particular, I will focus on German paleontology and its relationship with statistics. I argue that in paleontology, the quantitative method was questioned and strongly limited by the first decade of the 20th century because, as its opponents noted, when the fossil record is treated statistically, it was found to generate results openly in conflict with the Darwinian theory of evolution. Essentially, statistics questions the gradual mode of evolution and the role of natural selection. The main objections to statistics were addressed during the meetings at the Kaiserlich-Königliche Geologische Reichsanstalt in Vienna in the 1880s. After having introduced the statistical treatment of the fossil record, I will use the works of Charles Léo Lesquereux (1806-1889), Joachim Barrande (1799-1833), and Henry Shaler Williams (1847-1918) to compare the objections raised in Vienna with how the statistical treatment of the data worked in practice. Furthermore, I will discuss the criticisms of Melchior Neumayr (1845-1890), one of the leading German opponents of statistical paleontology, to show why, and to what extent, statistics were questioned in Vienna. The final part of this paper considers what paleontologists can derive from a statistical notion of data: the necessity of opening a discussion about the completeness and nature of the paleontological data. The Vienna discussion about which method paleontologists should follow offers an interesting case study in order to understand the epistemic tensions within paleontology surrounding Darwin's theory as well as the variety of non-Darwinian alternatives that emerged from the statistical treatment of the fossil record at the end of the 19th century.
Nevada's Children, 1996. Selected Educational and Social Statistics--Nevada and National.
ERIC Educational Resources Information Center
Horner, Mary P., Comp.
This report presents selected 1996 educational and social statistics that provide information about the status of children in Nevada. State statistics are in some cases compared to national statistics. The first part presents facts about education in Nevada with regard to student characteristics, enrollment, racial and ethnic populations, high…
Gasparini, Patrizia; Di Cosmo, Lucio; Cenni, Enrico; Pompei, Enrico; Ferretti, Marco
2013-07-01
In the frame of a process aiming at harmonizing National Forest Inventory (NFI) and ICP Forests Level I Forest Condition Monitoring (FCM) in Italy, we investigated (a) the long-term consistency between FCM sample points (a subsample of the first NFI, 1985, NFI_1) and recent forest area estimates (after the second NFI, 2005, NFI_2) and (b) the effect of tree selection method (tree-based or plot-based) on sample composition and defoliation statistics. The two investigations were carried out on 261 and 252 FCM sites, respectively. Results show that some individual forest categories (larch and stone pine, Norway spruce, other coniferous, beech, temperate oaks and cork oak forests) are over-represented and others (hornbeam and hophornbeam, other deciduous broadleaved and holm oak forests) are under-represented in the FCM sample. This is probably due to a change in forest cover, which has increased by 1,559,200 ha from 1985 to 2005. In case of shift from a tree-based to a plot-based selection method, 3,130 (46.7%) of the original 6,703 sample trees will be abandoned, and 1,473 new trees will be selected. The balance between exclusion of former sample trees and inclusion of new ones will be particularly unfavourable for conifers (with only 16.4% of excluded trees replaced by new ones) and less for deciduous broadleaves (with 63.5% of excluded trees replaced). The total number of tree species surveyed will not be impacted, while the number of trees per species will, and the resulting (plot-based) sample composition will have a much larger frequency of deciduous broadleaved trees. The newly selected trees have-in general-smaller diameter at breast height (DBH) and defoliation scores. Given the larger rate of turnover, the deciduous broadleaved part of the sample will be more impacted. Our results suggest that both a revision of FCM network to account for forest area change and a plot-based approach to permit statistical inference and avoid bias in the tree sample composition in terms of DBH (and likely age and structure) are desirable in Italy. As the adoption of a plot-based approach will keep a large share of the trees formerly selected, direct tree-by-tree comparison will remain possible, thus limiting the impact on the time series comparability. In addition, the plot-based design will favour the integration with NFI_2.
Bayesian inference of selection in a heterogeneous environment from genetic time-series data.
Gompert, Zachariah
2016-01-01
Evolutionary geneticists have sought to characterize the causes and molecular targets of selection in natural populations for many years. Although this research programme has been somewhat successful, most statistical methods employed were designed to detect consistent, weak to moderate selection. In contrast, phenotypic studies in nature show that selection varies in time and that individual bouts of selection can be strong. Measurements of the genomic consequences of such fluctuating selection could help test and refine hypotheses concerning the causes of ecological specialization and the maintenance of genetic variation in populations. Herein, I proposed a Bayesian nonhomogeneous hidden Markov model to estimate effective population sizes and quantify variable selection in heterogeneous environments from genetic time-series data. The model is described and then evaluated using a series of simulated data, including cases where selection occurs on a trait with a simple or polygenic molecular basis. The proposed method accurately distinguished neutral loci from non-neutral loci under strong selection, but not from those under weak selection. Selection coefficients were accurately estimated when selection was constant or when the fitness values of genotypes varied linearly with the environment, but these estimates were less accurate when fitness was polygenic or the relationship between the environment and the fitness of genotypes was nonlinear. Past studies of temporal evolutionary dynamics in laboratory populations have been remarkably successful. The proposed method makes similar analyses of genetic time-series data from natural populations more feasible and thereby could help answer fundamental questions about the causes and consequences of evolution in the wild. © 2015 John Wiley & Sons Ltd.
Kawata, Masaaki; Sato, Chikara
2007-06-01
In determining the three-dimensional (3D) structure of macromolecular assemblies in single particle analysis, a large representative dataset of two-dimensional (2D) average images from huge number of raw images is a key for high resolution. Because alignments prior to averaging are computationally intensive, currently available multireference alignment (MRA) software does not survey every possible alignment. This leads to misaligned images, creating blurred averages and reducing the quality of the final 3D reconstruction. We present a new method, in which multireference alignment is harmonized with classification (multireference multiple alignment: MRMA). This method enables a statistical comparison of multiple alignment peaks, reflecting the similarities between each raw image and a set of reference images. Among the selected alignment candidates for each raw image, misaligned images are statistically excluded, based on the principle that aligned raw images of similar projections have a dense distribution around the correctly aligned coordinates in image space. This newly developed method was examined for accuracy and speed using model image sets with various signal-to-noise ratios, and with electron microscope images of the Transient Receptor Potential C3 and the sodium channel. In every data set, the newly developed method outperformed conventional methods in robustness against noise and in speed, creating 2D average images of higher quality. This statistically harmonized alignment-classification combination should greatly improve the quality of single particle analysis.
AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku
2014-05-27
The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.
Raeissi, Pouran; Sharifi, Marziye; Khosravizadeh, Omid; Heidari, Mohammad
2017-01-01
Background: Patient safety culture plays an important role in healthcare systems, especially in chemotherapy and oncology departments (CODs), and its assessment can help to improve quality of services and hospital care. Objective: This study aimed to evaluate and compare items and dimensions of patient safety culture in the CODs of selected teaching hospitals of Iran and Tehran University of Medical Sciences. Materials and Methods: This descriptive-analytical cross-sectional survey was conducted during a six-month period on 270 people from chemotherapy and oncology departments selected through a cluster sampling method. All participants answered the standard questionnaire for “Hospital Survey of Patient Safety Culture” (HSOPSC). Statistical analyses were performed using SPSS/18 software. Results: The average score for patient safety culture was three for the majority of the studied CODs. Statistically significant differences were observed for supervisor actions, teamwork within various units, feedback and communications about errors, and the level of hospital management support. (p<0.05). Relationships between studied hospitals and patient safety culture were not statistically significant (p>0.05). Conclusion: Our results showed that the overall status of patient safety culture is not good in the studied CODs. In particular, teamwork across different units and organizational learning with continuous improvement were the only two properly operating items among 12 dimensions of patient safety culture. Therefore, systematic interventions are strongly required to promote communication. PMID:29072411
Estimating and Identifying Unspecified Correlation Structure for Longitudinal Data
Hu, Jianhua; Wang, Peng; Qu, Annie
2014-01-01
Identifying correlation structure is important to achieving estimation efficiency in analyzing longitudinal data, and is also crucial for drawing valid statistical inference for large size clustered data. In this paper, we propose a nonparametric method to estimate the correlation structure, which is applicable for discrete longitudinal data. We utilize eigenvector-based basis matrices to approximate the inverse of the empirical correlation matrix and determine the number of basis matrices via model selection. A penalized objective function based on the difference between the empirical and model approximation of the correlation matrices is adopted to select an informative structure for the correlation matrix. The eigenvector representation of the correlation estimation is capable of reducing the risk of model misspecification, and also provides useful information on the specific within-cluster correlation pattern of the data. We show that the proposed method possesses the oracle property and selects the true correlation structure consistently. The proposed method is illustrated through simulations and two data examples on air pollution and sonar signal studies. PMID:26361433
NASA Astrophysics Data System (ADS)
Wang, Jianing; Liu, Yuan; Noble, Jack H.; Dawant, Benoit M.
2017-02-01
Medical image registration establishes a correspondence between images of biological structures and it is at the core of many applications. Commonly used deformable image registration methods are dependent on a good preregistration initialization. The initialization can be performed by localizing homologous landmarks and calculating a point-based transformation between the images. The selection of landmarks is however important. In this work, we present a learning-based method to automatically find a set of robust landmarks in 3D MR image volumes of the head to initialize non-rigid transformations. To validate our method, these selected landmarks are localized in unknown image volumes and they are used to compute a smoothing thin-plate splines transformation that registers the atlas to the volumes. The transformed atlas image is then used as the preregistration initialization of an intensity-based non-rigid registration algorithm. We show that the registration accuracy of this algorithm is statistically significantly improved when using the presented registration initialization over a standard intensity-based affine registration.
Model selection and assessment for multi-species occupancy models
Broms, Kristin M.; Hooten, Mevin B.; Fitzpatrick, Ryan M.
2016-01-01
While multi-species occupancy models (MSOMs) are emerging as a popular method for analyzing biodiversity data, formal checking and validation approaches for this class of models have lagged behind. Concurrent with the rise in application of MSOMs among ecologists, a quiet regime shift is occurring in Bayesian statistics where predictive model comparison approaches are experiencing a resurgence. Unlike single-species occupancy models that use integrated likelihoods, MSOMs are usually couched in a Bayesian framework and contain multiple levels. Standard model checking and selection methods are often unreliable in this setting and there is only limited guidance in the ecological literature for this class of models. We examined several different contemporary Bayesian hierarchical approaches for checking and validating MSOMs and applied these methods to a freshwater aquatic study system in Colorado, USA, to better understand the diversity and distributions of plains fishes. Our findings indicated distinct differences among model selection approaches, with cross-validation techniques performing the best in terms of prediction.
NASA Astrophysics Data System (ADS)
Verfaillie, Deborah; Déqué, Michel; Morin, Samuel; Lafaysse, Matthieu
2017-11-01
We introduce the method ADAMONT v1.0 to adjust and disaggregate daily climate projections from a regional climate model (RCM) using an observational dataset at hourly time resolution. The method uses a refined quantile mapping approach for statistical adjustment and an analogous method for sub-daily disaggregation. The method ultimately produces adjusted hourly time series of temperature, precipitation, wind speed, humidity, and short- and longwave radiation, which can in turn be used to force any energy balance land surface model. While the method is generic and can be employed for any appropriate observation time series, here we focus on the description and evaluation of the method in the French mountainous regions. The observational dataset used here is the SAFRAN meteorological reanalysis, which covers the entire French Alps split into 23 massifs, within which meteorological conditions are provided for several 300 m elevation bands. In order to evaluate the skills of the method itself, it is applied to the ALADIN-Climate v5 RCM using the ERA-Interim reanalysis as boundary conditions, for the time period from 1980 to 2010. Results of the ADAMONT method are compared to the SAFRAN reanalysis itself. Various evaluation criteria are used for temperature and precipitation but also snow depth, which is computed by the SURFEX/ISBA-Crocus model using the meteorological driving data from either the adjusted RCM data or the SAFRAN reanalysis itself. The evaluation addresses in particular the time transferability of the method (using various learning/application time periods), the impact of the RCM grid point selection procedure for each massif/altitude band configuration, and the intervariable consistency of the adjusted meteorological data generated by the method. Results show that the performance of the method is satisfactory, with similar or even better evaluation metrics than alternative methods. However, results for air temperature are generally better than for precipitation. Results in terms of snow depth are satisfactory, which can be viewed as indicating a reasonably good intervariable consistency of the meteorological data produced by the method. In terms of temporal transferability (evaluated over time periods of 15 years only), results depend on the learning period. In terms of RCM grid point selection technique, the use of a complex RCM grid points selection technique, taking into account horizontal but also altitudinal proximity to SAFRAN massif centre points/altitude couples, generally degrades evaluation metrics for high altitudes compared to a simpler grid point selection method based on horizontal distance.
Husbands, Adrian; Mathieson, Alistair; Dowell, Jonathan; Cleland, Jennifer; MacKenzie, Rhoda
2014-04-23
The UK Clinical Aptitude Test (UKCAT) was designed to address issues identified with traditional methods of selection. This study aims to examine the predictive validity of the UKCAT and compare this to traditional selection methods in the senior years of medical school. This was a follow-up study of two cohorts of students from two medical schools who had previously taken part in a study examining the predictive validity of the UKCAT in first year. The sample consisted of 4th and 5th Year students who commenced their studies at the University of Aberdeen or University of Dundee medical schools in 2007. Data collected were: demographics (gender and age group), UKCAT scores; Universities and Colleges Admissions Service (UCAS) form scores; admission interview scores; Year 4 and 5 degree examination scores. Pearson's correlations were used to examine the relationships between admissions variables, examination scores, gender and age group, and to select variables for multiple linear regression analysis to predict examination scores. Ninety-nine and 89 students at Aberdeen medical school from Years 4 and 5 respectively, and 51 Year 4 students in Dundee, were included in the analysis. Neither UCAS form nor interview scores were statistically significant predictors of examination performance. Conversely, the UKCAT yielded statistically significant validity coefficients between .24 and .36 in four of five assessments investigated. Multiple regression analysis showed the UKCAT made a statistically significant unique contribution to variance in examination performance in the senior years. Results suggest the UKCAT appears to predict performance better in the later years of medical school compared to earlier years and provides modest supportive evidence for the UKCAT's role in student selection within these institutions. Further research is needed to assess the predictive validity of the UKCAT against professional and behavioural outcomes as the cohort commences working life.
Open-source platform to benchmark fingerprints for ligand-based virtual screening
2013-01-01
Similarity-search methods using molecular fingerprints are an important tool for ligand-based virtual screening. A huge variety of fingerprints exist and their performance, usually assessed in retrospective benchmarking studies using data sets with known actives and known or assumed inactives, depends largely on the validation data sets used and the similarity measure used. Comparing new methods to existing ones in any systematic way is rather difficult due to the lack of standard data sets and evaluation procedures. Here, we present a standard platform for the benchmarking of 2D fingerprints. The open-source platform contains all source code, structural data for the actives and inactives used (drawn from three publicly available collections of data sets), and lists of randomly selected query molecules to be used for statistically valid comparisons of methods. This allows the exact reproduction and comparison of results for future studies. The results for 12 standard fingerprints together with two simple baseline fingerprints assessed by seven evaluation methods are shown together with the correlations between methods. High correlations were found between the 12 fingerprints and a careful statistical analysis showed that only the two baseline fingerprints were different from the others in a statistically significant way. High correlations were also found between six of the seven evaluation methods, indicating that despite their seeming differences, many of these methods are similar to each other. PMID:23721588
Liver segmentation from CT images using a sparse priori statistical shape model (SP-SSM).
Wang, Xuehu; Zheng, Yongchang; Gan, Lan; Wang, Xuan; Sang, Xinting; Kong, Xiangfeng; Zhao, Jie
2017-01-01
This study proposes a new liver segmentation method based on a sparse a priori statistical shape model (SP-SSM). First, mark points are selected in the liver a priori model and the original image. Then, the a priori shape and its mark points are used to obtain a dictionary for the liver boundary information. Second, the sparse coefficient is calculated based on the correspondence between mark points in the original image and those in the a priori model, and then the sparse statistical model is established by combining the sparse coefficients and the dictionary. Finally, the intensity energy and boundary energy models are built based on the intensity information and the specific boundary information of the original image. Then, the sparse matching constraint model is established based on the sparse coding theory. These models jointly drive the iterative deformation of the sparse statistical model to approximate and accurately extract the liver boundaries. This method can solve the problems of deformation model initialization and a priori method accuracy using the sparse dictionary. The SP-SSM can achieve a mean overlap error of 4.8% and a mean volume difference of 1.8%, whereas the average symmetric surface distance and the root mean square symmetric surface distance can reach 0.8 mm and 1.4 mm, respectively.
Background noise spectra of global seismic stations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wada, M.M.; Claassen, J.P.
1996-08-01
Over an extended period of time station noise spectra were collected from various sources for use in estimating the detection and location performance of global networks of seismic stations. As the database of noise spectra enlarged and duplicate entries became available, an effort was mounted to more carefully select station noise spectra while discarding others. This report discusses the methodology and criteria by which the noise spectra were selected. It also identifies and illustrates the station noise spectra which survived the selection process and which currently contribute to the modeling efforts. The resulting catalog of noise statistics not only benefitsmore » those who model network performance but also those who wish to select stations on the basis of their noise level as may occur in designing networks or in selecting seismological data for analysis on the basis of station noise level. In view of the various ways by which station noise were estimated by the different contributors, it is advisable that future efforts which predict network performance have available station noise data and spectral estimation methods which are compatible with the statistics underlying seismic noise. This appropriately requires (1) averaging noise over seasonal and/or diurnal cycles, (2) averaging noise over time intervals comparable to those employed by actual detectors, and (3) using logarithmic measures of the noise.« less
Bird, Patrick; Flannery, Jonathan; Crowley, Erin; Agin, James R; Goins, David; Monteroso, Lisa
2016-07-01
The 3M™ Molecular Detection Assay (MDA) 2 - Salmonella uses real-time isothermal technology for the rapid and accurate detection of Salmonella spp. from enriched select food, feed, and food-process environmental samples. The 3M MDA 2 - Salmonella was evaluated in a multilaboratory collaborative study using an unpaired study design. The 3M MDA 2 - Salmonella was compared to the U.S. Food and Drug Administration Bacteriological Analytical Manual Chapter 5 reference method for the detection of Salmonella in creamy peanut butter, and to the U.S. Department of Agriculture, Food Safety and Inspection Service Microbiology Laboratory Guidebook Chapter 4.08 reference method "Isolation and Identification of Salmonella from Meat, Poultry, Pasteurized Egg and Catfish Products and Carcass and Environmental Samples" for the detection of Salmonella in raw ground beef (73% lean). Technicians from 16 laboratories located within the continental United States participated. Each matrix was evaluated at three levels of contamination: an uninoculated control level (0 CFU/test portion), a low inoculum level (0.2-2 CFU/test portion), and a high inoculum level (2-5 CFU/test portion). Statistical analysis was conducted according to the probability of detection (POD) statistical model. Results obtained for the low inoculum level test portions produced difference in collaborator POD values of 0.03 (95% confidence interval, -0.10 to 0.16) for raw ground beef and 0.06 (95% confidence interval, -0.06 to 0.18) for creamy peanut butter, indicating no statistically significant difference between the candidate and reference methods.
Anna, Bluszcz
Nowadays methods of measurement and assessment of the level of sustained development at the international, national and regional level are a current research problem, which requires multi-dimensional analysis. The relative assessment of the sustainability level of the European Union member states and the comparative analysis of the position of Poland relative to other countries was the aim of the conducted studies in the article. EU member states were treated as objects in the multi-dimensional space. Dimensions of space were specified by ten diagnostic variables describing the sustainability level of UE countries in three dimensions, i.e., social, economic and environmental. Because the compiled statistical data were expressed in different units of measure, taxonomic methods were used for building an aggregated measure to assess the level of sustainable development of EU member states, which through normalisation of variables enabled the comparative analysis between countries. Methodology of studies consisted of eight stages, which included, among others: defining data matrices, calculating the variability coefficient for all variables, which variability coefficient was under 10 %, division of variables into stimulants and destimulants, selection of the method of variable normalisation, developing matrices of normalised data, selection of the formula and calculating the aggregated indicator of the relative level of sustainable development of the EU countries, calculating partial development indicators for three studies dimensions: social, economic and environmental and the classification of the EU countries according to the relative level of sustainable development. Statistical date were collected based on the Polish Central Statistical Office publication.
Agier, Lydiane; Portengen, Lützen; Chadeau-Hyam, Marc; Basagaña, Xavier; Giorgis-Allemand, Lise; Siroux, Valérie; Robinson, Oliver; Vlaanderen, Jelle; González, Juan R; Nieuwenhuijsen, Mark J; Vineis, Paolo; Vrijheid, Martine; Slama, Rémy; Vermeulen, Roel
2016-12-01
The exposome constitutes a promising framework to improve understanding of the effects of environmental exposures on health by explicitly considering multiple testing and avoiding selective reporting. However, exposome studies are challenged by the simultaneous consideration of many correlated exposures. We compared the performances of linear regression-based statistical methods in assessing exposome-health associations. In a simulation study, we generated 237 exposure covariates with a realistic correlation structure and with a health outcome linearly related to 0 to 25 of these covariates. Statistical methods were compared primarily in terms of false discovery proportion (FDP) and sensitivity. On average over all simulation settings, the elastic net and sparse partial least-squares regression showed a sensitivity of 76% and an FDP of 44%; Graphical Unit Evolutionary Stochastic Search (GUESS) and the deletion/substitution/addition (DSA) algorithm revealed a sensitivity of 81% and an FDP of 34%. The environment-wide association study (EWAS) underperformed these methods in terms of FDP (average FDP, 86%) despite a higher sensitivity. Performances decreased considerably when assuming an exposome exposure matrix with high levels of correlation between covariates. Correlation between exposures is a challenge for exposome research, and the statistical methods investigated in this study were limited in their ability to efficiently differentiate true predictors from correlated covariates in a realistic exposome context. Although GUESS and DSA provided a marginally better balance between sensitivity and FDP, they did not outperform the other multivariate methods across all scenarios and properties examined, and computational complexity and flexibility should also be considered when choosing between these methods. Citation: Agier L, Portengen L, Chadeau-Hyam M, Basagaña X, Giorgis-Allemand L, Siroux V, Robinson O, Vlaanderen J, González JR, Nieuwenhuijsen MJ, Vineis P, Vrijheid M, Slama R, Vermeulen R. 2016. A systematic comparison of linear regression-based statistical methods to assess exposome-health associations. Environ Health Perspect 124:1848-1856; http://dx.doi.org/10.1289/EHP172.
Model selection bias and Freedman's paradox
Lukacs, P.M.; Burnham, K.P.; Anderson, D.R.
2010-01-01
In situations where limited knowledge of a system exists and the ratio of data points to variables is small, variable selection methods can often be misleading. Freedman (Am Stat 37:152-155, 1983) demonstrated how common it is to select completely unrelated variables as highly "significant" when the number of data points is similar in magnitude to the number of variables. A new type of model averaging estimator based on model selection with Akaike's AIC is used with linear regression to investigate the problems of likely inclusion of spurious effects and model selection bias, the bias introduced while using the data to select a single seemingly "best" model from a (often large) set of models employing many predictor variables. The new model averaging estimator helps reduce these problems and provides confidence interval coverage at the nominal level while traditional stepwise selection has poor inferential properties. ?? The Institute of Statistical Mathematics, Tokyo 2009.
Multivariate statistical approach to estimate mixing proportions for unknown end members
Valder, Joshua F.; Long, Andrew J.; Davis, Arden D.; Kenner, Scott J.
2012-01-01
A multivariate statistical method is presented, which includes principal components analysis (PCA) and an end-member mixing model to estimate unknown end-member hydrochemical compositions and the relative mixing proportions of those end members in mixed waters. PCA, together with the Hotelling T2 statistic and a conceptual model of groundwater flow and mixing, was used in selecting samples that best approximate end members, which then were used as initial values in optimization of the end-member mixing model. This method was tested on controlled datasets (i.e., true values of estimates were known a priori) and found effective in estimating these end members and mixing proportions. The controlled datasets included synthetically generated hydrochemical data, synthetically generated mixing proportions, and laboratory analyses of sample mixtures, which were used in an evaluation of the effectiveness of this method for potential use in actual hydrological settings. For three different scenarios tested, correlation coefficients (R2) for linear regression between the estimated and known values ranged from 0.968 to 0.993 for mixing proportions and from 0.839 to 0.998 for end-member compositions. The method also was applied to field data from a study of end-member mixing in groundwater as a field example and partial method validation.
Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong
2016-01-01
Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set-proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters.
Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong
2016-01-01
Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set–proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters. PMID:26820646
Eash, David A.; Barnes, Kimberlee K.; O'Shea, Padraic S.
2016-09-19
A statewide study was led to develop regression equations for estimating three selected spring and three selected fall low-flow frequency statistics for ungaged stream sites in Iowa. The estimation equations developed for the six low-flow frequency statistics include spring (April through June) 1-, 7-, and 30-day mean low flows for a recurrence interval of 10 years and fall (October through December) 1-, 7-, and 30-day mean low flows for a recurrence interval of 10 years. Estimates of the three selected spring statistics are provided for 241 U.S. Geological Survey continuous-record streamgages, and estimates of the three selected fall statistics are provided for 238 of these streamgages, using data through June 2014. Because only 9 years of fall streamflow record were available, three streamgages included in the development of the spring regression equations were not included in the development of the fall regression equations. Because of regulation, diversion, or urbanization, 30 of the 241 streamgages were not included in the development of the regression equations. The study area includes Iowa and adjacent areas within 50 miles of the Iowa border. Because trend analyses indicated statistically significant positive trends when considering the period of record for most of the streamgages, the longest, most recent period of record without a significant trend was determined for each streamgage for use in the study. Geographic information system software was used to measure 63 selected basin characteristics for each of the 211streamgages used to develop the regional regression equations. The study area was divided into three low-flow regions that were defined in a previous study for the development of regional regression equations.Because several streamgages included in the development of regional regression equations have estimates of zero flow calculated from observed streamflow for selected spring and fall low-flow frequency statistics, the final equations for the three low-flow regions were developed using two types of regression analyses—left-censored and generalized-least-squares regression analyses. A total of 211 streamgages were included in the development of nine spring regression equations—three equations for each of the three low-flow regions. A total of 208 streamgages were included in the development of nine fall regression equations—three equations for each of the three low-flow regions. A censoring threshold was used to develop 15 left-censored regression equations to estimate the three fall low-flow frequency statistics for each of the three low-flow regions and to estimate the three spring low-flow frequency statistics for the southern and northwest regions. For the northeast region, generalized-least-squares regression was used to develop three equations to estimate the three spring low-flow frequency statistics. For the northeast region, average standard errors of prediction range from 32.4 to 48.4 percent for the spring equations and average standard errors of estimate range from 56.4 to 73.8 percent for the fall equations. For the northwest region, average standard errors of estimate range from 58.9 to 62.1 percent for the spring equations and from 83.2 to 109.4 percent for the fall equations. For the southern region, average standard errors of estimate range from 43.2 to 64.0 percent for the spring equations and from 78.1 to 78.7 percent for the fall equations.The regression equations are applicable only to stream sites in Iowa with low flows not substantially affected by regulation, diversion, or urbanization and with basin characteristics within the range of those used to develop the equations. The regression equations will be implemented within the U.S. Geological Survey StreamStats Web-based geographic information system application. StreamStats allows users to click on any ungaged stream site and compute estimates of the six selected spring and fall low-flow statistics; in addition, 90-percent prediction intervals and the measured basin characteristics for the ungaged site are provided. StreamStats also allows users to click on any Iowa streamgage to obtain computed estimates for the six selected spring and fall low-flow statistics.
Townsley, Michael; Bernasco, Wim; Ruiter, Stijn; Johnson, Shane D.; White, Gentry; Baum, Scott
2015-01-01
Objectives: This study builds on research undertaken by Bernasco and Nieuwbeerta and explores the generalizability of a theoretically derived offender target selection model in three cross-national study regions. Methods: Taking a discrete spatial choice approach, we estimate the impact of both environment- and offender-level factors on residential burglary placement in the Netherlands, the United Kingdom, and Australia. Combining cleared burglary data from all study regions in a single statistical model, we make statistical comparisons between environments. Results: In all three study regions, the likelihood an offender selects an area for burglary is positively influenced by proximity to their home, the proportion of easily accessible targets, and the total number of targets available. Furthermore, in two of the three study regions, juvenile offenders under the legal driving age are significantly more influenced by target proximity than adult offenders. Post hoc tests indicate the magnitudes of these impacts vary significantly between study regions. Conclusions: While burglary target selection strategies are consistent with opportunity-based explanations of offending, the impact of environmental context is significant. As such, the approach undertaken in combining observations from multiple study regions may aid criminology scholars in assessing the generalizability of observed findings across multiple environments. PMID:25866418
Constantinescu, Alexandra C; Wolters, Maria; Moore, Adam; MacPherson, Sarah E
2017-06-01
The International Affective Picture System (IAPS; Lang, Bradley, & Cuthbert, 2008) is a stimulus database that is frequently used to investigate various aspects of emotional processing. Despite its extensive use, selecting IAPS stimuli for a research project is not usually done according to an established strategy, but rather is tailored to individual studies. Here we propose a standard, replicable method for stimulus selection based on cluster analysis, which re-creates the group structure that is most likely to have produced the valence arousal, and dominance norms associated with the IAPS images. Our method includes screening the database for outliers, identifying a suitable clustering solution, and then extracting the desired number of stimuli on the basis of their level of certainty of belonging to the cluster they were assigned to. Our method preserves statistical power in studies by maximizing the likelihood that the stimuli belong to the cluster structure fitted to them, and by filtering stimuli according to their certainty of cluster membership. In addition, although our cluster-based method is illustrated using the IAPS, it can be extended to other stimulus databases.
EAS fluctuation approach to primary mass composition investigation
NASA Technical Reports Server (NTRS)
Stamenov, J. N.; Janminchev, V. D.
1985-01-01
The analysis of muon and electron fluctuation distribution shapes by statistical method of invers problem solution gives the possibility to obtain the relative contribution values of the five main primary nuclei groups. The method is model-independent for a big class of interaction models and can give good results for observation levels not too far from the development maximum and for the selection of showers with fixed sizes and zenith angles not bigger than 30 deg.
NASA Astrophysics Data System (ADS)
Goodman, Joseph W.
2000-07-01
The Wiley Classics Library consists of selected books that have become recognized classics in their respective fields. With these new unabridged and inexpensive editions, Wiley hopes to extend the life of these important works by making them available to future generations of mathematicians and scientists. Currently available in the Series: T. W. Anderson The Statistical Analysis of Time Series T. S. Arthanari & Yadolah Dodge Mathematical Programming in Statistics Emil Artin Geometric Algebra Norman T. J. Bailey The Elements of Stochastic Processes with Applications to the Natural Sciences Robert G. Bartle The Elements of Integration and Lebesgue Measure George E. P. Box & Norman R. Draper Evolutionary Operation: A Statistical Method for Process Improvement George E. P. Box & George C. Tiao Bayesian Inference in Statistical Analysis R. W. Carter Finite Groups of Lie Type: Conjugacy Classes and Complex Characters R. W. Carter Simple Groups of Lie Type William G. Cochran & Gertrude M. Cox Experimental Designs, Second Edition Richard Courant Differential and Integral Calculus, Volume I RIchard Courant Differential and Integral Calculus, Volume II Richard Courant & D. Hilbert Methods of Mathematical Physics, Volume I Richard Courant & D. Hilbert Methods of Mathematical Physics, Volume II D. R. Cox Planning of Experiments Harold S. M. Coxeter Introduction to Geometry, Second Edition Charles W. Curtis & Irving Reiner Representation Theory of Finite Groups and Associative Algebras Charles W. Curtis & Irving Reiner Methods of Representation Theory with Applications to Finite Groups and Orders, Volume I Charles W. Curtis & Irving Reiner Methods of Representation Theory with Applications to Finite Groups and Orders, Volume II Cuthbert Daniel Fitting Equations to Data: Computer Analysis of Multifactor Data, Second Edition Bruno de Finetti Theory of Probability, Volume I Bruno de Finetti Theory of Probability, Volume 2 W. Edwards Deming Sample Design in Business Research
De Luca, Carlo J; Kline, Joshua C
2014-12-01
Over the past four decades, various methods have been implemented to measure synchronization of motor-unit firings. In this work, we provide evidence that prior reports of the existence of universal common inputs to all motoneurons and the presence of long-term synchronization are misleading, because they did not use sufficiently rigorous statistical tests to detect synchronization. We developed a statistically based method (SigMax) for computing synchronization and tested it with data from 17,736 motor-unit pairs containing 1,035,225 firing instances from the first dorsal interosseous and vastus lateralis muscles--a data set one order of magnitude greater than that reported in previous studies. Only firing data, obtained from surface electromyographic signal decomposition with >95% accuracy, were used in the study. The data were not subjectively selected in any manner. Because of the size of our data set and the statistical rigor inherent to SigMax, we have confidence that the synchronization values that we calculated provide an improved estimate of physiologically driven synchronization. Compared with three other commonly used techniques, ours revealed three types of discrepancies that result from failing to use sufficient statistical tests necessary to detect synchronization. 1) On average, the z-score method falsely detected synchronization at 16 separate latencies in each motor-unit pair. 2) The cumulative sum method missed one out of every four synchronization identifications found by SigMax. 3) The common input assumption method identified synchronization from 100% of motor-unit pairs studied. SigMax revealed that only 50% of motor-unit pairs actually manifested synchronization. Copyright © 2014 the American Physiological Society.
Volcano plots in analyzing differential expressions with mRNA microarrays.
Li, Wentian
2012-12-01
A volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log(10)(p-value) from the t-test). We review the basic and interactive use of the volcano plot and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide a unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility of applying volcano plots to other fields beyond microarray.
Statistical validation of normal tissue complication probability models.
Xu, Cheng-Jian; van der Schaaf, Arjen; Van't Veld, Aart A; Langendijk, Johannes A; Schilstra, Cornelis
2012-09-01
To investigate the applicability and value of double cross-validation and permutation tests as established statistical approaches in the validation of normal tissue complication probability (NTCP) models. A penalized regression method, LASSO (least absolute shrinkage and selection operator), was used to build NTCP models for xerostomia after radiation therapy treatment of head-and-neck cancer. Model assessment was based on the likelihood function and the area under the receiver operating characteristic curve. Repeated double cross-validation showed the uncertainty and instability of the NTCP models and indicated that the statistical significance of model performance can be obtained by permutation testing. Repeated double cross-validation and permutation tests are recommended to validate NTCP models before clinical use. Copyright © 2012 Elsevier Inc. All rights reserved.
Roh, Min K; Gillespie, Dan T; Petzold, Linda R
2010-11-07
The weighted stochastic simulation algorithm (wSSA) was developed by Kuwahara and Mura [J. Chem. Phys. 129, 165101 (2008)] to efficiently estimate the probabilities of rare events in discrete stochastic systems. The wSSA uses importance sampling to enhance the statistical accuracy in the estimation of the probability of the rare event. The original algorithm biases the reaction selection step with a fixed importance sampling parameter. In this paper, we introduce a novel method where the biasing parameter is state-dependent. The new method features improved accuracy, efficiency, and robustness.
Biosynthesis and genetic encoding of phosphothreonine through parallel selection and deep sequencing
Huguenin-Dezot, Nicolas; Liang, Alexandria D.; Schmied, Wolfgang H.; Rogerson, Daniel T.; Chin, Jason W.
2017-01-01
The phosphorylation of threonine residues in proteins regulates diverse processes in eukaryotic cells, and thousands of threonine phosphorylations have been identified. An understanding of how threonine phosphorylation regulates biological function will be accelerated by general methods to bio-synthesize defined phospho-proteins. Here we address limitations in current methods for discovering aminoacyl-tRNA synthetase/tRNA pairs for incorporating non-natural amino acids into proteins, by combining parallel positive selections with deep sequencing and statistical analysis, to create a rapid approach for directly discovering aminoacyl-tRNA synthetase/tRNA pairs that selectively incorporate non-natural substrates. Our approach is scalable and enables the direct discovery of aminoacyl-tRNA synthetase/tRNA pairs with mutually orthogonal substrate specificity. We biosynthesize phosphothreonine in cells, and use our new selection approach to discover a phosphothreonyl-tRNA synthetase/tRNACUA pair. By combining these advances we create an entirely biosynthetic route to incorporating phosphothreonine in proteins and biosynthesize several phosphoproteins; enabling phosphoprotein structure determination and synthetic protein kinase activation. PMID:28553966
Natural Selection in Large Populations
NASA Astrophysics Data System (ADS)
Desai, Michael
2011-03-01
I will discuss theoretical and experimental approaches to the evolutionary dynamics and population genetics of natural selection in large populations. In these populations, many mutations are often present simultaneously, and because recombination is limited, selection cannot act on them all independently. Rather, it can only affect whole combinations of mutations linked together on the same chromosome. Methods common in theoretical population genetics have been of limited utility in analyzing this coupling between the fates of different mutations. In the past few years it has become increasingly clear that this is a crucial gap in our understanding, as sequence data has begun to show that selection appears to act pervasively on many linked sites in a wide range of populations, including viruses, microbes, Drosophila, and humans. I will describe approaches that combine analytical tools drawn from statistical physics and dynamical systems with traditional methods in theoretical population genetics to address this problem, and describe how experiments in budding yeast can help us directly observe these evolutionary dynamics.
NASA Astrophysics Data System (ADS)
Su, Xing; Meng, Xingmin; Ye, Weilin; Wu, Weijiang; Liu, Xingrong; Wei, Wanhong
2018-03-01
Tianshui City is one of the mountainous cities that are threatened by severe geo-hazards in Gansu Province, China. Statistical probability models have been widely used in analyzing and evaluating geo-hazards such as landslide. In this research, three approaches (Certainty Factor Method, Weight of Evidence Method and Information Quantity Method) were adopted to quantitively analyze the relationship between the causative factors and the landslides, respectively. The source data used in this study are including the SRTM DEM and local geological maps in the scale of 1:200,000. 12 causative factors (i.e., altitude, slope, aspect, curvature, plan curvature, profile curvature, roughness, relief amplitude, and distance to rivers, distance to faults, distance to roads, and the stratum lithology) were selected to do correlation analysis after thorough investigation of geological conditions and historical landslides. The results indicate that the outcomes of the three models are fairly consistent.
New insights into old methods for identifying causal rare variants.
Wang, Haitian; Huang, Chien-Hsun; Lo, Shaw-Hwa; Zheng, Tian; Hu, Inchi
2011-11-29
The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.
An adaptive data-driven method for accurate prediction of remaining useful life of rolling bearings
NASA Astrophysics Data System (ADS)
Peng, Yanfeng; Cheng, Junsheng; Liu, Yanfei; Li, Xuejun; Peng, Zhihua
2018-06-01
A novel data-driven method based on Gaussian mixture model (GMM) and distance evaluation technique (DET) is proposed to predict the remaining useful life (RUL) of rolling bearings. The data sets are clustered by GMM to divide all data sets into several health states adaptively and reasonably. The number of clusters is determined by the minimum description length principle. Thus, either the health state of the data sets or the number of the states is obtained automatically. Meanwhile, the abnormal data sets can be recognized during the clustering process and removed from the training data sets. After obtaining the health states, appropriate features are selected by DET for increasing the classification and prediction accuracy. In the prediction process, each vibration signal is decomposed into several components by empirical mode decomposition. Some common statistical parameters of the components are calculated first and then the features are clustered using GMM to divide the data sets into several health states and remove the abnormal data sets. Thereafter, appropriate statistical parameters of the generated components are selected using DET. Finally, least squares support vector machine is utilized to predict the RUL of rolling bearings. Experimental results indicate that the proposed method reliably predicts the RUL of rolling bearings.
NASA Astrophysics Data System (ADS)
Zhang, Lufeng; Du, Jianxiu
2016-04-01
The development of highly selective and sensitive method for iron(III) detection is of great importance both from human health as well as environmental point of view. We herein reported a simple, selective and sensitive colorimetric method for the detection of Fe(III) at submicromolar level with 3,3,‧5,5‧-tetramethylbenzidine (TMB) as a chromogenic probe. It was observed that Fe(III) could directly oxidize TMB to form a blue solution without adding any extra oxidants. The reaction has a stoichiometric ratio of 1:1 (Fe(III)/TMB) as determined by a molar ratio method. The resultant color change can be perceived by the naked eye or monitored the absorbance change at 652 nm. The method allowed the measurement of Fe(III) in the range 1.0 × 10- 7-1.5 × 10- 4 mol L- 1 with a detection limit of 5.5 × 10- 8 mol L- 1. The relative standard deviation was 0.9% for eleven replicate measurements of 2.5 × 10- 5 mol L- 1 Fe(III) solution. The chemistry showed high selectivity for Fe(III) in contrast to other common cation ions. The practically of the method was evaluated by the determination of Fe in milk samples; good consistency was obtained between the results of this method and atomic absorption spectrophotometry as indicated by statistical analysis.
Allen, Peter J.; Dorozenko, Kate P.; Roberts, Lynne D.
2016-01-01
Quantitative research methods are essential to the development of professional competence in psychology. They are also an area of weakness for many students. In particular, students are known to struggle with the skill of selecting quantitative analytical strategies appropriate for common research questions, hypotheses and data types. To begin understanding this apparent deficit, we presented nine psychology undergraduates (who had all completed at least one quantitative methods course) with brief research vignettes, and asked them to explicate the process they would follow to identify an appropriate statistical technique for each. Thematic analysis revealed that all participants found this task challenging, and even those who had completed several research methods courses struggled to articulate how they would approach the vignettes on more than a very superficial and intuitive level. While some students recognized that there is a systematic decision making process that can be followed, none could describe it clearly or completely. We then presented the same vignettes to 10 psychology academics with particular expertise in conducting research and/or research methods instruction. Predictably, these “experts” were able to describe a far more systematic, comprehensive, flexible, and nuanced approach to statistical decision making, which begins early in the research process, and pays consideration to multiple contextual factors. They were sensitive to the challenges that students experience when making statistical decisions, which they attributed partially to how research methods and statistics are commonly taught. This sensitivity was reflected in their pedagogic practices. When asked to consider the format and features of an aid that could facilitate the statistical decision making process, both groups expressed a preference for an accessible, comprehensive and reputable resource that follows a basic decision tree logic. For the academics in particular, this aid should function as a teaching tool, which engages the user with each choice-point in the decision making process, rather than simply providing an “answer.” Based on these findings, we offer suggestions for tools and strategies that could be deployed in the research methods classroom to facilitate and strengthen students' statistical decision making abilities. PMID:26909064
Allen, Peter J; Dorozenko, Kate P; Roberts, Lynne D
2016-01-01
Quantitative research methods are essential to the development of professional competence in psychology. They are also an area of weakness for many students. In particular, students are known to struggle with the skill of selecting quantitative analytical strategies appropriate for common research questions, hypotheses and data types. To begin understanding this apparent deficit, we presented nine psychology undergraduates (who had all completed at least one quantitative methods course) with brief research vignettes, and asked them to explicate the process they would follow to identify an appropriate statistical technique for each. Thematic analysis revealed that all participants found this task challenging, and even those who had completed several research methods courses struggled to articulate how they would approach the vignettes on more than a very superficial and intuitive level. While some students recognized that there is a systematic decision making process that can be followed, none could describe it clearly or completely. We then presented the same vignettes to 10 psychology academics with particular expertise in conducting research and/or research methods instruction. Predictably, these "experts" were able to describe a far more systematic, comprehensive, flexible, and nuanced approach to statistical decision making, which begins early in the research process, and pays consideration to multiple contextual factors. They were sensitive to the challenges that students experience when making statistical decisions, which they attributed partially to how research methods and statistics are commonly taught. This sensitivity was reflected in their pedagogic practices. When asked to consider the format and features of an aid that could facilitate the statistical decision making process, both groups expressed a preference for an accessible, comprehensive and reputable resource that follows a basic decision tree logic. For the academics in particular, this aid should function as a teaching tool, which engages the user with each choice-point in the decision making process, rather than simply providing an "answer." Based on these findings, we offer suggestions for tools and strategies that could be deployed in the research methods classroom to facilitate and strengthen students' statistical decision making abilities.
Generation of synthetic flood hydrographs by hydrological donors (SHYDONHY method)
NASA Astrophysics Data System (ADS)
Paquet, Emmanuel
2017-04-01
For the design of hydraulic infrastructures like dams, a design hydrograph is required in most of the cases. Some of its features (e.g. peak value, duration, volume) corresponding to a given return period are computed thanks to a wide range of methods: historical records, mono or multivariate statistical analysis, stochastic simulation, etc. Then various methods have been proposed to construct design hydrographs having such characteristics, ranging from traditional unit-hydrograph to statistical methods (Yue et al., 2002). A new method to build design hydrographs (or more generally synthetic hydrographs) is introduced here, named SHYDONHY, French acronym for "Synthèse d'HYdrogrammes par DONneurs HYdrologiques". It is based on an extensive database of 100 000 flood hydrographs recorded at hourly time-step on 1300 gauging stations in France and Switzerland, covering a wide range of catchment size and climatology. For each station, an average of two hydrographs per year of record has been selected by a peak-over-threshold (POT) method with independence criteria (Lang et al., 1999). This sampling ensures that only hydrographs of intense floods are gathered in the dataset. For a given catchment, where few or no hydrograph is available at the outlet, a sub-set of 10 "donor stations" is selected within the complete dataset, considering several criteria: proximity, size, mean annual values and regimes for both total runoff and POT-selected floods. This sub-set of stations (and their corresponding flood hydrographs) will allow to: • Estimate a characteristic duration of flood hydrographs (e.g. duration for which the discharge is above 50% of the peak value). • For a given duration (e.g. one day), estimate the average peak-to- volume ratio of floods. • For a given duration and peak-to-volume ratio, generation of a synthetic reference hydrograph by combining appropriate hydrographs of the sub-set. • For a given daily discharge sequence, being observed or generated for extreme flood estimation, generate a suitable synthetic hydrograph, also by combining selected hydrographs of the sub-set. The reliability of the method is assessed by performing a jackknife validation on the whole dataset of stations, in particular by reconstructing the hydrograph of the biggest flood of each station and comparing it to the actual one. Some applications are presented, e.g. the coupling of SHYDONHY with the SCHADEX method (Paquet et al., 2003) for the stochastic simulation of extreme reservoir level in dams. References: Lang, M., Ouarda, T. B. M. J., & Bobée, B. (1999). Towards operational guidelines for over-threshold modeling. Journal of hydrology, 225(3), 103-117. Paquet, E., Garavaglia, F., Garçon, R., & Gailhard, J. (2013). The SCHADEX method: A semi-continuous rainfall-runoff simulation for extreme flood estimation. Journal of Hydrology, 495, 23-37. Yue, S., Ouarda, T. B., Bobée, B., Legendre, P., & Bruneau, P. (2002). Approach for describing statistical properties of flood hydrograph. Journal of hydrologic engineering, 7(2), 147-153.
Identification of Selection Footprints on the X Chromosome in Pig
Zhang, Qin; Ding, Xiangdong
2014-01-01
Identifying footprints of selection can provide a straightforward insight into the mechanism of artificial selection and further dig out the causal genes related to important traits. In this study, three between-population and two within-population approaches, the Cross Population Extend Haplotype Homozygosity Test (XPEHH), the Cross Population Composite Likelihood Ratio (XPCLR), the F-statistics (Fst), the Integrated Haplotype Score (iHS) and the Tajima's D, were implemented to detect the selection footprints on the X chromosome in three pig breeds using Illumina Porcine60K SNP chip. In the detection of selection footprints using between-population methods, 11, 11 and 7 potential selection regions with length of 15.62 Mb, 12.32 Mb and 9.38 Mb were identified in Landrace, Chinese Songliao and Yorkshire by XPEHH, respectively, and 16, 13 and 17 potential selection regions with length of 15.20 Mb, 13.00 Mb and 19.21 Mb by XPCLR, 4, 2 and 4 potential selection regions with length of 3.20 Mb, 1.60 Mb and 3.20 Mb by Fst. For within-population methods, 7, 10 and 9 potential selection regions with length of 8.12 Mb, 8.40 Mb and 9.99 Mb were identified in Landrace, Chinese Songliao and Yorkshire by iHS, and 4, 3 and 2 potential selection regions with length of 3.20 Mb, 2.40 Mb and 1.60 Mb by Tajima's D. Moreover, the selection regions from different methods were partly overlapped, especially the regions around 22 ∼25 Mb were detected under selection in Landrace and Yorkshire while no selection in Chinese Songliao by all three between-population methods. Only quite few overlap of selection regions identified by between-population and within-population methods were found. Bioinformatics analysis showed that the genes relevant with meat quality, reproduction and immune were found in potential selection regions. In addition, three out of five significant SNPs associated with hematological traits reported in our genome-wide association study were harbored in potential selection regions. PMID:24740293
Brauchli Pernus, Yolanda; Nan, Cassandra; Verstraeten, Thomas; Pedenko, Mariia; Osokogu, Osemeke U; Weibel, Daniel; Sturkenboom, Miriam; Bonhoeffer, Jan
2016-12-12
Safety signal detection in spontaneous reporting system databases and electronic healthcare records is key to detection of previously unknown adverse events following immunization. Various statistical methods for signal detection in these different datasources have been developed, however none are geared to the pediatric population and none specifically to vaccines. A reference set comprising pediatric vaccine-adverse event pairs is required for reliable performance testing of statistical methods within and across data sources. The study was conducted within the context of the Global Research in Paediatrics (GRiP) project, as part of the seventh framework programme (FP7) of the European Commission. Criteria for the selection of vaccines considered in the reference set were routine and global use in the pediatric population. Adverse events were primarily selected based on importance. Outcome based systematic literature searches were performed for all identified vaccine-adverse event pairs and complemented by expert committee reports, evidence based decision support systems (e.g. Micromedex), and summaries of product characteristics. Classification into positive (PC) and negative control (NC) pairs was performed by two independent reviewers according to a pre-defined algorithm and discussed for consensus in case of disagreement. We selected 13 vaccines and 14 adverse events to be included in the reference set. From a total of 182 vaccine-adverse event pairs, we classified 18 as PC, 113 as NC and 51 as unclassifiable. Most classifications (91) were based on literature review, 45 were based on expert committee reports, and for 46 vaccine-adverse event pairs, an underlying pathomechanism was not plausible classifying the association as NC. A reference set of vaccine-adverse event pairs was developed. We propose its use for comparing signal detection methods and systems in the pediatric population. Published by Elsevier Ltd.
Pathan, Multazim Muradkhan; Bhat, Kishore Gajanan; Joshi, Vinayak Mahableshwar
2017-01-01
Background: Several herbal mouthwash and herbal extracts have been tested in vitro and in vivo in search of a suitable adjunct to mechanical therapy for long-term use. In this study, we aimed to look at the antimicrobial effect of the herbal mouthwash and chlorhexidine (CHX) mouthwash on select organisms in in vitro test and an ex vivo model. Materials and Methods: The antimicrobial effects were determined against standard strains of bacteria that are involved in different stages of periodontal diseases. The in vitro tests included determination of minimum inhibitory concentration (MIC) using broth dilution and agar diffusion. In the ex vivo part of the study supragingival dental plaque were obtained from 20 periodontally healthy adult volunteers. Descriptive analysis was done for the entire quantitative and qualitative variable recorded. Results: The MIC by broth dilution method found no statistically significant difference between the mouthwashes. The agar dilution method showed CHX was more effective as compared to the herbal mouthwash against standard strains of Streptococcus mutans, Streptococcus sanguinis, and Aggregatibacter actinomycetemcomitans. However, no difference was observed between the mouthwashes for Porphyromonas, Pseudomonas aeruginosa, and Fusobacterium nucleatum. The ex vivo results conclude that none of the selected mouthwashes were statistically significantly different from each other. Conclusion: In the present study, CHX showed higher levels of antimicrobial action than the herbal mouthwash against bacterial species. The results reinforce the earlier findings that the in vitro testing is sensitive to methods and due diligence is needed when extrapolating the data for further use. However, long-term use and in vivo effectiveness against the periopathogens need to be tested in well-planned clinical trials. PMID:29456300
Pitfalls of national routine death statistics for maternal mortality study.
Saucedo, Monica; Bouvier-Colle, Marie-Hélène; Chantry, Anne A; Lamarche-Vadel, Agathe; Rey, Grégoire; Deneux-Tharaux, Catherine
2014-11-01
The lessons learned from the study of maternal deaths depend on the accuracy of data. Our objective was to assess time trends in the underestimation of maternal mortality (MM) in the national routine death statistics in France and to evaluate their current accuracy for the selection and causes of maternal deaths. National data obtained by enhanced methods in 1989, 1999, and 2007-09 were used as the gold standard to assess time trends in the underestimation of MM ratios (MMRs) in death statistics. Enhanced data and death statistics for 2007-09 were further compared by characterising false negatives (FNs) and false positives (FPs). The distribution of cause-specific MMRs, as assessed by each system, was described. Underestimation of MM in death statistics decreased from 55.6% in 1989 to 11.4% in 2007-09 (P < 0.001). In 2007-09, of 787 pregnancy-associated deaths, 254 were classified as maternal by the enhanced system and 211 by the death statistics; 34% of maternal deaths in the enhanced system were FNs in the death statistics, and 20% of maternal deaths in the death statistics were FPs. The hierarchy of causes of MM differed between the two systems. The discordances were mainly explained by the lack of precision in the drafting of death certificates by clinicians. Although the underestimation of MM in routine death statistics has decreased substantially over time, one third of maternal deaths remain unidentified, and the main causes of death are incorrectly identified in these data. Defining relevant priorities in maternal health requires the use of enhanced methods for MM study. © 2014 John Wiley & Sons Ltd.
Dascălu, Cristina Gena; Antohe, Magda Ecaterina
2009-01-01
Based on the eigenvalues and the eigenvectors analysis, the principal component analysis has the purpose to identify the subspace of the main components from a set of parameters, which are enough to characterize the whole set of parameters. Interpreting the data for analysis as a cloud of points, we find through geometrical transformations the directions where the cloud's dispersion is maximal--the lines that pass through the cloud's center of weight and have a maximal density of points around them (by defining an appropriate criteria function and its minimization. This method can be successfully used in order to simplify the statistical analysis on questionnaires--because it helps us to select from a set of items only the most relevant ones, which cover the variations of the whole set of data. For instance, in the presented sample we started from a questionnaire with 28 items and, applying the principal component analysis we identified 7 principal components--or main items--fact that simplifies significantly the further data statistical analysis.
A comparison of linear and nonlinear statistical techniques in performance attribution.
Chan, N H; Genovese, C R
2001-01-01
Performance attribution is usually conducted under the linear framework of multifactor models. Although commonly used by practitioners in finance, linear multifactor models are known to be less than satisfactory in many situations. After a brief survey of nonlinear methods, nonlinear statistical techniques are applied to performance attribution of a portfolio constructed from a fixed universe of stocks using factors derived from some commonly used cross sectional linear multifactor models. By rebalancing this portfolio monthly, the cumulative returns for procedures based on standard linear multifactor model and three nonlinear techniques-model selection, additive models, and neural networks-are calculated and compared. It is found that the first two nonlinear techniques, especially in combination, outperform the standard linear model. The results in the neural-network case are inconclusive because of the great variety of possible models. Although these methods are more complicated and may require some tuning, toolboxes are developed and suggestions on calibration are proposed. This paper demonstrates the usefulness of modern nonlinear statistical techniques in performance attribution.
Scanning probe recognition microscopy investigation of tissue scaffold properties
Fan, Yuan; Chen, Qian; Ayres, Virginia M; Baczewski, Andrew D; Udpa, Lalita; Kumar, Shiva
2007-01-01
Scanning probe recognition microscopy is a new scanning probe microscopy technique which enables selective scanning along individual nanofibers within a tissue scaffold. Statistically significant data for multiple properties can be collected by repetitively fine-scanning an identical region of interest. The results of a scanning probe recognition microscopy investigation of the surface roughness and elasticity of a series of tissue scaffolds are presented. Deconvolution and statistical methods were developed and used for data accuracy along curved nanofiber surfaces. Nanofiber features were also independently analyzed using transmission electron microscopy, with results that supported the scanning probe recognition microscopy-based analysis. PMID:18203431
Scanning probe recognition microscopy investigation of tissue scaffold properties.
Fan, Yuan; Chen, Qian; Ayres, Virginia M; Baczewski, Andrew D; Udpa, Lalita; Kumar, Shiva
2007-01-01
Scanning probe recognition microscopy is a new scanning probe microscopy technique which enables selective scanning along individual nanofibers within a tissue scaffold. Statistically significant data for multiple properties can be collected by repetitively fine-scanning an identical region of interest. The results of a scanning probe recognition microscopy investigation of the surface roughness and elasticity of a series of tissue scaffolds are presented. Deconvolution and statistical methods were developed and used for data accuracy along curved nanofiber surfaces. Nanofiber features were also independently analyzed using transmission electron microscopy, with results that supported the scanning probe recognition microscopy-based analysis.
Wang, Zhu; Shuangge, Ma; Wang, Ching-Yun
2017-01-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using an open-source R package mpath. PMID:26059498
Selection of climate change scenario data for impact modelling.
Sloth Madsen, M; Maule, C Fox; MacKellar, N; Olesen, J E; Christensen, J Hesselbjerg
2012-01-01
Impact models investigating climate change effects on food safety often need detailed climate data. The aim of this study was to select climate change projection data for selected crop phenology and mycotoxin impact models. Using the ENSEMBLES database of climate model output, this study illustrates how the projected climate change signal of important variables as temperature, precipitation and relative humidity depends on the choice of the climate model. Using climate change projections from at least two different climate models is recommended to account for model uncertainty. To make the climate projections suitable for impact analysis at the local scale a weather generator approach was adopted. As the weather generator did not treat all the necessary variables, an ad-hoc statistical method was developed to synthesise realistic values of missing variables. The method is presented in this paper, applied to relative humidity, but it could be adopted to other variables if needed.
Accelerating assimilation development for new observing systems using EFSO
NASA Astrophysics Data System (ADS)
Lien, Guo-Yuan; Hotta, Daisuke; Kalnay, Eugenia; Miyoshi, Takemasa; Chen, Tse-Chun
2018-03-01
To successfully assimilate data from a new observing system, it is necessary to develop appropriate data selection strategies, assimilating only the generally useful data. This development work is usually done by trial and error using observing system experiments (OSEs), which are very time and resource consuming. This study proposes a new, efficient methodology to accelerate the development using ensemble forecast sensitivity to observations (EFSO). First, non-cycled assimilation of the new observation data is conducted to compute EFSO diagnostics for each observation within a large sample. Second, the average EFSO conditionally sampled in terms of various factors is computed. Third, potential data selection criteria are designed based on the non-cycled EFSO statistics, and tested in cycled OSEs to verify the actual assimilation impact. The usefulness of this method is demonstrated with the assimilation of satellite precipitation data. It is shown that the EFSO-based method can efficiently suggest data selection criteria that significantly improve the assimilation results.
NASA Astrophysics Data System (ADS)
Zink, Frank Edward
The detection and classification of pulmonary nodules is of great interest in chest radiography. Nodules are often indicative of primary cancer, and their detection is particularly important in asymptomatic patients. The ability to classify nodules as calcified or non-calcified is important because calcification is a positive indicator that the nodule is benign. Dual-energy methods offer the potential to improve both the detection and classification of nodules by allowing the formation of material-selective images. Tissue-selective images can improve detection by virtue of the elimination of obscuring rib structure. Bone -selective images are essentially calcium images, allowing classification of the nodule. A dual-energy technique is introduced which uses a computed radiography system to acquire dual-energy chest radiographs in a single-exposure. All aspects of the dual-energy technique are described, with particular emphasis on scatter-correction, beam-hardening correction, and noise-reduction algorithms. The adaptive noise-reduction algorithm employed improves material-selective signal-to-noise ratio by up to a factor of seven with minimal sacrifice in selectivity. A clinical comparison study is described, undertaken to compare the dual-energy technique to conventional chest radiography for the tasks of nodule detection and classification. Observer performance data were collected using the Free Response Observer Characteristic (FROC) method and the bi-normal Alternative FROC (AFROC) performance model. Results of the comparison study, analyzed using two common multiple observer statistical models, showed that the dual-energy technique was superior to conventional chest radiography for detection of nodules at a statistically significant level (p < .05). Discussion of the comparison study emphasizes the unique combination of data collection and analysis techniques employed, as well as the limitations of comparison techniques in the larger context of technology assessment.
Optimal experimental designs for fMRI when the model matrix is uncertain.
Kao, Ming-Hung; Zhou, Lin
2017-07-15
This study concerns optimal designs for functional magnetic resonance imaging (fMRI) experiments when the model matrix of the statistical model depends on both the selected stimulus sequence (fMRI design), and the subject's uncertain feedback (e.g. answer) to each mental stimulus (e.g. question) presented to her/him. While practically important, this design issue is challenging. This mainly is because that the information matrix cannot be fully determined at the design stage, making it difficult to evaluate the quality of the selected designs. To tackle this challenging issue, we propose an easy-to-use optimality criterion for evaluating the quality of designs, and an efficient approach for obtaining designs optimizing this criterion. Compared with a previously proposed method, our approach requires a much less computing time to achieve designs with high statistical efficiencies. Copyright © 2017 Elsevier Inc. All rights reserved.
Hafen, G M; Hurst, C; Yearwood, J; Smith, J; Dzalilov, Z; Robinson, P J
2008-10-05
Cystic fibrosis is the most common fatal genetic disorder in the Caucasian population. Scoring systems for assessment of Cystic fibrosis disease severity have been used for almost 50 years, without being adapted to the milder phenotype of the disease in the 21st century. The aim of this current project is to develop a new scoring system using a database and employing various statistical tools. This study protocol reports the development of the statistical tools in order to create such a scoring system. The evaluation is based on the Cystic Fibrosis database from the cohort at the Royal Children's Hospital in Melbourne. Initially, unsupervised clustering of the all data records was performed using a range of clustering algorithms. In particular incremental clustering algorithms were used. The clusters obtained were characterised using rules from decision trees and the results examined by clinicians. In order to obtain a clearer definition of classes expert opinion of each individual's clinical severity was sought. After data preparation including expert-opinion of an individual's clinical severity on a 3 point-scale (mild, moderate and severe disease), two multivariate techniques were used throughout the analysis to establish a method that would have a better success in feature selection and model derivation: 'Canonical Analysis of Principal Coordinates' and 'Linear Discriminant Analysis'. A 3-step procedure was performed with (1) selection of features, (2) extracting 5 severity classes out of a 3 severity class as defined per expert-opinion and (3) establishment of calibration datasets. (1) Feature selection: CAP has a more effective "modelling" focus than DA.(2) Extraction of 5 severity classes: after variables were identified as important in discriminating contiguous CF severity groups on the 3-point scale as mild/moderate and moderate/severe, Discriminant Function (DF) was used to determine the new groups mild, intermediate moderate, moderate, intermediate severe and severe disease. (3) Generated confusion tables showed a misclassification rate of 19.1% for males and 16.5% for females, with a majority of misallocations into adjacent severity classes particularly for males. Our preliminary data show that using CAP for detection of selection features and Linear DA to derive the actual model in a CF database might be helpful in developing a scoring system. However, there are several limitations, particularly more data entry points are needed to finalize a score and the statistical tools have further to be refined and validated, with re-running the statistical methods in the larger dataset.
New U.S. Geological Survey Method for the Assessment of Reserve Growth
Klett, Timothy R.; Attanasi, E.D.; Charpentier, Ronald R.; Cook, Troy A.; Freeman, P.A.; Gautier, Donald L.; Le, Phuong A.; Ryder, Robert T.; Schenk, Christopher J.; Tennyson, Marilyn E.; Verma, Mahendra K.
2011-01-01
Reserve growth is defined as the estimated increases in quantities of crude oil, natural gas, and natural gas liquids that have the potential to be added to remaining reserves in discovered accumulations through extension, revision, improved recovery efficiency, and additions of new pools or reservoirs. A new U.S. Geological Survey method was developed to assess the reserve-growth potential of technically recoverable crude oil and natural gas to be added to reserves under proven technology currently in practice within the trend or play, or which reasonably can be extrapolated from geologically similar trends or plays. This method currently is in use to assess potential additions to reserves in discovered fields of the United States. The new approach involves (1) individual analysis of selected large accumulations that contribute most to reserve growth, and (2) conventional statistical modeling of reserve growth in remaining accumulations. This report will focus on the individual accumulation analysis. In the past, the U.S. Geological Survey estimated reserve growth by statistical methods using historical recoverable-quantity data. Those statistical methods were based on growth rates averaged by the number of years since accumulation discovery. Accumulations in mature petroleum provinces with volumetrically significant reserve growth, however, bias statistical models of the data; therefore, accumulations with significant reserve growth are best analyzed separately from those with less significant reserve growth. Large (greater than 500 million barrels) and older (with respect to year of discovery) oil accumulations increase in size at greater rates late in their development history in contrast to more recently discovered accumulations that achieve most growth early in their development history. Such differences greatly affect the statistical methods commonly used to forecast reserve growth. The individual accumulation-analysis method involves estimating the in-place petroleum quantity and its uncertainty, as well as the estimated (forecasted) recoverability and its respective uncertainty. These variables are assigned probabilistic distributions and are combined statistically to provide probabilistic estimates of ultimate recoverable quantities. Cumulative production and remaining reserves are then subtracted from the estimated ultimate recoverable quantities to provide potential reserve growth. In practice, results of the two methods are aggregated to various scales, the highest of which includes an entire country or the world total. The aggregated results are reported along with the statistically appropriate uncertainties.
A sampling and classification item selection approach with content balancing.
Chen, Pei-Hua
2015-03-01
Existing automated test assembly methods typically employ constrained combinatorial optimization. Constructing forms sequentially based on an optimization approach usually results in unparallel forms and requires heuristic modifications. Methods based on a random search approach have the major advantage of producing parallel forms sequentially without further adjustment. This study incorporated a flexible content-balancing element into the statistical perspective item selection method of the cell-only method (Chen et al. in Educational and Psychological Measurement, 72(6), 933-953, 2012). The new method was compared with a sequential interitem distance weighted deviation model (IID WDM) (Swanson & Stocking in Applied Psychological Measurement, 17(2), 151-166, 1993), a simultaneous IID WDM, and a big-shadow-test mixed integer programming (BST MIP) method to construct multiple parallel forms based on matching a reference form item-by-item. The results showed that the cell-only method with content balancing and the sequential and simultaneous versions of IID WDM yielded results comparable to those obtained using the BST MIP method. The cell-only method with content balancing is computationally less intensive than the sequential and simultaneous versions of IID WDM.
Identifying Loci Under Selection Against Gene Flow in Isolation-with-Migration Models
Sousa, Vitor C.; Carneiro, Miguel; Ferrand, Nuno; Hey, Jody
2013-01-01
When divergence occurs in the presence of gene flow, there can arise an interesting dynamic in which selection against gene flow, at sites associated with population-specific adaptations or genetic incompatibilities, can cause net gene flow to vary across the genome. Loci linked to sites under selection may experience reduced gene flow and may experience genetic bottlenecks by the action of nearby selective sweeps. Data from histories such as these may be poorly fitted by conventional neutral model approaches to demographic inference, which treat all loci as equally subject to forces of genetic drift and gene flow. To allow for demographic inference in the face of such histories, as well as the identification of loci affected by selection, we developed an isolation-with-migration model that explicitly provides for variation among genomic regions in migration rates and/or rates of genetic drift. The method allows for loci to fall into any of multiple groups, each characterized by a different set of parameters, thus relaxing the assumption that all loci share the same demography. By grouping loci, the method can be applied to data with multiple loci and still have tractable dimensionality and statistical power. We studied the performance of the method using simulated data, and we applied the method to study the divergence of two subspecies of European rabbits (Oryctolagus cuniculus). PMID:23457232
NASA Astrophysics Data System (ADS)
Li, Li-Na; Ma, Chang-Ming; Chang, Ming; Zhang, Ren-Cheng
2017-12-01
A novel method based on SIMPLe-to-use Interactive Self-modeling Mixture Analysis (SIMPLISMA) and Kernel Partial Least Square (KPLS), named as SIMPLISMA-KPLS, is proposed in this paper for selection of outlier samples and informative samples simultaneously. It is a quick algorithm used to model standardization (or named as model transfer) in near infrared (NIR) spectroscopy. The NIR experiment data of the corn for analysis of the protein content is introduced to evaluate the proposed method. Piecewise direct standardization (PDS) is employed in model transfer. And the comparison of SIMPLISMA-PDS-KPLS and KS-PDS-KPLS is given in this research by discussion of the prediction accuracy of protein content and calculation speed of each algorithm. The conclusions include that SIMPLISMA-KPLS can be utilized as an alternative sample selection method for model transfer. Although it has similar accuracy to Kennard-Stone (KS), it is different from KS as it employs concentration information in selection program. This means that it ensures analyte information is involved in analysis, and the spectra (X) of the selected samples is interrelated with concentration (y). And it can be used for outlier sample elimination simultaneously by validation of calibration. According to the statistical data results of running time, it is clear that the sample selection process is more rapid when using KPLS. The quick algorithm of SIMPLISMA-KPLS is beneficial to improve the speed of online measurement using NIR spectroscopy.
Rivoirard, Romain; Duplay, Vianney; Oriol, Mathieu; Tinquaut, Fabien; Chauvin, Franck; Magne, Nicolas; Bourmaud, Aurelie
2016-01-01
Quality of reporting for Randomized Clinical Trials (RCTs) in oncology was analyzed in several systematic reviews, but, in this setting, there is paucity of data for the outcomes definitions and consistency of reporting for statistical tests in RCTs and Observational Studies (OBS). The objective of this review was to describe those two reporting aspects, for OBS and RCTs in oncology. From a list of 19 medical journals, three were retained for analysis, after a random selection: British Medical Journal (BMJ), Annals of Oncology (AoO) and British Journal of Cancer (BJC). All original articles published between March 2009 and March 2014 were screened. Only studies whose main outcome was accompanied by a corresponding statistical test were included in the analysis. Studies based on censored data were excluded. Primary outcome was to assess quality of reporting for description of primary outcome measure in RCTs and of variables of interest in OBS. A logistic regression was performed to identify covariates of studies potentially associated with concordance of tests between Methods and Results parts. 826 studies were included in the review, and 698 were OBS. Variables were described in Methods section for all OBS studies and primary endpoint was clearly detailed in Methods section for 109 RCTs (85.2%). 295 OBS (42.2%) and 43 RCTs (33.6%) had perfect agreement for reported statistical test between Methods and Results parts. In multivariable analysis, variable "number of included patients in study" was associated with test consistency: aOR (adjusted Odds Ratio) for third group compared to first group was equal to: aOR Grp3 = 0.52 [0.31-0.89] (P value = 0.009). Variables in OBS and primary endpoint in RCTs are reported and described with a high frequency. However, statistical tests consistency between methods and Results sections of OBS is not always noted. Therefore, we encourage authors and peer reviewers to verify consistency of statistical tests in oncology studies.
Paroxysmal atrial fibrillation prediction method with shorter HRV sequences.
Boon, K H; Khalil-Hani, M; Malarvili, M B; Sia, C W
2016-10-01
This paper proposes a method that predicts the onset of paroxysmal atrial fibrillation (PAF), using heart rate variability (HRV) segments that are shorter than those applied in existing methods, while maintaining good prediction accuracy. PAF is a common cardiac arrhythmia that increases the health risk of a patient, and the development of an accurate predictor of the onset of PAF is clinical important because it increases the possibility to stabilize (electrically) and prevent the onset of atrial arrhythmias with different pacing techniques. We investigate the effect of HRV features extracted from different lengths of HRV segments prior to PAF onset with the proposed PAF prediction method. The pre-processing stage of the predictor includes QRS detection, HRV quantification and ectopic beat correction. Time-domain, frequency-domain, non-linear and bispectrum features are then extracted from the quantified HRV. In the feature selection, the HRV feature set and classifier parameters are optimized simultaneously using an optimization procedure based on genetic algorithm (GA). Both full feature set and statistically significant feature subset are optimized by GA respectively. For the statistically significant feature subset, Mann-Whitney U test is used to filter non-statistical significance features that cannot pass the statistical test at 20% significant level. The final stage of our predictor is the classifier that is based on support vector machine (SVM). A 10-fold cross-validation is applied in performance evaluation, and the proposed method achieves 79.3% prediction accuracy using 15-minutes HRV segment. This accuracy is comparable to that achieved by existing methods that use 30-minutes HRV segments, most of which achieves accuracy of around 80%. More importantly, our method significantly outperforms those that applied segments shorter than 30 minutes. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
A general scientific information system to support the study of climate-related data
NASA Technical Reports Server (NTRS)
Treinish, L. A.
1984-01-01
The development and use of NASA's Pilot Climate Data System (PCDS) are discussed. The PCDS is used as a focal point for managing and providing access to a large collection of actively used data for the Earth, ocean and atmospheric sciences. The PCDS provides uniform data catalogs, inventories, and access methods for selected NASA and non-NASA data sets. Scientific users can preview the data sets using graphical and statistical methods. The system has evolved from its original purpose as a climate data base management system in response to a national climate program, into an extensive package of capabilities to support many types of data sets from both spaceborne and surface based measurements with flexible data selection and analysis functions.
Regression Model Term Selection for the Analysis of Strain-Gage Balance Calibration Data
NASA Technical Reports Server (NTRS)
Ulbrich, Norbert Manfred; Volden, Thomas R.
2010-01-01
The paper discusses the selection of regression model terms for the analysis of wind tunnel strain-gage balance calibration data. Different function class combinations are presented that may be used to analyze calibration data using either a non-iterative or an iterative method. The role of the intercept term in a regression model of calibration data is reviewed. In addition, useful algorithms and metrics originating from linear algebra and statistics are recommended that will help an analyst (i) to identify and avoid both linear and near-linear dependencies between regression model terms and (ii) to make sure that the selected regression model of the calibration data uses only statistically significant terms. Three different tests are suggested that may be used to objectively assess the predictive capability of the final regression model of the calibration data. These tests use both the original data points and regression model independent confirmation points. Finally, data from a simplified manual calibration of the Ames MK40 balance is used to illustrate the application of some of the metrics and tests to a realistic calibration data set.
c-Myc oncogene expression in selected odontogenic cysts and tumors: An immunohistochemical study
Moosvi, Zama; Rekha, K
2013-01-01
Aim: To investigate the role of c-Myc oncogene in selected odontogenic cysts and tumors. Materials and Methods: Ten cases each of ameloblastoma, adenomatoid odontogenic tumor (AOT), odontogenic keratocyst (OKC), dentigerous cyst, and radicular cyst were selected and primary monoclonal mouse anti-human c-Myc antibody was used in a dilution of 1: 50. Statistical Analysis was performed using Mann Whitney U test. Results: 80% positivity was observed in ameloblastoma, AOT and OKC; 50% positivity in radicular cyst and 20% positivity in dentigerous cyst. Comparison of c-Myc expression between ameloblastoma and AOT did not reveal significant results. Similarly, no statistical significance was observed when results of OKC were compared with ameloblastoma and AOT. In contrast, significant differences were seen on comparison of dentigerous cyst with ameloblastoma and AOT and radicular cyst with AOT. Conclusion: From the above data we conclude that (1) Ameloblastoma and AOT have similar proliferative potential and their biologic behavior cannot possibly be attributed to it. (2) OKC has an intrinsic growth potential which is absent in other cysts and reinforces its classification as keratocystic odontogenic tumor. PMID:23798830
Methodological development for selection of significant predictors explaining fatal road accidents.
Dadashova, Bahar; Arenas-Ramírez, Blanca; Mira-McWilliams, José; Aparicio-Izquierdo, Francisco
2016-05-01
Identification of the most relevant factors for explaining road accident occurrence is an important issue in road safety research, particularly for future decision-making processes in transport policy. However model selection for this particular purpose is still an ongoing research. In this paper we propose a methodological development for model selection which addresses both explanatory variable and adequate model selection issues. A variable selection procedure, TIM (two-input model) method is carried out by combining neural network design and statistical approaches. The error structure of the fitted model is assumed to follow an autoregressive process. All models are estimated using Markov Chain Monte Carlo method where the model parameters are assigned non-informative prior distributions. The final model is built using the results of the variable selection. For the application of the proposed methodology the number of fatal accidents in Spain during 2000-2011 was used. This indicator has experienced the maximum reduction internationally during the indicated years thus making it an interesting time series from a road safety policy perspective. Hence the identification of the variables that have affected this reduction is of particular interest for future decision making. The results of the variable selection process show that the selected variables are main subjects of road safety policy measures. Published by Elsevier Ltd.
NASA Astrophysics Data System (ADS)
Rubin, D.; Aldering, G.; Barbary, K.; Boone, K.; Chappell, G.; Currie, M.; Deustua, S.; Fagrelius, P.; Fruchter, A.; Hayden, B.; Lidman, C.; Nordin, J.; Perlmutter, S.; Saunders, C.; Sofiatti, C.; Supernova Cosmology Project, The
2015-11-01
While recent supernova (SN) cosmology research has benefited from improved measurements, current analysis approaches are not statistically optimal and will prove insufficient for future surveys. This paper discusses the limitations of current SN cosmological analyses in treating outliers, selection effects, shape- and color-standardization relations, unexplained dispersion, and heterogeneous observations. We present a new Bayesian framework, called UNITY (Unified Nonlinear Inference for Type-Ia cosmologY), that incorporates significant improvements in our ability to confront these effects. We apply the framework to real SN observations and demonstrate smaller statistical and systematic uncertainties. We verify earlier results that SNe Ia require nonlinear shape and color standardizations, but we now include these nonlinear relations in a statistically well-justified way. This analysis was primarily performed blinded, in that the basic framework was first validated on simulated data before transitioning to real data. We also discuss possible extensions of the method.
A Comparison of Student Understanding of Seasons Using Inquiry and Didactic Teaching Methods
NASA Astrophysics Data System (ADS)
Ashcraft, Paul G.
2006-02-01
Student performance on open-ended questions concerning seasons in a university physical science content course was examined to note differences between classes that experienced inquiry using a 5-E lesson planning model and those that experienced the same content with a traditional, didactic lesson. The class examined is a required content course for elementary education majors and understanding the seasons is part of the university's state's elementary science standards. The two self-selected groups of students showed no statistically significant differences in pre-test scores, while there were statistically significant differences between the groups' post-test scores with those who participated in inquiry-based activities scoring higher. There were no statistically significant differences between the pre-test and the post-test for the students who experienced didactic teaching, while there were statistically significant improvements for the students who experienced the 5-E lesson.
Tips and Tricks for Successful Application of Statistical Methods to Biological Data.
Schlenker, Evelyn
2016-01-01
This chapter discusses experimental design and use of statistics to describe characteristics of data (descriptive statistics) and inferential statistics that test the hypothesis posed by the investigator. Inferential statistics, based on probability distributions, depend upon the type and distribution of the data. For data that are continuous, randomly and independently selected, as well as normally distributed more powerful parametric tests such as Student's t test and analysis of variance (ANOVA) can be used. For non-normally distributed or skewed data, transformation of the data (using logarithms) may normalize the data allowing use of parametric tests. Alternatively, with skewed data nonparametric tests can be utilized, some of which rely on data that are ranked prior to statistical analysis. Experimental designs and analyses need to balance between committing type 1 errors (false positives) and type 2 errors (false negatives). For a variety of clinical studies that determine risk or benefit, relative risk ratios (random clinical trials and cohort studies) or odds ratios (case-control studies) are utilized. Although both use 2 × 2 tables, their premise and calculations differ. Finally, special statistical methods are applied to microarray and proteomics data, since the large number of genes or proteins evaluated increase the likelihood of false discoveries. Additional studies in separate samples are used to verify microarray and proteomic data. Examples in this chapter and references are available to help continued investigation of experimental designs and appropriate data analysis.
NASA Technical Reports Server (NTRS)
Hoffman, R. N.; Leidner, S. M.; Henderson, J. M.; Atlas, R.; Ardizzone, J. V.; Bloom, S. C.; Atlas, Robert (Technical Monitor)
2001-01-01
In this study, we apply a two-dimensional variational analysis method (2d-VAR) to select a wind solution from NASA Scatterometer (NSCAT) ambiguous winds. 2d-VAR determines a "best" gridded surface wind analysis by minimizing a cost function. The cost function measures the misfit to the observations, the background, and the filtering and dynamical constraints. The ambiguity closest in direction to the minimizing analysis is selected. 2d-VAR method, sensitivity and numerical behavior are described. 2d-VAR is compared to statistical interpolation (OI) by examining the response of both systems to a single ship observation and to a swath of unique scatterometer winds. 2d-VAR is used with both NSCAT ambiguities and NSCAT backscatter values. Results are roughly comparable. When the background field is poor, 2d-VAR ambiguity removal often selects low probability ambiguities. To avoid this behavior, an initial 2d-VAR analysis, using only the two most likely ambiguities, provides the first guess for an analysis using all the ambiguities or the backscatter data. 2d-VAR and median filter selected ambiguities usually agree. Both methods require horizontal consistency, so disagreements occur in clumps, or as linear features. In these cases, 2d-VAR ambiguities are often more meteorologically reasonable and more consistent with satellite imagery.
Selected 1966-69 interior Alaska wildfire statistics with long-term comparisons.
Richard J. Barney
1971-01-01
This paper presents selected interior Alaska forest and range wildfire statistics for the period 1966-69. Comparisons are made with the decade 1956-65 and the 30-year period 1940-69, which are essentially the total recorded statistical history on wildfires available for Alaska.
NASA Astrophysics Data System (ADS)
Mehrvand, Masoud; Baghanam, Aida Hosseini; Razzaghzadeh, Zahra; Nourani, Vahid
2017-04-01
Since statistical downscaling methods are the most largely used models to study hydrologic impact studies under climate change scenarios, nonlinear regression models known as Artificial Intelligence (AI)-based models such as Artificial Neural Network (ANN) and Support Vector Machine (SVM) have been used to spatially downscale the precipitation outputs of Global Climate Models (GCMs). The study has been carried out using GCM and station data over GCM grid points located around the Peace-Tampa Bay watershed weather stations. Before downscaling with AI-based model, correlation coefficient values have been computed between a few selected large-scale predictor variables and local scale predictands to select the most effective predictors. The selected predictors are then assessed considering grid location for the site in question. In order to increase AI-based downscaling model accuracy pre-processing has been developed on precipitation time series. In this way, the precipitation data derived from various GCM data analyzed thoroughly to find the highest value of correlation coefficient between GCM-based historical data and station precipitation data. Both GCM and station precipitation time series have been assessed by comparing mean and variances over specific intervals. Results indicated that there is similar trend between GCM and station precipitation data; however station data has non-stationary time series while GCM data does not. Finally AI-based downscaling model have been applied to several GCMs with selected predictors by targeting local precipitation time series as predictand. The consequences of recent step have been used to produce multiple ensembles of downscaled AI-based models.
Deng, Bai-chuan; Yun, Yong-huan; Liang, Yi-zeng; Yi, Lun-zhao
2014-10-07
In this study, a new optimization algorithm called the Variable Iterative Space Shrinkage Approach (VISSA) that is based on the idea of model population analysis (MPA) is proposed for variable selection. Unlike most of the existing optimization methods for variable selection, VISSA statistically evaluates the performance of variable space in each step of optimization. Weighted binary matrix sampling (WBMS) is proposed to generate sub-models that span the variable subspace. Two rules are highlighted during the optimization procedure. First, the variable space shrinks in each step. Second, the new variable space outperforms the previous one. The second rule, which is rarely satisfied in most of the existing methods, is the core of the VISSA strategy. Compared with some promising variable selection methods such as competitive adaptive reweighted sampling (CARS), Monte Carlo uninformative variable elimination (MCUVE) and iteratively retaining informative variables (IRIV), VISSA showed better prediction ability for the calibration of NIR data. In addition, VISSA is user-friendly; only a few insensitive parameters are needed, and the program terminates automatically without any additional conditions. The Matlab codes for implementing VISSA are freely available on the website: https://sourceforge.net/projects/multivariateanalysis/files/VISSA/.
Navarro, Pedro J; Fernández-Isla, Carlos; Alcover, Pedro María; Suardíaz, Juan
2016-07-27
This paper presents a robust method for defect detection in textures, entropy-based automatic selection of the wavelet decomposition level (EADL), based on a wavelet reconstruction scheme, for detecting defects in a wide variety of structural and statistical textures. Two main features are presented. One of the new features is an original use of the normalized absolute function value (NABS) calculated from the wavelet coefficients derived at various different decomposition levels in order to identify textures where the defect can be isolated by eliminating the texture pattern in the first decomposition level. The second is the use of Shannon's entropy, calculated over detail subimages, for automatic selection of the band for image reconstruction, which, unlike other techniques, such as those based on the co-occurrence matrix or on energy calculation, provides a lower decomposition level, thus avoiding excessive degradation of the image, allowing a more accurate defect segmentation. A metric analysis of the results of the proposed method with nine different thresholding algorithms determined that selecting the appropriate thresholding method is important to achieve optimum performance in defect detection. As a consequence, several different thresholding algorithms depending on the type of texture are proposed.
Laharz_py: GIS tools for automated mapping of lahar inundation hazard zones
Schilling, Steve P.
2014-01-01
Laharz_py is written in the Python programming language as a suite of tools for use in ArcMap Geographic Information System (GIS). Primarily, Laharz_py is a computational model that uses statistical descriptions of areas inundated by past mass-flow events to forecast areas likely to be inundated by hypothetical future events. The forecasts use physically motivated and statistically calibrated power-law equations that each has a form A = cV2/3, relating mass-flow volume (V) to planimetric or cross-sectional areas (A) inundated by an average flow as it descends a given drainage. Calibration of the equations utilizes logarithmic transformation and linear regression to determine the best-fit values of c. The software uses values of V, an algorithm for idenitifying mass-flow source locations, and digital elevation models of topography to portray forecast hazard zones for lahars, debris flows, or rock avalanches on maps. Laharz_py offers two methods to construct areas of potential inundation for lahars: (1) Selection of a range of plausible V values results in a set of nested hazard zones showing areas likely to be inundated by a range of hypothetical flows; and (2) The user selects a single volume and a confidence interval for the prediction. In either case, Laharz_py calculates the mean expected A and B value from each user-selected value of V. However, for the second case, a single value of V yields two additional results representing the upper and lower values of the confidence interval of prediction. Calculation of these two bounding predictions require the statistically calibrated prediction equations, a user-specified level of confidence, and t-distribution statistics to calculate the standard error of regression, standard error of the mean, and standard error of prediction. The portrayal of results from these two methods on maps compares the range of inundation areas due to prediction uncertainties with uncertainties in selection of V values. The Open-File Report document contains an explanation of how to install and use the software. The Laharz_py software includes an example data set for Mount Rainier, Washington. The second part of the documentation describes how to use all of the Laharz_py tools in an example dataset at Mount Rainier, Washington.
McKinney, Brett A.; White, Bill C.; Grill, Diane E.; Li, Peter W.; Kennedy, Richard B.; Poland, Gregory A.; Oberg, Ann L.
2013-01-01
Relief-F is a nonparametric, nearest-neighbor machine learning method that has been successfully used to identify relevant variables that may interact in complex multivariate models to explain phenotypic variation. While several tools have been developed for assessing differential expression in sequence-based transcriptomics, the detection of statistical interactions between transcripts has received less attention in the area of RNA-seq analysis. We describe a new extension and assessment of Relief-F for feature selection in RNA-seq data. The ReliefSeq implementation adapts the number of nearest neighbors (k) for each gene to optimize the Relief-F test statistics (importance scores) for finding both main effects and interactions. We compare this gene-wise adaptive-k (gwak) Relief-F method with standard RNA-seq feature selection tools, such as DESeq and edgeR, and with the popular machine learning method Random Forests. We demonstrate performance on a panel of simulated data that have a range of distributional properties reflected in real mRNA-seq data including multiple transcripts with varying sizes of main effects and interaction effects. For simulated main effects, gwak-Relief-F feature selection performs comparably to standard tools DESeq and edgeR for ranking relevant transcripts. For gene-gene interactions, gwak-Relief-F outperforms all comparison methods at ranking relevant genes in all but the highest fold change/highest signal situations where it performs similarly. The gwak-Relief-F algorithm outperforms Random Forests for detecting relevant genes in all simulation experiments. In addition, Relief-F is comparable to the other methods based on computational time. We also apply ReliefSeq to an RNA-Seq study of smallpox vaccine to identify gene expression changes between vaccinia virus-stimulated and unstimulated samples. ReliefSeq is an attractive tool for inclusion in the suite of tools used for analysis of mRNA-Seq data; it has power to detect both main effects and interaction effects. Software Availability: http://insilico.utulsa.edu/ReliefSeq.php. PMID:24339943
Multilocus approaches for the measurement of selection on correlated genetic loci.
Gompert, Zachariah; Egan, Scott P; Barrett, Rowan D H; Feder, Jeffrey L; Nosil, Patrik
2017-01-01
The study of ecological speciation is inherently linked to the study of selection. Methods for estimating phenotypic selection within a generation based on associations between trait values and fitness (e.g. survival) of individuals are established. These methods attempt to disentangle selection acting directly on a trait from indirect selection caused by correlations with other traits via multivariate statistical approaches (i.e. inference of selection gradients). The estimation of selection on genotypic or genomic variation could also benefit from disentangling direct and indirect selection on genetic loci. However, achieving this goal is difficult with genomic data because the number of potentially correlated genetic loci (p) is very large relative to the number of individuals sampled (n). In other words, the number of model parameters exceeds the number of observations (p ≫ n). We present simulations examining the utility of whole-genome regression approaches (i.e. Bayesian sparse linear mixed models) for quantifying direct selection in cases where p ≫ n. Such models have been used for genome-wide association mapping and are common in artificial breeding. Our results show they hold promise for studies of natural selection in the wild and thus of ecological speciation. But we also demonstrate important limitations to the approach and discuss study designs required for more robust inferences. © 2016 John Wiley & Sons Ltd.
Vainik, Uku; Konstabel, Kenn; Lätt, Evelin; Mäestu, Jarek; Purge, Priit; Jürimäe, Jaak
2016-10-01
Subjective energy intake (sEI) is often misreported, providing unreliable estimates of energy consumed. Therefore, relating sEI data to health outcomes is difficult. Recently, Börnhorst et al. compared various methods to correct sEI-based energy intake estimates. They criticised approaches that categorise participants as under-reporters, plausible reporters and over-reporters based on the sEI:total energy expenditure (TEE) ratio, and thereafter use these categories as statistical covariates or exclusion criteria. Instead, they recommended using external predictors of sEI misreporting as statistical covariates. We sought to confirm and extend these findings. Using a sample of 190 adolescent boys (mean age=14), we demonstrated that dual-energy X-ray absorptiometry-measured fat-free mass is strongly associated with objective energy intake data (onsite weighted breakfast), but the association with sEI (previous 3-d dietary interview) is weak. Comparing sEI with TEE revealed that sEI was mostly under-reported (74 %). Interestingly, statistically controlling for dietary reporting groups or restricting samples to plausible reporters created a stronger-than-expected association between fat-free mass and sEI. However, the association was an artifact caused by selection bias - that is, data re-sampling and simulations showed that these methods overestimated the effect size because fat-free mass was related to sEI both directly and indirectly via TEE. A more realistic association between sEI and fat-free mass was obtained when the model included common predictors of misreporting (e.g. BMI, restraint). To conclude, restricting sEI data only to plausible reporters can cause selection bias and inflated associations in later analyses. Therefore, we further support statistically correcting sEI data in nutritional analyses. The script for running simulations is provided.
Exercise reduces depressive symptoms in adults with arthritis: Evidential value.
Kelley, George A; Kelley, Kristi S
2016-07-12
To determine whether evidential value exists that exercise reduces depression in adults with arthritis and other rheumatic conditions. Utilizing data derived from a prior meta-analysis of 29 randomized controlled trials comprising 2449 participants (1470 exercise, 979 control) with fibromyalgia, osteoarthritis, rheumatoid arthritis or systemic lupus erythematosus, a new method, P -curve, was utilized to assess for evidentiary worth as well as dismiss the possibility of discriminating reporting of statistically significant results regarding exercise and depression in adults with arthritis and other rheumatic conditions. Using the method of Stouffer, Z -scores were calculated to examine selective-reporting bias. An alpha ( P ) value < 0.05 was deemed statistically significant. In addition, average power of the tests included in P -curve, adjusted for publication bias, was calculated. Fifteen of 29 studies (51.7%) with exercise and depression results were statistically significant ( P < 0.05) while none of the results were statistically significant with respect to exercise increasing depression in adults with arthritis and other rheumatic conditions. Right-skew to dismiss selective reporting was identified ( Z = -5.28, P < 0.0001). In addition, the included studies did not lack evidential value ( Z = 2.39, P = 0.99), nor did they lack evidential value and were P -hacked ( Z = 5.28, P > 0.99). The relative frequencies of P -values were 66.7% at 0.01, 6.7% each at 0.02 and 0.03, 13.3% at 0.04 and 6.7% at 0.05. The average power of the tests included in P -curve, corrected for publication bias, was 69%. Diagnostic plot results revealed that the observed power estimate was a better fit than the alternatives. Evidential value results provide additional support that exercise reduces depression in adults with arthritis and other rheumatic conditions.
Assessing the accuracy and stability of variable selection ...
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively add/remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating dataset consists of the good/poor condition of n=1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p=212) of landscape features from the StreamCat dataset. Two types of RF models are compared: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substanti
ERIC Educational Resources Information Center
Merrill, Ray M.; Chatterley, Amanda; Shields, Eric C.
2005-01-01
This study explored the effectiveness of selected statistical measures at motivating or maintaining regular exercise among college students. The study also considered whether ease in understanding these statistical measures was associated with perceived effectiveness at motivating or maintaining regular exercise. Analyses were based on a…
Statistics Report on TEQSA Registered Higher Education Providers, 2017
ERIC Educational Resources Information Center
Australian Government Tertiary Education Quality and Standards Agency, 2017
2017-01-01
The "Statistics Report on TEQSA Registered Higher Education Providers" ("the Statistics Report") is the fourth release of selected higher education sector data held by is the fourth release of selected higher education sector data held by the Australian Government Tertiary Education Quality and Standards Agency (TEQSA) for its…
Statistical approach for selection of biologically informative genes.
Das, Samarendra; Rai, Anil; Mishra, D C; Rai, Shesh N
2018-05-20
Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies. Published by Elsevier B.V.
Xing, Nailin; Fan, Chuchuan; Zhou, Yongming
2014-01-01
Parental selection is crucial for hybrid breeding, but the methods available for such a selection are not very effective. In this study, a 6×6 incomplete diallel cross was designed using 12 rapeseed germplasms, and a total of 36 hybrids together with their parental lines were planted in 4 environments. Four yield-related traits and seed oil content (OC) were evaluated. Genetic distance (GD) was estimated with 359 simple sequence repeats (SSRs) markers. Heterosis levels, general combining ability (GCA) and specific combining ability (SCA) were evaluated. GD was found to have a significant correlation with better-parent heterosis (BPH) of thousand seed weight (TSW), SCA of seeds per silique (SS), TSW, and seed yield per plant (SY), while SCA showed a statistically significant correlation with heterosis levels of all traits at 1% significance level. Statistically significant correlations were also observed between GCA of maternal or paternal parents and heterosis levels of different traits except for SS. Interestingly, maternal (TSW, SS, and OC) and paternal (siliques per plant (SP) and SY) inheritance of traits was detected using contribution ratio of maternal and paternal GCA variance as well as correlations between GCA and heterosis levels. Phenotype and heterosis levels of all the traits except TSW of hybrids were significantly correlated with the average performance of parents. The correlations between SS and SP, SP and OC, and SY and OC were statistically significant in hybrids but not in parents. Potential applications of parental selection in hybrid breeding were discussed.
Kholeif, S A
2001-06-01
A new method that belongs to the differential category for determining the end points from potentiometric titration curves is presented. It uses a preprocess to find first derivative values by fitting four data points in and around the region of inflection to a non-linear function, and then locate the end point, usually as a maximum or minimum, using an inverse parabolic interpolation procedure that has an analytical solution. The behavior and accuracy of the sigmoid and cumulative non-linear functions used are investigated against three factors. A statistical evaluation of the new method using linear least-squares method validation and multifactor data analysis are covered. The new method is generally applied to symmetrical and unsymmetrical potentiometric titration curves, and the end point is calculated using numerical procedures only. It outperforms the "parent" regular differential method in almost all factors levels and gives accurate results comparable to the true or estimated true end points. Calculated end points from selected experimental titration curves compatible with the equivalence point category of methods, such as Gran or Fortuin, are also compared with the new method.
Miles, Jeffrey Hilton
2011-05-01
Combustion noise from turbofan engines has become important, as the noise from sources like the fan and jet are reduced. An aligned and un-aligned coherence technique has been developed to determine a threshold level for the coherence and thereby help to separate the coherent combustion noise source from other noise sources measured with far-field microphones. This method is compared with a statistics based coherence threshold estimation method. In addition, the un-aligned coherence procedure at the same time also reveals periodicities, spectral lines, and undamped sinusoids hidden by broadband turbofan engine noise. In calculating the coherence threshold using a statistical method, one may use either the number of independent records or a larger number corresponding to the number of overlapped records used to create the average. Using data from a turbofan engine and a simulation this paper shows that applying the Fisher z-transform to the un-aligned coherence can aid in making the proper selection of samples and produce a reasonable statistics based coherence threshold. Examples are presented showing that the underlying tonal and coherent broad band structure which is buried under random broadband noise and jet noise can be determined. The method also shows the possible presence of indirect combustion noise.
AUPress: A Comparison of an Open Access University Press with Traditional Presses
ERIC Educational Resources Information Center
McGreal, Rory; Chen, Nian-Shing
2011-01-01
This study is a comparison of AUPress with three other traditional (non-open access) Canadian university presses. The analysis is based on the rankings that are correlated with book sales on Amazon.com and Amazon.ca. Statistical methods include the sampling of the sales ranking of randomly selected books from each press. The results of one-way…
ERIC Educational Resources Information Center
Dastjerdi, Negin Barat
2016-01-01
The research aims at the evaluation of ICT use in teaching-learning process to the students of Isfahan elementary schools. The method of this research is descriptive-surveying. The statistical population of the study was all teachers of Isfahan elementary schools. The sample size was determined 350 persons that selected through cluster sampling…
ERIC Educational Resources Information Center
Whitehead, Michele L.; Gutierrez, Laura; Miller, Melody
2014-01-01
The purpose of this study is to gain an understanding of current academic medical library circulation polices and examine methods libraries utilize to meet patron needs. Key informants were selected from five states. Statistics regarding financial practices, users, services, space access, and circulation practices were collected via survey…
ERIC Educational Resources Information Center
Shahbazzadeh, Somayeh; Beliad, Mohammad Reza
2017-01-01
This study investigates the mediatory role of exercise self-regulation role in the relationship between personality traits and anger management among athletes. The statistical population of this study includes all athlete students of Shar-e Ghods College, among which 260 people were selected as sample using random sampling method. In addition, the…
NASA Technical Reports Server (NTRS)
Tomberlin, T. J.
1985-01-01
Research studies of residents' responses to noise consist of interviews with samples of individuals who are drawn from a number of different compact study areas. The statistical techniques developed provide a basis for those sample design decisions. These techniques are suitable for a wide range of sample survey applications. A sample may consist of a random sample of residents selected from a sample of compact study areas, or in a more complex design, of a sample of residents selected from a sample of larger areas (e.g., cities). The techniques may be applied to estimates of the effects on annoyance of noise level, numbers of noise events, the time-of-day of the events, ambient noise levels, or other factors. Methods are provided for determining, in advance, how accurately these effects can be estimated for different sample sizes and study designs. Using a simple cost function, they also provide for optimum allocation of the sample across the stages of the design for estimating these effects. These techniques are developed via a regression model in which the regression coefficients are assumed to be random, with components of variance associated with the various stages of a multi-stage sample design.
Apical extrusion of debris and irrigant using hand and rotary systems: A comparative study
Ghivari, Sheetal B; Kubasad, Girish C; Chandak, Manoj G; Akarte, NR
2011-01-01
Aim: To evaluate and compare the amount of debris and irrigant extruded quantitatively by using two hand and rotary nickel–titanium (Ni–Ti) instrumentation techniques. Materials and Methods: Eighty freshly extracted mandibular premolars having similar canal length and curvature were selected and mounted in a debris collection apparatus. After each instrument change, 1 ml of distilled water was used as an irrigant and the amount of irrigant extruded was measured using the Meyers and Montgomery method. After drying, the debris was weighed using an electronic microbalance to determine its weight. Statistical analysis used: The data was analyzed statistically to determine the mean difference between the groups. The mean weight of the dry debris and irrigant within the group and between the groups was calculated by the one-way ANOVA and multiple comparison (Dunnet D) test. Results: The step-back technique extruded a greater quantity of debris and irrigant in comparison to other hand and rotary Ni–Ti systems. Conclusions: All instrumentation techniques extrude debris and irrigant, it is prudent on the part of the clinician to select the instrumentation technique that extrudes the least amount of debris and irrigant, to prevent a flare-up phenomena. PMID:21814364
Degrees of separation as a statistical tool for evaluating candidate genes.
Nelson, Ronald M; Pettersson, Mats E
2014-12-01
Selection of candidate genes is an important step in the exploration of complex genetic architecture. The number of gene networks available is increasing and these can provide information to help with candidate gene selection. It is currently common to use the degree of connectedness in gene networks as validation in Genome Wide Association (GWA) and Quantitative Trait Locus (QTL) mapping studies. However, it can cause misleading results if not validated properly. Here we present a method and tool for validating the gene pairs from GWA studies given the context of the network they co-occur in. It ensures that proposed interactions and gene associations are not statistical artefacts inherent to the specific gene network architecture. The CandidateBacon package provides an easy and efficient method to calculate the average degree of separation (DoS) between pairs of genes to currently available gene networks. We show how these empirical estimates of average connectedness are used to validate candidate gene pairs. Validation of interacting genes by comparing their connectedness with the average connectedness in the gene network will provide support for said interactions by utilising the growing amount of gene network information available. Copyright © 2014 Elsevier Ltd. All rights reserved.
Logistic regression applied to natural hazards: rare event logistic regression with replications
NASA Astrophysics Data System (ADS)
Guns, M.; Vanacker, V.
2012-06-01
Statistical analysis of natural hazards needs particular attention, as most of these phenomena are rare events. This study shows that the ordinary rare event logistic regression, as it is now commonly used in geomorphologic studies, does not always lead to a robust detection of controlling factors, as the results can be strongly sample-dependent. In this paper, we introduce some concepts of Monte Carlo simulations in rare event logistic regression. This technique, so-called rare event logistic regression with replications, combines the strength of probabilistic and statistical methods, and allows overcoming some of the limitations of previous developments through robust variable selection. This technique was here developed for the analyses of landslide controlling factors, but the concept is widely applicable for statistical analyses of natural hazards.
Ladd, David E.; Law, George S.
2007-01-01
The U.S. Geological Survey (USGS) provides streamflow and other stream-related information needed to protect people and property from floods, to plan and manage water resources, and to protect water quality in the streams. Streamflow statistics provided by the USGS, such as the 100-year flood and the 7-day 10-year low flow, frequently are used by engineers, land managers, biologists, and many others to help guide decisions in their everyday work. In addition to streamflow statistics, resource managers often need to know the physical and climatic characteristics (basin characteristics) of the drainage basins for locations of interest to help them understand the mechanisms that control water availability and water quality at these locations. StreamStats is a Web-enabled geographic information system (GIS) application that makes it easy for users to obtain streamflow statistics, basin characteristics, and other information for USGS data-collection stations and for ungaged sites of interest. If a user selects the location of a data-collection station, StreamStats will provide previously published information for the station from a database. If a user selects a location where no data are available (an ungaged site), StreamStats will run a GIS program to delineate a drainage basin boundary, measure basin characteristics, and estimate streamflow statistics based on USGS streamflow prediction methods. A user can download a GIS feature class of the drainage basin boundary with attributes including the measured basin characteristics and streamflow estimates.