Sample records for statistical prediction methods

  1. Guidelines 13 and 14—Prediction uncertainty

    USGS Publications Warehouse

    Hill, Mary C.; Tiedeman, Claire

    2005-01-01

    An advantage of using optimization for model development and calibration is that optimization provides methods for evaluating and quantifying prediction uncertainty. Both deterministic and statistical methods can be used. Guideline 13 discusses using regression and post-audits, which we classify as deterministic methods. Guideline 14 discusses inferential statistics and Monte Carlo methods, which we classify as statistical methods.

  2. Testing prediction methods: Earthquake clustering versus the Poisson model

    USGS Publications Warehouse

    Michael, A.J.

    1997-01-01

    Testing earthquake prediction methods requires statistical techniques that compare observed success to random chance. One technique is to produce simulated earthquake catalogs and measure the relative success of predicting real and simulated earthquakes. The accuracy of these tests depends on the validity of the statistical model used to simulate the earthquakes. This study tests the effect of clustering in the statistical earthquake model on the results. Three simulation models were used to produce significance levels for a VLF earthquake prediction method. As the degree of simulated clustering increases, the statistical significance drops. Hence, the use of a seismicity model with insufficient clustering can lead to overly optimistic results. A successful method must pass the statistical tests with a model that fully replicates the observed clustering. However, a method can be rejected based on tests with a model that contains insufficient clustering. U.S. copyright. Published in 1997 by the American Geophysical Union.

  3. Predicting juvenile recidivism: new method, old problems.

    PubMed

    Benda, B B

    1987-01-01

    This prediction study compared three statistical procedures for accuracy using two assessment methods. The criterion is return to a juvenile prison after the first release, and the models tested are logit analysis, predictive attribute analysis, and a Burgess procedure. No significant differences are found between statistics in prediction.

  4. Analysis Code - Data Analysis in 'Leveraging Multiple Statistical Methods for Inverse Prediction in Nuclear Forensics Applications' (LMSMIPNFA) v. 1.0

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lewis, John R

    R code that performs the analysis of a data set presented in the paper ‘Leveraging Multiple Statistical Methods for Inverse Prediction in Nuclear Forensics Applications’ by Lewis, J., Zhang, A., Anderson-Cook, C. It provides functions for doing inverse predictions in this setting using several different statistical methods. The data set is a publicly available data set from a historical Plutonium production experiment.

  5. Comparing multiple statistical methods for inverse prediction in nuclear forensics applications

    DOE PAGES

    Lewis, John R.; Zhang, Adah; Anderson-Cook, Christine Michaela

    2017-10-29

    Forensic science seeks to predict source characteristics using measured observables. Statistically, this objective can be thought of as an inverse problem where interest is in the unknown source characteristics or factors ( X) of some underlying causal model producing the observables or responses (Y = g ( X) + error). Here, this paper reviews several statistical methods for use in inverse problems and demonstrates that comparing results from multiple methods can be used to assess predictive capability. Motivation for assessing inverse predictions comes from the desired application to historical and future experiments involving nuclear material production for forensics research inmore » which inverse predictions, along with an assessment of predictive capability, are desired.« less

  6. Comparing multiple statistical methods for inverse prediction in nuclear forensics applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lewis, John R.; Zhang, Adah; Anderson-Cook, Christine Michaela

    Forensic science seeks to predict source characteristics using measured observables. Statistically, this objective can be thought of as an inverse problem where interest is in the unknown source characteristics or factors ( X) of some underlying causal model producing the observables or responses (Y = g ( X) + error). Here, this paper reviews several statistical methods for use in inverse problems and demonstrates that comparing results from multiple methods can be used to assess predictive capability. Motivation for assessing inverse predictions comes from the desired application to historical and future experiments involving nuclear material production for forensics research inmore » which inverse predictions, along with an assessment of predictive capability, are desired.« less

  7. A Hierarchical Multivariate Bayesian Approach to Ensemble Model output Statistics in Atmospheric Prediction

    DTIC Science & Technology

    2017-09-01

    efficacy of statistical post-processing methods downstream of these dynamical model components with a hierarchical multivariate Bayesian approach to...Bayesian hierarchical modeling, Markov chain Monte Carlo methods , Metropolis algorithm, machine learning, atmospheric prediction 15. NUMBER OF PAGES...scale processes. However, this dissertation explores the efficacy of statistical post-processing methods downstream of these dynamical model components

  8. Statistical procedures for evaluating daily and monthly hydrologic model predictions

    USGS Publications Warehouse

    Coffey, M.E.; Workman, S.R.; Taraba, J.L.; Fogle, A.W.

    2004-01-01

    The overall study objective was to evaluate the applicability of different qualitative and quantitative methods for comparing daily and monthly SWAT computer model hydrologic streamflow predictions to observed data, and to recommend statistical methods for use in future model evaluations. Statistical methods were tested using daily streamflows and monthly equivalent runoff depths. The statistical techniques included linear regression, Nash-Sutcliffe efficiency, nonparametric tests, t-test, objective functions, autocorrelation, and cross-correlation. None of the methods specifically applied to the non-normal distribution and dependence between data points for the daily predicted and observed data. Of the tested methods, median objective functions, sign test, autocorrelation, and cross-correlation were most applicable for the daily data. The robust coefficient of determination (CD*) and robust modeling efficiency (EF*) objective functions were the preferred methods for daily model results due to the ease of comparing these values with a fixed ideal reference value of one. Predicted and observed monthly totals were more normally distributed, and there was less dependence between individual monthly totals than was observed for the corresponding predicted and observed daily values. More statistical methods were available for comparing SWAT model-predicted and observed monthly totals. The 1995 monthly SWAT model predictions and observed data had a regression Rr2 of 0.70, a Nash-Sutcliffe efficiency of 0.41, and the t-test failed to reject the equal data means hypothesis. The Nash-Sutcliffe coefficient and the R r2 coefficient were the preferred methods for monthly results due to the ability to compare these coefficients to a set ideal value of one.

  9. Stochastic or statistic? Comparing flow duration curve models in ungauged basins and changing climates

    NASA Astrophysics Data System (ADS)

    Müller, M. F.; Thompson, S. E.

    2015-09-01

    The prediction of flow duration curves (FDCs) in ungauged basins remains an important task for hydrologists given the practical relevance of FDCs for water management and infrastructure design. Predicting FDCs in ungauged basins typically requires spatial interpolation of statistical or model parameters. This task is complicated if climate becomes non-stationary, as the prediction challenge now also requires extrapolation through time. In this context, process-based models for FDCs that mechanistically link the streamflow distribution to climate and landscape factors may have an advantage over purely statistical methods to predict FDCs. This study compares a stochastic (process-based) and statistical method for FDC prediction in both stationary and non-stationary contexts, using Nepal as a case study. Under contemporary conditions, both models perform well in predicting FDCs, with Nash-Sutcliffe coefficients above 0.80 in 75 % of the tested catchments. The main drives of uncertainty differ between the models: parameter interpolation was the main source of error for the statistical model, while violations of the assumptions of the process-based model represented the main source of its error. The process-based approach performed better than the statistical approach in numerical simulations with non-stationary climate drivers. The predictions of the statistical method under non-stationary rainfall conditions were poor if (i) local runoff coefficients were not accurately determined from the gauge network, or (ii) streamflow variability was strongly affected by changes in rainfall. A Monte Carlo analysis shows that the streamflow regimes in catchments characterized by a strong wet-season runoff and a rapid, strongly non-linear hydrologic response are particularly sensitive to changes in rainfall statistics. In these cases, process-based prediction approaches are strongly favored over statistical models.

  10. Comparing statistical and process-based flow duration curve models in ungauged basins and changing rain regimes

    NASA Astrophysics Data System (ADS)

    Müller, M. F.; Thompson, S. E.

    2016-02-01

    The prediction of flow duration curves (FDCs) in ungauged basins remains an important task for hydrologists given the practical relevance of FDCs for water management and infrastructure design. Predicting FDCs in ungauged basins typically requires spatial interpolation of statistical or model parameters. This task is complicated if climate becomes non-stationary, as the prediction challenge now also requires extrapolation through time. In this context, process-based models for FDCs that mechanistically link the streamflow distribution to climate and landscape factors may have an advantage over purely statistical methods to predict FDCs. This study compares a stochastic (process-based) and statistical method for FDC prediction in both stationary and non-stationary contexts, using Nepal as a case study. Under contemporary conditions, both models perform well in predicting FDCs, with Nash-Sutcliffe coefficients above 0.80 in 75 % of the tested catchments. The main drivers of uncertainty differ between the models: parameter interpolation was the main source of error for the statistical model, while violations of the assumptions of the process-based model represented the main source of its error. The process-based approach performed better than the statistical approach in numerical simulations with non-stationary climate drivers. The predictions of the statistical method under non-stationary rainfall conditions were poor if (i) local runoff coefficients were not accurately determined from the gauge network, or (ii) streamflow variability was strongly affected by changes in rainfall. A Monte Carlo analysis shows that the streamflow regimes in catchments characterized by frequent wet-season runoff and a rapid, strongly non-linear hydrologic response are particularly sensitive to changes in rainfall statistics. In these cases, process-based prediction approaches are favored over statistical models.

  11. Statistical prediction of dynamic distortion of inlet flow using minimum dynamic measurement. An application to the Melick statistical method and inlet flow dynamic distortion prediction without RMS measurements

    NASA Technical Reports Server (NTRS)

    Schweikhard, W. G.; Chen, Y. S.

    1986-01-01

    The Melick method of inlet flow dynamic distortion prediction by statistical means is outlined. A hypothetic vortex model is used as the basis for the mathematical formulations. The main variables are identified by matching the theoretical total pressure rms ratio with the measured total pressure rms ratio. Data comparisons, using the HiMAT inlet test data set, indicate satisfactory prediction of the dynamic peak distortion for cases with boundary layer control device vortex generators. A method for the dynamic probe selection was developed. Validity of the probe selection criteria is demonstrated by comparing the reduced-probe predictions with the 40-probe predictions. It is indicated that the the number of dynamic probes can be reduced to as few as two and still retain good accuracy.

  12. Application of statistical classification methods for predicting the acceptability of well-water quality

    NASA Astrophysics Data System (ADS)

    Cameron, Enrico; Pilla, Giorgio; Stella, Fabio A.

    2018-06-01

    The application of statistical classification methods is investigated—in comparison also to spatial interpolation methods—for predicting the acceptability of well-water quality in a situation where an effective quantitative model of the hydrogeological system under consideration cannot be developed. In the example area in northern Italy, in particular, the aquifer is locally affected by saline water and the concentration of chloride is the main indicator of both saltwater occurrence and groundwater quality. The goal is to predict if the chloride concentration in a water well will exceed the allowable concentration so that the water is unfit for the intended use. A statistical classification algorithm achieved the best predictive performances and the results of the study show that statistical classification methods provide further tools for dealing with groundwater quality problems concerning hydrogeological systems that are too difficult to describe analytically or to simulate effectively.

  13. Radar prediction of absolute rain fade distributions for earth-satellite paths and general methods for extrapolation of fade statistics to other locations

    NASA Technical Reports Server (NTRS)

    Goldhirsh, J.

    1982-01-01

    The first absolute rain fade distribution method described establishes absolute fade statistics at a given site by means of a sampled radar data base. The second method extrapolates absolute fade statistics from one location to another, given simultaneously measured fade and rain rate statistics at the former. Both methods employ similar conditional fade statistic concepts and long term rain rate distributions. Probability deviations in the 2-19% range, with an 11% average, were obtained upon comparison of measured and predicted levels at given attenuations. The extrapolation of fade distributions to other locations at 28 GHz showed very good agreement with measured data at three sites located in the continental temperate region.

  14. [Statistical prediction methods in violence risk assessment and its application].

    PubMed

    Liu, Yuan-Yuan; Hu, Jun-Mei; Yang, Min; Li, Xiao-Song

    2013-06-01

    It is an urgent global problem how to improve the violence risk assessment. As a necessary part of risk assessment, statistical methods have remarkable impacts and effects. In this study, the predicted methods in violence risk assessment from the point of statistics are reviewed. The application of Logistic regression as the sample of multivariate statistical model, decision tree model as the sample of data mining technique, and neural networks model as the sample of artificial intelligence technology are all reviewed. This study provides data in order to contribute the further research of violence risk assessment.

  15. BetaTPred: prediction of beta-TURNS in a protein using statistical algorithms.

    PubMed

    Kaur, Harpreet; Raghava, G P S

    2002-03-01

    beta-turns play an important role from a structural and functional point of view. beta-turns are the most common type of non-repetitive structures in proteins and comprise on average, 25% of the residues. In the past numerous methods have been developed to predict beta-turns in a protein. Most of these prediction methods are based on statistical approaches. In order to utilize the full potential of these methods, there is a need to develop a web server. This paper describes a web server called BetaTPred, developed for predicting beta-TURNS in a protein from its amino acid sequence. BetaTPred allows the user to predict turns in a protein using existing statistical algorithms. It also allows to predict different types of beta-TURNS e.g. type I, I', II, II', VI, VIII and non-specific. This server assists the users in predicting the consensus beta-TURNS in a protein. The server is accessible from http://imtech.res.in/raghava/betatpred/

  16. Statistical Learning Theory for High Dimensional Prediction: Application to Criterion-Keyed Scale Development

    PubMed Central

    Chapman, Benjamin P.; Weiss, Alexander; Duberstein, Paul

    2016-01-01

    Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in “big data” problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how three common SLT algorithms–Supervised Principal Components, Regularization, and Boosting—can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach—or perhaps because of them–SLT methods may hold value as a statistically rigorous approach to exploratory regression. PMID:27454257

  17. K-nearest neighbors based methods for identification of different gear crack levels under different motor speeds and loads: Revisited

    NASA Astrophysics Data System (ADS)

    Wang, Dong

    2016-03-01

    Gears are the most commonly used components in mechanical transmission systems. Their failures may cause transmission system breakdown and result in economic loss. Identification of different gear crack levels is important to prevent any unexpected gear failure because gear cracks lead to gear tooth breakage. Signal processing based methods mainly require expertize to explain gear fault signatures which is usually not easy to be achieved by ordinary users. In order to automatically identify different gear crack levels, intelligent gear crack identification methods should be developed. The previous case studies experimentally proved that K-nearest neighbors based methods exhibit high prediction accuracies for identification of 3 different gear crack levels under different motor speeds and loads. In this short communication, to further enhance prediction accuracies of existing K-nearest neighbors based methods and extend identification of 3 different gear crack levels to identification of 5 different gear crack levels, redundant statistical features are constructed by using Daubechies 44 (db44) binary wavelet packet transform at different wavelet decomposition levels, prior to the use of a K-nearest neighbors method. The dimensionality of redundant statistical features is 620, which provides richer gear fault signatures. Since many of these statistical features are redundant and highly correlated with each other, dimensionality reduction of redundant statistical features is conducted to obtain new significant statistical features. At last, the K-nearest neighbors method is used to identify 5 different gear crack levels under different motor speeds and loads. A case study including 3 experiments is investigated to demonstrate that the developed method provides higher prediction accuracies than the existing K-nearest neighbors based methods for recognizing different gear crack levels under different motor speeds and loads. Based on the new significant statistical features, some other popular statistical models including linear discriminant analysis, quadratic discriminant analysis, classification and regression tree and naive Bayes classifier, are compared with the developed method. The results show that the developed method has the highest prediction accuracies among these statistical models. Additionally, selection of the number of new significant features and parameter selection of K-nearest neighbors are thoroughly investigated.

  18. Future missions studies: Combining Schatten's solar activity prediction model with a chaotic prediction model

    NASA Technical Reports Server (NTRS)

    Ashrafi, S.

    1991-01-01

    K. Schatten (1991) recently developed a method for combining his prediction model with our chaotic model. The philosophy behind this combined model and his method of combination is explained. Because the Schatten solar prediction model (KS) uses a dynamo to mimic solar dynamics, accurate prediction is limited to long-term solar behavior (10 to 20 years). The Chaotic prediction model (SA) uses the recently developed techniques of nonlinear dynamics to predict solar activity. It can be used to predict activity only up to the horizon. In theory, the chaotic prediction should be several orders of magnitude better than statistical predictions up to that horizon; beyond the horizon, chaotic predictions would theoretically be just as good as statistical predictions. Therefore, chaos theory puts a fundamental limit on predictability.

  19. Artificial intelligence in predicting bladder cancer outcome: a comparison of neuro-fuzzy modeling and artificial neural networks.

    PubMed

    Catto, James W F; Linkens, Derek A; Abbod, Maysam F; Chen, Minyou; Burton, Julian L; Feeley, Kenneth M; Hamdy, Freddie C

    2003-09-15

    New techniques for the prediction of tumor behavior are needed, because statistical analysis has a poor accuracy and is not applicable to the individual. Artificial intelligence (AI) may provide these suitable methods. Whereas artificial neural networks (ANN), the best-studied form of AI, have been used successfully, its hidden networks remain an obstacle to its acceptance. Neuro-fuzzy modeling (NFM), another AI method, has a transparent functional layer and is without many of the drawbacks of ANN. We have compared the predictive accuracies of NFM, ANN, and traditional statistical methods, for the behavior of bladder cancer. Experimental molecular biomarkers, including p53 and the mismatch repair proteins, and conventional clinicopathological data were studied in a cohort of 109 patients with bladder cancer. For all three of the methods, models were produced to predict the presence and timing of a tumor relapse. Both methods of AI predicted relapse with an accuracy ranging from 88% to 95%. This was superior to statistical methods (71-77%; P < 0.0006). NFM appeared better than ANN at predicting the timing of relapse (P = 0.073). The use of AI can accurately predict cancer behavior. NFM has a similar or superior predictive accuracy to ANN. However, unlike the impenetrable "black-box" of a neural network, the rules of NFM are transparent, enabling validation from clinical knowledge and the manipulation of input variables to allow exploratory predictions. This technique could be used widely in a variety of areas of medicine.

  20. Geographic and temporal validity of prediction models: Different approaches were useful to examine model performance

    PubMed Central

    Austin, Peter C.; van Klaveren, David; Vergouwe, Yvonne; Nieboer, Daan; Lee, Douglas S.; Steyerberg, Ewout W.

    2017-01-01

    Objective Validation of clinical prediction models traditionally refers to the assessment of model performance in new patients. We studied different approaches to geographic and temporal validation in the setting of multicenter data from two time periods. Study Design and Setting We illustrated different analytic methods for validation using a sample of 14,857 patients hospitalized with heart failure at 90 hospitals in two distinct time periods. Bootstrap resampling was used to assess internal validity. Meta-analytic methods were used to assess geographic transportability. Each hospital was used once as a validation sample, with the remaining hospitals used for model derivation. Hospital-specific estimates of discrimination (c-statistic) and calibration (calibration intercepts and slopes) were pooled using random effects meta-analysis methods. I2 statistics and prediction interval width quantified geographic transportability. Temporal transportability was assessed using patients from the earlier period for model derivation and patients from the later period for model validation. Results Estimates of reproducibility, pooled hospital-specific performance, and temporal transportability were on average very similar, with c-statistics of 0.75. Between-hospital variation was moderate according to I2 statistics and prediction intervals for c-statistics. Conclusion This study illustrates how performance of prediction models can be assessed in settings with multicenter data at different time periods. PMID:27262237

  1. Fast mean and variance computation of the diffuse sound transmission through finite-sized thick and layered wall and floor systems

    NASA Astrophysics Data System (ADS)

    Decraene, Carolina; Dijckmans, Arne; Reynders, Edwin P. B.

    2018-05-01

    A method is developed for computing the mean and variance of the diffuse field sound transmission loss of finite-sized layered wall and floor systems that consist of solid, fluid and/or poroelastic layers. This is achieved by coupling a transfer matrix model of the wall or floor to statistical energy analysis subsystem models of the adjacent room volumes. The modal behavior of the wall is approximately accounted for by projecting the wall displacement onto a set of sinusoidal lateral basis functions. This hybrid modal transfer matrix-statistical energy analysis method is validated on multiple wall systems: a thin steel plate, a polymethyl methacrylate panel, a thick brick wall, a sandwich panel, a double-leaf wall with poro-elastic material in the cavity, and a double glazing. The predictions are compared with experimental data and with results obtained using alternative prediction methods such as the transfer matrix method with spatial windowing, the hybrid wave based-transfer matrix method, and the hybrid finite element-statistical energy analysis method. These comparisons confirm the prediction accuracy of the proposed method and the computational efficiency against the conventional hybrid finite element-statistical energy analysis method.

  2. Calculation of precise firing statistics in a neural network model

    NASA Astrophysics Data System (ADS)

    Cho, Myoung Won

    2017-08-01

    A precise prediction of neural firing dynamics is requisite to understand the function of and the learning process in a biological neural network which works depending on exact spike timings. Basically, the prediction of firing statistics is a delicate manybody problem because the firing probability of a neuron at a time is determined by the summation over all effects from past firing states. A neural network model with the Feynman path integral formulation is recently introduced. In this paper, we present several methods to calculate firing statistics in the model. We apply the methods to some cases and compare the theoretical predictions with simulation results.

  3. Statistical learning theory for high dimensional prediction: Application to criterion-keyed scale development.

    PubMed

    Chapman, Benjamin P; Weiss, Alexander; Duberstein, Paul R

    2016-12-01

    Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  4. In silico prediction of post-translational modifications.

    PubMed

    Liu, Chunmei; Li, Hui

    2011-01-01

    Methods for predicting protein post-translational modifications have been developed extensively. In this chapter, we review major post-translational modification prediction strategies, with a particular focus on statistical and machine learning approaches. We present the workflow of the methods and summarize the advantages and disadvantages of the methods.

  5. Development of a Predictive Corrosion Model Using Locality-Specific Corrosion Indices

    DTIC Science & Technology

    2017-09-12

    6 3.2.1 Statistical data analysis methods ...6 3.2.2 Algorithm development method ...components, and method ) were compiled into an executable program that uses mathematical models of materials degradation, and statistical calcula- tions

  6. Comparison of the predictive validity of diagnosis-based risk adjusters for clinical outcomes.

    PubMed

    Petersen, Laura A; Pietz, Kenneth; Woodard, LeChauncy D; Byrne, Margaret

    2005-01-01

    Many possible methods of risk adjustment exist, but there is a dearth of comparative data on their performance. We compared the predictive validity of 2 widely used methods (Diagnostic Cost Groups [DCGs] and Adjusted Clinical Groups [ACGs]) for 2 clinical outcomes using a large national sample of patients. We studied all patients who used Veterans Health Administration (VA) medical services in fiscal year (FY) 2001 (n = 3,069,168) and assigned both a DCG and an ACG to each. We used logistic regression analyses to compare predictive ability for death or long-term care (LTC) hospitalization for age/gender models, DCG models, and ACG models. We also assessed the effect of adding age to the DCG and ACG models. Patients in the highest DCG categories, indicating higher severity of illness, were more likely to die or to require LTC hospitalization. Surprisingly, the age/gender model predicted death slightly more accurately than the ACG model (c-statistic of 0.710 versus 0.700, respectively). The addition of age to the ACG model improved the c-statistic to 0.768. The highest c-statistic for prediction of death was obtained with a DCG/age model (0.830). The lowest c-statistics were obtained for age/gender models for LTC hospitalization (c-statistic 0.593). The c-statistic for use of ACGs to predict LTC hospitalization was 0.783, and improved to 0.792 with the addition of age. The c-statistics for use of DCGs and DCG/age to predict LTC hospitalization were 0.885 and 0.890, respectively, indicating the best prediction. We found that risk adjusters based upon diagnoses predicted an increased likelihood of death or LTC hospitalization, exhibiting good predictive validity. In this comparative analysis using VA data, DCG models were generally superior to ACG models in predicting clinical outcomes, although ACG model performance was enhanced by the addition of age.

  7. Comparison and validation of statistical methods for predicting power outage durations in the event of hurricanes.

    PubMed

    Nateghi, Roshanak; Guikema, Seth D; Quiring, Steven M

    2011-12-01

    This article compares statistical methods for modeling power outage durations during hurricanes and examines the predictive accuracy of these methods. Being able to make accurate predictions of power outage durations is valuable because the information can be used by utility companies to plan their restoration efforts more efficiently. This information can also help inform customers and public agencies of the expected outage times, enabling better collective response planning, and coordination of restoration efforts for other critical infrastructures that depend on electricity. In the long run, outage duration estimates for future storm scenarios may help utilities and public agencies better allocate risk management resources to balance the disruption from hurricanes with the cost of hardening power systems. We compare the out-of-sample predictive accuracy of five distinct statistical models for estimating power outage duration times caused by Hurricane Ivan in 2004. The methods compared include both regression models (accelerated failure time (AFT) and Cox proportional hazard models (Cox PH)) and data mining techniques (regression trees, Bayesian additive regression trees (BART), and multivariate additive regression splines). We then validate our models against two other hurricanes. Our results indicate that BART yields the best prediction accuracy and that it is possible to predict outage durations with reasonable accuracy. © 2011 Society for Risk Analysis.

  8. Statistical Prediction in Proprietary Rehabilitation.

    ERIC Educational Resources Information Center

    Johnson, Kurt L.; And Others

    1987-01-01

    Applied statistical methods to predict case expenditures for low back pain rehabilitation cases in proprietary rehabilitation. Extracted predictor variables from case records of 175 workers compensation claimants with some degree of permanent disability due to back injury. Performed several multiple regression analyses resulting in a formula that…

  9. Paroxysmal atrial fibrillation prediction method with shorter HRV sequences.

    PubMed

    Boon, K H; Khalil-Hani, M; Malarvili, M B; Sia, C W

    2016-10-01

    This paper proposes a method that predicts the onset of paroxysmal atrial fibrillation (PAF), using heart rate variability (HRV) segments that are shorter than those applied in existing methods, while maintaining good prediction accuracy. PAF is a common cardiac arrhythmia that increases the health risk of a patient, and the development of an accurate predictor of the onset of PAF is clinical important because it increases the possibility to stabilize (electrically) and prevent the onset of atrial arrhythmias with different pacing techniques. We investigate the effect of HRV features extracted from different lengths of HRV segments prior to PAF onset with the proposed PAF prediction method. The pre-processing stage of the predictor includes QRS detection, HRV quantification and ectopic beat correction. Time-domain, frequency-domain, non-linear and bispectrum features are then extracted from the quantified HRV. In the feature selection, the HRV feature set and classifier parameters are optimized simultaneously using an optimization procedure based on genetic algorithm (GA). Both full feature set and statistically significant feature subset are optimized by GA respectively. For the statistically significant feature subset, Mann-Whitney U test is used to filter non-statistical significance features that cannot pass the statistical test at 20% significant level. The final stage of our predictor is the classifier that is based on support vector machine (SVM). A 10-fold cross-validation is applied in performance evaluation, and the proposed method achieves 79.3% prediction accuracy using 15-minutes HRV segment. This accuracy is comparable to that achieved by existing methods that use 30-minutes HRV segments, most of which achieves accuracy of around 80%. More importantly, our method significantly outperforms those that applied segments shorter than 30 minutes. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  10. Control Theory and Statistical Generalizations.

    ERIC Educational Resources Information Center

    Powers, William T.

    1990-01-01

    Contrasts modeling methods in control theory to the methods of statistical generalizations in empirical studies of human or animal behavior. Presents a computer simulation that predicts behavior based on variables (effort and rewards) determined by the invariable (desired reward). Argues that control theory methods better reflect relationships to…

  11. Ensemble-based prediction of RNA secondary structures.

    PubMed

    Aghaeepour, Nima; Hoos, Holger H

    2013-04-24

    Accurate structure prediction methods play an important role for the understanding of RNA function. Energy-based, pseudoknot-free secondary structure prediction is one of the most widely used and versatile approaches, and improved methods for this task have received much attention over the past five years. Despite the impressive progress that as been achieved in this area, existing evaluations of the prediction accuracy achieved by various algorithms do not provide a comprehensive, statistically sound assessment. Furthermore, while there is increasing evidence that no prediction algorithm consistently outperforms all others, no work has been done to exploit the complementary strengths of multiple approaches. In this work, we present two contributions to the area of RNA secondary structure prediction. Firstly, we use state-of-the-art, resampling-based statistical methods together with a previously published and increasingly widely used dataset of high-quality RNA structures to conduct a comprehensive evaluation of existing RNA secondary structure prediction procedures. The results from this evaluation clarify the performance relationship between ten well-known existing energy-based pseudoknot-free RNA secondary structure prediction methods and clearly demonstrate the progress that has been achieved in recent years. Secondly, we introduce AveRNA, a generic and powerful method for combining a set of existing secondary structure prediction procedures into an ensemble-based method that achieves significantly higher prediction accuracies than obtained from any of its component procedures. Our new, ensemble-based method, AveRNA, improves the state of the art for energy-based, pseudoknot-free RNA secondary structure prediction by exploiting the complementary strengths of multiple existing prediction procedures, as demonstrated using a state-of-the-art statistical resampling approach. In addition, AveRNA allows an intuitive and effective control of the trade-off between false negative and false positive base pair predictions. Finally, AveRNA can make use of arbitrary sets of secondary structure prediction procedures and can therefore be used to leverage improvements in prediction accuracy offered by algorithms and energy models developed in the future. Our data, MATLAB software and a web-based version of AveRNA are publicly available at http://www.cs.ubc.ca/labs/beta/Software/AveRNA.

  12. An investigation of new toxicity test method performance in validation studies: 1. Toxicity test methods that have predictive capacity no greater than chance.

    PubMed

    Bruner, L H; Carr, G J; Harbell, J W; Curren, R D

    2002-06-01

    An approach commonly used to measure new toxicity test method (NTM) performance in validation studies is to divide toxicity results into positive and negative classifications, and the identify true positive (TP), true negative (TN), false positive (FP) and false negative (FN) results. After this step is completed, the contingent probability statistics (CPS), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) are calculated. Although these statistics are widely used and often the only statistics used to assess the performance of toxicity test methods, there is little specific guidance in the validation literature on what values for these statistics indicate adequate performance. The purpose of this study was to begin developing data-based answers to this question by characterizing the CPS obtained from an NTM whose data have a completely random association with a reference test method (RTM). Determining the CPS of this worst-case scenario is useful because it provides a lower baseline from which the performance of an NTM can be judged in future validation studies. It also provides an indication of relationships in the CPS that help identify random or near-random relationships in the data. The results from this study of randomly associated tests show that the values obtained for the statistics vary significantly depending on the cut-offs chosen, that high values can be obtained for individual statistics, and that the different measures cannot be considered independently when evaluating the performance of an NTM. When the association between results of an NTM and RTM is random the sum of the complementary pairs of statistics (sensitivity + specificity, NPV + PPV) is approximately 1, and the prevalence (i.e., the proportion of toxic chemicals in the population of chemicals) and PPV are equal. Given that combinations of high sensitivity-low specificity or low specificity-high sensitivity (i.e., the sum of the sensitivity and specificity equal to approximately 1) indicate lack of predictive capacity, an NTM having these performance characteristics should be considered no better for predicting toxicity than by chance alone.

  13. Statistical Prediction of Sea Ice Concentration over Arctic

    NASA Astrophysics Data System (ADS)

    Kim, Jongho; Jeong, Jee-Hoon; Kim, Baek-Min

    2017-04-01

    In this study, a statistical method that predict sea ice concentration (SIC) over the Arctic is developed. We first calculate the Season-reliant Empirical Orthogonal Functions (S-EOFs) of monthly Arctic SIC from Nimbus-7 SMMR and DMSP SSM/I-SSMIS Passive Microwave Data, which contain the seasonal cycles (12 months long) of dominant SIC anomaly patterns. Then, the current SIC state index is determined by projecting observed SIC anomalies for latest 12 months to the S-EOFs. Assuming the current SIC anomalies follow the spatio-temporal evolution in the S-EOFs, we project the future (upto 12 months) SIC anomalies by multiplying the SI and the corresponding S-EOF and then taking summation. The predictive skill is assessed by hindcast experiments initialized at all the months for 1980-2010. When comparing predictive skill of SIC predicted by statistical model and NCEP CFS v2, the statistical model shows a higher skill in predicting sea ice concentration and extent.

  14. A new method of power load prediction in electrification railway

    NASA Astrophysics Data System (ADS)

    Dun, Xiaohong

    2018-04-01

    Aiming at the character of electrification railway, the paper mainly studies the problem of load prediction in electrification railway. After the preprocessing of data, and the similar days are separated on the basis of its statistical characteristics. Meanwhile the accuracy of different methods is analyzed. The paper provides a new thought of prediction and a new method of accuracy of judgment for the load prediction of power system.

  15. Use of model calibration to achieve high accuracy in analysis of computer networks

    DOEpatents

    Frogner, Bjorn; Guarro, Sergio; Scharf, Guy

    2004-05-11

    A system and method are provided for creating a network performance prediction model, and calibrating the prediction model, through application of network load statistical analyses. The method includes characterizing the measured load on the network, which may include background load data obtained over time, and may further include directed load data representative of a transaction-level event. Probabilistic representations of load data are derived to characterize the statistical persistence of the network performance variability and to determine delays throughout the network. The probabilistic representations are applied to the network performance prediction model to adapt the model for accurate prediction of network performance. Certain embodiments of the method and system may be used for analysis of the performance of a distributed application characterized as data packet streams.

  16. Analysis of methods to estimate spring flows in a karst aquifer

    USGS Publications Warehouse

    Sepulveda, N.

    2009-01-01

    Hydraulically and statistically based methods were analyzed to identify the most reliable method to predict spring flows in a karst aquifer. Measured water levels at nearby observation wells, measured spring pool altitudes, and the distance between observation wells and the spring pool were the parameters used to match measured spring flows. Measured spring flows at six Upper Floridan aquifer springs in central Florida were used to assess the reliability of these methods to predict spring flows. Hydraulically based methods involved the application of the Theis, Hantush-Jacob, and Darcy-Weisbach equations, whereas the statistically based methods were the multiple linear regressions and the technology of artificial neural networks (ANNs). Root mean square errors between measured and predicted spring flows using the Darcy-Weisbach method ranged between 5% and 15% of the measured flows, lower than the 7% to 27% range for the Theis or Hantush-Jacob methods. Flows at all springs were estimated to be turbulent based on the Reynolds number derived from the Darcy-Weisbach equation for conduit flow. The multiple linear regression and the Darcy-Weisbach methods had similar spring flow prediction capabilities. The ANNs provided the lowest residuals between measured and predicted spring flows, ranging from 1.6% to 5.3% of the measured flows. The model prediction efficiency criteria also indicated that the ANNs were the most accurate method predicting spring flows in a karst aquifer. ?? 2008 National Ground Water Association.

  17. Analysis of methods to estimate spring flows in a karst aquifer.

    PubMed

    Sepúlveda, Nicasio

    2009-01-01

    Hydraulically and statistically based methods were analyzed to identify the most reliable method to predict spring flows in a karst aquifer. Measured water levels at nearby observation wells, measured spring pool altitudes, and the distance between observation wells and the spring pool were the parameters used to match measured spring flows. Measured spring flows at six Upper Floridan aquifer springs in central Florida were used to assess the reliability of these methods to predict spring flows. Hydraulically based methods involved the application of the Theis, Hantush-Jacob, and Darcy-Weisbach equations, whereas the statistically based methods were the multiple linear regressions and the technology of artificial neural networks (ANNs). Root mean square errors between measured and predicted spring flows using the Darcy-Weisbach method ranged between 5% and 15% of the measured flows, lower than the 7% to 27% range for the Theis or Hantush-Jacob methods. Flows at all springs were estimated to be turbulent based on the Reynolds number derived from the Darcy-Weisbach equation for conduit flow. The multiple linear regression and the Darcy-Weisbach methods had similar spring flow prediction capabilities. The ANNs provided the lowest residuals between measured and predicted spring flows, ranging from 1.6% to 5.3% of the measured flows. The model prediction efficiency criteria also indicated that the ANNs were the most accurate method predicting spring flows in a karst aquifer.

  18. A Generalized Approach for Measuring Relationships Among Genes.

    PubMed

    Wang, Lijun; Ahsan, Md Asif; Chen, Ming

    2017-07-21

    Several methods for identifying relationships among pairs of genes have been developed. In this article, we present a generalized approach for measuring relationships between any pairs of genes, which is based on statistical prediction. We derive two particular versions of the generalized approach, least squares estimation (LSE) and nearest neighbors prediction (NNP). According to mathematical proof, LSE is equivalent to the methods based on correlation; and NNP is approximate to one popular method called the maximal information coefficient (MIC) according to the performances in simulations and real dataset. Moreover, the approach based on statistical prediction can be extended from two-genes relationships to multi-genes relationships. This application would help to identify relationships among multi-genes.

  19. Probabilistic performance estimators for computational chemistry methods: The empirical cumulative distribution function of absolute errors

    NASA Astrophysics Data System (ADS)

    Pernot, Pascal; Savin, Andreas

    2018-06-01

    Benchmarking studies in computational chemistry use reference datasets to assess the accuracy of a method through error statistics. The commonly used error statistics, such as the mean signed and mean unsigned errors, do not inform end-users on the expected amplitude of prediction errors attached to these methods. We show that, the distributions of model errors being neither normal nor zero-centered, these error statistics cannot be used to infer prediction error probabilities. To overcome this limitation, we advocate for the use of more informative statistics, based on the empirical cumulative distribution function of unsigned errors, namely, (1) the probability for a new calculation to have an absolute error below a chosen threshold and (2) the maximal amplitude of errors one can expect with a chosen high confidence level. Those statistics are also shown to be well suited for benchmarking and ranking studies. Moreover, the standard error on all benchmarking statistics depends on the size of the reference dataset. Systematic publication of these standard errors would be very helpful to assess the statistical reliability of benchmarking conclusions.

  20. Multi-fidelity machine learning models for accurate bandgap predictions of solids

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pilania, Ghanshyam; Gubernatis, James E.; Lookman, Turab

    Here, we present a multi-fidelity co-kriging statistical learning framework that combines variable-fidelity quantum mechanical calculations of bandgaps to generate a machine-learned model that enables low-cost accurate predictions of the bandgaps at the highest fidelity level. Additionally, the adopted Gaussian process regression formulation allows us to predict the underlying uncertainties as a measure of our confidence in the predictions. In using a set of 600 elpasolite compounds as an example dataset and using semi-local and hybrid exchange correlation functionals within density functional theory as two levels of fidelities, we demonstrate the excellent learning performance of the method against actual high fidelitymore » quantum mechanical calculations of the bandgaps. The presented statistical learning method is not restricted to bandgaps or electronic structure methods and extends the utility of high throughput property predictions in a significant way.« less

  1. Multi-fidelity machine learning models for accurate bandgap predictions of solids

    DOE PAGES

    Pilania, Ghanshyam; Gubernatis, James E.; Lookman, Turab

    2016-12-28

    Here, we present a multi-fidelity co-kriging statistical learning framework that combines variable-fidelity quantum mechanical calculations of bandgaps to generate a machine-learned model that enables low-cost accurate predictions of the bandgaps at the highest fidelity level. Additionally, the adopted Gaussian process regression formulation allows us to predict the underlying uncertainties as a measure of our confidence in the predictions. In using a set of 600 elpasolite compounds as an example dataset and using semi-local and hybrid exchange correlation functionals within density functional theory as two levels of fidelities, we demonstrate the excellent learning performance of the method against actual high fidelitymore » quantum mechanical calculations of the bandgaps. The presented statistical learning method is not restricted to bandgaps or electronic structure methods and extends the utility of high throughput property predictions in a significant way.« less

  2. Proposal for a recovery prediction method for patients affected by acute mediastinitis

    PubMed Central

    2012-01-01

    Background An attempt to find a prediction method of death risk in patients affected by acute mediastinitis. There is not such a tool described in available literature for that serious disease. Methods The study comprised 44 consecutive cases of acute mediastinitis. General anamnesis and biochemical data were included. Factor analysis was used to extract the risk characteristic for the patients. The most valuable results were obtained for 8 parameters which were selected for further statistical analysis (all collected during few hours after admission). Three factors reached Eigenvalue >1. Clinical explanations of these combined statistical factors are: Factor1 - proteinic status (serum total protein, albumin, and hemoglobin level), Factor2 - inflammatory status (white blood cells, CRP, procalcitonin), and Factor3 - general risk (age, number of coexisting diseases). Threshold values of prediction factors were estimated by means of statistical analysis (factor analysis, Statgraphics Centurion XVI). Results The final prediction result for the patients is constructed as simultaneous evaluation of all factor scores. High probability of death should be predicted if factor 1 value decreases with simultaneous increase of factors 2 and 3. The diagnostic power of the proposed method was revealed to be high [sensitivity =90%, specificity =64%], for Factor1 [SNC = 87%, SPC = 79%]; for Factor2 [SNC = 87%, SPC = 50%] and for Factor3 [SNC = 73%, SPC = 71%]. Conclusion The proposed prediction method seems a useful emergency signal during acute mediastinitis control in affected patients. PMID:22574625

  3. A method for obtaining a statistically stationary turbulent free shear flow

    NASA Technical Reports Server (NTRS)

    Timson, Stephen F.; Lele, S. K.; Moser, R. D.

    1994-01-01

    The long-term goal of the current research is the study of Large-Eddy Simulation (LES) as a tool for aeroacoustics. New algorithms and developments in computer hardware are making possible a new generation of tools for aeroacoustic predictions, which rely on the physics of the flow rather than empirical knowledge. LES, in conjunction with an acoustic analogy, holds the promise of predicting the statistics of noise radiated to the far-field of a turbulent flow. LES's predictive ability will be tested through extensive comparison of acoustic predictions based on a Direct Numerical Simulation (DNS) and LES of the same flow, as well as a priori testing of DNS results. The method presented here is aimed at allowing simulation of a turbulent flow field that is both simple and amenable to acoustic predictions. A free shear flow is homogeneous in both the streamwise and spanwise directions and which is statistically stationary will be simulated using equations based on the Navier-Stokes equations with a small number of added terms. Studying a free shear flow eliminates the need to consider flow-surface interactions as an acoustic source. The homogeneous directions and the flow's statistically stationary nature greatly simplify the application of an acoustic analogy.

  4. Statistical Approaches for Spatiotemporal Prediction of Low Flows

    NASA Astrophysics Data System (ADS)

    Fangmann, A.; Haberlandt, U.

    2017-12-01

    An adequate assessment of regional climate change impacts on streamflow requires the integration of various sources of information and modeling approaches. This study proposes simple statistical tools for inclusion into model ensembles, which are fast and straightforward in their application, yet able to yield accurate streamflow predictions in time and space. Target variables for all approaches are annual low flow indices derived from a data set of 51 records of average daily discharge for northwestern Germany. The models require input of climatic data in the form of meteorological drought indices, derived from observed daily climatic variables, averaged over the streamflow gauges' catchments areas. Four different modeling approaches are analyzed. Basis for all pose multiple linear regression models that estimate low flows as a function of a set of meteorological indices and/or physiographic and climatic catchment descriptors. For the first method, individual regression models are fitted at each station, predicting annual low flow values from a set of annual meteorological indices, which are subsequently regionalized using a set of catchment characteristics. The second method combines temporal and spatial prediction within a single panel data regression model, allowing estimation of annual low flow values from input of both annual meteorological indices and catchment descriptors. The third and fourth methods represent non-stationary low flow frequency analyses and require fitting of regional distribution functions. Method three is subject to a spatiotemporal prediction of an index value, method four to estimation of L-moments that adapt the regional frequency distribution to the at-site conditions. The results show that method two outperforms successive prediction in time and space. Method three also shows a high performance in the near future period, but since it relies on a stationary distribution, its application for prediction of far future changes may be problematic. Spatiotemporal prediction of L-moments appeared highly uncertain for higher-order moments resulting in unrealistic future low flow values. All in all, the results promote an inclusion of simple statistical methods in climate change impact assessment.

  5. Impact of statistical learning methods on the predictive power of multivariate normal tissue complication probability models.

    PubMed

    Xu, Cheng-Jian; van der Schaaf, Arjen; Schilstra, Cornelis; Langendijk, Johannes A; van't Veld, Aart A

    2012-03-15

    To study the impact of different statistical learning methods on the prediction performance of multivariate normal tissue complication probability (NTCP) models. In this study, three learning methods, stepwise selection, least absolute shrinkage and selection operator (LASSO), and Bayesian model averaging (BMA), were used to build NTCP models of xerostomia following radiotherapy treatment for head and neck cancer. Performance of each learning method was evaluated by a repeated cross-validation scheme in order to obtain a fair comparison among methods. It was found that the LASSO and BMA methods produced models with significantly better predictive power than that of the stepwise selection method. Furthermore, the LASSO method yields an easily interpretable model as the stepwise method does, in contrast to the less intuitive BMA method. The commonly used stepwise selection method, which is simple to execute, may be insufficient for NTCP modeling. The LASSO method is recommended. Copyright © 2012 Elsevier Inc. All rights reserved.

  6. Assessment of Current Jet Noise Prediction Capabilities

    NASA Technical Reports Server (NTRS)

    Hunter, Craid A.; Bridges, James E.; Khavaran, Abbas

    2008-01-01

    An assessment was made of the capability of jet noise prediction codes over a broad range of jet flows, with the objective of quantifying current capabilities and identifying areas requiring future research investment. Three separate codes in NASA s possession, representative of two classes of jet noise prediction codes, were evaluated, one empirical and two statistical. The empirical code is the Stone Jet Noise Module (ST2JET) contained within the ANOPP aircraft noise prediction code. It is well documented, and represents the state of the art in semi-empirical acoustic prediction codes where virtual sources are attributed to various aspects of noise generation in each jet. These sources, in combination, predict the spectral directivity of a jet plume. A total of 258 jet noise cases were examined on the ST2JET code, each run requiring only fractions of a second to complete. Two statistical jet noise prediction codes were also evaluated, JeNo v1, and Jet3D. Fewer cases were run for the statistical prediction methods because they require substantially more resources, typically a Reynolds-Averaged Navier-Stokes solution of the jet, volume integration of the source statistical models over the entire plume, and a numerical solution of the governing propagation equation within the jet. In the evaluation process, substantial justification of experimental datasets used in the evaluations was made. In the end, none of the current codes can predict jet noise within experimental uncertainty. The empirical code came within 2dB on a 1/3 octave spectral basis for a wide range of flows. The statistical code Jet3D was within experimental uncertainty at broadside angles for hot supersonic jets, but errors in peak frequency and amplitude put it out of experimental uncertainty at cooler, lower speed conditions. Jet3D did not predict changes in directivity in the downstream angles. The statistical code JeNo,v1 was within experimental uncertainty predicting noise from cold subsonic jets at all angles, but did not predict changes with heating of the jet and did not account for directivity changes at supersonic conditions. Shortcomings addressed here give direction for future work relevant to the statistical-based prediction methods. A full report will be released as a chapter in a NASA publication assessing the state of the art in aircraft noise prediction.

  7. Multiple Kernel Learning with Random Effects for Predicting Longitudinal Outcomes and Data Integration

    PubMed Central

    Chen, Tianle; Zeng, Donglin

    2015-01-01

    Summary Predicting disease risk and progression is one of the main goals in many clinical research studies. Cohort studies on the natural history and etiology of chronic diseases span years and data are collected at multiple visits. Although kernel-based statistical learning methods are proven to be powerful for a wide range of disease prediction problems, these methods are only well studied for independent data but not for longitudinal data. It is thus important to develop time-sensitive prediction rules that make use of the longitudinal nature of the data. In this paper, we develop a novel statistical learning method for longitudinal data by introducing subject-specific short-term and long-term latent effects through a designed kernel to account for within-subject correlation of longitudinal measurements. Since the presence of multiple sources of data is increasingly common, we embed our method in a multiple kernel learning framework and propose a regularized multiple kernel statistical learning with random effects to construct effective nonparametric prediction rules. Our method allows easy integration of various heterogeneous data sources and takes advantage of correlation among longitudinal measures to increase prediction power. We use different kernels for each data source taking advantage of the distinctive feature of each data modality, and then optimally combine data across modalities. We apply the developed methods to two large epidemiological studies, one on Huntington's disease and the other on Alzheimer's Disease (Alzheimer's Disease Neuroimaging Initiative, ADNI) where we explore a unique opportunity to combine imaging and genetic data to study prediction of mild cognitive impairment, and show a substantial gain in performance while accounting for the longitudinal aspect of the data. PMID:26177419

  8. The Effect of General Statistical Fiber Misalignment on Predicted Damage Initiation in Composites

    NASA Technical Reports Server (NTRS)

    Bednarcyk, Brett A.; Aboudi, Jacob; Arnold, Steven M.

    2014-01-01

    A micromechanical method is employed for the prediction of unidirectional composites in which the fiber orientation can possess various statistical misalignment distributions. The method relies on the probability-weighted averaging of the appropriate concentration tensor, which is established by the micromechanical procedure. This approach provides access to the local field quantities throughout the constituents, from which initiation of damage in the composite can be predicted. In contrast, a typical macromechanical procedure can determine the effective composite elastic properties in the presence of statistical fiber misalignment, but cannot provide the local fields. Fully random fiber distribution is presented as a special case using the proposed micromechanical method. Results are given that illustrate the effects of various amounts of fiber misalignment in terms of the standard deviations of in-plane and out-of-plane misalignment angles, where normal distributions have been employed. Damage initiation envelopes, local fields, effective moduli, and strengths are predicted for polymer and ceramic matrix composites with given normal distributions of misalignment angles, as well as fully random fiber orientation.

  9. Comparison of four statistical and machine learning methods for crash severity prediction.

    PubMed

    Iranitalab, Amirfarrokh; Khattak, Aemal

    2017-11-01

    Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012-2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012-2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method. Copyright © 2017 Elsevier Ltd. All rights reserved.

  10. Gene network inherent in genomic big data improves the accuracy of prognostic prediction for cancer patients.

    PubMed

    Kim, Yun Hak; Jeong, Dae Cheon; Pak, Kyoungjune; Goh, Tae Sik; Lee, Chi-Seung; Han, Myoung-Eun; Kim, Ji-Young; Liangwen, Liu; Kim, Chi Dae; Jang, Jeon Yeob; Cha, Wonjae; Oh, Sae-Ock

    2017-09-29

    Accurate prediction of prognosis is critical for therapeutic decisions regarding cancer patients. Many previously developed prognostic scoring systems have limitations in reflecting recent progress in the field of cancer biology such as microarray, next-generation sequencing, and signaling pathways. To develop a new prognostic scoring system for cancer patients, we used mRNA expression and clinical data in various independent breast cancer cohorts (n=1214) from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) and Gene Expression Omnibus (GEO). A new prognostic score that reflects gene network inherent in genomic big data was calculated using Network-Regularized high-dimensional Cox-regression (Net-score). We compared its discriminatory power with those of two previously used statistical methods: stepwise variable selection via univariate Cox regression (Uni-score) and Cox regression via Elastic net (Enet-score). The Net scoring system showed better discriminatory power in prediction of disease-specific survival (DSS) than other statistical methods (p=0 in METABRIC training cohort, p=0.000331, 4.58e-06 in two METABRIC validation cohorts) when accuracy was examined by log-rank test. Notably, comparison of C-index and AUC values in receiver operating characteristic analysis at 5 years showed fewer differences between training and validation cohorts with the Net scoring system than other statistical methods, suggesting minimal overfitting. The Net-based scoring system also successfully predicted prognosis in various independent GEO cohorts with high discriminatory power. In conclusion, the Net-based scoring system showed better discriminative power than previous statistical methods in prognostic prediction for breast cancer patients. This new system will mark a new era in prognosis prediction for cancer patients.

  11. Gene network inherent in genomic big data improves the accuracy of prognostic prediction for cancer patients

    PubMed Central

    Kim, Yun Hak; Jeong, Dae Cheon; Pak, Kyoungjune; Goh, Tae Sik; Lee, Chi-Seung; Han, Myoung-Eun; Kim, Ji-Young; Liangwen, Liu; Kim, Chi Dae; Jang, Jeon Yeob; Cha, Wonjae; Oh, Sae-Ock

    2017-01-01

    Accurate prediction of prognosis is critical for therapeutic decisions regarding cancer patients. Many previously developed prognostic scoring systems have limitations in reflecting recent progress in the field of cancer biology such as microarray, next-generation sequencing, and signaling pathways. To develop a new prognostic scoring system for cancer patients, we used mRNA expression and clinical data in various independent breast cancer cohorts (n=1214) from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) and Gene Expression Omnibus (GEO). A new prognostic score that reflects gene network inherent in genomic big data was calculated using Network-Regularized high-dimensional Cox-regression (Net-score). We compared its discriminatory power with those of two previously used statistical methods: stepwise variable selection via univariate Cox regression (Uni-score) and Cox regression via Elastic net (Enet-score). The Net scoring system showed better discriminatory power in prediction of disease-specific survival (DSS) than other statistical methods (p=0 in METABRIC training cohort, p=0.000331, 4.58e-06 in two METABRIC validation cohorts) when accuracy was examined by log-rank test. Notably, comparison of C-index and AUC values in receiver operating characteristic analysis at 5 years showed fewer differences between training and validation cohorts with the Net scoring system than other statistical methods, suggesting minimal overfitting. The Net-based scoring system also successfully predicted prognosis in various independent GEO cohorts with high discriminatory power. In conclusion, the Net-based scoring system showed better discriminative power than previous statistical methods in prognostic prediction for breast cancer patients. This new system will mark a new era in prognosis prediction for cancer patients. PMID:29100405

  12. Transmission overhaul and replacement predictions using Weibull and renewel theory

    NASA Technical Reports Server (NTRS)

    Savage, M.; Lewicki, D. G.

    1989-01-01

    A method to estimate the frequency of transmission overhauls is presented. This method is based on the two-parameter Weibull statistical distribution for component life. A second method is presented to estimate the number of replacement components needed to support the transmission overhaul pattern. The second method is based on renewal theory. Confidence statistics are applied with both methods to improve the statistical estimate of sample behavior. A transmission example is also presented to illustrate the use of the methods. Transmission overhaul frequency and component replacement calculations are included in the example.

  13. Implicit Statistical Learning and Language Skills in Bilingual Children

    ERIC Educational Resources Information Center

    Yim, Dongsun; Rudoy, John

    2013-01-01

    Purpose: Implicit statistical learning in 2 nonlinguistic domains (visual and auditory) was used to investigate (a) whether linguistic experience influences the underlying learning mechanism and (b) whether there are modality constraints in predicting implicit statistical learning with age and language skills. Method: Implicit statistical learning…

  14. Functional region prediction with a set of appropriate homologous sequences-an index for sequence selection by integrating structure and sequence information with spatial statistics

    PubMed Central

    2012-01-01

    Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems. PMID:22643026

  15. Predicting Acquisition of Learning Outcomes: A Comparison of Traditional and Activity-Based Instruction in an Introductory Statistics Course.

    ERIC Educational Resources Information Center

    Geske, Jenenne A.; Mickelson, William T.; Bandalos, Deborah L.; Jonson, Jessica; Smith, Russell W.

    The bulk of experimental research related to reforms in the teaching of statistics concentrates on the effects of alternative teaching methods on statistics achievement. This study expands on that research by including an examination of the effects of instructor and the interaction between instructor and method on achievement as well as attitudes,…

  16. The Shock and Vibration Bulletin. Part 2. Invited Papers, Structural Dynamics

    DTIC Science & Technology

    1974-08-01

    VIKING LANDER DYNAMICS 41 Mr. Joseph C. Pohlen, Martin Marietta Aerospace, Denver, Colorado Structural Dynamics PERFORMANCE OF STATISTICAL ENERGY ANALYSIS 47...aerospace structures. Analytical prediction of these environments is beyond the current scope of classical modal techniques. Statistical energy analysis methods...have been developed that circumvent the difficulties of high-frequency nodal analysis. These statistical energy analysis methods are evaluated

  17. The Shock and Vibration Digest. Volume 13. Number 7

    DTIC Science & Technology

    1981-07-01

    Richards, ISVR, University of Southampton Presidential Address "A Structural Dynamicist Looks at Statistical Energy Analysis " Professor B.L...excitation and for random and sine sweep mechanical excitation. Test data were used to assess prediction methods, in particular a statistical energy analysis method

  18. Using string invariants for prediction searching for optimal parameters

    NASA Astrophysics Data System (ADS)

    Bundzel, Marek; Kasanický, Tomáš; Pinčák, Richard

    2016-02-01

    We have developed a novel prediction method based on string invariants. The method does not require learning but a small set of parameters must be set to achieve optimal performance. We have implemented an evolutionary algorithm for the parametric optimization. We have tested the performance of the method on artificial and real world data and compared the performance to statistical methods and to a number of artificial intelligence methods. We have used data and the results of a prediction competition as a benchmark. The results show that the method performs well in single step prediction but the method's performance for multiple step prediction needs to be improved. The method works well for a wide range of parameters.

  19. Life beyond MSE and R2 — improving validation of predictive models with observations

    NASA Astrophysics Data System (ADS)

    Papritz, Andreas; Nussbaum, Madlene

    2017-04-01

    Machine learning and statistical predictive methods are evaluated by the closeness of predictions to observations of a test dataset. Common criteria for rating predictive methods are bias and mean square error (MSE), characterizing systematic and random prediction errors. Many studies also report R2-values, but their meaning is not always clear (correlation between observations and predictions or MSE skill score; Wilks, 2011). The same criteria are also used for choosing tuning parameters of predictive procedures by cross-validation and bagging (e.g. Hastie et al., 2009). For evident reasons, atmospheric sciences have developed a rich box of tools for forecast verification. Specific criteria have been proposed for evaluating deterministic and probabilistic predictions of binary, multinomial, ordinal and continuous responses (see reviews by Wilks, 2011, Jollie and Stephenson, 2012 and Gneiting et al., 2007). It appears that these techniques are not very well-known in the geosciences community interested in machine learning. In our presentation we review techniques that offer more insight into proximity of data and predictions than bias, MSE and R2 alone. We mention here only examples: (i) Graphing observations vs. predictions is usually more appropriate than the reverse (Piñeiro et al., 2008). (ii) The decomposition of the Brier score score (= MSE for probabilistic predictions of binary yes/no data) into reliability and resolution reveals (conditional) bias and capability of discriminating yes/no observations by the predictions. We illustrate the approaches by applications from digital soil mapping studies. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B, 69, 243-268. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning; Data Mining, Inference and Prediction. Springer, New York, second edition. Jolliffe, I. T. and Stephenson, D. B., editors (2012). Forecast Verification: A Practitioner's Guide in Atmospheric Science. Wiley-Blackwell, second edition. Piñeiro, G., Perelman, S., Guerschman, J., and Paruelo, J. (2008). How to evaluate models: Observed vs. predicted or predicted vs. observed? Ecological Modelling, 216, 316-322. Wilks, D. S. (2011). Statistical Methods in the Atmospheric Sciences. Academic Press, third edition.

  20. In vivo serial MRI-based models and statistical methods to quantify sensitivity and specificity of mechanical predictors for carotid plaque rupture: location and beyond.

    PubMed

    Wu, Zheyang; Yang, Chun; Tang, Dalin

    2011-06-01

    It has been hypothesized that mechanical risk factors may be used to predict future atherosclerotic plaque rupture. Truly predictive methods for plaque rupture and methods to identify the best predictor(s) from all the candidates are lacking in the literature. A novel combination of computational and statistical models based on serial magnetic resonance imaging (MRI) was introduced to quantify sensitivity and specificity of mechanical predictors to identify the best candidate for plaque rupture site prediction. Serial in vivo MRI data of carotid plaque from one patient was acquired with follow-up scan showing ulceration. 3D computational fluid-structure interaction (FSI) models using both baseline and follow-up data were constructed and plaque wall stress (PWS) and strain (PWSn) and flow maximum shear stress (FSS) were extracted from all 600 matched nodal points (100 points per matched slice, baseline matching follow-up) on the lumen surface for analysis. Each of the 600 points was marked "ulcer" or "nonulcer" using follow-up scan. Predictive statistical models for each of the seven combinations of PWS, PWSn, and FSS were trained using the follow-up data and applied to the baseline data to assess their sensitivity and specificity using the 600 data points for ulcer predictions. Sensitivity of prediction is defined as the proportion of the true positive outcomes that are predicted to be positive. Specificity of prediction is defined as the proportion of the true negative outcomes that are correctly predicted to be negative. Using probability 0.3 as a threshold to infer ulcer occurrence at the prediction stage, the combination of PWS and PWSn provided the best predictive accuracy with (sensitivity, specificity) = (0.97, 0.958). Sensitivity and specificity given by PWS, PWSn, and FSS individually were (0.788, 0.968), (0.515, 0.968), and (0.758, 0.928), respectively. The proposed computational-statistical process provides a novel method and a framework to assess the sensitivity and specificity of various risk indicators and offers the potential to identify the optimized predictor for plaque rupture using serial MRI with follow-up scan showing ulceration as the gold standard for method validation. While serial MRI data with actual rupture are hard to acquire, this single-case study suggests that combination of multiple predictors may provide potential improvement to existing plaque assessment schemes. With large-scale patient studies, this predictive modeling process may provide more solid ground for rupture predictor selection strategies and methods for image-based plaque vulnerability assessment.

  1. A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress

    PubMed Central

    2018-01-01

    The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i) the proposed model is different from the previous models lacking the concept of time series; (ii) the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii) the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies. PMID:29765399

  2. A Seasonal Time-Series Model Based on Gene Expression Programming for Predicting Financial Distress.

    PubMed

    Cheng, Ching-Hsue; Chan, Chia-Pang; Yang, Jun-He

    2018-01-01

    The issue of financial distress prediction plays an important and challenging research topic in the financial field. Currently, there have been many methods for predicting firm bankruptcy and financial crisis, including the artificial intelligence and the traditional statistical methods, and the past studies have shown that the prediction result of the artificial intelligence method is better than the traditional statistical method. Financial statements are quarterly reports; hence, the financial crisis of companies is seasonal time-series data, and the attribute data affecting the financial distress of companies is nonlinear and nonstationary time-series data with fluctuations. Therefore, this study employed the nonlinear attribute selection method to build a nonlinear financial distress prediction model: that is, this paper proposed a novel seasonal time-series gene expression programming model for predicting the financial distress of companies. The proposed model has several advantages including the following: (i) the proposed model is different from the previous models lacking the concept of time series; (ii) the proposed integrated attribute selection method can find the core attributes and reduce high dimensional data; and (iii) the proposed model can generate the rules and mathematical formulas of financial distress for providing references to the investors and decision makers. The result shows that the proposed method is better than the listing classifiers under three criteria; hence, the proposed model has competitive advantages in predicting the financial distress of companies.

  3. Binary recursive partitioning: background, methods, and application to psychology.

    PubMed

    Merkle, Edgar C; Shaffer, Victoria A

    2011-02-01

    Binary recursive partitioning (BRP) is a computationally intensive statistical method that can be used in situations where linear models are often used. Instead of imposing many assumptions to arrive at a tractable statistical model, BRP simply seeks to accurately predict a response variable based on values of predictor variables. The method outputs a decision tree depicting the predictor variables that were related to the response variable, along with the nature of the variables' relationships. No significance tests are involved, and the tree's 'goodness' is judged based on its predictive accuracy. In this paper, we describe BRP methods in a detailed manner and illustrate their use in psychological research. We also provide R code for carrying out the methods.

  4. Prediction of biomechanical parameters of the proximal femur using statistical appearance models and support vector regression.

    PubMed

    Fritscher, Karl; Schuler, Benedikt; Link, Thomas; Eckstein, Felix; Suhm, Norbert; Hänni, Markus; Hengg, Clemens; Schubert, Rainer

    2008-01-01

    Fractures of the proximal femur are one of the principal causes of mortality among elderly persons. Traditional methods for the determination of femoral fracture risk use methods for measuring bone mineral density. However, BMD alone is not sufficient to predict bone failure load for an individual patient and additional parameters have to be determined for this purpose. In this work an approach that uses statistical models of appearance to identify relevant regions and parameters for the prediction of biomechanical properties of the proximal femur will be presented. By using Support Vector Regression the proposed model based approach is capable of predicting two different biomechanical parameters accurately and fully automatically in two different testing scenarios.

  5. Predictive onboard flow control for packet switching satellites

    NASA Technical Reports Server (NTRS)

    Bobinsky, Eric A.

    1992-01-01

    We outline two alternate approaches to predicting the onset of congestion in a packet switching satellite, and argue that predictive, rather than reactive, flow control is necessary for the efficient operation of such a system. The first method discussed is based on standard, statistical techniques which are used to periodically calculate a probability of near-term congestion based on arrival rate statistics. If this probability exceeds a present threshold, the satellite would transmit a rate-reduction signal to all active ground stations. The second method discussed would utilize a neural network to periodically predict the occurrence of buffer overflow based on input data which would include, in addition to arrival rates, the distributions of packet lengths, source addresses, and destination addresses.

  6. Seasonal Drought Prediction: Advances, Challenges, and Future Prospects

    NASA Astrophysics Data System (ADS)

    Hao, Zengchao; Singh, Vijay P.; Xia, Youlong

    2018-03-01

    Drought prediction is of critical importance to early warning for drought managements. This review provides a synthesis of drought prediction based on statistical, dynamical, and hybrid methods. Statistical drought prediction is achieved by modeling the relationship between drought indices of interest and a suite of potential predictors, including large-scale climate indices, local climate variables, and land initial conditions. Dynamical meteorological drought prediction relies on seasonal climate forecast from general circulation models (GCMs), which can be employed to drive hydrological models for agricultural and hydrological drought prediction with the predictability determined by both climate forcings and initial conditions. Challenges still exist in drought prediction at long lead time and under a changing environment resulting from natural and anthropogenic factors. Future research prospects to improve drought prediction include, but are not limited to, high-quality data assimilation, improved model development with key processes related to drought occurrence, optimal ensemble forecast to select or weight ensembles, and hybrid drought prediction to merge statistical and dynamical forecasts.

  7. Statistical energy analysis computer program, user's guide

    NASA Technical Reports Server (NTRS)

    Trudell, R. W.; Yano, L. I.

    1981-01-01

    A high frequency random vibration analysis, (statistical energy analysis (SEA) method) is examined. The SEA method accomplishes high frequency prediction of arbitrary structural configurations. A general SEA computer program is described. A summary of SEA theory, example problems of SEA program application, and complete program listing are presented.

  8. The forecasting of menstruation based on a state-space modeling of basal body temperature time series.

    PubMed

    Fukaya, Keiichi; Kawamori, Ai; Osada, Yutaka; Kitazawa, Masumi; Ishiguro, Makio

    2017-09-20

    Women's basal body temperature (BBT) shows a periodic pattern that associates with menstrual cycle. Although this fact suggests a possibility that daily BBT time series can be useful for estimating the underlying phase state as well as for predicting the length of current menstrual cycle, little attention has been paid to model BBT time series. In this study, we propose a state-space model that involves the menstrual phase as a latent state variable to explain the daily fluctuation of BBT and the menstruation cycle length. Conditional distributions of the phase are obtained by using sequential Bayesian filtering techniques. A predictive distribution of the next menstruation day can be derived based on this conditional distribution and the model, leading to a novel statistical framework that provides a sequentially updated prediction for upcoming menstruation day. We applied this framework to a real data set of women's BBT and menstruation days and compared prediction accuracy of the proposed method with that of previous methods, showing that the proposed method generally provides a better prediction. Because BBT can be obtained with relatively small cost and effort, the proposed method can be useful for women's health management. Potential extensions of this framework as the basis of modeling and predicting events that are associated with the menstrual cycles are discussed. © 2017 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2017 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  9. Quantifying prognosis with risk predictions.

    PubMed

    Pace, Nathan L; Eberhart, Leopold H J; Kranke, Peter R

    2012-01-01

    Prognosis is a forecast, based on present observations in a patient, of their probable outcome from disease, surgery and so on. Research methods for the development of risk probabilities may not be familiar to some anaesthesiologists. We briefly describe methods for identifying risk factors and risk scores. A probability prediction rule assigns a risk probability to a patient for the occurrence of a specific event. Probability reflects the continuum between absolute certainty (Pi = 1) and certified impossibility (Pi = 0). Biomarkers and clinical covariates that modify risk are known as risk factors. The Pi as modified by risk factors can be estimated by identifying the risk factors and their weighting; these are usually obtained by stepwise logistic regression. The accuracy of probabilistic predictors can be separated into the concepts of 'overall performance', 'discrimination' and 'calibration'. Overall performance is the mathematical distance between predictions and outcomes. Discrimination is the ability of the predictor to rank order observations with different outcomes. Calibration is the correctness of prediction probabilities on an absolute scale. Statistical methods include the Brier score, coefficient of determination (Nagelkerke R2), C-statistic and regression calibration. External validation is the comparison of the actual outcomes to the predicted outcomes in a new and independent patient sample. External validation uses the statistical methods of overall performance, discrimination and calibration and is uniformly recommended before acceptance of the prediction model. Evidence from randomised controlled clinical trials should be obtained to show the effectiveness of risk scores for altering patient management and patient outcomes.

  10. Usefulness and limitations of various guinea-pig test methods in detecting human skin sensitizers-validation of guinea-pig tests for skin hypersensitivity.

    PubMed

    Marzulli, F; Maguire, H C

    1982-02-01

    Several guinea-pig predictive test methods were evaluated by comparison of results with those obtained with human predictive tests, using ten compounds that have been used in cosmetics. The method involves the statistical analysis of the frequency with which guinea-pig tests agree with the findings of tests in humans. In addition, the frequencies of false positive and false negative predictive findings are considered and statistically analysed. The results clearly demonstrate the superiority of adjuvant tests (complete Freund's adjuvant) in determining skin sensitizers and the overall superiority of the guinea-pig maximization test in providing results similar to those obtained by human testing. A procedure is suggested for utilizing adjuvant and non-adjuvant test methods for characterizing compounds as of weak, moderate or strong sensitizing potential.

  11. Forecasting runout of rock and debris avalanches

    USGS Publications Warehouse

    Iverson, Richard M.; Evans, S.G.; Mugnozza, G.S.; Strom, A.; Hermanns, R.L.

    2006-01-01

    Physically based mathematical models and statistically based empirical equations each may provide useful means of forecasting runout of rock and debris avalanches. This paper compares the foundations, strengths, and limitations of a physically based model and a statistically based forecasting method, both of which were developed to predict runout across three-dimensional topography. The chief advantage of the physically based model results from its ties to physical conservation laws and well-tested axioms of soil and rock mechanics, such as the Coulomb friction rule and effective-stress principle. The output of this model provides detailed information about the dynamics of avalanche runout, at the expense of high demands for accurate input data, numerical computation, and experimental testing. In comparison, the statistical method requires relatively modest computation and no input data except identification of prospective avalanche source areas and a range of postulated avalanche volumes. Like the physically based model, the statistical method yields maps of predicted runout, but it provides no information on runout dynamics. Although the two methods differ significantly in their structure and objectives, insights gained from one method can aid refinement of the other.

  12. Comparison of classical statistical methods and artificial neural network in traffic noise prediction

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nedic, Vladimir, E-mail: vnedic@kg.ac.rs; Despotovic, Danijela, E-mail: ddespotovic@kg.ac.rs; Cvetanovic, Slobodan, E-mail: slobodan.cvetanovic@eknfak.ni.ac.rs

    2014-11-15

    Traffic is the main source of noise in urban environments and significantly affects human mental and physical health and labor productivity. Therefore it is very important to model the noise produced by various vehicles. Techniques for traffic noise prediction are mainly based on regression analysis, which generally is not good enough to describe the trends of noise. In this paper the application of artificial neural networks (ANNs) for the prediction of traffic noise is presented. As input variables of the neural network, the proposed structure of the traffic flow and the average speed of the traffic flow are chosen. Themore » output variable of the network is the equivalent noise level in the given time period L{sub eq}. Based on these parameters, the network is modeled, trained and tested through a comparative analysis of the calculated values and measured levels of traffic noise using the originally developed user friendly software package. It is shown that the artificial neural networks can be a useful tool for the prediction of noise with sufficient accuracy. In addition, the measured values were also used to calculate equivalent noise level by means of classical methods, and comparative analysis is given. The results clearly show that ANN approach is superior in traffic noise level prediction to any other statistical method. - Highlights: • We proposed an ANN model for prediction of traffic noise. • We developed originally designed user friendly software package. • The results are compared with classical statistical methods. • The results are much better predictive capabilities of ANN model.« less

  13. Estimating current and future streamflow characteristics at ungaged sites, central and eastern Montana, with application to evaluating effects of climate change on fish populations

    USGS Publications Warehouse

    Sando, Roy; Chase, Katherine J.

    2017-03-23

    A common statistical procedure for estimating streamflow statistics at ungaged locations is to develop a relational model between streamflow and drainage basin characteristics at gaged locations using least squares regression analysis; however, least squares regression methods are parametric and make constraining assumptions about the data distribution. The random forest regression method provides an alternative nonparametric method for estimating streamflow characteristics at ungaged sites and requires that the data meet fewer statistical conditions than least squares regression methods.Random forest regression analysis was used to develop predictive models for 89 streamflow characteristics using Precipitation-Runoff Modeling System simulated streamflow data and drainage basin characteristics at 179 sites in central and eastern Montana. The predictive models were developed from streamflow data simulated for current (baseline, water years 1982–99) conditions and three future periods (water years 2021–38, 2046–63, and 2071–88) under three different climate-change scenarios. These predictive models were then used to predict streamflow characteristics for baseline conditions and three future periods at 1,707 fish sampling sites in central and eastern Montana. The average root mean square error for all predictive models was about 50 percent. When streamflow predictions at 23 fish sampling sites were compared to nearby locations with simulated data, the mean relative percent difference was about 43 percent. When predictions were compared to streamflow data recorded at 21 U.S. Geological Survey streamflow-gaging stations outside of the calibration basins, the average mean absolute percent error was about 73 percent.

  14. Combination of statistical and physically based methods to assess shallow slide susceptibility at the basin scale

    NASA Astrophysics Data System (ADS)

    Oliveira, Sérgio C.; Zêzere, José L.; Lajas, Sara; Melo, Raquel

    2017-07-01

    Approaches used to assess shallow slide susceptibility at the basin scale are conceptually different depending on the use of statistical or physically based methods. The former are based on the assumption that the same causes are more likely to produce the same effects, whereas the latter are based on the comparison between forces which tend to promote movement along the slope and the counteracting forces that are resistant to motion. Within this general framework, this work tests two hypotheses: (i) although conceptually and methodologically distinct, the statistical and deterministic methods generate similar shallow slide susceptibility results regarding the model's predictive capacity and spatial agreement; and (ii) the combination of shallow slide susceptibility maps obtained with statistical and physically based methods, for the same study area, generate a more reliable susceptibility model for shallow slide occurrence. These hypotheses were tested at a small test site (13.9 km2) located north of Lisbon (Portugal), using a statistical method (the information value method, IV) and a physically based method (the infinite slope method, IS). The landslide susceptibility maps produced with the statistical and deterministic methods were combined into a new landslide susceptibility map. The latter was based on a set of integration rules defined by the cross tabulation of the susceptibility classes of both maps and analysis of the corresponding contingency tables. The results demonstrate a higher predictive capacity of the new shallow slide susceptibility map, which combines the independent results obtained with statistical and physically based models. Moreover, the combination of the two models allowed the identification of areas where the results of the information value and the infinite slope methods are contradictory. Thus, these areas were classified as uncertain and deserve additional investigation at a more detailed scale.

  15. Predicting Success in Psychological Statistics Courses.

    PubMed

    Lester, David

    2016-06-01

    Many students perform poorly in courses on psychological statistics, and it is useful to be able to predict which students will have difficulties. In a study of 93 undergraduates enrolled in Statistical Methods (18 men, 75 women; M age = 22.0 years, SD = 5.1), performance was significantly associated with sex (female students performed better) and proficiency in algebra in a linear regression analysis. Anxiety about statistics was not associated with course performance, indicating that basic mathematical skills are the best correlate for performance in statistics courses and can usefully be used to stream students into classes by ability. © The Author(s) 2016.

  16. Towards good practice for health statistics: lessons from the Millennium Development Goal health indicators.

    PubMed

    Murray, Christopher J L

    2007-03-10

    Health statistics are at the centre of an increasing number of worldwide health controversies. Several factors are sharpening the tension between the supply and demand for high quality health information, and the health-related Millennium Development Goals (MDGs) provide a high-profile example. With thousands of indicators recommended but few measured well, the worldwide health community needs to focus its efforts on improving measurement of a small set of priority areas. Priority indicators should be selected on the basis of public-health significance and several dimensions of measurability. Health statistics can be divided into three types: crude, corrected, and predicted. Health statistics are necessary inputs to planning and strategic decision making, programme implementation, monitoring progress towards targets, and assessment of what works and what does not. Crude statistics that are biased have no role in any of these steps; corrected statistics are preferred. For strategic decision making, when corrected statistics are unavailable, predicted statistics can play an important part. For monitoring progress towards agreed targets and assessment of what works and what does not, however, predicted statistics should not be used. Perhaps the most effective method to decrease controversy over health statistics and to encourage better primary data collection and the development of better analytical methods is a strong commitment to provision of an explicit data audit trail. This initiative would make available the primary data, all post-data collection adjustments, models including covariates used for farcasting and forecasting, and necessary documentation to the public.

  17. Uncertainty propagation for statistical impact prediction of space debris

    NASA Astrophysics Data System (ADS)

    Hoogendoorn, R.; Mooij, E.; Geul, J.

    2018-01-01

    Predictions of the impact time and location of space debris in a decaying trajectory are highly influenced by uncertainties. The traditional Monte Carlo (MC) method can be used to perform accurate statistical impact predictions, but requires a large computational effort. A method is investigated that directly propagates a Probability Density Function (PDF) in time, which has the potential to obtain more accurate results with less computational effort. The decaying trajectory of Delta-K rocket stages was used to test the methods using a six degrees-of-freedom state model. The PDF of the state of the body was propagated in time to obtain impact-time distributions. This Direct PDF Propagation (DPP) method results in a multi-dimensional scattered dataset of the PDF of the state, which is highly challenging to process. No accurate results could be obtained, because of the structure of the DPP data and the high dimensionality. Therefore, the DPP method is less suitable for practical uncontrolled entry problems and the traditional MC method remains superior. Additionally, the MC method was used with two improved uncertainty models to obtain impact-time distributions, which were validated using observations of true impacts. For one of the two uncertainty models, statistically more valid impact-time distributions were obtained than in previous research.

  18. Comparison of climate envelope models developed using expert-selected variables versus statistical selection

    USGS Publications Warehouse

    Brandt, Laura A.; Benscoter, Allison; Harvey, Rebecca G.; Speroterra, Carolina; Bucklin, David N.; Romañach, Stephanie; Watling, James I.; Mazzotti, Frank J.

    2017-01-01

    Climate envelope models are widely used to describe potential future distribution of species under different climate change scenarios. It is broadly recognized that there are both strengths and limitations to using climate envelope models and that outcomes are sensitive to initial assumptions, inputs, and modeling methods Selection of predictor variables, a central step in modeling, is one of the areas where different techniques can yield varying results. Selection of climate variables to use as predictors is often done using statistical approaches that develop correlations between occurrences and climate data. These approaches have received criticism in that they rely on the statistical properties of the data rather than directly incorporating biological information about species responses to temperature and precipitation. We evaluated and compared models and prediction maps for 15 threatened or endangered species in Florida based on two variable selection techniques: expert opinion and a statistical method. We compared model performance between these two approaches for contemporary predictions, and the spatial correlation, spatial overlap and area predicted for contemporary and future climate predictions. In general, experts identified more variables as being important than the statistical method and there was low overlap in the variable sets (<40%) between the two methods Despite these differences in variable sets (expert versus statistical), models had high performance metrics (>0.9 for area under the curve (AUC) and >0.7 for true skill statistic (TSS). Spatial overlap, which compares the spatial configuration between maps constructed using the different variable selection techniques, was only moderate overall (about 60%), with a great deal of variability across species. Difference in spatial overlap was even greater under future climate projections, indicating additional divergence of model outputs from different variable selection techniques. Our work is in agreement with other studies which have found that for broad-scale species distribution modeling, using statistical methods of variable selection is a useful first step, especially when there is a need to model a large number of species or expert knowledge of the species is limited. Expert input can then be used to refine models that seem unrealistic or for species that experts believe are particularly sensitive to change. It also emphasizes the importance of using multiple models to reduce uncertainty and improve map outputs for conservation planning. Where outputs overlap or show the same direction of change there is greater certainty in the predictions. Areas of disagreement can be used for learning by asking why the models do not agree, and may highlight areas where additional on-the-ground data collection could improve the models.

  19. Predicting recreational water quality advisories: A comparison of statistical methods

    USGS Publications Warehouse

    Brooks, Wesley R.; Corsi, Steven R.; Fienen, Michael N.; Carvin, Rebecca B.

    2016-01-01

    Epidemiological studies indicate that fecal indicator bacteria (FIB) in beach water are associated with illnesses among people having contact with the water. In order to mitigate public health impacts, many beaches are posted with an advisory when the concentration of FIB exceeds a beach action value. The most commonly used method of measuring FIB concentration takes 18–24 h before returning a result. In order to avoid the 24 h lag, it has become common to ”nowcast” the FIB concentration using statistical regressions on environmental surrogate variables. Most commonly, nowcast models are estimated using ordinary least squares regression, but other regression methods from the statistical and machine learning literature are sometimes used. This study compares 14 regression methods across 7 Wisconsin beaches to identify which consistently produces the most accurate predictions. A random forest model is identified as the most accurate, followed by multiple regression fit using the adaptive LASSO.

  20. Travelogue--a newcomer encounters statistics and the computer.

    PubMed

    Bruce, Peter

    2011-11-01

    Computer-intensive methods have revolutionized statistics, giving rise to new areas of analysis and expertise in predictive analytics, image processing, pattern recognition, machine learning, genomic analysis, and more. Interest naturally centers on the new capabilities the computer allows the analyst to bring to the table. This article, instead, focuses on the account of how computer-based resampling methods, with their relative simplicity and transparency, enticed one individual, untutored in statistics or mathematics, on a long journey into learning statistics, then teaching it, then starting an education institution.

  1. Evaluating concentration estimation errors in ELISA microarray experiments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Daly, Don S.; White, Amanda M.; Varnum, Susan M.

    Enzyme-linked immunosorbent assay (ELISA) is a standard immunoassay to predict a protein concentration in a sample. Deploying ELISA in a microarray format permits simultaneous prediction of the concentrations of numerous proteins in a small sample. These predictions, however, are uncertain due to processing error and biological variability. Evaluating prediction error is critical to interpreting biological significance and improving the ELISA microarray process. Evaluating prediction error must be automated to realize a reliable high-throughput ELISA microarray system. Methods: In this paper, we present a statistical method based on propagation of error to evaluate prediction errors in the ELISA microarray process. Althoughmore » propagation of error is central to this method, it is effective only when comparable data are available. Therefore, we briefly discuss the roles of experimental design, data screening, normalization and statistical diagnostics when evaluating ELISA microarray prediction errors. We use an ELISA microarray investigation of breast cancer biomarkers to illustrate the evaluation of prediction errors. The illustration begins with a description of the design and resulting data, followed by a brief discussion of data screening and normalization. In our illustration, we fit a standard curve to the screened and normalized data, review the modeling diagnostics, and apply propagation of error.« less

  2. Statistical Methods for Rapid Aerothermal Analysis and Design Technology: Validation

    NASA Technical Reports Server (NTRS)

    DePriest, Douglas; Morgan, Carolyn

    2003-01-01

    The cost and safety goals for NASA s next generation of reusable launch vehicle (RLV) will require that rapid high-fidelity aerothermodynamic design tools be used early in the design cycle. To meet these requirements, it is desirable to identify adequate statistical models that quantify and improve the accuracy, extend the applicability, and enable combined analyses using existing prediction tools. The initial research work focused on establishing suitable candidate models for these purposes. The second phase is focused on assessing the performance of these models to accurately predict the heat rate for a given candidate data set. This validation work compared models and methods that may be useful in predicting the heat rate.

  3. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

    PubMed Central

    Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S.; Sinha, Saurabh

    2011-01-01

    Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. PMID:21821659

  4. Does rational selection of training and test sets improve the outcome of QSAR modeling?

    PubMed

    Martin, Todd M; Harten, Paul; Young, Douglas M; Muratov, Eugene N; Golbraikh, Alexander; Zhu, Hao; Tropsha, Alexander

    2012-10-22

    Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.

  5. Statistical Method Based on Confidence and Prediction Regions for Analysis of Volatile Organic Compounds in Human Breath Gas

    NASA Astrophysics Data System (ADS)

    Wimmer, G.

    2008-01-01

    In this paper we introduce two confidence and two prediction regions for statistical characterization of concentration measurements of product ions in order to discriminate various groups of persons for prospective better detection of primary lung cancer. Two MATLAB algorithms have been created for more adequate description of concentration measurements of volatile organic compounds in human breath gas for potential detection of primary lung cancer and for evaluation of the appropriate confidence and prediction regions.

  6. Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records

    PubMed Central

    Weiss, Jeremy C; Page, David; Peissig, Peggy L; Natarajan, Sriraam; McCarty, Catherine

    2013-01-01

    Electronic health records (EHRs) are an emerging relational domain with large potential to improve clinical outcomes. We apply two statistical relational learning (SRL) algorithms to the task of predicting primary myocardial infarction. We show that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region. We observe that both SRL algorithms predict outcomes better than their propositional analogs and suggest how our methods can augment current epidemiological practices. PMID:25360347

  7. CADDIS Volume 4. Data Analysis: Predicting Environmental Conditions from Biological Observations (PECBO Appendix)

    EPA Pesticide Factsheets

    Overview of PECBO Module, using scripts to infer environmental conditions from biological observations, statistically estimating species-environment relationships, methods for inferring environmental conditions, statistical scripts in module.

  8. Evaluation and comparison of statistical methods for early temporal detection of outbreaks: A simulation-based study

    PubMed Central

    Le Strat, Yann

    2017-01-01

    The objective of this paper is to evaluate a panel of statistical algorithms for temporal outbreak detection. Based on a large dataset of simulated weekly surveillance time series, we performed a systematic assessment of 21 statistical algorithms, 19 implemented in the R package surveillance and two other methods. We estimated false positive rate (FPR), probability of detection (POD), probability of detection during the first week, sensitivity, specificity, negative and positive predictive values and F1-measure for each detection method. Then, to identify the factors associated with these performance measures, we ran multivariate Poisson regression models adjusted for the characteristics of the simulated time series (trend, seasonality, dispersion, outbreak sizes, etc.). The FPR ranged from 0.7% to 59.9% and the POD from 43.3% to 88.7%. Some methods had a very high specificity, up to 99.4%, but a low sensitivity. Methods with a high sensitivity (up to 79.5%) had a low specificity. All methods had a high negative predictive value, over 94%, while positive predictive values ranged from 6.5% to 68.4%. Multivariate Poisson regression models showed that performance measures were strongly influenced by the characteristics of time series. Past or current outbreak size and duration strongly influenced detection performances. PMID:28715489

  9. An injury mortality prediction based on the anatomic injury scale

    PubMed Central

    Wang, Muding; Wu, Dan; Qiu, Wusi; Wang, Weimi; Zeng, Yunji; Shen, Yi

    2017-01-01

    Abstract To determine whether the injury mortality prediction (IMP) statistically outperforms the trauma mortality prediction model (TMPM) as a predictor of mortality. The TMPM is currently the best trauma score method, which is based on the anatomic injury. Its ability of mortality prediction is superior to the injury severity score (ISS) and to the new injury severity score (NISS). However, despite its statistical significance, the predictive power of TMPM needs to be further improved. Retrospective cohort study is based on the data of 1,148,359 injured patients in the National Trauma Data Bank hospitalized from 2010 to 2011. Sixty percent of the data was used to derive an empiric measure of severity of different Abbreviated Injury Scale predot codes by taking the weighted average death probabilities of trauma patients. Twenty percent of the data was used to create computing method of the IMP model. The remaining 20% of the data was used to evaluate the statistical performance of IMP and then be compared with the TMPM and the single worst injury by examining area under the receiver operating characteristic curve (ROC), the Hosmer–Lemeshow (HL) statistic, and the Akaike information criterion. IMP exhibits significantly both better discrimination (ROC-IMP, 0.903 [0.899–0.907] and ROC-TMPM, 0.890 [0.886–0.895]) and calibration (HL-IMP, 9.9 [4.4–14.7] and HL-TMPM, 197 [143–248]) compared with TMPM. All models show slight changes after the extension of age, gender, and mechanism of injury, but the extended IMP still dominated TMPM in every performance. The IMP has slight improvement in discrimination and calibration compared with the TMPM and can accurately predict mortality. Therefore, we consider it as a new feasible scoring method in trauma research. PMID:28858124

  10. Studying Individual Differences in Predictability with Gamma Regression and Nonlinear Multilevel Models

    ERIC Educational Resources Information Center

    Culpepper, Steven Andrew

    2010-01-01

    Statistical prediction remains an important tool for decisions in a variety of disciplines. An equally important issue is identifying factors that contribute to more or less accurate predictions. The time series literature includes well developed methods for studying predictability and volatility over time. This article develops…

  11. Predicting protein concentrations with ELISA microarray assays, monotonic splines and Monte Carlo simulation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Daly, Don S.; Anderson, Kevin K.; White, Amanda M.

    Background: A microarray of enzyme-linked immunosorbent assays, or ELISA microarray, predicts simultaneously the concentrations of numerous proteins in a small sample. These predictions, however, are uncertain due to processing error and biological variability. Making sound biological inferences as well as improving the ELISA microarray process require require both concentration predictions and creditable estimates of their errors. Methods: We present a statistical method based on monotonic spline statistical models, penalized constrained least squares fitting (PCLS) and Monte Carlo simulation (MC) to predict concentrations and estimate prediction errors in ELISA microarray. PCLS restrains the flexible spline to a fit of assay intensitymore » that is a monotone function of protein concentration. With MC, both modeling and measurement errors are combined to estimate prediction error. The spline/PCLS/MC method is compared to a common method using simulated and real ELISA microarray data sets. Results: In contrast to the rigid logistic model, the flexible spline model gave credible fits in almost all test cases including troublesome cases with left and/or right censoring, or other asymmetries. For the real data sets, 61% of the spline predictions were more accurate than their comparable logistic predictions; especially the spline predictions at the extremes of the prediction curve. The relative errors of 50% of comparable spline and logistic predictions differed by less than 20%. Monte Carlo simulation rendered acceptable asymmetric prediction intervals for both spline and logistic models while propagation of error produced symmetric intervals that diverged unrealistically as the standard curves approached horizontal asymptotes. Conclusions: The spline/PCLS/MC method is a flexible, robust alternative to a logistic/NLS/propagation-of-error method to reliably predict protein concentrations and estimate their errors. The spline method simplifies model selection and fitting, and reliably estimates believable prediction errors. For the 50% of the real data sets fit well by both methods, spline and logistic predictions are practically indistinguishable, varying in accuracy by less than 15%. The spline method may be useful when automated prediction across simultaneous assays of numerous proteins must be applied routinely with minimal user intervention.« less

  12. Light-transmittance predictions under multiple-light-scattering conditions. I. Direct problem: hybrid-method approximation.

    PubMed

    Czerwiński, M; Mroczka, J; Girasole, T; Gouesbet, G; Gréhan, G

    2001-03-20

    Our aim is to present a method of predicting light transmittances through dense three-dimensional layered media. A hybrid method is introduced as a combination of the four-flux method with coefficients predicted from a Monte Carlo statistical model to take into account the actual three-dimensional geometry of the problem under study. We present the principles of the hybrid method, some exemplifying results of numerical simulations, and their comparison with results obtained from Bouguer-Lambert-Beer law and from Monte Carlo simulations.

  13. Predicting Rotator Cuff Tears Using Data Mining and Bayesian Likelihood Ratios

    PubMed Central

    Lu, Hsueh-Yi; Huang, Chen-Yuan; Su, Chwen-Tzeng; Lin, Chen-Chiang

    2014-01-01

    Objectives Rotator cuff tear is a common cause of shoulder diseases. Correct diagnosis of rotator cuff tears can save patients from further invasive, costly and painful tests. This study used predictive data mining and Bayesian theory to improve the accuracy of diagnosing rotator cuff tears by clinical examination alone. Methods In this retrospective study, 169 patients who had a preliminary diagnosis of rotator cuff tear on the basis of clinical evaluation followed by confirmatory MRI between 2007 and 2011 were identified. MRI was used as a reference standard to classify rotator cuff tears. The predictor variable was the clinical assessment results, which consisted of 16 attributes. This study employed 2 data mining methods (ANN and the decision tree) and a statistical method (logistic regression) to classify the rotator cuff diagnosis into “tear” and “no tear” groups. Likelihood ratio and Bayesian theory were applied to estimate the probability of rotator cuff tears based on the results of the prediction models. Results Our proposed data mining procedures outperformed the classic statistical method. The correction rate, sensitivity, specificity and area under the ROC curve of predicting a rotator cuff tear were statistical better in the ANN and decision tree models compared to logistic regression. Based on likelihood ratios derived from our prediction models, Fagan's nomogram could be constructed to assess the probability of a patient who has a rotator cuff tear using a pretest probability and a prediction result (tear or no tear). Conclusions Our predictive data mining models, combined with likelihood ratios and Bayesian theory, appear to be good tools to classify rotator cuff tears as well as determine the probability of the presence of the disease to enhance diagnostic decision making for rotator cuff tears. PMID:24733553

  14. The construction and assessment of a statistical model for the prediction of protein assay data.

    PubMed

    Pittman, J; Sacks, J; Young, S Stanley

    2002-01-01

    The focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.

  15. Prediction of crime occurrence from multi-modal data using deep learning

    PubMed Central

    Kang, Hyeon-Woo

    2017-01-01

    In recent years, various studies have been conducted on the prediction of crime occurrences. This predictive capability is intended to assist in crime prevention by facilitating effective implementation of police patrols. Previous studies have used data from multiple domains such as demographics, economics, and education. Their prediction models treat data from different domains equally. These methods have problems in crime occurrence prediction, such as difficulty in discovering highly nonlinear relationships, redundancies, and dependencies between multiple datasets. In order to enhance crime prediction models, we consider environmental context information, such as broken windows theory and crime prevention through environmental design. In this paper, we propose a feature-level data fusion method with environmental context based on a deep neural network (DNN). Our dataset consists of data collected from various online databases of crime statistics, demographic and meteorological data, and images in Chicago, Illinois. Prior to generating training data, we select crime-related data by conducting statistical analyses. Finally, we train our DNN, which consists of the following four kinds of layers: spatial, temporal, environmental context, and joint feature representation layers. Coupled with crucial data extracted from various domains, our fusion DNN is a product of an efficient decision-making process that statistically analyzes data redundancy. Experimental performance results show that our DNN model is more accurate in predicting crime occurrence than other prediction models. PMID:28437486

  16. Prediction of crime occurrence from multi-modal data using deep learning.

    PubMed

    Kang, Hyeon-Woo; Kang, Hang-Bong

    2017-01-01

    In recent years, various studies have been conducted on the prediction of crime occurrences. This predictive capability is intended to assist in crime prevention by facilitating effective implementation of police patrols. Previous studies have used data from multiple domains such as demographics, economics, and education. Their prediction models treat data from different domains equally. These methods have problems in crime occurrence prediction, such as difficulty in discovering highly nonlinear relationships, redundancies, and dependencies between multiple datasets. In order to enhance crime prediction models, we consider environmental context information, such as broken windows theory and crime prevention through environmental design. In this paper, we propose a feature-level data fusion method with environmental context based on a deep neural network (DNN). Our dataset consists of data collected from various online databases of crime statistics, demographic and meteorological data, and images in Chicago, Illinois. Prior to generating training data, we select crime-related data by conducting statistical analyses. Finally, we train our DNN, which consists of the following four kinds of layers: spatial, temporal, environmental context, and joint feature representation layers. Coupled with crucial data extracted from various domains, our fusion DNN is a product of an efficient decision-making process that statistically analyzes data redundancy. Experimental performance results show that our DNN model is more accurate in predicting crime occurrence than other prediction models.

  17. SPATIO-TEMPORAL MODELING OF AGRICULTURAL YIELD DATA WITH AN APPLICATION TO PRICING CROP INSURANCE CONTRACTS

    PubMed Central

    Ozaki, Vitor A.; Ghosh, Sujit K.; Goodwin, Barry K.; Shirota, Ricardo

    2009-01-01

    This article presents a statistical model of agricultural yield data based on a set of hierarchical Bayesian models that allows joint modeling of temporal and spatial autocorrelation. This method captures a comprehensive range of the various uncertainties involved in predicting crop insurance premium rates as opposed to the more traditional ad hoc, two-stage methods that are typically based on independent estimation and prediction. A panel data set of county-average yield data was analyzed for 290 counties in the State of Paraná (Brazil) for the period of 1990 through 2002. Posterior predictive criteria are used to evaluate different model specifications. This article provides substantial improvements in the statistical and actuarial methods often applied to the calculation of insurance premium rates. These improvements are especially relevant to situations where data are limited. PMID:19890450

  18. Robust multiscale prediction of Po River discharge using a twofold AR-NN approach

    NASA Astrophysics Data System (ADS)

    Alessio, Silvia; Taricco, Carla; Rubinetti, Sara; Zanchettin, Davide; Rubino, Angelo; Mancuso, Salvatore

    2017-04-01

    The Mediterranean area is among the regions most exposed to hydroclimatic changes, with a likely increase of frequency and duration of droughts in the last decades and potentially substantial future drying according to climate projections. However, significant decadal variability is often superposed or even dominates these long-term hydrological trend as observed, for instance, in North Italian precipitation and river discharge records. The capability to accurately predict such decadal changes is, therefore, of utmost environmental and social importance. In order to forecast short and noisy hydroclimatic time series, we apply a twofold statistical approach that we improved with respect to previous works [1]. Our prediction strategy consists in the application of two independent methods that use autoregressive models and feed-forward neural networks. Since all prediction methods work better on clean signals, the predictions are not performed directly on the series, but rather on each significant variability components extracted with Singular Spectrum Analysis (SSA). In this contribution, we will illustrate the multiscale prediction approach and its application to the case of decadal prediction of annual-average Po River discharges (Italy). The discharge record is available for the last 209 years and allows to work with both interannual and decadal time-scale components. Fifteen-year forecasts obtained with both methods robustly indicate a prominent dry period in the second half of the 2020s. We will discuss advantages and limitations of the proposed statistical approach in the light of the current capabilities of decadal climate prediction systems based on numerical climate models, toward an integrated dynamical and statistical approach for the interannual-to-decadal prediction of hydroclimate variability in medium-size river basins. [1] Alessio et. al., Natural variability and anthropogenic effects in a Central Mediterranean core, Clim. of the Past, 8, 831-839, 2012.

  19. A statistical forecast model using the time-scale decomposition technique to predict rainfall during flood period over the middle and lower reaches of the Yangtze River Valley

    NASA Astrophysics Data System (ADS)

    Hu, Yijia; Zhong, Zhong; Zhu, Yimin; Ha, Yao

    2018-04-01

    In this paper, a statistical forecast model using the time-scale decomposition method is established to do the seasonal prediction of the rainfall during flood period (FPR) over the middle and lower reaches of the Yangtze River Valley (MLYRV). This method decomposites the rainfall over the MLYRV into three time-scale components, namely, the interannual component with the period less than 8 years, the interdecadal component with the period from 8 to 30 years, and the interdecadal component with the period larger than 30 years. Then, the predictors are selected for the three time-scale components of FPR through the correlation analysis. At last, a statistical forecast model is established using the multiple linear regression technique to predict the three time-scale components of the FPR, respectively. The results show that this forecast model can capture the interannual and interdecadal variation of FPR. The hindcast of FPR during 14 years from 2001 to 2014 shows that the FPR can be predicted successfully in 11 out of the 14 years. This forecast model performs better than the model using traditional scheme without time-scale decomposition. Therefore, the statistical forecast model using the time-scale decomposition technique has good skills and application value in the operational prediction of FPR over the MLYRV.

  20. Statistical inference, the bootstrap, and neural-network modeling with application to foreign exchange rates.

    PubMed

    White, H; Racine, J

    2001-01-01

    We propose tests for individual and joint irrelevance of network inputs. Such tests can be used to determine whether an input or group of inputs "belong" in a particular model, thus permitting valid statistical inference based on estimated feedforward neural-network models. The approaches employ well-known statistical resampling techniques. We conduct a small Monte Carlo experiment showing that our tests have reasonable level and power behavior, and we apply our methods to examine whether there are predictable regularities in foreign exchange rates. We find that exchange rates do appear to contain information that is exploitable for enhanced point prediction, but the nature of the predictive relations evolves through time.

  1. Developing points-based risk-scoring systems in the presence of competing risks.

    PubMed

    Austin, Peter C; Lee, Douglas S; D'Agostino, Ralph B; Fine, Jason P

    2016-09-30

    Predicting the occurrence of an adverse event over time is an important issue in clinical medicine. Clinical prediction models and associated points-based risk-scoring systems are popular statistical methods for summarizing the relationship between a multivariable set of patient risk factors and the risk of the occurrence of an adverse event. Points-based risk-scoring systems are popular amongst physicians as they permit a rapid assessment of patient risk without the use of computers or other electronic devices. The use of such points-based risk-scoring systems facilitates evidence-based clinical decision making. There is a growing interest in cause-specific mortality and in non-fatal outcomes. However, when considering these types of outcomes, one must account for competing risks whose occurrence precludes the occurrence of the event of interest. We describe how points-based risk-scoring systems can be developed in the presence of competing events. We illustrate the application of these methods by developing risk-scoring systems for predicting cardiovascular mortality in patients hospitalized with acute myocardial infarction. Code in the R statistical programming language is provided for the implementation of the described methods. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

  2. Rolling Bearing Life Prediction-Past, Present, and Future

    NASA Technical Reports Server (NTRS)

    Zaretsky, E V; Poplawski, J. V.; Miller, C. R.

    2000-01-01

    Comparisons were made between the life prediction formulas of Lundberg and Palmgren, Ioannides and Harris, and Zaretsky and full-scale ball and roller bearing life data. The effect of Weibull slope on bearing life prediction was determined. Life factors are proposed to adjust the respective life formulas to the normalized statistical life distribution of each bearing type. The Lundberg-Palmgren method resulted in the most conservative life predictions compared to Ioannides and Harris, and Zaretsky methods which produced statistically similar results. Roller profile can have significant effects on bearing life prediction results. Roller edge loading can reduce life by as much as 98 percent. The resultant predicted life not only depends on the life equation used but on the Weibull slope assumed, the least variation occurring with the Zaretsky equation. The load-life exponent p of 10/3 used in the American National Standards Institute (ANSI)/American Bearing Manufacturers Association (ABMA)/International Organization for Standardization (ISO) standards is inconsistent with the majority roller bearings designed and used today.

  3. Risk prediction model: Statistical and artificial neural network approach

    NASA Astrophysics Data System (ADS)

    Paiman, Nuur Azreen; Hariri, Azian; Masood, Ibrahim

    2017-04-01

    Prediction models are increasingly gaining popularity and had been used in numerous areas of studies to complement and fulfilled clinical reasoning and decision making nowadays. The adoption of such models assist physician's decision making, individual's behavior, and consequently improve individual outcomes and the cost-effectiveness of care. The objective of this paper is to reviewed articles related to risk prediction model in order to understand the suitable approach, development and the validation process of risk prediction model. A qualitative review of the aims, methods and significant main outcomes of the nineteen published articles that developed risk prediction models from numerous fields were done. This paper also reviewed on how researchers develop and validate the risk prediction models based on statistical and artificial neural network approach. From the review done, some methodological recommendation in developing and validating the prediction model were highlighted. According to studies that had been done, artificial neural network approached in developing the prediction model were more accurate compared to statistical approach. However currently, only limited published literature discussed on which approach is more accurate for risk prediction model development.

  4. From local uncertainty to global predictions: Making predictions on fractal basins

    PubMed Central

    2018-01-01

    In nonlinear systems long term dynamics is governed by the attractors present in phase space. The presence of a chaotic saddle gives rise to basins of attraction with fractal boundaries and sometimes even to Wada boundaries. These two phenomena involve extreme difficulties in the prediction of the future state of the system. However, we show here that it is possible to make statistical predictions even if we do not have any previous knowledge of the initial conditions or the time series of the system until it reaches its final state. In this work, we develop a general method to make statistical predictions in systems with fractal basins. In particular, we have applied this new method to the Duffing oscillator for a choice of parameters where the system possesses the Wada property. We have computed the statistical properties of the Duffing oscillator for different phase space resolutions, to obtain information about the global dynamics of the system. The key idea is that the fraction of initial conditions that evolve towards each attractor is scale free—which we illustrate numerically. We have also shown numerically how having partial information about the initial conditions of the system does not improve in general the predictions in the Wada regions. PMID:29668687

  5. Use of statistical and neural net approaches in predicting toxicity of chemicals.

    PubMed

    Basak, S C; Grunwald, G D; Gute, B D; Balasubramanian, K; Opitz, D

    2000-01-01

    Hierarchical quantitative structure-activity relationships (H-QSAR) have been developed as a new approach in constructing models for estimating physicochemical, biomedicinal, and toxicological properties of interest. This approach uses increasingly more complex molecular descriptors in a graduated approach to model building. In this study, statistical and neural network methods have been applied to the development of H-QSAR models for estimating the acute aquatic toxicity (LC50) of 69 benzene derivatives to Pimephales promelas (fathead minnow). Topostructural, topochemical, geometrical, and quantum chemical indices were used as the four levels of the hierarchical method. It is clear from both the statistical and neural network models that topostructural indices alone cannot adequately model this set of congeneric chemicals. Not surprisingly, topochemical indices greatly increase the predictive power of both statistical and neural network models. Quantum chemical indices also add significantly to the modeling of this set of acute aquatic toxicity data.

  6. Enhancing predictive accuracy and reproducibility in clinical evaluation research: Commentary on the special section of the Journal of Evaluation in Clinical Practice.

    PubMed

    Bryant, Fred B

    2016-12-01

    This paper introduces a special section of the current issue of the Journal of Evaluation in Clinical Practice that includes a set of 6 empirical articles showcasing a versatile, new machine-learning statistical method, known as optimal data (or discriminant) analysis (ODA), specifically designed to produce statistical models that maximize predictive accuracy. As this set of papers clearly illustrates, ODA offers numerous important advantages over traditional statistical methods-advantages that enhance the validity and reproducibility of statistical conclusions in empirical research. This issue of the journal also includes a review of a recently published book that provides a comprehensive introduction to the logic, theory, and application of ODA in empirical research. It is argued that researchers have much to gain by using ODA to analyze their data. © 2016 John Wiley & Sons, Ltd.

  7. Prediction of Patient-Controlled Analgesic Consumption: A Multimodel Regression Tree Approach.

    PubMed

    Hu, Yuh-Jyh; Ku, Tien-Hsiung; Yang, Yu-Hung; Shen, Jia-Ying

    2018-01-01

    Several factors contribute to individual variability in postoperative pain, therefore, individuals consume postoperative analgesics at different rates. Although many statistical studies have analyzed postoperative pain and analgesic consumption, most have identified only the correlation and have not subjected the statistical model to further tests in order to evaluate its predictive accuracy. In this study involving 3052 patients, a multistrategy computational approach was developed for analgesic consumption prediction. This approach uses data on patient-controlled analgesia demand behavior over time and combines clustering, classification, and regression to mitigate the limitations of current statistical models. Cross-validation results indicated that the proposed approach significantly outperforms various existing regression methods. Moreover, a comparison between the predictions by anesthesiologists and medical specialists and those of the computational approach for an independent test data set of 60 patients further evidenced the superiority of the computational approach in predicting analgesic consumption because it produced markedly lower root mean squared errors.

  8. Framework for making better predictions by directly estimating variables' predictivity.

    PubMed

    Lo, Adeline; Chernoff, Herman; Zheng, Tian; Lo, Shaw-Hwa

    2016-12-13

    We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the [Formula: see text]-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the [Formula: see text]-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the [Formula: see text]-score on real data to demonstrate the statistic's predictive performance on sample data. We conjecture that using the partition retention and [Formula: see text]-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.

  9. OPR-PPR, a Computer Program for Assessing Data Importance to Model Predictions Using Linear Statistics

    USGS Publications Warehouse

    Tonkin, Matthew J.; Tiedeman, Claire; Ely, D. Matthew; Hill, Mary C.

    2007-01-01

    The OPR-PPR program calculates the Observation-Prediction (OPR) and Parameter-Prediction (PPR) statistics that can be used to evaluate the relative importance of various kinds of data to simulated predictions. The data considered fall into three categories: (1) existing observations, (2) potential observations, and (3) potential information about parameters. The first two are addressed by the OPR statistic; the third is addressed by the PPR statistic. The statistics are based on linear theory and measure the leverage of the data, which depends on the location, the type, and possibly the time of the data being considered. For example, in a ground-water system the type of data might be a head measurement at a particular location and time. As a measure of leverage, the statistics do not take into account the value of the measurement. As linear measures, the OPR and PPR statistics require minimal computational effort once sensitivities have been calculated. Sensitivities need to be calculated for only one set of parameter values; commonly these are the values estimated through model calibration. OPR-PPR can calculate the OPR and PPR statistics for any mathematical model that produces the necessary OPR-PPR input files. In this report, OPR-PPR capabilities are presented in the context of using the ground-water model MODFLOW-2000 and the universal inverse program UCODE_2005. The method used to calculate the OPR and PPR statistics is based on the linear equation for prediction standard deviation. Using sensitivities and other information, OPR-PPR calculates (a) the percent increase in the prediction standard deviation that results when one or more existing observations are omitted from the calibration data set; (b) the percent decrease in the prediction standard deviation that results when one or more potential observations are added to the calibration data set; or (c) the percent decrease in the prediction standard deviation that results when potential information on one or more parameters is added.

  10. Benchmarking protein-protein interface predictions: why you should care about protein size.

    PubMed

    Martin, Juliette

    2014-07-01

    A number of predictive methods have been developed to predict protein-protein binding sites. Each new method is traditionally benchmarked using sets of protein structures of various sizes, and global statistics are used to assess the quality of the prediction. Little attention has been paid to the potential bias due to protein size on these statistics. Indeed, small proteins involve proportionally more residues at interfaces than large ones. If a predictive method is biased toward small proteins, this can lead to an over-estimation of its performance. Here, we investigate the bias due to the size effect when benchmarking protein-protein interface prediction on the widely used docking benchmark 4.0. First, we simulate random scores that favor small proteins over large ones. Instead of the 0.5 AUC (Area Under the Curve) value expected by chance, these biased scores result in an AUC equal to 0.6 using hypergeometric distributions, and up to 0.65 using constant scores. We then use real prediction results to illustrate how to detect the size bias by shuffling, and subsequently correct it using a simple conversion of the scores into normalized ranks. In addition, we investigate the scores produced by eight published methods and show that they are all affected by the size effect, which can change their relative ranking. The size effect also has an impact on linear combination scores by modifying the relative contributions of each method. In the future, systematic corrections should be applied when benchmarking predictive methods using data sets with mixed protein sizes. © 2014 Wiley Periodicals, Inc.

  11. Experimental and statistical post-validation of positive example EST sequences carrying peroxisome targeting signals type 1 (PTS1)

    PubMed Central

    Lingner, Thomas; Kataya, Amr R. A.; Reumann, Sigrun

    2012-01-01

    We recently developed the first algorithms specifically for plants to predict proteins carrying peroxisome targeting signals type 1 (PTS1) from genome sequences.1 As validated experimentally, the prediction methods are able to correctly predict unknown peroxisomal Arabidopsis proteins and to infer novel PTS1 tripeptides. The high prediction performance is primarily determined by the large number and sequence diversity of the underlying positive example sequences, which mainly derived from EST databases. However, a few constructs remained cytosolic in experimental validation studies, indicating sequencing errors in some ESTs. To identify erroneous sequences, we validated subcellular targeting of additional positive example sequences in the present study. Moreover, we analyzed the distribution of prediction scores separately for each orthologous group of PTS1 proteins, which generally resembled normal distributions with group-specific mean values. The cytosolic sequences commonly represented outliers of low prediction scores and were located at the very tail of a fitted normal distribution. Three statistical methods for identifying outliers were compared in terms of sensitivity and specificity.” Their combined application allows elimination of erroneous ESTs from positive example data sets. This new post-validation method will further improve the prediction accuracy of both PTS1 and PTS2 protein prediction models for plants, fungi, and mammals. PMID:22415050

  12. Experimental and statistical post-validation of positive example EST sequences carrying peroxisome targeting signals type 1 (PTS1).

    PubMed

    Lingner, Thomas; Kataya, Amr R A; Reumann, Sigrun

    2012-02-01

    We recently developed the first algorithms specifically for plants to predict proteins carrying peroxisome targeting signals type 1 (PTS1) from genome sequences. As validated experimentally, the prediction methods are able to correctly predict unknown peroxisomal Arabidopsis proteins and to infer novel PTS1 tripeptides. The high prediction performance is primarily determined by the large number and sequence diversity of the underlying positive example sequences, which mainly derived from EST databases. However, a few constructs remained cytosolic in experimental validation studies, indicating sequencing errors in some ESTs. To identify erroneous sequences, we validated subcellular targeting of additional positive example sequences in the present study. Moreover, we analyzed the distribution of prediction scores separately for each orthologous group of PTS1 proteins, which generally resembled normal distributions with group-specific mean values. The cytosolic sequences commonly represented outliers of low prediction scores and were located at the very tail of a fitted normal distribution. Three statistical methods for identifying outliers were compared in terms of sensitivity and specificity." Their combined application allows elimination of erroneous ESTs from positive example data sets. This new post-validation method will further improve the prediction accuracy of both PTS1 and PTS2 protein prediction models for plants, fungi, and mammals.

  13. Developing Risk Prediction Models for Kidney Injury and Assessing Incremental Value for Novel Biomarkers

    PubMed Central

    Kerr, Kathleen F.; Meisner, Allison; Thiessen-Philbrook, Heather; Coca, Steven G.

    2014-01-01

    The field of nephrology is actively involved in developing biomarkers and improving models for predicting patients’ risks of AKI and CKD and their outcomes. However, some important aspects of evaluating biomarkers and risk models are not widely appreciated, and statistical methods are still evolving. This review describes some of the most important statistical concepts for this area of research and identifies common pitfalls. Particular attention is paid to metrics proposed within the last 5 years for quantifying the incremental predictive value of a new biomarker. PMID:24855282

  14. Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures.

    PubMed

    Scheid, Anika; Nebel, Markus E

    2012-07-09

    Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case - without sacrificing much of the accuracy of the results. Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms.

  15. Evaluating the effect of disturbed ensemble distributions on SCFG based statistical sampling of RNA secondary structures

    PubMed Central

    2012-01-01

    Background Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. Results In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case – without sacrificing much of the accuracy of the results. Conclusions Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms. PMID:22776037

  16. In silico environmental chemical science: properties and processes from statistical and computational modelling

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tratnyek, Paul G.; Bylaska, Eric J.; Weber, Eric J.

    2017-01-01

    Quantitative structure–activity relationships (QSARs) have long been used in the environmental sciences. More recently, molecular modeling and chemoinformatic methods have become widespread. These methods have the potential to expand and accelerate advances in environmental chemistry because they complement observational and experimental data with “in silico” results and analysis. The opportunities and challenges that arise at the intersection between statistical and theoretical in silico methods are most apparent in the context of properties that determine the environmental fate and effects of chemical contaminants (degradation rate constants, partition coefficients, toxicities, etc.). The main example of this is the calibration of QSARs usingmore » descriptor variable data calculated from molecular modeling, which can make QSARs more useful for predicting property data that are unavailable, but also can make them more powerful tools for diagnosis of fate determining pathways and mechanisms. Emerging opportunities for “in silico environmental chemical science” are to move beyond the calculation of specific chemical properties using statistical models and toward more fully in silico models, prediction of transformation pathways and products, incorporation of environmental factors into model predictions, integration of databases and predictive models into more comprehensive and efficient tools for exposure assessment, and extending the applicability of all the above from chemicals to biologicals and materials.« less

  17. Predicting and downscaling ENSO impacts on intraseasonal precipitation statistics in California: The 1997/98 event

    USGS Publications Warehouse

    Gershunov, A.; Barnett, T.P.; Cayan, D.R.; Tubbs, T.; Goddard, L.

    2000-01-01

    Three long-range forecasting methods have been evaluated for prediction and downscaling of seasonal and intraseasonal precipitation statistics in California. Full-statistical, hybrid-dynamical - statistical and full-dynamical approaches have been used to forecast El Nin??o - Southern Oscillation (ENSO) - related total precipitation, daily precipitation frequency, and average intensity anomalies during the January - March season. For El Nin??o winters, the hybrid approach emerges as the best performer, while La Nin??a forecasting skill is poor. The full-statistical forecasting method features reasonable forecasting skill for both La Nin??a and El Nin??o winters. The performance of the full-dynamical approach could not be evaluated as rigorously as that of the other two forecasting schemes. Although the full-dynamical forecasting approach is expected to outperform simpler forecasting schemes in the long run, evidence is presented to conclude that, at present, the full-dynamical forecasting approach is the least viable of the three, at least in California. The authors suggest that operational forecasting of any intraseasonal temperature, precipitation, or streamflow statistic derivable from the available records is possible now for ENSO-extreme years.

  18. GAPIT: genome association and prediction integrated tool.

    PubMed

    Lipka, Alexander E; Tian, Feng; Wang, Qishan; Peiffer, Jason; Li, Meng; Bradbury, Peter J; Gore, Michael A; Buckler, Edward S; Zhang, Zhiwu

    2012-09-15

    Software programs that conduct genome-wide association studies and genomic prediction and selection need to use methodologies that maximize statistical power, provide high prediction accuracy and run in a computationally efficient manner. We developed an R package called Genome Association and Prediction Integrated Tool (GAPIT) that implements advanced statistical methods including the compressed mixed linear model (CMLM) and CMLM-based genomic prediction and selection. The GAPIT package can handle large datasets in excess of 10 000 individuals and 1 million single-nucleotide polymorphisms with minimal computational time, while providing user-friendly access and concise tables and graphs to interpret results. http://www.maizegenetics.net/GAPIT. zhiwu.zhang@cornell.edu Supplementary data are available at Bioinformatics online.

  19. A review of statistical updating methods for clinical prediction models.

    PubMed

    Su, Ting-Li; Jaki, Thomas; Hickey, Graeme L; Buchan, Iain; Sperrin, Matthew

    2018-01-01

    A clinical prediction model is a tool for predicting healthcare outcomes, usually within a specific population and context. A common approach is to develop a new clinical prediction model for each population and context; however, this wastes potentially useful historical information. A better approach is to update or incorporate the existing clinical prediction models already developed for use in similar contexts or populations. In addition, clinical prediction models commonly become miscalibrated over time, and need replacing or updating. In this article, we review a range of approaches for re-using and updating clinical prediction models; these fall in into three main categories: simple coefficient updating, combining multiple previous clinical prediction models in a meta-model and dynamic updating of models. We evaluated the performance (discrimination and calibration) of the different strategies using data on mortality following cardiac surgery in the United Kingdom: We found that no single strategy performed sufficiently well to be used to the exclusion of the others. In conclusion, useful tools exist for updating existing clinical prediction models to a new population or context, and these should be implemented rather than developing a new clinical prediction model from scratch, using a breadth of complementary statistical methods.

  20. Fusion of multiscale wavelet-based fractal analysis on retina image for stroke prediction.

    PubMed

    Che Azemin, M Z; Kumar, Dinesh K; Wong, T Y; Wang, J J; Kawasaki, R; Mitchell, P; Arjunan, Sridhar P

    2010-01-01

    In this paper, we present a novel method of analyzing retinal vasculature using Fourier Fractal Dimension to extract the complexity of the retinal vasculature enhanced at different wavelet scales. Logistic regression was used as a fusion method to model the classifier for 5-year stroke prediction. The efficacy of this technique has been tested using standard pattern recognition performance evaluation, Receivers Operating Characteristics (ROC) analysis and medical prediction statistics, odds ratio. Stroke prediction model was developed using the proposed system.

  1. Evaluation of three statistical prediction models for forensic age prediction based on DNA methylation.

    PubMed

    Smeers, Inge; Decorte, Ronny; Van de Voorde, Wim; Bekaert, Bram

    2018-05-01

    DNA methylation is a promising biomarker for forensic age prediction. A challenge that has emerged in recent studies is the fact that prediction errors become larger with increasing age due to interindividual differences in epigenetic ageing rates. This phenomenon of non-constant variance or heteroscedasticity violates an assumption of the often used method of ordinary least squares (OLS) regression. The aim of this study was to evaluate alternative statistical methods that do take heteroscedasticity into account in order to provide more accurate, age-dependent prediction intervals. A weighted least squares (WLS) regression is proposed as well as a quantile regression model. Their performances were compared against an OLS regression model based on the same dataset. Both models provided age-dependent prediction intervals which account for the increasing variance with age, but WLS regression performed better in terms of success rate in the current dataset. However, quantile regression might be a preferred method when dealing with a variance that is not only non-constant, but also not normally distributed. Ultimately the choice of which model to use should depend on the observed characteristics of the data. Copyright © 2018 Elsevier B.V. All rights reserved.

  2. Radiomic analysis in prediction of Human Papilloma Virus status.

    PubMed

    Yu, Kaixian; Zhang, Youyi; Yu, Yang; Huang, Chao; Liu, Rongjie; Li, Tengfei; Yang, Liuqing; Morris, Jeffrey S; Baladandayuthapani, Veerabhadran; Zhu, Hongtu

    2017-12-01

    Human Papilloma Virus (HPV) has been associated with oropharyngeal cancer prognosis. Traditionally the HPV status is tested through invasive lab test. Recently, the rapid development of statistical image analysis techniques has enabled precise quantitative analysis of medical images. The quantitative analysis of Computed Tomography (CT) provides a non-invasive way to assess HPV status for oropharynx cancer patients. We designed a statistical radiomics approach analyzing CT images to predict HPV status. Various radiomics features were extracted from CT scans, and analyzed using statistical feature selection and prediction methods. Our approach ranked the highest in the 2016 Medical Image Computing and Computer Assisted Intervention (MICCAI) grand challenge: Oropharynx Cancer (OPC) Radiomics Challenge, Human Papilloma Virus (HPV) Status Prediction. Further analysis on the most relevant radiomic features distinguishing HPV positive and negative subjects suggested that HPV positive patients usually have smaller and simpler tumors.

  3. Characteristics of genomic signatures derived using univariate methods and mechanistically anchored functional descriptors for predicting drug- and xenobiotic-induced nephrotoxicity.

    PubMed

    Shi, Weiwei; Bugrim, Andrej; Nikolsky, Yuri; Nikolskya, Tatiana; Brennan, Richard J

    2008-01-01

    ABSTRACT The ideal toxicity biomarker is composed of the properties of prediction (is detected prior to traditional pathological signs of injury), accuracy (high sensitivity and specificity), and mechanistic relationships to the endpoint measured (biological relevance). Gene expression-based toxicity biomarkers ("signatures") have shown good predictive power and accuracy, but are difficult to interpret biologically. We have compared different statistical methods of feature selection with knowledge-based approaches, using GeneGo's database of canonical pathway maps, to generate gene sets for the classification of renal tubule toxicity. The gene set selection algorithms include four univariate analyses: t-statistics, fold-change, B-statistics, and RankProd, and their combination and overlap for the identification of differentially expressed probes. Enrichment analysis following the results of the four univariate analyses, Hotelling T-square test, and, finally out-of-bag selection, a variant of cross-validation, were used to identify canonical pathway maps-sets of genes coordinately involved in key biological processes-with classification power. Differentially expressed genes identified by the different statistical univariate analyses all generated reasonably performing classifiers of tubule toxicity. Maps identified by enrichment analysis or Hotelling T-square had lower classification power, but highlighted perturbed lipid homeostasis as a common discriminator of nephrotoxic treatments. The out-of-bag method yielded the best functionally integrated classifier. The map "ephrins signaling" performed comparably to a classifier derived using sparse linear programming, a machine learning algorithm, and represents a signaling network specifically involved in renal tubule development and integrity. Such functional descriptors of toxicity promise to better integrate predictive toxicogenomics with mechanistic analysis, facilitating the interpretation and risk assessment of predictive genomic investigations.

  4. A Critical Review for Developing Accurate and Dynamic Predictive Models Using Machine Learning Methods in Medicine and Health Care.

    PubMed

    Alanazi, Hamdan O; Abdullah, Abdul Hanan; Qureshi, Kashif Naseer

    2017-04-01

    Recently, Artificial Intelligence (AI) has been used widely in medicine and health care sector. In machine learning, the classification or prediction is a major field of AI. Today, the study of existing predictive models based on machine learning methods is extremely active. Doctors need accurate predictions for the outcomes of their patients' diseases. In addition, for accurate predictions, timing is another significant factor that influences treatment decisions. In this paper, existing predictive models in medicine and health care have critically reviewed. Furthermore, the most famous machine learning methods have explained, and the confusion between a statistical approach and machine learning has clarified. A review of related literature reveals that the predictions of existing predictive models differ even when the same dataset is used. Therefore, existing predictive models are essential, and current methods must be improved.

  5. Geographic variation in forest composition and precipitation predict the synchrony of forest insect outbreaks

    Treesearch

    Kyle J. Haynes; Andrew M. Liebhold; Ottar N. Bjørnstad; Andrew J. Allstadt; Randall S. Morin

    2018-01-01

    Evaluating the causes of spatial synchrony in population dynamics in nature is notoriously difficult due to a lack of data and appropriate statistical methods. Here, we use a recently developed method, a multivariate extension of the local indicators of spatial autocorrelation statistic, to map geographic variation in the synchrony of gypsy moth outbreaks. Regression...

  6. Integrating Crop Growth Models with Whole Genome Prediction through Approximate Bayesian Computation.

    PubMed

    Technow, Frank; Messina, Carlos D; Totir, L Radu; Cooper, Mark

    2015-01-01

    Genomic selection, enabled by whole genome prediction (WGP) methods, is revolutionizing plant breeding. Existing WGP methods have been shown to deliver accurate predictions in the most common settings, such as prediction of across environment performance for traits with additive gene effects. However, prediction of traits with non-additive gene effects and prediction of genotype by environment interaction (G×E), continues to be challenging. Previous attempts to increase prediction accuracy for these particularly difficult tasks employed prediction methods that are purely statistical in nature. Augmenting the statistical methods with biological knowledge has been largely overlooked thus far. Crop growth models (CGMs) attempt to represent the impact of functional relationships between plant physiology and the environment in the formation of yield and similar output traits of interest. Thus, they can explain the impact of G×E and certain types of non-additive gene effects on the expressed phenotype. Approximate Bayesian computation (ABC), a novel and powerful computational procedure, allows the incorporation of CGMs directly into the estimation of whole genome marker effects in WGP. Here we provide a proof of concept study for this novel approach and demonstrate its use with synthetic data sets. We show that this novel approach can be considerably more accurate than the benchmark WGP method GBLUP in predicting performance in environments represented in the estimation set as well as in previously unobserved environments for traits determined by non-additive gene effects. We conclude that this proof of concept demonstrates that using ABC for incorporating biological knowledge in the form of CGMs into WGP is a very promising and novel approach to improving prediction accuracy for some of the most challenging scenarios in plant breeding and applied genetics.

  7. Integrating Crop Growth Models with Whole Genome Prediction through Approximate Bayesian Computation

    PubMed Central

    Technow, Frank; Messina, Carlos D.; Totir, L. Radu; Cooper, Mark

    2015-01-01

    Genomic selection, enabled by whole genome prediction (WGP) methods, is revolutionizing plant breeding. Existing WGP methods have been shown to deliver accurate predictions in the most common settings, such as prediction of across environment performance for traits with additive gene effects. However, prediction of traits with non-additive gene effects and prediction of genotype by environment interaction (G×E), continues to be challenging. Previous attempts to increase prediction accuracy for these particularly difficult tasks employed prediction methods that are purely statistical in nature. Augmenting the statistical methods with biological knowledge has been largely overlooked thus far. Crop growth models (CGMs) attempt to represent the impact of functional relationships between plant physiology and the environment in the formation of yield and similar output traits of interest. Thus, they can explain the impact of G×E and certain types of non-additive gene effects on the expressed phenotype. Approximate Bayesian computation (ABC), a novel and powerful computational procedure, allows the incorporation of CGMs directly into the estimation of whole genome marker effects in WGP. Here we provide a proof of concept study for this novel approach and demonstrate its use with synthetic data sets. We show that this novel approach can be considerably more accurate than the benchmark WGP method GBLUP in predicting performance in environments represented in the estimation set as well as in previously unobserved environments for traits determined by non-additive gene effects. We conclude that this proof of concept demonstrates that using ABC for incorporating biological knowledge in the form of CGMs into WGP is a very promising and novel approach to improving prediction accuracy for some of the most challenging scenarios in plant breeding and applied genetics. PMID:26121133

  8. Comparison of statistical methods for detection of serum lipid biomarkers for mesothelioma and asbestos exposure.

    PubMed

    Xu, Rengyi; Mesaros, Clementina; Weng, Liwei; Snyder, Nathaniel W; Vachani, Anil; Blair, Ian A; Hwang, Wei-Ting

    2017-07-01

    We compared three statistical methods in selecting a panel of serum lipid biomarkers for mesothelioma and asbestos exposure. Serum samples from mesothelioma, asbestos-exposed subjects and controls (40 per group) were analyzed. Three variable selection methods were considered: top-ranked predictors from univariate model, stepwise and least absolute shrinkage and selection operator. Crossed-validated area under the receiver operating characteristic curve was used to compare the prediction performance. Lipids with high crossed-validated area under the curve were identified. Lipid with mass-to-charge ratio of 372.31 was selected by all three methods comparing mesothelioma versus control. Lipids with mass-to-charge ratio of 1464.80 and 329.21 were selected by two models for asbestos exposure versus control. Different methods selected a similar set of serum lipids. Combining candidate biomarkers can improve prediction.

  9. External validation of ADO, DOSE, COTE and CODEX at predicting death in primary care patients with COPD using standard and machine learning approaches.

    PubMed

    Morales, Daniel R; Flynn, Rob; Zhang, Jianguo; Trucco, Emmanuel; Quint, Jennifer K; Zutis, Kris

    2018-05-01

    Several models for predicting the risk of death in people with chronic obstructive pulmonary disease (COPD) exist but have not undergone large scale validation in primary care. The objective of this study was to externally validate these models using statistical and machine learning approaches. We used a primary care COPD cohort identified using data from the UK Clinical Practice Research Datalink. Age-standardised mortality rates were calculated for the population by gender and discrimination of ADO (age, dyspnoea, airflow obstruction), COTE (COPD-specific comorbidity test), DOSE (dyspnoea, airflow obstruction, smoking, exacerbations) and CODEX (comorbidity, dyspnoea, airflow obstruction, exacerbations) at predicting death over 1-3 years measured using logistic regression and a support vector machine learning (SVM) method of analysis. The age-standardised mortality rate was 32.8 (95%CI 32.5-33.1) and 25.2 (95%CI 25.4-25.7) per 1000 person years for men and women respectively. Complete data were available for 54879 patients to predict 1-year mortality. ADO performed the best (c-statistic of 0.730) compared with DOSE (c-statistic 0.645), COTE (c-statistic 0.655) and CODEX (c-statistic 0.649) at predicting 1-year mortality. Discrimination of ADO and DOSE improved at predicting 1-year mortality when combined with COTE comorbidities (c-statistic 0.780 ADO + COTE; c-statistic 0.727 DOSE + COTE). Discrimination did not change significantly over 1-3 years. Comparable results were observed using SVM. In primary care, ADO appears superior at predicting death in COPD. Performance of ADO and DOSE improved when combined with COTE comorbidities suggesting better models may be generated with additional data facilitated using novel approaches. Copyright © 2018. Published by Elsevier Ltd.

  10. Towards the feasibility of using ultrasound to determine mechanical properties of tissues in a bioreactor

    PubMed Central

    Mansour, Joseph M.; Gu, Di-Win Marine; Chung, Chen-Yuan; Heebner, Joseph; Althans, Jake; Abdalian, Sarah; Schluchter, Mark D.; Liu, Yiying; Welter, Jean F.

    2016-01-01

    Introduction Our ultimate goal is to non-destructively evaluate mechanical properties of tissue-engineered (TE) cartilage using ultrasound (US). We used agarose gels as surrogates for TE cartilage. Previously, we showed that mechanical properties measured using conventional methods were related to those measured using US, which suggested a way to non-destructively predict mechanical properties of samples with known volume fractions. In this study, we sought to determine whether the mechanical properties of samples, with unknown volume fractions could be predicted by US. Methods Aggregate moduli were calculated for hydrogels as a function of SOS, based on concentration and density using a poroelastic model. The data were used to train a statistical model, which we then used to predict volume fractions and mechanical properties of unknown samples. Young's and storage moduli were measured mechanically. Results The statistical model generally predicted the Young's moduli in compression to within < 10% of their mechanically measured value. We defined positive linear correlations between the aggregate modulus predicted from US and both the storage and Young's moduli determined from mechanical tests. Conclusions Mechanical properties of hydrogels with unknown volume fractions can be predicted successfully from US measurements. This method has the potential to predict mechanical properties of TE cartilage non-destructively in a bioreactor. PMID:25092421

  11. A Sub-filter Scale Noise Equation far Hybrid LES Simulations

    NASA Technical Reports Server (NTRS)

    Goldstein, Marvin E.

    2006-01-01

    Hybrid LES/subscale modeling approaches have an important advantage over the current noise prediction methods in that they only involve modeling of the relatively universal subscale motion and not the configuration dependent larger scale turbulence . Previous hybrid approaches use approximate statistical techniques or extrapolation methods to obtain the requisite information about the sub-filter scale motion. An alternative approach would be to adopt the modeling techniques used in the current noise prediction methods and determine the unknown stresses from experimental data. The present paper derives an equation for predicting the sub scale sound from information that can be obtained with currently available experimental procedures. The resulting prediction method would then be intermediate between the current noise prediction codes and previously proposed hybrid techniques.

  12. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison.

    PubMed

    Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S; Sinha, Saurabh

    2011-12-01

    Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, 'enhancers'), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for 'motif-blind' CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to 'supervise' the search. We propose a new statistical method, based on 'Interpolated Markov Models', for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. © The Author(s) 2011. Published by Oxford University Press.

  13. Physics-based statistical model and simulation method of RF propagation in urban environments

    DOEpatents

    Pao, Hsueh-Yuan; Dvorak, Steven L.

    2010-09-14

    A physics-based statistical model and simulation/modeling method and system of electromagnetic wave propagation (wireless communication) in urban environments. In particular, the model is a computationally efficient close-formed parametric model of RF propagation in an urban environment which is extracted from a physics-based statistical wireless channel simulation method and system. The simulation divides the complex urban environment into a network of interconnected urban canyon waveguides which can be analyzed individually; calculates spectral coefficients of modal fields in the waveguides excited by the propagation using a database of statistical impedance boundary conditions which incorporates the complexity of building walls in the propagation model; determines statistical parameters of the calculated modal fields; and determines a parametric propagation model based on the statistical parameters of the calculated modal fields from which predictions of communications capability may be made.

  14. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features.

    PubMed

    Jones, David T; Kandathil, Shaun M

    2018-04-26

    In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. DeepCov is freely available at https://github.com/psipred/DeepCov. d.t.jones@ucl.ac.uk.

  15. Predicting September sea ice: Ensemble skill of the SEARCH Sea Ice Outlook 2008-2013

    NASA Astrophysics Data System (ADS)

    Stroeve, Julienne; Hamilton, Lawrence C.; Bitz, Cecilia M.; Blanchard-Wrigglesworth, Edward

    2014-04-01

    Since 2008, the Study of Environmental Arctic Change Sea Ice Outlook has solicited predictions of September sea-ice extent from the Arctic research community. Individuals and teams employ a variety of modeling, statistical, and heuristic approaches to make these predictions. Viewed as monthly ensembles each with one or two dozen individual predictions, they display a bimodal pattern of success. In years when observed ice extent is near its trend, the median predictions tend to be accurate. In years when the observed extent is anomalous, the median and most individual predictions are less accurate. Statistical analysis suggests that year-to-year variability, rather than methods, dominate the variation in ensemble prediction success. Furthermore, ensemble predictions do not improve as the season evolves. We consider the role of initial ice, atmosphere and ocean conditions, and summer storms and weather in contributing to the challenge of sea-ice prediction.

  16. Prediction of slant path rain attenuation statistics at various locations

    NASA Technical Reports Server (NTRS)

    Goldhirsh, J.

    1977-01-01

    The paper describes a method for predicting slant path attenuation statistics at arbitrary locations for variable frequencies and path elevation angles. The method involves the use of median reflectivity factor-height profiles measured with radar as well as the use of long-term point rain rate data and assumed or measured drop size distributions. The attenuation coefficient due to cloud liquid water in the presence of rain is also considered. Absolute probability fade distributions are compared for eight cases: Maryland (15 GHz), Texas (30 GHz), Slough, England (19 and 37 GHz), Fayetteville, North Carolina (13 and 18 GHz), and Cambridge, Massachusetts (13 and 18 GHz).

  17. GAPIT version 2: an enhanced integrated tool for genomic association and prediction

    USDA-ARS?s Scientific Manuscript database

    Most human diseases and agriculturally important traits are complex. Dissecting their genetic architecture requires continued development of innovative and powerful statistical methods. Corresponding advances in computing tools are critical to efficiently use these statistical innovations and to enh...

  18. Prediction of rainfall anomalies during the dry to wet transition season over the Southern Amazonia using machine learning tools

    NASA Astrophysics Data System (ADS)

    Shan, X.; Zhang, K.; Zhuang, Y.; Fu, R.; Hong, Y.

    2017-12-01

    Seasonal prediction of rainfall during the dry-to-wet transition season in austral spring (September-November) over southern Amazonia is central for improving planting crops and fire mitigation in that region. Previous studies have identified the key large-scale atmospheric dynamic and thermodynamics pre-conditions during the dry season (June-August) that influence the rainfall anomalies during the dry to wet transition season over Southern Amazonia. Based on these key pre-conditions during dry season, we have evaluated several statistical models and developed a Neural Network based statistical prediction system to predict rainfall during the dry to wet transition for Southern Amazonia (5-15°S, 50-70°W). Multivariate Empirical Orthogonal Function (EOF) Analysis is applied to the following four fields during JJA from the ECMWF Reanalysis (ERA-Interim) spanning from year 1979 to 2015: geopotential height at 200 hPa, surface relative humidity, convective inhibition energy (CIN) index and convective available potential energy (CAPE), to filter out noise and highlight the most coherent spatial and temporal variations. The first 10 EOF modes are retained for inputs to the statistical models, accounting for at least 70% of the total variance in the predictor fields. We have tested several linear and non-linear statistical methods. While the regularized Ridge Regression and Lasso Regression can generally capture the spatial pattern and magnitude of rainfall anomalies, we found that that Neural Network performs best with an accuracy greater than 80%, as expected from the non-linear dependence of the rainfall on the large-scale atmospheric thermodynamic conditions and circulation. Further tests of various prediction skill metrics and hindcasts also suggest this Neural Network prediction approach can significantly improve seasonal prediction skill than the dynamic predictions and regression based statistical predictions. Thus, this statistical prediction system could have shown potential to improve real-time seasonal rainfall predictions in the future.

  19. Filter Tuning Using the Chi-Squared Statistic

    NASA Technical Reports Server (NTRS)

    Lilly-Salkowski, Tyler B.

    2017-01-01

    This paper examines the use of the Chi-square statistic as a means of evaluating filter performance. The goal of the process is to characterize the filter performance in the metric of covariance realism. The Chi-squared statistic is the value calculated to determine the realism of a covariance based on the prediction accuracy and the covariance values at a given point in time. Once calculated, it is the distribution of this statistic that provides insight on the accuracy of the covariance. The process of tuning an Extended Kalman Filter (EKF) for Aqua and Aura support is described, including examination of the measurement errors of available observation types, and methods of dealing with potentially volatile atmospheric drag modeling. Predictive accuracy and the distribution of the Chi-squared statistic, calculated from EKF solutions, are assessed.

  20. Evaluation of statistical and rainfall-runoff models for predicting historical daily streamflow time series in the Des Moines and Iowa River watersheds

    USGS Publications Warehouse

    Farmer, William H.; Knight, Rodney R.; Eash, David A.; Kasey J. Hutchinson,; Linhart, S. Mike; Christiansen, Daniel E.; Archfield, Stacey A.; Over, Thomas M.; Kiang, Julie E.

    2015-08-24

    Daily records of streamflow are essential to understanding hydrologic systems and managing the interactions between human and natural systems. Many watersheds and locations lack streamgages to provide accurate and reliable records of daily streamflow. In such ungaged watersheds, statistical tools and rainfall-runoff models are used to estimate daily streamflow. Previous work compared 19 different techniques for predicting daily streamflow records in the southeastern United States. Here, five of the better-performing methods are compared in a different hydroclimatic region of the United States, in Iowa. The methods fall into three classes: (1) drainage-area ratio methods, (2) nonlinear spatial interpolations using flow duration curves, and (3) mechanistic rainfall-runoff models. The first two classes are each applied with nearest-neighbor and map-correlated index streamgages. Using a threefold validation and robust rank-based evaluation, the methods are assessed for overall goodness of fit of the hydrograph of daily streamflow, the ability to reproduce a daily, no-fail storage-yield curve, and the ability to reproduce key streamflow statistics. As in the Southeast study, a nonlinear spatial interpolation of daily streamflow using flow duration curves is found to be a method with the best predictive accuracy. Comparisons with previous work in Iowa show that the accuracy of mechanistic models with at-site calibration is substantially degraded in the ungaged framework.

  1. Developing risk prediction models for kidney injury and assessing incremental value for novel biomarkers.

    PubMed

    Kerr, Kathleen F; Meisner, Allison; Thiessen-Philbrook, Heather; Coca, Steven G; Parikh, Chirag R

    2014-08-07

    The field of nephrology is actively involved in developing biomarkers and improving models for predicting patients' risks of AKI and CKD and their outcomes. However, some important aspects of evaluating biomarkers and risk models are not widely appreciated, and statistical methods are still evolving. This review describes some of the most important statistical concepts for this area of research and identifies common pitfalls. Particular attention is paid to metrics proposed within the last 5 years for quantifying the incremental predictive value of a new biomarker. Copyright © 2014 by the American Society of Nephrology.

  2. The Statistical Nature of Fatigue Crack Propagation

    DTIC Science & Technology

    1977-03-01

    LEVEL x - V AFFDL-TRt-T843 r THE STATISTICAL NATURE OF b FATIGUE CRACK PROPAGATION D. A. VIRKLER B. M. HILLBERR Y LL= P. K. GOEL C* SCHOOL...function of crack length was best represented by the three-parameter log-normal distribution. Six growth rate calculation methods were investigated and the...dN, which varied moderately as a function of crack length, replicate a vs. N data were predicted This predicted data reproduced the mean behavior but

  3. [How reliable is the monitoring for doping?].

    PubMed

    Hüsler, J

    1990-12-01

    The reliability of the dope control, of the chemical analysis of the urine probes in the accredited laboratories and their decisions, is discussed using probabilistic and statistical methods. Basically, we evaluated and estimated the positive predictive value which means the probability that an urine probe contains prohibited dope substances given a positive test decision. Since there are not statistical data and evidence for some important quantities in relation to the predictive value, an exact evaluation is not possible, only conservative, lower bounds can be given. We found that the predictive value is at least 90% or 95% with respect to the analysis and decision based on the A-probe only, and at least 99% with respect to both A- and B-probes. A more realistic observation, but without sufficient statistical confidence, points to the fact that the true predictive value is significantly larger than these lower estimates.

  4. Vibration Response Models of a Stiffened Aluminum Plate Excited by a Shaker

    NASA Technical Reports Server (NTRS)

    Cabell, Randolph H.

    2008-01-01

    Numerical models of structural-acoustic interactions are of interest to aircraft designers and the space program. This paper describes a comparison between two energy finite element codes, a statistical energy analysis code, a structural finite element code, and the experimentally measured response of a stiffened aluminum plate excited by a shaker. Different methods for modeling the stiffeners and the power input from the shaker are discussed. The results show that the energy codes (energy finite element and statistical energy analysis) accurately predicted the measured mean square velocity of the plate. In addition, predictions from an energy finite element code had the best spatial correlation with measured velocities. However, predictions from a considerably simpler, single subsystem, statistical energy analysis model also correlated well with the spatial velocity distribution. The results highlight a need for further work to understand the relationship between modeling assumptions and the prediction results.

  5. No-Reference Video Quality Assessment Based on Statistical Analysis in 3D-DCT Domain.

    PubMed

    Li, Xuelong; Guo, Qun; Lu, Xiaoqiang

    2016-05-13

    It is an important task to design models for universal no-reference video quality assessment (NR-VQA) in multiple video processing and computer vision applications. However, most existing NR-VQA metrics are designed for specific distortion types which are not often aware in practical applications. A further deficiency is that the spatial and temporal information of videos is hardly considered simultaneously. In this paper, we propose a new NR-VQA metric based on the spatiotemporal natural video statistics (NVS) in 3D discrete cosine transform (3D-DCT) domain. In the proposed method, a set of features are firstly extracted based on the statistical analysis of 3D-DCT coefficients to characterize the spatiotemporal statistics of videos in different views. These features are used to predict the perceived video quality via the efficient linear support vector regression (SVR) model afterwards. The contributions of this paper are: 1) we explore the spatiotemporal statistics of videos in 3DDCT domain which has the inherent spatiotemporal encoding advantage over other widely used 2D transformations; 2) we extract a small set of simple but effective statistical features for video visual quality prediction; 3) the proposed method is universal for multiple types of distortions and robust to different databases. The proposed method is tested on four widely used video databases. Extensive experimental results demonstrate that the proposed method is competitive with the state-of-art NR-VQA metrics and the top-performing FR-VQA and RR-VQA metrics.

  6. Long-time predictions in nonlinear dynamics

    NASA Technical Reports Server (NTRS)

    Szebehely, V.

    1980-01-01

    It is known that nonintegrable dynamical systems do not allow precise predictions concerning their behavior for arbitrary long times. The available series solutions are not uniformly convergent according to Poincare's theorem and numerical integrations lose their meaningfulness after the elapse of arbitrary long times. Two approaches are the use of existing global integrals and statistical methods. This paper presents a generalized method along the first approach. As examples long-time predictions in the classical gravitational satellite and planetary problems are treated.

  7. A Comparison of the Performance of Advanced Statistical Techniques for the Refinement of Day-ahead and Longer NWP-based Wind Power Forecasts

    NASA Astrophysics Data System (ADS)

    Zack, J. W.

    2015-12-01

    Predictions from Numerical Weather Prediction (NWP) models are the foundation for wind power forecasts for day-ahead and longer forecast horizons. The NWP models directly produce three-dimensional wind forecasts on their respective computational grids. These can be interpolated to the location and time of interest. However, these direct predictions typically contain significant systematic errors ("biases"). This is due to a variety of factors including the limited space-time resolution of the NWP models and shortcomings in the model's representation of physical processes. It has become common practice to attempt to improve the raw NWP forecasts by statistically adjusting them through a procedure that is widely known as Model Output Statistics (MOS). The challenge is to identify complex patterns of systematic errors and then use this knowledge to adjust the NWP predictions. The MOS-based improvements are the basis for much of the value added by commercial wind power forecast providers. There are an enormous number of statistical approaches that can be used to generate the MOS adjustments to the raw NWP forecasts. In order to obtain insight into the potential value of some of the newer and more sophisticated statistical techniques often referred to as "machine learning methods" a MOS-method comparison experiment has been performed for wind power generation facilities in 6 wind resource areas of California. The underlying NWP models that provided the raw forecasts were the two primary operational models of the US National Weather Service: the GFS and NAM models. The focus was on 1- and 2-day ahead forecasts of the hourly wind-based generation. The statistical methods evaluated included: (1) screening multiple linear regression, which served as a baseline method, (2) artificial neural networks, (3) a decision-tree approach called random forests, (4) gradient boosted regression based upon an decision-tree algorithm, (5) support vector regression and (6) analog ensemble, which is a case-matching scheme. The presentation will provide (1) an overview of each method and the experimental design, (2) performance comparisons based on standard metrics such as bias, MAE and RMSE, (3) a summary of the performance characteristics of each approach and (4) a preview of further experiments to be conducted.

  8. Pile Driving

    NASA Technical Reports Server (NTRS)

    1987-01-01

    Machine-oriented structural engineering firm TERA, Inc. is engaged in a project to evaluate the reliability of offshore pile driving prediction methods to eventually predict the best pile driving technique for each new offshore oil platform. Phase I Pile driving records of 48 offshore platforms including such information as blow counts, soil composition and pertinent construction details were digitized. In Phase II, pile driving records were statistically compared with current methods of prediction. Result was development of modular software, the CRIPS80 Software Design Analyzer System, that companies can use to evaluate other prediction procedures or other data bases.

  9. Evaluating and implementing temporal, spatial, and spatio-temporal methods for outbreak detection in a local syndromic surveillance system.

    PubMed

    Mathes, Robert W; Lall, Ramona; Levin-Rector, Alison; Sell, Jessica; Paladini, Marc; Konty, Kevin J; Olson, Don; Weiss, Don

    2017-01-01

    The New York City Department of Health and Mental Hygiene has operated an emergency department syndromic surveillance system since 2001, using temporal and spatial scan statistics run on a daily basis for cluster detection. Since the system was originally implemented, a number of new methods have been proposed for use in cluster detection. We evaluated six temporal and four spatial/spatio-temporal detection methods using syndromic surveillance data spiked with simulated injections. The algorithms were compared on several metrics, including sensitivity, specificity, positive predictive value, coherence, and timeliness. We also evaluated each method's implementation, programming time, run time, and the ease of use. Among the temporal methods, at a set specificity of 95%, a Holt-Winters exponential smoother performed the best, detecting 19% of the simulated injects across all shapes and sizes, followed by an autoregressive moving average model (16%), a generalized linear model (15%), a modified version of the Early Aberration Reporting System's C2 algorithm (13%), a temporal scan statistic (11%), and a cumulative sum control chart (<2%). Of the spatial/spatio-temporal methods we tested, a spatial scan statistic detected 3% of all injects, a Bayes regression found 2%, and a generalized linear mixed model and a space-time permutation scan statistic detected none at a specificity of 95%. Positive predictive value was low (<7%) for all methods. Overall, the detection methods we tested did not perform well in identifying the temporal and spatial clusters of cases in the inject dataset. The spatial scan statistic, our current method for spatial cluster detection, performed slightly better than the other tested methods across different inject magnitudes and types. Furthermore, we found the scan statistics, as applied in the SaTScan software package, to be the easiest to program and implement for daily data analysis.

  10. Rainfall Prediction of Indian Peninsula: Comparison of Time Series Based Approach and Predictor Based Approach using Machine Learning Techniques

    NASA Astrophysics Data System (ADS)

    Dash, Y.; Mishra, S. K.; Panigrahi, B. K.

    2017-12-01

    Prediction of northeast/post monsoon rainfall which occur during October, November and December (OND) over Indian peninsula is a challenging task due to the dynamic nature of uncertain chaotic climate. It is imperative to elucidate this issue by examining performance of different machine leaning (ML) approaches. The prime objective of this research is to compare between a) statistical prediction using historical rainfall observations and global atmosphere-ocean predictors like Sea Surface Temperature (SST) and Sea Level Pressure (SLP) and b) empirical prediction based on a time series analysis of past rainfall data without using any other predictors. Initially, ML techniques have been applied on SST and SLP data (1948-2014) obtained from NCEP/NCAR reanalysis monthly mean provided by the NOAA ESRL PSD. Later, this study investigated the applicability of ML methods using OND rainfall time series for 1948-2014 and forecasted up to 2018. The predicted values of aforementioned methods were verified using observed time series data collected from Indian Institute of Tropical Meteorology and the result revealed good performance of ML algorithms with minimal error scores. Thus, it is found that both statistical and empirical methods are useful for long range climatic projections.

  11. Logistic model analysis of neurological findings in Minamata disease and the predicting index.

    PubMed

    Nakagawa, Masanori; Kodama, Tomoko; Akiba, Suminori; Arimura, Kimiyoshi; Wakamiya, Junji; Futatsuka, Makoto; Kitano, Takao; Osame, Mitsuhiro

    2002-01-01

    To establish a statistical diagnostic method to identify patients with Minamata disease (MD) considering factors of aging and sex, we analyzed the neurological findings in MD patients, inhabitants in a methylmercury polluted (MP) area, and inhabitants in a non-MP area. We compared the neurological findings in MD patients and inhabitants aged more than 40 years in the non-MP area. Based on the different frequencies of the neurological signs in the two groups, we devised the following formula to calculate the predicting index for MD: predicting index = 1/(1+e(-x)) x 100 (The value of x was calculated using the regression coefficients of each neurological finding obtained from logistic analysis. The index 100 indicated MD, and 0, non-MD). Using this method, we found that 100% of male and 98% of female patients with MD (95 cases) gave predicting indices higher than 95. Five percent of the aged inhabitants in the MP area (598 inhabitants) and 0.2% of those in the non-MP area (558 inhabitants) gave predicting indices of 50 or higher. Our statistical diagnostic method for MD was useful in distinguishing MD patients from healthy elders based on their neurological findings.

  12. Cytomegalovirus frequency in neonatal intrahepatic cholestasis determined by serology, histology, immunohistochemistry and PCR

    PubMed Central

    Bellomo-Brandao, Maria Angela; Andrade, Paula D; Costa, Sandra CB; Escanhoela, Cecilia AF; Vassallo, Jose; Porta, Gilda; De Tommaso, Adriana MA; Hessel, Gabriel

    2009-01-01

    AIM: To determine cytomegalovirus (CMV) frequency in neonatal intrahepatic cholestasis by serology, histological revision (searching for cytomegalic cells), immunohistochemistry, and polymerase chain reaction (PCR), and to verify the relationships among these methods. METHODS: The study comprised 101 non-consecutive infants submitted for hepatic biopsy between March 1982 and December 2005. Serological results were obtained from the patient’s files and the other methods were performed on paraffin-embedded liver samples from hepatic biopsies. The following statistical measures were calculated: frequency, sensibility, specific positive predictive value, negative predictive value, and accuracy. RESULTS: The frequencies of positive results were as follows: serology, 7/64 (11%); histological revision, 0/84; immunohistochemistry, 1/44 (2%), and PCR, 6/77 (8%). Only one patient had positive immunohistochemical findings and a positive PCR. The following statistical measures were calculated between PCR and serology: sensitivity, 33.3%; specificity, 88.89%; positive predictive value, 28.57%; negative predictive value, 90.91%; and accuracy, 82.35%. CONCLUSION: The frequency of positive CMV varied among the tests. Serology presented the highest positive frequency. When compared to PCR, the sensitivity and positive predictive value of serology were low. PMID:19610143

  13. Design of a testing strategy using non-animal based test methods: lessons learnt from the ACuteTox project.

    PubMed

    Kopp-Schneider, Annette; Prieto, Pilar; Kinsner-Ovaskainen, Agnieszka; Stanzel, Sven

    2013-06-01

    In the framework of toxicology, a testing strategy can be viewed as a series of steps which are taken to come to a final prediction about a characteristic of a compound under study. The testing strategy is performed as a single-step procedure, usually called a test battery, using simultaneously all information collected on different endpoints, or as tiered approach in which a decision tree is followed. Design of a testing strategy involves statistical considerations, such as the development of a statistical prediction model. During the EU FP6 ACuteTox project, several prediction models were proposed on the basis of statistical classification algorithms which we illustrate here. The final choice of testing strategies was not based on statistical considerations alone. However, without thorough statistical evaluations a testing strategy cannot be identified. We present here a number of observations made from the statistical viewpoint which relate to the development of testing strategies. The points we make were derived from problems we had to deal with during the evaluation of this large research project. A central issue during the development of a prediction model is the danger of overfitting. Procedures are presented to deal with this challenge. Copyright © 2012 Elsevier Ltd. All rights reserved.

  14. Strength and life criteria for corrugated fiberboard by three methods

    Treesearch

    Thomas J. Urbanik

    1997-01-01

    The conventional test method for determining the stacking life of corrugated containers at a fixed load level does not adequately predict a safe load when storage time is fixed. This study introduced multiple load levels and related the probability of time at failure to load. A statistical analysis of logarithm-of-time failure data varying with load level predicts the...

  15. Predictive analysis of beer quality by correlating sensory evaluation with higher alcohol and ester production using multivariate statistics methods.

    PubMed

    Dong, Jian-Jun; Li, Qing-Liang; Yin, Hua; Zhong, Cheng; Hao, Jun-Guang; Yang, Pan-Fei; Tian, Yu-Hong; Jia, Shi-Ru

    2014-10-15

    Sensory evaluation is regarded as a necessary procedure to ensure a reproducible quality of beer. Meanwhile, high-throughput analytical methods provide a powerful tool to analyse various flavour compounds, such as higher alcohol and ester. In this study, the relationship between flavour compounds and sensory evaluation was established by non-linear models such as partial least squares (PLS), genetic algorithm back-propagation neural network (GA-BP), support vector machine (SVM). It was shown that SVM with a Radial Basis Function (RBF) had a better performance of prediction accuracy for both calibration set (94.3%) and validation set (96.2%) than other models. Relatively lower prediction abilities were observed for GA-BP (52.1%) and PLS (31.7%). In addition, the kernel function of SVM played an essential role of model training when the prediction accuracy of SVM with polynomial kernel function was 32.9%. As a powerful multivariate statistics method, SVM holds great potential to assess beer quality. Copyright © 2014 Elsevier Ltd. All rights reserved.

  16. Contrast Analysis: A Tutorial

    ERIC Educational Resources Information Center

    Haans, Antal

    2018-01-01

    Contrast analysis is a relatively simple but effective statistical method for testing theoretical predictions about differences between group means against the empirical data. Despite its advantages, contrast analysis is hardly used to date, perhaps because it is not implemented in a convenient manner in many statistical software packages. This…

  17. Implementing statistical equating for MRCP(UK) Parts 1 and 2.

    PubMed

    McManus, I C; Chis, Liliana; Fox, Ray; Waller, Derek; Tang, Peter

    2014-09-26

    The MRCP(UK) exam, in 2008 and 2010, changed the standard-setting of its Part 1 and Part 2 examinations from a hybrid Angoff/Hofstee method to statistical equating using Item Response Theory, the reference group being UK graduates. The present paper considers the implementation of the change, the question of whether the pass rate increased amongst non-UK candidates, any possible role of Differential Item Functioning (DIF), and changes in examination predictive validity after the change. Analysis of data of MRCP(UK) Part 1 exam from 2003 to 2013 and Part 2 exam from 2005 to 2013. Inspection suggested that Part 1 pass rates were stable after the introduction of statistical equating, but showed greater annual variation probably due to stronger candidates taking the examination earlier. Pass rates seemed to have increased in non-UK graduates after equating was introduced, but was not associated with any changes in DIF after statistical equating. Statistical modelling of the pass rates for non-UK graduates found that pass rates, in both Part 1 and Part 2, were increasing year on year, with the changes probably beginning before the introduction of equating. The predictive validity of Part 1 for Part 2 was higher with statistical equating than with the previous hybrid Angoff/Hofstee method, confirming the utility of IRT-based statistical equating. Statistical equating was successfully introduced into the MRCP(UK) Part 1 and Part 2 written examinations, resulting in higher predictive validity than the previous Angoff/Hofstee standard setting. Concerns about an artefactual increase in pass rates for non-UK candidates after equating were shown not to be well-founded. Most likely the changes resulted from a genuine increase in candidate ability, albeit for reasons which remain unclear, coupled with a cognitive illusion giving the impression of a step-change immediately after equating began. Statistical equating provides a robust standard-setting method, with a better theoretical foundation than judgemental techniques such as Angoff, and is more straightforward and requires far less examiner time to provide a more valid result. The present study provides a detailed case study of introducing statistical equating, and issues which may need to be considered with its introduction.

  18. An adaptive data-driven method for accurate prediction of remaining useful life of rolling bearings

    NASA Astrophysics Data System (ADS)

    Peng, Yanfeng; Cheng, Junsheng; Liu, Yanfei; Li, Xuejun; Peng, Zhihua

    2018-06-01

    A novel data-driven method based on Gaussian mixture model (GMM) and distance evaluation technique (DET) is proposed to predict the remaining useful life (RUL) of rolling bearings. The data sets are clustered by GMM to divide all data sets into several health states adaptively and reasonably. The number of clusters is determined by the minimum description length principle. Thus, either the health state of the data sets or the number of the states is obtained automatically. Meanwhile, the abnormal data sets can be recognized during the clustering process and removed from the training data sets. After obtaining the health states, appropriate features are selected by DET for increasing the classification and prediction accuracy. In the prediction process, each vibration signal is decomposed into several components by empirical mode decomposition. Some common statistical parameters of the components are calculated first and then the features are clustered using GMM to divide the data sets into several health states and remove the abnormal data sets. Thereafter, appropriate statistical parameters of the generated components are selected using DET. Finally, least squares support vector machine is utilized to predict the RUL of rolling bearings. Experimental results indicate that the proposed method reliably predicts the RUL of rolling bearings.

  19. Assessing Discriminative Performance at External Validation of Clinical Prediction Models

    PubMed Central

    Nieboer, Daan; van der Ploeg, Tjeerd; Steyerberg, Ewout W.

    2016-01-01

    Introduction External validation studies are essential to study the generalizability of prediction models. Recently a permutation test, focusing on discrimination as quantified by the c-statistic, was proposed to judge whether a prediction model is transportable to a new setting. We aimed to evaluate this test and compare it to previously proposed procedures to judge any changes in c-statistic from development to external validation setting. Methods We compared the use of the permutation test to the use of benchmark values of the c-statistic following from a previously proposed framework to judge transportability of a prediction model. In a simulation study we developed a prediction model with logistic regression on a development set and validated them in the validation set. We concentrated on two scenarios: 1) the case-mix was more heterogeneous and predictor effects were weaker in the validation set compared to the development set, and 2) the case-mix was less heterogeneous in the validation set and predictor effects were identical in the validation and development set. Furthermore we illustrated the methods in a case study using 15 datasets of patients suffering from traumatic brain injury. Results The permutation test indicated that the validation and development set were homogenous in scenario 1 (in almost all simulated samples) and heterogeneous in scenario 2 (in 17%-39% of simulated samples). Previously proposed benchmark values of the c-statistic and the standard deviation of the linear predictors correctly pointed at the more heterogeneous case-mix in scenario 1 and the less heterogeneous case-mix in scenario 2. Conclusion The recently proposed permutation test may provide misleading results when externally validating prediction models in the presence of case-mix differences between the development and validation population. To correctly interpret the c-statistic found at external validation it is crucial to disentangle case-mix differences from incorrect regression coefficients. PMID:26881753

  20. Projecting future precipitation and temperature at sites with diverse climate through multiple statistical downscaling schemes

    NASA Astrophysics Data System (ADS)

    Vallam, P.; Qin, X. S.

    2017-10-01

    Anthropogenic-driven climate change would affect the global ecosystem and is becoming a world-wide concern. Numerous studies have been undertaken to determine the future trends of meteorological variables at different scales. Despite these studies, there remains significant uncertainty in the prediction of future climates. To examine the uncertainty arising from using different schemes to downscale the meteorological variables for the future horizons, projections from different statistical downscaling schemes were examined. These schemes included statistical downscaling method (SDSM), change factor incorporated with LARS-WG, and bias corrected disaggregation (BCD) method. Global circulation models (GCMs) based on CMIP3 (HadCM3) and CMIP5 (CanESM2) were utilized to perturb the changes in the future climate. Five study sites (i.e., Alice Springs, Edmonton, Frankfurt, Miami, and Singapore) with diverse climatic conditions were chosen for examining the spatial variability of applying various statistical downscaling schemes. The study results indicated that the regions experiencing heavy precipitation intensities were most likely to demonstrate the divergence between the predictions from various statistical downscaling methods. Also, the variance computed in projecting the weather extremes indicated the uncertainty derived from selection of downscaling tools and climate models. This study could help gain an improved understanding about the features of different downscaling approaches and the overall downscaling uncertainty.

  1. The energetic cost of walking: a comparison of predictive methods.

    PubMed

    Kramer, Patricia Ann; Sylvester, Adam D

    2011-01-01

    The energy that animals devote to locomotion has been of intense interest to biologists for decades and two basic methodologies have emerged to predict locomotor energy expenditure: those based on metabolic and those based on mechanical energy. Metabolic energy approaches share the perspective that prediction of locomotor energy expenditure should be based on statistically significant proxies of metabolic function, while mechanical energy approaches, which derive from many different perspectives, focus on quantifying the energy of movement. Some controversy exists as to which mechanical perspective is "best", but from first principles all mechanical methods should be equivalent if the inputs to the simulation are of similar quality. Our goals in this paper are 1) to establish the degree to which the various methods of calculating mechanical energy are correlated, and 2) to investigate to what degree the prediction methods explain the variation in energy expenditure. We use modern humans as the model organism in this experiment because their data are readily attainable, but the methodology is appropriate for use in other species. Volumetric oxygen consumption and kinematic and kinetic data were collected on 8 adults while walking at their self-selected slow, normal and fast velocities. Using hierarchical statistical modeling via ordinary least squares and maximum likelihood techniques, the predictive ability of several metabolic and mechanical approaches were assessed. We found that all approaches are correlated and that the mechanical approaches explain similar amounts of the variation in metabolic energy expenditure. Most methods predict the variation within an individual well, but are poor at accounting for variation between individuals. Our results indicate that the choice of predictive method is dependent on the question(s) of interest and the data available for use as inputs. Although we used modern humans as our model organism, these results can be extended to other species.

  2. Evaluating and implementing temporal, spatial, and spatio-temporal methods for outbreak detection in a local syndromic surveillance system

    PubMed Central

    Lall, Ramona; Levin-Rector, Alison; Sell, Jessica; Paladini, Marc; Konty, Kevin J.; Olson, Don; Weiss, Don

    2017-01-01

    The New York City Department of Health and Mental Hygiene has operated an emergency department syndromic surveillance system since 2001, using temporal and spatial scan statistics run on a daily basis for cluster detection. Since the system was originally implemented, a number of new methods have been proposed for use in cluster detection. We evaluated six temporal and four spatial/spatio-temporal detection methods using syndromic surveillance data spiked with simulated injections. The algorithms were compared on several metrics, including sensitivity, specificity, positive predictive value, coherence, and timeliness. We also evaluated each method’s implementation, programming time, run time, and the ease of use. Among the temporal methods, at a set specificity of 95%, a Holt-Winters exponential smoother performed the best, detecting 19% of the simulated injects across all shapes and sizes, followed by an autoregressive moving average model (16%), a generalized linear model (15%), a modified version of the Early Aberration Reporting System’s C2 algorithm (13%), a temporal scan statistic (11%), and a cumulative sum control chart (<2%). Of the spatial/spatio-temporal methods we tested, a spatial scan statistic detected 3% of all injects, a Bayes regression found 2%, and a generalized linear mixed model and a space-time permutation scan statistic detected none at a specificity of 95%. Positive predictive value was low (<7%) for all methods. Overall, the detection methods we tested did not perform well in identifying the temporal and spatial clusters of cases in the inject dataset. The spatial scan statistic, our current method for spatial cluster detection, performed slightly better than the other tested methods across different inject magnitudes and types. Furthermore, we found the scan statistics, as applied in the SaTScan software package, to be the easiest to program and implement for daily data analysis. PMID:28886112

  3. Estimation of Global Network Statistics from Incomplete Data

    PubMed Central

    Bliss, Catherine A.; Danforth, Christopher M.; Dodds, Peter Sheridan

    2014-01-01

    Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar's hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week. PMID:25338183

  4. Seasonal drought predictability in Portugal using statistical-dynamical techniques

    NASA Astrophysics Data System (ADS)

    Ribeiro, A. F. S.; Pires, C. A. L.

    2016-08-01

    Atmospheric forecasting and predictability are important to promote adaption and mitigation measures in order to minimize drought impacts. This study estimates hybrid (statistical-dynamical) long-range forecasts of the regional drought index SPI (3-months) over homogeneous regions from mainland Portugal, based on forecasts from the UKMO operational forecasting system, with lead-times up to 6 months. ERA-Interim reanalysis data is used for the purpose of building a set of SPI predictors integrating recent past information prior to the forecast launching. Then, the advantage of combining predictors with both dynamical and statistical background in the prediction of drought conditions at different lags is evaluated. A two-step hybridization procedure is performed, in which both forecasted and observed 500 hPa geopotential height fields are subjected to a PCA in order to use forecasted PCs and persistent PCs as predictors. A second hybridization step consists on a statistical/hybrid downscaling to the regional SPI, based on regression techniques, after the pre-selection of the statistically significant predictors. The SPI forecasts and the added value of combining dynamical and statistical methods are evaluated in cross-validation mode, using the R2 and binary event scores. Results are obtained for the four seasons and it was found that winter is the most predictable season, and that most of the predictive power is on the large-scale fields from past observations. The hybridization improves the downscaling based on the forecasted PCs, since they provide complementary information (though modest) beyond that of persistent PCs. These findings provide clues about the predictability of the SPI, particularly in Portugal, and may contribute to the predictability of crops yields and to some guidance on users (such as farmers) decision making process.

  5. MODELING A MIXTURE: PBPK/PD APPROACHES FOR PREDICTING CHEMICAL INTERACTIONS.

    EPA Science Inventory

    Since environmental chemical exposures generally involve multiple chemicals, there are both regulatory and scientific drivers to develop methods to predict outcomes of these exposures. Even using efficient statistical and experimental designs, it is not possible to test in vivo a...

  6. Why significant variables aren't automatically good predictors.

    PubMed

    Lo, Adeline; Chernoff, Herman; Zheng, Tian; Lo, Shaw-Hwa

    2015-11-10

    Thus far, genome-wide association studies (GWAS) have been disappointing in the inability of investigators to use the results of identified, statistically significant variants in complex diseases to make predictions useful for personalized medicine. Why are significant variables not leading to good prediction of outcomes? We point out that this problem is prevalent in simple as well as complex data, in the sciences as well as the social sciences. We offer a brief explanation and some statistical insights on why higher significance cannot automatically imply stronger predictivity and illustrate through simulations and a real breast cancer example. We also demonstrate that highly predictive variables do not necessarily appear as highly significant, thus evading the researcher using significance-based methods. We point out that what makes variables good for prediction versus significance depends on different properties of the underlying distributions. If prediction is the goal, we must lay aside significance as the only selection standard. We suggest that progress in prediction requires efforts toward a new research agenda of searching for a novel criterion to retrieve highly predictive variables rather than highly significant variables. We offer an alternative approach that was not designed for significance, the partition retention method, which was very effective predicting on a long-studied breast cancer data set, by reducing the classification error rate from 30% to 8%.

  7. Kernel-based whole-genome prediction of complex traits: a review.

    PubMed

    Morota, Gota; Gianola, Daniel

    2014-01-01

    Prediction of genetic values has been a focus of applied quantitative genetics since the beginning of the 20th century, with renewed interest following the advent of the era of whole genome-enabled prediction. Opportunities offered by the emergence of high-dimensional genomic data fueled by post-Sanger sequencing technologies, especially molecular markers, have driven researchers to extend Ronald Fisher and Sewall Wright's models to confront new challenges. In particular, kernel methods are gaining consideration as a regression method of choice for genome-enabled prediction. Complex traits are presumably influenced by many genomic regions working in concert with others (clearly so when considering pathways), thus generating interactions. Motivated by this view, a growing number of statistical approaches based on kernels attempt to capture non-additive effects, either parametrically or non-parametrically. This review centers on whole-genome regression using kernel methods applied to a wide range of quantitative traits of agricultural importance in animals and plants. We discuss various kernel-based approaches tailored to capturing total genetic variation, with the aim of arriving at an enhanced predictive performance in the light of available genome annotation information. Connections between prediction machines born in animal breeding, statistics, and machine learning are revisited, and their empirical prediction performance is discussed. Overall, while some encouraging results have been obtained with non-parametric kernels, recovering non-additive genetic variation in a validation dataset remains a challenge in quantitative genetics.

  8. A fractional factorial probabilistic collocation method for uncertainty propagation of hydrologic model parameters in a reduced dimensional space

    NASA Astrophysics Data System (ADS)

    Wang, S.; Huang, G. H.; Huang, W.; Fan, Y. R.; Li, Z.

    2015-10-01

    In this study, a fractional factorial probabilistic collocation method is proposed to reveal statistical significance of hydrologic model parameters and their multi-level interactions affecting model outputs, facilitating uncertainty propagation in a reduced dimensional space. The proposed methodology is applied to the Xiangxi River watershed in China to demonstrate its validity and applicability, as well as its capability of revealing complex and dynamic parameter interactions. A set of reduced polynomial chaos expansions (PCEs) only with statistically significant terms can be obtained based on the results of factorial analysis of variance (ANOVA), achieving a reduction of uncertainty in hydrologic predictions. The predictive performance of reduced PCEs is verified by comparing against standard PCEs and the Monte Carlo with Latin hypercube sampling (MC-LHS) method in terms of reliability, sharpness, and Nash-Sutcliffe efficiency (NSE). Results reveal that the reduced PCEs are able to capture hydrologic behaviors of the Xiangxi River watershed, and they are efficient functional representations for propagating uncertainties in hydrologic predictions.

  9. Evaluation of bearing capacity of piles from cone penetration test data.

    DOT National Transportation Integrated Search

    2007-12-01

    A statistical analysis and ranking criteria were used to compare the CPT methods and the conventional alpha design method. Based on the results, the de Ruiter/Beringen and LCPC methods showed the best capability in predicting the measured load carryi...

  10. A multivariate model and statistical method for validating tree grade lumber yield equations

    Treesearch

    Donald W. Seegrist

    1975-01-01

    Lumber yields within lumber grades can be described by a multivariate linear model. A method for validating lumber yield prediction equations when there are several tree grades is presented. The method is based on multivariate simultaneous test procedures.

  11. Properties of the endogenous post-stratified estimator using a random forests model

    Treesearch

    John Tipton; Jean Opsomer; Gretchen G. Moisen

    2012-01-01

    Post-stratification is used in survey statistics as a method to improve variance estimates. In traditional post-stratification methods, the variable on which the data is being stratified must be known at the population level. In many cases this is not possible, but it is possible to use a model to predict values using covariates, and then stratify on these predicted...

  12. Review and statistical analysis of the ultrasonic velocity method for estimating the porosity fraction in polycrystalline materials

    NASA Technical Reports Server (NTRS)

    Roth, D. J.; Swickard, S. M.; Stang, D. B.; Deguire, M. R.

    1990-01-01

    A review and statistical analysis of the ultrasonic velocity method for estimating the porosity fraction in polycrystalline materials is presented. Initially, a semi-empirical model is developed showing the origin of the linear relationship between ultrasonic velocity and porosity fraction. Then, from a compilation of data produced by many researchers, scatter plots of velocity versus percent porosity data are shown for Al2O3, MgO, porcelain-based ceramics, PZT, SiC, Si3N4, steel, tungsten, UO2,(U0.30Pu0.70)C, and YBa2Cu3O(7-x). Linear regression analysis produced predicted slope, intercept, correlation coefficient, level of significance, and confidence interval statistics for the data. Velocity values predicted from regression analysis for fully-dense materials are in good agreement with those calculated from elastic properties.

  13. Predicting trauma patient mortality: ICD [or ICD-10-AM] versus AIS based approaches.

    PubMed

    Willis, Cameron D; Gabbe, Belinda J; Jolley, Damien; Harrison, James E; Cameron, Peter A

    2010-11-01

    The International Classification of Diseases Injury Severity Score (ICISS) has been proposed as an International Classification of Diseases (ICD)-10-based alternative to mortality prediction tools that use Abbreviated Injury Scale (AIS) data, including the Trauma and Injury Severity Score (TRISS). To date, studies have not examined the performance of ICISS using Australian trauma registry data. This study aimed to compare the performance of ICISS with other mortality prediction tools in an Australian trauma registry. This was a retrospective review of prospectively collected data from the Victorian State Trauma Registry. A training dataset was created for model development and a validation dataset for evaluation. The multiplicative ICISS model was compared with a worst injury ICISS approach, Victorian TRISS (V-TRISS, using local coefficients), maximum AIS severity and a multivariable model including ICD-10-AM codes as predictors. Models were investigated for discrimination (C-statistic) and calibration (Hosmer-Lemeshow statistic). The multivariable approach had the highest level of discrimination (C-statistic 0.90) and calibration (H-L 7.65, P= 0.468). Worst injury ICISS, V-TRISS and maximum AIS had similar performance. The multiplicative ICISS produced the lowest level of discrimination (C-statistic 0.80) and poorest calibration (H-L 50.23, P < 0.001). The performance of ICISS may be affected by the data used to develop estimates, the ICD version employed, the methods for deriving estimates and the inclusion of covariates. In this analysis, a multivariable approach using ICD-10-AM codes was the best-performing method. A multivariable ICISS approach may therefore be a useful alternative to AIS-based methods and may have comparable predictive performance to locally derived TRISS models. © 2010 The Authors. ANZ Journal of Surgery © 2010 Royal Australasian College of Surgeons.

  14. Predicting Binding Free Energy Change Caused by Point Mutations with Knowledge-Modified MM/PBSA Method.

    PubMed

    Petukh, Marharyta; Li, Minghui; Alexov, Emil

    2015-07-01

    A new methodology termed Single Amino Acid Mutation based change in Binding free Energy (SAAMBE) was developed to predict the changes of the binding free energy caused by mutations. The method utilizes 3D structures of the corresponding protein-protein complexes and takes advantage of both approaches: sequence- and structure-based methods. The method has two components: a MM/PBSA-based component, and an additional set of statistical terms delivered from statistical investigation of physico-chemical properties of protein complexes. While the approach is rigid body approach and does not explicitly consider plausible conformational changes caused by the binding, the effect of conformational changes, including changes away from binding interface, on electrostatics are mimicked with amino acid specific dielectric constants. This provides significant improvement of SAAMBE predictions as indicated by better match against experimentally determined binding free energy changes over 1300 mutations in 43 proteins. The final benchmarking resulted in a very good agreement with experimental data (correlation coefficient 0.624) while the algorithm being fast enough to allow for large-scale calculations (the average time is less than a minute per mutation).

  15. Choosing the appropriate forecasting model for predictive parameter control.

    PubMed

    Aleti, Aldeida; Moser, Irene; Meedeniya, Indika; Grunske, Lars

    2014-01-01

    All commonly used stochastic optimisation algorithms have to be parameterised to perform effectively. Adaptive parameter control (APC) is an effective method used for this purpose. APC repeatedly adjusts parameter values during the optimisation process for optimal algorithm performance. The assignment of parameter values for a given iteration is based on previously measured performance. In recent research, time series prediction has been proposed as a method of projecting the probabilities to use for parameter value selection. In this work, we examine the suitability of a variety of prediction methods for the projection of future parameter performance based on previous data. All considered prediction methods have assumptions the time series data has to conform to for the prediction method to provide accurate projections. Looking specifically at parameters of evolutionary algorithms (EAs), we find that all standard EA parameters with the exception of population size conform largely to the assumptions made by the considered prediction methods. Evaluating the performance of these prediction methods, we find that linear regression provides the best results by a very small and statistically insignificant margin. Regardless of the prediction method, predictive parameter control outperforms state of the art parameter control methods when the performance data adheres to the assumptions made by the prediction method. When a parameter's performance data does not adhere to the assumptions made by the forecasting method, the use of prediction does not have a notable adverse impact on the algorithm's performance.

  16. Three Strategies for the Critical Use of Statistical Methods in Psychological Research

    ERIC Educational Resources Information Center

    Campitelli, Guillermo; Macbeth, Guillermo; Ospina, Raydonal; Marmolejo-Ramos, Fernando

    2017-01-01

    We present three strategies to replace the null hypothesis statistical significance testing approach in psychological research: (1) visual representation of cognitive processes and predictions, (2) visual representation of data distributions and choice of the appropriate distribution for analysis, and (3) model comparison. The three strategies…

  17. The value of model averaging and dynamical climate model predictions for improving statistical seasonal streamflow forecasts over Australia

    NASA Astrophysics Data System (ADS)

    Pokhrel, Prafulla; Wang, Q. J.; Robertson, David E.

    2013-10-01

    Seasonal streamflow forecasts are valuable for planning and allocation of water resources. In Australia, the Bureau of Meteorology employs a statistical method to forecast seasonal streamflows. The method uses predictors that are related to catchment wetness at the start of a forecast period and to climate during the forecast period. For the latter, a predictor is selected among a number of lagged climate indices as candidates to give the "best" model in terms of model performance in cross validation. This study investigates two strategies for further improvement in seasonal streamflow forecasts. The first is to combine, through Bayesian model averaging, multiple candidate models with different lagged climate indices as predictors, to take advantage of different predictive strengths of the multiple models. The second strategy is to introduce additional candidate models, using rainfall and sea surface temperature predictions from a global climate model as predictors. This is to take advantage of the direct simulations of various dynamic processes. The results show that combining forecasts from multiple statistical models generally yields more skillful forecasts than using only the best model and appears to moderate the worst forecast errors. The use of rainfall predictions from the dynamical climate model marginally improves the streamflow forecasts when viewed over all the study catchments and seasons, but the use of sea surface temperature predictions provide little additional benefit.

  18. Predicting Flowering Behavior and Exploring Its Genetic Determinism in an Apple Multi-family Population Based on Statistical Indices and Simplified Phenotyping.

    PubMed

    Durand, Jean-Baptiste; Allard, Alix; Guitton, Baptiste; van de Weg, Eric; Bink, Marco C A M; Costes, Evelyne

    2017-01-01

    Irregular flowering over years is commonly observed in fruit trees. The early prediction of tree behavior is highly desirable in breeding programmes. This study aims at performing such predictions, combining simplified phenotyping and statistics methods. Sequences of vegetative vs. floral annual shoots (AS) were observed along axes in trees belonging to five apple related full-sib families. Sequences were analyzed using Markovian and linear mixed models including year and site effects. Indices of flowering irregularity, periodicity and synchronicity were estimated, at tree and axis scales. They were used to predict tree behavior and detect QTL with a Bayesian pedigree-based analysis, using an integrated genetic map containing 6,849 SNPs. The combination of a Biennial Bearing Index (BBI) with an autoregressive coefficient (γ g ) efficiently predicted and classified the genotype behaviors, despite few misclassifications. Four QTLs common to BBIs and γ g and one for synchronicity were highlighted and revealed the complex genetic architecture of the traits. Irregularity resulted from high AS synchronism, whereas regularity resulted from either asynchronous locally alternating or continual regular AS flowering. A relevant and time-saving method, based on a posteriori sampling of axes and statistical indices is proposed, which is efficient to evaluate the tree breeding values for flowering regularity and could be transferred to other species.

  19. Track-pattern-based seasonal prediction model for intense tropical cyclone activities over the North Atlantic and the western North Pacific basins

    NASA Astrophysics Data System (ADS)

    Choi, W.; Ho, C. H.

    2015-12-01

    Intense tropical cyclones (TCs) accompanying heavy rainfall and destructive wind gusts sometimes cause incredible socio-economic damages in the regions near their landfall. This study aims to analyze intense TC activities in the North Atlantic (NA) and the western North Pacific (WNP) basins and develop their track propensity seasonal prediction model. Considering that the number of TCs in the NA basin is much smaller than that in the WNP basin, different intensity criteria are used; category 1 and above for NA and category 3 and above for WNP based on Saffir-Simpson hurricane wind scale. By using a fuzzy clustering method, intense TC tracks in the NA and the WNP basins are classified into two and three representative patterns, respectively. Each pattern shows empirical relationships with climate variabilities such as sea surface temperature distribution associated with El Niño/La Niña or Atlantic Meridional Mode, Pacific decadal oscillation, upper and low level zonal wind, and strength of subtropical high. The hybrid statistical-dynamical method has been used to develop the seasonal prediction model for each pattern based on statistical relationships between the intense TC activity and seasonal averaged key predictors. The model performance is statistically assessed by cross validation for the training period (1982-2013) and has been applied for the 2014 and 2015 prediction. This study suggests applicability of this model to real prediction work and provide bridgehead of attempt for intense TC prediction.

  20. Addressing issues associated with evaluating prediction models for survival endpoints based on the concordance statistic.

    PubMed

    Wang, Ming; Long, Qi

    2016-09-01

    Prediction models for disease risk and prognosis play an important role in biomedical research, and evaluating their predictive accuracy in the presence of censored data is of substantial interest. The standard concordance (c) statistic has been extended to provide a summary measure of predictive accuracy for survival models. Motivated by a prostate cancer study, we address several issues associated with evaluating survival prediction models based on c-statistic with a focus on estimators using the technique of inverse probability of censoring weighting (IPCW). Compared to the existing work, we provide complete results on the asymptotic properties of the IPCW estimators under the assumption of coarsening at random (CAR), and propose a sensitivity analysis under the mechanism of noncoarsening at random (NCAR). In addition, we extend the IPCW approach as well as the sensitivity analysis to high-dimensional settings. The predictive accuracy of prediction models for cancer recurrence after prostatectomy is assessed by applying the proposed approaches. We find that the estimated predictive accuracy for the models in consideration is sensitive to NCAR assumption, and thus identify the best predictive model. Finally, we further evaluate the performance of the proposed methods in both settings of low-dimensional and high-dimensional data under CAR and NCAR through simulations. © 2016, The International Biometric Society.

  1. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.

    PubMed

    Rivas, Elena; Lang, Raymond; Eddy, Sean R

    2012-02-01

    The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases.

  2. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more

    PubMed Central

    Rivas, Elena; Lang, Raymond; Eddy, Sean R.

    2012-01-01

    The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases. PMID:22194308

  3. “Wish You Were Here”: Examining Characteristics, Outcomes, and Statistical Solutions for Missing Cases in Web-Based Psychotherapeutic Trials

    PubMed Central

    Dear, Blake F; Heller, Gillian Z; Crane, Monique F; Titov, Nickolai

    2018-01-01

    Background Missing cases following treatment are common in Web-based psychotherapy trials. Without the ability to directly measure and evaluate the outcomes for missing cases, the ability to measure and evaluate the effects of treatment is challenging. Although common, little is known about the characteristics of Web-based psychotherapy participants who present as missing cases, their likely clinical outcomes, or the suitability of different statistical assumptions that can characterize missing cases. Objective Using a large sample of individuals who underwent Web-based psychotherapy for depressive symptoms (n=820), the aim of this study was to explore the characteristics of cases who present as missing cases at posttreatment (n=138), their likely treatment outcomes, and compare between statistical methods for replacing their missing data. Methods First, common participant and treatment features were tested through binary logistic regression models, evaluating the ability to predict missing cases. Second, the same variables were screened for their ability to increase or impede the rate symptom change that was observed following treatment. Third, using recontacted cases at 3-month follow-up to proximally represent missing cases outcomes following treatment, various simulated replacement scores were compared and evaluated against observed clinical follow-up scores. Results Missing cases were dominantly predicted by lower treatment adherence and increased symptoms at pretreatment. Statistical methods that ignored these characteristics can overlook an important clinical phenomenon and consequently produce inaccurate replacement outcomes, with symptoms estimates that can swing from −32% to 70% from the observed outcomes of recontacted cases. In contrast, longitudinal statistical methods that adjusted their estimates for missing cases outcomes by treatment adherence rates and baseline symptoms scores resulted in minimal measurement bias (<8%). Conclusions Certain variables can characterize and predict missing cases likelihood and jointly predict lesser clinical improvement. Under such circumstances, individuals with potentially worst off treatment outcomes can become concealed, and failure to adjust for this can lead to substantial clinical measurement bias. Together, this preliminary research suggests that missing cases in Web-based psychotherapeutic interventions may not occur as random events and can be systematically predicted. Critically, at the same time, missing cases may experience outcomes that are distinct and important for a complete understanding of the treatment effect. PMID:29674311

  4. Hip joint center localisation: A biomechanical application to hip arthroplasty population

    PubMed Central

    Bouffard, Vicky; Begon, Mickael; Champagne, Annick; Farhadnia, Payam; Vendittoli, Pascal-André; Lavigne, Martin; Prince, François

    2012-01-01

    AIM: To determine hip joint center (HJC) location on hip arthroplasty population comparing predictive and functional approaches with radiographic measurements. METHODS: The distance between the HJC and the mid-pelvis was calculated and compared between the three approaches. The localisation error between the predictive and functional approach was compared using the radiographic measurements as the reference. The operated leg was compared to the non-operated leg. RESULTS: A significant difference was found for the distance between the HJC and the mid-pelvis when comparing the predictive and functional method. The functional method leads to fewer errors. A statistical difference was found for the localization error between the predictive and functional method. The functional method is twice more precise. CONCLUSION: Although being more individualized, the functional method improves HJC localization and should be used in three-dimensional gait analysis. PMID:22919569

  5. Comparison of RF spectrum prediction methods for dynamic spectrum access

    NASA Astrophysics Data System (ADS)

    Kovarskiy, Jacob A.; Martone, Anthony F.; Gallagher, Kyle A.; Sherbondy, Kelly D.; Narayanan, Ram M.

    2017-05-01

    Dynamic spectrum access (DSA) refers to the adaptive utilization of today's busy electromagnetic spectrum. Cognitive radio/radar technologies require DSA to intelligently transmit and receive information in changing environments. Predicting radio frequency (RF) activity reduces sensing time and energy consumption for identifying usable spectrum. Typical spectrum prediction methods involve modeling spectral statistics with Hidden Markov Models (HMM) or various neural network structures. HMMs describe the time-varying state probabilities of Markov processes as a dynamic Bayesian network. Neural Networks model biological brain neuron connections to perform a wide range of complex and often non-linear computations. This work compares HMM, Multilayer Perceptron (MLP), and Recurrent Neural Network (RNN) algorithms and their ability to perform RF channel state prediction. Monte Carlo simulations on both measured and simulated spectrum data evaluate the performance of these algorithms. Generalizing spectrum occupancy as an alternating renewal process allows Poisson random variables to generate simulated data while energy detection determines the occupancy state of measured RF spectrum data for testing. The results suggest that neural networks achieve better prediction accuracy and prove more adaptable to changing spectral statistics than HMMs given sufficient training data.

  6. An object programming based environment for protein secondary structure prediction.

    PubMed

    Giacomini, M; Ruggiero, C; Sacile, R

    1996-01-01

    The most frequently used methods for protein secondary structure prediction are empirical statistical methods and rule based methods. A consensus system based on object-oriented programming is presented, which integrates the two approaches with the aim of improving the prediction quality. This system uses an object-oriented knowledge representation based on the concepts of conformation, residue and protein, where the conformation class is the basis, the residue class derives from it and the protein class derives from the residue class. The system has been tested with satisfactory results on several proteins of the Brookhaven Protein Data Bank. Its results have been compared with the results of the most widely used prediction methods, and they show a higher prediction capability and greater stability. Moreover, the system itself provides an index of the reliability of its current prediction. This system can also be regarded as a basis structure for programs of this kind.

  7. Survival Regression Modeling Strategies in CVD Prediction.

    PubMed

    Barkhordari, Mahnaz; Padyab, Mojgan; Sardarinia, Mahsa; Hadaegh, Farzad; Azizi, Fereidoun; Bozorgmanesh, Mohammadreza

    2016-04-01

    A fundamental part of prevention is prediction. Potential predictors are the sine qua non of prediction models. However, whether incorporating novel predictors to prediction models could be directly translated to added predictive value remains an area of dispute. The difference between the predictive power of a predictive model with (enhanced model) and without (baseline model) a certain predictor is generally regarded as an indicator of the predictive value added by that predictor. Indices such as discrimination and calibration have long been used in this regard. Recently, the use of added predictive value has been suggested while comparing the predictive performances of the predictive models with and without novel biomarkers. User-friendly statistical software capable of implementing novel statistical procedures is conspicuously lacking. This shortcoming has restricted implementation of such novel model assessment methods. We aimed to construct Stata commands to help researchers obtain the aforementioned statistical indices. We have written Stata commands that are intended to help researchers obtain the following. 1, Nam-D'Agostino X 2 goodness of fit test; 2, Cut point-free and cut point-based net reclassification improvement index (NRI), relative absolute integrated discriminatory improvement index (IDI), and survival-based regression analyses. We applied the commands to real data on women participating in the Tehran lipid and glucose study (TLGS) to examine if information relating to a family history of premature cardiovascular disease (CVD), waist circumference, and fasting plasma glucose can improve predictive performance of Framingham's general CVD risk algorithm. The command is adpredsurv for survival models. Herein we have described the Stata package "adpredsurv" for calculation of the Nam-D'Agostino X 2 goodness of fit test as well as cut point-free and cut point-based NRI, relative and absolute IDI, and survival-based regression analyses. We hope this work encourages the use of novel methods in examining predictive capacity of the emerging plethora of novel biomarkers.

  8. Interspecies scaling and prediction of human clearance: comparison of small- and macro-molecule drugs

    PubMed Central

    Huh, Yeamin; Smith, David E.; Feng, Meihau Rose

    2014-01-01

    Human clearance prediction for small- and macro-molecule drugs was evaluated and compared using various scaling methods and statistical analysis.Human clearance is generally well predicted using single or multiple species simple allometry for macro- and small-molecule drugs excreted renally.The prediction error is higher for hepatically eliminated small-molecules using single or multiple species simple allometry scaling, and it appears that the prediction error is mainly associated with drugs with low hepatic extraction ratio (Eh). The error in human clearance prediction for hepatically eliminated small-molecules was reduced using scaling methods with a correction of maximum life span (MLP) or brain weight (BRW).Human clearance of both small- and macro-molecule drugs is well predicted using the monkey liver blood flow method. Predictions using liver blood flow from other species did not work as well, especially for the small-molecule drugs. PMID:21892879

  9. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants.

    PubMed

    Gagliano, Sarah A; Ravji, Reena; Barnes, Michael R; Weale, Michael E; Knight, Jo

    2015-08-24

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.

  10. Ensemble flare forecasting: using numerical weather prediction techniques to improve space weather operations

    NASA Astrophysics Data System (ADS)

    Murray, S.; Guerra, J. A.

    2017-12-01

    One essential component of operational space weather forecasting is the prediction of solar flares. Early flare forecasting work focused on statistical methods based on historical flaring rates, but more complex machine learning methods have been developed in recent years. A multitude of flare forecasting methods are now available, however it is still unclear which of these methods performs best, and none are substantially better than climatological forecasts. Current operational space weather centres cannot rely on automated methods, and generally use statistical forecasts with a little human intervention. Space weather researchers are increasingly looking towards methods used in terrestrial weather to improve current forecasting techniques. Ensemble forecasting has been used in numerical weather prediction for many years as a way to combine different predictions in order to obtain a more accurate result. It has proved useful in areas such as magnetospheric modelling and coronal mass ejection arrival analysis, however has not yet been implemented in operational flare forecasting. Here we construct ensemble forecasts for major solar flares by linearly combining the full-disk probabilistic forecasts from a group of operational forecasting methods (ASSA, ASAP, MAG4, MOSWOC, NOAA, and Solar Monitor). Forecasts from each method are weighted by a factor that accounts for the method's ability to predict previous events, and several performance metrics (both probabilistic and categorical) are considered. The results provide space weather forecasters with a set of parameters (combination weights, thresholds) that allow them to select the most appropriate values for constructing the 'best' ensemble forecast probability value, according to the performance metric of their choice. In this way different forecasts can be made to fit different end-user needs.

  11. TH-CD-207A-07: Prediction of High Dimensional State Subject to Respiratory Motion: A Manifold Learning Approach

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, W; Sawant, A; Ruan, D

    Purpose: The development of high dimensional imaging systems (e.g. volumetric MRI, CBCT, photogrammetry systems) in image-guided radiotherapy provides important pathways to the ultimate goal of real-time volumetric/surface motion monitoring. This study aims to develop a prediction method for the high dimensional state subject to respiratory motion. Compared to conventional linear dimension reduction based approaches, our method utilizes manifold learning to construct a descriptive feature submanifold, where more efficient and accurate prediction can be performed. Methods: We developed a prediction framework for high-dimensional state subject to respiratory motion. The proposed method performs dimension reduction in a nonlinear setting to permit moremore » descriptive features compared to its linear counterparts (e.g., classic PCA). Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where low-dimensional prediction is performed. A fixed-point iterative pre-image estimation method is applied subsequently to recover the predicted value in the original state space. We evaluated and compared the proposed method with PCA-based method on 200 level-set surfaces reconstructed from surface point clouds captured by the VisionRT system. The prediction accuracy was evaluated with respect to root-mean-squared-error (RMSE) for both 200ms and 600ms lookahead lengths. Results: The proposed method outperformed PCA-based approach with statistically higher prediction accuracy. In one-dimensional feature subspace, our method achieved mean prediction accuracy of 0.86mm and 0.89mm for 200ms and 600ms lookahead lengths respectively, compared to 0.95mm and 1.04mm from PCA-based method. The paired t-tests further demonstrated the statistical significance of the superiority of our method, with p-values of 6.33e-3 and 5.78e-5, respectively. Conclusion: The proposed approach benefits from the descriptiveness of a nonlinear manifold and the prediction reliability in such low dimensional manifold. The fixed-point iterative approach turns out to work well practically for the pre-image recovery. Our approach is particularly suitable to facilitate managing respiratory motion in image-guide radiotherapy. This work is supported in part by NIH grant R01 CA169102-02.« less

  12. Prediction of Multiple-Trait and Multiple-Environment Genomic Data Using Recommender Systems.

    PubMed

    Montesinos-López, Osval A; Montesinos-López, Abelardo; Crossa, José; Montesinos-López, José C; Mota-Sanchez, David; Estrada-González, Fermín; Gillberg, Jussi; Singh, Ravi; Mondal, Suchismita; Juliana, Philomin

    2018-01-04

    In genomic-enabled prediction, the task of improving the accuracy of the prediction of lines in environments is difficult because the available information is generally sparse and usually has low correlations between traits. In current genomic selection, although researchers have a large amount of information and appropriate statistical models to process it, there is still limited computing efficiency to do so. Although some statistical models are usually mathematically elegant, many of them are also computationally inefficient, and they are impractical for many traits, lines, environments, and years because they need to sample from huge normal multivariate distributions. For these reasons, this study explores two recommender systems: item-based collaborative filtering (IBCF) and the matrix factorization algorithm (MF) in the context of multiple traits and multiple environments. The IBCF and MF methods were compared with two conventional methods on simulated and real data. Results of the simulated and real data sets show that the IBCF technique was slightly better in terms of prediction accuracy than the two conventional methods and the MF method when the correlation was moderately high. The IBCF technique is very attractive because it produces good predictions when there is high correlation between items (environment-trait combinations) and its implementation is computationally feasible, which can be useful for plant breeders who deal with very large data sets. Copyright © 2018 Montesinos-Lopez et al.

  13. Prediction of Multiple-Trait and Multiple-Environment Genomic Data Using Recommender Systems

    PubMed Central

    Montesinos-López, Osval A.; Montesinos-López, Abelardo; Crossa, José; Montesinos-López, José C.; Mota-Sanchez, David; Estrada-González, Fermín; Gillberg, Jussi; Singh, Ravi; Mondal, Suchismita; Juliana, Philomin

    2018-01-01

    In genomic-enabled prediction, the task of improving the accuracy of the prediction of lines in environments is difficult because the available information is generally sparse and usually has low correlations between traits. In current genomic selection, although researchers have a large amount of information and appropriate statistical models to process it, there is still limited computing efficiency to do so. Although some statistical models are usually mathematically elegant, many of them are also computationally inefficient, and they are impractical for many traits, lines, environments, and years because they need to sample from huge normal multivariate distributions. For these reasons, this study explores two recommender systems: item-based collaborative filtering (IBCF) and the matrix factorization algorithm (MF) in the context of multiple traits and multiple environments. The IBCF and MF methods were compared with two conventional methods on simulated and real data. Results of the simulated and real data sets show that the IBCF technique was slightly better in terms of prediction accuracy than the two conventional methods and the MF method when the correlation was moderately high. The IBCF technique is very attractive because it produces good predictions when there is high correlation between items (environment–trait combinations) and its implementation is computationally feasible, which can be useful for plant breeders who deal with very large data sets. PMID:29097376

  14. The critical period hypothesis in second language acquisition: a statistical critique and a reanalysis.

    PubMed

    Vanhove, Jan

    2013-01-01

    In second language acquisition research, the critical period hypothesis (cph) holds that the function between learners' age and their susceptibility to second language input is non-linear. This paper revisits the indistinctness found in the literature with regard to this hypothesis's scope and predictions. Even when its scope is clearly delineated and its predictions are spelt out, however, empirical studies-with few exceptions-use analytical (statistical) tools that are irrelevant with respect to the predictions made. This paper discusses statistical fallacies common in cph research and illustrates an alternative analytical method (piecewise regression) by means of a reanalysis of two datasets from a 2010 paper purporting to have found cross-linguistic evidence in favour of the cph. This reanalysis reveals that the specific age patterns predicted by the cph are not cross-linguistically robust. Applying the principle of parsimony, it is concluded that age patterns in second language acquisition are not governed by a critical period. To conclude, this paper highlights the role of confirmation bias in the scientific enterprise and appeals to second language acquisition researchers to reanalyse their old datasets using the methods discussed in this paper. The data and R commands that were used for the reanalysis are provided as supplementary materials.

  15. Progress of statistical analysis in biomedical research through the historical review of the development of the Framingham score.

    PubMed

    Ignjatović, Aleksandra; Stojanović, Miodrag; Milošević, Zoran; Anđelković Apostolović, Marija

    2017-12-02

    The interest in developing risk models in medicine not only is appealing, but also associated with many obstacles in different aspects of predictive model development. Initially, the association of biomarkers or the association of more markers with the specific outcome was proven by statistical significance, but novel and demanding questions required the development of new and more complex statistical techniques. Progress of statistical analysis in biomedical research can be observed the best through the history of the Framingham study and development of the Framingham score. Evaluation of predictive models comes from a combination of the facts which are results of several metrics. Using logistic regression and Cox proportional hazards regression analysis, the calibration test, and the ROC curve analysis should be mandatory and eliminatory, and the central place should be taken by some new statistical techniques. In order to obtain complete information related to the new marker in the model, recently, there is a recommendation to use the reclassification tables by calculating the net reclassification index and the integrated discrimination improvement. Decision curve analysis is a novel method for evaluating the clinical usefulness of a predictive model. It may be noted that customizing and fine-tuning of the Framingham risk score initiated the development of statistical analysis. Clinically applicable predictive model should be a trade-off between all abovementioned statistical metrics, a trade-off between calibration and discrimination, accuracy and decision-making, costs and benefits, and quality and quantity of patient's life.

  16. Statistical evaluation of forecasts

    NASA Astrophysics Data System (ADS)

    Mader, Malenka; Mader, Wolfgang; Gluckman, Bruce J.; Timmer, Jens; Schelter, Björn

    2014-08-01

    Reliable forecasts of extreme but rare events, such as earthquakes, financial crashes, and epileptic seizures, would render interventions and precautions possible. Therefore, forecasting methods have been developed which intend to raise an alarm if an extreme event is about to occur. In order to statistically validate the performance of a prediction system, it must be compared to the performance of a random predictor, which raises alarms independent of the events. Such a random predictor can be obtained by bootstrapping or analytically. We propose an analytic statistical framework which, in contrast to conventional methods, allows for validating independently the sensitivity and specificity of a forecasting method. Moreover, our method accounts for the periods during which an event has to remain absent or occur after a respective forecast.

  17. Power flow as a complement to statistical energy analysis and finite element analysis

    NASA Technical Reports Server (NTRS)

    Cuschieri, J. M.

    1987-01-01

    Present methods of analysis of the structural response and the structure-borne transmission of vibrational energy use either finite element (FE) techniques or statistical energy analysis (SEA) methods. The FE methods are a very useful tool at low frequencies where the number of resonances involved in the analysis is rather small. On the other hand SEA methods can predict with acceptable accuracy the response and energy transmission between coupled structures at relatively high frequencies where the structural modal density is high and a statistical approach is the appropriate solution. In the mid-frequency range, a relatively large number of resonances exist which make finite element method too costly. On the other hand SEA methods can only predict an average level form. In this mid-frequency range a possible alternative is to use power flow techniques, where the input and flow of vibrational energy to excited and coupled structural components can be expressed in terms of input and transfer mobilities. This power flow technique can be extended from low to high frequencies and this can be integrated with established FE models at low frequencies and SEA models at high frequencies to form a verification of the method. This method of structural analysis using power flo and mobility methods, and its integration with SEA and FE analysis is applied to the case of two thin beams joined together at right angles.

  18. Solar Activity Heading for a Maunder Minimum?

    NASA Astrophysics Data System (ADS)

    Schatten, K. H.; Tobiska, W. K.

    2003-05-01

    Long-range (few years to decades) solar activity prediction techniques vary greatly in their methods. They range from examining planetary orbits, to spectral analyses (e.g. Fourier, wavelet and spectral analyses), to artificial intelligence methods, to simply using general statistical techniques. Rather than concentrate on statistical/mathematical/numerical methods, we discuss a class of methods which appears to have a "physical basis." Not only does it have a physical basis, but this basis is rooted in both "basic" physics (dynamo theory), but also solar physics (Babcock dynamo theory). The class we discuss is referred to as "precursor methods," originally developed by Ohl, Brown and Williams and others, using geomagnetic observations. My colleagues and I have developed some understanding for how these methods work and have expanded the prediction methods using "solar dynamo precursor" methods, notably a "SODA" index (SOlar Dynamo Amplitude). These methods are now based upon an understanding of the Sun's dynamo processes- to explain a connection between how the Sun's fields are generated and how the Sun broadcasts its future activity levels to Earth. This has led to better monitoring of the Sun's dynamo fields and is leading to more accurate prediction techniques. Related to the Sun's polar and toroidal magnetic fields, we explain how these methods work, past predictions, the current cycle, and predictions of future of solar activity levels for the next few solar cycles. The surprising result of these long-range predictions is a rapid decline in solar activity, starting with cycle #24. If this trend continues, we may see the Sun heading towards a "Maunder" type of solar activity minimum - an extensive period of reduced levels of solar activity. For the solar physicists, who enjoy studying solar activity, we hope this isn't so, but for NASA, which must place and maintain satellites in low earth orbit (LEO), it may help with reboost problems. Space debris, and other aspects of objects in LEO will also be affected. This research is supported by the NSF and NASA.

  19. Improved artificial neural networks in prediction of malignancy of lesions in contrast-enhanced MR-mammography.

    PubMed

    Vomweg, T W; Buscema, M; Kauczor, H U; Teifke, A; Intraligi, M; Terzi, S; Heussel, C P; Achenbach, T; Rieker, O; Mayer, D; Thelen, M

    2003-09-01

    The aim of this study was to evaluate the capability of improved artificial neural networks (ANN) and additional novel training methods in distinguishing between benign and malignant breast lesions in contrast-enhanced magnetic resonance-mammography (MRM). A total of 604 histologically proven cases of contrast-enhanced lesions of the female breast at MRI were analyzed. Morphological, dynamic and clinical parameters were collected and stored in a database. The data set was divided into several groups using random or experimental methods [Training & Testing (T&T) algorithm] to train and test different ANNs. An additional novel computer program for input variable selection was applied. Sensitivity and specificity were calculated and compared with a statistical method and an expert radiologist. After optimization of the distribution of cases among the training and testing sets by the T & T algorithm and the reduction of input variables by the Input Selection procedure a highly sophisticated ANN achieved a sensitivity of 93.6% and a specificity of 91.9% in predicting malignancy of lesions within an independent prediction sample set. The best statistical method reached a sensitivity of 90.5% and a specificity of 68.9%. An expert radiologist performed better than the statistical method but worse than the ANN (sensitivity 92.1%, specificity 85.6%). Features extracted out of dynamic contrast-enhanced MRM and additional clinical data can be successfully analyzed by advanced ANNs. The quality of the resulting network strongly depends on the training methods, which are improved by the use of novel training tools. The best results of an improved ANN outperform expert radiologists.

  20. Modeling student success in engineering education

    NASA Astrophysics Data System (ADS)

    Jin, Qu

    In order for the United States to maintain its global competitiveness, the long-term success of our engineering students in specific courses, programs, and colleges is now, more than ever, an extremely high priority. Numerous studies have focused on factors that impact student success, namely academic performance, retention, and/or graduation. However, there are only a limited number of works that have systematically developed models to investigate important factors and to predict student success in engineering. Therefore, this research presents three separate but highly connected investigations to address this gap. The first investigation involves explaining and predicting engineering students' success in Calculus I courses using statistical models. The participants were more than 4000 first-year engineering students (cohort years 2004 - 2008) who enrolled in Calculus I courses during the first semester in a large Midwestern university. Predictions from statistical models were proposed to be used to place engineering students into calculus courses. The success rates were improved by 12% in Calculus IA using predictions from models developed over traditional placement method. The results showed that these statistical models provided a more accurate calculus placement method than traditional placement methods and help improve success rates in those courses. In the second investigation, multi-outcome and single-outcome neural network models were designed to understand and to predict first-year retention and first-year GPA of engineering students. The participants were more than 3000 first year engineering students (cohort years 2004 - 2005) enrolled in a large Midwestern university. The independent variables include both high school academic performance factors and affective factors measured prior to entry. The prediction performances of the multi-outcome and single-outcome models were comparable. The ability to predict cumulative GPA at the end of an engineering student's first year of college was about a half of a grade point for both models. The predictors of retention and cumulative GPA while being similar differ in that high school academic metrics play a more important role in predicting cumulative GPA with the affective measures playing a more important role in predicting retention. In the last investigation, multi-outcome neural network models were used to understand and to predict engineering students' retention, GPA, and graduation from entry to departure. The participants were more than 4000 engineering students (cohort years 2004 - 2006) enrolled in a large Midwestern university. Different patterns of important predictors were identified for GPA, retention, and graduation. Overall, this research explores the feasibility of using modeling to enhance a student's educational experience in engineering. Student success modeling was used to identify the most important cognitive and affective predictors for a student's first calculus course retention, GPA, and graduation. The results suggest that the statistical modeling methods have great potential to assist decision making and help ensure student success in engineering education.

  1. Arkansas StreamStats: a U.S. Geological Survey web map application for basin characteristics and streamflow statistics

    USGS Publications Warehouse

    Pugh, Aaron L.

    2014-01-01

    Users of streamflow information often require streamflow statistics and basin characteristics at various locations along a stream. The USGS periodically calculates and publishes streamflow statistics and basin characteristics for streamflowgaging stations and partial-record stations, but these data commonly are scattered among many reports that may or may not be readily available to the public. The USGS also provides and periodically updates regional analyses of streamflow statistics that include regression equations and other prediction methods for estimating statistics for ungaged and unregulated streams across the State. Use of these regional predictions for a stream can be complex and often requires the user to determine a number of basin characteristics that may require interpretation. Basin characteristics may include drainage area, classifiers for physical properties, climatic characteristics, and other inputs. Obtaining these input values for gaged and ungaged locations traditionally has been time consuming, subjective, and can lead to inconsistent results.

  2. A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.

    PubMed

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-05-27

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.

  3. Prediction system of hydroponic plant growth and development using algorithm Fuzzy Mamdani method

    NASA Astrophysics Data System (ADS)

    Sudana, I. Made; Purnawirawan, Okta; Arief, Ulfa Mediaty

    2017-03-01

    Hydroponics is a method of farming without soil. One of the Hydroponic plants is Watercress (Nasturtium Officinale). The development and growth process of hydroponic Watercress was influenced by levels of nutrients, acidity and temperature. The independent variables can be used as input variable system to predict the value level of plants growth and development. The prediction system is using Fuzzy Algorithm Mamdani method. This system was built to implement the function of Fuzzy Inference System (Fuzzy Inference System/FIS) as a part of the Fuzzy Logic Toolbox (FLT) by using MATLAB R2007b. FIS is a computing system that works on the principle of fuzzy reasoning which is similar to humans' reasoning. Basically FIS consists of four units which are fuzzification unit, fuzzy logic reasoning unit, base knowledge unit and defuzzification unit. In addition to know the effect of independent variables on the plants growth and development that can be visualized with the function diagram of FIS output surface that is shaped three-dimensional, and statistical tests based on the data from the prediction system using multiple linear regression method, which includes multiple linear regression analysis, T test, F test, the coefficient of determination and donations predictor that are calculated using SPSS (Statistical Product and Service Solutions) software applications.

  4. Statistical prediction with Kanerva's sparse distributed memory

    NASA Technical Reports Server (NTRS)

    Rogers, David

    1989-01-01

    A new viewpoint of the processing performed by Kanerva's sparse distributed memory (SDM) is presented. In conditions of near- or over-capacity, where the associative-memory behavior of the model breaks down, the processing performed by the model can be interpreted as that of a statistical predictor. Mathematical results are presented which serve as the framework for a new statistical viewpoint of sparse distributed memory and for which the standard formulation of SDM is a special case. This viewpoint suggests possible enhancements to the SDM model, including a procedure for improving the predictiveness of the system based on Holland's work with genetic algorithms, and a method for improving the capacity of SDM even when used as an associative memory.

  5. Computer program for prediction of fuel consumption statistical data for an upper stage three-axes stabilized on-off control system

    NASA Technical Reports Server (NTRS)

    1982-01-01

    A FORTRAN coded computer program and method to predict the reaction control fuel consumption statistics for a three axis stabilized rocket vehicle upper stage is described. A Monte Carlo approach is used which is more efficient by using closed form estimates of impulses. The effects of rocket motor thrust misalignment, static unbalance, aerodynamic disturbances, and deviations in trajectory, mass properties and control system characteristics are included. This routine can be applied to many types of on-off reaction controlled vehicles. The pseudorandom number generation and statistical analyses subroutines including the output histograms can be used for other Monte Carlo analyses problems.

  6. Analysis of Machine Learning Techniques for Heart Failure Readmissions.

    PubMed

    Mortazavi, Bobak J; Downing, Nicholas S; Bucholz, Emily M; Dharmarajan, Kumar; Manhapra, Ajay; Li, Shu-Xia; Negahban, Sahand N; Krumholz, Harlan M

    2016-11-01

    The current ability to predict readmissions in patients with heart failure is modest at best. It is unclear whether machine learning techniques that address higher dimensional, nonlinear relationships among variables would enhance prediction. We sought to compare the effectiveness of several machine learning algorithms for predicting readmissions. Using data from the Telemonitoring to Improve Heart Failure Outcomes trial, we compared the effectiveness of random forests, boosting, random forests combined hierarchically with support vector machines or logistic regression (LR), and Poisson regression against traditional LR to predict 30- and 180-day all-cause readmissions and readmissions because of heart failure. We randomly selected 50% of patients for a derivation set, and a validation set comprised the remaining patients, validated using 100 bootstrapped iterations. We compared C statistics for discrimination and distributions of observed outcomes in risk deciles for predictive range. In 30-day all-cause readmission prediction, the best performing machine learning model, random forests, provided a 17.8% improvement over LR (mean C statistics, 0.628 and 0.533, respectively). For readmissions because of heart failure, boosting improved the C statistic by 24.9% over LR (mean C statistic 0.678 and 0.543, respectively). For 30-day all-cause readmission, the observed readmission rates in the lowest and highest deciles of predicted risk with random forests (7.8% and 26.2%, respectively) showed a much wider separation than LR (14.2% and 16.4%, respectively). Machine learning methods improved the prediction of readmission after hospitalization for heart failure compared with LR and provided the greatest predictive range in observed readmission rates. © 2016 American Heart Association, Inc.

  7. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

    EPA Science Inventory

    Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, e...

  8. A NEW LOG EVALUATION METHOD TO APPRAISE MESAVERDE RE-COMPLETION OPPORTUNITIES

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Albert Greer

    2003-09-11

    Artificial intelligence tools, fuzzy logic and neural networks were used to evaluate the potential of the behind pipe Mesaverde formation in BMG's Mancos formation wells. A fractal geostatistical mapping algorithm was also used to predict Mesaverde production. Additionally, a conventional geological study was conducted. To date one Mesaverde completion has been performed. The Janet No.3 Mesaverde completion was non-economic. Both the AI method and the geostatistical methods predicted the failure of the Janet No.3. The Gavilan No.1 in the Mesaverde was completed during the course of the study and was an extremely good well. This well was not included inmore » the statistical dataset. The AI method predicted very good production while the fractal map predicted a poor producer.« less

  9. Bayesian inference based on dual generalized order statistics from the exponentiated Weibull model

    NASA Astrophysics Data System (ADS)

    Al Sobhi, Mashail M.

    2015-02-01

    Bayesian estimation for the two parameters and the reliability function of the exponentiated Weibull model are obtained based on dual generalized order statistics (DGOS). Also, Bayesian prediction bounds for future DGOS from exponentiated Weibull model are obtained. The symmetric and asymmetric loss functions are considered for Bayesian computations. The Markov chain Monte Carlo (MCMC) methods are used for computing the Bayes estimates and prediction bounds. The results have been specialized to the lower record values. Comparisons are made between Bayesian and maximum likelihood estimators via Monte Carlo simulation.

  10. Assessing the sensitivity and robustness of prediction models for apple firmness using spectral scattering technique

    USDA-ARS?s Scientific Manuscript database

    Spectral scattering is useful for nondestructive sensing of fruit firmness. Prediction models, however, are typically built using multivariate statistical methods such as partial least squares regression (PLSR), whose performance generally depends on the characteristics of the data. The aim of this ...

  11. Watershed Regressions for Pesticides (WARP) models for predicting stream concentrations of multiple pesticides

    USGS Publications Warehouse

    Stone, Wesley W.; Crawford, Charles G.; Gilliom, Robert J.

    2013-01-01

    Watershed Regressions for Pesticides for multiple pesticides (WARP-MP) are statistical models developed to predict concentration statistics for a wide range of pesticides in unmonitored streams. The WARP-MP models use the national atrazine WARP models in conjunction with an adjustment factor for each additional pesticide. The WARP-MP models perform best for pesticides with application timing and methods similar to those used with atrazine. For other pesticides, WARP-MP models tend to overpredict concentration statistics for the model development sites. For WARP and WARP-MP, the less-than-ideal sampling frequency for the model development sites leads to underestimation of the shorter-duration concentration; hence, the WARP models tend to underpredict 4- and 21-d maximum moving-average concentrations, with median errors ranging from 9 to 38% As a result of this sampling bias, pesticides that performed well with the model development sites are expected to have predictions that are biased low for these shorter-duration concentration statistics. The overprediction by WARP-MP apparent for some of the pesticides is variably offset by underestimation of the model development concentration statistics. Of the 112 pesticides used in the WARP-MP application to stream segments nationwide, 25 were predicted to have concentration statistics with a 50% or greater probability of exceeding one or more aquatic life benchmarks in one or more stream segments. Geographically, many of the modeled streams in the Corn Belt Region were predicted to have one or more pesticides that exceeded an aquatic life benchmark during 2009, indicating the potential vulnerability of streams in this region.

  12. Estimating sunspot number

    NASA Technical Reports Server (NTRS)

    Wilson, R. M.; Reichmann, E. J.; Teuber, D. L.

    1984-01-01

    An empirical method is developed to predict certain parameters of future solar activity cycles. Sunspot cycle statistics are examined, and curve fitting and linear regression analysis techniques are utilized.

  13. The application of feature selection to the development of Gaussian process models for percutaneous absorption.

    PubMed

    Lam, Lun Tak; Sun, Yi; Davey, Neil; Adams, Rod; Prapopoulou, Maria; Brown, Marc B; Moss, Gary P

    2010-06-01

    The aim was to employ Gaussian processes to assess mathematically the nature of a skin permeability dataset and to employ these methods, particularly feature selection, to determine the key physicochemical descriptors which exert the most significant influence on percutaneous absorption, and to compare such models with established existing models. Gaussian processes, including automatic relevance detection (GPRARD) methods, were employed to develop models of percutaneous absorption that identified key physicochemical descriptors of percutaneous absorption. Using MatLab software, the statistical performance of these models was compared with single linear networks (SLN) and quantitative structure-permeability relationships (QSPRs). Feature selection methods were used to examine in more detail the physicochemical parameters used in this study. A range of statistical measures to determine model quality were used. The inherently nonlinear nature of the skin data set was confirmed. The Gaussian process regression (GPR) methods yielded predictive models that offered statistically significant improvements over SLN and QSPR models with regard to predictivity (where the rank order was: GPR > SLN > QSPR). Feature selection analysis determined that the best GPR models were those that contained log P, melting point and the number of hydrogen bond donor groups as significant descriptors. Further statistical analysis also found that great synergy existed between certain parameters. It suggested that a number of the descriptors employed were effectively interchangeable, thus questioning the use of models where discrete variables are output, usually in the form of an equation. The use of a nonlinear GPR method produced models with significantly improved predictivity, compared with SLN or QSPR models. Feature selection methods were able to provide important mechanistic information. However, it was also shown that significant synergy existed between certain parameters, and as such it was possible to interchange certain descriptors (i.e. molecular weight and melting point) without incurring a loss of model quality. Such synergy suggested that a model constructed from discrete terms in an equation may not be the most appropriate way of representing mechanistic understandings of skin absorption.

  14. External model validation of binary clinical risk prediction models in cardiovascular and thoracic surgery.

    PubMed

    Hickey, Graeme L; Blackstone, Eugene H

    2016-08-01

    Clinical risk-prediction models serve an important role in healthcare. They are used for clinical decision-making and measuring the performance of healthcare providers. To establish confidence in a model, external model validation is imperative. When designing such an external model validation study, thought must be given to patient selection, risk factor and outcome definitions, missing data, and the transparent reporting of the analysis. In addition, there are a number of statistical methods available for external model validation. Execution of a rigorous external validation study rests in proper study design, application of suitable statistical methods, and transparent reporting. Copyright © 2016 The American Association for Thoracic Surgery. Published by Elsevier Inc. All rights reserved.

  15. Penalized Ordinal Regression Methods for Predicting Stage of Cancer in High-Dimensional Covariate Spaces.

    PubMed

    Gentry, Amanda Elswick; Jackson-Cook, Colleen K; Lyon, Debra E; Archer, Kellie J

    2015-01-01

    The pathological description of the stage of a tumor is an important clinical designation and is considered, like many other forms of biomedical data, an ordinal outcome. Currently, statistical methods for predicting an ordinal outcome using clinical, demographic, and high-dimensional correlated features are lacking. In this paper, we propose a method that fits an ordinal response model to predict an ordinal outcome for high-dimensional covariate spaces. Our method penalizes some covariates (high-throughput genomic features) without penalizing others (such as demographic and/or clinical covariates). We demonstrate the application of our method to predict the stage of breast cancer. In our model, breast cancer subtype is a nonpenalized predictor, and CpG site methylation values from the Illumina Human Methylation 450K assay are penalized predictors. The method has been made available in the ordinalgmifs package in the R programming environment.

  16. The skeletal maturation status estimated by statistical shape analysis: axial images of Japanese cervical vertebra

    PubMed Central

    Shin, S M; Choi, Y-S; Yamaguchi, T; Maki, K; Cho, B-H; Park, S-B

    2015-01-01

    Objectives: To evaluate axial cervical vertebral (ACV) shape quantitatively and to build a prediction model for skeletal maturation level using statistical shape analysis for Japanese individuals. Methods: The sample included 24 female and 19 male patients with hand–wrist radiographs and CBCT images. Through generalized Procrustes analysis and principal components (PCs) analysis, the meaningful PCs were extracted from each ACV shape and analysed for the estimation regression model. Results: Each ACV shape had meaningful PCs, except for the second axial cervical vertebra. Based on these models, the smallest prediction intervals (PIs) were from the combination of the shape space PCs, age and gender. Overall, the PIs of the male group were smaller than those of the female group. There was no significant correlation between centroid size as a size factor and skeletal maturation level. Conclusions: Our findings suggest that the ACV maturation method, which was applied by statistical shape analysis, could confirm information about skeletal maturation in Japanese individuals as an available quantifier of skeletal maturation and could be as useful a quantitative method as the skeletal maturation index. PMID:25411713

  17. Discrimination and prediction of the origin of Chinese and Korean soybeans using Fourier transform infrared spectrometry (FT-IR) with multivariate statistical analysis

    PubMed Central

    Lee, Byeong-Ju; Zhou, Yaoyao; Lee, Jae Soung; Shin, Byeung Kon; Seo, Jeong-Ah; Lee, Doyup; Kim, Young-Suk

    2018-01-01

    The ability to determine the origin of soybeans is an important issue following the inclusion of this information in the labeling of agricultural food products becoming mandatory in South Korea in 2017. This study was carried out to construct a prediction model for discriminating Chinese and Korean soybeans using Fourier-transform infrared (FT-IR) spectroscopy and multivariate statistical analysis. The optimal prediction models for discriminating soybean samples were obtained by selecting appropriate scaling methods, normalization methods, variable influence on projection (VIP) cutoff values, and wave-number regions. The factors for constructing the optimal partial-least-squares regression (PLSR) prediction model were using second derivatives, vector normalization, unit variance scaling, and the 4000–400 cm–1 region (excluding water vapor and carbon dioxide). The PLSR model for discriminating Chinese and Korean soybean samples had the best predictability when a VIP cutoff value was not applied. When Chinese soybean samples were identified, a PLSR model that has the lowest root-mean-square error of the prediction value was obtained using a VIP cutoff value of 1.5. The optimal PLSR prediction model for discriminating Korean soybean samples was also obtained using a VIP cutoff value of 1.5. This is the first study that has combined FT-IR spectroscopy with normalization methods, VIP cutoff values, and selected wave-number regions for discriminating Chinese and Korean soybeans. PMID:29689113

  18. Prediction of seizure-onset laterality by using Wada memory asymmetries in pediatric epilepsy surgery candidates.

    PubMed

    Lee, Gregory P; Park, Yong D; Hempel, Ann; Westerveld, Michael; Loring, David W

    2002-09-01

    Because the capacity of intracarotid amobarbital (Wada) memory assessment to predict seizure-onset laterality in children has not been thoroughly investigated, three comprehensive epilepsy surgery centers pooled their data and examined Wada memory asymmetries to predict side of seizure onset in children being considered for epilepsy surgery. One hundred fifty-two children with intractable epilepsy underwent Wada testing. Although the type and number of memory stimuli and methods varied at each institution, all children were presented with six to 10 items soon after amobarbital injection. After return to neurologic baseline, recognition memory for the stimuli was assessed. Seizure onset was determined by simultaneous video-EEG recordings of multiple seizures. In children with unilateral temporal lobe seizures (n = 87), Wada memory asymmetries accurately predicted seizure laterality to a statistically significant degree. Wada memory asymmetries also correctly predicted side of seizure onset in children with extra-temporal lobe seizures (n = 65). Although individual patient prediction accuracy was statistically significant in temporal lobe cases, onset laterality was incorrectly predicted in < or =52% of children with left temporal lobe seizure onset, depending on the methods and asymmetry criterion used. There also were significant differences between Wada prediction accuracy across the three epilepsy centers. Results suggest that Wada memory assessment is useful in predicting side of seizure onset in many children. However, Wada memory asymmetries should be interpreted more cautiously in children than in adults.

  19. Multimodel predictive system for carbon dioxide solubility in saline formation waters.

    PubMed

    Wang, Zan; Small, Mitchell J; Karamalidis, Athanasios K

    2013-02-05

    The prediction of carbon dioxide solubility in brine at conditions relevant to carbon sequestration (i.e., high temperature, pressure, and salt concentration (T-P-X)) is crucial when this technology is applied. Eleven mathematical models for predicting CO(2) solubility in brine are compared and considered for inclusion in a multimodel predictive system. Model goodness of fit is evaluated over the temperature range 304-433 K, pressure range 74-500 bar, and salt concentration range 0-7 m (NaCl equivalent), using 173 published CO(2) solubility measurements, particularly selected for those conditions. The performance of each model is assessed using various statistical methods, including the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Different models emerge as best fits for different subranges of the input conditions. A classification tree is generated using machine learning methods to predict the best-performing model under different T-P-X subranges, allowing development of a multimodel predictive system (MMoPS) that selects and applies the model expected to yield the most accurate CO(2) solubility prediction. Statistical analysis of the MMoPS predictions, including a stratified 5-fold cross validation, shows that MMoPS outperforms each individual model and increases the overall accuracy of CO(2) solubility prediction across the range of T-P-X conditions likely to be encountered in carbon sequestration applications.

  20. [The new methods in gerontology for life expectancy prediction of the indigenous population of Yugra].

    PubMed

    Gavrilenko, T V; Es'kov, V M; Khadartsev, A A; Khimikova, O I; Sokolova, A A

    2014-01-01

    The behavior of the state vector of human cardio-vascular system in different age groups according to methods of theory of chaos-self-organization and methods of classical statistics was investigated. Observations were made on the indigenous people of North of the Russian Federation. Using methods of the theory of chaos-self-organization the differences in the parameters of quasi-attractors of the human state vector of cardio-vascular system of the people of Russian Federation North were shown. Comparison with the results obtained by classical statistics was made.

  1. Towards the feasibility of using ultrasound to determine mechanical properties of tissues in a bioreactor.

    PubMed

    Mansour, Joseph M; Gu, Di-Win Marine; Chung, Chen-Yuan; Heebner, Joseph; Althans, Jake; Abdalian, Sarah; Schluchter, Mark D; Liu, Yiying; Welter, Jean F

    2014-10-01

    Our ultimate goal is to non-destructively evaluate mechanical properties of tissue-engineered (TE) cartilage using ultrasound (US). We used agarose gels as surrogates for TE cartilage. Previously, we showed that mechanical properties measured using conventional methods were related to those measured using US, which suggested a way to non-destructively predict mechanical properties of samples with known volume fractions. In this study, we sought to determine whether the mechanical properties of samples, with unknown volume fractions could be predicted by US. Aggregate moduli were calculated for hydrogels as a function of SOS, based on concentration and density using a poroelastic model. The data were used to train a statistical model, which we then used to predict volume fractions and mechanical properties of unknown samples. Young's and storage moduli were measured mechanically. The statistical model generally predicted the Young's moduli in compression to within <10% of their mechanically measured value. We defined positive linear correlations between the aggregate modulus predicted from US and both the storage and Young's moduli determined from mechanical tests. Mechanical properties of hydrogels with unknown volume fractions can be predicted successfully from US measurements. This method has the potential to predict mechanical properties of TE cartilage non-destructively in a bioreactor.

  2. A Stochastic Model of Space-Time Variability of Mesoscale Rainfall: Statistics of Spatial Averages

    NASA Technical Reports Server (NTRS)

    Kundu, Prasun K.; Bell, Thomas L.

    2003-01-01

    A characteristic feature of rainfall statistics is that they depend on the space and time scales over which rain data are averaged. A previously developed spectral model of rain statistics that is designed to capture this property, predicts power law scaling behavior for the second moment statistics of area-averaged rain rate on the averaging length scale L as L right arrow 0. In the present work a more efficient method of estimating the model parameters is presented, and used to fit the model to the statistics of area-averaged rain rate derived from gridded radar precipitation data from TOGA COARE. Statistical properties of the data and the model predictions are compared over a wide range of averaging scales. An extension of the spectral model scaling relations to describe the dependence of the average fraction of grid boxes within an area containing nonzero rain (the "rainy area fraction") on the grid scale L is also explored.

  3. Prediction of drug transport processes using simple parameters and PLS statistics. The use of ACD/logP and ACD/ChemSketch descriptors.

    PubMed

    Osterberg, T; Norinder, U

    2001-01-01

    A method of modelling and predicting biopharmaceutical properties using simple theoretically computed molecular descriptors and multivariate statistics has been investigated for several data sets related to solubility, IAM chromatography, permeability across Caco-2 cell monolayers, human intestinal perfusion, brain-blood partitioning, and P-glycoprotein ATPase activity. The molecular descriptors (e.g. molar refractivity, molar volume, index of refraction, surface tension and density) and logP were computed with ACD/ChemSketch and ACD/logP, respectively. Good statistical models were derived that permit simple computational prediction of biopharmaceutical properties. All final models derived had R(2) values ranging from 0.73 to 0.95 and Q(2) values ranging from 0.69 to 0.86. The RMSEP values for the external test sets ranged from 0.24 to 0.85 (log scale).

  4. Trend extraction using empirical mode decomposition and statistical empirical mode decomposition: Case study: Kuala Lumpur stock market

    NASA Astrophysics Data System (ADS)

    Jaber, Abobaker M.

    2014-12-01

    Two nonparametric methods for prediction and modeling of financial time series signals are proposed. The proposed techniques are designed to handle non-stationary and non-linearity behave and to extract meaningful signals for reliable prediction. Due to Fourier Transform (FT), the methods select significant decomposed signals that will be employed for signal prediction. The proposed techniques developed by coupling Holt-winter method with Empirical Mode Decomposition (EMD) and it is Extending the scope of empirical mode decomposition by smoothing (SEMD). To show performance of proposed techniques, we analyze daily closed price of Kuala Lumpur stock market index.

  5. Use of predictive models and rapid methods to nowcast bacteria levels at coastal beaches

    USGS Publications Warehouse

    Francy, Donna S.

    2009-01-01

    The need for rapid assessments of recreational water quality to better protect public health is well accepted throughout the research and regulatory communities. Rapid analytical methods, such as quantitative polymerase chain reaction (qPCR) and immunomagnetic separation/adenosine triphosphate (ATP) analysis, are being tested but are not yet ready for widespread use.Another solution is the use of predictive models, wherein variable(s) that are easily and quickly measured are surrogates for concentrations of fecal-indicator bacteria. Rainfall-based alerts, the simplest type of model, have been used by several communities for a number of years. Deterministic models use mathematical representations of the processes that affect bacteria concentrations; this type of model is being used for beach-closure decisions at one location in the USA. Multivariable statistical models are being developed and tested in many areas of the USA; however, they are only used in three areas of the Great Lakes to aid in notifications of beach advisories or closings. These “operational” statistical models can result in more accurate assessments of recreational water quality than use of the previous day's Escherichia coli (E. coli)concentration as determined by traditional culture methods. The Ohio Nowcast, at Huntington Beach, Bay Village, Ohio, is described in this paper as an example of an operational statistical model. Because predictive modeling is a dynamic process, water-resource managers continue to collect additional data to improve the predictive ability of the nowcast and expand the nowcast to other Ohio beaches and a recreational river. Although predictive models have been shown to work well at some beaches and are becoming more widely accepted, implementation in many areas is limited by funding, lack of coordinated technical leadership, and lack of supporting epidemiological data.

  6. A Bayesian nonparametric method for prediction in EST analysis

    PubMed Central

    Lijoi, Antonio; Mena, Ramsés H; Prünster, Igor

    2007-01-01

    Background Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library. Results In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail. Conclusion The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample. PMID:17868445

  7. Predicting individual brain functional connectivity using a Bayesian hierarchical model.

    PubMed

    Dai, Tian; Guo, Ying

    2017-02-15

    Network-oriented analysis of functional magnetic resonance imaging (fMRI), especially resting-state fMRI, has revealed important association between abnormal connectivity and brain disorders such as schizophrenia, major depression and Alzheimer's disease. Imaging-based brain connectivity measures have become a useful tool for investigating the pathophysiology, progression and treatment response of psychiatric disorders and neurodegenerative diseases. Recent studies have started to explore the possibility of using functional neuroimaging to help predict disease progression and guide treatment selection for individual patients. These studies provide the impetus to develop statistical methodology that would help provide predictive information on disease progression-related or treatment-related changes in neural connectivity. To this end, we propose a prediction method based on Bayesian hierarchical model that uses individual's baseline fMRI scans, coupled with relevant subject characteristics, to predict the individual's future functional connectivity. A key advantage of the proposed method is that it can improve the accuracy of individualized prediction of connectivity by combining information from both group-level connectivity patterns that are common to subjects with similar characteristics as well as individual-level connectivity features that are particular to the specific subject. Furthermore, our method also offers statistical inference tools such as predictive intervals that help quantify the uncertainty or variability of the predicted outcomes. The proposed prediction method could be a useful approach to predict the changes in individual patient's brain connectivity with the progression of a disease. It can also be used to predict a patient's post-treatment brain connectivity after a specified treatment regimen. Another utility of the proposed method is that it can be applied to test-retest imaging data to develop a more reliable estimator for individual functional connectivity. We show there exists a nice connection between our proposed estimator and a recently developed shrinkage estimator of connectivity measures in the neuroimaging community. We develop an expectation-maximization (EM) algorithm for estimation of the proposed Bayesian hierarchical model. Simulations studies are performed to evaluate the accuracy of our proposed prediction methods. We illustrate the application of the methods with two data examples: the longitudinal resting-state fMRI from ADNI2 study and the test-retest fMRI data from Kirby21 study. In both the simulation studies and the fMRI data applications, we demonstrate that the proposed methods provide more accurate prediction and more reliable estimation of individual functional connectivity as compared with alternative methods. Copyright © 2017 Elsevier Inc. All rights reserved.

  8. Virtual reconstruction of glenoid bone defects using a statistical shape model.

    PubMed

    Plessers, Katrien; Vanden Berghe, Peter; Van Dijck, Christophe; Wirix-Speetjens, Roel; Debeer, Philippe; Jonkers, Ilse; Vander Sloten, Jos

    2018-01-01

    Description of the native shape of a glenoid helps surgeons to preoperatively plan the position of a shoulder implant. A statistical shape model (SSM) can be used to virtually reconstruct a glenoid bone defect and to predict the inclination, version, and center position of the native glenoid. An SSM-based reconstruction method has already been developed for acetabular bone reconstruction. The goal of this study was to evaluate the SSM-based method for the reconstruction of glenoid bone defects and the prediction of native anatomic parameters. First, an SSM was created on the basis of 66 healthy scapulae. Then, artificial bone defects were created in all scapulae and reconstructed using the SSM-based reconstruction method. For each bone defect, the reconstructed surface was compared with the original surface. Furthermore, the inclination, version, and glenoid center point of the reconstructed surface were compared with the original parameters of each scapula. For small glenoid bone defects, the healthy surface of the glenoid was reconstructed with a root mean square error of 1.2 ± 0.4 mm. Inclination, version, and glenoid center point were predicted with an accuracy of 2.4° ± 2.1°, 2.9° ± 2.2°, and 1.8 ± 0.8 mm, respectively. The SSM-based reconstruction method is able to accurately reconstruct the native glenoid surface and to predict the native anatomic parameters. Based on this outcome, statistical shape modeling can be considered a successful technique for use in the preoperative planning of shoulder arthroplasty. Copyright © 2017 Journal of Shoulder and Elbow Surgery Board of Trustees. Published by Elsevier Inc. All rights reserved.

  9. Improving binding mode and binding affinity predictions of docking by ligand-based search of protein conformations: evaluation in D3R grand challenge 2015

    NASA Astrophysics Data System (ADS)

    Xu, Xianjin; Yan, Chengfei; Zou, Xiaoqin

    2017-08-01

    The growing number of protein-ligand complex structures, particularly the structures of proteins co-bound with different ligands, in the Protein Data Bank helps us tackle two major challenges in molecular docking studies: the protein flexibility and the scoring function. Here, we introduced a systematic strategy by using the information embedded in the known protein-ligand complex structures to improve both binding mode and binding affinity predictions. Specifically, a ligand similarity calculation method was employed to search a receptor structure with a bound ligand sharing high similarity with the query ligand for the docking use. The strategy was applied to the two datasets (HSP90 and MAP4K4) in recent D3R Grand Challenge 2015. In addition, for the HSP90 dataset, a system-specific scoring function (ITScore2_hsp90) was generated by recalibrating our statistical potential-based scoring function (ITScore2) using the known protein-ligand complex structures and the statistical mechanics-based iterative method. For the HSP90 dataset, better performances were achieved for both binding mode and binding affinity predictions comparing with the original ITScore2 and with ensemble docking. For the MAP4K4 dataset, although there were only eight known protein-ligand complex structures, our docking strategy achieved a comparable performance with ensemble docking. Our method for receptor conformational selection and iterative method for the development of system-specific statistical potential-based scoring functions can be easily applied to other protein targets that have a number of protein-ligand complex structures available to improve predictions on binding.

  10. Aqua/Aura Updated Inclination Adjust Maneuver Performance Prediction Model

    NASA Technical Reports Server (NTRS)

    Boone, Spencer

    2017-01-01

    This presentation will discuss the updated Inclination Adjust Maneuver (IAM) performance prediction model that was developed for Aqua and Aura following the 2017 IAM series. This updated model uses statistical regression methods to identify potential long-term trends in maneuver parameters, yielding improved predictions when re-planning past maneuvers. The presentation has been reviewed and approved by Eric Moyer, ESMO Deputy Project Manager.

  11. Restoration of the Patient-Specific Anatomy of the Proximal and Distal Parts of the Humerus: Statistical Shape Modeling Versus Contralateral Registration Method.

    PubMed

    Vlachopoulos, Lazaros; Lüthi, Marcel; Carrillo, Fabio; Gerber, Christian; Székely, Gábor; Fürnstahl, Philipp

    2018-04-18

    In computer-assisted reconstructive surgeries, the contralateral anatomy is established as the best available reconstruction template. However, existing intra-individual bilateral differences or a pathological, contralateral humerus may limit the applicability of the method. The aim of the study was to evaluate whether a statistical shape model (SSM) has the potential to predict accurately the pretraumatic anatomy of the humerus from the posttraumatic condition. Three-dimensional (3D) triangular surface models were extracted from the computed tomographic data of 100 paired cadaveric humeri without a pathological condition. An SSM was constructed, encoding the characteristic shape variations among the individuals. To predict the patient-specific anatomy of the proximal (or distal) part of the humerus with the SSM, we generated segments of the humerus of predefined length excluding the part to predict. The proximal and distal humeral prediction (p-HP and d-HP) errors, defined as the deviation of the predicted (bone) model from the original (bone) model, were evaluated. For comparison with the state-of-the-art technique, i.e., the contralateral registration method, we used the same segments of the humerus to evaluate whether the SSM or the contralateral anatomy yields a more accurate reconstruction template. The p-HP error (mean and standard deviation, 3.8° ± 1.9°) using 85% of the distal end of the humerus to predict the proximal humeral anatomy was significantly smaller (p = 0.001) compared with the contralateral registration method. The difference between the d-HP error (mean, 5.5° ± 2.9°), using 85% of the proximal part of the humerus to predict the distal humeral anatomy, and the contralateral registration method was not significant (p = 0.61). The restoration of the humeral length was not significantly different between the SSM and the contralateral registration method. SSMs accurately predict the patient-specific anatomy of the proximal and distal aspects of the humerus. The prediction errors of the SSM depend on the size of the healthy part of the humerus. The prediction of the patient-specific anatomy of the humerus is of fundamental importance for computer-assisted reconstructive surgeries.

  12. The Stroke Riskometer™ App: Validation of a data collection tool and stroke risk predictor

    PubMed Central

    Parmar, Priya; Krishnamurthi, Rita; Ikram, M Arfan; Hofman, Albert; Mirza, Saira S; Varakin, Yury; Kravchenko, Michael; Piradov, Michael; Thrift, Amanda G; Norrving, Bo; Wang, Wenzhi; Mandal, Dipes Kumar; Barker-Collo, Suzanne; Sahathevan, Ramesh; Davis, Stephen; Saposnik, Gustavo; Kivipelto, Miia; Sindi, Shireen; Bornstein, Natan M; Giroud, Maurice; Béjot, Yannick; Brainin, Michael; Poulton, Richie; Narayan, K M Venkat; Correia, Manuel; Freire, António; Kokubo, Yoshihiro; Wiebers, David; Mensah, George; BinDhim, Nasser F; Barber, P Alan; Pandian, Jeyaraj Durai; Hankey, Graeme J; Mehndiratta, Man Mohan; Azhagammal, Shobhana; Ibrahim, Norlinah Mohd; Abbott, Max; Rush, Elaine; Hume, Patria; Hussein, Tasleem; Bhattacharjee, Rohit; Purohit, Mitali; Feigin, Valery L

    2015-01-01

    Background The greatest potential to reduce the burden of stroke is by primary prevention of first-ever stroke, which constitutes three quarters of all stroke. In addition to population-wide prevention strategies (the ‘mass’ approach), the ‘high risk’ approach aims to identify individuals at risk of stroke and to modify their risk factors, and risk, accordingly. Current methods of assessing and modifying stroke risk are difficult to access and implement by the general population, amongst whom most future strokes will arise. To help reduce the burden of stroke on individuals and the population a new app, the Stroke Riskometer™, has been developed. We aim to explore the validity of the app for predicting the risk of stroke compared with current best methods. Methods 752 stroke outcomes from a sample of 9501 individuals across three countries (New Zealand, Russia and the Netherlands) were utilized to investigate the performance of a novel stroke risk prediction tool algorithm (Stroke Riskometer™) compared with two established stroke risk score prediction algorithms (Framingham Stroke Risk Score [FSRS] and QStroke). We calculated the receiver operating characteristics (ROC) curves and area under the ROC curve (AUROC) with 95% confidence intervals, Harrels C-statistic and D-statistics for measure of discrimination, R2 statistics to indicate level of variability accounted for by each prediction algorithm, the Hosmer-Lemeshow statistic for calibration, and the sensitivity and specificity of each algorithm. Results The Stroke Riskometer™ performed well against the FSRS five-year AUROC for both males (FSRS = 75·0% (95% CI 72·3%–77·6%), Stroke Riskometer™ = 74·0(95% CI 71·3%–76·7%) and females [FSRS = 70·3% (95% CI 67·9%–72·8%, Stroke Riskometer™ = 71·5% (95% CI 69·0%–73·9%)], and better than QStroke [males – 59·7% (95% CI 57·3%–62·0%) and comparable to females = 71·1% (95% CI 69·0%–73·1%)]. Discriminative ability of all algorithms was low (C-statistic ranging from 0·51–0·56, D-statistic ranging from 0·01–0·12). Hosmer-Lemeshow illustrated that all of the predicted risk scores were not well calibrated with the observed event data (P < 0·006). Conclusions The Stroke Riskometer™ is comparable in performance for stroke prediction with FSRS and QStroke. All three algorithms performed equally poorly in predicting stroke events. The Stroke Riskometer™ will be continually developed and validated to address the need to improve the current stroke risk scoring systems to more accurately predict stroke, particularly by identifying robust ethnic/race ethnicity group and country specific risk factors. PMID:25491651

  13. New Methods for Estimating Seasonal Potential Climate Predictability

    NASA Astrophysics Data System (ADS)

    Feng, Xia

    This study develops two new statistical approaches to assess the seasonal potential predictability of the observed climate variables. One is the univariate analysis of covariance (ANOCOVA) model, a combination of autoregressive (AR) model and analysis of variance (ANOVA). It has the advantage of taking into account the uncertainty of the estimated parameter due to sampling errors in statistical test, which is often neglected in AR based methods, and accounting for daily autocorrelation that is not considered in traditional ANOVA. In the ANOCOVA model, the seasonal signals arising from external forcing are determined to be identical or not to assess any interannual variability that may exist is potentially predictable. The bootstrap is an attractive alternative method that requires no hypothesis model and is available no matter how mathematically complicated the parameter estimator. This method builds up the empirical distribution of the interannual variance from the resamplings drawn with replacement from the given sample, in which the only predictability in seasonal means arises from the weather noise. These two methods are applied to temperature and water cycle components including precipitation and evaporation, to measure the extent to which the interannual variance of seasonal means exceeds the unpredictable weather noise compared with the previous methods, including Leith-Shukla-Gutzler (LSG), Madden, and Katz. The potential predictability of temperature from ANOCOVA model, bootstrap, LSG and Madden exhibits a pronounced tropical-extratropical contrast with much larger predictability in the tropics dominated by El Nino/Southern Oscillation (ENSO) than in higher latitudes where strong internal variability lowers predictability. Bootstrap tends to display highest predictability of the four methods, ANOCOVA lies in the middle, while LSG and Madden appear to generate lower predictability. Seasonal precipitation from ANOCOVA, bootstrap, and Katz, resembling that for temperature, is more predictable over the tropical regions, and less predictable in extropics. Bootstrap and ANOCOVA are in good agreement with each other, both methods generating larger predictability than Katz. The seasonal predictability of evaporation over land bears considerably similarity with that of temperature using ANOCOVA, bootstrap, LSG and Madden. The remote SST forcing and soil moisture reveal substantial seasonality in their relations with the potentially predictable seasonal signals. For selected regions, either SST or soil moisture or both shows significant relationships with predictable signals, hence providing indirect insight on slowly varying boundary processes involved to enable useful seasonal climate predication. A multivariate analysis of covariance (MANOCOVA) model is established to identify distinctive predictable patterns, which are uncorrelated with each other. Generally speaking, the seasonal predictability from multivariate model is consistent with that from ANOCOVA. Besides unveiling the spatial variability of predictability, MANOCOVA model also reveals the temporal variability of each predictable pattern, which could be linked to the periodic oscillations.

  14. Hybrid perturbation methods based on statistical time series models

    NASA Astrophysics Data System (ADS)

    San-Juan, Juan Félix; San-Martín, Montserrat; Pérez, Iván; López, Rosario

    2016-04-01

    In this work we present a new methodology for orbit propagation, the hybrid perturbation theory, based on the combination of an integration method and a prediction technique. The former, which can be a numerical, analytical or semianalytical theory, generates an initial approximation that contains some inaccuracies derived from the fact that, in order to simplify the expressions and subsequent computations, not all the involved forces are taken into account and only low-order terms are considered, not to mention the fact that mathematical models of perturbations not always reproduce physical phenomena with absolute precision. The prediction technique, which can be based on either statistical time series models or computational intelligence methods, is aimed at modelling and reproducing missing dynamics in the previously integrated approximation. This combination results in the precision improvement of conventional numerical, analytical and semianalytical theories for determining the position and velocity of any artificial satellite or space debris object. In order to validate this methodology, we present a family of three hybrid orbit propagators formed by the combination of three different orders of approximation of an analytical theory and a statistical time series model, and analyse their capability to process the effect produced by the flattening of the Earth. The three considered analytical components are the integration of the Kepler problem, a first-order and a second-order analytical theories, whereas the prediction technique is the same in the three cases, namely an additive Holt-Winters method.

  15. Evaluation and statistical inference for human connectomes.

    PubMed

    Pestilli, Franco; Yeatman, Jason D; Rokem, Ariel; Kay, Kendrick N; Wandell, Brian A

    2014-10-01

    Diffusion-weighted imaging coupled with tractography is currently the only method for in vivo mapping of human white-matter fascicles. Tractography takes diffusion measurements as input and produces the connectome, a large collection of white-matter fascicles, as output. We introduce a method to evaluate the evidence supporting connectomes. Linear fascicle evaluation (LiFE) takes any connectome as input and predicts diffusion measurements as output, using the difference between the measured and predicted diffusion signals to quantify the prediction error. We use the prediction error to evaluate the evidence that supports the properties of the connectome, to compare tractography algorithms and to test hypotheses about tracts and connections.

  16. Global vision of druggability issues: applications and perspectives.

    PubMed

    Abi Hussein, Hiba; Geneix, Colette; Petitjean, Michel; Borrel, Alexandre; Flatters, Delphine; Camproux, Anne-Claude

    2017-02-01

    During the preliminary stage of a drug discovery project, the lack of druggability information and poor target selection are the main causes of frequent failures. Elaborating on accurate computational druggability prediction methods is a requirement for prioritizing target selection, designing new drugs and avoiding side effects. In this review, we describe a survey of recently reported druggability prediction methods mainly based on networks, statistical pocket druggability predictions and virtual screening. An application for a frequent mutation of p53 tumor suppressor is presented, illustrating the complementarity of druggability prediction approaches, the remaining challenges and potential new drug development perspectives. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. A Study on Predictive Analytics Application to Ship Machinery Maintenance

    DTIC Science & Technology

    2013-09-01

    Looking at the nature of the time series forecasting method , it would be better applied to offline analysis . The application for real- time online...other system attributes in future. Two techniques of statistical analysis , mainly time series models and cumulative sum control charts, are discussed in...statistical tool employed for the two techniques of statistical analysis . Both time series forecasting as well as CUSUM control charts are shown to be

  18. Low-Level Stratus Prediction Using Binary Statistical Regression: A Progress Report Using Moffett Field Data.

    DTIC Science & Technology

    1983-12-01

    analysis; such work is not reported here. It seems pos- sible that a robust principle component analysis may he informa- tive (see Gnanadesikan (1977...Statistics in Atmospheric Sciences, American Meteorological Soc., Boston, Mass. (1979) pp. 46-48. a Gnanadesikan , R., Methods for Statistical Data...North Carolina Chapel Hill, NC 20742 Dr. R. Gnanadesikan Bell Telephone Lab Murray Hill, NJ 07733 -%.. *5%a: *1 *15 I ,, - . . , ,, ... . . . . . . NO

  19. Translational Research for Occupational Therapy: Using SPRE in Hippotherapy for Children with Developmental Disabilities.

    PubMed

    Weissman-Miller, Deborah; Miller, Rosalie J; Shotwell, Mary P

    2017-01-01

    Translational research is redefined in this paper using a combination of methods in statistics and data science to enhance the understanding of outcomes and practice in occupational therapy. These new methods are applied, using larger data and smaller single-subject data, to a study in hippotherapy for children with developmental disabilities (DD). The Centers for Disease Control and Prevention estimates DD affects nearly 10 million children, aged 2-19, where diagnoses may be comorbid. Hippotherapy is defined here as a treatment strategy in occupational therapy using equine movement to achieve functional outcomes. Semiparametric ratio estimator (SPRE), a single-subject statistical and small data science model, is used to derive a "change point" indicating where the participant adapts to treatment, from which predictions are made. Data analyzed here is from an institutional review board approved pilot study using the Hippotherapy Evaluation and Assessment Tool measure, where outcomes are given separately for each of four measured domains and the total scores of each participant. Analysis with SPRE, using statistical methods to predict a "change point" and data science graphical interpretations of data, shows the translational comparisons between results from larger mean values and the very different results from smaller values for each HEAT domain in terms of relationships and statistical probabilities.

  20. Comparison between deterministic and statistical wavelet estimation methods through predictive deconvolution: Seismic to well tie example from the North Sea

    NASA Astrophysics Data System (ADS)

    de Macedo, Isadora A. S.; da Silva, Carolina B.; de Figueiredo, J. J. S.; Omoboya, Bode

    2017-01-01

    Wavelet estimation as well as seismic-to-well tie procedures are at the core of every seismic interpretation workflow. In this paper we perform a comparative study of wavelet estimation methods for seismic-to-well tie. Two approaches to wavelet estimation are discussed: a deterministic estimation, based on both seismic and well log data, and a statistical estimation, based on predictive deconvolution and the classical assumptions of the convolutional model, which provides a minimum-phase wavelet. Our algorithms, for both wavelet estimation methods introduce a semi-automatic approach to determine the optimum parameters of deterministic wavelet estimation and statistical wavelet estimation and, further, to estimate the optimum seismic wavelets by searching for the highest correlation coefficient between the recorded trace and the synthetic trace, when the time-depth relationship is accurate. Tests with numerical data show some qualitative conclusions, which are probably useful for seismic inversion and interpretation of field data, by comparing deterministic wavelet estimation and statistical wavelet estimation in detail, especially for field data example. The feasibility of this approach is verified on real seismic and well data from Viking Graben field, North Sea, Norway. Our results also show the influence of the washout zones on well log data on the quality of the well to seismic tie.

  1. Translational Research for Occupational Therapy: Using SPRE in Hippotherapy for Children with Developmental Disabilities

    PubMed Central

    Miller, Rosalie J.; Shotwell, Mary P.

    2017-01-01

    Translational research is redefined in this paper using a combination of methods in statistics and data science to enhance the understanding of outcomes and practice in occupational therapy. These new methods are applied, using larger data and smaller single-subject data, to a study in hippotherapy for children with developmental disabilities (DD). The Centers for Disease Control and Prevention estimates DD affects nearly 10 million children, aged 2–19, where diagnoses may be comorbid. Hippotherapy is defined here as a treatment strategy in occupational therapy using equine movement to achieve functional outcomes. Semiparametric ratio estimator (SPRE), a single-subject statistical and small data science model, is used to derive a “change point” indicating where the participant adapts to treatment, from which predictions are made. Data analyzed here is from an institutional review board approved pilot study using the Hippotherapy Evaluation and Assessment Tool measure, where outcomes are given separately for each of four measured domains and the total scores of each participant. Analysis with SPRE, using statistical methods to predict a “change point” and data science graphical interpretations of data, shows the translational comparisons between results from larger mean values and the very different results from smaller values for each HEAT domain in terms of relationships and statistical probabilities. PMID:29097962

  2. Polygenic scores via penalized regression on summary statistics.

    PubMed

    Mak, Timothy Shin Heng; Porsch, Robert Milan; Choi, Shing Wan; Zhou, Xueya; Sham, Pak Chung

    2017-09-01

    Polygenic scores (PGS) summarize the genetic contribution of a person's genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating PGS have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can use LD information available elsewhere to supplement such analyses. To answer this question, we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and P-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred. © 2017 WILEY PERIODICALS, INC.

  3. Correcting for Optimistic Prediction in Small Data Sets

    PubMed Central

    Smith, Gordon C. S.; Seaman, Shaun R.; Wood, Angela M.; Royston, Patrick; White, Ian R.

    2014-01-01

    The C statistic is a commonly reported measure of screening test performance. Optimistic estimation of the C statistic is a frequent problem because of overfitting of statistical models in small data sets, and methods exist to correct for this issue. However, many studies do not use such methods, and those that do correct for optimism use diverse methods, some of which are known to be biased. We used clinical data sets (United Kingdom Down syndrome screening data from Glasgow (1991–2003), Edinburgh (1999–2003), and Cambridge (1990–2006), as well as Scottish national pregnancy discharge data (2004–2007)) to evaluate different approaches to adjustment for optimism. We found that sample splitting, cross-validation without replication, and leave-1-out cross-validation produced optimism-adjusted estimates of the C statistic that were biased and/or associated with greater absolute error than other available methods. Cross-validation with replication, bootstrapping, and a new method (leave-pair-out cross-validation) all generated unbiased optimism-adjusted estimates of the C statistic and had similar absolute errors in the clinical data set. Larger simulation studies confirmed that all 3 methods performed similarly with 10 or more events per variable, or when the C statistic was 0.9 or greater. However, with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than both methods of cross-validation. PMID:24966219

  4. A note on evaluating VAN earthquake predictions

    NASA Astrophysics Data System (ADS)

    Tselentis, G.-Akis; Melis, Nicos S.

    The evaluation of the success level of an earthquake prediction method should not be based on approaches that apply generalized strict statistical laws and avoid the specific nature of the earthquake phenomenon. Fault rupture processes cannot be compared to gambling processes. The outcome of the present note is that even an ideal earthquake prediction method is still shown to be a matter of a “chancy” association between precursors and earthquakes if we apply the same procedure proposed by Mulargia and Gasperini [1992] in evaluating VAN earthquake predictions. Each individual VAN prediction has to be evaluated separately, taking always into account the specific circumstances and information available. The success level of epicenter prediction should depend on the earthquake magnitude, and magnitude and time predictions may depend on earthquake clustering and the tectonic regime respectively.

  5. Could the outcome of the 2016 US elections have been predicted from past voting patterns?

    NASA Astrophysics Data System (ADS)

    Schmitz, Peter M. U.; Holloway, Jennifer P.; Dudeni-Tlhone, Nontembeko; Ntlangu, Mbulelo B.; Koen, Renee

    2018-05-01

    In South Africa, a team of analysts has for some years been using statistical techniques to predict election outcomes during election nights in South Africa. The prediction method involves using statistical clusters based on past voting patterns to predict final election outcomes, using a small number of released vote counts. With the US presidential elections in November 2016 hitting the global media headlines during the time period directly after successful predictions were done for the South African elections, the team decided to investigate adapting their meth-od to forecast the final outcome in the US elections. In particular, it was felt that the time zone differences between states would affect the time at which results are released and thereby provide a window of opportunity for doing election night prediction using only the early results from the eastern side of the US. Testing the method on the US presidential elections would have two advantages: it would determine whether the core methodology could be generalised, and whether it would work to include a stronger spatial element in the modelling, since the early results released would be spatially biased due to time zone differences. This paper presents a high-level view of the overall methodology and how it was adapted to predict the results of the US presidential elections. A discussion on the clustering of spatial units within the US is also provided and the spatial distribution of results together with the Electoral College prediction results from both a `test-run' and the final 2016 presidential elections are given and analysed.

  6. Investigation of Super Learner Methodology on HIV-1 Small Sample: Application on Jaguar Trial Data.

    PubMed

    Houssaïni, Allal; Assoumou, Lambert; Marcelin, Anne Geneviève; Molina, Jean Michel; Calvez, Vincent; Flandre, Philippe

    2012-01-01

    Background. Many statistical models have been tested to predict phenotypic or virological response from genotypic data. A statistical framework called Super Learner has been introduced either to compare different methods/learners (discrete Super Learner) or to combine them in a Super Learner prediction method. Methods. The Jaguar trial is used to apply the Super Learner framework. The Jaguar study is an "add-on" trial comparing the efficacy of adding didanosine to an on-going failing regimen. Our aim was also to investigate the impact on the use of different cross-validation strategies and different loss functions. Four different repartitions between training set and validations set were tested through two loss functions. Six statistical methods were compared. We assess performance by evaluating R(2) values and accuracy by calculating the rates of patients being correctly classified. Results. Our results indicated that the more recent Super Learner methodology of building a new predictor based on a weighted combination of different methods/learners provided good performance. A simple linear model provided similar results to those of this new predictor. Slight discrepancy arises between the two loss functions investigated, and slight difference arises also between results based on cross-validated risks and results from full dataset. The Super Learner methodology and linear model provided around 80% of patients correctly classified. The difference between the lower and higher rates is around 10 percent. The number of mutations retained in different learners also varys from one to 41. Conclusions. The more recent Super Learner methodology combining the prediction of many learners provided good performance on our small dataset.

  7. Real-time Mainshock Forecast by Statistical Discrimination of Foreshock Clusters

    NASA Astrophysics Data System (ADS)

    Nomura, S.; Ogata, Y.

    2016-12-01

    Foreshock discremination is one of the most effective ways for short-time forecast of large main shocks. Though many large earthquakes accompany their foreshocks, discreminating them from enormous small earthquakes is difficult and only probabilistic evaluation from their spatio-temporal features and magnitude evolution may be available. Logistic regression is the statistical learning method best suited to such binary pattern recognition problems where estimates of a-posteriori probability of class membership are required. Statistical learning methods can keep learning discreminating features from updating catalog and give probabilistic recognition of forecast in real time. We estimated a non-linear function of foreshock proportion by smooth spline bases and evaluate the possibility of foreshocks by the logit function. In this study, we classified foreshocks from earthquake catalog by the Japan Meteorological Agency by single-link clustering methods and learned spatial and temporal features of foreshocks by the probability density ratio estimation. We use the epicentral locations, time spans and difference in magnitudes for learning and forecasting. Magnitudes of main shocks are also predicted our method by incorporating b-values into our method. We discuss the spatial pattern of foreshocks from the classifier composed by our model. We also implement a back test to validate predictive performance of the model by this catalog.

  8. A method for evaluating the importance of system state observations to model predictions, with application to the Death Valley regional groundwater flow system

    USGS Publications Warehouse

    Tiedeman, Claire; Ely, D. Matthew; Hill, Mary C.; O'Brien, Grady M.

    2004-01-01

    We develop a new observation‐prediction (OPR) statistic for evaluating the importance of system state observations to model predictions. The OPR statistic measures the change in prediction uncertainty produced when an observation is added to or removed from an existing monitoring network, and it can be used to guide refinement and enhancement of the network. Prediction uncertainty is approximated using a first‐order second‐moment method. We apply the OPR statistic to a model of the Death Valley regional groundwater flow system (DVRFS) to evaluate the importance of existing and potential hydraulic head observations to predicted advective transport paths in the saturated zone underlying Yucca Mountain and underground testing areas on the Nevada Test Site. Important existing observations tend to be far from the predicted paths, and many unimportant observations are in areas of high observation density. These results can be used to select locations at which increased observation accuracy would be beneficial and locations that could be removed from the network. Important potential observations are mostly in areas of high hydraulic gradient far from the paths. Results for both existing and potential observations are related to the flow system dynamics and coarse parameter zonation in the DVRFS model. If system properties in different locations are as similar as the zonation assumes, then the OPR results illustrate a data collection opportunity whereby observations in distant, high‐gradient areas can provide information about properties in flatter‐gradient areas near the paths. If this similarity is suspect, then the analysis produces a different type of data collection opportunity involving testing of model assumptions critical to the OPR results.

  9. Review of Factors, Methods, and Outcome Definition in Designing Opioid Abuse Predictive Models.

    PubMed

    Alzeer, Abdullah H; Jones, Josette; Bair, Matthew J

    2018-05-01

    Several opioid risk assessment tools are available to prescribers to evaluate opioid analgesic abuse among chronic patients. The objectives of this study are to 1) identify variables available in the literature to predict opioid abuse; 2) explore and compare methods (population, database, and analysis) used to develop statistical models that predict opioid abuse; and 3) understand how outcomes were defined in each statistical model predicting opioid abuse. The OVID database was searched for this study. The search was limited to articles written in English and published from January 1990 to April 2016. This search generated 1,409 articles. Only seven studies and nine models met our inclusion-exclusion criteria. We found nine models and identified 75 distinct variables. Three studies used administrative claims data, and four studies used electronic health record data. The majority, four out of seven articles (six out of nine models), were primarily dependent on the presence or absence of opioid abuse or dependence (ICD-9 diagnosis code) to define opioid abuse. However, two articles used a predefined list of opioid-related aberrant behaviors. We identified variables used to predict opioid abuse from electronic health records and administrative data. Medication variables are the recurrent variables in the articles reviewed (33 variables). Age and gender are the most consistent demographic variables in predicting opioid abuse. Overall, there is similarity in the sampling method and inclusion/exclusion criteria (age, number of prescriptions, follow-up period, and data analysis methods). Intuitive research to utilize unstructured data may increase opioid abuse models' accuracy.

  10. Validation of Methods to Predict Vibration of a Panel in the Near Field of a Hot Supersonic Rocket Plume

    NASA Technical Reports Server (NTRS)

    Bremner, P. G.; Blelloch, P. A.; Hutchings, A.; Shah, P.; Streett, C. L.; Larsen, C. E.

    2011-01-01

    This paper describes the measurement and analysis of surface fluctuating pressure level (FPL) data and vibration data from a plume impingement aero-acoustic and vibration (PIAAV) test to validate NASA s physics-based modeling methods for prediction of panel vibration in the near field of a hot supersonic rocket plume. For this test - reported more fully in a companion paper by Osterholt & Knox at 26th Aerospace Testing Seminar, 2011 - the flexible panel was located 2.4 nozzle diameters from the plume centerline and 4.3 nozzle diameters downstream from the nozzle exit. The FPL loading is analyzed in terms of its auto spectrum, its cross spectrum, its spatial correlation parameters and its statistical properties. The panel vibration data is used to estimate the in-situ damping under plume FPL loading conditions and to validate both finite element analysis (FEA) and statistical energy analysis (SEA) methods for prediction of panel response. An assessment is also made of the effects of non-linearity in the panel elasticity.

  11. Comparison of the performance of the CMS Hierarchical Condition Category (CMS-HCC) risk adjuster with the Charlson and Elixhauser comorbidity measures in predicting mortality.

    PubMed

    Li, Pengxiang; Kim, Michelle M; Doshi, Jalpa A

    2010-08-20

    The Centers for Medicare and Medicaid Services (CMS) has implemented the CMS-Hierarchical Condition Category (CMS-HCC) model to risk adjust Medicare capitation payments. This study intends to assess the performance of the CMS-HCC risk adjustment method and to compare it to the Charlson and Elixhauser comorbidity measures in predicting in-hospital and six-month mortality in Medicare beneficiaries. The study used the 2005-2006 Chronic Condition Data Warehouse (CCW) 5% Medicare files. The primary study sample included all community-dwelling fee-for-service Medicare beneficiaries with a hospital admission between January 1st, 2006 and June 30th, 2006. Additionally, four disease-specific samples consisting of subgroups of patients with principal diagnoses of congestive heart failure (CHF), stroke, diabetes mellitus (DM), and acute myocardial infarction (AMI) were also selected. Four analytic files were generated for each sample by extracting inpatient and/or outpatient claims for each patient. Logistic regressions were used to compare the methods. Model performance was assessed using the c-statistic, the Akaike's information criterion (AIC), the Bayesian information criterion (BIC) and their 95% confidence intervals estimated using bootstrapping. The CMS-HCC had statistically significant higher c-statistic and lower AIC and BIC values than the Charlson and Elixhauser methods in predicting in-hospital and six-month mortality across all samples in analytic files that included claims from the index hospitalization. Exclusion of claims for the index hospitalization generally led to drops in model performance across all methods with the highest drops for the CMS-HCC method. However, the CMS-HCC still performed as well or better than the other two methods. The CMS-HCC method demonstrated better performance relative to the Charlson and Elixhauser methods in predicting in-hospital and six-month mortality. The CMS-HCC model is preferred over the Charlson and Elixhauser methods if information about the patient's diagnoses prior to the index hospitalization is available and used to code the risk adjusters. However, caution should be exercised in studies evaluating inpatient processes of care and where data on pre-index admission diagnoses are unavailable.

  12. Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines.

    PubMed

    Spindel, Jennifer; Begum, Hasina; Akdemir, Deniz; Virk, Parminder; Collard, Bertrand; Redoña, Edilberto; Atlin, Gary; Jannink, Jean-Luc; McCouch, Susan R

    2015-02-01

    Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline.

  13. Genomic Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines

    PubMed Central

    Spindel, Jennifer; Begum, Hasina; Akdemir, Deniz; Virk, Parminder; Collard, Bertrand; Redoña, Edilberto; Atlin, Gary; Jannink, Jean-Luc; McCouch, Susan R.

    2015-01-01

    Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline. PMID:25689273

  14. The Energetic Cost of Walking: A Comparison of Predictive Methods

    PubMed Central

    Kramer, Patricia Ann; Sylvester, Adam D.

    2011-01-01

    Background The energy that animals devote to locomotion has been of intense interest to biologists for decades and two basic methodologies have emerged to predict locomotor energy expenditure: those based on metabolic and those based on mechanical energy. Metabolic energy approaches share the perspective that prediction of locomotor energy expenditure should be based on statistically significant proxies of metabolic function, while mechanical energy approaches, which derive from many different perspectives, focus on quantifying the energy of movement. Some controversy exists as to which mechanical perspective is “best”, but from first principles all mechanical methods should be equivalent if the inputs to the simulation are of similar quality. Our goals in this paper are 1) to establish the degree to which the various methods of calculating mechanical energy are correlated, and 2) to investigate to what degree the prediction methods explain the variation in energy expenditure. Methodology/Principal Findings We use modern humans as the model organism in this experiment because their data are readily attainable, but the methodology is appropriate for use in other species. Volumetric oxygen consumption and kinematic and kinetic data were collected on 8 adults while walking at their self-selected slow, normal and fast velocities. Using hierarchical statistical modeling via ordinary least squares and maximum likelihood techniques, the predictive ability of several metabolic and mechanical approaches were assessed. We found that all approaches are correlated and that the mechanical approaches explain similar amounts of the variation in metabolic energy expenditure. Most methods predict the variation within an individual well, but are poor at accounting for variation between individuals. Conclusion Our results indicate that the choice of predictive method is dependent on the question(s) of interest and the data available for use as inputs. Although we used modern humans as our model organism, these results can be extended to other species. PMID:21731693

  15. Analysis of Physicochemical and Structural Properties Determining HIV-1 Coreceptor Usage

    PubMed Central

    Bozek, Katarzyna; Lengauer, Thomas; Sierra, Saleta; Kaiser, Rolf; Domingues, Francisco S.

    2013-01-01

    The relationship of HIV tropism with disease progression and the recent development of CCR5-blocking drugs underscore the importance of monitoring virus coreceptor usage. As an alternative to costly phenotypic assays, computational methods aim at predicting virus tropism based on the sequence and structure of the V3 loop of the virus gp120 protein. Here we present a numerical descriptor of the V3 loop encoding its physicochemical and structural properties. The descriptor allows for structure-based prediction of HIV tropism and identification of properties of the V3 loop that are crucial for coreceptor usage. Use of the proposed descriptor for prediction results in a statistically significant improvement over the prediction based solely on V3 sequence with 3 percentage points improvement in AUC and 7 percentage points in sensitivity at the specificity of the 11/25 rule (95%). We additionally assessed the predictive power of the new method on clinically derived ‘bulk’ sequence data and obtained a statistically significant improvement in AUC of 3 percentage points over sequence-based prediction. Furthermore, we demonstrated the capacity of our method to predict therapy outcome by applying it to 53 samples from patients undergoing Maraviroc therapy. The analysis of structural features of the loop informative of tropism indicates the importance of two loop regions and their physicochemical properties. The regions are located on opposite strands of the loop stem and the respective features are predominantly charge-, hydrophobicity- and structure-related. These regions are in close proximity in the bound conformation of the loop potentially forming a site determinant for the coreceptor binding. The method is available via server under http://structure.bioinf.mpi-inf.mpg.de/. PMID:23555214

  16. Datamining approaches for modeling tumor control probability.

    PubMed

    Naqa, Issam El; Deasy, Joseph O; Mu, Yi; Huang, Ellen; Hope, Andrew J; Lindsay, Patricia E; Apte, Aditya; Alaly, James; Bradley, Jeffrey D

    2010-11-01

    Tumor control probability (TCP) to radiotherapy is determined by complex interactions between tumor biology, tumor microenvironment, radiation dosimetry, and patient-related variables. The complexity of these heterogeneous variable interactions constitutes a challenge for building predictive models for routine clinical practice. We describe a datamining framework that can unravel the higher order relationships among dosimetric dose-volume prognostic variables, interrogate various radiobiological processes, and generalize to unseen data before when applied prospectively. Several datamining approaches are discussed that include dose-volume metrics, equivalent uniform dose, mechanistic Poisson model, and model building methods using statistical regression and machine learning techniques. Institutional datasets of non-small cell lung cancer (NSCLC) patients are used to demonstrate these methods. The performance of the different methods was evaluated using bivariate Spearman rank correlations (rs). Over-fitting was controlled via resampling methods. Using a dataset of 56 patients with primary NCSLC tumors and 23 candidate variables, we estimated GTV volume and V75 to be the best model parameters for predicting TCP using statistical resampling and a logistic model. Using these variables, the support vector machine (SVM) kernel method provided superior performance for TCP prediction with an rs=0.68 on leave-one-out testing compared to logistic regression (rs=0.4), Poisson-based TCP (rs=0.33), and cell kill equivalent uniform dose model (rs=0.17). The prediction of treatment response can be improved by utilizing datamining approaches, which are able to unravel important non-linear complex interactions among model variables and have the capacity to predict on unseen data for prospective clinical applications.

  17. No-reference image quality assessment based on statistics of convolution feature maps

    NASA Astrophysics Data System (ADS)

    Lv, Xiaoxin; Qin, Min; Chen, Xiaohui; Wei, Guo

    2018-04-01

    We propose a Convolutional Feature Maps (CFM) driven approach to accurately predict image quality. Our motivation bases on the finding that the Nature Scene Statistic (NSS) features on convolution feature maps are significantly sensitive to distortion degree of an image. In our method, a Convolutional Neural Network (CNN) is trained to obtain kernels for generating CFM. We design a forward NSS layer which performs on CFM to better extract NSS features. The quality aware features derived from the output of NSS layer is effective to describe the distortion type and degree an image suffered. Finally, a Support Vector Regression (SVR) is employed in our No-Reference Image Quality Assessment (NR-IQA) model to predict a subjective quality score of a distorted image. Experiments conducted on two public databases demonstrate the promising performance of the proposed method is competitive to state of the art NR-IQA methods.

  18. How Does One Assess the Accuracy of Academic Success Predictors? ROC Analysis Applied to University Entrance Factors

    ERIC Educational Resources Information Center

    Vivo, Juana-Maria; Franco, Manuel

    2008-01-01

    This article attempts to present a novel application of a method of measuring accuracy for academic success predictors that could be used as a standard. This procedure is known as the receiver operating characteristic (ROC) curve, which comes from statistical decision techniques. The statistical prediction techniques provide predictor models and…

  19. Methods for estimating low-flow statistics for Massachusetts streams

    USGS Publications Warehouse

    Ries, Kernell G.; Friesz, Paul J.

    2000-01-01

    Methods and computer software are described in this report for determining flow duration, low-flow frequency statistics, and August median flows. These low-flow statistics can be estimated for unregulated streams in Massachusetts using different methods depending on whether the location of interest is at a streamgaging station, a low-flow partial-record station, or an ungaged site where no data are available. Low-flow statistics for streamgaging stations can be estimated using standard U.S. Geological Survey methods described in the report. The MOVE.1 mathematical method and a graphical correlation method can be used to estimate low-flow statistics for low-flow partial-record stations. The MOVE.1 method is recommended when the relation between measured flows at a partial-record station and daily mean flows at a nearby, hydrologically similar streamgaging station is linear, and the graphical method is recommended when the relation is curved. Equations are presented for computing the variance and equivalent years of record for estimates of low-flow statistics for low-flow partial-record stations when either a single or multiple index stations are used to determine the estimates. The drainage-area ratio method or regression equations can be used to estimate low-flow statistics for ungaged sites where no data are available. The drainage-area ratio method is generally as accurate as or more accurate than regression estimates when the drainage-area ratio for an ungaged site is between 0.3 and 1.5 times the drainage area of the index data-collection site. Regression equations were developed to estimate the natural, long-term 99-, 98-, 95-, 90-, 85-, 80-, 75-, 70-, 60-, and 50-percent duration flows; the 7-day, 2-year and the 7-day, 10-year low flows; and the August median flow for ungaged sites in Massachusetts. Streamflow statistics and basin characteristics for 87 to 133 streamgaging stations and low-flow partial-record stations were used to develop the equations. The streamgaging stations had from 2 to 81 years of record, with a mean record length of 37 years. The low-flow partial-record stations had from 8 to 36 streamflow measurements, with a median of 14 measurements. All basin characteristics were determined from digital map data. The basin characteristics that were statistically significant in most of the final regression equations were drainage area, the area of stratified-drift deposits per unit of stream length plus 0.1, mean basin slope, and an indicator variable that was 0 in the eastern region and 1 in the western region of Massachusetts. The equations were developed by use of weighted-least-squares regression analyses, with weights assigned proportional to the years of record and inversely proportional to the variances of the streamflow statistics for the stations. Standard errors of prediction ranged from 70.7 to 17.5 percent for the equations to predict the 7-day, 10-year low flow and 50-percent duration flow, respectively. The equations are not applicable for use in the Southeast Coastal region of the State, or where basin characteristics for the selected ungaged site are outside the ranges of those for the stations used in the regression analyses. A World Wide Web application was developed that provides streamflow statistics for data collection stations from a data base and for ungaged sites by measuring the necessary basin characteristics for the site and solving the regression equations. Output provided by the Web application for ungaged sites includes a map of the drainage-basin boundary determined for the site, the measured basin characteristics, the estimated streamflow statistics, and 90-percent prediction intervals for the estimates. An equation is provided for combining regression and correlation estimates to obtain improved estimates of the streamflow statistics for low-flow partial-record stations. An equation is also provided for combining regression and drainage-area ratio estimates to obtain improved e

  20. Predictive Validity of DSM-IV and ICD-10 Criteria for ADHD and Hyperkinetic Disorder

    ERIC Educational Resources Information Center

    Lee, Soyoung I.; Schachar, Russell J.; Chen, Shirley X.; Ornstein, Tisha J.; Charach, Alice; Barr, Cathy; Ickowicz, Abel

    2008-01-01

    Background: The goal of this study was to compare the predictive validity of the two main diagnostic schemata for childhood hyperactivity--attention-deficit hyperactivity disorder (ADHD; "Diagnostic and Statistical Manual"-IV) and hyperkinetic disorder (HKD; "International Classification of Diseases"-10th Edition). Methods: Diagnostic criteria for…

  1. Three-dimensional virtual planning in orthognathic surgery enhances the accuracy of soft tissue prediction.

    PubMed

    Van Hemelen, Geert; Van Genechten, Maarten; Renier, Lieven; Desmedt, Maria; Verbruggen, Elric; Nadjmi, Nasser

    2015-07-01

    Throughout the history of computing, shortening the gap between the physical and digital world behind the screen has always been strived for. Recent advances in three-dimensional (3D) virtual surgery programs have reduced this gap significantly. Although 3D assisted surgery is now widely available for orthognathic surgery, one might still argue whether a 3D virtual planning approach is a better alternative to a conventional two-dimensional (2D) planning technique. The purpose of this study was to compare the accuracy of a traditional 2D technique and a 3D computer-aided prediction method. A double blind randomised prospective study was performed to compare the prediction accuracy of a traditional 2D planning technique versus a 3D computer-aided planning approach. The accuracy of the hard and soft tissue profile predictions using both planning methods was investigated. There was a statistically significant difference between 2D and 3D soft tissue planning (p < 0.05). The statistically significant difference found between 2D and 3D planning and the actual soft tissue outcome was not confirmed by a statistically significant difference between methods. The 3D planning approach provides more accurate soft tissue planning. However, the 2D orthognathic planning is comparable to 3D planning when it comes to hard tissue planning. This study provides relevant results for choosing between 3D and 2D planning in clinical practice. Copyright © 2015 European Association for Cranio-Maxillo-Facial Surgery. Published by Elsevier Ltd. All rights reserved.

  2. Reflexion on linear regression trip production modelling method for ensuring good model quality

    NASA Astrophysics Data System (ADS)

    Suprayitno, Hitapriya; Ratnasari, Vita

    2017-11-01

    Transport Modelling is important. For certain cases, the conventional model still has to be used, in which having a good trip production model is capital. A good model can only be obtained from a good sample. Two of the basic principles of a good sampling is having a sample capable to represent the population characteristics and capable to produce an acceptable error at a certain confidence level. It seems that this principle is not yet quite understood and used in trip production modeling. Therefore, investigating the Trip Production Modelling practice in Indonesia and try to formulate a better modeling method for ensuring the Model Quality is necessary. This research result is presented as follows. Statistics knows a method to calculate span of prediction value at a certain confidence level for linear regression, which is called Confidence Interval of Predicted Value. The common modeling practice uses R2 as the principal quality measure, the sampling practice varies and not always conform to the sampling principles. An experiment indicates that small sample is already capable to give excellent R2 value and sample composition can significantly change the model. Hence, good R2 value, in fact, does not always mean good model quality. These lead to three basic ideas for ensuring good model quality, i.e. reformulating quality measure, calculation procedure, and sampling method. A quality measure is defined as having a good R2 value and a good Confidence Interval of Predicted Value. Calculation procedure must incorporate statistical calculation method and appropriate statistical tests needed. A good sampling method must incorporate random well distributed stratified sampling with a certain minimum number of samples. These three ideas need to be more developed and tested.

  3. Artificial neural network models for prediction of cardiovascular autonomic dysfunction in general Chinese population

    PubMed Central

    2013-01-01

    Background The present study aimed to develop an artificial neural network (ANN) based prediction model for cardiovascular autonomic (CA) dysfunction in the general population. Methods We analyzed a previous dataset based on a population sample consisted of 2,092 individuals aged 30–80 years. The prediction models were derived from an exploratory set using ANN analysis. Performances of these prediction models were evaluated in the validation set. Results Univariate analysis indicated that 14 risk factors showed statistically significant association with CA dysfunction (P < 0.05). The mean area under the receiver-operating curve was 0.762 (95% CI 0.732–0.793) for prediction model developed using ANN analysis. The mean sensitivity, specificity, positive and negative predictive values were similar in the prediction models was 0.751, 0.665, 0.330 and 0.924, respectively. All HL statistics were less than 15.0. Conclusion ANN is an effective tool for developing prediction models with high value for predicting CA dysfunction among the general population. PMID:23902963

  4. A three-step approach for the derivation and validation of high-performing predictive models using an operational dataset: congestive heart failure readmission case study.

    PubMed

    AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku

    2014-05-27

    The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.

  5. Recommendations for evaluation of computational methods

    NASA Astrophysics Data System (ADS)

    Jain, Ajay N.; Nicholls, Anthony

    2008-03-01

    The field of computational chemistry, particularly as applied to drug design, has become increasingly important in terms of the practical application of predictive modeling to pharmaceutical research and development. Tools for exploiting protein structures or sets of ligands known to bind particular targets can be used for binding-mode prediction, virtual screening, and prediction of activity. A serious weakness within the field is a lack of standards with respect to quantitative evaluation of methods, data set preparation, and data set sharing. Our goal should be to report new methods or comparative evaluations of methods in a manner that supports decision making for practical applications. Here we propose a modest beginning, with recommendations for requirements on statistical reporting, requirements for data sharing, and best practices for benchmark preparation and usage.

  6. Comparisons of survival predictions using survival risk ratios based on International Classification of Diseases, Ninth Revision and Abbreviated Injury Scale trauma diagnosis codes.

    PubMed

    Clarke, John R; Ragone, Andrew V; Greenwald, Lloyd

    2005-09-01

    We conducted a comparison of methods for predicting survival using survival risk ratios (SRRs), including new comparisons based on International Classification of Diseases, Ninth Revision (ICD-9) versus Abbreviated Injury Scale (AIS) six-digit codes. From the Pennsylvania trauma center's registry, all direct trauma admissions were collected through June 22, 1999. Patients with no comorbid medical diagnoses and both ICD-9 and AIS injury codes were used for comparisons based on a single set of data. SRRs for ICD-9 and then for AIS diagnostic codes were each calculated two ways: from the survival rate of patients with each diagnosis and when each diagnosis was an isolated diagnosis. Probabilities of survival for the cohort were calculated using each set of SRRs by the multiplicative ICISS method and, where appropriate, the minimum SRR method. These prediction sets were then internally validated against actual survival by the Hosmer-Lemeshow goodness-of-fit statistic. The 41,364 patients had 1,224 different ICD-9 injury diagnoses in 32,261 combinations and 1,263 corresponding AIS injury diagnoses in 31,755 combinations, ranging from 1 to 27 injuries per patient. All conventional ICD-9-based combinations of SRRs and methods had better Hosmer-Lemeshow goodness-of-fit statistic fits than their AIS-based counterparts. The minimum SRR method produced better calibration than the multiplicative methods, presumably because it did not magnify inaccuracies in the SRRs that might occur with multiplication. Predictions of survival based on anatomic injury alone can be performed using ICD-9 codes, with no advantage from extra coding of AIS diagnoses. Predictions based on the single worst SRR were closer to actual outcomes than those based on multiplying SRRs.

  7. An image based method for crop yield prediction using remotely sensed and crop canopy data: the case of Paphos district, western Cyprus

    NASA Astrophysics Data System (ADS)

    Papadavid, G.; Hadjimitsis, D.

    2014-08-01

    Remote sensing techniques development have provided the opportunity for optimizing yields in the agricultural procedure and moreover to predict the forthcoming yield. Yield prediction plays a vital role in Agricultural Policy and provides useful data to policy makers. In this context, crop and soil parameters along with NDVI index which are valuable sources of information have been elaborated statistically to test if a) Durum wheat yield can be predicted and b) when is the actual time-window to predict the yield in the district of Paphos, where Durum wheat is the basic cultivation and supports the rural economy of the area. 15 plots cultivated with Durum wheat from the Agricultural Research Institute of Cyprus for research purposes, in the area of interest, have been under observation for three years to derive the necessary data. Statistical and remote sensing techniques were then applied to derive and map a model that can predict yield of Durum wheat in this area. Indeed the semi-empirical model developed for this purpose, with very high correlation coefficient R2=0.886, has shown in practice that can predict yields very good. Students T test has revealed that predicted values and real values of yield have no statistically significant difference. The developed model can and will be further elaborated with more parameters and applied for other crops in the near future.

  8. A Comparison of Imputation Methods for Bayesian Factor Analysis Models

    ERIC Educational Resources Information Center

    Merkle, Edgar C.

    2011-01-01

    Imputation methods are popular for the handling of missing data in psychology. The methods generally consist of predicting missing data based on observed data, yielding a complete data set that is amiable to standard statistical analyses. In the context of Bayesian factor analysis, this article compares imputation under an unrestricted…

  9. Predictive validity of the UK clinical aptitude test in the final years of medical school: a prospective cohort study

    PubMed Central

    2014-01-01

    Background The UK Clinical Aptitude Test (UKCAT) was designed to address issues identified with traditional methods of selection. This study aims to examine the predictive validity of the UKCAT and compare this to traditional selection methods in the senior years of medical school. This was a follow-up study of two cohorts of students from two medical schools who had previously taken part in a study examining the predictive validity of the UKCAT in first year. Methods The sample consisted of 4th and 5th Year students who commenced their studies at the University of Aberdeen or University of Dundee medical schools in 2007. Data collected were: demographics (gender and age group), UKCAT scores; Universities and Colleges Admissions Service (UCAS) form scores; admission interview scores; Year 4 and 5 degree examination scores. Pearson’s correlations were used to examine the relationships between admissions variables, examination scores, gender and age group, and to select variables for multiple linear regression analysis to predict examination scores. Results Ninety-nine and 89 students at Aberdeen medical school from Years 4 and 5 respectively, and 51 Year 4 students in Dundee, were included in the analysis. Neither UCAS form nor interview scores were statistically significant predictors of examination performance. Conversely, the UKCAT yielded statistically significant validity coefficients between .24 and .36 in four of five assessments investigated. Multiple regression analysis showed the UKCAT made a statistically significant unique contribution to variance in examination performance in the senior years. Conclusions Results suggest the UKCAT appears to predict performance better in the later years of medical school compared to earlier years and provides modest supportive evidence for the UKCAT’s role in student selection within these institutions. Further research is needed to assess the predictive validity of the UKCAT against professional and behavioural outcomes as the cohort commences working life. PMID:24762134

  10. The application of the statistical theory of extreme values to gust-load problems

    NASA Technical Reports Server (NTRS)

    Press, Harry

    1950-01-01

    An analysis is presented which indicates that the statistical theory of extreme values is applicable to the problems of predicting the frequency of encountering the larger gust loads and gust velocities for both specific test conditions as well as commercial transport operations. The extreme-value theory provides an analytic form for the distributions of maximum values of gust load and velocity. Methods of fitting the distribution are given along with a method of estimating the reliability of the predictions. The theory of extreme values is applied to available load data from commercial transport operations. The results indicate that the estimates of the frequency of encountering the larger loads are more consistent with the data and more reliable than those obtained in previous analyses. (author)

  11. A statistical learning framework for groundwater nitrate models of the Central Valley, California, USA

    USGS Publications Warehouse

    Nolan, Bernard T.; Fienen, Michael N.; Lorenz, David L.

    2015-01-01

    We used a statistical learning framework to evaluate the ability of three machine-learning methods to predict nitrate concentration in shallow groundwater of the Central Valley, California: boosted regression trees (BRT), artificial neural networks (ANN), and Bayesian networks (BN). Machine learning methods can learn complex patterns in the data but because of overfitting may not generalize well to new data. The statistical learning framework involves cross-validation (CV) training and testing data and a separate hold-out data set for model evaluation, with the goal of optimizing predictive performance by controlling for model overfit. The order of prediction performance according to both CV testing R2 and that for the hold-out data set was BRT > BN > ANN. For each method we identified two models based on CV testing results: that with maximum testing R2 and a version with R2 within one standard error of the maximum (the 1SE model). The former yielded CV training R2 values of 0.94–1.0. Cross-validation testing R2 values indicate predictive performance, and these were 0.22–0.39 for the maximum R2 models and 0.19–0.36 for the 1SE models. Evaluation with hold-out data suggested that the 1SE BRT and ANN models predicted better for an independent data set compared with the maximum R2 versions, which is relevant to extrapolation by mapping. Scatterplots of predicted vs. observed hold-out data obtained for final models helped identify prediction bias, which was fairly pronounced for ANN and BN. Lastly, the models were compared with multiple linear regression (MLR) and a previous random forest regression (RFR) model. Whereas BRT results were comparable to RFR, MLR had low hold-out R2 (0.07) and explained less than half the variation in the training data. Spatial patterns of predictions by the final, 1SE BRT model agreed reasonably well with previously observed patterns of nitrate occurrence in groundwater of the Central Valley.

  12. [Study of beta-turns in globular proteins].

    PubMed

    Amirova, S R; Milchevskiĭ, Iu V; Filatov, I V; Esipova, N G; Tumanian, V G

    2005-01-01

    The formation of beta-turns in globular proteins has been studied by the method of molecular mechanics. Statistical method of discriminant analysis was applied to calculate energy components and sequences of oligopeptide segments, and after this prediction of I type beta-turns has been drawn. The accuracy of true positive prediction is 65%. Components of conformational energy considerably affecting beta-turn formation were delineated. There are torsional energy, energy of hydrogen bonds, and van der Waals energy.

  13. A Novel Approach for Adaptive Signal Processing

    NASA Technical Reports Server (NTRS)

    Chen, Ya-Chin; Juang, Jer-Nan

    1998-01-01

    Adaptive linear predictors have been used extensively in practice in a wide variety of forms. In the main, their theoretical development is based upon the assumption of stationarity of the signals involved, particularly with respect to the second order statistics. On this basis, the well-known normal equations can be formulated. If high- order statistical stationarity is assumed, then the equivalent normal equations involve high-order signal moments. In either case, the cross moments (second or higher) are needed. This renders the adaptive prediction procedure non-blind. A novel procedure for blind adaptive prediction has been proposed and considerable implementation has been made in our contributions in the past year. The approach is based upon a suitable interpretation of blind equalization methods that satisfy the constant modulus property and offers significant deviations from the standard prediction methods. These blind adaptive algorithms are derived by formulating Lagrange equivalents from mechanisms of constrained optimization. In this report, other new update algorithms are derived from the fundamental concepts of advanced system identification to carry out the proposed blind adaptive prediction. The results of the work can be extended to a number of control-related problems, such as disturbance identification. The basic principles are outlined in this report and differences from other existing methods are discussed. The applications implemented are speech processing, such as coding and synthesis. Simulations are included to verify the novel modelling method.

  14. "Wish You Were Here": Examining Characteristics, Outcomes, and Statistical Solutions for Missing Cases in Web-Based Psychotherapeutic Trials.

    PubMed

    Karin, Eyal; Dear, Blake F; Heller, Gillian Z; Crane, Monique F; Titov, Nickolai

    2018-04-19

    Missing cases following treatment are common in Web-based psychotherapy trials. Without the ability to directly measure and evaluate the outcomes for missing cases, the ability to measure and evaluate the effects of treatment is challenging. Although common, little is known about the characteristics of Web-based psychotherapy participants who present as missing cases, their likely clinical outcomes, or the suitability of different statistical assumptions that can characterize missing cases. Using a large sample of individuals who underwent Web-based psychotherapy for depressive symptoms (n=820), the aim of this study was to explore the characteristics of cases who present as missing cases at posttreatment (n=138), their likely treatment outcomes, and compare between statistical methods for replacing their missing data. First, common participant and treatment features were tested through binary logistic regression models, evaluating the ability to predict missing cases. Second, the same variables were screened for their ability to increase or impede the rate symptom change that was observed following treatment. Third, using recontacted cases at 3-month follow-up to proximally represent missing cases outcomes following treatment, various simulated replacement scores were compared and evaluated against observed clinical follow-up scores. Missing cases were dominantly predicted by lower treatment adherence and increased symptoms at pretreatment. Statistical methods that ignored these characteristics can overlook an important clinical phenomenon and consequently produce inaccurate replacement outcomes, with symptoms estimates that can swing from -32% to 70% from the observed outcomes of recontacted cases. In contrast, longitudinal statistical methods that adjusted their estimates for missing cases outcomes by treatment adherence rates and baseline symptoms scores resulted in minimal measurement bias (<8%). Certain variables can characterize and predict missing cases likelihood and jointly predict lesser clinical improvement. Under such circumstances, individuals with potentially worst off treatment outcomes can become concealed, and failure to adjust for this can lead to substantial clinical measurement bias. Together, this preliminary research suggests that missing cases in Web-based psychotherapeutic interventions may not occur as random events and can be systematically predicted. Critically, at the same time, missing cases may experience outcomes that are distinct and important for a complete understanding of the treatment effect. ©Eyal Karin, Blake F Dear, Gillian Z Heller, Monique F Crane, Nickolai Titov. Originally published in JMIR Mental Health (http://mental.jmir.org), 19.04.2018.

  15. Archaeology Through Computational Linguistics: Inscription Statistics Predict Excavation Sites of Indus Valley Artifacts.

    PubMed

    Recchia, Gabriel L; Louwerse, Max M

    2016-11-01

    Computational techniques comparing co-occurrences of city names in texts allow the relative longitudes and latitudes of cities to be estimated algorithmically. However, these techniques have not been applied to estimate the provenance of artifacts with unknown origins. Here, we estimate the geographic origin of artifacts from the Indus Valley Civilization, applying methods commonly used in cognitive science to the Indus script. We show that these methods can accurately predict the relative locations of archeological sites on the basis of artifacts of known provenance, and we further apply these techniques to determine the most probable excavation sites of four sealings of unknown provenance. These findings suggest that inscription statistics reflect historical interactions among locations in the Indus Valley region, and they illustrate how computational methods can help localize inscribed archeological artifacts of unknown origin. The success of this method offers opportunities for the cognitive sciences in general and for computational anthropology specifically. Copyright © 2015 Cognitive Science Society, Inc.

  16. UVPROM dosimetry, microdosimetry and applications to SEU and extreme value theory

    NASA Astrophysics Data System (ADS)

    Scheick, Leif Zebediah

    A new method is described for characterizing a device in terms of the statistical distribution of first failures. The method is based on the erasure of a commercial Ultra- Violet erasable Programmable Read Only Memory (UVPROM). The method of readout would be used on a spacecraft or in other restrictive radiation environments. The measurement of the charge remaining on the floating gate is used to determine absorbed dose. The method of determining dose does not require the detector to be destroyed or erased nor does it effect the ability for taking further measurements. This is compared to extreme value theory applied to the statistical distributions that apply to this device. This technique predicts the threshold of Single Event Effects (SEE), like anomalous changes in erasure time in programmable devices due to high microdose energy-deposition events. This technique also allows for advanced non-destructive, screening of a single microelectronic devices for predictable response in a stressful, i.e. radiation, environments.

  17. Confidence Intervals from Realizations of Simulated Nuclear Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Younes, W.; Ratkiewicz, A.; Ressler, J. J.

    2017-09-28

    Various statistical techniques are discussed that can be used to assign a level of confidence in the prediction of models that depend on input data with known uncertainties and correlations. The particular techniques reviewed in this paper are: 1) random realizations of the input data using Monte-Carlo methods, 2) the construction of confidence intervals to assess the reliability of model predictions, and 3) resampling techniques to impose statistical constraints on the input data based on additional information. These techniques are illustrated with a calculation of the keff value, based on the 235U(n, f) and 239Pu (n, f) cross sections.

  18. Gender discrimination and prediction on the basis of facial metric information.

    PubMed

    Fellous, J M

    1997-07-01

    Horizontal and vertical facial measurements are statistically independent. Discriminant analysis shows that five of such normalized distances explain over 95% of the gender differences of "training" samples and predict the gender of 90% novel test faces exhibiting various facial expressions. The robustness of the method and its results are assessed. It is argued that these distances (termed fiducial) are compatible with those found experimentally by psychophysical and neurophysiological studies. In consequence, partial explanations for the effects observed in these experiments can be found in the intrinsic statistical nature of the facial stimuli used.

  19. Dynamic Modeling and Very Short-term Prediction of Wind Power Output Using Box-Cox Transformation

    NASA Astrophysics Data System (ADS)

    Urata, Kengo; Inoue, Masaki; Murayama, Dai; Adachi, Shuichi

    2016-09-01

    We propose a statistical modeling method of wind power output for very short-term prediction. The modeling method with a nonlinear model has cascade structure composed of two parts. One is a linear dynamic part that is driven by a Gaussian white noise and described by an autoregressive model. The other is a nonlinear static part that is driven by the output of the linear part. This nonlinear part is designed for output distribution matching: we shape the distribution of the model output to match with that of the wind power output. The constructed model is utilized for one-step ahead prediction of the wind power output. Furthermore, we study the relation between the prediction accuracy and the prediction horizon.

  20. Hurricane track forecast cones from fluctuations

    PubMed Central

    Meuel, T.; Prado, G.; Seychelles, F.; Bessafi, M.; Kellay, H.

    2012-01-01

    Trajectories of tropical cyclones may show large deviations from predicted tracks leading to uncertainty as to their landfall location for example. Prediction schemes usually render this uncertainty by showing track forecast cones representing the most probable region for the location of a cyclone during a period of time. By using the statistical properties of these deviations, we propose a simple method to predict possible corridors for the future trajectory of a cyclone. Examples of this scheme are implemented for hurricane Ike and hurricane Jimena. The corridors include the future trajectory up to at least 50 h before landfall. The cones proposed here shed new light on known track forecast cones as they link them directly to the statistics of these deviations. PMID:22701776

  1. A novel principal component analysis for spatially misaligned multivariate air pollution data.

    PubMed

    Jandarov, Roman A; Sheppard, Lianne A; Sampson, Paul D; Szpiro, Adam A

    2017-01-01

    We propose novel methods for predictive (sparse) PCA with spatially misaligned data. These methods identify principal component loading vectors that explain as much variability in the observed data as possible, while also ensuring the corresponding principal component scores can be predicted accurately by means of spatial statistics at locations where air pollution measurements are not available. This will make it possible to identify important mixtures of air pollutants and to quantify their health effects in cohort studies, where currently available methods cannot be used. We demonstrate the utility of predictive (sparse) PCA in simulated data and apply the approach to annual averages of particulate matter speciation data from national Environmental Protection Agency (EPA) regulatory monitors.

  2. Principal Component Analysis in Construction of 3D Human Knee Joint Models Using a Statistical Shape Model Method

    PubMed Central

    Tsai, Tsung-Yuan; Li, Jing-Sheng; Wang, Shaobai; Li, Pingyue; Kwon, Young-Min; Li, Guoan

    2013-01-01

    The statistical shape model (SSM) method that uses 2D images of the knee joint to predict the 3D joint surface model has been reported in literature. In this study, we constructed a SSM database using 152 human CT knee joint models, including the femur, tibia and patella and analyzed the characteristics of each principal component of the SSM. The surface models of two in vivo knees were predicted using the SSM and their 2D bi-plane fluoroscopic images. The predicted models were compared to their CT joint models. The differences between the predicted 3D knee joint surfaces and the CT image-based surfaces were 0.30 ± 0.81 mm, 0.34 ± 0.79 mm and 0.36 ± 0.59 mm for the femur, tibia and patella, respectively (average ± standard deviation). The computational time for each bone of the knee joint was within 30 seconds using a personal computer. The analysis of this study indicated that the SSM method could be a useful tool to construct 3D surface models of the knee with sub-millimeter accuracy in real time. Thus it may have a broad application in computer assisted knee surgeries that require 3D surface models of the knee. PMID:24156375

  3. Extreme value statistics analysis of fracture strengths of a sintered silicon nitride failing from pores

    NASA Technical Reports Server (NTRS)

    Chao, Luen-Yuan; Shetty, Dinesh K.

    1992-01-01

    Statistical analysis and correlation between pore-size distribution and fracture strength distribution using the theory of extreme-value statistics is presented for a sintered silicon nitride. The pore-size distribution on a polished surface of this material was characterized, using an automatic optical image analyzer. The distribution measured on the two-dimensional plane surface was transformed to a population (volume) distribution, using the Schwartz-Saltykov diameter method. The population pore-size distribution and the distribution of the pore size at the fracture origin were correllated by extreme-value statistics. Fracture strength distribution was then predicted from the extreme-value pore-size distribution, usin a linear elastic fracture mechanics model of annular crack around pore and the fracture toughness of the ceramic. The predicted strength distribution was in good agreement with strength measurements in bending. In particular, the extreme-value statistics analysis explained the nonlinear trend in the linearized Weibull plot of measured strengths without postulating a lower-bound strength.

  4. Comparison of Adaline and Multiple Linear Regression Methods for Rainfall Forecasting

    NASA Astrophysics Data System (ADS)

    Sutawinaya, IP; Astawa, INGA; Hariyanti, NKD

    2018-01-01

    Heavy rainfall can cause disaster, therefore need a forecast to predict rainfall intensity. Main factor that cause flooding is there is a high rainfall intensity and it makes the river become overcapacity. This will cause flooding around the area. Rainfall factor is a dynamic factor, so rainfall is very interesting to be studied. In order to support the rainfall forecasting, there are methods that can be used from Artificial Intelligence (AI) to statistic. In this research, we used Adaline for AI method and Regression for statistic method. The more accurate forecast result shows the method that used is good for forecasting the rainfall. Through those methods, we expected which is the best method for rainfall forecasting here.

  5. Statistical tools for transgene copy number estimation based on real-time PCR.

    PubMed

    Yuan, Joshua S; Burris, Jason; Stewart, Nathan R; Mentewab, Ayalew; Stewart, C Neal

    2007-11-01

    As compared with traditional transgene copy number detection technologies such as Southern blot analysis, real-time PCR provides a fast, inexpensive and high-throughput alternative. However, the real-time PCR based transgene copy number estimation tends to be ambiguous and subjective stemming from the lack of proper statistical analysis and data quality control to render a reliable estimation of copy number with a prediction value. Despite the recent progresses in statistical analysis of real-time PCR, few publications have integrated these advancements in real-time PCR based transgene copy number determination. Three experimental designs and four data quality control integrated statistical models are presented. For the first method, external calibration curves are established for the transgene based on serially-diluted templates. The Ct number from a control transgenic event and putative transgenic event are compared to derive the transgene copy number or zygosity estimation. Simple linear regression and two group T-test procedures were combined to model the data from this design. For the second experimental design, standard curves were generated for both an internal reference gene and the transgene, and the copy number of transgene was compared with that of internal reference gene. Multiple regression models and ANOVA models can be employed to analyze the data and perform quality control for this approach. In the third experimental design, transgene copy number is compared with reference gene without a standard curve, but rather, is based directly on fluorescence data. Two different multiple regression models were proposed to analyze the data based on two different approaches of amplification efficiency integration. Our results highlight the importance of proper statistical treatment and quality control integration in real-time PCR-based transgene copy number determination. These statistical methods allow the real-time PCR-based transgene copy number estimation to be more reliable and precise with a proper statistical estimation. Proper confidence intervals are necessary for unambiguous prediction of trangene copy number. The four different statistical methods are compared for their advantages and disadvantages. Moreover, the statistical methods can also be applied for other real-time PCR-based quantification assays including transfection efficiency analysis and pathogen quantification.

  6. Earthquake prediction evaluation standards applied to the VAN Method

    NASA Astrophysics Data System (ADS)

    Jackson, David D.

    Earthquake prediction research must meet certain standards before it can be suitably evaluated for potential application in decision making. For methods that result in a binary (on or off) alarm condition, requirements include (1) a quantitative description of observables that trigger an alarm, (2) a quantitative description, including ranges of time, location, and magnitude, of the predicted earthquakes, (3) documented evidence of all previous alarms, (4) a complete list of predicted earthquakes, (5) a complete list of unpredicted earthquakes. The VAN technique [Varotsos and Lazaridou, 1991; Varotsos et al., 1996] has not yet been stated as a testable hypothesis. It fails criteria (1) and (2) so it is not ready to be evaluated properly. Although telegrams were transmitted in advance of claimed successes, these telegrams did not fully specify the predicted events, and all of the published statistical evaluations involve many subjective ex post facto decisions. Lacking a statistically demonstrated relationship to earthquakes, a candidate prediction technique should satisfy several plausibility criteria, including: (1) a reasonable relationship between the location of the candidate precursor and that of the predicted earthquake, (2) some demonstration that the candidate precursory observations are related to stress, strain, or other quantities related to earthquakes, and (3) the existence of co-seismic as well as pre-seismic variations of the candidate precursor. The VAN technique meets none of these criteria.

  7. Predicting survival of Escherichia coli O157:H7 in dry fermented sausage using artificial neural networks.

    PubMed

    Palanichamy, A; Jayas, D S; Holley, R A

    2008-01-01

    The Canadian Food Inspection Agency required the meat industry to ensure Escherichia coli O157:H7 does not survive (experiences > or = 5 log CFU/g reduction) in dry fermented sausage (salami) during processing after a series of foodborne illness outbreaks resulting from this pathogenic bacterium occurred. The industry is in need of an effective technique like predictive modeling for estimating bacterial viability, because traditional microbiological enumeration is a time-consuming and laborious method. The accuracy and speed of artificial neural networks (ANNs) for this purpose is an attractive alternative (developed from predictive microbiology), especially for on-line processing in industry. Data from a study of interactive effects of different levels of pH, water activity, and the concentrations of allyl isothiocyanate at various times during sausage manufacture in reducing numbers of E. coli O157:H7 were collected. Data were used to develop predictive models using a general regression neural network (GRNN), a form of ANN, and a statistical linear polynomial regression technique. Both models were compared for their predictive error, using various statistical indices. GRNN predictions for training and test data sets had less serious errors when compared with the statistical model predictions. GRNN models were better and slightly better for training and test sets, respectively, than was the statistical model. Also, GRNN accurately predicted the level of allyl isothiocyanate required, ensuring a 5-log reduction, when an appropriate production set was created by interpolation. Because they are simple to generate, fast, and accurate, ANN models may be of value for industrial use in dry fermented sausage manufacture to reduce the hazard associated with E. coli O157:H7 in fresh beef and permit production of consistently safe products from this raw material.

  8. Validity of a Manual Soft Tissue Profile Prediction Method Following Mandibular Setback Osteotomy

    PubMed Central

    Kolokitha, Olga-Elpis

    2007-01-01

    Objectives The aim of this study was to determine the validity of a manual cephalometric method used for predicting the post-operative soft tissue profiles of patients who underwent mandibular setback surgery and compare it to a computerized cephalometric prediction method (Dentofacial Planner). Lateral cephalograms of 18 adults with mandibular prognathism taken at the end of pre-surgical orthodontics and approximately one year after surgery were used. Methods To test the validity of the manual method the prediction tracings were compared to the actual post-operative tracings. The Dentofacial Planner software was used to develop the computerized post-surgical prediction tracings. Both manual and computerized prediction printouts were analyzed by using the cephalometric system PORDIOS. Statistical analysis was performed by means of t-test. Results Comparison between manual prediction tracings and the actual post-operative profile showed that the manual method results in more convex soft tissue profiles; the upper lip was found in a more prominent position, upper lip thickness was increased and, the mandible and lower lip were found in a less posterior position than that of the actual profiles. Comparison between computerized and manual prediction methods showed that in the manual method upper lip thickness was increased, the upper lip was found in a more anterior position and the lower anterior facial height was increased as compared to the computerized prediction method. Conclusions Cephalometric simulation of post-operative soft tissue profile following orthodontic-surgical management of mandibular prognathism imposes certain limitations related to the methods implied. However, both manual and computerized prediction methods remain a useful tool for patient communication. PMID:19212468

  9. No-reference image quality assessment based on natural scene statistics and gradient magnitude similarity

    NASA Astrophysics Data System (ADS)

    Jia, Huizhen; Sun, Quansen; Ji, Zexuan; Wang, Tonghan; Chen, Qiang

    2014-11-01

    The goal of no-reference/blind image quality assessment (NR-IQA) is to devise a perceptual model that can accurately predict the quality of a distorted image as human opinions, in which feature extraction is an important issue. However, the features used in the state-of-the-art "general purpose" NR-IQA algorithms are usually natural scene statistics (NSS) based or are perceptually relevant; therefore, the performance of these models is limited. To further improve the performance of NR-IQA, we propose a general purpose NR-IQA algorithm which combines NSS-based features with perceptually relevant features. The new method extracts features in both the spatial and gradient domains. In the spatial domain, we extract the point-wise statistics for single pixel values which are characterized by a generalized Gaussian distribution model to form the underlying features. In the gradient domain, statistical features based on neighboring gradient magnitude similarity are extracted. Then a mapping is learned to predict quality scores using a support vector regression. The experimental results on the benchmark image databases demonstrate that the proposed algorithm correlates highly with human judgments of quality and leads to significant performance improvements over state-of-the-art methods.

  10. Modern Prediction Methods for Turbomachine Performance

    DTIC Science & Technology

    1976-01-01

    easier on the basis of C factor correlation. Generally, correlation has to be carefully established; statistical methods may be of good help when a , large...FLOW TURBOMACHINE BLADES Walker, G. J. Instn of Engrs,, Australia-3rd Australiasian Conference on Hydraulics 8 Fluid Mechanics-Proc, Nov 25-29 1968

  11. Assessment of statistical methods used in library-based approaches to microbial source tracking.

    PubMed

    Ritter, Kerry J; Carruthers, Ethan; Carson, C Andrew; Ellender, R D; Harwood, Valerie J; Kingsley, Kyle; Nakatsu, Cindy; Sadowsky, Michael; Shear, Brian; West, Brian; Whitlock, John E; Wiggins, Bruce A; Wilbur, Jayson D

    2003-12-01

    Several commonly used statistical methods for fingerprint identification in microbial source tracking (MST) were examined to assess the effectiveness of pattern-matching algorithms to correctly identify sources. Although numerous statistical methods have been employed for source identification, no widespread consensus exists as to which is most appropriate. A large-scale comparison of several MST methods, using identical fecal sources, presented a unique opportunity to assess the utility of several popular statistical methods. These included discriminant analysis, nearest neighbour analysis, maximum similarity and average similarity, along with several measures of distance or similarity. Threshold criteria for excluding uncertain or poorly matched isolates from final analysis were also examined for their ability to reduce false positives and increase prediction success. Six independent libraries used in the study were constructed from indicator bacteria isolated from fecal materials of humans, seagulls, cows and dogs. Three of these libraries were constructed using the rep-PCR technique and three relied on antibiotic resistance analysis (ARA). Five of the libraries were constructed using Escherichia coli and one using Enterococcus spp. (ARA). Overall, the outcome of this study suggests a high degree of variability across statistical methods. Despite large differences in correct classification rates among the statistical methods, no single statistical approach emerged as superior. Thresholds failed to consistently increase rates of correct classification and improvement was often associated with substantial effective sample size reduction. Recommendations are provided to aid in selecting appropriate analyses for these types of data.

  12. CORSSA: Community Online Resource for Statistical Seismicity Analysis

    NASA Astrophysics Data System (ADS)

    Zechar, J. D.; Hardebeck, J. L.; Michael, A. J.; Naylor, M.; Steacy, S.; Wiemer, S.; Zhuang, J.

    2011-12-01

    Statistical seismology is critical to the understanding of seismicity, the evaluation of proposed earthquake prediction and forecasting methods, and the assessment of seismic hazard. Unfortunately, despite its importance to seismology-especially to those aspects with great impact on public policy-statistical seismology is mostly ignored in the education of seismologists, and there is no central repository for the existing open-source software tools. To remedy these deficiencies, and with the broader goal to enhance the quality of statistical seismology research, we have begun building the Community Online Resource for Statistical Seismicity Analysis (CORSSA, www.corssa.org). We anticipate that the users of CORSSA will range from beginning graduate students to experienced researchers. More than 20 scientists from around the world met for a week in Zurich in May 2010 to kick-start the creation of CORSSA: the format and initial table of contents were defined; a governing structure was organized; and workshop participants began drafting articles. CORSSA materials are organized with respect to six themes, each will contain between four and eight articles. CORSSA now includes seven articles with an additional six in draft form along with forums for discussion, a glossary, and news about upcoming meetings, special issues, and recent papers. Each article is peer-reviewed and presents a balanced discussion, including illustrative examples and code snippets. Topics in the initial set of articles include: introductions to both CORSSA and statistical seismology, basic statistical tests and their role in seismology; understanding seismicity catalogs and their problems; basic techniques for modeling seismicity; and methods for testing earthquake predictability hypotheses. We have also begun curating a collection of statistical seismology software packages.

  13. Progress in space weather predictions and applications

    NASA Astrophysics Data System (ADS)

    Lundstedt, H.

    The methods of today's predictions of space weather and effects are so much more advanced and yesterday's statistical methods are now replaced by integrated knowledge-based neuro-computing models and MHD methods. Within the ESA Space Weather Programme Study a real-time forecast service has been developed for space weather and effects. This prototype is now being implemented for specific users. Today's applications are not only so many more but also so much more advanced and user-oriented. A scientist needs real-time predictions of a global index as input for an MHD model calculating the radiation dose for EVAs. A power company system operator needs a prediction of the local value of a geomagnetically induced current. A science tourist needs to know whether or not aurora will occur. Soon we might even be able to predict the tropospheric climate changes and weather caused by the space weather.

  14. Modelling the effect of structural QSAR parameters on skin penetration using genetic programming

    NASA Astrophysics Data System (ADS)

    Chung, K. K.; Do, D. Q.

    2010-09-01

    In order to model relationships between chemical structures and biological effects in quantitative structure-activity relationship (QSAR) data, an alternative technique of artificial intelligence computing—genetic programming (GP)—was investigated and compared to the traditional method—statistical. GP, with the primary advantage of generating mathematical equations, was employed to model QSAR data and to define the most important molecular descriptions in QSAR data. The models predicted by GP agreed with the statistical results, and the most predictive models of GP were significantly improved when compared to the statistical models using ANOVA. Recently, artificial intelligence techniques have been applied widely to analyse QSAR data. With the capability of generating mathematical equations, GP can be considered as an effective and efficient method for modelling QSAR data.

  15. Validity of a manual soft tissue profile prediction method following mandibular setback osteotomy.

    PubMed

    Kolokitha, Olga-Elpis

    2007-10-01

    The aim of this study was to determine the validity of a manual cephalometric method used for predicting the post-operative soft tissue profiles of patients who underwent mandibular setback surgery and compare it to a computerized cephalometric prediction method (Dentofacial Planner). Lateral cephalograms of 18 adults with mandibular prognathism taken at the end of pre-surgical orthodontics and approximately one year after surgery were used. To test the validity of the manual method the prediction tracings were compared to the actual post-operative tracings. The Dentofacial Planner software was used to develop the computerized post-surgical prediction tracings. Both manual and computerized prediction printouts were analyzed by using the cephalometric system PORDIOS. Statistical analysis was performed by means of t-test. Comparison between manual prediction tracings and the actual post-operative profile showed that the manual method results in more convex soft tissue profiles; the upper lip was found in a more prominent position, upper lip thickness was increased and, the mandible and lower lip were found in a less posterior position than that of the actual profiles. Comparison between computerized and manual prediction methods showed that in the manual method upper lip thickness was increased, the upper lip was found in a more anterior position and the lower anterior facial height was increased as compared to the computerized prediction method. Cephalometric simulation of post-operative soft tissue profile following orthodontic-surgical management of mandibular prognathism imposes certain limitations related to the methods implied. However, both manual and computerized prediction methods remain a useful tool for patient communication.

  16. Prostate segmentation in MRI using a convolutional neural network architecture and training strategy based on statistical shape models.

    PubMed

    Karimi, Davood; Samei, Golnoosh; Kesch, Claudia; Nir, Guy; Salcudean, Septimiu E

    2018-05-15

    Most of the existing convolutional neural network (CNN)-based medical image segmentation methods are based on methods that have originally been developed for segmentation of natural images. Therefore, they largely ignore the differences between the two domains, such as the smaller degree of variability in the shape and appearance of the target volume and the smaller amounts of training data in medical applications. We propose a CNN-based method for prostate segmentation in MRI that employs statistical shape models to address these issues. Our CNN predicts the location of the prostate center and the parameters of the shape model, which determine the position of prostate surface keypoints. To train such a large model for segmentation of 3D images using small data (1) we adopt a stage-wise training strategy by first training the network to predict the prostate center and subsequently adding modules for predicting the parameters of the shape model and prostate rotation, (2) we propose a data augmentation method whereby the training images and their prostate surface keypoints are deformed according to the displacements computed based on the shape model, and (3) we employ various regularization techniques. Our proposed method achieves a Dice score of 0.88, which is obtained by using both elastic-net and spectral dropout for regularization. Compared with a standard CNN-based method, our method shows significantly better segmentation performance on the prostate base and apex. Our experiments also show that data augmentation using the shape model significantly improves the segmentation results. Prior knowledge about the shape of the target organ can improve the performance of CNN-based segmentation methods, especially where image features are not sufficient for a precise segmentation. Statistical shape models can also be employed to synthesize additional training data that can ease the training of large CNNs.

  17. Inference on network statistics by restricting to the network space: applications to sexual history data.

    PubMed

    Goyal, Ravi; De Gruttola, Victor

    2018-01-30

    Analysis of sexual history data intended to describe sexual networks presents many challenges arising from the fact that most surveys collect information on only a very small fraction of the population of interest. In addition, partners are rarely identified and responses are subject to reporting biases. Typically, each network statistic of interest, such as mean number of sexual partners for men or women, is estimated independently of other network statistics. There is, however, a complex relationship among networks statistics; and knowledge of these relationships can aid in addressing concerns mentioned earlier. We develop a novel method that constrains a posterior predictive distribution of a collection of network statistics in order to leverage the relationships among network statistics in making inference about network properties of interest. The method ensures that inference on network properties is compatible with an actual network. Through extensive simulation studies, we also demonstrate that use of this method can improve estimates in settings where there is uncertainty that arises both from sampling and from systematic reporting bias compared with currently available approaches to estimation. To illustrate the method, we apply it to estimate network statistics using data from the Chicago Health and Social Life Survey. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  18. Statistical analysis of water-quality data containing multiple detection limits: S-language software for regression on order statistics

    USGS Publications Warehouse

    Lee, L.; Helsel, D.

    2005-01-01

    Trace contaminants in water, including metals and organics, often are measured at sufficiently low concentrations to be reported only as values below the instrument detection limit. Interpretation of these "less thans" is complicated when multiple detection limits occur. Statistical methods for multiply censored, or multiple-detection limit, datasets have been developed for medical and industrial statistics, and can be employed to estimate summary statistics or model the distributions of trace-level environmental data. We describe S-language-based software tools that perform robust linear regression on order statistics (ROS). The ROS method has been evaluated as one of the most reliable procedures for developing summary statistics of multiply censored data. It is applicable to any dataset that has 0 to 80% of its values censored. These tools are a part of a software library, or add-on package, for the R environment for statistical computing. This library can be used to generate ROS models and associated summary statistics, plot modeled distributions, and predict exceedance probabilities of water-quality standards. ?? 2005 Elsevier Ltd. All rights reserved.

  19. A New Sample Size Formula for Regression.

    ERIC Educational Resources Information Center

    Brooks, Gordon P.; Barcikowski, Robert S.

    The focus of this research was to determine the efficacy of a new method of selecting sample sizes for multiple linear regression. A Monte Carlo simulation was used to study both empirical predictive power rates and empirical statistical power rates of the new method and seven other methods: those of C. N. Park and A. L. Dudycha (1974); J. Cohen…

  20. A clinical study of electrophysiological correlates of behavioural comfort levels in cochlear implantees.

    PubMed

    Raghunandhan, S; Ravikumar, A; Kameswaran, Mohan; Mandke, Kalyani; Ranjith, R

    2014-05-01

    Indications for cochlear implantation have expanded today to include very young children and those with syndromes/multiple handicaps. Programming the implant based on behavioural responses may be tedious for audiologists in such cases, wherein matching an effective Measurable Auditory Percept (MAP) and appropriate MAP becomes the key issue in the habilitation program. In 'Difficult to MAP' scenarios, objective measures become paramount to predict optimal current levels to be set in the MAP. We aimed to (a) study the trends in multi-modal electrophysiological tests and behavioural responses sequentially over the first year of implant use; (b) generate normative data from the above; (c) correlate the multi-modal electrophysiological thresholds levels with behavioural comfort levels; and (d) create predictive formulae for deriving optimal comfort levels (if unknown), using linear and multiple regression analysis. This prospective study included 10 profoundly hearing impaired children aged between 2 and 7 years with normal inner ear anatomy and no additional handicaps. They received the Advanced Bionics HiRes 90 K Implant with Harmony Speech processor and used HiRes-P with Fidelity 120 strategy. They underwent, impedance telemetry, neural response imaging, electrically evoked stapedial response telemetry (ESRT), and electrically evoked auditory brainstem response (EABR) tests at 1, 4, 8, and 12 months of implant use, in conjunction with behavioural mapping. Trends in electrophysiological and behavioural responses were analyzed using paired t-test. By Karl Pearson's correlation method, electrode-wise correlations were derived for neural response imaging (NRI) thresholds versus most comfortable level (M-levels) and offset based (apical, mid-array, and basal array) correlations for EABR and ESRT thresholds versus M-levels were calculated over time. These were used to derive predictive formulae by linear and multiple regression analysis. Such statistically predicted M-levels were compared with the behaviourally recorded M-levels among the cohort, using Cronbach's alpha reliability test method for confirming the efficacy of this method. NRI, ESRT, and EABR thresholds showed statistically significant positive correlations with behavioural M-levels, which improved with implant use over time. These correlations were used to derive predicted M-levels using regression analysis. On an average, predicted M-levels were found to be statistically reliable and they were a fair match to the actual behavioural M-levels. When applied in clinical practice, the predicted values were found to be useful for programming members of the study group. However, individuals showed considerable deviations in behavioural M-levels, above and below the electrophysiologically predicted values, due to various factors. While the current method appears helpful as a reference to predict initial maps in 'difficult to Map' subjects, it is recommended that behavioural measures are mandatory to further optimize the maps for these individuals. The study explores the trends, correlations and individual variabilities that occur between electrophysiological tests and behavioural responses, recorded over time among a cohort of cochlear implantees. The statistical method shown may be used as a guideline to predict optimal behavioural levels in difficult situations among future implantees, bearing in mind that optimal M-levels for individuals can vary from predicted values. In 'Difficult to MAP' scenarios, following a protocol of sequential behavioural programming, in conjunction with electrophysiological correlates will provide the best outcomes.

  1. Quasi-Static Probabilistic Structural Analyses Process and Criteria

    NASA Technical Reports Server (NTRS)

    Goldberg, B.; Verderaime, V.

    1999-01-01

    Current deterministic structural methods are easily applied to substructures and components, and analysts have built great design insights and confidence in them over the years. However, deterministic methods cannot support systems risk analyses, and it was recently reported that deterministic treatment of statistical data is inconsistent with error propagation laws that can result in unevenly conservative structural predictions. Assuming non-nal distributions and using statistical data formats throughout prevailing stress deterministic processes lead to a safety factor in statistical format, which integrated into the safety index, provides a safety factor and first order reliability relationship. The embedded safety factor in the safety index expression allows a historically based risk to be determined and verified over a variety of quasi-static metallic substructures consistent with the traditional safety factor methods and NASA Std. 5001 criteria.

  2. A summary of wind power prediction methods

    NASA Astrophysics Data System (ADS)

    Wang, Yuqi

    2018-06-01

    The deterministic prediction of wind power, the probability prediction and the prediction of wind power ramp events are introduced in this paper. Deterministic prediction includes the prediction of statistical learning based on histor ical data and the prediction of physical models based on NWP data. Due to the great impact of wind power ramp events on the power system, this paper also introduces the prediction of wind power ramp events. At last, the evaluation indicators of all kinds of prediction are given. The prediction of wind power can be a good solution to the adverse effects of wind power on the power system due to the abrupt, intermittent and undulation of wind power.

  3. Selected Streamflow Statistics and Regression Equations for Predicting Statistics at Stream Locations in Monroe County, Pennsylvania

    USGS Publications Warehouse

    Thompson, Ronald E.; Hoffman, Scott A.

    2006-01-01

    A suite of 28 streamflow statistics, ranging from extreme low to high flows, was computed for 17 continuous-record streamflow-gaging stations and predicted for 20 partial-record stations in Monroe County and contiguous counties in north-eastern Pennsylvania. The predicted statistics for the partial-record stations were based on regression analyses relating inter-mittent flow measurements made at the partial-record stations indexed to concurrent daily mean flows at continuous-record stations during base-flow conditions. The same statistics also were predicted for 134 ungaged stream locations in Monroe County on the basis of regression analyses relating the statistics to GIS-determined basin characteristics for the continuous-record station drainage areas. The prediction methodology for developing the regression equations used to estimate statistics was developed for estimating low-flow frequencies. This study and a companion study found that the methodology also has application potential for predicting intermediate- and high-flow statistics. The statistics included mean monthly flows, mean annual flow, 7-day low flows for three recurrence intervals, nine flow durations, mean annual base flow, and annual mean base flows for two recurrence intervals. Low standard errors of prediction and high coefficients of determination (R2) indicated good results in using the regression equations to predict the statistics. Regression equations for the larger flow statistics tended to have lower standard errors of prediction and higher coefficients of determination (R2) than equations for the smaller flow statistics. The report discusses the methodologies used in determining the statistics and the limitations of the statistics and the equations used to predict the statistics. Caution is indicated in using the predicted statistics for small drainage area situations. Study results constitute input needed by water-resource managers in Monroe County for planning purposes and evaluation of water-resources availability.

  4. A novel multi-target regression framework for time-series prediction of drug efficacy.

    PubMed

    Li, Haiqing; Zhang, Wei; Chen, Ying; Guo, Yumeng; Li, Guo-Zheng; Zhu, Xiaoxin

    2017-01-18

    Excavating from small samples is a challenging pharmacokinetic problem, where statistical methods can be applied. Pharmacokinetic data is special due to the small samples of high dimensionality, which makes it difficult to adopt conventional methods to predict the efficacy of traditional Chinese medicine (TCM) prescription. The main purpose of our study is to obtain some knowledge of the correlation in TCM prescription. Here, a novel method named Multi-target Regression Framework to deal with the problem of efficacy prediction is proposed. We employ the correlation between the values of different time sequences and add predictive targets of previous time as features to predict the value of current time. Several experiments are conducted to test the validity of our method and the results of leave-one-out cross-validation clearly manifest the competitiveness of our framework. Compared with linear regression, artificial neural networks, and partial least squares, support vector regression combined with our framework demonstrates the best performance, and appears to be more suitable for this task.

  5. A novel multi-target regression framework for time-series prediction of drug efficacy

    PubMed Central

    Li, Haiqing; Zhang, Wei; Chen, Ying; Guo, Yumeng; Li, Guo-Zheng; Zhu, Xiaoxin

    2017-01-01

    Excavating from small samples is a challenging pharmacokinetic problem, where statistical methods can be applied. Pharmacokinetic data is special due to the small samples of high dimensionality, which makes it difficult to adopt conventional methods to predict the efficacy of traditional Chinese medicine (TCM) prescription. The main purpose of our study is to obtain some knowledge of the correlation in TCM prescription. Here, a novel method named Multi-target Regression Framework to deal with the problem of efficacy prediction is proposed. We employ the correlation between the values of different time sequences and add predictive targets of previous time as features to predict the value of current time. Several experiments are conducted to test the validity of our method and the results of leave-one-out cross-validation clearly manifest the competitiveness of our framework. Compared with linear regression, artificial neural networks, and partial least squares, support vector regression combined with our framework demonstrates the best performance, and appears to be more suitable for this task. PMID:28098186

  6. Forecasts of non-Gaussian parameter spaces using Box-Cox transformations

    NASA Astrophysics Data System (ADS)

    Joachimi, B.; Taylor, A. N.

    2011-09-01

    Forecasts of statistical constraints on model parameters using the Fisher matrix abound in many fields of astrophysics. The Fisher matrix formalism involves the assumption of Gaussianity in parameter space and hence fails to predict complex features of posterior probability distributions. Combining the standard Fisher matrix with Box-Cox transformations, we propose a novel method that accurately predicts arbitrary posterior shapes. The Box-Cox transformations are applied to parameter space to render it approximately multivariate Gaussian, performing the Fisher matrix calculation on the transformed parameters. We demonstrate that, after the Box-Cox parameters have been determined from an initial likelihood evaluation, the method correctly predicts changes in the posterior when varying various parameters of the experimental setup and the data analysis, with marginally higher computational cost than a standard Fisher matrix calculation. We apply the Box-Cox-Fisher formalism to forecast cosmological parameter constraints by future weak gravitational lensing surveys. The characteristic non-linear degeneracy between matter density parameter and normalization of matter density fluctuations is reproduced for several cases, and the capabilities of breaking this degeneracy by weak-lensing three-point statistics is investigated. Possible applications of Box-Cox transformations of posterior distributions are discussed, including the prospects for performing statistical data analysis steps in the transformed Gaussianized parameter space.

  7. Statistical variation in progressive scrambling

    NASA Astrophysics Data System (ADS)

    Clark, Robert D.; Fox, Peter C.

    2004-07-01

    The two methods most often used to evaluate the robustness and predictivity of partial least squares (PLS) models are cross-validation and response randomization. Both methods may be overly optimistic for data sets that contain redundant observations, however. The kinds of perturbation analysis widely used for evaluating model stability in the context of ordinary least squares regression are only applicable when the descriptors are independent of each other and errors are independent and normally distributed; neither assumption holds for QSAR in general and for PLS in particular. Progressive scrambling is a novel, non-parametric approach to perturbing models in the response space in a way that does not disturb the underlying covariance structure of the data. Here, we introduce adjustments for two of the characteristic values produced by a progressive scrambling analysis - the deprecated predictivity (Q_s^{ast^2}) and standard error of prediction (SDEP s * ) - that correct for the effect of introduced perturbation. We also explore the statistical behavior of the adjusted values (Q_0^{ast^2} and SDEP 0 * ) and the sensitivity to perturbation (d q 2/d r yy ' 2). It is shown that the three statistics are all robust for stable PLS models, in terms of the stochastic component of their determination and of their variation due to sampling effects involved in training set selection.

  8. Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations.

    PubMed

    Dinov, Ivo D; Heavner, Ben; Tang, Ming; Glusman, Gustavo; Chard, Kyle; Darcy, Mike; Madduri, Ravi; Pa, Judy; Spino, Cathie; Kesselman, Carl; Foster, Ian; Deutsch, Eric W; Price, Nathan D; Van Horn, John D; Ames, Joseph; Clark, Kristi; Hood, Leroy; Hampstead, Benjamin M; Dauer, William; Toga, Arthur W

    2016-01-01

    A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data-large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

  9. Bayesian model averaging using particle filtering and Gaussian mixture modeling: Theory, concepts, and simulation experiments

    NASA Astrophysics Data System (ADS)

    Rings, Joerg; Vrugt, Jasper A.; Schoups, Gerrit; Huisman, Johan A.; Vereecken, Harry

    2012-05-01

    Bayesian model averaging (BMA) is a standard method for combining predictive distributions from different models. In recent years, this method has enjoyed widespread application and use in many fields of study to improve the spread-skill relationship of forecast ensembles. The BMA predictive probability density function (pdf) of any quantity of interest is a weighted average of pdfs centered around the individual (possibly bias-corrected) forecasts, where the weights are equal to posterior probabilities of the models generating the forecasts, and reflect the individual models skill over a training (calibration) period. The original BMA approach presented by Raftery et al. (2005) assumes that the conditional pdf of each individual model is adequately described with a rather standard Gaussian or Gamma statistical distribution, possibly with a heteroscedastic variance. Here we analyze the advantages of using BMA with a flexible representation of the conditional pdf. A joint particle filtering and Gaussian mixture modeling framework is presented to derive analytically, as closely and consistently as possible, the evolving forecast density (conditional pdf) of each constituent ensemble member. The median forecasts and evolving conditional pdfs of the constituent models are subsequently combined using BMA to derive one overall predictive distribution. This paper introduces the theory and concepts of this new ensemble postprocessing method, and demonstrates its usefulness and applicability by numerical simulation of the rainfall-runoff transformation using discharge data from three different catchments in the contiguous United States. The revised BMA method receives significantly lower-prediction errors than the original default BMA method (due to filtering) with predictive uncertainty intervals that are substantially smaller but still statistically coherent (due to the use of a time-variant conditional pdf).

  10. Evaluation of Deep Learning Representations of Spatial Storm Data

    NASA Astrophysics Data System (ADS)

    Gagne, D. J., II; Haupt, S. E.; Nychka, D. W.

    2017-12-01

    The spatial structure of a severe thunderstorm and its surrounding environment provide useful information about the potential for severe weather hazards, including tornadoes, hail, and high winds. Statistics computed over the area of a storm or from the pre-storm environment can provide descriptive information but fail to capture structural information. Because the storm environment is a complex, high-dimensional space, identifying methods to encode important spatial storm information in a low-dimensional form should aid analysis and prediction of storms by statistical and machine learning models. Principal component analysis (PCA), a more traditional approach, transforms high-dimensional data into a set of linearly uncorrelated, orthogonal components ordered by the amount of variance explained by each component. The burgeoning field of deep learning offers two potential approaches to this problem. Convolutional Neural Networks are a supervised learning method for transforming spatial data into a hierarchical set of feature maps that correspond with relevant combinations of spatial structures in the data. Generative Adversarial Networks (GANs) are an unsupervised deep learning model that uses two neural networks trained against each other to produce encoded representations of spatial data. These different spatial encoding methods were evaluated on the prediction of severe hail for a large set of storm patches extracted from the NCAR convection-allowing ensemble. Each storm patch contains information about storm structure and the near-storm environment. Logistic regression and random forest models were trained using the PCA and GAN encodings of the storm data and were compared against the predictions from a convolutional neural network. All methods showed skill over climatology at predicting the probability of severe hail. However, the verification scores among the methods were very similar and the predictions were highly correlated. Further evaluations are being performed to determine how the choice of input variables affects the results.

  11. Wheat mill stream properties for discrete element method modeling

    USDA-ARS?s Scientific Manuscript database

    A discrete phase approach based on individual wheat kernel characteristics is needed to overcome the limitations of previous statistical models and accurately predict the milling behavior of wheat. As a first step to develop a discrete element method (DEM) model for the wheat milling process, this s...

  12. Statistical Tools And Artificial Intelligence Approaches To Predict Fracture In Bulk Forming Processes

    NASA Astrophysics Data System (ADS)

    Di Lorenzo, R.; Ingarao, G.; Fonti, V.

    2007-05-01

    The crucial task in the prevention of ductile fracture is the availability of a tool for the prediction of such defect occurrence. The technical literature presents a wide investigation on this topic and many contributions have been given by many authors following different approaches. The main class of approaches regards the development of fracture criteria: generally, such criteria are expressed by determining a critical value of a damage function which depends on stress and strain paths: ductile fracture is assumed to occur when such critical value is reached during the analysed process. There is a relevant drawback related to the utilization of ductile fracture criteria; in fact each criterion usually has good performances in the prediction of fracture for particular stress - strain paths, i.e. it works very well for certain processes but may provide no good results for other processes. On the other hand, the approaches based on damage mechanics formulation are very effective from a theoretical point of view but they are very complex and their proper calibration is quite difficult. In this paper, two different approaches are investigated to predict fracture occurrence in cold forming operations. The final aim of the proposed method is the achievement of a tool which has a general reliability i.e. it is able to predict fracture for different forming processes. The proposed approach represents a step forward within a research project focused on the utilization of innovative predictive tools for ductile fracture. The paper presents a comparison between an artificial neural network design procedure and an approach based on statistical tools; both the approaches were aimed to predict fracture occurrence/absence basing on a set of stress and strain paths data. The proposed approach is based on the utilization of experimental data available, for a given material, on fracture occurrence in different processes. More in detail, the approach consists in the analysis of experimental tests in which fracture occurs followed by the numerical simulations of such processes in order to track the stress-strain paths in the workpiece region where fracture is expected. Such data are utilized to build up a proper data set which was utilized both to train an artificial neural network and to perform a statistical analysis aimed to predict fracture occurrence. The developed statistical tool is properly designed and optimized and is able to recognize the fracture occurrence. The reliability and predictive capability of the statistical method were compared with the ones obtained from an artificial neural network developed to predict fracture occurrence. Moreover, the approach is validated also in forming processes characterized by a complex fracture mechanics.

  13. Predicting Statistical Response and Extreme Events in Uncertainty Quantification through Reduced-Order Models

    NASA Astrophysics Data System (ADS)

    Qi, D.; Majda, A.

    2017-12-01

    A low-dimensional reduced-order statistical closure model is developed for quantifying the uncertainty in statistical sensitivity and intermittency in principal model directions with largest variability in high-dimensional turbulent system and turbulent transport models. Imperfect model sensitivity is improved through a recent mathematical strategy for calibrating model errors in a training phase, where information theory and linear statistical response theory are combined in a systematic fashion to achieve the optimal model performance. The idea in the reduced-order method is from a self-consistent mathematical framework for general systems with quadratic nonlinearity, where crucial high-order statistics are approximated by a systematic model calibration procedure. Model efficiency is improved through additional damping and noise corrections to replace the expensive energy-conserving nonlinear interactions. Model errors due to the imperfect nonlinear approximation are corrected by tuning the model parameters using linear response theory with an information metric in a training phase before prediction. A statistical energy principle is adopted to introduce a global scaling factor in characterizing the higher-order moments in a consistent way to improve model sensitivity. Stringent models of barotropic and baroclinic turbulence are used to display the feasibility of the reduced-order methods. Principal statistical responses in mean and variance can be captured by the reduced-order models with accuracy and efficiency. Besides, the reduced-order models are also used to capture crucial passive tracer field that is advected by the baroclinic turbulent flow. It is demonstrated that crucial principal statistical quantities like the tracer spectrum and fat-tails in the tracer probability density functions in the most important large scales can be captured efficiently with accuracy using the reduced-order tracer model in various dynamical regimes of the flow field with distinct statistical structures.

  14. Statistical physics in foreign exchange currency and stock markets

    NASA Astrophysics Data System (ADS)

    Ausloos, M.

    2000-09-01

    Problems in economy and finance have attracted the interest of statistical physicists all over the world. Fundamental problems pertain to the existence or not of long-, medium- or/and short-range power-law correlations in various economic systems, to the presence of financial cycles and on economic considerations, including economic policy. A method like the detrended fluctuation analysis is recalled emphasizing its value in sorting out correlation ranges, thereby leading to predictability at short horizon. The ( m, k)-Zipf method is presented for sorting out short-range correlations in the sign and amplitude of the fluctuations. A well-known financial analysis technique, the so-called moving average, is shown to raise questions to physicists about fractional Brownian motion properties. Among spectacular results, the possibility of crash predictions has been demonstrated through the log-periodicity of financial index oscillations.

  15. Revealing how network structure affects accuracy of link prediction

    NASA Astrophysics Data System (ADS)

    Yang, Jin-Xuan; Zhang, Xiao-Dong

    2017-08-01

    Link prediction plays an important role in network reconstruction and network evolution. The network structure affects the accuracy of link prediction, which is an interesting problem. In this paper we use common neighbors and the Gini coefficient to reveal the relation between them, which can provide a good reference for the choice of a suitable link prediction algorithm according to the network structure. Moreover, the statistical analysis reveals correlation between the common neighbors index, Gini coefficient index and other indices to describe the network structure, such as Laplacian eigenvalues, clustering coefficient, degree heterogeneity, and assortativity of network. Furthermore, a new method to predict missing links is proposed. The experimental results show that the proposed algorithm yields better prediction accuracy and robustness to the network structure than existing currently used methods for a variety of real-world networks.

  16. Line identification studies using traditional techniques and wavelength coincidence statistics

    NASA Technical Reports Server (NTRS)

    Cowley, Charles R.; Adelman, Saul J.

    1990-01-01

    Traditional line identification techniques result in the assignment of individual lines to an atomic or ionic species. These methods may be supplemented by wavelength coincidence statistics (WCS). The strength and weakness of these methods are discussed using spectra of a number of normal and peculiar B and A stars that have been studied independently by both methods. The present results support the overall findings of some earlier studies. WCS would be most useful in a first survey, before traditional methods have been applied. WCS can quickly make a global search for all species and in this way may enable identifications of an unexpected spectrum that could easily be omitted entirely from a traditional study. This is illustrated by O I. WCS is a subject to well known weakness of any statistical technique, for example, a predictable number of spurious results are to be expected. The danger of small number statistics are illustrated. WCS is at its best relative to traditional methods in finding a line-rich atomic species that is only weakly present in a complicated stellar spectrum.

  17. aPPRove: An HMM-Based Method for Accurate Prediction of RNA-Pentatricopeptide Repeat Protein Binding Events

    PubMed Central

    Harrison, Thomas; Ruiz, Jaime; Sloan, Daniel B.; Ben-Hur, Asa; Boucher, Christina

    2016-01-01

    Pentatricopeptide repeat containing proteins (PPRs) bind to RNA transcripts originating from mitochondria and plastids. There are two classes of PPR proteins. The P class contains tandem P-type motif sequences, and the PLS class contains alternating P, L and S type sequences. In this paper, we describe a novel tool that predicts PPR-RNA interaction; specifically, our method, which we call aPPRove, determines where and how a PLS-class PPR protein will bind to RNA when given a PPR and one or more RNA transcripts by using a combinatorial binding code for site specificity proposed by Barkan et al. Our results demonstrate that aPPRove successfully locates how and where a PPR protein belonging to the PLS class can bind to RNA. For each binding event it outputs the binding site, the amino-acid-nucleotide interaction, and its statistical significance. Furthermore, we show that our method can be used to predict binding events for PLS-class proteins using a known edit site and the statistical significance of aligning the PPR protein to that site. In particular, we use our method to make a conjecture regarding an interaction between CLB19 and the second intronic region of ycf3. The aPPRove web server can be found at www.cs.colostate.edu/~approve. PMID:27560805

  18. Nonclassical acoustics

    NASA Technical Reports Server (NTRS)

    Kentzer, C. P.

    1976-01-01

    A statistical approach to sound propagation is considered in situations where, due to the presence of large gradients of properties of the medium, the classical (deterministic) treatment of wave motion is inadequate. Mathematical methods for wave motions not restricted to small wavelengths (analogous to known methods of quantum mechanics) are used to formulate a wave theory of sound in nonuniform flows. Nonlinear transport equations for field probabilities are derived for the limiting case of noninteracting sound waves and it is postulated that such transport equations, appropriately generalized, may be used to predict the statistical behavior of sound in arbitrary flows.

  19. Molecular activity prediction by means of supervised subspace projection based ensembles of classifiers.

    PubMed

    Cerruela García, G; García-Pedrajas, N; Luque Ruiz, I; Gómez-Nieto, M Á

    2018-03-01

    This paper proposes a method for molecular activity prediction in QSAR studies using ensembles of classifiers constructed by means of two supervised subspace projection methods, namely nonparametric discriminant analysis (NDA) and hybrid discriminant analysis (HDA). We studied the performance of the proposed ensembles compared to classical ensemble methods using four molecular datasets and eight different models for the representation of the molecular structure. Using several measures and statistical tests for classifier comparison, we observe that our proposal improves the classification results with respect to classical ensemble methods. Therefore, we show that ensembles constructed using supervised subspace projections offer an effective way of creating classifiers in cheminformatics.

  20. Comparison of in silico models for prediction of mutagenicity.

    PubMed

    Bakhtyari, Nazanin G; Raitano, Giuseppa; Benfenati, Emilio; Martin, Todd; Young, Douglas

    2013-01-01

    Using a dataset with more than 6000 compounds, the performance of eight quantitative structure activity relationships (QSAR) models was evaluated: ACD/Tox Suite, Absorption, Distribution, Metabolism, Elimination, and Toxicity of chemical substances (ADMET) predictor, Derek, Toxicity Estimation Software Tool (T.E.S.T.), TOxicity Prediction by Komputer Assisted Technology (TOPKAT), Toxtree, CEASAR, and SARpy (SAR in python). In general, the results showed a high level of performance. To have a realistic estimate of the predictive ability, the results for chemicals inside and outside the training set for each model were considered. The effect of applicability domain tools (when available) on the prediction accuracy was also evaluated. The predictive tools included QSAR models, knowledge-based systems, and a combination of both methods. Models based on statistical QSAR methods gave better results.

  1. Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

    PubMed Central

    Dinov, Ivo D.; Heavner, Ben; Tang, Ming; Glusman, Gustavo; Chard, Kyle; Darcy, Mike; Madduri, Ravi; Pa, Judy; Spino, Cathie; Kesselman, Carl; Foster, Ian; Deutsch, Eric W.; Price, Nathan D.; Van Horn, John D.; Ames, Joseph; Clark, Kristi; Hood, Leroy; Hampstead, Benjamin M.; Dauer, William; Toga, Arthur W.

    2016-01-01

    Background A unique archive of Big Data on Parkinson’s Disease is collected, managed and disseminated by the Parkinson’s Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson’s disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data–large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources–all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson’s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson’s disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer’s, Huntington’s, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications. PMID:27494614

  2. Weighted Feature Significance: A Simple, Interpretable Model of Compound Toxicity Based on the Statistical Enrichment of Structural Features

    PubMed Central

    Huang, Ruili; Southall, Noel; Xia, Menghang; Cho, Ming-Hsuang; Jadhav, Ajit; Nguyen, Dac-Trung; Inglese, James; Tice, Raymond R.; Austin, Christopher P.

    2009-01-01

    In support of the U.S. Tox21 program, we have developed a simple and chemically intuitive model we call weighted feature significance (WFS) to predict the toxicological activity of compounds, based on the statistical enrichment of structural features in toxic compounds. We trained and tested the model on the following: (1) data from quantitative high–throughput screening cytotoxicity and caspase activation assays conducted at the National Institutes of Health Chemical Genomics Center, (2) data from Salmonella typhimurium reverse mutagenicity assays conducted by the U.S. National Toxicology Program, and (3) hepatotoxicity data published in the Registry of Toxic Effects of Chemical Substances. Enrichments of structural features in toxic compounds are evaluated for their statistical significance and compiled into a simple additive model of toxicity and then used to score new compounds for potential toxicity. The predictive power of the model for cytotoxicity was validated using an independent set of compounds from the U.S. Environmental Protection Agency tested also at the National Institutes of Health Chemical Genomics Center. We compared the performance of our WFS approach with classical classification methods such as Naive Bayesian clustering and support vector machines. In most test cases, WFS showed similar or slightly better predictive power, especially in the prediction of hepatotoxic compounds, where WFS appeared to have the best performance among the three methods. The new algorithm has the important advantages of simplicity, power, interpretability, and ease of implementation. PMID:19805409

  3. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lewis, John R.; Brooks, Dusty Marie

    In pressurized water reactors, the prevention, detection, and repair of cracks within dissimilar metal welds is essential to ensure proper plant functionality and safety. Weld residual stresses, which are difficult to model and cannot be directly measured, contribute to the formation and growth of cracks due to primary water stress corrosion cracking. Additionally, the uncertainty in weld residual stress measurements and modeling predictions is not well understood, further complicating the prediction of crack evolution. The purpose of this document is to develop methodology to quantify the uncertainty associated with weld residual stress that can be applied to modeling predictions andmore » experimental measurements. Ultimately, the results can be used to assess the current state of uncertainty and to build confidence in both modeling and experimental procedures. The methodology consists of statistically modeling the variation in the weld residual stress profiles using functional data analysis techniques. Uncertainty is quantified using statistical bounds (e.g. confidence and tolerance bounds) constructed with a semi-parametric bootstrap procedure. Such bounds describe the range in which quantities of interest, such as means, are expected to lie as evidenced by the data. The methodology is extended to provide direct comparisons between experimental measurements and modeling predictions by constructing statistical confidence bounds for the average difference between the two quantities. The statistical bounds on the average difference can be used to assess the level of agreement between measurements and predictions. The methodology is applied to experimental measurements of residual stress obtained using two strain relief measurement methods and predictions from seven finite element models developed by different organizations during a round robin study.« less

  4. Comparison of two statistical methods for probability prediction of monthly precipitation during summer over Huaihe River Basin in China, and applications in runoff prediction based on hydrological model

    NASA Astrophysics Data System (ADS)

    Liu, L.; Du, L.; Liao, Y.

    2017-12-01

    Based on the ensemble hindcast dataset of CSM1.1m by NCC, CMA, Bayesian merging models and a two-step statistical model are developed and employed to predict monthly grid/station precipitation in the Huaihe River China during summer at the lead-time of 1 to 3 months. The hindcast datasets span a period of 1991 to 2014. The skill of the two models is evaluated using area under the ROC curve (AUC) in a leave-one-out cross-validation framework, and is compared to the skill of CSM1.1m. CSM1.1m has highest skill for summer precipitation from April while lowest from May, and has highest skill for precipitation in June but lowest for precipitation in July. Compared with raw outputs of climate models, some schemes of the two approaches have higher skill for the prediction from March and May, but almost schemes have lower skill for prediction from April. Compared to two-step approach, one sampling scheme of Bayesian merging approach has higher skill for the prediction from March, but has lower skill from May. The results suggest that there is potential to apply the two statistical models for monthly precipitation forecast in summer from March and from May over Huaihe River basin, but is potential to apply CSM1.1m forecast from April. Finally, the summer runoff during 1991 to 2014 is simulated based on one hydrological model using the climate hindcast of CSM1.1m and the two statistical models.

  5. Study of Uncertainties of Predicting Space Shuttle Thermal Environment. [impact of heating rate prediction errors on weight of thermal protection system

    NASA Technical Reports Server (NTRS)

    Fehrman, A. L.; Masek, R. V.

    1972-01-01

    Quantitative estimates of the uncertainty in predicting aerodynamic heating rates for a fully reusable space shuttle system are developed and the impact of these uncertainties on Thermal Protection System (TPS) weight are discussed. The study approach consisted of statistical evaluations of the scatter of heating data on shuttle configurations about state-of-the-art heating prediction methods to define the uncertainty in these heating predictions. The uncertainties were then applied as heating rate increments to the nominal predicted heating rate to define the uncertainty in TPS weight. Separate evaluations were made for the booster and orbiter, for trajectories which included boost through reentry and touchdown. For purposes of analysis, the vehicle configuration is divided into areas in which a given prediction method is expected to apply, and separate uncertainty factors and corresponding uncertainty in TPS weight derived for each area.

  6. Accurate low-cost methods for performance evaluation of cache memory systems

    NASA Technical Reports Server (NTRS)

    Laha, Subhasis; Patel, Janak H.; Iyer, Ravishankar K.

    1988-01-01

    Methods of simulation based on statistical techniques are proposed to decrease the need for large trace measurements and for predicting true program behavior. Sampling techniques are applied while the address trace is collected from a workload. This drastically reduces the space and time needed to collect the trace. Simulation techniques are developed to use the sampled data not only to predict the mean miss rate of the cache, but also to provide an empirical estimate of its actual distribution. Finally, a concept of primed cache is introduced to simulate large caches by the sampling-based method.

  7. Statistical Analysis of Complexity Generators for Cost Estimation

    NASA Technical Reports Server (NTRS)

    Rowell, Ginger Holmes

    1999-01-01

    Predicting the cost of cutting edge new technologies involved with spacecraft hardware can be quite complicated. A new feature of the NASA Air Force Cost Model (NAFCOM), called the Complexity Generator, is being developed to model the complexity factors that drive the cost of space hardware. This parametric approach is also designed to account for the differences in cost, based on factors that are unique to each system and subsystem. The cost driver categories included in this model are weight, inheritance from previous missions, technical complexity, and management factors. This paper explains the Complexity Generator framework, the statistical methods used to select the best model within this framework, and the procedures used to find the region of predictability and the prediction intervals for the cost of a mission.

  8. XenoSite: accurately predicting CYP-mediated sites of metabolism with neural networks.

    PubMed

    Zaretzki, Jed; Matlock, Matthew; Swamidass, S Joshua

    2013-12-23

    Understanding how xenobiotic molecules are metabolized is important because it influences the safety, efficacy, and dose of medicines and how they can be modified to improve these properties. The cytochrome P450s (CYPs) are proteins responsible for metabolizing 90% of drugs on the market, and many computational methods can predict which atomic sites of a molecule--sites of metabolism (SOMs)--are modified during CYP-mediated metabolism. This study improves on prior methods of predicting CYP-mediated SOMs by using new descriptors and machine learning based on neural networks. The new method, XenoSite, is faster to train and more accurate by as much as 4% or 5% for some isozymes. Furthermore, some "incorrect" predictions made by XenoSite were subsequently validated as correct predictions by revaluation of the source literature. Moreover, XenoSite output is interpretable as a probability, which reflects both the confidence of the model that a particular atom is metabolized and the statistical likelihood that its prediction for that atom is correct.

  9. Rapid prediction of chemical metabolism by human UDP-glucuronosyltransferase isoforms using quantum chemical descriptors derived with the electronegativity equalization method.

    PubMed

    Sorich, Michael J; McKinnon, Ross A; Miners, John O; Winkler, David A; Smith, Paul A

    2004-10-07

    This study aimed to evaluate in silico models based on quantum chemical (QC) descriptors derived using the electronegativity equalization method (EEM) and to assess the use of QC properties to predict chemical metabolism by human UDP-glucuronosyltransferase (UGT) isoforms. Various EEM-derived QC molecular descriptors were calculated for known UGT substrates and nonsubstrates. Classification models were developed using support vector machine and partial least squares discriminant analysis. In general, the most predictive models were generated with the support vector machine. Combining QC and 2D descriptors (from previous work) using a consensus approach resulted in a statistically significant improvement in predictivity (to 84%) over both the QC and 2D models and the other methods of combining the descriptors. EEM-derived QC descriptors were shown to be both highly predictive and computationally efficient. It is likely that EEM-derived QC properties will be generally useful for predicting ADMET and physicochemical properties during drug discovery.

  10. Quasi-closed phase forward-backward linear prediction analysis of speech for accurate formant detection and estimation.

    PubMed

    Gowda, Dhananjaya; Airaksinen, Manu; Alku, Paavo

    2017-09-01

    Recently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.

  11. Accurate prediction of RNA-binding protein residues with two discriminative structural descriptors.

    PubMed

    Sun, Meijian; Wang, Xia; Zou, Chuanxin; He, Zenghui; Liu, Wei; Li, Honglin

    2016-06-07

    RNA-binding proteins participate in many important biological processes concerning RNA-mediated gene regulation, and several computational methods have been recently developed to predict the protein-RNA interactions of RNA-binding proteins. Newly developed discriminative descriptors will help to improve the prediction accuracy of these prediction methods and provide further meaningful information for researchers. In this work, we designed two structural features (residue electrostatic surface potential and triplet interface propensity) and according to the statistical and structural analysis of protein-RNA complexes, the two features were powerful for identifying RNA-binding protein residues. Using these two features and other excellent structure- and sequence-based features, a random forest classifier was constructed to predict RNA-binding residues. The area under the receiver operating characteristic curve (AUC) of five-fold cross-validation for our method on training set RBP195 was 0.900, and when applied to the test set RBP68, the prediction accuracy (ACC) was 0.868, and the F-score was 0.631. The good prediction performance of our method revealed that the two newly designed descriptors could be discriminative for inferring protein residues interacting with RNAs. To facilitate the use of our method, a web-server called RNAProSite, which implements the proposed method, was constructed and is freely available at http://lilab.ecust.edu.cn/NABind .

  12. Building blocks for automated elucidation of metabolites: machine learning methods for NMR prediction.

    PubMed

    Kuhn, Stefan; Egert, Björn; Neumann, Steffen; Steinbeck, Christoph

    2008-09-25

    Current efforts in Metabolomics, such as the Human Metabolome Project, collect structures of biological metabolites as well as data for their characterisation, such as spectra for identification of substances and measurements of their concentration. Still, only a fraction of existing metabolites and their spectral fingerprints are known. Computer-Assisted Structure Elucidation (CASE) of biological metabolites will be an important tool to leverage this lack of knowledge. Indispensable for CASE are modules to predict spectra for hypothetical structures. This paper evaluates different statistical and machine learning methods to perform predictions of proton NMR spectra based on data from our open database NMRShiftDB. A mean absolute error of 0.18 ppm was achieved for the prediction of proton NMR shifts ranging from 0 to 11 ppm. Random forest, J48 decision tree and support vector machines achieved similar overall errors. HOSE codes being a notably simple method achieved a comparatively good result of 0.17 ppm mean absolute error. NMR prediction methods applied in the course of this work delivered precise predictions which can serve as a building block for Computer-Assisted Structure Elucidation for biological metabolites.

  13. Generalized Polynomial Chaos Based Uncertainty Quantification for Planning MRgLITT Procedures

    PubMed Central

    Fahrenholtz, S.; Stafford, R. J.; Maier, F.; Hazle, J. D.; Fuentes, D.

    2014-01-01

    Purpose A generalized polynomial chaos (gPC) method is used to incorporate constitutive parameter uncertainties within the Pennes representation of bioheat transfer phenomena. The stochastic temperature predictions of the mathematical model are critically evaluated against MR thermometry data for planning MR-guided Laser Induced Thermal Therapies (MRgLITT). Methods Pennes bioheat transfer model coupled with a diffusion theory approximation of laser tissue interaction was implemented as the underlying deterministic kernel. A probabilistic sensitivity study was used to identify parameters that provide the most variance in temperature output. Confidence intervals of the temperature predictions are compared to MR temperature imaging (MRTI) obtained during phantom and in vivo canine (n=4) MRgLITT experiments. The gPC predictions were quantitatively compared to MRTI data using probabilistic linear and temporal profiles as well as 2-D 60 °C isotherms. Results Within the range of physically meaningful constitutive values relevant to the ablative temperature regime of MRgLITT, the sensitivity study indicated that the optical parameters, particularly the anisotropy factor, created the most variance in the stochastic model's output temperature prediction. Further, within the statistical sense considered, a nonlinear model of the temperature and damage dependent perfusion, absorption, and scattering is captured within the confidence intervals of the linear gPC method. Multivariate stochastic model predictions using parameters with the dominant sensitivities show good agreement with experimental MRTI data. Conclusions Given parameter uncertainties and mathematical modeling approximations of the Pennes bioheat model, the statistical framework demonstrates conservative estimates of the therapeutic heating and has potential for use as a computational prediction tool for thermal therapy planning. PMID:23692295

  14. Probabilistic Analysis of Aircraft Gas Turbine Disk Life and Reliability

    NASA Technical Reports Server (NTRS)

    Melis, Matthew E.; Zaretsky, Erwin V.; August, Richard

    1999-01-01

    Two series of low cycle fatigue (LCF) test data for two groups of different aircraft gas turbine engine compressor disk geometries were reanalyzed and compared using Weibull statistics. Both groups of disks were manufactured from titanium (Ti-6Al-4V) alloy. A NASA Glenn Research Center developed probabilistic computer code Probable Cause was used to predict disk life and reliability. A material-life factor A was determined for titanium (Ti-6Al-4V) alloy based upon fatigue disk data and successfully applied to predict the life of the disks as a function of speed. A comparison was made with the currently used life prediction method based upon crack growth rate. Applying an endurance limit to the computer code did not significantly affect the predicted lives under engine operating conditions. Failure location prediction correlates with those experimentally observed in the LCF tests. A reasonable correlation was obtained between the predicted disk lives using the Probable Cause code and a modified crack growth method for life prediction. Both methods slightly overpredict life for one disk group and significantly under predict it for the other.

  15. Truly random number generation: an example

    NASA Astrophysics Data System (ADS)

    Frauchiger, Daniela; Renner, Renato

    2013-10-01

    Randomness is crucial for a variety of applications, ranging from gambling to computer simulations, and from cryptography to statistics. However, many of the currently used methods for generating randomness do not meet the criteria that are necessary for these applications to work properly and safely. A common problem is that a sequence of numbers may look random but nevertheless not be truly random. In fact, the sequence may pass all standard statistical tests and yet be perfectly predictable. This renders it useless for many applications. For example, in cryptography, the predictability of a "andomly" chosen password is obviously undesirable. Here, we review a recently developed approach to generating true | and hence unpredictable | randomness.

  16. Clustering gene expression data based on predicted differential effects of GV interaction.

    PubMed

    Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu

    2005-02-01

    Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.

  17. Link Prediction in Evolving Networks Based on Popularity of Nodes.

    PubMed

    Wang, Tong; He, Xing-Sheng; Zhou, Ming-Yang; Fu, Zhong-Qian

    2017-08-02

    Link prediction aims to uncover the underlying relationship behind networks, which could be utilized to predict missing edges or identify the spurious edges. The key issue of link prediction is to estimate the likelihood of potential links in networks. Most classical static-structure based methods ignore the temporal aspects of networks, limited by the time-varying features, such approaches perform poorly in evolving networks. In this paper, we propose a hypothesis that the ability of each node to attract links depends not only on its structural importance, but also on its current popularity (activeness), since active nodes have much more probability to attract future links. Then a novel approach named popularity based structural perturbation method (PBSPM) and its fast algorithm are proposed to characterize the likelihood of an edge from both existing connectivity structure and current popularity of its two endpoints. Experiments on six evolving networks show that the proposed methods outperform state-of-the-art methods in accuracy and robustness. Besides, visual results and statistical analysis reveal that the proposed methods are inclined to predict future edges between active nodes, rather than edges between inactive nodes.

  18. Recent development of risk-prediction models for incident hypertension: An updated systematic review

    PubMed Central

    Xiao, Lei; Liu, Ya; Wang, Zuoguang; Li, Chuang; Jin, Yongxin; Zhao, Qiong

    2017-01-01

    Background Hypertension is a leading global health threat and a major cardiovascular disease. Since clinical interventions are effective in delaying the disease progression from prehypertension to hypertension, diagnostic prediction models to identify patient populations at high risk for hypertension are imperative. Methods Both PubMed and Embase databases were searched for eligible reports of either prediction models or risk scores of hypertension. The study data were collected, including risk factors, statistic methods, characteristics of study design and participants, performance measurement, etc. Results From the searched literature, 26 studies reporting 48 prediction models were selected. Among them, 20 reports studied the established models using traditional risk factors, such as body mass index (BMI), age, smoking, blood pressure (BP) level, parental history of hypertension, and biochemical factors, whereas 6 reports used genetic risk score (GRS) as the prediction factor. AUC ranged from 0.64 to 0.97, and C-statistic ranged from 60% to 90%. Conclusions The traditional models are still the predominant risk prediction models for hypertension, but recently, more models have begun to incorporate genetic factors as part of their model predictors. However, these genetic predictors need to be well selected. The current reported models have acceptable to good discrimination and calibration ability, but whether the models can be applied in clinical practice still needs more validation and adjustment. PMID:29084293

  19. Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches.

    PubMed

    Stylianou, Neophytos; Akbarov, Artur; Kontopantelis, Evangelos; Buchan, Iain; Dunn, Ken W

    2015-08-01

    Predicting mortality from burn injury has traditionally employed logistic regression models. Alternative machine learning methods have been introduced in some areas of clinical prediction as the necessary software and computational facilities have become accessible. Here we compare logistic regression and machine learning predictions of mortality from burn. An established logistic mortality model was compared to machine learning methods (artificial neural network, support vector machine, random forests and naïve Bayes) using a population-based (England & Wales) case-cohort registry. Predictive evaluation used: area under the receiver operating characteristic curve; sensitivity; specificity; positive predictive value and Youden's index. All methods had comparable discriminatory abilities, similar sensitivities, specificities and positive predictive values. Although some machine learning methods performed marginally better than logistic regression the differences were seldom statistically significant and clinically insubstantial. Random forests were marginally better for high positive predictive value and reasonable sensitivity. Neural networks yielded slightly better prediction overall. Logistic regression gives an optimal mix of performance and interpretability. The established logistic regression model of burn mortality performs well against more complex alternatives. Clinical prediction with a small set of strong, stable, independent predictors is unlikely to gain much from machine learning outside specialist research contexts. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.

  20. Machine learning to predict the occurrence of bisphosphonate-related osteonecrosis of the jaw associated with dental extraction: A preliminary report.

    PubMed

    Kim, Dong Wook; Kim, Hwiyoung; Nam, Woong; Kim, Hyung Jun; Cha, In-Ho

    2018-04-23

    The aim of this study was to build and validate five types of machine learning models that can predict the occurrence of BRONJ associated with dental extraction in patients taking bisphosphonates for the management of osteoporosis. A retrospective review of the medical records was conducted to obtain cases and controls for the study. Total 125 patients consisting of 41 cases and 84 controls were selected for the study. Five machine learning prediction algorithms including multivariable logistic regression model, decision tree, support vector machine, artificial neural network, and random forest were implemented. The outputs of these models were compared with each other and also with conventional methods, such as serum CTX level. Area under the receiver operating characteristic (ROC) curve (AUC) was used to compare the results. The performance of machine learning models was significantly superior to conventional statistical methods and single predictors. The random forest model yielded the best performance (AUC = 0.973), followed by artificial neural network (AUC = 0.915), support vector machine (AUC = 0.882), logistic regression (AUC = 0.844), decision tree (AUC = 0.821), drug holiday alone (AUC = 0.810), and CTX level alone (AUC = 0.630). Machine learning methods showed superior performance in predicting BRONJ associated with dental extraction compared to conventional statistical methods using drug holiday and serum CTX level. Machine learning can thus be applied in a wide range of clinical studies. Copyright © 2017. Published by Elsevier Inc.

  1. Cluster analysis as a prediction tool for pregnancy outcomes.

    PubMed

    Banjari, Ines; Kenjerić, Daniela; Šolić, Krešimir; Mandić, Milena L

    2015-03-01

    Considering specific physiology changes during gestation and thinking of pregnancy as a "critical window", classification of pregnant women at early pregnancy can be considered as crucial. The paper demonstrates the use of a method based on an approach from intelligent data mining, cluster analysis. Cluster analysis method is a statistical method which makes possible to group individuals based on sets of identifying variables. The method was chosen in order to determine possibility for classification of pregnant women at early pregnancy to analyze unknown correlations between different variables so that the certain outcomes could be predicted. 222 pregnant women from two general obstetric offices' were recruited. The main orient was set on characteristics of these pregnant women: their age, pre-pregnancy body mass index (BMI) and haemoglobin value. Cluster analysis gained a 94.1% classification accuracy rate with three branch- es or groups of pregnant women showing statistically significant correlations with pregnancy outcomes. The results are showing that pregnant women both of older age and higher pre-pregnancy BMI have a significantly higher incidence of delivering baby of higher birth weight but they gain significantly less weight during pregnancy. Their babies are also longer, and these women have significantly higher probability for complications during pregnancy (gestosis) and higher probability of induced or caesarean delivery. We can conclude that the cluster analysis method can appropriately classify pregnant women at early pregnancy to predict certain outcomes.

  2. Fully Bayesian tests of neutrality using genealogical summary statistics.

    PubMed

    Drummond, Alexei J; Suchard, Marc A

    2008-10-31

    Many data summary statistics have been developed to detect departures from neutral expectations of evolutionary models. However questions about the neutrality of the evolution of genetic loci within natural populations remain difficult to assess. One critical cause of this difficulty is that most methods for testing neutrality make simplifying assumptions simultaneously about the mutational model and the population size model. Consequentially, rejecting the null hypothesis of neutrality under these methods could result from violations of either or both assumptions, making interpretation troublesome. Here we harness posterior predictive simulation to exploit summary statistics of both the data and model parameters to test the goodness-of-fit of standard models of evolution. We apply the method to test the selective neutrality of molecular evolution in non-recombining gene genealogies and we demonstrate the utility of our method on four real data sets, identifying significant departures of neutrality in human influenza A virus, even after controlling for variation in population size. Importantly, by employing a full model-based Bayesian analysis, our method separates the effects of demography from the effects of selection. The method also allows multiple summary statistics to be used in concert, thus potentially increasing sensitivity. Furthermore, our method remains useful in situations where analytical expectations and variances of summary statistics are not available. This aspect has great potential for the analysis of temporally spaced data, an expanding area previously ignored for limited availability of theory and methods.

  3. SU-F-BRD-01: A Logistic Regression Model to Predict Objective Function Weights in Prostate Cancer IMRT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Boutilier, J; Chan, T; Lee, T

    2014-06-15

    Purpose: To develop a statistical model that predicts optimization objective function weights from patient geometry for intensity-modulation radiotherapy (IMRT) of prostate cancer. Methods: A previously developed inverse optimization method (IOM) is applied retrospectively to determine optimal weights for 51 treated patients. We use an overlap volume ratio (OVR) of bladder and rectum for different PTV expansions in order to quantify patient geometry in explanatory variables. Using the optimal weights as ground truth, we develop and train a logistic regression (LR) model to predict the rectum weight and thus the bladder weight. Post hoc, we fix the weights of the leftmore » femoral head, right femoral head, and an artificial structure that encourages conformity to the population average while normalizing the bladder and rectum weights accordingly. The population average of objective function weights is used for comparison. Results: The OVR at 0.7cm was found to be the most predictive of the rectum weights. The LR model performance is statistically significant when compared to the population average over a range of clinical metrics including bladder/rectum V53Gy, bladder/rectum V70Gy, and mean voxel dose to the bladder, rectum, CTV, and PTV. On average, the LR model predicted bladder and rectum weights that are both 63% closer to the optimal weights compared to the population average. The treatment plans resulting from the LR weights have, on average, a rectum V70Gy that is 35% closer to the clinical plan and a bladder V70Gy that is 43% closer. Similar results are seen for bladder V54Gy and rectum V54Gy. Conclusion: Statistical modelling from patient anatomy can be used to determine objective function weights in IMRT for prostate cancer. Our method allows the treatment planners to begin the personalization process from an informed starting point, which may lead to more consistent clinical plans and reduce overall planning time.« less

  4. Genomic selection of agronomic traits in hybrid rice using an NCII population.

    PubMed

    Xu, Yang; Wang, Xin; Ding, Xiaowen; Zheng, Xingfei; Yang, Zefeng; Xu, Chenwu; Hu, Zhongli

    2018-05-10

    Hybrid breeding is an effective tool to improve yield in rice, while parental selection remains the key and difficult issue. Genomic selection (GS) provides opportunities to predict the performance of hybrids before phenotypes are measured. However, the application of GS is influenced by several genetic and statistical factors. Here, we used a rice North Carolina II (NC II) population constructed by crossing 115 rice varieties with five male sterile lines as a model to evaluate effects of statistical methods, heritability, marker density and training population size on prediction for hybrid performance. From the comparison of six GS methods, we found that predictabilities for different methods are significantly different, with genomic best linear unbiased prediction (GBLUP) and least absolute shrinkage and selection operation (LASSO) being the best, support vector machine (SVM) and partial least square (PLS) being the worst. The marker density has lower influence on predicting rice hybrid performance compared with the size of training population. Additionally, we used the 575 (115 × 5) hybrid rice as a training population to predict eight agronomic traits of all hybrids derived from 120 (115 + 5) rice varieties each mating with 3023 rice accessions from the 3000 rice genomes project (3 K RGP). Of the 362,760 potential hybrids, selection of the top 100 predicted hybrids would lead to 35.5%, 23.25%, 30.21%, 42.87%, 61.80%, 75.83%, 19.24% and 36.12% increase in grain yield per plant, thousand-grain weight, panicle number per plant, plant height, secondary branch number, grain number per panicle, panicle length and primary branch number, respectively. This study evaluated the factors affecting predictabilities for hybrid prediction and demonstrated the implementation of GS to predict hybrid performance of rice. Our results suggest that GS could enable the rapid selection of superior hybrids, thus increasing the efficiency of rice hybrid breeding.

  5. The index-flood and the GRADEX methods combination for flood frequency analysis.

    NASA Astrophysics Data System (ADS)

    Fuentes, Diana; Di Baldassarre, Giuliano; Quesada, Beatriz; Xu, Chong-Yu; Halldin, Sven; Beven, Keith

    2017-04-01

    Flood frequency analysis is used in many applications, including flood risk management, design of hydraulic structures, and urban planning. However, such analysis requires of long series of observed discharge data which are often not available in many basins around the world. In this study, we tested the usefulness of combining regional discharge and local precipitation data to estimate the event flood volume frequency curve for 63 catchments in Mexico, Central America and the Caribbean. This was achieved by combining two existing flood frequency analysis methods, the regionalization index-flood approach with the GRADEX method. For up to 10-years return period, similar shape of the scaled flood frequency curve for catchments with similar flood behaviour was assumed from the index-flood approach. For return periods larger than 10-years the probability distribution of rainfall and discharge volumes were assumed to be asymptotically and exponential-type functions with the same scale parameter from the GRADEX method. Results showed that if the mean annual flood (MAF), used as index-flood, is known, the index-flood approach performed well for up to 10 years return periods, resulting in 25% mean relative error in prediction. For larger return periods the prediction capability decreased but could be improved by the use of the GRADEX method. As the MAF is unknown at ungauged and short-period measured basins, we tested predicting the MAF using catchments climate-physical characteristics, and discharge statistics, the latter when observations were available for only 8 years. Only the use of discharge statistics resulted in acceptable predictions.

  6. Simulating Metabolism with Statistical Thermodynamics

    PubMed Central

    Cannon, William R.

    2014-01-01

    New methods are needed for large scale modeling of metabolism that predict metabolite levels and characterize the thermodynamics of individual reactions and pathways. Current approaches use either kinetic simulations, which are difficult to extend to large networks of reactions because of the need for rate constants, or flux-based methods, which have a large number of feasible solutions because they are unconstrained by the law of mass action. This report presents an alternative modeling approach based on statistical thermodynamics. The principles of this approach are demonstrated using a simple set of coupled reactions, and then the system is characterized with respect to the changes in energy, entropy, free energy, and entropy production. Finally, the physical and biochemical insights that this approach can provide for metabolism are demonstrated by application to the tricarboxylic acid (TCA) cycle of Escherichia coli. The reaction and pathway thermodynamics are evaluated and predictions are made regarding changes in concentration of TCA cycle intermediates due to 10- and 100-fold changes in the ratio of NAD+:NADH concentrations. Finally, the assumptions and caveats regarding the use of statistical thermodynamics to model non-equilibrium reactions are discussed. PMID:25089525

  7. Simulating metabolism with statistical thermodynamics.

    PubMed

    Cannon, William R

    2014-01-01

    New methods are needed for large scale modeling of metabolism that predict metabolite levels and characterize the thermodynamics of individual reactions and pathways. Current approaches use either kinetic simulations, which are difficult to extend to large networks of reactions because of the need for rate constants, or flux-based methods, which have a large number of feasible solutions because they are unconstrained by the law of mass action. This report presents an alternative modeling approach based on statistical thermodynamics. The principles of this approach are demonstrated using a simple set of coupled reactions, and then the system is characterized with respect to the changes in energy, entropy, free energy, and entropy production. Finally, the physical and biochemical insights that this approach can provide for metabolism are demonstrated by application to the tricarboxylic acid (TCA) cycle of Escherichia coli. The reaction and pathway thermodynamics are evaluated and predictions are made regarding changes in concentration of TCA cycle intermediates due to 10- and 100-fold changes in the ratio of NAD+:NADH concentrations. Finally, the assumptions and caveats regarding the use of statistical thermodynamics to model non-equilibrium reactions are discussed.

  8. A Low-Cost Method for Multiple Disease Prediction.

    PubMed

    Bayati, Mohsen; Bhaskar, Sonia; Montanari, Andrea

    Recently, in response to the rising costs of healthcare services, employers that are financially responsible for the healthcare costs of their workforce have been investing in health improvement programs for their employees. A main objective of these so called "wellness programs" is to reduce the incidence of chronic illnesses such as cardiovascular disease, cancer, diabetes, and obesity, with the goal of reducing future medical costs. The majority of these wellness programs include an annual screening to detect individuals with the highest risk of developing chronic disease. Once these individuals are identified, the company can invest in interventions to reduce the risk of those individuals. However, capturing many biomarkers per employee creates a costly screening procedure. We propose a statistical data-driven method to address this challenge by minimizing the number of biomarkers in the screening procedure while maximizing the predictive power over a broad spectrum of diseases. Our solution uses multi-task learning and group dimensionality reduction from machine learning and statistics. We provide empirical validation of the proposed solution using data from two different electronic medical records systems, with comparisons to a statistical benchmark.

  9. Syndromic surveillance models using Web data: the case of scarlet fever in the UK.

    PubMed

    Samaras, Loukas; García-Barriocanal, Elena; Sicilia, Miguel-Angel

    2012-03-01

    Recent research has shown the potential of Web queries as a source for syndromic surveillance, and existing studies show that these queries can be used as a basis for estimation and prediction of the development of a syndromic disease, such as influenza, using log linear (logit) statistical models. Two alternative models are applied to the relationship between cases and Web queries in this paper. We examine the applicability of using statistical methods to relate search engine queries with scarlet fever cases in the UK, taking advantage of tools to acquire the appropriate data from Google, and using an alternative statistical method based on gamma distributions. The results show that using logit models, the Pearson correlation factor between Web queries and the data obtained from the official agencies must be over 0.90, otherwise the prediction of the peak and the spread of the distributions gives significant deviations. In this paper, we describe the gamma distribution model and show that we can obtain better results in all cases using gamma transformations, and especially in those with a smaller correlation factor.

  10. The skeletal maturation status estimated by statistical shape analysis: axial images of Japanese cervical vertebra.

    PubMed

    Shin, S M; Kim, Y-I; Choi, Y-S; Yamaguchi, T; Maki, K; Cho, B-H; Park, S-B

    2015-01-01

    To evaluate axial cervical vertebral (ACV) shape quantitatively and to build a prediction model for skeletal maturation level using statistical shape analysis for Japanese individuals. The sample included 24 female and 19 male patients with hand-wrist radiographs and CBCT images. Through generalized Procrustes analysis and principal components (PCs) analysis, the meaningful PCs were extracted from each ACV shape and analysed for the estimation regression model. Each ACV shape had meaningful PCs, except for the second axial cervical vertebra. Based on these models, the smallest prediction intervals (PIs) were from the combination of the shape space PCs, age and gender. Overall, the PIs of the male group were smaller than those of the female group. There was no significant correlation between centroid size as a size factor and skeletal maturation level. Our findings suggest that the ACV maturation method, which was applied by statistical shape analysis, could confirm information about skeletal maturation in Japanese individuals as an available quantifier of skeletal maturation and could be as useful a quantitative method as the skeletal maturation index.

  11. The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data.

    PubMed

    Ma, Yan; Zhang, Wei; Lyman, Stephen; Huang, Yihe

    2018-06-01

    To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. HCUP SID. A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. Conditional MI substantially improved statistical inferences for racial health disparities research with the SID. © Health Research and Educational Trust.

  12. A system for learning statistical motion patterns.

    PubMed

    Hu, Weiming; Xiao, Xuejuan; Fu, Zhouyu; Xie, Dan; Tan, Tieniu; Maybank, Steve

    2006-09-01

    Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy K-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.

  13. The Complexity of Solar and Geomagnetic Indices

    NASA Astrophysics Data System (ADS)

    Pesnell, W. Dean

    2017-08-01

    How far in advance can the sunspot number be predicted with any degree of confidence? Solar cycle predictions are needed to plan long-term space missions. Fleets of satellites circle the Earth collecting science data, protecting astronauts, and relaying information. All of these satellites are sensitive at some level to solar cycle effects. Statistical and timeseries analyses of the sunspot number are often used to predict solar activity. These methods have not been completely successful as the solar dynamo changes over time and one cycle's sunspots are not a faithful predictor of the next cycle's activity. In some ways, using these techniques is similar to asking whether the stock market can be predicted. It has been shown that the Dow Jones Industrial Average (DJIA) can be more accurately predicted during periods when it obeys certain statistical properties than at other times. The Hurst exponent is one such way to partition the data. Another measure of the complexity of a timeseries is the fractal dimension. We can use these measures of complexity to compare the sunspot number with other solar and geomagnetic indices. Our concentration is on how trends are removed by the various techniques, either internally or externally. Comparisons of the statistical properties of the various solar indices may guide us in understanding how the dynamo manifests in the various indices and the Sun.

  14. Assessing the prediction accuracy of a cure model for censored survival data with long-term survivors: Application to breast cancer data.

    PubMed

    Asano, Junichi; Hirakawa, Akihiro

    2017-01-01

    The Cox proportional hazards cure model is a survival model incorporating a cure rate with the assumption that the population contains both uncured and cured individuals. It contains a logistic regression for the cure rate, and a Cox regression to estimate the hazard for uncured patients. A single predictive model for both the cure and hazard can be developed by using a cure model that simultaneously predicts the cure rate and hazards for uncured patients; however, model selection is a challenge because of the lack of a measure for quantifying the predictive accuracy of a cure model. Recently, we developed an area under the receiver operating characteristic curve (AUC) for determining the cure rate in a cure model (Asano et al., 2014), but the hazards measure for uncured patients was not resolved. In this article, we propose novel C-statistics that are weighted by the patients' cure status (i.e., cured, uncured, or censored cases) for the cure model. The operating characteristics of the proposed C-statistics and their confidence interval were examined by simulation analyses. We also illustrate methods for predictive model selection and for further interpretation of variables using the proposed AUCs and C-statistics via application to breast cancer data.

  15. Prediction of local concentration statistics in variably saturated soils: Influence of observation scale and comparison with field data

    NASA Astrophysics Data System (ADS)

    Graham, Wendy; Destouni, Georgia; Demmy, George; Foussereau, Xavier

    1998-07-01

    The methodology developed in Destouni and Graham [Destouni, G., Graham, W.D., 1997. The influence of observation method on local concentration statistics in the subsurface. Water Resour. Res. 33 (4) 663-676.] for predicting locally measured concentration statistics for solute transport in heterogeneous porous media under saturated flow conditions is applied to the prediction of conservative nonreactive solute transport in the vadose zone where observations are obtained by soil coring. Exact analytical solutions are developed for both the mean and variance of solute concentrations measured in discrete soil cores using a simplified physical model for vadose-zone flow and solute transport. Theoretical results show that while the ensemble mean concentration is relatively insensitive to the length-scale of the measurement, predictions of the concentration variance are significantly impacted by the sampling interval. Results also show that accounting for vertical heterogeneity in the soil profile results in significantly less spreading in the mean and variance of the measured solute breakthrough curves, indicating that it is important to account for vertical heterogeneity even for relatively small travel distances. Model predictions for both the mean and variance of locally measured solute concentration, based on independently estimated model parameters, agree well with data from a field tracer test conducted in Manatee County, Florida.

  16. Lung Cancer Risk Prediction Model Incorporating Lung Function: Development and Validation in the UK Biobank Prospective Cohort Study.

    PubMed

    Muller, David C; Johansson, Mattias; Brennan, Paul

    2017-03-10

    Purpose Several lung cancer risk prediction models have been developed, but none to date have assessed the predictive ability of lung function in a population-based cohort. We sought to develop and internally validate a model incorporating lung function using data from the UK Biobank prospective cohort study. Methods This analysis included 502,321 participants without a previous diagnosis of lung cancer, predominantly between 40 and 70 years of age. We used flexible parametric survival models to estimate the 2-year probability of lung cancer, accounting for the competing risk of death. Models included predictors previously shown to be associated with lung cancer risk, including sex, variables related to smoking history and nicotine addiction, medical history, family history of lung cancer, and lung function (forced expiratory volume in 1 second [FEV1]). Results During accumulated follow-up of 1,469,518 person-years, there were 738 lung cancer diagnoses. A model incorporating all predictors had excellent discrimination (concordance (c)-statistic [95% CI] = 0.85 [0.82 to 0.87]). Internal validation suggested that the model will discriminate well when applied to new data (optimism-corrected c-statistic = 0.84). The full model, including FEV1, also had modestly superior discriminatory power than one that was designed solely on the basis of questionnaire variables (c-statistic = 0.84 [0.82 to 0.86]; optimism-corrected c-statistic = 0.83; p FEV1 = 3.4 × 10 -13 ). The full model had better discrimination than standard lung cancer screening eligibility criteria (c-statistic = 0.66 [0.64 to 0.69]). Conclusion A risk prediction model that includes lung function has strong predictive ability, which could improve eligibility criteria for lung cancer screening programs.

  17. Relationship between affect and achievement in science and mathematics in Malaysia and Singapore

    NASA Astrophysics Data System (ADS)

    Thoe Ng, Khar; Fah Lay, Yoon; Areepattamannil, Shaljan; Treagust, David F.; Chandrasegaran, A. L.

    2012-11-01

    Background : The Trends in International Mathematics and Science Study (TIMSS) assesses the quality of the teaching and learning of science and mathematics among Grades 4 and 8 students across participating countries. Purpose : This study explored the relationship between positive affect towards science and mathematics and achievement in science and mathematics among Malaysian and Singaporean Grade 8 students. Sample : In total, 4466 Malaysia students and 4599 Singaporean students from Grade 8 who participated in TIMSS 2007 were involved in this study. Design and method : Students' achievement scores on eight items in the survey instrument that were reported in TIMSS 2007 were used as the dependent variable in the analysis. Students' scores on four items in the TIMSS 2007 survey instrument pertaining to students' affect towards science and mathematics together with students' gender, language spoken at home and parental education were used as the independent variables. Results : Positive affect towards science and mathematics indicated statistically significant predictive effects on achievement in the two subjects for both Malaysian and Singaporean Grade 8 students. There were statistically significant predictive effects on mathematics achievement for the students' gender, language spoken at home and parental education for both Malaysian and Singaporean students, with R 2 = 0.18 and 0.21, respectively. However, only parental education showed statistically significant predictive effects on science achievement for both countries. For Singapore, language spoken at home also demonstrated statistically significant predictive effects on science achievement, whereas gender did not. For Malaysia, neither gender nor language spoken at home had statistically significant predictive effects on science achievement. Conclusions : It is important for educators to consider implementing self-concept enhancement intervention programmes by incorporating 'affect' components of academic self-concept in order to develop students' talents and promote academic excellence in science and mathematics.

  18. Three-dimensional mapping of soil chemical characteristics at micrometric scale: Statistical prediction by combining 2D SEM-EDX data and 3D X-ray computed micro-tomographic images

    NASA Astrophysics Data System (ADS)

    Hapca, Simona

    2015-04-01

    Many soil properties and functions emerge from interactions of physical, chemical and biological processes at microscopic scales, which can be understood only by integrating techniques that traditionally are developed within separate disciplines. While recent advances in imaging techniques, such as X-ray computed tomography (X-ray CT), offer the possibility to reconstruct the 3D physical structure at fine resolutions, for the distribution of chemicals in soil, existing methods, based on scanning electron microscope (SEM) and energy dispersive X-ray detection (EDX), allow for characterization of the chemical composition only on 2D surfaces. At present, direct 3D measurement techniques are still lacking, sequential sectioning of soils, followed by 2D mapping of chemical elements and interpolation to 3D, being an alternative which is explored in this study. Specifically, we develop an integrated experimental and theoretical framework which combines 3D X-ray CT imaging technique with 2D SEM-EDX and use spatial statistics methods to map the chemical composition of soil in 3D. The procedure involves three stages 1) scanning a resin impregnated soil cube by X-ray CT, followed by precision cutting to produce parallel thin slices, the surfaces of which are scanned by SEM-EDX, 2) alignment of the 2D chemical maps within the internal 3D structure of the soil cube, and 3) development, of spatial statistics methods to predict the chemical composition of 3D soil based on the observed 2D chemical and 3D physical data. Specifically, three statistical models consisting of a regression tree, a regression tree kriging and cokriging model were used to predict the 3D spatial distribution of carbon, silicon, iron and oxygen in soil, these chemical elements showing a good spatial agreement between the X-ray grayscale intensities and the corresponding 2D SEM-EDX data. Due to the spatial correlation between the physical and chemical data, the regression-tree model showed a great potential in predicting chemical composition in particular for iron, which is generally sparsely distributed in soil. For carbon, silicon and oxygen, which are more densely distributed, the additional kriging of the regression tree residuals improved significantly the prediction, whereas prediction based on co-kriging was less consistent across replicates, underperforming regression-tree kriging. The present study shows a great potential in integrating geo-statistical methods with imaging techniques to unveil the 3D chemical structure of soil at very fine scales, the framework being suitable to be further applied to other types of imaging data such as images of biological thin sections for characterization of microbial distribution. Key words: X-ray CT, SEM-EDX, segmentation techniques, spatial correlation, 3D soil images, 2D chemical maps.

  19. Predictive analysis effectiveness in determining the epidemic disease infected area

    NASA Astrophysics Data System (ADS)

    Ibrahim, Najihah; Akhir, Nur Shazwani Md.; Hassan, Fadratul Hafinaz

    2017-10-01

    Epidemic disease outbreak had caused nowadays community to raise their great concern over the infectious disease controlling, preventing and handling methods to diminish the disease dissemination percentage and infected area. Backpropagation method was used for the counter measure and prediction analysis of the epidemic disease. The predictive analysis based on the backpropagation method can be determine via machine learning process that promotes the artificial intelligent in pattern recognition, statistics and features selection. This computational learning process will be integrated with data mining by measuring the score output as the classifier to the given set of input features through classification technique. The classification technique is the features selection of the disease dissemination factors that likely have strong interconnection between each other in causing infectious disease outbreaks. The predictive analysis of epidemic disease in determining the infected area was introduced in this preliminary study by using the backpropagation method in observation of other's findings. This study will classify the epidemic disease dissemination factors as the features for weight adjustment on the prediction of epidemic disease outbreaks. Through this preliminary study, the predictive analysis is proven to be effective method in determining the epidemic disease infected area by minimizing the error value through the features classification.

  20. An intercomparison of approaches for improving operational seasonal streamflow forecasts

    NASA Astrophysics Data System (ADS)

    Mendoza, Pablo A.; Wood, Andrew W.; Clark, Elizabeth; Rothwell, Eric; Clark, Martyn P.; Nijssen, Bart; Brekke, Levi D.; Arnold, Jeffrey R.

    2017-07-01

    For much of the last century, forecasting centers around the world have offered seasonal streamflow predictions to support water management. Recent work suggests that the two major avenues to advance seasonal predictability are improvements in the estimation of initial hydrologic conditions (IHCs) and the incorporation of climate information. This study investigates the marginal benefits of a variety of methods using IHCs and/or climate information, focusing on seasonal water supply forecasts (WSFs) in five case study watersheds located in the US Pacific Northwest region. We specify two benchmark methods that mimic standard operational approaches - statistical regression against IHCs and model-based ensemble streamflow prediction (ESP) - and then systematically intercompare WSFs across a range of lead times. Additional methods include (i) statistical techniques using climate information either from standard indices or from climate reanalysis variables and (ii) several hybrid/hierarchical approaches harnessing both land surface and climate predictability. In basins where atmospheric teleconnection signals are strong, and when watershed predictability is low, climate information alone provides considerable improvements. For those basins showing weak teleconnections, custom predictors from reanalysis fields were more effective in forecast skill than standard climate indices. ESP predictions tended to have high correlation skill but greater bias compared to other methods, and climate predictors failed to substantially improve these deficiencies within a trace weighting framework. Lower complexity techniques were competitive with more complex methods, and the hierarchical expert regression approach introduced here (hierarchical ensemble streamflow prediction - HESP) provided a robust alternative for skillful and reliable water supply forecasts at all initialization times. Three key findings from this effort are (1) objective approaches supporting methodologically consistent hindcasts open the door to a broad range of beneficial forecasting strategies; (2) the use of climate predictors can add to the seasonal forecast skill available from IHCs; and (3) sample size limitations must be handled rigorously to avoid over-trained forecast solutions. Overall, the results suggest that despite a rich, long heritage of operational use, there remain a number of compelling opportunities to improve the skill and value of seasonal streamflow predictions.

  1. Predicting the 10-year risk of hip and major osteoporotic fracture in rheumatoid arthritis and in the general population: an independent validation and update of UK FRAX without bone mineral density

    PubMed Central

    Klop, Corinne; de Vries, Frank; Bijlsma, Johannes W J; Leufkens, Hubert G M; Welsing, Paco M J

    2016-01-01

    Objectives FRAX incorporates rheumatoid arthritis (RA) as a dichotomous predictor for predicting the 10-year risk of hip and major osteoporotic fracture (MOF). However, fracture risk may deviate with disease severity, duration or treatment. Aims were to validate, and if needed to update, UK FRAX for patients with RA and to compare predictive performance with the general population (GP). Methods Cohort study within UK Clinical Practice Research Datalink (CPRD) (RA: n=11 582, GP: n=38 755), also linked to hospital admissions for hip fracture (CPRD-Hospital Episode Statistics, HES) (RA: n=7221, GP: n=24 227). Predictive performance of UK FRAX without bone mineral density was assessed by discrimination and calibration. Updating methods included recalibration and extension. Differences in predictive performance were assessed by the C-statistic and Net Reclassification Improvement (NRI) using the UK National Osteoporosis Guideline Group intervention thresholds. Results UK FRAX significantly overestimated fracture risk in patients with RA, both for MOF (mean predicted vs observed 10-year risk: 13.3% vs 8.4%) and hip fracture (CPRD: 5.5% vs 3.1%, CPRD-HES: 5.5% vs 4.1%). Calibration was good for hip fracture in the GP (CPRD-HES: 2.7% vs 2.4%). Discrimination was good for hip fracture (RA: 0.78, GP: 0.83) and moderate for MOF (RA: 0.69, GP: 0.71). Extension of the recalibrated UK FRAX using CPRD-HES with duration of RA disease, glucocorticoids (>7.5 mg/day) and secondary osteoporosis did not improve the NRI (0.01, 95% CI −0.04 to 0.05) or C-statistic (0.78). Conclusions UK FRAX overestimated fracture risk in RA, but performed well for hip fracture in the GP after linkage to hospitalisations. Extension of the recalibrated UK FRAX did not improve predictive performance. PMID:26984006

  2. Jet Noise Diagnostics Supporting Statistical Noise Prediction Methods

    NASA Technical Reports Server (NTRS)

    Bridges, James E.

    2006-01-01

    The primary focus of my presentation is the development of the jet noise prediction code JeNo with most examples coming from the experimental work that drove the theoretical development and validation. JeNo is a statistical jet noise prediction code, based upon the Lilley acoustic analogy. Our approach uses time-average 2-D or 3-D mean and turbulent statistics of the flow as input. The output is source distributions and spectral directivity. NASA has been investing in development of statistical jet noise prediction tools because these seem to fit the middle ground that allows enough flexibility and fidelity for jet noise source diagnostics while having reasonable computational requirements. These tools rely on Reynolds-averaged Navier-Stokes (RANS) computational fluid dynamics (CFD) solutions as input for computing far-field spectral directivity using an acoustic analogy. There are many ways acoustic analogies can be created, each with a series of assumptions and models, many often taken unknowingly. And the resulting prediction can be easily reverse-engineered by altering the models contained within. However, only an approach which is mathematically sound, with assumptions validated and modeled quantities checked against direct measurement will give consistently correct answers. Many quantities are modeled in acoustic analogies precisely because they have been impossible to measure or calculate, making this requirement a difficult task. The NASA team has spent considerable effort identifying all the assumptions and models used to take the Navier-Stokes equations to the point of a statistical calculation via an acoustic analogy very similar to that proposed by Lilley. Assumptions have been identified and experiments have been developed to test these assumptions. In some cases this has resulted in assumptions being changed. Beginning with the CFD used as input to the acoustic analogy, models for turbulence closure used in RANS CFD codes have been explored and compared against measurements of mean and rms velocity statistics over a range of jet speeds and temperatures. Models for flow parameters used in the acoustic analogy, most notably the space-time correlations of velocity, have been compared against direct measurements, and modified to better fit the observed data. These measurements have been extremely challenging for hot, high speed jets, and represent a sizeable investment in instrumentation development. As an intermediate check that the analysis is predicting the physics intended, phased arrays have been employed to measure source distributions for a wide range of jet cases. And finally, careful far-field spectral directivity measurements have been taken for final validation of the prediction code. Examples of each of these experimental efforts will be presented. The main result of these efforts is a noise prediction code, named JeNo, which is in middevelopment. JeNo is able to consistently predict spectral directivity, including aft angle directivity, for subsonic cold jets of most geometries. Current development on JeNo is focused on extending its capability to hot jets, requiring inclusion of a previously neglected second source associated with thermal fluctuations. A secondary result of the intensive experimentation is the archiving of various flow statistics applicable to other acoustic analogies and to development of time-resolved prediction methods. These will be of lasting value as we look ahead at future challenges to the aeroacoustic experimentalist.

  3. Building the Community Online Resource for Statistical Seismicity Analysis (CORSSA)

    NASA Astrophysics Data System (ADS)

    Michael, A. J.; Wiemer, S.; Zechar, J. D.; Hardebeck, J. L.; Naylor, M.; Zhuang, J.; Steacy, S.; Corssa Executive Committee

    2010-12-01

    Statistical seismology is critical to the understanding of seismicity, the testing of proposed earthquake prediction and forecasting methods, and the assessment of seismic hazard. Unfortunately, despite its importance to seismology - especially to those aspects with great impact on public policy - statistical seismology is mostly ignored in the education of seismologists, and there is no central repository for the existing open-source software tools. To remedy these deficiencies, and with the broader goal to enhance the quality of statistical seismology research, we have begun building the Community Online Resource for Statistical Seismicity Analysis (CORSSA). CORSSA is a web-based educational platform that is authoritative, up-to-date, prominent, and user-friendly. We anticipate that the users of CORSSA will range from beginning graduate students to experienced researchers. More than 20 scientists from around the world met for a week in Zurich in May 2010 to kick-start the creation of CORSSA: the format and initial table of contents were defined; a governing structure was organized; and workshop participants began drafting articles. CORSSA materials are organized with respect to six themes, each containing between four and eight articles. The CORSSA web page, www.corssa.org, officially unveiled on September 6, 2010, debuts with an initial set of approximately 10 to 15 articles available online for viewing and commenting with additional articles to be added over the coming months. Each article will be peer-reviewed and will present a balanced discussion, including illustrative examples and code snippets. Topics in the initial set of articles will include: introductions to both CORSSA and statistical seismology, basic statistical tests and their role in seismology; understanding seismicity catalogs and their problems; basic techniques for modeling seismicity; and methods for testing earthquake predictability hypotheses. A special article will compare and review available statistical seismology software packages.

  4. Principal component analysis in construction of 3D human knee joint models using a statistical shape model method.

    PubMed

    Tsai, Tsung-Yuan; Li, Jing-Sheng; Wang, Shaobai; Li, Pingyue; Kwon, Young-Min; Li, Guoan

    2015-01-01

    The statistical shape model (SSM) method that uses 2D images of the knee joint to predict the three-dimensional (3D) joint surface model has been reported in the literature. In this study, we constructed a SSM database using 152 human computed tomography (CT) knee joint models, including the femur, tibia and patella and analysed the characteristics of each principal component of the SSM. The surface models of two in vivo knees were predicted using the SSM and their 2D bi-plane fluoroscopic images. The predicted models were compared to their CT joint models. The differences between the predicted 3D knee joint surfaces and the CT image-based surfaces were 0.30 ± 0.81 mm, 0.34 ± 0.79 mm and 0.36 ± 0.59 mm for the femur, tibia and patella, respectively (average ± standard deviation). The computational time for each bone of the knee joint was within 30 s using a personal computer. The analysis of this study indicated that the SSM method could be a useful tool to construct 3D surface models of the knee with sub-millimeter accuracy in real time. Thus, it may have a broad application in computer-assisted knee surgeries that require 3D surface models of the knee.

  5. Forecasting Solar Flares Using Magnetogram-based Predictors and Machine Learning

    NASA Astrophysics Data System (ADS)

    Florios, Kostas; Kontogiannis, Ioannis; Park, Sung-Hong; Guerra, Jordan A.; Benvenuto, Federico; Bloomfield, D. Shaun; Georgoulis, Manolis K.

    2018-02-01

    We propose a forecasting approach for solar flares based on data from Solar Cycle 24, taken by the Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO) mission. In particular, we use the Space-weather HMI Active Region Patches (SHARP) product that facilitates cut-out magnetograms of solar active regions (AR) in the Sun in near-realtime (NRT), taken over a five-year interval (2012 - 2016). Our approach utilizes a set of thirteen predictors, which are not included in the SHARP metadata, extracted from line-of-sight and vector photospheric magnetograms. We exploit several machine learning (ML) and conventional statistics techniques to predict flares of peak magnitude {>} M1 and {>} C1 within a 24 h forecast window. The ML methods used are multi-layer perceptrons (MLP), support vector machines (SVM), and random forests (RF). We conclude that random forests could be the prediction technique of choice for our sample, with the second-best method being multi-layer perceptrons, subject to an entropy objective function. A Monte Carlo simulation showed that the best-performing method gives accuracy ACC=0.93(0.00), true skill statistic TSS=0.74(0.02), and Heidke skill score HSS=0.49(0.01) for {>} M1 flare prediction with probability threshold 15% and ACC=0.84(0.00), TSS=0.60(0.01), and HSS=0.59(0.01) for {>} C1 flare prediction with probability threshold 35%.

  6. Reliable probabilities through statistical post-processing of ensemble predictions

    NASA Astrophysics Data System (ADS)

    Van Schaeybroeck, Bert; Vannitsem, Stéphane

    2013-04-01

    We develop post-processing or calibration approaches based on linear regression that make ensemble forecasts more reliable. We enforce climatological reliability in the sense that the total variability of the prediction is equal to the variability of the observations. Second, we impose ensemble reliability such that the spread around the ensemble mean of the observation coincides with the one of the ensemble members. In general the attractors of the model and reality are inhomogeneous. Therefore ensemble spread displays a variability not taken into account in standard post-processing methods. We overcome this by weighting the ensemble by a variable error. The approaches are tested in the context of the Lorenz 96 model (Lorenz 1996). The forecasts become more reliable at short lead times as reflected by a flatter rank histogram. Our best method turns out to be superior to well-established methods like EVMOS (Van Schaeybroeck and Vannitsem, 2011) and Nonhomogeneous Gaussian Regression (Gneiting et al., 2005). References [1] Gneiting, T., Raftery, A. E., Westveld, A., Goldman, T., 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Weather Rev. 133, 1098-1118. [2] Lorenz, E. N., 1996: Predictability - a problem partly solved. Proceedings, Seminar on Predictability ECMWF. 1, 1-18. [3] Van Schaeybroeck, B., and S. Vannitsem, 2011: Post-processing through linear regression, Nonlin. Processes Geophys., 18, 147.

  7. Statistical analysis and modeling of intermittent transport events in the tokamak scrape-off layer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Anderson, Johan, E-mail: anderson.johan@gmail.com; Halpern, Federico D.; Ricci, Paolo

    The turbulence observed in the scrape-off-layer of a tokamak is often characterized by intermittent events of bursty nature, a feature which raises concerns about the prediction of heat loads on the physical boundaries of the device. It appears thus necessary to delve into the statistical properties of turbulent physical fields such as density, electrostatic potential, and temperature, focusing on the mathematical expression of tails of the probability distribution functions. The method followed here is to generate statistical information from time-traces of the plasma density stemming from Braginskii-type fluid simulations and check this against a first-principles theoretical model. The analysis ofmore » the numerical simulations indicates that the probability distribution function of the intermittent process contains strong exponential tails, as predicted by the analytical theory.« less

  8. LES/PDF studies of joint statistics of mixture fraction and progress variable in piloted methane jet flames with inhomogeneous inlet flows

    NASA Astrophysics Data System (ADS)

    Zhang, Pei; Barlow, Robert; Masri, Assaad; Wang, Haifeng

    2016-11-01

    The mixture fraction and progress variable are often used as independent variables for describing turbulent premixed and non-premixed flames. There is a growing interest in using these two variables for describing partially premixed flames. The joint statistical distribution of the mixture fraction and progress variable is of great interest in developing models for partially premixed flames. In this work, we conduct predictive studies of the joint statistics of mixture fraction and progress variable in a series of piloted methane jet flames with inhomogeneous inlet flows. The employed models combine large eddy simulations with the Monte Carlo probability density function (PDF) method. The joint PDFs and marginal PDFs are examined in detail by comparing the model predictions and the measurements. Different presumed shapes of the joint PDFs are also evaluated.

  9. Statistics of Smoothed Cosmic Fields in Perturbation Theory. I. Formulation and Useful Formulae in Second-Order Perturbation Theory

    NASA Astrophysics Data System (ADS)

    Matsubara, Takahiko

    2003-02-01

    We formulate a general method for perturbative evaluations of statistics of smoothed cosmic fields and provide useful formulae for application of the perturbation theory to various statistics. This formalism is an extensive generalization of the method used by Matsubara, who derived a weakly nonlinear formula of the genus statistic in a three-dimensional density field. After describing the general method, we apply the formalism to a series of statistics, including genus statistics, level-crossing statistics, Minkowski functionals, and a density extrema statistic, regardless of the dimensions in which each statistic is defined. The relation between the Minkowski functionals and other geometrical statistics is clarified. These statistics can be applied to several cosmic fields, including three-dimensional density field, three-dimensional velocity field, two-dimensional projected density field, and so forth. The results are detailed for second-order theory of the formalism. The effect of the bias is discussed. The statistics of smoothed cosmic fields as functions of rescaled threshold by volume fraction are discussed in the framework of second-order perturbation theory. In CDM-like models, their functional deviations from linear predictions plotted against the rescaled threshold are generally much smaller than that plotted against the direct threshold. There is still a slight meatball shift against rescaled threshold, which is characterized by asymmetry in depths of troughs in the genus curve. A theory-motivated asymmetry factor in the genus curve is proposed.

  10. A Comparison of Conventional Linear Regression Methods and Neural Networks for Forecasting Educational Spending.

    ERIC Educational Resources Information Center

    Baker, Bruce D.; Richards, Craig E.

    1999-01-01

    Applies neural network methods for forecasting 1991-95 per-pupil expenditures in U.S. public elementary and secondary schools. Forecasting models included the National Center for Education Statistics' multivariate regression model and three neural architectures. Regarding prediction accuracy, neural network results were comparable or superior to…

  11. Analyzing the Validity of the Adult-Adolescent Parenting Inventory for Low-Income Populations

    ERIC Educational Resources Information Center

    Lawson, Michael A.; Alameda-Lawson, Tania; Byrnes, Edward

    2017-01-01

    Objectives: The purpose of this study was to examine the construct and predictive validity of the Adult-Adolescent Parenting Inventory (AAPI-2). Methods: The validity of the AAPI-2 was evaluated using multiple statistical methods, including exploratory factor analysis, confirmatory factor analysis, and latent class analysis. These analyses were…

  12. PREDICTION OF VO2PEAK USING OMNI RATINGS OF PERCEIVED EXERTION FROM A SUBMAXIMAL CYCLE EXERCISE TEST

    PubMed Central

    Mays, Ryan J.; Goss, Fredric L.; Nagle-Stilley, Elizabeth F.; Gallagher, Michael; Schafer, Mark A.; Kim, Kevin H.; Robertson, Robert J.

    2015-01-01

    Summary The primary aim of this study was to develop statistical models to predict peak oxygen consumption (VO2peak) using OMNI Ratings of Perceived Exertion measured during submaximal cycle ergometry. Men (mean ± standard error: 20.90 ± 0.42 yrs) and women (21.59 ± 0.49 yrs) participants (n = 81) completed a load-incremented maximal cycle ergometer exercise test. Simultaneous multiple linear regression was used to develop separate VO2peak statistical models using submaximal ratings of perceived exertion for the overall body, legs, and chest/breathing as predictor variables. VO2peak (L·min−1) predicted for men and women from ratings of perceived exertion for the overall body (3.02 ± 0.06; 2.03 ± 0.04), legs (3.02 ± 0.06; 2.04 ± 0.04) and chest/breathing (3.02 ± 0.05; 2.03 ± 0.03) were similar with measured VO2peak (3.02 ± 0.10; 2.03 ± 0.06, ps > .05). Statistical models based on submaximal OMNI Ratings of Perceived Exertion provide an easily administered and accurate method to predict VO2peak. PMID:25068750

  13. Symptom Clusters in Advanced Cancer Patients: An Empirical Comparison of Statistical Methods and the Impact on Quality of Life.

    PubMed

    Dong, Skye T; Costa, Daniel S J; Butow, Phyllis N; Lovell, Melanie R; Agar, Meera; Velikova, Galina; Teckle, Paulos; Tong, Allison; Tebbutt, Niall C; Clarke, Stephen J; van der Hoek, Kim; King, Madeleine T; Fayers, Peter M

    2016-01-01

    Symptom clusters in advanced cancer can influence patient outcomes. There is large heterogeneity in the methods used to identify symptom clusters. To investigate the consistency of symptom cluster composition in advanced cancer patients using different statistical methodologies for all patients across five primary cancer sites, and to examine which clusters predict functional status, a global assessment of health and global quality of life. Principal component analysis and exploratory factor analysis (with different rotation and factor selection methods) and hierarchical cluster analysis (with different linkage and similarity measures) were used on a data set of 1562 advanced cancer patients who completed the European Organization for the Research and Treatment of Cancer Quality of Life Questionnaire-Core 30. Four clusters consistently formed for many of the methods and cancer sites: tense-worry-irritable-depressed (emotional cluster), fatigue-pain, nausea-vomiting, and concentration-memory (cognitive cluster). The emotional cluster was a stronger predictor of overall quality of life than the other clusters. Fatigue-pain was a stronger predictor of overall health than the other clusters. The cognitive cluster and fatigue-pain predicted physical functioning, role functioning, and social functioning. The four identified symptom clusters were consistent across statistical methods and cancer types, although there were some noteworthy differences. Statistical derivation of symptom clusters is in need of greater methodological guidance. A psychosocial pathway in the management of symptom clusters may improve quality of life. Biological mechanisms underpinning symptom clusters need to be delineated by future research. A framework for evidence-based screening, assessment, treatment, and follow-up of symptom clusters in advanced cancer is essential. Copyright © 2016 American Academy of Hospice and Palliative Medicine. Published by Elsevier Inc. All rights reserved.

  14. Estimating statistical uncertainty of Monte Carlo efficiency-gain in the context of a correlated sampling Monte Carlo code for brachytherapy treatment planning with non-normal dose distribution.

    PubMed

    Mukhopadhyay, Nitai D; Sampson, Andrew J; Deniz, Daniel; Alm Carlsson, Gudrun; Williamson, Jeffrey; Malusek, Alexandr

    2012-01-01

    Correlated sampling Monte Carlo methods can shorten computing times in brachytherapy treatment planning. Monte Carlo efficiency is typically estimated via efficiency gain, defined as the reduction in computing time by correlated sampling relative to conventional Monte Carlo methods when equal statistical uncertainties have been achieved. The determination of the efficiency gain uncertainty arising from random effects, however, is not a straightforward task specially when the error distribution is non-normal. The purpose of this study is to evaluate the applicability of the F distribution and standardized uncertainty propagation methods (widely used in metrology to estimate uncertainty of physical measurements) for predicting confidence intervals about efficiency gain estimates derived from single Monte Carlo runs using fixed-collision correlated sampling in a simplified brachytherapy geometry. A bootstrap based algorithm was used to simulate the probability distribution of the efficiency gain estimates and the shortest 95% confidence interval was estimated from this distribution. It was found that the corresponding relative uncertainty was as large as 37% for this particular problem. The uncertainty propagation framework predicted confidence intervals reasonably well; however its main disadvantage was that uncertainties of input quantities had to be calculated in a separate run via a Monte Carlo method. The F distribution noticeably underestimated the confidence interval. These discrepancies were influenced by several photons with large statistical weights which made extremely large contributions to the scored absorbed dose difference. The mechanism of acquiring high statistical weights in the fixed-collision correlated sampling method was explained and a mitigation strategy was proposed. Copyright © 2011 Elsevier Ltd. All rights reserved.

  15. Improving the long-lead predictability of El Niño using a novel forecasting scheme based on a dynamic components model

    NASA Astrophysics Data System (ADS)

    Petrova, Desislava; Koopman, Siem Jan; Ballester, Joan; Rodó, Xavier

    2017-02-01

    El Niño (EN) is a dominant feature of climate variability on inter-annual time scales driving changes in the climate throughout the globe, and having wide-spread natural and socio-economic consequences. In this sense, its forecast is an important task, and predictions are issued on a regular basis by a wide array of prediction schemes and climate centres around the world. This study explores a novel method for EN forecasting. In the state-of-the-art the advantageous statistical technique of unobserved components time series modeling, also known as structural time series modeling, has not been applied. Therefore, we have developed such a model where the statistical analysis, including parameter estimation and forecasting, is based on state space methods, and includes the celebrated Kalman filter. The distinguishing feature of this dynamic model is the decomposition of a time series into a range of stochastically time-varying components such as level (or trend), seasonal, cycles of different frequencies, irregular, and regression effects incorporated as explanatory covariates. These components are modeled separately and ultimately combined in a single forecasting scheme. Customary statistical models for EN prediction essentially use SST and wind stress in the equatorial Pacific. In addition to these, we introduce a new domain of regression variables accounting for the state of the subsurface ocean temperature in the western and central equatorial Pacific, motivated by our analysis, as well as by recent and classical research, showing that subsurface processes and heat accumulation there are fundamental for the genesis of EN. An important feature of the scheme is that different regression predictors are used at different lead months, thus capturing the dynamical evolution of the system and rendering more efficient forecasts. The new model has been tested with the prediction of all warm events that occurred in the period 1996-2015. Retrospective forecasts of these events were made for long lead times of at least two and a half years. Hence, the present study demonstrates that the theoretical limit of ENSO prediction should be sought much longer than the commonly accepted "Spring Barrier". The high correspondence between the forecasts and observations indicates that the proposed model outperforms all current operational statistical models, and behaves comparably to the best dynamical models used for EN prediction. Thus, the novel way in which the modeling scheme has been structured could also be used for improving other statistical and dynamical modeling systems.

  16. Risk prediction models of breast cancer: a systematic review of model performances.

    PubMed

    Anothaisintawee, Thunyarat; Teerawattananon, Yot; Wiratkapun, Chollathip; Kasamesup, Vijj; Thakkinstian, Ammarin

    2012-05-01

    The number of risk prediction models has been increasingly developed, for estimating about breast cancer in individual women. However, those model performances are questionable. We therefore have conducted a study with the aim to systematically review previous risk prediction models. The results from this review help to identify the most reliable model and indicate the strengths and weaknesses of each model for guiding future model development. We searched MEDLINE (PubMed) from 1949 and EMBASE (Ovid) from 1974 until October 2010. Observational studies which constructed models using regression methods were selected. Information about model development and performance were extracted. Twenty-five out of 453 studies were eligible. Of these, 18 developed prediction models and 7 validated existing prediction models. Up to 13 variables were included in the models and sample sizes for each study ranged from 550 to 2,404,636. Internal validation was performed in four models, while five models had external validation. Gail and Rosner and Colditz models were the significant models which were subsequently modified by other scholars. Calibration performance of most models was fair to good (expected/observe ratio: 0.87-1.12), but discriminatory accuracy was poor to fair both in internal validation (concordance statistics: 0.53-0.66) and in external validation (concordance statistics: 0.56-0.63). Most models yielded relatively poor discrimination in both internal and external validation. This poor discriminatory accuracy of existing models might be because of a lack of knowledge about risk factors, heterogeneous subtypes of breast cancer, and different distributions of risk factors across populations. In addition the concordance statistic itself is insensitive to measure the improvement of discrimination. Therefore, the new method such as net reclassification index should be considered to evaluate the improvement of the performance of a new develop model.

  17. Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.

    PubMed

    Murata, Hiroshi; Zangwill, Linda M; Fujino, Yuri; Matsuura, Masato; Miki, Atsuya; Hirasawa, Kazunori; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki; Asaoka, Ryo

    2018-04-01

    To validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset. The training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic. OLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05). VBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings.

  18. Statistical Approaches to Interpretation of Local, Regional, and National Highway-Runoff and Urban-Stormwater Data

    USGS Publications Warehouse

    Tasker, Gary D.; Granato, Gregory E.

    2000-01-01

    Decision makers need viable methods for the interpretation of local, regional, and national-highway runoff and urban-stormwater data including flows, concentrations and loads of chemical constituents and sediment, potential effects on receiving waters, and the potential effectiveness of various best management practices (BMPs). Valid (useful for intended purposes), current, and technically defensible stormwater-runoff models are needed to interpret data collected in field studies, to support existing highway and urban-runoffplanning processes, to meet National Pollutant Discharge Elimination System (NPDES) requirements, and to provide methods for computation of Total Maximum Daily Loads (TMDLs) systematically and economically. Historically, conceptual, simulation, empirical, and statistical models of varying levels of detail, complexity, and uncertainty have been used to meet various data-quality objectives in the decision-making processes necessary for the planning, design, construction, and maintenance of highways and for other land-use applications. Water-quality simulation models attempt a detailed representation of the physical processes and mechanisms at a given site. Empirical and statistical regional water-quality assessment models provide a more general picture of water quality or changes in water quality over a region. All these modeling techniques share one common aspect-their predictive ability is poor without suitable site-specific data for calibration. To properly apply the correct model, one must understand the classification of variables, the unique characteristics of water-resources data, and the concept of population structure and analysis. Classifying variables being used to analyze data may determine which statistical methods are appropriate for data analysis. An understanding of the characteristics of water-resources data is necessary to evaluate the applicability of different statistical methods, to interpret the results of these techniques, and to use tools and techniques that account for the unique nature of water-resources data sets. Populations of data on stormwater-runoff quantity and quality are often best modeled as logarithmic transformations. Therefore, these factors need to be considered to form valid, current, and technically defensible stormwater-runoff models. Regression analysis is an accepted method for interpretation of water-resources data and for prediction of current or future conditions at sites that fit the input data model. Regression analysis is designed to provide an estimate of the average response of a system as it relates to variation in one or more known variables. To produce valid models, however, regression analysis should include visual analysis of scatterplots, an examination of the regression equation, evaluation of the method design assumptions, and regression diagnostics. A number of statistical techniques are described in the text and in the appendixes to provide information necessary to interpret data by use of appropriate methods. Uncertainty is an important part of any decisionmaking process. In order to deal with uncertainty problems, the analyst needs to know the severity of the statistical uncertainty of the methods used to predict water quality. Statistical models need to be based on information that is meaningful, representative, complete, precise, accurate, and comparable to be deemed valid, up to date, and technically supportable. To assess uncertainty in the analytical tools, the modeling methods, and the underlying data set, all of these components need be documented and communicated in an accessible format within project publications.

  19. Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses

    PubMed Central

    Park, Danny S.; Brown, Brielin; Eng, Celeste; Huntsman, Scott; Hu, Donglei; Torgerson, Dara G.; Burchard, Esteban G.; Zaitlen, Noah

    2015-01-01

    Motivation: Approaches to identifying new risk loci, training risk prediction models, imputing untyped variants and fine-mapping causal variants from summary statistics of genome-wide association studies are playing an increasingly important role in the human genetics community. Current summary statistics-based methods rely on global ‘best guess’ reference panels to model the genetic correlation structure of the dataset being studied. This approach, especially in admixed populations, has the potential to produce misleading results, ignores variation in local structure and is not feasible when appropriate reference panels are missing or small. Here, we develop a method, Adapt-Mix, that combines information across all available reference panels to produce estimates of local genetic correlation structure for summary statistics-based methods in arbitrary populations. Results: We applied Adapt-Mix to estimate the genetic correlation structure of both admixed and non-admixed individuals using simulated and real data. We evaluated our method by measuring the performance of two summary statistics-based methods: imputation and joint-testing. When using our method as opposed to the current standard of ‘best guess’ reference panels, we observed a 28% decrease in mean-squared error for imputation and a 73.7% decrease in mean-squared error for joint-testing. Availability and implementation: Our method is publicly available in a software package called ADAPT-Mix available at https://github.com/dpark27/adapt_mix. Contact: noah.zaitlen@ucsf.edu PMID:26072481

  20. The Use of Artificial Neural Network for Prediction of Dissolution Kinetics

    PubMed Central

    Elçiçek, H.; Akdoğan, E.; Karagöz, S.

    2014-01-01

    Colemanite is a preferred boron mineral in industry, such as boric acid production, fabrication of heat resistant glass, and cleaning agents. Dissolution of the mineral is one of the most important processes for these industries. In this study, dissolution of colemanite was examined in water saturated with carbon dioxide solutions. Also, prediction of dissolution rate was determined using artificial neural networks (ANNs) which are based on the multilayered perceptron. Reaction temperature, total pressure, stirring speed, solid/liquid ratio, particle size, and reaction time were selected as input parameters to predict the dissolution rate. Experimental dataset was used to train multilayer perceptron (MLP) networks to allow for prediction of dissolution kinetics. Developing ANNs has provided highly accurate predictions in comparison with an obtained mathematical model used through regression method. We conclude that ANNs may be a preferred alternative approach instead of conventional statistical methods for prediction of boron minerals. PMID:25028674

  1. Testing alternative ground water models using cross-validation and other methods

    USGS Publications Warehouse

    Foglia, L.; Mehl, S.W.; Hill, M.C.; Perona, P.; Burlando, P.

    2007-01-01

    Many methods can be used to test alternative ground water models. Of concern in this work are methods able to (1) rank alternative models (also called model discrimination) and (2) identify observations important to parameter estimates and predictions (equivalent to the purpose served by some types of sensitivity analysis). Some of the measures investigated are computationally efficient; others are computationally demanding. The latter are generally needed to account for model nonlinearity. The efficient model discrimination methods investigated include the information criteria: the corrected Akaike information criterion, Bayesian information criterion, and generalized cross-validation. The efficient sensitivity analysis measures used are dimensionless scaled sensitivity (DSS), composite scaled sensitivity, and parameter correlation coefficient (PCC); the other statistics are DFBETAS, Cook's D, and observation-prediction statistic. Acronyms are explained in the introduction. Cross-validation (CV) is a computationally intensive nonlinear method that is used for both model discrimination and sensitivity analysis. The methods are tested using up to five alternative parsimoniously constructed models of the ground water system of the Maggia Valley in southern Switzerland. The alternative models differ in their representation of hydraulic conductivity. A new method for graphically representing CV and sensitivity analysis results for complex models is presented and used to evaluate the utility of the efficient statistics. The results indicate that for model selection, the information criteria produce similar results at much smaller computational cost than CV. For identifying important observations, the only obviously inferior linear measure is DSS; the poor performance was expected because DSS does not include the effects of parameter correlation and PCC reveals large parameter correlations. ?? 2007 National Ground Water Association.

  2. Moments of inclination error distribution computer program

    NASA Technical Reports Server (NTRS)

    Myler, T. R.

    1981-01-01

    A FORTRAN coded computer program is described which calculates orbital inclination error statistics using a closed-form solution. This solution uses a data base of trajectory errors from actual flights to predict the orbital inclination error statistics. The Scott flight history data base consists of orbit insertion errors in the trajectory parameters - altitude, velocity, flight path angle, flight azimuth, latitude and longitude. The methods used to generate the error statistics are of general interest since they have other applications. Program theory, user instructions, output definitions, subroutine descriptions and detailed FORTRAN coding information are included.

  3. Empirical best linear unbiased prediction method for small areas with restricted maximum likelihood and bootstrap procedure to estimate the average of household expenditure per capita in Banjar Regency

    NASA Astrophysics Data System (ADS)

    Aminah, Agustin Siti; Pawitan, Gandhi; Tantular, Bertho

    2017-03-01

    So far, most of the data published by Statistics Indonesia (BPS) as data providers for national statistics are still limited to the district level. Less sufficient sample size for smaller area levels to make the measurement of poverty indicators with direct estimation produced high standard error. Therefore, the analysis based on it is unreliable. To solve this problem, the estimation method which can provide a better accuracy by combining survey data and other auxiliary data is required. One method often used for the estimation is the Small Area Estimation (SAE). There are many methods used in SAE, one of them is Empirical Best Linear Unbiased Prediction (EBLUP). EBLUP method of maximum likelihood (ML) procedures does not consider the loss of degrees of freedom due to estimating β with β ^. This drawback motivates the use of the restricted maximum likelihood (REML) procedure. This paper proposed EBLUP with REML procedure for estimating poverty indicators by modeling the average of household expenditures per capita and implemented bootstrap procedure to calculate MSE (Mean Square Error) to compare the accuracy EBLUP method with the direct estimation method. Results show that EBLUP method reduced MSE in small area estimation.

  4. Nonparametric method for failures diagnosis in the actuating subsystem of aircraft control system

    NASA Astrophysics Data System (ADS)

    Terentev, M. N.; Karpenko, S. S.; Zybin, E. Yu; Kosyanchuk, V. V.

    2018-02-01

    In this paper we design a nonparametric method for failures diagnosis in the aircraft control system that uses the measurements of the control signals and the aircraft states only. It doesn’t require a priori information of the aircraft model parameters, training or statistical calculations, and is based on analytical nonparametric one-step-ahead state prediction approach. This makes it possible to predict the behavior of unidentified and failure dynamic systems, to weaken the requirements to control signals, and to reduce the diagnostic time and problem complexity.

  5. A simple formula for insertion loss prediction of large acoustical enclosures using statistical energy analysis method

    NASA Astrophysics Data System (ADS)

    Kim, Hyun-Sil; Kim, Jae-Seung; Lee, Seong-Hyun; Seo, Yun-Ho

    2014-12-01

    Insertion loss prediction of large acoustical enclosures using Statistical Energy Analysis (SEA) method is presented. The SEA model consists of three elements: sound field inside the enclosure, vibration energy of the enclosure panel, and sound field outside the enclosure. It is assumed that the space surrounding the enclosure is sufficiently large so that there is no energy flow from the outside to the wall panel or to air cavity inside the enclosure. The comparison of the predicted insertion loss to the measured data for typical large acoustical enclosures shows good agreements. It is found that if the critical frequency of the wall panel falls above the frequency region of interest, insertion loss is dominated by the sound transmission loss of the wall panel and averaged sound absorption coefficient inside the enclosure. However, if the critical frequency of the wall panel falls into the frequency region of interest, acoustic power from the sound radiation by the wall panel must be added to the acoustic power from transmission through the panel.

  6. Bayesian Estimation of Thermonuclear Reaction Rates for Deuterium+Deuterium Reactions

    NASA Astrophysics Data System (ADS)

    Gómez Iñesta, Á.; Iliadis, C.; Coc, A.

    2017-11-01

    The study of d+d reactions is of major interest since their reaction rates affect the predicted abundances of D, 3He, and 7Li. In particular, recent measurements of primordial D/H ratios call for reduced uncertainties in the theoretical abundances predicted by Big Bang nucleosynthesis (BBN). Different authors have studied reactions involved in BBN by incorporating new experimental data and a careful treatment of systematic and probabilistic uncertainties. To analyze the experimental data, Coc et al. used results of ab initio models for the theoretical calculation of the energy dependence of S-factors in conjunction with traditional statistical methods based on χ 2 minimization. Bayesian methods have now spread to many scientific fields and provide numerous advantages in data analysis. Astrophysical S-factors and reaction rates using Bayesian statistics were calculated by Iliadis et al. Here we present a similar analysis for two d+d reactions, d(d, n)3He and d(d, p)3H, that has been translated into a total decrease of the predicted D/H value by 0.16%.

  7. FIT: statistical modeling tool for transcriptome dynamics under fluctuating field conditions

    PubMed Central

    Iwayama, Koji; Aisaka, Yuri; Kutsuna, Natsumaro

    2017-01-01

    Abstract Motivation: Considerable attention has been given to the quantification of environmental effects on organisms. In natural conditions, environmental factors are continuously changing in a complex manner. To reveal the effects of such environmental variations on organisms, transcriptome data in field environments have been collected and analyzed. Nagano et al. proposed a model that describes the relationship between transcriptomic variation and environmental conditions and demonstrated the capability to predict transcriptome variation in rice plants. However, the computational cost of parameter optimization has prevented its wide application. Results: We propose a new statistical model and efficient parameter optimization based on the previous study. We developed and released FIT, an R package that offers functions for parameter optimization and transcriptome prediction. The proposed method achieves comparable or better prediction performance within a shorter computational time than the previous method. The package will facilitate the study of the environmental effects on transcriptomic variation in field conditions. Availability and Implementation: Freely available from CRAN (https://cran.r-project.org/web/packages/FIT/). Contact: anagano@agr.ryukoku.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online PMID:28158396

  8. Regression modeling of particle size distributions in urban storm water: advancements through improved sample collection methods

    USGS Publications Warehouse

    Fienen, Michael N.; Selbig, William R.

    2012-01-01

    A new sample collection system was developed to improve the representation of sediment entrained in urban storm water by integrating water quality samples from the entire water column. The depth-integrated sampler arm (DISA) was able to mitigate sediment stratification bias in storm water, thereby improving the characterization of suspended-sediment concentration and particle size distribution at three independent study locations. Use of the DISA decreased variability, which improved statistical regression to predict particle size distribution using surrogate environmental parameters, such as precipitation depth and intensity. The performance of this statistical modeling technique was compared to results using traditional fixed-point sampling methods and was found to perform better. When environmental parameters can be used to predict particle size distributions, environmental managers have more options when characterizing concentrations, loads, and particle size distributions in urban runoff.

  9. From plant traits to plant communities: a statistical mechanistic approach to biodiversity.

    PubMed

    Shipley, Bill; Vile, Denis; Garnier, Eric

    2006-11-03

    We developed a quantitative method, analogous to those used in statistical mechanics, to predict how biodiversity will vary across environments, which plant species from a species pool will be found in which relative abundances in a given environment, and which plant traits determine community assembly. This provides a scaling from plant traits to ecological communities while bypassing the complications of population dynamics. Our method treats community development as a sorting process involving species that are ecologically equivalent except with respect to particular functional traits, which leads to a constrained random assembly of species; the relative abundance of each species adheres to a general exponential distribution as a function of its traits. Using data for eight functional traits of 30 herbaceous species and community-aggregated values of these traits in 12 sites along a 42-year chronosequence of secondary succession, we predicted 94% of the variance in the relative abundances.

  10. Personalized Modeling for Prediction with Decision-Path Models

    PubMed Central

    Visweswaran, Shyam; Ferreira, Antonio; Ribeiro, Guilherme A.; Oliveira, Alexandre C.; Cooper, Gregory F.

    2015-01-01

    Deriving predictive models in medicine typically relies on a population approach where a single model is developed from a dataset of individuals. In this paper we describe and evaluate a personalized approach in which we construct a new type of decision tree model called decision-path model that takes advantage of the particular features of a given person of interest. We introduce three personalized methods that derive personalized decision-path models. We compared the performance of these methods to that of Classification And Regression Tree (CART) that is a population decision tree to predict seven different outcomes in five medical datasets. Two of the three personalized methods performed statistically significantly better on area under the ROC curve (AUC) and Brier skill score compared to CART. The personalized approach of learning decision path models is a new approach for predictive modeling that can perform better than a population approach. PMID:26098570

  11. IBM Watson Analytics: Automating Visualization, Descriptive, and Predictive Statistics

    PubMed Central

    2016-01-01

    Background We live in an era of explosive data generation that will continue to grow and involve all industries. One of the results of this explosion is the need for newer and more efficient data analytics procedures. Traditionally, data analytics required a substantial background in statistics and computer science. In 2015, International Business Machines Corporation (IBM) released the IBM Watson Analytics (IBMWA) software that delivered advanced statistical procedures based on the Statistical Package for the Social Sciences (SPSS). The latest entry of Watson Analytics into the field of analytical software products provides users with enhanced functions that are not available in many existing programs. For example, Watson Analytics automatically analyzes datasets, examines data quality, and determines the optimal statistical approach. Users can request exploratory, predictive, and visual analytics. Using natural language processing (NLP), users are able to submit additional questions for analyses in a quick response format. This analytical package is available free to academic institutions (faculty and students) that plan to use the tools for noncommercial purposes. Objective To report the features of IBMWA and discuss how this software subjectively and objectively compares to other data mining programs. Methods The salient features of the IBMWA program were examined and compared with other common analytical platforms, using validated health datasets. Results Using a validated dataset, IBMWA delivered similar predictions compared with several commercial and open source data mining software applications. The visual analytics generated by IBMWA were similar to results from programs such as Microsoft Excel and Tableau Software. In addition, assistance with data preprocessing and data exploration was an inherent component of the IBMWA application. Sensitivity and specificity were not included in the IBMWA predictive analytics results, nor were odds ratios, confidence intervals, or a confusion matrix. Conclusions IBMWA is a new alternative for data analytics software that automates descriptive, predictive, and visual analytics. This program is very user-friendly but requires data preprocessing, statistical conceptual understanding, and domain expertise. PMID:27729304

  12. Primary Sclerosing Cholangitis Risk Estimate Tool (PREsTo) Predicts Outcomes in PSC: A Derivation & Validation Study Using Machine Learning.

    PubMed

    Eaton, John E; Vesterhus, Mette; McCauley, Bryan M; Atkinson, Elizabeth J; Schlicht, Erik M; Juran, Brian D; Gossard, Andrea A; LaRusso, Nicholas F; Gores, Gregory J; Karlsen, Tom H; Lazaridis, Konstantinos N

    2018-05-09

    Improved methods are needed to risk stratify and predict outcomes in patients with primary sclerosing cholangitis (PSC). Therefore, we sought to derive and validate a new prediction model and compare its performance to existing surrogate markers. The model was derived using 509 subjects from a multicenter North American cohort and validated in an international multicenter cohort (n=278). Gradient boosting, a machine based learning technique, was used to create the model. The endpoint was hepatic decompensation (ascites, variceal hemorrhage or encephalopathy). Subjects with advanced PSC or cholangiocarcinoma at baseline were excluded. The PSC risk estimate tool (PREsTo) consists of 9 variables: bilirubin, albumin, serum alkaline phosphatase (SAP) times the upper limit of normal (ULN), platelets, AST, hemoglobin, sodium, patient age and the number of years since PSC was diagnosed. Validation in an independent cohort confirms PREsTo accurately predicts decompensation (C statistic 0.90, 95% confidence interval (CI) 0.84-0.95) and performed well compared to MELD score (C statistic 0.72, 95% CI 0.57-0.84), Mayo PSC risk score (C statistic 0.85, 95% CI 0.77-0.92) and SAP < 1.5x ULN (C statistic 0.65, 95% CI 0.55-0.73). PREsTo continued to be accurate among individuals with a bilirubin < 2.0 mg/dL (C statistic 0.90, 95% CI 0.82-0.96) and when the score was re-applied at a later course in the disease (C statistic 0.82, 95% CI 0.64-0.95). PREsTo accurately predicts hepatic decompensation in PSC and exceeds the performance among other widely available, noninvasive prognostic scoring systems. This article is protected by copyright. All rights reserved. © 2018 by the American Association for the Study of Liver Diseases.

  13. Evaluation of the Risk of Grade 3 Oral and Pharyngeal Dysphagia Using Atlas-Based Method and Multivariate Analyses of Individual Patient Dose Distributions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Otter, Sophie; Schick, Ulrike; Gulliford, Sarah

    Purpose: The study aimed to apply the atlas of complication incidence (ACI) method to patients receiving radical treatment for head and neck squamous cell carcinomas (HNSCC), to generate constraints based on dose-volume histograms (DVHs), and to identify clinical and dosimetric parameters that predict the risk of grade 3 oral mucositis (g3OM) and pharyngeal dysphagia (g3PD). Methods and Materials: Oral and pharyngeal mucosal DVHs were generated for 253 patients who received radiation (RT) or chemoradiation (CRT). They were used to produce ACI for g3OM and g3PD. Multivariate analysis (MVA) of the effect of dosimetry, clinical, and patient-related variables was performed usingmore » logistic regression and bootstrapping. Receiver operating curve (ROC) analysis was also performed, and the Youden index was used to find volume constraints that discriminated between volumes that predicted for toxicity. Results: We derived statistically significant dose-volume constraints for g3OM over the range v28 to v70. Only 3 statistically significant constraints were derived for g3PD v67, v68, and v69. On MVA, mean dose to the oral mucosa predicted for g3OM and concomitant chemotherapy and mean dose to the inferior constrictor (IC) predicted for g3PD. Conclusions: We have used the ACI method to evaluate incidences of g3OM and g3PD and ROC analysis to generate constraints to predict g3OM and g3PD derived from entire individual patient DVHs. On MVA, the strongest predictors were radiation dose (for g3OM) and concomitant chemotherapy (for g3PD).« less

  14. Predicting radiotherapy outcomes using statistical learning techniques

    NASA Astrophysics Data System (ADS)

    El Naqa, Issam; Bradley, Jeffrey D.; Lindsay, Patricia E.; Hope, Andrew J.; Deasy, Joseph O.

    2009-09-01

    Radiotherapy outcomes are determined by complex interactions between treatment, anatomical and patient-related variables. A common obstacle to building maximally predictive outcome models for clinical practice is the failure to capture potential complexity of heterogeneous variable interactions and applicability beyond institutional data. We describe a statistical learning methodology that can automatically screen for nonlinear relations among prognostic variables and generalize to unseen data before. In this work, several types of linear and nonlinear kernels to generate interaction terms and approximate the treatment-response function are evaluated. Examples of institutional datasets of esophagitis, pneumonitis and xerostomia endpoints were used. Furthermore, an independent RTOG dataset was used for 'generalizabilty' validation. We formulated the discrimination between risk groups as a supervised learning problem. The distribution of patient groups was initially analyzed using principle components analysis (PCA) to uncover potential nonlinear behavior. The performance of the different methods was evaluated using bivariate correlations and actuarial analysis. Over-fitting was controlled via cross-validation resampling. Our results suggest that a modified support vector machine (SVM) kernel method provided superior performance on leave-one-out testing compared to logistic regression and neural networks in cases where the data exhibited nonlinear behavior on PCA. For instance, in prediction of esophagitis and pneumonitis endpoints, which exhibited nonlinear behavior on PCA, the method provided 21% and 60% improvements, respectively. Furthermore, evaluation on the independent pneumonitis RTOG dataset demonstrated good generalizabilty beyond institutional data in contrast with other models. This indicates that the prediction of treatment response can be improved by utilizing nonlinear kernel methods for discovering important nonlinear interactions among model variables. These models have the capacity to predict on unseen data. Part of this work was first presented at the Seventh International Conference on Machine Learning and Applications, San Diego, CA, USA, 11-13 December 2008.

  15. Are there pollination syndromes in the Australian epacrids (Ericaceae: Styphelioideae)? A novel statistical method to identify key floral traits per syndrome

    PubMed Central

    Johnson, Karen A.

    2013-01-01

    Background and Aims Convergent floral traits hypothesized as attracting particular pollinators are known as pollination syndromes. Floral diversity suggests that the Australian epacrid flora may be adapted to pollinator type. Currently there are empirical data on the pollination systems for 87 species (approx. 15 % of Australian epacrids). This provides an opportunity to test for pollination syndromes and their important morphological traits in an iconic element of the Australian flora. Methods Data on epacrid–pollinator relationships were obtained from published literature and field observation. A multivariate approach was used to test whether epacrid floral attributes related to pollinator profiles. Statistical classification was then used to rank floral attributes according to their predictive value. Data sets excluding mixed pollination systems were used to test the predictive power of statistical classification to identify pollination models. Key Results Floral attributes are correlated with bird, fly and bee pollination. Using floral attributes identified as correlating with pollinator type, bird pollination is classified with 86 % accuracy, red flowers being the most important predictor. Fly and bee pollination are classified with 78 and 69 % accuracy, but have a lack of individually important floral predictors. Excluding mixed pollination systems improved the accuracy of the prediction of both bee and fly pollination systems. Conclusions Although most epacrids have generalized pollination systems, a correlation between bird pollination and red, long-tubed epacrids is found. Statistical classification highlights the relative importance of each floral attribute in relation to pollinator type and proves useful in classifying epacrids to bird, fly and bee pollination systems. PMID:23681546

  16. Refining calibration and predictions of a Bayesian statistical-dynamical model for long term avalanche forecasting using dendrochronological reconstructions

    NASA Astrophysics Data System (ADS)

    Eckert, Nicolas; Schläppy, Romain; Jomelli, Vincent; Naaim, Mohamed

    2013-04-01

    A crucial step for proposing relevant long-term mitigation measures in long term avalanche forecasting is the accurate definition of high return period avalanches. Recently, "statistical-dynamical" approach combining a numerical model with stochastic operators describing the variability of its inputs-outputs have emerged. Their main interests is to take into account the topographic dependency of snow avalanche runout distances, and to constrain the correlation structure between model's variables by physical rules, so as to simulate the different marginal distributions of interest (pressure, flow depth, etc.) with a reasonable realism. Bayesian methods have been shown to be well adapted to achieve model inference, getting rid of identifiability problems thanks to prior information. An important problem which has virtually never been considered before is the validation of the predictions resulting from a statistical-dynamical approach (or from any other engineering method for computing extreme avalanches). In hydrology, independent "fossil" data such as flood deposits in caves are sometimes confronted to design discharges corresponding to high return periods. Hence, the aim of this work is to implement a similar comparison between high return period avalanches obtained with a statistical-dynamical approach and independent validation data resulting from careful dendrogeomorphological reconstructions. To do so, an up-to-date statistical model based on the depth-averaged equations and the classical Voellmy friction law is used on a well-documented case study. First, parameter values resulting from another path are applied, and the dendrological validation sample shows that this approach fails in providing realistic prediction for the case study. This may be due to the strongly bounded behaviour of runouts in this case (the extreme of their distribution is identified as belonging to the Weibull attraction domain). Second, local calibration on the available avalanche chronicle is performed with various prior distributions resulting from expert knowledge and/or other paths. For all calibrations, a very successful convergence is obtained, which confirms the robustness of the used Metropolis-Hastings estimation algorithm. This also demonstrates the interest of the Bayesian framework for aggregating information by sequential assimilation in the frequently encountered case of limited data quantity. Confrontation with the dendrological sample stresses the predominant role of the Coulombian friction coefficient distribution's variance on predicted high magnitude runouts. The optimal fit is obtained for a strong prior reflecting the local bounded behavior, and results in a 10-40 m difference for return periods ranging between 10 and 300 years. Implementing predictive simulations shows that this is largely within the range of magnitude of uncertainties to be taken into account. On the other hand, the different priors tested for the turbulent friction coefficient influence predictive performances only slightly, but have a large influence on predicted velocity and flow depth distributions. This all may be of high interest to refine calibration and predictive use of the statistical-dynamical model for any engineering application.

  17. Multivariate Analysis and Prediction of Dioxin-Furan ...

    EPA Pesticide Factsheets

    Peer Review Draft of Regional Methods Initiative Final Report Dioxins, which are bioaccumulative and environmentally persistent, pose an ongoing risk to human and ecosystem health. Fish constitute a significant source of dioxin exposure for humans and fish-eating wildlife. Current dioxin analytical methods are costly, time-consuming, and produce hazardous by-products. A Danish team developed a novel, multivariate statistical methodology based on the covariance of dioxin-furan congener Toxic Equivalences (TEQs) and fatty acid methyl esters (FAMEs) and applied it to North Atlantic Ocean fishmeal samples. The goal of the current study was to attempt to extend this Danish methodology to 77 whole and composite fish samples from three trophic groups: predator (whole largemouth bass), benthic (whole flathead and channel catfish) and forage fish (composite bluegill, pumpkinseed and green sunfish) from two dioxin contaminated rivers (Pocatalico R. and Kanawha R.) in West Virginia, USA. Multivariate statistical analyses, including, Principal Components Analysis (PCA), Hierarchical Clustering, and Partial Least Squares Regression (PLS), were used to assess the relationship between the FAMEs and TEQs in these dioxin contaminated freshwater fish from the Kanawha and Pocatalico Rivers. These three multivariate statistical methods all confirm that the pattern of Fatty Acid Methyl Esters (FAMEs) in these freshwater fish covaries with and is predictive of the WHO TE

  18. Forecasting daily source air quality using multivariate statistical analysis and radial basis function networks.

    PubMed

    Sun, Gang; Hoff, Steven J; Zelle, Brian C; Nelson, Minda A

    2008-12-01

    It is vital to forecast gas and particle matter concentrations and emission rates (GPCER) from livestock production facilities to assess the impact of airborne pollutants on human health, ecological environment, and global warming. Modeling source air quality is a complex process because of abundant nonlinear interactions between GPCER and other factors. The objective of this study was to introduce statistical methods and radial basis function (RBF) neural network to predict daily source air quality in Iowa swine deep-pit finishing buildings. The results show that four variables (outdoor and indoor temperature, animal units, and ventilation rates) were identified as relative important model inputs using statistical methods. It can be further demonstrated that only two factors, the environment factor and the animal factor, were capable of explaining more than 94% of the total variability after performing principal component analysis. The introduction of fewer uncorrelated variables to the neural network would result in the reduction of the model structure complexity, minimize computation cost, and eliminate model overfitting problems. The obtained results of RBF network prediction were in good agreement with the actual measurements, with values of the correlation coefficient between 0.741 and 0.995 and very low values of systemic performance indexes for all the models. The good results indicated the RBF network could be trained to model these highly nonlinear relationships. Thus, the RBF neural network technology combined with multivariate statistical methods is a promising tool for air pollutant emissions modeling.

  19. Identifying trends in climate: an application to the cenozoic

    NASA Astrophysics Data System (ADS)

    Richards, Gordon R.

    1998-05-01

    The recent literature on trending in climate has raised several issues, whether trends should be modeled as deterministic or stochastic, whether trends are nonlinear, and the relative merits of statistical models versus models based on physics. This article models trending since the late Cretaceous. This 68 million-year interval is selected because the reliability of tests for trending is critically dependent on the length of time spanned by the data. Two main hypotheses are tested, that the trend has been caused primarily by CO2 forcing, and that it reflects a variety of forcing factors which can be approximated by statistical methods. The CO2 data is obtained from model simulations. Several widely-used statistical models are found to be inadequate. ARIMA methods parameterize too much of the short-term variation, and do not identify low frequency movements. Further, the unit root in the ARIMA process does not predict the long-term path of temperature. Spectral methods also have little ability to predict temperature at long horizons. Instead, the statistical trend is estimated using a nonlinear smoothing filter. Both of these paradigms make it possible to model climate as a cointegrated process, in which temperature can wander quite far from the trend path in the intermediate term, but converges back over longer horizons. Comparing the forecasting properties of the two trend models demonstrates that the optimal forecasting model includes CO2 forcing and a parametric representation of the nonlinear variability in climate.

  20. Spatio-Chromatic Adaptation via Higher-Order Canonical Correlation Analysis of Natural Images

    PubMed Central

    Gutmann, Michael U.; Laparra, Valero; Hyvärinen, Aapo; Malo, Jesús

    2014-01-01

    Independent component and canonical correlation analysis are two general-purpose statistical methods with wide applicability. In neuroscience, independent component analysis of chromatic natural images explains the spatio-chromatic structure of primary cortical receptive fields in terms of properties of the visual environment. Canonical correlation analysis explains similarly chromatic adaptation to different illuminations. But, as we show in this paper, neither of the two methods generalizes well to explain both spatio-chromatic processing and adaptation at the same time. We propose a statistical method which combines the desirable properties of independent component and canonical correlation analysis: It finds independent components in each data set which, across the two data sets, are related to each other via linear or higher-order correlations. The new method is as widely applicable as canonical correlation analysis, and also to more than two data sets. We call it higher-order canonical correlation analysis. When applied to chromatic natural images, we found that it provides a single (unified) statistical framework which accounts for both spatio-chromatic processing and adaptation. Filters with spatio-chromatic tuning properties as in the primary visual cortex emerged and corresponding-colors psychophysics was reproduced reasonably well. We used the new method to make a theory-driven testable prediction on how the neural response to colored patterns should change when the illumination changes. We predict shifts in the responses which are comparable to the shifts reported for chromatic contrast habituation. PMID:24533049

  1. Spatio-chromatic adaptation via higher-order canonical correlation analysis of natural images.

    PubMed

    Gutmann, Michael U; Laparra, Valero; Hyvärinen, Aapo; Malo, Jesús

    2014-01-01

    Independent component and canonical correlation analysis are two general-purpose statistical methods with wide applicability. In neuroscience, independent component analysis of chromatic natural images explains the spatio-chromatic structure of primary cortical receptive fields in terms of properties of the visual environment. Canonical correlation analysis explains similarly chromatic adaptation to different illuminations. But, as we show in this paper, neither of the two methods generalizes well to explain both spatio-chromatic processing and adaptation at the same time. We propose a statistical method which combines the desirable properties of independent component and canonical correlation analysis: It finds independent components in each data set which, across the two data sets, are related to each other via linear or higher-order correlations. The new method is as widely applicable as canonical correlation analysis, and also to more than two data sets. We call it higher-order canonical correlation analysis. When applied to chromatic natural images, we found that it provides a single (unified) statistical framework which accounts for both spatio-chromatic processing and adaptation. Filters with spatio-chromatic tuning properties as in the primary visual cortex emerged and corresponding-colors psychophysics was reproduced reasonably well. We used the new method to make a theory-driven testable prediction on how the neural response to colored patterns should change when the illumination changes. We predict shifts in the responses which are comparable to the shifts reported for chromatic contrast habituation.

  2. Assessing Discriminative Performance at External Validation of Clinical Prediction Models.

    PubMed

    Nieboer, Daan; van der Ploeg, Tjeerd; Steyerberg, Ewout W

    2016-01-01

    External validation studies are essential to study the generalizability of prediction models. Recently a permutation test, focusing on discrimination as quantified by the c-statistic, was proposed to judge whether a prediction model is transportable to a new setting. We aimed to evaluate this test and compare it to previously proposed procedures to judge any changes in c-statistic from development to external validation setting. We compared the use of the permutation test to the use of benchmark values of the c-statistic following from a previously proposed framework to judge transportability of a prediction model. In a simulation study we developed a prediction model with logistic regression on a development set and validated them in the validation set. We concentrated on two scenarios: 1) the case-mix was more heterogeneous and predictor effects were weaker in the validation set compared to the development set, and 2) the case-mix was less heterogeneous in the validation set and predictor effects were identical in the validation and development set. Furthermore we illustrated the methods in a case study using 15 datasets of patients suffering from traumatic brain injury. The permutation test indicated that the validation and development set were homogenous in scenario 1 (in almost all simulated samples) and heterogeneous in scenario 2 (in 17%-39% of simulated samples). Previously proposed benchmark values of the c-statistic and the standard deviation of the linear predictors correctly pointed at the more heterogeneous case-mix in scenario 1 and the less heterogeneous case-mix in scenario 2. The recently proposed permutation test may provide misleading results when externally validating prediction models in the presence of case-mix differences between the development and validation population. To correctly interpret the c-statistic found at external validation it is crucial to disentangle case-mix differences from incorrect regression coefficients.

  3. Three methods to construct predictive models using logistic regression and likelihood ratios to facilitate adjustment for pretest probability give similar results.

    PubMed

    Chan, Siew Foong; Deeks, Jonathan J; Macaskill, Petra; Irwig, Les

    2008-01-01

    To compare three predictive models based on logistic regression to estimate adjusted likelihood ratios allowing for interdependency between diagnostic variables (tests). This study was a review of the theoretical basis, assumptions, and limitations of published models; and a statistical extension of methods and application to a case study of the diagnosis of obstructive airways disease based on history and clinical examination. Albert's method includes an offset term to estimate an adjusted likelihood ratio for combinations of tests. Spiegelhalter and Knill-Jones method uses the unadjusted likelihood ratio for each test as a predictor and computes shrinkage factors to allow for interdependence. Knottnerus' method differs from the other methods because it requires sequencing of tests, which limits its application to situations where there are few tests and substantial data. Although parameter estimates differed between the models, predicted "posttest" probabilities were generally similar. Construction of predictive models using logistic regression is preferred to the independence Bayes' approach when it is important to adjust for dependency of tests errors. Methods to estimate adjusted likelihood ratios from predictive models should be considered in preference to a standard logistic regression model to facilitate ease of interpretation and application. Albert's method provides the most straightforward approach.

  4. High resolution probabilistic precipitation forecast over Spain combining the statistical downscaling tool PROMETEO and the AEMET short range EPS system (AEMET/SREPS)

    NASA Astrophysics Data System (ADS)

    Cofino, A. S.; Santos, C.; Garcia-Moya, J. A.; Gutierrez, J. M.; Orfila, B.

    2009-04-01

    The Short-Range Ensemble Prediction System (SREPS) is a multi-LAM (UM, HIRLAM, MM5, LM and HRM) multi analysis/boundary conditions (ECMWF, UKMetOffice, DWD and GFS) run twice a day by AEMET (72 hours lead time) over a European domain, with a total of 5 (LAMs) x 4 (GCMs) = 20 members. One of the main goals of this project is analyzing the impact of models and boundary conditions in the short-range high-resolution forecasted precipitation. A previous validation of this method has been done considering a set of climate networks in Spain, France and Germany, by interpolating the prediction to the gauge locations (SREPS, 2008). In this work we compare these results with those obtained by using a statistical downscaling method to post-process the global predictions, obtaining an "advanced interpolation" for the local precipitation using climate network precipitation observations. In particular, we apply the PROMETEO downscaling system based on analogs and compare the SREPS ensemble of 20 members with the PROMETEO statistical ensemble of 5 (analog ensemble) x 4 (GCMs) = 20 members. Moreover, we will also compare the performance of a combined approach post-processing the SREPS outputs using the PROMETEO system. References: SREPS 2008. 2008 EWGLAM-SRNWP Meeting (http://www.aemet.es/documentos/va/divulgacion/conferencias/prediccion/Ewglam/PRED_CSantos.pdf)

  5. Bayesian methods in reliability

    NASA Astrophysics Data System (ADS)

    Sander, P.; Badoux, R.

    1991-11-01

    The present proceedings from a course on Bayesian methods in reliability encompasses Bayesian statistical methods and their computational implementation, models for analyzing censored data from nonrepairable systems, the traits of repairable systems and growth models, the use of expert judgment, and a review of the problem of forecasting software reliability. Specific issues addressed include the use of Bayesian methods to estimate the leak rate of a gas pipeline, approximate analyses under great prior uncertainty, reliability estimation techniques, and a nonhomogeneous Poisson process. Also addressed are the calibration sets and seed variables of expert judgment systems for risk assessment, experimental illustrations of the use of expert judgment for reliability testing, and analyses of the predictive quality of software-reliability growth models such as the Weibull order statistics.

  6. Short-Term fo F2 Forecast: Present Day State of Art

    NASA Astrophysics Data System (ADS)

    Mikhailov, A. V.; Depuev, V. H.; Depueva, A. H.

    An analysis of the F2-layer short-term forecast problem has been done. Both objective and methodological problems prevent us from a deliberate F2-layer forecast issuing at present. An empirical approach based on statistical methods may be recommended for practical use. A forecast method based on a new aeronomic index (a proxy) AI has been proposed and tested over selected 64 severe storm events. The method provides an acceptable prediction accuracy both for strongly disturbed and quiet conditions. The problems with the prediction of the F2-layer quiet-time disturbances as well as some other unsolved problems are discussed

  7. An improved method for predicting the lightning performance of high and extra-high-voltage substation shielding

    NASA Astrophysics Data System (ADS)

    Vinh, T.

    1980-08-01

    There is a need for better and more effective lightning protection for transmission and switching substations. In the past, a number of empirical methods were utilized to design systems to protect substations and transmission lines from direct lightning strokes. The need exists for convenient analytical lightning models adequate for engineering usage. In this study, analytical lightning models were developed along with a method for improved analysis of the physical properties of lightning through their use. This method of analysis is based upon the most recent statistical field data. The result is an improved method for predicting the occurrence of sheilding failure and for designing more effective protection for high and extra high voltage substations from direct strokes.

  8. Meteorological regimes for the classification of aerospace air quality predictions for NASA-Kennedy Space Center

    NASA Technical Reports Server (NTRS)

    Stephens, J. B.; Sloan, J. C.

    1976-01-01

    A method is described for developing a statistical air quality assessment for the launch of an aerospace vehicle from the Kennedy Space Center in terms of existing climatological data sets. The procedure can be refined as developing meteorological conditions are identified for use with the NASA-Marshall Space Flight Center Rocket Exhaust Effluent Diffusion (REED) description. Classical climatological regimes for the long range analysis can be narrowed as the synoptic and mesoscale structure is identified. Only broad synoptic regimes are identified at this stage of analysis. As the statistical data matrix is developed, synoptic regimes will be refined in terms of the resulting eigenvectors as applicable to aerospace air quality predictions.

  9. Bayesian model aggregation for ensemble-based estimates of protein pKa values

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gosink, Luke J.; Hogan, Emilie A.; Pulsipher, Trenton C.

    2014-03-01

    This paper investigates an ensemble-based technique called Bayesian Model Averaging (BMA) to improve the performance of protein amino acid pmore » $$K_a$$ predictions. Structure-based p$$K_a$$ calculations play an important role in the mechanistic interpretation of protein structure and are also used to determine a wide range of protein properties. A diverse set of methods currently exist for p$$K_a$$ prediction, ranging from empirical statistical models to {\\it ab initio} quantum mechanical approaches. However, each of these methods are based on a set of assumptions that have inherent bias and sensitivities that can effect a model's accuracy and generalizability for p$$K_a$$ prediction in complicated biomolecular systems. We use BMA to combine eleven diverse prediction methods that each estimate pKa values of amino acids in staphylococcal nuclease. These methods are based on work conducted for the pKa Cooperative and the pKa measurements are based on experimental work conducted by the Garc{\\'i}a-Moreno lab. Our study demonstrates that the aggregated estimate obtained from BMA outperforms all individual prediction methods in our cross-validation study with improvements from 40-70\\% over other method classes. This work illustrates a new possible mechanism for improving the accuracy of p$$K_a$$ prediction and lays the foundation for future work on aggregate models that balance computational cost with prediction accuracy.« less

  10. Statistical-learning strategies generate only modestly performing predictive models for urinary symptoms following external beam radiotherapy of the prostate: A comparison of conventional and machine-learning methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yahya, Noorazrul, E-mail: noorazrul.yahya@research.uwa.edu.au; Ebert, Martin A.; Bulsara, Max

    Purpose: Given the paucity of available data concerning radiotherapy-induced urinary toxicity, it is important to ensure derivation of the most robust models with superior predictive performance. This work explores multiple statistical-learning strategies for prediction of urinary symptoms following external beam radiotherapy of the prostate. Methods: The performance of logistic regression, elastic-net, support-vector machine, random forest, neural network, and multivariate adaptive regression splines (MARS) to predict urinary symptoms was analyzed using data from 754 participants accrued by TROG03.04-RADAR. Predictive features included dose-surface data, comorbidities, and medication-intake. Four symptoms were analyzed: dysuria, haematuria, incontinence, and frequency, each with three definitions (grade ≥more » 1, grade ≥ 2 and longitudinal) with event rate between 2.3% and 76.1%. Repeated cross-validations producing matched models were implemented. A synthetic minority oversampling technique was utilized in endpoints with rare events. Parameter optimization was performed on the training data. Area under the receiver operating characteristic curve (AUROC) was used to compare performance using sample size to detect differences of ≥0.05 at the 95% confidence level. Results: Logistic regression, elastic-net, random forest, MARS, and support-vector machine were the highest-performing statistical-learning strategies in 3, 3, 3, 2, and 1 endpoints, respectively. Logistic regression, MARS, elastic-net, random forest, neural network, and support-vector machine were the best, or were not significantly worse than the best, in 7, 7, 5, 5, 3, and 1 endpoints. The best-performing statistical model was for dysuria grade ≥ 1 with AUROC ± standard deviation of 0.649 ± 0.074 using MARS. For longitudinal frequency and dysuria grade ≥ 1, all strategies produced AUROC>0.6 while all haematuria endpoints and longitudinal incontinence models produced AUROC<0.6. Conclusions: Logistic regression and MARS were most likely to be the best-performing strategy for the prediction of urinary symptoms with elastic-net and random forest producing competitive results. The predictive power of the models was modest and endpoint-dependent. New features, including spatial dose maps, may be necessary to achieve better models.« less

  11. Prediction of CpG-island function: CpG clustering vs. sliding-window methods

    PubMed Central

    2010-01-01

    Background Unmethylated stretches of CpG dinucleotides (CpG islands) are an outstanding property of mammal genomes. Conventionally, these regions are detected by sliding window approaches using %G + C, CpG observed/expected ratio and length thresholds as main parameters. Recently, clustering methods directly detect clusters of CpG dinucleotides as a statistical property of the genome sequence. Results We compare sliding-window to clustering (i.e. CpGcluster) predictions by applying new ways to detect putative functionality of CpG islands. Analyzing the co-localization with several genomic regions as a function of window size vs. statistical significance (p-value), CpGcluster shows a higher overlap with promoter regions and highly conserved elements, at the same time showing less overlap with Alu retrotransposons. The major difference in the prediction was found for short islands (CpG islets), often exclusively predicted by CpGcluster. Many of these islets seem to be functional, as they are unmethylated, highly conserved and/or located within the promoter region. Finally, we show that window-based islands can spuriously overlap several, differentially regulated promoters as well as different methylation domains, which might indicate a wrong merge of several CpG islands into a single, very long island. The shorter CpGcluster islands seem to be much more specific when concerning the overlap with alternative transcription start sites or the detection of homogenous methylation domains. Conclusions The main difference between sliding-window approaches and clustering methods is the length of the predicted islands. Short islands, often differentially methylated, are almost exclusively predicted by CpGcluster. This suggests that CpGcluster may be the algorithm of choice to explore the function of these short, but putatively functional CpG islands. PMID:20500903

  12. Monitoring Method of Cow Anthrax Based on Gis and Spatial Statistical Analysis

    NASA Astrophysics Data System (ADS)

    Li, Lin; Yang, Yong; Wang, Hongbin; Dong, Jing; Zhao, Yujun; He, Jianbin; Fan, Honggang

    Geographic information system (GIS) is a computer application system, which possesses the ability of manipulating spatial information and has been used in many fields related with the spatial information management. Many methods and models have been established for analyzing animal diseases distribution models and temporal-spatial transmission models. Great benefits have been gained from the application of GIS in animal disease epidemiology. GIS is now a very important tool in animal disease epidemiological research. Spatial analysis function of GIS can be widened and strengthened by using spatial statistical analysis, allowing for the deeper exploration, analysis, manipulation and interpretation of spatial pattern and spatial correlation of the animal disease. In this paper, we analyzed the cow anthrax spatial distribution characteristics in the target district A (due to the secret of epidemic data we call it district A) based on the established GIS of the cow anthrax in this district in combination of spatial statistical analysis and GIS. The Cow anthrax is biogeochemical disease, and its geographical distribution is related closely to the environmental factors of habitats and has some spatial characteristics, and therefore the correct analysis of the spatial distribution of anthrax cow for monitoring and the prevention and control of anthrax has a very important role. However, the application of classic statistical methods in some areas is very difficult because of the pastoral nomadic context. The high mobility of livestock and the lack of enough suitable sampling for the some of the difficulties in monitoring currently make it nearly impossible to apply rigorous random sampling methods. It is thus necessary to develop an alternative sampling method, which could overcome the lack of sampling and meet the requirements for randomness. The GIS computer application software ArcGIS9.1 was used to overcome the lack of data of sampling sites.Using ArcGIS 9.1 and GEODA to analyze the cow anthrax spatial distribution of district A. we gained some conclusions about cow anthrax' density: (1) there is a spatial clustering model. (2) there is an intensely spatial autocorrelation. We established a prediction model to estimate the anthrax distribution based on the spatial characteristic of the density of cow anthrax. Comparing with the true distribution, the prediction model has a well coincidence and is feasible to the application. The method using a GIS tool facilitates can be implemented significantly in the cow anthrax monitoring and investigation, and the space statistics - related prediction model provides a fundamental use for other study on space-related animal diseases.

  13. Accelerated battery-life testing - A concept

    NASA Technical Reports Server (NTRS)

    Mccallum, J.; Thomas, R. E.

    1971-01-01

    Test program, employing empirical, statistical and physical methods, determines service life and failure probabilities of electrochemical cells and batteries, and is applicable to testing mechanical, electrical, and chemical devices. Data obtained aids long-term performance prediction of battery or cell.

  14. The statistical mechanics of complex signaling networks: nerve growth factor signaling

    NASA Astrophysics Data System (ADS)

    Brown, K. S.; Hill, C. C.; Calero, G. A.; Myers, C. R.; Lee, K. H.; Sethna, J. P.; Cerione, R. A.

    2004-10-01

    The inherent complexity of cellular signaling networks and their importance to a wide range of cellular functions necessitates the development of modeling methods that can be applied toward making predictions and highlighting the appropriate experiments to test our understanding of how these systems are designed and function. We use methods of statistical mechanics to extract useful predictions for complex cellular signaling networks. A key difficulty with signaling models is that, while significant effort is being made to experimentally measure the rate constants for individual steps in these networks, many of the parameters required to describe their behavior remain unknown or at best represent estimates. To establish the usefulness of our approach, we have applied our methods toward modeling the nerve growth factor (NGF)-induced differentiation of neuronal cells. In particular, we study the actions of NGF and mitogenic epidermal growth factor (EGF) in rat pheochromocytoma (PC12) cells. Through a network of intermediate signaling proteins, each of these growth factors stimulates extracellular regulated kinase (Erk) phosphorylation with distinct dynamical profiles. Using our modeling approach, we are able to predict the influence of specific signaling modules in determining the integrated cellular response to the two growth factors. Our methods also raise some interesting insights into the design and possible evolution of cellular systems, highlighting an inherent property of these systems that we call 'sloppiness.'

  15. Prediction of protein secondary structure content for the twilight zone sequences.

    PubMed

    Homaeian, Leila; Kurgan, Lukasz A; Ruan, Jishou; Cios, Krzysztof J; Chen, Ke

    2007-11-15

    Secondary protein structure carries information about local structural arrangements, which include three major conformations: alpha-helices, beta-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a comprehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for beta-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the secondary protein structure. (c) 2007 Wiley-Liss, Inc.

  16. Regression Models and Fuzzy Logic Prediction of TBM Penetration Rate

    NASA Astrophysics Data System (ADS)

    Minh, Vu Trieu; Katushin, Dmitri; Antonov, Maksim; Veinthal, Renno

    2017-03-01

    This paper presents statistical analyses of rock engineering properties and the measured penetration rate of tunnel boring machine (TBM) based on the data of an actual project. The aim of this study is to analyze the influence of rock engineering properties including uniaxial compressive strength (UCS), Brazilian tensile strength (BTS), rock brittleness index (BI), the distance between planes of weakness (DPW), and the alpha angle (Alpha) between the tunnel axis and the planes of weakness on the TBM rate of penetration (ROP). Four (4) statistical regression models (two linear and two nonlinear) are built to predict the ROP of TBM. Finally a fuzzy logic model is developed as an alternative method and compared to the four statistical regression models. Results show that the fuzzy logic model provides better estimations and can be applied to predict the TBM performance. The R-squared value (R2) of the fuzzy logic model scores the highest value of 0.714 over the second runner-up of 0.667 from the multiple variables nonlinear regression model.

  17. Predicting Slag Generation in Sub-Scale Test Motors Using a Neural Network

    NASA Technical Reports Server (NTRS)

    Wiesenberg, Brent

    1999-01-01

    Generation of slag (aluminum oxide) is an important issue for the Reusable Solid Rocket Motor (RSRM). Thiokol performed testing to quantify the relationship between raw material variations and slag generation in solid propellants by testing sub-scale motors cast with propellant containing various combinations of aluminum fuel and ammonium perchlorate (AP) oxidizer particle sizes. The test data were analyzed using statistical methods and an artificial neural network. This paper primarily addresses the neural network results with some comparisons to the statistical results. The neural network showed that the particle sizes of both the aluminum and unground AP have a measurable effect on slag generation. The neural network analysis showed that aluminum particle size is the dominant driver in slag generation, about 40% more influential than AP. The network predictions of the amount of slag produced during firing of sub-scale motors were 16% better than the predictions of a statistically derived empirical equation. Another neural network successfully characterized the slag generated during full-scale motor tests. The success is attributable to the ability of neural networks to characterize multiple complex factors including interactions that affect slag generation.

  18. Multi-year slant path rain fade statistics at 28.56 and 19.04 GHz for Wallops Island, Virginia

    NASA Technical Reports Server (NTRS)

    Goldhirsh, J.

    1979-01-01

    Multiyear rain fade statistics at 28.56 GHz and 19.04 GHz were compiled for the region of Wallops Island, Virginia covering the time periods, 1 April 1977 through 31 March 1978, and 1 September 1978 through 31 August 1979. The 28.56 GHz attenuations were derived by monitoring the beacon signals from the COMSTAR geosynchronous satellite, D sub 2 during the first year, and satellite, D sub 3, during the second year. Although 19.04 GHz beacons exist aboard these satellites, statistics at this frequency were predicted using the 28 GHz fade data, the measured rain rate distribution, and effective path length concepts. The prediction method used was tested against radar derived fade distributions and excellent comparisons were noted. For example, the rms deviations between the predicted and test distributions were less than or equal to 0.2dB or 4% at 19.04 GHz. The average ratio between the 28.56 GHz and 19.04 GHz fades were also derived for equal percentages of time resulting in a factor of 2.1 with a .05 standard deviation.

  19. Improving the Validity of Activity of Daily Living Dependency Risk Assessment

    PubMed Central

    Clark, Daniel O.; Stump, Timothy E.; Tu, Wanzhu; Miller, Douglas K.

    2015-01-01

    Objectives Efforts to prevent activity of daily living (ADL) dependency may be improved through models that assess older adults’ dependency risk. We evaluated whether cognition and gait speed measures improve the predictive validity of interview-based models. Method Participants were 8,095 self-respondents in the 2006 Health and Retirement Survey who were aged 65 years or over and independent in five ADLs. Incident ADL dependency was determined from the 2008 interview. Models were developed using random 2/3rd cohorts and validated in the remaining 1/3rd. Results Compared to a c-statistic of 0.79 in the best interview model, the model including cognitive measures had c-statistics of 0.82 and 0.80 while the best fitting gait speed model had c-statistics of 0.83 and 0.79 in the development and validation cohorts, respectively. Conclusion Two relatively brief models, one that requires an in-person assessment and one that does not, had excellent validity for predicting incident ADL dependency but did not significantly improve the predictive validity of the best fitting interview-based models. PMID:24652867

  20. Predicting the graft survival for heart-lung transplantation patients: an integrated data mining methodology.

    PubMed

    Oztekin, Asil; Delen, Dursun; Kong, Zhenyu James

    2009-12-01

    Predicting the survival of heart-lung transplant patients has the potential to play a critical role in understanding and improving the matching procedure between the recipient and graft. Although voluminous data related to the transplantation procedures is being collected and stored, only a small subset of the predictive factors has been used in modeling heart-lung transplantation outcomes. The previous studies have mainly focused on applying statistical techniques to a small set of factors selected by the domain-experts in order to reveal the simple linear relationships between the factors and survival. The collection of methods known as 'data mining' offers significant advantages over conventional statistical techniques in dealing with the latter's limitations such as normality assumption of observations, independence of observations from each other, and linearity of the relationship between the observations and the output measure(s). There are statistical methods that overcome these limitations. Yet, they are computationally more expensive and do not provide fast and flexible solutions as do data mining techniques in large datasets. The main objective of this study is to improve the prediction of outcomes following combined heart-lung transplantation by proposing an integrated data-mining methodology. A large and feature-rich dataset (16,604 cases with 283 variables) is used to (1) develop machine learning based predictive models and (2) extract the most important predictive factors. Then, using three different variable selection methods, namely, (i) machine learning methods driven variables-using decision trees, neural networks, logistic regression, (ii) the literature review-based expert-defined variables, and (iii) common sense-based interaction variables, a consolidated set of factors is generated and used to develop Cox regression models for heart-lung graft survival. The predictive models' performance in terms of 10-fold cross-validation accuracy rates for two multi-imputed datasets ranged from 79% to 86% for neural networks, from 78% to 86% for logistic regression, and from 71% to 79% for decision trees. The results indicate that the proposed integrated data mining methodology using Cox hazard models better predicted the graft survival with different variables than the conventional approaches commonly used in the literature. This result is validated by the comparison of the corresponding Gains charts for our proposed methodology and the literature review based Cox results, and by the comparison of Akaike information criteria (AIC) values received from each. Data mining-based methodology proposed in this study reveals that there are undiscovered relationships (i.e. interactions of the existing variables) among the survival-related variables, which helps better predict the survival of the heart-lung transplants. It also brings a different set of variables into the scene to be evaluated by the domain-experts and be considered prior to the organ transplantation.

  1. Can bias correction and statistical downscaling methods improve the skill of seasonal precipitation forecasts?

    NASA Astrophysics Data System (ADS)

    Manzanas, R.; Lucero, A.; Weisheimer, A.; Gutiérrez, J. M.

    2018-02-01

    Statistical downscaling methods are popular post-processing tools which are widely used in many sectors to adapt the coarse-resolution biased outputs from global climate simulations to the regional-to-local scale typically required by users. They range from simple and pragmatic Bias Correction (BC) methods, which directly adjust the model outputs of interest (e.g. precipitation) according to the available local observations, to more complex Perfect Prognosis (PP) ones, which indirectly derive local predictions (e.g. precipitation) from appropriate upper-air large-scale model variables (predictors). Statistical downscaling methods have been extensively used and critically assessed in climate change applications; however, their advantages and limitations in seasonal forecasting are not well understood yet. In particular, a key problem in this context is whether they serve to improve the forecast quality/skill of raw model outputs beyond the adjustment of their systematic biases. In this paper we analyze this issue by applying two state-of-the-art BC and two PP methods to downscale precipitation from a multimodel seasonal hindcast in a challenging tropical region, the Philippines. To properly assess the potential added value beyond the reduction of model biases, we consider two validation scores which are not sensitive to changes in the mean (correlation and reliability categories). Our results show that, whereas BC methods maintain or worsen the skill of the raw model forecasts, PP methods can yield significant skill improvement (worsening) in cases for which the large-scale predictor variables considered are better (worse) predicted by the model than precipitation. For instance, PP methods are found to increase (decrease) model reliability in nearly 40% of the stations considered in boreal summer (autumn). Therefore, the choice of a convenient downscaling approach (either BC or PP) depends on the region and the season.

  2. A Public-Private Partnership Develops and Externally Validates a 30-Day Hospital Readmission Risk Prediction Model

    PubMed Central

    Choudhry, Shahid A.; Li, Jing; Davis, Darcy; Erdmann, Cole; Sikka, Rishi; Sutariya, Bharat

    2013-01-01

    Introduction: Preventing the occurrence of hospital readmissions is needed to improve quality of care and foster population health across the care continuum. Hospitals are being held accountable for improving transitions of care to avert unnecessary readmissions. Advocate Health Care in Chicago and Cerner (ACC) collaborated to develop all-cause, 30-day hospital readmission risk prediction models to identify patients that need interventional resources. Ideally, prediction models should encompass several qualities: they should have high predictive ability; use reliable and clinically relevant data; use vigorous performance metrics to assess the models; be validated in populations where they are applied; and be scalable in heterogeneous populations. However, a systematic review of prediction models for hospital readmission risk determined that most performed poorly (average C-statistic of 0.66) and efforts to improve their performance are needed for widespread usage. Methods: The ACC team incorporated electronic health record data, utilized a mixed-method approach to evaluate risk factors, and externally validated their prediction models for generalizability. Inclusion and exclusion criteria were applied on the patient cohort and then split for derivation and internal validation. Stepwise logistic regression was performed to develop two predictive models: one for admission and one for discharge. The prediction models were assessed for discrimination ability, calibration, overall performance, and then externally validated. Results: The ACC Admission and Discharge Models demonstrated modest discrimination ability during derivation, internal and external validation post-recalibration (C-statistic of 0.76 and 0.78, respectively), and reasonable model fit during external validation for utility in heterogeneous populations. Conclusions: The ACC Admission and Discharge Models embody the design qualities of ideal prediction models. The ACC plans to continue its partnership to further improve and develop valuable clinical models. PMID:24224068

  3. Energy Production Calculations with Field Flow Models and Windspeed Predictions with Statistical Methods

    NASA Astrophysics Data System (ADS)

    Rüstemoǧlu, Sevinç; Barutçu, Burak; Sibel Menteş, Å.ž.

    2010-05-01

    The continuous usage of fossil fuels as primary energy source is the reason of the emission of CO and powerless economy of the country affected by the great flactuations in the unit price of energy sources. In recent years, developments in wind energy sector and the supporting new renewable energy policies of the countries allow the new wind farm owners and the firms who expect to be an owner to consider and invest on the renewable sources. In this study, the annual production of the turbines with 1.8 kW and 30 kW which are available for Istanbul Technical University in Energy Institute is calculated by Wasp and WindPro Field Flow Models and the wind characteristics of the area are analysed. The meteorological data used in calculation includes the period between 02.March.2000 and 31.May.2004 and is taken from the meteorological mast ( ) in Istanbul Technical University's campus area. The measurement data is taken from 2 m and 10 m heights with hourly means. The topography, roughness classes and shelter effects are defined in the models to make accurate extrapolation to the turbine sites. As an advantage, the region is nearly 3.5 km close to the Istanbul Bosphorous but as it can be seen from the Wasp and WindPro Model Results, the Bosphorous effect is interrupted by the new buildings and hight forestry. The shelter effect of these high buildings have a great influence on the wind flow and decrease the high wind energy potential which is produced by the Bosphorous effect. This study, which determines wind characteristics and expected annual production, is important for this Project Site and therefore gains importance before the construction of wind energy system. However, when the system is operating, developing the energy management skills, forecasting the wind speed and direction will become important. At this point, three statistical models which are Kalman Fitler, AR Model and Neural Networks models are used to determine the success of each method for correct wind prediction. Statistical methods' preditictions as time series are included and the similartiy rates are compared for each method. The algorithms which are performed in MATLAB, gave the similarity results of each model. According to the Neural Networks results which are found to be the most successful method for prediction within these three statistical models, the windspeed similarity rate between the original measurements and the prediction set which includes 1 year period between 2003 and 2004, is evaluated as % 94.7. For wind direction, the similarity rate is %81.61. High noise margin and ability to learn the characteristics of the signal are important advantages of Neural Networks for compatible windspeed and direction predictions compared with measurements.

  4. Sirc-cvs cytotoxicity test: an alternative for predicting rodent acute systemic toxicity.

    PubMed

    Kitagaki, Masato; Wakuri, Shinobu; Hirota, Morihiko; Tanaka, Noriho; Itagaki, Hiroshi

    2006-10-01

    An in vitro crystal violet staining method using the rabbit cornea-derived cell line (SIRC-CVS) has been developed as an alternative to predict acute systemic toxicity in rodents. Seventy-nine chemicals, the in vitro cytotoxicity of which was already reported by the Multicenter Evaluation of In vitro Toxicity (MEIC) and ICCVAM/ECVAM, were selected as test compounds. The cells were incubated with the chemicals for 72 hrs and the IC(50) and IC(35) values (microg/mL) were obtained. The results were compared to the in vivo (rat or mouse) "most toxic" oral, intraperitoneal, subcutaneous and intravenous LD(50) values (mg/kg) taken from the RTECS database for each of the chemicals by using Pearson's correlation statistics. The following parameters were calculated: accuracy, sensitivity, specificity, prevalence, positive predictability, and negative predictability. Good linear correlations (Pearson's coefficient; r>0.6) were observed between either the IC(50) or the IC(35) values and all the LD(50) values. Among them, a statistically significant high correlation (r=0.8102, p<0.001) required for acute systemic toxicity prediction was obtained between the IC(50) values and the oral LD(50) values. By using the cut-off concentrations of 2,000 mg/kg (LD(50)) and 4,225 microg/mL (IC(50)), no false negatives were observed, and the accuracy was 84.8%. From this, it is concluded that this method could be used to predict the acute systemic toxicity potential of chemicals in rodents.

  5. Review and statistical analysis of the use of ultrasonic velocity for estimating the porosity fraction in polycrystalline materials

    NASA Technical Reports Server (NTRS)

    Roth, D. J.; Swickard, S. M.; Stang, D. B.; Deguire, M. R.

    1991-01-01

    A review and statistical analysis of the ultrasonic velocity method for estimating the porosity fraction in polycrystalline materials is presented. Initially, a semiempirical model is developed showing the origin of the linear relationship between ultrasonic velocity and porosity fraction. Then, from a compilation of data produced by many researchers, scatter plots of velocity versus percent porosity data are shown for Al2O3, MgO, porcelain-based ceramics, PZT, SiC, Si3N4, steel, tungsten, UO2,(U0.30Pu0.70)C, and YBa2Cu3O(7-x). Linear regression analysis produces predicted slope, intercept, correlation coefficient, level of significance, and confidence interval statistics for the data. Velocity values predicted from regression analysis of fully-dense materials are in good agreement with those calculated from elastic properties.

  6. Prediction of Knee Joint Contact Forces From External Measures Using Principal Component Prediction and Reconstruction.

    PubMed

    Saliba, Christopher M; Clouthier, Allison L; Brandon, Scott C E; Rainbow, Michael J; Deluzio, Kevin J

    2018-05-29

    Abnormal loading of the knee joint contributes to the pathogenesis of knee osteoarthritis. Gait retraining is a non-invasive intervention that aims to reduce knee loads by providing audible, visual, or haptic feedback of gait parameters. The computational expense of joint contact force prediction has limited real-time feedback to surrogate measures of the contact force, such as the knee adduction moment. We developed a method to predict knee joint contact forces using motion analysis and a statistical regression model that can be implemented in near real-time. Gait waveform variables were deconstructed using principal component analysis and a linear regression was used to predict the principal component scores of the contact force waveforms. Knee joint contact force waveforms were reconstructed using the predicted scores. We tested our method using a heterogenous population of asymptomatic controls and subjects with knee osteoarthritis. The reconstructed contact force waveforms had mean (SD) RMS differences of 0.17 (0.05) bodyweight compared to the contact forces predicted by a musculoskeletal model. Our method successfully predicted subject-specific shape features of contact force waveforms and is a potentially powerful tool in biofeedback and clinical gait analysis.

  7. Modeling potential habitats for alien species Dreissena polymorpha in continental USA

    USGS Publications Warehouse

    Mingyang, Li; Yunwei, Ju; Kumar, Sunil; Stohlgren, Thomas J.

    2008-01-01

    The effective measure to minimize the damage of invasive species is to block the potential invasive species to enter into suitable areas. 1864 occurrence points with GPS coordinates and 34 environmental variables from Daymet datasets were gathered, and 4 modeling methods, i.e., Logistic Regression (LR), Classification and Regression Trees (CART), Genetic Algorithm for Rule-Set Prediction (GARP), and maximum entropy method (Maxent), were introduced to generate potential geographic distributions for invasive species Dreissena polymorpha in Continental USA. Then 3 statistical criteria of the area under the Receiver Operating Characteristic curve (AUC), Pearson correlation (COR) and Kappa value were calculated to evaluate the performance of the models, followed by analyses on major contribution variables. Results showed that in terms of the 3 statistical criteria, the prediction results of the 4 ecological niche models were either excellent or outstanding, in which Maxent outperformed the others in 3 aspects of predicting current distribution habitats, selecting major contribution factors, and quantifying the influence of environmental variables on habitats. Distance to water, elevation, frequency of precipitation and solar radiation were 4 environmental forcing factors. The method suggested in the paper can have some reference meaning for modeling habitats of alien species in China and provide a direction to prevent Mytilopsis sallei on the Chinese coast line.

  8. Simultaneous fitting of genomic-BLUP and Bayes-C components in a genomic prediction model.

    PubMed

    Iheshiulor, Oscar O M; Woolliams, John A; Svendsen, Morten; Solberg, Trygve; Meuwissen, Theo H E

    2017-08-24

    The rapid adoption of genomic selection is due to two key factors: availability of both high-throughput dense genotyping and statistical methods to estimate and predict breeding values. The development of such methods is still ongoing and, so far, there is no consensus on the best approach. Currently, the linear and non-linear methods for genomic prediction (GP) are treated as distinct approaches. The aim of this study was to evaluate the implementation of an iterative method (called GBC) that incorporates aspects of both linear [genomic-best linear unbiased prediction (G-BLUP)] and non-linear (Bayes-C) methods for GP. The iterative nature of GBC makes it less computationally demanding similar to other non-Markov chain Monte Carlo (MCMC) approaches. However, as a Bayesian method, GBC differs from both MCMC- and non-MCMC-based methods by combining some aspects of G-BLUP and Bayes-C methods for GP. Its relative performance was compared to those of G-BLUP and Bayes-C. We used an imputed 50 K single-nucleotide polymorphism (SNP) dataset based on the Illumina Bovine50K BeadChip, which included 48,249 SNPs and 3244 records. Daughter yield deviations for somatic cell count, fat yield, milk yield, and protein yield were used as response variables. GBC was frequently (marginally) superior to G-BLUP and Bayes-C in terms of prediction accuracy and was significantly better than G-BLUP only for fat yield. On average across the four traits, GBC yielded a 0.009 and 0.006 increase in prediction accuracy over G-BLUP and Bayes-C, respectively. Computationally, GBC was very much faster than Bayes-C and similar to G-BLUP. Our results show that incorporating some aspects of G-BLUP and Bayes-C in a single model can improve accuracy of GP over the commonly used method: G-BLUP. Generally, GBC did not statistically perform better than G-BLUP and Bayes-C, probably due to the close relationships between reference and validation individuals. Nevertheless, it is a flexible tool, in the sense, that it simultaneously incorporates some aspects of linear and non-linear models for GP, thereby exploiting family relationships while also accounting for linkage disequilibrium between SNPs and genes with large effects. The application of GBC in GP merits further exploration.

  9. Ultra-low-density genotype panels for breed assignment of Angus and Hereford cattle.

    PubMed

    Judge, M M; Kelleher, M M; Kearney, J F; Sleator, R D; Berry, D P

    2017-06-01

    Angus and Hereford beef is marketed internationally for apparent superior meat quality attributes; DNA-based breed authenticity could be a useful instrument to ensure consumer confidence on premium meat products. The objective of this study was to develop an ultra-low-density genotype panel to accurately quantify the Angus and Hereford breed proportion in biological samples. Medium-density genotypes (13 306 single nucleotide polymorphisms (SNPs)) were available on 54 703 commercial and 4042 purebred animals. The breed proportion of the commercial animals was generated from the medium-density genotypes and this estimate was regarded as the gold-standard breed composition. Ten genotype panels (100 to 1000 SNPs) were developed from the medium-density genotypes; five methods were used to identify the most informative SNPs and these included the Delta statistic, the fixation (F st) statistic and an index of both. Breed assignment analyses were undertaken for each breed, panel density and SNP selection method separately with a programme to infer population structure using the entire 13 306 SNP panel (representing the gold-standard measure). Breed assignment was undertaken for all commercial animals (n=54 703), animals deemed to contain some proportion of Angus based on pedigree (n=5740) and animals deemed to contain some proportion of Hereford based on pedigree (n=5187). The predicted breed proportion of all animals from the lower density panels was then compared with the gold-standard breed prediction. Panel density, SNP selection method and breed all had a significant effect on the correlation of predicted and actual breed proportion. Regardless of breed, the Index method of SNP selection numerically (but not significantly) outperformed all other selection methods in accuracy (i.e. correlation and root mean square of prediction) when panel density was ⩾300 SNPs. The correlation between actual and predicted breed proportion increased as panel density increased. Using 300 SNPs (selected using the global index method), the correlation between predicted and actual breed proportion was 0.993 and 0.995 in the Angus and Hereford validation populations, respectively. When SNP panels optimised for breed prediction in one population were used to predict the breed proportion of a separate population, the correlation between predicted and actual breed proportion was 0.034 and 0.044 weaker in the Hereford and Angus populations, respectively (using the 300 SNP panel). It is necessary to include at least 300 to 400 SNPs (per breed) on genotype panels to accurately predict breed proportion from biological samples.

  10. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable

    PubMed Central

    2012-01-01

    Background When outcomes are binary, the c-statistic (equivalent to the area under the Receiver Operating Characteristic curve) is a standard measure of the predictive accuracy of a logistic regression model. Methods An analytical expression was derived under the assumption that a continuous explanatory variable follows a normal distribution in those with and without the condition. We then conducted an extensive set of Monte Carlo simulations to examine whether the expressions derived under the assumption of binormality allowed for accurate prediction of the empirical c-statistic when the explanatory variable followed a normal distribution in the combined sample of those with and without the condition. We also examine the accuracy of the predicted c-statistic when the explanatory variable followed a gamma, log-normal or uniform distribution in combined sample of those with and without the condition. Results Under the assumption of binormality with equality of variances, the c-statistic follows a standard normal cumulative distribution function with dependence on the product of the standard deviation of the normal components (reflecting more heterogeneity) and the log-odds ratio (reflecting larger effects). Under the assumption of binormality with unequal variances, the c-statistic follows a standard normal cumulative distribution function with dependence on the standardized difference of the explanatory variable in those with and without the condition. In our Monte Carlo simulations, we found that these expressions allowed for reasonably accurate prediction of the empirical c-statistic when the distribution of the explanatory variable was normal, gamma, log-normal, and uniform in the entire sample of those with and without the condition. Conclusions The discriminative ability of a continuous explanatory variable cannot be judged by its odds ratio alone, but always needs to be considered in relation to the heterogeneity of the population. PMID:22716998

  11. The non-equilibrium statistical mechanics of a simple geophysical fluid dynamics model

    NASA Astrophysics Data System (ADS)

    Verkley, Wim; Severijns, Camiel

    2014-05-01

    Lorenz [1] has devised a dynamical system that has proved to be very useful as a benchmark system in geophysical fluid dynamics. The system in its simplest form consists of a periodic array of variables that can be associated with an atmospheric field on a latitude circle. The system is driven by a constant forcing, is damped by linear friction and has a simple advection term that causes the model to behave chaotically if the forcing is large enough. Our aim is to predict the statistics of Lorenz' model on the basis of a given average value of its total energy - obtained from a numerical integration - and the assumption of statistical stationarity. Our method is the principle of maximum entropy [2] which in this case reads: the information entropy of the system's probability density function shall be maximal under the constraints of normalization, a given value of the average total energy and statistical stationarity. Statistical stationarity is incorporated approximately by using `stationarity constraints', i.e., by requiring that the average first and possibly higher-order time-derivatives of the energy are zero in the maximization of entropy. The analysis [3] reveals that, if the first stationarity constraint is used, the resulting probability density function rather accurately reproduces the statistics of the individual variables. If the second stationarity constraint is used as well, the correlations between the variables are also reproduced quite adequately. The method can be generalized straightforwardly and holds the promise of a viable non-equilibrium statistical mechanics of the forced-dissipative systems of geophysical fluid dynamics. [1] E.N. Lorenz, 1996: Predictability - A problem partly solved, in Proc. Seminar on Predictability (ECMWF, Reading, Berkshire, UK), Vol. 1, pp. 1-18. [2] E.T. Jaynes, 2003: Probability Theory - The Logic of Science (Cambridge University Press, Cambridge). [3] W.T.M. Verkley and C.A. Severijns, 2014: The maximum entropy principle applied to a dynamical system proposed by Lorenz, Eur. Phys. J. B, 87:7, http://dx.doi.org/10.1140/epjb/e2013-40681-2 (open access).

  12. Prediction of the Electromagnetic Field Distribution in a Typical Aircraft Using the Statistical Energy Analysis

    NASA Astrophysics Data System (ADS)

    Kovalevsky, Louis; Langley, Robin S.; Caro, Stephane

    2016-05-01

    Due to the high cost of experimental EMI measurements significant attention has been focused on numerical simulation. Classical methods such as Method of Moment or Finite Difference Time Domain are not well suited for this type of problem, as they require a fine discretisation of space and failed to take into account uncertainties. In this paper, the authors show that the Statistical Energy Analysis is well suited for this type of application. The SEA is a statistical approach employed to solve high frequency problems of electromagnetically reverberant cavities at a reduced computational cost. The key aspects of this approach are (i) to consider an ensemble of system that share the same gross parameter, and (ii) to avoid solving Maxwell's equations inside the cavity, using the power balance principle. The output is an estimate of the field magnitude distribution in each cavity. The method is applied on a typical aircraft structure.

  13. High-Density Signal Interface Electromagnetic Radiation Prediction for Electromagnetic Compatibility Evaluation.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Halligan, Matthew

    Radiated power calculation approaches for practical scenarios of incomplete high- density interface characterization information and incomplete incident power information are presented. The suggested approaches build upon a method that characterizes power losses through the definition of power loss constant matrices. Potential radiated power estimates include using total power loss information, partial radiated power loss information, worst case analysis, and statistical bounding analysis. A method is also proposed to calculate radiated power when incident power information is not fully known for non-periodic signals at the interface. Incident data signals are modeled from a two-state Markov chain where bit state probabilities aremore » derived. The total spectrum for windowed signals is postulated as the superposition of spectra from individual pulses in a data sequence. Statistical bounding methods are proposed as a basis for the radiated power calculation due to the statistical calculation complexity to find a radiated power probability density function.« less

  14. Classical Statistics and Statistical Learning in Imaging Neuroscience

    PubMed Central

    Bzdok, Danilo

    2017-01-01

    Brain-imaging research has predominantly generated insight by means of classical statistics, including regression-type analyses and null-hypothesis testing using t-test and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity especially for applications in rich and complex data, including cross-validated out-of-sample prediction using pattern classification and sparsity-inducing regression. This concept paper discusses the implications of inferential justifications and algorithmic methodologies in common data analysis scenarios in neuroimaging. It is retraced how classical statistics and statistical learning originated from different historical contexts, build on different theoretical foundations, make different assumptions, and evaluate different outcome metrics to permit differently nuanced conclusions. The present considerations should help reduce current confusion between model-driven classical hypothesis testing and data-driven learning algorithms for investigating the brain with imaging techniques. PMID:29056896

  15. Predictive validity of the UK clinical aptitude test in the final years of medical school: a prospective cohort study.

    PubMed

    Husbands, Adrian; Mathieson, Alistair; Dowell, Jonathan; Cleland, Jennifer; MacKenzie, Rhoda

    2014-04-23

    The UK Clinical Aptitude Test (UKCAT) was designed to address issues identified with traditional methods of selection. This study aims to examine the predictive validity of the UKCAT and compare this to traditional selection methods in the senior years of medical school. This was a follow-up study of two cohorts of students from two medical schools who had previously taken part in a study examining the predictive validity of the UKCAT in first year. The sample consisted of 4th and 5th Year students who commenced their studies at the University of Aberdeen or University of Dundee medical schools in 2007. Data collected were: demographics (gender and age group), UKCAT scores; Universities and Colleges Admissions Service (UCAS) form scores; admission interview scores; Year 4 and 5 degree examination scores. Pearson's correlations were used to examine the relationships between admissions variables, examination scores, gender and age group, and to select variables for multiple linear regression analysis to predict examination scores. Ninety-nine and 89 students at Aberdeen medical school from Years 4 and 5 respectively, and 51 Year 4 students in Dundee, were included in the analysis. Neither UCAS form nor interview scores were statistically significant predictors of examination performance. Conversely, the UKCAT yielded statistically significant validity coefficients between .24 and .36 in four of five assessments investigated. Multiple regression analysis showed the UKCAT made a statistically significant unique contribution to variance in examination performance in the senior years. Results suggest the UKCAT appears to predict performance better in the later years of medical school compared to earlier years and provides modest supportive evidence for the UKCAT's role in student selection within these institutions. Further research is needed to assess the predictive validity of the UKCAT against professional and behavioural outcomes as the cohort commences working life.

  16. Complex networks as a unified framework for descriptive analysis and predictive modeling in climate

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steinhaeuser, Karsten J K; Chawla, Nitesh; Ganguly, Auroop R

    The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of climate networks have useful interpretation within the domain. Further,more » we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.« less

  17. A New Approach of Juvenile Age Estimation using Measurements of the Ilium and Multivariate Adaptive Regression Splines (MARS) Models for Better Age Prediction.

    PubMed

    Corron, Louise; Marchal, François; Condemi, Silvana; Chaumoître, Kathia; Adalian, Pascal

    2017-01-01

    Juvenile age estimation methods used in forensic anthropology generally lack methodological consistency and/or statistical validity. Considering this, a standard approach using nonparametric Multivariate Adaptive Regression Splines (MARS) models were tested to predict age from iliac biometric variables of male and female juveniles from Marseilles, France, aged 0-12 years. Models using unidimensional (length and width) and bidimensional iliac data (module and surface) were constructed on a training sample of 176 individuals and validated on an independent test sample of 68 individuals. Results show that MARS prediction models using iliac width, module and area give overall better and statistically valid age estimates. These models integrate punctual nonlinearities of the relationship between age and osteometric variables. By constructing valid prediction intervals whose size increases with age, MARS models take into account the normal increase of individual variability. MARS models can qualify as a practical and standardized approach for juvenile age estimation. © 2016 American Academy of Forensic Sciences.

  18. Statistical modelling coupled with LC-MS analysis to predict human upper intestinal absorption of phytochemical mixtures.

    PubMed

    Selby-Pham, Sophie N B; Howell, Kate S; Dunshea, Frank R; Ludbey, Joel; Lutz, Adrian; Bennett, Louise

    2018-04-15

    A diet rich in phytochemicals confers benefits for health by reducing the risk of chronic diseases via regulation of oxidative stress and inflammation (OSI). For optimal protective bio-efficacy, the time required for phytochemicals and their metabolites to reach maximal plasma concentrations (T max ) should be synchronised with the time of increased OSI. A statistical model has been reported to predict T max of individual phytochemicals based on molecular mass and lipophilicity. We report the application of the model for predicting the absorption profile of an uncharacterised phytochemical mixture, herein referred to as the 'functional fingerprint'. First, chemical profiles of phytochemical extracts were acquired using liquid chromatography mass spectrometry (LC-MS), then the molecular features for respective components were used to predict their plasma absorption maximum, based on molecular mass and lipophilicity. This method of 'functional fingerprinting' of plant extracts represents a novel tool for understanding and optimising the health efficacy of plant extracts. Copyright © 2017 Elsevier Ltd. All rights reserved.

  19. Comparing statistical and machine learning classifiers: alternatives for predictive modeling in human factors research.

    PubMed

    Carnahan, Brian; Meyer, Gérard; Kuntz, Lois-Ann

    2003-01-01

    Multivariate classification models play an increasingly important role in human factors research. In the past, these models have been based primarily on discriminant analysis and logistic regression. Models developed from machine learning research offer the human factors professional a viable alternative to these traditional statistical classification methods. To illustrate this point, two machine learning approaches--genetic programming and decision tree induction--were used to construct classification models designed to predict whether or not a student truck driver would pass his or her commercial driver license (CDL) examination. The models were developed and validated using the curriculum scores and CDL exam performances of 37 student truck drivers who had completed a 320-hr driver training course. Results indicated that the machine learning classification models were superior to discriminant analysis and logistic regression in terms of predictive accuracy. Actual or potential applications of this research include the creation of models that more accurately predict human performance outcomes.

  20. Prediction of the effect of formulation on the toxicity of chemicals.

    PubMed

    Mistry, Pritesh; Neagu, Daniel; Sanchez-Ruiz, Antonio; Trundle, Paul R; Vessey, Jonathan D; Gosling, John Paul

    2017-01-01

    Two approaches for the prediction of which of two vehicles will result in lower toxicity for anticancer agents are presented. Machine-learning models are developed using decision tree, random forest and partial least squares methodologies and statistical evidence is presented to demonstrate that they represent valid models. Separately, a clustering method is presented that allows the ordering of vehicles by the toxicity they show for chemically-related compounds.

  1. A Method to Predict the Structure and Stability of RNA/RNA Complexes.

    PubMed

    Xu, Xiaojun; Chen, Shi-Jie

    2016-01-01

    RNA/RNA interactions are essential for genomic RNA dimerization and regulation of gene expression. Intermolecular loop-loop base pairing is a widespread and functionally important tertiary structure motif in RNA machinery. However, computational prediction of intermolecular loop-loop base pairing is challenged by the entropy and free energy calculation due to the conformational constraint and the intermolecular interactions. In this chapter, we describe a recently developed statistical mechanics-based method for the prediction of RNA/RNA complex structures and stabilities. The method is based on the virtual bond RNA folding model (Vfold). The main emphasis in the method is placed on the evaluation of the entropy and free energy for the loops, especially tertiary kissing loops. The method also uses recursive partition function calculations and two-step screening algorithm for large, complicated structures of RNA/RNA complexes. As case studies, we use the HIV-1 Mal dimer and the siRNA/HIV-1 mutant (T4) to illustrate the method.

  2. Impacts of pesticide mixtures in European rivers as predicted by the Species Sensitivity Distribution (SSD) models and SPEAR bioindication

    NASA Astrophysics Data System (ADS)

    Jesenska, Sona; Liess, Mathias; Schäfer, Ralf; Beketov, Mikhail; Blaha, Ludek

    2013-04-01

    Species sensitivity distribution (SSD) is statistical method broadly used in the ecotoxicological risk assessment of chemicals. Originally it has been used for prospective risk assessment of single substances but nowadays it is becoming more important also in the retrospective risk assessment of mixtures, including the catchment scale. In the present work, SSD predictions (impacts of mixtures consisting of 25 pesticides; data from several catchments in Germany, France and Finland) were compared with SPEAR-pesticides, which a bioindicator index based on biological traits responsive to the effects of pesticides and post-contamination recovery. The results showed statistically significant correlations (Pearson's R, p<0.01) between SSD (predicted msPAF values) and values of SPEAR-pesticides (based on field biomonitoring observations). Comparisons of the thresholds established for the SSD and SPEAR approaches (SPEAR-pesticides=45%, i.e. LOEC level, and msPAF = 0.05 for SSD, i.e. HC5) showed that use of chronic toxicity data significantly improved the agreement between the two methods but the SPEAR-pesticides index was still more sensitive. Taken together, the validation study shows good potential of SSD models in predicting the real impacts of micropollutant mixtures on natural communities of aquatic biota.

  3. Open-access programs for injury categorization using ICD-9 or ICD-10.

    PubMed

    Clark, David E; Black, Adam W; Skavdahl, David H; Hallagan, Lee D

    2018-04-09

    The article introduces Programs for Injury Categorization, using the International Classification of Diseases (ICD) and R statistical software (ICDPIC-R). Starting with ICD-8, methods have been described to map injury diagnosis codes to severity scores, especially the Abbreviated Injury Scale (AIS) and Injury Severity Score (ISS). ICDPIC was originally developed for this purpose using Stata, and ICDPIC-R is an open-access update that accepts both ICD-9 and ICD-10 codes. Data were obtained from the National Trauma Data Bank (NTDB), Admission Year 2015. ICDPIC-R derives CDC injury mechanism categories and an approximate ISS ("RISS") from either ICD-9 or ICD-10 codes. For ICD-9-coded cases, RISS is derived similar to the Stata package (with some improvements reflecting user feedback). For ICD-10-coded cases, RISS may be calculated in several ways: The "GEM" methods convert ICD-10 to ICD-9 (using General Equivalence Mapping tables from CMS) and then calculate ISS with options similar to the Stata package; a "ROCmax" method calculates RISS directly from ICD-10 codes, based on diagnosis-specific mortality in the NTDB, maximizing the C-statistic for predicting NTDB mortality while attempting to minimize the difference between RISS and ISS submitted by NTDB registrars (ISSAIS). Findings were validated using data from the National Inpatient Survey (NIS, 2015). NTDB contained 917,865 cases, of which 86,878 had valid ICD-10 injury codes. For a random 100,000 ICD-9-coded cases in NTDB, RISS using the GEM methods was nearly identical to ISS calculated by the Stata version, which has been previously validated. For ICD-10-coded cases in NTDB, categorized ISS using any version of RISS was similar to ISSAIS; for both NTDB and NIS cases, increasing ISS was associated with increasing mortality. Prediction of NTDB mortality was associated with C-statistics of 0.81 for ISSAIS, 0.75 for RISS using the GEM methods, and 0.85 for RISS using the ROCmax method; prediction of NIS mortality was associated with C-statistics of 0.75-0.76 for RISS using the GEM methods, and 0.78 for RISS using the ROCmax method. Instructions are provided for accessing ICDPIC-R at no cost. The ideal methods of injury categorization and injury severity scoring involve trained personnel with access to injured persons or their medical records. ICDPIC-R may be a useful substitute when this ideal cannot be obtained.

  4. Pedophilia: an evaluation of diagnostic and risk prediction methods.

    PubMed

    Wilson, Robin J; Abracen, Jeffrey; Looman, Jan; Picheca, Janice E; Ferguson, Meaghan

    2011-06-01

    One hundred thirty child sexual abusers were diagnosed using each of following four methods: (a) phallometric testing, (b) strict application of Diagnostic and Statistical Manual of Mental Disorders (4th ed., text revision [DSM-IV-TR]) criteria, (c) Rapid Risk Assessment of Sex Offender Recidivism (RRASOR) scores, and (d) "expert" diagnoses rendered by a seasoned clinician. Comparative utility and intermethod consistency of these methods are reported, along with recidivism data indicating predictive validity for risk management. Results suggest that inconsistency exists in diagnosing pedophilia, leading to diminished accuracy in risk assessment. Although the RRASOR and DSM-IV-TR methods were significantly correlated with expert ratings, RRASOR and DSM-IV-TR were unrelated to each other. Deviant arousal was not associated with any of the other methods. Only the expert ratings and RRASOR scores were predictive of sexual recidivism. Logistic regression analyses showed that expert diagnosis did not add to prediction of sexual offence recidivism over and above RRASOR alone. Findings are discussed within a context of encouragement of clinical consistency and evidence-based practice regarding treatment and risk management of those who sexually abuse children.

  5. Predicting treatment failure, death and drug resistance using a computed risk score among newly diagnosed TB patients in Tamaulipas, Mexico.

    PubMed

    Abdelbary, B E; Garcia-Viveros, M; Ramirez-Oropesa, H; Rahbar, M H; Restrepo, B I

    2017-10-01

    The purpose of this study was to develop a method for identifying newly diagnosed tuberculosis (TB) patients at risk for TB adverse events in Tamaulipas, Mexico. Surveillance data between 2006 and 2013 (8431 subjects) was used to develop risk scores based on predictive modelling. The final models revealed that TB patients failing their treatment regimen were more likely to have at most a primary school education, multi-drug resistance (MDR)-TB, and few to moderate bacilli on acid-fast bacilli smear. TB patients who died were more likely to be older males with MDR-TB, HIV, malnutrition, and reporting excessive alcohol use. Modified risk scores were developed with strong predictability for treatment failure and death (c-statistic 0·65 and 0·70, respectively), and moderate predictability for drug resistance (c-statistic 0·57). Among TB patients with diabetes, risk scores showed moderate predictability for death (c-statistic 0·68). Our findings suggest that in the clinical setting, the use of our risk scores for TB treatment failure or death will help identify these individuals for tailored management to prevent these adverse events. In contrast, the available variables in the TB surveillance dataset are not robust predictors of drug resistance, indicating the need for prompt testing at time of diagnosis.

  6. Validation of Statistical Predictive Models Meant to Select Melanoma Patients for Sentinel Lymph Node Biopsy

    PubMed Central

    Sabel, Michael S.; Rice, John D.; Griffith, Kent A.; Lowe, Lori; Wong, Sandra L.; Chang, Alfred E.; Johnson, Timothy M.; Taylor, Jeremy M.G.

    2013-01-01

    Introduction To identify melanoma patients at sufficiently low risk of nodal metastases who could avoid SLN biopsy (SLNB). Several statistical models have been proposed based upon patient/tumor characteristics, including logistic regression, classification trees, random forests and support vector machines. We sought to validate recently published models meant to predict sentinel node status. Methods We queried our comprehensive, prospectively-collected melanoma database for consecutive melanoma patients undergoing SLNB. Prediction values were estimated based upon 4 published models, calculating the same reported metrics: negative predictive value (NPV), rate of negative predictions (RNP), and false negative rate (FNR). Results Logistic regression performed comparably with our data when considering NPV (89.4% vs. 93.6%); however the model’s specificity was not high enough to significantly reduce the rate of biopsies (SLN reduction rate of 2.9%). When applied to our data, the classification tree produced NPV and reduction in biopsies rates that were lower 87.7% vs. 94.1% and 29.8% vs. 14.3%, respectively. Two published models could not be applied to our data due to model complexity and the use of proprietary software. Conclusions Published models meant to reduce the SLNB rate among patients with melanoma either underperformed when applied to our larger dataset, or could not be validated. Differences in selection criteria and histopathologic interpretation likely resulted in underperformance. Development of statistical predictive models must be created in a clinically applicable manner to allow for both validation and ultimately clinical utility. PMID:21822550

  7. A test to evaluate the earthquake prediction algorithm, M8

    USGS Publications Warehouse

    Healy, John H.; Kossobokov, Vladimir G.; Dewey, James W.

    1992-01-01

    A test of the algorithm M8 is described. The test is constructed to meet four rules, which we propose to be applicable to the test of any method for earthquake prediction:  1. An earthquake prediction technique should be presented as a well documented, logical algorithm that can be used by  investigators without restrictions. 2. The algorithm should be coded in a common programming language and implementable on widely available computer systems. 3. A test of the earthquake prediction technique should involve future predictions with a black box version of the algorithm in which potentially adjustable parameters are fixed in advance. The source of the input data must be defined and ambiguities in these data must be resolved automatically by the algorithm. 4. At least one reasonable null hypothesis should be stated in advance of testing the earthquake prediction method, and it should be stated how this null hypothesis will be used to estimate the statistical significance of the earthquake predictions. The M8 algorithm has successfully predicted several destructive earthquakes, in the sense that the earthquakes occurred inside regions with linear dimensions from 384 to 854 km that the algorithm had identified as being in times of increased probability for strong earthquakes. In addition, M8 has successfully "post predicted" high percentages of strong earthquakes in regions to which it has been applied in retroactive studies. The statistical significance of previous predictions has not been established, however, and post-prediction studies in general are notoriously subject to success-enhancement through hindsight. Nor has it been determined how much more precise an M8 prediction might be than forecasts and probability-of-occurrence estimates made by other techniques. We view our test of M8 both as a means to better determine the effectiveness of M8 and as an experimental structure within which to make observations that might lead to improvements in the algorithm or conceivably lead to a radically different approach to earthquake prediction.

  8. Robust Ultraviolet-Visible (UV-Vis) Partial Least-Squares (PLS) Models for Tannin Quantification in Red Wine.

    PubMed

    Aleixandre-Tudo, José Luis; Nieuwoudt, Helené; Aleixandre, José Luis; Du Toit, Wessel J

    2015-02-04

    The validation of ultraviolet-visible (UV-vis) spectroscopy combined with partial least-squares (PLS) regression to quantify red wine tannins is reported. The methylcellulose precipitable (MCP) tannin assay and the bovine serum albumin (BSA) tannin assay were used as reference methods. To take the high variability of wine tannins into account when the calibration models were built, a diverse data set was collected from samples of South African red wines that consisted of 18 different cultivars, from regions spanning the wine grape-growing areas of South Africa with their various sites, climates, and soils, ranging in vintage from 2000 to 2012. A total of 240 wine samples were analyzed, and these were divided into a calibration set (n = 120) and a validation set (n = 120) to evaluate the predictive ability of the models. To test the robustness of the PLS calibration models, the predictive ability of the classifying variables cultivar, vintage year, and experimental versus commercial wines was also tested. In general, the statistics obtained when BSA was used as a reference method were slightly better than those obtained with MCP. Despite this, the MCP tannin assay should also be considered as a valid reference method for developing PLS calibrations. The best calibration statistics for the prediction of new samples were coefficient of correlation (R 2 val) = 0.89, root mean standard error of prediction (RMSEP) = 0.16, and residual predictive deviation (RPD) = 3.49 for MCP and R 2 val = 0.93, RMSEP = 0.08, and RPD = 4.07 for BSA, when only the UV region (260-310 nm) was selected, which also led to a faster analysis time. In addition, a difference in the results obtained when the predictive ability of the classifying variables vintage, cultivar, or commercial versus experimental wines was studied suggests that tannin composition is highly affected by many factors. This study also discusses the correlations in tannin values between the methylcellulose and protein precipitation methods.

  9. Rare earth element geochemistry of shallow carbonate outcropping strata in Saudi Arabia: Application for depositional environments prediction

    NASA Astrophysics Data System (ADS)

    Eltom, Hassan A.; Abdullatif, Osman M.; Makkawi, Mohammed H.; Eltoum, Isam-Eldin A.

    2017-03-01

    The interpretation of depositional environments provides important information to understand facies distribution and geometry. The classical approach to interpret depositional environments principally relies on the analysis of lithofacies, biofacies and stratigraphic data, among others. An alternative method, based on geochemical data (chemical element data), is advantageous because it can simply, reproducibly and efficiently interpret and refine the interpretation of the depositional environment of carbonate strata. Here we geochemically analyze and statistically model carbonate samples (n = 156) from seven sections of the Arab-D reservoir outcrop analog of central Saudi Arabia, to determine whether the elemental signatures (major, trace and rare earth elements [REEs]) can be effectively used to predict depositional environments. We find that lithofacies associations of the studied outcrop (peritidal to open marine depositional environments) possess altered REE signatures, and that this trend increases stratigraphically from bottom-to-top, which corresponds to an upward shallowing of depositional environments. The relationship between REEs and major, minor and trace elements indicates that contamination by detrital materials is the principal source of REEs, whereas redox condition, marine and diagenetic processes have minimal impact on the relative distribution of REEs in the lithofacies. In a statistical model (factor analysis and logistic regression), REEs, major and trace elements cluster together and serve as markers to differentiate between peritidal and open marine facies and to differentiate between intertidal and subtidal lithofacies within the peritidal facies. The results indicate that statistical modelling of the elemental composition of carbonate strata can be used as a quantitative method to predict depositional environments and regional paleogeography. The significance of this study lies in offering new assessments of the relationships between lithofacies and geochemical elements by using advanced statistical analysis, a method that could be used elsewhere to interpret depositional environment and refine facies models.

  10. Pitfalls in statistical landslide susceptibility modelling

    NASA Astrophysics Data System (ADS)

    Schröder, Boris; Vorpahl, Peter; Märker, Michael; Elsenbeer, Helmut

    2010-05-01

    The use of statistical methods is a well-established approach to predict landslide occurrence probabilities and to assess landslide susceptibility. This is achieved by applying statistical methods relating historical landslide inventories to topographic indices as predictor variables. In our contribution, we compare several new and powerful methods developed in machine learning and well-established in landscape ecology and macroecology for predicting the distribution of shallow landslides in tropical mountain rainforests in southern Ecuador (among others: boosted regression trees, multivariate adaptive regression splines, maximum entropy). Although these methods are powerful, we think it is necessary to follow a basic set of guidelines to avoid some pitfalls regarding data sampling, predictor selection, and model quality assessment, especially if a comparison of different models is contemplated. We therefore suggest to apply a novel toolbox to evaluate approaches to the statistical modelling of landslide susceptibility. Additionally, we propose some methods to open the "black box" as an inherent part of machine learning methods in order to achieve further explanatory insights into preparatory factors that control landslides. Sampling of training data should be guided by hypotheses regarding processes that lead to slope failure taking into account their respective spatial scales. This approach leads to the selection of a set of candidate predictor variables considered on adequate spatial scales. This set should be checked for multicollinearity in order to facilitate model response curve interpretation. Model quality assesses how well a model is able to reproduce independent observations of its response variable. This includes criteria to evaluate different aspects of model performance, i.e. model discrimination, model calibration, and model refinement. In order to assess a possible violation of the assumption of independency in the training samples or a possible lack of explanatory information in the chosen set of predictor variables, the model residuals need to be checked for spatial auto¬correlation. Therefore, we calculate spline correlograms. In addition to this, we investigate partial dependency plots and bivariate interactions plots considering possible interactions between predictors to improve model interpretation. Aiming at presenting this toolbox for model quality assessment, we investigate the influence of strategies in the construction of training datasets for statistical models on model quality.

  11. Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners

    PubMed Central

    Feinauer, Christoph; Procaccini, Andrea; Zecchina, Riccardo; Weigt, Martin; Pagnani, Andrea

    2014-01-01

    In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code. PMID:24663061

  12. Walking through the statistical black boxes of plant breeding.

    PubMed

    Xavier, Alencar; Muir, William M; Craig, Bruce; Rainey, Katy Martin

    2016-10-01

    The main statistical procedures in plant breeding are based on Gaussian process and can be computed through mixed linear models. Intelligent decision making relies on our ability to extract useful information from data to help us achieve our goals more efficiently. Many plant breeders and geneticists perform statistical analyses without understanding the underlying assumptions of the methods or their strengths and pitfalls. In other words, they treat these statistical methods (software and programs) like black boxes. Black boxes represent complex pieces of machinery with contents that are not fully understood by the user. The user sees the inputs and outputs without knowing how the outputs are generated. By providing a general background on statistical methodologies, this review aims (1) to introduce basic concepts of machine learning and its applications to plant breeding; (2) to link classical selection theory to current statistical approaches; (3) to show how to solve mixed models and extend their application to pedigree-based and genomic-based prediction; and (4) to clarify how the algorithms of genome-wide association studies work, including their assumptions and limitations.

  13. Statistical downscaling modeling with quantile regression using lasso to estimate extreme rainfall

    NASA Astrophysics Data System (ADS)

    Santri, Dewi; Wigena, Aji Hamim; Djuraidah, Anik

    2016-02-01

    Rainfall is one of the climatic elements with high diversity and has many negative impacts especially extreme rainfall. Therefore, there are several methods that required to minimize the damage that may occur. So far, Global circulation models (GCM) are the best method to forecast global climate changes include extreme rainfall. Statistical downscaling (SD) is a technique to develop the relationship between GCM output as a global-scale independent variables and rainfall as a local- scale response variable. Using GCM method will have many difficulties when assessed against observations because GCM has high dimension and multicollinearity between the variables. The common method that used to handle this problem is principal components analysis (PCA) and partial least squares regression. The new method that can be used is lasso. Lasso has advantages in simultaneuosly controlling the variance of the fitted coefficients and performing automatic variable selection. Quantile regression is a method that can be used to detect extreme rainfall in dry and wet extreme. Objective of this study is modeling SD using quantile regression with lasso to predict extreme rainfall in Indramayu. The results showed that the estimation of extreme rainfall (extreme wet in January, February and December) in Indramayu could be predicted properly by the model at quantile 90th.

  14. Use of statistical study methods for the analysis of the results of the imitation modeling of radiation transfer

    NASA Astrophysics Data System (ADS)

    Alekseenko, M. A.; Gendrina, I. Yu.

    2017-11-01

    Recently, due to the abundance of various types of observational data in the systems of vision through the atmosphere and the need for their processing, the use of various methods of statistical research in the study of such systems as correlation-regression analysis, dynamic series, variance analysis, etc. is actual. We have attempted to apply elements of correlation-regression analysis for the study and subsequent prediction of the patterns of radiation transfer in these systems same as in the construction of radiation models of the atmosphere. In this paper, we present some results of statistical processing of the results of numerical simulation of the characteristics of vision systems through the atmosphere obtained with the help of a special software package.1

  15. Characterizing bias correction uncertainty in wheat yield predictions

    NASA Astrophysics Data System (ADS)

    Ortiz, Andrea Monica; Jones, Julie; Freckleton, Robert; Scaife, Adam

    2017-04-01

    Farming systems are under increased pressure due to current and future climate change, variability and extremes. Research on the impacts of climate change on crop production typically rely on the output of complex Global and Regional Climate Models, which are used as input to crop impact models. Yield predictions from these top-down approaches can have high uncertainty for several reasons, including diverse model construction and parameterization, future emissions scenarios, and inherent or response uncertainty. These uncertainties propagate down each step of the 'cascade of uncertainty' that flows from climate input to impact predictions, leading to yield predictions that may be too complex for their intended use in practical adaptation options. In addition to uncertainty from impact models, uncertainty can also stem from the intermediate steps that are used in impact studies to adjust climate model simulations to become more realistic when compared to observations, or to correct the spatial or temporal resolution of climate simulations, which are often not directly applicable as input into impact models. These important steps of bias correction or calibration also add uncertainty to final yield predictions, given the various approaches that exist to correct climate model simulations. In order to address how much uncertainty the choice of bias correction method can add to yield predictions, we use several evaluation runs from Regional Climate Models from the Coordinated Regional Downscaling Experiment over Europe (EURO-CORDEX) at different resolutions together with different bias correction methods (linear and variance scaling, power transformation, quantile-quantile mapping) as input to a statistical crop model for wheat, a staple European food crop. The objective of our work is to compare the resulting simulation-driven hindcasted wheat yields to climate observation-driven wheat yield hindcasts from the UK and Germany in order to determine ranges of yield uncertainty that result from different climate model simulation input and bias correction methods. We simulate wheat yields using a General Linear Model that includes the effects of seasonal maximum temperatures and precipitation, since wheat is sensitive to heat stress during important developmental stages. We use the same statistical model to predict future wheat yields using the recently available bias-corrected simulations of EURO-CORDEX-Adjust. While statistical models are often criticized for their lack of complexity, an advantage is that we are here able to consider only the effect of the choice of climate model, resolution or bias correction method on yield. Initial results using both past and future bias-corrected climate simulations with a process-based model will also be presented. Through these methods, we make recommendations in preparing climate model output for crop models.

  16. Characterization and identification of ubiquitin conjugation sites with E3 ligase recognition specificities.

    PubMed

    Nguyen, Van-Nui; Huang, Kai-Yao; Huang, Chien-Hsun; Chang, Tzu-Hao; Bretaña, Neil; Lai, K; Weng, Julia; Lee, Tzong-Yi

    2015-01-01

    In eukaryotes, ubiquitin-conjugation is an important mechanism underlying proteasome-mediated degradation of proteins, and as such, plays an essential role in the regulation of many cellular processes. In the ubiquitin-proteasome pathway, E3 ligases play important roles by recognizing a specific protein substrate and catalyzing the attachment of ubiquitin to a lysine (K) residue. As more and more experimental data on ubiquitin conjugation sites become available, it becomes possible to develop prediction models that can be scaled to big data. However, no development that focuses on the investigation of ubiquitinated substrate specificities has existed. Herein, we present an approach that exploits an iteratively statistical method to identify ubiquitin conjugation sites with substrate site specificities. In this investigation, totally 6259 experimentally validated ubiquitinated proteins were obtained from dbPTM. After having filtered out homologous fragments with 40% sequence identity, the training data set contained 2658 ubiquitination sites (positive data) and 5532 non-ubiquitinated sites (negative data). Due to the difficulty in characterizing the substrate site specificities of E3 ligases by conventional sequence logo analysis, a recursively statistical method has been applied to obtain significant conserved motifs. The profile hidden Markov model (profile HMM) was adopted to construct the predictive models learned from the identified substrate motifs. A five-fold cross validation was then used to evaluate the predictive model, achieving sensitivity, specificity, and accuracy of 73.07%, 65.46%, and 67.93%, respectively. Additionally, an independent testing set, completely blind to the training data of the predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (76.13%) and outperform other ubiquitination site prediction tool. A case study demonstrated the effectiveness of the characterized substrate motifs for identifying ubiquitination sites. The proposed method presents a practical means of preliminary analysis and greatly diminishes the total number of potential targets required for further experimental confirmation. This method may help unravel their mechanisms and roles in E3 recognition and ubiquitin-mediated protein degradation.

  17. Predicting Survival of De Novo Metastatic Breast Cancer in Asian Women: Systematic Review and Validation Study

    PubMed Central

    Miao, Hui; Hartman, Mikael; Bhoo-Pathy, Nirmala; Lee, Soo-Chin; Taib, Nur Aishah; Tan, Ern-Yu; Chan, Patrick; Moons, Karel G. M.; Wong, Hoong-Seam; Goh, Jeremy; Rahim, Siti Mastura; Yip, Cheng-Har; Verkooijen, Helena M.

    2014-01-01

    Background In Asia, up to 25% of breast cancer patients present with distant metastases at diagnosis. Given the heterogeneous survival probabilities of de novo metastatic breast cancer, individual outcome prediction is challenging. The aim of the study is to identify existing prognostic models for patients with de novo metastatic breast cancer and validate them in Asia. Materials and Methods We performed a systematic review to identify prediction models for metastatic breast cancer. Models were validated in 642 women with de novo metastatic breast cancer registered between 2000 and 2010 in the Singapore Malaysia Hospital Based Breast Cancer Registry. Survival curves for low, intermediate and high-risk groups according to each prognostic score were compared by log-rank test and discrimination of the models was assessed by concordance statistic (C-statistic). Results We identified 16 prediction models, seven of which were for patients with brain metastases only. Performance status, estrogen receptor status, metastatic site(s) and disease-free interval were the most common predictors. We were able to validate nine prediction models. The capacity of the models to discriminate between poor and good survivors varied from poor to fair with C-statistics ranging from 0.50 (95% CI, 0.48–0.53) to 0.63 (95% CI, 0.60–0.66). Conclusion The discriminatory performance of existing prediction models for de novo metastatic breast cancer in Asia is modest. Development of an Asian-specific prediction model is needed to improve prognostication and guide decision making. PMID:24695692

  18. Individual and population pharmacokinetic compartment analysis: a graphic procedure for quantification of predictive performance.

    PubMed

    Eksborg, Staffan

    2013-01-01

    Pharmacokinetic studies are important for optimizing of drug dosing, but requires proper validation of the used pharmacokinetic procedures. However, simple and reliable statistical methods suitable for evaluation of the predictive performance of pharmacokinetic analysis are essentially lacking. The aim of the present study was to construct and evaluate a graphic procedure for quantification of predictive performance of individual and population pharmacokinetic compartment analysis. Original data from previously published pharmacokinetic compartment analyses after intravenous, oral, and epidural administration, and digitized data, obtained from published scatter plots of observed vs predicted drug concentrations from population pharmacokinetic studies using the NPEM algorithm and NONMEM computer program and Bayesian forecasting procedures, were used for estimating the predictive performance according to the proposed graphical method and by the method of Sheiner and Beal. The graphical plot proposed in the present paper proved to be a useful tool for evaluation of predictive performance of both individual and population compartment pharmacokinetic analysis. The proposed method is simple to use and gives valuable information concerning time- and concentration-dependent inaccuracies that might occur in individual and population pharmacokinetic compartment analysis. Predictive performance can be quantified by the fraction of concentration ratios within arbitrarily specified ranges, e.g. within the range 0.8-1.2.

  19. Prediction Methods in Solar Sunspots Cycles

    PubMed Central

    Ng, Kim Kwee

    2016-01-01

    An understanding of the Ohl’s Precursor Method, which is used to predict the upcoming sunspots activity, is presented by employing a simplified movable divided-blocks diagram. Using a new approach, the total number of sunspots in a solar cycle and the maximum averaged monthly sunspots number Rz(max) are both shown to be statistically related to the geomagnetic activity index in the prior solar cycle. The correlation factors are significant and they are respectively found to be 0.91 ± 0.13 and 0.85 ± 0.17. The projected result is consistent with the current observation of solar cycle 24 which appears to have attained at least Rz(max) at 78.7 ± 11.7 in March 2014. Moreover, in a statistical study of the time-delayed solar events, the average time between the peak in the monthly geomagnetic index and the peak in the monthly sunspots numbers in the succeeding ascending phase of the sunspot activity is found to be 57.6 ± 3.1 months. The statistically determined time-delayed interval confirms earlier observational results by others that the Sun’s electromagnetic dipole is moving toward the Sun’s Equator during a solar cycle. PMID:26868269

  20. Predicting Chemically Induced Duodenal Ulcer and Adrenal Necrosis with Classification Trees

    NASA Astrophysics Data System (ADS)

    Giampaolo, Casimiro; Gray, Andrew T.; Olshen, Richard A.; Szabo, Sandor

    1991-07-01

    Binary tree-structured statistical classification algorithms and properties of 56 model alkyl nucleophiles were brought to bear on two problems of experimental pharmacology and toxicology. Each rat of a learning sample of 745 was administered one compound and autopsied to determine the presence of duodenal ulcer or adrenal hemorrhagic necrosis. The cited statistical classification schemes were then applied to these outcomes and 67 features of the compounds to ascertain those characteristics that are associated with biologic activity. For predicting duodenal ulceration, dipole moment, melting point, and solubility in octanol are particularly important, while for predicting adrenal necrosis, important features include the number of sulfhydryl groups and double bonds. These methods may constitute inexpensive but powerful ways to screen untested compounds for possible organ-specific toxicity. Mechanisms for the etiology and pathogenesis of the duodenal and adrenal lesions are suggested, as are additional avenues for drug design.

  1. Narrowing the scope of failure prediction using targeted fault load injection

    NASA Astrophysics Data System (ADS)

    Jordan, Paul L.; Peterson, Gilbert L.; Lin, Alan C.; Mendenhall, Michael J.; Sellers, Andrew J.

    2018-05-01

    As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensuring that those systems do not fail becomes increasingly important. Many organizations depend heavily on desktop computers for day-to-day operations. Unfortunately, the software that runs on these computers is written by humans and, as such, is still subject to human error and consequent failure. A natural solution is to use statistical machine learning to predict failure. However, since failure is still a relatively rare event, obtaining labelled training data to train these models is not a trivial task. This work presents new simulated fault-inducing loads that extend the focus of traditional fault injection techniques to predict failure in the Microsoft enterprise authentication service and Apache web server. These new fault loads were successful in creating failure conditions that were identifiable using statistical learning methods, with fewer irrelevant faults being created.

  2. New efficient optimizing techniques for Kalman filters and numerical weather prediction models

    NASA Astrophysics Data System (ADS)

    Famelis, Ioannis; Galanis, George; Liakatas, Aristotelis

    2016-06-01

    The need for accurate local environmental predictions and simulations beyond the classical meteorological forecasts are increasing the last years due to the great number of applications that are directly or not affected: renewable energy resource assessment, natural hazards early warning systems, global warming and questions on the climate change can be listed among them. Within this framework the utilization of numerical weather and wave prediction systems in conjunction with advanced statistical techniques that support the elimination of the model bias and the reduction of the error variability may successfully address the above issues. In the present work, new optimization methods are studied and tested in selected areas of Greece where the use of renewable energy sources is of critical. The added value of the proposed work is due to the solid mathematical background adopted making use of Information Geometry and Statistical techniques, new versions of Kalman filters and state of the art numerical analysis tools.

  3. A statistical approach based on accumulated degree-days to predict decomposition-related processes in forensic studies.

    PubMed

    Michaud, Jean-Philippe; Moreau, Gaétan

    2011-01-01

    Using pig carcasses exposed over 3 years in rural fields during spring, summer, and fall, we studied the relationship between decomposition stages and degree-day accumulation (i) to verify the predictability of the decomposition stages used in forensic entomology to document carcass decomposition and (ii) to build a degree-day accumulation model applicable to various decomposition-related processes. Results indicate that the decomposition stages can be predicted with accuracy from temperature records and that a reliable degree-day index can be developed to study decomposition-related processes. The development of degree-day indices opens new doors for researchers and allows for the application of inferential tools unaffected by climatic variability, as well as for the inclusion of statistics in a science that is primarily descriptive and in need of validation methods in courtroom proceedings. © 2010 American Academy of Forensic Sciences.

  4. Predicting perceptual quality of images in realistic scenario using deep filter banks

    NASA Astrophysics Data System (ADS)

    Zhang, Weixia; Yan, Jia; Hu, Shiyong; Ma, Yang; Deng, Dexiang

    2018-03-01

    Classical image perceptual quality assessment models usually resort to natural scene statistic methods, which are based on an assumption that certain reliable statistical regularities hold on undistorted images and will be corrupted by introduced distortions. However, these models usually fail to accurately predict degradation severity of images in realistic scenarios since complex, multiple, and interactive authentic distortions usually appear on them. We propose a quality prediction model based on convolutional neural network. Quality-aware features extracted from filter banks of multiple convolutional layers are aggregated into the image representation. Furthermore, an easy-to-implement and effective feature selection strategy is used to further refine the image representation and finally a linear support vector regression model is trained to map image representation into images' subjective perceptual quality scores. The experimental results on benchmark databases present the effectiveness and generalizability of the proposed model.

  5. A practical approach for the scale-up of roller compaction process.

    PubMed

    Shi, Weixian; Sprockel, Omar L

    2016-09-01

    An alternative approach for the scale-up of ribbon formation during roller compaction was investigated, which required only one batch at the commercial scale to set the operational conditions. The scale-up of ribbon formation was based on a probability method. It was sufficient in describing the mechanism of ribbon formation at both scales. In this method, a statistical relationship between roller compaction parameters and ribbon attributes (thickness and density) was first defined with DoE using a pilot Alexanderwerk WP120 roller compactor. While the milling speed was included in the design, it has no practical effect on granule properties within the study range despite its statistical significance. The statistical relationship was then adapted to a commercial Alexanderwerk WP200 roller compactor with one experimental run. The experimental run served as a calibration of the statistical model parameters. The proposed transfer method was then confirmed by conducting a mapping study on the Alexanderwerk WP200 using a factorial DoE, which showed a match between the predictions and the verification experiments. The study demonstrates the applicability of the roller compaction transfer method using the statistical model from the development scale calibrated with one experiment point at the commercial scale. Copyright © 2016 Elsevier B.V. All rights reserved.

  6. Visualization of the significance of Receiver Operating Characteristics based on confidence ellipses

    NASA Astrophysics Data System (ADS)

    Sarlis, Nicholas V.; Christopoulos, Stavros-Richard G.

    2014-03-01

    The Receiver Operating Characteristics (ROC) is used for the evaluation of prediction methods in various disciplines like meteorology, geophysics, complex system physics, medicine etc. The estimation of the significance of a binary prediction method, however, remains a cumbersome task and is usually done by repeating the calculations by Monte Carlo. The FORTRAN code provided here simplifies this problem by evaluating the significance of binary predictions for a family of ellipses which are based on confidence ellipses and cover the whole ROC space. Catalogue identifier: AERY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERY_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 11511 No. of bytes in distributed program, including test data, etc.: 72906 Distribution format: tar.gz Programming language: FORTRAN. Computer: Any computer supporting a GNU FORTRAN compiler. Operating system: Linux, MacOS, Windows. RAM: 1Mbyte Classification: 4.13, 9, 14. Nature of problem: The Receiver Operating Characteristics (ROC) is used for the evaluation of prediction methods in various disciplines like meteorology, geophysics, complex system physics, medicine etc. The estimation of the significance of a binary prediction method, however, remains a cumbersome task and is usually done by repeating the calculations by Monte Carlo. The FORTRAN code provided here simplifies this problem by evaluating the significance of binary predictions for a family of ellipses which are based on confidence ellipses and cover the whole ROC space. Solution method: Using the statistics of random binary predictions for a given value of the predictor threshold ɛt, one can construct the corresponding confidence ellipses. The envelope of these corresponding confidence ellipses is estimated when ɛt varies from 0 to 1. This way a new family of ellipses is obtained, named k-ellipses, which covers the whole ROC plane and leads to a well defined Area Under the Curve (AUC). For the latter quantity, Mason and Graham [1] have shown that it follows the Mann-Whitney U-statistics [2] which can be applied [3] for the estimation of the statistical significance of each k-ellipse. As the transformation is invertible, any point on the ROC plane corresponds to a unique value of k, thus to a unique p-value to obtain this point by chance. The present FORTRAN code provides this p-value field on the ROC plane as well as the k-ellipses corresponding to the (p=)10%, 5% and 1% significance levels using as input the number of the positive (P) and negative (Q) cases to be predicted. Unusual features: In some machines, the compiler directive -O2 or -O3 should be used to avoid NaN’s in some points of the p-field along the diagonal. Running time: Depending on the application, e.g., 4s for an Intel(R) Core(TM)2 CPU E7600 at 3.06 GHz with 2 GB RAM for the examples presented here References: [1] S.J. Mason, N.E. Graham, Quart. J. Roy. Meteor. Soc. 128 (2002) 2145. [2] H.B. Mann, D.R. Whitney, Ann. Math. Statist. 18 (1947) 50. [3] L.C. Dinneen, B.C. Blakesley, J. Roy. Stat. Soc. Ser. C Appl. Stat. 22 (1973) 269.

  7. Avalanche Statistics Identify Intrinsic Stellar Processes near Criticality in KIC 8462852

    NASA Astrophysics Data System (ADS)

    Sheikh, Mohammed A.; Weaver, Richard L.; Dahmen, Karin A.

    2016-12-01

    The star KIC8462852 (Tabby's star) has shown anomalous drops in light flux. We perform a statistical analysis of the more numerous smaller dimming events by using methods found useful for avalanches in ferromagnetism and plastic flow. Scaling exponents for avalanche statistics and temporal profiles of the flux during the dimming events are close to mean field predictions. Scaling collapses suggest that this star may be near a nonequilibrium critical point. The large events are interpreted as avalanches marked by modified dynamics, limited by the system size, and not within the scaling regime.

  8. Machine learning-based methods for prediction of linear B-cell epitopes.

    PubMed

    Wang, Hsin-Wei; Pai, Tun-Wen

    2014-01-01

    B-cell epitope prediction facilitates immunologists in designing peptide-based vaccine, diagnostic test, disease prevention, treatment, and antibody production. In comparison with T-cell epitope prediction, the performance of variable length B-cell epitope prediction is still yet to be satisfied. Fortunately, due to increasingly available verified epitope databases, bioinformaticians could adopt machine learning-based algorithms on all curated data to design an improved prediction tool for biomedical researchers. Here, we have reviewed related epitope prediction papers, especially those for linear B-cell epitope prediction. It should be noticed that a combination of selected propensity scales and statistics of epitope residues with machine learning-based tools formulated a general way for constructing linear B-cell epitope prediction systems. It is also observed from most of the comparison results that the kernel method of support vector machine (SVM) classifier outperformed other machine learning-based approaches. Hence, in this chapter, except reviewing recently published papers, we have introduced the fundamentals of B-cell epitope and SVM techniques. In addition, an example of linear B-cell prediction system based on physicochemical features and amino acid combinations is illustrated in details.

  9. Remaining dischargeable time prediction for lithium-ion batteries using unscented Kalman filter

    NASA Astrophysics Data System (ADS)

    Dong, Guangzhong; Wei, Jingwen; Chen, Zonghai; Sun, Han; Yu, Xiaowei

    2017-10-01

    To overcome the range anxiety, one of the important strategies is to accurately predict the range or dischargeable time of the battery system. To accurately predict the remaining dischargeable time (RDT) of a battery, a RDT prediction framework based on accurate battery modeling and state estimation is presented in this paper. Firstly, a simplified linearized equivalent-circuit-model is developed to simulate the dynamic characteristics of a battery. Then, an online recursive least-square-algorithm method and unscented-Kalman-filter are employed to estimate the system matrices and SOC at every prediction point. Besides, a discrete wavelet transform technique is employed to capture the statistical information of past dynamics of input currents, which are utilized to predict the future battery currents. Finally, the RDT can be predicted based on the battery model, SOC estimation results and predicted future battery currents. The performance of the proposed methodology has been verified by a lithium-ion battery cell. Experimental results indicate that the proposed method can provide an accurate SOC and parameter estimation and the predicted RDT can solve the range anxiety issues.

  10. Predictive capacity of a non-radioisotopic local lymph node assay using flow cytometry, LLNA:BrdU-FCM: Comparison of a cutoff approach and inferential statistics.

    PubMed

    Kim, Da-Eun; Yang, Hyeri; Jang, Won-Hee; Jung, Kyoung-Mi; Park, Miyoung; Choi, Jin Kyu; Jung, Mi-Sook; Jeon, Eun-Young; Heo, Yong; Yeo, Kyung-Wook; Jo, Ji-Hoon; Park, Jung Eun; Sohn, Soo Jung; Kim, Tae Sung; Ahn, Il Young; Jeong, Tae-Cheon; Lim, Kyung-Min; Bae, SeungJin

    2016-01-01

    In order for a novel test method to be applied for regulatory purposes, its reliability and relevance, i.e., reproducibility and predictive capacity, must be demonstrated. Here, we examine the predictive capacity of a novel non-radioisotopic local lymph node assay, LLNA:BrdU-FCM (5-bromo-2'-deoxyuridine-flow cytometry), with a cutoff approach and inferential statistics as a prediction model. 22 reference substances in OECD TG429 were tested with a concurrent positive control, hexylcinnamaldehyde 25%(PC), and the stimulation index (SI) representing the fold increase in lymph node cells over the vehicle control was obtained. The optimal cutoff SI (2.7≤cutoff <3.5), with respect to predictive capacity, was obtained by a receiver operating characteristic curve, which produced 90.9% accuracy for the 22 substances. To address the inter-test variability in responsiveness, SI values standardized with PC were employed to obtain the optimal percentage cutoff (42.6≤cutoff <57.3% of PC), which produced 86.4% accuracy. A test substance may be diagnosed as a sensitizer if a statistically significant increase in SI is elicited. The parametric one-sided t-test and non-parametric Wilcoxon rank-sum test produced 77.3% accuracy. Similarly, a test substance could be defined as a sensitizer if the SI means of the vehicle control, and of the low, middle, and high concentrations were statistically significantly different, which was tested using ANOVA or Kruskal-Wallis, with post hoc analysis, Dunnett, or DSCF (Dwass-Steel-Critchlow-Fligner), respectively, depending on the equal variance test, producing 81.8% accuracy. The absolute SI-based cutoff approach produced the best predictive capacity, however the discordant decisions between prediction models need to be examined further. Copyright © 2015 Elsevier Inc. All rights reserved.

  11. Preparing systems engineering and computing science students in disciplined methods, quantitative, and advanced statistical techniques to improve process performance

    NASA Astrophysics Data System (ADS)

    McCray, Wilmon Wil L., Jr.

    The research was prompted by a need to conduct a study that assesses process improvement, quality management and analytical techniques taught to students in U.S. colleges and universities undergraduate and graduate systems engineering and the computing science discipline (e.g., software engineering, computer science, and information technology) degree programs during their academic training that can be applied to quantitatively manage processes for performance. Everyone involved in executing repeatable processes in the software and systems development lifecycle processes needs to become familiar with the concepts of quantitative management, statistical thinking, process improvement methods and how they relate to process-performance. Organizations are starting to embrace the de facto Software Engineering Institute (SEI) Capability Maturity Model Integration (CMMI RTM) Models as process improvement frameworks to improve business processes performance. High maturity process areas in the CMMI model imply the use of analytical, statistical, quantitative management techniques, and process performance modeling to identify and eliminate sources of variation, continually improve process-performance; reduce cost and predict future outcomes. The research study identifies and provides a detail discussion of the gap analysis findings of process improvement and quantitative analysis techniques taught in U.S. universities systems engineering and computing science degree programs, gaps that exist in the literature, and a comparison analysis which identifies the gaps that exist between the SEI's "healthy ingredients " of a process performance model and courses taught in U.S. universities degree program. The research also heightens awareness that academicians have conducted little research on applicable statistics and quantitative techniques that can be used to demonstrate high maturity as implied in the CMMI models. The research also includes a Monte Carlo simulation optimization model and dashboard that demonstrates the use of statistical methods, statistical process control, sensitivity analysis, quantitative and optimization techniques to establish a baseline and predict future customer satisfaction index scores (outcomes). The American Customer Satisfaction Index (ACSI) model and industry benchmarks were used as a framework for the simulation model.

  12. A Stochastic Framework for Evaluating Seizure Prediction Algorithms Using Hidden Markov Models

    PubMed Central

    Wong, Stephen; Gardner, Andrew B.; Krieger, Abba M.; Litt, Brian

    2007-01-01

    Responsive, implantable stimulation devices to treat epilepsy are now in clinical trials. New evidence suggests that these devices may be more effective when they deliver therapy before seizure onset. Despite years of effort, prospective seizure prediction, which could improve device performance, remains elusive. In large part, this is explained by lack of agreement on a statistical framework for modeling seizure generation and a method for validating algorithm performance. We present a novel stochastic framework based on a three-state hidden Markov model (HMM) (representing interictal, preictal, and seizure states) with the feature that periods of increased seizure probability can transition back to the interictal state. This notion reflects clinical experience and may enhance interpretation of published seizure prediction studies. Our model accommodates clipped EEG segments and formalizes intuitive notions regarding statistical validation. We derive equations for type I and type II errors as a function of the number of seizures, duration of interictal data, and prediction horizon length and we demonstrate the model’s utility with a novel seizure detection algorithm that appeared to predicted seizure onset. We propose this framework as a vital tool for designing and validating prediction algorithms and for facilitating collaborative research in this area. PMID:17021032

  13. Gaussian covariance graph models accounting for correlated marker effects in genome-wide prediction.

    PubMed

    Martínez, C A; Khare, K; Rahman, S; Elzo, M A

    2017-10-01

    Several statistical models used in genome-wide prediction assume uncorrelated marker allele substitution effects, but it is known that these effects may be correlated. In statistics, graphical models have been identified as a useful tool for covariance estimation in high-dimensional problems and it is an area that has recently experienced a great expansion. In Gaussian covariance graph models (GCovGM), the joint distribution of a set of random variables is assumed to be Gaussian and the pattern of zeros of the covariance matrix is encoded in terms of an undirected graph G. In this study, methods adapting the theory of GCovGM to genome-wide prediction were developed (Bayes GCov, Bayes GCov-KR and Bayes GCov-H). In simulated data sets, improvements in correlation between phenotypes and predicted breeding values and accuracies of predicted breeding values were found. Our models account for correlation of marker effects and permit to accommodate general structures as opposed to models proposed in previous studies, which consider spatial correlation only. In addition, they allow incorporation of biological information in the prediction process through its use when constructing graph G, and their extension to the multi-allelic loci case is straightforward. © 2017 Blackwell Verlag GmbH.

  14. Derivation and Validation of the Surgical Site Infections Risk Model Using Health Administrative Data.

    PubMed

    van Walraven, Carl; Jackson, Timothy D; Daneman, Nick

    2016-04-01

    OBJECTIVE Surgical site infections (SSIs) are common hospital-acquired infections. Tracking SSIs is important to monitor their incidence, and this process requires primary data collection. In this study, we derived and validated a method using health administrative data to predict the probability that a person who had surgery would develop an SSI within 30 days. METHODS All patients enrolled in the National Surgical Quality Improvement Program (NSQIP) from 2 sites were linked to population-based administrative datasets in Ontario, Canada. We derived a multivariate model, stratified by surgical specialty, to determine the independent association of SSI status with patient and hospitalization covariates as well as physician claim codes. This SSI risk model was validated in 2 cohorts. RESULTS The derivation cohort included 5,359 patients with a 30-day SSI incidence of 6.0% (n=118). The SSI risk model predicted the probability that a person had an SSI based on 7 covariates: index hospitalization diagnostic score; physician claims score; emergency visit diagnostic score; operation duration; surgical service; and potential SSI codes. More than 90% of patients had predicted SSI risks lower than 10%. In the derivation group, model discrimination and calibration was excellent (C statistic, 0.912; Hosmer-Lemeshow [H-L] statistic, P=.47). In the 2 validation groups, performance decreased slightly (C statistics, 0.853 and 0.812; H-L statistics, 26.4 [P=.0009] and 8.0 [P=.42]), but low-risk patients were accurately identified. CONCLUSION Health administrative data can effectively identify postoperative patients with a very low risk of surgical site infection within 30 days of their procedure. Records of higher-risk patients can be reviewed to confirm SSI status.

  15. Integrative genetic risk prediction using non-parametric empirical Bayes classification.

    PubMed

    Zhao, Sihai Dave

    2017-06-01

    Genetic risk prediction is an important component of individualized medicine, but prediction accuracies remain low for many complex diseases. A fundamental limitation is the sample sizes of the studies on which the prediction algorithms are trained. One way to increase the effective sample size is to integrate information from previously existing studies. However, it can be difficult to find existing data that examine the target disease of interest, especially if that disease is rare or poorly studied. Furthermore, individual-level genotype data from these auxiliary studies are typically difficult to obtain. This article proposes a new approach to integrative genetic risk prediction of complex diseases with binary phenotypes. It accommodates possible heterogeneity in the genetic etiologies of the target and auxiliary diseases using a tuning parameter-free non-parametric empirical Bayes procedure, and can be trained using only auxiliary summary statistics. Simulation studies show that the proposed method can provide superior predictive accuracy relative to non-integrative as well as integrative classifiers. The method is applied to a recent study of pediatric autoimmune diseases, where it substantially reduces prediction error for certain target/auxiliary disease combinations. The proposed method is implemented in the R package ssa. © 2016, The International Biometric Society.

  16. Applications of Principled Search Methods in Climate Influences and Mechanisms

    NASA Technical Reports Server (NTRS)

    Glymour, Clark

    2005-01-01

    Forest and grass fires cause economic losses in the billions of dollars in the U.S. alone. In addition, boreal forests constitute a large carbon store; it has been estimated that, were no burning to occur, an additional 7 gigatons of carbon would be sequestered in boreal soils each century. Effective wildfire suppression requires anticipation of locales and times for which wildfire is most probable, preferably with a two to four week forecast, so that limited resources can be efficiently deployed. The United States Forest Service (USFS), and other experts and agencies have developed several measures of fire risk combining physical principles and expert judgment, and have used them in automated procedures for forecasting fire risk. Forecasting accuracies for some fire risk indices in combination with climate and other variables have been estimated for specific locations, with the value of fire risk index variables assessed by their statistical significance in regressions. In other cases, the MAPSS forecasts [23, 241 for example, forecasting accuracy has been estimated only by simulated data. We describe alternative forecasting methods that predict fire probability by locale and time using statistical or machine learning procedures trained on historical data, and we give comparative assessments of their forecasting accuracy for one fire season year, April- October, 2003, for all U.S. Forest Service lands. Aside from providing an accuracy baseline for other forecasting methods, the results illustrate the interdependence between the statistical significance of prediction variables and the forecasting method used.

  17. Deriving Signatures of In Vivo Toxicity Using Both Efficacy and Potency Information from In Vitro Assays: Evaluating Model Performance as a Function of Increasing Variability in Experimental Data

    EPA Science Inventory

    The US EPA ToxCast program aims to develop methods for mechanistically-based chemical prioritization using a suite of high throughput, in vitro assays that probe relevant biological pathways, and coupling them with statistical and machine learning methods that produce predictive ...

  18. Heritability of and Mortality Prediction With a Longevity Phenotype: The Healthy Aging Index

    PubMed Central

    2014-01-01

    Background. Longevity-associated genes may modulate risk for age-related diseases and survival. The Healthy Aging Index (HAI) may be a subphenotype of longevity, which can be constructed in many studies for genetic analysis. We investigated the HAI’s association with survival in the Cardiovascular Health Study and heritability in the Long Life Family Study. Methods. The HAI includes systolic blood pressure, pulmonary vital capacity, creatinine, fasting glucose, and Modified Mini-Mental Status Examination score, each scored 0, 1, or 2 using approximate tertiles and summed from 0 (healthy) to 10 (unhealthy). In Cardiovascular Health Study, the association with mortality and accuracy predicting death were determined with Cox proportional hazards analysis and c-statistics, respectively. In Long Life Family Study, heritability was determined with a variance component–based family analysis using a polygenic model. Results. Cardiovascular Health Study participants with unhealthier index scores (7–10) had 2.62-fold (95% confidence interval: 2.22, 3.10) greater mortality than participants with healthier scores (0–2). The HAI alone predicted death moderately well (c-statistic = 0.643, 95% confidence interval: 0.626, 0.661, p < .0001) and slightly worse than age alone (c-statistic = 0.700, 95% confidence interval: 0.684, 0.717, p < .0001; p < .0001 for comparison of c-statistics). Prediction increased significantly with adjustment for demographics, health behaviors, and clinical comorbidities (c-statistic = 0.780, 95% confidence interval: 0.765, 0.794, p < .0001). In Long Life Family Study, the heritability of the HAI was 0.295 (p < .0001) overall, 0.387 (p < .0001) in probands, and 0.238 (p = .0004) in offspring. Conclusion. The HAI should be investigated further as a candidate phenotype for uncovering longevity-associated genes in humans. PMID:23913930

  19. Statistical classification of road pavements using near field vehicle rolling noise measurements.

    PubMed

    Paulo, Joel Preto; Coelho, J L Bento; Figueiredo, Mário A T

    2010-10-01

    Low noise surfaces have been increasingly considered as a viable and cost-effective alternative to acoustical barriers. However, road planners and administrators frequently lack information on the correlation between the type of road surface and the resulting noise emission profile. To address this problem, a method to identify and classify different types of road pavements was developed, whereby near field road noise is analyzed using statistical learning methods. The vehicle rolling sound signal near the tires and close to the road surface was acquired by two microphones in a special arrangement which implements the Close-Proximity method. A set of features, characterizing the properties of the road pavement, was extracted from the corresponding sound profiles. A feature selection method was used to automatically select those that are most relevant in predicting the type of pavement, while reducing the computational cost. A set of different types of road pavement segments were tested and the performance of the classifier was evaluated. Results of pavement classification performed during a road journey are presented on a map, together with geographical data. This procedure leads to a considerable improvement in the quality of road pavement noise data, thereby increasing the accuracy of road traffic noise prediction models.

  20. Structure and Randomness of Continuous-Time, Discrete-Event Processes

    NASA Astrophysics Data System (ADS)

    Marzen, Sarah E.; Crutchfield, James P.

    2017-10-01

    Loosely speaking, the Shannon entropy rate is used to gauge a stochastic process' intrinsic randomness; the statistical complexity gives the cost of predicting the process. We calculate, for the first time, the entropy rate and statistical complexity of stochastic processes generated by finite unifilar hidden semi-Markov models—memoryful, state-dependent versions of renewal processes. Calculating these quantities requires introducing novel mathematical objects (ɛ -machines of hidden semi-Markov processes) and new information-theoretic methods to stochastic processes.

  1. To What Extent Is Mathematical Ability Predictive of Performance in a Methodology and Statistics Course? Can an Action Research Approach Be Used to Understand the Relevance of Mathematical Ability in Psychology Undergraduates?

    ERIC Educational Resources Information Center

    Bourne, Victoria J.

    2014-01-01

    Research methods and statistical analysis is typically the least liked and most anxiety provoking aspect of a psychology undergraduate degree, in large part due to the mathematical component of the content. In this first cycle of a piece of action research, students' mathematical ability is examined in relation to their performance across…

  2. Polypropylene Production Optimization in Fluidized Bed Catalytic Reactor (FBCR): Statistical Modeling and Pilot Scale Experimental Validation

    PubMed Central

    Khan, Mohammad Jakir Hossain; Hussain, Mohd Azlan; Mujtaba, Iqbal Mohammed

    2014-01-01

    Propylene is one type of plastic that is widely used in our everyday life. This study focuses on the identification and justification of the optimum process parameters for polypropylene production in a novel pilot plant based fluidized bed reactor. This first-of-its-kind statistical modeling with experimental validation for the process parameters of polypropylene production was conducted by applying ANNOVA (Analysis of variance) method to Response Surface Methodology (RSM). Three important process variables i.e., reaction temperature, system pressure and hydrogen percentage were considered as the important input factors for the polypropylene production in the analysis performed. In order to examine the effect of process parameters and their interactions, the ANOVA method was utilized among a range of other statistical diagnostic tools such as the correlation between actual and predicted values, the residuals and predicted response, outlier t plot, 3D response surface and contour analysis plots. The statistical analysis showed that the proposed quadratic model had a good fit with the experimental results. At optimum conditions with temperature of 75°C, system pressure of 25 bar and hydrogen percentage of 2%, the highest polypropylene production obtained is 5.82% per pass. Hence it is concluded that the developed experimental design and proposed model can be successfully employed with over a 95% confidence level for optimum polypropylene production in a fluidized bed catalytic reactor (FBCR). PMID:28788576

  3. Development of a statistical method for predicting human driver decisions.

    DOT National Transportation Integrated Search

    2015-09-01

    As autonomous vehicles enter the fleet, there will be a long period when these vehicles will have to interact with : human drivers. One of the challenges for autonomous vehicles is that human drivers do not communicate their : decisions well. However...

  4. Spatial modelling of disease using data- and knowledge-driven approaches.

    PubMed

    Stevens, Kim B; Pfeiffer, Dirk U

    2011-09-01

    The purpose of spatial modelling in animal and public health is three-fold: describing existing spatial patterns of risk, attempting to understand the biological mechanisms that lead to disease occurrence and predicting what will happen in the medium to long-term future (temporal prediction) or in different geographical areas (spatial prediction). Traditional methods for temporal and spatial predictions include general and generalized linear models (GLM), generalized additive models (GAM) and Bayesian estimation methods. However, such models require both disease presence and absence data which are not always easy to obtain. Novel spatial modelling methods such as maximum entropy (MAXENT) and the genetic algorithm for rule set production (GARP) require only disease presence data and have been used extensively in the fields of ecology and conservation, to model species distribution and habitat suitability. Other methods, such as multicriteria decision analysis (MCDA), use knowledge of the causal factors of disease occurrence to identify areas potentially suitable for disease. In addition to their less restrictive data requirements, some of these novel methods have been shown to outperform traditional statistical methods in predictive ability (Elith et al., 2006). This review paper provides details of some of these novel methods for mapping disease distribution, highlights their advantages and limitations, and identifies studies which have used the methods to model various aspects of disease distribution. Copyright © 2011. Published by Elsevier Ltd.

  5. 3D Markov Process for Traffic Flow Prediction in Real-Time.

    PubMed

    Ko, Eunjeong; Ahn, Jinyoung; Kim, Eun Yi

    2016-01-25

    Recently, the correct estimation of traffic flow has begun to be considered an essential component in intelligent transportation systems. In this paper, a new statistical method to predict traffic flows using time series analyses and geometric correlations is proposed. The novelty of the proposed method is two-fold: (1) a 3D heat map is designed to describe the traffic conditions between roads, which can effectively represent the correlations between spatially- and temporally-adjacent traffic states; and (2) the relationship between the adjacent roads on the spatiotemporal domain is represented by cliques in MRF and the clique parameters are obtained by example-based learning. In order to assess the validity of the proposed method, it is tested using data from expressway traffic that are provided by the Korean Expressway Corporation, and the performance of the proposed method is compared with existing approaches. The results demonstrate that the proposed method can predict traffic conditions with an accuracy of 85%, and this accuracy can be improved further.

  6. 3D Markov Process for Traffic Flow Prediction in Real-Time

    PubMed Central

    Ko, Eunjeong; Ahn, Jinyoung; Kim, Eun Yi

    2016-01-01

    Recently, the correct estimation of traffic flow has begun to be considered an essential component in intelligent transportation systems. In this paper, a new statistical method to predict traffic flows using time series analyses and geometric correlations is proposed. The novelty of the proposed method is two-fold: (1) a 3D heat map is designed to describe the traffic conditions between roads, which can effectively represent the correlations between spatially- and temporally-adjacent traffic states; and (2) the relationship between the adjacent roads on the spatiotemporal domain is represented by cliques in MRF and the clique parameters are obtained by example-based learning. In order to assess the validity of the proposed method, it is tested using data from expressway traffic that are provided by the Korean Expressway Corporation, and the performance of the proposed method is compared with existing approaches. The results demonstrate that the proposed method can predict traffic conditions with an accuracy of 85%, and this accuracy can be improved further. PMID:26821025

  7. An Aquatic Decomposition Scoring Method to Potentially Predict the Postmortem Submersion Interval of Bodies Recovered from the North Sea.

    PubMed

    van Daalen, Marjolijn A; de Kat, Dorothée S; Oude Grotebevelsborg, Bernice F L; de Leeuwe, Roosje; Warnaar, Jeroen; Oostra, Roelof Jan; M Duijst-Heesters, Wilma L J

    2017-03-01

    This study aimed to develop an aquatic decomposition scoring (ADS) method and investigated the predictive value of this method in estimating the postmortem submersion interval (PMSI) of bodies recovered from the North Sea. This method, consisting of an ADS item list and a pictorial reference atlas, showed a high interobserver agreement (Krippendorff's alpha ≥ 0.93) and hence proved to be valid. This scoring method was applied to data, collected from closed cases-cases in which the postmortal submersion interval (PMSI) was known-concerning bodies recovered from the North Sea from 1990 to 2013. Thirty-eight cases met the inclusion criteria and were scored by quantifying the observed total aquatic decomposition score (TADS). Statistical analysis demonstrated that TADS accurately predicts the PMSI (p < 0.001), confirming that the decomposition process in the North Sea is strongly correlated to time. © 2017 American Academy of Forensic Sciences.

  8. FUN-LDA: A Latent Dirichlet Allocation Model for Predicting Tissue-Specific Functional Effects of Noncoding Variation: Methods and Applications.

    PubMed

    Backenroth, Daniel; He, Zihuai; Kiryluk, Krzysztof; Boeva, Valentina; Pethukova, Lynn; Khurana, Ekta; Christiano, Angela; Buxbaum, Joseph D; Ionita-Laza, Iuliana

    2018-05-03

    We describe a method based on a latent Dirichlet allocation model for predicting functional effects of noncoding genetic variants in a cell-type- and/or tissue-specific way (FUN-LDA). Using this unsupervised approach, we predict tissue-specific functional effects for every position in the human genome in 127 different tissues and cell types. We demonstrate the usefulness of our predictions by using several validation experiments. Using eQTL data from several sources, including the GTEx project, Geuvadis project, and TwinsUK cohort, we show that eQTLs in specific tissues tend to be most enriched among the predicted functional variants in relevant tissues in Roadmap. We further show how these integrated functional scores can be used for (1) deriving the most likely cell or tissue type causally implicated for a complex trait by using summary statistics from genome-wide association studies and (2) estimating a tissue-based correlation matrix of various complex traits. We found large enrichment of heritability in functional components of relevant tissues for various complex traits, and FUN-LDA yielded higher enrichment estimates than existing methods. Finally, using experimentally validated functional variants from the literature and variants possibly implicated in disease by previous studies, we rigorously compare FUN-LDA with state-of-the-art functional annotation methods and show that FUN-LDA has better prediction accuracy and higher resolution than these methods. In particular, our results suggest that tissue- and cell-type-specific functional prediction methods tend to have substantially better prediction accuracy than organism-level prediction methods. Scores for each position in the human genome and for each ENCODE and Roadmap tissue are available online (see Web Resources). Copyright © 2018 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  9. Nonparametric predictive inference for combining diagnostic tests with parametric copula

    NASA Astrophysics Data System (ADS)

    Muhammad, Noryanti; Coolen, F. P. A.; Coolen-Maturi, T.

    2017-09-01

    Measuring the accuracy of diagnostic tests is crucial in many application areas including medicine and health care. The Receiver Operating Characteristic (ROC) curve is a popular statistical tool for describing the performance of diagnostic tests. The area under the ROC curve (AUC) is often used as a measure of the overall performance of the diagnostic test. In this paper, we interest in developing strategies for combining test results in order to increase the diagnostic accuracy. We introduce nonparametric predictive inference (NPI) for combining two diagnostic test results with considering dependence structure using parametric copula. NPI is a frequentist statistical framework for inference on a future observation based on past data observations. NPI uses lower and upper probabilities to quantify uncertainty and is based on only a few modelling assumptions. While copula is a well-known statistical concept for modelling dependence of random variables. A copula is a joint distribution function whose marginals are all uniformly distributed and it can be used to model the dependence separately from the marginal distributions. In this research, we estimate the copula density using a parametric method which is maximum likelihood estimator (MLE). We investigate the performance of this proposed method via data sets from the literature and discuss results to show how our method performs for different family of copulas. Finally, we briefly outline related challenges and opportunities for future research.

  10. Fresh Biomass Estimation in Heterogeneous Grassland Using Hyperspectral Measurements and Multivariate Statistical Analysis

    NASA Astrophysics Data System (ADS)

    Darvishzadeh, R.; Skidmore, A. K.; Mirzaie, M.; Atzberger, C.; Schlerf, M.

    2014-12-01

    Accurate estimation of grassland biomass at their peak productivity can provide crucial information regarding the functioning and productivity of the rangelands. Hyperspectral remote sensing has proved to be valuable for estimation of vegetation biophysical parameters such as biomass using different statistical techniques. However, in statistical analysis of hyperspectral data, multicollinearity is a common problem due to large amount of correlated hyper-spectral reflectance measurements. The aim of this study was to examine the prospect of above ground biomass estimation in a heterogeneous Mediterranean rangeland employing multivariate calibration methods. Canopy spectral measurements were made in the field using a GER 3700 spectroradiometer, along with concomitant in situ measurements of above ground biomass for 170 sample plots. Multivariate calibrations including partial least squares regression (PLSR), principal component regression (PCR), and Least-Squared Support Vector Machine (LS-SVM) were used to estimate the above ground biomass. The prediction accuracy of the multivariate calibration methods were assessed using cross validated R2 and RMSE. The best model performance was obtained using LS_SVM and then PLSR both calibrated with first derivative reflectance dataset with R2cv = 0.88 & 0.86 and RMSEcv= 1.15 & 1.07 respectively. The weakest prediction accuracy was appeared when PCR were used (R2cv = 0.31 and RMSEcv= 2.48). The obtained results highlight the importance of multivariate calibration methods for biomass estimation when hyperspectral data are used.

  11. Earthquake Prediction in a Big Data World

    NASA Astrophysics Data System (ADS)

    Kossobokov, V. G.

    2016-12-01

    The digital revolution started just about 15 years ago has already surpassed the global information storage capacity of more than 5000 Exabytes (in optimally compressed bytes) per year. Open data in a Big Data World provides unprecedented opportunities for enhancing studies of the Earth System. However, it also opens wide avenues for deceptive associations in inter- and transdisciplinary data and for inflicted misleading predictions based on so-called "precursors". Earthquake prediction is not an easy task that implies a delicate application of statistics. So far, none of the proposed short-term precursory signals showed sufficient evidence to be used as a reliable precursor of catastrophic earthquakes. Regretfully, in many cases of seismic hazard assessment (SHA), from term-less to time-dependent (probabilistic PSHA or deterministic DSHA), and short-term earthquake forecasting (StEF), the claims of a high potential of the method are based on a flawed application of statistics and, therefore, are hardly suitable for communication to decision makers. Self-testing must be done in advance claiming prediction of hazardous areas and/or times. The necessity and possibility of applying simple tools of Earthquake Prediction Strategies, in particular, Error Diagram, introduced by G.M. Molchan in early 1990ies, and Seismic Roulette null-hypothesis as a metric of the alerted space, is evident. The set of errors, i.e. the rates of failure and of the alerted space-time volume, can be easily compared to random guessing, which comparison permits evaluating the SHA method effectiveness and determining the optimal choice of parameters in regard to a given cost-benefit function. These and other information obtained in such a simple testing may supply us with a realistic estimates of confidence and accuracy of SHA predictions and, if reliable but not necessarily perfect, with related recommendations on the level of risks for decision making in regard to engineering design, insurance, and emergency management. The examples of independent expertize of "seismic hazard maps", "precursors", and "forecast/prediction methods" are provided.

  12. Laharz_py: GIS tools for automated mapping of lahar inundation hazard zones

    USGS Publications Warehouse

    Schilling, Steve P.

    2014-01-01

    Laharz_py is written in the Python programming language as a suite of tools for use in ArcMap Geographic Information System (GIS). Primarily, Laharz_py is a computational model that uses statistical descriptions of areas inundated by past mass-flow events to forecast areas likely to be inundated by hypothetical future events. The forecasts use physically motivated and statistically calibrated power-law equations that each has a form A = cV2/3, relating mass-flow volume (V) to planimetric or cross-sectional areas (A) inundated by an average flow as it descends a given drainage. Calibration of the equations utilizes logarithmic transformation and linear regression to determine the best-fit values of c. The software uses values of V, an algorithm for idenitifying mass-flow source locations, and digital elevation models of topography to portray forecast hazard zones for lahars, debris flows, or rock avalanches on maps. Laharz_py offers two methods to construct areas of potential inundation for lahars: (1) Selection of a range of plausible V values results in a set of nested hazard zones showing areas likely to be inundated by a range of hypothetical flows; and (2) The user selects a single volume and a confidence interval for the prediction. In either case, Laharz_py calculates the mean expected A and B value from each user-selected value of V. However, for the second case, a single value of V yields two additional results representing the upper and lower values of the confidence interval of prediction. Calculation of these two bounding predictions require the statistically calibrated prediction equations, a user-specified level of confidence, and t-distribution statistics to calculate the standard error of regression, standard error of the mean, and standard error of prediction. The portrayal of results from these two methods on maps compares the range of inundation areas due to prediction uncertainties with uncertainties in selection of V values. The Open-File Report document contains an explanation of how to install and use the software. The Laharz_py software includes an example data set for Mount Rainier, Washington. The second part of the documentation describes how to use all of the Laharz_py tools in an example dataset at Mount Rainier, Washington.

  13. NUMERICAL ANALYSIS TECHNIQUE USING THE STATISTICAL ENERGY ANALYSIS METHOD CONCERNING THE BLASTING NOISE REDUCTION BY THE SOUND INSULATION DOOR USED IN TUNNEL CONSTRUCTIONS

    NASA Astrophysics Data System (ADS)

    Ishida, Shigeki; Mori, Atsuo; Shinji, Masato

    The main method to reduce the blasting charge noise which occurs in a tunnel under construction is to install the sound insulation door in the tunnel. However, the numerical analysis technique to predict the accurate effect of the transmission loss in the sound insulation door is not established. In this study, we measured the blasting charge noise and the vibration of the sound insulation door in the tunnel with the blasting charge, and performed analysis and modified acoustic feature. In addition, we reproduced the noise reduction effect of the sound insulation door by statistical energy analysis method and confirmed that numerical simulation is possible by this procedure.

  14. On the calibration process of film dosimetry: OLS inverse regression versus WLS inverse prediction.

    PubMed

    Crop, F; Van Rompaye, B; Paelinck, L; Vakaet, L; Thierens, H; De Wagter, C

    2008-07-21

    The purpose of this study was both putting forward a statistically correct model for film calibration and the optimization of this process. A reliable calibration is needed in order to perform accurate reference dosimetry with radiographic (Gafchromic) film. Sometimes, an ordinary least squares simple linear (in the parameters) regression is applied to the dose-optical-density (OD) curve with the dose as a function of OD (inverse regression) or sometimes OD as a function of dose (inverse prediction). The application of a simple linear regression fit is an invalid method because heteroscedasticity of the data is not taken into account. This could lead to erroneous results originating from the calibration process itself and thus to a lower accuracy. In this work, we compare the ordinary least squares (OLS) inverse regression method with the correct weighted least squares (WLS) inverse prediction method to create calibration curves. We found that the OLS inverse regression method could lead to a prediction bias of up to 7.3 cGy at 300 cGy and total prediction errors of 3% or more for Gafchromic EBT film. Application of the WLS inverse prediction method resulted in a maximum prediction bias of 1.4 cGy and total prediction errors below 2% in a 0-400 cGy range. We developed a Monte-Carlo-based process to optimize calibrations, depending on the needs of the experiment. This type of thorough analysis can lead to a higher accuracy for film dosimetry.

  15. Statistics of return intervals between long heartbeat intervals and their usability for online prediction of disorders

    NASA Astrophysics Data System (ADS)

    Bogachev, Mikhail I.; Kireenkov, Igor S.; Nifontov, Eugene M.; Bunde, Armin

    2009-06-01

    We study the statistics of return intervals between large heartbeat intervals (above a certain threshold Q) in 24 h records obtained from healthy subjects. We find that both the linear and the nonlinear long-term memory inherent in the heartbeat intervals lead to power-laws in the probability density function PQ(r) of the return intervals. As a consequence, the probability WQ(t; Δt) that at least one large heartbeat interval will occur within the next Δt heartbeat intervals, with an increasing elapsed number of intervals t after the last large heartbeat interval, follows a power-law. Based on these results, we suggest a method of obtaining a priori information about the occurrence of the next large heartbeat interval, and thus to predict it. We show explicitly that the proposed method, which exploits long-term memory, is superior to the conventional precursory pattern recognition technique, which focuses solely on short-term memory. We believe that our results can be straightforwardly extended to obtain more reliable predictions in other physiological signals like blood pressure, as well as in other complex records exhibiting multifractal behaviour, e.g. turbulent flow, precipitation, river flows and network traffic.

  16. An Assessment of Phylogenetic Tools for Analyzing the Interplay Between Interspecific Interactions and Phenotypic Evolution.

    PubMed

    Drury, J P; Grether, G F; Garland, T; Morlon, H

    2018-05-01

    Much ecological and evolutionary theory predicts that interspecific interactions often drive phenotypic diversification and that species phenotypes in turn influence species interactions. Several phylogenetic comparative methods have been developed to assess the importance of such processes in nature; however, the statistical properties of these methods have gone largely untested. Focusing mainly on scenarios of competition between closely-related species, we assess the performance of available comparative approaches for analyzing the interplay between interspecific interactions and species phenotypes. We find that many currently used statistical methods often fail to detect the impact of interspecific interactions on trait evolution, that sister-taxa analyses are particularly unreliable in general, and that recently developed process-based models have more satisfactory statistical properties. Methods for detecting predictors of species interactions are generally more reliable than methods for detecting character displacement. In weighing the strengths and weaknesses of different approaches, we hope to provide a clear guide for empiricists testing hypotheses about the reciprocal effect of interspecific interactions and species phenotypes and to inspire further development of process-based models.

  17. Predictive capabilities of statistical learning methods for lung nodule malignancy classification using diagnostic image features: an investigation using the Lung Image Database Consortium dataset

    NASA Astrophysics Data System (ADS)

    Hancock, Matthew C.; Magnan, Jerry F.

    2017-03-01

    To determine the potential usefulness of quantified diagnostic image features as inputs to a CAD system, we investigate the predictive capabilities of statistical learning methods for classifying nodule malignancy, utilizing the Lung Image Database Consortium (LIDC) dataset, and only employ the radiologist-assigned diagnostic feature values for the lung nodules therein, as well as our derived estimates of the diameter and volume of the nodules from the radiologists' annotations. We calculate theoretical upper bounds on the classification accuracy that is achievable by an ideal classifier that only uses the radiologist-assigned feature values, and we obtain an accuracy of 85.74 (+/-1.14)% which is, on average, 4.43% below the theoretical maximum of 90.17%. The corresponding area-under-the-curve (AUC) score is 0.932 (+/-0.012), which increases to 0.949 (+/-0.007) when diameter and volume features are included, along with the accuracy to 88.08 (+/-1.11)%. Our results are comparable to those in the literature that use algorithmically-derived image-based features, which supports our hypothesis that lung nodules can be classified as malignant or benign using only quantified, diagnostic image features, and indicates the competitiveness of this approach. We also analyze how the classification accuracy depends on specific features, and feature subsets, and we rank the features according to their predictive power, statistically demonstrating the top four to be spiculation, lobulation, subtlety, and calcification.

  18. Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk.

    PubMed

    Walsh, Colin G; Sharman, Kavya; Hripcsak, George

    2017-12-01

    Prior to implementing predictive models in novel settings, analyses of calibration and clinical usefulness remain as important as discrimination, but they are not frequently discussed. Calibration is a model's reflection of actual outcome prevalence in its predictions. Clinical usefulness refers to the utilities, costs, and harms of using a predictive model in practice. A decision analytic approach to calibrating and selecting an optimal intervention threshold may help maximize the impact of readmission risk and other preventive interventions. To select a pragmatic means of calibrating predictive models that requires a minimum amount of validation data and that performs well in practice. To evaluate the impact of miscalibration on utility and cost via clinical usefulness analyses. Observational, retrospective cohort study with electronic health record data from 120,000 inpatient admissions at an urban, academic center in Manhattan. The primary outcome was thirty-day readmission for three causes: all-cause, congestive heart failure, and chronic coronary atherosclerotic disease. Predictive modeling was performed via L1-regularized logistic regression. Calibration methods were compared including Platt Scaling, Logistic Calibration, and Prevalence Adjustment. Performance of predictive modeling and calibration was assessed via discrimination (c-statistic), calibration (Spiegelhalter Z-statistic, Root Mean Square Error [RMSE] of binned predictions, Sanders and Murphy Resolutions of the Brier Score, Calibration Slope and Intercept), and clinical usefulness (utility terms represented as costs). The amount of validation data necessary to apply each calibration algorithm was also assessed. C-statistics by diagnosis ranged from 0.7 for all-cause readmission to 0.86 (0.78-0.93) for congestive heart failure. Logistic Calibration and Platt Scaling performed best and this difference required analyzing multiple metrics of calibration simultaneously, in particular Calibration Slopes and Intercepts. Clinical usefulness analyses provided optimal risk thresholds, which varied by reason for readmission, outcome prevalence, and calibration algorithm. Utility analyses also suggested maximum tolerable intervention costs, e.g., $1720 for all-cause readmissions based on a published cost of readmission of $11,862. Choice of calibration method depends on availability of validation data and on performance. Improperly calibrated models may contribute to higher costs of intervention as measured via clinical usefulness. Decision-makers must understand underlying utilities or costs inherent in the use-case at hand to assess usefulness and will obtain the optimal risk threshold to trigger intervention with intervention cost limits as a result. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. Planning for subacute care: predicting demand using acute activity data.

    PubMed

    Green, Janette P; McNamee, Jennifer P; Kobel, Conrad; Seraji, Md Habibur R; Lawrence, Suanne J

    2016-01-01

    Objective The aim of the present study was to develop a robust model that uses the concept of 'rehabilitation-sensitive' Diagnosis Related Groups (DRGs) in predicting demand for rehabilitation and geriatric evaluation and management (GEM) care following acute in-patient episodes provided in Australian hospitals. Methods The model was developed using statistical analyses of national datasets, informed by a panel of expert clinicians and jurisdictional advice. Logistic regression analysis was undertaken using acute in-patient data, published national hospital statistics and data from the Australasian Rehabilitation Outcomes Centre. Results The predictive model comprises tables of probabilities that patients will require rehabilitation or GEM care after an acute episode, with columns defined by age group and rows defined by grouped Australian Refined (AR)-DRGs. Conclusions The existing concept of rehabilitation-sensitive DRGs was revised and extended. When applied to national data, the model provided a conservative estimate of 83% of the activity actually provided. An example demonstrates the application of the model for service planning. What is known about the topic? Health service planning is core business for jurisdictions and local areas. With populations ageing and an acknowledgement of the underservicing of subacute care, it is timely to find improved methods of estimating demand for this type of care. Traditionally, age-sex standardised utilisation rates for individual DRGs have been applied to Australian Bureau of Statistics (ABS) population projections to predict the future need for subacute services. Improved predictions became possible when some AR-DRGs were designated 'rehabilitation-sensitive'. This improved methodology has been used in several Australian jurisdictions. What does this paper add? This paper presents a new tool, or model, to predict demand for rehabilitation and GEM services based on in-patient acute activity. In this model, the methodology based on rehabilitation-sensitive AR-DRGs has been extended by updating them to AR-DRG Version 7.0, quantifying the level of 'sensitivity' and incorporating the patient's age to improve the prediction of demand for subacute services. What are the implications for practitioners? The predictive model takes the form of tables of probabilities that patients will require rehabilitation or GEM care after an acute episode and can be applied to acute in-patient administrative datasets in any Australian jurisdiction or local area. The use of patient-level characteristics will enable service planners to improve their forecasting of demand for these services. Clinicians and jurisdictional representatives consulted during the project regarded the model favourably and believed that it was an improvement on currently available methods.

  20. Topological and canonical kriging for design flood prediction in ungauged catchments: an improvement over a traditional regional regression approach?

    USGS Publications Warehouse

    Archfield, Stacey A.; Pugliese, Alessio; Castellarin, Attilio; Skøien, Jon O.; Kiang, Julie E.

    2013-01-01

    In the United States, estimation of flood frequency quantiles at ungauged locations has been largely based on regional regression techniques that relate measurable catchment descriptors to flood quantiles. More recently, spatial interpolation techniques of point data have been shown to be effective for predicting streamflow statistics (i.e., flood flows and low-flow indices) in ungauged catchments. Literature reports successful applications of two techniques, canonical kriging, CK (or physiographical-space-based interpolation, PSBI), and topological kriging, TK (or top-kriging). CK performs the spatial interpolation of the streamflow statistic of interest in the two-dimensional space of catchment descriptors. TK predicts the streamflow statistic along river networks taking both the catchment area and nested nature of catchments into account. It is of interest to understand how these spatial interpolation methods compare with generalized least squares (GLS) regression, one of the most common approaches to estimate flood quantiles at ungauged locations. By means of a leave-one-out cross-validation procedure, the performance of CK and TK was compared to GLS regression equations developed for the prediction of 10, 50, 100 and 500 yr floods for 61 streamgauges in the southeast United States. TK substantially outperforms GLS and CK for the study area, particularly for large catchments. The performance of TK over GLS highlights an important distinction between the treatments of spatial correlation when using regression-based or spatial interpolation methods to estimate flood quantiles at ungauged locations. The analysis also shows that coupling TK with CK slightly improves the performance of TK; however, the improvement is marginal when compared to the improvement in performance over GLS.

Top