Sample records for regression models including

  1. Regression Models for Identifying Noise Sources in Magnetic Resonance Images

    PubMed Central

    Zhu, Hongtu; Li, Yimei; Ibrahim, Joseph G.; Shi, Xiaoyan; An, Hongyu; Chen, Yashen; Gao, Wei; Lin, Weili; Rowe, Daniel B.; Peterson, Bradley S.

    2009-01-01

    Stochastic noise, susceptibility artifacts, magnetic field and radiofrequency inhomogeneities, and other noise components in magnetic resonance images (MRIs) can introduce serious bias into any measurements made with those images. We formally introduce three regression models including a Rician regression model and two associated normal models to characterize stochastic noise in various magnetic resonance imaging modalities, including diffusion-weighted imaging (DWI) and functional MRI (fMRI). Estimation algorithms are introduced to maximize the likelihood function of the three regression models. We also develop a diagnostic procedure for systematically exploring MR images to identify noise components other than simple stochastic noise, and to detect discrepancies between the fitted regression models and MRI data. The diagnostic procedure includes goodness-of-fit statistics, measures of influence, and tools for graphical display. The goodness-of-fit statistics can assess the key assumptions of the three regression models, whereas measures of influence can isolate outliers caused by certain noise components, including motion artifacts. The tools for graphical display permit graphical visualization of the values for the goodness-of-fit statistic and influence measures. Finally, we conduct simulation studies to evaluate performance of these methods, and we analyze a real dataset to illustrate how our diagnostic procedure localizes subtle image artifacts by detecting intravoxel variability that is not captured by the regression models. PMID:19890478

  2. A menu-driven software package of Bayesian nonparametric (and parametric) mixed models for regression analysis and density estimation.

    PubMed

    Karabatsos, George

    2017-02-01

    Most of applied statistics involves regression analysis of data. In practice, it is important to specify a regression model that has minimal assumptions which are not violated by data, to ensure that statistical inferences from the model are informative and not misleading. This paper presents a stand-alone and menu-driven software package, Bayesian Regression: Nonparametric and Parametric Models, constructed from MATLAB Compiler. Currently, this package gives the user a choice from 83 Bayesian models for data analysis. They include 47 Bayesian nonparametric (BNP) infinite-mixture regression models; 5 BNP infinite-mixture models for density estimation; and 31 normal random effects models (HLMs), including normal linear models. Each of the 78 regression models handles either a continuous, binary, or ordinal dependent variable, and can handle multi-level (grouped) data. All 83 Bayesian models can handle the analysis of weighted observations (e.g., for meta-analysis), and the analysis of left-censored, right-censored, and/or interval-censored data. Each BNP infinite-mixture model has a mixture distribution assigned one of various BNP prior distributions, including priors defined by either the Dirichlet process, Pitman-Yor process (including the normalized stable process), beta (two-parameter) process, normalized inverse-Gaussian process, geometric weights prior, dependent Dirichlet process, or the dependent infinite-probits prior. The software user can mouse-click to select a Bayesian model and perform data analysis via Markov chain Monte Carlo (MCMC) sampling. After the sampling completes, the software automatically opens text output that reports MCMC-based estimates of the model's posterior distribution and model predictive fit to the data. Additional text and/or graphical output can be generated by mouse-clicking other menu options. This includes output of MCMC convergence analyses, and estimates of the model's posterior predictive distribution, for selected functionals and values of covariates. The software is illustrated through the BNP regression analysis of real data.

  3. Using Weighted Least Squares Regression for Obtaining Langmuir Sorption Constants

    USDA-ARS?s Scientific Manuscript database

    One of the most commonly used models for describing phosphorus (P) sorption to soils is the Langmuir model. To obtain model parameters, the Langmuir model is fit to measured sorption data using least squares regression. Least squares regression is based on several assumptions including normally dist...

  4. Support vector methods for survival analysis: a comparison between ranking and regression approaches.

    PubMed

    Van Belle, Vanya; Pelckmans, Kristiaan; Van Huffel, Sabine; Suykens, Johan A K

    2011-10-01

    To compare and evaluate ranking, regression and combined machine learning approaches for the analysis of survival data. The literature describes two approaches based on support vector machines to deal with censored observations. In the first approach the key idea is to rephrase the task as a ranking problem via the concordance index, a problem which can be solved efficiently in a context of structural risk minimization and convex optimization techniques. In a second approach, one uses a regression approach, dealing with censoring by means of inequality constraints. The goal of this paper is then twofold: (i) introducing a new model combining the ranking and regression strategy, which retains the link with existing survival models such as the proportional hazards model via transformation models; and (ii) comparison of the three techniques on 6 clinical and 3 high-dimensional datasets and discussing the relevance of these techniques over classical approaches fur survival data. We compare svm-based survival models based on ranking constraints, based on regression constraints and models based on both ranking and regression constraints. The performance of the models is compared by means of three different measures: (i) the concordance index, measuring the model's discriminating ability; (ii) the logrank test statistic, indicating whether patients with a prognostic index lower than the median prognostic index have a significant different survival than patients with a prognostic index higher than the median; and (iii) the hazard ratio after normalization to restrict the prognostic index between 0 and 1. Our results indicate a significantly better performance for models including regression constraints above models only based on ranking constraints. This work gives empirical evidence that svm-based models using regression constraints perform significantly better than svm-based models based on ranking constraints. Our experiments show a comparable performance for methods including only regression or both regression and ranking constraints on clinical data. On high dimensional data, the former model performs better. However, this approach does not have a theoretical link with standard statistical models for survival data. This link can be made by means of transformation models when ranking constraints are included. Copyright © 2011 Elsevier B.V. All rights reserved.

  5. Linear regression metamodeling as a tool to summarize and present simulation model results.

    PubMed

    Jalal, Hawre; Dowd, Bryan; Sainfort, François; Kuntz, Karen M

    2013-10-01

    Modelers lack a tool to systematically and clearly present complex model results, including those from sensitivity analyses. The objective was to propose linear regression metamodeling as a tool to increase transparency of decision analytic models and better communicate their results. We used a simplified cancer cure model to demonstrate our approach. The model computed the lifetime cost and benefit of 3 treatment options for cancer patients. We simulated 10,000 cohorts in a probabilistic sensitivity analysis (PSA) and regressed the model outcomes on the standardized input parameter values in a set of regression analyses. We used the regression coefficients to describe measures of sensitivity analyses, including threshold and parameter sensitivity analyses. We also compared the results of the PSA to deterministic full-factorial and one-factor-at-a-time designs. The regression intercept represented the estimated base-case outcome, and the other coefficients described the relative parameter uncertainty in the model. We defined simple relationships that compute the average and incremental net benefit of each intervention. Metamodeling produced outputs similar to traditional deterministic 1-way or 2-way sensitivity analyses but was more reliable since it used all parameter values. Linear regression metamodeling is a simple, yet powerful, tool that can assist modelers in communicating model characteristics and sensitivity analyses.

  6. Robust mislabel logistic regression without modeling mislabel probabilities.

    PubMed

    Hung, Hung; Jou, Zhi-Yu; Huang, Su-Yun

    2018-03-01

    Logistic regression is among the most widely used statistical methods for linear discriminant analysis. In many applications, we only observe possibly mislabeled responses. Fitting a conventional logistic regression can then lead to biased estimation. One common resolution is to fit a mislabel logistic regression model, which takes into consideration of mislabeled responses. Another common method is to adopt a robust M-estimation by down-weighting suspected instances. In this work, we propose a new robust mislabel logistic regression based on γ-divergence. Our proposal possesses two advantageous features: (1) It does not need to model the mislabel probabilities. (2) The minimum γ-divergence estimation leads to a weighted estimating equation without the need to include any bias correction term, that is, it is automatically bias-corrected. These features make the proposed γ-logistic regression more robust in model fitting and more intuitive for model interpretation through a simple weighting scheme. Our method is also easy to implement, and two types of algorithms are included. Simulation studies and the Pima data application are presented to demonstrate the performance of γ-logistic regression. © 2017, The International Biometric Society.

  7. Army College Fund Cost-Effectiveness Study

    DTIC Science & Technology

    1990-11-01

    Section A.2 presents a theory of enlistment supply to provide a basis for specifying the regression model , The model Is specified in Section A.3, which...Supplementary materials are included in the final four sections. Section A.6 provides annual trends in the regression model variables. Estimates of the model ...millions, A.S. ESTIMATION OF A YOUTH EARNINGS FORECASTING MODEL Civilian pay is an important explanatory variable in the regression model . Previous

  8. Correlation and simple linear regression.

    PubMed

    Eberly, Lynn E

    2007-01-01

    This chapter highlights important steps in using correlation and simple linear regression to address scientific questions about the association of two continuous variables with each other. These steps include estimation and inference, assessing model fit, the connection between regression and ANOVA, and study design. Examples in microbiology are used throughout. This chapter provides a framework that is helpful in understanding more complex statistical techniques, such as multiple linear regression, linear mixed effects models, logistic regression, and proportional hazards regression.

  9. Predicting recycling behaviour: Comparison of a linear regression model and a fuzzy logic model.

    PubMed

    Vesely, Stepan; Klöckner, Christian A; Dohnal, Mirko

    2016-03-01

    In this paper we demonstrate that fuzzy logic can provide a better tool for predicting recycling behaviour than the customarily used linear regression. To show this, we take a set of empirical data on recycling behaviour (N=664), which we randomly divide into two halves. The first half is used to estimate a linear regression model of recycling behaviour, and to develop a fuzzy logic model of recycling behaviour. As the first comparison, the fit of both models to the data included in estimation of the models (N=332) is evaluated. As the second comparison, predictive accuracy of both models for "new" cases (hold-out data not included in building the models, N=332) is assessed. In both cases, the fuzzy logic model significantly outperforms the regression model in terms of fit. To conclude, when accurate predictions of recycling and possibly other environmental behaviours are needed, fuzzy logic modelling seems to be a promising technique. Copyright © 2015 Elsevier Ltd. All rights reserved.

  10. Regression modeling of ground-water flow

    USGS Publications Warehouse

    Cooley, R.L.; Naff, R.L.

    1985-01-01

    Nonlinear multiple regression methods are developed to model and analyze groundwater flow systems. Complete descriptions of regression methodology as applied to groundwater flow models allow scientists and engineers engaged in flow modeling to apply the methods to a wide range of problems. Organization of the text proceeds from an introduction that discusses the general topic of groundwater flow modeling, to a review of basic statistics necessary to properly apply regression techniques, and then to the main topic: exposition and use of linear and nonlinear regression to model groundwater flow. Statistical procedures are given to analyze and use the regression models. A number of exercises and answers are included to exercise the student on nearly all the methods that are presented for modeling and statistical analysis. Three computer programs implement the more complex methods. These three are a general two-dimensional, steady-state regression model for flow in an anisotropic, heterogeneous porous medium, a program to calculate a measure of model nonlinearity with respect to the regression parameters, and a program to analyze model errors in computed dependent variables such as hydraulic head. (USGS)

  11. Model selection for logistic regression models

    NASA Astrophysics Data System (ADS)

    Duller, Christine

    2012-09-01

    Model selection for logistic regression models decides which of some given potential regressors have an effect and hence should be included in the final model. The second interesting question is whether a certain factor is heterogeneous among some subsets, i.e. whether the model should include a random intercept or not. In this paper these questions will be answered with classical as well as with Bayesian methods. The application show some results of recent research projects in medicine and business administration.

  12. Breeding value accuracy estimates for growth traits using random regression and multi-trait models in Nelore cattle.

    PubMed

    Boligon, A A; Baldi, F; Mercadante, M E Z; Lobo, R B; Pereira, R J; Albuquerque, L G

    2011-06-28

    We quantified the potential increase in accuracy of expected breeding value for weights of Nelore cattle, from birth to mature age, using multi-trait and random regression models on Legendre polynomials and B-spline functions. A total of 87,712 weight records from 8144 females were used, recorded every three months from birth to mature age from the Nelore Brazil Program. For random regression analyses, all female weight records from birth to eight years of age (data set I) were considered. From this general data set, a subset was created (data set II), which included only nine weight records: at birth, weaning, 365 and 550 days of age, and 2, 3, 4, 5, and 6 years of age. Data set II was analyzed using random regression and multi-trait models. The model of analysis included the contemporary group as fixed effects and age of dam as a linear and quadratic covariable. In the random regression analyses, average growth trends were modeled using a cubic regression on orthogonal polynomials of age. Residual variances were modeled by a step function with five classes. Legendre polynomials of fourth and sixth order were utilized to model the direct genetic and animal permanent environmental effects, respectively, while third-order Legendre polynomials were considered for maternal genetic and maternal permanent environmental effects. Quadratic polynomials were applied to model all random effects in random regression models on B-spline functions. Direct genetic and animal permanent environmental effects were modeled using three segments or five coefficients, and genetic maternal and maternal permanent environmental effects were modeled with one segment or three coefficients in the random regression models on B-spline functions. For both data sets (I and II), animals ranked differently according to expected breeding value obtained by random regression or multi-trait models. With random regression models, the highest gains in accuracy were obtained at ages with a low number of weight records. The results indicate that random regression models provide more accurate expected breeding values than the traditionally finite multi-trait models. Thus, higher genetic responses are expected for beef cattle growth traits by replacing a multi-trait model with random regression models for genetic evaluation. B-spline functions could be applied as an alternative to Legendre polynomials to model covariance functions for weights from birth to mature age.

  13. A review of logistic regression models used to predict post-fire tree mortality of western North American conifers

    Treesearch

    Travis Woolley; David C. Shaw; Lisa M. Ganio; Stephen Fitzgerald

    2012-01-01

    Logistic regression models used to predict tree mortality are critical to post-fire management, planning prescribed bums and understanding disturbance ecology. We review literature concerning post-fire mortality prediction using logistic regression models for coniferous tree species in the western USA. We include synthesis and review of: methods to develop, evaluate...

  14. Developing a predictive tropospheric ozone model for Tabriz

    NASA Astrophysics Data System (ADS)

    Khatibi, Rahman; Naghipour, Leila; Ghorbani, Mohammad A.; Smith, Michael S.; Karimi, Vahid; Farhoudi, Reza; Delafrouz, Hadi; Arvanaghi, Hadi

    2013-04-01

    Predictive ozone models are becoming indispensable tools by providing a capability for pollution alerts to serve people who are vulnerable to the risks. We have developed a tropospheric ozone prediction capability for Tabriz, Iran, by using the following five modeling strategies: three regression-type methods: Multiple Linear Regression (MLR), Artificial Neural Networks (ANNs), and Gene Expression Programming (GEP); and two auto-regression-type models: Nonlinear Local Prediction (NLP) to implement chaos theory and Auto-Regressive Integrated Moving Average (ARIMA) models. The regression-type modeling strategies explain the data in terms of: temperature, solar radiation, dew point temperature, and wind speed, by regressing present ozone values to their past values. The ozone time series are available at various time intervals, including hourly intervals, from August 2010 to March 2011. The results for MLR, ANN and GEP models are not overly good but those produced by NLP and ARIMA are promising for the establishing a forecasting capability.

  15. SPSS macros to compare any two fitted values from a regression model.

    PubMed

    Weaver, Bruce; Dubois, Sacha

    2012-12-01

    In regression models with first-order terms only, the coefficient for a given variable is typically interpreted as the change in the fitted value of Y for a one-unit increase in that variable, with all other variables held constant. Therefore, each regression coefficient represents the difference between two fitted values of Y. But the coefficients represent only a fraction of the possible fitted value comparisons that might be of interest to researchers. For many fitted value comparisons that are not captured by any of the regression coefficients, common statistical software packages do not provide the standard errors needed to compute confidence intervals or carry out statistical tests-particularly in more complex models that include interactions, polynomial terms, or regression splines. We describe two SPSS macros that implement a matrix algebra method for comparing any two fitted values from a regression model. The !OLScomp and !MLEcomp macros are for use with models fitted via ordinary least squares and maximum likelihood estimation, respectively. The output from the macros includes the standard error of the difference between the two fitted values, a 95% confidence interval for the difference, and a corresponding statistical test with its p-value.

  16. Moderation analysis using a two-level regression model.

    PubMed

    Yuan, Ke-Hai; Cheng, Ying; Maxwell, Scott

    2014-10-01

    Moderation analysis is widely used in social and behavioral research. The most commonly used model for moderation analysis is moderated multiple regression (MMR) in which the explanatory variables of the regression model include product terms, and the model is typically estimated by least squares (LS). This paper argues for a two-level regression model in which the regression coefficients of a criterion variable on predictors are further regressed on moderator variables. An algorithm for estimating the parameters of the two-level model by normal-distribution-based maximum likelihood (NML) is developed. Formulas for the standard errors (SEs) of the parameter estimates are provided and studied. Results indicate that, when heteroscedasticity exists, NML with the two-level model gives more efficient and more accurate parameter estimates than the LS analysis of the MMR model. When error variances are homoscedastic, NML with the two-level model leads to essentially the same results as LS with the MMR model. Most importantly, the two-level regression model permits estimating the percentage of variance of each regression coefficient that is due to moderator variables. When applied to data from General Social Surveys 1991, NML with the two-level model identified a significant moderation effect of race on the regression of job prestige on years of education while LS with the MMR model did not. An R package is also developed and documented to facilitate the application of the two-level model.

  17. Comparison Between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat

    PubMed Central

    Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne

    2012-01-01

    In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models. PMID:23275882

  18. Comparison between linear and non-parametric regression models for genome-enabled prediction in wheat.

    PubMed

    Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne

    2012-12-01

    In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models.

  19. Procedures for adjusting regional regression models of urban-runoff quality using local data

    USGS Publications Warehouse

    Hoos, A.B.; Sisolak, J.K.

    1993-01-01

    Statistical operations termed model-adjustment procedures (MAP?s) can be used to incorporate local data into existing regression models to improve the prediction of urban-runoff quality. Each MAP is a form of regression analysis in which the local data base is used as a calibration data set. Regression coefficients are determined from the local data base, and the resulting `adjusted? regression models can then be used to predict storm-runoff quality at unmonitored sites. The response variable in the regression analyses is the observed load or mean concentration of a constituent in storm runoff for a single storm. The set of explanatory variables used in the regression analyses is different for each MAP, but always includes the predicted value of load or mean concentration from a regional regression model. The four MAP?s examined in this study were: single-factor regression against the regional model prediction, P, (termed MAP-lF-P), regression against P,, (termed MAP-R-P), regression against P, and additional local variables (termed MAP-R-P+nV), and a weighted combination of P, and a local-regression prediction (termed MAP-W). The procedures were tested by means of split-sample analysis, using data from three cities included in the Nationwide Urban Runoff Program: Denver, Colorado; Bellevue, Washington; and Knoxville, Tennessee. The MAP that provided the greatest predictive accuracy for the verification data set differed among the three test data bases and among model types (MAP-W for Denver and Knoxville, MAP-lF-P and MAP-R-P for Bellevue load models, and MAP-R-P+nV for Bellevue concentration models) and, in many cases, was not clearly indicated by the values of standard error of estimate for the calibration data set. A scheme to guide MAP selection, based on exploratory data analysis of the calibration data set, is presented and tested. The MAP?s were tested for sensitivity to the size of a calibration data set. As expected, predictive accuracy of all MAP?s for the verification data set decreased as the calibration data-set size decreased, but predictive accuracy was not as sensitive for the MAP?s as it was for the local regression models.

  20. Multivariate Analysis of Seismic Field Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alam, M. Kathleen

    1999-06-01

    This report includes the details of the model building procedure and prediction of seismic field data. Principal Components Regression, a multivariate analysis technique, was used to model seismic data collected as two pieces of equipment were cycled on and off. Models built that included only the two pieces of equipment of interest had trouble predicting data containing signals not included in the model. Evidence for poor predictions came from the prediction curves as well as spectral F-ratio plots. Once the extraneous signals were included in the model, predictions improved dramatically. While Principal Components Regression performed well for the present datamore » sets, the present data analysis suggests further work will be needed to develop more robust modeling methods as the data become more complex.« less

  1. A general framework for the use of logistic regression models in meta-analysis.

    PubMed

    Simmonds, Mark C; Higgins, Julian Pt

    2016-12-01

    Where individual participant data are available for every randomised trial in a meta-analysis of dichotomous event outcomes, "one-stage" random-effects logistic regression models have been proposed as a way to analyse these data. Such models can also be used even when individual participant data are not available and we have only summary contingency table data. One benefit of this one-stage regression model over conventional meta-analysis methods is that it maximises the correct binomial likelihood for the data and so does not require the common assumption that effect estimates are normally distributed. A second benefit of using this model is that it may be applied, with only minor modification, in a range of meta-analytic scenarios, including meta-regression, network meta-analyses and meta-analyses of diagnostic test accuracy. This single model can potentially replace the variety of often complex methods used in these areas. This paper considers, with a range of meta-analysis examples, how random-effects logistic regression models may be used in a number of different types of meta-analyses. This one-stage approach is compared with widely used meta-analysis methods including Bayesian network meta-analysis and the bivariate and hierarchical summary receiver operating characteristic (ROC) models for meta-analyses of diagnostic test accuracy. © The Author(s) 2014.

  2. Linear Multivariable Regression Models for Prediction of Eddy Dissipation Rate from Available Meteorological Data

    NASA Technical Reports Server (NTRS)

    MCKissick, Burnell T. (Technical Monitor); Plassman, Gerald E.; Mall, Gerald H.; Quagliano, John R.

    2005-01-01

    Linear multivariable regression models for predicting day and night Eddy Dissipation Rate (EDR) from available meteorological data sources are defined and validated. Model definition is based on a combination of 1997-2000 Dallas/Fort Worth (DFW) data sources, EDR from Aircraft Vortex Spacing System (AVOSS) deployment data, and regression variables primarily from corresponding Automated Surface Observation System (ASOS) data. Model validation is accomplished through EDR predictions on a similar combination of 1994-1995 Memphis (MEM) AVOSS and ASOS data. Model forms include an intercept plus a single term of fixed optimal power for each of these regression variables; 30-minute forward averaged mean and variance of near-surface wind speed and temperature, variance of wind direction, and a discrete cloud cover metric. Distinct day and night models, regressing on EDR and the natural log of EDR respectively, yield best performance and avoid model discontinuity over day/night data boundaries.

  3. Boosted regression tree, table, and figure data

    EPA Pesticide Factsheets

    Spreadsheets are included here to support the manuscript Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. This dataset is associated with the following publication:Golden , H., C. Lane , A. Prues, and E. D'Amico. Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. JAWRA. American Water Resources Association, Middleburg, VA, USA, 52(5): 1251-1274, (2016).

  4. Regression analysis using dependent Polya trees.

    PubMed

    Schörgendorfer, Angela; Branscum, Adam J

    2013-11-30

    Many commonly used models for linear regression analysis force overly simplistic shape and scale constraints on the residual structure of data. We propose a semiparametric Bayesian model for regression analysis that produces data-driven inference by using a new type of dependent Polya tree prior to model arbitrary residual distributions that are allowed to evolve across increasing levels of an ordinal covariate (e.g., time, in repeated measurement studies). By modeling residual distributions at consecutive covariate levels or time points using separate, but dependent Polya tree priors, distributional information is pooled while allowing for broad pliability to accommodate many types of changing residual distributions. We can use the proposed dependent residual structure in a wide range of regression settings, including fixed-effects and mixed-effects linear and nonlinear models for cross-sectional, prospective, and repeated measurement data. A simulation study illustrates the flexibility of our novel semiparametric regression model to accurately capture evolving residual distributions. In an application to immune development data on immunoglobulin G antibodies in children, our new model outperforms several contemporary semiparametric regression models based on a predictive model selection criterion. Copyright © 2013 John Wiley & Sons, Ltd.

  5. A land use regression model for ambient ultrafine particles in Montreal, Canada: A comparison of linear regression and a machine learning approach.

    PubMed

    Weichenthal, Scott; Ryswyk, Keith Van; Goldstein, Alon; Bagg, Scott; Shekkarizfard, Maryam; Hatzopoulou, Marianne

    2016-04-01

    Existing evidence suggests that ambient ultrafine particles (UFPs) (<0.1µm) may contribute to acute cardiorespiratory morbidity. However, few studies have examined the long-term health effects of these pollutants owing in part to a need for exposure surfaces that can be applied in large population-based studies. To address this need, we developed a land use regression model for UFPs in Montreal, Canada using mobile monitoring data collected from 414 road segments during the summer and winter months between 2011 and 2012. Two different approaches were examined for model development including standard multivariable linear regression and a machine learning approach (kernel-based regularized least squares (KRLS)) that learns the functional form of covariate impacts on ambient UFP concentrations from the data. The final models included parameters for population density, ambient temperature and wind speed, land use parameters (park space and open space), length of local roads and rail, and estimated annual average NOx emissions from traffic. The final multivariable linear regression model explained 62% of the spatial variation in ambient UFP concentrations whereas the KRLS model explained 79% of the variance. The KRLS model performed slightly better than the linear regression model when evaluated using an external dataset (R(2)=0.58 vs. 0.55) or a cross-validation procedure (R(2)=0.67 vs. 0.60). In general, our findings suggest that the KRLS approach may offer modest improvements in predictive performance compared to standard multivariable linear regression models used to estimate spatial variations in ambient UFPs. However, differences in predictive performance were not statistically significant when evaluated using the cross-validation procedure. Crown Copyright © 2015. Published by Elsevier Inc. All rights reserved.

  6. Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree

    PubMed Central

    de los Campos, Gustavo; Naya, Hugo; Gianola, Daniel; Crossa, José; Legarra, Andrés; Manfredi, Eduardo; Weigel, Kent; Cotes, José Miguel

    2009-01-01

    The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available. PMID:19293140

  7. Relations of water-quality constituent concentrations to surrogate measurements in the lower Platte River corridor, Nebraska, 2007 through 2011

    USGS Publications Warehouse

    Schaepe, Nathaniel J.; Soenksen, Philip J.; Rus, David L.

    2014-01-01

    The lower Platte River, Nebraska, provides drinking water, irrigation water, and in-stream flows for recreation, wildlife habitat, and vital habitats for several threatened and endangered species. The U.S. Geological Survey (USGS), in cooperation with the Lower Platte River Corridor Alliance (LPRCA) developed site-specific regression models for water-quality constituents at four sites (Shell Creek near Columbus, Nebraska [USGS site 06795500]; Elkhorn River at Waterloo, Nebr. [USGS site 06800500]; Salt Creek near Ashland, Nebr. [USGS site 06805000]; and Platte River at Louisville, Nebr. [USGS site 06805500]) in the lower Platte River corridor. The models were developed by relating continuously monitored water-quality properties (surrogate measurements) to discrete water-quality samples. These models enable existing web-based software to provide near-real-time estimates of stream-specific constituent concentrations to support natural resources management decisions. Since 2007, USGS, in cooperation with the LPRCA, has continuously monitored four water-quality properties seasonally within the lower Platte River corridor: specific conductance, water temperature, dissolved oxygen, and turbidity. During 2007 through 2011, the USGS and the Nebraska Department of Environmental Quality collected and analyzed discrete water-quality samples for nutrients, major ions, pesticides, suspended sediment, and bacteria. These datasets were used to develop the regression models. This report documents the collection of these various water-quality datasets and the development of the site-specific regression models. Regression models were developed for all four monitored sites. Constituent models for Shell Creek included nitrate plus nitrite, total phosphorus, orthophosphate, atrazine, acetochlor, suspended sediment, and Escherichia coli (E. coli) bacteria. Regression models that were developed for the Elkhorn River included nitrate plus nitrite, total Kjeldahl nitrogen, total phosphorus, orthophosphate, chloride, atrazine, acetochlor, suspended sediment, and E. coli. Models developed for Salt Creek included nitrate plus nitrite, total Kjeldahl nitrogen, suspended sediment, and E. coli. Lastly, models developed for the Platte River site included total Kjeldahl nitrogen, total phosphorus, sodium, metolachlor, atrazine, acetochlor, suspended sediment, and E. coli.

  8. Random regression analyses using B-splines functions to model growth from birth to adult age in Canchim cattle.

    PubMed

    Baldi, F; Alencar, M M; Albuquerque, L G

    2010-12-01

    The objective of this work was to estimate covariance functions using random regression models on B-splines functions of animal age, for weights from birth to adult age in Canchim cattle. Data comprised 49,011 records on 2435 females. The model of analysis included fixed effects of contemporary groups, age of dam as quadratic covariable and the population mean trend taken into account by a cubic regression on orthogonal polynomials of animal age. Residual variances were modelled through a step function with four classes. The direct and maternal additive genetic effects, and animal and maternal permanent environmental effects were included as random effects in the model. A total of seventeen analyses, considering linear, quadratic and cubic B-splines functions and up to seven knots, were carried out. B-spline functions of the same order were considered for all random effects. Random regression models on B-splines functions were compared to a random regression model on Legendre polynomials and with a multitrait model. Results from different models of analyses were compared using the REML form of the Akaike Information criterion and Schwarz' Bayesian Information criterion. In addition, the variance components and genetic parameters estimated for each random regression model were also used as criteria to choose the most adequate model to describe the covariance structure of the data. A model fitting quadratic B-splines, with four knots or three segments for direct additive genetic effect and animal permanent environmental effect and two knots for maternal additive genetic effect and maternal permanent environmental effect, was the most adequate to describe the covariance structure of the data. Random regression models using B-spline functions as base functions fitted the data better than Legendre polynomials, especially at mature ages, but higher number of parameters need to be estimated with B-splines functions. © 2010 Blackwell Verlag GmbH.

  9. Orthogonal Projection in Teaching Regression and Financial Mathematics

    ERIC Educational Resources Information Center

    Kachapova, Farida; Kachapov, Ilias

    2010-01-01

    Two improvements in teaching linear regression are suggested. The first is to include the population regression model at the beginning of the topic. The second is to use a geometric approach: to interpret the regression estimate as an orthogonal projection and the estimation error as the distance (which is minimized by the projection). Linear…

  10. Boosting structured additive quantile regression for longitudinal childhood obesity data.

    PubMed

    Fenske, Nora; Fahrmeir, Ludwig; Hothorn, Torsten; Rzehak, Peter; Höhle, Michael

    2013-07-25

    Childhood obesity and the investigation of its risk factors has become an important public health issue. Our work is based on and motivated by a German longitudinal study including 2,226 children with up to ten measurements on their body mass index (BMI) and risk factors from birth to the age of 10 years. We introduce boosting of structured additive quantile regression as a novel distribution-free approach for longitudinal quantile regression. The quantile-specific predictors of our model include conventional linear population effects, smooth nonlinear functional effects, varying-coefficient terms, and individual-specific effects, such as intercepts and slopes. Estimation is based on boosting, a computer intensive inference method for highly complex models. We propose a component-wise functional gradient descent boosting algorithm that allows for penalized estimation of the large variety of different effects, particularly leading to individual-specific effects shrunken toward zero. This concept allows us to flexibly estimate the nonlinear age curves of upper quantiles of the BMI distribution, both on population and on individual-specific level, adjusted for further risk factors and to detect age-varying effects of categorical risk factors. Our model approach can be regarded as the quantile regression analog of Gaussian additive mixed models (or structured additive mean regression models), and we compare both model classes with respect to our obesity data.

  11. Comparisons between physics-based, engineering, and statistical learning models for outdoor sound propagation.

    PubMed

    Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T

    2016-05-01

    Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.

  12. Population heterogeneity in the salience of multiple risk factors for adolescent delinquency.

    PubMed

    Lanza, Stephanie T; Cooper, Brittany R; Bray, Bethany C

    2014-03-01

    To present mixture regression analysis as an alternative to more standard regression analysis for predicting adolescent delinquency. We demonstrate how mixture regression analysis allows for the identification of population subgroups defined by the salience of multiple risk factors. We identified population subgroups (i.e., latent classes) of individuals based on their coefficients in a regression model predicting adolescent delinquency from eight previously established risk indices drawn from the community, school, family, peer, and individual levels. The study included N = 37,763 10th-grade adolescents who participated in the Communities That Care Youth Survey. Standard, zero-inflated, and mixture Poisson and negative binomial regression models were considered. Standard and mixture negative binomial regression models were selected as optimal. The five-class regression model was interpreted based on the class-specific regression coefficients, indicating that risk factors had varying salience across classes of adolescents. Standard regression showed that all risk factors were significantly associated with delinquency. Mixture regression provided more nuanced information, suggesting a unique set of risk factors that were salient for different subgroups of adolescents. Implications for the design of subgroup-specific interventions are discussed. Copyright © 2014 Society for Adolescent Health and Medicine. Published by Elsevier Inc. All rights reserved.

  13. Retargeted Least Squares Regression Algorithm.

    PubMed

    Zhang, Xu-Yao; Wang, Lingfeng; Xiang, Shiming; Liu, Cheng-Lin

    2015-09-01

    This brief presents a framework of retargeted least squares regression (ReLSR) for multicategory classification. The core idea is to directly learn the regression targets from data other than using the traditional zero-one matrix as regression targets. The learned target matrix can guarantee a large margin constraint for the requirement of correct classification for each data point. Compared with the traditional least squares regression (LSR) and a recently proposed discriminative LSR models, ReLSR is much more accurate in measuring the classification error of the regression model. Furthermore, ReLSR is a single and compact model, hence there is no need to train two-class (binary) machines that are independent of each other. The convex optimization problem of ReLSR is solved elegantly and efficiently with an alternating procedure including regression and retargeting as substeps. The experimental evaluation over a range of databases identifies the validity of our method.

  14. Default Bayes Factors for Model Selection in Regression

    ERIC Educational Resources Information Center

    Rouder, Jeffrey N.; Morey, Richard D.

    2012-01-01

    In this article, we present a Bayes factor solution for inference in multiple regression. Bayes factors are principled measures of the relative evidence from data for various models or positions, including models that embed null hypotheses. In this regard, they may be used to state positive evidence for a lack of an effect, which is not possible…

  15. Advanced statistics: linear regression, part I: simple linear regression.

    PubMed

    Marill, Keith A

    2004-01-01

    Simple linear regression is a mathematical technique used to model the relationship between a single independent predictor variable and a single dependent outcome variable. In this, the first of a two-part series exploring concepts in linear regression analysis, the four fundamental assumptions and the mechanics of simple linear regression are reviewed. The most common technique used to derive the regression line, the method of least squares, is described. The reader will be acquainted with other important concepts in simple linear regression, including: variable transformations, dummy variables, relationship to inference testing, and leverage. Simplified clinical examples with small datasets and graphic models are used to illustrate the points. This will provide a foundation for the second article in this series: a discussion of multiple linear regression, in which there are multiple predictor variables.

  16. Comparison of Regression Analysis and Transfer Function in Estimating the Parameters of Central Pulse Waves from Brachial Pulse Wave.

    PubMed

    Chai, Rui; Xu, Li-Sheng; Yao, Yang; Hao, Li-Ling; Qi, Lin

    2017-01-01

    This study analyzed ascending branch slope (A_slope), dicrotic notch height (Hn), diastolic area (Ad) and systolic area (As) diastolic blood pressure (DBP), systolic blood pressure (SBP), pulse pressure (PP), subendocardial viability ratio (SEVR), waveform parameter (k), stroke volume (SV), cardiac output (CO), and peripheral resistance (RS) of central pulse wave invasively and non-invasively measured. Invasively measured parameters were compared with parameters measured from brachial pulse waves by regression model and transfer function model. Accuracy of parameters estimated by regression and transfer function model, was compared too. Findings showed that k value, central pulse wave and brachial pulse wave parameters invasively measured, correlated positively. Regression model parameters including A_slope, DBP, SEVR, and transfer function model parameters had good consistency with parameters invasively measured. They had same effect of consistency. SBP, PP, SV, and CO could be calculated through the regression model, but their accuracies were worse than that of transfer function model.

  17. An empirical model for estimating annual consumption by freshwater fish populations

    USGS Publications Warehouse

    Liao, H.; Pierce, C.L.; Larscheid, J.G.

    2005-01-01

    Population consumption is an important process linking predator populations to their prey resources. Simple tools are needed to enable fisheries managers to estimate population consumption. We assembled 74 individual estimates of annual consumption by freshwater fish populations and their mean annual population size, 41 of which also included estimates of mean annual biomass. The data set included 14 freshwater fish species from 10 different bodies of water. From this data set we developed two simple linear regression models predicting annual population consumption. Log-transformed population size explained 94% of the variation in log-transformed annual population consumption. Log-transformed biomass explained 98% of the variation in log-transformed annual population consumption. We quantified the accuracy of our regressions and three alternative consumption models as the mean percent difference from observed (bioenergetics-derived) estimates in a test data set. Predictions from our population-size regression matched observed consumption estimates poorly (mean percent difference = 222%). Predictions from our biomass regression matched observed consumption reasonably well (mean percent difference = 24%). The biomass regression was superior to an alternative model, similar in complexity, and comparable to two alternative models that were more complex and difficult to apply. Our biomass regression model, log10(consumption) = 0.5442 + 0.9962??log10(biomass), will be a useful tool for fishery managers, enabling them to make reasonably accurate annual population consumption predictions from mean annual biomass estimates. ?? Copyright by the American Fisheries Society 2005.

  18. A gentle introduction to quantile regression for ecologists

    USGS Publications Warehouse

    Cade, B.S.; Noon, B.R.

    2003-01-01

    Quantile regression is a way to estimate the conditional quantiles of a response variable distribution in the linear model that provides a more complete view of possible causal relationships between variables in ecological processes. Typically, all the factors that affect ecological processes are not measured and included in the statistical models used to investigate relationships between variables associated with those processes. As a consequence, there may be a weak or no predictive relationship between the mean of the response variable (y) distribution and the measured predictive factors (X). Yet there may be stronger, useful predictive relationships with other parts of the response variable distribution. This primer relates quantile regression estimates to prediction intervals in parametric error distribution regression models (eg least squares), and discusses the ordering characteristics, interval nature, sampling variation, weighting, and interpretation of the estimates for homogeneous and heterogeneous regression models.

  19. Understanding poisson regression.

    PubMed

    Hayat, Matthew J; Higgins, Melinda

    2014-04-01

    Nurse investigators often collect study data in the form of counts. Traditional methods of data analysis have historically approached analysis of count data either as if the count data were continuous and normally distributed or with dichotomization of the counts into the categories of occurred or did not occur. These outdated methods for analyzing count data have been replaced with more appropriate statistical methods that make use of the Poisson probability distribution, which is useful for analyzing count data. The purpose of this article is to provide an overview of the Poisson distribution and its use in Poisson regression. Assumption violations for the standard Poisson regression model are addressed with alternative approaches, including addition of an overdispersion parameter or negative binomial regression. An illustrative example is presented with an application from the ENSPIRE study, and regression modeling of comorbidity data is included for illustrative purposes. Copyright 2014, SLACK Incorporated.

  20. Background stratified Poisson regression analysis of cohort data.

    PubMed

    Richardson, David B; Langholz, Bryan

    2012-03-01

    Background stratified Poisson regression is an approach that has been used in the analysis of data derived from a variety of epidemiologically important studies of radiation-exposed populations, including uranium miners, nuclear industry workers, and atomic bomb survivors. We describe a novel approach to fit Poisson regression models that adjust for a set of covariates through background stratification while directly estimating the radiation-disease association of primary interest. The approach makes use of an expression for the Poisson likelihood that treats the coefficients for stratum-specific indicator variables as 'nuisance' variables and avoids the need to explicitly estimate the coefficients for these stratum-specific parameters. Log-linear models, as well as other general relative rate models, are accommodated. This approach is illustrated using data from the Life Span Study of Japanese atomic bomb survivors and data from a study of underground uranium miners. The point estimate and confidence interval obtained from this 'conditional' regression approach are identical to the values obtained using unconditional Poisson regression with model terms for each background stratum. Moreover, it is shown that the proposed approach allows estimation of background stratified Poisson regression models of non-standard form, such as models that parameterize latency effects, as well as regression models in which the number of strata is large, thereby overcoming the limitations of previously available statistical software for fitting background stratified Poisson regression models.

  1. A consistent framework for Horton regression statistics that leads to a modified Hack's law

    USGS Publications Warehouse

    Furey, P.R.; Troutman, B.M.

    2008-01-01

    A statistical framework is introduced that resolves important problems with the interpretation and use of traditional Horton regression statistics. The framework is based on a univariate regression model that leads to an alternative expression for Horton ratio, connects Horton regression statistics to distributional simple scaling, and improves the accuracy in estimating Horton plot parameters. The model is used to examine data for drainage area A and mainstream length L from two groups of basins located in different physiographic settings. Results show that confidence intervals for the Horton plot regression statistics are quite wide. Nonetheless, an analysis of covariance shows that regression intercepts, but not regression slopes, can be used to distinguish between basin groups. The univariate model is generalized to include n > 1 dependent variables. For the case where the dependent variables represent ln A and ln L, the generalized model performs somewhat better at distinguishing between basin groups than two separate univariate models. The generalized model leads to a modification of Hack's law where L depends on both A and Strahler order ??. Data show that ?? plays a statistically significant role in the modified Hack's law expression. ?? 2008 Elsevier B.V.

  2. Improving near-infrared prediction model robustness with support vector machine regression: a pharmaceutical tablet assay example.

    PubMed

    Igne, Benoît; Drennen, James K; Anderson, Carl A

    2014-01-01

    Changes in raw materials and process wear and tear can have significant effects on the prediction error of near-infrared calibration models. When the variability that is present during routine manufacturing is not included in the calibration, test, and validation sets, the long-term performance and robustness of the model will be limited. Nonlinearity is a major source of interference. In near-infrared spectroscopy, nonlinearity can arise from light path-length differences that can come from differences in particle size or density. The usefulness of support vector machine (SVM) regression to handle nonlinearity and improve the robustness of calibration models in scenarios where the calibration set did not include all the variability present in test was evaluated. Compared to partial least squares (PLS) regression, SVM regression was less affected by physical (particle size) and chemical (moisture) differences. The linearity of the SVM predicted values was also improved. Nevertheless, although visualization and interpretation tools have been developed to enhance the usability of SVM-based methods, work is yet to be done to provide chemometricians in the pharmaceutical industry with a regression method that can supplement PLS-based methods.

  3. Can We Use Regression Modeling to Quantify Mean Annual Streamflow at a Global-Scale?

    NASA Astrophysics Data System (ADS)

    Barbarossa, V.; Huijbregts, M. A. J.; Hendriks, J. A.; Beusen, A.; Clavreul, J.; King, H.; Schipper, A.

    2016-12-01

    Quantifying mean annual flow of rivers (MAF) at ungauged sites is essential for a number of applications, including assessments of global water supply, ecosystem integrity and water footprints. MAF can be quantified with spatially explicit process-based models, which might be overly time-consuming and data-intensive for this purpose, or with empirical regression models that predict MAF based on climate and catchment characteristics. Yet, regression models have mostly been developed at a regional scale and the extent to which they can be extrapolated to other regions is not known. In this study, we developed a global-scale regression model for MAF using observations of discharge and catchment characteristics from 1,885 catchments worldwide, ranging from 2 to 106 km2 in size. In addition, we compared the performance of the regression model with the predictive ability of the spatially explicit global hydrological model PCR-GLOBWB [van Beek et al., 2011] by comparing results from both models to independent measurements. We obtained a regression model explaining 89% of the variance in MAF based on catchment area, mean annual precipitation and air temperature, average slope and elevation. The regression model performed better than PCR-GLOBWB for the prediction of MAF, as root-mean-square error values were lower (0.29 - 0.38 compared to 0.49 - 0.57) and the modified index of agreement was higher (0.80 - 0.83 compared to 0.72 - 0.75). Our regression model can be applied globally at any point of the river network, provided that the input parameters are within the range of values employed in the calibration of the model. The performance is reduced for water scarce regions and further research should focus on improving such an aspect for regression-based global hydrological models.

  4. Harmonic regression of Landsat time series for modeling attributes from national forest inventory data

    NASA Astrophysics Data System (ADS)

    Wilson, Barry T.; Knight, Joseph F.; McRoberts, Ronald E.

    2018-03-01

    Imagery from the Landsat Program has been used frequently as a source of auxiliary data for modeling land cover, as well as a variety of attributes associated with tree cover. With ready access to all scenes in the archive since 2008 due to the USGS Landsat Data Policy, new approaches to deriving such auxiliary data from dense Landsat time series are required. Several methods have previously been developed for use with finer temporal resolution imagery (e.g. AVHRR and MODIS), including image compositing and harmonic regression using Fourier series. The manuscript presents a study, using Minnesota, USA during the years 2009-2013 as the study area and timeframe. The study examined the relative predictive power of land cover models, in particular those related to tree cover, using predictor variables based solely on composite imagery versus those using estimated harmonic regression coefficients. The study used two common non-parametric modeling approaches (i.e. k-nearest neighbors and random forests) for fitting classification and regression models of multiple attributes measured on USFS Forest Inventory and Analysis plots using all available Landsat imagery for the study area and timeframe. The estimated Fourier coefficients developed by harmonic regression of tasseled cap transformation time series data were shown to be correlated with land cover, including tree cover. Regression models using estimated Fourier coefficients as predictor variables showed a two- to threefold increase in explained variance for a small set of continuous response variables, relative to comparable models using monthly image composites. Similarly, the overall accuracies of classification models using the estimated Fourier coefficients were approximately 10-20 percentage points higher than the models using the image composites, with corresponding individual class accuracies between six and 45 percentage points higher.

  5. [Prediction model of health workforce and beds in county hospitals of Hunan by multiple linear regression].

    PubMed

    Ling, Ru; Liu, Jiawang

    2011-12-01

    To construct prediction model for health workforce and hospital beds in county hospitals of Hunan by multiple linear regression. We surveyed 16 counties in Hunan with stratified random sampling according to uniform questionnaires,and multiple linear regression analysis with 20 quotas selected by literature view was done. Independent variables in the multiple linear regression model on medical personnels in county hospitals included the counties' urban residents' income, crude death rate, medical beds, business occupancy, professional equipment value, the number of devices valued above 10 000 yuan, fixed assets, long-term debt, medical income, medical expenses, outpatient and emergency visits, hospital visits, actual available bed days, and utilization rate of hospital beds. Independent variables in the multiple linear regression model on county hospital beds included the the population of aged 65 and above in the counties, disposable income of urban residents, medical personnel of medical institutions in county area, business occupancy, the total value of professional equipment, fixed assets, long-term debt, medical income, medical expenses, outpatient and emergency visits, hospital visits, actual available bed days, utilization rate of hospital beds, and length of hospitalization. The prediction model shows good explanatory and fitting, and may be used for short- and mid-term forecasting.

  6. Evaluation of land use regression models (LURs) for nitrogen dioxide and benzene in four U.S. Cities.

    EPA Science Inventory

    Spatial analysis studies have included application of land use regression models (LURs) for health and air quality assessments. Recent LUR studies have collected nitrogen dioxide (NO2) and volatile organic compounds (VOCs) using passive samplers at urban air monitoring networks ...

  7. Time series regression model for infectious disease and weather.

    PubMed

    Imai, Chisato; Armstrong, Ben; Chalabi, Zaid; Mangtani, Punam; Hashizume, Masahiro

    2015-10-01

    Time series regression has been developed and long used to evaluate the short-term associations of air pollution and weather with mortality or morbidity of non-infectious diseases. The application of the regression approaches from this tradition to infectious diseases, however, is less well explored and raises some new issues. We discuss and present potential solutions for five issues often arising in such analyses: changes in immune population, strong autocorrelations, a wide range of plausible lag structures and association patterns, seasonality adjustments, and large overdispersion. The potential approaches are illustrated with datasets of cholera cases and rainfall from Bangladesh and influenza and temperature in Tokyo. Though this article focuses on the application of the traditional time series regression to infectious diseases and weather factors, we also briefly introduce alternative approaches, including mathematical modeling, wavelet analysis, and autoregressive integrated moving average (ARIMA) models. Modifications proposed to standard time series regression practice include using sums of past cases as proxies for the immune population, and using the logarithm of lagged disease counts to control autocorrelation due to true contagion, both of which are motivated from "susceptible-infectious-recovered" (SIR) models. The complexity of lag structures and association patterns can often be informed by biological mechanisms and explored by using distributed lag non-linear models. For overdispersed models, alternative distribution models such as quasi-Poisson and negative binomial should be considered. Time series regression can be used to investigate dependence of infectious diseases on weather, but may need modifying to allow for features specific to this context. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  8. A Comparison between Multiple Regression Models and CUN-BAE Equation to Predict Body Fat in Adults

    PubMed Central

    Fuster-Parra, Pilar; Bennasar-Veny, Miquel; Tauler, Pedro; Yañez, Aina; López-González, Angel A.; Aguiló, Antoni

    2015-01-01

    Background Because the accurate measure of body fat (BF) is difficult, several prediction equations have been proposed. The aim of this study was to compare different multiple regression models to predict BF, including the recently reported CUN-BAE equation. Methods Multi regression models using body mass index (BMI) and body adiposity index (BAI) as predictors of BF will be compared. These models will be also compared with the CUN-BAE equation. For all the analysis a sample including all the participants and another one including only the overweight and obese subjects will be considered. The BF reference measure was made using Bioelectrical Impedance Analysis. Results The simplest models including only BMI or BAI as independent variables showed that BAI is a better predictor of BF. However, adding the variable sex to both models made BMI a better predictor than the BAI. For both the whole group of participants and the group of overweight and obese participants, using simple models (BMI, age and sex as variables) allowed obtaining similar correlations with BF as when the more complex CUN-BAE was used (ρ = 0:87 vs. ρ = 0:86 for the whole sample and ρ = 0:88 vs. ρ = 0:89 for overweight and obese subjects, being the second value the one for CUN-BAE). Conclusions There are simpler models than CUN-BAE equation that fits BF as well as CUN-BAE does. Therefore, it could be considered that CUN-BAE overfits. Using a simple linear regression model, the BAI, as the only variable, predicts BF better than BMI. However, when the sex variable is introduced, BMI becomes the indicator of choice to predict BF. PMID:25821960

  9. A comparison between multiple regression models and CUN-BAE equation to predict body fat in adults.

    PubMed

    Fuster-Parra, Pilar; Bennasar-Veny, Miquel; Tauler, Pedro; Yañez, Aina; López-González, Angel A; Aguiló, Antoni

    2015-01-01

    Because the accurate measure of body fat (BF) is difficult, several prediction equations have been proposed. The aim of this study was to compare different multiple regression models to predict BF, including the recently reported CUN-BAE equation. Multi regression models using body mass index (BMI) and body adiposity index (BAI) as predictors of BF will be compared. These models will be also compared with the CUN-BAE equation. For all the analysis a sample including all the participants and another one including only the overweight and obese subjects will be considered. The BF reference measure was made using Bioelectrical Impedance Analysis. The simplest models including only BMI or BAI as independent variables showed that BAI is a better predictor of BF. However, adding the variable sex to both models made BMI a better predictor than the BAI. For both the whole group of participants and the group of overweight and obese participants, using simple models (BMI, age and sex as variables) allowed obtaining similar correlations with BF as when the more complex CUN-BAE was used (ρ = 0:87 vs. ρ = 0:86 for the whole sample and ρ = 0:88 vs. ρ = 0:89 for overweight and obese subjects, being the second value the one for CUN-BAE). There are simpler models than CUN-BAE equation that fits BF as well as CUN-BAE does. Therefore, it could be considered that CUN-BAE overfits. Using a simple linear regression model, the BAI, as the only variable, predicts BF better than BMI. However, when the sex variable is introduced, BMI becomes the indicator of choice to predict BF.

  10. Developing a dengue forecast model using machine learning: A case study in China.

    PubMed

    Guo, Pi; Liu, Tao; Zhang, Qin; Wang, Li; Xiao, Jianpeng; Zhang, Qingying; Luo, Ganfeng; Li, Zhihao; He, Jianfeng; Zhang, Yonghui; Ma, Wenjun

    2017-10-01

    In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue. Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011-2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China. The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics.

  11. Prediction of Emergency Department Hospital Admission Based on Natural Language Processing and Neural Networks.

    PubMed

    Zhang, Xingyu; Kim, Joyce; Patzer, Rachel E; Pitts, Stephen R; Patzer, Aaron; Schrager, Justin D

    2017-10-26

    To describe and compare logistic regression and neural network modeling strategies to predict hospital admission or transfer following initial presentation to Emergency Department (ED) triage with and without the addition of natural language processing elements. Using data from the National Hospital Ambulatory Medical Care Survey (NHAMCS), a cross-sectional probability sample of United States EDs from 2012 and 2013 survey years, we developed several predictive models with the outcome being admission to the hospital or transfer vs. discharge home. We included patient characteristics immediately available after the patient has presented to the ED and undergone a triage process. We used this information to construct logistic regression (LR) and multilayer neural network models (MLNN) which included natural language processing (NLP) and principal component analysis from the patient's reason for visit. Ten-fold cross validation was used to test the predictive capacity of each model and receiver operating curves (AUC) were then calculated for each model. Of the 47,200 ED visits from 642 hospitals, 6,335 (13.42%) resulted in hospital admission (or transfer). A total of 48 principal components were extracted by NLP from the reason for visit fields, which explained 75% of the overall variance for hospitalization. In the model including only structured variables, the AUC was 0.824 (95% CI 0.818-0.830) for logistic regression and 0.823 (95% CI 0.817-0.829) for MLNN. Models including only free-text information generated AUC of 0.742 (95% CI 0.731- 0.753) for logistic regression and 0.753 (95% CI 0.742-0.764) for MLNN. When both structured variables and free text variables were included, the AUC reached 0.846 (95% CI 0.839-0.853) for logistic regression and 0.844 (95% CI 0.836-0.852) for MLNN. The predictive accuracy of hospital admission or transfer for patients who presented to ED triage overall was good, and was improved with the inclusion of free text data from a patient's reason for visit regardless of modeling approach. Natural language processing and neural networks that incorporate patient-reported outcome free text may increase predictive accuracy for hospital admission.

  12. Covariance functions for body weight from birth to maturity in Nellore cows.

    PubMed

    Boligon, A A; Mercadante, M E Z; Forni, S; Lôbo, R B; Albuquerque, L G

    2010-03-01

    The objective of this study was to estimate (co)variance functions using random regression models on Legendre polynomials for the analysis of repeated measures of BW from birth to adult age. A total of 82,064 records from 8,145 females were analyzed. Different models were compared. The models included additive direct and maternal effects, and animal and maternal permanent environmental effects as random terms. Contemporary group and dam age at calving (linear and quadratic effect) were included as fixed effects, and orthogonal Legendre polynomials of animal age (cubic regression) were considered as random covariables. Eight models with polynomials of third to sixth order were used to describe additive direct and maternal effects, and animal and maternal permanent environmental effects. Residual effects were modeled using 1 (i.e., assuming homogeneity of variances across all ages) or 5 age classes. The model with 5 classes was the best to describe the trajectory of residuals along the growth curve. The model including fourth- and sixth-order polynomials for additive direct and animal permanent environmental effects, respectively, and third-order polynomials for maternal genetic and maternal permanent environmental effects were the best. Estimates of (co)variance obtained with the multi-trait and random regression models were similar. Direct heritability estimates obtained with the random regression models followed a trend similar to that obtained with the multi-trait model. The largest estimates of maternal heritability were those of BW taken close to 240 d of age. In general, estimates of correlation between BW from birth to 8 yr of age decreased with increasing distance between ages.

  13. Development and implementation of a regression model for predicting recreational water quality in the Cuyahoga River, Cuyahoga Valley National Park, Ohio 2009-11

    USGS Publications Warehouse

    Brady, Amie M.G.; Plona, Meg B.

    2012-01-01

    The Cuyahoga River within Cuyahoga Valley National Park (CVNP) is at times impaired for recreational use due to elevated concentrations of Escherichia coli (E. coli), a fecal-indicator bacterium. During the recreational seasons of mid-May through September during 2009–11, samples were collected 4 days per week and analyzed for E. coli concentrations at two sites within CVNP. Other water-quality and environ-mental data, including turbidity, rainfall, and streamflow, were measured and (or) tabulated for analysis. Regression models developed to predict recreational water quality in the river were implemented during the recreational seasons of 2009–11 for one site within CVNP–Jaite. For the 2009 and 2010 seasons, the regression models were better at predicting exceedances of Ohio's single-sample standard for primary-contact recreation compared to the traditional method of using the previous day's E. coli concentration. During 2009, the regression model was based on data collected during 2005 through 2008, excluding available 2004 data. The resulting model for 2009 did not perform as well as expected (based on the calibration data set) and tended to overestimate concentrations (correct responses at 69 percent). During 2010, the regression model was based on data collected during 2004 through 2009, including all of the available data. The 2010 model performed well, correctly predicting 89 percent of the samples above or below the single-sample standard, even though the predictions tended to be lower than actual sample concentrations. During 2011, the regression model was based on data collected during 2004 through 2010 and tended to overestimate concentrations. The 2011 model did not perform as well as the traditional method or as expected, based on the calibration dataset (correct responses at 56 percent). At a second site—Lock 29, approximately 5 river miles upstream from Jaite, a regression model based on data collected at the site during the recreational seasons of 2008–10 also did not perform as well as the traditional method or as well as expected (correct responses at 60 percent). Above normal precipitation in the region and a delayed start to the 2011 sampling season (sampling began mid-June) may have affected how well the 2011 models performed. With these new data, however, updated regression models may be better able to predict recreational water quality conditions due to the increased amount of diverse water quality conditions included in the calibration data. Daily recreational water-quality predictions for Jaite were made available on the Ohio Nowcast Web site at www.ohionowcast.info. Other public outreach included signage at trailheads in the park, articles in the park's quarterly-published schedule of events and volunteer newsletters. A U.S. Geological Survey Fact Sheet was also published to bring attention to water-quality issues in the park.

  14. Cox regression analysis with missing covariates via nonparametric multiple imputation.

    PubMed

    Hsu, Chiu-Hsieh; Yu, Mandi

    2018-01-01

    We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.

  15. Predicting 30-day Hospital Readmission with Publicly Available Administrative Database. A Conditional Logistic Regression Modeling Approach.

    PubMed

    Zhu, K; Lou, Z; Zhou, J; Ballester, N; Kong, N; Parikh, P

    2015-01-01

    This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners. Explore the use of conditional logistic regression to increase the prediction accuracy. We analyzed an HCUP statewide inpatient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models. The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 - 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures. It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.

  16. Data Mining Methods Applied to Flight Operations Quality Assurance Data: A Comparison to Standard Statistical Methods

    NASA Technical Reports Server (NTRS)

    Stolzer, Alan J.; Halford, Carl

    2007-01-01

    In a previous study, multiple regression techniques were applied to Flight Operations Quality Assurance-derived data to develop parsimonious model(s) for fuel consumption on the Boeing 757 airplane. The present study examined several data mining algorithms, including neural networks, on the fuel consumption problem and compared them to the multiple regression results obtained earlier. Using regression methods, parsimonious models were obtained that explained approximately 85% of the variation in fuel flow. In general data mining methods were more effective in predicting fuel consumption. Classification and Regression Tree methods reported correlation coefficients of .91 to .92, and General Linear Models and Multilayer Perceptron neural networks reported correlation coefficients of about .99. These data mining models show great promise for use in further examining large FOQA databases for operational and safety improvements.

  17. Adjusting for overdispersion in piecewise exponential regression models to estimate excess mortality rate in population-based research.

    PubMed

    Luque-Fernandez, Miguel Angel; Belot, Aurélien; Quaresma, Manuela; Maringe, Camille; Coleman, Michel P; Rachet, Bernard

    2016-10-01

    In population-based cancer research, piecewise exponential regression models are used to derive adjusted estimates of excess mortality due to cancer using the Poisson generalized linear modelling framework. However, the assumption that the conditional mean and variance of the rate parameter given the set of covariates x i are equal is strong and may fail to account for overdispersion given the variability of the rate parameter (the variance exceeds the mean). Using an empirical example, we aimed to describe simple methods to test and correct for overdispersion. We used a regression-based score test for overdispersion under the relative survival framework and proposed different approaches to correct for overdispersion including a quasi-likelihood, robust standard errors estimation, negative binomial regression and flexible piecewise modelling. All piecewise exponential regression models showed the presence of significant inherent overdispersion (p-value <0.001). However, the flexible piecewise exponential model showed the smallest overdispersion parameter (3.2 versus 21.3) for non-flexible piecewise exponential models. We showed that there were no major differences between methods. However, using a flexible piecewise regression modelling, with either a quasi-likelihood or robust standard errors, was the best approach as it deals with both, overdispersion due to model misspecification and true or inherent overdispersion.

  18. Conjoint Analysis: A Study of the Effects of Using Person Variables.

    ERIC Educational Resources Information Center

    Fraas, John W.; Newman, Isadore

    Three statistical techniques--conjoint analysis, a multiple linear regression model, and a multiple linear regression model with a surrogate person variable--were used to estimate the relative importance of five university attributes for students in the process of selecting a college. The five attributes include: availability and variety of…

  19. Suppressor Variables: The Difference between "Is" versus "Acting As"

    ERIC Educational Resources Information Center

    Ludlow, Larry; Klein, Kelsey

    2014-01-01

    Correlated predictors in regression models are a fact of life in applied social science research. The extent to which they are correlated will influence the estimates and statistics associated with the other variables they are modeled along with. These effects, for example, may include enhanced regression coefficients for the other variables--a…

  20. A Comparison of Conventional Linear Regression Methods and Neural Networks for Forecasting Educational Spending.

    ERIC Educational Resources Information Center

    Baker, Bruce D.; Richards, Craig E.

    1999-01-01

    Applies neural network methods for forecasting 1991-95 per-pupil expenditures in U.S. public elementary and secondary schools. Forecasting models included the National Center for Education Statistics' multivariate regression model and three neural architectures. Regarding prediction accuracy, neural network results were comparable or superior to…

  1. Nonlinear-regression groundwater flow modeling of a deep regional aquifer system

    USGS Publications Warehouse

    Cooley, Richard L.; Konikow, Leonard F.; Naff, Richard L.

    1986-01-01

    A nonlinear regression groundwater flow model, based on a Galerkin finite-element discretization, was used to analyze steady state two-dimensional groundwater flow in the areally extensive Madison aquifer in a 75,000 mi2 area of the Northern Great Plains. Regression parameters estimated include intrinsic permeabilities of the main aquifer and separate lineament zones, discharges from eight major springs surrounding the Black Hills, and specified heads on the model boundaries. Aquifer thickness and temperature variations were included as specified functions. The regression model was applied using sequential F testing so that the fewest number and simplest zonation of intrinsic permeabilities, combined with the simplest overall model, were evaluated initially; additional complexities (such as subdivisions of zones and variations in temperature and thickness) were added in stages to evaluate the subsequent degree of improvement in the model results. It was found that only the eight major springs, a single main aquifer intrinsic permeability, two separate lineament intrinsic permeabilities of much smaller values, and temperature variations are warranted by the observed data (hydraulic heads and prior information on some parameters) for inclusion in a model that attempts to explain significant controls on groundwater flow. Addition of thickness variations did not significantly improve model results; however, thickness variations were included in the final model because they are fairly well defined. Effects on the observed head distribution from other features, such as vertical leakage and regional variations in intrinsic permeability, apparently were overshadowed by measurement errors in the observed heads. Estimates of the parameters correspond well to estimates obtained from other independent sources.

  2. Nonlinear-Regression Groundwater Flow Modeling of a Deep Regional Aquifer System

    NASA Astrophysics Data System (ADS)

    Cooley, Richard L.; Konikow, Leonard F.; Naff, Richard L.

    1986-12-01

    A nonlinear regression groundwater flow model, based on a Galerkin finite-element discretization, was used to analyze steady state two-dimensional groundwater flow in the areally extensive Madison aquifer in a 75,000 mi2 area of the Northern Great Plains. Regression parameters estimated include intrinsic permeabilities of the main aquifer and separate lineament zones, discharges from eight major springs surrounding the Black Hills, and specified heads on the model boundaries. Aquifer thickness and temperature variations were included as specified functions. The regression model was applied using sequential F testing so that the fewest number and simplest zonation of intrinsic permeabilities, combined with the simplest overall model, were evaluated initially; additional complexities (such as subdivisions of zones and variations in temperature and thickness) were added in stages to evaluate the subsequent degree of improvement in the model results. It was found that only the eight major springs, a single main aquifer intrinsic permeability, two separate lineament intrinsic permeabilities of much smaller values, and temperature variations are warranted by the observed data (hydraulic heads and prior information on some parameters) for inclusion in a model that attempts to explain significant controls on groundwater flow. Addition of thickness variations did not significantly improve model results; however, thickness variations were included in the final model because they are fairly well defined. Effects on the observed head distribution from other features, such as vertical leakage and regional variations in intrinsic permeability, apparently were overshadowed by measurement errors in the observed heads. Estimates of the parameters correspond well to estimates obtained from other independent sources.

  3. Genetic Programming Transforms in Linear Regression Situations

    NASA Astrophysics Data System (ADS)

    Castillo, Flor; Kordon, Arthur; Villa, Carlos

    The chapter summarizes the use of Genetic Programming (GP) inMultiple Linear Regression (MLR) to address multicollinearity and Lack of Fit (LOF). The basis of the proposed method is applying appropriate input transforms (model respecification) that deal with these issues while preserving the information content of the original variables. The transforms are selected from symbolic regression models with optimal trade-off between accuracy of prediction and expressional complexity, generated by multiobjective Pareto-front GP. The chapter includes a comparative study of the GP-generated transforms with Ridge Regression, a variant of ordinary Multiple Linear Regression, which has been a useful and commonly employed approach for reducing multicollinearity. The advantages of GP-generated model respecification are clearly defined and demonstrated. Some recommendations for transforms selection are given as well. The application benefits of the proposed approach are illustrated with a real industrial application in one of the broadest empirical modeling areas in manufacturing - robust inferential sensors. The chapter contributes to increasing the awareness of the potential of GP in statistical model building by MLR.

  4. Multiple linear regression and regression with time series error models in forecasting PM10 concentrations in Peninsular Malaysia.

    PubMed

    Ng, Kar Yong; Awang, Norhashidah

    2018-01-06

    Frequent haze occurrences in Malaysia have made the management of PM 10 (particulate matter with aerodynamic less than 10 μm) pollution a critical task. This requires knowledge on factors associating with PM 10 variation and good forecast of PM 10 concentrations. Hence, this paper demonstrates the prediction of 1-day-ahead daily average PM 10 concentrations based on predictor variables including meteorological parameters and gaseous pollutants. Three different models were built. They were multiple linear regression (MLR) model with lagged predictor variables (MLR1), MLR model with lagged predictor variables and PM 10 concentrations (MLR2) and regression with time series error (RTSE) model. The findings revealed that humidity, temperature, wind speed, wind direction, carbon monoxide and ozone were the main factors explaining the PM 10 variation in Peninsular Malaysia. Comparison among the three models showed that MLR2 model was on a same level with RTSE model in terms of forecasting accuracy, while MLR1 model was the worst.

  5. Development of LACIE CCEA-1 weather/wheat yield models. [regression analysis

    NASA Technical Reports Server (NTRS)

    Strommen, N. D.; Sakamoto, C. M.; Leduc, S. K.; Umberger, D. E. (Principal Investigator)

    1979-01-01

    The advantages and disadvantages of the casual (phenological, dynamic, physiological), statistical regression, and analog approaches to modeling for grain yield are examined. Given LACIE's primary goal of estimating wheat production for the large areas of eight major wheat-growing regions, the statistical regression approach of correlating historical yield and climate data offered the Center for Climatic and Environmental Assessment the greatest potential return within the constraints of time and data sources. The basic equation for the first generation wheat-yield model is given. Topics discussed include truncation, trend variable, selection of weather variables, episodic events, strata selection, operational data flow, weighting, and model results.

  6. Examining the Association between Patient-Reported Symptoms of Attention and Memory Dysfunction with Objective Cognitive Performance: A Latent Regression Rasch Model Approach.

    PubMed

    Li, Yuelin; Root, James C; Atkinson, Thomas M; Ahles, Tim A

    2016-06-01

    Patient-reported cognition generally exhibits poor concordance with objectively assessed cognitive performance. In this article, we introduce latent regression Rasch modeling and provide a step-by-step tutorial for applying Rasch methods as an alternative to traditional correlation to better clarify the relationship of self-report and objective cognitive performance. An example analysis using these methods is also included. Introduction to latent regression Rasch modeling is provided together with a tutorial on implementing it using the JAGS programming language for the Bayesian posterior parameter estimates. In an example analysis, data from a longitudinal neurocognitive outcomes study of 132 breast cancer patients and 45 non-cancer matched controls that included self-report and objective performance measures pre- and post-treatment were analyzed using both conventional and latent regression Rasch model approaches. Consistent with previous research, conventional analysis and correlations between neurocognitive decline and self-reported problems were generally near zero. In contrast, application of latent regression Rasch modeling found statistically reliable associations between objective attention and processing speed measures with self-reported Attention and Memory scores. Latent regression Rasch modeling, together with correlation of specific self-reported cognitive domains with neurocognitive measures, helps to clarify the relationship of self-report with objective performance. While the majority of patients attribute their cognitive difficulties to memory decline, the Rash modeling suggests the importance of processing speed and initial learning. To encourage the use of this method, a step-by-step guide and programming language for implementation is provided. Implications of this method in cognitive outcomes research are discussed. © The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  7. Monthly monsoon rainfall forecasting using artificial neural networks

    NASA Astrophysics Data System (ADS)

    Ganti, Ravikumar

    2014-10-01

    Indian agriculture sector heavily depends on monsoon rainfall for successful harvesting. In the past, prediction of rainfall was mainly performed using regression models, which provide reasonable accuracy in the modelling and forecasting of complex physical systems. Recently, Artificial Neural Networks (ANNs) have been proposed as efficient tools for modelling and forecasting. A feed-forward multi-layer perceptron type of ANN architecture trained using the popular back-propagation algorithm was employed in this study. Other techniques investigated for modeling monthly monsoon rainfall include linear and non-linear regression models for comparison purposes. The data employed in this study include monthly rainfall and monthly average of the daily maximum temperature in the North Central region in India. Specifically, four regression models and two ANN model's were developed. The performance of various models was evaluated using a wide variety of standard statistical parameters and scatter plots. The results obtained in this study for forecasting monsoon rainfalls using ANNs have been encouraging. India's economy and agricultural activities can be effectively managed with the help of the availability of the accurate monsoon rainfall forecasts.

  8. Regression analysis and transfer function in estimating the parameters of central pulse waves from brachial pulse wave.

    PubMed

    Chai Rui; Li Si-Man; Xu Li-Sheng; Yao Yang; Hao Li-Ling

    2017-07-01

    This study mainly analyzed the parameters such as ascending branch slope (A_slope), dicrotic notch height (Hn), diastolic area (Ad) and systolic area (As) diastolic blood pressure (DBP), systolic blood pressure (SBP), pulse pressure (PP), subendocardial viability ratio (SEVR), waveform parameter (k), stroke volume (SV), cardiac output (CO) and peripheral resistance (RS) of central pulse wave invasively and non-invasively measured. These parameters extracted from the central pulse wave invasively measured were compared with the parameters measured from the brachial pulse waves by a regression model and a transfer function model. The accuracy of the parameters which were estimated by the regression model and the transfer function model was compared too. Our findings showed that in addition to the k value, the above parameters of the central pulse wave and the brachial pulse wave invasively measured had positive correlation. Both the regression model parameters including A_slope, DBP, SEVR and the transfer function model parameters had good consistency with the parameters invasively measured, and they had the same effect of consistency. The regression equations of the three parameters were expressed by Y'=a+bx. The SBP, PP, SV, CO of central pulse wave could be calculated through the regression model, but their accuracies were worse than that of transfer function model.

  9. An Effect Size for Regression Predictors in Meta-Analysis

    ERIC Educational Resources Information Center

    Aloe, Ariel M.; Becker, Betsy Jane

    2012-01-01

    A new effect size representing the predictive power of an independent variable from a multiple regression model is presented. The index, denoted as r[subscript sp], is the semipartial correlation of the predictor with the outcome of interest. This effect size can be computed when multiple predictor variables are included in the regression model…

  10. Time series modeling by a regression approach based on a latent process.

    PubMed

    Chamroukhi, Faicel; Samé, Allou; Govaert, Gérard; Aknin, Patrice

    2009-01-01

    Time series are used in many domains including finance, engineering, economics and bioinformatics generally to represent the change of a measurement over time. Modeling techniques may then be used to give a synthetic representation of such data. A new approach for time series modeling is proposed in this paper. It consists of a regression model incorporating a discrete hidden logistic process allowing for activating smoothly or abruptly different polynomial regression models. The model parameters are estimated by the maximum likelihood method performed by a dedicated Expectation Maximization (EM) algorithm. The M step of the EM algorithm uses a multi-class Iterative Reweighted Least-Squares (IRLS) algorithm to estimate the hidden process parameters. To evaluate the proposed approach, an experimental study on simulated data and real world data was performed using two alternative approaches: a heteroskedastic piecewise regression model using a global optimization algorithm based on dynamic programming, and a Hidden Markov Regression Model whose parameters are estimated by the Baum-Welch algorithm. Finally, in the context of the remote monitoring of components of the French railway infrastructure, and more particularly the switch mechanism, the proposed approach has been applied to modeling and classifying time series representing the condition measurements acquired during switch operations.

  11. On the Latent Regression Model of Item Response Theory. Research Report. ETS RR-07-12

    ERIC Educational Resources Information Center

    Antal, Tamás

    2007-01-01

    Full account of the latent regression model for the National Assessment of Educational Progress is given. The treatment includes derivation of the EM algorithm, Newton-Raphson method, and the asymptotic standard errors. The paper also features the use of the adaptive Gauss-Hermite numerical integration method as a basic tool to evaluate…

  12. Measurement error and outcome distributions: Methodological issues in regression analyses of behavioral coding data.

    PubMed

    Holsclaw, Tracy; Hallgren, Kevin A; Steyvers, Mark; Smyth, Padhraic; Atkins, David C

    2015-12-01

    Behavioral coding is increasingly used for studying mechanisms of change in psychosocial treatments for substance use disorders (SUDs). However, behavioral coding data typically include features that can be problematic in regression analyses, including measurement error in independent variables, non normal distributions of count outcome variables, and conflation of predictor and outcome variables with third variables, such as session length. Methodological research in econometrics has shown that these issues can lead to biased parameter estimates, inaccurate standard errors, and increased Type I and Type II error rates, yet these statistical issues are not widely known within SUD treatment research, or more generally, within psychotherapy coding research. Using minimally technical language intended for a broad audience of SUD treatment researchers, the present paper illustrates the nature in which these data issues are problematic. We draw on real-world data and simulation-based examples to illustrate how these data features can bias estimation of parameters and interpretation of models. A weighted negative binomial regression is introduced as an alternative to ordinary linear regression that appropriately addresses the data characteristics common to SUD treatment behavioral coding data. We conclude by demonstrating how to use and interpret these models with data from a study of motivational interviewing. SPSS and R syntax for weighted negative binomial regression models is included in online supplemental materials. (c) 2016 APA, all rights reserved).

  13. Measurement error and outcome distributions: Methodological issues in regression analyses of behavioral coding data

    PubMed Central

    Holsclaw, Tracy; Hallgren, Kevin A.; Steyvers, Mark; Smyth, Padhraic; Atkins, David C.

    2015-01-01

    Behavioral coding is increasingly used for studying mechanisms of change in psychosocial treatments for substance use disorders (SUDs). However, behavioral coding data typically include features that can be problematic in regression analyses, including measurement error in independent variables, non-normal distributions of count outcome variables, and conflation of predictor and outcome variables with third variables, such as session length. Methodological research in econometrics has shown that these issues can lead to biased parameter estimates, inaccurate standard errors, and increased type-I and type-II error rates, yet these statistical issues are not widely known within SUD treatment research, or more generally, within psychotherapy coding research. Using minimally-technical language intended for a broad audience of SUD treatment researchers, the present paper illustrates the nature in which these data issues are problematic. We draw on real-world data and simulation-based examples to illustrate how these data features can bias estimation of parameters and interpretation of models. A weighted negative binomial regression is introduced as an alternative to ordinary linear regression that appropriately addresses the data characteristics common to SUD treatment behavioral coding data. We conclude by demonstrating how to use and interpret these models with data from a study of motivational interviewing. SPSS and R syntax for weighted negative binomial regression models is included in supplementary materials. PMID:26098126

  14. Individualized Prediction of Heat Stress in Firefighters: A Data-Driven Approach Using Classification and Regression Trees.

    PubMed

    Mani, Ashutosh; Rao, Marepalli; James, Kelley; Bhattacharya, Amit

    2015-01-01

    The purpose of this study was to explore data-driven models, based on decision trees, to develop practical and easy to use predictive models for early identification of firefighters who are likely to cross the threshold of hyperthermia during live-fire training. Predictive models were created for three consecutive live-fire training scenarios. The final predicted outcome was a categorical variable: will a firefighter cross the upper threshold of hyperthermia - Yes/No. Two tiers of models were built, one with and one without taking into account the outcome (whether a firefighter crossed hyperthermia or not) from the previous training scenario. First tier of models included age, baseline heart rate and core body temperature, body mass index, and duration of training scenario as predictors. The second tier of models included the outcome of the previous scenario in the prediction space, in addition to all the predictors from the first tier of models. Classification and regression trees were used independently for prediction. The response variable for the regression tree was the quantitative variable: core body temperature at the end of each scenario. The predicted quantitative variable from regression trees was compared to the upper threshold of hyperthermia (38°C) to predict whether a firefighter would enter hyperthermia. The performance of classification and regression tree models was satisfactory for the second (success rate = 79%) and third (success rate = 89%) training scenarios but not for the first (success rate = 43%). Data-driven models based on decision trees can be a useful tool for predicting physiological response without modeling the underlying physiological systems. Early prediction of heat stress coupled with proactive interventions, such as pre-cooling, can help reduce heat stress in firefighters.

  15. Parsimonious estimation of the Wechsler Memory Scale, Fourth Edition demographically adjusted index scores: immediate and delayed memory.

    PubMed

    Miller, Justin B; Axelrod, Bradley N; Schutte, Christian

    2012-01-01

    The recent release of the Wechsler Memory Scale Fourth Edition contains many improvements from a theoretical and administration perspective, including demographic corrections using the Advanced Clinical Solutions. Although the administration time has been reduced from previous versions, a shortened version may be desirable in certain situations given practical time limitations in clinical practice. The current study evaluated two- and three-subtest estimations of demographically corrected Immediate and Delayed Memory index scores using both simple arithmetic prorating and regression models. All estimated values were significantly associated with observed index scores. Use of Lin's Concordance Correlation Coefficient as a measure of agreement showed a high degree of precision and virtually zero bias in the models, although the regression models showed a stronger association than prorated models. Regression-based models proved to be more accurate than prorated estimates with less dispersion around observed values, particularly when using three subtest regression models. Overall, the present research shows strong support for estimating demographically corrected index scores on the WMS-IV in clinical practice with an adequate performance using arithmetically prorated models and a stronger performance using regression models to predict index scores.

  16. Mixed-effects Gaussian process functional regression models with application to dose-response curve prediction.

    PubMed

    Shi, J Q; Wang, B; Will, E J; West, R M

    2012-11-20

    We propose a new semiparametric model for functional regression analysis, combining a parametric mixed-effects model with a nonparametric Gaussian process regression model, namely a mixed-effects Gaussian process functional regression model. The parametric component can provide explanatory information between the response and the covariates, whereas the nonparametric component can add nonlinearity. We can model the mean and covariance structures simultaneously, combining the information borrowed from other subjects with the information collected from each individual subject. We apply the model to dose-response curves that describe changes in the responses of subjects for differing levels of the dose of a drug or agent and have a wide application in many areas. We illustrate the method for the management of renal anaemia. An individual dose-response curve is improved when more information is included by this mechanism from the subject/patient over time, enabling a patient-specific treatment regime. Copyright © 2012 John Wiley & Sons, Ltd.

  17. Accounting for spatial effects in land use regression for urban air pollution modeling.

    PubMed

    Bertazzon, Stefania; Johnson, Markey; Eccles, Kristin; Kaplan, Gilaad G

    2015-01-01

    In order to accurately assess air pollution risks, health studies require spatially resolved pollution concentrations. Land-use regression (LUR) models estimate ambient concentrations at a fine spatial scale. However, spatial effects such as spatial non-stationarity and spatial autocorrelation can reduce the accuracy of LUR estimates by increasing regression errors and uncertainty; and statistical methods for resolving these effects--e.g., spatially autoregressive (SAR) and geographically weighted regression (GWR) models--may be difficult to apply simultaneously. We used an alternate approach to address spatial non-stationarity and spatial autocorrelation in LUR models for nitrogen dioxide. Traditional models were re-specified to include a variable capturing wind speed and direction, and re-fit as GWR models. Mean R(2) values for the resulting GWR-wind models (summer: 0.86, winter: 0.73) showed a 10-20% improvement over traditional LUR models. GWR-wind models effectively addressed both spatial effects and produced meaningful predictive models. These results suggest a useful method for improving spatially explicit models. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  18. Advanced statistics: linear regression, part II: multiple linear regression.

    PubMed

    Marill, Keith A

    2004-01-01

    The applications of simple linear regression in medical research are limited, because in most situations, there are multiple relevant predictor variables. Univariate statistical techniques such as simple linear regression use a single predictor variable, and they often may be mathematically correct but clinically misleading. Multiple linear regression is a mathematical technique used to model the relationship between multiple independent predictor variables and a single dependent outcome variable. It is used in medical research to model observational data, as well as in diagnostic and therapeutic studies in which the outcome is dependent on more than one factor. Although the technique generally is limited to data that can be expressed with a linear function, it benefits from a well-developed mathematical framework that yields unique solutions and exact confidence intervals for regression coefficients. Building on Part I of this series, this article acquaints the reader with some of the important concepts in multiple regression analysis. These include multicollinearity, interaction effects, and an expansion of the discussion of inference testing, leverage, and variable transformations to multivariate models. Examples from the first article in this series are expanded on using a primarily graphic, rather than mathematical, approach. The importance of the relationships among the predictor variables and the dependence of the multivariate model coefficients on the choice of these variables are stressed. Finally, concepts in regression model building are discussed.

  19. Comparing lagged linear correlation, lagged regression, Granger causality, and vector autoregression for uncovering associations in EHR data.

    PubMed

    Levine, Matthew E; Albers, David J; Hripcsak, George

    2016-01-01

    Time series analysis methods have been shown to reveal clinical and biological associations in data collected in the electronic health record. We wish to develop reliable high-throughput methods for identifying adverse drug effects that are easy to implement and produce readily interpretable results. To move toward this goal, we used univariate and multivariate lagged regression models to investigate associations between twenty pairs of drug orders and laboratory measurements. Multivariate lagged regression models exhibited higher sensitivity and specificity than univariate lagged regression in the 20 examples, and incorporating autoregressive terms for labs and drugs produced more robust signals in cases of known associations among the 20 example pairings. Moreover, including inpatient admission terms in the model attenuated the signals for some cases of unlikely associations, demonstrating how multivariate lagged regression models' explicit handling of context-based variables can provide a simple way to probe for health-care processes that confound analyses of EHR data.

  20. Regression-based model of skin diffuse reflectance for skin color analysis

    NASA Astrophysics Data System (ADS)

    Tsumura, Norimichi; Kawazoe, Daisuke; Nakaguchi, Toshiya; Ojima, Nobutoshi; Miyake, Yoichi

    2008-11-01

    A simple regression-based model of skin diffuse reflectance is developed based on reflectance samples calculated by Monte Carlo simulation of light transport in a two-layered skin model. This reflectance model includes the values of spectral reflectance in the visible spectra for Japanese women. The modified Lambert Beer law holds in the proposed model with a modified mean free path length in non-linear density space. The averaged RMS and maximum errors of the proposed model were 1.1 and 3.1%, respectively, in the above range.

  1. Deletion Diagnostics for Alternating Logistic Regressions

    PubMed Central

    Preisser, John S.; By, Kunthel; Perin, Jamie; Qaqish, Bahjat F.

    2013-01-01

    Deletion diagnostics are introduced for the regression analysis of clustered binary outcomes estimated with alternating logistic regressions, an implementation of generalized estimating equations (GEE) that estimates regression coefficients in a marginal mean model and in a model for the intracluster association given by the log odds ratio. The diagnostics are developed within an estimating equations framework that recasts the estimating functions for association parameters based upon conditional residuals into equivalent functions based upon marginal residuals. Extensions of earlier work on GEE diagnostics follow directly, including computational formulae for one-step deletion diagnostics that measure the influence of a cluster of observations on the estimated regression parameters and on the overall marginal mean or association model fit. The diagnostic formulae are evaluated with simulations studies and with an application concerning an assessment of factors associated with health maintenance visits in primary care medical practices. The application and the simulations demonstrate that the proposed cluster-deletion diagnostics for alternating logistic regressions are good approximations of their exact fully iterated counterparts. PMID:22777960

  2. Developing a dengue forecast model using machine learning: A case study in China

    PubMed Central

    Zhang, Qin; Wang, Li; Xiao, Jianpeng; Zhang, Qingying; Luo, Ganfeng; Li, Zhihao; He, Jianfeng; Zhang, Yonghui; Ma, Wenjun

    2017-01-01

    Background In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue. Methodology/Principal findings Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011–2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China. Conclusion and significance The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics. PMID:29036169

  3. Estimation of Logistic Regression Models in Small Samples. A Simulation Study Using a Weakly Informative Default Prior Distribution

    ERIC Educational Resources Information Center

    Gordovil-Merino, Amalia; Guardia-Olmos, Joan; Pero-Cebollero, Maribel

    2012-01-01

    In this paper, we used simulations to compare the performance of classical and Bayesian estimations in logistic regression models using small samples. In the performed simulations, conditions were varied, including the type of relationship between independent and dependent variable values (i.e., unrelated and related values), the type of variable…

  4. An overall strategy based on regression models to estimate relative survival and model the effects of prognostic factors in cancer survival studies.

    PubMed

    Remontet, L; Bossard, N; Belot, A; Estève, J

    2007-05-10

    Relative survival provides a measure of the proportion of patients dying from the disease under study without requiring the knowledge of the cause of death. We propose an overall strategy based on regression models to estimate the relative survival and model the effects of potential prognostic factors. The baseline hazard was modelled until 10 years follow-up using parametric continuous functions. Six models including cubic regression splines were considered and the Akaike Information Criterion was used to select the final model. This approach yielded smooth and reliable estimates of mortality hazard and allowed us to deal with sparse data taking into account all the available information. Splines were also used to model simultaneously non-linear effects of continuous covariates and time-dependent hazard ratios. This led to a graphical representation of the hazard ratio that can be useful for clinical interpretation. Estimates of these models were obtained by likelihood maximization. We showed that these estimates could be also obtained using standard algorithms for Poisson regression. Copyright 2006 John Wiley & Sons, Ltd.

  5. Modeling health survey data with excessive zero and K responses.

    PubMed

    Lin, Ting Hsiang; Tsai, Min-Hsiao

    2013-04-30

    Zero-inflated Poisson regression is a popular tool used to analyze data with excessive zeros. Although much work has already been performed to fit zero-inflated data, most models heavily depend on special features of the individual data. To be specific, this means that there is a sizable group of respondents who endorse the same answers making the data have peaks. In this paper, we propose a new model with the flexibility to model excessive counts other than zero, and the model is a mixture of multinomial logistic and Poisson regression, in which the multinomial logistic component models the occurrence of excessive counts, including zeros, K (where K is a positive integer) and all other values. The Poisson regression component models the counts that are assumed to follow a Poisson distribution. Two examples are provided to illustrate our models when the data have counts containing many ones and sixes. As a result, the zero-inflated and K-inflated models exhibit a better fit than the zero-inflated Poisson and standard Poisson regressions. Copyright © 2012 John Wiley & Sons, Ltd.

  6. Forecasting daily meteorological time series using ARIMA and regression models

    NASA Astrophysics Data System (ADS)

    Murat, Małgorzata; Malinowska, Iwona; Gos, Magdalena; Krzyszczak, Jaromir

    2018-04-01

    The daily air temperature and precipitation time series recorded between January 1, 1980 and December 31, 2010 in four European sites (Jokioinen, Dikopshof, Lleida and Lublin) from different climatic zones were modeled and forecasted. In our forecasting we used the methods of the Box-Jenkins and Holt- Winters seasonal auto regressive integrated moving-average, the autoregressive integrated moving-average with external regressors in the form of Fourier terms and the time series regression, including trend and seasonality components methodology with R software. It was demonstrated that obtained models are able to capture the dynamics of the time series data and to produce sensible forecasts.

  7. Application of nonlinear least-squares regression to ground-water flow modeling, west-central Florida

    USGS Publications Warehouse

    Yobbi, D.K.

    2000-01-01

    A nonlinear least-squares regression technique for estimation of ground-water flow model parameters was applied to an existing model of the regional aquifer system underlying west-central Florida. The regression technique minimizes the differences between measured and simulated water levels. Regression statistics, including parameter sensitivities and correlations, were calculated for reported parameter values in the existing model. Optimal parameter values for selected hydrologic variables of interest are estimated by nonlinear regression. Optimal estimates of parameter values are about 140 times greater than and about 0.01 times less than reported values. Independently estimating all parameters by nonlinear regression was impossible, given the existing zonation structure and number of observations, because of parameter insensitivity and correlation. Although the model yields parameter values similar to those estimated by other methods and reproduces the measured water levels reasonably accurately, a simpler parameter structure should be considered. Some possible ways of improving model calibration are to: (1) modify the defined parameter-zonation structure by omitting and/or combining parameters to be estimated; (2) carefully eliminate observation data based on evidence that they are likely to be biased; (3) collect additional water-level data; (4) assign values to insensitive parameters, and (5) estimate the most sensitive parameters first, then, using the optimized values for these parameters, estimate the entire data set.

  8. BFLCRM: A BAYESIAN FUNCTIONAL LINEAR COX REGRESSION MODEL FOR PREDICTING TIME TO CONVERSION TO ALZHEIMER’S DISEASE*

    PubMed Central

    Lee, Eunjee; Zhu, Hongtu; Kong, Dehan; Wang, Yalin; Giovanello, Kelly Sullivan; Ibrahim, Joseph G

    2015-01-01

    The aim of this paper is to develop a Bayesian functional linear Cox regression model (BFLCRM) with both functional and scalar covariates. This new development is motivated by establishing the likelihood of conversion to Alzheimer’s disease (AD) in 346 patients with mild cognitive impairment (MCI) enrolled in the Alzheimer’s Disease Neuroimaging Initiative 1 (ADNI-1) and the early markers of conversion. These 346 MCI patients were followed over 48 months, with 161 MCI participants progressing to AD at 48 months. The functional linear Cox regression model was used to establish that functional covariates including hippocampus surface morphology and scalar covariates including brain MRI volumes, cognitive performance (ADAS-Cog), and APOE status can accurately predict time to onset of AD. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. A simulation study is performed to evaluate the finite sample performance of BFLCRM. PMID:26900412

  9. Blood oxygen level dependent magnetic resonance imaging for detecting pathological patterns in lupus nephritis patients: a preliminary study using a decision tree model.

    PubMed

    Shi, Huilan; Jia, Junya; Li, Dong; Wei, Li; Shang, Wenya; Zheng, Zhenfeng

    2018-02-09

    Precise renal histopathological diagnosis will guide therapy strategy in patients with lupus nephritis. Blood oxygen level dependent (BOLD) magnetic resonance imaging (MRI) has been applicable noninvasive technique in renal disease. This current study was performed to explore whether BOLD MRI could contribute to diagnose renal pathological pattern. Adult patients with lupus nephritis renal pathological diagnosis were recruited for this study. Renal biopsy tissues were assessed based on the lupus nephritis ISN/RPS 2003 classification. The Blood oxygen level dependent magnetic resonance imaging (BOLD-MRI) was used to obtain functional magnetic resonance parameter, R2* values. Several functions of R2* values were calculated and used to construct algorithmic models for renal pathological patterns. In addition, the algorithmic models were compared as to their diagnostic capability. Both Histopathology and BOLD MRI were used to examine a total of twelve patients. Renal pathological patterns included five classes III (including 3 as class III + V) and seven classes IV (including 4 as class IV + V). Three algorithmic models, including decision tree, line discriminant, and logistic regression, were constructed to distinguish the renal pathological pattern of class III and class IV. The sensitivity of the decision tree model was better than that of the line discriminant model (71.87% vs 59.48%, P < 0.001) and inferior to that of the Logistic regression model (71.87% vs 78.71%, P < 0.001). The specificity of decision tree model was equivalent to that of the line discriminant model (63.87% vs 63.73%, P = 0.939) and higher than that of the logistic regression model (63.87% vs 38.0%, P < 0.001). The Area under the ROC curve (AUROCC) of the decision tree model was greater than that of the line discriminant model (0.765 vs 0.629, P < 0.001) and logistic regression model (0.765 vs 0.662, P < 0.001). BOLD MRI is a useful non-invasive imaging technique for the evaluation of lupus nephritis. Decision tree models constructed using functions of R2* values may facilitate the prediction of renal pathological patterns.

  10. An operational GLS model for hydrologic regression

    USGS Publications Warehouse

    Tasker, Gary D.; Stedinger, J.R.

    1989-01-01

    Recent Monte Carlo studies have documented the value of generalized least squares (GLS) procedures to estimate empirical relationships between streamflow statistics and physiographic basin characteristics. This paper presents a number of extensions of the GLS method that deal with realities and complexities of regional hydrologic data sets that were not addressed in the simulation studies. These extensions include: (1) a more realistic model of the underlying model errors; (2) smoothed estimates of cross correlation of flows; (3) procedures for including historical flow data; (4) diagnostic statistics describing leverage and influence for GLS regression; and (5) the formulation of a mathematical program for evaluating future gaging activities. ?? 1989.

  11. Modeling the frequency of opposing left-turn conflicts at signalized intersections using generalized linear regression models.

    PubMed

    Zhang, Xin; Liu, Pan; Chen, Yuguang; Bai, Lu; Wang, Wei

    2014-01-01

    The primary objective of this study was to identify whether the frequency of traffic conflicts at signalized intersections can be modeled. The opposing left-turn conflicts were selected for the development of conflict predictive models. Using data collected at 30 approaches at 20 signalized intersections, the underlying distributions of the conflicts under different traffic conditions were examined. Different conflict-predictive models were developed to relate the frequency of opposing left-turn conflicts to various explanatory variables. The models considered include a linear regression model, a negative binomial model, and separate models developed for four traffic scenarios. The prediction performance of different models was compared. The frequency of traffic conflicts follows a negative binominal distribution. The linear regression model is not appropriate for the conflict frequency data. In addition, drivers behaved differently under different traffic conditions. Accordingly, the effects of conflicting traffic volumes on conflict frequency vary across different traffic conditions. The occurrences of traffic conflicts at signalized intersections can be modeled using generalized linear regression models. The use of conflict predictive models has potential to expand the uses of surrogate safety measures in safety estimation and evaluation.

  12. An adaptive two-stage analog/regression model for probabilistic prediction of small-scale precipitation in France

    NASA Astrophysics Data System (ADS)

    Chardon, Jérémy; Hingray, Benoit; Favre, Anne-Catherine

    2018-01-01

    Statistical downscaling models (SDMs) are often used to produce local weather scenarios from large-scale atmospheric information. SDMs include transfer functions which are based on a statistical link identified from observations between local weather and a set of large-scale predictors. As physical processes driving surface weather vary in time, the most relevant predictors and the regression link are likely to vary in time too. This is well known for precipitation for instance and the link is thus often estimated after some seasonal stratification of the data. In this study, we present a two-stage analog/regression model where the regression link is estimated from atmospheric analogs of the current prediction day. Atmospheric analogs are identified from fields of geopotential heights at 1000 and 500 hPa. For the regression stage, two generalized linear models are further used to model the probability of precipitation occurrence and the distribution of non-zero precipitation amounts, respectively. The two-stage model is evaluated for the probabilistic prediction of small-scale precipitation over France. It noticeably improves the skill of the prediction for both precipitation occurrence and amount. As the analog days vary from one prediction day to another, the atmospheric predictors selected in the regression stage and the value of the corresponding regression coefficients can vary from one prediction day to another. The model allows thus for a day-to-day adaptive and tailored downscaling. It can also reveal specific predictors for peculiar and non-frequent weather configurations.

  13. Soil Cd, Cr, Cu, Ni, Pb and Zn sorption and retention models using SVM: Variable selection and competitive model.

    PubMed

    González Costa, J J; Reigosa, M J; Matías, J M; Covelo, E F

    2017-09-01

    The aim of this study was to model the sorption and retention of Cd, Cu, Ni, Pb and Zn in soils. To that extent, the sorption and retention of these metals were studied and the soil characterization was performed separately. Multiple stepwise regression was used to produce multivariate models with linear techniques and with support vector machines, all of which included 15 explanatory variables characterizing soils. When the R-squared values are represented, two different groups are noticed. Cr, Cu and Pb sorption and retention show a higher R-squared; the most explanatory variables being humified organic matter, Al oxides and, in some cases, cation-exchange capacity (CEC). The other group of metals (Cd, Ni and Zn) shows a lower R-squared, and clays are the most explanatory variables, including a percentage of vermiculite and slime. In some cases, quartz, plagioclase or hematite percentages also show some explanatory capacity. Support Vector Machine (SVM) regression shows that the different models are not as regular as in multiple regression in terms of number of variables, the regression for nickel adsorption being the one with the highest number of variables in its optimal model. On the other hand, there are cases where the most explanatory variables are the same for two metals, as it happens with Cd and Cr adsorption. A similar adsorption mechanism is thus postulated. These patterns of the introduction of variables in the model allow us to create explainability sequences. Those which are the most similar to the selectivity sequences obtained by Covelo (2005) are Mn oxides in multiple regression and change capacity in SVM. Among all the variables, the only one that is explanatory for all the metals after applying the maximum parsimony principle is the percentage of sand in the retention process. In the competitive model arising from the aforementioned sequences, the most intense competitiveness for the adsorption and retention of different metals appears between Cr and Cd, Cu and Zn in multiple regression; and between Cr and Cd in SVM regression. Copyright © 2017 Elsevier B.V. All rights reserved.

  14. Regression calibration for models with two predictor variables measured with error and their interaction, using instrumental variables and longitudinal data.

    PubMed

    Strand, Matthew; Sillau, Stefan; Grunwald, Gary K; Rabinovitch, Nathan

    2014-02-10

    Regression calibration provides a way to obtain unbiased estimators of fixed effects in regression models when one or more predictors are measured with error. Recent development of measurement error methods has focused on models that include interaction terms between measured-with-error predictors, and separately, methods for estimation in models that account for correlated data. In this work, we derive explicit and novel forms of regression calibration estimators and associated asymptotic variances for longitudinal models that include interaction terms, when data from instrumental and unbiased surrogate variables are available but not the actual predictors of interest. The longitudinal data are fit using linear mixed models that contain random intercepts and account for serial correlation and unequally spaced observations. The motivating application involves a longitudinal study of exposure to two pollutants (predictors) - outdoor fine particulate matter and cigarette smoke - and their association in interactive form with levels of a biomarker of inflammation, leukotriene E4 (LTE 4 , outcome) in asthmatic children. Because the exposure concentrations could not be directly observed, we used measurements from a fixed outdoor monitor and urinary cotinine concentrations as instrumental variables, and we used concentrations of fine ambient particulate matter and cigarette smoke measured with error by personal monitors as unbiased surrogate variables. We applied the derived regression calibration methods to estimate coefficients of the unobserved predictors and their interaction, allowing for direct comparison of toxicity of the different pollutants. We used simulations to verify accuracy of inferential methods based on asymptotic theory. Copyright © 2013 John Wiley & Sons, Ltd.

  15. Analysis of low flows and selected methods for estimating low-flow characteristics at partial-record and ungaged stream sites in western Washington

    USGS Publications Warehouse

    Curran, Christopher A.; Eng, Ken; Konrad, Christopher P.

    2012-01-01

    Regional low-flow regression models for estimating Q7,10 at ungaged stream sites are developed from the records of daily discharge at 65 continuous gaging stations (including 22 discontinued gaging stations) for the purpose of evaluating explanatory variables. By incorporating the base-flow recession time constant τ as an explanatory variable in the regression model, the root-mean square error for estimating Q7,10 at ungaged sites can be lowered to 72 percent (for known values of τ), which is 42 percent less than if only basin area and mean annual precipitation are used as explanatory variables. If partial-record sites are included in the regression data set, τ must be estimated from pairs of discharge measurements made during continuous periods of declining low flows. Eight measurement pairs are optimal for estimating τ at partial-record sites, and result in a lowering of the root-mean square error by 25 percent. A low-flow survey strategy that includes paired measurements at partial-record sites requires additional effort and planning beyond a standard strategy, but could be used to enhance regional estimates of τ and potentially reduce the error of regional regression models for estimating low-flow characteristics at ungaged sites.

  16. Modelling of binary logistic regression for obesity among secondary students in a rural area of Kedah

    NASA Astrophysics Data System (ADS)

    Kamaruddin, Ainur Amira; Ali, Zalila; Noor, Norlida Mohd.; Baharum, Adam; Ahmad, Wan Muhamad Amir W.

    2014-07-01

    Logistic regression analysis examines the influence of various factors on a dichotomous outcome by estimating the probability of the event's occurrence. Logistic regression, also called a logit model, is a statistical procedure used to model dichotomous outcomes. In the logit model the log odds of the dichotomous outcome is modeled as a linear combination of the predictor variables. The log odds ratio in logistic regression provides a description of the probabilistic relationship of the variables and the outcome. In conducting logistic regression, selection procedures are used in selecting important predictor variables, diagnostics are used to check that assumptions are valid which include independence of errors, linearity in the logit for continuous variables, absence of multicollinearity, and lack of strongly influential outliers and a test statistic is calculated to determine the aptness of the model. This study used the binary logistic regression model to investigate overweight and obesity among rural secondary school students on the basis of their demographics profile, medical history, diet and lifestyle. The results indicate that overweight and obesity of students are influenced by obesity in family and the interaction between a student's ethnicity and routine meals intake. The odds of a student being overweight and obese are higher for a student having a family history of obesity and for a non-Malay student who frequently takes routine meals as compared to a Malay student.

  17. Evaluation of the Bitterness of Traditional Chinese Medicines using an E-Tongue Coupled with a Robust Partial Least Squares Regression Method.

    PubMed

    Lin, Zhaozhou; Zhang, Qiao; Liu, Ruixin; Gao, Xiaojie; Zhang, Lu; Kang, Bingya; Shi, Junhan; Wu, Zidan; Gui, Xinjing; Li, Xuelin

    2016-01-25

    To accurately, safely, and efficiently evaluate the bitterness of Traditional Chinese Medicines (TCMs), a robust predictor was developed using robust partial least squares (RPLS) regression method based on data obtained from an electronic tongue (e-tongue) system. The data quality was verified by the Grubb's test. Moreover, potential outliers were detected based on both the standardized residual and score distance calculated for each sample. The performance of RPLS on the dataset before and after outlier detection was compared to other state-of-the-art methods including multivariate linear regression, least squares support vector machine, and the plain partial least squares regression. Both R² and root-mean-squares error (RMSE) of cross-validation (CV) were recorded for each model. With four latent variables, a robust RMSECV value of 0.3916 with bitterness values ranging from 0.63 to 4.78 were obtained for the RPLS model that was constructed based on the dataset including outliers. Meanwhile, the RMSECV, which was calculated using the models constructed by other methods, was larger than that of the RPLS model. After six outliers were excluded, the performance of all benchmark methods markedly improved, but the difference between the RPLS model constructed before and after outlier exclusion was negligible. In conclusion, the bitterness of TCM decoctions can be accurately evaluated with the RPLS model constructed using e-tongue data.

  18. Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble-based methods?

    PubMed Central

    Austin, Peter C; Lee, Douglas S; Steyerberg, Ewout W; Tu, Jack V

    2012-01-01

    In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease. PMID:22777999

  19. ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches.

    PubMed

    Sharma, Ashok K; Srivastava, Gopal N; Roy, Ankita; Sharma, Vineet K

    2017-01-01

    The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84-0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better ( R 2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better ( R 2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules.

  20. ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches

    PubMed Central

    Sharma, Ashok K.; Srivastava, Gopal N.; Roy, Ankita; Sharma, Vineet K.

    2017-01-01

    The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84–0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better (R2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules. PMID:29249969

  1. Analytics of Radioactive Materials Released in the Fukushima Daiichi Nuclear Accident

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Egarievwe, Stephen U.; Nuclear Engineering Department, University of Tennessee, Knoxville, TN; Coble, Jamie B.

    The 2011 Fukushima Daiichi nuclear accident in Japan resulted in the release of radioactive materials into the atmosphere, the nearby sea, and the surrounding land. Following the accident, several meteorological models were used to predict the transport of the radioactive materials to other continents such as North America and Europe. Also of high importance is the dispersion of radioactive materials locally and within Japan. Based on the International Atomic Energy Agency (IAEA) Convention on Early Notification of a nuclear accident, several radiological data sets were collected on the accident by the Japanese authorities. Among the radioactive materials monitored, are I-131more » and Cs-137 which form the major contributions to the contamination of drinking water. The radiation dose in the atmosphere was also measured. It is impractical to measure contamination and radiation dose in every place of interest. Therefore, modeling helps to predict contamination and radiation dose. Some modeling studies that have been reported in the literature include the simulation of transport and deposition of I-131 and Cs-137 from the accident, Cs-137 deposition and contamination of Japanese soils, and preliminary estimates of I-131 and Cs-137 discharged from the plant into the atmosphere. In this paper, we present statistical analytics of I-131 and Cs-137 with the goal of predicting gamma dose from the Fukushima Daiichi nuclear accident. The data sets used in our study were collected from the IAEA Fukushima Monitoring Database. As part of this study, we investigated several regression models to find the best algorithm for modeling the gamma dose. The modeling techniques used in our study include linear regression, principal component regression (PCR), partial least square (PLS) regression, and ridge regression. Our preliminary results on the first set of data showed that the linear regression model with one variable was the best with a root mean square error of 0.0133 μSv/h, compared to 0.0210 μSv/h for PCR, 0.231 μSv/h for ridge regression L-curve, 0.0856 μSv/h for PLS, and 0.0860 μSv/h for ridge regression cross validation. Complete results using the full datasets for these models will also be presented. (authors)« less

  2. Nonlinear-regression flow model of the Gulf Coast aquifer systems in the south-central United States

    USGS Publications Warehouse

    Kuiper, L.K.

    1994-01-01

    A multiple-regression methodology was used to help answer questions concerning model reliability, and to calibrate a time-dependent variable-density ground-water flow model of the gulf coast aquifer systems in the south-central United States. More than 40 regression models with 2 to 31 regressions parameters are used and detailed results are presented for 12 of the models. More than 3,000 values for grid-element volume-averaged head and hydraulic conductivity are used for the regression model observations. Calculated prediction interval half widths, though perhaps inaccurate due to a lack of normality of the residuals, are the smallest for models with only four regression parameters. In addition, the root-mean weighted residual decreases very little with an increase in the number of regression parameters. The various models showed considerable overlap between the prediction inter- vals for shallow head and hydraulic conductivity. Approximate 95-percent prediction interval half widths for volume-averaged freshwater head exceed 108 feet; for volume-averaged base 10 logarithm hydraulic conductivity, they exceed 0.89. All of the models are unreliable for the prediction of head and ground-water flow in the deeper parts of the aquifer systems, including the amount of flow coming from the underlying geopressured zone. Truncating the domain of solution of one model to exclude that part of the system having a ground-water density greater than 1.005 grams per cubic centimeter or to exclude that part of the systems below a depth of 3,000 feet, and setting the density to that of freshwater does not appreciably change the results for head and ground-water flow, except for locations close to the truncation surface.

  3. Bootstrap investigation of the stability of a Cox regression model.

    PubMed

    Altman, D G; Andersen, P K

    1989-07-01

    We describe a bootstrap investigation of the stability of a Cox proportional hazards regression model resulting from the analysis of a clinical trial of azathioprine versus placebo in patients with primary biliary cirrhosis. We have considered stability to refer both to the choice of variables included in the model and, more importantly, to the predictive ability of the model. In stepwise Cox regression analyses of 100 bootstrap samples using 17 candidate variables, the most frequently selected variables were those selected in the original analysis, and no other important variable was identified. Thus there was no reason to doubt the model obtained in the original analysis. For each patient in the trial, bootstrap confidence intervals were constructed for the estimated probability of surviving two years. It is shown graphically that these intervals are markedly wider than those obtained from the original model.

  4. Regression models of monthly water-level change in and near the Closed Basin Division of the San Luis Valley, south-central Colorado

    USGS Publications Warehouse

    Watts, Kenneth R.

    1995-01-01

    The Bureau of Reclamation is developing a water-resource project, the Closed Basin Division, in the San Luis Valley of south-central Colorado that is designed to salvage unconfined ground water that currently is discharged as evapotranspiration. The water table in and near the 130,000-acre Closed Basin Division area will be lowered by an annual withdrawal of as much as 100,000 acre-feet of ground water from the unconfined aquifer. The legislation authorizing the project limits resulting drawdown of the water table in preexisting irrigation and domestic wells outside the Closed Basin Division to a maximum of 2 feet. Water levels in the closed basin in the northern part of the San Luis Valley historically have fluctuated more than 2 feet in response to water-use practices and variation of climatically controlled recharge and discharge. Declines of water levels in nearby wells that are caused by withdrawals in the Closed Basin Division can be quantified if water-level fluctuations that result from other water-use practices and climatic variations can be estimated. This study was done to evaluate water-level change at selected observation wells in and near the Closed Basin Division. Regression models of monthly water-level change were developed to predict monthly water-level change in 46 selected observation wells. Predictions of monthly water-level change are based on one or more of the following: elapsed time, cosine and sine functions with an annual period, streamflow depletion of the Rio Grande, electrical use for agricultural purposes, runoff into the closed basin, precipitation, and mean air temperature. Regression models for five of the wells include only an intercept term and either an elapsed-time term or terms determined by the cosine and sine functions. Regression models for the other 41 wells include 1 to 4 of the 5 other variables, which can vary from month to month and from year to year. Serial correlation of the residuals was detected in 24 of the regression models. These models also include an autoregressive term to account for serial correlation in the residuals. The adjusted coefficient of determination (Ra2) for the 46 regression models range from 0.08 to 0.89, and the standard errors of estimate range from 0.034 to 2.483 feet. The regression models of monthly water- level change can be used to evaluate whether post-1985 monthly water-level change values at the selected observation wells are within the 95-percent confidence limits of predicted monthly water-level change.

  5. Properties of added variable plots in Cox's regression model.

    PubMed

    Lindkvist, M

    2000-03-01

    The added variable plot is useful for examining the effect of a covariate in regression models. The plot provides information regarding the inclusion of a covariate, and is useful in identifying influential observations on the parameter estimates. Hall et al. (1996) proposed a plot for Cox's proportional hazards model derived by regarding the Cox model as a generalized linear model. This paper proves and discusses properties of this plot. These properties make the plot a valuable tool in model evaluation. Quantities considered include parameter estimates, residuals, leverage, case influence measures and correspondence to previously proposed residuals and diagnostics.

  6. Regression Models and Fuzzy Logic Prediction of TBM Penetration Rate

    NASA Astrophysics Data System (ADS)

    Minh, Vu Trieu; Katushin, Dmitri; Antonov, Maksim; Veinthal, Renno

    2017-03-01

    This paper presents statistical analyses of rock engineering properties and the measured penetration rate of tunnel boring machine (TBM) based on the data of an actual project. The aim of this study is to analyze the influence of rock engineering properties including uniaxial compressive strength (UCS), Brazilian tensile strength (BTS), rock brittleness index (BI), the distance between planes of weakness (DPW), and the alpha angle (Alpha) between the tunnel axis and the planes of weakness on the TBM rate of penetration (ROP). Four (4) statistical regression models (two linear and two nonlinear) are built to predict the ROP of TBM. Finally a fuzzy logic model is developed as an alternative method and compared to the four statistical regression models. Results show that the fuzzy logic model provides better estimations and can be applied to predict the TBM performance. The R-squared value (R2) of the fuzzy logic model scores the highest value of 0.714 over the second runner-up of 0.667 from the multiple variables nonlinear regression model.

  7. Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications

    PubMed Central

    Qian, Guoqi; Wu, Yuehua; Ferrari, Davide; Qiao, Puxue; Hollande, Frédéric

    2016-01-01

    Regression clustering is a mixture of unsupervised and supervised statistical learning and data mining method which is found in a wide range of applications including artificial intelligence and neuroscience. It performs unsupervised learning when it clusters the data according to their respective unobserved regression hyperplanes. The method also performs supervised learning when it fits regression hyperplanes to the corresponding data clusters. Applying regression clustering in practice requires means of determining the underlying number of clusters in the data, finding the cluster label of each data point, and estimating the regression coefficients of the model. In this paper, we review the estimation and selection issues in regression clustering with regard to the least squares and robust statistical methods. We also provide a model selection based technique to determine the number of regression clusters underlying the data. We further develop a computing procedure for regression clustering estimation and selection. Finally, simulation studies are presented for assessing the procedure, together with analyzing a real data set on RGB cell marking in neuroscience to illustrate and interpret the method. PMID:27212939

  8. Restoration of Monotonicity Respecting in Dynamic Regression

    PubMed Central

    Huang, Yijian

    2017-01-01

    Dynamic regression models, including the quantile regression model and Aalen’s additive hazards model, are widely adopted to investigate evolving covariate effects. Yet lack of monotonicity respecting with standard estimation procedures remains an outstanding issue. Advances have recently been made, but none provides a complete resolution. In this article, we propose a novel adaptive interpolation method to restore monotonicity respecting, by successively identifying and then interpolating nearest monotonicity-respecting points of an original estimator. Under mild regularity conditions, the resulting regression coefficient estimator is shown to be asymptotically equivalent to the original. Our numerical studies have demonstrated that the proposed estimator is much more smooth and may have better finite-sample efficiency than the original as well as, when available as only in special cases, other competing monotonicity-respecting estimators. Illustration with a clinical study is provided. PMID:29430068

  9. Selection of the NIR region for a regression model of the ethanol concentration in fermentation process by an online NIR and mid-IR dual-region spectrometer and 2D heterospectral correlation spectroscopy.

    PubMed

    Nishii, Takashi; Genkawa, Takuma; Watari, Masahiro; Ozaki, Yukihiro

    2012-01-01

    A new selection procedure of an informative near-infrared (NIR) region for regression model building is proposed that uses an online NIR/mid-infrared (mid-IR) dual-region spectrometer in conjunction with two-dimensional (2D) NIR/mid-IR heterospectral correlation spectroscopy. In this procedure, both NIR and mid-IR spectra of a liquid sample are acquired sequentially during a reaction process using the NIR/mid-IR dual-region spectrometer; the 2D NIR/mid-IR heterospectral correlation spectrum is subsequently calculated from the obtained spectral data set. From the calculated 2D spectrum, a NIR region is selected that includes bands of high positive correlation intensity with mid-IR bands assigned to the analyte, and used for the construction of a regression model. To evaluate the performance of this procedure, a partial least-squares (PLS) regression model of the ethanol concentration in a fermentation process was constructed. During fermentation, NIR/mid-IR spectra in the 10000 - 1200 cm(-1) region were acquired every 3 min, and a 2D NIR/mid-IR heterospectral correlation spectrum was calculated to investigate the correlation intensity between the NIR and mid-IR bands. NIR regions that include bands at 4343, 4416, 5778, 5904, and 5955 cm(-1), which result from the combinations and overtones of the C-H group of ethanol, were selected for use in the PLS regression models, by taking the correlation intensity of a mid-IR band at 2985 cm(-1) arising from the CH(3) asymmetric stretching vibration mode of ethanol as a reference. The predicted results indicate that the ethanol concentrations calculated from the PLS regression models fit well to those obtained by high-performance liquid chromatography. Thus, it can be concluded that the selection procedure using the NIR/mid-IR dual-region spectrometer combined with 2D NIR/mid-IR heterospectral correlation spectroscopy is a powerful method for the construction of a reliable regression model.

  10. Predicting risk for portal vein thrombosis in acute pancreatitis patients: A comparison of radical basis function artificial neural network and logistic regression models.

    PubMed

    Fei, Yang; Hu, Jian; Gao, Kun; Tu, Jianfeng; Li, Wei-Qin; Wang, Wei

    2017-06-01

    To construct a radical basis function (RBF) artificial neural networks (ANNs) model to predict the incidence of acute pancreatitis (AP)-induced portal vein thrombosis. The analysis included 353 patients with AP who had admitted between January 2011 and December 2015. RBF ANNs model and logistic regression model were constructed based on eleven factors relevant to AP respectively. Statistical indexes were used to evaluate the value of the prediction in two models. The predict sensitivity, specificity, positive predictive value, negative predictive value and accuracy by RBF ANNs model for PVT were 73.3%, 91.4%, 68.8%, 93.0% and 87.7%, respectively. There were significant differences between the RBF ANNs and logistic regression models in these parameters (P<0.05). In addition, a comparison of the area under receiver operating characteristic curves of the two models showed a statistically significant difference (P<0.05). The RBF ANNs model is more likely to predict the occurrence of PVT induced by AP than logistic regression model. D-dimer, AMY, Hct and PT were important prediction factors of approval for AP-induced PVT. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. Regression mixture models: Does modeling the covariance between independent variables and latent classes improve the results?

    PubMed Central

    Lamont, Andrea E.; Vermunt, Jeroen K.; Van Horn, M. Lee

    2016-01-01

    Regression mixture models are increasingly used as an exploratory approach to identify heterogeneity in the effects of a predictor on an outcome. In this simulation study, we test the effects of violating an implicit assumption often made in these models – i.e., independent variables in the model are not directly related to latent classes. Results indicated that the major risk of failing to model the relationship between predictor and latent class was an increase in the probability of selecting additional latent classes and biased class proportions. Additionally, this study tests whether regression mixture models can detect a piecewise relationship between a predictor and outcome. Results suggest that these models are able to detect piecewise relations, but only when the relationship between the latent class and the predictor is included in model estimation. We illustrate the implications of making this assumption through a re-analysis of applied data examining heterogeneity in the effects of family resources on academic achievement. We compare previous results (which assumed no relation between independent variables and latent class) to the model where this assumption is lifted. Implications and analytic suggestions for conducting regression mixture based on these findings are noted. PMID:26881956

  12. Improving reliability of aggregation, numerical simulation and analysis of complex systems by empirical data

    NASA Astrophysics Data System (ADS)

    Dobronets, Boris S.; Popova, Olga A.

    2018-05-01

    The paper considers a new approach of regression modeling that uses aggregated data presented in the form of density functions. Approaches to Improving the reliability of aggregation of empirical data are considered: improving accuracy and estimating errors. We discuss the procedures of data aggregation as a preprocessing stage for subsequent to regression modeling. An important feature of study is demonstration of the way how represent the aggregated data. It is proposed to use piecewise polynomial models, including spline aggregate functions. We show that the proposed approach to data aggregation can be interpreted as the frequency distribution. To study its properties density function concept is used. Various types of mathematical models of data aggregation are discussed. For the construction of regression models, it is proposed to use data representation procedures based on piecewise polynomial models. New approaches to modeling functional dependencies based on spline aggregations are proposed.

  13. An introduction to using Bayesian linear regression with clinical data.

    PubMed

    Baldwin, Scott A; Larson, Michael J

    2017-11-01

    Statistical training psychology focuses on frequentist methods. Bayesian methods are an alternative to standard frequentist methods. This article provides researchers with an introduction to fundamental ideas in Bayesian modeling. We use data from an electroencephalogram (EEG) and anxiety study to illustrate Bayesian models. Specifically, the models examine the relationship between error-related negativity (ERN), a particular event-related potential, and trait anxiety. Methodological topics covered include: how to set up a regression model in a Bayesian framework, specifying priors, examining convergence of the model, visualizing and interpreting posterior distributions, interval estimates, expected and predicted values, and model comparison tools. We also discuss situations where Bayesian methods can outperform frequentist methods as well has how to specify more complicated regression models. Finally, we conclude with recommendations about reporting guidelines for those using Bayesian methods in their own research. We provide data and R code for replicating our analyses. Copyright © 2017 Elsevier Ltd. All rights reserved.

  14. RRegrs: an R package for computer-aided model selection with multiple regression models.

    PubMed

    Tsiliki, Georgia; Munteanu, Cristian R; Seoane, Jose A; Fernandez-Lozano, Carlos; Sarimveis, Haralambos; Willighagen, Egon L

    2015-01-01

    Predictive regression models can be created with many different modelling approaches. Choices need to be made for data set splitting, cross-validation methods, specific regression parameters and best model criteria, as they all affect the accuracy and efficiency of the produced predictive models, and therefore, raising model reproducibility and comparison issues. Cheminformatics and bioinformatics are extensively using predictive modelling and exhibit a need for standardization of these methodologies in order to assist model selection and speed up the process of predictive model development. A tool accessible to all users, irrespectively of their statistical knowledge, would be valuable if it tests several simple and complex regression models and validation schemes, produce unified reports, and offer the option to be integrated into more extensive studies. Additionally, such methodology should be implemented as a free programming package, in order to be continuously adapted and redistributed by others. We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at https://www.github.com/enanomapper/RRegrs, by reusing and extending on the caret package. The universality of the new methodology is demonstrated using five standard data sets from different scientific fields. Its efficiency in cheminformatics and QSAR modelling is shown with three use cases: proteomics data for surface-modified gold nanoparticles, nano-metal oxides descriptor data, and molecular descriptors for acute aquatic toxicity data. The results show that for all data sets RRegrs reports models with equal or better performance for both training and test sets than those reported in the original publications. Its good performance as well as its adaptability in terms of parameter optimization could make RRegrs a popular framework to assist the initial exploration of predictive models, and with that, the design of more comprehensive in silico screening applications.Graphical abstractRRegrs is a computer-aided model selection framework for R multiple regression models; this is a fully validated procedure with application to QSAR modelling.

  15. Assessing risk factors for periodontitis using regression

    NASA Astrophysics Data System (ADS)

    Lobo Pereira, J. A.; Ferreira, Maria Cristina; Oliveira, Teresa

    2013-10-01

    Multivariate statistical analysis is indispensable to assess the associations and interactions between different factors and the risk of periodontitis. Among others, regression analysis is a statistical technique widely used in healthcare to investigate and model the relationship between variables. In our work we study the impact of socio-demographic, medical and behavioral factors on periodontal health. Using regression, linear and logistic models, we can assess the relevance, as risk factors for periodontitis disease, of the following independent variables (IVs): Age, Gender, Diabetic Status, Education, Smoking status and Plaque Index. The multiple linear regression analysis model was built to evaluate the influence of IVs on mean Attachment Loss (AL). Thus, the regression coefficients along with respective p-values will be obtained as well as the respective p-values from the significance tests. The classification of a case (individual) adopted in the logistic model was the extent of the destruction of periodontal tissues defined by an Attachment Loss greater than or equal to 4 mm in 25% (AL≥4mm/≥25%) of sites surveyed. The association measures include the Odds Ratios together with the correspondent 95% confidence intervals.

  16. Comparison of Sub-Pixel Classification Approaches for Crop-Specific Mapping

    EPA Science Inventory

    This paper examined two non-linear models, Multilayer Perceptron (MLP) regression and Regression Tree (RT), for estimating sub-pixel crop proportions using time-series MODIS-NDVI data. The sub-pixel proportions were estimated for three major crop types including corn, soybean, a...

  17. Regression: The Apple Does Not Fall Far From the Tree.

    PubMed

    Vetter, Thomas R; Schober, Patrick

    2018-05-15

    Researchers and clinicians are frequently interested in either: (1) assessing whether there is a relationship or association between 2 or more variables and quantifying this association; or (2) determining whether 1 or more variables can predict another variable. The strength of such an association is mainly described by the correlation. However, regression analysis and regression models can be used not only to identify whether there is a significant relationship or association between variables but also to generate estimations of such a predictive relationship between variables. This basic statistical tutorial discusses the fundamental concepts and techniques related to the most common types of regression analysis and modeling, including simple linear regression, multiple regression, logistic regression, ordinal regression, and Poisson regression, as well as the common yet often underrecognized phenomenon of regression toward the mean. The various types of regression analysis are powerful statistical techniques, which when appropriately applied, can allow for the valid interpretation of complex, multifactorial data. Regression analysis and models can assess whether there is a relationship or association between 2 or more observed variables and estimate the strength of this association, as well as determine whether 1 or more variables can predict another variable. Regression is thus being applied more commonly in anesthesia, perioperative, critical care, and pain research. However, it is crucial to note that regression can identify plausible risk factors; it does not prove causation (a definitive cause and effect relationship). The results of a regression analysis instead identify independent (predictor) variable(s) associated with the dependent (outcome) variable. As with other statistical methods, applying regression requires that certain assumptions be met, which can be tested with specific diagnostics.

  18. Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees.

    PubMed

    Chung, Yi-Shih

    2013-12-01

    Factor complexity is a characteristic of traffic crashes. This paper proposes a novel method, namely boosted regression trees (BRT), to investigate the complex and nonlinear relationships in high-variance traffic crash data. The Taiwanese 2004-2005 single-vehicle motorcycle crash data are used to demonstrate the utility of BRT. Traditional logistic regression and classification and regression tree (CART) models are also used to compare their estimation results and external validities. Both the in-sample cross-validation and out-of-sample validation results show that an increase in tree complexity provides improved, although declining, classification performance, indicating a limited factor complexity of single-vehicle motorcycle crashes. The effects of crucial variables including geographical, time, and sociodemographic factors explain some fatal crashes. Relatively unique fatal crashes are better approximated by interactive terms, especially combinations of behavioral factors. BRT models generally provide improved transferability than conventional logistic regression and CART models. This study also discusses the implications of the results for devising safety policies. Copyright © 2012 Elsevier Ltd. All rights reserved.

  19. Multilayer Perceptron for Robust Nonlinear Interval Regression Analysis Using Genetic Algorithms

    PubMed Central

    2014-01-01

    On the basis of fuzzy regression, computational models in intelligence such as neural networks have the capability to be applied to nonlinear interval regression analysis for dealing with uncertain and imprecise data. When training data are not contaminated by outliers, computational models perform well by including almost all given training data in the data interval. Nevertheless, since training data are often corrupted by outliers, robust learning algorithms employed to resist outliers for interval regression analysis have been an interesting area of research. Several approaches involving computational intelligence are effective for resisting outliers, but the required parameters for these approaches are related to whether the collected data contain outliers or not. Since it seems difficult to prespecify the degree of contamination beforehand, this paper uses multilayer perceptron to construct the robust nonlinear interval regression model using the genetic algorithm. Outliers beyond or beneath the data interval will impose slight effect on the determination of data interval. Simulation results demonstrate that the proposed method performs well for contaminated datasets. PMID:25110755

  20. Multilayer perceptron for robust nonlinear interval regression analysis using genetic algorithms.

    PubMed

    Hu, Yi-Chung

    2014-01-01

    On the basis of fuzzy regression, computational models in intelligence such as neural networks have the capability to be applied to nonlinear interval regression analysis for dealing with uncertain and imprecise data. When training data are not contaminated by outliers, computational models perform well by including almost all given training data in the data interval. Nevertheless, since training data are often corrupted by outliers, robust learning algorithms employed to resist outliers for interval regression analysis have been an interesting area of research. Several approaches involving computational intelligence are effective for resisting outliers, but the required parameters for these approaches are related to whether the collected data contain outliers or not. Since it seems difficult to prespecify the degree of contamination beforehand, this paper uses multilayer perceptron to construct the robust nonlinear interval regression model using the genetic algorithm. Outliers beyond or beneath the data interval will impose slight effect on the determination of data interval. Simulation results demonstrate that the proposed method performs well for contaminated datasets.

  1. Genetic analysis of body weights of individually fed beef bulls in South Africa using random regression models.

    PubMed

    Selapa, N W; Nephawe, K A; Maiwashe, A; Norris, D

    2012-02-08

    The aim of this study was to estimate genetic parameters for body weights of individually fed beef bulls measured at centralized testing stations in South Africa using random regression models. Weekly body weights of Bonsmara bulls (N = 2919) tested between 1999 and 2003 were available for the analyses. The model included a fixed regression of the body weights on fourth-order orthogonal Legendre polynomials of the actual days on test (7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, and 84) for starting age and contemporary group effects. Random regressions on fourth-order orthogonal Legendre polynomials of the actual days on test were included for additive genetic effects and additional uncorrelated random effects of the weaning-herd-year and the permanent environment of the animal. Residual effects were assumed to be independently distributed with heterogeneous variance for each test day. Variance ratios for additive genetic, permanent environment and weaning-herd-year for weekly body weights at different test days ranged from 0.26 to 0.29, 0.37 to 0.44 and 0.26 to 0.34, respectively. The weaning-herd-year was found to have a significant effect on the variation of body weights of bulls despite a 28-day adjustment period. Genetic correlations amongst body weights at different test days were high, ranging from 0.89 to 1.00. Heritability estimates were comparable to literature using multivariate models. Therefore, random regression model could be applied in the genetic evaluation of body weight of individually fed beef bulls in South Africa.

  2. Evaluation and application of regional turbidity-sediment regression models in Virginia

    USGS Publications Warehouse

    Hyer, Kenneth; Jastram, John D.; Moyer, Douglas; Webber, James S.; Chanat, Jeffrey G.

    2015-01-01

    Conventional thinking has long held that turbidity-sediment surrogate-regression equations are site specific and that regression equations developed at a single monitoring station should not be applied to another station; however, few studies have evaluated this issue in a rigorous manner. If robust regional turbidity-sediment models can be developed successfully, their applications could greatly expand the usage of these methods. Suspended sediment load estimation could occur as soon as flow and turbidity monitoring commence at a site, suspended sediment sampling frequencies for various projects potentially could be reduced, and special-project applications (sediment monitoring following dam removal, for example) could be significantly enhanced. The objective of this effort was to investigate the turbidity-suspended sediment concentration (SSC) relations at all available USGS monitoring sites within Virginia to determine whether meaningful turbidity-sediment regression models can be developed by combining the data from multiple monitoring stations into a single model, known as a “regional” model. Following the development of the regional model, additional objectives included a comparison of predicted SSCs between the regional model and commonly used site-specific models, as well as an evaluation of why specific monitoring stations did not fit the regional model.

  3. Random regression models using different functions to model milk flow in dairy cows.

    PubMed

    Laureano, M M M; Bignardi, A B; El Faro, L; Cardoso, V L; Tonhati, H; Albuquerque, L G

    2014-09-12

    We analyzed 75,555 test-day milk flow records from 2175 primiparous Holstein cows that calved between 1997 and 2005. Milk flow was obtained by dividing the mean milk yield (kg) of the 3 daily milking by the total milking time (min) and was expressed as kg/min. Milk flow was grouped into 43 weekly classes. The analyses were performed using a single-trait Random Regression Models that included direct additive genetic, permanent environmental, and residual random effects. In addition, the contemporary group and linear and quadratic effects of cow age at calving were included as fixed effects. Fourth-order orthogonal Legendre polynomial of days in milk was used to model the mean trend in milk flow. The additive genetic and permanent environmental covariance functions were estimated using random regression Legendre polynomials and B-spline functions of days in milk. The model using a third-order Legendre polynomial for additive genetic effects and a sixth-order polynomial for permanent environmental effects, which contained 7 residual classes, proved to be the most adequate to describe variations in milk flow, and was also the most parsimonious. The heritability in milk flow estimated by the most parsimonious model was of moderate to high magnitude.

  4. Multilevel covariance regression with correlated random effects in the mean and variance structure.

    PubMed

    Quintero, Adrian; Lesaffre, Emmanuel

    2017-09-01

    Multivariate regression methods generally assume a constant covariance matrix for the observations. In case a heteroscedastic model is needed, the parametric and nonparametric covariance regression approaches can be restrictive in the literature. We propose a multilevel regression model for the mean and covariance structure, including random intercepts in both components and allowing for correlation between them. The implied conditional covariance function can be different across clusters as a result of the random effect in the variance structure. In addition, allowing for correlation between the random intercepts in the mean and covariance makes the model convenient for skewedly distributed responses. Furthermore, it permits us to analyse directly the relation between the mean response level and the variability in each cluster. Parameter estimation is carried out via Gibbs sampling. We compare the performance of our model to other covariance modelling approaches in a simulation study. Finally, the proposed model is applied to the RN4CAST dataset to identify the variables that impact burnout of nurses in Belgium. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  5. Regression model development and computational procedures to support estimation of real-time concentrations and loads of selected constituents in two tributaries to Lake Houston near Houston, Texas, 2005-9

    USGS Publications Warehouse

    Lee, Michael T.; Asquith, William H.; Oden, Timothy D.

    2012-01-01

    In December 2005, the U.S. Geological Survey (USGS), in cooperation with the City of Houston, Texas, began collecting discrete water-quality samples for nutrients, total organic carbon, bacteria (Escherichia coli and total coliform), atrazine, and suspended sediment at two USGS streamflow-gaging stations that represent watersheds contributing to Lake Houston (08068500 Spring Creek near Spring, Tex., and 08070200 East Fork San Jacinto River near New Caney, Tex.). Data from the discrete water-quality samples collected during 2005–9, in conjunction with continuously monitored real-time data that included streamflow and other physical water-quality properties (specific conductance, pH, water temperature, turbidity, and dissolved oxygen), were used to develop regression models for the estimation of concentrations of water-quality constituents of substantial source watersheds to Lake Houston. The potential explanatory variables included discharge (streamflow), specific conductance, pH, water temperature, turbidity, dissolved oxygen, and time (to account for seasonal variations inherent in some water-quality data). The response variables (the selected constituents) at each site were nitrite plus nitrate nitrogen, total phosphorus, total organic carbon, E. coli, atrazine, and suspended sediment. The explanatory variables provide easily measured quantities to serve as potential surrogate variables to estimate concentrations of the selected constituents through statistical regression. Statistical regression also facilitates accompanying estimates of uncertainty in the form of prediction intervals. Each regression model potentially can be used to estimate concentrations of a given constituent in real time. Among other regression diagnostics, the diagnostics used as indicators of general model reliability and reported herein include the adjusted R-squared, the residual standard error, residual plots, and p-values. Adjusted R-squared values for the Spring Creek models ranged from .582–.922 (dimensionless). The residual standard errors ranged from .073–.447 (base-10 logarithm). Adjusted R-squared values for the East Fork San Jacinto River models ranged from .253–.853 (dimensionless). The residual standard errors ranged from .076–.388 (base-10 logarithm). In conjunction with estimated concentrations, constituent loads can be estimated by multiplying the estimated concentration by the corresponding streamflow and by applying the appropriate conversion factor. The regression models presented in this report are site specific, that is, they are specific to the Spring Creek and East Fork San Jacinto River streamflow-gaging stations; however, the general methods that were developed and documented could be applied to most perennial streams for the purpose of estimating real-time water quality data.

  6. Independent variable complexity for regional regression of the flow duration curve in ungauged basins

    NASA Astrophysics Data System (ADS)

    Fouad, Geoffrey; Skupin, André; Hope, Allen

    2016-04-01

    The flow duration curve (FDC) is one of the most widely used tools to quantify streamflow. Its percentile flows are often required for water resource applications, but these values must be predicted for ungauged basins with insufficient or no streamflow data. Regional regression is a commonly used approach for predicting percentile flows that involves identifying hydrologic regions and calibrating regression models to each region. The independent variables used to describe the physiographic and climatic setting of the basins are a critical component of regional regression, yet few studies have investigated their effect on resulting predictions. In this study, the complexity of the independent variables needed for regional regression is investigated. Different levels of variable complexity are applied for a regional regression consisting of 918 basins in the US. Both the hydrologic regions and regression models are determined according to the different sets of variables, and the accuracy of resulting predictions is assessed. The different sets of variables include (1) a simple set of three variables strongly tied to the FDC (mean annual precipitation, potential evapotranspiration, and baseflow index), (2) a traditional set of variables describing the average physiographic and climatic conditions of the basins, and (3) a more complex set of variables extending the traditional variables to include statistics describing the distribution of physiographic data and temporal components of climatic data. The latter set of variables is not typically used in regional regression, and is evaluated for its potential to predict percentile flows. The simplest set of only three variables performed similarly to the other more complex sets of variables. Traditional variables used to describe climate, topography, and soil offered little more to the predictions, and the experimental set of variables describing the distribution of basin data in more detail did not improve predictions. These results are largely reflective of cross-correlation existing in hydrologic datasets, and highlight the limited predictive power of many traditionally used variables for regional regression. A parsimonious approach including fewer variables chosen based on their connection to streamflow may be more efficient than a data mining approach including many different variables. Future regional regression studies may benefit from having a hydrologic rationale for including different variables and attempting to create new variables related to streamflow.

  7. Improving precision of glomerular filtration rate estimating model by ensemble learning.

    PubMed

    Liu, Xun; Li, Ningshan; Lv, Linsheng; Fu, Yongmei; Cheng, Cailian; Wang, Caixia; Ye, Yuqiu; Li, Shaomin; Lou, Tanqi

    2017-11-09

    Accurate assessment of kidney function is clinically important, but estimates of glomerular filtration rate (GFR) by regression are imprecise. We hypothesized that ensemble learning could improve precision. A total of 1419 participants were enrolled, with 1002 in the development dataset and 417 in the external validation dataset. GFR was independently estimated from age, sex and serum creatinine using an artificial neural network (ANN), support vector machine (SVM), regression, and ensemble learning. GFR was measured by 99mTc-DTPA renal dynamic imaging calibrated with dual plasma sample 99mTc-DTPA GFR. Mean measured GFRs were 70.0 ml/min/1.73 m 2 in the developmental and 53.4 ml/min/1.73 m 2 in the external validation cohorts. In the external validation cohort, precision was better in the ensemble model of the ANN, SVM and regression equation (IQR = 13.5 ml/min/1.73 m 2 ) than in the new regression model (IQR = 14.0 ml/min/1.73 m 2 , P < 0.001). The precision of ensemble learning was the best of the three models, but the models had similar bias and accuracy. The median difference ranged from 2.3 to 3.7 ml/min/1.73 m 2 , 30% accuracy ranged from 73.1 to 76.0%, and P was > 0.05 for all comparisons of the new regression equation and the other new models. An ensemble learning model including three variables, the average ANN, SVM, and regression equation values, was more precise than the new regression model. A more complex ensemble learning strategy may further improve GFR estimates.

  8. A regression-based 3-D shoulder rhythm.

    PubMed

    Xu, Xu; Lin, Jia-hua; McGorry, Raymond W

    2014-03-21

    In biomechanical modeling of the shoulder, it is important to know the orientation of each bone in the shoulder girdle when estimating the loads on each musculoskeletal element. However, because of the soft tissue overlying the bones, it is difficult to accurately derive the orientation of the clavicle and scapula using surface markers during dynamic movement. The purpose of this study is to develop two regression models which predict the orientation of the clavicle and the scapula. The first regression model uses humerus orientation and individual factors such as age, gender, and anthropometry data as the predictors. The second regression model includes only the humerus orientation as the predictor. Thirty-eight participants performed 118 static postures covering the volume of the right hand reach. The orientation of the thorax, clavicle, scapula and humerus were measured with a motion tracking system. Regression analysis was performed on the Euler angles decomposed from the orientation of each bone from 26 randomly selected participants. The regression models were then validated with the remaining 12 participants. The results indicate that for the first model, the r(2) of the predicted orientation of the clavicle and the scapula ranged between 0.31 and 0.65, and the RMSE obtained from the validation dataset ranged from 6.92° to 10.39°. For the second model, the r(2) ranged between 0.19 and 0.57, and the RMSE obtained from the validation dataset ranged from 6.62° and 11.13°. The derived regression-based shoulder rhythm could be useful in future biomechanical modeling of the shoulder. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.

  9. Methods for estimating population density in data-limited areas: evaluating regression and tree-based models in Peru.

    PubMed

    Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William

    2014-01-01

    Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies.

  10. Methods for Estimating Population Density in Data-Limited Areas: Evaluating Regression and Tree-Based Models in Peru

    PubMed Central

    Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William

    2014-01-01

    Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies. PMID:24992657

  11. Bayesian quantile regression-based partially linear mixed-effects joint models for longitudinal data with multiple features.

    PubMed

    Zhang, Hanze; Huang, Yangxin; Wang, Wei; Chen, Henian; Langland-Orban, Barbara

    2017-01-01

    In longitudinal AIDS studies, it is of interest to investigate the relationship between HIV viral load and CD4 cell counts, as well as the complicated time effect. Most of common models to analyze such complex longitudinal data are based on mean-regression, which fails to provide efficient estimates due to outliers and/or heavy tails. Quantile regression-based partially linear mixed-effects models, a special case of semiparametric models enjoying benefits of both parametric and nonparametric models, have the flexibility to monitor the viral dynamics nonparametrically and detect the varying CD4 effects parametrically at different quantiles of viral load. Meanwhile, it is critical to consider various data features of repeated measurements, including left-censoring due to a limit of detection, covariate measurement error, and asymmetric distribution. In this research, we first establish a Bayesian joint models that accounts for all these data features simultaneously in the framework of quantile regression-based partially linear mixed-effects models. The proposed models are applied to analyze the Multicenter AIDS Cohort Study (MACS) data. Simulation studies are also conducted to assess the performance of the proposed methods under different scenarios.

  12. Evaluation of the Bitterness of Traditional Chinese Medicines using an E-Tongue Coupled with a Robust Partial Least Squares Regression Method

    PubMed Central

    Lin, Zhaozhou; Zhang, Qiao; Liu, Ruixin; Gao, Xiaojie; Zhang, Lu; Kang, Bingya; Shi, Junhan; Wu, Zidan; Gui, Xinjing; Li, Xuelin

    2016-01-01

    To accurately, safely, and efficiently evaluate the bitterness of Traditional Chinese Medicines (TCMs), a robust predictor was developed using robust partial least squares (RPLS) regression method based on data obtained from an electronic tongue (e-tongue) system. The data quality was verified by the Grubb’s test. Moreover, potential outliers were detected based on both the standardized residual and score distance calculated for each sample. The performance of RPLS on the dataset before and after outlier detection was compared to other state-of-the-art methods including multivariate linear regression, least squares support vector machine, and the plain partial least squares regression. Both R2 and root-mean-squares error (RMSE) of cross-validation (CV) were recorded for each model. With four latent variables, a robust RMSECV value of 0.3916 with bitterness values ranging from 0.63 to 4.78 were obtained for the RPLS model that was constructed based on the dataset including outliers. Meanwhile, the RMSECV, which was calculated using the models constructed by other methods, was larger than that of the RPLS model. After six outliers were excluded, the performance of all benchmark methods markedly improved, but the difference between the RPLS model constructed before and after outlier exclusion was negligible. In conclusion, the bitterness of TCM decoctions can be accurately evaluated with the RPLS model constructed using e-tongue data. PMID:26821026

  13. A novel strategy for forensic age prediction by DNA methylation and support vector regression model

    PubMed Central

    Xu, Cheng; Qu, Hongzhu; Wang, Guangyu; Xie, Bingbing; Shi, Yi; Yang, Yaran; Zhao, Zhao; Hu, Lan; Fang, Xiangdong; Yan, Jiangwei; Feng, Lei

    2015-01-01

    High deviations resulting from prediction model, gender and population difference have limited age estimation application of DNA methylation markers. Here we identified 2,957 novel age-associated DNA methylation sites (P < 0.01 and R2 > 0.5) in blood of eight pairs of Chinese Han female monozygotic twins. Among them, nine novel sites (false discovery rate < 0.01), along with three other reported sites, were further validated in 49 unrelated female volunteers with ages of 20–80 years by Sequenom Massarray. A total of 95 CpGs were covered in the PCR products and 11 of them were built the age prediction models. After comparing four different models including, multivariate linear regression, multivariate nonlinear regression, back propagation neural network and support vector regression, SVR was identified as the most robust model with the least mean absolute deviation from real chronological age (2.8 years) and an average accuracy of 4.7 years predicted by only six loci from the 11 loci, as well as an less cross-validated error compared with linear regression model. Our novel strategy provides an accurate measurement that is highly useful in estimating the individual age in forensic practice as well as in tracking the aging process in other related applications. PMID:26635134

  14. Comparison of methods for the analysis of relatively simple mediation models.

    PubMed

    Rijnhart, Judith J M; Twisk, Jos W R; Chinapaw, Mai J M; de Boer, Michiel R; Heymans, Martijn W

    2017-09-01

    Statistical mediation analysis is an often used method in trials, to unravel the pathways underlying the effect of an intervention on a particular outcome variable. Throughout the years, several methods have been proposed, such as ordinary least square (OLS) regression, structural equation modeling (SEM), and the potential outcomes framework. Most applied researchers do not know that these methods are mathematically equivalent when applied to mediation models with a continuous mediator and outcome variable. Therefore, the aim of this paper was to demonstrate the similarities between OLS regression, SEM, and the potential outcomes framework in three mediation models: 1) a crude model, 2) a confounder-adjusted model, and 3) a model with an interaction term for exposure-mediator interaction. Secondary data analysis of a randomized controlled trial that included 546 schoolchildren. In our data example, the mediator and outcome variable were both continuous. We compared the estimates of the total, direct and indirect effects, proportion mediated, and 95% confidence intervals (CIs) for the indirect effect across OLS regression, SEM, and the potential outcomes framework. OLS regression, SEM, and the potential outcomes framework yielded the same effect estimates in the crude mediation model, the confounder-adjusted mediation model, and the mediation model with an interaction term for exposure-mediator interaction. Since OLS regression, SEM, and the potential outcomes framework yield the same results in three mediation models with a continuous mediator and outcome variable, researchers can continue using the method that is most convenient to them.

  15. Connectome-based predictive modeling of attention: Comparing different functional connectivity features and prediction methods across datasets.

    PubMed

    Yoo, Kwangsun; Rosenberg, Monica D; Hsu, Wei-Ting; Zhang, Sheng; Li, Chiang-Shan R; Scheinost, Dustin; Constable, R Todd; Chun, Marvin M

    2018-02-15

    Connectome-based predictive modeling (CPM; Finn et al., 2015; Shen et al., 2017) was recently developed to predict individual differences in traits and behaviors, including fluid intelligence (Finn et al., 2015) and sustained attention (Rosenberg et al., 2016a), from functional brain connectivity (FC) measured with fMRI. Here, using the CPM framework, we compared the predictive power of three different measures of FC (Pearson's correlation, accordance, and discordance) and two different prediction algorithms (linear and partial least square [PLS] regression) for attention function. Accordance and discordance are recently proposed FC measures that respectively track in-phase synchronization and out-of-phase anti-correlation (Meskaldji et al., 2015). We defined connectome-based models using task-based or resting-state FC data, and tested the effects of (1) functional connectivity measure and (2) feature-selection/prediction algorithm on individualized attention predictions. Models were internally validated in a training dataset using leave-one-subject-out cross-validation, and externally validated with three independent datasets. The training dataset included fMRI data collected while participants performed a sustained attention task and rested (N = 25; Rosenberg et al., 2016a). The validation datasets included: 1) data collected during performance of a stop-signal task and at rest (N = 83, including 19 participants who were administered methylphenidate prior to scanning; Farr et al., 2014a; Rosenberg et al., 2016b), 2) data collected during Attention Network Task performance and rest (N = 41, Rosenberg et al., in press), and 3) resting-state data and ADHD symptom severity from the ADHD-200 Consortium (N = 113; Rosenberg et al., 2016a). Models defined using all combinations of functional connectivity measure (Pearson's correlation, accordance, and discordance) and prediction algorithm (linear and PLS regression) predicted attentional abilities, with correlations between predicted and observed measures of attention as high as 0.9 for internal validation, and 0.6 for external validation (all p's < 0.05). Models trained on task data outperformed models trained on rest data. Pearson's correlation and accordance features generally showed a small numerical advantage over discordance features, while PLS regression models were usually better than linear regression models. Overall, in addition to correlation features combined with linear models (Rosenberg et al., 2016a), it is useful to consider accordance features and PLS regression for CPM. Copyright © 2017 Elsevier Inc. All rights reserved.

  16. Modeling the language learning strategies and English language proficiency of pre-university students in UMS: A case study

    NASA Astrophysics Data System (ADS)

    Kiram, J. J.; Sulaiman, J.; Swanto, S.; Din, W. A.

    2015-10-01

    This study aims to construct a mathematical model of the relationship between a student's Language Learning Strategy usage and English Language proficiency. Fifty-six pre-university students of University Malaysia Sabah participated in this study. A self-report questionnaire called the Strategy Inventory for Language Learning was administered to them to measure their language learning strategy preferences before they sat for the Malaysian University English Test (MUET), the results of which were utilised to measure their English language proficiency. We attempted the model assessment specific to Multiple Linear Regression Analysis subject to variable selection using Stepwise regression. We conducted various assessments to the model obtained, including the Global F-test, Root Mean Square Error and R-squared. The model obtained suggests that not all language learning strategies should be included in the model in an attempt to predict Language Proficiency.

  17. Modeling Fire Occurrence at the City Scale: A Comparison between Geographically Weighted Regression and Global Linear Regression.

    PubMed

    Song, Chao; Kwan, Mei-Po; Zhu, Jiping

    2017-04-08

    An increasing number of fires are occurring with the rapid development of cities, resulting in increased risk for human beings and the environment. This study compares geographically weighted regression-based models, including geographically weighted regression (GWR) and geographically and temporally weighted regression (GTWR), which integrates spatial and temporal effects and global linear regression models (LM) for modeling fire risk at the city scale. The results show that the road density and the spatial distribution of enterprises have the strongest influences on fire risk, which implies that we should focus on areas where roads and enterprises are densely clustered. In addition, locations with a large number of enterprises have fewer fire ignition records, probably because of strict management and prevention measures. A changing number of significant variables across space indicate that heterogeneity mainly exists in the northern and eastern rural and suburban areas of Hefei city, where human-related facilities or road construction are only clustered in the city sub-centers. GTWR can capture small changes in the spatiotemporal heterogeneity of the variables while GWR and LM cannot. An approach that integrates space and time enables us to better understand the dynamic changes in fire risk. Thus governments can use the results to manage fire safety at the city scale.

  18. Modeling Fire Occurrence at the City Scale: A Comparison between Geographically Weighted Regression and Global Linear Regression

    PubMed Central

    Song, Chao; Kwan, Mei-Po; Zhu, Jiping

    2017-01-01

    An increasing number of fires are occurring with the rapid development of cities, resulting in increased risk for human beings and the environment. This study compares geographically weighted regression-based models, including geographically weighted regression (GWR) and geographically and temporally weighted regression (GTWR), which integrates spatial and temporal effects and global linear regression models (LM) for modeling fire risk at the city scale. The results show that the road density and the spatial distribution of enterprises have the strongest influences on fire risk, which implies that we should focus on areas where roads and enterprises are densely clustered. In addition, locations with a large number of enterprises have fewer fire ignition records, probably because of strict management and prevention measures. A changing number of significant variables across space indicate that heterogeneity mainly exists in the northern and eastern rural and suburban areas of Hefei city, where human-related facilities or road construction are only clustered in the city sub-centers. GTWR can capture small changes in the spatiotemporal heterogeneity of the variables while GWR and LM cannot. An approach that integrates space and time enables us to better understand the dynamic changes in fire risk. Thus governments can use the results to manage fire safety at the city scale. PMID:28397745

  19. [Logistic regression model of noninvasive prediction for portal hypertensive gastropathy in patients with hepatitis B associated cirrhosis].

    PubMed

    Wang, Qingliang; Li, Xiaojie; Hu, Kunpeng; Zhao, Kun; Yang, Peisheng; Liu, Bo

    2015-05-12

    To explore the risk factors of portal hypertensive gastropathy (PHG) in patients with hepatitis B associated cirrhosis and establish a Logistic regression model of noninvasive prediction. The clinical data of 234 hospitalized patients with hepatitis B associated cirrhosis from March 2012 to March 2014 were analyzed retrospectively. The dependent variable was the occurrence of PHG while the independent variables were screened by binary Logistic analysis. Multivariate Logistic regression was used for further analysis of significant noninvasive independent variables. Logistic regression model was established and odds ratio was calculated for each factor. The accuracy, sensitivity and specificity of model were evaluated by the curve of receiver operating characteristic (ROC). According to univariate Logistic regression, the risk factors included hepatic dysfunction, albumin (ALB), bilirubin (TB), prothrombin time (PT), platelet (PLT), white blood cell (WBC), portal vein diameter, spleen index, splenic vein diameter, diameter ratio, PLT to spleen volume ratio, esophageal varices (EV) and gastric varices (GV). Multivariate analysis showed that hepatic dysfunction (X1), TB (X2), PLT (X3) and splenic vein diameter (X4) were the major occurring factors for PHG. The established regression model was Logit P=-2.667+2.186X1-2.167X2+0.725X3+0.976X4. The accuracy of model for PHG was 79.1% with a sensitivity of 77.2% and a specificity of 80.8%. Hepatic dysfunction, TB, PLT and splenic vein diameter are risk factors for PHG and the noninvasive predicted Logistic regression model was Logit P=-2.667+2.186X1-2.167X2+0.725X3+0.976X4.

  20. Linear models for calculating digestibile energy for sheep diets.

    PubMed

    Fonnesbeck, P V; Christiansen, M L; Harris, L E

    1981-05-01

    Equations for estimating the digestible energy (DE) content of sheep diets were generated from the chemical contents and a factorial description of diets fed to lambs in digestion trials. The diet factors were two forages (alfalfa and grass hay), harvested at three stages of maturity (late vegetative, early bloom and full bloom), fed in two ingredient combinations (all hay or a 50:50 hay and corn grain mixture) and prepared by two forage texture processes (coarsely chopped or finely chopped and pelleted). The 2 x 3 x 2 x 2 factorial arrangement produced 24 diet treatments. These were replicated twice, for a total of 48 lamb digestion trials. In model 1 regression equations, DE was calculated directly from chemical composition of the diet. In model 2, regression equations predicted the percentage of digested nutrient from the chemical contents of the diet and then DE of the diet was calculated as the sum of the gross energy of the digested organic components. Expanded forms of model 1 and model 2 were also developed that included diet factors as qualitative indicator variables to adjust the regression constant and regression coefficients for the diet description. The expanded forms of the equations accounted for significantly more variation in DE than did the simple models and more accurately estimated DE of the diet. Information provided by the diet description proved as useful as chemical analyses for the prediction of digestibility of nutrients. The statistics indicate that, with model 1, neutral detergent fiber and plant cell wall analyses provided as much information for the estimation of DE as did model 2 with the combined information from crude protein, available carbohydrate, total lipid, cellulose and hemicellulose. Regression equations are presented for estimating DE with the most currently analyzed organic components, including linear and curvilinear variables and diet factors that significantly reduce the standard error of the estimate. To estimate De of a diet, the user utilizes the equation that uses the chemical analysis information and diet description most effectively.

  1. Classification and regression tree analysis of acute-on-chronic hepatitis B liver failure: Seeing the forest for the trees.

    PubMed

    Shi, K-Q; Zhou, Y-Y; Yan, H-D; Li, H; Wu, F-L; Xie, Y-Y; Braddock, M; Lin, X-Y; Zheng, M-H

    2017-02-01

    At present, there is no ideal model for predicting the short-term outcome of patients with acute-on-chronic hepatitis B liver failure (ACHBLF). This study aimed to establish and validate a prognostic model by using the classification and regression tree (CART) analysis. A total of 1047 patients from two separate medical centres with suspected ACHBLF were screened in the study, which were recognized as derivation cohort and validation cohort, respectively. CART analysis was applied to predict the 3-month mortality of patients with ACHBLF. The accuracy of the CART model was tested using the area under the receiver operating characteristic curve, which was compared with the model for end-stage liver disease (MELD) score and a new logistic regression model. CART analysis identified four variables as prognostic factors of ACHBLF: total bilirubin, age, serum sodium and INR, and three distinct risk groups: low risk (4.2%), intermediate risk (30.2%-53.2%) and high risk (81.4%-96.9%). The new logistic regression model was constructed with four independent factors, including age, total bilirubin, serum sodium and prothrombin activity by multivariate logistic regression analysis. The performances of the CART model (0.896), similar to the logistic regression model (0.914, P=.382), exceeded that of MELD score (0.667, P<.001). The results were confirmed in the validation cohort. We have developed and validated a novel CART model superior to MELD for predicting three-month mortality of patients with ACHBLF. Thus, the CART model could facilitate medical decision-making and provide clinicians with a validated practical bedside tool for ACHBLF risk stratification. © 2016 John Wiley & Sons Ltd.

  2. Forecasting space weather: Can new econometric methods improve accuracy?

    NASA Astrophysics Data System (ADS)

    Reikard, Gordon

    2011-06-01

    Space weather forecasts are currently used in areas ranging from navigation and communication to electric power system operations. The relevant forecast horizons can range from as little as 24 h to several days. This paper analyzes the predictability of two major space weather measures using new time series methods, many of them derived from econometrics. The data sets are the A p geomagnetic index and the solar radio flux at 10.7 cm. The methods tested include nonlinear regressions, neural networks, frequency domain algorithms, GARCH models (which utilize the residual variance), state transition models, and models that combine elements of several techniques. While combined models are complex, they can be programmed using modern statistical software. The data frequency is daily, and forecasting experiments are run over horizons ranging from 1 to 7 days. Two major conclusions stand out. First, the frequency domain method forecasts the A p index more accurately than any time domain model, including both regressions and neural networks. This finding is very robust, and holds for all forecast horizons. Combining the frequency domain method with other techniques yields a further small improvement in accuracy. Second, the neural network forecasts the solar flux more accurately than any other method, although at short horizons (2 days or less) the regression and net yield similar results. The neural net does best when it includes measures of the long-term component in the data.

  3. Robust functional regression model for marginal mean and subject-specific inferences.

    PubMed

    Cao, Chunzheng; Shi, Jian Qing; Lee, Youngjo

    2017-01-01

    We introduce flexible robust functional regression models, using various heavy-tailed processes, including a Student t-process. We propose efficient algorithms in estimating parameters for the marginal mean inferences and in predicting conditional means as well as interpolation and extrapolation for the subject-specific inferences. We develop bootstrap prediction intervals (PIs) for conditional mean curves. Numerical studies show that the proposed model provides a robust approach against data contamination or distribution misspecification, and the proposed PIs maintain the nominal confidence levels. A real data application is presented as an illustrative example.

  4. Inverse models: A necessary next step in ground-water modeling

    USGS Publications Warehouse

    Poeter, E.P.; Hill, M.C.

    1997-01-01

    Inverse models using, for example, nonlinear least-squares regression, provide capabilities that help modelers take full advantage of the insight available from ground-water models. However, lack of information about the requirements and benefits of inverse models is an obstacle to their widespread use. This paper presents a simple ground-water flow problem to illustrate the requirements and benefits of the nonlinear least-squares repression method of inverse modeling and discusses how these attributes apply to field problems. The benefits of inverse modeling include: (1) expedited determination of best fit parameter values; (2) quantification of the (a) quality of calibration, (b) data shortcomings and needs, and (c) confidence limits on parameter estimates and predictions; and (3) identification of issues that are easily overlooked during nonautomated calibration.Inverse models using, for example, nonlinear least-squares regression, provide capabilities that help modelers take full advantage of the insight available from ground-water models. However, lack of information about the requirements and benefits of inverse models is an obstacle to their widespread use. This paper presents a simple ground-water flow problem to illustrate the requirements and benefits of the nonlinear least-squares regression method of inverse modeling and discusses how these attributes apply to field problems. The benefits of inverse modeling include: (1) expedited determination of best fit parameter values; (2) quantification of the (a) quality of calibration, (b) data shortcomings and needs, and (c) confidence limits on parameter estimates and predictions; and (3) identification of issues that are easily overlooked during nonautomated calibration.

  5. Multicollinearity and Regression Analysis

    NASA Astrophysics Data System (ADS)

    Daoud, Jamal I.

    2017-12-01

    In regression analysis it is obvious to have a correlation between the response and predictor(s), but having correlation among predictors is something undesired. The number of predictors included in the regression model depends on many factors among which, historical data, experience, etc. At the end selection of most important predictors is something objective due to the researcher. Multicollinearity is a phenomena when two or more predictors are correlated, if this happens, the standard error of the coefficients will increase [8]. Increased standard errors means that the coefficients for some or all independent variables may be found to be significantly different from In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. In this paper we focus on the multicollinearity, reasons and consequences on the reliability of the regression model.

  6. A comparison of different ways of including baseline counts in negative binomial models for data from falls prevention trials.

    PubMed

    Zheng, Han; Kimber, Alan; Goodwin, Victoria A; Pickering, Ruth M

    2018-01-01

    A common design for a falls prevention trial is to assess falling at baseline, randomize participants into an intervention or control group, and ask them to record the number of falls they experience during a follow-up period of time. This paper addresses how best to include the baseline count in the analysis of the follow-up count of falls in negative binomial (NB) regression. We examine the performance of various approaches in simulated datasets where both counts are generated from a mixed Poisson distribution with shared random subject effect. Including the baseline count after log-transformation as a regressor in NB regression (NB-logged) or as an offset (NB-offset) resulted in greater power than including the untransformed baseline count (NB-unlogged). Cook and Wei's conditional negative binomial (CNB) model replicates the underlying process generating the data. In our motivating dataset, a statistically significant intervention effect resulted from the NB-logged, NB-offset, and CNB models, but not from NB-unlogged, and large, outlying baseline counts were overly influential in NB-unlogged but not in NB-logged. We conclude that there is little to lose by including the log-transformed baseline count in standard NB regression compared to CNB for moderate to larger sized datasets. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  7. GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA

    PubMed Central

    Zheng, Qi; Peng, Limin; He, Xuming

    2015-01-01

    Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal. PMID:26604424

  8. INNOVATIVE INSTRUMENTATION AND ANALYSIS OF THE TEMPERATURE MEASUREMENT FOR HIGH TEMPERATURE GASIFICATION

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seong W. Lee

    During this reporting period, the literature survey including the gasifier temperature measurement literature, the ultrasonic application and its background study in cleaning application, and spray coating process are completed. The gasifier simulator (cold model) testing has been successfully conducted. Four factors (blower voltage, ultrasonic application, injection time intervals, particle weight) were considered as significant factors that affect the temperature measurement. The Analysis of Variance (ANOVA) was applied to analyze the test data. The analysis shows that all four factors are significant to the temperature measurements in the gasifier simulator (cold model). The regression analysis for the case with the normalizedmore » room temperature shows that linear model fits the temperature data with 82% accuracy (18% error). The regression analysis for the case without the normalized room temperature shows 72.5% accuracy (27.5% error). The nonlinear regression analysis indicates a better fit than that of the linear regression. The nonlinear regression model's accuracy is 88.7% (11.3% error) for normalized room temperature case, which is better than the linear regression analysis. The hot model thermocouple sleeve design and fabrication are completed. The gasifier simulator (hot model) design and the fabrication are completed. The system tests of the gasifier simulator (hot model) have been conducted and some modifications have been made. Based on the system tests and results analysis, the gasifier simulator (hot model) has met the proposed design requirement and the ready for system test. The ultrasonic cleaning method is under evaluation and will be further studied for the gasifier simulator (hot model) application. The progress of this project has been on schedule.« less

  9. RBF kernel based support vector regression to estimate the blood volume and heart rate responses during hemodialysis.

    PubMed

    Javed, Faizan; Chan, Gregory S H; Savkin, Andrey V; Middleton, Paul M; Malouf, Philip; Steel, Elizabeth; Mackie, James; Lovell, Nigel H

    2009-01-01

    This paper uses non-linear support vector regression (SVR) to model the blood volume and heart rate (HR) responses in 9 hemodynamically stable kidney failure patients during hemodialysis. Using radial bias function (RBF) kernels the non-parametric models of relative blood volume (RBV) change with time as well as percentage change in HR with respect to RBV were obtained. The e-insensitivity based loss function was used for SVR modeling. Selection of the design parameters which includes capacity (C), insensitivity region (e) and the RBF kernel parameter (sigma) was made based on a grid search approach and the selected models were cross-validated using the average mean square error (AMSE) calculated from testing data based on a k-fold cross-validation technique. Linear regression was also applied to fit the curves and the AMSE was calculated for comparison with SVR. For the model based on RBV with time, SVR gave a lower AMSE for both training (AMSE=1.5) as well as testing data (AMSE=1.4) compared to linear regression (AMSE=1.8 and 1.5). SVR also provided a better fit for HR with RBV for both training as well as testing data (AMSE=15.8 and 16.4) compared to linear regression (AMSE=25.2 and 20.1).

  10. Genetic prediction of type 2 diabetes using deep neural network.

    PubMed

    Kim, J; Kim, J; Kwak, M J; Bajaj, M

    2018-04-01

    Type 2 diabetes (T2DM) has strong heritability but genetic models to explain heritability have been challenging. We tested deep neural network (DNN) to predict T2DM using the nested case-control study of Nurses' Health Study (3326 females, 45.6% T2DM) and Health Professionals Follow-up Study (2502 males, 46.5% T2DM). We selected 96, 214, 399, and 678 single-nucleotide polymorphism (SNPs) through Fisher's exact test and L1-penalized logistic regression. We split each dataset randomly in 4:1 to train prediction models and test their performance. DNN and logistic regressions showed better area under the curve (AUC) of ROC curves than the clinical model when 399 or more SNPs included. DNN was superior than logistic regressions in AUC with 399 or more SNPs in male and 678 SNPs in female. Addition of clinical factors consistently increased AUC of DNN but failed to improve logistic regressions with 214 or more SNPs. In conclusion, we show that DNN can be a versatile tool to predict T2DM incorporating large numbers of SNPs and clinical information. Limitations include a relatively small number of the subjects mostly of European ethnicity. Further studies are warranted to confirm and improve performance of genetic prediction models using DNN in different ethnic groups. © 2017 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  11. Using occupancy modeling and logistic regression to assess the distribution of shrimp species in lowland streams, Costa Rica: Does regional groundwater create favorable habitat?

    USGS Publications Warehouse

    Snyder, Marcia; Freeman, Mary C.; Purucker, S. Thomas; Pringle, Catherine M.

    2016-01-01

    Freshwater shrimps are an important biotic component of tropical ecosystems. However, they can have a low probability of detection when abundances are low. We sampled 3 of the most common freshwater shrimp species, Macrobrachium olfersii, Macrobrachium carcinus, and Macrobrachium heterochirus, and used occupancy modeling and logistic regression models to improve our limited knowledge of distribution of these cryptic species by investigating both local- and landscape-scale effects at La Selva Biological Station in Costa Rica. Local-scale factors included substrate type and stream size, and landscape-scale factors included presence or absence of regional groundwater inputs. Capture rates for 2 of the sampled species (M. olfersii and M. carcinus) were sufficient to compare the fit of occupancy models. Occupancy models did not converge for M. heterochirus, but M. heterochirus had high enough occupancy rates that logistic regression could be used to model the relationship between occupancy rates and predictors. The best-supported models for M. olfersii and M. carcinus included conductivity, discharge, and substrate parameters. Stream size was positively correlated with occupancy rates of all 3 species. High stream conductivity, which reflects the quantity of regional groundwater input into the stream, was positively correlated with M. olfersii occupancy rates. Boulder substrates increased occupancy rate of M. carcinus and decreased the detection probability of M. olfersii. Our models suggest that shrimp distribution is driven by factors that function at local (substrate and discharge) and landscape (conductivity) scales.

  12. The creation and evaluation of a model to simulate the probability of conception in seasonal-calving pasture-based dairy heifers.

    PubMed

    Fenlon, Caroline; O'Grady, Luke; Butler, Stephen; Doherty, Michael L; Dunnion, John

    2017-01-01

    Herd fertility in pasture-based dairy farms is a key driver of farm economics. Models for predicting nulliparous reproductive outcomes are rare, but age, genetics, weight, and BCS have been identified as factors influencing heifer conception. The aim of this study was to create a simulation model of heifer conception to service with thorough evaluation. Artificial Insemination service records from two research herds and ten commercial herds were provided to build and evaluate the models. All were managed as spring-calving pasture-based systems. The factors studied were related to age, genetics, and time of service. The data were split into training and testing sets and bootstrapping was used to train the models. Logistic regression (with and without random effects) and generalised additive modelling were selected as the model-building techniques. Two types of evaluation were used to test the predictive ability of the models: discrimination and calibration. Discrimination, which includes sensitivity, specificity, accuracy and ROC analysis, measures a model's ability to distinguish between positive and negative outcomes. Calibration measures the accuracy of the predicted probabilities with the Hosmer-Lemeshow goodness-of-fit, calibration plot and calibration error. After data cleaning and the removal of services with missing values, 1396 services remained to train the models and 597 were left for testing. Age, breed, genetic predicted transmitting ability for calving interval, month and year were significant in the multivariate models. The regression models also included an interaction between age and month. Year within herd was a random effect in the mixed regression model. Overall prediction accuracy was between 77.1% and 78.9%. All three models had very high sensitivity, but low specificity. The two regression models were very well-calibrated. The mean absolute calibration errors were all below 4%. Because the models were not adept at identifying unsuccessful services, they are not suggested for use in predicting the outcome of individual heifer services. Instead, they are useful for the comparison of services with different covariate values or as sub-models in whole-farm simulations. The mixed regression model was identified as the best model for prediction, as the random effects can be ignored and the other variables can be easily obtained or simulated.

  13. Information, Avoidance Behavior, and Health: The Effect of Ozone on Asthma Hospitalizations

    ERIC Educational Resources Information Center

    Neidell, Matthew

    2009-01-01

    This paper assesses whether responses to information about risk impact estimates of the relationship between ozone and asthma in Southern California. Using a regression discontinuity design, I find smog alerts significantly reduce daily attendance at two major outdoor facilities. Using daily time-series regression models that include year-month…

  14. Multiple Logistic Regression Analysis of Cigarette Use among High School Students

    ERIC Educational Resources Information Center

    Adwere-Boamah, Joseph

    2011-01-01

    A binary logistic regression analysis was performed to predict high school students' cigarette smoking behavior from selected predictors from 2009 CDC Youth Risk Behavior Surveillance Survey. The specific target student behavior of interest was frequent cigarette use. Five predictor variables included in the model were: a) race, b) frequency of…

  15. Length bias correction in gene ontology enrichment analysis using logistic regression.

    PubMed

    Mi, Gu; Di, Yanming; Emerson, Sarah; Cumbie, Jason S; Chang, Jeff H

    2012-01-01

    When assessing differential gene expression from RNA sequencing data, commonly used statistical tests tend to have greater power to detect differential expression of genes encoding longer transcripts. This phenomenon, called "length bias", will influence subsequent analyses such as Gene Ontology enrichment analysis. In the presence of length bias, Gene Ontology categories that include longer genes are more likely to be identified as enriched. These categories, however, are not necessarily biologically more relevant. We show that one can effectively adjust for length bias in Gene Ontology analysis by including transcript length as a covariate in a logistic regression model. The logistic regression model makes the statistical issue underlying length bias more transparent: transcript length becomes a confounding factor when it correlates with both the Gene Ontology membership and the significance of the differential expression test. The inclusion of the transcript length as a covariate allows one to investigate the direct correlation between the Gene Ontology membership and the significance of testing differential expression, conditional on the transcript length. We present both real and simulated data examples to show that the logistic regression approach is simple, effective, and flexible.

  16. Global-scale high-resolution ( 1 km) modelling of mean, maximum and minimum annual streamflow

    NASA Astrophysics Data System (ADS)

    Barbarossa, Valerio; Huijbregts, Mark; Hendriks, Jan; Beusen, Arthur; Clavreul, Julie; King, Henry; Schipper, Aafke

    2017-04-01

    Quantifying mean, maximum and minimum annual flow (AF) of rivers at ungauged sites is essential for a number of applications, including assessments of global water supply, ecosystem integrity and water footprints. AF metrics can be quantified with spatially explicit process-based models, which might be overly time-consuming and data-intensive for this purpose, or with empirical regression models that predict AF metrics based on climate and catchment characteristics. Yet, so far, regression models have mostly been developed at a regional scale and the extent to which they can be extrapolated to other regions is not known. We developed global-scale regression models that quantify mean, maximum and minimum AF as function of catchment area and catchment-averaged slope, elevation, and mean, maximum and minimum annual precipitation and air temperature. We then used these models to obtain global 30 arc-seconds (˜ 1 km) maps of mean, maximum and minimum AF for each year from 1960 through 2015, based on a newly developed hydrologically conditioned digital elevation model. We calibrated our regression models based on observations of discharge and catchment characteristics from about 4,000 catchments worldwide, ranging from 100 to 106 km2 in size, and validated them against independent measurements as well as the output of a number of process-based global hydrological models (GHMs). The variance explained by our regression models ranged up to 90% and the performance of the models compared well with the performance of existing GHMs. Yet, our AF maps provide a level of spatial detail that cannot yet be achieved by current GHMs.

  17. Improved estimation of PM2.5 using Lagrangian satellite-measured aerosol optical depth

    NASA Astrophysics Data System (ADS)

    Olivas Saunders, Rolando

    Suspended particulate matter (aerosols) with aerodynamic diameters less than 2.5 mum (PM2.5) has negative effects on human health, plays an important role in climate change and also causes the corrosion of structures by acid deposition. Accurate estimates of PM2.5 concentrations are thus relevant in air quality, epidemiology, cloud microphysics and climate forcing studies. Aerosol optical depth (AOD) retrieved by the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite instrument has been used as an empirical predictor to estimate ground-level concentrations of PM2.5 . These estimates usually have large uncertainties and errors. The main objective of this work is to assess the value of using upwind (Lagrangian) MODIS-AOD as predictors in empirical models of PM2.5. The upwind locations of the Lagrangian AOD were estimated using modeled backward air trajectories. Since the specification of an arrival elevation is somewhat arbitrary, trajectories were calculated to arrive at four different elevations at ten measurement sites within the continental United States. A systematic examination revealed trajectory model calculations to be sensitive to starting elevation. With a 500 m difference in starting elevation, the 48-hr mean horizontal separation of trajectory endpoints was 326 km. When the difference in starting elevation was doubled and tripled to 1000 m and 1500m, the mean horizontal separation of trajectory endpoints approximately doubled and tripled to 627 km and 886 km, respectively. A seasonal dependence of this sensitivity was also found: the smallest mean horizontal separation of trajectory endpoints was exhibited during the summer and the largest separations during the winter. A daily average AOD product was generated and coupled to the trajectory model in order to determine AOD values upwind of the measurement sites during the period 2003-2007. Empirical models that included in situ AOD and upwind AOD as predictors of PM2.5 were generated by multivariate linear regressions using the least squares method. The multivariate models showed improved performance over the single variable regression (PM2.5 and in situ AOD) models. The statistical significance of the improvement of the multivariate models over the single variable regression models was tested using the extra sum of squares principle. In many cases, even when the R-squared was high for the multivariate models, the improvement over the single models was not statistically significant. The R-squared of these multivariate models varied with respect to seasons, with the best performance occurring during the summer months. A set of seasonal categorical variables was included in the regressions to exploit this variability. The multivariate regression models that included these categorical seasonal variables performed better than the models that didn't account for seasonal variability. Furthermore, 71% of these regressions exhibited improvement over the single variable models that was statistically significant at a 95% confidence level.

  18. Novel forecasting approaches using combination of machine learning and statistical models for flood susceptibility mapping.

    PubMed

    Shafizadeh-Moghadam, Hossein; Valavi, Roozbeh; Shahabi, Himan; Chapi, Kamran; Shirzadi, Ataollah

    2018-07-01

    In this research, eight individual machine learning and statistical models are implemented and compared, and based on their results, seven ensemble models for flood susceptibility assessment are introduced. The individual models included artificial neural networks, classification and regression trees, flexible discriminant analysis, generalized linear model, generalized additive model, boosted regression trees, multivariate adaptive regression splines, and maximum entropy, and the ensemble models were Ensemble Model committee averaging (EMca), Ensemble Model confidence interval Inferior (EMciInf), Ensemble Model confidence interval Superior (EMciSup), Ensemble Model to estimate the coefficient of variation (EMcv), Ensemble Model to estimate the mean (EMmean), Ensemble Model to estimate the median (EMmedian), and Ensemble Model based on weighted mean (EMwmean). The data set covered 201 flood events in the Haraz watershed (Mazandaran province in Iran) and 10,000 randomly selected non-occurrence points. Among the individual models, the Area Under the Receiver Operating Characteristic (AUROC), which showed the highest value, belonged to boosted regression trees (0.975) and the lowest value was recorded for generalized linear model (0.642). On the other hand, the proposed EMmedian resulted in the highest accuracy (0.976) among all models. In spite of the outstanding performance of some models, nevertheless, variability among the prediction of individual models was considerable. Therefore, to reduce uncertainty, creating more generalizable, more stable, and less sensitive models, ensemble forecasting approaches and in particular the EMmedian is recommended for flood susceptibility assessment. Copyright © 2018 Elsevier Ltd. All rights reserved.

  19. Application of Boosting Regression Trees to Preliminary Cost Estimation in Building Construction Projects

    PubMed Central

    2015-01-01

    Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project. PMID:26339227

  20. Application of Boosting Regression Trees to Preliminary Cost Estimation in Building Construction Projects.

    PubMed

    Shin, Yoonseok

    2015-01-01

    Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project.

  1. A simulation study on Bayesian Ridge regression models for several collinearity levels

    NASA Astrophysics Data System (ADS)

    Efendi, Achmad; Effrihan

    2017-12-01

    When analyzing data with multiple regression model if there are collinearities, then one or several predictor variables are usually omitted from the model. However, there sometimes some reasons, for instance medical or economic reasons, the predictors are all important and should be included in the model. Ridge regression model is not uncommon in some researches to use to cope with collinearity. Through this modeling, weights for predictor variables are used for estimating parameters. The next estimation process could follow the concept of likelihood. Furthermore, for the estimation nowadays the Bayesian version could be an alternative. This estimation method does not match likelihood one in terms of popularity due to some difficulties; computation and so forth. Nevertheless, with the growing improvement of computational methodology recently, this caveat should not at the moment become a problem. This paper discusses about simulation process for evaluating the characteristic of Bayesian Ridge regression parameter estimates. There are several simulation settings based on variety of collinearity levels and sample sizes. The results show that Bayesian method gives better performance for relatively small sample sizes, and for other settings the method does perform relatively similar to the likelihood method.

  2. A New SEYHAN's Approach in Case of Heterogeneity of Regression Slopes in ANCOVA.

    PubMed

    Ankarali, Handan; Cangur, Sengul; Ankarali, Seyit

    2018-06-01

    In this study, when the assumptions of linearity and homogeneity of regression slopes of conventional ANCOVA are not met, a new approach named as SEYHAN has been suggested to use conventional ANCOVA instead of robust or nonlinear ANCOVA. The proposed SEYHAN's approach involves transformation of continuous covariate into categorical structure when the relationship between covariate and dependent variable is nonlinear and the regression slopes are not homogenous. A simulated data set was used to explain SEYHAN's approach. In this approach, we performed conventional ANCOVA in each subgroup which is constituted according to knot values and analysis of variance with two-factor model after MARS method was used for categorization of covariate. The first model is a simpler model than the second model that includes interaction term. Since the model with interaction effect has more subjects, the power of test also increases and the existing significant difference is revealed better. We can say that linearity and homogeneity of regression slopes are not problem for data analysis by conventional linear ANCOVA model by helping this approach. It can be used fast and efficiently for the presence of one or more covariates.

  3. Application of logistic regression for landslide susceptibility zoning of Cekmece Area, Istanbul, Turkey

    NASA Astrophysics Data System (ADS)

    Duman, T. Y.; Can, T.; Gokceoglu, C.; Nefeslioglu, H. A.; Sonmez, H.

    2006-11-01

    As a result of industrialization, throughout the world, cities have been growing rapidly for the last century. One typical example of these growing cities is Istanbul, the population of which is over 10 million. Due to rapid urbanization, new areas suitable for settlement and engineering structures are necessary. The Cekmece area located west of the Istanbul metropolitan area is studied, because the landslide activity is extensive in this area. The purpose of this study is to develop a model that can be used to characterize landslide susceptibility in map form using logistic regression analysis of an extensive landslide database. A database of landslide activity was constructed using both aerial-photography and field studies. About 19.2% of the selected study area is covered by deep-seated landslides. The landslides that occur in the area are primarily located in sandstones with interbedded permeable and impermeable layers such as claystone, siltstone and mudstone. About 31.95% of the total landslide area is located at this unit. To apply logistic regression analyses, a data matrix including 37 variables was constructed. The variables used in the forwards stepwise analyses are different measures of slope, aspect, elevation, stream power index (SPI), plan curvature, profile curvature, geology, geomorphology and relative permeability of lithological units. A total of 25 variables were identified as exerting strong influence on landslide occurrence, and included by the logistic regression equation. Wald statistics values indicate that lithology, SPI and slope are more important than the other parameters in the equation. Beta coefficients of the 25 variables included the logistic regression equation provide a model for landslide susceptibility in the Cekmece area. This model is used to generate a landslide susceptibility map that correctly classified 83.8% of the landslide-prone areas.

  4. Associations between dietary and lifestyle risk factors and colorectal cancer in the Scottish population.

    PubMed

    Theodoratou, Evropi; Farrington, Susan M; Tenesa, Albert; McNeill, Geraldine; Cetnarskyj, Roseanne; Korakakis, Emmanouil; Din, Farhat V N; Porteous, Mary E; Dunlop, Malcolm G; Campbell, Harry

    2014-01-01

    Colorectal cancer (CRC) accounts for 9.7% of all cancer cases and for 8% of all cancer-related deaths. Established risk factors include personal or family history of CRC as well as lifestyle and dietary factors. We investigated the relationship between CRC and demographic, lifestyle, food and nutrient risk factors through a case-control study that included 2062 patients and 2776 controls from Scotland. Forward and backward stepwise regression was applied and the stability of the models was assessed in 1000 bootstrap samples. The variables that were automatically selected to be included by the forward or backward stepwise regression and whose selection was verified by bootstrap sampling in the current study were family history, dietary energy, 'high-energy snack foods', eggs, juice, sugar-sweetened beverages and white fish (associated with an increased CRC risk) and NSAIDs, coffee and magnesium (associated with a decreased CRC risk). Application of forward and backward stepwise regression in this CRC study identified some already established as well as some novel potential risk factors. Bootstrap findings suggest that examination of the stability of regression models by bootstrap sampling is useful in the interpretation of study findings. 'High-energy snack foods' and high-energy drinks (including sugar-sweetened beverages and fruit juices) as risk factors for CRC have not been reported previously and merit further investigation as such snacks and beverages are important contributors in European and North American diets.

  5. Mapping the spatial pattern of temperate forest above ground biomass by integrating airborne lidar with Radarsat-2 imagery via geostatistical models

    NASA Astrophysics Data System (ADS)

    Li, Wang; Niu, Zheng; Gao, Shuai; Wang, Cheng

    2014-11-01

    Light Detection and Ranging (LiDAR) and Synthetic Aperture Radar (SAR) are two competitive active remote sensing techniques in forest above ground biomass estimation, which is important for forest management and global climate change study. This study aims to further explore their capabilities in temperate forest above ground biomass (AGB) estimation by emphasizing the spatial auto-correlation of variables obtained from these two remote sensing tools, which is a usually overlooked aspect in remote sensing applications to vegetation studies. Remote sensing variables including airborne LiDAR metrics, backscattering coefficient for different SAR polarizations and their ratio variables for Radarsat-2 imagery were calculated. First, simple linear regression models (SLR) was established between the field-estimated above ground biomass and the remote sensing variables. Pearson's correlation coefficient (R2) was used to find which LiDAR metric showed the most significant correlation with the regression residuals and could be selected as co-variable in regression co-kriging (RCoKrig). Second, regression co-kriging was conducted by choosing the regression residuals as dependent variable and the LiDAR metric (Hmean) with highest R2 as co-variable. Third, above ground biomass over the study area was estimated using SLR model and RCoKrig model, respectively. The results for these two models were validated using the same ground points. Results showed that both of these two methods achieved satisfactory prediction accuracy, while regression co-kriging showed the lower estimation error. It is proved that regression co-kriging model is feasible and effective in mapping the spatial pattern of AGB in the temperate forest using Radarsat-2 data calibrated by airborne LiDAR metrics.

  6. A computational approach to compare regression modelling strategies in prediction research.

    PubMed

    Pajouheshnia, Romin; Pestman, Wiebe R; Teerenstra, Steven; Groenwold, Rolf H H

    2016-08-25

    It is often unclear which approach to fit, assess and adjust a model will yield the most accurate prediction model. We present an extension of an approach for comparing modelling strategies in linear regression to the setting of logistic regression and demonstrate its application in clinical prediction research. A framework for comparing logistic regression modelling strategies by their likelihoods was formulated using a wrapper approach. Five different strategies for modelling, including simple shrinkage methods, were compared in four empirical data sets to illustrate the concept of a priori strategy comparison. Simulations were performed in both randomly generated data and empirical data to investigate the influence of data characteristics on strategy performance. We applied the comparison framework in a case study setting. Optimal strategies were selected based on the results of a priori comparisons in a clinical data set and the performance of models built according to each strategy was assessed using the Brier score and calibration plots. The performance of modelling strategies was highly dependent on the characteristics of the development data in both linear and logistic regression settings. A priori comparisons in four empirical data sets found that no strategy consistently outperformed the others. The percentage of times that a model adjustment strategy outperformed a logistic model ranged from 3.9 to 94.9 %, depending on the strategy and data set. However, in our case study setting the a priori selection of optimal methods did not result in detectable improvement in model performance when assessed in an external data set. The performance of prediction modelling strategies is a data-dependent process and can be highly variable between data sets within the same clinical domain. A priori strategy comparison can be used to determine an optimal logistic regression modelling strategy for a given data set before selecting a final modelling approach.

  7. Robust and efficient estimation with weighted composite quantile regression

    NASA Astrophysics Data System (ADS)

    Jiang, Xuejun; Li, Jingzhi; Xia, Tian; Yan, Wanfeng

    2016-09-01

    In this paper we introduce a weighted composite quantile regression (CQR) estimation approach and study its application in nonlinear models such as exponential models and ARCH-type models. The weighted CQR is augmented by using a data-driven weighting scheme. With the error distribution unspecified, the proposed estimators share robustness from quantile regression and achieve nearly the same efficiency as the oracle maximum likelihood estimator (MLE) for a variety of error distributions including the normal, mixed-normal, Student's t, Cauchy distributions, etc. We also suggest an algorithm for the fast implementation of the proposed methodology. Simulations are carried out to compare the performance of different estimators, and the proposed approach is used to analyze the daily S&P 500 Composite index, which verifies the effectiveness and efficiency of our theoretical results.

  8. Trans-dimensional joint inversion of seabed scattering and reflection data.

    PubMed

    Steininger, Gavin; Dettmer, Jan; Dosso, Stan E; Holland, Charles W

    2013-03-01

    This paper examines joint inversion of acoustic scattering and reflection data to resolve seabed interface roughness parameters (spectral strength, exponent, and cutoff) and geoacoustic profiles. Trans-dimensional (trans-D) Bayesian sampling is applied with both the number of sediment layers and the order (zeroth or first) of auto-regressive parameters in the error model treated as unknowns. A prior distribution that allows fluid sediment layers over an elastic basement in a trans-D inversion is derived and implemented. Three cases are considered: Scattering-only inversion, joint scattering and reflection inversion, and joint inversion with the trans-D auto-regressive error model. Including reflection data improves the resolution of scattering and geoacoustic parameters. The trans-D auto-regressive model further improves scattering resolution and correctly differentiates between strongly and weakly correlated residual errors.

  9. Goodness-Of-Fit Test for Nonparametric Regression Models: Smoothing Spline ANOVA Models as Example.

    PubMed

    Teran Hidalgo, Sebastian J; Wu, Michael C; Engel, Stephanie M; Kosorok, Michael R

    2018-06-01

    Nonparametric regression models do not require the specification of the functional form between the outcome and the covariates. Despite their popularity, the amount of diagnostic statistics, in comparison to their parametric counter-parts, is small. We propose a goodness-of-fit test for nonparametric regression models with linear smoother form. In particular, we apply this testing framework to smoothing spline ANOVA models. The test can consider two sources of lack-of-fit: whether covariates that are not currently in the model need to be included, and whether the current model fits the data well. The proposed method derives estimated residuals from the model. Then, statistical dependence is assessed between the estimated residuals and the covariates using the HSIC. If dependence exists, the model does not capture all the variability in the outcome associated with the covariates, otherwise the model fits the data well. The bootstrap is used to obtain p-values. Application of the method is demonstrated with a neonatal mental development data analysis. We demonstrate correct type I error as well as power performance through simulations.

  10. Seasonal forecasting of high wind speeds over Western Europe

    NASA Astrophysics Data System (ADS)

    Palutikof, J. P.; Holt, T.

    2003-04-01

    As financial losses associated with extreme weather events escalate, there is interest from end users in the forestry and insurance industries, for example, in the development of seasonal forecasting models with a long lead time. This study uses exceedences of the 90th, 95th, and 99th percentiles of daily maximum wind speed over the period 1958 to present to derive predictands of winter wind extremes. The source data is the 6-hourly NCEP Reanalysis gridded surface wind field. Predictor variables include principal components of Atlantic sea surface temperature and several indices of climate variability, including the NAO and SOI. Lead times of up to a year are considered, in monthly increments. Three regression techniques are evaluated; multiple linear regression (MLR), principal component regression (PCR), and partial least squares regression (PLS). PCR and PLS proved considerably superior to MLR with much lower standard errors. PLS was chosen to formulate the predictive model since it offers more flexibility in experimental design and gave slightly better results than PCR. The results indicate that winter windiness can be predicted with considerable skill one year ahead for much of coastal Europe, but that this deteriorates rapidly in the hinterland. The experiment succeeded in highlighting PLS as a very useful method for developing more precise forecasting models, and in identifying areas of high predictability.

  11. Meta-regression analysis of the effect of trans fatty acids on low-density lipoprotein cholesterol.

    PubMed

    Allen, Bruce C; Vincent, Melissa J; Liska, DeAnn; Haber, Lynne T

    2016-12-01

    We conducted a meta-regression of controlled clinical trial data to investigate quantitatively the relationship between dietary intake of industrial trans fatty acids (iTFA) and increased low-density lipoprotein cholesterol (LDL-C). Previous regression analyses included insufficient data to determine the nature of the dose response in the low-dose region and have nonetheless assumed a linear relationship between iTFA intake and LDL-C levels. This work contributes to the previous work by 1) including additional studies examining low-dose intake (identified using an evidence mapping procedure); 2) investigating a range of curve shapes, including both linear and nonlinear models; and 3) using Bayesian meta-regression to combine results across trials. We found that, contrary to previous assumptions, the linear model does not acceptably fit the data, while the nonlinear, S-shaped Hill model fits the data well. Based on a conservative estimate of the degree of intra-individual variability in LDL-C (0.1 mmoL/L), as an estimate of a change in LDL-C that is not adverse, a change in iTFA intake of 2.2% of energy intake (%en) (corresponding to a total iTFA intake of 2.2-2.9%en) does not cause adverse effects on LDL-C. The iTFA intake associated with this change in LDL-C is substantially higher than the average iTFA intake (0.5%en). Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.

  12. Analysis of training sample selection strategies for regression-based quantitative landslide susceptibility mapping methods

    NASA Astrophysics Data System (ADS)

    Erener, Arzu; Sivas, A. Abdullah; Selcuk-Kestel, A. Sevtap; Düzgün, H. Sebnem

    2017-07-01

    All of the quantitative landslide susceptibility mapping (QLSM) methods requires two basic data types, namely, landslide inventory and factors that influence landslide occurrence (landslide influencing factors, LIF). Depending on type of landslides, nature of triggers and LIF, accuracy of the QLSM methods differs. Moreover, how to balance the number of 0 (nonoccurrence) and 1 (occurrence) in the training set obtained from the landslide inventory and how to select which one of the 1's and 0's to be included in QLSM models play critical role in the accuracy of the QLSM. Although performance of various QLSM methods is largely investigated in the literature, the challenge of training set construction is not adequately investigated for the QLSM methods. In order to tackle this challenge, in this study three different training set selection strategies along with the original data set is used for testing the performance of three different regression methods namely Logistic Regression (LR), Bayesian Logistic Regression (BLR) and Fuzzy Logistic Regression (FLR). The first sampling strategy is proportional random sampling (PRS), which takes into account a weighted selection of landslide occurrences in the sample set. The second method, namely non-selective nearby sampling (NNS), includes randomly selected sites and their surrounding neighboring points at certain preselected distances to include the impact of clustering. Selective nearby sampling (SNS) is the third method, which concentrates on the group of 1's and their surrounding neighborhood. A randomly selected group of landslide sites and their neighborhood are considered in the analyses similar to NNS parameters. It is found that LR-PRS, FLR-PRS and BLR-Whole Data set-ups, with order, yield the best fits among the other alternatives. The results indicate that in QLSM based on regression models, avoidance of spatial correlation in the data set is critical for the model's performance.

  13. Application and interpretation of functional data analysis techniques to differential scanning calorimetry data from lupus patients.

    PubMed

    Kendrick, Sarah K; Zheng, Qi; Garbett, Nichola C; Brock, Guy N

    2017-01-01

    DSC is used to determine thermally-induced conformational changes of biomolecules within a blood plasma sample. Recent research has indicated that DSC curves (or thermograms) may have different characteristics based on disease status and, thus, may be useful as a monitoring and diagnostic tool for some diseases. Since thermograms are curves measured over a range of temperature values, they are considered functional data. In this paper we apply functional data analysis techniques to analyze differential scanning calorimetry (DSC) data from individuals from the Lupus Family Registry and Repository (LFRR). The aim was to assess the effect of lupus disease status as well as additional covariates on the thermogram profiles, and use FD analysis methods to create models for classifying lupus vs. control patients on the basis of the thermogram curves. Thermograms were collected for 300 lupus patients and 300 controls without lupus who were matched with diseased individuals based on sex, race, and age. First, functional regression with a functional response (DSC) and categorical predictor (disease status) was used to determine how thermogram curve structure varied according to disease status and other covariates including sex, race, and year of birth. Next, functional logistic regression with disease status as the response and functional principal component analysis (FPCA) scores as the predictors was used to model the effect of thermogram structure on disease status prediction. The prediction accuracy for patients with Osteoarthritis and Rheumatoid Arthritis but without Lupus was also calculated to determine the ability of the classifier to differentiate between Lupus and other diseases. Data were divided 1000 times into separate 2/3 training and 1/3 test data for evaluation of predictions. Finally, derivatives of thermogram curves were included in the models to determine whether they aided in prediction of disease status. Functional regression with thermogram as a functional response and disease status as predictor showed a clear separation in thermogram curve structure between cases and controls. The logistic regression model with FPCA scores as the predictors gave the most accurate results with a mean 79.22% correct classification rate with a mean sensitivity = 79.70%, and specificity = 81.48%. The model correctly classified OA and RA patients without Lupus as controls at a rate of 75.92% on average with a mean sensitivity = 79.70% and specificity = 77.6%. Regression models including FPCA scores for derivative curves did not perform as well, nor did regression models including covariates. Changes in thermograms observed in the disease state likely reflect covalent modifications of plasma proteins or changes in large protein-protein interacting networks resulting in the stabilization of plasma proteins towards thermal denaturation. By relating functional principal components from thermograms to disease status, our Functional Principal Component Analysis model provides results that are more easily interpretable compared to prior studies. Further, the model could also potentially be coupled with other biomarkers to improve diagnostic classification for lupus.

  14. Modelling subject-specific childhood growth using linear mixed-effect models with cubic regression splines.

    PubMed

    Grajeda, Laura M; Ivanescu, Andrada; Saito, Mayuko; Crainiceanu, Ciprian; Jaganath, Devan; Gilman, Robert H; Crabtree, Jean E; Kelleher, Dermott; Cabrera, Lilia; Cama, Vitaliano; Checkley, William

    2016-01-01

    Childhood growth is a cornerstone of pediatric research. Statistical models need to consider individual trajectories to adequately describe growth outcomes. Specifically, well-defined longitudinal models are essential to characterize both population and subject-specific growth. Linear mixed-effect models with cubic regression splines can account for the nonlinearity of growth curves and provide reasonable estimators of population and subject-specific growth, velocity and acceleration. We provide a stepwise approach that builds from simple to complex models, and account for the intrinsic complexity of the data. We start with standard cubic splines regression models and build up to a model that includes subject-specific random intercepts and slopes and residual autocorrelation. We then compared cubic regression splines vis-à-vis linear piecewise splines, and with varying number of knots and positions. Statistical code is provided to ensure reproducibility and improve dissemination of methods. Models are applied to longitudinal height measurements in a cohort of 215 Peruvian children followed from birth until their fourth year of life. Unexplained variability, as measured by the variance of the regression model, was reduced from 7.34 when using ordinary least squares to 0.81 (p < 0.001) when using a linear mixed-effect models with random slopes and a first order continuous autoregressive error term. There was substantial heterogeneity in both the intercept (p < 0.001) and slopes (p < 0.001) of the individual growth trajectories. We also identified important serial correlation within the structure of the data (ρ = 0.66; 95 % CI 0.64 to 0.68; p < 0.001), which we modeled with a first order continuous autoregressive error term as evidenced by the variogram of the residuals and by a lack of association among residuals. The final model provides a parametric linear regression equation for both estimation and prediction of population- and individual-level growth in height. We show that cubic regression splines are superior to linear regression splines for the case of a small number of knots in both estimation and prediction with the full linear mixed effect model (AIC 19,352 vs. 19,598, respectively). While the regression parameters are more complex to interpret in the former, we argue that inference for any problem depends more on the estimated curve or differences in curves rather than the coefficients. Moreover, use of cubic regression splines provides biological meaningful growth velocity and acceleration curves despite increased complexity in coefficient interpretation. Through this stepwise approach, we provide a set of tools to model longitudinal childhood data for non-statisticians using linear mixed-effect models.

  15. A Poisson regression approach to model monthly hail occurrence in Northern Switzerland using large-scale environmental variables

    NASA Astrophysics Data System (ADS)

    Madonna, Erica; Ginsbourger, David; Martius, Olivia

    2018-05-01

    In Switzerland, hail regularly causes substantial damage to agriculture, cars and infrastructure, however, little is known about its long-term variability. To study the variability, the monthly number of days with hail in northern Switzerland is modeled in a regression framework using large-scale predictors derived from ERA-Interim reanalysis. The model is developed and verified using radar-based hail observations for the extended summer season (April-September) in the period 2002-2014. The seasonality of hail is explicitly modeled with a categorical predictor (month) and monthly anomalies of several large-scale predictors are used to capture the year-to-year variability. Several regression models are applied and their performance tested with respect to standard scores and cross-validation. The chosen model includes four predictors: the monthly anomaly of the two meter temperature, the monthly anomaly of the logarithm of the convective available potential energy (CAPE), the monthly anomaly of the wind shear and the month. This model well captures the intra-annual variability and slightly underestimates its inter-annual variability. The regression model is applied to the reanalysis data back in time to 1980. The resulting hail day time series shows an increase of the number of hail days per month, which is (in the model) related to an increase in temperature and CAPE. The trend corresponds to approximately 0.5 days per month per decade. The results of the regression model have been compared to two independent data sets. All data sets agree on the sign of the trend, but the trend is weaker in the other data sets.

  16. Predictors and Neuropsychiatric Profile of Nucleus Basalis of Meynert Degeneration in Parkinson Disease

    DTIC Science & Technology

    2017-10-01

    baseline were available for 228 PD subjects. In a logistic regression model adjusted for age and sex , Ch4 density was associated with lower risk of...events, there were no significant differences in age or sex (p>0.05). PD subjects with 2 or more psychotic events had significantly lower baseline Ch4...Aim 1 and 2 include use of linear regression models to adjust for age, sex , and other significant covariates. Aim 3 is a cross-sectional controlled

  17. A general regression framework for a secondary outcome in case-control studies.

    PubMed

    Tchetgen Tchetgen, Eric J

    2014-01-01

    Modern case-control studies typically involve the collection of data on a large number of outcomes, often at considerable logistical and monetary expense. These data are of potentially great value to subsequent researchers, who, although not necessarily concerned with the disease that defined the case series in the original study, may want to use the available information for a regression analysis involving a secondary outcome. Because cases and controls are selected with unequal probability, regression analysis involving a secondary outcome generally must acknowledge the sampling design. In this paper, the author presents a new framework for the analysis of secondary outcomes in case-control studies. The approach is based on a careful re-parameterization of the conditional model for the secondary outcome given the case-control outcome and regression covariates, in terms of (a) the population regression of interest of the secondary outcome given covariates and (b) the population regression of the case-control outcome on covariates. The error distribution for the secondary outcome given covariates and case-control status is otherwise unrestricted. For a continuous outcome, the approach sometimes reduces to extending model (a) by including a residual of (b) as a covariate. However, the framework is general in the sense that models (a) and (b) can take any functional form, and the methodology allows for an identity, log or logit link function for model (a).

  18. What are hierarchical models and how do we analyze them?

    USGS Publications Warehouse

    Royle, Andy

    2016-01-01

    In this chapter we provide a basic definition of hierarchical models and introduce the two canonical hierarchical models in this book: site occupancy and N-mixture models. The former is a hierarchical extension of logistic regression and the latter is a hierarchical extension of Poisson regression. We introduce basic concepts of probability modeling and statistical inference including likelihood and Bayesian perspectives. We go through the mechanics of maximizing the likelihood and characterizing the posterior distribution by Markov chain Monte Carlo (MCMC) methods. We give a general perspective on topics such as model selection and assessment of model fit, although we demonstrate these topics in practice in later chapters (especially Chapters 5, 6, 7, and 10 Chapter 5 Chapter 6 Chapter 7 Chapter 10)

  19. An Analysis of San Diego's Housing Market Using a Geographically Weighted Regression Approach

    NASA Astrophysics Data System (ADS)

    Grant, Christina P.

    San Diego County real estate transaction data was evaluated with a set of linear models calibrated by ordinary least squares and geographically weighted regression (GWR). The goal of the analysis was to determine whether the spatial effects assumed to be in the data are best studied globally with no spatial terms, globally with a fixed effects submarket variable, or locally with GWR. 18,050 single-family residential sales which closed in the six months between April 2014 and September 2014 were used in the analysis. Diagnostic statistics including AICc, R2, Global Moran's I, and visual inspection of diagnostic plots and maps indicate superior model performance by GWR as compared to both global regressions.

  20. Influence diagnostics in meta-regression model.

    PubMed

    Shi, Lei; Zuo, ShanShan; Yu, Dalei; Zhou, Xiaohua

    2017-09-01

    This paper studies the influence diagnostics in meta-regression model including case deletion diagnostic and local influence analysis. We derive the subset deletion formulae for the estimation of regression coefficient and heterogeneity variance and obtain the corresponding influence measures. The DerSimonian and Laird estimation and maximum likelihood estimation methods in meta-regression are considered, respectively, to derive the results. Internal and external residual and leverage measure are defined. The local influence analysis based on case-weights perturbation scheme, responses perturbation scheme, covariate perturbation scheme, and within-variance perturbation scheme are explored. We introduce a method by simultaneous perturbing responses, covariate, and within-variance to obtain the local influence measure, which has an advantage of capable to compare the influence magnitude of influential studies from different perturbations. An example is used to illustrate the proposed methodology. Copyright © 2017 John Wiley & Sons, Ltd.

  1. A critical re-evaluation of the regression model specification in the US D1 EQ-5D value function

    PubMed Central

    2012-01-01

    Background The EQ-5D is a generic health-related quality of life instrument (five dimensions with three levels, 243 health states), used extensively in cost-utility/cost-effectiveness analyses. EQ-5D health states are assigned values on a scale anchored in perfect health (1) and death (0). The dominant procedure for defining values for EQ-5D health states involves regression modeling. These regression models have typically included a constant term, interpreted as the utility loss associated with any movement away from perfect health. The authors of the United States EQ-5D valuation study replaced this constant with a variable, D1, which corresponds to the number of impaired dimensions beyond the first. The aim of this study was to illustrate how the use of the D1 variable in place of a constant is problematic. Methods We compared the original D1 regression model with a mathematically equivalent model with a constant term. Comparisons included implications for the magnitude and statistical significance of the coefficients, multicollinearity (variance inflation factors, or VIFs), number of calculation steps needed to determine tariff values, and consequences for tariff interpretation. Results Using the D1 variable in place of a constant shifted all dummy variable coefficients away from zero by the value of the constant, greatly increased the multicollinearity of the model (maximum VIF of 113.2 vs. 21.2), and increased the mean number of calculation steps required to determine health state values. Discussion Using the D1 variable in place of a constant constitutes an unnecessary complication of the model, obscures the fact that at least two of the main effect dummy variables are statistically nonsignificant, and complicates and biases interpretation of the tariff algorithm. PMID:22244261

  2. A critical re-evaluation of the regression model specification in the US D1 EQ-5D value function.

    PubMed

    Rand-Hendriksen, Kim; Augestad, Liv A; Dahl, Fredrik A

    2012-01-13

    The EQ-5D is a generic health-related quality of life instrument (five dimensions with three levels, 243 health states), used extensively in cost-utility/cost-effectiveness analyses. EQ-5D health states are assigned values on a scale anchored in perfect health (1) and death (0).The dominant procedure for defining values for EQ-5D health states involves regression modeling. These regression models have typically included a constant term, interpreted as the utility loss associated with any movement away from perfect health. The authors of the United States EQ-5D valuation study replaced this constant with a variable, D1, which corresponds to the number of impaired dimensions beyond the first. The aim of this study was to illustrate how the use of the D1 variable in place of a constant is problematic. We compared the original D1 regression model with a mathematically equivalent model with a constant term. Comparisons included implications for the magnitude and statistical significance of the coefficients, multicollinearity (variance inflation factors, or VIFs), number of calculation steps needed to determine tariff values, and consequences for tariff interpretation. Using the D1 variable in place of a constant shifted all dummy variable coefficients away from zero by the value of the constant, greatly increased the multicollinearity of the model (maximum VIF of 113.2 vs. 21.2), and increased the mean number of calculation steps required to determine health state values. Using the D1 variable in place of a constant constitutes an unnecessary complication of the model, obscures the fact that at least two of the main effect dummy variables are statistically nonsignificant, and complicates and biases interpretation of the tariff algorithm.

  3. Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

    PubMed Central

    Huang, Jian; Zhang, Cun-Hui

    2013-01-01

    The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. PMID:24348100

  4. Regression Analysis of Stage Variability for West-Central Florida Lakes

    USGS Publications Warehouse

    Sacks, Laura A.; Ellison, Donald L.; Swancar, Amy

    2008-01-01

    The variability in a lake's stage depends upon many factors, including surface-water flows, meteorological conditions, and hydrogeologic characteristics near the lake. An understanding of the factors controlling lake-stage variability for a population of lakes may be helpful to water managers who set regulatory levels for lakes. The goal of this study is to determine whether lake-stage variability can be predicted using multiple linear regression and readily available lake and basin characteristics defined for each lake. Regressions were evaluated for a recent 10-year period (1996-2005) and for a historical 10-year period (1954-63). Ground-water pumping is considered to have affected stage at many of the 98 lakes included in the recent period analysis, and not to have affected stage at the 20 lakes included in the historical period analysis. For the recent period, regression models had coefficients of determination (R2) values ranging from 0.60 to 0.74, and up to five explanatory variables. Standard errors ranged from 21 to 37 percent of the average stage variability. Net leakage was the most important explanatory variable in regressions describing the full range and low range in stage variability for the recent period. The most important explanatory variable in the model predicting the high range in stage variability was the height over median lake stage at which surface-water outflow would occur. Other explanatory variables in final regression models for the recent period included the range in annual rainfall for the period and several variables related to local and regional hydrogeology: (1) ground-water pumping within 1 mile of each lake, (2) the amount of ground-water inflow (by category), (3) the head gradient between the lake and the Upper Floridan aquifer, and (4) the thickness of the intermediate confining unit. Many of the variables in final regression models are related to hydrogeologic characteristics, underscoring the importance of ground-water exchange in controlling the stage of karst lakes in Florida. Regression equations were used to predict lake-stage variability for the recent period for 12 additional lakes, and the median difference between predicted and observed values ranged from 11 to 23 percent. Coefficients of determination for the historical period were considerably lower (maximum R2 of 0.28) than for the recent period. Reasons for these low R2 values are probably related to the small number of lakes (20) with stage data for an equivalent time period that were unaffected by ground-water pumping, the similarity of many of the lake types (large surface-water drainage lakes), and the greater uncertainty in defining historical basin characteristics. The lack of lake-stage data unaffected by ground-water pumping and the poor regression results obtained for that group of lakes limit the ability to predict natural lake-stage variability using this method in west-central Florida.

  5. A Bayesian Semiparametric Latent Variable Model for Mixed Responses

    ERIC Educational Resources Information Center

    Fahrmeir, Ludwig; Raach, Alexander

    2007-01-01

    In this paper we introduce a latent variable model (LVM) for mixed ordinal and continuous responses, where covariate effects on the continuous latent variables are modelled through a flexible semiparametric Gaussian regression model. We extend existing LVMs with the usual linear covariate effects by including nonparametric components for nonlinear…

  6. A Predictive Model for Readmissions Among Medicare Patients in a California Hospital.

    PubMed

    Duncan, Ian; Huynh, Nhan

    2017-11-17

    Predictive models for hospital readmission rates are in high demand because of the Centers for Medicare & Medicaid Services (CMS) Hospital Readmission Reduction Program (HRRP). The LACE index is one of the most popular predictive tools among hospitals in the United States. The LACE index is a simple tool with 4 parameters: Length of stay, Acuity of admission, Comorbidity, and Emergency visits in the previous 6 months. The authors applied logistic regression to develop a predictive model for a medium-sized not-for-profit community hospital in California using patient-level data with more specific patient information (including 13 explanatory variables). Specifically, the logistic regression is applied to 2 populations: a general population including all patients and the specific group of patients targeted by the CMS penalty (characterized as ages 65 or older with select conditions). The 2 resulting logistic regression models have a higher sensitivity rate compared to the sensitivity of the LACE index. The C statistic values of the model applied to both populations demonstrate moderate levels of predictive power. The authors also build an economic model to demonstrate the potential financial impact of the use of the model for targeting high-risk patients in a sample hospital and demonstrate that, on balance, whether the hospital gains or loses from reducing readmissions depends on its margin and the extent of its readmission penalties.

  7. Cross-validation pitfalls when selecting and assessing regression and classification models.

    PubMed

    Krstajic, Damjan; Buturovic, Ljubomir J; Leahy, David E; Thomas, Simon

    2014-03-29

    We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.

  8. Three methods to construct predictive models using logistic regression and likelihood ratios to facilitate adjustment for pretest probability give similar results.

    PubMed

    Chan, Siew Foong; Deeks, Jonathan J; Macaskill, Petra; Irwig, Les

    2008-01-01

    To compare three predictive models based on logistic regression to estimate adjusted likelihood ratios allowing for interdependency between diagnostic variables (tests). This study was a review of the theoretical basis, assumptions, and limitations of published models; and a statistical extension of methods and application to a case study of the diagnosis of obstructive airways disease based on history and clinical examination. Albert's method includes an offset term to estimate an adjusted likelihood ratio for combinations of tests. Spiegelhalter and Knill-Jones method uses the unadjusted likelihood ratio for each test as a predictor and computes shrinkage factors to allow for interdependence. Knottnerus' method differs from the other methods because it requires sequencing of tests, which limits its application to situations where there are few tests and substantial data. Although parameter estimates differed between the models, predicted "posttest" probabilities were generally similar. Construction of predictive models using logistic regression is preferred to the independence Bayes' approach when it is important to adjust for dependency of tests errors. Methods to estimate adjusted likelihood ratios from predictive models should be considered in preference to a standard logistic regression model to facilitate ease of interpretation and application. Albert's method provides the most straightforward approach.

  9. [Associated factors in newborns with intrauterine growth retardation].

    PubMed

    Thompson-Chagoyán, Oscar C; Vega-Franco, Leopoldo

    2008-01-01

    To identify the risk factors implicated in the intrauterine growth retardation (IUGR) of neonates born in a social security institution. Case controls design study in 376 neonates: 188 with IUGR (weight < 10 percentile) and 188 without IUGR. When they born, information about 30 variables of risk for IUGR were obtained from mothers. Risk analysis and logistical regression (stepwise) were used. Odds ratios were significant for 12 of the variables. The model obtains by stepwise regression included: weight gain at pregnancy, prenatal care attendance, toxemia, chocolate ingestion, father's weight, and the environmental house. Must of the variables included in the model are related to socioeconomic disadvantages related to the risk of RCIU in the population.

  10. Extended cox regression model: The choice of timefunction

    NASA Astrophysics Data System (ADS)

    Isik, Hatice; Tutkun, Nihal Ata; Karasoy, Durdu

    2017-07-01

    Cox regression model (CRM), which takes into account the effect of censored observations, is one the most applicative and usedmodels in survival analysis to evaluate the effects of covariates. Proportional hazard (PH), requires a constant hazard ratio over time, is the assumptionofCRM. Using extended CRM provides the test of including a time dependent covariate to assess the PH assumption or an alternative model in case of nonproportional hazards. In this study, the different types of real data sets are used to choose the time function and the differences between time functions are analyzed and discussed.

  11. Simplified large African carnivore density estimators from track indices.

    PubMed

    Winterbach, Christiaan W; Ferreira, Sam M; Funston, Paul J; Somers, Michael J

    2016-01-01

    The range, population size and trend of large carnivores are important parameters to assess their status globally and to plan conservation strategies. One can use linear models to assess population size and trends of large carnivores from track-based surveys on suitable substrates. The conventional approach of a linear model with intercept may not intercept at zero, but may fit the data better than linear model through the origin. We assess whether a linear regression through the origin is more appropriate than a linear regression with intercept to model large African carnivore densities and track indices. We did simple linear regression with intercept analysis and simple linear regression through the origin and used the confidence interval for ß in the linear model y  =  αx  + ß, Standard Error of Estimate, Mean Squares Residual and Akaike Information Criteria to evaluate the models. The Lion on Clay and Low Density on Sand models with intercept were not significant ( P  > 0.05). The other four models with intercept and the six models thorough origin were all significant ( P  < 0.05). The models using linear regression with intercept all included zero in the confidence interval for ß and the null hypothesis that ß = 0 could not be rejected. All models showed that the linear model through the origin provided a better fit than the linear model with intercept, as indicated by the Standard Error of Estimate and Mean Square Residuals. Akaike Information Criteria showed that linear models through the origin were better and that none of the linear models with intercept had substantial support. Our results showed that linear regression through the origin is justified over the more typical linear regression with intercept for all models we tested. A general model can be used to estimate large carnivore densities from track densities across species and study areas. The formula observed track density = 3.26 × carnivore density can be used to estimate densities of large African carnivores using track counts on sandy substrates in areas where carnivore densities are 0.27 carnivores/100 km 2 or higher. To improve the current models, we need independent data to validate the models and data to test for non-linear relationship between track indices and true density at low densities.

  12. Imaging-based biomarkers of cognitive performance in older adults constructed via high-dimensional pattern regression applied to MRI and PET.

    PubMed

    Wang, Ying; Goh, Joshua O; Resnick, Susan M; Davatzikos, Christos

    2013-01-01

    In this study, we used high-dimensional pattern regression methods based on structural (gray and white matter; GM and WM) and functional (positron emission tomography of regional cerebral blood flow; PET) brain data to identify cross-sectional imaging biomarkers of cognitive performance in cognitively normal older adults from the Baltimore Longitudinal Study of Aging (BLSA). We focused on specific components of executive and memory domains known to decline with aging, including manipulation, semantic retrieval, long-term memory (LTM), and short-term memory (STM). For each imaging modality, brain regions associated with each cognitive domain were generated by adaptive regional clustering. A relevance vector machine was adopted to model the nonlinear continuous relationship between brain regions and cognitive performance, with cross-validation to select the most informative brain regions (using recursive feature elimination) as imaging biomarkers and optimize model parameters. Predicted cognitive scores using our regression algorithm based on the resulting brain regions correlated well with actual performance. Also, regression models obtained using combined GM, WM, and PET imaging modalities outperformed models based on single modalities. Imaging biomarkers related to memory performance included the orbito-frontal and medial temporal cortical regions with LTM showing stronger correlation with the temporal lobe than STM. Brain regions predicting executive performance included orbito-frontal, and occipito-temporal areas. The PET modality had higher contribution to most cognitive domains except manipulation, which had higher WM contribution from the superior longitudinal fasciculus and the genu of the corpus callosum. These findings based on machine-learning methods demonstrate the importance of combining structural and functional imaging data in understanding complex cognitive mechanisms and also their potential usage as biomarkers that predict cognitive status.

  13. Aircraft Anomaly Detection Using Performance Models Trained on Fleet Data

    NASA Technical Reports Server (NTRS)

    Gorinevsky, Dimitry; Matthews, Bryan L.; Martin, Rodney

    2012-01-01

    This paper describes an application of data mining technology called Distributed Fleet Monitoring (DFM) to Flight Operational Quality Assurance (FOQA) data collected from a fleet of commercial aircraft. DFM transforms the data into aircraft performance models, flight-to-flight trends, and individual flight anomalies by fitting a multi-level regression model to the data. The model represents aircraft flight performance and takes into account fixed effects: flight-to-flight and vehicle-to-vehicle variability. The regression parameters include aerodynamic coefficients and other aircraft performance parameters that are usually identified by aircraft manufacturers in flight tests. Using DFM, the multi-terabyte FOQA data set with half-million flights was processed in a few hours. The anomalies found include wrong values of competed variables, (e.g., aircraft weight), sensor failures and baises, failures, biases, and trends in flight actuators. These anomalies were missed by the existing airline monitoring of FOQA data exceedances.

  14. Outcome modelling strategies in epidemiology: traditional methods and basic alternatives

    PubMed Central

    Greenland, Sander; Daniel, Rhian; Pearce, Neil

    2016-01-01

    Abstract Controlling for too many potential confounders can lead to or aggravate problems of data sparsity or multicollinearity, particularly when the number of covariates is large in relation to the study size. As a result, methods to reduce the number of modelled covariates are often deployed. We review several traditional modelling strategies, including stepwise regression and the ‘change-in-estimate’ (CIE) approach to deciding which potential confounders to include in an outcome-regression model for estimating effects of a targeted exposure. We discuss their shortcomings, and then provide some basic alternatives and refinements that do not require special macros or programming. Throughout, we assume the main goal is to derive the most accurate effect estimates obtainable from the data and commercial software. Allowing that most users must stay within standard software packages, this goal can be roughly approximated using basic methods to assess, and thereby minimize, mean squared error (MSE). PMID:27097747

  15. Effects of land cover, topography, and built structure on seasonal water quality at multiple spatial scales.

    PubMed

    Pratt, Bethany; Chang, Heejun

    2012-03-30

    The relationship among land cover, topography, built structure and stream water quality in the Portland Metro region of Oregon and Clark County, Washington areas, USA, is analyzed using ordinary least squares (OLS) and geographically weighted (GWR) multiple regression models. Two scales of analysis, a sectional watershed and a buffer, offered a local and a global investigation of the sources of stream pollutants. Model accuracy, measured by R(2) values, fluctuated according to the scale, season, and regression method used. While most wet season water quality parameters are associated with urban land covers, most dry season water quality parameters are related topographic features such as elevation and slope. GWR models, which take into consideration local relations of spatial autocorrelation, had stronger results than OLS regression models. In the multiple regression models, sectioned watershed results were consistently better than the sectioned buffer results, except for dry season pH and stream temperature parameters. This suggests that while riparian land cover does have an effect on water quality, a wider contributing area needs to be included in order to account for distant sources of pollutants. Copyright © 2012 Elsevier B.V. All rights reserved.

  16. The quest for conditional independence in prospectivity modeling: weights-of-evidence, boost weights-of-evidence, and logistic regression

    NASA Astrophysics Data System (ADS)

    Schaeben, Helmut; Semmler, Georg

    2016-09-01

    The objective of prospectivity modeling is prediction of the conditional probability of the presence T = 1 or absence T = 0 of a target T given favorable or prohibitive predictors B, or construction of a two classes 0,1 classification of T. A special case of logistic regression called weights-of-evidence (WofE) is geologists' favorite method of prospectivity modeling due to its apparent simplicity. However, the numerical simplicity is deceiving as it is implied by the severe mathematical modeling assumption of joint conditional independence of all predictors given the target. General weights of evidence are explicitly introduced which are as simple to estimate as conventional weights, i.e., by counting, but do not require conditional independence. Complementary to the regression view is the classification view on prospectivity modeling. Boosting is the construction of a strong classifier from a set of weak classifiers. From the regression point of view it is closely related to logistic regression. Boost weights-of-evidence (BoostWofE) was introduced into prospectivity modeling to counterbalance violations of the assumption of conditional independence even though relaxation of modeling assumptions with respect to weak classifiers was not the (initial) purpose of boosting. In the original publication of BoostWofE a fabricated dataset was used to "validate" this approach. Using the same fabricated dataset it is shown that BoostWofE cannot generally compensate lacking conditional independence whatever the consecutively processing order of predictors. Thus the alleged features of BoostWofE are disproved by way of counterexamples, while theoretical findings are confirmed that logistic regression including interaction terms can exactly compensate violations of joint conditional independence if the predictors are indicators.

  17. UCODE, a computer code for universal inverse modeling

    USGS Publications Warehouse

    Poeter, E.P.; Hill, M.C.

    1999-01-01

    This article presents the US Geological Survey computer program UCODE, which was developed in collaboration with the US Army Corps of Engineers Waterways Experiment Station and the International Ground Water Modeling Center of the Colorado School of Mines. UCODE performs inverse modeling, posed as a parameter-estimation problem, using nonlinear regression. Any application model or set of models can be used; the only requirement is that they have numerical (ASCII or text only) input and output files and that the numbers in these files have sufficient significant digits. Application models can include preprocessors and postprocessors as well as models related to the processes of interest (physical, chemical and so on), making UCODE extremely powerful for model calibration. Estimated parameters can be defined flexibly with user-specified functions. Observations to be matched in the regression can be any quantity for which a simulated equivalent value can be produced, thus simulated equivalent values are calculated using values that appear in the application model output files and can be manipulated with additive and multiplicative functions, if necessary. Prior, or direct, information on estimated parameters also can be included in the regression. The nonlinear regression problem is solved by minimizing a weighted least-squares objective function with respect to the parameter values using a modified Gauss-Newton method. Sensitivities needed for the method are calculated approximately by forward or central differences and problems and solutions related to this approximation are discussed. Statistics are calculated and printed for use in (1) diagnosing inadequate data or identifying parameters that probably cannot be estimated with the available data, (2) evaluating estimated parameter values, (3) evaluating the model representation of the actual processes and (4) quantifying the uncertainty of model simulated values. UCODE is intended for use on any computer operating system: it consists of algorithms programmed in perl, a freeware language designed for text manipulation and Fortran90, which efficiently performs numerical calculations.

  18. A study of machine learning regression methods for major elemental analysis of rocks using laser-induced breakdown spectroscopy

    NASA Astrophysics Data System (ADS)

    Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.

    2015-05-01

    The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO2, Fe2O3, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na2O, K2O, TiO2, and P2O5, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144 channels) relative to the small number of samples studied. The best-performing models were SVR-Lin for SiO2, MgO, Fe2O3, and Na2O, lasso for Al2O3, elastic net for MnO, and PLS-1 for CaO, TiO2, and K2O. Although these differences in model performance between methods were identified, most of the models produce comparable results when p ≤ 0.05 and all techniques except kNN produced statistically-indistinguishable results. It is likely that a combination of models could be used together to yield a lower total error of prediction, depending on the requirements of the user.

  19. Stature estimation from the lengths of the growing foot-a study on North Indian adolescents.

    PubMed

    Krishan, Kewal; Kanchan, Tanuj; Passi, Neelam; DiMaggio, John A

    2012-12-01

    Stature estimation is considered as one of the basic parameters of the investigation process in unknown and commingled human remains in medico-legal case work. Race, age and sex are the other parameters which help in this process. Stature estimation is of the utmost importance as it completes the biological profile of a person along with the other three parameters of identification. The present research is intended to formulate standards for stature estimation from foot dimensions in adolescent males from North India and study the pattern of foot growth during the growing years. 154 male adolescents from the Northern part of India were included in the study. Besides stature, five anthropometric measurements that included the length of the foot from each toe (T1, T2, T3, T4, and T5 respectively) to pternion were measured on each foot. The data was analyzed statistically using Student's t-test, Pearson's correlation, linear and multiple regression analysis for estimation of stature and growth of foot during ages 13-18 years. Correlation coefficients between stature and all the foot measurements were found to be highly significant and positively correlated. Linear regression models and multiple regression models (with age as a co-variable) were derived for estimation of stature from the different measurements of the foot. Multiple regression models (with age as a co-variable) estimate stature with greater accuracy than the regression models for 13-18 years age group. The study shows the growth pattern of feet in North Indian adolescents and indicates that anthropometric measurements of the foot and its segments are valuable in estimation of stature in growing individuals of that population. Copyright © 2012 Elsevier Ltd. All rights reserved.

  20. Estimation of streamflow, base flow, and nitrate-nitrogen loads in Iowa using multiple linear regression models

    USGS Publications Warehouse

    Schilling, K.E.; Wolter, C.F.

    2005-01-01

    Nineteen variables, including precipitation, soils and geology, land use, and basin morphologic characteristics, were evaluated to develop Iowa regression models to predict total streamflow (Q), base flow (Qb), storm flow (Qs) and base flow percentage (%Qb) in gauged and ungauged watersheds in the state. Discharge records from a set of 33 watersheds across the state for the 1980 to 2000 period were separated into Qb and Qs. Multiple linear regression found that 75.5 percent of long term average Q was explained by rainfall, sand content, and row crop percentage variables, whereas 88.5 percent of Qb was explained by these three variables plus permeability and floodplain area variables. Qs was explained by average rainfall and %Qb was a function of row crop percentage, permeability, and basin slope variables. Regional regression models developed for long term average Q and Qb were adapted to annual rainfall and showed good correlation between measured and predicted values. Combining the regression model for Q with an estimate of mean annual nitrate concentration, a map of potential nitrate loads in the state was produced. Results from this study have important implications for understanding geomorphic and land use controls on streamflow and base flow in Iowa watersheds and similar agriculture dominated watersheds in the glaciated Midwest. (JAWRA) (Copyright ?? 2005).

  1. Quantile regression models of animal habitat relationships

    USGS Publications Warehouse

    Cade, Brian S.

    2003-01-01

    Typically, all factors that limit an organism are not measured and included in statistical models used to investigate relationships with their environment. If important unmeasured variables interact multiplicatively with the measured variables, the statistical models often will have heterogeneous response distributions with unequal variances. Quantile regression is an approach for estimating the conditional quantiles of a response variable distribution in the linear model, providing a more complete view of possible causal relationships between variables in ecological processes. Chapter 1 introduces quantile regression and discusses the ordering characteristics, interval nature, sampling variation, weighting, and interpretation of estimates for homogeneous and heterogeneous regression models. Chapter 2 evaluates performance of quantile rankscore tests used for hypothesis testing and constructing confidence intervals for linear quantile regression estimates (0 ≤ τ ≤ 1). A permutation F test maintained better Type I errors than the Chi-square T test for models with smaller n, greater number of parameters p, and more extreme quantiles τ. Both versions of the test required weighting to maintain correct Type I errors when there was heterogeneity under the alternative model. An example application related trout densities to stream channel width:depth. Chapter 3 evaluates a drop in dispersion, F-ratio like permutation test for hypothesis testing and constructing confidence intervals for linear quantile regression estimates (0 ≤ τ ≤ 1). Chapter 4 simulates from a large (N = 10,000) finite population representing grid areas on a landscape to demonstrate various forms of hidden bias that might occur when the effect of a measured habitat variable on some animal was confounded with the effect of another unmeasured variable (spatially and not spatially structured). Depending on whether interactions of the measured habitat and unmeasured variable were negative (interference interactions) or positive (facilitation interactions), either upper (τ > 0.5) or lower (τ < 0.5) quantile regression parameters were less biased than mean rate parameters. Sampling (n = 20 - 300) simulations demonstrated that confidence intervals constructed by inverting rankscore tests provided valid coverage of these biased parameters. Quantile regression was used to estimate effects of physical habitat resources on a bivalve mussel (Macomona liliana) in a New Zealand harbor by modeling the spatial trend surface as a cubic polynomial of location coordinates.

  2. Multilevel Modeling in Psychosomatic Medicine Research

    PubMed Central

    Myers, Nicholas D.; Brincks, Ahnalee M.; Ames, Allison J.; Prado, Guillermo J.; Penedo, Frank J.; Benedict, Catherine

    2012-01-01

    The primary purpose of this manuscript is to provide an overview of multilevel modeling for Psychosomatic Medicine readers and contributors. The manuscript begins with a general introduction to multilevel modeling. Multilevel regression modeling at two-levels is emphasized because of its prevalence in psychosomatic medicine research. Simulated datasets based on some core ideas from the Familias Unidas effectiveness study are used to illustrate key concepts including: communication of model specification, parameter interpretation, sample size and power, and missing data. Input and key output files from Mplus and SAS are provided. A cluster randomized trial with repeated measures (i.e., three-level regression model) is then briefly presented with simulated data based on some core ideas from a cognitive behavioral stress management intervention in prostate cancer. PMID:23107843

  3. Human Language Technology: Opportunities and Challenges

    DTIC Science & Technology

    2005-01-01

    because of the connections to and reliance on signal processing. Audio diarization critically includes indexing of speakers [12], since speaker ...to reduce inter- speaker variability in training. Standard techniques include vocal-tract length normalization, adaptation of acoustic models using...maximum likelihood linear regression (MLLR), and speaker -adaptive training based on MLLR. The acoustic models are mixtures of Gaussians, typically with

  4. Forecasting models for sugi (Cryptomeria japonica D. Don) pollen count showing an alternate dispersal rhythm.

    PubMed

    Ito, Yukiko; Hattori, Reiko; Mase, Hiroki; Watanabe, Masako; Shiotani, Itaru

    2008-12-01

    Pollen information is indispensable for allergic individuals and clinicians. This study aimed to develop forecasting models for the total annual count of airborne pollen grains based on data monitored over the last 20 years at the Mie Chuo Medical Center, Tsu, Mie, Japan. Airborne pollen grains were collected using a Durham sampler. Total annual pollen count and pollen count from October to December (OD pollen count) of the previous year were transformed to logarithms. Regression analysis of the total pollen count was performed using variables such as the OD pollen count and the maximum temperature for mid-July of the previous year. Time series analysis revealed an alternate rhythm of the series of total pollen count. The alternate rhythm consisted of a cyclic alternation of an "on" year (high pollen count) and an "off" year (low pollen count). This rhythm was used as a dummy variable in regression equations. Of the three models involving the OD pollen count, a multiple regression equation that included the alternate rhythm variable and the interaction of this rhythm with OD pollen count showed a high coefficient of determination (0.844). Of the three models involving the maximum temperature for mid-July, those including the alternate rhythm variable and the interaction of this rhythm with maximum temperature had the highest coefficient of determination (0.925). An alternate pollen dispersal rhythm represented by a dummy variable in the multiple regression analysis plays a key role in improving forecasting models for the total annual sugi pollen count.

  5. The PIT-trap-A "model-free" bootstrap procedure for inference about regression models with discrete, multivariate responses.

    PubMed

    Warton, David I; Thibaut, Loïc; Wang, Yi Alice

    2017-01-01

    Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)-common examples including logistic or Poisson regression and generalizations to handle clustered or multivariate data, such as generalised estimating equations. We propose a bootstrap method based on probability integral transform (PIT-) residuals, which we call the PIT-trap, which assumes data come from some marginal distribution F of known parametric form. This method can be understood as a type of "model-free bootstrap", adapted to the problem of discrete and highly multivariate data. PIT-residuals have the key property that they are (asymptotically) pivotal. The PIT-trap thus inherits the key property, not afforded by any other residual resampling approach, that the marginal distribution of data can be preserved under PIT-trapping. This in turn enables the derivation of some standard bootstrap properties, including second-order correctness of pivotal PIT-trap test statistics. In multivariate data, bootstrapping rows of PIT-residuals affords the property that it preserves correlation in data without the need for it to be modelled, a key point of difference as compared to a parametric bootstrap. The proposed method is illustrated on an example involving multivariate abundance data in ecology, and demonstrated via simulation to have improved properties as compared to competing resampling methods.

  6. The PIT-trap—A “model-free” bootstrap procedure for inference about regression models with discrete, multivariate responses

    PubMed Central

    Thibaut, Loïc; Wang, Yi Alice

    2017-01-01

    Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)—common examples including logistic or Poisson regression and generalizations to handle clustered or multivariate data, such as generalised estimating equations. We propose a bootstrap method based on probability integral transform (PIT-) residuals, which we call the PIT-trap, which assumes data come from some marginal distribution F of known parametric form. This method can be understood as a type of “model-free bootstrap”, adapted to the problem of discrete and highly multivariate data. PIT-residuals have the key property that they are (asymptotically) pivotal. The PIT-trap thus inherits the key property, not afforded by any other residual resampling approach, that the marginal distribution of data can be preserved under PIT-trapping. This in turn enables the derivation of some standard bootstrap properties, including second-order correctness of pivotal PIT-trap test statistics. In multivariate data, bootstrapping rows of PIT-residuals affords the property that it preserves correlation in data without the need for it to be modelled, a key point of difference as compared to a parametric bootstrap. The proposed method is illustrated on an example involving multivariate abundance data in ecology, and demonstrated via simulation to have improved properties as compared to competing resampling methods. PMID:28738071

  7. A comparison of the performances of an artificial neural network and a regression model for GFR estimation.

    PubMed

    Liu, Xun; Li, Ning-shan; Lv, Lin-sheng; Huang, Jian-hua; Tang, Hua; Chen, Jin-xia; Ma, Hui-juan; Wu, Xiao-ming; Lou, Tan-qi

    2013-12-01

    Accurate estimation of glomerular filtration rate (GFR) is important in clinical practice. Current models derived from regression are limited by the imprecision of GFR estimates. We hypothesized that an artificial neural network (ANN) might improve the precision of GFR estimates. A study of diagnostic test accuracy. 1,230 patients with chronic kidney disease were enrolled, including the development cohort (n=581), internal validation cohort (n=278), and external validation cohort (n=371). Estimated GFR (eGFR) using a new ANN model and a new regression model using age, sex, and standardized serum creatinine level derived in the development and internal validation cohort, and the CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration) 2009 creatinine equation. Measured GFR (mGFR). GFR was measured using a diethylenetriaminepentaacetic acid renal dynamic imaging method. Serum creatinine was measured with an enzymatic method traceable to isotope-dilution mass spectrometry. In the external validation cohort, mean mGFR was 49±27 (SD) mL/min/1.73 m2 and biases (median difference between mGFR and eGFR) for the CKD-EPI, new regression, and new ANN models were 0.4, 1.5, and -0.5 mL/min/1.73 m2, respectively (P<0.001 and P=0.02 compared to CKD-EPI and P<0.001 comparing the new regression and ANN models). Precisions (IQRs for the difference) were 22.6, 14.9, and 15.6 mL/min/1.73 m2, respectively (P<0.001 for both compared to CKD-EPI and P<0.001 comparing the new ANN and new regression models). Accuracies (proportions of eGFRs not deviating >30% from mGFR) were 50.9%, 77.4%, and 78.7%, respectively (P<0.001 for both compared to CKD-EPI and P=0.5 comparing the new ANN and new regression models). Different methods for measuring GFR were a source of systematic bias in comparisons of new models to CKD-EPI, and both the derivation and validation cohorts consisted of a group of patients who were referred to the same institution. An ANN model using 3 variables did not perform better than a new regression model. Whether ANN can improve GFR estimation using more variables requires further investigation. Copyright © 2013 National Kidney Foundation, Inc. Published by Elsevier Inc. All rights reserved.

  8. Solving large test-day models by iteration on data and preconditioned conjugate gradient.

    PubMed

    Lidauer, M; Strandén, I; Mäntysaari, E A; Pösö, J; Kettunen, A

    1999-12-01

    A preconditioned conjugate gradient method was implemented into an iteration on a program for data estimation of breeding values, and its convergence characteristics were studied. An algorithm was used as a reference in which one fixed effect was solved by Gauss-Seidel method, and other effects were solved by a second-order Jacobi method. Implementation of the preconditioned conjugate gradient required storing four vectors (size equal to number of unknowns in the mixed model equations) in random access memory and reading the data at each round of iteration. The preconditioner comprised diagonal blocks of the coefficient matrix. Comparison of algorithms was based on solutions of mixed model equations obtained by a single-trait animal model and a single-trait, random regression test-day model. Data sets for both models used milk yield records of primiparous Finnish dairy cows. Animal model data comprised 665,629 lactation milk yields and random regression test-day model data of 6,732,765 test-day milk yields. Both models included pedigree information of 1,099,622 animals. The animal model ¿random regression test-day model¿ required 122 ¿305¿ rounds of iteration to converge with the reference algorithm, but only 88 ¿149¿ were required with the preconditioned conjugate gradient. To solve the random regression test-day model with the preconditioned conjugate gradient required 237 megabytes of random access memory and took 14% of the computation time needed by the reference algorithm.

  9. Regression rate study of porous axial-injection, endburning hybrid fuel grains

    NASA Astrophysics Data System (ADS)

    Hitt, Matthew A.

    This experimental and theoretical work examines the effects of gaseous oxidizer flow rates and pressure on the regression rates of porous fuels for hybrid rocket applications. Testing was conducted using polyethylene as the porous fuel and both gaseous oxygen and nitrous oxide as the oxidizer. Nominal test articles were tested using 200, 100, 50, and 15 micron fuel pore sizes. Pressures tested ranged from atmospheric to 1160 kPa for the gaseous oxygen tests and from 207 kPa to 1054 kPa for the nitrous oxide tests, and oxidizer injection velocities ranged from 35 m/s to 80 m/s for the gaseous oxygen tests and from 7.5 m/s to 16.8 m/s for the nitrous oxide tests. Regression rates were determined using pretest and posttest length measurements of the solid fuel. Experimental results demonstrated that the regression rate of the porous axial-injection, end-burning hybrid was a function of the chamber pressure, as opposed to the oxidizer mass flux typical in conventional hybrids. Regression rates ranged from approximately 0.75 mm/s at atmospheric pressure to 8.89 mm/s at 1160 kPa for the gaseous oxygen tests and 0.21 mm/s at 207 kPa to 1.44 mm/s at 1054 kPa for the nitrous oxide tests. The analytical model was developed based on a standard ablative model modified to include oxidizer flow through the grain. The heat transfer from the flame was primarily modeled using an empirically determined flame coefficient that included all heat transfer mechanisms in one term. An exploratory flame model based on the Granular Diffusion Flame model used for solid rocket motors was also adapted for comparison with the empirical flame coefficient. This model was then evaluated quantitatively using the experimental results of the gaseous oxygen tests as well as qualitatively using the experimental results of the nitrous oxide tests. The model showed agreement with the experimental results indicating it has potential for giving insight into the flame structure in this motor configuration. Results from the model suggested that both kinetic and diffusion processes could be relevant to the combustion depending on the chamber pressure.

  10. Robust, Adaptive Functional Regression in Functional Mixed Model Framework.

    PubMed

    Zhu, Hongxiao; Brown, Philip J; Morris, Jeffrey S

    2011-09-01

    Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets.

  11. Robust, Adaptive Functional Regression in Functional Mixed Model Framework

    PubMed Central

    Zhu, Hongxiao; Brown, Philip J.; Morris, Jeffrey S.

    2012-01-01

    Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets. PMID:22308015

  12. Differential item functioning analysis with ordinal logistic regression techniques. DIFdetect and difwithpar.

    PubMed

    Crane, Paul K; Gibbons, Laura E; Jolley, Lance; van Belle, Gerald

    2006-11-01

    We present an ordinal logistic regression model for identification of items with differential item functioning (DIF) and apply this model to a Mini-Mental State Examination (MMSE) dataset. We employ item response theory ability estimation in our models. Three nested ordinal logistic regression models are applied to each item. Model testing begins with examination of the statistical significance of the interaction term between ability and the group indicator, consistent with nonuniform DIF. Then we turn our attention to the coefficient of the ability term in models with and without the group term. If including the group term has a marked effect on that coefficient, we declare that it has uniform DIF. We examined DIF related to language of test administration in addition to self-reported race, Hispanic ethnicity, age, years of education, and sex. We used PARSCALE for IRT analyses and STATA for ordinal logistic regression approaches. We used an iterative technique for adjusting IRT ability estimates on the basis of DIF findings. Five items were found to have DIF related to language. These same items also had DIF related to other covariates. The ordinal logistic regression approach to DIF detection, when combined with IRT ability estimates, provides a reasonable alternative for DIF detection. There appear to be several items with significant DIF related to language of test administration in the MMSE. More attention needs to be paid to the specific criteria used to determine whether an item has DIF, not just the technique used to identify DIF.

  13. Prediction of silicon oxynitride plasma etching using a generalized regression neural network

    NASA Astrophysics Data System (ADS)

    Kim, Byungwhan; Lee, Byung Teak

    2005-08-01

    A prediction model of silicon oxynitride (SiON) etching was constructed using a neural network. Model prediction performance was improved by means of genetic algorithm. The etching was conducted in a C2F6 inductively coupled plasma. A 24 full factorial experiment was employed to systematically characterize parameter effects on SiON etching. The process parameters include radio frequency source power, bias power, pressure, and C2F6 flow rate. To test the appropriateness of the trained model, additional 16 experiments were conducted. For comparison, four types of statistical regression models were built. Compared to the best regression model, the optimized neural network model demonstrated an improvement of about 52%. The optimized model was used to infer etch mechanisms as a function of parameters. The pressure effect was noticeably large only as relatively large ion bombardment was maintained in the process chamber. Ion-bombardment-activated polymer deposition played the most significant role in interpreting the complex effect of bias power or C2F6 flow rate. Moreover, [CF2] was expected to be the predominant precursor to polymer deposition.

  14. Examining geological controls on baseflow index (BFI) using regression analysis: An illustration from the Thames Basin, UK

    NASA Astrophysics Data System (ADS)

    Bloomfield, J. P.; Allen, D. J.; Griffiths, K. J.

    2009-06-01

    SummaryLinear regression methods can be used to quantify geological controls on baseflow index (BFI). This is illustrated using an example from the Thames Basin, UK. Two approaches have been adopted. The areal extents of geological classes based on lithostratigraphic and hydrogeological classification schemes have been correlated with BFI for 44 'natural' catchments from the Thames Basin. When regression models are built using lithostratigraphic classes that include a constant term then the model is shown to have some physical meaning and the relative influence of the different geological classes on BFI can be quantified. For example, the regression constants for two such models, 0.64 and 0.69, are consistent with the mean observed BFI (0.65) for the Thames Basin, and the signs and relative magnitudes of the regression coefficients for each of the lithostratigraphic classes are consistent with the hydrogeology of the Basin. In addition, regression coefficients for the lithostratigraphic classes scale linearly with estimates of log 10 hydraulic conductivity for each lithological class. When a regression is built using a hydrogeological classification scheme with no constant term, the model does not have any physical meaning, but it has a relatively high adjusted R2 value and because of the continuous coverage of the hydrogeological classification scheme, the model can be used for predictive purposes. A model calibrated on the 44 'natural' catchments and using four hydrogeological classes (low-permeability surficial deposits, consolidated aquitards, fractured aquifers and intergranular aquifers) is shown to perform as well as a model based on a hydrology of soil types (BFIHOST) scheme in predicting BFI in the Thames Basin. Validation of this model using 110 other 'variably impacted' catchments in the Basin shows that there is a correlation between modelled and observed BFI. Where the observed BFI is significantly higher than modelled BFI the deviations can be explained by an exogenous factor, catchment urban area. It is inferred that this is may be due influences from sewage discharge, mains leakage, and leakage from septic tanks.

  15. Static and moving solid/gas interface modeling in a hybrid rocket engine

    NASA Astrophysics Data System (ADS)

    Mangeot, Alexandre; William-Louis, Mame; Gillard, Philippe

    2018-07-01

    A numerical model was developed with CFD-ACE software to study the working condition of an oxygen-nitrogen/polyethylene hybrid rocket combustor. As a first approach, a simplified numerical model is presented. It includes a compressible transient gas phase in which a two-step combustion mechanism is implemented coupled to a radiative model. The solid phase from the fuel grain is a semi-opaque material with its degradation process modeled by an Arrhenius type law. Two versions of the model were tested. The first considers the solid/gas interface with a static grid while the second uses grid deformation during the computation to follow the asymmetrical regression. The numerical results are obtained with two different regression kinetics originating from ThermoGravimetry Analysis and test bench results. In each case, the fuel surface temperature is retrieved within a range of 5% error. However, good results are only found using kinetics from the test bench. The regression rate is found within 0.03 mm s-1 and average combustor pressure and its variation over time have the same intensity than the measurements conducted on the test bench. The simulation that uses grid deformation to follow the regression shows a good stability over a 10 s simulated time simulation.

  16. Quantifying the causal effects of 20mph zones on road casualties in London via doubly robust estimation.

    PubMed

    Li, Haojie; Graham, Daniel J

    2016-08-01

    This paper estimates the causal effect of 20mph zones on road casualties in London. Potential confounders in the key relationship of interest are included within outcome regression and propensity score models, and the models are then combined to form a doubly robust estimator. A total of 234 treated zones and 2844 potential control zones are included in the data sample. The propensity score model is used to select a viable control group which has common support in the covariate distributions. We compare the doubly robust estimates with those obtained using three other methods: inverse probability weighting, regression adjustment, and propensity score matching. The results indicate that 20mph zones have had a significant causal impact on road casualty reduction in both absolute and proportional terms. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. Modelling Nitrogen Oxides in Los Angeles Using a Hybrid Dispersion/Land Use Regression Model

    NASA Astrophysics Data System (ADS)

    Wilton, Darren C.

    The goal of this dissertation is to develop models capable of predicting long term annual average NOx concentrations in urban areas. Predictions from simple meteorological dispersion models and seasonal proxies for NO2 oxidation were included as covariates in a land use regression (LUR) model for NOx in Los Angeles, CA. The NO x measurements were obtained from a comprehensive measurement campaign that is part of the Multi-Ethnic Study of Atherosclerosis Air Pollution Study (MESA Air). Simple land use regression models were initially developed using a suite of GIS-derived land use variables developed from various buffer sizes (R²=0.15). Caline3, a simple steady-state Gaussian line source model, was initially incorporated into the land-use regression framework. The addition of this spatio-temporally varying Caline3 covariate improved the simple LUR model predictions. The extent of improvement was much more pronounced for models based solely on the summer measurements (simple LUR: R²=0.45; Caline3/LUR: R²=0.70), than it was for models based on all seasons (R²=0.20). We then used a Lagrangian dispersion model to convert static land use covariates for population density, commercial/industrial area into spatially and temporally varying covariates. The inclusion of these covariates resulted in significant improvement in model prediction (R²=0.57). In addition to the dispersion model covariates described above, a two-week average value of daily peak-hour ozone was included as a surrogate of the oxidation of NO2 during the different sampling periods. This additional covariate further improved overall model performance for all models. The best model by 10-fold cross validation (R²=0.73) contained the Caline3 prediction, a static covariate for length of A3 roads within 50 meters, the Calpuff-adjusted covariates derived from both population density and industrial/commercial land area, and the ozone covariate. This model was tested against annual average NOx concentrations from an independent data set from the EPA's Air Quality System (AQS) and MESA Air fixed site monitors, and performed very well (R²=0.82).

  18. Nonparametric instrumental regression with non-convex constraints

    NASA Astrophysics Data System (ADS)

    Grasmair, M.; Scherzer, O.; Vanhems, A.

    2013-03-01

    This paper considers the nonparametric regression model with an additive error that is dependent on the explanatory variables. As is common in empirical studies in epidemiology and economics, it also supposes that valid instrumental variables are observed. A classical example in microeconomics considers the consumer demand function as a function of the price of goods and the income, both variables often considered as endogenous. In this framework, the economic theory also imposes shape restrictions on the demand function, such as integrability conditions. Motivated by this illustration in microeconomics, we study an estimator of a nonparametric constrained regression function using instrumental variables by means of Tikhonov regularization. We derive rates of convergence for the regularized model both in a deterministic and stochastic setting under the assumption that the true regression function satisfies a projected source condition including, because of the non-convexity of the imposed constraints, an additional smallness condition.

  19. Fundamental Phenomena on Fuel Decomposition and Boundary-Layer Combustion Precesses with Applications to Hybrid Rocket Motors. Part 1; Experimental Investigation

    NASA Technical Reports Server (NTRS)

    Kuo, Kenneth K.; Lu, Yeu-Cherng; Chiaverini, Martin J.; Johnson, David K.; Serin, Nadir; Risha, Grant A.; Merkle, Charles L.; Venkateswaran, Sankaran

    1996-01-01

    This final report summarizes the major findings on the subject of 'Fundamental Phenomena on Fuel Decomposition and Boundary-Layer Combustion Processes with Applications to Hybrid Rocket Motors', performed from 1 April 1994 to 30 June 1996. Both experimental results from Task 1 and theoretical/numerical results from Task 2 are reported here in two parts. Part 1 covers the experimental work performed and describes the test facility setup, data reduction techniques employed, and results of the test firings, including effects of operating conditions and fuel additives on solid fuel regression rate and thermal profiles of the condensed phase. Part 2 concerns the theoretical/numerical work. It covers physical modeling of the combustion processes including gas/surface coupling, and radiation effect on regression rate. The numerical solution of the flowfield structure and condensed phase regression behavior are presented. Experimental data from the test firings were used for numerical model validation.

  20. An application in identifying high-risk populations in alternative tobacco product use utilizing logistic regression and CART: a heuristic comparison.

    PubMed

    Lei, Yang; Nollen, Nikki; Ahluwahlia, Jasjit S; Yu, Qing; Mayo, Matthew S

    2015-04-09

    Other forms of tobacco use are increasing in prevalence, yet most tobacco control efforts are aimed at cigarettes. In light of this, it is important to identify individuals who are using both cigarettes and alternative tobacco products (ATPs). Most previous studies have used regression models. We conducted a traditional logistic regression model and a classification and regression tree (CART) model to illustrate and discuss the added advantages of using CART in the setting of identifying high-risk subgroups of ATP users among cigarettes smokers. The data were collected from an online cross-sectional survey administered by Survey Sampling International between July 5, 2012 and August 15, 2012. Eligible participants self-identified as current smokers, African American, White, or Latino (of any race), were English-speaking, and were at least 25 years old. The study sample included 2,376 participants and was divided into independent training and validation samples for a hold out validation. Logistic regression and CART models were used to examine the important predictors of cigarettes + ATP users. The logistic regression model identified nine important factors: gender, age, race, nicotine dependence, buying cigarettes or borrowing, whether the price of cigarettes influences the brand purchased, whether the participants set limits on cigarettes per day, alcohol use scores, and discrimination frequencies. The C-index of the logistic regression model was 0.74, indicating good discriminatory capability. The model performed well in the validation cohort also with good discrimination (c-index = 0.73) and excellent calibration (R-square = 0.96 in the calibration regression). The parsimonious CART model identified gender, age, alcohol use score, race, and discrimination frequencies to be the most important factors. It also revealed interesting partial interactions. The c-index is 0.70 for the training sample and 0.69 for the validation sample. The misclassification rate was 0.342 for the training sample and 0.346 for the validation sample. The CART model was easier to interpret and discovered target populations that possess clinical significance. This study suggests that the non-parametric CART model is parsimonious, potentially easier to interpret, and provides additional information in identifying the subgroups at high risk of ATP use among cigarette smokers.

  1. Overhead longwave infrared hyperspectral material identification using radiometric models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zelinski, M. E.

    Material detection algorithms used in hyperspectral data processing are computationally efficient but can produce relatively high numbers of false positives. Material identification performed as a secondary processing step on detected pixels can help separate true and false positives. This paper presents a material identification processing chain for longwave infrared hyperspectral data of solid materials collected from airborne platforms. The algorithms utilize unwhitened radiance data and an iterative algorithm that determines the temperature, humidity, and ozone of the atmospheric profile. Pixel unmixing is done using constrained linear regression and Bayesian Information Criteria for model selection. The resulting product includes an optimalmore » atmospheric profile and full radiance material model that includes material temperature, abundance values, and several fit statistics. A logistic regression method utilizing all model parameters to improve identification is also presented. This paper details the processing chain and provides justification for the algorithms used. Several examples are provided using modeled data at different noise levels.« less

  2. Statistical relations among earthquake magnitude, surface rupture length, and surface fault displacement

    USGS Publications Warehouse

    Bonilla, M.G.; Mark, R.K.; Lienkaemper, J.J.

    1984-01-01

    In order to refine correlations of surface-wave magnitude, fault rupture length at the ground surface, and fault displacement at the surface by including the uncertainties in these variables, the existing data were critically reviewed and a new data base was compiled. Earthquake magnitudes were redetermined as necessary to make them as consistent as possible with the Gutenberg methods and results, which necessarily make up much of the data base. Measurement errors were estimated for the three variables for 58 moderate to large shallow-focus earthquakes. Regression analyses were then made utilizing the estimated measurement errors. The regression analysis demonstrates that the relations among the variables magnitude, length, and displacement are stochastic in nature. The stochastic variance, introduced in part by incomplete surface expression of seismogenic faulting, variation in shear modulus, and regional factors, dominates the estimated measurement errors. Thus, it is appropriate to use ordinary least squares for the regression models, rather than regression models based upon an underlying deterministic relation with the variance resulting from measurement errors. Significant differences exist in correlations of certain combinations of length, displacement, and magnitude when events are qrouped by fault type or by region, including attenuation regions delineated by Evernden and others. Subdivision of the data results in too few data for some fault types and regions, and for these only regressions using all of the data as a group are reported. Estimates of the magnitude and the standard deviation of the magnitude of a prehistoric or future earthquake associated with a fault can be made by correlating M with the logarithms of rupture length, fault displacement, or the product of length and displacement. Fault rupture area could be reliably estimated for about 20 of the events in the data set. Regression of MS on rupture area did not result in a marked improvement over regressions that did not involve rupture area. Because no subduction-zone earthquakes are included in this study, the reported results do not apply to such zones.

  3. Measurement error in epidemiologic studies of air pollution based on land-use regression models.

    PubMed

    Basagaña, Xavier; Aguilera, Inmaculada; Rivera, Marcela; Agis, David; Foraster, Maria; Marrugat, Jaume; Elosua, Roberto; Künzli, Nino

    2013-10-15

    Land-use regression (LUR) models are increasingly used to estimate air pollution exposure in epidemiologic studies. These models use air pollution measurements taken at a small set of locations and modeling based on geographical covariates for which data are available at all study participant locations. The process of LUR model development commonly includes a variable selection procedure. When LUR model predictions are used as explanatory variables in a model for a health outcome, measurement error can lead to bias of the regression coefficients and to inflation of their variance. In previous studies dealing with spatial predictions of air pollution, bias was shown to be small while most of the effect of measurement error was on the variance. In this study, we show that in realistic cases where LUR models are applied to health data, bias in health-effect estimates can be substantial. This bias depends on the number of air pollution measurement sites, the number of available predictors for model selection, and the amount of explainable variability in the true exposure. These results should be taken into account when interpreting health effects from studies that used LUR models.

  4. Regression analysis of current-status data: an application to breast-feeding.

    PubMed

    Grummer-strawn, L M

    1993-09-01

    "Although techniques for calculating mean survival time from current-status data are well known, their use in multiple regression models is somewhat troublesome. Using data on current breast-feeding behavior, this article considers a number of techniques that have been suggested in the literature, including parametric, nonparametric, and semiparametric models as well as the application of standard schedules. Models are tested in both proportional-odds and proportional-hazards frameworks....I fit [the] models to current status data on breast-feeding from the Demographic and Health Survey (DHS) in six countries: two African (Mali and Ondo State, Nigeria), two Asian (Indonesia and Sri Lanka), and two Latin American (Colombia and Peru)." excerpt

  5. [Use of multiple regression models in observational studies (1970-2013) and requirements of the STROBE guidelines in Spanish scientific journals].

    PubMed

    Real, J; Cleries, R; Forné, C; Roso-Llorach, A; Martínez-Sánchez, J M

    In medicine and biomedical research, statistical techniques like logistic, linear, Cox and Poisson regression are widely known. The main objective is to describe the evolution of multivariate techniques used in observational studies indexed in PubMed (1970-2013), and to check the requirements of the STROBE guidelines in the author guidelines in Spanish journals indexed in PubMed. A targeted PubMed search was performed to identify papers that used logistic linear Cox and Poisson models. Furthermore, a review was also made of the author guidelines of journals published in Spain and indexed in PubMed and Web of Science. Only 6.1% of the indexed manuscripts included a term related to multivariate analysis, increasing from 0.14% in 1980 to 12.3% in 2013. In 2013, 6.7, 2.5, 3.5, and 0.31% of the manuscripts contained terms related to logistic, linear, Cox and Poisson regression, respectively. On the other hand, 12.8% of journals author guidelines explicitly recommend to follow the STROBE guidelines, and 35.9% recommend the CONSORT guideline. A low percentage of Spanish scientific journals indexed in PubMed include the STROBE statement requirement in the author guidelines. Multivariate regression models in published observational studies such as logistic regression, linear, Cox and Poisson are increasingly used both at international level, as well as in journals published in Spanish. Copyright © 2015 Sociedad Española de Médicos de Atención Primaria (SEMERGEN). Publicado por Elsevier España, S.L.U. All rights reserved.

  6. The association between short interpregnancy interval and preterm birth in Louisiana: a comparison of methods.

    PubMed

    Howard, Elizabeth J; Harville, Emily; Kissinger, Patricia; Xiong, Xu

    2013-07-01

    There is growing interest in the application of propensity scores (PS) in epidemiologic studies, especially within the field of reproductive epidemiology. This retrospective cohort study assesses the impact of a short interpregnancy interval (IPI) on preterm birth and compares the results of the conventional logistic regression analysis with analyses utilizing a PS. The study included 96,378 singleton infants from Louisiana birth certificate data (1995-2007). Five regression models designed for methods comparison are presented. Ten percent (10.17 %) of all births were preterm; 26.83 % of births were from a short IPI. The PS-adjusted model produced a more conservative estimate of the exposure variable compared to the conventional logistic regression method (β-coefficient: 0.21 vs. 0.43), as well as a smaller standard error (0.024 vs. 0.028), odds ratio and 95 % confidence intervals [1.15 (1.09, 1.20) vs. 1.23 (1.17, 1.30)]. The inclusion of more covariate and interaction terms in the PS did not change the estimates of the exposure variable. This analysis indicates that PS-adjusted regression may be appropriate for validation of conventional methods in a large dataset with a fairly common outcome. PS's may be beneficial in producing more precise estimates, especially for models with many confounders and effect modifiers and where conventional adjustment with logistic regression is unsatisfactory. Short intervals between pregnancies are associated with preterm birth in this population, according to either technique. Birth spacing is an issue that women have some control over. Educational interventions, including birth control, should be applied during prenatal visits and following delivery.

  7. A Developmental Sequence Model to University Adjustment of International Undergraduate Students

    ERIC Educational Resources Information Center

    Chavoshi, Saeid; Wintre, Maxine Gallander; Dentakos, Stella; Wright, Lorna

    2017-01-01

    The current study proposes a Developmental Sequence Model to University Adjustment and uses a multifaceted measure, including academic, social and psychological adjustment, to examine factors predictive of undergraduate international student adjustment. A hierarchic regression model is carried out on the Student Adaptation to College Questionnaire…

  8. A Latent Transition Model with Logistic Regression

    ERIC Educational Resources Information Center

    Chung, Hwan; Walls, Theodore A.; Park, Yousung

    2007-01-01

    Latent transition models increasingly include covariates that predict prevalence of latent classes at a given time or transition rates among classes over time. In many situations, the covariate of interest may be latent. This paper describes an approach for handling both manifest and latent covariates in a latent transition model. A Bayesian…

  9. A method for nonlinear exponential regression analysis

    NASA Technical Reports Server (NTRS)

    Junkin, B. G.

    1971-01-01

    A computer-oriented technique is presented for performing a nonlinear exponential regression analysis on decay-type experimental data. The technique involves the least squares procedure wherein the nonlinear problem is linearized by expansion in a Taylor series. A linear curve fitting procedure for determining the initial nominal estimates for the unknown exponential model parameters is included as an integral part of the technique. A correction matrix was derived and then applied to the nominal estimate to produce an improved set of model parameters. The solution cycle is repeated until some predetermined criterion is satisfied.

  10. pLARmEB: integration of least angle regression with empirical Bayes for multilocus genome-wide association studies.

    PubMed

    Zhang, J; Feng, J-Y; Ni, Y-L; Wen, Y-J; Niu, Y; Tamba, C L; Yue, C; Song, Q; Zhang, Y-M

    2017-06-01

    Multilocus genome-wide association studies (GWAS) have become the state-of-the-art procedure to identify quantitative trait nucleotides (QTNs) associated with complex traits. However, implementation of multilocus model in GWAS is still difficult. In this study, we integrated least angle regression with empirical Bayes to perform multilocus GWAS under polygenic background control. We used an algorithm of model transformation that whitened the covariance matrix of the polygenic matrix K and environmental noise. Markers on one chromosome were included simultaneously in a multilocus model and least angle regression was used to select the most potentially associated single-nucleotide polymorphisms (SNPs), whereas the markers on the other chromosomes were used to calculate kinship matrix as polygenic background control. The selected SNPs in multilocus model were further detected for their association with the trait by empirical Bayes and likelihood ratio test. We herein refer to this method as the pLARmEB (polygenic-background-control-based least angle regression plus empirical Bayes). Results from simulation studies showed that pLARmEB was more powerful in QTN detection and more accurate in QTN effect estimation, had less false positive rate and required less computing time than Bayesian hierarchical generalized linear model, efficient mixed model association (EMMA) and least angle regression plus empirical Bayes. pLARmEB, multilocus random-SNP-effect mixed linear model and fast multilocus random-SNP-effect EMMA methods had almost equal power of QTN detection in simulation experiments. However, only pLARmEB identified 48 previously reported genes for 7 flowering time-related traits in Arabidopsis thaliana.

  11. On approaches to analyze the sensitivity of simulated hydrologic fluxes to model parameters in the community land model

    DOE PAGES

    Bao, Jie; Hou, Zhangshuan; Huang, Maoyi; ...

    2015-12-04

    Here, effective sensitivity analysis approaches are needed to identify important parameters or factors and their uncertainties in complex Earth system models composed of multi-phase multi-component phenomena and multiple biogeophysical-biogeochemical processes. In this study, the impacts of 10 hydrologic parameters in the Community Land Model on simulations of runoff and latent heat flux are evaluated using data from a watershed. Different metrics, including residual statistics, the Nash-Sutcliffe coefficient, and log mean square error, are used as alternative measures of the deviations between the simulated and field observed values. Four sensitivity analysis (SA) approaches, including analysis of variance based on the generalizedmore » linear model, generalized cross validation based on the multivariate adaptive regression splines model, standardized regression coefficients based on a linear regression model, and analysis of variance based on support vector machine, are investigated. Results suggest that these approaches show consistent measurement of the impacts of major hydrologic parameters on response variables, but with differences in the relative contributions, particularly for the secondary parameters. The convergence behaviors of the SA with respect to the number of sampling points are also examined with different combinations of input parameter sets and output response variables and their alternative metrics. This study helps identify the optimal SA approach, provides guidance for the calibration of the Community Land Model parameters to improve the model simulations of land surface fluxes, and approximates the magnitudes to be adjusted in the parameter values during parametric model optimization.« less

  12. Genetic parameters for growth characteristics of free-range chickens under univariate random regression models.

    PubMed

    Rovadoscki, Gregori A; Petrini, Juliana; Ramirez-Diaz, Johanna; Pertile, Simone F N; Pertille, Fábio; Salvian, Mayara; Iung, Laiza H S; Rodriguez, Mary Ana P; Zampar, Aline; Gaya, Leila G; Carvalho, Rachel S B; Coelho, Antonio A D; Savino, Vicente J M; Coutinho, Luiz L; Mourão, Gerson B

    2016-09-01

    Repeated measures from the same individual have been analyzed by using repeatability and finite dimension models under univariate or multivariate analyses. However, in the last decade, the use of random regression models for genetic studies with longitudinal data have become more common. Thus, the aim of this research was to estimate genetic parameters for body weight of four experimental chicken lines by using univariate random regression models. Body weight data from hatching to 84 days of age (n = 34,730) from four experimental free-range chicken lines (7P, Caipirão da ESALQ, Caipirinha da ESALQ and Carijó Barbado) were used. The analysis model included the fixed effects of contemporary group (gender and rearing system), fixed regression coefficients for age at measurement, and random regression coefficients for permanent environmental effects and additive genetic effects. Heterogeneous variances for residual effects were considered, and one residual variance was assigned for each of six subclasses of age at measurement. Random regression curves were modeled by using Legendre polynomials of the second and third orders, with the best model chosen based on the Akaike Information Criterion, Bayesian Information Criterion, and restricted maximum likelihood. Multivariate analyses under the same animal mixed model were also performed for the validation of the random regression models. The Legendre polynomials of second order were better for describing the growth curves of the lines studied. Moderate to high heritabilities (h(2) = 0.15 to 0.98) were estimated for body weight between one and 84 days of age, suggesting that selection for body weight at all ages can be used as a selection criteria. Genetic correlations among body weight records obtained through multivariate analyses ranged from 0.18 to 0.96, 0.12 to 0.89, 0.06 to 0.96, and 0.28 to 0.96 in 7P, Caipirão da ESALQ, Caipirinha da ESALQ, and Carijó Barbado chicken lines, respectively. Results indicate that genetic gain for body weight can be achieved by selection. Also, selection for body weight at 42 days of age can be maintained as a selection criterion. © 2016 Poultry Science Association Inc.

  13. A new model for estimating total body water from bioelectrical resistance

    NASA Technical Reports Server (NTRS)

    Siconolfi, S. F.; Kear, K. T.

    1992-01-01

    Estimation of total body water (T) from bioelectrical resistance (R) is commonly done by stepwise regression models with height squared over R, H(exp 2)/R, age, sex, and weight (W). Polynomials of H(exp 2)/R have not been included in these models. We examined the validity of a model with third order polynomials and W. Methods: T was measured with oxygen-18 labled water in 27 subjects. R at 50 kHz was obtained from electrodes placed on the hand and foot while subjects were in the supine position. A stepwise regression equation was developed with 13 subjects (age 31.5 plus or minus 6.2 years, T 38.2 plus or minus 6.6 L, W 65.2 plus or minus 12.0 kg). Correlations, standard error of estimates and mean differences were computed between T and estimated T's from the new (N) model and other models. Evaluations were completed with the remaining 14 subjects (age 32.4 plus or minus 6.3 years, T 40.3 plus or minus 8 L, W 70.2 plus or minus 12.3 kg) and two of its subgroups (high and low) Results: A regression equation was developed from the model. The only significant mean difference was between T and one of the earlier models. Conclusion: Third order polynomials in regression models may increase the accuracy of estimating total body water. Evaluating the model with a larger population is needed.

  14. Estimated Perennial Streams of Idaho and Related Geospatial Datasets

    USGS Publications Warehouse

    Rea, Alan; Skinner, Kenneth D.

    2009-01-01

    The perennial or intermittent status of a stream has bearing on many regulatory requirements. Because of changing technologies over time, cartographic representation of perennial/intermittent status of streams on U.S. Geological Survey (USGS) topographic maps is not always accurate and (or) consistent from one map sheet to another. Idaho Administrative Code defines an intermittent stream as one having a 7-day, 2-year low flow (7Q2) less than 0.1 cubic feet per second. To establish consistency with the Idaho Administrative Code, the USGS developed regional regression equations for Idaho streams for several low-flow statistics, including 7Q2. Using these regression equations, the 7Q2 streamflow may be estimated for naturally flowing streams anywhere in Idaho to help determine perennial/intermittent status of streams. Using these equations in conjunction with a Geographic Information System (GIS) technique known as weighted flow accumulation allows for an automated and continuous estimation of 7Q2 streamflow at all points along a stream, which in turn can be used to determine if a stream is intermittent or perennial according to the Idaho Administrative Code operational definition. The selected regression equations were applied to create continuous grids of 7Q2 estimates for the eight low-flow regression regions of Idaho. By applying the 0.1 ft3/s criterion, the perennial streams have been estimated in each low-flow region. Uncertainty in the estimates is shown by identifying a 'transitional' zone, corresponding to flow estimates of 0.1 ft3/s plus and minus one standard error. Considerable additional uncertainty exists in the model of perennial streams presented in this report. The regression models provide overall estimates based on general trends within each regression region. These models do not include local factors such as a large spring or a losing reach that may greatly affect flows at any given point. Site-specific flow data, assuming a sufficient period of record, generally would be considered to represent flow conditions better at a given site than flow estimates based on regionalized regression models. The geospatial datasets of modeled perennial streams are considered a first-cut estimate, and should not be construed to override site-specific flow data.

  15. A Selective Review of Group Selection in High-Dimensional Models

    PubMed Central

    Huang, Jian; Breheny, Patrick; Ma, Shuangge

    2013-01-01

    Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study. PMID:24174707

  16. [Associations of the Employment Status during the First 2 Years Following Medical Rehabilitation and Long Term Occupational Trajectories: Implications for Outcome Measurement].

    PubMed

    Holstiege, J; Kaluscha, R; Jankowiak, S; Krischak, G

    2017-02-01

    Study Objectives: The aim was to investigate the predictive value of the employment status measured in the 6 th , 12 th , 18 th and 24 th month after medical rehabilitation for long-term employment trajectories during 4 years. Methods: A retrospective study was conducted based on a 20%-sample of all patients receiving inpatient rehabilitation funded by the German pension fund. Patients aged <62 years who were treated due to musculoskeletal, cardiovascular or psychosomatic disorders during the years 2002-2005 were included and followed for 4 consecutive years. The predictive value of the employment status in 4 predefined months after discharge (6 th , 12 th , 18 th and 24 th month), for the total number of months in employment in 4 years following rehabilitative treatment was analyzed using multiple linear regression. Per time point, separate regression analyses were conducted, including the employment status (employed vs. unemployed) at the respective point in time as explanatory variable, besides a standard set of additional prognostic variables. Results: A total of 252 591 patients were eligible for study inclusion. The level of explained variance of the regression models increased with the point in time used to measure the employment status, included as explanatory variable. Overall the R²-measure increased by 30% from the regression model that included the employment status in the 6 th month (R²=0.60) to the model that included the work status in the 24 th month (R²=0.78). Conclusion: The degree of accuracy in the prognosis of long-term employment biographies increases with the point in time used to measure employment in the first 2 years following rehabilitation. These findings should be taken into consideration for the predefinition of time points used to measure the employment status in future studies. © Georg Thieme Verlag KG Stuttgart · New York.

  17. Voxel-wise prostate cell density prediction using multiparametric magnetic resonance imaging and machine learning.

    PubMed

    Sun, Yu; Reynolds, Hayley M; Wraith, Darren; Williams, Scott; Finnegan, Mary E; Mitchell, Catherine; Murphy, Declan; Haworth, Annette

    2018-04-26

    There are currently no methods to estimate cell density in the prostate. This study aimed to develop predictive models to estimate prostate cell density from multiparametric magnetic resonance imaging (mpMRI) data at a voxel level using machine learning techniques. In vivo mpMRI data were collected from 30 patients before radical prostatectomy. Sequences included T2-weighted imaging, diffusion-weighted imaging and dynamic contrast-enhanced imaging. Ground truth cell density maps were computed from histology and co-registered with mpMRI. Feature extraction and selection were performed on mpMRI data. Final models were fitted using three regression algorithms including multivariate adaptive regression spline (MARS), polynomial regression (PR) and generalised additive model (GAM). Model parameters were optimised using leave-one-out cross-validation on the training data and model performance was evaluated on test data using root mean square error (RMSE) measurements. Predictive models to estimate voxel-wise prostate cell density were successfully trained and tested using the three algorithms. The best model (GAM) achieved a RMSE of 1.06 (± 0.06) × 10 3 cells/mm 2 and a relative deviation of 13.3 ± 0.8%. Prostate cell density can be quantitatively estimated non-invasively from mpMRI data using high-quality co-registered data at a voxel level. These cell density predictions could be used for tissue classification, treatment response evaluation and personalised radiotherapy.

  18. Monthly streamflow forecasting with auto-regressive integrated moving average

    NASA Astrophysics Data System (ADS)

    Nasir, Najah; Samsudin, Ruhaidah; Shabri, Ani

    2017-09-01

    Forecasting of streamflow is one of the many ways that can contribute to better decision making for water resource management. The auto-regressive integrated moving average (ARIMA) model was selected in this research for monthly streamflow forecasting with enhancement made by pre-processing the data using singular spectrum analysis (SSA). This study also proposed an extension of the SSA technique to include a step where clustering was performed on the eigenvector pairs before reconstruction of the time series. The monthly streamflow data of Sungai Muda at Jeniang, Sungai Muda at Jambatan Syed Omar and Sungai Ketil at Kuala Pegang was gathered from the Department of Irrigation and Drainage Malaysia. A ratio of 9:1 was used to divide the data into training and testing sets. The ARIMA, SSA-ARIMA and Clustered SSA-ARIMA models were all developed in R software. Results from the proposed model are then compared to a conventional auto-regressive integrated moving average model using the root-mean-square error and mean absolute error values. It was found that the proposed model can outperform the conventional model.

  19. Comparison of Cox's Regression Model and Parametric Models in Evaluating the Prognostic Factors for Survival after Liver Transplantation in Shiraz during 2000-2012.

    PubMed

    Adelian, R; Jamali, J; Zare, N; Ayatollahi, S M T; Pooladfar, G R; Roustaei, N

    2015-01-01

    Identification of the prognostic factors for survival in patients with liver transplantation is challengeable. Various methods of survival analysis have provided different, sometimes contradictory, results from the same data. To compare Cox's regression model with parametric models for determining the independent factors for predicting adults' and pediatrics' survival after liver transplantation. This study was conducted on 183 pediatric patients and 346 adults underwent liver transplantation in Namazi Hospital, Shiraz, southern Iran. The study population included all patients undergoing liver transplantation from 2000 to 2012. The prognostic factors sex, age, Child class, initial diagnosis of the liver disease, PELD/MELD score, and pre-operative laboratory markers were selected for survival analysis. Among 529 patients, 346 (64.5%) were adult and 183 (34.6%) were pediatric cases. Overall, the lognormal distribution was the best-fitting model for adult and pediatric patients. Age in adults (HR=1.16, p<0.05) and weight (HR=2.68, p<0.01) and Child class B (HR=2.12, p<0.05) in pediatric patients were the most important factors for prediction of survival after liver transplantation. Adult patients younger than the mean age and pediatric patients weighing above the mean and Child class A (compared to those with classes B or C) had better survival. Parametric regression model is a good alternative for the Cox's regression model.

  20. Risk stratification personalised model for prediction of life-threatening ventricular tachyarrhythmias in patients with chronic heart failure.

    PubMed

    Frolov, Alexander Vladimirovich; Vaikhanskaya, Tatjana Gennadjevna; Melnikova, Olga Petrovna; Vorobiev, Anatoly Pavlovich; Guel, Ludmila Michajlovna

    2017-01-01

    The development of prognostic factors of life-threatening ventricular tachyarrhythmias (VTA) and sudden cardiac death (SCD) continues to maintain its priority and relevance in cardiology. The development of a method of personalised prognosis based on multifactorial analysis of the risk factors associated with life-threatening heart rhythm disturbances is considered a key research and clinical task. To design a prognostic and mathematical model to define personalised risk for life-threatening VTA in patients with chronic heart failure (CHF). The study included 240 patients with CHF (mean-age of 50.5 ± 12.1 years; left ventricular ejection fraction 32.8 ± 10.9%; follow-up period 36.8 ± 5.7 months). The participants received basic therapy for heart failure. The elec-trocardiogram (ECG) markers of myocardial electrical instability were assessed including microvolt T-wave alternans, heart rate turbulence, heart rate deceleration, and QT dispersion. Additionally, echocardiography and Holter monitoring (HM) were performed. The cardiovascular events were considered as primary endpoints, including SCD, paroxysmal ventricular tachycardia/ventricular fibrillation (VT/VF) based on HM-ECG data, and data obtained from implantable device interrogation (CRT-D, ICD) as well as appropriated shocks. During the follow-up period, 66 (27.5%) subjects with CHF showed adverse arrhythmic events, including nine SCD events and 57 VTAs. Data from a stepwise discriminant analysis of cumulative ECG-markers of myocardial electrical instability were used to make a mathematical model of preliminary VTA risk stratification. Uni- and multivariate Cox logistic regression analysis were performed to define an individualised risk stratification model of SCD/VTA. A binary logistic regression model demonstrated a high prognostic significance of discriminant function with a classification sensitivity of 80.8% and specificity of 99.1% (F = 31.2; c2 = 143.2; p < 0.0001). The method of personalised risk stratification using Cox logistic regression allows correct classification of more than 93.9% of CHF cases. A robust body of evidence concerning logistic regression prognostic significance to define VTA risk allows inclusion of this method into the algorithm of subsequent control and selection of the optimal treatment modality to treat patients with CHF.

  1. Using automated texture features to determine the probability for masking of a tumor on mammography, but not ultrasound.

    PubMed

    Häberle, Lothar; Hack, Carolin C; Heusinger, Katharina; Wagner, Florian; Jud, Sebastian M; Uder, Michael; Beckmann, Matthias W; Schulz-Wendtland, Rüdiger; Wittenberg, Thomas; Fasching, Peter A

    2017-08-30

    Tumors in radiologically dense breast were overlooked on mammograms more often than tumors in low-density breasts. A fast reproducible and automated method of assessing percentage mammographic density (PMD) would be desirable to support decisions whether ultrasonography should be provided for women in addition to mammography in diagnostic mammography units. PMD assessment has still not been included in clinical routine work, as there are issues of interobserver variability and the procedure is quite time consuming. This study investigated whether fully automatically generated texture features of mammograms can replace time-consuming semi-automatic PMD assessment to predict a patient's risk of having an invasive breast tumor that is visible on ultrasound but masked on mammography (mammography failure). This observational study included 1334 women with invasive breast cancer treated at a hospital-based diagnostic mammography unit. Ultrasound was available for the entire cohort as part of routine diagnosis. Computer-based threshold PMD assessments ("observed PMD") were carried out and 363 texture features were obtained from each mammogram. Several variable selection and regression techniques (univariate selection, lasso, boosting, random forest) were applied to predict PMD from the texture features. The predicted PMD values were each used as new predictor for masking in logistic regression models together with clinical predictors. These four logistic regression models with predicted PMD were compared among themselves and with a logistic regression model with observed PMD. The most accurate masking prediction was determined by cross-validation. About 120 of the 363 texture features were selected for predicting PMD. Density predictions with boosting were the best substitute for observed PMD to predict masking. Overall, the corresponding logistic regression model performed better (cross-validated AUC, 0.747) than one without mammographic density (0.734), but less well than the one with the observed PMD (0.753). However, in patients with an assigned mammography failure risk >10%, covering about half of all masked tumors, the boosting-based model performed at least as accurately as the original PMD model. Automatically generated texture features can replace semi-automatically determined PMD in a prediction model for mammography failure, such that more than 50% of masked tumors could be discovered.

  2. Multicollinearity is a red herring in the search for moderator variables: A guide to interpreting moderated multiple regression models and a critique of Iacobucci, Schneider, Popovich, and Bakamitsos (2016).

    PubMed

    McClelland, Gary H; Irwin, Julie R; Disatnik, David; Sivan, Liron

    2017-02-01

    Multicollinearity is irrelevant to the search for moderator variables, contrary to the implications of Iacobucci, Schneider, Popovich, and Bakamitsos (Behavior Research Methods, 2016, this issue). Multicollinearity is like the red herring in a mystery novel that distracts the statistical detective from the pursuit of a true moderator relationship. We show multicollinearity is completely irrelevant for tests of moderator variables. Furthermore, readers of Iacobucci et al. might be confused by a number of their errors. We note those errors, but more positively, we describe a variety of methods researchers might use to test and interpret their moderated multiple regression models, including two-stage testing, mean-centering, spotlighting, orthogonalizing, and floodlighting without regard to putative issues of multicollinearity. We cite a number of recent studies in the psychological literature in which the researchers used these methods appropriately to test, to interpret, and to report their moderated multiple regression models. We conclude with a set of recommendations for the analysis and reporting of moderated multiple regression that should help researchers better understand their models and facilitate generalizations across studies.

  3. Cox proportional hazards model of myopic regression for laser in situ keratomileusis flap creation with a femtosecond laser and with a mechanical microkeratome.

    PubMed

    Lin, Meng-Yin; Chang, David C K; Hsu, Wen-Ming; Wang, I-Jong

    2012-06-01

    To compare predictive factors for postoperative myopic regression between laser in situ keratomileusis (LASIK) with a femtosecond laser and LASIK with a mechanical microkeratome. Nobel Eye Clinic, Taipei, Taiwan. Retrospective comparative study. Refractive outcomes were recorded 1 day, 1 week, and 1, 3, 6, 9, and 12 months after LASIK. A Cox proportional hazards model was used to evaluate the impact of the 2 flap-creating methods and other covariates on postoperative myopic regression. The femtosecond group comprised 409 eyes and the mechanical microkeratome group, 377 eyes. For both methods, significant predictors for myopic regression after LASIK included preoperative manifest spherical equivalent (P=.0001) and central corneal thickness (P=.027). Laser in situ keratomileusis with a mechanical microkeratome had a higher probability of postoperative myopic regression than LASIK with a femtosecond laser (P=.0002). After adjusting for other covariates in the Cox proportional hazards model, the cumulative risk for myopic regression with a mechanical microkeratome was higher than with a femtosecond laser 12 months postoperatively (P=.0002). With the definition of myopic regression as a myopic shift of 0.50 diopter (D) or more and residual myopia of -0.50 D or less, the risk estimate based on the mean covariates in all eyes in the femtosecond group and mechanical microkeratome group at 12 months was 43.6% and 66.9%, respectively. Laser in situ keratomileusis with a mechanical microkeratome had a higher risk for myopic regression than LASIK with a femtosecond laser through 12 months postoperatively. Copyright © 2012. Published by Elsevier Inc.

  4. Modelling tendon excursions and moment arms of the finger flexors: anatomic fidelity versus function.

    PubMed

    Kociolek, Aaron M; Keir, Peter J

    2011-07-07

    A detailed musculoskeletal model of the human hand is needed to investigate the pathomechanics of tendon disorders and carpal tunnel syndrome. The purpose of this study was to develop a biomechanical model with realistic flexor tendon excursions and moment arms. An existing upper extremity model served as a starting point, which included programmed movement of the index finger. Movement capabilities were added for the other fingers. Metacarpophalangeal articulations were modelled as universal joints to simulate flexion/extension and abduction/adduction while interphalangeal articulations used hinges to represent flexion. Flexor tendon paths were modelled using two approaches. The first method constrained tendons with control points, representing annular pulleys. The second technique used wrap objects at the joints as tendon constraints. Both control point and joint wrap models were iteratively adjusted to coincide with tendon excursions and moment arms from a anthropometric regression model using inputs for a 50th percentile male. Tendon excursions from the joint wrap method best matched the regression model even though anatomic features of the tendon paths were not preserved (absolute differences: mean<0.33 mm, peak<0.74 mm). The joint wrap model also produced similar moment arms to the regression (absolute differences: mean<0.63 mm, peak<1.58 mm). When a scaling algorithm was used to test anthropometrics, the scaled joint wrap models better matched the regression than the scaled control point models. Detailed patient-specific anatomical data will improve model outcomes for clinical use; however, population studies may benefit from simplified geometry, especially with anthropometric scaling. Copyright © 2011 Elsevier Ltd. All rights reserved.

  5. Parametric correlation functions to model the structure of permanent environmental (co)variances in milk yield random regression models.

    PubMed

    Bignardi, A B; El Faro, L; Cardoso, V L; Machado, P F; Albuquerque, L G

    2009-09-01

    The objective of the present study was to estimate milk yield genetic parameters applying random regression models and parametric correlation functions combined with a variance function to model animal permanent environmental effects. A total of 152,145 test-day milk yields from 7,317 first lactations of Holstein cows belonging to herds located in the southeastern region of Brazil were analyzed. Test-day milk yields were divided into 44 weekly classes of days in milk. Contemporary groups were defined by herd-test-day comprising a total of 2,539 classes. The model included direct additive genetic, permanent environmental, and residual random effects. The following fixed effects were considered: contemporary group, age of cow at calving (linear and quadratic regressions), and the population average lactation curve modeled by fourth-order orthogonal Legendre polynomial. Additive genetic effects were modeled by random regression on orthogonal Legendre polynomials of days in milk, whereas permanent environmental effects were estimated using a stationary or nonstationary parametric correlation function combined with a variance function of different orders. The structure of residual variances was modeled using a step function containing 6 variance classes. The genetic parameter estimates obtained with the model using a stationary correlation function associated with a variance function to model permanent environmental effects were similar to those obtained with models employing orthogonal Legendre polynomials for the same effect. A model using a sixth-order polynomial for additive effects and a stationary parametric correlation function associated with a seventh-order variance function to model permanent environmental effects would be sufficient for data fitting.

  6. Assessing the accuracy of ANFIS, EEMD-GRNN, PCR, and MLR models in predicting PM2.5

    NASA Astrophysics Data System (ADS)

    Ausati, Shadi; Amanollahi, Jamil

    2016-10-01

    Since Sanandaj is considered one of polluted cities of Iran, prediction of any type of pollution especially prediction of suspended particles of PM2.5, which are the cause of many diseases, could contribute to health of society by timely announcements and prior to increase of PM2.5. In order to predict PM2.5 concentration in the Sanandaj air the hybrid models consisting of an ensemble empirical mode decomposition and general regression neural network (EEMD-GRNN), Adaptive Neuro-Fuzzy Inference System (ANFIS), principal component regression (PCR), and linear model such as multiple liner regression (MLR) model were used. In these models the data of suspended particles of PM2.5 were the dependent variable and the data related to air quality including PM2.5, PM10, SO2, NO2, CO, O3 and meteorological data including average minimum temperature (Min T), average maximum temperature (Max T), average atmospheric pressure (AP), daily total precipitation (TP), daily relative humidity level of the air (RH) and daily wind speed (WS) for the year 2014 in Sanandaj were the independent variables. Among the used models, EEMD-GRNN model with values of R2 = 0.90, root mean square error (RMSE) = 4.9218 and mean absolute error (MAE) = 3.4644 in the training phase and with values of R2 = 0.79, RMSE = 5.0324 and MAE = 3.2565 in the testing phase, exhibited the best function in predicting this phenomenon. It can be concluded that hybrid models have accurate results to predict PM2.5 concentration compared with linear model.

  7. Water quality of storm runoff and comparison of procedures for estimating storm-runoff loads, volume, event-mean concentrations, and the mean load for a storm for selected properties and constituents for Colorado Springs, southeastern Colorado, 1992

    USGS Publications Warehouse

    Von Guerard, Paul; Weiss, W.B.

    1995-01-01

    The U.S. Environmental Protection Agency requires that municipalities that have a population of 100,000 or greater obtain National Pollutant Discharge Elimination System permits to characterize the quality of their storm runoff. In 1992, the U.S. Geological Survey, in cooperation with the Colorado Springs City Engineering Division, began a study to characterize the water quality of storm runoff and to evaluate procedures for the estimation of storm-runoff loads, volume and event-mean concentrations for selected properties and constituents. Precipitation, streamflow, and water-quality data were collected during 1992 at five sites in Colorado Springs. Thirty-five samples were collected, seven at each of the five sites. At each site, three samples were collected for permitting purposes; two of the samples were collected during rainfall runoff, and one sample was collected during snowmelt runoff. Four additional samples were collected at each site to obtain a large enough sample size to estimate storm-runoff loads, volume, and event-mean concentrations for selected properties and constituents using linear-regression procedures developed using data from the Nationwide Urban Runoff Program (NURP). Storm-water samples were analyzed for as many as 186 properties and constituents. The constituents measured include total-recoverable metals, vola-tile-organic compounds, acid-base/neutral organic compounds, and pesticides. Storm runoff sampled had large concentrations of chemical oxygen demand and 5-day biochemical oxygen demand. Chemical oxygen demand ranged from 100 to 830 milligrams per liter, and 5.-day biochemical oxygen demand ranged from 14 to 260 milligrams per liter. Total-organic carbon concentrations ranged from 18 to 240 milligrams per liter. The total-recoverable metals lead and zinc had the largest concentrations of the total-recoverable metals analyzed. Concentrations of lead ranged from 23 to 350 micrograms per liter, and concentrations of zinc ranged from 110 to 1,400 micrograms per liter. The data for 30 storms representing rainfall runoff from 5 drainage basins were used to develop single-storm local-regression models. The response variables, storm-runoff loads, volume, and event-mean concentrations were modeled using explanatory variables for climatic, physical, and land-use characteristics. The r2 for models that use ordinary least-squares regression ranged from 0.57 to 0.86 for storm-runoff loads and volume and from 0.25 to 0.63 for storm-runoff event-mean concentrations. Except for cadmium, standard errors of estimate ranged from 43 to 115 percent for storm- runoff loads and volume and from 35 to 66 percent for storm-runoff event-mean concentrations. Eleven of the 30 concentrations collected during rainfall runoff for total-recoverable cadmium were censored (less than) concentrations. Ordinary least-squares regression should not be used with censored data; however, censored data can be included with uncensored data using tobit regression. Standard errors of estimate for storm-runoff load and event-mean concentration for total-recoverable cadmium, computed using tobit regression, are 247 and 171 percent. Estimates from single-storm regional-regression models, developed from the Nationwide Urban Runoff Program data base, were compared with observed storm-runoff loads, volume, and event-mean concentrations determined from samples collected in the study area. Single-storm regional-regression models tended to overestimate storm-runoff loads, volume, and event-mean con-centrations. Therefore, single-storm local- and regional-regression models were combined using model-adjustment procedures to take advantage of the strengths of both models while minimizing the deficiencies of each model. Procedures were used to develop single-stormregression equations that were adjusted using local data and estimates from single-storm regional-regression equations. Single-storm regression models developed using model- adjustment proce

  8. A Semiparametric Change-Point Regression Model for Longitudinal Observations.

    PubMed

    Xing, Haipeng; Ying, Zhiliang

    2012-12-01

    Many longitudinal studies involve relating an outcome process to a set of possibly time-varying covariates, giving rise to the usual regression models for longitudinal data. When the purpose of the study is to investigate the covariate effects when experimental environment undergoes abrupt changes or to locate the periods with different levels of covariate effects, a simple and easy-to-interpret approach is to introduce change-points in regression coefficients. In this connection, we propose a semiparametric change-point regression model, in which the error process (stochastic component) is nonparametric and the baseline mean function (functional part) is completely unspecified, the observation times are allowed to be subject-specific, and the number, locations and magnitudes of change-points are unknown and need to be estimated. We further develop an estimation procedure which combines the recent advance in semiparametric analysis based on counting process argument and multiple change-points inference, and discuss its large sample properties, including consistency and asymptotic normality, under suitable regularity conditions. Simulation results show that the proposed methods work well under a variety of scenarios. An application to a real data set is also given.

  9. Relating soil geochemical properties to arsenic bioaccessibility through hierarchical modeling.

    EPA Science Inventory

    Interest in improved understanding of relationships among soil properties and arsenic (As) bioaccessibility has motivated the use of regression models for As bioaccessibility prediction. However, limits in the numbers and types of soils included in previous studies restrict the u...

  10. Extrapolation of a predictive model for growth of a low inoculum size of Salmonella typhimurium DT104 on chicken skin to higher inoculum sizes

    USDA-ARS?s Scientific Manuscript database

    Validation of model predictions for independent variables not included in model development can save time and money by identifying conditions for which new models are not needed. A single strain of Salmonella Typhimurium DT104 was used to develop a general regression neural network model for growth...

  11. Comparison and validation of statistical methods for predicting power outage durations in the event of hurricanes.

    PubMed

    Nateghi, Roshanak; Guikema, Seth D; Quiring, Steven M

    2011-12-01

    This article compares statistical methods for modeling power outage durations during hurricanes and examines the predictive accuracy of these methods. Being able to make accurate predictions of power outage durations is valuable because the information can be used by utility companies to plan their restoration efforts more efficiently. This information can also help inform customers and public agencies of the expected outage times, enabling better collective response planning, and coordination of restoration efforts for other critical infrastructures that depend on electricity. In the long run, outage duration estimates for future storm scenarios may help utilities and public agencies better allocate risk management resources to balance the disruption from hurricanes with the cost of hardening power systems. We compare the out-of-sample predictive accuracy of five distinct statistical models for estimating power outage duration times caused by Hurricane Ivan in 2004. The methods compared include both regression models (accelerated failure time (AFT) and Cox proportional hazard models (Cox PH)) and data mining techniques (regression trees, Bayesian additive regression trees (BART), and multivariate additive regression splines). We then validate our models against two other hurricanes. Our results indicate that BART yields the best prediction accuracy and that it is possible to predict outage durations with reasonable accuracy. © 2011 Society for Risk Analysis.

  12. How to address data gaps in life cycle inventories: a case study on estimating CO2 emissions from coal-fired electricity plants on a global scale.

    PubMed

    Steinmann, Zoran J N; Venkatesh, Aranya; Hauck, Mara; Schipper, Aafke M; Karuppiah, Ramkumar; Laurenzi, Ian J; Huijbregts, Mark A J

    2014-05-06

    One of the major challenges in life cycle assessment (LCA) is the availability and quality of data used to develop models and to make appropriate recommendations. Approximations and assumptions are often made if appropriate data are not readily available. However, these proxies may introduce uncertainty into the results. A regression model framework may be employed to assess missing data in LCAs of products and processes. In this study, we develop such a regression-based framework to estimate CO2 emission factors associated with coal power plants in the absence of reported data. Our framework hypothesizes that emissions from coal power plants can be explained by plant-specific factors (predictors) that include steam pressure, total capacity, plant age, fuel type, and gross domestic product (GDP) per capita of the resident nations of those plants. Using reported emission data for 444 plants worldwide, plant level CO2 emission factors were fitted to the selected predictors by a multiple linear regression model and a local linear regression model. The validated models were then applied to 764 coal power plants worldwide, for which no reported data were available. Cumulatively, available reported data and our predictions together account for 74% of the total world's coal-fired power generation capacity.

  13. Estimating Causal Effects with Ancestral Graph Markov Models

    PubMed Central

    Malinsky, Daniel; Spirtes, Peter

    2017-01-01

    We present an algorithm for estimating bounds on causal effects from observational data which combines graphical model search with simple linear regression. We assume that the underlying system can be represented by a linear structural equation model with no feedback, and we allow for the possibility of latent variables. Under assumptions standard in the causal search literature, we use conditional independence constraints to search for an equivalence class of ancestral graphs. Then, for each model in the equivalence class, we perform the appropriate regression (using causal structure information to determine which covariates to include in the regression) to estimate a set of possible causal effects. Our approach is based on the “IDA” procedure of Maathuis et al. (2009), which assumes that all relevant variables have been measured (i.e., no unmeasured confounders). We generalize their work by relaxing this assumption, which is often violated in applied contexts. We validate the performance of our algorithm on simulated data and demonstrate improved precision over IDA when latent variables are present. PMID:28217244

  14. Analysis of an Environmental Exposure Health Questionnaire in a Metropolitan Minority Population Utilizing Logistic Regression and Support Vector Machines

    PubMed Central

    Chen, Chau-Kuang; Bruce, Michelle; Tyler, Lauren; Brown, Claudine; Garrett, Angelica; Goggins, Susan; Lewis-Polite, Brandy; Weriwoh, Mirabel L; Juarez, Paul D.; Hood, Darryl B.; Skelton, Tyler

    2014-01-01

    The goal of this study was to analyze a 54-item instrument for assessment of perception of exposure to environmental contaminants within the context of the built environment, or exposome. This exposome was defined in five domains to include 1) home and hobby, 2) school, 3) community, 4) occupation, and 5) exposure history. Interviews were conducted with child-bearing-age minority women at Metro Nashville General Hospital at Meharry Medical College. Data were analyzed utilizing DTReg software for Support Vector Machine (SVM) modeling followed by an SPSS package for a logistic regression model. The target (outcome) variable of interest was respondent's residence by ZIP code. The results demonstrate that the rank order of important variables with respect to SVM modeling versus traditional logistic regression models is almost identical. This is the first study documenting that SVM analysis has discriminate power for determination of higher-ordered spatial relationships on an environmental exposure history questionnaire. PMID:23395953

  15. Mixed effect Poisson log-linear models for clinical and epidemiological sleep hypnogram data

    PubMed Central

    Swihart, Bruce J.; Caffo, Brian S.; Crainiceanu, Ciprian; Punjabi, Naresh M.

    2013-01-01

    Bayesian Poisson log-linear multilevel models scalable to epidemiological studies are proposed to investigate population variability in sleep state transition rates. Hierarchical random effects are used to account for pairings of subjects and repeated measures within those subjects, as comparing diseased to non-diseased subjects while minimizing bias is of importance. Essentially, non-parametric piecewise constant hazards are estimated and smoothed, allowing for time-varying covariates and segment of the night comparisons. The Bayesian Poisson regression is justified through a re-derivation of a classical algebraic likelihood equivalence of Poisson regression with a log(time) offset and survival regression assuming exponentially distributed survival times. Such re-derivation allows synthesis of two methods currently used to analyze sleep transition phenomena: stratified multi-state proportional hazards models and log-linear models with GEE for transition counts. An example data set from the Sleep Heart Health Study is analyzed. Supplementary material includes the analyzed data set as well as the code for a reproducible analysis. PMID:22241689

  16. Analysis of an environmental exposure health questionnaire in a metropolitan minority population utilizing logistic regression and Support Vector Machines.

    PubMed

    Chen, Chau-Kuang; Bruce, Michelle; Tyler, Lauren; Brown, Claudine; Garrett, Angelica; Goggins, Susan; Lewis-Polite, Brandy; Weriwoh, Mirabel L; Juarez, Paul D; Hood, Darryl B; Skelton, Tyler

    2013-02-01

    The goal of this study was to analyze a 54-item instrument for assessment of perception of exposure to environmental contaminants within the context of the built environment, or exposome. This exposome was defined in five domains to include 1) home and hobby, 2) school, 3) community, 4) occupation, and 5) exposure history. Interviews were conducted with child-bearing-age minority women at Metro Nashville General Hospital at Meharry Medical College. Data were analyzed utilizing DTReg software for Support Vector Machine (SVM) modeling followed by an SPSS package for a logistic regression model. The target (outcome) variable of interest was respondent's residence by ZIP code. The results demonstrate that the rank order of important variables with respect to SVM modeling versus traditional logistic regression models is almost identical. This is the first study documenting that SVM analysis has discriminate power for determination of higher-ordered spatial relationships on an environmental exposure history questionnaire.

  17. Can Predictive Modeling Identify Head and Neck Oncology Patients at Risk for Readmission?

    PubMed

    Manning, Amy M; Casper, Keith A; Peter, Kay St; Wilson, Keith M; Mark, Jonathan R; Collar, Ryan M

    2018-05-01

    Objective Unplanned readmission within 30 days is a contributor to health care costs in the United States. The use of predictive modeling during hospitalization to identify patients at risk for readmission offers a novel approach to quality improvement and cost reduction. Study Design Two-phase study including retrospective analysis of prospectively collected data followed by prospective longitudinal study. Setting Tertiary academic medical center. Subjects and Methods Prospectively collected data for patients undergoing surgical treatment for head and neck cancer from January 2013 to January 2015 were used to build predictive models for readmission within 30 days of discharge using logistic regression, classification and regression tree (CART) analysis, and random forests. One model (logistic regression) was then placed prospectively into the discharge workflow from March 2016 to May 2016 to determine the model's ability to predict which patients would be readmitted within 30 days. Results In total, 174 admissions had descriptive data. Thirty-two were excluded due to incomplete data. Logistic regression, CART, and random forest predictive models were constructed using the remaining 142 admissions. When applied to 106 consecutive prospective head and neck oncology patients at the time of discharge, the logistic regression model predicted readmissions with a specificity of 94%, a sensitivity of 47%, a negative predictive value of 90%, and a positive predictive value of 62% (odds ratio, 14.9; 95% confidence interval, 4.02-55.45). Conclusion Prospectively collected head and neck cancer databases can be used to develop predictive models that can accurately predict which patients will be readmitted. This offers valuable support for quality improvement initiatives and readmission-related cost reduction in head and neck cancer care.

  18. Improving Global Models of Remotely Sensed Ocean Chlorophyll Content Using Partial Least Squares and Geographically Weighted Regression

    NASA Astrophysics Data System (ADS)

    Gholizadeh, H.; Robeson, S. M.

    2015-12-01

    Empirical models have been widely used to estimate global chlorophyll content from remotely sensed data. Here, we focus on the standard NASA empirical models that use blue-green band ratios. These band ratio ocean color (OC) algorithms are in the form of fourth-order polynomials and the parameters of these polynomials (i.e. coefficients) are estimated from the NASA bio-Optical Marine Algorithm Data set (NOMAD). Most of the points in this data set have been sampled from tropical and temperate regions. However, polynomial coefficients obtained from this data set are used to estimate chlorophyll content in all ocean regions with different properties such as sea-surface temperature, salinity, and downwelling/upwelling patterns. Further, the polynomial terms in these models are highly correlated. In sum, the limitations of these empirical models are as follows: 1) the independent variables within the empirical models, in their current form, are correlated (multicollinear), and 2) current algorithms are global approaches and are based on the spatial stationarity assumption, so they are independent of location. Multicollinearity problem is resolved by using partial least squares (PLS). PLS, which transforms the data into a set of independent components, can be considered as a combined form of principal component regression (PCR) and multiple regression. Geographically weighted regression (GWR) is also used to investigate the validity of spatial stationarity assumption. GWR solves a regression model over each sample point by using the observations within its neighbourhood. PLS results show that the empirical method underestimates chlorophyll content in high latitudes, including the Southern Ocean region, when compared to PLS (see Figure 1). Cluster analysis of GWR coefficients also shows that the spatial stationarity assumption in empirical models is not likely a valid assumption.

  19. A phenomenological biological dose model for proton therapy based on linear energy transfer spectra.

    PubMed

    Rørvik, Eivind; Thörnqvist, Sara; Stokkevåg, Camilla H; Dahle, Tordis J; Fjaera, Lars Fredrik; Ytre-Hauge, Kristian S

    2017-06-01

    The relative biological effectiveness (RBE) of protons varies with the radiation quality, quantified by the linear energy transfer (LET). Most phenomenological models employ a linear dependency of the dose-averaged LET (LET d ) to calculate the biological dose. However, several experiments have indicated a possible non-linear trend. Our aim was to investigate if biological dose models including non-linear LET dependencies should be considered, by introducing a LET spectrum based dose model. The RBE-LET relationship was investigated by fitting of polynomials from 1st to 5th degree to a database of 85 data points from aerobic in vitro experiments. We included both unweighted and weighted regression, the latter taking into account experimental uncertainties. Statistical testing was performed to decide whether higher degree polynomials provided better fits to the data as compared to lower degrees. The newly developed models were compared to three published LET d based models for a simulated spread out Bragg peak (SOBP) scenario. The statistical analysis of the weighted regression analysis favored a non-linear RBE-LET relationship, with the quartic polynomial found to best represent the experimental data (P = 0.010). The results of the unweighted regression analysis were on the borderline of statistical significance for non-linear functions (P = 0.053), and with the current database a linear dependency could not be rejected. For the SOBP scenario, the weighted non-linear model estimated a similar mean RBE value (1.14) compared to the three established models (1.13-1.17). The unweighted model calculated a considerably higher RBE value (1.22). The analysis indicated that non-linear models could give a better representation of the RBE-LET relationship. However, this is not decisive, as inclusion of the experimental uncertainties in the regression analysis had a significant impact on the determination and ranking of the models. As differences between the models were observed for the SOBP scenario, both non-linear LET spectrum- and linear LET d based models should be further evaluated in clinically realistic scenarios. © 2017 American Association of Physicists in Medicine.

  20. Regression approaches in the test-negative study design for assessment of influenza vaccine effectiveness.

    PubMed

    Bond, H S; Sullivan, S G; Cowling, B J

    2016-06-01

    Influenza vaccination is the most practical means available for preventing influenza virus infection and is widely used in many countries. Because vaccine components and circulating strains frequently change, it is important to continually monitor vaccine effectiveness (VE). The test-negative design is frequently used to estimate VE. In this design, patients meeting the same clinical case definition are recruited and tested for influenza; those who test positive are the cases and those who test negative form the comparison group. When determining VE in these studies, the typical approach has been to use logistic regression, adjusting for potential confounders. Because vaccine coverage and influenza incidence change throughout the season, time is included among these confounders. While most studies use unconditional logistic regression, adjusting for time, an alternative approach is to use conditional logistic regression, matching on time. Here, we used simulation data to examine the potential for both regression approaches to permit accurate and robust estimates of VE. In situations where vaccine coverage changed during the influenza season, the conditional model and unconditional models adjusting for categorical week and using a spline function for week provided more accurate estimates. We illustrated the two approaches on data from a test-negative study of influenza VE against hospitalization in children in Hong Kong which resulted in the conditional logistic regression model providing the best fit to the data.

  1. Comparison of Logistic Regression and Artificial Neural Network in Low Back Pain Prediction: Second National Health Survey

    PubMed Central

    Parsaeian, M; Mohammad, K; Mahmoudi, M; Zeraati, H

    2012-01-01

    Background: The purpose of this investigation was to compare empirically predictive ability of an artificial neural network with a logistic regression in prediction of low back pain. Methods: Data from the second national health survey were considered in this investigation. This data includes the information of low back pain and its associated risk factors among Iranian people aged 15 years and older. Artificial neural network and logistic regression models were developed using a set of 17294 data and they were validated in a test set of 17295 data. Hosmer and Lemeshow recommendation for model selection was used in fitting the logistic regression. A three-layer perceptron with 9 inputs, 3 hidden and 1 output neurons was employed. The efficiency of two models was compared by receiver operating characteristic analysis, root mean square and -2 Loglikelihood criteria. Results: The area under the ROC curve (SE), root mean square and -2Loglikelihood of the logistic regression was 0.752 (0.004), 0.3832 and 14769.2, respectively. The area under the ROC curve (SE), root mean square and -2Loglikelihood of the artificial neural network was 0.754 (0.004), 0.3770 and 14757.6, respectively. Conclusions: Based on these three criteria, artificial neural network would give better performance than logistic regression. Although, the difference is statistically significant, it does not seem to be clinically significant. PMID:23113198

  2. Comparison of logistic regression and artificial neural network in low back pain prediction: second national health survey.

    PubMed

    Parsaeian, M; Mohammad, K; Mahmoudi, M; Zeraati, H

    2012-01-01

    The purpose of this investigation was to compare empirically predictive ability of an artificial neural network with a logistic regression in prediction of low back pain. Data from the second national health survey were considered in this investigation. This data includes the information of low back pain and its associated risk factors among Iranian people aged 15 years and older. Artificial neural network and logistic regression models were developed using a set of 17294 data and they were validated in a test set of 17295 data. Hosmer and Lemeshow recommendation for model selection was used in fitting the logistic regression. A three-layer perceptron with 9 inputs, 3 hidden and 1 output neurons was employed. The efficiency of two models was compared by receiver operating characteristic analysis, root mean square and -2 Loglikelihood criteria. The area under the ROC curve (SE), root mean square and -2Loglikelihood of the logistic regression was 0.752 (0.004), 0.3832 and 14769.2, respectively. The area under the ROC curve (SE), root mean square and -2Loglikelihood of the artificial neural network was 0.754 (0.004), 0.3770 and 14757.6, respectively. Based on these three criteria, artificial neural network would give better performance than logistic regression. Although, the difference is statistically significant, it does not seem to be clinically significant.

  3. Relations between continuous real-time physical properties and discrete water-quality constituents in the Little Arkansas River, south-central Kansas, 1998-2014

    USGS Publications Warehouse

    Rasmussen, Patrick P.; Eslick, Patrick J.; Ziegler, Andrew C.

    2016-08-11

    Water from the Little Arkansas River is used as source water for artificial recharge of the Equus Beds aquifer, one of the primary water-supply sources for the city of Wichita, Kansas. The U.S. Geological Survey has operated two continuous real-time water-quality monitoring stations since 1995 on the Little Arkansas River in Kansas. Regression models were developed to establish relations between discretely sampled constituent concentrations and continuously measured physical properties to compute concentrations of those constituents of interest. Site-specific regression models were originally published in 2000 for the near Halstead and near Sedgwick U.S. Geological Survey streamgaging stations and the site-specific regression models were then updated in 2003. This report updates those regression models using discrete and continuous data collected during May 1998 through August 2014. In addition to the constituents listed in the 2003 update, new regression models were developed for total organic carbon. The real-time computations of water-quality concentrations and loads are available at http://nrtwq.usgs.gov. The water-quality information in this report is important to the city of Wichita because water-quality information allows for real-time quantification and characterization of chemicals of concern (including chloride), in addition to nutrients, sediment, bacteria, and atrazine transported in the Little Arkansas River. The water-quality information in this report aids in the decision making for water treatment before artificial recharge.

  4. Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.

    PubMed

    Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A

    2016-01-01

    Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.

  5. Computer-program documentation of an interactive-accounting model to simulate streamflow, water quality, and water-supply operations in a river basin

    USGS Publications Warehouse

    Burns, A.W.

    1988-01-01

    This report describes an interactive-accounting model used to simulate streamflow, chemical-constituent concentrations and loads, and water-supply operations in a river basin. The model uses regression equations to compute flow from incremental (internode) drainage areas. Conservative chemical constituents (typically dissolved solids) also are computed from regression equations. Both flow and water quality loads are accumulated downstream. Optionally, the model simulates the water use and the simplified groundwater systems of a basin. Water users include agricultural, municipal, industrial, and in-stream users , and reservoir operators. Water users list their potential water sources, including direct diversions, groundwater pumpage, interbasin imports, or reservoir releases, in the order in which they will be used. Direct diversions conform to basinwide water law priorities. The model is interactive, and although the input data exist in files, the user can modify them interactively. A major feature of the model is its color-graphic-output options. This report includes a description of the model, organizational charts of subroutines, and examples of the graphics. Detailed format instructions for the input data, example files of input data, definitions of program variables, and listing of the FORTRAN source code are Attachments to the report. (USGS)

  6. Predictive factors of early moderate/severe ovarian hyperstimulation syndrome in non-polycystic ovarian syndrome patients: a statistical model.

    PubMed

    Ashrafi, Mahnaz; Bahmanabadi, Akram; Akhond, Mohammad Reza; Arabipoor, Arezoo

    2015-11-01

    To evaluate demographic, medical history and clinical cycle characteristics of infertile non-polycystic ovary syndrome (NPCOS) women with the purpose of investigating their associations with the prevalence of moderate-to-severe OHSS. In this retrospective study, among 7073 in vitro fertilization and/or intracytoplasmic sperm injection (IVF/ICSI) cycles, 86 cases of NPCO patients who developed moderate-to-severe OHSS while being treated with IVF/ICSI cycles were analyzed during the period of January 2008 to December 2010 at Royan Institute. To review the OHSS risk factors, 172 NPCOS patients without developing OHSS, treated at the same period of time, were selected randomly by computer as control group. We used multiple logistic regression in a backward manner to build a prediction model. The regression analysis revealed that the variables, including age [odds ratio (OR) 0.9, confidence interval (CI) 0.81-0.99], antral follicles count (OR 4.3, CI 2.7-6.9), infertility cause (tubal factor, OR 11.5, CI 1.1-51.3), hypothyroidism (OR 3.8, CI 1.5-9.4) and positive history of ovarian surgery (OR 0.2, CI 0.05-0.9) were the most important predictors of OHSS. The regression model had an area under curve of 0.94, presenting an allowable discriminative performance that was equal with two strong predictive variables, including the number of follicles and serum estradiol level on human chorionic gonadotropin day. The predictive regression model based on primary characteristics of NPCOS patients had equal specificity in comparison with two mentioned strong predictive variables. Therefore, it may be beneficial to apply this model before the beginning of ovarian stimulation protocol.

  7. Poisson Mixture Regression Models for Heart Disease Prediction.

    PubMed

    Mufudza, Chipo; Erol, Hamza

    2016-01-01

    Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model.

  8. Poisson Mixture Regression Models for Heart Disease Prediction

    PubMed Central

    Erol, Hamza

    2016-01-01

    Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model. PMID:27999611

  9. Generalized linear and generalized additive models in studies of species distributions: Setting the scene

    USGS Publications Warehouse

    Guisan, Antoine; Edwards, T.C.; Hastie, T.

    2002-01-01

    An important statistical development of the last 30 years has been the advance in regression analysis provided by generalized linear models (GLMs) and generalized additive models (GAMs). Here we introduce a series of papers prepared within the framework of an international workshop entitled: Advances in GLMs/GAMs modeling: from species distribution to environmental management, held in Riederalp, Switzerland, 6-11 August 2001. We first discuss some general uses of statistical models in ecology, as well as provide a short review of several key examples of the use of GLMs and GAMs in ecological modeling efforts. We next present an overview of GLMs and GAMs, and discuss some of their related statistics used for predictor selection, model diagnostics, and evaluation. Included is a discussion of several new approaches applicable to GLMs and GAMs, such as ridge regression, an alternative to stepwise selection of predictors, and methods for the identification of interactions by a combined use of regression trees and several other approaches. We close with an overview of the papers and how we feel they advance our understanding of their application to ecological modeling. ?? 2002 Elsevier Science B.V. All rights reserved.

  10. Use of ocean color scanner data in water quality mapping

    NASA Technical Reports Server (NTRS)

    Khorram, S.

    1981-01-01

    Remotely sensed data, in combination with in situ data, are used in assessing water quality parameters within the San Francisco Bay-Delta. The parameters include suspended solids, chlorophyll, and turbidity. Regression models are developed between each of the water quality parameter measurements and the Ocean Color Scanner (OCS) data. The models are then extended to the entire study area for mapping water quality parameters. The results include a series of color-coded maps, each pertaining to one of the water quality parameters, and the statistical analysis of the OCS data and regression models. It is found that concurrently collected OCS data and surface truth measurements are highly useful in mapping the selected water quality parameters and locating areas having relatively high biological activity. In addition, it is found to be virtually impossible, at least within this test site, to locate such areas on U-2 color and color-infrared photography.

  11. Post-processing through linear regression

    NASA Astrophysics Data System (ADS)

    van Schaeybroeck, B.; Vannitsem, S.

    2011-03-01

    Various post-processing techniques are compared for both deterministic and ensemble forecasts, all based on linear regression between forecast data and observations. In order to evaluate the quality of the regression methods, three criteria are proposed, related to the effective correction of forecast error, the optimal variability of the corrected forecast and multicollinearity. The regression schemes under consideration include the ordinary least-square (OLS) method, a new time-dependent Tikhonov regularization (TDTR) method, the total least-square method, a new geometric-mean regression (GM), a recently introduced error-in-variables (EVMOS) method and, finally, a "best member" OLS method. The advantages and drawbacks of each method are clarified. These techniques are applied in the context of the 63 Lorenz system, whose model version is affected by both initial condition and model errors. For short forecast lead times, the number and choice of predictors plays an important role. Contrarily to the other techniques, GM degrades when the number of predictors increases. At intermediate lead times, linear regression is unable to provide corrections to the forecast and can sometimes degrade the performance (GM and the best member OLS with noise). At long lead times the regression schemes (EVMOS, TDTR) which yield the correct variability and the largest correlation between ensemble error and spread, should be preferred.

  12. Estimating effects of limiting factors with regression quantiles

    USGS Publications Warehouse

    Cade, B.S.; Terrell, J.W.; Schroeder, R.L.

    1999-01-01

    In a recent Concepts paper in Ecology, Thomson et al. emphasized that assumptions of conventional correlation and regression analyses fundamentally conflict with the ecological concept of limiting factors, and they called for new statistical procedures to address this problem. The analytical issue is that unmeasured factors may be the active limiting constraint and may induce a pattern of unequal variation in the biological response variable through an interaction with the measured factors. Consequently, changes near the maxima, rather than at the center of response distributions, are better estimates of the effects expected when the observed factor is the active limiting constraint. Regression quantiles provide estimates for linear models fit to any part of a response distribution, including near the upper bounds, and require minimal assumptions about the form of the error distribution. Regression quantiles extend the concept of one-sample quantiles to the linear model by solving an optimization problem of minimizing an asymmetric function of absolute errors. Rank-score tests for regression quantiles provide tests of hypotheses and confidence intervals for parameters in linear models with heteroscedastic errors, conditions likely to occur in models of limiting ecological relations. We used selected regression quantiles (e.g., 5th, 10th, ..., 95th) and confidence intervals to test hypotheses that parameters equal zero for estimated changes in average annual acorn biomass due to forest canopy cover of oak (Quercus spp.) and oak species diversity. Regression quantiles also were used to estimate changes in glacier lily (Erythronium grandiflorum) seedling numbers as a function of lily flower numbers, rockiness, and pocket gopher (Thomomys talpoides fossor) activity, data that motivated the query by Thomson et al. for new statistical procedures. Both example applications showed that effects of limiting factors estimated by changes in some upper regression quantile (e.g., 90-95th) were greater than if effects were estimated by changes in the means from standard linear model procedures. Estimating a range of regression quantiles (e.g., 5-95th) provides a comprehensive description of biological response patterns for exploratory and inferential analyses in observational studies of limiting factors, especially when sampling large spatial and temporal scales.

  13. Tutorial on Biostatistics: Linear Regression Analysis of Continuous Correlated Eye Data.

    PubMed

    Ying, Gui-Shuang; Maguire, Maureen G; Glynn, Robert; Rosner, Bernard

    2017-04-01

    To describe and demonstrate appropriate linear regression methods for analyzing correlated continuous eye data. We describe several approaches to regression analysis involving both eyes, including mixed effects and marginal models under various covariance structures to account for inter-eye correlation. We demonstrate, with SAS statistical software, applications in a study comparing baseline refractive error between one eye with choroidal neovascularization (CNV) and the unaffected fellow eye, and in a study determining factors associated with visual field in the elderly. When refractive error from both eyes were analyzed with standard linear regression without accounting for inter-eye correlation (adjusting for demographic and ocular covariates), the difference between eyes with CNV and fellow eyes was 0.15 diopters (D; 95% confidence interval, CI -0.03 to 0.32D, p = 0.10). Using a mixed effects model or a marginal model, the estimated difference was the same but with narrower 95% CI (0.01 to 0.28D, p = 0.03). Standard regression for visual field data from both eyes provided biased estimates of standard error (generally underestimated) and smaller p-values, while analysis of the worse eye provided larger p-values than mixed effects models and marginal models. In research involving both eyes, ignoring inter-eye correlation can lead to invalid inferences. Analysis using only right or left eyes is valid, but decreases power. Worse-eye analysis can provide less power and biased estimates of effect. Mixed effects or marginal models using the eye as the unit of analysis should be used to appropriately account for inter-eye correlation and maximize power and precision.

  14. The alarming problems of confounding equivalence using logistic regression models in the perspective of causal diagrams.

    PubMed

    Yu, Yuanyuan; Li, Hongkai; Sun, Xiaoru; Su, Ping; Wang, Tingting; Liu, Yi; Yuan, Zhongshang; Liu, Yanxun; Xue, Fuzhong

    2017-12-28

    Confounders can produce spurious associations between exposure and outcome in observational studies. For majority of epidemiologists, adjusting for confounders using logistic regression model is their habitual method, though it has some problems in accuracy and precision. It is, therefore, important to highlight the problems of logistic regression and search the alternative method. Four causal diagram models were defined to summarize confounding equivalence. Both theoretical proofs and simulation studies were performed to verify whether conditioning on different confounding equivalence sets had the same bias-reducing potential and then to select the optimum adjusting strategy, in which logistic regression model and inverse probability weighting based marginal structural model (IPW-based-MSM) were compared. The "do-calculus" was used to calculate the true causal effect of exposure on outcome, then the bias and standard error were used to evaluate the performances of different strategies. Adjusting for different sets of confounding equivalence, as judged by identical Markov boundaries, produced different bias-reducing potential in the logistic regression model. For the sets satisfied G-admissibility, adjusting for the set including all the confounders reduced the equivalent bias to the one containing the parent nodes of the outcome, while the bias after adjusting for the parent nodes of exposure was not equivalent to them. In addition, all causal effect estimations through logistic regression were biased, although the estimation after adjusting for the parent nodes of exposure was nearest to the true causal effect. However, conditioning on different confounding equivalence sets had the same bias-reducing potential under IPW-based-MSM. Compared with logistic regression, the IPW-based-MSM could obtain unbiased causal effect estimation when the adjusted confounders satisfied G-admissibility and the optimal strategy was to adjust for the parent nodes of outcome, which obtained the highest precision. All adjustment strategies through logistic regression were biased for causal effect estimation, while IPW-based-MSM could always obtain unbiased estimation when the adjusted set satisfied G-admissibility. Thus, IPW-based-MSM was recommended to adjust for confounders set.

  15. Characterizing the spatial distribution of ambient ultrafine particles in Toronto, Canada: A land use regression model.

    PubMed

    Weichenthal, Scott; Van Ryswyk, Keith; Goldstein, Alon; Shekarrizfard, Maryam; Hatzopoulou, Marianne

    2016-01-01

    Exposure models are needed to evaluate the chronic health effects of ambient ultrafine particles (<0.1 μm) (UFPs). We developed a land use regression model for ambient UFPs in Toronto, Canada using mobile monitoring data collected during summer/winter 2010-2011. In total, 405 road segments were included in the analysis. The final model explained 67% of the spatial variation in mean UFPs and included terms for the logarithm of distances to highways, major roads, the central business district, Pearson airport, and bus routes as well as variables for the number of on-street trees, parks, open space, and the length of bus routes within a 100 m buffer. There was no systematic difference between measured and predicted values when the model was evaluated in an external dataset, although the R(2) value decreased (R(2) = 50%). This model will be used to evaluate the chronic health effects of UFPs using population-based cohorts in the Toronto area. Crown Copyright © 2015. Published by Elsevier Ltd. All rights reserved.

  16. R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification

    PubMed Central

    Dazard, Jean-Eudes; Choe, Michael; LeBlanc, Michael; Rao, J. Sunil

    2015-01-01

    PRIMsrc is a novel implementation of a non-parametric bump hunting procedure, based on the Patient Rule Induction Method (PRIM), offering a unified treatment of outcome variables, including censored time-to-event (Survival), continuous (Regression) and discrete (Classification) responses. To fit the model, it uses a recursive peeling procedure with specific peeling criteria and stopping rules depending on the response. To validate the model, it provides an objective function based on prediction-error or other specific statistic, as well as two alternative cross-validation techniques, adapted to the task of decision-rule making and estimation in the three types of settings. PRIMsrc comes as an open source R package, including at this point: (i) a main function for fitting a Survival Bump Hunting model with various options allowing cross-validated model selection to control model size (#covariates) and model complexity (#peeling steps) and generation of cross-validated end-point estimates; (ii) parallel computing; (iii) various S3-generic and specific plotting functions for data visualization, diagnostic, prediction, summary and display of results. It is available on CRAN and GitHub. PMID:26798326

  17. Spatial measurement error and correction by spatial SIMEX in linear regression models when using predicted air pollution exposures.

    PubMed

    Alexeeff, Stacey E; Carroll, Raymond J; Coull, Brent

    2016-04-01

    Spatial modeling of air pollution exposures is widespread in air pollution epidemiology research as a way to improve exposure assessment. However, there are key sources of exposure model uncertainty when air pollution is modeled, including estimation error and model misspecification. We examine the use of predicted air pollution levels in linear health effect models under a measurement error framework. For the prediction of air pollution exposures, we consider a universal Kriging framework, which may include land-use regression terms in the mean function and a spatial covariance structure for the residuals. We derive the bias induced by estimation error and by model misspecification in the exposure model, and we find that a misspecified exposure model can induce asymptotic bias in the effect estimate of air pollution on health. We propose a new spatial simulation extrapolation (SIMEX) procedure, and we demonstrate that the procedure has good performance in correcting this asymptotic bias. We illustrate spatial SIMEX in a study of air pollution and birthweight in Massachusetts. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  18. Genomic Bayesian functional regression models with interactions for predicting wheat grain yield using hyper-spectral image data.

    PubMed

    Montesinos-López, Abelardo; Montesinos-López, Osval A; Cuevas, Jaime; Mata-López, Walter A; Burgueño, Juan; Mondal, Sushismita; Huerta, Julio; Singh, Ravi; Autrique, Enrique; González-Pérez, Lorena; Crossa, José

    2017-01-01

    Modern agriculture uses hyperspectral cameras that provide hundreds of reflectance data at discrete narrow bands in many environments. These bands often cover the whole visible light spectrum and part of the infrared and ultraviolet light spectra. With the bands, vegetation indices are constructed for predicting agronomically important traits such as grain yield and biomass. However, since vegetation indices only use some wavelengths (referred to as bands), we propose using all bands simultaneously as predictor variables for the primary trait grain yield; results of several multi-environment maize (Aguate et al. in Crop Sci 57(5):1-8, 2017) and wheat (Montesinos-López et al. in Plant Methods 13(4):1-23, 2017) breeding trials indicated that using all bands produced better prediction accuracy than vegetation indices. However, until now, these prediction models have not accounted for the effects of genotype × environment (G × E) and band × environment (B × E) interactions incorporating genomic or pedigree information. In this study, we propose Bayesian functional regression models that take into account all available bands, genomic or pedigree information, the main effects of lines and environments, as well as G × E and B × E interaction effects. The data set used is comprised of 976 wheat lines evaluated for grain yield in three environments (Drought, Irrigated and Reduced Irrigation). The reflectance data were measured in 250 discrete narrow bands ranging from 392 to 851 nm (nm). The proposed Bayesian functional regression models were implemented using two types of basis: B-splines and Fourier. Results of the proposed Bayesian functional regression models, including all the wavelengths for predicting grain yield, were compared with results from conventional models with and without bands. We observed that the models with B × E interaction terms were the most accurate models, whereas the functional regression models (with B-splines and Fourier basis) and the conventional models performed similarly in terms of prediction accuracy. However, the functional regression models are more parsimonious and computationally more efficient because the number of beta coefficients to be estimated is 21 (number of basis), rather than estimating the 250 regression coefficients for all bands. In this study adding pedigree or genomic information did not increase prediction accuracy.

  19. Online Statistical Modeling (Regression Analysis) for Independent Responses

    NASA Astrophysics Data System (ADS)

    Made Tirta, I.; Anggraeni, Dian; Pandutama, Martinus

    2017-06-01

    Regression analysis (statistical analmodelling) are among statistical methods which are frequently needed in analyzing quantitative data, especially to model relationship between response and explanatory variables. Nowadays, statistical models have been developed into various directions to model various type and complex relationship of data. Rich varieties of advanced and recent statistical modelling are mostly available on open source software (one of them is R). However, these advanced statistical modelling, are not very friendly to novice R users, since they are based on programming script or command line interface. Our research aims to developed web interface (based on R and shiny), so that most recent and advanced statistical modelling are readily available, accessible and applicable on web. We have previously made interface in the form of e-tutorial for several modern and advanced statistical modelling on R especially for independent responses (including linear models/LM, generalized linier models/GLM, generalized additive model/GAM and generalized additive model for location scale and shape/GAMLSS). In this research we unified them in the form of data analysis, including model using Computer Intensive Statistics (Bootstrap and Markov Chain Monte Carlo/ MCMC). All are readily accessible on our online Virtual Statistics Laboratory. The web (interface) make the statistical modeling becomes easier to apply and easier to compare them in order to find the most appropriate model for the data.

  20. Cost-effectiveness analysis of the diarrhea alleviation through zinc and oral rehydration therapy (DAZT) program in rural Gujarat India: an application of the net-benefit regression framework.

    PubMed

    Shillcutt, Samuel D; LeFevre, Amnesty E; Fischer-Walker, Christa L; Taneja, Sunita; Black, Robert E; Mazumder, Sarmila

    2017-01-01

    This study evaluates the cost-effectiveness of the DAZT program for scaling up treatment of acute child diarrhea in Gujarat India using a net-benefit regression framework. Costs were calculated from societal and caregivers' perspectives and effectiveness was assessed in terms of coverage of zinc and both zinc and Oral Rehydration Salt. Regression models were tested in simple linear regression, with a specified set of covariates, and with a specified set of covariates and interaction terms using linear regression with endogenous treatment effects was used as the reference case. The DAZT program was cost-effective with over 95% certainty above $5.50 and $7.50 per appropriately treated child in the unadjusted and adjusted models respectively, with specifications including interaction terms being cost-effective with 85-97% certainty. Findings from this study should be combined with other evidence when considering decisions to scale up programs such as the DAZT program to promote the use of ORS and zinc to treat child diarrhea.

  1. Estimating suspended sediment load with multivariate adaptive regression spline, teaching-learning based optimization, and artificial bee colony models.

    PubMed

    Yilmaz, Banu; Aras, Egemen; Nacar, Sinan; Kankal, Murat

    2018-05-23

    The functional life of a dam is often determined by the rate of sediment delivery to its reservoir. Therefore, an accurate estimate of the sediment load in rivers with dams is essential for designing and predicting a dam's useful lifespan. The most credible method is direct measurements of sediment input, but this can be very costly and it cannot always be implemented at all gauging stations. In this study, we tested various regression models to estimate suspended sediment load (SSL) at two gauging stations on the Çoruh River in Turkey, including artificial bee colony (ABC), teaching-learning-based optimization algorithm (TLBO), and multivariate adaptive regression splines (MARS). These models were also compared with one another and with classical regression analyses (CRA). Streamflow values and previously collected data of SSL were used as model inputs with predicted SSL data as output. Two different training and testing dataset configurations were used to reinforce the model accuracy. For the MARS method, the root mean square error value was found to range between 35% and 39% for the test two gauging stations, which was lower than errors for other models. Error values were even lower (7% to 15%) using another dataset. Our results indicate that simultaneous measurements of streamflow with SSL provide the most effective parameter for obtaining accurate predictive models and that MARS is the most accurate model for predicting SSL. Copyright © 2017 Elsevier B.V. All rights reserved.

  2. GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran.

    PubMed

    Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali

    2016-01-01

    Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.

  3. An investigation on fatality of drivers in vehicle-fixed object accidents on expressways in China: Using multinomial logistic regression model.

    PubMed

    Peng, Yong; Peng, Shuangling; Wang, Xinghua; Tan, Shiyang

    2018-06-01

    This study aims to identify the effects of characteristics of vehicle, roadway, driver, and environment on fatality of drivers in vehicle-fixed object accidents on expressways in Changsha-Zhuzhou-Xiangtan district of Hunan province in China by developing multinomial logistic regression models. For this purpose, 121 vehicle-fixed object accidents from 2011-2017 are included in the modeling process. First, descriptive statistical analysis is made to understand the main characteristics of the vehicle-fixed object crashes. Then, 19 explanatory variables are selected, and correlation analysis of each two variables is conducted to choose the variables to be concluded. Finally, five multinomial logistic regression models including different independent variables are compared, and the model with best fitting and prediction capability is chosen as the final model. The results showed that the turning direction in avoiding fixed objects raised the possibility that drivers would die. About 64% of drivers died in the accident were found being ejected out of the car, of which 50% did not use a seatbelt before the fatal accidents. Drivers are likely to die when they encounter bad weather on the expressway. Drivers with less than 10 years of driving experience are more likely to die in these accidents. Fatigue or distracted driving is also a significant factor in fatality of drivers. Findings from this research provide an insight into reducing fatality of drivers in vehicle-fixed object accidents.

  4. [Mapping environmental vulnerability from ETM + data in the Yellow River Mouth Area].

    PubMed

    Wang, Rui-Yan; Yu, Zhen-Wen; Xia, Yan-Ling; Wang, Xiang-Feng; Zhao, Geng-Xing; Jiang, Shu-Qian

    2013-10-01

    The environmental vulnerability retrieval is important to support continuing data. The spatial distribution of regional environmental vulnerability was got through remote sensing retrieval. In view of soil and vegetation, the environmental vulnerability evaluation index system was built, and the environmental vulnerability of sampling points was calculated by the AHP-fuzzy method, then the correlation between the sampling points environmental vulnerability and ETM + spectral reflectance ratio including some kinds of conversion data was analyzed to determine the sensitive spectral parameters. Based on that, models of correlation analysis, traditional regression, BP neural network and support vector regression were taken to explain the quantitative relationship between the spectral reflectance and the environmental vulnerability. With this model, the environmental vulnerability distribution was retrieved in the Yellow River Mouth Area. The results showed that the correlation between the environmental vulnerability and the spring NDVI, the September NDVI and the spring brightness was better than others, so they were selected as the sensitive spectral parameters. The model precision result showed that in addition to the support vector model, the other model reached the significant level. While all the multi-variable regression was better than all one-variable regression, and the model accuracy of BP neural network was the best. This study will serve as a reliable theoretical reference for the large spatial scale environmental vulnerability estimation based on remote sensing data.

  5. Parametric regression model for survival data: Weibull regression model as an example

    PubMed Central

    2016-01-01

    Weibull regression model is one of the most popular forms of parametric regression model that it provides estimate of baseline hazard function, as well as coefficients for covariates. Because of technical difficulties, Weibull regression model is seldom used in medical literature as compared to the semi-parametric proportional hazard model. To make clinical investigators familiar with Weibull regression model, this article introduces some basic knowledge on Weibull regression model and then illustrates how to fit the model with R software. The SurvRegCensCov package is useful in converting estimated coefficients to clinical relevant statistics such as hazard ratio (HR) and event time ratio (ETR). Model adequacy can be assessed by inspecting Kaplan-Meier curves stratified by categorical variable. The eha package provides an alternative method to model Weibull regression model. The check.dist() function helps to assess goodness-of-fit of the model. Variable selection is based on the importance of a covariate, which can be tested using anova() function. Alternatively, backward elimination starting from a full model is an efficient way for model development. Visualization of Weibull regression model after model development is interesting that it provides another way to report your findings. PMID:28149846

  6. Neural network modeling for surgical decisions on traumatic brain injury patients.

    PubMed

    Li, Y C; Liu, L; Chiu, W T; Jian, W S

    2000-01-01

    Computerized medical decision support systems have been a major research topic in recent years. Intelligent computer programs were implemented to aid physicians and other medical professionals in making difficult medical decisions. This report compares three different mathematical models for building a traumatic brain injury (TBI) medical decision support system (MDSS). These models were developed based on a large TBI patient database. This MDSS accepts a set of patient data such as the types of skull fracture, Glasgow Coma Scale (GCS), episode of convulsion and return the chance that a neurosurgeon would recommend an open-skull surgery for this patient. The three mathematical models described in this report including a logistic regression model, a multi-layer perceptron (MLP) neural network and a radial-basis-function (RBF) neural network. From the 12,640 patients selected from the database. A randomly drawn 9480 cases were used as the training group to develop/train our models. The other 3160 cases were in the validation group which we used to evaluate the performance of these models. We used sensitivity, specificity, areas under receiver-operating characteristics (ROC) curve and calibration curves as the indicator of how accurate these models are in predicting a neurosurgeon's decision on open-skull surgery. The results showed that, assuming equal importance of sensitivity and specificity, the logistic regression model had a (sensitivity, specificity) of (73%, 68%), compared to (80%, 80%) from the RBF model and (88%, 80%) from the MLP model. The resultant areas under ROC curve for logistic regression, RBF and MLP neural networks are 0.761, 0.880 and 0.897, respectively (P < 0.05). Among these models, the logistic regression has noticeably poorer calibration. This study demonstrated the feasibility of applying neural networks as the mechanism for TBI decision support systems based on clinical databases. The results also suggest that neural networks may be a better solution for complex, non-linear medical decision support systems than conventional statistical techniques such as logistic regression.

  7. Eigenvector Spatial Filtering Regression Modeling of Ground PM2.5 Concentrations Using Remotely Sensed Data.

    PubMed

    Zhang, Jingyi; Li, Bin; Chen, Yumin; Chen, Meijie; Fang, Tao; Liu, Yongfeng

    2018-06-11

    This paper proposes a regression model using the Eigenvector Spatial Filtering (ESF) method to estimate ground PM 2.5 concentrations. Covariates are derived from remotely sensed data including aerosol optical depth, normal differential vegetation index, surface temperature, air pressure, relative humidity, height of planetary boundary layer and digital elevation model. In addition, cultural variables such as factory densities and road densities are also used in the model. With the Yangtze River Delta region as the study area, we constructed ESF-based Regression (ESFR) models at different time scales, using data for the period between December 2015 and November 2016. We found that the ESFR models effectively filtered spatial autocorrelation in the OLS residuals and resulted in increases in the goodness-of-fit metrics as well as reductions in residual standard errors and cross-validation errors, compared to the classic OLS models. The annual ESFR model explained 70% of the variability in PM 2.5 concentrations, 16.7% more than the non-spatial OLS model. With the ESFR models, we performed detail analyses on the spatial and temporal distributions of PM 2.5 concentrations in the study area. The model predictions are lower than ground observations but match the general trend. The experiment shows that ESFR provides a promising approach to PM 2.5 analysis and prediction.

  8. Tools to Support Interpreting Multiple Regression in the Face of Multicollinearity

    PubMed Central

    Kraha, Amanda; Turner, Heather; Nimon, Kim; Zientek, Linda Reichwein; Henson, Robin K.

    2012-01-01

    While multicollinearity may increase the difficulty of interpreting multiple regression (MR) results, it should not cause undue problems for the knowledgeable researcher. In the current paper, we argue that rather than using one technique to investigate regression results, researchers should consider multiple indices to understand the contributions that predictors make not only to a regression model, but to each other as well. Some of the techniques to interpret MR effects include, but are not limited to, correlation coefficients, beta weights, structure coefficients, all possible subsets regression, commonality coefficients, dominance weights, and relative importance weights. This article will review a set of techniques to interpret MR effects, identify the elements of the data on which the methods focus, and identify statistical software to support such analyses. PMID:22457655

  9. Tools to support interpreting multiple regression in the face of multicollinearity.

    PubMed

    Kraha, Amanda; Turner, Heather; Nimon, Kim; Zientek, Linda Reichwein; Henson, Robin K

    2012-01-01

    While multicollinearity may increase the difficulty of interpreting multiple regression (MR) results, it should not cause undue problems for the knowledgeable researcher. In the current paper, we argue that rather than using one technique to investigate regression results, researchers should consider multiple indices to understand the contributions that predictors make not only to a regression model, but to each other as well. Some of the techniques to interpret MR effects include, but are not limited to, correlation coefficients, beta weights, structure coefficients, all possible subsets regression, commonality coefficients, dominance weights, and relative importance weights. This article will review a set of techniques to interpret MR effects, identify the elements of the data on which the methods focus, and identify statistical software to support such analyses.

  10. Introduction to the use of regression models in epidemiology.

    PubMed

    Bender, Ralf

    2009-01-01

    Regression modeling is one of the most important statistical techniques used in analytical epidemiology. By means of regression models the effect of one or several explanatory variables (e.g., exposures, subject characteristics, risk factors) on a response variable such as mortality or cancer can be investigated. From multiple regression models, adjusted effect estimates can be obtained that take the effect of potential confounders into account. Regression methods can be applied in all epidemiologic study designs so that they represent a universal tool for data analysis in epidemiology. Different kinds of regression models have been developed in dependence on the measurement scale of the response variable and the study design. The most important methods are linear regression for continuous outcomes, logistic regression for binary outcomes, Cox regression for time-to-event data, and Poisson regression for frequencies and rates. This chapter provides a nontechnical introduction to these regression models with illustrating examples from cancer research.

  11. The cross-validated AUC for MCP-logistic regression with high-dimensional data.

    PubMed

    Jiang, Dingfeng; Huang, Jian; Zhang, Ying

    2013-10-01

    We propose a cross-validated area under the receiving operator characteristic (ROC) curve (CV-AUC) criterion for tuning parameter selection for penalized methods in sparse, high-dimensional logistic regression models. We use this criterion in combination with the minimax concave penalty (MCP) method for variable selection. The CV-AUC criterion is specifically designed for optimizing the classification performance for binary outcome data. To implement the proposed approach, we derive an efficient coordinate descent algorithm to compute the MCP-logistic regression solution surface. Simulation studies are conducted to evaluate the finite sample performance of the proposed method and its comparison with the existing methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC) or Extended BIC (EBIC). The model selected based on the CV-AUC criterion tends to have a larger predictive AUC and smaller classification error than those with tuning parameters selected using the AIC, BIC or EBIC. We illustrate the application of the MCP-logistic regression with the CV-AUC criterion on three microarray datasets from the studies that attempt to identify genes related to cancers. Our simulation studies and data examples demonstrate that the CV-AUC is an attractive method for tuning parameter selection for penalized methods in high-dimensional logistic regression models.

  12. Mapping urban environmental noise: a land use regression method.

    PubMed

    Xie, Dan; Liu, Yi; Chen, Jining

    2011-09-01

    Forecasting and preventing urban noise pollution are major challenges in urban environmental management. Most existing efforts, including experiment-based models, statistical models, and noise mapping, however, have limited capacity to explain the association between urban growth and corresponding noise change. Therefore, these conventional methods can hardly forecast urban noise at a given outlook of development layout. This paper, for the first time, introduces a land use regression method, which has been applied for simulating urban air quality for a decade, to construct an urban noise model (LUNOS) in Dalian Municipality, Northwest China. The LUNOS model describes noise as a dependent variable of surrounding various land areas via a regressive function. The results suggest that a linear model performs better in fitting monitoring data, and there is no significant difference of the LUNOS's outputs when applied to different spatial scales. As the LUNOS facilitates a better understanding of the association between land use and urban environmental noise in comparison to conventional methods, it can be regarded as a promising tool for noise prediction for planning purposes and aid smart decision-making.

  13. Food Insecurity in U.S. Households That Include Children with Disabilities

    ERIC Educational Resources Information Center

    Sonik, Rajan; Parish, Susan L.; Ghosh, Subharati; Igdalsky, Leah

    2016-01-01

    The authors examined food insecurity in households including children with disabilities, analyzing data from the 2004 and 2008 panels of the Survey of Income and Program Participation, which included 24,729 households with children, 3,948 of which had children with disabilities. Logistic regression models were used to estimate the likelihood of…

  14. Sensitivity analysis, calibration, and testing of a distributed hydrological model using error‐based weighting and one objective function

    USGS Publications Warehouse

    Foglia, L.; Hill, Mary C.; Mehl, Steffen W.; Burlando, P.

    2009-01-01

    We evaluate the utility of three interrelated means of using data to calibrate the fully distributed rainfall‐runoff model TOPKAPI as applied to the Maggia Valley drainage area in Switzerland. The use of error‐based weighting of observation and prior information data, local sensitivity analysis, and single‐objective function nonlinear regression provides quantitative evaluation of sensitivity of the 35 model parameters to the data, identification of data types most important to the calibration, and identification of correlations among parameters that contribute to nonuniqueness. Sensitivity analysis required only 71 model runs, and regression required about 50 model runs. The approach presented appears to be ideal for evaluation of models with long run times or as a preliminary step to more computationally demanding methods. The statistics used include composite scaled sensitivities, parameter correlation coefficients, leverage, Cook's D, and DFBETAS. Tests suggest predictive ability of the calibrated model typical of hydrologic models.

  15. A novel variational Bayes multiple locus Z-statistic for genome-wide association studies with Bayesian model averaging

    PubMed Central

    Logsdon, Benjamin A.; Carty, Cara L.; Reiner, Alexander P.; Dai, James Y.; Kooperberg, Charles

    2012-01-01

    Motivation: For many complex traits, including height, the majority of variants identified by genome-wide association studies (GWAS) have small effects, leaving a significant proportion of the heritable variation unexplained. Although many penalized multiple regression methodologies have been proposed to increase the power to detect associations for complex genetic architectures, they generally lack mechanisms for false-positive control and diagnostics for model over-fitting. Our methodology is the first penalized multiple regression approach that explicitly controls Type I error rates and provide model over-fitting diagnostics through a novel normally distributed statistic defined for every marker within the GWAS, based on results from a variational Bayes spike regression algorithm. Results: We compare the performance of our method to the lasso and single marker analysis on simulated data and demonstrate that our approach has superior performance in terms of power and Type I error control. In addition, using the Women's Health Initiative (WHI) SNP Health Association Resource (SHARe) GWAS of African-Americans, we show that our method has power to detect additional novel associations with body height. These findings replicate by reaching a stringent cutoff of marginal association in a larger cohort. Availability: An R-package, including an implementation of our variational Bayes spike regression (vBsr) algorithm, is available at http://kooperberg.fhcrc.org/soft.html. Contact: blogsdon@fhcrc.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22563072

  16. Interpretation of commonly used statistical regression models.

    PubMed

    Kasza, Jessica; Wolfe, Rory

    2014-01-01

    A review of some regression models commonly used in respiratory health applications is provided in this article. Simple linear regression, multiple linear regression, logistic regression and ordinal logistic regression are considered. The focus of this article is on the interpretation of the regression coefficients of each model, which are illustrated through the application of these models to a respiratory health research study. © 2013 The Authors. Respirology © 2013 Asian Pacific Society of Respirology.

  17. Assessment and improvement of biotransfer models to cow's milk and beef used in exposure assessment tools for organic pollutants.

    PubMed

    Takaki, Koki; Wade, Andrew J; Collins, Chris D

    2015-11-01

    The aim of this study was to assess and improve the accuracy of biotransfer models for the organic pollutants (PCBs, PCDD/Fs, PBDEs, PFCAs, and pesticides) into cow's milk and beef used in human exposure assessment. Metabolic rate in cattle is known as a key parameter for this biotransfer, however few experimental data and no simulation methods are currently available. In this research, metabolic rate was estimated using existing QSAR biodegradation models of microorganisms (BioWIN) and fish (EPI-HL and IFS-HL). This simulated metabolic rate was then incorporated into the mechanistic cattle biotransfer models (RAIDAR, ACC-HUMAN, OMEGA, and CKow). The goodness of fit tests showed that RAIDAR, ACC-HUMAN, OMEGA model performances were significantly improved using either of the QSARs when comparing the new model outputs to observed data. The CKow model is the only one that separates the processes in the gut and liver. This model showed the lowest residual error of all the models tested when the BioWIN model was used to represent the ruminant metabolic process in the gut and the two fish QSARs were used to represent the metabolic process in the liver. Our testing included EUSES and CalTOX which are KOW-regression models that are widely used in regulatory assessment. New regressions based on the simulated rate of the two metabolic processes are also proposed as an alternative to KOW-regression models for a screening risk assessment. The modified CKow model is more physiologically realistic, but has equivalent usability to existing KOW-regression models for estimating cattle biotransfer of organic pollutants. Copyright © 2015. Published by Elsevier Ltd.

  18. Predicting allergic contact dermatitis: a hierarchical structure activity relationship (SAR) approach to chemical classification using topological and quantum chemical descriptors

    NASA Astrophysics Data System (ADS)

    Basak, Subhash C.; Mills, Denise; Hawkins, Douglas M.

    2008-06-01

    A hierarchical classification study was carried out based on a set of 70 chemicals—35 which produce allergic contact dermatitis (ACD) and 35 which do not. This approach was implemented using a regular ridge regression computer code, followed by conversion of regression output to binary data values. The hierarchical descriptor classes used in the modeling include topostructural (TS), topochemical (TC), and quantum chemical (QC), all of which are based solely on chemical structure. The concordance, sensitivity, and specificity are reported. The model based on the TC descriptors was found to be the best, while the TS model was extremely poor.

  19. Predicting daily use of urban forest recreation sites

    Treesearch

    John F. Dwyer

    1988-01-01

    A multiple linear regression model explains 90% of the variance in daily use of an urban recreation site. Explanatory variables include season, day of the week, and weather. The results offer guides for recreation site planning and management as well as suggestions for improving the model.

  20. Individual risk factors for deep infection and compromised fracture healing after intramedullary nailing of tibial shaft fractures: a single centre experience of 480 patients.

    PubMed

    Metsemakers, W-J; Handojo, K; Reynders, P; Sermon, A; Vanderschot, P; Nijs, S

    2015-04-01

    Despite modern advances in the treatment of tibial shaft fractures, complications including nonunion, malunion, and infection remain relatively frequent. A better understanding of these injuries and its complications could lead to prevention rather than treatment strategies. A retrospective study was performed to identify risk factors for deep infection and compromised fracture healing after intramedullary nailing (IMN) of tibial shaft fractures. Between January 2000 and January 2012, 480 consecutive patients with 486 tibial shaft fractures were enrolled in the study. Statistical analysis was performed to determine predictors of deep infection and compromised fracture healing. Compromised fracture healing was subdivided in delayed union and nonunion. The following independent variables were selected for analysis: age, sex, smoking, obesity, diabetes, American Society of Anaesthesiologists (ASA) classification, polytrauma, fracture type, open fractures, Gustilo type, primary external fixation (EF), time to nailing (TTN) and reaming. As primary statistical evaluation we performed a univariate analysis, followed by a multiple logistic regression model. Univariate regression analysis revealed similar risk factors for delayed union and nonunion, including fracture type, open fractures and Gustilo type. Factors affecting the occurrence of deep infection in this model were primary EF, a prolonged TTN, open fractures and Gustilo type. Multiple logistic regression analysis revealed polytrauma as the single risk factor for nonunion. With respect to delayed union, no risk factors could be identified. In the same statistical model, deep infection was correlated with primary EF. The purpose of this study was to evaluate risk factors of poor outcome after IMN of tibial shaft fractures. The univariate regression analysis showed that the nature of complications after tibial shaft nailing could be multifactorial. This was not confirmed in a multiple logistic regression model, which only revealed polytrauma and primary EF as risk factors for nonunion and deep infection, respectively. Future strategies should focus on prevention in high-risk populations such as polytrauma patients treated with EF. Copyright © 2014 Elsevier Ltd. All rights reserved.

  1. Classification and regression tree analysis vs. multivariable linear and logistic regression methods as statistical tools for studying haemophilia.

    PubMed

    Henrard, S; Speybroeck, N; Hermans, C

    2015-11-01

    Haemophilia is a rare genetic haemorrhagic disease characterized by partial or complete deficiency of coagulation factor VIII, for haemophilia A, or IX, for haemophilia B. As in any other medical research domain, the field of haemophilia research is increasingly concerned with finding factors associated with binary or continuous outcomes through multivariable models. Traditional models include multiple logistic regressions, for binary outcomes, and multiple linear regressions for continuous outcomes. Yet these regression models are at times difficult to implement, especially for non-statisticians, and can be difficult to interpret. The present paper sought to didactically explain how, why, and when to use classification and regression tree (CART) analysis for haemophilia research. The CART method is non-parametric and non-linear, based on the repeated partitioning of a sample into subgroups based on a certain criterion. Breiman developed this method in 1984. Classification trees (CTs) are used to analyse categorical outcomes and regression trees (RTs) to analyse continuous ones. The CART methodology has become increasingly popular in the medical field, yet only a few examples of studies using this methodology specifically in haemophilia have to date been published. Two examples using CART analysis and previously published in this field are didactically explained in details. There is increasing interest in using CART analysis in the health domain, primarily due to its ease of implementation, use, and interpretation, thus facilitating medical decision-making. This method should be promoted for analysing continuous or categorical outcomes in haemophilia, when applicable. © 2015 John Wiley & Sons Ltd.

  2. Comparison of Cox’s Regression Model and Parametric Models in Evaluating the Prognostic Factors for Survival after Liver Transplantation in Shiraz during 2000–2012

    PubMed Central

    Adelian, R.; Jamali, J.; Zare, N.; Ayatollahi, S. M. T.; Pooladfar, G. R.; Roustaei, N.

    2015-01-01

    Background: Identification of the prognostic factors for survival in patients with liver transplantation is challengeable. Various methods of survival analysis have provided different, sometimes contradictory, results from the same data. Objective: To compare Cox’s regression model with parametric models for determining the independent factors for predicting adults’ and pediatrics’ survival after liver transplantation. Method: This study was conducted on 183 pediatric patients and 346 adults underwent liver transplantation in Namazi Hospital, Shiraz, southern Iran. The study population included all patients undergoing liver transplantation from 2000 to 2012. The prognostic factors sex, age, Child class, initial diagnosis of the liver disease, PELD/MELD score, and pre-operative laboratory markers were selected for survival analysis. Result: Among 529 patients, 346 (64.5%) were adult and 183 (34.6%) were pediatric cases. Overall, the lognormal distribution was the best-fitting model for adult and pediatric patients. Age in adults (HR=1.16, p<0.05) and weight (HR=2.68, p<0.01) and Child class B (HR=2.12, p<0.05) in pediatric patients were the most important factors for prediction of survival after liver transplantation. Adult patients younger than the mean age and pediatric patients weighing above the mean and Child class A (compared to those with classes B or C) had better survival. Conclusion: Parametric regression model is a good alternative for the Cox’s regression model. PMID:26306158

  3. A novel model incorporating two variability sources for describing motor evoked potentials

    PubMed Central

    Goetz, Stefan M.; Luber, Bruce; Lisanby, Sarah H.; Peterchev, Angel V.

    2014-01-01

    Objective Motor evoked potentials (MEPs) play a pivotal role in transcranial magnetic stimulation (TMS), e.g., for determining the motor threshold and probing cortical excitability. Sampled across the range of stimulation strengths, MEPs outline an input–output (IO) curve, which is often used to characterize the corticospinal tract. More detailed understanding of the signal generation and variability of MEPs would provide insight into the underlying physiology and aid correct statistical treatment of MEP data. Methods A novel regression model is tested using measured IO data of twelve subjects. The model splits MEP variability into two independent contributions, acting on both sides of a strong sigmoidal nonlinearity that represents neural recruitment. Traditional sigmoidal regression with a single variability source after the nonlinearity is used for comparison. Results The distribution of MEP amplitudes varied across different stimulation strengths, violating statistical assumptions in traditional regression models. In contrast to the conventional regression model, the dual variability source model better described the IO characteristics including phenomena such as changing distribution spread and skewness along the IO curve. Conclusions MEP variability is best described by two sources that most likely separate variability in the initial excitation process from effects occurring later on. The new model enables more accurate and sensitive estimation of the IO curve characteristics, enhancing its power as a detection tool, and may apply to other brain stimulation modalities. Furthermore, it extracts new information from the IO data concerning the neural variability—information that has previously been treated as noise. PMID:24794287

  4. Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

    PubMed Central

    Dipnall, Joanna F.

    2016-01-01

    Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571

  5. Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

    PubMed

    Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny

    2016-01-01

    Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

  6. Prognostic model for survival in patients with early stage cervical cancer.

    PubMed

    Biewenga, Petra; van der Velden, Jacobus; Mol, Ben Willem J; Stalpers, Lukas J A; Schilthuis, Marten S; van der Steeg, Jan Willem; Burger, Matthé P M; Buist, Marrije R

    2011-02-15

    In the management of early stage cervical cancer, knowledge about the prognosis is critical. Although many factors have an impact on survival, their relative importance remains controversial. This study aims to develop a prognostic model for survival in early stage cervical cancer patients and to reconsider grounds for adjuvant treatment. A multivariate Cox regression model was used to identify the prognostic weight of clinical and histological factors for disease-specific survival (DSS) in 710 consecutive patients who had surgery for early stage cervical cancer (FIGO [International Federation of Gynecology and Obstetrics] stage IA2-IIA). Prognostic scores were derived by converting the regression coefficients for each prognostic marker and used in a score chart. The discriminative capacity was expressed as the area under the curve (AUC) of the receiver operating characteristic. The 5-year DSS was 92%. Tumor diameter, histological type, lymph node metastasis, depth of stromal invasion, lymph vascular space invasion, and parametrial extension were independently associated with DSS and were included in a Cox regression model. This prognostic model, corrected for the 9% overfit shown by internal validation, showed a fair discriminative capacity (AUC, 0.73). The derived score chart predicting 5-year DSS showed a good discriminative capacity (AUC, 0.85). In patients with early stage cervical cancer, DSS can be predicted with a statistical model. Models, such as that presented here, should be used in clinical trials on the effects of adjuvant treatments in high-risk early cervical cancer patients, both to stratify and to include patients. Copyright © 2010 American Cancer Society.

  7. Statistical relations among earthquake magnitude, surface rupture length, and surface fault displacement

    USGS Publications Warehouse

    Bonilla, Manuel G.; Mark, Robert K.; Lienkaemper, James J.

    1984-01-01

    In order to refine correlations of surface-wave magnitude, fault rupture length at the ground surface, and fault displacement at the surface by including the uncertainties in these variables, the existing data were critically reviewed and a new data base was compiled. Earthquake magnitudes were redetermined as necessary to make them as consistent as possible with the Gutenberg methods and results, which make up much of the data base. Measurement errors were estimated for the three variables for 58 moderate to large shallow-focus earthquakes. Regression analyses were then made utilizing the estimated measurement errors.The regression analysis demonstrates that the relations among the variables magnitude, length, and displacement are stochastic in nature. The stochastic variance, introduced in part by incomplete surface expression of seismogenic faulting, variation in shear modulus, and regional factors, dominates the estimated measurement errors. Thus, it is appropriate to use ordinary least squares for the regression models, rather than regression models based upon an underlying deterministic relation in which the variance results primarily from measurement errors.Significant differences exist in correlations of certain combinations of length, displacement, and magnitude when events are grouped by fault type or by region, including attenuation regions delineated by Evernden and others.Estimates of the magnitude and the standard deviation of the magnitude of a prehistoric or future earthquake associated with a fault can be made by correlating Ms with the logarithms of rupture length, fault displacement, or the product of length and displacement.Fault rupture area could be reliably estimated for about 20 of the events in the data set. Regression of Ms on rupture area did not result in a marked improvement over regressions that did not involve rupture area. Because no subduction-zone earthquakes are included in this study, the reported results do not apply to such zones.

  8. Using Generalized Additive Models to Analyze Single-Case Designs

    ERIC Educational Resources Information Center

    Shadish, William; Sullivan, Kristynn

    2013-01-01

    Many analyses for single-case designs (SCDs)--including nearly all the effect size indicators-- currently assume no trend in the data. Regression and multilevel models allow for trend, but usually test only linear trend and have no principled way of knowing if higher order trends should be represented in the model. This paper shows how Generalized…

  9. A new approach to correct the QT interval for changes in heart rate using a nonparametric regression model in beagle dogs.

    PubMed

    Watanabe, Hiroyuki; Miyazaki, Hiroyasu

    2006-01-01

    Over- and/or under-correction of QT intervals for changes in heart rate may lead to misleading conclusions and/or masking the potential of a drug to prolong the QT interval. This study examines a nonparametric regression model (Loess Smoother) to adjust the QT interval for differences in heart rate, with an improved fitness over a wide range of heart rates. 240 sets of (QT, RR) observations collected from each of 8 conscious and non-treated beagle dogs were used as the materials for investigation. The fitness of the nonparametric regression model to the QT-RR relationship was compared with four models (individual linear regression, common linear regression, and Bazett's and Fridericia's correlation models) with reference to Akaike's Information Criterion (AIC). Residuals were visually assessed. The bias-corrected AIC of the nonparametric regression model was the best of the models examined in this study. Although the parametric models did not fit, the nonparametric regression model improved the fitting at both fast and slow heart rates. The nonparametric regression model is the more flexible method compared with the parametric method. The mathematical fit for linear regression models was unsatisfactory at both fast and slow heart rates, while the nonparametric regression model showed significant improvement at all heart rates in beagle dogs.

  10. A semi-nonparametric Poisson regression model for analyzing motor vehicle crash data.

    PubMed

    Ye, Xin; Wang, Ke; Zou, Yajie; Lord, Dominique

    2018-01-01

    This paper develops a semi-nonparametric Poisson regression model to analyze motor vehicle crash frequency data collected from rural multilane highway segments in California, US. Motor vehicle crash frequency on rural highway is a topic of interest in the area of transportation safety due to higher driving speeds and the resultant severity level. Unlike the traditional Negative Binomial (NB) model, the semi-nonparametric Poisson regression model can accommodate an unobserved heterogeneity following a highly flexible semi-nonparametric (SNP) distribution. Simulation experiments are conducted to demonstrate that the SNP distribution can well mimic a large family of distributions, including normal distributions, log-gamma distributions, bimodal and trimodal distributions. Empirical estimation results show that such flexibility offered by the SNP distribution can greatly improve model precision and the overall goodness-of-fit. The semi-nonparametric distribution can provide a better understanding of crash data structure through its ability to capture potential multimodality in the distribution of unobserved heterogeneity. When estimated coefficients in empirical models are compared, SNP and NB models are found to have a substantially different coefficient for the dummy variable indicating the lane width. The SNP model with better statistical performance suggests that the NB model overestimates the effect of lane width on crash frequency reduction by 83.1%.

  11. Evaluation of the Use of Zero-Augmented Regression Techniques to Model Incidence of Campylobacter Infections in FoodNet.

    PubMed

    Tremblay, Marlène; Crim, Stacy M; Cole, Dana J; Hoekstra, Robert M; Henao, Olga L; Döpfer, Dörte

    2017-10-01

    The Foodborne Diseases Active Surveillance Network (FoodNet) is currently using a negative binomial (NB) regression model to estimate temporal changes in the incidence of Campylobacter infection. FoodNet active surveillance in 483 counties collected data on 40,212 Campylobacter cases between years 2004 and 2011. We explored models that disaggregated these data to allow us to account for demographic, geographic, and seasonal factors when examining changes in incidence of Campylobacter infection. We hypothesized that modeling structural zeros and including demographic variables would increase the fit of FoodNet's Campylobacter incidence regression models. Five different models were compared: NB without demographic covariates, NB with demographic covariates, hurdle NB with covariates in the count component only, hurdle NB with covariates in both zero and count components, and zero-inflated NB with covariates in the count component only. Of the models evaluated, the nonzero-augmented NB model with demographic variables provided the best fit. Results suggest that even though zero inflation was not present at this level, individualizing the level of aggregation and using different model structures and predictors per site might be required to correctly distinguish between structural and observational zeros and account for risk factors that vary geographically.

  12. Improving virtual screening predictive accuracy of Human kallikrein 5 inhibitors using machine learning models.

    PubMed

    Fang, Xingang; Bagui, Sikha; Bagui, Subhash

    2017-08-01

    The readily available high throughput screening (HTS) data from the PubChem database provides an opportunity for mining of small molecules in a variety of biological systems using machine learning techniques. From the thousands of available molecular descriptors developed to encode useful chemical information representing the characteristics of molecules, descriptor selection is an essential step in building an optimal quantitative structural-activity relationship (QSAR) model. For the development of a systematic descriptor selection strategy, we need the understanding of the relationship between: (i) the descriptor selection; (ii) the choice of the machine learning model; and (iii) the characteristics of the target bio-molecule. In this work, we employed the Signature descriptor to generate a dataset on the Human kallikrein 5 (hK 5) inhibition confirmatory assay data and compared multiple classification models including logistic regression, support vector machine, random forest and k-nearest neighbor. Under optimal conditions, the logistic regression model provided extremely high overall accuracy (98%) and precision (90%), with good sensitivity (65%) in the cross validation test. In testing the primary HTS screening data with more than 200K molecular structures, the logistic regression model exhibited the capability of eliminating more than 99.9% of the inactive structures. As part of our exploration of the descriptor-model-target relationship, the excellent predictive performance of the combination of the Signature descriptor and the logistic regression model on the assay data of the Human kallikrein 5 (hK 5) target suggested a feasible descriptor/model selection strategy on similar targets. Copyright © 2017 Elsevier Ltd. All rights reserved.

  13. Aspects of porosity prediction using multivariate linear regression

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Byrnes, A.P.; Wilson, M.D.

    1991-03-01

    Highly accurate multiple linear regression models have been developed for sandstones of diverse compositions. Porosity reduction or enhancement processes are controlled by the fundamental variables, Pressure (P), Temperature (T), Time (t), and Composition (X), where composition includes mineralogy, size, sorting, fluid composition, etc. The multiple linear regression equation, of which all linear porosity prediction models are subsets, takes the generalized form: Porosity = C{sub 0} + C{sub 1}(P) + C{sub 2}(T) + C{sub 3}(X) + C{sub 4}(t) + C{sub 5}(PT) + C{sub 6}(PX) + C{sub 7}(Pt) + C{sub 8}(TX) + C{sub 9}(Tt) + C{sub 10}(Xt) + C{sub 11}(PTX) + C{submore » 12}(PXt) + C{sub 13}(PTt) + C{sub 14}(TXt) + C{sub 15}(PTXt). The first four primary variables are often interactive, thus requiring terms involving two or more primary variables (the form shown implies interaction and not necessarily multiplication). The final terms used may also involve simple mathematic transforms such as log X, e{sup T}, X{sup 2}, or more complex transformations such as the Time-Temperature Index (TTI). The X term in the equation above represents a suite of compositional variable and, therefore, a fully expanded equation may include a series of terms incorporating these variables. Numerous published bivariate porosity prediction models involving P (or depth) or Tt (TTI) are effective to a degree, largely because of the high degree of colinearity between p and TTI. However, all such bivariate models ignore the unique contributions of P and Tt, as well as various X terms. These simpler models become poor predictors in regions where colinear relations change, were important variables have been ignored, or where the database does not include a sufficient range or weight distribution for the critical variables.« less

  14. Nitrate removal in stream ecosystems measured by 15N addition experiments: Total uptake

    USGS Publications Warehouse

    Hall, R.O.; Tank, J.L.; Sobota, D.J.; Mulholland, P.J.; O'Brien, J. M.; Dodds, W.K.; Webster, J.R.; Valett, H.M.; Poole, G.C.; Peterson, B.J.; Meyer, J.L.; McDowell, W.H.; Johnson, S.L.; Hamilton, S.K.; Grimm, N. B.; Gregory, S.V.; Dahm, Clifford N.; Cooper, L.W.; Ashkenas, L.R.; Thomas, S.M.; Sheibley, R.W.; Potter, J.D.; Niederlehner, B.R.; Johnson, L.T.; Helton, A.M.; Crenshaw, C.M.; Burgin, A.J.; Bernot, M.J.; Beaulieu, J.J.; Arangob, C.P.

    2009-01-01

    We measured uptake length of 15NO-3 in 72 streams in eight regions across the United States and Puerto Rico to develop quantitative predictive models on controls of NO-3 uptake length. As part of the Lotic Intersite Nitrogen eXperiment II project, we chose nine streams in each region corresponding to natural (reference), suburban-urban, and agricultural land uses. Study streams spanned a range of human land use to maximize variation in NO-3 concentration, geomorphology, and metabolism. We tested a causal model predicting controls on NO-3 uptake length using structural equation modeling. The model included concomitant measurements of ecosystem metabolism, hydraulic parameters, and nitrogen concentration. We compared this structural equation model to multiple regression models which included additional biotic, catchment, and riparian variables. The structural equation model explained 79% of the variation in log uptake length (S Wtot). Uptake length increased with specific discharge (Q/w) and increasing NO-3 concentrations, showing a loss in removal efficiency in streams with high NO-3 concentration. Uptake lengths shortened with increasing gross primary production, suggesting autotrophic assimilation dominated NO-3 removal. The fraction of catchment area as agriculture and suburban-urban land use weakly predicted NO-3 uptake in bivariate regression, and did improve prediction in a set of multiple regression models. Adding land use to the structural equation model showed that land use indirectly affected NO-3 uptake lengths via directly increasing both gross primary production and NO-3 concentration. Gross primary production shortened SWtot, while increasing NO-3 lengthened SWtot resulting in no net effect of land use on NO- 3 removal. ?? 2009.

  15. Drought Patterns Forecasting using an Auto-Regressive Logistic Model

    NASA Astrophysics Data System (ADS)

    del Jesus, M.; Sheffield, J.; Méndez Incera, F. J.; Losada, I. J.; Espejo, A.

    2014-12-01

    Drought is characterized by a water deficit that may manifest across a large range of spatial and temporal scales. Drought may create important socio-economic consequences, many times of catastrophic dimensions. A quantifiable definition of drought is elusive because depending on its impacts, consequences and generation mechanism, different water deficit periods may be identified as a drought by virtue of some definitions but not by others. Droughts are linked to the water cycle and, although a climate change signal may not have emerged yet, they are also intimately linked to climate.In this work we develop an auto-regressive logistic model for drought prediction at different temporal scales that makes use of a spatially explicit framework. Our model allows to include covariates, continuous or categorical, to improve the performance of the auto-regressive component.Our approach makes use of dimensionality reduction (principal component analysis) and classification techniques (K-Means and maximum dissimilarity) to simplify the representation of complex climatic patterns, such as sea surface temperature (SST) and sea level pressure (SLP), while including information on their spatial structure, i.e. considering their spatial patterns. This procedure allows us to include in the analysis multivariate representation of complex climatic phenomena, as the El Niño-Southern Oscillation. We also explore the impact of other climate-related variables such as sun spots. The model allows to quantify the uncertainty of the forecasts and can be easily adapted to make predictions under future climatic scenarios. The framework herein presented may be extended to other applications such as flash flood analysis, or risk assessment of natural hazards.

  16. Quantile regression via vector generalized additive models.

    PubMed

    Yee, Thomas W

    2004-07-30

    One of the most popular methods for quantile regression is the LMS method of Cole and Green. The method naturally falls within a penalized likelihood framework, and consequently allows for considerable flexible because all three parameters may be modelled by cubic smoothing splines. The model is also very understandable: for a given value of the covariate, the LMS method applies a Box-Cox transformation to the response in order to transform it to standard normality; to obtain the quantiles, an inverse Box-Cox transformation is applied to the quantiles of the standard normal distribution. The purposes of this article are three-fold. Firstly, LMS quantile regression is presented within the framework of the class of vector generalized additive models. This confers a number of advantages such as a unifying theory and estimation process. Secondly, a new LMS method based on the Yeo-Johnson transformation is proposed, which has the advantage that the response is not restricted to be positive. Lastly, this paper describes a software implementation of three LMS quantile regression methods in the S language. This includes the LMS-Yeo-Johnson method, which is estimated efficiently by a new numerical integration scheme. The LMS-Yeo-Johnson method is illustrated by way of a large cross-sectional data set from a New Zealand working population. Copyright 2004 John Wiley & Sons, Ltd.

  17. Statistical downscaling modeling with quantile regression using lasso to estimate extreme rainfall

    NASA Astrophysics Data System (ADS)

    Santri, Dewi; Wigena, Aji Hamim; Djuraidah, Anik

    2016-02-01

    Rainfall is one of the climatic elements with high diversity and has many negative impacts especially extreme rainfall. Therefore, there are several methods that required to minimize the damage that may occur. So far, Global circulation models (GCM) are the best method to forecast global climate changes include extreme rainfall. Statistical downscaling (SD) is a technique to develop the relationship between GCM output as a global-scale independent variables and rainfall as a local- scale response variable. Using GCM method will have many difficulties when assessed against observations because GCM has high dimension and multicollinearity between the variables. The common method that used to handle this problem is principal components analysis (PCA) and partial least squares regression. The new method that can be used is lasso. Lasso has advantages in simultaneuosly controlling the variance of the fitted coefficients and performing automatic variable selection. Quantile regression is a method that can be used to detect extreme rainfall in dry and wet extreme. Objective of this study is modeling SD using quantile regression with lasso to predict extreme rainfall in Indramayu. The results showed that the estimation of extreme rainfall (extreme wet in January, February and December) in Indramayu could be predicted properly by the model at quantile 90th.

  18. Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data.

    PubMed

    Balabin, Roman M; Smirnov, Sergey V

    2011-04-29

    During the past several years, near-infrared (near-IR/NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields from petroleum to biomedical sectors. The NIR spectrum (above 4000 cm(-1)) of a sample is typically measured by modern instruments at a few hundred of wavelengths. Recently, considerable effort has been directed towards developing procedures to identify variables (wavelengths) that contribute useful information. Variable selection (VS) or feature selection, also called frequency selection or wavelength selection, is a critical step in data analysis for vibrational spectroscopy (infrared, Raman, or NIRS). In this paper, we compare the performance of 16 different feature selection methods for the prediction of properties of biodiesel fuel, including density, viscosity, methanol content, and water concentration. The feature selection algorithms tested include stepwise multiple linear regression (MLR-step), interval partial least squares regression (iPLS), backward iPLS (BiPLS), forward iPLS (FiPLS), moving window partial least squares regression (MWPLS), (modified) changeable size moving window partial least squares (CSMWPLS/MCSMWPLSR), searching combination moving window partial least squares (SCMWPLS), successive projections algorithm (SPA), uninformative variable elimination (UVE, including UVE-SPA), simulated annealing (SA), back-propagation artificial neural networks (BP-ANN), Kohonen artificial neural network (K-ANN), and genetic algorithms (GAs, including GA-iPLS). Two linear techniques for calibration model building, namely multiple linear regression (MLR) and partial least squares regression/projection to latent structures (PLS/PLSR), are used for the evaluation of biofuel properties. A comparison with a non-linear calibration model, artificial neural networks (ANN-MLP), is also provided. Discussion of gasoline, ethanol-gasoline (bioethanol), and diesel fuel data is presented. The results of other spectroscopic techniques application, such as Raman, ultraviolet-visible (UV-vis), or nuclear magnetic resonance (NMR) spectroscopies, can be greatly improved by an appropriate feature selection choice. Copyright © 2011 Elsevier B.V. All rights reserved.

  19. The effect of service satisfaction and spiritual well-being on the quality of life of patients with schizophrenia.

    PubMed

    Lanfredi, Mariangela; Candini, Valentina; Buizza, Chiara; Ferrari, Clarissa; Boero, Maria E; Giobbio, Gian M; Goldschmidt, Nicoletta; Greppo, Stefania; Iozzino, Laura; Maggi, Paolo; Melegari, Anna; Pasqualetti, Patrizio; Rossi, Giuseppe; de Girolamo, Giovanni

    2014-05-15

    Quality of life (QOL) has been considered an important outcome measure in psychiatric research and determinants of QOL have been widely investigated. We aimed at detecting predictors of QOL at baseline and at testing the longitudinal interrelations of the baseline predictors with QOL scores at a 1-year follow-up in a sample of patients living in Residential Facilities (RFs). Logistic regression models were adopted to evaluate the association between WHOQoL-Bref scores and potential determinants of QOL. In addition, all variables significantly associated with QOL domains in the final logistic regression model were included by using the Structural Equation Modeling (SEM). We included 139 patients with a diagnosis of schizophrenia spectrum. In the final logistic regression model level of activity, social support, age, service satisfaction, spiritual well-being and symptoms' severity were identified as predictors of QOL scores at baseline. Longitudinal analyses carried out by SEM showed that 40% of QOL follow-up variability was explained by QOL at baseline, and significant indirect effects toward QOL at follow-up were found for satisfaction with services and for social support. Rehabilitation plans for people with schizophrenia living in RFs should also consider mediators of change in subjective QOL such as satisfaction with mental health services. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  20. Correlates of motivation to change in pathological gamblers completing cognitive-behavioral group therapy.

    PubMed

    Gómez-Peña, Mónica; Penelo, Eva; Granero, Roser; Fernández-Aranda, Fernando; Alvarez-Moya, Eva; Santamaría, Juan José; Moragas, Laura; Neus Aymamí, Maria; Gunnard, Katarina; Menchón, José M; Jimenez-Murcia, Susana

    2012-07-01

    The present study analyzes the association between the motivation to change and the cognitive-behavioral group intervention, in terms of dropouts and relapses, in a sample of male pathological gamblers. The specific objectives were as follows: (a) to estimate the predictive value of baseline University of Rhode Island Change Assessment scale (URICA) scores (i.e., at the start of the study) as regards the risk of relapse and dropout during treatment and (b) to assess the incremental predictive ability of URICA scores, as regards the mean change produced in the clinical status of patients between the start and finish of treatment. The relationship between the URICA and the response to treatment was analyzed by means of a pre-post design applied to a sample of 191 patients who were consecutively receiving cognitive-behavioral group therapy. The statistical analysis included logistic regression models and hierarchical multiple linear regression models. The discriminative ability of the models including the four URICA scores regarding the likelihood of relapse and dropout was acceptable (area under the receiver operating haracteristic curve: .73 and .71, respectively). No significant predictive ability was found as regards the differences between baseline and posttreatment scores (changes in R(2) below 5% in the multiple regression models). The availability of useful measures of motivation to change would enable treatment outcomes to be optimized through the application of specific therapeutic interventions. © 2012 Wiley Periodicals, Inc.

  1. 77 FR 3121 - Program Integrity: Gainful Employment-Debt Measures; Correction

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-01-23

    ...On June 13, 2011, the Secretary of Education (Secretary) published a notice of final regulations in the Federal Register for Program Integrity: Gainful Employment--Debt Measures (Gainful Employment--Debt Measures) (76 FR 34386). In the preamble of the final regulations, we used the wrong data to calculate the percent of total variance in institutions' repayment rates that may be explained by race/ethnicity. Our intent was to use the data that included all minority students per institution. However, we mistakenly used the data for a subset of minority students per institution. We have now recalculated the total variance using the data that includes all minority students. Through this document, we correct, in the preamble of the Gainful Employment--Debt Measures final regulations, the errors resulting from this misapplication. We do not change the regression analysis model itself; we are using the same model with the appropriate data. Through this notice we also correct, in the preamble of the Gainful Employment--Debt Measures final regulations, our description of one component of the regression analysis. The preamble referred to use of an institutional variable measuring acceptance rates. This description was incorrect; in fact we used an institutional variable measuring retention rates. Correcting this language does not change the regression analysis model itself or the variance explained by the model. The text of the final regulations remains unchanged.

  2. Load estimator (LOADEST): a FORTRAN program for estimating constituent loads in streams and rivers

    USGS Publications Warehouse

    Runkel, Robert L.; Crawford, Charles G.; Cohn, Timothy A.

    2004-01-01

    LOAD ESTimator (LOADEST) is a FORTRAN program for estimating constituent loads in streams and rivers. Given a time series of streamflow, additional data variables, and constituent concentration, LOADEST assists the user in developing a regression model for the estimation of constituent load (calibration). Explanatory variables within the regression model include various functions of streamflow, decimal time, and additional user-specified data variables. The formulated regression model then is used to estimate loads over a user-specified time interval (estimation). Mean load estimates, standard errors, and 95 percent confidence intervals are developed on a monthly and(or) seasonal basis. The calibration and estimation procedures within LOADEST are based on three statistical estimation methods. The first two methods, Adjusted Maximum Likelihood Estimation (AMLE) and Maximum Likelihood Estimation (MLE), are appropriate when the calibration model errors (residuals) are normally distributed. Of the two, AMLE is the method of choice when the calibration data set (time series of streamflow, additional data variables, and concentration) contains censored data. The third method, Least Absolute Deviation (LAD), is an alternative to maximum likelihood estimation when the residuals are not normally distributed. LOADEST output includes diagnostic tests and warnings to assist the user in determining the appropriate estimation method and in interpreting the estimated loads. This report describes the development and application of LOADEST. Sections of the report describe estimation theory, input/output specifications, sample applications, and installation instructions.

  3. Predicting outcome in severe traumatic brain injury using a simple prognostic model.

    PubMed

    Sobuwa, Simpiwe; Hartzenberg, Henry Benjamin; Geduld, Heike; Uys, Corrie

    2014-06-17

    Several studies have made it possible to predict outcome in severe traumatic brain injury (TBI) making it beneficial as an aid for clinical decision-making in the emergency setting. However, reliable predictive models are lacking for resource-limited prehospital settings such as those in developing countries like South Africa. To develop a simple predictive model for severe TBI using clinical variables in a South African prehospital setting. All consecutive patients admitted at two level-one centres in Cape Town, South Africa, for severe TBI were included. A binary logistic regression model was used, which included three predictor variables: oxygen saturation (SpO₂), Glasgow Coma Scale (GCS) and pupil reactivity. The Glasgow Outcome Scale was used to assess outcome on hospital discharge. A total of 74.4% of the outcomes were correctly predicted by the logistic regression model. The model demonstrated SpO₂ (p=0.019), GCS (p=0.001) and pupil reactivity (p=0.002) as independently significant predictors of outcome in severe TBI. Odds ratios of a good outcome were 3.148 (SpO₂ ≥ 90%), 5.108 (GCS 6 - 8) and 4.405 (pupils bilaterally reactive). This model is potentially useful for effective predictions of outcome in severe TBI.

  4. Factors associated with parasite dominance in fishes from Brazil.

    PubMed

    Amarante, Cristina Fernandes do; Tassinari, Wagner de Souza; Luque, Jose Luis; Pereira, Maria Julia Salim

    2016-06-14

    The present study used regression models to evaluate the existence of factors that may influence the numerical parasite dominance with an epidemiological approximation. A database including 3,746 fish specimens and their respective parasites were used to evaluate the relationship between parasite dominance and biotic characteristics inherent to the studied hosts and the parasite taxa. Multivariate, classical, and mixed effects linear regression models were fitted. The calculations were performed using R software (95% CI). In the fitting of the classical multiple linear regression model, freshwater and planktivorous fish species and body length, as well as the species of the taxa Trematoda, Monogenea, and Hirudinea, were associated with parasite dominance. However, the fitting of the mixed effects model showed that the body length of the host and the species of the taxa Nematoda, Trematoda, Monogenea, Hirudinea, and Crustacea were significantly associated with parasite dominance. Studies that consider specific biological aspects of the hosts and parasites should expand the knowledge regarding factors that influence the numerical dominance of fish in Brazil. The use of a mixed model shows, once again, the importance of the appropriate use of a model correlated with the characteristics of the data to obtain consistent results.

  5. Modified Regression Correlation Coefficient for Poisson Regression Model

    NASA Astrophysics Data System (ADS)

    Kaengthong, Nattacha; Domthong, Uthumporn

    2017-09-01

    This study gives attention to indicators in predictive power of the Generalized Linear Model (GLM) which are widely used; however, often having some restrictions. We are interested in regression correlation coefficient for a Poisson regression model. This is a measure of predictive power, and defined by the relationship between the dependent variable (Y) and the expected value of the dependent variable given the independent variables [E(Y|X)] for the Poisson regression model. The dependent variable is distributed as Poisson. The purpose of this research was modifying regression correlation coefficient for Poisson regression model. We also compare the proposed modified regression correlation coefficient with the traditional regression correlation coefficient in the case of two or more independent variables, and having multicollinearity in independent variables. The result shows that the proposed regression correlation coefficient is better than the traditional regression correlation coefficient based on Bias and the Root Mean Square Error (RMSE).

  6. Multiple balance tests improve the assessment of postural stability in subjects with Parkinson's disease

    PubMed Central

    Jacobs, J V; Horak, F B; Tran, V K; Nutt, J G

    2006-01-01

    Objectives Clinicians often base the implementation of therapies on the presence of postural instability in subjects with Parkinson's disease (PD). These decisions are frequently based on the pull test from the Unified Parkinson's Disease Rating Scale (UPDRS). We sought to determine whether combining the pull test, the one‐leg stance test, the functional reach test, and UPDRS items 27–29 (arise from chair, posture, and gait) predicts balance confidence and falling better than any test alone. Methods The study included 67 subjects with PD. Subjects performed the one‐leg stance test, the functional reach test, and the UPDRS motor exam. Subjects also responded to the Activities‐specific Balance Confidence (ABC) scale and reported how many times they fell during the previous year. Regression models determined the combination of tests that optimally predicted mean ABC scores or categorised fall frequency. Results When all tests were included in a stepwise linear regression, only gait (UPDRS item 29), the pull test (UPDRS item 30), and the one‐leg stance test, in combination, represented significant predictor variables for mean ABC scores (r2 = 0.51). A multinomial logistic regression model including the one‐leg stance test and gait represented the model with the fewest significant predictor variables that correctly identified the most subjects as fallers or non‐fallers (85% of subjects were correctly identified). Conclusions Multiple balance tests (including the one‐leg stance test, and the gait and pull test items of the UPDRS) that assess different types of postural stress provide an optimal assessment of postural stability in subjects with PD. PMID:16484639

  7. Bias correction by use of errors-in-variables regression models in studies with K-X-ray fluorescence bone lead measurements.

    PubMed

    Lamadrid-Figueroa, Héctor; Téllez-Rojo, Martha M; Angeles, Gustavo; Hernández-Ávila, Mauricio; Hu, Howard

    2011-01-01

    In-vivo measurement of bone lead by means of K-X-ray fluorescence (KXRF) is the preferred biological marker of chronic exposure to lead. Unfortunately, considerable measurement error associated with KXRF estimations can introduce bias in estimates of the effect of bone lead when this variable is included as the exposure in a regression model. Estimates of uncertainty reported by the KXRF instrument reflect the variance of the measurement error and, although they can be used to correct the measurement error bias, they are seldom used in epidemiological statistical analyzes. Errors-in-variables regression (EIV) allows for correction of bias caused by measurement error in predictor variables, based on the knowledge of the reliability of such variables. The authors propose a way to obtain reliability coefficients for bone lead measurements from uncertainty data reported by the KXRF instrument and compare, by the use of Monte Carlo simulations, results obtained using EIV regression models vs. those obtained by the standard procedures. Results of the simulations show that Ordinary Least Square (OLS) regression models provide severely biased estimates of effect, and that EIV provides nearly unbiased estimates. Although EIV effect estimates are more imprecise, their mean squared error is much smaller than that of OLS estimates. In conclusion, EIV is a better alternative than OLS to estimate the effect of bone lead when measured by KXRF. Copyright © 2010 Elsevier Inc. All rights reserved.

  8. Improving power and robustness for detecting genetic association with extreme-value sampling design.

    PubMed

    Chen, Hua Yun; Li, Mingyao

    2011-12-01

    Extreme-value sampling design that samples subjects with extremely large or small quantitative trait values is commonly used in genetic association studies. Samples in such designs are often treated as "cases" and "controls" and analyzed using logistic regression. Such a case-control analysis ignores the potential dose-response relationship between the quantitative trait and the underlying trait locus and thus may lead to loss of power in detecting genetic association. An alternative approach to analyzing such data is to model the dose-response relationship by a linear regression model. However, parameter estimation from this model can be biased, which may lead to inflated type I errors. We propose a robust and efficient approach that takes into consideration of both the biased sampling design and the potential dose-response relationship. Extensive simulations demonstrate that the proposed method is more powerful than the traditional logistic regression analysis and is more robust than the linear regression analysis. We applied our method to the analysis of a candidate gene association study on high-density lipoprotein cholesterol (HDL-C) which includes study subjects with extremely high or low HDL-C levels. Using our method, we identified several SNPs showing a stronger evidence of association with HDL-C than the traditional case-control logistic regression analysis. Our results suggest that it is important to appropriately model the quantitative traits and to adjust for the biased sampling when dose-response relationship exists in extreme-value sampling designs. © 2011 Wiley Periodicals, Inc.

  9. Climate Prediction Center - Seasonal Outlook

    Science.gov Websites

    SEASONAL CLIMATE VARIABILITY, INCLUDING ENSO, SOIL MOISTURE, AND VARIOUS STATE-OF-THE-ART DYNAMICAL MODEL ACROSS PARTS OF THE EAST-CENTRAL CONUS CENTERED ON THE MISSISSIPPI RIVER. THIS IS DUE TO VERY HIGH SOIL TRENDS, NEGATIVE SOIL MOISTURE ANOMALIES, LAGGED ENSO REGRESSIONS, AND DYNAMICAL MODEL GUIDANCE ARE ALL

  10. Bias-motivated bullying and psychosocial problems: implications for HIV risk behaviors among young men who have sex with men.

    PubMed

    Li, Michael Jonathan; Distefano, Anthony; Mouttapa, Michele; Gill, Jasmeet K

    2014-02-01

    The present study aimed to determine whether the experience of bias-motivated bullying was associated with behaviors known to increase the risk of HIV infection among young men who have sex with men (YMSM) aged 18-29, and to assess whether the psychosocial problems moderated this relationship. Using an Internet-based direct marketing approach in sampling, we recruited 545 YMSM residing in the USA to complete an online questionnaire. Multiple linear regression analyses tested three regression models where we controlled for sociodemographics. The first model indicated that bullying during high school was associated with unprotected receptive anal intercourse within the past 12 months, while the second model indicated that bullying after high school was associated with engaging in anal intercourse while under the influence of drugs or alcohol in the past 12 months. In the final regression model, our composite measure of HIV risk behavior was found to be associated with lifetime verbal harassment. None of the psychosocial problems measured in this study - depression, low self-esteem, and internalized homonegativity - moderated any of the associations between bias-motivated bullying victimization and HIV risk behaviors in our regression models. Still, these findings provide novel evidence that bullying prevention programs in schools and communities should be included in comprehensive approaches to HIV prevention among YMSM.

  11. Determinants of The Grade A Embryos in Infertile Women; Zero-Inflated Regression Model.

    PubMed

    Almasi-Hashiani, Amir; Ghaheri, Azadeh; Omani Samani, Reza

    2017-10-01

    In assisted reproductive technology, it is important to choose high quality embryos for embryo transfer. The aim of the present study was to determine the grade A embryo count and factors related to it in infertile women. This historical cohort study included 996 infertile women. The main outcome was the number of grade A embryos. Zero-Inflated Poisson (ZIP) regression and Zero-Inflated Negative Binomial (ZINB) regression were used to model the count data as it contained excessive zeros. Stata software, version 13 (Stata Corp, College Station, TX, USA) was used for all statistical analyses. After adjusting for potential confounders, results from the ZINB model show that for each unit increase in the number 2 pronuclear (2PN) zygotes, we get an increase of 1.45 times as incidence rate ratio (95% confidence interval (CI): 1.23-1.69, P=0.001) in the expected grade A embryo count number, and for each increase in the cleavage day we get a decrease 0.35 times (95% CI: 0.20-0.61, P=0.001) in expected grade A embryo count. There is a significant association between both the number of 2PN zygotes and cleavage day with the number of grade A embryos in both ZINB and ZIP regression models. The estimated coefficients are more plausible than values found in earlier studies using less relevant models. Copyright© by Royan Institute. All rights reserved.

  12. Challenges Associated with Estimating Utility in Wet Age-Related Macular Degeneration: A Novel Regression Analysis to Capture the Bilateral Nature of the Disease.

    PubMed

    Hodgson, Robert; Reason, Timothy; Trueman, David; Wickstead, Rose; Kusel, Jeanette; Jasilek, Adam; Claxton, Lindsay; Taylor, Matthew; Pulikottil-Jacob, Ruth

    2017-10-01

    The estimation of utility values for the economic evaluation of therapies for wet age-related macular degeneration (AMD) is a particular challenge. Previous economic models in wet AMD have been criticized for failing to capture the bilateral nature of wet AMD by modelling visual acuity (VA) and utility values associated with the better-seeing eye only. Here we present a de novo regression analysis using generalized estimating equations (GEE) applied to a previous dataset of time trade-off (TTO)-derived utility values from a sample of the UK population that wore contact lenses to simulate visual deterioration in wet AMD. This analysis allows utility values to be estimated as a function of VA in both the better-seeing eye (BSE) and worse-seeing eye (WSE). VAs in both the BSE and WSE were found to be statistically significant (p < 0.05) when regressed separately. When included without an interaction term, only the coefficient for VA in the BSE was significant (p = 0.04), but when an interaction term between VA in the BSE and WSE was included, only the constant term (mean TTO utility value) was significant, potentially a result of the collinearity between the VA of the two eyes. The lack of both formal model fit statistics from the GEE approach and theoretical knowledge to support the superiority of one model over another make it difficult to select the best model. Limitations of this analysis arise from the potential influence of collinearity between the VA of both eyes, and the use of contact lenses to reflect VA states to obtain the original dataset. Whilst further research is required to elicit more accurate utility values for wet AMD, this novel regression analysis provides a possible source of utility values to allow future economic models to capture the quality of life impact of changes in VA in both eyes. Novartis Pharmaceuticals UK Limited.

  13. Spatially Explicit Estimates of Suspended Sediment and Bedload Transport Rates for Western Oregon and Northwestern California

    NASA Astrophysics Data System (ADS)

    O'Connor, J. E.; Wise, D. R.; Mangano, J.; Jones, K.

    2015-12-01

    Empirical analyses of suspended sediment and bedload transport gives estimates of sediment flux for western Oregon and northwestern California. The estimates of both bedload and suspended load are from regression models relating measured annual sediment yield to geologic, physiographic, and climatic properties of contributing basins. The best models include generalized geology and either slope or precipitation. The best-fit suspended-sediment model is based on basin geology, precipitation, and area of recent wildfire. It explains 65% of the variance for 68 suspended sediment measurement sites within the model area. Predicted suspended sediment yields range from no yield from the High Cascades geologic province to 200 tonnes/ km2-yr in the northern Oregon Coast Range and 1000 tonnes/km2-yr in recently burned areas of the northern Klamath terrain. Bed-material yield is similarly estimated from a regression model based on 22 sites of measured bed-material transport, mostly from reservoir accumulation analyses but also from several bedload measurement programs. The resulting best-fit regression is based on basin slope and the presence/absence of the Klamath geologic terrane. For the Klamath terrane, bed-material yield is twice that of the other geologic provinces. This model explains more than 80% of the variance of the better-quality measurements. Predicted bed-material yields range up to 350 tonnes/ km2-yr in steep areas of the Klamath terrane. Applying these regressions to small individual watersheds (mean size; 66 km2 for bed-material; 3 km2 for suspended sediment) and cumulating totals down the hydrologic network (but also decreasing the bed-material flux by experimentally determined attrition rates) gives spatially explicit estimates of both bed-material and suspended sediment flux. This enables assessment of several management issues, including the effects of dams on bedload transport, instream gravel mining, habitat formation processes, and water-quality. The combined fluxes can also be compared to long-term rock uplift and cosmogenically determined landscape erosion rates.

  14. Predictors of the number of under-five malnourished children in Bangladesh: application of the generalized poisson regression model

    PubMed Central

    2013-01-01

    Background Malnutrition is one of the principal causes of child mortality in developing countries including Bangladesh. According to our knowledge, most of the available studies, that addressed the issue of malnutrition among under-five children, considered the categorical (dichotomous/polychotomous) outcome variables and applied logistic regression (binary/multinomial) to find their predictors. In this study malnutrition variable (i.e. outcome) is defined as the number of under-five malnourished children in a family, which is a non-negative count variable. The purposes of the study are (i) to demonstrate the applicability of the generalized Poisson regression (GPR) model as an alternative of other statistical methods and (ii) to find some predictors of this outcome variable. Methods The data is extracted from the Bangladesh Demographic and Health Survey (BDHS) 2007. Briefly, this survey employs a nationally representative sample which is based on a two-stage stratified sample of households. A total of 4,460 under-five children is analysed using various statistical techniques namely Chi-square test and GPR model. Results The GPR model (as compared to the standard Poisson regression and negative Binomial regression) is found to be justified to study the above-mentioned outcome variable because of its under-dispersion (variance < mean) property. Our study also identify several significant predictors of the outcome variable namely mother’s education, father’s education, wealth index, sanitation status, source of drinking water, and total number of children ever born to a woman. Conclusions Consistencies of our findings in light of many other studies suggest that the GPR model is an ideal alternative of other statistical models to analyse the number of under-five malnourished children in a family. Strategies based on significant predictors may improve the nutritional status of children in Bangladesh. PMID:23297699

  15. Tutorial on Biostatistics: Linear Regression Analysis of Continuous Correlated Eye Data

    PubMed Central

    Ying, Gui-shuang; Maguire, Maureen G; Glynn, Robert; Rosner, Bernard

    2017-01-01

    Purpose To describe and demonstrate appropriate linear regression methods for analyzing correlated continuous eye data. Methods We describe several approaches to regression analysis involving both eyes, including mixed effects and marginal models under various covariance structures to account for inter-eye correlation. We demonstrate, with SAS statistical software, applications in a study comparing baseline refractive error between one eye with choroidal neovascularization (CNV) and the unaffected fellow eye, and in a study determining factors associated with visual field data in the elderly. Results When refractive error from both eyes were analyzed with standard linear regression without accounting for inter-eye correlation (adjusting for demographic and ocular covariates), the difference between eyes with CNV and fellow eyes was 0.15 diopters (D; 95% confidence interval, CI −0.03 to 0.32D, P=0.10). Using a mixed effects model or a marginal model, the estimated difference was the same but with narrower 95% CI (0.01 to 0.28D, P=0.03). Standard regression for visual field data from both eyes provided biased estimates of standard error (generally underestimated) and smaller P-values, while analysis of the worse eye provided larger P-values than mixed effects models and marginal models. Conclusion In research involving both eyes, ignoring inter-eye correlation can lead to invalid inferences. Analysis using only right or left eyes is valid, but decreases power. Worse-eye analysis can provide less power and biased estimates of effect. Mixed effects or marginal models using the eye as the unit of analysis should be used to appropriately account for inter-eye correlation and maximize power and precision. PMID:28102741

  16. Analytic model for the long-term evolution of circular Earth satellite orbits including lunar node regression

    NASA Astrophysics Data System (ADS)

    Zhu, Ting-Lei; Zhao, Chang-Yin; Zhang, Ming-Jiang

    2017-04-01

    This paper aims to obtain an analytic approximation to the evolution of circular orbits governed by the Earth's J2 and the luni-solar gravitational perturbations. Assuming that the lunar orbital plane coincides with the ecliptic plane, Allan and Cook (Proc. R. Soc. A, Math. Phys. Eng. Sci. 280(1380):97, 1964) derived an analytic solution to the orbital plane evolution of circular orbits. Using their result as an intermediate solution, we establish an approximate analytic model with lunar orbital inclination and its node regression be taken into account. Finally, an approximate analytic expression is derived, which is accurate compared to the numerical results except for the resonant cases when the period of the reference orbit approximately equals the integer multiples (especially 1 or 2 times) of lunar node regression period.

  17. Random regression analyses using B-spline functions to model growth of Nellore cattle.

    PubMed

    Boligon, A A; Mercadante, M E Z; Lôbo, R B; Baldi, F; Albuquerque, L G

    2012-02-01

    The objective of this study was to estimate (co)variance components using random regression on B-spline functions to weight records obtained from birth to adulthood. A total of 82 064 weight records of 8145 females obtained from the data bank of the Nellore Breeding Program (PMGRN/Nellore Brazil) which started in 1987, were used. The models included direct additive and maternal genetic effects and animal and maternal permanent environmental effects as random. Contemporary group and dam age at calving (linear and quadratic effect) were included as fixed effects, and orthogonal Legendre polynomials of age (cubic regression) were considered as random covariate. The random effects were modeled using B-spline functions considering linear, quadratic and cubic polynomials for each individual segment. Residual variances were grouped in five age classes. Direct additive genetic and animal permanent environmental effects were modeled using up to seven knots (six segments). A single segment with two knots at the end points of the curve was used for the estimation of maternal genetic and maternal permanent environmental effects. A total of 15 models were studied, with the number of parameters ranging from 17 to 81. The models that used B-splines were compared with multi-trait analyses with nine weight traits and to a random regression model that used orthogonal Legendre polynomials. A model fitting quadratic B-splines, with four knots or three segments for direct additive genetic effect and animal permanent environmental effect and two knots for maternal additive genetic effect and maternal permanent environmental effect, was the most appropriate and parsimonious model to describe the covariance structure of the data. Selection for higher weight, such as at young ages, should be performed taking into account an increase in mature cow weight. Particularly, this is important in most of Nellore beef cattle production systems, where the cow herd is maintained on range conditions. There is limited modification of the growth curve of Nellore cattle with respect to the aim of selecting them for rapid growth at young ages while maintaining constant adult weight.

  18. ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL

    PubMed Central

    Huang, Jian; Sun, Tingni; Ying, Zhiliang; Yu, Yi; Zhang, Cun-Hui

    2013-01-01

    We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities. PMID:24086091

  19. ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL.

    PubMed

    Huang, Jian; Sun, Tingni; Ying, Zhiliang; Yu, Yi; Zhang, Cun-Hui

    2013-06-01

    We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities.

  20. Applicability of linear regression equation for prediction of chlorophyll content in rice leaves

    NASA Astrophysics Data System (ADS)

    Li, Yunmei

    2005-09-01

    A modeling approach is used to assess the applicability of the derived equations which are capable to predict chlorophyll content of rice leaves at a given view direction. Two radiative transfer models, including PROSPECT model operated at leaf level and FCR model operated at canopy level, are used in the study. The study is consisted of three steps: (1) Simulation of bidirectional reflectance from canopy with different leaf chlorophyll contents, leaf-area-index (LAI) and under storey configurations; (2) Establishment of prediction relations of chlorophyll content by stepwise regression; and (3) Assessment of the applicability of these relations. The result shows that the accuracy of prediction is affected by different under storey configurations and, however, the accuracy tends to be greatly improved with increase of LAI.

  1. Prediction of hearing outcomes by multiple regression analysis in patients with idiopathic sudden sensorineural hearing loss.

    PubMed

    Suzuki, Hideaki; Tabata, Takahisa; Koizumi, Hiroki; Hohchi, Nobusuke; Takeuchi, Shoko; Kitamura, Takuro; Fujino, Yoshihisa; Ohbuchi, Toyoaki

    2014-12-01

    This study aimed to create a multiple regression model for predicting hearing outcomes of idiopathic sudden sensorineural hearing loss (ISSNHL). The participants were 205 consecutive patients (205 ears) with ISSNHL (hearing level ≥ 40 dB, interval between onset and treatment ≤ 30 days). They received systemic steroid administration combined with intratympanic steroid injection. Data were examined by simple and multiple regression analyses. Three hearing indices (percentage hearing improvement, hearing gain, and posttreatment hearing level [HLpost]) and 7 prognostic factors (age, days from onset to treatment, initial hearing level, initial hearing level at low frequencies, initial hearing level at high frequencies, presence of vertigo, and contralateral hearing level) were included in the multiple regression analysis as dependent and explanatory variables, respectively. In the simple regression analysis, the percentage hearing improvement, hearing gain, and HLpost showed significant correlation with 2, 5, and 6 of the 7 prognostic factors, respectively. The multiple correlation coefficients were 0.396, 0.503, and 0.714 for the percentage hearing improvement, hearing gain, and HLpost, respectively. Predicted values of HLpost calculated by the multiple regression equation were reliable with 70% probability with a 40-dB-width prediction interval. Prediction of HLpost by the multiple regression model may be useful to estimate the hearing prognosis of ISSNHL. © The Author(s) 2014.

  2. Estimating interaction on an additive scale between continuous determinants in a logistic regression model.

    PubMed

    Knol, Mirjam J; van der Tweel, Ingeborg; Grobbee, Diederick E; Numans, Mattijs E; Geerlings, Mirjam I

    2007-10-01

    To determine the presence of interaction in epidemiologic research, typically a product term is added to the regression model. In linear regression, the regression coefficient of the product term reflects interaction as departure from additivity. However, in logistic regression it refers to interaction as departure from multiplicativity. Rothman has argued that interaction estimated as departure from additivity better reflects biologic interaction. So far, literature on estimating interaction on an additive scale using logistic regression only focused on dichotomous determinants. The objective of the present study was to provide the methods to estimate interaction between continuous determinants and to illustrate these methods with a clinical example. and results From the existing literature we derived the formulas to quantify interaction as departure from additivity between one continuous and one dichotomous determinant and between two continuous determinants using logistic regression. Bootstrapping was used to calculate the corresponding confidence intervals. To illustrate the theory with an empirical example, data from the Utrecht Health Project were used, with age and body mass index as risk factors for elevated diastolic blood pressure. The methods and formulas presented in this article are intended to assist epidemiologists to calculate interaction on an additive scale between two variables on a certain outcome. The proposed methods are included in a spreadsheet which is freely available at: http://www.juliuscenter.nl/additive-interaction.xls.

  3. Can dispersion modeling of air pollution be improved by land-use regression? An example from Stockholm, Sweden

    PubMed Central

    Korek, Michal; Johansson, Christer; Svensson, Nina; Lind, Tomas; Beelen, Rob; Hoek, Gerard; Pershagen, Göran; Bellander, Tom

    2017-01-01

    Both dispersion modeling (DM) and land-use regression modeling (LUR) are often used for assessment of long-term air pollution exposure in epidemiological studies, but seldom in combination. We developed a hybrid DM–LUR model using 93 biweekly observations of NOx at 31 sites in greater Stockholm (Sweden). The DM was based on spatially resolved topographic, physiographic and emission data, and hourly meteorological data from a diagnostic wind model. Other data were from land use, meteorology and routine monitoring of NOx. We built a linear regression model for NOx, using a stepwise forward selection of covariates. The resulting model predicted observed NOx (R2=0.89) better than the DM without covariates (R2=0.68, P-interaction <0.001) and with minimal apparent bias. The model included (in descending order of importance) DM, traffic intensity on the nearest street, population (number of inhabitants) within 100 m radius, global radiation (direct sunlight plus diffuse or scattered light) and urban contribution to NOx levels (routine urban NOx, less routine rural NOx). Our results indicate that there is a potential for improving estimates of air pollutant concentrations based on DM, by incorporating further spatial characteristics of the immediate surroundings, possibly accounting for imperfections in the emission data. PMID:27485990

  4. Can dispersion modeling of air pollution be improved by land-use regression? An example from Stockholm, Sweden.

    PubMed

    Korek, Michal; Johansson, Christer; Svensson, Nina; Lind, Tomas; Beelen, Rob; Hoek, Gerard; Pershagen, Göran; Bellander, Tom

    2017-11-01

    Both dispersion modeling (DM) and land-use regression modeling (LUR) are often used for assessment of long-term air pollution exposure in epidemiological studies, but seldom in combination. We developed a hybrid DM-LUR model using 93 biweekly observations of NO x at 31 sites in greater Stockholm (Sweden). The DM was based on spatially resolved topographic, physiographic and emission data, and hourly meteorological data from a diagnostic wind model. Other data were from land use, meteorology and routine monitoring of NO x . We built a linear regression model for NO x , using a stepwise forward selection of covariates. The resulting model predicted observed NO x (R 2 =0.89) better than the DM without covariates (R 2 =0.68, P-interaction <0.001) and with minimal apparent bias. The model included (in descending order of importance) DM, traffic intensity on the nearest street, population (number of inhabitants) within 100 m radius, global radiation (direct sunlight plus diffuse or scattered light) and urban contribution to NO x levels (routine urban NO x , less routine rural NO x ). Our results indicate that there is a potential for improving estimates of air pollutant concentrations based on DM, by incorporating further spatial characteristics of the immediate surroundings, possibly accounting for imperfections in the emission data.

  5. The comparison of landslide ratio-based and general logistic regression landslide susceptibility models in the Chishan watershed after 2009 Typhoon Morakot

    NASA Astrophysics Data System (ADS)

    WU, Chunhung

    2015-04-01

    The research built the original logistic regression landslide susceptibility model (abbreviated as or-LRLSM) and landslide ratio-based ogistic regression landslide susceptibility model (abbreviated as lr-LRLSM), compared the performance and explained the error source of two models. The research assumes that the performance of the logistic regression model can be better if the distribution of landslide ratio and weighted value of each variable is similar. Landslide ratio is the ratio of landslide area to total area in the specific area and an useful index to evaluate the seriousness of landslide disaster in Taiwan. The research adopted the landside inventory induced by 2009 Typhoon Morakot in the Chishan watershed, which was the most serious disaster event in the last decade, in Taiwan. The research adopted the 20 m grid as the basic unit in building the LRLSM, and six variables, including elevation, slope, aspect, geological formation, accumulated rainfall, and bank erosion, were included in the two models. The six variables were divided as continuous variables, including elevation, slope, and accumulated rainfall, and categorical variables, including aspect, geological formation and bank erosion in building the or-LRLSM, while all variables, which were classified based on landslide ratio, were categorical variables in building the lr-LRLSM. Because the count of whole basic unit in the Chishan watershed was too much to calculate by using commercial software, the research took random sampling instead of the whole basic units. The research adopted equal proportions of landslide unit and not landslide unit in logistic regression analysis. The research took 10 times random sampling and selected the group with the best Cox & Snell R2 value and Nagelkerker R2 value as the database for the following analysis. Based on the best result from 10 random sampling groups, the or-LRLSM (lr-LRLSM) is significant at the 1% level with Cox & Snell R2 = 0.190 (0.196) and Nagelkerke R2 = 0.253 (0.260). The unit with the landslide susceptibility value > 0.5 (≦ 0.5) will be classified as a predicted landslide unit (not landslide unit). The AUC, i.e. the area under the relative operating characteristic curve, of or-LRLSM in the Chishan watershed is 0.72, while that of lr-LRLSM is 0.77. Furthermore, the average correct ratio of lr-LRLSM (73.3%) is better than that of or-LRLSM (68.3%). The research analyzed in detail the error sources from the two models. In continuous variables, using the landslide ratio-based classification in building the lr-LRLSM can let the distribution of weighted value more similar to distribution of landslide ratio in the range of continuous variable than that in building the or-LRLSM. In categorical variables, the meaning of using the landslide ratio-based classification in building the lr-LRLSM is to gather the parameters with approximate landslide ratio together. The mean correct ratio in continuous variables (categorical variables) by using the lr-LRLSM is better than that in or-LRLSM by 0.6 ~ 2.6% (1.7% ~ 6.0%). Building the landslide susceptibility model by using landslide ratio-based classification is practical and of better performance than that by using the original logistic regression.

  6. Random regression models on Legendre polynomials to estimate genetic parameters for weights from birth to adult age in Canchim cattle.

    PubMed

    Baldi, F; Albuquerque, L G; Alencar, M M

    2010-08-01

    The objective of this work was to estimate covariance functions for direct and maternal genetic effects, animal and maternal permanent environmental effects, and subsequently, to derive relevant genetic parameters for growth traits in Canchim cattle. Data comprised 49,011 weight records on 2435 females from birth to adult age. The model of analysis included fixed effects of contemporary groups (year and month of birth and at weighing) and age of dam as quadratic covariable. Mean trends were taken into account by a cubic regression on orthogonal polynomials of animal age. Residual variances were allowed to vary and were modelled by a step function with 1, 4 or 11 classes based on animal's age. The model fitting four classes of residual variances was the best. A total of 12 random regression models from second to seventh order were used to model direct and maternal genetic effects, animal and maternal permanent environmental effects. The model with direct and maternal genetic effects, animal and maternal permanent environmental effects fitted by quadric, cubic, quintic and linear Legendre polynomials, respectively, was the most adequate to describe the covariance structure of the data. Estimates of direct and maternal heritability obtained by multi-trait (seven traits) and random regression models were very similar. Selection for higher weight at any age, especially after weaning, will produce an increase in mature cow weight. The possibility to modify the growth curve in Canchim cattle to obtain animals with rapid growth at early ages and moderate to low mature cow weight is limited.

  7. Fast function-on-scalar regression with penalized basis expansions.

    PubMed

    Reiss, Philip T; Huang, Lei; Mennes, Maarten

    2010-01-01

    Regression models for functional responses and scalar predictors are often fitted by means of basis functions, with quadratic roughness penalties applied to avoid overfitting. The fitting approach described by Ramsay and Silverman in the 1990 s amounts to a penalized ordinary least squares (P-OLS) estimator of the coefficient functions. We recast this estimator as a generalized ridge regression estimator, and present a penalized generalized least squares (P-GLS) alternative. We describe algorithms by which both estimators can be implemented, with automatic selection of optimal smoothing parameters, in a more computationally efficient manner than has heretofore been available. We discuss pointwise confidence intervals for the coefficient functions, simultaneous inference by permutation tests, and model selection, including a novel notion of pointwise model selection. P-OLS and P-GLS are compared in a simulation study. Our methods are illustrated with an analysis of age effects in a functional magnetic resonance imaging data set, as well as a reanalysis of a now-classic Canadian weather data set. An R package implementing the methods is publicly available.

  8. Additive hazards regression and partial likelihood estimation for ecological monitoring data across space.

    PubMed

    Lin, Feng-Chang; Zhu, Jun

    2012-01-01

    We develop continuous-time models for the analysis of environmental or ecological monitoring data such that subjects are observed at multiple monitoring time points across space. Of particular interest are additive hazards regression models where the baseline hazard function can take on flexible forms. We consider time-varying covariates and take into account spatial dependence via autoregression in space and time. We develop statistical inference for the regression coefficients via partial likelihood. Asymptotic properties, including consistency and asymptotic normality, are established for parameter estimates under suitable regularity conditions. Feasible algorithms utilizing existing statistical software packages are developed for computation. We also consider a simpler additive hazards model with homogeneous baseline hazard and develop hypothesis testing for homogeneity. A simulation study demonstrates that the statistical inference using partial likelihood has sound finite-sample properties and offers a viable alternative to maximum likelihood estimation. For illustration, we analyze data from an ecological study that monitors bark beetle colonization of red pines in a plantation of Wisconsin.

  9. On state-of-charge determination for lithium-ion batteries

    NASA Astrophysics Data System (ADS)

    Li, Zhe; Huang, Jun; Liaw, Bor Yann; Zhang, Jianbo

    2017-04-01

    Accurate estimation of state-of-charge (SOC) of a battery through its life remains challenging in battery research. Although improved precisions continue to be reported at times, almost all are based on regression methods empirically, while the accuracy is often not properly addressed. Here, a comprehensive review is set to address such issues, from fundamental principles that are supposed to define SOC to methodologies to estimate SOC for practical use. It covers topics from calibration, regression (including modeling methods) to validation in terms of precision and accuracy. At the end, we intend to answer the following questions: 1) can SOC estimation be self-adaptive without bias? 2) Why Ah-counting is a necessity in almost all battery-model-assisted regression methods? 3) How to establish a consistent framework of coupling in multi-physics battery models? 4) To assess the accuracy in SOC estimation, statistical methods should be employed to analyze factors that contribute to the uncertainty. We hope, through this proper discussion of the principles, accurate SOC estimation can be widely achieved.

  10. Simple estimation procedures for regression analysis of interval-censored failure time data under the proportional hazards model.

    PubMed

    Sun, Jianguo; Feng, Yanqin; Zhao, Hui

    2015-01-01

    Interval-censored failure time data occur in many fields including epidemiological and medical studies as well as financial and sociological studies, and many authors have investigated their analysis (Sun, The statistical analysis of interval-censored failure time data, 2006; Zhang, Stat Modeling 9:321-343, 2009). In particular, a number of procedures have been developed for regression analysis of interval-censored data arising from the proportional hazards model (Finkelstein, Biometrics 42:845-854, 1986; Huang, Ann Stat 24:540-568, 1996; Pan, Biometrics 56:199-203, 2000). For most of these procedures, however, one drawback is that they involve estimation of both regression parameters and baseline cumulative hazard function. In this paper, we propose two simple estimation approaches that do not need estimation of the baseline cumulative hazard function. The asymptotic properties of the resulting estimates are given, and an extensive simulation study is conducted and indicates that they work well for practical situations.

  11. Application of nonlinear-regression methods to a ground-water flow model of the Albuquerque Basin, New Mexico

    USGS Publications Warehouse

    Tiedeman, C.R.; Kernodle, J.M.; McAda, D.P.

    1998-01-01

    This report documents the application of nonlinear-regression methods to a numerical model of ground-water flow in the Albuquerque Basin, New Mexico. In the Albuquerque Basin, ground water is the primary source for most water uses. Ground-water withdrawal has steadily increased since the 1940's, resulting in large declines in water levels in the Albuquerque area. A ground-water flow model was developed in 1994 and revised and updated in 1995 for the purpose of managing basin ground- water resources. In the work presented here, nonlinear-regression methods were applied to a modified version of the previous flow model. Goals of this work were to use regression methods to calibrate the model with each of six different configurations of the basin subsurface and to assess and compare optimal parameter estimates, model fit, and model error among the resulting calibrations. The Albuquerque Basin is one in a series of north trending structural basins within the Rio Grande Rift, a region of Cenozoic crustal extension. Mountains, uplifts, and fault zones bound the basin, and rock units within the basin include pre-Santa Fe Group deposits, Tertiary Santa Fe Group basin fill, and post-Santa Fe Group volcanics and sediments. The Santa Fe Group is greater than 14,000 feet (ft) thick in the central part of the basin. During deposition of the Santa Fe Group, crustal extension resulted in development of north trending normal faults with vertical displacements of as much as 30,000 ft. Ground-water flow in the Albuquerque Basin occurs primarily in the Santa Fe Group and post-Santa Fe Group deposits. Water flows between the ground-water system and surface-water bodies in the inner valley of the basin, where the Rio Grande, a network of interconnected canals and drains, and Cochiti Reservoir are located. Recharge to the ground-water flow system occurs as infiltration of precipitation along mountain fronts and infiltration of stream water along tributaries to the Rio Grande; subsurface flow from adjacent regions; irrigation and septic field seepage; and leakage through the Rio Grande, canal, and Cochiti Reservoir beds. Ground water is discharged from the basin by withdrawal; evapotranspiration; subsurface flow; and flow to the Rio Grande, canals, and drains. The transient, three-dimensional numerical model of ground-water flow to which nonlinear-regression methods were applied simulates flow in the Albuquerque Basin from 1900 to March 1995. Six different basin subsurface configurations are considered in the model. These configurations are designed to test the effects of (1) varying the simulated basin thickness, (2) including a hypothesized hydrogeologic unit with large hydraulic conductivity in the western part of the basin (the west basin high-K zone), and (3) substantially lowering the simulated hydraulic conductivity of a fault in the western part of the basin (the low-K fault zone). The model with each of the subsurface configurations was calibrated using a nonlinear least- squares regression technique. The calibration data set includes 802 hydraulic-head measurements that provide broad spatial and temporal coverage of basin conditions, and one measurement of net flow from the Rio Grande and drains to the ground-water system in the Albuquerque area. Data are weighted on the basis of estimates of the standard deviations of measurement errors. The 10 to 12 parameters to which the calibration data as a whole are generally most sensitive were estimated by nonlinear regression, whereas the remaining model parameter values were specified. Results of model calibration indicate that the optimal parameter estimates as a whole are most reasonable in calibrations of the model with with configurations 3 (which contains 1,600-ft-thick basin deposits and the west basin high-K zone), 4 (which contains 5,000-ft-thick basin de

  12. Prediction of Baseflow Index of Catchments using Machine Learning Algorithms

    NASA Astrophysics Data System (ADS)

    Yadav, B.; Hatfield, K.

    2017-12-01

    We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.

  13. Perceived Organizational Support for Enhancing Welfare at Work: A Regression Tree Model

    PubMed Central

    Giorgi, Gabriele; Dubin, David; Perez, Javier Fiz

    2016-01-01

    When trying to examine outcomes such as welfare and well-being, research tends to focus on main effects and take into account limited numbers of variables at a time. There are a number of techniques that may help address this problem. For example, many statistical packages available in R provide easy-to-use methods of modeling complicated analysis such as classification and tree regression (i.e., recursive partitioning). The present research illustrates the value of recursive partitioning in the prediction of perceived organizational support in a sample of more than 6000 Italian bankers. Utilizing the tree function party package in R, we estimated a regression tree model predicting perceived organizational support from a multitude of job characteristics including job demand, lack of job control, lack of supervisor support, training, etc. The resulting model appears particularly helpful in pointing out several interactions in the prediction of perceived organizational support. In particular, training is the dominant factor. Another dimension that seems to influence organizational support is reporting (perceived communication about safety and stress concerns). Results are discussed from a theoretical and methodological point of view. PMID:28082924

  14. Multi-parameters monitoring during traditional Chinese medicine concentration process with near infrared spectroscopy and chemometrics

    NASA Astrophysics Data System (ADS)

    Liu, Ronghua; Sun, Qiaofeng; Hu, Tian; Li, Lian; Nie, Lei; Wang, Jiayue; Zhou, Wanhui; Zang, Hengchang

    2018-03-01

    As a powerful process analytical technology (PAT) tool, near infrared (NIR) spectroscopy has been widely used in real-time monitoring. In this study, NIR spectroscopy was applied to monitor multi-parameters of traditional Chinese medicine (TCM) Shenzhiling oral liquid during the concentration process to guarantee the quality of products. Five lab scale batches were employed to construct quantitative models to determine five chemical ingredients and physical change (samples density) during concentration process. The paeoniflorin, albiflorin, liquiritin and samples density were modeled by partial least square regression (PLSR), while the content of the glycyrrhizic acid and cinnamic acid were modeled by support vector machine regression (SVMR). Standard normal variate (SNV) and/or Savitzkye-Golay (SG) smoothing with derivative methods were adopted for spectra pretreatment. Variable selection methods including correlation coefficient (CC), competitive adaptive reweighted sampling (CARS) and interval partial least squares regression (iPLS) were performed for optimizing the models. The results indicated that NIR spectroscopy was an effective tool to successfully monitoring the concentration process of Shenzhiling oral liquid.

  15. Correlations of turbidity to suspended-sediment concentration in the Toutle River Basin, near Mount St. Helens, Washington, 2010-11

    USGS Publications Warehouse

    Uhrich, Mark A.; Kolasinac, Jasna; Booth, Pamela L.; Fountain, Robert L.; Spicer, Kurt R.; Mosbrucker, Adam R.

    2014-01-01

    Researchers at the U.S. Geological Survey, Cascades Volcano Observatory, investigated alternative methods for the traditional sample-based sediment record procedure in determining suspended-sediment concentration (SSC) and discharge. One such sediment-surrogate technique was developed using turbidity and discharge to estimate SSC for two gaging stations in the Toutle River Basin near Mount St. Helens, Washington. To provide context for the study, methods for collecting sediment data and monitoring turbidity are discussed. Statistical methods used include the development of ordinary least squares regression models for each gaging station. Issues of time-related autocorrelation also are evaluated. Addition of lagged explanatory variables was used to account for autocorrelation in the turbidity, discharge, and SSC data. Final regression model equations and plots are presented for the two gaging stations. The regression models support near-real-time estimates of SSC and improved suspended-sediment discharge records by incorporating continuous instream turbidity. Future use of such models may potentially lower the costs of sediment monitoring by reducing time it takes to collect and process samples and to derive a sediment-discharge record.

  16. Statistical-learning strategies generate only modestly performing predictive models for urinary symptoms following external beam radiotherapy of the prostate: A comparison of conventional and machine-learning methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yahya, Noorazrul, E-mail: noorazrul.yahya@research.uwa.edu.au; Ebert, Martin A.; Bulsara, Max

    Purpose: Given the paucity of available data concerning radiotherapy-induced urinary toxicity, it is important to ensure derivation of the most robust models with superior predictive performance. This work explores multiple statistical-learning strategies for prediction of urinary symptoms following external beam radiotherapy of the prostate. Methods: The performance of logistic regression, elastic-net, support-vector machine, random forest, neural network, and multivariate adaptive regression splines (MARS) to predict urinary symptoms was analyzed using data from 754 participants accrued by TROG03.04-RADAR. Predictive features included dose-surface data, comorbidities, and medication-intake. Four symptoms were analyzed: dysuria, haematuria, incontinence, and frequency, each with three definitions (grade ≥more » 1, grade ≥ 2 and longitudinal) with event rate between 2.3% and 76.1%. Repeated cross-validations producing matched models were implemented. A synthetic minority oversampling technique was utilized in endpoints with rare events. Parameter optimization was performed on the training data. Area under the receiver operating characteristic curve (AUROC) was used to compare performance using sample size to detect differences of ≥0.05 at the 95% confidence level. Results: Logistic regression, elastic-net, random forest, MARS, and support-vector machine were the highest-performing statistical-learning strategies in 3, 3, 3, 2, and 1 endpoints, respectively. Logistic regression, MARS, elastic-net, random forest, neural network, and support-vector machine were the best, or were not significantly worse than the best, in 7, 7, 5, 5, 3, and 1 endpoints. The best-performing statistical model was for dysuria grade ≥ 1 with AUROC ± standard deviation of 0.649 ± 0.074 using MARS. For longitudinal frequency and dysuria grade ≥ 1, all strategies produced AUROC>0.6 while all haematuria endpoints and longitudinal incontinence models produced AUROC<0.6. Conclusions: Logistic regression and MARS were most likely to be the best-performing strategy for the prediction of urinary symptoms with elastic-net and random forest producing competitive results. The predictive power of the models was modest and endpoint-dependent. New features, including spatial dose maps, may be necessary to achieve better models.« less

  17. Effect of motivational interviewing on rates of early childhood caries: a randomized trial.

    PubMed

    Harrison, Rosamund; Benton, Tonya; Everson-Stewart, Siobhan; Weinstein, Phil

    2007-01-01

    The purposes of this randomized controlled trial were to: (1) test motivational interviewing (MI) to prevent early childhood caries; and (2) use Poisson regression for data analysis. A total of 240 South Asian children 6 to 18 months old were enrolled and randomly assigned to either the MI or control condition. Children had a dental exam, and their mothers completed pretested instruments at baseline and 1 and 2 years postintervention. Other covariates that might explain outcomes over and above treatment differences were modeled using Poisson regression. Hazard ratios were produced. Analyses included all participants whenever possible. Poisson regression supported a protective effect of MI (hazard ratio [HR]=0.54 (95%CI=035-0.84)-that is, the M/ group had about a 46% lower rate of dmfs at 2 years than did control children. Similar treatment effect estimates were obtained from models that included, as alternative outcomes, ds, dms, and dmfs, including "white spot lesions." Exploratory analyses revealed that rates of dmfs were higher in children whose mothers had: (1) prechewed their food; (2) been raised in a rural environment; and (3) a higher family income (P<.05). A motivational interviewing-style intervention shows promise to promote preventive behaviors in mothers of young children at high risk for caries.

  18. On the Asymptotic Relative Efficiency of Planned Missingness Designs.

    PubMed

    Rhemtulla, Mijke; Savalei, Victoria; Little, Todd D

    2016-03-01

    In planned missingness (PM) designs, certain data are set a priori to be missing. PM designs can increase validity and reduce cost; however, little is known about the loss of efficiency that accompanies these designs. The present paper compares PM designs to reduced sample (RN) designs that have the same total number of data points concentrated in fewer participants. In 4 studies, we consider models for both observed and latent variables, designs that do or do not include an "X set" of variables with complete data, and a full range of between- and within-set correlation values. All results are obtained using asymptotic relative efficiency formulas, and thus no data are generated; this novel approach allows us to examine whether PM designs have theoretical advantages over RN designs removing the impact of sampling error. Our primary findings are that (a) in manifest variable regression models, estimates of regression coefficients have much lower relative efficiency in PM designs as compared to RN designs, (b) relative efficiency of factor correlation or latent regression coefficient estimates is maximized when the indicators of each latent variable come from different sets, and (c) the addition of an X set improves efficiency in manifest variable regression models only for the parameters that directly involve the X-set variables, but it substantially improves efficiency of most parameters in latent variable models. We conclude that PM designs can be beneficial when the model of interest is a latent variable model; recommendations are made for how to optimize such a design.

  19. Development and validation of a risk calculator predicting exercise-induced ventricular arrhythmia in patients with cardiovascular disease.

    PubMed

    Hermes, Ilarraza-Lomelí; Marianna, García-Saldivia; Jessica, Rojano-Castillo; Carlos, Barrera-Ramírez; Rafael, Chávez-Domínguez; María Dolores, Rius-Suárez; Pedro, Iturralde

    2016-10-01

    Mortality due to cardiovascular disease is often associated with ventricular arrhythmias. Nowadays, patients with cardiovascular disease are more encouraged to take part in physical training programs. Nevertheless, high-intensity exercise is associated to a higher risk for sudden death, even in apparently healthy people. During an exercise testing (ET), health care professionals provide patients, in a controlled scenario, an intense physiological stimulus that could precipitate cardiac arrhythmia in high risk individuals. There is still no clinical or statistical tool to predict this incidence. The aim of this study was to develop a statistical model to predict the incidence of exercise-induced potentially life-threatening ventricular arrhythmia (PLVA) during high intensity exercise. 6415 patients underwent a symptom-limited ET with a Balke ramp protocol. A multivariate logistic regression model where the primary outcome was PLVA was performed. Incidence of PLVA was 548 cases (8.5%). After a bivariate model, thirty one clinical or ergometric variables were statistically associated with PLVA and were included in the regression model. In the multivariate model, 13 of these variables were found to be statistically significant. A regression model (G) with a X(2) of 283.987 and a p<0.001, was constructed. Significant variables included: heart failure, antiarrhythmic drugs, myocardial lower-VD, age and use of digoxin, nitrates, among others. This study allows clinicians to identify patients at risk of ventricular tachycardia or couplets during exercise, and to take preventive measures or appropriate supervision. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  20. The Application of the Cumulative Logistic Regression Model to Automated Essay Scoring

    ERIC Educational Resources Information Center

    Haberman, Shelby J.; Sinharay, Sandip

    2010-01-01

    Most automated essay scoring programs use a linear regression model to predict an essay score from several essay features. This article applied a cumulative logit model instead of the linear regression model to automated essay scoring. Comparison of the performances of the linear regression model and the cumulative logit model was performed on a…

  1. The Association between Environmental Factors and Scarlet Fever Incidence in Beijing Region: Using GIS and Spatial Regression Models

    PubMed Central

    Mahara, Gehendra; Wang, Chao; Yang, Kun; Chen, Sipeng; Guo, Jin; Gao, Qi; Wang, Wei; Wang, Quanyi; Guo, Xiuhua

    2016-01-01

    (1) Background: Evidence regarding scarlet fever and its relationship with meteorological, including air pollution factors, is not very available. This study aimed to examine the relationship between ambient air pollutants and meteorological factors with scarlet fever occurrence in Beijing, China. (2) Methods: A retrospective ecological study was carried out to distinguish the epidemic characteristics of scarlet fever incidence in Beijing districts from 2013 to 2014. Daily incidence and corresponding air pollutant and meteorological data were used to develop the model. Global Moran’s I statistic and Anselin’s local Moran’s I (LISA) were applied to detect the spatial autocorrelation (spatial dependency) and clusters of scarlet fever incidence. The spatial lag model (SLM) and spatial error model (SEM) including ordinary least squares (OLS) models were then applied to probe the association between scarlet fever incidence and meteorological including air pollution factors. (3) Results: Among the 5491 cases, more than half (62%) were male, and more than one-third (37.8%) were female, with the annual average incidence rate 14.64 per 100,000 population. Spatial autocorrelation analysis exhibited the existence of spatial dependence; therefore, we applied spatial regression models. After comparing the values of R-square, log-likelihood and the Akaike information criterion (AIC) among the three models, the OLS model (R2 = 0.0741, log likelihood = −1819.69, AIC = 3665.38), SLM (R2 = 0.0786, log likelihood = −1819.04, AIC = 3665.08) and SEM (R2 = 0.0743, log likelihood = −1819.67, AIC = 3665.36), identified that the spatial lag model (SLM) was best for model fit for the regression model. There was a positive significant association between nitrogen oxide (p = 0.027), rainfall (p = 0.036) and sunshine hour (p = 0.048), while the relative humidity (p = 0.034) had an adverse association with scarlet fever incidence in SLM. (4) Conclusions: Our findings indicated that meteorological, as well as air pollutant factors may increase the incidence of scarlet fever; these findings may help to guide scarlet fever control programs and targeting the intervention. PMID:27827946

  2. The Association between Environmental Factors and Scarlet Fever Incidence in Beijing Region: Using GIS and Spatial Regression Models.

    PubMed

    Mahara, Gehendra; Wang, Chao; Yang, Kun; Chen, Sipeng; Guo, Jin; Gao, Qi; Wang, Wei; Wang, Quanyi; Guo, Xiuhua

    2016-11-04

    (1) Background: Evidence regarding scarlet fever and its relationship with meteorological, including air pollution factors, is not very available. This study aimed to examine the relationship between ambient air pollutants and meteorological factors with scarlet fever occurrence in Beijing, China. (2) Methods: A retrospective ecological study was carried out to distinguish the epidemic characteristics of scarlet fever incidence in Beijing districts from 2013 to 2014. Daily incidence and corresponding air pollutant and meteorological data were used to develop the model. Global Moran's I statistic and Anselin's local Moran's I (LISA) were applied to detect the spatial autocorrelation (spatial dependency) and clusters of scarlet fever incidence. The spatial lag model (SLM) and spatial error model (SEM) including ordinary least squares (OLS) models were then applied to probe the association between scarlet fever incidence and meteorological including air pollution factors. (3) Results: Among the 5491 cases, more than half (62%) were male, and more than one-third (37.8%) were female, with the annual average incidence rate 14.64 per 100,000 population. Spatial autocorrelation analysis exhibited the existence of spatial dependence; therefore, we applied spatial regression models. After comparing the values of R-square, log-likelihood and the Akaike information criterion (AIC) among the three models, the OLS model (R² = 0.0741, log likelihood = -1819.69, AIC = 3665.38), SLM (R² = 0.0786, log likelihood = -1819.04, AIC = 3665.08) and SEM (R² = 0.0743, log likelihood = -1819.67, AIC = 3665.36), identified that the spatial lag model (SLM) was best for model fit for the regression model. There was a positive significant association between nitrogen oxide ( p = 0.027), rainfall ( p = 0.036) and sunshine hour ( p = 0.048), while the relative humidity ( p = 0.034) had an adverse association with scarlet fever incidence in SLM. (4) Conclusions: Our findings indicated that meteorological, as well as air pollutant factors may increase the incidence of scarlet fever; these findings may help to guide scarlet fever control programs and targeting the intervention.

  3. 40 CFR 80.48 - Augmentation of the complex emission model by vehicle testing.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... section, the analysis shall fit a regression model to a combined data set that includes vehicle testing... logarithm of emissions contained in this combined data set: (A) A term for each vehicle that shall reflect... nearest limit of the data core, using the unaugmented complex model. (B) “B” shall be set equal to the...

  4. 40 CFR 80.48 - Augmentation of the complex emission model by vehicle testing.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... section, the analysis shall fit a regression model to a combined data set that includes vehicle testing... logarithm of emissions contained in this combined data set: (A) A term for each vehicle that shall reflect... nearest limit of the data core, using the unaugmented complex model. (B) “B” shall be set equal to the...

  5. 40 CFR 80.48 - Augmentation of the complex emission model by vehicle testing.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... section, the analysis shall fit a regression model to a combined data set that includes vehicle testing... logarithm of emissions contained in this combined data set: (A) A term for each vehicle that shall reflect... nearest limit of the data core, using the unaugmented complex model. (B) “B” shall be set equal to the...

  6. 40 CFR 80.48 - Augmentation of the complex emission model by vehicle testing.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... section, the analysis shall fit a regression model to a combined data set that includes vehicle testing... logarithm of emissions contained in this combined data set: (A) A term for each vehicle that shall reflect... nearest limit of the data core, using the unaugmented complex model. (B) “B” shall be set equal to the...

  7. 40 CFR 80.48 - Augmentation of the complex emission model by vehicle testing.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... section, the analysis shall fit a regression model to a combined data set that includes vehicle testing... logarithm of emissions contained in this combined data set: (A) A term for each vehicle that shall reflect... nearest limit of the data core, using the unaugmented complex model. (B) “B” shall be set equal to the...

  8. Modeling Longitudinal Data Containing Non-Normal Within Subject Errors

    NASA Technical Reports Server (NTRS)

    Feiveson, Alan; Glenn, Nancy L.

    2013-01-01

    The mission of the National Aeronautics and Space Administration’s (NASA) human research program is to advance safe human spaceflight. This involves conducting experiments, collecting data, and analyzing data. The data are longitudinal and result from a relatively few number of subjects; typically 10 – 20. A longitudinal study refers to an investigation where participant outcomes and possibly treatments are collected at multiple follow-up times. Standard statistical designs such as mean regression with random effects and mixed–effects regression are inadequate for such data because the population is typically not approximately normally distributed. Hence, more advanced data analysis methods are necessary. This research focuses on four such methods for longitudinal data analysis: the recently proposed linear quantile mixed models (lqmm) by Geraci and Bottai (2013), quantile regression, multilevel mixed–effects linear regression, and robust regression. This research also provides computational algorithms for longitudinal data that scientists can directly use for human spaceflight and other longitudinal data applications, then presents statistical evidence that verifies which method is best for specific situations. This advances the study of longitudinal data in a broad range of applications including applications in the sciences, technology, engineering and mathematics fields.

  9. The microcomputer scientific software series 2: general linear model--regression.

    Treesearch

    Harold M. Rauscher

    1983-01-01

    The general linear model regression (GLMR) program provides the microcomputer user with a sophisticated regression analysis capability. The output provides a regression ANOVA table, estimators of the regression model coefficients, their confidence intervals, confidence intervals around the predicted Y-values, residuals for plotting, a check for multicollinearity, a...

  10. Valuation of National Park System Visitation: The Efficient Use of Count Data Models, Meta-Analysis, and Secondary Visitor Survey Data

    NASA Astrophysics Data System (ADS)

    Neher, Christopher; Duffield, John; Patterson, David

    2013-09-01

    The National Park Service (NPS) currently manages a large and diverse system of park units nationwide which received an estimated 279 million recreational visits in 2011. This article uses park visitor data collected by the NPS Visitor Services Project to estimate a consistent set of count data travel cost models of park visitor willingness to pay (WTP). Models were estimated using 58 different park unit survey datasets. WTP estimates for these 58 park surveys were used within a meta-regression analysis model to predict average and total WTP for NPS recreational visitation system-wide. Estimated WTP per NPS visit in 2011 averaged 102 system-wide, and ranged across park units from 67 to 288. Total 2011 visitor WTP for the NPS system is estimated at 28.5 billion with a 95% confidence interval of 19.7-43.1 billion. The estimation of a meta-regression model using consistently collected data and identical specification of visitor WTP models greatly reduces problems common to meta-regression models, including sample selection bias, primary data heterogeneity, and heteroskedasticity, as well as some aspects of panel effects. The article provides the first estimate of total annual NPS visitor WTP within the literature directly based on NPS visitor survey data.

  11. An hourly regression model for ultrafine particles in a near-highway urban area

    PubMed Central

    Patton, Allison P.; Collins, Caitlin; Naumova, Elena N.; Zamore, Wig; Brugge, Doug; Durant, John L.

    2015-01-01

    Estimating ultrafine particle number concentrations (PNC) near highways for exposure assessment in chronic health studies requires models capable of capturing PNC spatial and temporal variations over the course of a full year. The objectives of this work were to describe the relationship between near-highway PNC and potential predictors, and to build and validate hourly log-linear regression models. PNC was measured near Interstate 93 (I-93) in Somerville, MA (USA) using a mobile monitoring platform driven for 234 hours on 43 days between August 2009 and September 2010. Compared to urban background, PNC levels were consistently elevated within 100–200 m of I-93, with gradients impacted by meteorological and traffic conditions. Temporal and spatial variables including wind speed and direction, temperature, highway traffic, and distance to I-93 and major roads contributed significantly to the full regression model. Cross-validated model R2 values ranged from 0.38–0.47, with higher values achieved (0.43–0.53) when short-duration PNC spikes were removed. The model predicts highest PNC near major roads and on cold days with low wind speeds. The model allows estimation of hourly ambient PNC at 20-m resolution in a near-highway neighborhood. PMID:24559198

  12. Spatiotemporal analysis of the relationship between socioeconomic factors and stroke in the Portuguese mainland population under 65 years old.

    PubMed

    Oliveira, André; Cabral, António J R; Mendes, Jorge M; Martins, Maria R O; Cabral, Pedro

    2015-11-04

    Stroke risk has been shown to display varying patterns of geographic distribution amongst countries but also between regions of the same country. Traditionally a disease of older persons, a global 25% increase in incidence instead was noticed between 1990 and 2010 in persons aged 20-≤64 years, particularly in low- and medium-income countries. Understanding spatial disparities in the association between socioeconomic factors and stroke is critical to target public health initiatives aiming to mitigate or prevent this disease, including in younger persons. We aimed to identify socioeconomic determinants of geographic disparities of stroke risk in people <65 years old, in municipalities of mainland Portugal, and the spatiotemporal variation of the association between these determinants and stroke risk during two study periods (1992-1996 and 2002-2006). Poisson and negative binomial global regression models were used to explore determinants of disease risk. Geographically weighted regression (GWR) represents a distinctive approach, allowing estimation of local regression coefficients. Models for both study periods were identified. Significant variables included education attainment, work hours per week and unemployment. Local Poisson GWR models achieved the best fit and evidenced spatially varying regression coefficients. Spatiotemporal inequalities were observed in significant variables, with dissimilarities between men and women. This study contributes to a better understanding of the relationship between stroke and socioeconomic factors in the population <65 years of age, one age group seldom analysed separately. It can thus help to improve the targeting of public health initiatives, even more in a context of economic crisis.

  13. Neural Network and Regression Methods Demonstrated in the Design Optimization of a Subsonic Aircraft

    NASA Technical Reports Server (NTRS)

    Hopkins, Dale A.; Lavelle, Thomas M.; Patnaik, Surya

    2003-01-01

    The neural network and regression methods of NASA Glenn Research Center s COMETBOARDS design optimization testbed were used to generate approximate analysis and design models for a subsonic aircraft operating at Mach 0.85 cruise speed. The analytical model is defined by nine design variables: wing aspect ratio, engine thrust, wing area, sweep angle, chord-thickness ratio, turbine temperature, pressure ratio, bypass ratio, fan pressure; and eight response parameters: weight, landing velocity, takeoff and landing field lengths, approach thrust, overall efficiency, and compressor pressure and temperature. The variables were adjusted to optimally balance the engines to the airframe. The solution strategy included a sensitivity model and the soft analysis model. Researchers generated the sensitivity model by training the approximators to predict an optimum design. The trained neural network predicted all response variables, within 5-percent error. This was reduced to 1 percent by the regression method. The soft analysis model was developed to replace aircraft analysis as the reanalyzer in design optimization. Soft models have been generated for a neural network method, a regression method, and a hybrid method obtained by combining the approximators. The performance of the models is graphed for aircraft weight versus thrust as well as for wing area and turbine temperature. The regression method followed the analytical solution with little error. The neural network exhibited 5-percent maximum error over all parameters. Performance of the hybrid method was intermediate in comparison to the individual approximators. Error in the response variable is smaller than that shown in the figure because of a distortion scale factor. The overall performance of the approximators was considered to be satisfactory because aircraft analysis with NASA Langley Research Center s FLOPS (Flight Optimization System) code is a synthesis of diverse disciplines: weight estimation, aerodynamic analysis, engine cycle analysis, propulsion data interpolation, mission performance, airfield length for landing and takeoff, noise footprint, and others.

  14. Climate variations and salmonellosis transmission in Adelaide, South Australia: a comparison between regression models

    NASA Astrophysics Data System (ADS)

    Zhang, Ying; Bi, Peng; Hiller, Janet

    2008-01-01

    This is the first study to identify appropriate regression models for the association between climate variation and salmonellosis transmission. A comparison between different regression models was conducted using surveillance data in Adelaide, South Australia. By using notified salmonellosis cases and climatic variables from the Adelaide metropolitan area over the period 1990-2003, four regression methods were examined: standard Poisson regression, autoregressive adjusted Poisson regression, multiple linear regression, and a seasonal autoregressive integrated moving average (SARIMA) model. Notified salmonellosis cases in 2004 were used to test the forecasting ability of the four models. Parameter estimation, goodness-of-fit and forecasting ability of the four regression models were compared. Temperatures occurring 2 weeks prior to cases were positively associated with cases of salmonellosis. Rainfall was also inversely related to the number of cases. The comparison of the goodness-of-fit and forecasting ability suggest that the SARIMA model is better than the other three regression models. Temperature and rainfall may be used as climatic predictors of salmonellosis cases in regions with climatic characteristics similar to those of Adelaide. The SARIMA model could, thus, be adopted to quantify the relationship between climate variations and salmonellosis transmission.

  15. Body Fat Percentage Prediction Using Intelligent Hybrid Approaches

    PubMed Central

    Shao, Yuehjen E.

    2014-01-01

    Excess of body fat often leads to obesity. Obesity is typically associated with serious medical diseases, such as cancer, heart disease, and diabetes. Accordingly, knowing the body fat is an extremely important issue since it affects everyone's health. Although there are several ways to measure the body fat percentage (BFP), the accurate methods are often associated with hassle and/or high costs. Traditional single-stage approaches may use certain body measurements or explanatory variables to predict the BFP. Diverging from existing approaches, this study proposes new intelligent hybrid approaches to obtain fewer explanatory variables, and the proposed forecasting models are able to effectively predict the BFP. The proposed hybrid models consist of multiple regression (MR), artificial neural network (ANN), multivariate adaptive regression splines (MARS), and support vector regression (SVR) techniques. The first stage of the modeling includes the use of MR and MARS to obtain fewer but more important sets of explanatory variables. In the second stage, the remaining important variables are served as inputs for the other forecasting methods. A real dataset was used to demonstrate the development of the proposed hybrid models. The prediction results revealed that the proposed hybrid schemes outperformed the typical, single-stage forecasting models. PMID:24723804

  16. High dimensional linear regression models under long memory dependence and measurement error

    NASA Astrophysics Data System (ADS)

    Kaul, Abhishek

    This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the dimensionality can grow exponentially with the sample size. In the fixed dimensional setting we provide the oracle properties associated with the proposed estimators. In the high dimensional setting, we provide bounds for the statistical error associated with the estimation, that hold with asymptotic probability 1, thereby providing the ℓ1-consistency of the proposed estimator. We also establish the model selection consistency in terms of the correctly estimated zero components of the parameter vector. A simulation study that investigates the finite sample accuracy of the proposed estimator is also included in this chapter.

  17. [Evaluation of estimation of prevalence ratio using bayesian log-binomial regression model].

    PubMed

    Gao, W L; Lin, H; Liu, X N; Ren, X W; Li, J S; Shen, X P; Zhu, S L

    2017-03-10

    To evaluate the estimation of prevalence ratio ( PR ) by using bayesian log-binomial regression model and its application, we estimated the PR of medical care-seeking prevalence to caregivers' recognition of risk signs of diarrhea in their infants by using bayesian log-binomial regression model in Openbugs software. The results showed that caregivers' recognition of infant' s risk signs of diarrhea was associated significantly with a 13% increase of medical care-seeking. Meanwhile, we compared the differences in PR 's point estimation and its interval estimation of medical care-seeking prevalence to caregivers' recognition of risk signs of diarrhea and convergence of three models (model 1: not adjusting for the covariates; model 2: adjusting for duration of caregivers' education, model 3: adjusting for distance between village and township and child month-age based on model 2) between bayesian log-binomial regression model and conventional log-binomial regression model. The results showed that all three bayesian log-binomial regression models were convergence and the estimated PRs were 1.130(95 %CI : 1.005-1.265), 1.128(95 %CI : 1.001-1.264) and 1.132(95 %CI : 1.004-1.267), respectively. Conventional log-binomial regression model 1 and model 2 were convergence and their PRs were 1.130(95 % CI : 1.055-1.206) and 1.126(95 % CI : 1.051-1.203), respectively, but the model 3 was misconvergence, so COPY method was used to estimate PR , which was 1.125 (95 %CI : 1.051-1.200). In addition, the point estimation and interval estimation of PRs from three bayesian log-binomial regression models differed slightly from those of PRs from conventional log-binomial regression model, but they had a good consistency in estimating PR . Therefore, bayesian log-binomial regression model can effectively estimate PR with less misconvergence and have more advantages in application compared with conventional log-binomial regression model.

  18. Mixed and Mixture Regression Models for Continuous Bounded Responses Using the Beta Distribution

    ERIC Educational Resources Information Center

    Verkuilen, Jay; Smithson, Michael

    2012-01-01

    Doubly bounded continuous data are common in the social and behavioral sciences. Examples include judged probabilities, confidence ratings, derived proportions such as percent time on task, and bounded scale scores. Dependent variables of this kind are often difficult to analyze using normal theory models because their distributions may be quite…

  19. Drivers willingness to pay progressive rate for street parking.

    DOT National Transportation Integrated Search

    2015-01-01

    This study finds willingness to pay and price elasticity for on-street parking demand using stated : preference data obtained from 238 respondents. Descriptive, statistical and economic analyses including : regression, generalized linear model, and f...

  20. Trees grow on money: urban tree canopy cover and environmental justice.

    PubMed

    Schwarz, Kirsten; Fragkias, Michail; Boone, Christopher G; Zhou, Weiqi; McHale, Melissa; Grove, J Morgan; O'Neil-Dunne, Jarlath; McFadden, Joseph P; Buckley, Geoffrey L; Childers, Dan; Ogden, Laura; Pincetl, Stephanie; Pataki, Diane; Whitmer, Ali; Cadenasso, Mary L

    2015-01-01

    This study examines the distributional equity of urban tree canopy (UTC) cover for Baltimore, MD, Los Angeles, CA, New York, NY, Philadelphia, PA, Raleigh, NC, Sacramento, CA, and Washington, D.C. using high spatial resolution land cover data and census data. Data are analyzed at the Census Block Group levels using Spearman's correlation, ordinary least squares regression (OLS), and a spatial autoregressive model (SAR). Across all cities there is a strong positive correlation between UTC cover and median household income. Negative correlations between race and UTC cover exist in bivariate models for some cities, but they are generally not observed using multivariate regressions that include additional variables on income, education, and housing age. SAR models result in higher r-square values compared to the OLS models across all cities, suggesting that spatial autocorrelation is an important feature of our data. Similarities among cities can be found based on shared characteristics of climate, race/ethnicity, and size. Our findings suggest that a suite of variables, including income, contribute to the distribution of UTC cover. These findings can help target simultaneous strategies for UTC goals and environmental justice concerns.

  1. Does money matter in inflation forecasting?

    NASA Astrophysics Data System (ADS)

    Binner, J. M.; Tino, P.; Tepper, J.; Anderson, R.; Jones, B.; Kendall, G.

    2010-11-01

    This paper provides the most fully comprehensive evidence to date on whether or not monetary aggregates are valuable for forecasting US inflation in the early to mid 2000s. We explore a wide range of different definitions of money, including different methods of aggregation and different collections of included monetary assets. In our forecasting experiment we use two nonlinear techniques, namely, recurrent neural networks and kernel recursive least squares regression-techniques that are new to macroeconomics. Recurrent neural networks operate with potentially unbounded input memory, while the kernel regression technique is a finite memory predictor. The two methodologies compete to find the best fitting US inflation forecasting models and are then compared to forecasts from a naïve random walk model. The best models were nonlinear autoregressive models based on kernel methods. Our findings do not provide much support for the usefulness of monetary aggregates in forecasting inflation. Beyond its economic findings, our study is in the tradition of physicists’ long-standing interest in the interconnections among statistical mechanics, neural networks, and related nonparametric statistical methods, and suggests potential avenues of extension for such studies.

  2. Issues and Importance of "Good" Starting Points for Nonlinear Regression for Mathematical Modeling with Maple: Basic Model Fitting to Make Predictions with Oscillating Data

    ERIC Educational Resources Information Center

    Fox, William

    2012-01-01

    The purpose of our modeling effort is to predict future outcomes. We assume the data collected are both accurate and relatively precise. For our oscillating data, we examined several mathematical modeling forms for predictions. We also examined both ignoring the oscillations as an important feature and including the oscillations as an important…

  3. Evaluation of weighted regression and sample size in developing a taper model for loblolly pine

    Treesearch

    Kenneth L. Cormier; Robin M. Reich; Raymond L. Czaplewski; William A. Bechtold

    1992-01-01

    A stem profile model, fit using pseudo-likelihood weighted regression, was used to estimate merchantable volume of loblolly pine (Pinus taeda L.) in the southeast. The weighted regression increased model fit marginally, but did not substantially increase model performance. In all cases, the unweighted regression models performed as well as the...

  4. Quantifying Vegetation Biophysical Variables from Imaging Spectroscopy Data: A Review on Retrieval Methods

    NASA Astrophysics Data System (ADS)

    Verrelst, Jochem; Malenovský, Zbyněk; Van der Tol, Christiaan; Camps-Valls, Gustau; Gastellu-Etchegorry, Jean-Philippe; Lewis, Philip; North, Peter; Moreno, Jose

    2018-06-01

    An unprecedented spectroscopic data stream will soon become available with forthcoming Earth-observing satellite missions equipped with imaging spectroradiometers. This data stream will open up a vast array of opportunities to quantify a diversity of biochemical and structural vegetation properties. The processing requirements for such large data streams require reliable retrieval techniques enabling the spatiotemporally explicit quantification of biophysical variables. With the aim of preparing for this new era of Earth observation, this review summarizes the state-of-the-art retrieval methods that have been applied in experimental imaging spectroscopy studies inferring all kinds of vegetation biophysical variables. Identified retrieval methods are categorized into: (1) parametric regression, including vegetation indices, shape indices and spectral transformations; (2) nonparametric regression, including linear and nonlinear machine learning regression algorithms; (3) physically based, including inversion of radiative transfer models (RTMs) using numerical optimization and look-up table approaches; and (4) hybrid regression methods, which combine RTM simulations with machine learning regression methods. For each of these categories, an overview of widely applied methods with application to mapping vegetation properties is given. In view of processing imaging spectroscopy data, a critical aspect involves the challenge of dealing with spectral multicollinearity. The ability to provide robust estimates, retrieval uncertainties and acceptable retrieval processing speed are other important aspects in view of operational processing. Recommendations towards new-generation spectroscopy-based processing chains for operational production of biophysical variables are given.

  5. Parameters Estimation of Geographically Weighted Ordinal Logistic Regression (GWOLR) Model

    NASA Astrophysics Data System (ADS)

    Zuhdi, Shaifudin; Retno Sari Saputro, Dewi; Widyaningsih, Purnami

    2017-06-01

    A regression model is the representation of relationship between independent variable and dependent variable. The dependent variable has categories used in the logistic regression model to calculate odds on. The logistic regression model for dependent variable has levels in the logistics regression model is ordinal. GWOLR model is an ordinal logistic regression model influenced the geographical location of the observation site. Parameters estimation in the model needed to determine the value of a population based on sample. The purpose of this research is to parameters estimation of GWOLR model using R software. Parameter estimation uses the data amount of dengue fever patients in Semarang City. Observation units used are 144 villages in Semarang City. The results of research get GWOLR model locally for each village and to know probability of number dengue fever patient categories.

  6. Modelling fourier regression for time series data- a case study: modelling inflation in foods sector in Indonesia

    NASA Astrophysics Data System (ADS)

    Prahutama, Alan; Suparti; Wahyu Utami, Tiani

    2018-03-01

    Regression analysis is an analysis to model the relationship between response variables and predictor variables. The parametric approach to the regression model is very strict with the assumption, but nonparametric regression model isn’t need assumption of model. Time series data is the data of a variable that is observed based on a certain time, so if the time series data wanted to be modeled by regression, then we should determined the response and predictor variables first. Determination of the response variable in time series is variable in t-th (yt), while the predictor variable is a significant lag. In nonparametric regression modeling, one developing approach is to use the Fourier series approach. One of the advantages of nonparametric regression approach using Fourier series is able to overcome data having trigonometric distribution. In modeling using Fourier series needs parameter of K. To determine the number of K can be used Generalized Cross Validation method. In inflation modeling for the transportation sector, communication and financial services using Fourier series yields an optimal K of 120 parameters with R-square 99%. Whereas if it was modeled by multiple linear regression yield R-square 90%.

  7. A Ranking Approach to Genomic Selection.

    PubMed

    Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori

    2015-01-01

    Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.

  8. Selecting risk factors: a comparison of discriminant analysis, logistic regression and Cox's regression model using data from the Tromsø Heart Study.

    PubMed

    Brenn, T; Arnesen, E

    1985-01-01

    For comparative evaluation, discriminant analysis, logistic regression and Cox's model were used to select risk factors for total and coronary deaths among 6595 men aged 20-49 followed for 9 years. Groups with mortality between 5 and 93 per 1000 were considered. Discriminant analysis selected variable sets only marginally different from the logistic and Cox methods which always selected the same sets. A time-saving option, offered for both the logistic and Cox selection, showed no advantage compared with discriminant analysis. Analysing more than 3800 subjects, the logistic and Cox methods consumed, respectively, 80 and 10 times more computer time than discriminant analysis. When including the same set of variables in non-stepwise analyses, all methods estimated coefficients that in most cases were almost identical. In conclusion, discriminant analysis is advocated for preliminary or stepwise analysis, otherwise Cox's method should be used.

  9. Forecasting urban water demand: A meta-regression analysis.

    PubMed

    Sebri, Maamar

    2016-12-01

    Water managers and planners require accurate water demand forecasts over the short-, medium- and long-term for many purposes. These range from assessing water supply needs over spatial and temporal patterns to optimizing future investments and planning future allocations across competing sectors. This study surveys the empirical literature on the urban water demand forecasting using the meta-analytical approach. Specifically, using more than 600 estimates, a meta-regression analysis is conducted to identify explanations of cross-studies variation in accuracy of urban water demand forecasting. Our study finds that accuracy depends significantly on study characteristics, including demand periodicity, modeling method, forecasting horizon, model specification and sample size. The meta-regression results remain robust to different estimators employed as well as to a series of sensitivity checks performed. The importance of these findings lies in the conclusions and implications drawn out for regulators and policymakers and for academics alike. Copyright © 2016. Published by Elsevier Ltd.

  10. Bayesian median regression for temporal gene expression data

    NASA Astrophysics Data System (ADS)

    Yu, Keming; Vinciotti, Veronica; Liu, Xiaohui; 't Hoen, Peter A. C.

    2007-09-01

    Most of the existing methods for the identification of biologically interesting genes in a temporal expression profiling dataset do not fully exploit the temporal ordering in the dataset and are based on normality assumptions for the gene expression. In this paper, we introduce a Bayesian median regression model to detect genes whose temporal profile is significantly different across a number of biological conditions. The regression model is defined by a polynomial function where both time and condition effects as well as interactions between the two are included. MCMC-based inference returns the posterior distribution of the polynomial coefficients. From this a simple Bayes factor test is proposed to test for significance. The estimation of the median rather than the mean, and within a Bayesian framework, increases the robustness of the method compared to a Hotelling T2-test previously suggested. This is shown on simulated data and on muscular dystrophy gene expression data.

  11. Applications of MIDAS regression in analysing trends in water quality

    NASA Astrophysics Data System (ADS)

    Penev, Spiridon; Leonte, Daniela; Lazarov, Zdravetz; Mann, Rob A.

    2014-04-01

    We discuss novel statistical methods in analysing trends in water quality. Such analysis uses complex data sets of different classes of variables, including water quality, hydrological and meteorological. We analyse the effect of rainfall and flow on trends in water quality utilising a flexible model called Mixed Data Sampling (MIDAS). This model arises because of the mixed frequency in the data collection. Typically, water quality variables are sampled fortnightly, whereas the rain data is sampled daily. The advantage of using MIDAS regression is in the flexible and parsimonious modelling of the influence of the rain and flow on trends in water quality variables. We discuss the model and its implementation on a data set from the Shoalhaven Supply System and Catchments in the state of New South Wales, Australia. Information criteria indicate that MIDAS modelling improves upon simplistic approaches that do not utilise the mixed data sampling nature of the data.

  12. The hemispherical asymmetry of the residual polar caps on Mars

    NASA Technical Reports Server (NTRS)

    Lindner, Bernhard Lee

    1991-01-01

    A model of the polar caps of Mars was created which allows: (1) for light penetration into the cap; (2) ice albedo to vary with age, latitude, hemisphere, dust content, and solar zenith angle; and (3) for diurnal variability. The model includes the radiative effects of clouds and dust, and heat transport as represented by a thermal wind. The model reproduces polar cap regression data very well, including the survival of CO2 frost at the south pole and reproduces the general trend in the Viking Lander pressure data.

  13. Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors.

    PubMed

    Woodard, Dawn B; Crainiceanu, Ciprian; Ruppert, David

    2013-01-01

    We propose a new method for regression using a parsimonious and scientifically interpretable representation of functional predictors. Our approach is designed for data that exhibit features such as spikes, dips, and plateaus whose frequency, location, size, and shape varies stochastically across subjects. We propose Bayesian inference of the joint functional and exposure models, and give a method for efficient computation. We contrast our approach with existing state-of-the-art methods for regression with functional predictors, and show that our method is more effective and efficient for data that include features occurring at varying locations. We apply our methodology to a large and complex dataset from the Sleep Heart Health Study, to quantify the association between sleep characteristics and health outcomes. Software and technical appendices are provided in online supplemental materials.

  14. PREDICTION OF MALIGNANT BREAST LESIONS FROM MRI FEATURES: A COMPARISON OF ARTIFICIAL NEURAL NETWORK AND LOGISTIC REGRESSION TECHNIQUES

    PubMed Central

    McLaren, Christine E.; Chen, Wen-Pin; Nie, Ke; Su, Min-Ying

    2009-01-01

    Rationale and Objectives Dynamic contrast enhanced MRI (DCE-MRI) is a clinical imaging modality for detection and diagnosis of breast lesions. Analytical methods were compared for diagnostic feature selection and performance of lesion classification to differentiate between malignant and benign lesions in patients. Materials and Methods The study included 43 malignant and 28 benign histologically-proven lesions. Eight morphological parameters, ten gray level co-occurrence matrices (GLCM) texture features, and fourteen Laws’ texture features were obtained using automated lesion segmentation and quantitative feature extraction. Artificial neural network (ANN) and logistic regression analysis were compared for selection of the best predictors of malignant lesions among the normalized features. Results Using ANN, the final four selected features were compactness, energy, homogeneity, and Law_LS, with area under the receiver operating characteristic curve (AUC) = 0.82, and accuracy = 0.76. The diagnostic performance of these 4-features computed on the basis of logistic regression yielded AUC = 0.80 (95% CI, 0.688 to 0.905), similar to that of ANN. The analysis also shows that the odds of a malignant lesion decreased by 48% (95% CI, 25% to 92%) for every increase of 1 SD in the Law_LS feature, adjusted for differences in compactness, energy, and homogeneity. Using logistic regression with z-score transformation, a model comprised of compactness, NRL entropy, and gray level sum average was selected, and it had the highest overall accuracy of 0.75 among all models, with AUC = 0.77 (95% CI, 0.660 to 0.880). When logistic modeling of transformations using the Box-Cox method was performed, the most parsimonious model with predictors, compactness and Law_LS, had an AUC of 0.79 (95% CI, 0.672 to 0.898). Conclusion The diagnostic performance of models selected by ANN and logistic regression was similar. The analytic methods were found to be roughly equivalent in terms of predictive ability when a small number of variables were chosen. The robust ANN methodology utilizes a sophisticated non-linear model, while logistic regression analysis provides insightful information to enhance interpretation of the model features. PMID:19409817

  15. Developing a Model for Forecasting Road Traffic Accident (RTA) Fatalities in Yemen

    NASA Astrophysics Data System (ADS)

    Karim, Fareed M. A.; Abdo Saleh, Ali; Taijoobux, Aref; Ševrović, Marko

    2017-12-01

    The aim of this paper is to develop a model for forecasting RTA fatalities in Yemen. The yearly fatalities was modeled as the dependent variable, while the number of independent variables included the population, number of vehicles, GNP, GDP and Real GDP per capita. It was determined that all these variables are highly correlated with the correlation coefficient (r ≈ 0.9); in order to avoid multicollinearity in the model, a single variable with the highest r value was selected (real GDP per capita). A simple regression model was developed; the model was very good (R2=0.916); however, the residuals were serially correlated. The Prais-Winsten procedure was used to overcome this violation of the regression assumption. The data for a 20-year period from 1991-2010 were analyzed to build the model; the model was validated by using data for the years 2011-2013; the historical fit for the period 1991 - 2011 was very good. Also, the validation for 2011-2013 proved accurate.

  16. Comparing statistical and machine learning classifiers: alternatives for predictive modeling in human factors research.

    PubMed

    Carnahan, Brian; Meyer, Gérard; Kuntz, Lois-Ann

    2003-01-01

    Multivariate classification models play an increasingly important role in human factors research. In the past, these models have been based primarily on discriminant analysis and logistic regression. Models developed from machine learning research offer the human factors professional a viable alternative to these traditional statistical classification methods. To illustrate this point, two machine learning approaches--genetic programming and decision tree induction--were used to construct classification models designed to predict whether or not a student truck driver would pass his or her commercial driver license (CDL) examination. The models were developed and validated using the curriculum scores and CDL exam performances of 37 student truck drivers who had completed a 320-hr driver training course. Results indicated that the machine learning classification models were superior to discriminant analysis and logistic regression in terms of predictive accuracy. Actual or potential applications of this research include the creation of models that more accurately predict human performance outcomes.

  17. Quantitative imaging features of pretreatment CT predict volumetric response to chemotherapy in patients with colorectal liver metastases.

    PubMed

    Creasy, John M; Midya, Abhishek; Chakraborty, Jayasree; Adams, Lauryn B; Gomes, Camilla; Gonen, Mithat; Seastedt, Kenneth P; Sutton, Elizabeth J; Cercek, Andrea; Kemeny, Nancy E; Shia, Jinru; Balachandran, Vinod P; Kingham, T Peter; Allen, Peter J; DeMatteo, Ronald P; Jarnagin, William R; D'Angelica, Michael I; Do, Richard K G; Simpson, Amber L

    2018-06-19

    This study investigates whether quantitative image analysis of pretreatment CT scans can predict volumetric response to chemotherapy for patients with colorectal liver metastases (CRLM). Patients treated with chemotherapy for CRLM (hepatic artery infusion (HAI) combined with systemic or systemic alone) were included in the study. Patients were imaged at baseline and approximately 8 weeks after treatment. Response was measured as the percentage change in tumour volume from baseline. Quantitative imaging features were derived from the index hepatic tumour on pretreatment CT, and features statistically significant on univariate analysis were included in a linear regression model to predict volumetric response. The regression model was constructed from 70% of data, while 30% were reserved for testing. Test data were input into the trained model. Model performance was evaluated with mean absolute prediction error (MAPE) and R 2 . Clinicopatholologic factors were assessed for correlation with response. 157 patients were included, split into training (n = 110) and validation (n = 47) sets. MAPE from the multivariate linear regression model was 16.5% (R 2 = 0.774) and 21.5% in the training and validation sets, respectively. Stratified by HAI utilisation, MAPE in the validation set was 19.6% for HAI and 25.1% for systemic chemotherapy alone. Clinical factors associated with differences in median tumour response were treatment strategy, systemic chemotherapy regimen, age and KRAS mutation status (p < 0.05). Quantitative imaging features extracted from pretreatment CT are promising predictors of volumetric response to chemotherapy in patients with CRLM. Pretreatment predictors of response have the potential to better select patients for specific therapies. • Colorectal liver metastases (CRLM) are downsized with chemotherapy but predicting the patients that will respond to chemotherapy is currently not possible. • Heterogeneity and enhancement patterns of CRLM can be measured with quantitative imaging. • Prediction model constructed that predicts volumetric response with 20% error suggesting that quantitative imaging holds promise to better select patients for specific treatments.

  18. The effect of occlusion on the semantics of projective spatial terms: a case study in grounding language in perception.

    PubMed

    Kelleher, John D; Ross, Robert J; Sloan, Colm; Mac Namee, Brian

    2011-02-01

    Although data-driven spatial template models provide a practical and cognitively motivated mechanism for characterizing spatial term meaning, the influence of perceptual rather than solely geometric and functional properties has yet to be systematically investigated. In the light of this, in this paper, we investigate the effects of the perceptual phenomenon of object occlusion on the semantics of projective terms. We did this by conducting a study to test whether object occlusion had a noticeable effect on the acceptance values assigned to projective terms with respect to a 2.5-dimensional visual stimulus. Based on the data collected, a regression model was constructed and presented. Subsequent analysis showed that the regression model that included the occlusion factor outperformed an adaptation of Regier & Carlson's well-regarded AVS model for that same spatial configuration.

  19. The Colorectal Cancer Mortality-to-Incidence Ratio as an Indicator of Global Cancer Screening and Care

    PubMed Central

    Sunkara, Vasu; Hébert, James R.

    2015-01-01

    BACKGROUND Disparities in cancer screening, incidence, treatment, and survival are worsening globally. The mortality-to-incidence ratio (MIR) has been used previously to evaluate such disparities. METHODS The MIR for colorectal cancer is calculated for all Organisation for Economic Cooperation and Development (OECD) countries using the 2012 GLOBOCAN incidence and mortality statistics. Health system rankings were obtained from the World Health Organization. Two linear regression models were fit with the MIR as the dependent variable and health system ranking as the independent variable; one included all countries and one model had the “divergents” removed. RESULTS The regression model for all countries explained 24% of the total variance in the MIR. Nine countries were found to have regression-calculated MIRs that differed from the actual MIR by >20%. Countries with lower-than-expected MIRs were found to have strong national health systems characterized by formal colorectal cancer screening programs. Conversely, countries with higher-than-expected MIRs lack screening programs. When these divergent points were removed from the data set, the recalculated regression model explained 60% of the total variance in the MIR. CONCLUSIONS The MIR proved useful for identifying disparities in cancer screening and treatment internationally. It has potential as an indicator of the long-term success of cancer surveillance programs and may be extended to other cancer types for these purposes. PMID:25572676

  20. The colorectal cancer mortality-to-incidence ratio as an indicator of global cancer screening and care.

    PubMed

    Sunkara, Vasu; Hébert, James R

    2015-05-15

    Disparities in cancer screening, incidence, treatment, and survival are worsening globally. The mortality-to-incidence ratio (MIR) has been used previously to evaluate such disparities. The MIR for colorectal cancer is calculated for all Organisation for Economic Cooperation and Development (OECD) countries using the 2012 GLOBOCAN incidence and mortality statistics. Health system rankings were obtained from the World Health Organization. Two linear regression models were fit with the MIR as the dependent variable and health system ranking as the independent variable; one included all countries and one model had the "divergents" removed. The regression model for all countries explained 24% of the total variance in the MIR. Nine countries were found to have regression-calculated MIRs that differed from the actual MIR by >20%. Countries with lower-than-expected MIRs were found to have strong national health systems characterized by formal colorectal cancer screening programs. Conversely, countries with higher-than-expected MIRs lack screening programs. When these divergent points were removed from the data set, the recalculated regression model explained 60% of the total variance in the MIR. The MIR proved useful for identifying disparities in cancer screening and treatment internationally. It has potential as an indicator of the long-term success of cancer surveillance programs and may be extended to other cancer types for these purposes. © 2015 American Cancer Society.

  1. Development of hybrid genetic-algorithm-based neural networks using regression trees for modeling air quality inside a public transportation bus.

    PubMed

    Kadiyala, Akhil; Kaur, Devinder; Kumar, Ashok

    2013-02-01

    The present study developed a novel approach to modeling indoor air quality (IAQ) of a public transportation bus by the development of hybrid genetic-algorithm-based neural networks (also known as evolutionary neural networks) with input variables optimized from using the regression trees, referred as the GART approach. This study validated the applicability of the GART modeling approach in solving complex nonlinear systems by accurately predicting the monitored contaminants of carbon dioxide (CO2), carbon monoxide (CO), nitric oxide (NO), sulfur dioxide (SO2), 0.3-0.4 microm sized particle numbers, 0.4-0.5 microm sized particle numbers, particulate matter (PM) concentrations less than 1.0 microm (PM10), and PM concentrations less than 2.5 microm (PM2.5) inside a public transportation bus operating on 20% grade biodiesel in Toledo, OH. First, the important variables affecting each monitored in-bus contaminant were determined using regression trees. Second, the analysis of variance was used as a complimentary sensitivity analysis to the regression tree results to determine a subset of statistically significant variables affecting each monitored in-bus contaminant. Finally, the identified subsets of statistically significant variables were used as inputs to develop three artificial neural network (ANN) models. The models developed were regression tree-based back-propagation network (BPN-RT), regression tree-based radial basis function network (RBFN-RT), and GART models. Performance measures were used to validate the predictive capacity of the developed IAQ models. The results from this approach were compared with the results obtained from using a theoretical approach and a generalized practicable approach to modeling IAQ that included the consideration of additional independent variables when developing the aforementioned ANN models. The hybrid GART models were able to capture majority of the variance in the monitored in-bus contaminants. The genetic-algorithm-based neural network IAQ models outperformed the traditional ANN methods of the back-propagation and the radial basis function networks. The novelty of this research is the development of a novel approach to modeling vehicular indoor air quality by integration of the advanced methods of genetic algorithms, regression trees, and the analysis of variance for the monitored in-vehicle gaseous and particulate matter contaminants, and comparing the results obtained from using the developed approach with conventional artificial intelligence techniques of back propagation networks and radial basis function networks. This study validated the newly developed approach using holdout and threefold cross-validation methods. These results are of great interest to scientists, researchers, and the public in understanding the various aspects of modeling an indoor microenvironment. This methodology can easily be extended to other fields of study also.

  2. A methodology for the design of experiments in computational intelligence with multiple regression models.

    PubMed

    Fernandez-Lozano, Carlos; Gestal, Marcos; Munteanu, Cristian R; Dorado, Julian; Pazos, Alejandro

    2016-01-01

    The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable.

  3. A methodology for the design of experiments in computational intelligence with multiple regression models

    PubMed Central

    Gestal, Marcos; Munteanu, Cristian R.; Dorado, Julian; Pazos, Alejandro

    2016-01-01

    The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable. PMID:27920952

  4. Applying Kaplan-Meier to Item Response Data

    ERIC Educational Resources Information Center

    McNeish, Daniel

    2018-01-01

    Some IRT models can be equivalently modeled in alternative frameworks such as logistic regression. Logistic regression can also model time-to-event data, which concerns the probability of an event occurring over time. Using the relation between time-to-event models and logistic regression and the relation between logistic regression and IRT, this…

  5. Data to support statistical modeling of instream nutrient load based on watershed attributes, southeastern United States, 2002

    USGS Publications Warehouse

    Hoos, Anne B.; Terziotti, Silvia; McMahon, Gerard; Savvas, Katerina; Tighe, Kirsten C.; Alkons-Wolinsky, Ruth

    2008-01-01

    This report presents and describes the digital datasets that characterize nutrient source inputs, environmental characteristics, and instream nutrient loads for the purpose of calibrating and applying a nutrient water-quality model for the southeastern United States for 2002. The model area includes all of the river basins draining to the south Atlantic and the eastern Gulf of Mexico, as well as the Tennessee River basin (referred to collectively as the SAGT area). The water-quality model SPARROW (SPAtially-Referenced Regression On Watershed attributes), developed by the U.S. Geological Survey, uses a regression equation to describe the relation between watershed attributes (predictors) and measured instream loads (response). Watershed attributes that are considered to describe nutrient input conditions and are tested in the SPARROW model for the SAGT area as source variables include atmospheric deposition, fertilizer application to farmland, manure from livestock production, permitted wastewater discharge, and land cover. Watershed and channel attributes that are considered to affect rates of nutrient transport from land to water and are tested in the SAGT SPARROW model as nutrient-transport variables include characteristics of soil, landform, climate, reach time of travel, and reservoir hydraulic loading. Datasets with estimates of each of these attributes for each individual reach or catchment in the reach-catchment network are presented in this report, along with descriptions of methods used to produce them. Measurements of nutrient water quality at stream monitoring sites from a combination of monitoring programs were used to develop observations of the response variable - mean annual nitrogen or phosphorus load - in the SPARROW regression equation. Instream load of nitrogen and phosphorus was estimated using bias-corrected log-linear regression models using the program Fluxmaster, which provides temporally detrended estimates of long-term mean load well-suited for spatial comparisons. The detrended, or normalized, estimates of load are useful for regional-scale assessments but should be used with caution for local-scale interpretations, for which use of loads estimated for actual time periods and employing more detailed regression analysis is suggested. The mean value of the nitrogen yield estimates, normalized to 2002, for 637 stations in the SAGT area is 4.7 kilograms per hectare; the mean value of nitrogen flow-weighted mean concentration is 1.2 milligrams per liter. The mean value of the phosphorus yield estimates, normalized to 2002, for the 747 stations in the SAGT area is 0.66 kilogram per hectare; the mean value of phosphorus flow-weighted mean concentration is 0.17 milligram per liter. Nutrient conditions measured in streams affected by substantial influx or outflux of water and nutrient mass across surface-water basin divides do not reflect nutrient source and transport conditions in the topographic watershed; therefore, inclusion of such streams in the SPARROW modeling approach is considered inappropriate. River basins identified with this concern include south Florida (where surface-water flow paths have been extensively altered) and the Oklawaha, Crystal, Lower Sante Fe, Lower Suwanee, St. Marks, and Chipola River basins in central and northern Florida (where flow exchange with the underlying regional aquifer may represent substantial nitrogen influx to and outflux from the surface-water basins).

  6. Local Composite Quantile Regression Smoothing for Harris Recurrent Markov Processes

    PubMed Central

    Li, Degui; Li, Runze

    2016-01-01

    In this paper, we study the local polynomial composite quantile regression (CQR) smoothing method for the nonlinear and nonparametric models under the Harris recurrent Markov chain framework. The local polynomial CQR regression method is a robust alternative to the widely-used local polynomial method, and has been well studied in stationary time series. In this paper, we relax the stationarity restriction on the model, and allow that the regressors are generated by a general Harris recurrent Markov process which includes both the stationary (positive recurrent) and nonstationary (null recurrent) cases. Under some mild conditions, we establish the asymptotic theory for the proposed local polynomial CQR estimator of the mean regression function, and show that the convergence rate for the estimator in nonstationary case is slower than that in stationary case. Furthermore, a weighted type local polynomial CQR estimator is provided to improve the estimation efficiency, and a data-driven bandwidth selection is introduced to choose the optimal bandwidth involved in the nonparametric estimators. Finally, we give some numerical studies to examine the finite sample performance of the developed methodology and theory. PMID:27667894

  7. Nonparametric Stochastic Model for Uncertainty Quantifi cation of Short-term Wind Speed Forecasts

    NASA Astrophysics Data System (ADS)

    AL-Shehhi, A. M.; Chaouch, M.; Ouarda, T.

    2014-12-01

    Wind energy is increasing in importance as a renewable energy source due to its potential role in reducing carbon emissions. It is a safe, clean, and inexhaustible source of energy. The amount of wind energy generated by wind turbines is closely related to the wind speed. Wind speed forecasting plays a vital role in the wind energy sector in terms of wind turbine optimal operation, wind energy dispatch and scheduling, efficient energy harvesting etc. It is also considered during planning, design, and assessment of any proposed wind project. Therefore, accurate prediction of wind speed carries a particular importance and plays significant roles in the wind industry. Many methods have been proposed in the literature for short-term wind speed forecasting. These methods are usually based on modeling historical fixed time intervals of the wind speed data and using it for future prediction. The methods mainly include statistical models such as ARMA, ARIMA model, physical models for instance numerical weather prediction and artificial Intelligence techniques for example support vector machine and neural networks. In this paper, we are interested in estimating hourly wind speed measures in United Arab Emirates (UAE). More precisely, we predict hourly wind speed using a nonparametric kernel estimation of the regression and volatility functions pertaining to nonlinear autoregressive model with ARCH model, which includes unknown nonlinear regression function and volatility function already discussed in the literature. The unknown nonlinear regression function describe the dependence between the value of the wind speed at time t and its historical data at time t -1, t - 2, … , t - d. This function plays a key role to predict hourly wind speed process. The volatility function, i.e., the conditional variance given the past, measures the risk associated to this prediction. Since the regression and the volatility functions are supposed to be unknown, they are estimated using nonparametric kernel methods. In addition, to the pointwise hourly wind speed forecasts, a confidence interval is also provided which allows to quantify the uncertainty around the forecasts.

  8. A spatially explicit approach to the study of socio-demographic inequality in the spatial distribution of trees across Boston neighborhoods.

    PubMed

    Duncan, Dustin T; Kawachi, Ichiro; Kum, Susan; Aldstadt, Jared; Piras, Gianfranco; Matthews, Stephen A; Arbia, Giuseppe; Castro, Marcia C; White, Kellee; Williams, David R

    2014-04-01

    The racial/ethnic and income composition of neighborhoods often influences local amenities, including the potential spatial distribution of trees, which are important for population health and community wellbeing, particularly in urban areas. This ecological study used spatial analytical methods to assess the relationship between neighborhood socio-demographic characteristics (i.e. minority racial/ethnic composition and poverty) and tree density at the census tact level in Boston, Massachusetts (US). We examined spatial autocorrelation with the Global Moran's I for all study variables and in the ordinary least squares (OLS) regression residuals as well as computed Spearman correlations non-adjusted and adjusted for spatial autocorrelation between socio-demographic characteristics and tree density. Next, we fit traditional regressions (i.e. OLS regression models) and spatial regressions (i.e. spatial simultaneous autoregressive models), as appropriate. We found significant positive spatial autocorrelation for all neighborhood socio-demographic characteristics (Global Moran's I range from 0.24 to 0.86, all P =0.001), for tree density (Global Moran's I =0.452, P =0.001), and in the OLS regression residuals (Global Moran's I range from 0.32 to 0.38, all P <0.001). Therefore, we fit the spatial simultaneous autoregressive models. There was a negative correlation between neighborhood percent non-Hispanic Black and tree density (r S =-0.19; conventional P -value=0.016; spatially adjusted P -value=0.299) as well as a negative correlation between predominantly non-Hispanic Black (over 60% Black) neighborhoods and tree density (r S =-0.18; conventional P -value=0.019; spatially adjusted P -value=0.180). While the conventional OLS regression model found a marginally significant inverse relationship between Black neighborhoods and tree density, we found no statistically significant relationship between neighborhood socio-demographic composition and tree density in the spatial regression models. Methodologically, our study suggests the need to take into account spatial autocorrelation as findings/conclusions can change when the spatial autocorrelation is ignored. Substantively, our findings suggest no need for policy intervention vis-à-vis trees in Boston, though we hasten to add that replication studies, and more nuanced data on tree quality, age and diversity are needed.

  9. Immortal time bias in observational studies of time-to-event outcomes.

    PubMed

    Jones, Mark; Fowler, Robert

    2016-12-01

    The purpose of the study is to show, through simulation and example, the magnitude and direction of immortal time bias when an inappropriate analysis is used. We compare 4 methods of analysis for observational studies of time-to-event outcomes: logistic regression, standard Cox model, landmark analysis, and time-dependent Cox model using an example data set of patients critically ill with influenza and a simulation study. For the example data set, logistic regression, standard Cox model, and landmark analysis all showed some evidence that treatment with oseltamivir provides protection from mortality in patients critically ill with influenza. However, when the time-dependent nature of treatment exposure is taken account of using a time-dependent Cox model, there is no longer evidence of a protective effect of treatment. The simulation study showed that, under various scenarios, the time-dependent Cox model consistently provides unbiased treatment effect estimates, whereas standard Cox model leads to bias in favor of treatment. Logistic regression and landmark analysis may also lead to bias. To minimize the risk of immortal time bias in observational studies of survival outcomes, we strongly suggest time-dependent exposures be included as time-dependent variables in hazard-based analyses. Copyright © 2016 Elsevier Inc. All rights reserved.

  10. Construction of multiple linear regression models using blood biomarkers for selecting against abdominal fat traits in broilers.

    PubMed

    Dong, J Q; Zhang, X Y; Wang, S Z; Jiang, X F; Zhang, K; Ma, G W; Wu, M Q; Li, H; Zhang, H

    2018-01-01

    Plasma very low-density lipoprotein (VLDL) can be used to select for low body fat or abdominal fat (AF) in broilers, but its correlation with AF is limited. We investigated whether any other biochemical indicator can be used in combination with VLDL for a better selective effect. Nineteen plasma biochemical indicators were measured in male chickens from the Northeast Agricultural University broiler lines divergently selected for AF content (NEAUHLF) in the fed state at 46 and 48 d of age. The average concentration of every parameter for the 2 d was used for statistical analysis. Levels of these 19 plasma biochemical parameters were compared between the lean and fat lines. The phenotypic correlations between these plasma biochemical indicators and AF traits were analyzed. Then, multiple linear regression models were constructed to select the best model used for selecting against AF content. and the heritabilities of plasma indicators contained in the best models were estimated. The results showed that 11 plasma biochemical indicators (triglycerides, total bile acid, total protein, globulin, albumin/globulin, aspartate transaminase, alanine transaminase, gamma-glutamyl transpeptidase, uric acid, creatinine, and VLDL) differed significantly between the lean and fat lines (P < 0.01), and correlated significantly with AF traits (P < 0.05). The best multiple linear regression models based on albumin/globulin, VLDL, triglycerides, globulin, total bile acid, and uric acid, had higher R2 (0.73) than the model based only on VLDL (0.21). The plasma parameters included in the best models had moderate heritability estimates (0.21 ≤ h2 ≤ 0.43). These results indicate that these multiple linear regression models can be used to select for lean broiler chickens. © 2017 Poultry Science Association Inc.

  11. Visual abilities distinguish pitchers from hitters in professional baseball.

    PubMed

    Klemish, David; Ramger, Benjamin; Vittetoe, Kelly; Reiter, Jerome P; Tokdar, Surya T; Appelbaum, Lawrence Gregory

    2018-01-01

    This study aimed to evaluate the possibility that differences in sensorimotor abilities exist between hitters and pitchers in a large cohort of baseball players of varying levels of experience. Secondary data analysis was performed on 9 sensorimotor tasks comprising the Nike Sensory Station assessment battery. Bayesian hierarchical regression modelling was applied to test for differences between pitchers and hitters in data from 566 baseball players (112 high school, 85 college, 369 professional) collected at 20 testing centres. Explanatory variables including height, handedness, eye dominance, concussion history, and player position were modelled along with age curves using basis regression splines. Regression analyses revealed better performance for hitters relative to pitchers at the professional level in the visual clarity and depth perception tasks, but these differences did not exist at the high school or college levels. No significant differences were observed in the other 7 measures of sensorimotor capabilities included in the test battery, and no systematic biases were found between the testing centres. These findings, indicating that professional-level hitters have better visual acuity and depth perception than professional-level pitchers, affirm the notion that highly experienced athletes have differing perceptual skills. Findings are discussed in relation to deliberate practice theory.

  12. Classification of Effective Soil Depth by Using Multinomial Logistic Regression Analysis

    NASA Astrophysics Data System (ADS)

    Chang, C. H.; Chan, H. C.; Chen, B. A.

    2016-12-01

    Classification of effective soil depth is a task of determining the slopeland utilizable limitation in Taiwan. The "Slopeland Conservation and Utilization Act" categorizes the slopeland into agriculture and husbandry land, land suitable for forestry and land for enhanced conservation according to the factors including average slope, effective soil depth, soil erosion and parental rock. However, sit investigation of the effective soil depth requires a cost-effective field work. This research aimed to classify the effective soil depth by using multinomial logistic regression with the environmental factors. The Wen-Shui Watershed located at the central Taiwan was selected as the study areas. The analysis of multinomial logistic regression is performed by the assistance of a Geographic Information Systems (GIS). The effective soil depth was categorized into four levels including deeper, deep, shallow and shallower. The environmental factors of slope, aspect, digital elevation model (DEM), curvature and normalized difference vegetation index (NDVI) were selected for classifying the soil depth. An Error Matrix was then used to assess the model accuracy. The results showed an overall accuracy of 75%. At the end, a map of effective soil depth was produced to help planners and decision makers in determining the slopeland utilizable limitation in the study areas.

  13. Section 3. The SPARROW Surface Water-Quality Model: Theory, Application and User Documentation

    USGS Publications Warehouse

    Schwarz, G.E.; Hoos, A.B.; Alexander, R.B.; Smith, R.A.

    2006-01-01

    SPARROW (SPAtially Referenced Regressions On Watershed attributes) is a watershed modeling technique for relating water-quality measurements made at a network of monitoring stations to attributes of the watersheds containing the stations. The core of the model consists of a nonlinear regression equation describing the non-conservative transport of contaminants from point and diffuse sources on land to rivers and through the stream and river network. The model predicts contaminant flux, concentration, and yield in streams and has been used to evaluate alternative hypotheses about the important contaminant sources and watershed properties that control transport over large spatial scales. This report provides documentation for the SPARROW modeling technique and computer software to guide users in constructing and applying basic SPARROW models. The documentation gives details of the SPARROW software, including the input data and installation requirements, and guidance in the specification, calibration, and application of basic SPARROW models, as well as descriptions of the model output and its interpretation. The documentation is intended for both researchers and water-resource managers with interest in using the results of existing models and developing and applying new SPARROW models. The documentation of the model is presented in two parts. Part 1 provides a theoretical and practical introduction to SPARROW modeling techniques, which includes a discussion of the objectives, conceptual attributes, and model infrastructure of SPARROW. Part 1 also includes background on the commonly used model specifications and the methods for estimating and evaluating parameters, evaluating model fit, and generating water-quality predictions and measures of uncertainty. Part 2 provides a user's guide to SPARROW, which includes a discussion of the software architecture and details of the model input requirements and output files, graphs, and maps. The text documentation and computer software are available on the Web at http://usgs.er.gov/sparrow/sparrow-mod/.

  14. Use of random regression to estimate genetic parameters of temperament across an age continuum in a crossbred cattle population.

    PubMed

    Littlejohn, B P; Riley, D G; Welsh, T H; Randel, R D; Willard, S T; Vann, R C

    2018-05-12

    The objective was to estimate genetic parameters of temperament in beef cattle across an age continuum. The population consisted predominantly of Brahman-British crossbred cattle. Temperament was quantified by: 1) pen score (PS), the reaction of a calf to a single experienced evaluator on a scale of 1 to 5 (1 = calm, 5 = excitable); 2) exit velocity (EV), the rate (m/sec) at which a calf traveled 1.83 m upon exiting a squeeze chute; and 3) temperament score (TS), the numerical average of PS and EV. Covariates included days of age and proportion of Bos indicus in the calf and dam. Random regression models included the fixed effects determined from the repeated measures models, except for calf age. Likelihood ratio tests were used to determine the most appropriate random structures. In repeated measures models, the proportion of Bos indicus in the calf was positively related with each calf temperament trait (0.41 ± 0.20, 0.85 ± 0.21, and 0.57 ± 0.18 for PS, EV, and TS, respectively; P < 0.01). There was an effect of contemporary group (combinations of season, year of birth, and management group) and dam age (P < 0.001) in all models. From repeated records analyses, estimates of heritability (h2) were 0.34 ± 0.04, 0.31 ± 0.04, and 0.39 ± 0.04, while estimates of permanent environmental variance as a proportion of the phenotypic variance (c2) were 0.30 ± 0.04, 0.31 ± 0.03, and 0.34 ± 0.04 for PS, EV, and TS, respectively. Quadratic additive genetic random regressions on Legendre polynomials of age were significant for all traits. Quadratic permanent environmental random regressions were significant for PS and TS, but linear permanent environmental random regressions were significant for EV. Random regression results suggested that these components change across the age dimension of these data. There appeared to be an increasing influence of permanent environmental effects and decreasing influence of additive genetic effects corresponding to increasing calf age for EV, and to a lesser extent for TS. Inherited temperament may be overcome by accumulating environmental stimuli with increases in age, especially after weaning.

  15. Practical Guidance for Conducting Mediation Analysis With Multiple Mediators Using Inverse Odds Ratio Weighting

    PubMed Central

    Nguyen, Quynh C.; Osypuk, Theresa L.; Schmidt, Nicole M.; Glymour, M. Maria; Tchetgen Tchetgen, Eric J.

    2015-01-01

    Despite the recent flourishing of mediation analysis techniques, many modern approaches are difficult to implement or applicable to only a restricted range of regression models. This report provides practical guidance for implementing a new technique utilizing inverse odds ratio weighting (IORW) to estimate natural direct and indirect effects for mediation analyses. IORW takes advantage of the odds ratio's invariance property and condenses information on the odds ratio for the relationship between the exposure (treatment) and multiple mediators, conditional on covariates, by regressing exposure on mediators and covariates. The inverse of the covariate-adjusted exposure-mediator odds ratio association is used to weight the primary analytical regression of the outcome on treatment. The treatment coefficient in such a weighted regression estimates the natural direct effect of treatment on the outcome, and indirect effects are identified by subtracting direct effects from total effects. Weighting renders treatment and mediators independent, thereby deactivating indirect pathways of the mediators. This new mediation technique accommodates multiple discrete or continuous mediators. IORW is easily implemented and is appropriate for any standard regression model, including quantile regression and survival analysis. An empirical example is given using data from the Moving to Opportunity (1994–2002) experiment, testing whether neighborhood context mediated the effects of a housing voucher program on obesity. Relevant Stata code (StataCorp LP, College Station, Texas) is provided. PMID:25693776

  16. Threshold regression to accommodate a censored covariate.

    PubMed

    Qian, Jing; Chiou, Sy Han; Maye, Jacqueline E; Atem, Folefac; Johnson, Keith A; Betensky, Rebecca A

    2018-06-22

    In several common study designs, regression modeling is complicated by the presence of censored covariates. Examples of such covariates include maternal age of onset of dementia that may be right censored in an Alzheimer's amyloid imaging study of healthy subjects, metabolite measurements that are subject to limit of detection censoring in a case-control study of cardiovascular disease, and progressive biomarkers whose baseline values are of interest, but are measured post-baseline in longitudinal neuropsychological studies of Alzheimer's disease. We propose threshold regression approaches for linear regression models with a covariate that is subject to random censoring. Threshold regression methods allow for immediate testing of the significance of the effect of a censored covariate. In addition, they provide for unbiased estimation of the regression coefficient of the censored covariate. We derive the asymptotic properties of the resulting estimators under mild regularity conditions. Simulations demonstrate that the proposed estimators have good finite-sample performance, and often offer improved efficiency over existing methods. We also derive a principled method for selection of the threshold. We illustrate the approach in application to an Alzheimer's disease study that investigated brain amyloid levels in older individuals, as measured through positron emission tomography scans, as a function of maternal age of dementia onset, with adjustment for other covariates. We have developed an R package, censCov, for implementation of our method, available at CRAN. © 2018, The International Biometric Society.

  17. Psychosocial work characteristics and long-term sickness absence due to mental disorders.

    PubMed

    van Hoffen, Marieke F A; Roelen, Corné A M; van Rhenen, Willem; Schaufeli, Wilmar B; Heymans, Martijn W; Twisk, Jos W R

    2018-02-09

    Psychosocial work characteristics are associated with all-cause long-term sickness absence (LTSA). This study investigated whether psychosocial work characteristics such as higher workload, faster pace of work, less variety in work, lack of performance feedback, and lack of supervisor support are prospectively associated with higher LTSA due to mental disorders. Cohort study including 4877 workers employed in the distribution and transport sector in The Netherlands. Psychosocial work characteristics were included in a logistic regression model estimating the odds ratios (OR) and 95% confidence intervals (CI) of mental LTSA during 2-year follow-up. The ability of the regression model to discriminate between workers with and without mental LTSA was investigated with the area under the receiver operating characteristic curve (AUC). Tow thousand seven hundred and eighty-two (57%) workers were included in the analysis; 73 (3%) had mental LTSA. Feedback about one's performance (OR = 0.82; 95% CI 0.70-0.96) was associated with mental LTSA. A prediction model including psychosocial work characteristics poorly discriminated (AUC = 0.65; 95% CI 0.56-0.74) between workers with and without mental LTSA. Feedback about one's performance is associated with lower rates of mental LTSA, but it is not useful to measure psychosocial work characteristics to identify workers at risk of mental LTSA.

  18. Job stress models, depressive disorders and work performance of engineers in microelectronics industry.

    PubMed

    Chen, Sung-Wei; Wang, Po-Chuan; Hsin, Ping-Lung; Oates, Anthony; Sun, I-Wen; Liu, Shen-Ing

    2011-01-01

    Microelectronic engineers are considered valuable human capital contributing significantly toward economic development, but they may encounter stressful work conditions in the context of a globalized industry. The study aims at identifying risk factors of depressive disorders primarily based on job stress models, the Demand-Control-Support and Effort-Reward Imbalance models, and at evaluating whether depressive disorders impair work performance in microelectronics engineers in Taiwan. The case-control study was conducted among 678 microelectronics engineers, 452 controls and 226 cases with depressive disorders which were defined by a score 17 or more on the Beck Depression Inventory and a psychiatrist's diagnosis. The self-administered questionnaires included the Job Content Questionnaire, Effort-Reward Imbalance Questionnaire, demography, psychosocial factors, health behaviors and work performance. Hierarchical logistic regression was applied to identify risk factors of depressive disorders. Multivariate linear regressions were used to determine factors affecting work performance. By hierarchical logistic regression, risk factors of depressive disorders are high demands, low work social support, high effort/reward ratio and low frequency of physical exercise. Combining the two job stress models may have better predictive power for depressive disorders than adopting either model alone. Three multivariate linear regressions provide similar results indicating that depressive disorders are associated with impaired work performance in terms of absence, role limitation and social functioning limitation. The results may provide insight into the applicability of job stress models in a globalized high-tech industry considerably focused in non-Western countries, and the design of workplace preventive strategies for depressive disorders in Asian electronics engineering population.

  19. Predictive modeling of hazardous waste landfill total above-ground biomass using passive optical and LIDAR remotely sensed data

    NASA Astrophysics Data System (ADS)

    Hadley, Brian Christopher

    This dissertation assessed remotely sensed data and geospatial modeling technique(s) to map the spatial distribution of total above-ground biomass present on the surface of the Savannah River National Laboratory's (SRNL) Mixed Waste Management Facility (MWMF) hazardous waste landfill. Ordinary least squares (OLS) regression, regression kriging, and tree-structured regression were employed to model the empirical relationship between in-situ measured Bahia (Paspalum notatum Flugge) and Centipede [Eremochloa ophiuroides (Munro) Hack.] grass biomass against an assortment of explanatory variables extracted from fine spatial resolution passive optical and LIDAR remotely sensed data. Explanatory variables included: (1) discrete channels of visible, near-infrared (NIR), and short-wave infrared (SWIR) reflectance, (2) spectral vegetation indices (SVI), (3) spectral mixture analysis (SMA) modeled fractions, (4) narrow-band derivative-based vegetation indices, and (5) LIDAR derived topographic variables (i.e. elevation, slope, and aspect). Results showed that a linear combination of the first- (1DZ_DGVI), second- (2DZ_DGVI), and third-derivative of green vegetation indices (3DZ_DGVI) calculated from hyperspectral data recorded over the 400--960 nm wavelengths of the electromagnetic spectrum explained the largest percentage of statistical variation (R2 = 0.5184) in the total above-ground biomass measurements. In general, the topographic variables did not correlate well with the MWMF biomass data, accounting for less than five percent of the statistical variation. It was concluded that tree-structured regression represented the optimum geospatial modeling technique due to a combination of model performance and efficiency/flexibility factors.

  20. Computerized pigment design based on property hypersurfaces

    NASA Astrophysics Data System (ADS)

    Halova, Jaroslava; Sulcova, Petra; Kupka, Karel

    2007-05-01

    Competition is tough in the pigment market. Rational pigment design has therefore a competitive advantage, saving time and money. The aim of this work is to provide methods that can assist in designing pigments with defined properties. These methods include partial least squares regression (PLSR), neural network (NN) and generalized regression ANOVA model. Authors show how PLS bi-plot can be used to identify market gaps poorly covered by pigment manufacturers, thus giving an opportunity to develop pigments with potentially profitable properties.

  1. Asthma exacerbation and proximity of residence to major roads: a population-based matched case-control study among the pediatric Medicaid population in Detroit, Michigan

    PubMed Central

    2011-01-01

    Background The relationship between asthma and traffic-related pollutants has received considerable attention. The use of individual-level exposure measures, such as residence location or proximity to emission sources, may avoid ecological biases. Method This study focused on the pediatric Medicaid population in Detroit, MI, a high-risk population for asthma-related events. A population-based matched case-control analysis was used to investigate associations between acute asthma outcomes and proximity of residence to major roads, including freeways. Asthma cases were identified as all children who made at least one asthma claim, including inpatient and emergency department visits, during the three-year study period, 2004-06. Individually matched controls were randomly selected from the rest of the Medicaid population on the basis of non-respiratory related illness. We used conditional logistic regression with distance as both categorical and continuous variables, and examined non-linear relationships with distance using polynomial splines. The conditional logistic regression models were then extended by considering multiple asthma states (based on the frequency of acute asthma outcomes) using polychotomous conditional logistic regression. Results Asthma events were associated with proximity to primary roads with an odds ratio of 0.97 (95% CI: 0.94, 0.99) for a 1 km increase in distance using conditional logistic regression, implying that asthma events are less likely as the distance between the residence and a primary road increases. Similar relationships and effect sizes were found using polychotomous conditional logistic regression. Another plausible exposure metric, a reduced form response surface model that represents atmospheric dispersion of pollutants from roads, was not associated under that exposure model. Conclusions There is moderately strong evidence of elevated risk of asthma close to major roads based on the results obtained in this population-based matched case-control study. PMID:21513554

  2. A Model Comparison for Count Data with a Positively Skewed Distribution with an Application to the Number of University Mathematics Courses Completed

    ERIC Educational Resources Information Center

    Liou, Pey-Yan

    2009-01-01

    The current study examines three regression models: OLS (ordinary least square) linear regression, Poisson regression, and negative binomial regression for analyzing count data. Simulation results show that the OLS regression model performed better than the others, since it did not produce more false statistically significant relationships than…

  3. Quantitative assessment of cervical vertebral maturation using cone beam computed tomography in Korean girls.

    PubMed

    Byun, Bo-Ram; Kim, Yong-Il; Yamaguchi, Tetsutaro; Maki, Koutaro; Son, Woo-Sung

    2015-01-01

    This study was aimed to examine the correlation between skeletal maturation status and parameters from the odontoid process/body of the second vertebra and the bodies of third and fourth cervical vertebrae and simultaneously build multiple regression models to be able to estimate skeletal maturation status in Korean girls. Hand-wrist radiographs and cone beam computed tomography (CBCT) images were obtained from 74 Korean girls (6-18 years of age). CBCT-generated cervical vertebral maturation (CVM) was used to demarcate the odontoid process and the body of the second cervical vertebra, based on the dentocentral synchondrosis. Correlation coefficient analysis and multiple linear regression analysis were used for each parameter of the cervical vertebrae (P < 0.05). Forty-seven of 64 parameters from CBCT-generated CVM (independent variables) exhibited statistically significant correlations (P < 0.05). The multiple regression model with the greatest R (2) had six parameters (PH2/W2, UW2/W2, (OH+AH2)/LW2, UW3/LW3, D3, and H4/W4) as independent variables with a variance inflation factor (VIF) of <2. CBCT-generated CVM was able to include parameters from the second cervical vertebral body and odontoid process, respectively, for the multiple regression models. This suggests that quantitative analysis might be used to estimate skeletal maturation status.

  4. Solid-phase cadmium speciation in soil using L3-edge XANES spectroscopy with partial least-squares regression.

    PubMed

    Siebers, Nina; Kruse, Jens; Eckhardt, Kai-Uwe; Hu, Yongfeng; Leinweber, Peter

    2012-07-01

    Cadmium (Cd) has a high toxicity and resolving its speciation in soil is challenging but essential for estimating the environmental risk. In this study partial least-square (PLS) regression was tested for its capability to deconvolute Cd L(3)-edge X-ray absorption near-edge structure (XANES) spectra of multi-compound mixtures. For this, a library of Cd reference compound spectra and a spectrum of a soil sample were acquired. A good coefficient of determination (R(2)) of Cd compounds in mixtures was obtained for the PLS model using binary and ternary mixtures of various Cd reference compounds proving the validity of this approach. In order to describe complex systems like soil, multi-compound mixtures of a variety of Cd compounds must be included in the PLS model. The obtained PLS regression model was then applied to a highly Cd-contaminated soil revealing Cd(3)(PO(4))(2) (36.1%), Cd(NO(3))(2)·4H(2)O (24.5%), Cd(OH)(2) (21.7%), CdCO(3) (17.1%) and CdCl(2) (0.4%). These preliminary results proved that PLS regression is a promising approach for a direct determination of Cd speciation in the solid phase of a soil sample.

  5. Developing logistic regression models using purchase attributes and demographics to predict the probability of purchases of regular and specialty eggs.

    PubMed

    Bejaei, M; Wiseman, K; Cheng, K M

    2015-01-01

    Consumers' interest in specialty eggs appears to be growing in Europe and North America. The objective of this research was to develop logistic regression models that utilise purchaser attributes and demographics to predict the probability of a consumer purchasing a specific type of table egg including regular (white and brown), non-caged (free-run, free-range and organic) or nutrient-enhanced eggs. These purchase prediction models, together with the purchasers' attributes, can be used to assess market opportunities of different egg types specifically in British Columbia (BC). An online survey was used to gather data for the models. A total of 702 completed questionnaires were submitted by BC residents. Selected independent variables included in the logistic regression to develop models for different egg types to predict the probability of a consumer purchasing a specific type of table egg. The variables used in the model accounted for 54% and 49% of variances in the purchase of regular and non-caged eggs, respectively. Research results indicate that consumers of different egg types exhibit a set of unique and statistically significant characteristics and/or demographics. For example, consumers of regular eggs were less educated, older, price sensitive, major chain store buyers, and store flyer users, and had lower awareness about different types of eggs and less concern regarding animal welfare issues. However, most of the non-caged egg consumers were less concerned about price, had higher awareness about different types of table eggs, purchased their eggs from local/organic grocery stores, farm gates or farmers markets, and they were more concerned about care and feeding of hens compared to consumers of other eggs types.

  6. An event-based approach to understanding decadal fluctuations in the Atlantic meridional overturning circulation

    NASA Astrophysics Data System (ADS)

    Allison, Lesley; Hawkins, Ed; Woollings, Tim

    2015-01-01

    Many previous studies have shown that unforced climate model simulations exhibit decadal-scale fluctuations in the Atlantic meridional overturning circulation (AMOC), and that this variability can have impacts on surface climate fields. However, the robustness of these surface fingerprints across different models is less clear. Furthermore, with the potential for coupled feedbacks that may amplify or damp the response, it is not known whether the associated climate signals are linearly related to the strength of the AMOC changes, or if the fluctuation events exhibit nonlinear behaviour with respect to their strength or polarity. To explore these questions, we introduce an objective and flexible method for identifying the largest natural AMOC fluctuation events in multicentennial/multimillennial simulations of a variety of coupled climate models. The characteristics of the events are explored, including their magnitude, meridional coherence and spatial structure, as well as links with ocean heat transport and the horizontal circulation. The surface fingerprints in ocean temperature and salinity are examined, and compared with the results of linear regression analysis. It is found that the regressions generally provide a good indication of the surface changes associated with the largest AMOC events. However, there are some exceptions, including a nonlinear change in the atmospheric pressure signal, particularly at high latitudes, in HadCM3. Some asymmetries are also found between the changes associated with positive and negative AMOC events in the same model. Composite analysis suggests that there are signals that are robust across the largest AMOC events in each model, which provides reassurance that the surface changes associated with one particular event will be similar to those expected from regression analysis. However, large differences are found between the AMOC fingerprints in different models, which may hinder the prediction and attribution of such events in reality.

  7. [Stature estimation for Sichuan Han nationality female based on X-ray technology with measurement of lumbar vertebrae].

    PubMed

    Qing, Si-han; Chang, Yun-feng; Dong, Xiao-ai; Li, Yuan; Chen, Xiao-gang; Shu, Yong-kang; Deng, Zhen-hua

    2013-10-01

    To establish the mathematical models of stature estimation for Sichuan Han female with measurement of lumbar vertebrae by X-ray to provide essential data for forensic anthropology research. The samples, 206 Sichuan Han females, were divided into three groups including group A, B and C according to the ages. Group A (206 samples) consisted of all ages, group B (116 samples) were 20-45 years old and 90 samples over 45 years old were group C. All the samples were examined lumbar vertebrae through CR technology, including the parameters of five centrums (L1-L5) as anterior border, posterior border and central heights (x1-x15), total central height of lumbar spine (x16), and the real height of every sample. The linear regression analysis was produced using the parameters to establish the mathematical models of stature estimation. Sixty-two trained subjects were tested to verify the accuracy of the mathematical models. The established mathematical models by hypothesis test of linear regression equation model were statistically significant (P<0.05). The standard errors of the equation were 2.982-5.004 cm, while correlation coefficients were 0.370-0.779 and multiple correlation coefficients were 0.533-0.834. The return tests of the highest correlation coefficient and multiple correlation coefficient of each group showed that the highest accuracy of the multiple regression equation, y = 100.33 + 1.489 x3 - 0.548 x6 + 0.772 x9 + 0.058 x12 + 0.645 x15, in group A were 80.6% (+/- lSE) and 100% (+/- 2SE). The established mathematical models in this study could be applied for the stature estimation for Sichuan Han females.

  8. Evaluating differential effects using regression interactions and regression mixture models

    PubMed Central

    Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung

    2015-01-01

    Research increasingly emphasizes understanding differential effects. This paper focuses on understanding regression mixture models, a relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their formulation, and their assumptions are compared using Monte Carlo simulations and real data analysis. The capabilities of regression mixture models are described and specific issues to be addressed when conducting regression mixtures are proposed. The paper aims to clarify the role that regression mixtures can take in the estimation of differential effects and increase awareness of the benefits and potential pitfalls of this approach. Regression mixture models are shown to be a potentially effective exploratory method for finding differential effects when these effects can be defined by a small number of classes of respondents who share a typical relationship between a predictor and an outcome. It is also shown that the comparison between regression mixture models and interactions becomes substantially more complex as the number of classes increases. It is argued that regression interactions are well suited for direct tests of specific hypotheses about differential effects and regression mixtures provide a useful approach for exploring effect heterogeneity given adequate samples and study design. PMID:26556903

  9. Evaluating Differential Effects Using Regression Interactions and Regression Mixture Models

    ERIC Educational Resources Information Center

    Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung

    2015-01-01

    Research increasingly emphasizes understanding differential effects. This article focuses on understanding regression mixture models, which are relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their…

  10. School Attendance Problems and Youth Psychopathology: Structural Cross-Lagged Regression Models in Three Longitudinal Data Sets

    ERIC Educational Resources Information Center

    Wood, Jeffrey J.; Lynne-Landsman, Sarah D.; Langer, David A.; Wood, Patricia A.; Clark, Shaunna L.; Eddy, J. Mark; Ialongo, Nick

    2012-01-01

    This study tests a model of reciprocal influences between absenteeism and youth psychopathology using 3 longitudinal datasets (Ns = 20,745, 2,311, and 671). Participants in 1st through 12th grades were interviewed annually or biannually. Measures of psychopathology include self-, parent-, and teacher-report questionnaires. Structural cross-lagged…

  11. Potential redistribution of tree species habitat under five climate change scenarios in the eastern US

    Treesearch

    Louis R. Iverson; Anantha M. Prasad; Anantha M. Prasad

    2002-01-01

    Global climate change could have profound effects on the Earth's biota, including large redistributions of tree species and forest types. We used DISTRIB, a deterministic regression tree analysis model, to examine environmental drivers related to current forest-species distributions and then model potential suitable habitat under five climate change scenarios...

  12. Deciphering factors controlling groundwater arsenic spatial variability in Bangladesh

    NASA Astrophysics Data System (ADS)

    Tan, Z.; Yang, Q.; Zheng, C.; Zheng, Y.

    2017-12-01

    Elevated concentrations of geogenic arsenic in groundwater have been found in many countries to exceed 10 μg/L, the WHO's guideline value for drinking water. A common yet unexplained characteristic of groundwater arsenic spatial distribution is the extensive variability at various spatial scales. This study investigates factors influencing the spatial variability of groundwater arsenic in Bangladesh to improve the accuracy of models predicting arsenic exceedance rate spatially. A novel boosted regression tree method is used to establish a weak-learning ensemble model, which is compared to a linear model using a conventional stepwise logistic regression method. The boosted regression tree models offer the advantage of parametric interaction when big datasets are analyzed in comparison to the logistic regression. The point data set (n=3,538) of groundwater hydrochemistry with 19 parameters was obtained by the British Geological Survey in 2001. The spatial data sets of geological parameters (n=13) were from the Consortium for Spatial Information, Technical University of Denmark, University of East Anglia and the FAO, while the soil parameters (n=42) were from the Harmonized World Soil Database. The aforementioned parameters were regressed to categorical groundwater arsenic concentrations below or above three thresholds: 5 μg/L, 10 μg/L and 50 μg/L to identify respective controlling factors. Boosted regression tree method outperformed logistic regression methods in all three threshold levels in terms of accuracy, specificity and sensitivity, resulting in an improvement of spatial distribution map of probability of groundwater arsenic exceeding all three thresholds when compared to disjunctive-kriging interpolated spatial arsenic map using the same groundwater arsenic dataset. Boosted regression tree models also show that the most important controlling factors of groundwater arsenic distribution include groundwater iron content and well depth for all three thresholds. The probability of a well with iron content higher than 5mg/L to contain greater than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be more than 91%, 85% and 51%, respectively, while the probability of a well from depth more than 160m to contain more than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be less than 38%, 25% and 14%, respectively.

  13. Characterizing mammographic images by using generic texture features

    PubMed Central

    2012-01-01

    Introduction Although mammographic density is an established risk factor for breast cancer, its use is limited in clinical practice because of a lack of automated and standardized measurement methods. The aims of this study were to evaluate a variety of automated texture features in mammograms as risk factors for breast cancer and to compare them with the percentage mammographic density (PMD) by using a case-control study design. Methods A case-control study including 864 cases and 418 controls was analyzed automatically. Four hundred seventy features were explored as possible risk factors for breast cancer. These included statistical features, moment-based features, spectral-energy features, and form-based features. An elaborate variable selection process using logistic regression analyses was performed to identify those features that were associated with case-control status. In addition, PMD was assessed and included in the regression model. Results Of the 470 image-analysis features explored, 46 remained in the final logistic regression model. An area under the curve of 0.79, with an odds ratio per standard deviation change of 2.88 (95% CI, 2.28 to 3.65), was obtained with validation data. Adding the PMD did not improve the final model. Conclusions Using texture features to predict the risk of breast cancer appears feasible. PMD did not show any additional value in this study. With regard to the features assessed, most of the analysis tools appeared to reflect mammographic density, although some features did not correlate with PMD. It remains to be investigated in larger case-control studies whether these features can contribute to increased prediction accuracy. PMID:22490545

  14. Sex differences in the effect of aging on dry eye disease.

    PubMed

    Ahn, Jong Ho; Choi, Yoon-Hyeong; Paik, Hae Jung; Kim, Mee Kum; Wee, Won Ryang; Kim, Dong Hyun

    2017-01-01

    Aging is a major risk factor in dry eye disease (DED), and understanding sexual differences is very important in biomedical research. However, there is little information about sex differences in the effect of aging on DED. We investigated sex differences in the effect of aging and other risk factors for DED. This study included data of 16,824 adults from the Korea National Health and Nutrition Examination Survey (2010-2012), which is a population-based cross-sectional survey. DED was defined as the presence of frequent ocular dryness or a previous diagnosis by an ophthalmologist. Basic sociodemographic factors and previously known risk factors for DED were included in the analyses. Linear regression modeling and multivariate logistic regression modeling were used to compare the sex differences in the effect of risk factors for DED; we additionally performed tests for interactions between sex and other risk factors for DED in logistic regression models. In our linear regression models, the prevalence of DED symptoms in men increased with age ( R =0.311, P =0.012); however, there was no association between aging and DED in women ( P >0.05). Multivariate logistic regression analyses showed that aging in men was not associated with DED (DED symptoms/diagnosis: odds ratio [OR] =1.01/1.04, each P >0.05), while aging in women was protectively associated with DED (DED symptoms/diagnosis: OR =0.94/0.91, P =0.011/0.003). Previous ocular surgery was significantly associated with DED in both men and women (men/women: OR =2.45/1.77 [DED symptoms] and 3.17/2.05 [DED diagnosis], each P <0.001). Tests for interactions of sex revealed significantly different aging × sex and previous ocular surgery × sex interactions ( P for interaction of sex: DED symptoms/diagnosis - 0.044/0.011 [age] and 0.012/0.006 [previous ocular surgery]). There were distinct sex differences in the effect of aging on DED in the Korean population. DED following ocular surgery also showed sexually different patterns. Age matching and sex matching are strongly recommended in further studies about DED, especially DED following ocular surgery.

  15. Association between response rates and survival outcomes in patients with newly diagnosed multiple myeloma. A systematic review and meta-regression analysis.

    PubMed

    Mainou, Maria; Madenidou, Anastasia-Vasiliki; Liakos, Aris; Paschos, Paschalis; Karagiannis, Thomas; Bekiari, Eleni; Vlachaki, Efthymia; Wang, Zhen; Murad, Mohammad Hassan; Kumar, Shaji; Tsapas, Apostolos

    2017-06-01

    We performed a systematic review and meta-regression analysis of randomized control trials to investigate the association between response to initial treatment and survival outcomes in patients with newly diagnosed multiple myeloma (MM). Response outcomes included complete response (CR) and the combined outcome of CR or very good partial response (VGPR), while survival outcomes were overall survival (OS) and progression-free survival (PFS). We used random-effect meta-regression models and conducted sensitivity analyses based on definition of CR and study quality. Seventy-two trials were included in the systematic review, 63 of which contributed data in meta-regression analyses. There was no association between OS and CR in patients without autologous stem cell transplant (ASCT) (regression coefficient: .02, 95% confidence interval [CI] -0.06, 0.10), in patients undergoing ASCT (-.11, 95% CI -0.44, 0.22) and in trials comparing ASCT with non-ASCT patients (.04, 95% CI -0.29, 0.38). Similarly, OS did not correlate with the combined metric of CR or VGPR, and no association was evident between response outcomes and PFS. Sensitivity analyses yielded similar results. This meta-regression analysis suggests that there is no association between conventional response outcomes and survival in patients with newly diagnosed MM. © 2017 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  16. Household water treatment in developing countries: comparing different intervention types using meta-regression.

    PubMed

    Hunter, Paul R

    2009-12-01

    Household water treatment (HWT) is being widely promoted as an appropriate intervention for reducing the burden of waterborne disease in poor communities in developing countries. A recent study has raised concerns about the effectiveness of HWT, in part because of concerns over the lack of blinding and in part because of considerable heterogeneity in the reported effectiveness of randomized controlled trials. This study set out to attempt to investigate the causes of this heterogeneity and so identify factors associated with good health gains. Studies identified in an earlier systematic review and meta-analysis were supplemented with more recently published randomized controlled trials. A total of 28 separate studies of randomized controlled trials of HWT with 39 intervention arms were included in the analysis. Heterogeneity was studied using the "metareg" command in Stata. Initial analyses with single candidate predictors were undertaken and all variables significant at the P < 0.2 level were included in a final regression model. Further analyses were done to estimate the effect of the interventions over time by MonteCarlo modeling using @Risk and the parameter estimates from the final regression model. The overall effect size of all unblinded studies was relative risk = 0.56 (95% confidence intervals 0.51-0.63), but after adjusting for bias due to lack of blinding the effect size was much lower (RR = 0.85, 95% CI = 0.76-0.97). Four main variables were significant predictors of effectiveness of intervention in a multipredictor meta regression model: Log duration of study follow-up (regression coefficient of log effect size = 0.186, standard error (SE) = 0.072), whether or not the study was blinded (coefficient 0.251, SE 0.066) and being conducted in an emergency setting (coefficient -0.351, SE 0.076) were all significant predictors of effect size in the final model. Compared to the ceramic filter all other interventions were much less effective (Biosand 0.247, 0.073; chlorine and safe waste storage 0.295, 0.061; combined coagulant-chlorine 0.2349, 0.067; SODIS 0.302, 0.068). A Monte Carlo model predicted that over 12 months ceramic filters were likely to be still effective at reducing disease, whereas SODIS, chlorination, and coagulation-chlorination had little if any benefit. Indeed these three interventions are predicted to have the same or less effect than what may be expected due purely to reporting bias in unblinded studies With the currently available evidence ceramic filters are the most effective form of HWT in the longterm, disinfection-only interventions including SODIS appear to have poor if any longterm public health benefit.

  17. Spatial regression analysis on 32 years of total column ozone data

    NASA Astrophysics Data System (ADS)

    Knibbe, J. S.; van der A, R. J.; de Laat, A. T. J.

    2014-08-01

    Multiple-regression analyses have been performed on 32 years of total ozone column data that was spatially gridded with a 1 × 1.5° resolution. The total ozone data consist of the MSR (Multi Sensor Reanalysis; 1979-2008) and 2 years of assimilated SCIAMACHY (SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY) ozone data (2009-2010). The two-dimensionality in this data set allows us to perform the regressions locally and investigate spatial patterns of regression coefficients and their explanatory power. Seasonal dependencies of ozone on regressors are included in the analysis. A new physically oriented model is developed to parameterize stratospheric ozone. Ozone variations on nonseasonal timescales are parameterized by explanatory variables describing the solar cycle, stratospheric aerosols, the quasi-biennial oscillation (QBO), El Niño-Southern Oscillation (ENSO) and stratospheric alternative halogens which are parameterized by the effective equivalent stratospheric chlorine (EESC). For several explanatory variables, seasonally adjusted versions of these explanatory variables are constructed to account for the difference in their effect on ozone throughout the year. To account for seasonal variation in ozone, explanatory variables describing the polar vortex, geopotential height, potential vorticity and average day length are included. Results of this regression model are compared to that of a similar analysis based on a more commonly applied statistically oriented model. The physically oriented model provides spatial patterns in the regression results for each explanatory variable. The EESC has a significant depleting effect on ozone at mid- and high latitudes, the solar cycle affects ozone positively mostly in the Southern Hemisphere, stratospheric aerosols affect ozone negatively at high northern latitudes, the effect of QBO is positive and negative in the tropics and mid- to high latitudes, respectively, and ENSO affects ozone negatively between 30° N and 30° S, particularly over the Pacific. The contribution of explanatory variables describing seasonal ozone variation is generally large at mid- to high latitudes. We observe ozone increases with potential vorticity and day length and ozone decreases with geopotential height and variable ozone effects due to the polar vortex in regions to the north and south of the polar vortices. Recovery of ozone is identified globally. However, recovery rates and uncertainties strongly depend on choices that can be made in defining the explanatory variables. The application of several trend models, each with their own pros and cons, yields a large range of recovery rate estimates. Overall these results suggest that care has to be taken in determining ozone recovery rates, in particular for the Antarctic ozone hole.

  18. Genetic Parameter Estimates for Metabolizing Two Common Pharmaceuticals in Swine.

    PubMed

    Howard, Jeremy T; Ashwell, Melissa S; Baynes, Ronald E; Brooks, James D; Yeatts, James L; Maltecca, Christian

    2018-01-01

    In livestock, the regulation of drugs used to treat livestock has received increased attention and it is currently unknown how much of the phenotypic variation in drug metabolism is due to the genetics of an animal. Therefore, the objective of the study was to determine the amount of phenotypic variation in fenbendazole and flunixin meglumine drug metabolism due to genetics. The population consisted of crossbred female and castrated male nursery pigs ( n = 198) that were sired by boars represented by four breeds. The animals were spread across nine batches. Drugs were administered intravenously and blood collected a minimum of 10 times over a 48 h period. Genetic parameters for the parent drug and metabolite concentration within each drug were estimated based on pharmacokinetics (PK) parameters or concentrations across time utilizing a random regression model. The PK parameters were estimated using a non-compartmental analysis. The PK model included fixed effects of sex and breed of sire along with random sire and batch effects. The random regression model utilized Legendre polynomials and included a fixed population concentration curve, sex, and breed of sire effects along with a random sire deviation from the population curve and batch effect. The sire effect included the intercept for all models except for the fenbendazole metabolite (i.e., intercept and slope). The mean heritability across PK parameters for the fenbendazole and flunixin meglumine parent drug (metabolite) was 0.15 (0.18) and 0.31 (0.40), respectively. For the parent drug (metabolite), the mean heritability across time was 0.27 (0.60) and 0.14 (0.44) for fenbendazole and flunixin meglumine, respectively. The errors surrounding the heritability estimates for the random regression model were smaller compared to estimates obtained from PK parameters. Across both the PK and plasma drug concentration across model, a moderate heritability was estimated. The model that utilized the plasma drug concentration across time resulted in estimates with a smaller standard error compared to models that utilized PK parameters. The current study found a low to moderate proportion of the phenotypic variation in metabolizing fenbendazole and flunixin meglumine that was explained by genetics in the current study.

  19. Genetic Parameter Estimates for Metabolizing Two Common Pharmaceuticals in Swine

    PubMed Central

    Howard, Jeremy T.; Ashwell, Melissa S.; Baynes, Ronald E.; Brooks, James D.; Yeatts, James L.; Maltecca, Christian

    2018-01-01

    In livestock, the regulation of drugs used to treat livestock has received increased attention and it is currently unknown how much of the phenotypic variation in drug metabolism is due to the genetics of an animal. Therefore, the objective of the study was to determine the amount of phenotypic variation in fenbendazole and flunixin meglumine drug metabolism due to genetics. The population consisted of crossbred female and castrated male nursery pigs (n = 198) that were sired by boars represented by four breeds. The animals were spread across nine batches. Drugs were administered intravenously and blood collected a minimum of 10 times over a 48 h period. Genetic parameters for the parent drug and metabolite concentration within each drug were estimated based on pharmacokinetics (PK) parameters or concentrations across time utilizing a random regression model. The PK parameters were estimated using a non-compartmental analysis. The PK model included fixed effects of sex and breed of sire along with random sire and batch effects. The random regression model utilized Legendre polynomials and included a fixed population concentration curve, sex, and breed of sire effects along with a random sire deviation from the population curve and batch effect. The sire effect included the intercept for all models except for the fenbendazole metabolite (i.e., intercept and slope). The mean heritability across PK parameters for the fenbendazole and flunixin meglumine parent drug (metabolite) was 0.15 (0.18) and 0.31 (0.40), respectively. For the parent drug (metabolite), the mean heritability across time was 0.27 (0.60) and 0.14 (0.44) for fenbendazole and flunixin meglumine, respectively. The errors surrounding the heritability estimates for the random regression model were smaller compared to estimates obtained from PK parameters. Across both the PK and plasma drug concentration across model, a moderate heritability was estimated. The model that utilized the plasma drug concentration across time resulted in estimates with a smaller standard error compared to models that utilized PK parameters. The current study found a low to moderate proportion of the phenotypic variation in metabolizing fenbendazole and flunixin meglumine that was explained by genetics in the current study. PMID:29487615

  20. An enhanced PM 2.5 air quality forecast model based on nonlinear regression and back-trajectory concentrations

    NASA Astrophysics Data System (ADS)

    Cobourn, W. Geoffrey

    2010-08-01

    An enhanced PM 2.5 air quality forecast model based on nonlinear regression (NLR) and back-trajectory concentrations has been developed for use in the Louisville, Kentucky metropolitan area. The PM 2.5 air quality forecast model is designed for use in the warm season, from May through September, when PM 2.5 air quality is more likely to be critical for human health. The enhanced PM 2.5 model consists of a basic NLR model, developed for use with an automated air quality forecast system, and an additional parameter based on upwind PM 2.5 concentration, called PM24. The PM24 parameter is designed to be determined manually, by synthesizing backward air trajectory and regional air quality information to compute 24-h back-trajectory concentrations. The PM24 parameter may be used by air quality forecasters to adjust the forecast provided by the automated forecast system. In this study of the 2007 and 2008 forecast seasons, the enhanced model performed well using forecasted meteorological data and PM24 as input. The enhanced PM 2.5 model was compared with three alternative models, including the basic NLR model, the basic NLR model with a persistence parameter added, and the NLR model with persistence and PM24. The two models that included PM24 were of comparable accuracy. The two models incorporating back-trajectory concentrations had lower mean absolute errors and higher rates of detecting unhealthy PM2.5 concentrations compared to the other models.

  1. Modeling absolute differences in life expectancy with a censored skew-normal regression approach

    PubMed Central

    Clough-Gorr, Kerri; Zwahlen, Marcel

    2015-01-01

    Parameter estimates from commonly used multivariable parametric survival regression models do not directly quantify differences in years of life expectancy. Gaussian linear regression models give results in terms of absolute mean differences, but are not appropriate in modeling life expectancy, because in many situations time to death has a negative skewed distribution. A regression approach using a skew-normal distribution would be an alternative to parametric survival models in the modeling of life expectancy, because parameter estimates can be interpreted in terms of survival time differences while allowing for skewness of the distribution. In this paper we show how to use the skew-normal regression so that censored and left-truncated observations are accounted for. With this we model differences in life expectancy using data from the Swiss National Cohort Study and from official life expectancy estimates and compare the results with those derived from commonly used survival regression models. We conclude that a censored skew-normal survival regression approach for left-truncated observations can be used to model differences in life expectancy across covariates of interest. PMID:26339544

  2. Comparison of random regression test-day models for Polish Black and White cattle.

    PubMed

    Strabel, T; Szyda, J; Ptak, E; Jamrozik, J

    2005-10-01

    Test-day milk yields of first-lactation Black and White cows were used to select the model for routine genetic evaluation of dairy cattle in Poland. The population of Polish Black and White cows is characterized by small herd size, low level of production, and relatively early peak of lactation. Several random regression models for first-lactation milk yield were initially compared using the "percentage of squared bias" criterion and the correlations between true and predicted breeding values. Models with random herd-test-date effects, fixed age-season and herd-year curves, and random additive genetic and permanent environmental curves (Legendre polynomials of different orders were used for all regressions) were chosen for further studies. Additional comparisons included analyses of the residuals and shapes of variance curves in days in milk. The low production level and early peak of lactation of the breed required the use of Legendre polynomials of order 5 to describe age-season lactation curves. For the other curves, Legendre polynomials of order 3 satisfactorily described daily milk yield variation. Fitting third-order polynomials for the permanent environmental effect made it possible to adequately account for heterogeneous residual variance at different stages of lactation.

  3. Partial Least Squares Regression Calibration of an Ultraviolet-Visible Spectrophotometer for Measurements of Chemical Oxygen Demand in Dye Wastewater

    NASA Astrophysics Data System (ADS)

    Mai, W.; Zhang, J.-F.; Zhao, X.-M.; Li, Z.; Xu, Z.-W.

    2017-11-01

    Wastewater from the dye industry is typically analyzed using a standard method for measurement of chemical oxygen demand (COD) or by a single-wavelength spectroscopic method. To overcome the disadvantages of these methods, ultraviolet-visible (UV-Vis) spectroscopy was combined with principal component regression (PCR) and partial least squares regression (PLSR) in this study. Unlike the standard method, this method does not require digestion of the samples for preparation. Experiments showed that the PLSR model offered high prediction performance for COD, with a mean relative error of about 5% for two dyes. This error is similar to that obtained with the standard method. In this study, the precision of the PLSR model decreased with the number of dye compounds present. It is likely that multiple models will be required in reality, and the complexity of a COD monitoring system would be greatly reduced if the PLSR model is used because it can include several dyes. UV-Vis spectroscopy with PLSR successfully enhanced the performance of COD prediction for dye wastewater and showed good potential for application in on-line water quality monitoring.

  4. Use of Two-Part Regression Calibration Model to Correct for Measurement Error in Episodically Consumed Foods in a Single-Replicate Study Design: EPIC Case Study

    PubMed Central

    Agogo, George O.; van der Voet, Hilko; Veer, Pieter van’t; Ferrari, Pietro; Leenders, Max; Muller, David C.; Sánchez-Cantalejo, Emilio; Bamia, Christina; Braaten, Tonje; Knüppel, Sven; Johansson, Ingegerd; van Eeuwijk, Fred A.; Boshuizen, Hendriek

    2014-01-01

    In epidemiologic studies, measurement error in dietary variables often attenuates association between dietary intake and disease occurrence. To adjust for the attenuation caused by error in dietary intake, regression calibration is commonly used. To apply regression calibration, unbiased reference measurements are required. Short-term reference measurements for foods that are not consumed daily contain excess zeroes that pose challenges in the calibration model. We adapted two-part regression calibration model, initially developed for multiple replicates of reference measurements per individual to a single-replicate setting. We showed how to handle excess zero reference measurements by two-step modeling approach, how to explore heteroscedasticity in the consumed amount with variance-mean graph, how to explore nonlinearity with the generalized additive modeling (GAM) and the empirical logit approaches, and how to select covariates in the calibration model. The performance of two-part calibration model was compared with the one-part counterpart. We used vegetable intake and mortality data from European Prospective Investigation on Cancer and Nutrition (EPIC) study. In the EPIC, reference measurements were taken with 24-hour recalls. For each of the three vegetable subgroups assessed separately, correcting for error with an appropriately specified two-part calibration model resulted in about three fold increase in the strength of association with all-cause mortality, as measured by the log hazard ratio. Further found is that the standard way of including covariates in the calibration model can lead to over fitting the two-part calibration model. Moreover, the extent of adjusting for error is influenced by the number and forms of covariates in the calibration model. For episodically consumed foods, we advise researchers to pay special attention to response distribution, nonlinearity, and covariate inclusion in specifying the calibration model. PMID:25402487

  5. Error Covariance Penalized Regression: A novel multivariate model combining penalized regression with multivariate error structure.

    PubMed

    Allegrini, Franco; Braga, Jez W B; Moreira, Alessandro C O; Olivieri, Alejandro C

    2018-06-29

    A new multivariate regression model, named Error Covariance Penalized Regression (ECPR) is presented. Following a penalized regression strategy, the proposed model incorporates information about the measurement error structure of the system, using the error covariance matrix (ECM) as a penalization term. Results are reported from both simulations and experimental data based on replicate mid and near infrared (MIR and NIR) spectral measurements. The results for ECPR are better under non-iid conditions when compared with traditional first-order multivariate methods such as ridge regression (RR), principal component regression (PCR) and partial least-squares regression (PLS). Copyright © 2018 Elsevier B.V. All rights reserved.

  6. Analyzing Student Learning Outcomes: Usefulness of Logistic and Cox Regression Models. IR Applications, Volume 5

    ERIC Educational Resources Information Center

    Chen, Chau-Kuang

    2005-01-01

    Logistic and Cox regression methods are practical tools used to model the relationships between certain student learning outcomes and their relevant explanatory variables. The logistic regression model fits an S-shaped curve into a binary outcome with data points of zero and one. The Cox regression model allows investigators to study the duration…

  7. Empirical Modeling of Plant Gas Fluxes in Controlled Environments

    NASA Technical Reports Server (NTRS)

    Cornett, Jessie David

    1994-01-01

    As humans extend their reach beyond the earth, bioregenerative life support systems must replace the resupply and physical/chemical systems now used. The Controlled Ecological Life Support System (CELSS) will utilize plants to recycle the carbon dioxide (CO2) and excrement produced by humans and return oxygen (O2), purified water and food. CELSS design requires knowledge of gas flux levels for net photosynthesis (PS(sub n)), dark respiration (R(sub d)) and evapotranspiration (ET). Full season gas flux data regarding these processes for wheat (Triticum aestivum), soybean (Glycine max) and rice (Oryza sativa) from published sources were used to develop empirical models. Univariate models relating crop age (days after planting) and gas flux were fit by simple regression. Models are either high order (5th to 8th) or more complex polynomials whose curves describe crop development characteristics. The models provide good estimates of gas flux maxima, but are of limited utility. To broaden the applicability, data were transformed to dimensionless or correlation formats and, again, fit by regression. Polynomials, similar to those in the initial effort, were selected as the most appropriate models. These models indicate that, within a cultivar, gas flux patterns appear remarkably similar prior to maximum flux, but exhibit considerable variation beyond this point. This suggests that more broadly applicable models of plant gas flux are feasible, but univariate models defining gas flux as a function of crop age are too simplistic. Multivariate models using CO2 and crop age were fit for PS(sub n), and R(sub d) by multiple regression. In each case, the selected model is a subset of a full third order model with all possible interactions. These models are improvements over the univariate models because they incorporate more than the single factor, crop age, as the primary variable governing gas flux. They are still limited, however, by their reliance on the other environmental conditions under which the original data were collected. Three-dimensional plots representing the response surface of each model are included. Suitability of using empirical models to generate engineering design estimates is discussed. Recommendations for the use of more complex multivariate models to increase versatility are included.

  8. Robust geographically weighted regression of modeling the Air Polluter Standard Index (APSI)

    NASA Astrophysics Data System (ADS)

    Warsito, Budi; Yasin, Hasbi; Ispriyanti, Dwi; Hoyyi, Abdul

    2018-05-01

    The Geographically Weighted Regression (GWR) model has been widely applied to many practical fields for exploring spatial heterogenity of a regression model. However, this method is inherently not robust to outliers. Outliers commonly exist in data sets and may lead to a distorted estimate of the underlying regression model. One of solution to handle the outliers in the regression model is to use the robust models. So this model was called Robust Geographically Weighted Regression (RGWR). This research aims to aid the government in the policy making process related to air pollution mitigation by developing a standard index model for air polluter (Air Polluter Standard Index - APSI) based on the RGWR approach. In this research, we also consider seven variables that are directly related to the air pollution level, which are the traffic velocity, the population density, the business center aspect, the air humidity, the wind velocity, the air temperature, and the area size of the urban forest. The best model is determined by the smallest AIC value. There are significance differences between Regression and RGWR in this case, but Basic GWR using the Gaussian kernel is the best model to modeling APSI because it has smallest AIC.

  9. Classification of Large-Scale Remote Sensing Images for Automatic Identification of Health Hazards: Smoke Detection Using an Autologistic Regression Classifier.

    PubMed

    Wolters, Mark A; Dean, C B

    2017-01-01

    Remote sensing images from Earth-orbiting satellites are a potentially rich data source for monitoring and cataloguing atmospheric health hazards that cover large geographic regions. A method is proposed for classifying such images into hazard and nonhazard regions using the autologistic regression model, which may be viewed as a spatial extension of logistic regression. The method includes a novel and simple approach to parameter estimation that makes it well suited to handling the large and high-dimensional datasets arising from satellite-borne instruments. The methodology is demonstrated on both simulated images and a real application to the identification of forest fire smoke.

  10. Estimation of stature from the foot and its segments in a sub-adult female population of North India

    PubMed Central

    2011-01-01

    Background Establishing personal identity is one of the main concerns in forensic investigations. Estimation of stature forms a basic domain of the investigation process in unknown and co-mingled human remains in forensic anthropology case work. The objective of the present study was to set up standards for estimation of stature from the foot and its segments in a sub-adult female population. Methods The sample for the study constituted 149 young females from the Northern part of India. The participants were aged between 13 and 18 years. Besides stature, seven anthropometric measurements that included length of the foot from each toe (T1, T2, T3, T4, and T5 respectively), foot breadth at ball (BBAL) and foot breadth at heel (BHEL) were measured on both feet in each participant using standard methods and techniques. Results The results indicated that statistically significant differences (p < 0.05) between left and right feet occur in both the foot breadth measurements (BBAL and BHEL). Foot length measurements (T1 to T5 lengths) did not show any statistically significant bilateral asymmetry. The correlation between stature and all the foot measurements was found to be positive and statistically significant (p-value < 0.001). Linear regression models and multiple regression models were derived for estimation of stature from the measurements of the foot. The present study indicates that anthropometric measurements of foot and its segments are valuable in the estimation of stature. Foot length measurements estimate stature with greater accuracy when compared to foot breadth measurements. Conclusions The present study concluded that foot measurements have a strong relationship with stature in the sub-adult female population of North India. Hence, the stature of an individual can be successfully estimated from the foot and its segments using different regression models derived in the study. The regression models derived in the study may be applied successfully for the estimation of stature in sub-adult females, whenever foot remains are brought for forensic examination. Stepwise multiple regression models tend to estimate stature more accurately than linear regression models in female sub-adults. PMID:22104433

  11. Estimation of stature from the foot and its segments in a sub-adult female population of North India.

    PubMed

    Krishan, Kewal; Kanchan, Tanuj; Passi, Neelam

    2011-11-21

    Establishing personal identity is one of the main concerns in forensic investigations. Estimation of stature forms a basic domain of the investigation process in unknown and co-mingled human remains in forensic anthropology case work. The objective of the present study was to set up standards for estimation of stature from the foot and its segments in a sub-adult female population. The sample for the study constituted 149 young females from the Northern part of India. The participants were aged between 13 and 18 years. Besides stature, seven anthropometric measurements that included length of the foot from each toe (T1, T2, T3, T4, and T5 respectively), foot breadth at ball (BBAL) and foot breadth at heel (BHEL) were measured on both feet in each participant using standard methods and techniques. The results indicated that statistically significant differences (p < 0.05) between left and right feet occur in both the foot breadth measurements (BBAL and BHEL). Foot length measurements (T1 to T5 lengths) did not show any statistically significant bilateral asymmetry. The correlation between stature and all the foot measurements was found to be positive and statistically significant (p-value < 0.001). Linear regression models and multiple regression models were derived for estimation of stature from the measurements of the foot. The present study indicates that anthropometric measurements of foot and its segments are valuable in the estimation of stature. Foot length measurements estimate stature with greater accuracy when compared to foot breadth measurements. The present study concluded that foot measurements have a strong relationship with stature in the sub-adult female population of North India. Hence, the stature of an individual can be successfully estimated from the foot and its segments using different regression models derived in the study. The regression models derived in the study may be applied successfully for the estimation of stature in sub-adult females, whenever foot remains are brought for forensic examination. Stepwise multiple regression models tend to estimate stature more accurately than linear regression models in female sub-adults.

  12. Estimation of genetic parameters related to eggshell strength using random regression models.

    PubMed

    Guo, J; Ma, M; Qu, L; Shen, M; Dou, T; Wang, K

    2015-01-01

    This study examined the changes in eggshell strength and the genetic parameters related to this trait throughout a hen's laying life using random regression. The data were collected from a crossbred population between 2011 and 2014, where the eggshell strength was determined repeatedly for 2260 hens. Using random regression models (RRMs), several Legendre polynomials were employed to estimate the fixed, direct genetic and permanent environment effects. The residual effects were treated as independently distributed with heterogeneous variance for each test week. The direct genetic variance was included with second-order Legendre polynomials and the permanent environment with third-order Legendre polynomials. The heritability of eggshell strength ranged from 0.26 to 0.43, the repeatability ranged between 0.47 and 0.69, and the estimated genetic correlations between test weeks was high at > 0.67. The first eigenvalue of the genetic covariance matrix accounted for about 97% of the sum of all the eigenvalues. The flexibility and statistical power of RRM suggest that this model could be an effective method to improve eggshell quality and to reduce losses due to cracked eggs in a breeding plan.

  13. Method validation using weighted linear regression models for quantification of UV filters in water samples.

    PubMed

    da Silva, Claudia Pereira; Emídio, Elissandro Soares; de Marchi, Mary Rosa Rodrigues

    2015-01-01

    This paper describes the validation of a method consisting of solid-phase extraction followed by gas chromatography-tandem mass spectrometry for the analysis of the ultraviolet (UV) filters benzophenone-3, ethylhexyl salicylate, ethylhexyl methoxycinnamate and octocrylene. The method validation criteria included evaluation of selectivity, analytical curve, trueness, precision, limits of detection and limits of quantification. The non-weighted linear regression model has traditionally been used for calibration, but it is not necessarily the optimal model in all cases. Because the assumption of homoscedasticity was not met for the analytical data in this work, a weighted least squares linear regression was used for the calibration method. The evaluated analytical parameters were satisfactory for the analytes and showed recoveries at four fortification levels between 62% and 107%, with relative standard deviations less than 14%. The detection limits ranged from 7.6 to 24.1 ng L(-1). The proposed method was used to determine the amount of UV filters in water samples from water treatment plants in Araraquara and Jau in São Paulo, Brazil. Copyright © 2014 Elsevier B.V. All rights reserved.

  14. Forecasting the probability of future groundwater levels declining below specified low thresholds in the conterminous U.S.

    USGS Publications Warehouse

    Dudley, Robert W.; Hodgkins, Glenn A.; Dickinson, Jesse

    2017-01-01

    We present a logistic regression approach for forecasting the probability of future groundwater levels declining or maintaining below specific groundwater-level thresholds. We tested our approach on 102 groundwater wells in different climatic regions and aquifers of the United States that are part of the U.S. Geological Survey Groundwater Climate Response Network. We evaluated the importance of current groundwater levels, precipitation, streamflow, seasonal variability, Palmer Drought Severity Index, and atmosphere/ocean indices for developing the logistic regression equations. Several diagnostics of model fit were used to evaluate the regression equations, including testing of autocorrelation of residuals, goodness-of-fit metrics, and bootstrap validation testing. The probabilistic predictions were most successful at wells with high persistence (low month-to-month variability) in their groundwater records and at wells where the groundwater level remained below the defined low threshold for sustained periods (generally three months or longer). The model fit was weakest at wells with strong seasonal variability in levels and with shorter duration low-threshold events. We identified challenges in deriving probabilistic-forecasting models and possible approaches for addressing those challenges.

  15. Decomposing Racial/Ethnic Disparities in Influenza Vaccination among the Elderly

    PubMed Central

    Yoo, Byung-Kwang; Hasebe, Takuya; Szilagyi, Peter G.

    2015-01-01

    While persistent racial/ethnic disparities in influenza vaccination have been reported among the elderly, characteristics contributing to disparities are poorly understood. This study aimed to assess characteristics associated with racial/ethnic disparities in influenza vaccination using a nonlinear Oaxaca-Blinder decomposition method. We performed cross-sectional multivariable logistic regression analyses for which the dependent variable was self-reported receipt of influenza vaccine during the 2010–2011 season among community dwelling non-Hispanic African-American (AA), non-Hispanic White (W), English-speaking Hispanic (EH) and Spanish-speaking Hispanic (SH) elderly, enrolled in the 2011 Medicare Current Beneficiary Survey (MCBS) (un-weighted/weighted N= 6,095/19.2million). Using the nonlinear Oaxaca-Blinder decomposition method, we assessed the relative contribution of seventeen covariates—including socio-demographic characteristics, health status, insurance, access, preference regarding healthcare, and geographic regions —to disparities in influenza vaccination. Unadjusted racial/ethnic disparities in influenza vaccination were 14.1 percentage points (pp) (W-AA disparity, p<.001), 25.7 pp (W-SH disparity, p<.001) and 0.6 pp (W-EH disparity, p>.8). The Oaxaca-Blinder decomposition method estimated that the unadjusted W-AA and W-SH disparities in vaccination could be reduced by only 45% even if AA and SH groups become equivalent to Whites in all covariates in multivariable regression models. The remaining 55% of disparities were attributed to (a) racial/ethnic differences in the estimated coefficients (e.g., odds ratios) in the regression models and (b) characteristics not included in the regression models. Our analysis found that only about 45% of racial/ethnic disparities in influenza vaccination among the elderly could be reduced by equalizing recognized characteristics among racial/ethnic groups. Future studies are needed to identify additional modifiable characteristics causing disparities in influenza vaccination. PMID:25900133

  16. Determining delayed admission to intensive care unit for mechanically ventilated patients in the emergency department.

    PubMed

    Hung, Shih-Chiang; Kung, Chia-Te; Hung, Chih-Wei; Liu, Ber-Ming; Liu, Jien-Wei; Chew, Ghee; Chuang, Hung-Yi; Lee, Wen-Huei; Lee, Tzu-Chi

    2014-08-23

    The adverse effects of delayed admission to the intensive care unit (ICU) have been recognized in previous studies. However, the definitions of delayed admission varies across studies. This study proposed a model to define "delayed admission", and explored the effect of ICU-waiting time on patients' outcome. This retrospective cohort study included non-traumatic adult patients on mechanical ventilation in the emergency department (ED), from July 2009 to June 2010. The primary outcomes measures were 21-ventilator-day mortality and prolonged hospital stays (over 30 days). Models of Cox regression and logistic regression were used for multivariate analysis. The non-delayed ICU-waiting was defined as a period in which the time effect on mortality was not statistically significant in a Cox regression model. To identify a suitable cut-off point between "delayed" and "non-delayed", subsets from the overall data were made based on ICU-waiting time and the hazard ratio of ICU-waiting hour in each subset was iteratively calculated. The cut-off time was then used to evaluate the impact of delayed ICU admission on mortality and prolonged length of hospital stay. The final analysis included 1,242 patients. The time effect on mortality emerged after 4 hours, thus we deduced ICU-waiting time in ED > 4 hours as delayed. By logistic regression analysis, delayed ICU admission affected the outcomes of 21 ventilator-days mortality and prolonged hospital stay, with odds ratio of 1.41 (95% confidence interval, 1.05 to 1.89) and 1.56 (95% confidence interval, 1.07 to 2.27) respectively. For patients on mechanical ventilation at the ED, delayed ICU admission is associated with higher probability of mortality and additional resource expenditure. A benchmark waiting time of no more than 4 hours for ICU admission is recommended.

  17. Using Structured Additive Regression Models to Estimate Risk Factors of Malaria: Analysis of 2010 Malawi Malaria Indicator Survey Data

    PubMed Central

    Chirombo, James; Lowe, Rachel; Kazembe, Lawrence

    2014-01-01

    Background After years of implementing Roll Back Malaria (RBM) interventions, the changing landscape of malaria in terms of risk factors and spatial pattern has not been fully investigated. This paper uses the 2010 malaria indicator survey data to investigate if known malaria risk factors remain relevant after many years of interventions. Methods We adopted a structured additive logistic regression model that allowed for spatial correlation, to more realistically estimate malaria risk factors. Our model included child and household level covariates, as well as climatic and environmental factors. Continuous variables were modelled by assuming second order random walk priors, while spatial correlation was specified as a Markov random field prior, with fixed effects assigned diffuse priors. Inference was fully Bayesian resulting in an under five malaria risk map for Malawi. Results Malaria risk increased with increasing age of the child. With respect to socio-economic factors, the greater the household wealth, the lower the malaria prevalence. A general decline in malaria risk was observed as altitude increased. Minimum temperatures and average total rainfall in the three months preceding the survey did not show a strong association with disease risk. Conclusions The structured additive regression model offered a flexible extension to standard regression models by enabling simultaneous modelling of possible nonlinear effects of continuous covariates, spatial correlation and heterogeneity, while estimating usual fixed effects of categorical and continuous observed variables. Our results confirmed that malaria epidemiology is a complex interaction of biotic and abiotic factors, both at the individual, household and community level and that risk factors are still relevant many years after extensive implementation of RBM activities. PMID:24991915

  18. Using structured additive regression models to estimate risk factors of malaria: analysis of 2010 Malawi malaria indicator survey data.

    PubMed

    Chirombo, James; Lowe, Rachel; Kazembe, Lawrence

    2014-01-01

    After years of implementing Roll Back Malaria (RBM) interventions, the changing landscape of malaria in terms of risk factors and spatial pattern has not been fully investigated. This paper uses the 2010 malaria indicator survey data to investigate if known malaria risk factors remain relevant after many years of interventions. We adopted a structured additive logistic regression model that allowed for spatial correlation, to more realistically estimate malaria risk factors. Our model included child and household level covariates, as well as climatic and environmental factors. Continuous variables were modelled by assuming second order random walk priors, while spatial correlation was specified as a Markov random field prior, with fixed effects assigned diffuse priors. Inference was fully Bayesian resulting in an under five malaria risk map for Malawi. Malaria risk increased with increasing age of the child. With respect to socio-economic factors, the greater the household wealth, the lower the malaria prevalence. A general decline in malaria risk was observed as altitude increased. Minimum temperatures and average total rainfall in the three months preceding the survey did not show a strong association with disease risk. The structured additive regression model offered a flexible extension to standard regression models by enabling simultaneous modelling of possible nonlinear effects of continuous covariates, spatial correlation and heterogeneity, while estimating usual fixed effects of categorical and continuous observed variables. Our results confirmed that malaria epidemiology is a complex interaction of biotic and abiotic factors, both at the individual, household and community level and that risk factors are still relevant many years after extensive implementation of RBM activities.

  19. Estimation of Fine Particulate Matter in Taipei Using Landuse Regression and Bayesian Maximum Entropy Methods

    PubMed Central

    Yu, Hwa-Lung; Wang, Chih-Hsih; Liu, Ming-Che; Kuo, Yi-Ming

    2011-01-01

    Fine airborne particulate matter (PM2.5) has adverse effects on human health. Assessing the long-term effects of PM2.5 exposure on human health and ecology is often limited by a lack of reliable PM2.5 measurements. In Taipei, PM2.5 levels were not systematically measured until August, 2005. Due to the popularity of geographic information systems (GIS), the landuse regression method has been widely used in the spatial estimation of PM concentrations. This method accounts for the potential contributing factors of the local environment, such as traffic volume. Geostatistical methods, on other hand, account for the spatiotemporal dependence among the observations of ambient pollutants. This study assesses the performance of the landuse regression model for the spatiotemporal estimation of PM2.5 in the Taipei area. Specifically, this study integrates the landuse regression model with the geostatistical approach within the framework of the Bayesian maximum entropy (BME) method. The resulting epistemic framework can assimilate knowledge bases including: (a) empirical-based spatial trends of PM concentration based on landuse regression, (b) the spatio-temporal dependence among PM observation information, and (c) site-specific PM observations. The proposed approach performs the spatiotemporal estimation of PM2.5 levels in the Taipei area (Taiwan) from 2005–2007. PMID:21776223

  20. Estimation of fine particulate matter in Taipei using landuse regression and bayesian maximum entropy methods.

    PubMed

    Yu, Hwa-Lung; Wang, Chih-Hsih; Liu, Ming-Che; Kuo, Yi-Ming

    2011-06-01

    Fine airborne particulate matter (PM2.5) has adverse effects on human health. Assessing the long-term effects of PM2.5 exposure on human health and ecology is often limited by a lack of reliable PM2.5 measurements. In Taipei, PM2.5 levels were not systematically measured until August, 2005. Due to the popularity of geographic information systems (GIS), the landuse regression method has been widely used in the spatial estimation of PM concentrations. This method accounts for the potential contributing factors of the local environment, such as traffic volume. Geostatistical methods, on other hand, account for the spatiotemporal dependence among the observations of ambient pollutants. This study assesses the performance of the landuse regression model for the spatiotemporal estimation of PM2.5 in the Taipei area. Specifically, this study integrates the landuse regression model with the geostatistical approach within the framework of the Bayesian maximum entropy (BME) method. The resulting epistemic framework can assimilate knowledge bases including: (a) empirical-based spatial trends of PM concentration based on landuse regression, (b) the spatio-temporal dependence among PM observation information, and (c) site-specific PM observations. The proposed approach performs the spatiotemporal estimation of PM2.5 levels in the Taipei area (Taiwan) from 2005-2007.

  1. A spatial regression procedure for evaluating the relationship between AVHRR-NDVI and climate in the northern Great Plains

    USGS Publications Warehouse

    Ji, Lei; Peters, Albert J.

    2004-01-01

    The relationship between vegetation and climate in the grassland and cropland of the northern US Great Plains was investigated with Normalized Difference Vegetation Index (NDVI) (1989–1993) images derived from the Advanced Very High Resolution Radiometer (AVHRR), and climate data from automated weather stations. The relationship was quantified using a spatial regression technique that adjusts for spatial autocorrelation inherent in these data. Conventional regression techniques used frequently in previous studies are not adequate, because they are based on the assumption of independent observations. Six climate variables during the growing season; precipitation, potential evapotranspiration, daily maximum and minimum air temperature, soil temperature, solar irradiation were regressed on NDVI derived from a 10-km weather station buffer. The regression model identified precipitation and potential evapotranspiration as the most significant climatic variables, indicating that the water balance is the most important factor controlling vegetation condition at an annual timescale. The model indicates that 46% and 24% of variation in NDVI is accounted for by climate in grassland and cropland, respectively, indicating that grassland vegetation has a more pronounced response to climate variation than cropland. Other factors contributing to NDVI variation include environmental factors (soil, groundwater and terrain), human manipulation of crops, and sensor variation.

  2. Bayesian Unimodal Density Regression for Causal Inference

    ERIC Educational Resources Information Center

    Karabatsos, George; Walker, Stephen G.

    2011-01-01

    Karabatsos and Walker (2011) introduced a new Bayesian nonparametric (BNP) regression model. Through analyses of real and simulated data, they showed that the BNP regression model outperforms other parametric and nonparametric regression models of common use, in terms of predictive accuracy of the outcome (dependent) variable. The other,…

  3. Bayesian Estimation of Multivariate Latent Regression Models: Gauss versus Laplace

    ERIC Educational Resources Information Center

    Culpepper, Steven Andrew; Park, Trevor

    2017-01-01

    A latent multivariate regression model is developed that employs a generalized asymmetric Laplace (GAL) prior distribution for regression coefficients. The model is designed for high-dimensional applications where an approximate sparsity condition is satisfied, such that many regression coefficients are near zero after accounting for all the model…

  4. A simple approach to power and sample size calculations in logistic regression and Cox regression models.

    PubMed

    Vaeth, Michael; Skovlund, Eva

    2004-06-15

    For a given regression problem it is possible to identify a suitably defined equivalent two-sample problem such that the power or sample size obtained for the two-sample problem also applies to the regression problem. For a standard linear regression model the equivalent two-sample problem is easily identified, but for generalized linear models and for Cox regression models the situation is more complicated. An approximately equivalent two-sample problem may, however, also be identified here. In particular, we show that for logistic regression and Cox regression models the equivalent two-sample problem is obtained by selecting two equally sized samples for which the parameters differ by a value equal to the slope times twice the standard deviation of the independent variable and further requiring that the overall expected number of events is unchanged. In a simulation study we examine the validity of this approach to power calculations in logistic regression and Cox regression models. Several different covariate distributions are considered for selected values of the overall response probability and a range of alternatives. For the Cox regression model we consider both constant and non-constant hazard rates. The results show that in general the approach is remarkably accurate even in relatively small samples. Some discrepancies are, however, found in small samples with few events and a highly skewed covariate distribution. Comparison with results based on alternative methods for logistic regression models with a single continuous covariate indicates that the proposed method is at least as good as its competitors. The method is easy to implement and therefore provides a simple way to extend the range of problems that can be covered by the usual formulas for power and sample size determination. Copyright 2004 John Wiley & Sons, Ltd.

  5. Comparative evaluation of urban storm water quality models

    NASA Astrophysics Data System (ADS)

    Vaze, J.; Chiew, Francis H. S.

    2003-10-01

    The estimation of urban storm water pollutant loads is required for the development of mitigation and management strategies to minimize impacts to receiving environments. Event pollutant loads are typically estimated using either regression equations or "process-based" water quality models. The relative merit of using regression models compared to process-based models is not clear. A modeling study is carried out here to evaluate the comparative ability of the regression equations and process-based water quality models to estimate event diffuse pollutant loads from impervious surfaces. The results indicate that, once calibrated, both the regression equations and the process-based model can estimate event pollutant loads satisfactorily. In fact, the loads estimated using the regression equation as a function of rainfall intensity and runoff rate are better than the loads estimated using the process-based model. Therefore, if only estimates of event loads are required, regression models should be used because they are simpler and require less data compared to process-based models.

  6. The natural outcome of melamine-induced bladder stones with bladder epithelial hyperplasia after the withdrawal of melamine in mice.

    PubMed

    Ren, Shu-Ting; Xu, Chang-Fu; Du, Yun-Xia; Gao, Xiao-Li; Sun, Ying; Jiang, Yi-Na

    2012-07-01

    The natural outcome of melamine-induced bladder stones (cystoliths) with bladder epithelial hyperplasia (BEH) after melamine withdrawn is unclear. Using an ideal dual-model system, three experiments were conducted in BALB/c mice. Each experiment included a control, model 1 and model 2 groups. The mice were fed a regular diet in controls or a 9373 ppm melamine diet in models, and the first day was designated as dosing day 1. The melamine diet was then replaced by the regular diet in the model 2 groups, and the first day was designated as post-dosing day 1. On dosing days 12, 35 and 49, the incidence of cystoliths and diffusely active BEH was 8/8 in the mice of three model 1 groups. On post-dosing days 1, 4 and 8, in the mice of three model 2 groups, the incidence of cystoliths was 2/8, 0/8 and 1/8, respectively, and the progressive regression of BEH was observed. In conclusion, both the stones and BEH have the natural property of rapid development and rapid regression, and melamine withdrawn plays a key role in the stone dissolution-discharge necessary for BEH regression. BEH may be reversible after the discharge of the stones. The conventionally conservative therapy is thus reasonable. Copyright © 2012 Elsevier Ltd. All rights reserved.

  7. Meteorological adjustment of yearly mean values for air pollutant concentration comparison

    NASA Technical Reports Server (NTRS)

    Sidik, S. M.; Neustadter, H. E.

    1976-01-01

    Using multiple linear regression analysis, models which estimate mean concentrations of Total Suspended Particulate (TSP), sulfur dioxide, and nitrogen dioxide as a function of several meteorologic variables, two rough economic indicators, and a simple trend in time are studied. Meteorologic data were obtained and do not include inversion heights. The goodness of fit of the estimated models is partially reflected by the squared coefficient of multiple correlation which indicates that, at the various sampling stations, the models accounted for about 23 to 47 percent of the total variance of the observed TSP concentrations. If the resulting model equations are used in place of simple overall means of the observed concentrations, there is about a 20 percent improvement in either: (1) predicting mean concentrations for specified meteorological conditions; or (2) adjusting successive yearly averages to allow for comparisons devoid of meteorological effects. An application to source identification is presented using regression coefficients of wind velocity predictor variables.

  8. MMI: Multimodel inference or models with management implications?

    USGS Publications Warehouse

    Fieberg, J.; Johnson, Douglas H.

    2015-01-01

    We consider a variety of regression modeling strategies for analyzing observational data associated with typical wildlife studies, including all subsets and stepwise regression, a single full model, and Akaike's Information Criterion (AIC)-based multimodel inference. Although there are advantages and disadvantages to each approach, we suggest that there is no unique best way to analyze data. Further, we argue that, although multimodel inference can be useful in natural resource management, the importance of considering causality and accurately estimating effect sizes is greater than simply considering a variety of models. Determining causation is far more valuable than simply indicating how the response variable and explanatory variables covaried within a data set, especially when the data set did not arise from a controlled experiment. Understanding the causal mechanism will provide much better predictions beyond the range of data observed. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.

  9. Third molar development: measurements versus scores as age predictor.

    PubMed

    Thevissen, P W; Fieuws, S; Willems, G

    2011-10-01

    Human third molar development is widely used to predict chronological age of sub adult individuals with unknown or doubted age. For these predictions, classically, the radiologically observed third molar growth and maturation is registered using a staging and related scoring technique. Measures of lengths and widths of the developing wisdom tooth and its adjacent second molar can be considered as an alternative registration. The aim of this study was to verify relations between mandibular third molar developmental stages or measurements of mandibular second molar and third molars and age. Age related performance of stages and measurements were compared to assess if measurements added information to age predictions from third molar formation stage. The sample was 340 orthopantomograms (170 females, 170 males) of individuals homogenously distributed in age between 7 and 24 years. Mandibular lower right, third and second molars, were staged following Gleiser and Hunt, length and width measurements were registered, and various ratios of these measurements were calculated. Univariable regression models with age as response and third molar stage, measurements and ratios of second and third molars as predictors, were considered. Multivariable regression models assessed if measurements or ratios added information to age prediction from third molar stage. Coefficients of determination (R(2)) and root mean squared errors (RMSE) obtained from all regression models were compared. The univariable regression model using stages as predictor yielded most accurate age predictions (males: R(2) 0.85, RMSE between 0.85 and 1.22 year; females: R(2) 0.77, RMSE between 1.19 and 2.11 year) compared to all models including measurements and ratios. The multivariable regression models indicated that measurements and ratios added no clinical relevant information to the age prediction from third molar stage. Ratios and measurements of second and third molars are less accurate age predictors than stages of developing third molars. Copyright © 2011 Elsevier Ltd. All rights reserved.

  10. Understanding bias in relationships between the food environment and diet quality: the Coronary Artery Risk Development in Young Adults (CARDIA) study.

    PubMed

    Rummo, Pasquale E; Guilkey, David K; Ng, Shu Wen; Meyer, Katie A; Popkin, Barry M; Reis, Jared P; Shikany, James M; Gordon-Larsen, Penny

    2017-12-01

    The relationship between food environment exposures and diet behaviours is unclear, possibly because the majority of studies ignore potential residual confounding. We used 20 years (1985-1986, 1992-1993 2005-2006) of data from the Coronary Artery Risk Development in Young Adults (CARDIA) study across four US cities (Birmingham, Alabama; Chicago, Illinois; Minneapolis, Minnesota; Oakland, California) and instrumental variables (IV) regression to obtain causal estimates of longitudinal associations between the percentage of neighbourhood food outlets (per total food outlets within 1 km network distance of respondent residence) and an a priori diet quality score, with higher scores indicating higher diet quality. To assess the presence and magnitude of bias related to residual confounding, we compared results from causal models (IV regression) to non-causal models, including ordinary least squares regression, which does not account for residual confounding at all and fixed-effects regression, which only controls for time-invariant unmeasured characteristics. The mean diet quality score across follow-up was 63.4 (SD=12.7). A 10% increase in fast food restaurants (relative to full-service restaurants) was associated with a lower diet quality score over time using IV regression (β=-1.01, 95% CI -1.99 to -0.04); estimates were attenuated using non-causal models. The percentage of neighbourhood convenience and grocery stores (relative to supermarkets) was not associated with diet quality in any model, but estimates from non-causal models were similarly attenuated compared with causal models. Ignoring residual confounding may generate biased estimated effects of neighbourhood food outlets on diet outcomes and may have contributed to weak findings in the food environment literature. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  11. Epidemiological characteristics of reported sporadic and outbreak cases of E. coli O157 in people from Alberta, Canada (2000-2002): methodological challenges of comparing clustered to unclustered data.

    PubMed

    Pearl, D L; Louie, M; Chui, L; Doré, K; Grimsrud, K M; Martin, S W; Michel, P; Svenson, L W; McEwen, S A

    2008-04-01

    Using multivariable models, we compared whether there were significant differences between reported outbreak and sporadic cases in terms of their sex, age, and mode and site of disease transmission. We also determined the potential role of administrative, temporal, and spatial factors within these models. We compared a variety of approaches to account for clustering of cases in outbreaks including weighted logistic regression, random effects models, general estimating equations, robust variance estimates, and the random selection of one case from each outbreak. Age and mode of transmission were the only epidemiologically and statistically significant covariates in our final models using the above approaches. Weighing observations in a logistic regression model by the inverse of their outbreak size appeared to be a relatively robust and valid means for modelling these data. Some analytical techniques, designed to account for clustering, had difficulty converging or producing realistic measures of association.

  12. A New Method for Partial Correction of Residual Confounding in Time-Series and Other Observational Studies.

    PubMed

    Flanders, W Dana; Strickland, Matthew J; Klein, Mitchel

    2017-05-15

    Methods exist to detect residual confounding in epidemiologic studies. One requires a negative control exposure with 2 key properties: 1) conditional independence of the negative control and the outcome (given modeled variables) absent confounding and other model misspecification, and 2) associations of the negative control with uncontrolled confounders and the outcome. We present a new method to partially correct for residual confounding: When confounding is present and our assumptions hold, we argue that estimators from models that include a negative control exposure with these 2 properties tend to be less biased than those from models without it. Using regression theory, we provide theoretical arguments that support our claims. In simulations, we empirically evaluated the approach using a time-series study of ozone effects on asthma emergency department visits. In simulations, effect estimators from models that included the negative control exposure (ozone concentrations 1 day after the emergency department visit) had slightly or modestly less residual confounding than those from models without it. Theory and simulations show that including the negative control can reduce residual confounding, if our assumptions hold. Our method differs from available methods because it uses a regression approach involving an exposure-based indicator rather than a negative control outcome to partially correct for confounding. © The Author 2017. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  13. A generalized right truncated bivariate Poisson regression model with applications to health data.

    PubMed

    Islam, M Ataharul; Chowdhury, Rafiqul I

    2017-01-01

    A generalized right truncated bivariate Poisson regression model is proposed in this paper. Estimation and tests for goodness of fit and over or under dispersion are illustrated for both untruncated and right truncated bivariate Poisson regression models using marginal-conditional approach. Estimation and test procedures are illustrated for bivariate Poisson regression models with applications to Health and Retirement Study data on number of health conditions and the number of health care services utilized. The proposed test statistics are easy to compute and it is evident from the results that the models fit the data very well. A comparison between the right truncated and untruncated bivariate Poisson regression models using the test for nonnested models clearly shows that the truncated model performs significantly better than the untruncated model.

  14. A generalized right truncated bivariate Poisson regression model with applications to health data

    PubMed Central

    Islam, M. Ataharul; Chowdhury, Rafiqul I.

    2017-01-01

    A generalized right truncated bivariate Poisson regression model is proposed in this paper. Estimation and tests for goodness of fit and over or under dispersion are illustrated for both untruncated and right truncated bivariate Poisson regression models using marginal-conditional approach. Estimation and test procedures are illustrated for bivariate Poisson regression models with applications to Health and Retirement Study data on number of health conditions and the number of health care services utilized. The proposed test statistics are easy to compute and it is evident from the results that the models fit the data very well. A comparison between the right truncated and untruncated bivariate Poisson regression models using the test for nonnested models clearly shows that the truncated model performs significantly better than the untruncated model. PMID:28586344

  15. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges

    PubMed Central

    Goldstein, Benjamin A.; Navar, Ann Marie; Carter, Rickey E.

    2017-01-01

    Abstract Risk prediction plays an important role in clinical cardiology research. Traditionally, most risk models have been based on regression models. While useful and robust, these statistical methods are limited to using a small number of predictors which operate in the same way on everyone, and uniformly throughout their range. The purpose of this review is to illustrate the use of machine-learning methods for development of risk prediction models. Typically presented as black box approaches, most machine-learning methods are aimed at solving particular challenges that arise in data analysis that are not well addressed by typical regression approaches. To illustrate these challenges, as well as how different methods can address them, we consider trying to predicting mortality after diagnosis of acute myocardial infarction. We use data derived from our institution's electronic health record and abstract data on 13 regularly measured laboratory markers. We walk through different challenges that arise in modelling these data and then introduce different machine-learning approaches. Finally, we discuss general issues in the application of machine-learning methods including tuning parameters, loss functions, variable importance, and missing data. Overall, this review serves as an introduction for those working on risk modelling to approach the diffuse field of machine learning. PMID:27436868

  16. Development and validation of a mortality risk model for pediatric sepsis.

    PubMed

    Chen, Mengshi; Lu, Xiulan; Hu, Li; Liu, Pingping; Zhao, Wenjiao; Yan, Haipeng; Tang, Liang; Zhu, Yimin; Xiao, Zhenghui; Chen, Lizhang; Tan, Hongzhuan

    2017-05-01

    Pediatric sepsis is a burdensome public health problem. Assessing the mortality risk of pediatric sepsis patients, offering effective treatment guidance, and improving prognosis to reduce mortality rates, are crucial.We extracted data derived from electronic medical records of pediatric sepsis patients that were collected during the first 24 hours after admission to the pediatric intensive care unit (PICU) of the Hunan Children's hospital from January 2012 to June 2014. A total of 788 children were randomly divided into a training (592, 75%) and validation group (196, 25%). The risk factors for mortality among these patients were identified by conducting multivariate logistic regression in the training group. Based on the established logistic regression equation, the logit probabilities for all patients (in both groups) were calculated to verify the model's internal and external validities.According to the training group, 6 variables (brain natriuretic peptide, albumin, total bilirubin, D-dimer, lactate levels, and mechanical ventilation in 24 hours) were included in the final logistic regression model. The areas under the curves of the model were 0.854 (0.826, 0.881) and 0.844 (0.816, 0.873) in the training and validation groups, respectively.The Mortality Risk Model for Pediatric Sepsis we established in this study showed acceptable accuracy to predict the mortality risk in pediatric sepsis patients.

  17. Development and validation of a mortality risk model for pediatric sepsis

    PubMed Central

    Chen, Mengshi; Lu, Xiulan; Hu, Li; Liu, Pingping; Zhao, Wenjiao; Yan, Haipeng; Tang, Liang; Zhu, Yimin; Xiao, Zhenghui; Chen, Lizhang; Tan, Hongzhuan

    2017-01-01

    Abstract Pediatric sepsis is a burdensome public health problem. Assessing the mortality risk of pediatric sepsis patients, offering effective treatment guidance, and improving prognosis to reduce mortality rates, are crucial. We extracted data derived from electronic medical records of pediatric sepsis patients that were collected during the first 24 hours after admission to the pediatric intensive care unit (PICU) of the Hunan Children's hospital from January 2012 to June 2014. A total of 788 children were randomly divided into a training (592, 75%) and validation group (196, 25%). The risk factors for mortality among these patients were identified by conducting multivariate logistic regression in the training group. Based on the established logistic regression equation, the logit probabilities for all patients (in both groups) were calculated to verify the model's internal and external validities. According to the training group, 6 variables (brain natriuretic peptide, albumin, total bilirubin, D-dimer, lactate levels, and mechanical ventilation in 24 hours) were included in the final logistic regression model. The areas under the curves of the model were 0.854 (0.826, 0.881) and 0.844 (0.816, 0.873) in the training and validation groups, respectively. The Mortality Risk Model for Pediatric Sepsis we established in this study showed acceptable accuracy to predict the mortality risk in pediatric sepsis patients. PMID:28514310

  18. Predicting Energy Performance of a Net-Zero Energy Building: A Statistical Approach

    PubMed Central

    Kneifel, Joshua; Webb, David

    2016-01-01

    Performance-based building requirements have become more prevalent because it gives freedom in building design while still maintaining or exceeding the energy performance required by prescriptive-based requirements. In order to determine if building designs reach target energy efficiency improvements, it is necessary to estimate the energy performance of a building using predictive models and different weather conditions. Physics-based whole building energy simulation modeling is the most common approach. However, these physics-based models include underlying assumptions and require significant amounts of information in order to specify the input parameter values. An alternative approach to test the performance of a building is to develop a statistically derived predictive regression model using post-occupancy data that can accurately predict energy consumption and production based on a few common weather-based factors, thus requiring less information than simulation models. A regression model based on measured data should be able to predict energy performance of a building for a given day as long as the weather conditions are similar to those during the data collection time frame. This article uses data from the National Institute of Standards and Technology (NIST) Net-Zero Energy Residential Test Facility (NZERTF) to develop and validate a regression model to predict the energy performance of the NZERTF using two weather variables aggregated to the daily level, applies the model to estimate the energy performance of hypothetical NZERTFs located in different cities in the Mixed-Humid climate zone, and compares these estimates to the results from already existing EnergyPlus whole building energy simulations. This regression model exhibits agreement with EnergyPlus predictive trends in energy production and net consumption, but differs greatly in energy consumption. The model can be used as a framework for alternative and more complex models based on the experimental data collected from the NZERTF. PMID:27956756

  19. Predicting Energy Performance of a Net-Zero Energy Building: A Statistical Approach.

    PubMed

    Kneifel, Joshua; Webb, David

    2016-09-01

    Performance-based building requirements have become more prevalent because it gives freedom in building design while still maintaining or exceeding the energy performance required by prescriptive-based requirements. In order to determine if building designs reach target energy efficiency improvements, it is necessary to estimate the energy performance of a building using predictive models and different weather conditions. Physics-based whole building energy simulation modeling is the most common approach. However, these physics-based models include underlying assumptions and require significant amounts of information in order to specify the input parameter values. An alternative approach to test the performance of a building is to develop a statistically derived predictive regression model using post-occupancy data that can accurately predict energy consumption and production based on a few common weather-based factors, thus requiring less information than simulation models. A regression model based on measured data should be able to predict energy performance of a building for a given day as long as the weather conditions are similar to those during the data collection time frame. This article uses data from the National Institute of Standards and Technology (NIST) Net-Zero Energy Residential Test Facility (NZERTF) to develop and validate a regression model to predict the energy performance of the NZERTF using two weather variables aggregated to the daily level, applies the model to estimate the energy performance of hypothetical NZERTFs located in different cities in the Mixed-Humid climate zone, and compares these estimates to the results from already existing EnergyPlus whole building energy simulations. This regression model exhibits agreement with EnergyPlus predictive trends in energy production and net consumption, but differs greatly in energy consumption. The model can be used as a framework for alternative and more complex models based on the experimental data collected from the NZERTF.

  20. A Technique of Fuzzy C-Mean in Multiple Linear Regression Model toward Paddy Yield

    NASA Astrophysics Data System (ADS)

    Syazwan Wahab, Nur; Saifullah Rusiman, Mohd; Mohamad, Mahathir; Amira Azmi, Nur; Che Him, Norziha; Ghazali Kamardan, M.; Ali, Maselan

    2018-04-01

    In this paper, we propose a hybrid model which is a combination of multiple linear regression model and fuzzy c-means method. This research involved a relationship between 20 variates of the top soil that are analyzed prior to planting of paddy yields at standard fertilizer rates. Data used were from the multi-location trials for rice carried out by MARDI at major paddy granary in Peninsular Malaysia during the period from 2009 to 2012. Missing observations were estimated using mean estimation techniques. The data were analyzed using multiple linear regression model and a combination of multiple linear regression model and fuzzy c-means method. Analysis of normality and multicollinearity indicate that the data is normally scattered without multicollinearity among independent variables. Analysis of fuzzy c-means cluster the yield of paddy into two clusters before the multiple linear regression model can be used. The comparison between two method indicate that the hybrid of multiple linear regression model and fuzzy c-means method outperform the multiple linear regression model with lower value of mean square error.

  1. Relations that affect the probability and prediction of nitrate concentration in private wells in the glacial aquifer system in the United States

    USGS Publications Warehouse

    Warner, Kelly L.; Arnold, Terri L.

    2010-01-01

    Nitrate in private wells in the glacial aquifer system is a concern for an estimated 17 million people using private wells because of the proximity of many private wells to nitrogen sources. Yet, less than 5 percent of private wells sampled in this study contained nitrate in concentrations that exceeded the U.S. Environmental Protection Agency (USEPA) Maximum Contaminant Level (MCL) of 10 mg/L (milligrams per liter) as N (nitrogen). However, this small group with nitrate concentrations above the USEPA MCL includes some of the highest nitrate concentrations detected in groundwater from private wells (77 mg/L). Median nitrate concentration measured in groundwater from private wells in the glacial aquifer system (0.11 mg/L as N) is lower than that in water from other unconsolidated aquifers and is not strongly related to surface sources of nitrate. Background concentration of nitrate is less than 1 mg/L as N. Although overall nitrate concentration in private wells was low relative to the MCL, concentrations were highly variable over short distances and at various depths below land surface. Groundwater from wells in the glacial aquifer system at all depths was a mixture of old and young water. Oxidation and reduction potential changes with depth and groundwater age were important influences on nitrate concentrations in private wells. A series of 10 logistic regression models was developed to estimate the probability of nitrate concentration above various thresholds. The threshold concentration (1 to 10 mg/L) affected the number of variables in the model. Fewer explanatory variables are needed to predict nitrate at higher threshold concentrations. The variables that were identified as significant predictors for nitrate concentration above 4 mg/L as N included well characteristics such as open-interval diameter, open-interval length, and depth to top of open interval. Environmental variables in the models were mean percent silt in soil, soil type, and mean depth to saturated soil. The 10-year mean (1992-2001) application rate of nitrogen fertilizer applied to farms was included as the potential source variable. A linear regression model also was developed to predict mean nitrate concentrations in well networks. The model is based on network averages because nitrate concentrations are highly variable over short distances. Using values for each of the predictor variables averaged by network (network mean value) from the logistic regression models, the linear regression model developed in this study predicted the mean nitrate concentration in well networks with a 95 percent confidence in predictions.

  2. Creating a non-linear total sediment load formula using polynomial best subset regression model

    NASA Astrophysics Data System (ADS)

    Okcu, Davut; Pektas, Ali Osman; Uyumaz, Ali

    2016-08-01

    The aim of this study is to derive a new total sediment load formula which is more accurate and which has less application constraints than the well-known formulae of the literature. 5 most known stream power concept sediment formulae which are approved by ASCE are used for benchmarking on a wide range of datasets that includes both field and flume (lab) observations. The dimensionless parameters of these widely used formulae are used as inputs in a new regression approach. The new approach is called Polynomial Best subset regression (PBSR) analysis. The aim of the PBRS analysis is fitting and testing all possible combinations of the input variables and selecting the best subset. Whole the input variables with their second and third powers are included in the regression to test the possible relation between the explanatory variables and the dependent variable. While selecting the best subset a multistep approach is used that depends on significance values and also the multicollinearity degrees of inputs. The new formula is compared to others in a holdout dataset and detailed performance investigations are conducted for field and lab datasets within this holdout data. Different goodness of fit statistics are used as they represent different perspectives of the model accuracy. After the detailed comparisons are carried out we figured out the most accurate equation that is also applicable on both flume and river data. Especially, on field dataset the prediction performance of the proposed formula outperformed the benchmark formulations.

  3. A crash-prediction model for multilane roads.

    PubMed

    Caliendo, Ciro; Guida, Maurizio; Parisi, Alessandra

    2007-07-01

    Considerable research has been carried out in recent years to establish relationships between crashes and traffic flow, geometric infrastructure characteristics and environmental factors for two-lane rural roads. Crash-prediction models focused on multilane rural roads, however, have rarely been investigated. In addition, most research has paid but little attention to the safety effects of variables such as stopping sight distance and pavement surface characteristics. Moreover, the statistical approaches have generally included Poisson and Negative Binomial regression models, whilst Negative Multinomial regression model has been used to a lesser extent. Finally, as far as the authors are aware, prediction models involving all the above-mentioned factors have still not been developed in Italy for multilane roads, such as motorways. Thus, in this paper crash-prediction models for a four-lane median-divided Italian motorway were set up on the basis of accident data observed during a 5-year monitoring period extending between 1999 and 2003. The Poisson, Negative Binomial and Negative Multinomial regression models, applied separately to tangents and curves, were used to model the frequency of accident occurrence. Model parameters were estimated by the Maximum Likelihood Method, and the Generalized Likelihood Ratio Test was applied to detect the significant variables to be included in the model equation. Goodness-of-fit was measured by means of both the explained fraction of total variation and the explained fraction of systematic variation. The Cumulative Residuals Method was also used to test the adequacy of a regression model throughout the range of each variable. The candidate set of explanatory variables was: length (L), curvature (1/R), annual average daily traffic (AADT), sight distance (SD), side friction coefficient (SFC), longitudinal slope (LS) and the presence of a junction (J). Separate prediction models for total crashes and for fatal and injury crashes only were considered. For curves it is shown that significant variables are L, 1/R and AADT, whereas for tangents they are L, AADT and junctions. The effect of rain precipitation was analysed on the basis of hourly rainfall data and assumptions about drying time. It is shown that a wet pavement significantly increases the number of crashes. The models developed in this paper for Italian motorways appear to be useful for many applications such as the detection of critical factors, the estimation of accident reduction due to infrastructure and pavement improvement, and the predictions of accidents counts when comparing different design options. Thus this research may represent a point of reference for engineers in adjusting or designing multilane roads.

  4. Spatial Assessment of Model Errors from Four Regression Techniques

    Treesearch

    Lianjun Zhang; Jeffrey H. Gove; Jeffrey H. Gove

    2005-01-01

    Fomst modelers have attempted to account for the spatial autocorrelations among trees in growth and yield models by applying alternative regression techniques such as linear mixed models (LMM), generalized additive models (GAM), and geographicalIy weighted regression (GWR). However, the model errors are commonly assessed using average errors across the entire study...

  5. Analysis of precision and accuracy in a simple model of machine learning

    NASA Astrophysics Data System (ADS)

    Lee, Julian

    2017-12-01

    Machine learning is a procedure where a model for the world is constructed from a training set of examples. It is important that the model should capture relevant features of the training set, and at the same time make correct prediction for examples not included in the training set. I consider the polynomial regression, the simplest method of learning, and analyze the accuracy and precision for different levels of the model complexity.

  6. Modeling of Micro Deval abrasion loss based on some rock properties

    NASA Astrophysics Data System (ADS)

    Capik, Mehmet; Yilmaz, Ali Osman

    2017-10-01

    Aggregate is one of the most widely used construction material. The quality of the aggregate is determined using some testing methods. Among these methods, the Micro Deval Abrasion Loss (MDAL) test is commonly used for the determination of the quality and the abrasion resistance of aggregate. The main objective of this study is to develop models for the prediction of MDAL from rock properties, including uniaxial compressive strength, Brazilian tensile strength, point load index, Schmidt rebound hardness, apparent porosity, void ratio Cerchar abrasivity index and Bohme abrasion test are examined. Additionally, the MDAL is modeled using simple regression analysis and multiple linear regression analysis based on the rock properties. The study shows that the MDAL decreases with the increase of uniaxial compressive strength, Brazilian tensile strength, point load index, Schmidt rebound hardness and Cerchar abrasivity index. It is also concluded that the MDAL increases with the increase of apparent porosity, void ratio and Bohme abrasion test. The modeling results show that the models based on Bohme abrasion test and L type Schmidt rebound hardness give the better forecasting performances for the MDAL. More models, including the uniaxial compressive strength, the apparent porosity and Cerchar abrasivity index, are developed for the rapid estimation of the MDAL of the rocks. The developed models were verified by statistical tests. Additionally, it can be stated that the proposed models can be used as a forecasting for aggregate quality.

  7. The extension of total gain (TG) statistic in survival models: properties and applications.

    PubMed

    Choodari-Oskooei, Babak; Royston, Patrick; Parmar, Mahesh K B

    2015-07-01

    The results of multivariable regression models are usually summarized in the form of parameter estimates for the covariates, goodness-of-fit statistics, and the relevant p-values. These statistics do not inform us about whether covariate information will lead to any substantial improvement in prediction. Predictive ability measures can be used for this purpose since they provide important information about the practical significance of prognostic factors. R (2)-type indices are the most familiar forms of such measures in survival models, but they all have limitations and none is widely used. In this paper, we extend the total gain (TG) measure, proposed for a logistic regression model, to survival models and explore its properties using simulations and real data. TG is based on the binary regression quantile plot, otherwise known as the predictiveness curve. Standardised TG ranges from 0 (no explanatory power) to 1 ('perfect' explanatory power). The results of our simulations show that unlike many of the other R (2)-type predictive ability measures, TG is independent of random censoring. It increases as the effect of a covariate increases and can be applied to different types of survival models, including models with time-dependent covariate effects. We also apply TG to quantify the predictive ability of multivariable prognostic models developed in several disease areas. Overall, TG performs well in our simulation studies and can be recommended as a measure to quantify the predictive ability in survival models.

  8. Incorporating wind availability into land use regression modelling of air quality in mountainous high-density urban environment.

    PubMed

    Shi, Yuan; Lau, Kevin Ka-Lun; Ng, Edward

    2017-08-01

    Urban air quality serves as an important function of the quality of urban life. Land use regression (LUR) modelling of air quality is essential for conducting health impacts assessment but more challenging in mountainous high-density urban scenario due to the complexities of the urban environment. In this study, a total of 21 LUR models are developed for seven kinds of air pollutants (gaseous air pollutants CO, NO 2 , NO x , O 3 , SO 2 and particulate air pollutants PM 2.5 , PM 10 ) with reference to three different time periods (summertime, wintertime and annual average of 5-year long-term hourly monitoring data from local air quality monitoring network) in Hong Kong. Under the mountainous high-density urban scenario, we improved the traditional LUR modelling method by incorporating wind availability information into LUR modelling based on surface geomorphometrical analysis. As a result, 269 independent variables were examined to develop the LUR models by using the "ADDRESS" independent variable selection method and stepwise multiple linear regression (MLR). Cross validation has been performed for each resultant model. The results show that wind-related variables are included in most of the resultant models as statistically significant independent variables. Compared with the traditional method, a maximum increase of 20% was achieved in the prediction performance of annual averaged NO 2 concentration level by incorporating wind-related variables into LUR model development. Copyright © 2017 Elsevier Inc. All rights reserved.

  9. Unitary Response Regression Models

    ERIC Educational Resources Information Center

    Lipovetsky, S.

    2007-01-01

    The dependent variable in a regular linear regression is a numerical variable, and in a logistic regression it is a binary or categorical variable. In these models the dependent variable has varying values. However, there are problems yielding an identity output of a constant value which can also be modelled in a linear or logistic regression with…

  10. Modelling infant mortality rate in Central Java, Indonesia use generalized poisson regression method

    NASA Astrophysics Data System (ADS)

    Prahutama, Alan; Sudarno

    2018-05-01

    The infant mortality rate is the number of deaths under one year of age occurring among the live births in a given geographical area during a given year, per 1,000 live births occurring among the population of the given geographical area during the same year. This problem needs to be addressed because it is an important element of a country’s economic development. High infant mortality rate will disrupt the stability of a country as it relates to the sustainability of the population in the country. One of regression model that can be used to analyze the relationship between dependent variable Y in the form of discrete data and independent variable X is Poisson regression model. Recently The regression modeling used for data with dependent variable is discrete, among others, poisson regression, negative binomial regression and generalized poisson regression. In this research, generalized poisson regression modeling gives better AIC value than poisson regression. The most significant variable is the Number of health facilities (X1), while the variable that gives the most influence to infant mortality rate is the average breastfeeding (X9).

  11. The extraction of simple relationships in growth factor-specific multiple-input and multiple-output systems in cell-fate decisions by backward elimination PLS regression.

    PubMed

    Akimoto, Yuki; Yugi, Katsuyuki; Uda, Shinsuke; Kudo, Takamasa; Komori, Yasunori; Kubota, Hiroyuki; Kuroda, Shinya

    2013-01-01

    Cells use common signaling molecules for the selective control of downstream gene expression and cell-fate decisions. The relationship between signaling molecules and downstream gene expression and cellular phenotypes is a multiple-input and multiple-output (MIMO) system and is difficult to understand due to its complexity. For example, it has been reported that, in PC12 cells, different types of growth factors activate MAP kinases (MAPKs) including ERK, JNK, and p38, and CREB, for selective protein expression of immediate early genes (IEGs) such as c-FOS, c-JUN, EGR1, JUNB, and FOSB, leading to cell differentiation, proliferation and cell death; however, how multiple-inputs such as MAPKs and CREB regulate multiple-outputs such as expression of the IEGs and cellular phenotypes remains unclear. To address this issue, we employed a statistical method called partial least squares (PLS) regression, which involves a reduction of the dimensionality of the inputs and outputs into latent variables and a linear regression between these latent variables. We measured 1,200 data points for MAPKs and CREB as the inputs and 1,900 data points for IEGs and cellular phenotypes as the outputs, and we constructed the PLS model from these data. The PLS model highlighted the complexity of the MIMO system and growth factor-specific input-output relationships of cell-fate decisions in PC12 cells. Furthermore, to reduce the complexity, we applied a backward elimination method to the PLS regression, in which 60 input variables were reduced to 5 variables, including the phosphorylation of ERK at 10 min, CREB at 5 min and 60 min, AKT at 5 min and JNK at 30 min. The simple PLS model with only 5 input variables demonstrated a predictive ability comparable to that of the full PLS model. The 5 input variables effectively extracted the growth factor-specific simple relationships within the MIMO system in cell-fate decisions in PC12 cells.

  12. [From clinical judgment to linear regression model.

    PubMed

    Palacios-Cruz, Lino; Pérez, Marcela; Rivas-Ruiz, Rodolfo; Talavera, Juan O

    2013-01-01

    When we think about mathematical models, such as linear regression model, we think that these terms are only used by those engaged in research, a notion that is far from the truth. Legendre described the first mathematical model in 1805, and Galton introduced the formal term in 1886. Linear regression is one of the most commonly used regression models in clinical practice. It is useful to predict or show the relationship between two or more variables as long as the dependent variable is quantitative and has normal distribution. Stated in another way, the regression is used to predict a measure based on the knowledge of at least one other variable. Linear regression has as it's first objective to determine the slope or inclination of the regression line: Y = a + bx, where "a" is the intercept or regression constant and it is equivalent to "Y" value when "X" equals 0 and "b" (also called slope) indicates the increase or decrease that occurs when the variable "x" increases or decreases in one unit. In the regression line, "b" is called regression coefficient. The coefficient of determination (R 2 ) indicates the importance of independent variables in the outcome.

  13. Impact of multicollinearity on small sample hydrologic regression models

    NASA Astrophysics Data System (ADS)

    Kroll, Charles N.; Song, Peter

    2013-06-01

    Often hydrologic regression models are developed with ordinary least squares (OLS) procedures. The use of OLS with highly correlated explanatory variables produces multicollinearity, which creates highly sensitive parameter estimators with inflated variances and improper model selection. It is not clear how to best address multicollinearity in hydrologic regression models. Here a Monte Carlo simulation is developed to compare four techniques to address multicollinearity: OLS, OLS with variance inflation factor screening (VIF), principal component regression (PCR), and partial least squares regression (PLS). The performance of these four techniques was observed for varying sample sizes, correlation coefficients between the explanatory variables, and model error variances consistent with hydrologic regional regression models. The negative effects of multicollinearity are magnified at smaller sample sizes, higher correlations between the variables, and larger model error variances (smaller R2). The Monte Carlo simulation indicates that if the true model is known, multicollinearity is present, and the estimation and statistical testing of regression parameters are of interest, then PCR or PLS should be employed. If the model is unknown, or if the interest is solely on model predictions, is it recommended that OLS be employed since using more complicated techniques did not produce any improvement in model performance. A leave-one-out cross-validation case study was also performed using low-streamflow data sets from the eastern United States. Results indicate that OLS with stepwise selection generally produces models across study regions with varying levels of multicollinearity that are as good as biased regression techniques such as PCR and PLS.

  14. Factors associated with self-medication in Spain: a cross-sectional study in different age groups.

    PubMed

    Niclós, Gracia; Olivar, Teresa; Rodilla, Vicent

    2018-06-01

    The identification of factors which may influence a patient's decision to self-medicate. Descriptive, cross-sectional study of the adult population (at least 16 years old), using data from the 2009 European Health Interview Survey in Spain, which included 22 188 subjects. Logistic regression models enabled us to estimate the effect of each analysed variable on self-medication. In total, 14 863 (67%) individuals reported using medication (prescribed and non-prescribed) and 3274 (22.0%) of them self-medicated. Using logistic regression and stratifying by age, four different models have been constructed. Our results include different variables in each of the models to explain self-medication, but the one that appears on all four models is education level. Age is the other important factor which influences self-medication. Self-medication is strongly associated with factors related to socio-demographic, such as sex, educational level or age, as well as several health factors such as long-standing illness or physical activity. When our data are compared to those from previous Spanish surveys carried out in 2003 and 2006, we can conclude that self-medication is increasing in Spain. © 2017 Royal Pharmaceutical Society.

  15. Trees Grow on Money: Urban Tree Canopy Cover and Environmental Justice

    PubMed Central

    Schwarz, Kirsten; Fragkias, Michail; Boone, Christopher G.; Zhou, Weiqi; McHale, Melissa; Grove, J. Morgan; O’Neil-Dunne, Jarlath; McFadden, Joseph P.; Buckley, Geoffrey L.; Childers, Dan; Ogden, Laura; Pincetl, Stephanie; Pataki, Diane; Whitmer, Ali; Cadenasso, Mary L.

    2015-01-01

    This study examines the distributional equity of urban tree canopy (UTC) cover for Baltimore, MD, Los Angeles, CA, New York, NY, Philadelphia, PA, Raleigh, NC, Sacramento, CA, and Washington, D.C. using high spatial resolution land cover data and census data. Data are analyzed at the Census Block Group levels using Spearman’s correlation, ordinary least squares regression (OLS), and a spatial autoregressive model (SAR). Across all cities there is a strong positive correlation between UTC cover and median household income. Negative correlations between race and UTC cover exist in bivariate models for some cities, but they are generally not observed using multivariate regressions that include additional variables on income, education, and housing age. SAR models result in higher r-square values compared to the OLS models across all cities, suggesting that spatial autocorrelation is an important feature of our data. Similarities among cities can be found based on shared characteristics of climate, race/ethnicity, and size. Our findings suggest that a suite of variables, including income, contribute to the distribution of UTC cover. These findings can help target simultaneous strategies for UTC goals and environmental justice concerns. PMID:25830303

  16. Analysis of Longitudinal Studies With Repeated Outcome Measures: Adjusting for Time-Dependent Confounding Using Conventional Methods.

    PubMed

    Keogh, Ruth H; Daniel, Rhian M; VanderWeele, Tyler J; Vansteelandt, Stijn

    2018-05-01

    Estimation of causal effects of time-varying exposures using longitudinal data is a common problem in epidemiology. When there are time-varying confounders, which may include past outcomes, affected by prior exposure, standard regression methods can lead to bias. Methods such as inverse probability weighted estimation of marginal structural models have been developed to address this problem. However, in this paper we show how standard regression methods can be used, even in the presence of time-dependent confounding, to estimate the total effect of an exposure on a subsequent outcome by controlling appropriately for prior exposures, outcomes, and time-varying covariates. We refer to the resulting estimation approach as sequential conditional mean models (SCMMs), which can be fitted using generalized estimating equations. We outline this approach and describe how including propensity score adjustment is advantageous. We compare the causal effects being estimated using SCMMs and marginal structural models, and we compare the two approaches using simulations. SCMMs enable more precise inferences, with greater robustness against model misspecification via propensity score adjustment, and easily accommodate continuous exposures and interactions. A new test for direct effects of past exposures on a subsequent outcome is described.

  17. Real estate value prediction using multivariate regression models

    NASA Astrophysics Data System (ADS)

    Manjula, R.; Jain, Shubham; Srivastava, Sharad; Rajiv Kher, Pranav

    2017-11-01

    The real estate market is one of the most competitive in terms of pricing and the same tends to vary significantly based on a lot of factors, hence it becomes one of the prime fields to apply the concepts of machine learning to optimize and predict the prices with high accuracy. Therefore in this paper, we present various important features to use while predicting housing prices with good accuracy. We have described regression models, using various features to have lower Residual Sum of Squares error. While using features in a regression model some feature engineering is required for better prediction. Often a set of features (multiple regressions) or polynomial regression (applying a various set of powers in the features) is used for making better model fit. For these models are expected to be susceptible towards over fitting ridge regression is used to reduce it. This paper thus directs to the best application of regression models in addition to other techniques to optimize the result.

  18. Improved animal models for testing gene therapy for atherosclerosis.

    PubMed

    Du, Liang; Zhang, Jingwan; De Meyer, Guido R Y; Flynn, Rowan; Dichek, David A

    2014-04-01

    Gene therapy delivered to the blood vessel wall could augment current therapies for atherosclerosis, including systemic drug therapy and stenting. However, identification of clinically useful vectors and effective therapeutic transgenes remains at the preclinical stage. Identification of effective vectors and transgenes would be accelerated by availability of animal models that allow practical and expeditious testing of vessel-wall-directed gene therapy. Such models would include humanlike lesions that develop rapidly in vessels that are amenable to efficient gene delivery. Moreover, because human atherosclerosis develops in normal vessels, gene therapy that prevents atherosclerosis is most logically tested in relatively normal arteries. Similarly, gene therapy that causes atherosclerosis regression requires gene delivery to an existing lesion. Here we report development of three new rabbit models for testing vessel-wall-directed gene therapy that either prevents or reverses atherosclerosis. Carotid artery intimal lesions in these new models develop within 2-7 months after initiation of a high-fat diet and are 20-80 times larger than lesions in a model we described previously. Individual models allow generation of lesions that are relatively rich in either macrophages or smooth muscle cells, permitting testing of gene therapy strategies targeted at either cell type. Two of the models include gene delivery to essentially normal arteries and will be useful for identifying strategies that prevent lesion development. The third model generates lesions rapidly in vector-naïve animals and can be used for testing gene therapy that promotes lesion regression. These models are optimized for testing helper-dependent adenovirus (HDAd)-mediated gene therapy; however, they could be easily adapted for testing of other vectors or of different types of molecular therapies, delivered directly to the blood vessel wall. Our data also supports the promise of HDAd to deliver long-term therapy from vascular endothelium without accelerating atherosclerotic disease.

  19. Regression analysis for solving diagnosis problem of children's health

    NASA Astrophysics Data System (ADS)

    Cherkashina, Yu A.; Gerget, O. M.

    2016-04-01

    The paper includes results of scientific researches. These researches are devoted to the application of statistical techniques, namely, regression analysis, to assess the health status of children in the neonatal period based on medical data (hemostatic parameters, parameters of blood tests, the gestational age, vascular-endothelial growth factor) measured at 3-5 days of children's life. In this paper a detailed description of the studied medical data is given. A binary logistic regression procedure is discussed in the paper. Basic results of the research are presented. A classification table of predicted values and factual observed values is shown, the overall percentage of correct recognition is determined. Regression equation coefficients are calculated, the general regression equation is written based on them. Based on the results of logistic regression, ROC analysis was performed, sensitivity and specificity of the model are calculated and ROC curves are constructed. These mathematical techniques allow carrying out diagnostics of health of children providing a high quality of recognition. The results make a significant contribution to the development of evidence-based medicine and have a high practical importance in the professional activity of the author.

  20. Measurement of pediatric regional cerebral blood flow from 6 months to 15 years of age in a clinical population.

    PubMed

    Carsin-Vu, Aline; Corouge, Isabelle; Commowick, Olivier; Bouzillé, Guillaume; Barillot, Christian; Ferré, Jean-Christophe; Proisy, Maia

    2018-04-01

    To investigate changes in cerebral blood flow (CBF) in gray matter (GM) between 6 months and 15 years of age and to provide CBF values for the brain, GM, white matter (WM), hemispheres and lobes. Between 2013 and 2016, we retrospectively included all clinical MRI examinations with arterial spin labeling (ASL). We excluded subjects with a condition potentially affecting brain perfusion. For each subject, mean values of CBF in the brain, GM, WM, hemispheres and lobes were calculated. GM CBF was fitted using linear, quadratic and cubic polynomial regression against age. Regression models were compared with Akaike's information criterion (AIC), and Likelihood Ratio tests. 84 children were included (44 females/40 males). Mean CBF values were 64.2 ± 13.8 mL/100 g/min in GM, and 29.3 ± 10.0 mL/100 g/min in WM. The best-fit model of brain perfusion was the cubic polynomial function (AIC = 672.7, versus respectively AIC = 673.9 and AIC = 674.1 with the linear negative function and the quadratic polynomial function). A statistically significant difference between the tested models demonstrating the superiority of the quadratic (p = 0.18) or cubic polynomial model (p = 0.06), over the negative linear regression model was not found. No effect of general anesthesia (p = 0.34) or of gender (p = 0.16) was found. we provided values for ASL CBF in the brain, GM, WM, hemispheres, and lobes over a wide pediatric age range, approximately showing inverted U-shaped changes in GM perfusion over the course of childhood. Copyright © 2018 Elsevier B.V. All rights reserved.

Top