Poisson Mixture Regression Models for Heart Disease Prediction.
Mufudza, Chipo; Erol, Hamza
2016-01-01
Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model.
Poisson Mixture Regression Models for Heart Disease Prediction
Erol, Hamza
2016-01-01
Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model. PMID:27999611
Bayesian Estimation of Multivariate Latent Regression Models: Gauss versus Laplace
ERIC Educational Resources Information Center
Culpepper, Steven Andrew; Park, Trevor
2017-01-01
A latent multivariate regression model is developed that employs a generalized asymmetric Laplace (GAL) prior distribution for regression coefficients. The model is designed for high-dimensional applications where an approximate sparsity condition is satisfied, such that many regression coefficients are near zero after accounting for all the model…
Predicting U.S. Army Reserve Unit Manning Using Market Demographics
2015-06-01
develops linear regression , classification tree, and logistic regression models to determine the ability of the location to support manning requirements... logistic regression model delivers predictive results that allow decision-makers to identify locations with a high probability of meeting unit...manning requirements. The recommendation of this thesis is that the USAR implement the logistic regression model. 14. SUBJECT TERMS U.S
Impact of multicollinearity on small sample hydrologic regression models
NASA Astrophysics Data System (ADS)
Kroll, Charles N.; Song, Peter
2013-06-01
Often hydrologic regression models are developed with ordinary least squares (OLS) procedures. The use of OLS with highly correlated explanatory variables produces multicollinearity, which creates highly sensitive parameter estimators with inflated variances and improper model selection. It is not clear how to best address multicollinearity in hydrologic regression models. Here a Monte Carlo simulation is developed to compare four techniques to address multicollinearity: OLS, OLS with variance inflation factor screening (VIF), principal component regression (PCR), and partial least squares regression (PLS). The performance of these four techniques was observed for varying sample sizes, correlation coefficients between the explanatory variables, and model error variances consistent with hydrologic regional regression models. The negative effects of multicollinearity are magnified at smaller sample sizes, higher correlations between the variables, and larger model error variances (smaller R2). The Monte Carlo simulation indicates that if the true model is known, multicollinearity is present, and the estimation and statistical testing of regression parameters are of interest, then PCR or PLS should be employed. If the model is unknown, or if the interest is solely on model predictions, is it recommended that OLS be employed since using more complicated techniques did not produce any improvement in model performance. A leave-one-out cross-validation case study was also performed using low-streamflow data sets from the eastern United States. Results indicate that OLS with stepwise selection generally produces models across study regions with varying levels of multicollinearity that are as good as biased regression techniques such as PCR and PLS.
Tøndel, Kristin; Indahl, Ulf G; Gjuvsland, Arne B; Vik, Jon Olav; Hunter, Peter; Omholt, Stig W; Martens, Harald
2011-06-01
Deterministic dynamic models of complex biological systems contain a large number of parameters and state variables, related through nonlinear differential equations with various types of feedback. A metamodel of such a dynamic model is a statistical approximation model that maps variation in parameters and initial conditions (inputs) to variation in features of the trajectories of the state variables (outputs) throughout the entire biologically relevant input space. A sufficiently accurate mapping can be exploited both instrumentally and epistemically. Multivariate regression methodology is a commonly used approach for emulating dynamic models. However, when the input-output relations are highly nonlinear or non-monotone, a standard linear regression approach is prone to give suboptimal results. We therefore hypothesised that a more accurate mapping can be obtained by locally linear or locally polynomial regression. We present here a new method for local regression modelling, Hierarchical Cluster-based PLS regression (HC-PLSR), where fuzzy C-means clustering is used to separate the data set into parts according to the structure of the response surface. We compare the metamodelling performance of HC-PLSR with polynomial partial least squares regression (PLSR) and ordinary least squares (OLS) regression on various systems: six different gene regulatory network models with various types of feedback, a deterministic mathematical model of the mammalian circadian clock and a model of the mouse ventricular myocyte function. Our results indicate that multivariate regression is well suited for emulating dynamic models in systems biology. The hierarchical approach turned out to be superior to both polynomial PLSR and OLS regression in all three test cases. The advantage, in terms of explained variance and prediction accuracy, was largest in systems with highly nonlinear functional relationships and in systems with positive feedback loops. HC-PLSR is a promising approach for metamodelling in systems biology, especially for highly nonlinear or non-monotone parameter to phenotype maps. The algorithm can be flexibly adjusted to suit the complexity of the dynamic model behaviour, inviting automation in the metamodelling of complex systems.
2011-01-01
Background Deterministic dynamic models of complex biological systems contain a large number of parameters and state variables, related through nonlinear differential equations with various types of feedback. A metamodel of such a dynamic model is a statistical approximation model that maps variation in parameters and initial conditions (inputs) to variation in features of the trajectories of the state variables (outputs) throughout the entire biologically relevant input space. A sufficiently accurate mapping can be exploited both instrumentally and epistemically. Multivariate regression methodology is a commonly used approach for emulating dynamic models. However, when the input-output relations are highly nonlinear or non-monotone, a standard linear regression approach is prone to give suboptimal results. We therefore hypothesised that a more accurate mapping can be obtained by locally linear or locally polynomial regression. We present here a new method for local regression modelling, Hierarchical Cluster-based PLS regression (HC-PLSR), where fuzzy C-means clustering is used to separate the data set into parts according to the structure of the response surface. We compare the metamodelling performance of HC-PLSR with polynomial partial least squares regression (PLSR) and ordinary least squares (OLS) regression on various systems: six different gene regulatory network models with various types of feedback, a deterministic mathematical model of the mammalian circadian clock and a model of the mouse ventricular myocyte function. Results Our results indicate that multivariate regression is well suited for emulating dynamic models in systems biology. The hierarchical approach turned out to be superior to both polynomial PLSR and OLS regression in all three test cases. The advantage, in terms of explained variance and prediction accuracy, was largest in systems with highly nonlinear functional relationships and in systems with positive feedback loops. Conclusions HC-PLSR is a promising approach for metamodelling in systems biology, especially for highly nonlinear or non-monotone parameter to phenotype maps. The algorithm can be flexibly adjusted to suit the complexity of the dynamic model behaviour, inviting automation in the metamodelling of complex systems. PMID:21627852
SEMIPARAMETRIC QUANTILE REGRESSION WITH HIGH-DIMENSIONAL COVARIATES
Zhu, Liping; Huang, Mian; Li, Runze
2012-01-01
This paper is concerned with quantile regression for a semiparametric regression model, in which both the conditional mean and conditional variance function of the response given the covariates admit a single-index structure. This semiparametric regression model enables us to reduce the dimension of the covariates and simultaneously retains the flexibility of nonparametric regression. Under mild conditions, we show that the simple linear quantile regression offers a consistent estimate of the index parameter vector. This is a surprising and interesting result because the single-index model is possibly misspecified under the linear quantile regression. With a root-n consistent estimate of the index vector, one may employ a local polynomial regression technique to estimate the conditional quantile function. This procedure is computationally efficient, which is very appealing in high-dimensional data analysis. We show that the resulting estimator of the quantile function performs asymptotically as efficiently as if the true value of the index vector were known. The methodologies are demonstrated through comprehensive simulation studies and an application to a real dataset. PMID:24501536
Prediction of dynamical systems by symbolic regression
NASA Astrophysics Data System (ADS)
Quade, Markus; Abel, Markus; Shafi, Kamran; Niven, Robert K.; Noack, Bernd R.
2016-07-01
We study the modeling and prediction of dynamical systems based on conventional models derived from measurements. Such algorithms are highly desirable in situations where the underlying dynamics are hard to model from physical principles or simplified models need to be found. We focus on symbolic regression methods as a part of machine learning. These algorithms are capable of learning an analytically tractable model from data, a highly valuable property. Symbolic regression methods can be considered as generalized regression methods. We investigate two particular algorithms, the so-called fast function extraction which is a generalized linear regression algorithm, and genetic programming which is a very general method. Both are able to combine functions in a certain way such that a good model for the prediction of the temporal evolution of a dynamical system can be identified. We illustrate the algorithms by finding a prediction for the evolution of a harmonic oscillator based on measurements, by detecting an arriving front in an excitable system, and as a real-world application, the prediction of solar power production based on energy production observations at a given site together with the weather forecast.
Van Belle, Vanya; Pelckmans, Kristiaan; Van Huffel, Sabine; Suykens, Johan A K
2011-10-01
To compare and evaluate ranking, regression and combined machine learning approaches for the analysis of survival data. The literature describes two approaches based on support vector machines to deal with censored observations. In the first approach the key idea is to rephrase the task as a ranking problem via the concordance index, a problem which can be solved efficiently in a context of structural risk minimization and convex optimization techniques. In a second approach, one uses a regression approach, dealing with censoring by means of inequality constraints. The goal of this paper is then twofold: (i) introducing a new model combining the ranking and regression strategy, which retains the link with existing survival models such as the proportional hazards model via transformation models; and (ii) comparison of the three techniques on 6 clinical and 3 high-dimensional datasets and discussing the relevance of these techniques over classical approaches fur survival data. We compare svm-based survival models based on ranking constraints, based on regression constraints and models based on both ranking and regression constraints. The performance of the models is compared by means of three different measures: (i) the concordance index, measuring the model's discriminating ability; (ii) the logrank test statistic, indicating whether patients with a prognostic index lower than the median prognostic index have a significant different survival than patients with a prognostic index higher than the median; and (iii) the hazard ratio after normalization to restrict the prognostic index between 0 and 1. Our results indicate a significantly better performance for models including regression constraints above models only based on ranking constraints. This work gives empirical evidence that svm-based models using regression constraints perform significantly better than svm-based models based on ranking constraints. Our experiments show a comparable performance for methods including only regression or both regression and ranking constraints on clinical data. On high dimensional data, the former model performs better. However, this approach does not have a theoretical link with standard statistical models for survival data. This link can be made by means of transformation models when ranking constraints are included. Copyright © 2011 Elsevier B.V. All rights reserved.
ERIC Educational Resources Information Center
Bulcock, J. W.; And Others
Advantages of normalization regression estimation over ridge regression estimation are demonstrated by reference to Bloom's model of school learning. Theoretical concern centered on the structure of scholastic achievement at grade 10 in Canadian high schools. Data on 886 students were randomly sampled from the Carnegie Human Resources Data Bank.…
NASA Astrophysics Data System (ADS)
Lu, Lin; Chang, Yunlong; Li, Yingmin; He, Youyou
2013-05-01
A transverse magnetic field was introduced to the arc plasma in the process of welding stainless steel tubes by high-speed Tungsten Inert Gas Arc Welding (TIG for short) without filler wire. The influence of external magnetic field on welding quality was investigated. 9 sets of parameters were designed by the means of orthogonal experiment. The welding joint tensile strength and form factor of weld were regarded as the main standards of welding quality. A binary quadratic nonlinear regression equation was established with the conditions of magnetic induction and flow rate of Ar gas. The residual standard deviation was calculated to adjust the accuracy of regression model. The results showed that, the regression model was correct and effective in calculating the tensile strength and aspect ratio of weld. Two 3D regression models were designed respectively, and then the impact law of magnetic induction on welding quality was researched.
Real estate value prediction using multivariate regression models
NASA Astrophysics Data System (ADS)
Manjula, R.; Jain, Shubham; Srivastava, Sharad; Rajiv Kher, Pranav
2017-11-01
The real estate market is one of the most competitive in terms of pricing and the same tends to vary significantly based on a lot of factors, hence it becomes one of the prime fields to apply the concepts of machine learning to optimize and predict the prices with high accuracy. Therefore in this paper, we present various important features to use while predicting housing prices with good accuracy. We have described regression models, using various features to have lower Residual Sum of Squares error. While using features in a regression model some feature engineering is required for better prediction. Often a set of features (multiple regressions) or polynomial regression (applying a various set of powers in the features) is used for making better model fit. For these models are expected to be susceptible towards over fitting ridge regression is used to reduce it. This paper thus directs to the best application of regression models in addition to other techniques to optimize the result.
NASA Astrophysics Data System (ADS)
Chu, Hone-Jay; Kong, Shish-Jeng; Chang, Chih-Hua
2018-03-01
The turbidity (TB) of a water body varies with time and space. Water quality is traditionally estimated via linear regression based on satellite images. However, estimating and mapping water quality require a spatio-temporal nonstationary model, while TB mapping necessitates the use of geographically and temporally weighted regression (GTWR) and geographically weighted regression (GWR) models, both of which are more precise than linear regression. Given the temporal nonstationary models for mapping water quality, GTWR offers the best option for estimating regional water quality. Compared with GWR, GTWR provides highly reliable information for water quality mapping, boasts a relatively high goodness of fit, improves the explanation of variance from 44% to 87%, and shows a sufficient space-time explanatory power. The seasonal patterns of TB and the main spatial patterns of TB variability can be identified using the estimated TB maps from GTWR and by conducting an empirical orthogonal function (EOF) analysis.
ERIC Educational Resources Information Center
Kobrin, Jennifer L.; Sinharay, Sandip; Haberman, Shelby J.; Chajewski, Michael
2011-01-01
This study examined the adequacy of a multiple linear regression model for predicting first-year college grade point average (FYGPA) using SAT[R] scores and high school grade point average (HSGPA). A variety of techniques, both graphical and statistical, were used to examine if it is possible to improve on the linear regression model. The results…
GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA
Zheng, Qi; Peng, Limin; He, Xuming
2015-01-01
Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal. PMID:26604424
Modelling infant mortality rate in Central Java, Indonesia use generalized poisson regression method
NASA Astrophysics Data System (ADS)
Prahutama, Alan; Sudarno
2018-05-01
The infant mortality rate is the number of deaths under one year of age occurring among the live births in a given geographical area during a given year, per 1,000 live births occurring among the population of the given geographical area during the same year. This problem needs to be addressed because it is an important element of a country’s economic development. High infant mortality rate will disrupt the stability of a country as it relates to the sustainability of the population in the country. One of regression model that can be used to analyze the relationship between dependent variable Y in the form of discrete data and independent variable X is Poisson regression model. Recently The regression modeling used for data with dependent variable is discrete, among others, poisson regression, negative binomial regression and generalized poisson regression. In this research, generalized poisson regression modeling gives better AIC value than poisson regression. The most significant variable is the Number of health facilities (X1), while the variable that gives the most influence to infant mortality rate is the average breastfeeding (X9).
ERIC Educational Resources Information Center
Bulcock, J. W.; And Others
Multicollinearity refers to the presence of highly intercorrelated independent variables in structural equation models, that is, models estimated by using techniques such as least squares regression and maximum likelihood. There is a problem of multicollinearity in both the natural and social sciences where theory formulation and estimation is in…
Building Your Own Regression Model
ERIC Educational Resources Information Center
Horton, Robert, M.; Phillips, Vicki; Kenelly, John
2004-01-01
Spreadsheets to explore regression with an algebra 2 class in a medium-sized rural high school are presented. The use of spreadsheets can help students develop sophisticated understanding of mathematical models and use them to describe real-world phenomena.
Fonseca, Maria de Jesus Mendes da; Juvanhol, Leidjaira Lopes; Rotenberg, Lúcia; Nobre, Aline Araújo; Griep, Rosane Härter; Alves, Márcia Guimarães de Mello; Cardoso, Letícia de Oliveira; Giatti, Luana; Nunes, Maria Angélica; Aquino, Estela M L; Chor, Dóra
2017-11-17
This paper explores the association between job strain and adiposity, using two statistical analysis approaches and considering the role of gender. The research evaluated 11,960 active baseline participants (2008-2010) in the ELSA-Brasil study. Job strain was evaluated through a demand-control questionnaire, while body mass index (BMI) and waist circumference (WC) were evaluated in continuous form. The associations were estimated using gamma regression models with an identity link function. Quantile regression models were also estimated from the final set of co-variables established by gamma regression. The relationship that was found varied by analytical approach and gender. Among the women, no association was observed between job strain and adiposity in the fitted gamma models. In the quantile models, a pattern of increasing effects of high strain was observed at higher BMI and WC distribution quantiles. Among the men, high strain was associated with adiposity in the gamma regression models. However, when quantile regression was used, that association was found not to be homogeneous across outcome distributions. In addition, in the quantile models an association was observed between active jobs and BMI. Our results point to an association between job strain and adiposity, which follows a heterogeneous pattern. Modelling strategies can produce different results and should, accordingly, be used to complement one another.
NASA Astrophysics Data System (ADS)
Wibowo, Wahyu; Wene, Chatrien; Budiantara, I. Nyoman; Permatasari, Erma Oktania
2017-03-01
Multiresponse semiparametric regression is simultaneous equation regression model and fusion of parametric and nonparametric model. The regression model comprise several models and each model has two components, parametric and nonparametric. The used model has linear function as parametric and polynomial truncated spline as nonparametric component. The model can handle both linearity and nonlinearity relationship between response and the sets of predictor variables. The aim of this paper is to demonstrate the application of the regression model for modeling of effect of regional socio-economic on use of information technology. More specific, the response variables are percentage of households has access to internet and percentage of households has personal computer. Then, predictor variables are percentage of literacy people, percentage of electrification and percentage of economic growth. Based on identification of the relationship between response and predictor variable, economic growth is treated as nonparametric predictor and the others are parametric predictors. The result shows that the multiresponse semiparametric regression can be applied well as indicate by the high coefficient determination, 90 percent.
Ridge: a computer program for calculating ridge regression estimates
Donald E. Hilt; Donald W. Seegrist
1977-01-01
Least-squares coefficients for multiple-regression models may be unstable when the independent variables are highly correlated. Ridge regression is a biased estimation procedure that produces stable estimates of the coefficients. Ridge regression is discussed, and a computer program for calculating the ridge coefficients is presented.
Lin, Chao-Cheng; Bai, Ya-Mei; Chen, Jen-Yeu; Hwang, Tzung-Jeng; Chen, Tzu-Ting; Chiu, Hung-Wen; Li, Yu-Chuan
2010-03-01
Metabolic syndrome (MetS) is an important side effect of second-generation antipsychotics (SGAs). However, many SGA-treated patients with MetS remain undetected. In this study, we trained and validated artificial neural network (ANN) and multiple logistic regression models without biochemical parameters to rapidly identify MetS in patients with SGA treatment. A total of 383 patients with a diagnosis of schizophrenia or schizoaffective disorder (DSM-IV criteria) with SGA treatment for more than 6 months were investigated to determine whether they met the MetS criteria according to the International Diabetes Federation. The data for these patients were collected between March 2005 and September 2005. The input variables of ANN and logistic regression were limited to demographic and anthropometric data only. All models were trained by randomly selecting two-thirds of the patient data and were internally validated with the remaining one-third of the data. The models were then externally validated with data from 69 patients from another hospital, collected between March 2008 and June 2008. The area under the receiver operating characteristic curve (AUC) was used to measure the performance of all models. Both the final ANN and logistic regression models had high accuracy (88.3% vs 83.6%), sensitivity (93.1% vs 86.2%), and specificity (86.9% vs 83.8%) to identify MetS in the internal validation set. The mean +/- SD AUC was high for both the ANN and logistic regression models (0.934 +/- 0.033 vs 0.922 +/- 0.035, P = .63). During external validation, high AUC was still obtained for both models. Waist circumference and diastolic blood pressure were the common variables that were left in the final ANN and logistic regression models. Our study developed accurate ANN and logistic regression models to detect MetS in patients with SGA treatment. The models are likely to provide a noninvasive tool for large-scale screening of MetS in this group of patients. (c) 2010 Physicians Postgraduate Press, Inc.
High dimensional linear regression models under long memory dependence and measurement error
NASA Astrophysics Data System (ADS)
Kaul, Abhishek
This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the dimensionality can grow exponentially with the sample size. In the fixed dimensional setting we provide the oracle properties associated with the proposed estimators. In the high dimensional setting, we provide bounds for the statistical error associated with the estimation, that hold with asymptotic probability 1, thereby providing the ℓ1-consistency of the proposed estimator. We also establish the model selection consistency in terms of the correctly estimated zero components of the parameter vector. A simulation study that investigates the finite sample accuracy of the proposed estimator is also included in this chapter.
Neural Network and Regression Soft Model Extended for PAX-300 Aircraft Engine
NASA Technical Reports Server (NTRS)
Patnaik, Surya N.; Hopkins, Dale A.
2002-01-01
In fiscal year 2001, the neural network and regression capabilities of NASA Glenn Research Center's COMETBOARDS design optimization testbed were extended to generate approximate models for the PAX-300 aircraft engine. The analytical model of the engine is defined through nine variables: the fan efficiency factor, the low pressure of the compressor, the high pressure of the compressor, the high pressure of the turbine, the low pressure of the turbine, the operating pressure, and three critical temperatures (T(sub 4), T(sub vane), and T(sub metal)). Numerical Propulsion System Simulation (NPSS) calculations of the specific fuel consumption (TSFC), as a function of the variables can become time consuming, and numerical instabilities can occur during these design calculations. "Soft" models can alleviate both deficiencies. These approximate models are generated from a set of high-fidelity input-output pairs obtained from the NPSS code and a design of the experiment strategy. A neural network and a regression model with 45 weight factors were trained for the input/output pairs. Then, the trained models were validated through a comparison with the original NPSS code. Comparisons of TSFC versus the operating pressure and of TSFC versus the three temperatures (T(sub 4), T(sub vane), and T(sub metal)) are depicted in the figures. The overall performance was satisfactory for both the regression and the neural network model. The regression model required fewer calculations than the neural network model, and it produced marginally superior results. Training the approximate methods is time consuming. Once trained, the approximate methods generated the solution with only a trivial computational effort, reducing the solution time from hours to less than a minute.
Lei, Yang; Nollen, Nikki; Ahluwahlia, Jasjit S; Yu, Qing; Mayo, Matthew S
2015-04-09
Other forms of tobacco use are increasing in prevalence, yet most tobacco control efforts are aimed at cigarettes. In light of this, it is important to identify individuals who are using both cigarettes and alternative tobacco products (ATPs). Most previous studies have used regression models. We conducted a traditional logistic regression model and a classification and regression tree (CART) model to illustrate and discuss the added advantages of using CART in the setting of identifying high-risk subgroups of ATP users among cigarettes smokers. The data were collected from an online cross-sectional survey administered by Survey Sampling International between July 5, 2012 and August 15, 2012. Eligible participants self-identified as current smokers, African American, White, or Latino (of any race), were English-speaking, and were at least 25 years old. The study sample included 2,376 participants and was divided into independent training and validation samples for a hold out validation. Logistic regression and CART models were used to examine the important predictors of cigarettes + ATP users. The logistic regression model identified nine important factors: gender, age, race, nicotine dependence, buying cigarettes or borrowing, whether the price of cigarettes influences the brand purchased, whether the participants set limits on cigarettes per day, alcohol use scores, and discrimination frequencies. The C-index of the logistic regression model was 0.74, indicating good discriminatory capability. The model performed well in the validation cohort also with good discrimination (c-index = 0.73) and excellent calibration (R-square = 0.96 in the calibration regression). The parsimonious CART model identified gender, age, alcohol use score, race, and discrimination frequencies to be the most important factors. It also revealed interesting partial interactions. The c-index is 0.70 for the training sample and 0.69 for the validation sample. The misclassification rate was 0.342 for the training sample and 0.346 for the validation sample. The CART model was easier to interpret and discovered target populations that possess clinical significance. This study suggests that the non-parametric CART model is parsimonious, potentially easier to interpret, and provides additional information in identifying the subgroups at high risk of ATP use among cigarette smokers.
Distributed Monitoring of the R(sup 2) Statistic for Linear Regression
NASA Technical Reports Server (NTRS)
Bhaduri, Kanishka; Das, Kamalika; Giannella, Chris R.
2011-01-01
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness.
A novel strategy for forensic age prediction by DNA methylation and support vector regression model
Xu, Cheng; Qu, Hongzhu; Wang, Guangyu; Xie, Bingbing; Shi, Yi; Yang, Yaran; Zhao, Zhao; Hu, Lan; Fang, Xiangdong; Yan, Jiangwei; Feng, Lei
2015-01-01
High deviations resulting from prediction model, gender and population difference have limited age estimation application of DNA methylation markers. Here we identified 2,957 novel age-associated DNA methylation sites (P < 0.01 and R2 > 0.5) in blood of eight pairs of Chinese Han female monozygotic twins. Among them, nine novel sites (false discovery rate < 0.01), along with three other reported sites, were further validated in 49 unrelated female volunteers with ages of 20–80 years by Sequenom Massarray. A total of 95 CpGs were covered in the PCR products and 11 of them were built the age prediction models. After comparing four different models including, multivariate linear regression, multivariate nonlinear regression, back propagation neural network and support vector regression, SVR was identified as the most robust model with the least mean absolute deviation from real chronological age (2.8 years) and an average accuracy of 4.7 years predicted by only six loci from the 11 loci, as well as an less cross-validated error compared with linear regression model. Our novel strategy provides an accurate measurement that is highly useful in estimating the individual age in forensic practice as well as in tracking the aging process in other related applications. PMID:26635134
Vaeth, Michael; Skovlund, Eva
2004-06-15
For a given regression problem it is possible to identify a suitably defined equivalent two-sample problem such that the power or sample size obtained for the two-sample problem also applies to the regression problem. For a standard linear regression model the equivalent two-sample problem is easily identified, but for generalized linear models and for Cox regression models the situation is more complicated. An approximately equivalent two-sample problem may, however, also be identified here. In particular, we show that for logistic regression and Cox regression models the equivalent two-sample problem is obtained by selecting two equally sized samples for which the parameters differ by a value equal to the slope times twice the standard deviation of the independent variable and further requiring that the overall expected number of events is unchanged. In a simulation study we examine the validity of this approach to power calculations in logistic regression and Cox regression models. Several different covariate distributions are considered for selected values of the overall response probability and a range of alternatives. For the Cox regression model we consider both constant and non-constant hazard rates. The results show that in general the approach is remarkably accurate even in relatively small samples. Some discrepancies are, however, found in small samples with few events and a highly skewed covariate distribution. Comparison with results based on alternative methods for logistic regression models with a single continuous covariate indicates that the proposed method is at least as good as its competitors. The method is easy to implement and therefore provides a simple way to extend the range of problems that can be covered by the usual formulas for power and sample size determination. Copyright 2004 John Wiley & Sons, Ltd.
Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T
2016-05-01
Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.
Gaussian functional regression for output prediction: Model assimilation and experimental design
NASA Astrophysics Data System (ADS)
Nguyen, N. C.; Peraire, J.
2016-03-01
In this paper, we introduce a Gaussian functional regression (GFR) technique that integrates multi-fidelity models with model reduction to efficiently predict the input-output relationship of a high-fidelity model. The GFR method combines the high-fidelity model with a low-fidelity model to provide an estimate of the output of the high-fidelity model in the form of a posterior distribution that can characterize uncertainty in the prediction. A reduced basis approximation is constructed upon the low-fidelity model and incorporated into the GFR method to yield an inexpensive posterior distribution of the output estimate. As this posterior distribution depends crucially on a set of training inputs at which the high-fidelity models are simulated, we develop a greedy sampling algorithm to select the training inputs. Our approach results in an output prediction model that inherits the fidelity of the high-fidelity model and has the computational complexity of the reduced basis approximation. Numerical results are presented to demonstrate the proposed approach.
Cao, Qingqing; Wu, Zhenqiang; Sun, Ying; Wang, Tiezhu; Han, Tengwei; Gu, Chaomei; Sun, Yehuan
2011-11-01
To Eexplore the application of negative binomial regression and modified Poisson regression analysis in analyzing the influential factors for injury frequency and the risk factors leading to the increase of injury frequency. 2917 primary and secondary school students were selected from Hefei by cluster random sampling method and surveyed by questionnaire. The data on the count event-based injuries used to fitted modified Poisson regression and negative binomial regression model. The risk factors incurring the increase of unintentional injury frequency for juvenile students was explored, so as to probe the efficiency of these two models in studying the influential factors for injury frequency. The Poisson model existed over-dispersion (P < 0.0001) based on testing by the Lagrangemultiplier. Therefore, the over-dispersion dispersed data using a modified Poisson regression and negative binomial regression model, was fitted better. respectively. Both showed that male gender, younger age, father working outside of the hometown, the level of the guardian being above junior high school and smoking might be the results of higher injury frequencies. On a tendency of clustered frequency data on injury event, both the modified Poisson regression analysis and negative binomial regression analysis can be used. However, based on our data, the modified Poisson regression fitted better and this model could give a more accurate interpretation of relevant factors affecting the frequency of injury.
Modeling nitrate at domestic and public-supply well depths in the Central Valley, California
Nolan, Bernard T.; Gronberg, JoAnn M.; Faunt, Claudia C.; Eberts, Sandra M.; Belitz, Ken
2014-01-01
Aquifer vulnerability models were developed to map groundwater nitrate concentration at domestic and public-supply well depths in the Central Valley, California. We compared three modeling methods for ability to predict nitrate concentration >4 mg/L: logistic regression (LR), random forest classification (RFC), and random forest regression (RFR). All three models indicated processes of nitrogen fertilizer input at the land surface, transmission through coarse-textured, well-drained soils, and transport in the aquifer to the well screen. The total percent correct predictions were similar among the three models (69–82%), but RFR had greater sensitivity (84% for shallow wells and 51% for deep wells). The results suggest that RFR can better identify areas with high nitrate concentration but that LR and RFC may better describe bulk conditions in the aquifer. A unique aspect of the modeling approach was inclusion of outputs from previous, physically based hydrologic and textural models as predictor variables, which were important to the models. Vertical water fluxes in the aquifer and percent coarse material above the well screen were ranked moderately high-to-high in the RFR models, and the average vertical water flux during the irrigation season was highly significant (p < 0.0001) in logistic regression.
Bayesian Analysis of High Dimensional Classification
NASA Astrophysics Data System (ADS)
Mukhopadhyay, Subhadeep; Liang, Faming
2009-12-01
Modern data mining and bioinformatics have presented an important playground for statistical learning techniques, where the number of input variables is possibly much larger than the sample size of the training data. In supervised learning, logistic regression or probit regression can be used to model a binary output and form perceptron classification rules based on Bayesian inference. In these cases , there is a lot of interest in searching for sparse model in High Dimensional regression(/classification) setup. we first discuss two common challenges for analyzing high dimensional data. The first one is the curse of dimensionality. The complexity of many existing algorithms scale exponentially with the dimensionality of the space and by virtue of that algorithms soon become computationally intractable and therefore inapplicable in many real applications. secondly, multicollinearities among the predictors which severely slowdown the algorithm. In order to make Bayesian analysis operational in high dimension we propose a novel 'Hierarchical stochastic approximation monte carlo algorithm' (HSAMC), which overcomes the curse of dimensionality, multicollinearity of predictors in high dimension and also it possesses the self-adjusting mechanism to avoid the local minima separated by high energy barriers. Models and methods are illustrated by simulation inspired from from the feild of genomics. Numerical results indicate that HSAMC can work as a general model selection sampler in high dimensional complex model space.
Estimating the Depth of the Navy Recruiting Market
2016-09-01
recommend that NRC make use of the Poisson regression model in order to determine high-yield ZIP codes for market depth. 14. SUBJECT...recommend that NRC make use of the Poisson regression model in order to determine high-yield ZIP codes for market depth. vi THIS PAGE INTENTIONALLY LEFT...DEPTH OF THE NAVY RECRUITING MARKET by Emilie M. Monaghan September 2016 Thesis Advisor: Lyn R. Whitaker Second Reader: Jonathan K. Alt
Hidden Connections between Regression Models of Strain-Gage Balance Calibration Data
NASA Technical Reports Server (NTRS)
Ulbrich, Norbert
2013-01-01
Hidden connections between regression models of wind tunnel strain-gage balance calibration data are investigated. These connections become visible whenever balance calibration data is supplied in its design format and both the Iterative and Non-Iterative Method are used to process the data. First, it is shown how the regression coefficients of the fitted balance loads of a force balance can be approximated by using the corresponding regression coefficients of the fitted strain-gage outputs. Then, data from the manual calibration of the Ames MK40 six-component force balance is chosen to illustrate how estimates of the regression coefficients of the fitted balance loads can be obtained from the regression coefficients of the fitted strain-gage outputs. The study illustrates that load predictions obtained by applying the Iterative or the Non-Iterative Method originate from two related regression solutions of the balance calibration data as long as balance loads are given in the design format of the balance, gage outputs behave highly linear, strict statistical quality metrics are used to assess regression models of the data, and regression model term combinations of the fitted loads and gage outputs can be obtained by a simple variable exchange.
Ebrahimzadeh, Farzad; Hajizadeh, Ebrahim; Vahabi, Nasim; Almasian, Mohammad; Bakhteyar, Katayoon
2015-01-01
Background: Unwanted pregnancy not intended by at least one of the parents has undesirable consequences for the family and the society. In the present study, three classification models were used and compared to predict unwanted pregnancies in an urban population. Methods: In this cross-sectional study, 887 pregnant mothers referring to health centers in Khorramabad, Iran, in 2012 were selected by the stratified and cluster sampling; relevant variables were measured and for prediction of unwanted pregnancy, logistic regression, discriminant analysis, and probit regression models and SPSS software version 21 were used. To compare these models, indicators such as sensitivity, specificity, the area under the ROC curve, and the percentage of correct predictions were used. Results: The prevalence of unwanted pregnancies was 25.3%. The logistic and probit regression models indicated that parity and pregnancy spacing, contraceptive methods, household income and number of living male children were related to unwanted pregnancy. The performance of the models based on the area under the ROC curve was 0.735, 0.733, and 0.680 for logistic regression, probit regression, and linear discriminant analysis, respectively. Conclusion: Given the relatively high prevalence of unwanted pregnancies in Khorramabad, it seems necessary to revise family planning programs. Despite the similar accuracy of the models, if the researcher is interested in the interpretability of the results, the use of the logistic regression model is recommended. PMID:26793655
Ebrahimzadeh, Farzad; Hajizadeh, Ebrahim; Vahabi, Nasim; Almasian, Mohammad; Bakhteyar, Katayoon
2015-01-01
Unwanted pregnancy not intended by at least one of the parents has undesirable consequences for the family and the society. In the present study, three classification models were used and compared to predict unwanted pregnancies in an urban population. In this cross-sectional study, 887 pregnant mothers referring to health centers in Khorramabad, Iran, in 2012 were selected by the stratified and cluster sampling; relevant variables were measured and for prediction of unwanted pregnancy, logistic regression, discriminant analysis, and probit regression models and SPSS software version 21 were used. To compare these models, indicators such as sensitivity, specificity, the area under the ROC curve, and the percentage of correct predictions were used. The prevalence of unwanted pregnancies was 25.3%. The logistic and probit regression models indicated that parity and pregnancy spacing, contraceptive methods, household income and number of living male children were related to unwanted pregnancy. The performance of the models based on the area under the ROC curve was 0.735, 0.733, and 0.680 for logistic regression, probit regression, and linear discriminant analysis, respectively. Given the relatively high prevalence of unwanted pregnancies in Khorramabad, it seems necessary to revise family planning programs. Despite the similar accuracy of the models, if the researcher is interested in the interpretability of the results, the use of the logistic regression model is recommended.
Testing a single regression coefficient in high dimensional linear models
Zhong, Ping-Shou; Li, Runze; Wang, Hansheng; Tsai, Chih-Ling
2017-01-01
In linear regression models with high dimensional data, the classical z-test (or t-test) for testing the significance of each single regression coefficient is no longer applicable. This is mainly because the number of covariates exceeds the sample size. In this paper, we propose a simple and novel alternative by introducing the Correlated Predictors Screening (CPS) method to control for predictors that are highly correlated with the target covariate. Accordingly, the classical ordinary least squares approach can be employed to estimate the regression coefficient associated with the target covariate. In addition, we demonstrate that the resulting estimator is consistent and asymptotically normal even if the random errors are heteroscedastic. This enables us to apply the z-test to assess the significance of each covariate. Based on the p-value obtained from testing the significance of each covariate, we further conduct multiple hypothesis testing by controlling the false discovery rate at the nominal level. Then, we show that the multiple hypothesis testing achieves consistent model selection. Simulation studies and empirical examples are presented to illustrate the finite sample performance and the usefulness of the proposed method, respectively. PMID:28663668
Testing a single regression coefficient in high dimensional linear models.
Lan, Wei; Zhong, Ping-Shou; Li, Runze; Wang, Hansheng; Tsai, Chih-Ling
2016-11-01
In linear regression models with high dimensional data, the classical z -test (or t -test) for testing the significance of each single regression coefficient is no longer applicable. This is mainly because the number of covariates exceeds the sample size. In this paper, we propose a simple and novel alternative by introducing the Correlated Predictors Screening (CPS) method to control for predictors that are highly correlated with the target covariate. Accordingly, the classical ordinary least squares approach can be employed to estimate the regression coefficient associated with the target covariate. In addition, we demonstrate that the resulting estimator is consistent and asymptotically normal even if the random errors are heteroscedastic. This enables us to apply the z -test to assess the significance of each covariate. Based on the p -value obtained from testing the significance of each covariate, we further conduct multiple hypothesis testing by controlling the false discovery rate at the nominal level. Then, we show that the multiple hypothesis testing achieves consistent model selection. Simulation studies and empirical examples are presented to illustrate the finite sample performance and the usefulness of the proposed method, respectively.
NASA Astrophysics Data System (ADS)
Shi, Liangliang; Mao, Zhihua; Wang, Zheng
2018-02-01
Satellite imagery has played an important role in monitoring water quality of lakes or coastal waters presently, but scarcely been applied in inland rivers. This paper presents an attempt of feasibility to apply regression model to quantify and map the concentrations of total suspended matter (CTSM) in inland rivers which have a large scale of spatial and a high CTSM dynamic range by using high resolution satellite remote sensing data, WorldView-2. An empirical approach to quantify CTSM by integrated use of high resolution WorldView-2 multispectral data and 21 in situ CTSM measurements. Radiometric correction, geometric and atmospheric correction involved in image processing procedure is carried out for deriving the surface reflectance to correlate the CTSM and satellite data by using single-variable and multivariable regression technique. Results of regression model show that the single near-infrared (NIR) band 8 of WorldView-2 have a relative strong relationship (R2=0.93) with CTSM. Different prediction models were developed on various combinations of WorldView-2 bands, the Akaike Information Criteria approach was used to choose the best model. The model involving band 1, 3, 5, and 8 of WorldView-2 had a best performance, whose R2 reach to 0.92, with SEE of 53.30 g/m3. The spatial distribution maps were produced by using the best multiple regression model. The results of this paper indicated that it is feasible to apply the empirical model by using high resolution satellite imagery to retrieve CTSM of inland rivers in routine monitoring of water quality.
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso.
Kong, Shengchun; Nan, Bin
2014-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso
Kong, Shengchun; Nan, Bin
2013-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses. PMID:24516328
Ecotoxicology of phenylphosphonothioates.
Francis, B M; Hansen, L G; Fukuto, T R; Lu, P Y; Metcalf, R L
1980-01-01
The phenylphosphonothioate insecticides EPN and leptophos, and several analogs, were evaluated with respect to their delayed neurotoxic effects in hens and their environmental behavior in a terrestrial-aquatic model ecosystem. Acute toxicity to insects was highly correlated with sigma sigma of the substituted phenyl group (regression coefficient r = -0.91) while acute toxicity to mammals was slightly less well correlated (regression coefficient r = -0.71), and neurotoxicity was poorly correlated with sigma sigma (regression coefficient r = -0.35). Both EPN and leptophos were markedly more persistent and bioaccumulative in the model ecosystem than parathion. Desbromoleptophos, a contaminant and metabolite of leptophos, was seen to be a highly stable and persistent terminal residue of leptophos. PMID:6159210
NASA Astrophysics Data System (ADS)
Song, Lanlan
2017-04-01
Nitrous oxide is much more potent greenhouse gas than carbon dioxide. However, the estimation of N2O flux is usually clouded with uncertainty, mainly due to high spatial and temporal variations. This hampers the development of general mechanistic models for N2O emission as well, as most previously developed models were empirical or exhibited low predictability with numerous assumptions. In this study, we tested General Regression Neural Networks (GRNN) as an alternative to classic empirical models for simulating N2O emission in riparian zones of Reservoirs. GRNN and nonlinear regression (NLR) were applied to estimate the N2O flux of 1-year observations in riparian zones of Three Gorge Reservoir. NLR resulted in lower prediction power and higher residuals compared to GRNN. Although nonlinear regression model estimated similar average values of N2O, it could not capture the fluctuation patterns accurately. In contrast, GRNN model achieved a fairly high predictability, with an R2 of 0.59 for model validation, 0.77 for model calibration (training), and a low root mean square error (RMSE), indicating a high capacity to simulate the dynamics of N2O flux. According to a sensitivity analysis of the GRNN, nonlinear relationships between input variables and N2O flux were well explained. Our results suggest that the GRNN developed in this study has a greater performance in simulating variations in N2O flux than nonlinear regressions.
A computational approach to compare regression modelling strategies in prediction research.
Pajouheshnia, Romin; Pestman, Wiebe R; Teerenstra, Steven; Groenwold, Rolf H H
2016-08-25
It is often unclear which approach to fit, assess and adjust a model will yield the most accurate prediction model. We present an extension of an approach for comparing modelling strategies in linear regression to the setting of logistic regression and demonstrate its application in clinical prediction research. A framework for comparing logistic regression modelling strategies by their likelihoods was formulated using a wrapper approach. Five different strategies for modelling, including simple shrinkage methods, were compared in four empirical data sets to illustrate the concept of a priori strategy comparison. Simulations were performed in both randomly generated data and empirical data to investigate the influence of data characteristics on strategy performance. We applied the comparison framework in a case study setting. Optimal strategies were selected based on the results of a priori comparisons in a clinical data set and the performance of models built according to each strategy was assessed using the Brier score and calibration plots. The performance of modelling strategies was highly dependent on the characteristics of the development data in both linear and logistic regression settings. A priori comparisons in four empirical data sets found that no strategy consistently outperformed the others. The percentage of times that a model adjustment strategy outperformed a logistic model ranged from 3.9 to 94.9 %, depending on the strategy and data set. However, in our case study setting the a priori selection of optimal methods did not result in detectable improvement in model performance when assessed in an external data set. The performance of prediction modelling strategies is a data-dependent process and can be highly variable between data sets within the same clinical domain. A priori strategy comparison can be used to determine an optimal logistic regression modelling strategy for a given data set before selecting a final modelling approach.
ERIC Educational Resources Information Center
Koon, Sharon; Petscher, Yaacov
2015-01-01
The purpose of this report was to explicate the use of logistic regression and classification and regression tree (CART) analysis in the development of early warning systems. It was motivated by state education leaders' interest in maintaining high classification accuracy while simultaneously improving practitioner understanding of the rules by…
The cross-validated AUC for MCP-logistic regression with high-dimensional data.
Jiang, Dingfeng; Huang, Jian; Zhang, Ying
2013-10-01
We propose a cross-validated area under the receiving operator characteristic (ROC) curve (CV-AUC) criterion for tuning parameter selection for penalized methods in sparse, high-dimensional logistic regression models. We use this criterion in combination with the minimax concave penalty (MCP) method for variable selection. The CV-AUC criterion is specifically designed for optimizing the classification performance for binary outcome data. To implement the proposed approach, we derive an efficient coordinate descent algorithm to compute the MCP-logistic regression solution surface. Simulation studies are conducted to evaluate the finite sample performance of the proposed method and its comparison with the existing methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC) or Extended BIC (EBIC). The model selected based on the CV-AUC criterion tends to have a larger predictive AUC and smaller classification error than those with tuning parameters selected using the AIC, BIC or EBIC. We illustrate the application of the MCP-logistic regression with the CV-AUC criterion on three microarray datasets from the studies that attempt to identify genes related to cancers. Our simulation studies and data examples demonstrate that the CV-AUC is an attractive method for tuning parameter selection for penalized methods in high-dimensional logistic regression models.
Experimental and computational prediction of glass transition temperature of drugs.
Alzghoul, Ahmad; Alhalaweh, Amjad; Mahlin, Denny; Bergström, Christel A S
2014-12-22
Glass transition temperature (Tg) is an important inherent property of an amorphous solid material which is usually determined experimentally. In this study, the relation between Tg and melting temperature (Tm) was evaluated using a data set of 71 structurally diverse druglike compounds. Further, in silico models for prediction of Tg were developed based on calculated molecular descriptors and linear (multilinear regression, partial least-squares, principal component regression) and nonlinear (neural network, support vector regression) modeling techniques. The models based on Tm predicted Tg with an RMSE of 19.5 K for the test set. Among the five computational models developed herein the support vector regression gave the best result with RMSE of 18.7 K for the test set using only four chemical descriptors. Hence, two different models that predict Tg of drug-like molecules with high accuracy were developed. If Tm is available, a simple linear regression can be used to predict Tg. However, the results also suggest that support vector regression and calculated molecular descriptors can predict Tg with equal accuracy, already before compound synthesis.
NASA Astrophysics Data System (ADS)
Keat, Sim Chong; Chun, Beh Boon; San, Lim Hwee; Jafri, Mohd Zubir Mat
2015-04-01
Climate change due to carbon dioxide (CO2) emissions is one of the most complex challenges threatening our planet. This issue considered as a great and international concern that primary attributed from different fossil fuels. In this paper, regression model is used for analyzing the causal relationship among CO2 emissions based on the energy consumption in Malaysia using time series data for the period of 1980-2010. The equations were developed using regression model based on the eight major sources that contribute to the CO2 emissions such as non energy, Liquefied Petroleum Gas (LPG), diesel, kerosene, refinery gas, Aviation Turbine Fuel (ATF) and Aviation Gasoline (AV Gas), fuel oil and motor petrol. The related data partly used for predict the regression model (1980-2000) and partly used for validate the regression model (2001-2010). The results of the prediction model with the measured data showed a high correlation coefficient (R2=0.9544), indicating the model's accuracy and efficiency. These results are accurate and can be used in early warning of the population to comply with air quality standards.
Aulenbach, Brent T.
2013-01-01
A regression-model based approach is a commonly used, efficient method for estimating streamwater constituent load when there is a relationship between streamwater constituent concentration and continuous variables such as streamwater discharge, season and time. A subsetting experiment using a 30-year dataset of daily suspended sediment observations from the Mississippi River at Thebes, Illinois, was performed to determine optimal sampling frequency, model calibration period length, and regression model methodology, as well as to determine the effect of serial correlation of model residuals on load estimate precision. Two regression-based methods were used to estimate streamwater loads, the Adjusted Maximum Likelihood Estimator (AMLE), and the composite method, a hybrid load estimation approach. While both methods accurately and precisely estimated loads at the model’s calibration period time scale, precisions were progressively worse at shorter reporting periods, from annually to monthly. Serial correlation in model residuals resulted in observed AMLE precision to be significantly worse than the model calculated standard errors of prediction. The composite method effectively improved upon AMLE loads for shorter reporting periods, but required a sampling interval of at least 15-days or shorter, when the serial correlations in the observed load residuals were greater than 0.15. AMLE precision was better at shorter sampling intervals and when using the shortest model calibration periods, such that the regression models better fit the temporal changes in the concentration–discharge relationship. The models with the largest errors typically had poor high flow sampling coverage resulting in unrepresentative models. Increasing sampling frequency and/or targeted high flow sampling are more efficient approaches to ensure sufficient sampling and to avoid poorly performing models, than increasing calibration period length.
Statistical downscaling of precipitation using long short-term memory recurrent neural networks
NASA Astrophysics Data System (ADS)
Misra, Saptarshi; Sarkar, Sudeshna; Mitra, Pabitra
2017-11-01
Hydrological impacts of global climate change on regional scale are generally assessed by downscaling large-scale climatic variables, simulated by General Circulation Models (GCMs), to regional, small-scale hydrometeorological variables like precipitation, temperature, etc. In this study, we propose a new statistical downscaling model based on Recurrent Neural Network with Long Short-Term Memory which captures the spatio-temporal dependencies in local rainfall. The previous studies have used several other methods such as linear regression, quantile regression, kernel regression, beta regression, and artificial neural networks. Deep neural networks and recurrent neural networks have been shown to be highly promising in modeling complex and highly non-linear relationships between input and output variables in different domains and hence we investigated their performance in the task of statistical downscaling. We have tested this model on two datasets—one on precipitation in Mahanadi basin in India and the second on precipitation in Campbell River basin in Canada. Our autoencoder coupled long short-term memory recurrent neural network model performs the best compared to other existing methods on both the datasets with respect to temporal cross-correlation, mean squared error, and capturing the extremes.
Multiple Logistic Regression Analysis of Cigarette Use among High School Students
ERIC Educational Resources Information Center
Adwere-Boamah, Joseph
2011-01-01
A binary logistic regression analysis was performed to predict high school students' cigarette smoking behavior from selected predictors from 2009 CDC Youth Risk Behavior Surveillance Survey. The specific target student behavior of interest was frequent cigarette use. Five predictor variables included in the model were: a) race, b) frequency of…
Calibrated Multivariate Regression with Application to Neural Semantic Basis Discovery.
Liu, Han; Wang, Lie; Zhao, Tuo
2015-08-01
We propose a calibrated multivariate regression method named CMR for fitting high dimensional multivariate regression models. Compared with existing methods, CMR calibrates regularization for each regression task with respect to its noise level so that it simultaneously attains improved finite-sample performance and tuning insensitiveness. Theoretically, we provide sufficient conditions under which CMR achieves the optimal rate of convergence in parameter estimation. Computationally, we propose an efficient smoothed proximal gradient algorithm with a worst-case numerical rate of convergence O (1/ ϵ ), where ϵ is a pre-specified accuracy of the objective function value. We conduct thorough numerical simulations to illustrate that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR to solve a brain activity prediction problem and find that it is as competitive as a handcrafted model created by human experts. The R package camel implementing the proposed method is available on the Comprehensive R Archive Network http://cran.r-project.org/web/packages/camel/.
Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data
Xiong, Lie; Kuan, Pei-Fen; Tian, Jianan; Keles, Sunduz; Wang, Sijian
2015-01-01
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies. PMID:26609213
Miller, Justin B; Axelrod, Bradley N; Schutte, Christian
2012-01-01
The recent release of the Wechsler Memory Scale Fourth Edition contains many improvements from a theoretical and administration perspective, including demographic corrections using the Advanced Clinical Solutions. Although the administration time has been reduced from previous versions, a shortened version may be desirable in certain situations given practical time limitations in clinical practice. The current study evaluated two- and three-subtest estimations of demographically corrected Immediate and Delayed Memory index scores using both simple arithmetic prorating and regression models. All estimated values were significantly associated with observed index scores. Use of Lin's Concordance Correlation Coefficient as a measure of agreement showed a high degree of precision and virtually zero bias in the models, although the regression models showed a stronger association than prorated models. Regression-based models proved to be more accurate than prorated estimates with less dispersion around observed values, particularly when using three subtest regression models. Overall, the present research shows strong support for estimating demographically corrected index scores on the WMS-IV in clinical practice with an adequate performance using arithmetically prorated models and a stronger performance using regression models to predict index scores.
Forest type mapping of the Interior West
Bonnie Ruefenacht; Gretchen G. Moisen; Jock A. Blackard
2004-01-01
This paper develops techniques for the mapping of forest types in Arizona, New Mexico, and Wyoming. The methods involve regression-tree modeling using a variety of remote sensing and GIS layers along with Forest Inventory Analysis (FIA) point data. Regression-tree modeling is a fast and efficient technique of estimating variables for large data sets with high accuracy...
David R. Weise; Eunmo Koo; Xiangyang Zhou; Shankar Mahalingam; Frédéric Morandini; Jacques-Henri Balbi
2016-01-01
Fire behaviour data from 240 laboratory fires in high-density live chaparral fuel beds were compared with model predictions. Logistic regression was used to develop a model to predict fire spread success in the fuel beds and linear regression was used to predict rate of spread. Predictions from the Rothermel equation and three proposed changes as well as two physically...
Levine, Matthew E; Albers, David J; Hripcsak, George
2016-01-01
Time series analysis methods have been shown to reveal clinical and biological associations in data collected in the electronic health record. We wish to develop reliable high-throughput methods for identifying adverse drug effects that are easy to implement and produce readily interpretable results. To move toward this goal, we used univariate and multivariate lagged regression models to investigate associations between twenty pairs of drug orders and laboratory measurements. Multivariate lagged regression models exhibited higher sensitivity and specificity than univariate lagged regression in the 20 examples, and incorporating autoregressive terms for labs and drugs produced more robust signals in cases of known associations among the 20 example pairings. Moreover, including inpatient admission terms in the model attenuated the signals for some cases of unlikely associations, demonstrating how multivariate lagged regression models' explicit handling of context-based variables can provide a simple way to probe for health-care processes that confound analyses of EHR data.
A Ranking Approach to Genomic Selection.
Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori
2015-01-01
Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.
Ertas, Gokhan
2018-07-01
To assess the value of joint evaluation of diffusion tensor imaging (DTI) measures by using logistic regression modelling to detect high GS risk group prostate tumors. Fifty tumors imaged using DTI on a 3 T MRI device were analyzed. Regions of interests focusing on the center of tumor foci and noncancerous tissue on the maps of mean diffusivity (MD) and fractional anisotropy (FA) were used to extract the minimum, the maximum and the mean measures. Measure ratio was computed by dividing tumor measure by noncancerous tissue measure. Logistic regression models were fitted for all possible pair combinations of the measures using 5-fold cross validation. Systematic differences are present for all MD measures and also for all FA measures in distinguishing the high risk tumors [GS ≥ 7(4 + 3)] from the low risk tumors [GS ≤ 7(3 + 4)] (P < 0.05). Smaller value for MD measures and larger value for FA measures indicate the high risk. The models enrolling the measures achieve good fits and good classification performances (R 2 adj = 0.55-0.60, AUC = 0.88-0.91), however the models using the measure ratios perform better (R 2 adj = 0.59-0.75, AUC = 0.88-0.95). The model that employs the ratios of minimum MD and maximum FA accomplishes the highest sensitivity, specificity and accuracy (Se = 77.8%, Sp = 96.9% and Acc = 90.0%). Joint evaluation of MD and FA diffusion tensor imaging measures is valuable to detect high GS risk group peripheral zone prostate tumors. However, use of the ratios of the measures improves the accuracy of the detections substantially. Logistic regression modelling provides a favorable solution for the joint evaluations easily adoptable in clinical practice. Copyright © 2018 Elsevier Inc. All rights reserved.
Marek K. Jakubowksi; Qinghua Guo; Brandon Collins; Scott Stephens; Maggi Kelly
2013-01-01
We compared the ability of several classification and regression algorithms to predict forest stand structure metrics and standard surface fuel models. Our study area spans a dense, topographically complex Sierra Nevada mixed-conifer forest. We used clustering, regression trees, and support vector machine algorithms to analyze high density (average 9 pulses/m
NASA Astrophysics Data System (ADS)
Gholizadeh, H.; Robeson, S. M.
2015-12-01
Empirical models have been widely used to estimate global chlorophyll content from remotely sensed data. Here, we focus on the standard NASA empirical models that use blue-green band ratios. These band ratio ocean color (OC) algorithms are in the form of fourth-order polynomials and the parameters of these polynomials (i.e. coefficients) are estimated from the NASA bio-Optical Marine Algorithm Data set (NOMAD). Most of the points in this data set have been sampled from tropical and temperate regions. However, polynomial coefficients obtained from this data set are used to estimate chlorophyll content in all ocean regions with different properties such as sea-surface temperature, salinity, and downwelling/upwelling patterns. Further, the polynomial terms in these models are highly correlated. In sum, the limitations of these empirical models are as follows: 1) the independent variables within the empirical models, in their current form, are correlated (multicollinear), and 2) current algorithms are global approaches and are based on the spatial stationarity assumption, so they are independent of location. Multicollinearity problem is resolved by using partial least squares (PLS). PLS, which transforms the data into a set of independent components, can be considered as a combined form of principal component regression (PCR) and multiple regression. Geographically weighted regression (GWR) is also used to investigate the validity of spatial stationarity assumption. GWR solves a regression model over each sample point by using the observations within its neighbourhood. PLS results show that the empirical method underestimates chlorophyll content in high latitudes, including the Southern Ocean region, when compared to PLS (see Figure 1). Cluster analysis of GWR coefficients also shows that the spatial stationarity assumption in empirical models is not likely a valid assumption.
Hyper-Spectral Image Analysis With Partially Latent Regression and Spatial Markov Dependencies
NASA Astrophysics Data System (ADS)
Deleforge, Antoine; Forbes, Florence; Ba, Sileye; Horaud, Radu
2015-09-01
Hyper-spectral data can be analyzed to recover physical properties at large planetary scales. This involves resolving inverse problems which can be addressed within machine learning, with the advantage that, once a relationship between physical parameters and spectra has been established in a data-driven fashion, the learned relationship can be used to estimate physical parameters for new hyper-spectral observations. Within this framework, we propose a spatially-constrained and partially-latent regression method which maps high-dimensional inputs (hyper-spectral images) onto low-dimensional responses (physical parameters such as the local chemical composition of the soil). The proposed regression model comprises two key features. Firstly, it combines a Gaussian mixture of locally-linear mappings (GLLiM) with a partially-latent response model. While the former makes high-dimensional regression tractable, the latter enables to deal with physical parameters that cannot be observed or, more generally, with data contaminated by experimental artifacts that cannot be explained with noise models. Secondly, spatial constraints are introduced in the model through a Markov random field (MRF) prior which provides a spatial structure to the Gaussian-mixture hidden variables. Experiments conducted on a database composed of remotely sensed observations collected from the Mars planet by the Mars Express orbiter demonstrate the effectiveness of the proposed model.
SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression *
Sun, Qiang; Zhu, Hongtu; Liu, Yufeng; Ibrahim, Joseph G.
2014-01-01
The aim of this paper is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling’s T2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM out-performs other state-of-the-art methods. PMID:26527844
Sharma, Ashok K; Srivastava, Gopal N; Roy, Ankita; Sharma, Vineet K
2017-01-01
The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84-0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better ( R 2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better ( R 2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules.
Sharma, Ashok K.; Srivastava, Gopal N.; Roy, Ankita; Sharma, Vineet K.
2017-01-01
The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84–0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better (R2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules. PMID:29249969
Engvall, Karin; Hult, M; Corner, R; Lampa, E; Norbäck, D; Emenius, G
2010-01-01
The aim was to develop a new model to identify residential buildings with higher frequencies of "SBS" than expected, "risk buildings". In 2005, 481 multi-family buildings with 10,506 dwellings in Stockholm were studied by a new stratified random sampling. A standardised self-administered questionnaire was used to assess "SBS", atopy and personal factors. The response rate was 73%. Statistical analysis was performed by multiple logistic regressions. Dwellers owning their building reported less "SBS" than those renting. There was a strong relationship between socio-economic factors and ownership. The regression model, ended up with high explanatory values for age, gender, atopy and ownership. Applying our model, 9% of all residential buildings in Stockholm were classified as "risk buildings" with the highest proportion in houses built 1961-1975 (26%) and lowest in houses built 1985-1990 (4%). To identify "risk buildings", it is necessary to adjust for ownership and population characteristics.
Chen, Sung-Wei; Wang, Po-Chuan; Hsin, Ping-Lung; Oates, Anthony; Sun, I-Wen; Liu, Shen-Ing
2011-01-01
Microelectronic engineers are considered valuable human capital contributing significantly toward economic development, but they may encounter stressful work conditions in the context of a globalized industry. The study aims at identifying risk factors of depressive disorders primarily based on job stress models, the Demand-Control-Support and Effort-Reward Imbalance models, and at evaluating whether depressive disorders impair work performance in microelectronics engineers in Taiwan. The case-control study was conducted among 678 microelectronics engineers, 452 controls and 226 cases with depressive disorders which were defined by a score 17 or more on the Beck Depression Inventory and a psychiatrist's diagnosis. The self-administered questionnaires included the Job Content Questionnaire, Effort-Reward Imbalance Questionnaire, demography, psychosocial factors, health behaviors and work performance. Hierarchical logistic regression was applied to identify risk factors of depressive disorders. Multivariate linear regressions were used to determine factors affecting work performance. By hierarchical logistic regression, risk factors of depressive disorders are high demands, low work social support, high effort/reward ratio and low frequency of physical exercise. Combining the two job stress models may have better predictive power for depressive disorders than adopting either model alone. Three multivariate linear regressions provide similar results indicating that depressive disorders are associated with impaired work performance in terms of absence, role limitation and social functioning limitation. The results may provide insight into the applicability of job stress models in a globalized high-tech industry considerably focused in non-Western countries, and the design of workplace preventive strategies for depressive disorders in Asian electronics engineering population.
Jovanovic, Milos; Radovanovic, Sandro; Vukicevic, Milan; Van Poucke, Sven; Delibasic, Boris
2016-09-01
Quantification and early identification of unplanned readmission risk have the potential to improve the quality of care during hospitalization and after discharge. However, high dimensionality, sparsity, and class imbalance of electronic health data and the complexity of risk quantification, challenge the development of accurate predictive models. Predictive models require a certain level of interpretability in order to be applicable in real settings and create actionable insights. This paper aims to develop accurate and interpretable predictive models for readmission in a general pediatric patient population, by integrating a data-driven model (sparse logistic regression) and domain knowledge based on the international classification of diseases 9th-revision clinical modification (ICD-9-CM) hierarchy of diseases. Additionally, we propose a way to quantify the interpretability of a model and inspect the stability of alternative solutions. The analysis was conducted on >66,000 pediatric hospital discharge records from California, State Inpatient Databases, Healthcare Cost and Utilization Project between 2009 and 2011. We incorporated domain knowledge based on the ICD-9-CM hierarchy in a data driven, Tree-Lasso regularized logistic regression model, providing the framework for model interpretation. This approach was compared with traditional Lasso logistic regression resulting in models that are easier to interpret by fewer high-level diagnoses, with comparable prediction accuracy. The results revealed that the use of a Tree-Lasso model was as competitive in terms of accuracy (measured by area under the receiver operating characteristic curve-AUC) as the traditional Lasso logistic regression, but integration with the ICD-9-CM hierarchy of diseases provided more interpretable models in terms of high-level diagnoses. Additionally, interpretations of models are in accordance with existing medical understanding of pediatric readmission. Best performing models have similar performances reaching AUC values 0.783 and 0.779 for traditional Lasso and Tree-Lasso, respectfully. However, information loss of Lasso models is 0.35 bits higher compared to Tree-Lasso model. We propose a method for building predictive models applicable for the detection of readmission risk based on Electronic Health records. Integration of domain knowledge (in the form of ICD-9-CM taxonomy) and a data-driven, sparse predictive algorithm (Tree-Lasso Logistic Regression) resulted in an increase of interpretability of the resulting model. The models are interpreted for the readmission prediction problem in general pediatric population in California, as well as several important subpopulations, and the interpretations of models comply with existing medical understanding of pediatric readmission. Finally, quantitative assessment of the interpretability of the models is given, that is beyond simple counts of selected low-level features. Copyright © 2016 Elsevier B.V. All rights reserved.
Deep ensemble learning of sparse regression models for brain disease diagnosis.
Suk, Heung-Il; Lee, Seong-Whan; Shen, Dinggang
2017-04-01
Recent studies on brain imaging analysis witnessed the core roles of machine learning techniques in computer-assisted intervention for brain disease diagnosis. Of various machine-learning techniques, sparse regression models have proved their effectiveness in handling high-dimensional data but with a small number of training samples, especially in medical problems. In the meantime, deep learning methods have been making great successes by outperforming the state-of-the-art performances in various applications. In this paper, we propose a novel framework that combines the two conceptually different methods of sparse regression and deep learning for Alzheimer's disease/mild cognitive impairment diagnosis and prognosis. Specifically, we first train multiple sparse regression models, each of which is trained with different values of a regularization control parameter. Thus, our multiple sparse regression models potentially select different feature subsets from the original feature set; thereby they have different powers to predict the response values, i.e., clinical label and clinical scores in our work. By regarding the response values from our sparse regression models as target-level representations, we then build a deep convolutional neural network for clinical decision making, which thus we call 'Deep Ensemble Sparse Regression Network.' To our best knowledge, this is the first work that combines sparse regression models with deep neural network. In our experiments with the ADNI cohort, we validated the effectiveness of the proposed method by achieving the highest diagnostic accuracies in three classification tasks. We also rigorously analyzed our results and compared with the previous studies on the ADNI cohort in the literature. Copyright © 2017 Elsevier B.V. All rights reserved.
Deep ensemble learning of sparse regression models for brain disease diagnosis
Suk, Heung-Il; Lee, Seong-Whan; Shen, Dinggang
2018-01-01
Recent studies on brain imaging analysis witnessed the core roles of machine learning techniques in computer-assisted intervention for brain disease diagnosis. Of various machine-learning techniques, sparse regression models have proved their effectiveness in handling high-dimensional data but with a small number of training samples, especially in medical problems. In the meantime, deep learning methods have been making great successes by outperforming the state-of-the-art performances in various applications. In this paper, we propose a novel framework that combines the two conceptually different methods of sparse regression and deep learning for Alzheimer’s disease/mild cognitive impairment diagnosis and prognosis. Specifically, we first train multiple sparse regression models, each of which is trained with different values of a regularization control parameter. Thus, our multiple sparse regression models potentially select different feature subsets from the original feature set; thereby they have different powers to predict the response values, i.e., clinical label and clinical scores in our work. By regarding the response values from our sparse regression models as target-level representations, we then build a deep convolutional neural network for clinical decision making, which thus we call ‘ Deep Ensemble Sparse Regression Network.’ To our best knowledge, this is the first work that combines sparse regression models with deep neural network. In our experiments with the ADNI cohort, we validated the effectiveness of the proposed method by achieving the highest diagnostic accuracies in three classification tasks. We also rigorously analyzed our results and compared with the previous studies on the ADNI cohort in the literature. PMID:28167394
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
2016-11-22
With many predictors in regression, fitting the full model can induce multicollinearity problems. Least Absolute Shrinkage and Selection Operation (LASSO) is useful when the effects of many explanatory variables are sparse in a high-dimensional dataset. Influential points can have a disproportionate impact on the estimated values of model parameters. Here, this paper describes a new influence plot that can be used to increase understanding of the contributions of individual observations and the robustness of results. This can serve as a complement to other regression diagnostics techniques in the LASSO regression setting. Using this influence plot, we can find influential pointsmore » and their impact on shrinkage of model parameters and model selection. Lastly, we provide two examples to illustrate the methods.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
With many predictors in regression, fitting the full model can induce multicollinearity problems. Least Absolute Shrinkage and Selection Operation (LASSO) is useful when the effects of many explanatory variables are sparse in a high-dimensional dataset. Influential points can have a disproportionate impact on the estimated values of model parameters. Here, this paper describes a new influence plot that can be used to increase understanding of the contributions of individual observations and the robustness of results. This can serve as a complement to other regression diagnostics techniques in the LASSO regression setting. Using this influence plot, we can find influential pointsmore » and their impact on shrinkage of model parameters and model selection. Lastly, we provide two examples to illustrate the methods.« less
Censored quantile regression with recursive partitioning-based weights
Wey, Andrew; Wang, Lan; Rudser, Kyle
2014-01-01
Censored quantile regression provides a useful alternative to the Cox proportional hazards model for analyzing survival data. It directly models the conditional quantile of the survival time and hence is easy to interpret. Moreover, it relaxes the proportionality constraint on the hazard function associated with the popular Cox model and is natural for modeling heterogeneity of the data. Recently, Wang and Wang (2009. Locally weighted censored quantile regression. Journal of the American Statistical Association 103, 1117–1128) proposed a locally weighted censored quantile regression approach that allows for covariate-dependent censoring and is less restrictive than other censored quantile regression methods. However, their kernel smoothing-based weighting scheme requires all covariates to be continuous and encounters practical difficulty with even a moderate number of covariates. We propose a new weighting approach that uses recursive partitioning, e.g. survival trees, that offers greater flexibility in handling covariate-dependent censoring in moderately high dimensions and can incorporate both continuous and discrete covariates. We prove that this new weighting scheme leads to consistent estimation of the quantile regression coefficients and demonstrate its effectiveness via Monte Carlo simulations. We also illustrate the new method using a widely recognized data set from a clinical trial on primary biliary cirrhosis. PMID:23975800
Ridge regression for predicting elastic moduli and hardness of calcium aluminosilicate glasses
NASA Astrophysics Data System (ADS)
Deng, Yifan; Zeng, Huidan; Jiang, Yejia; Chen, Guorong; Chen, Jianding; Sun, Luyi
2018-03-01
It is of great significance to design glasses with satisfactory mechanical properties predictively through modeling. Among various modeling methods, data-driven modeling is such a reliable approach that can dramatically shorten research duration, cut research cost and accelerate the development of glass materials. In this work, the ridge regression (RR) analysis was used to construct regression models for predicting the compositional dependence of CaO-Al2O3-SiO2 glass elastic moduli (Shear, Bulk, and Young’s moduli) and hardness based on the ternary diagram of the compositions. The property prediction over a large glass composition space was accomplished with known experimental data of various compositions in the literature, and the simulated results are in good agreement with the measured ones. This regression model can serve as a facile and effective tool for studying the relationship between the compositions and the property, enabling high-efficient design of glasses to meet the requirements for specific elasticity and hardness.
NASA Astrophysics Data System (ADS)
Tan, C. H.; Matjafri, M. Z.; Lim, H. S.
2015-10-01
This paper presents the prediction models which analyze and compute the CO2 emission in Malaysia. Each prediction model for CO2 emission will be analyzed based on three main groups which is transportation, electricity and heat production as well as residential buildings and commercial and public services. The prediction models were generated using data obtained from World Bank Open Data. Best subset method will be used to remove irrelevant data and followed by multi linear regression to produce the prediction models. From the results, high R-square (prediction) value was obtained and this implies that the models are reliable to predict the CO2 emission by using specific data. In addition, the CO2 emissions from these three groups are forecasted using trend analysis plots for observation purpose.
Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett
2009-01-01
Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....
Incremental online learning in high dimensions.
Vijayakumar, Sethu; D'Souza, Aaron; Schaal, Stefan
2005-12-01
Locally weighted projection regression (LWPR) is a new algorithm for incremental nonlinear function approximation in high-dimensional spaces with redundant and irrelevant input dimensions. At its core, it employs nonparametric regression with locally linear models. In order to stay computationally efficient and numerically robust, each local model performs the regression analysis with a small number of univariate regressions in selected directions in input space in the spirit of partial least squares regression. We discuss when and how local learning techniques can successfully work in high-dimensional spaces and review the various techniques for local dimensionality reduction before finally deriving the LWPR algorithm. The properties of LWPR are that it (1) learns rapidly with second-order learning methods based on incremental training, (2) uses statistically sound stochastic leave-one-out cross validation for learning without the need to memorize training data, (3) adjusts its weighting kernels based on only local information in order to minimize the danger of negative interference of incremental learning, (4) has a computational complexity that is linear in the number of inputs, and (5) can deal with a large number of-possibly redundant-inputs, as shown in various empirical evaluations with up to 90 dimensional data sets. For a probabilistic interpretation, predictive variance and confidence intervals are derived. To our knowledge, LWPR is the first truly incremental spatially localized learning method that can successfully and efficiently operate in very high-dimensional spaces.
Prediction models for clustered data: comparison of a random intercept and standard regression model
2013-01-01
Background When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Methods Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. Results The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. Conclusion The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters. PMID:23414436
Bouwmeester, Walter; Twisk, Jos W R; Kappen, Teus H; van Klei, Wilton A; Moons, Karel G M; Vergouwe, Yvonne
2013-02-15
When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters.
Bian, Xihui; Li, Shujuan; Lin, Ligang; Tan, Xiaoyao; Fan, Qingjie; Li, Ming
2016-06-21
Accurate prediction of the model is fundamental to the successful analysis of complex samples. To utilize abundant information embedded over frequency and time domains, a novel regression model is presented for quantitative analysis of hydrocarbon contents in the fuel oil samples. The proposed method named as high and low frequency unfolded PLSR (HLUPLSR), which integrates empirical mode decomposition (EMD) and unfolded strategy with partial least squares regression (PLSR). In the proposed method, the original signals are firstly decomposed into a finite number of intrinsic mode functions (IMFs) and a residue by EMD. Secondly, the former high frequency IMFs are summed as a high frequency matrix and the latter IMFs and residue are summed as a low frequency matrix. Finally, the two matrices are unfolded to an extended matrix in variable dimension, and then the PLSR model is built between the extended matrix and the target values. Coupled with Ultraviolet (UV) spectroscopy, HLUPLSR has been applied to determine hydrocarbon contents of light gas oil and diesel fuels samples. Comparing with single PLSR and other signal processing techniques, the proposed method shows superiority in prediction ability and better model interpretation. Therefore, HLUPLSR method provides a promising tool for quantitative analysis of complex samples. Copyright © 2016 Elsevier B.V. All rights reserved.
Nishii, Takashi; Genkawa, Takuma; Watari, Masahiro; Ozaki, Yukihiro
2012-01-01
A new selection procedure of an informative near-infrared (NIR) region for regression model building is proposed that uses an online NIR/mid-infrared (mid-IR) dual-region spectrometer in conjunction with two-dimensional (2D) NIR/mid-IR heterospectral correlation spectroscopy. In this procedure, both NIR and mid-IR spectra of a liquid sample are acquired sequentially during a reaction process using the NIR/mid-IR dual-region spectrometer; the 2D NIR/mid-IR heterospectral correlation spectrum is subsequently calculated from the obtained spectral data set. From the calculated 2D spectrum, a NIR region is selected that includes bands of high positive correlation intensity with mid-IR bands assigned to the analyte, and used for the construction of a regression model. To evaluate the performance of this procedure, a partial least-squares (PLS) regression model of the ethanol concentration in a fermentation process was constructed. During fermentation, NIR/mid-IR spectra in the 10000 - 1200 cm(-1) region were acquired every 3 min, and a 2D NIR/mid-IR heterospectral correlation spectrum was calculated to investigate the correlation intensity between the NIR and mid-IR bands. NIR regions that include bands at 4343, 4416, 5778, 5904, and 5955 cm(-1), which result from the combinations and overtones of the C-H group of ethanol, were selected for use in the PLS regression models, by taking the correlation intensity of a mid-IR band at 2985 cm(-1) arising from the CH(3) asymmetric stretching vibration mode of ethanol as a reference. The predicted results indicate that the ethanol concentrations calculated from the PLS regression models fit well to those obtained by high-performance liquid chromatography. Thus, it can be concluded that the selection procedure using the NIR/mid-IR dual-region spectrometer combined with 2D NIR/mid-IR heterospectral correlation spectroscopy is a powerful method for the construction of a reliable regression model.
Improved model of the retardance in citric acid coated ferrofluids using stepwise regression
NASA Astrophysics Data System (ADS)
Lin, J. F.; Qiu, X. R.
2017-06-01
Citric acid (CA) coated Fe3O4 ferrofluids (FFs) have been conducted for biomedical application. The magneto-optical retardance of CA coated FFs was measured by a Stokes polarimeter. Optimization and multiple regression of retardance in FFs were executed by Taguchi method and Microsoft Excel previously, and the F value of regression model was large enough. However, the model executed by Excel was not systematic. Instead we adopted the stepwise regression to model the retardance of CA coated FFs. From the results of stepwise regression by MATLAB, the developed model had highly predictable ability owing to F of 2.55897e+7 and correlation coefficient of one. The average absolute error of predicted retardances to measured retardances was just 0.0044%. Using the genetic algorithm (GA) in MATLAB, the optimized parametric combination was determined as [4.709 0.12 39.998 70.006] corresponding to the pH of suspension, molar ratio of CA to Fe3O4, CA volume, and coating temperature. The maximum retardance was found as 31.712°, close to that obtained by evolutionary solver in Excel and a relative error of -0.013%. Above all, the stepwise regression method was successfully used to model the retardance of CA coated FFs, and the maximum global retardance was determined by the use of GA.
Multi-Target Regression via Robust Low-Rank Learning.
Zhen, Xiantong; Yu, Mengyang; He, Xiaofei; Li, Shuo
2018-02-01
Multi-target regression has recently regained great popularity due to its capability of simultaneously learning multiple relevant regression tasks and its wide applications in data mining, computer vision and medical image analysis, while great challenges arise from jointly handling inter-target correlations and input-output relationships. In this paper, we propose Multi-layer Multi-target Regression (MMR) which enables simultaneously modeling intrinsic inter-target correlations and nonlinear input-output relationships in a general framework via robust low-rank learning. Specifically, the MMR can explicitly encode inter-target correlations in a structure matrix by matrix elastic nets (MEN); the MMR can work in conjunction with the kernel trick to effectively disentangle highly complex nonlinear input-output relationships; the MMR can be efficiently solved by a new alternating optimization algorithm with guaranteed convergence. The MMR leverages the strength of kernel methods for nonlinear feature learning and the structural advantage of multi-layer learning architectures for inter-target correlation modeling. More importantly, it offers a new multi-layer learning paradigm for multi-target regression which is endowed with high generality, flexibility and expressive ability. Extensive experimental evaluation on 18 diverse real-world datasets demonstrates that our MMR can achieve consistently high performance and outperforms representative state-of-the-art algorithms, which shows its great effectiveness and generality for multivariate prediction.
Mohd Yusof, Mohd Yusmiaidil Putera; Cauwels, Rita; Deschepper, Ellen; Martens, Luc
2015-08-01
The third molar development (TMD) has been widely utilized as one of the radiographic method for dental age estimation. By using the same radiograph of the same individual, third molar eruption (TME) information can be incorporated to the TMD regression model. This study aims to evaluate the performance of dental age estimation in individual method models and the combined model (TMD and TME) based on the classic regressions of multiple linear and principal component analysis. A sample of 705 digital panoramic radiographs of Malay sub-adults aged between 14.1 and 23.8 years was collected. The techniques described by Gleiser and Hunt (modified by Kohler) and Olze were employed to stage the TMD and TME, respectively. The data was divided to develop three respective models based on the two regressions of multiple linear and principal component analysis. The trained models were then validated on the test sample and the accuracy of age prediction was compared between each model. The coefficient of determination (R²) and root mean square error (RMSE) were calculated. In both genders, adjusted R² yielded an increment in the linear regressions of combined model as compared to the individual models. The overall decrease in RMSE was detected in combined model as compared to TMD (0.03-0.06) and TME (0.2-0.8). In principal component regression, low value of adjusted R(2) and high RMSE except in male were exhibited in combined model. Dental age estimation is better predicted using combined model in multiple linear regression models. Copyright © 2015 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees.
Chung, Yi-Shih
2013-12-01
Factor complexity is a characteristic of traffic crashes. This paper proposes a novel method, namely boosted regression trees (BRT), to investigate the complex and nonlinear relationships in high-variance traffic crash data. The Taiwanese 2004-2005 single-vehicle motorcycle crash data are used to demonstrate the utility of BRT. Traditional logistic regression and classification and regression tree (CART) models are also used to compare their estimation results and external validities. Both the in-sample cross-validation and out-of-sample validation results show that an increase in tree complexity provides improved, although declining, classification performance, indicating a limited factor complexity of single-vehicle motorcycle crashes. The effects of crucial variables including geographical, time, and sociodemographic factors explain some fatal crashes. Relatively unique fatal crashes are better approximated by interactive terms, especially combinations of behavioral factors. BRT models generally provide improved transferability than conventional logistic regression and CART models. This study also discusses the implications of the results for devising safety policies. Copyright © 2012 Elsevier Ltd. All rights reserved.
Shi, Yuan; Lau, Kevin Ka-Lun; Ng, Edward
2017-08-01
Urban air quality serves as an important function of the quality of urban life. Land use regression (LUR) modelling of air quality is essential for conducting health impacts assessment but more challenging in mountainous high-density urban scenario due to the complexities of the urban environment. In this study, a total of 21 LUR models are developed for seven kinds of air pollutants (gaseous air pollutants CO, NO 2 , NO x , O 3 , SO 2 and particulate air pollutants PM 2.5 , PM 10 ) with reference to three different time periods (summertime, wintertime and annual average of 5-year long-term hourly monitoring data from local air quality monitoring network) in Hong Kong. Under the mountainous high-density urban scenario, we improved the traditional LUR modelling method by incorporating wind availability information into LUR modelling based on surface geomorphometrical analysis. As a result, 269 independent variables were examined to develop the LUR models by using the "ADDRESS" independent variable selection method and stepwise multiple linear regression (MLR). Cross validation has been performed for each resultant model. The results show that wind-related variables are included in most of the resultant models as statistically significant independent variables. Compared with the traditional method, a maximum increase of 20% was achieved in the prediction performance of annual averaged NO 2 concentration level by incorporating wind-related variables into LUR model development. Copyright © 2017 Elsevier Inc. All rights reserved.
Fang, Xingang; Bagui, Sikha; Bagui, Subhash
2017-08-01
The readily available high throughput screening (HTS) data from the PubChem database provides an opportunity for mining of small molecules in a variety of biological systems using machine learning techniques. From the thousands of available molecular descriptors developed to encode useful chemical information representing the characteristics of molecules, descriptor selection is an essential step in building an optimal quantitative structural-activity relationship (QSAR) model. For the development of a systematic descriptor selection strategy, we need the understanding of the relationship between: (i) the descriptor selection; (ii) the choice of the machine learning model; and (iii) the characteristics of the target bio-molecule. In this work, we employed the Signature descriptor to generate a dataset on the Human kallikrein 5 (hK 5) inhibition confirmatory assay data and compared multiple classification models including logistic regression, support vector machine, random forest and k-nearest neighbor. Under optimal conditions, the logistic regression model provided extremely high overall accuracy (98%) and precision (90%), with good sensitivity (65%) in the cross validation test. In testing the primary HTS screening data with more than 200K molecular structures, the logistic regression model exhibited the capability of eliminating more than 99.9% of the inactive structures. As part of our exploration of the descriptor-model-target relationship, the excellent predictive performance of the combination of the Signature descriptor and the logistic regression model on the assay data of the Human kallikrein 5 (hK 5) target suggested a feasible descriptor/model selection strategy on similar targets. Copyright © 2017 Elsevier Ltd. All rights reserved.
Collinearity and Causal Diagrams: A Lesson on the Importance of Model Specification.
Schisterman, Enrique F; Perkins, Neil J; Mumford, Sunni L; Ahrens, Katherine A; Mitchell, Emily M
2017-01-01
Correlated data are ubiquitous in epidemiologic research, particularly in nutritional and environmental epidemiology where mixtures of factors are often studied. Our objectives are to demonstrate how highly correlated data arise in epidemiologic research and provide guidance, using a directed acyclic graph approach, on how to proceed analytically when faced with highly correlated data. We identified three fundamental structural scenarios in which high correlation between a given variable and the exposure can arise: intermediates, confounders, and colliders. For each of these scenarios, we evaluated the consequences of increasing correlation between the given variable and the exposure on the bias and variance for the total effect of the exposure on the outcome using unadjusted and adjusted models. We derived closed-form solutions for continuous outcomes using linear regression and empirically present our findings for binary outcomes using logistic regression. For models properly specified, total effect estimates remained unbiased even when there was almost perfect correlation between the exposure and a given intermediate, confounder, or collider. In general, as the correlation increased, the variance of the parameter estimate for the exposure in the adjusted models increased, while in the unadjusted models, the variance increased to a lesser extent or decreased. Our findings highlight the importance of considering the causal framework under study when specifying regression models. Strategies that do not take into consideration the causal structure may lead to biased effect estimation for the original question of interest, even under high correlation.
A self-trained classification technique for producing 30 m percent-water maps from Landsat data
Rover, Jennifer R.; Wylie, Bruce K.; Ji, Lei
2010-01-01
Small bodies of water can be mapped with moderate-resolution satellite data using methods where water is mapped as subpixel fractions using field measurements or high-resolution images as training datasets. A new method, developed from a regression-tree technique, uses a 30 m Landsat image for training the regression tree that, in turn, is applied to the same image to map subpixel water. The self-trained method was evaluated by comparing the percent-water map with three other maps generated from established percent-water mapping methods: (1) a regression-tree model trained with a 5 m SPOT 5 image, (2) a regression-tree model based on endmembers and (3) a linear unmixing classification technique. The results suggest that subpixel water fractions can be accurately estimated when high-resolution satellite data or intensively interpreted training datasets are not available, which increases our ability to map small water bodies or small changes in lake size at a regional scale.
Statistical downscaling modeling with quantile regression using lasso to estimate extreme rainfall
NASA Astrophysics Data System (ADS)
Santri, Dewi; Wigena, Aji Hamim; Djuraidah, Anik
2016-02-01
Rainfall is one of the climatic elements with high diversity and has many negative impacts especially extreme rainfall. Therefore, there are several methods that required to minimize the damage that may occur. So far, Global circulation models (GCM) are the best method to forecast global climate changes include extreme rainfall. Statistical downscaling (SD) is a technique to develop the relationship between GCM output as a global-scale independent variables and rainfall as a local- scale response variable. Using GCM method will have many difficulties when assessed against observations because GCM has high dimension and multicollinearity between the variables. The common method that used to handle this problem is principal components analysis (PCA) and partial least squares regression. The new method that can be used is lasso. Lasso has advantages in simultaneuosly controlling the variance of the fitted coefficients and performing automatic variable selection. Quantile regression is a method that can be used to detect extreme rainfall in dry and wet extreme. Objective of this study is modeling SD using quantile regression with lasso to predict extreme rainfall in Indramayu. The results showed that the estimation of extreme rainfall (extreme wet in January, February and December) in Indramayu could be predicted properly by the model at quantile 90th.
Improving power and robustness for detecting genetic association with extreme-value sampling design.
Chen, Hua Yun; Li, Mingyao
2011-12-01
Extreme-value sampling design that samples subjects with extremely large or small quantitative trait values is commonly used in genetic association studies. Samples in such designs are often treated as "cases" and "controls" and analyzed using logistic regression. Such a case-control analysis ignores the potential dose-response relationship between the quantitative trait and the underlying trait locus and thus may lead to loss of power in detecting genetic association. An alternative approach to analyzing such data is to model the dose-response relationship by a linear regression model. However, parameter estimation from this model can be biased, which may lead to inflated type I errors. We propose a robust and efficient approach that takes into consideration of both the biased sampling design and the potential dose-response relationship. Extensive simulations demonstrate that the proposed method is more powerful than the traditional logistic regression analysis and is more robust than the linear regression analysis. We applied our method to the analysis of a candidate gene association study on high-density lipoprotein cholesterol (HDL-C) which includes study subjects with extremely high or low HDL-C levels. Using our method, we identified several SNPs showing a stronger evidence of association with HDL-C than the traditional case-control logistic regression analysis. Our results suggest that it is important to appropriately model the quantitative traits and to adjust for the biased sampling when dose-response relationship exists in extreme-value sampling designs. © 2011 Wiley Periodicals, Inc.
Wang, Shuang; Jiang, Xiaoqian; Wu, Yuan; Cui, Lijuan; Cheng, Samuel; Ohno-Machado, Lucila
2013-01-01
We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacy-preserving online learning. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance (e.g., discrimination, calibration, feature selection etc.) as the traditional frequentist Logistic Regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications. PMID:23562651
Continuous monitoring of sediment and nutrients in the Illinois River at Florence, Illinois, 2012-13
Terrio, Paul J.; Straub, Timothy D.; Domanski, Marian M.; Siudyla, Nicholas A.
2015-01-01
The Illinois River is the largest river in Illinois and is the primary contributing watershed for nitrogen, phosphorus, and suspended-sediment loading to the upper Mississippi River from Illinois. In addition to streamflow, the following water-quality constituents were monitored at the Illinois River at Florence, Illinois (U.S. Geological Survey station number 05586300), during May 2012–October 2013: phosphate, nitrate, turbidity, temperature, specific conductance, pH, and dissolved oxygen. The objectives of this monitoring were to (1) determine performance capabilities of the in-situ instruments; (2) collect continuous data that would provide an improved understanding of constituent characteristics during normal, low-, and high-flow periods and during different climatic and land-use seasons; (3) evaluate the ability to use continuous turbidity as a surrogate constituent to determine suspended-sediment concentrations; and (4) evaluate the ability to develop a regression model for total phosphorus using phosphate, turbidity, and other measured parameters. Reliable data collection was achieved, following some initial periods of instrument and data-communication difficulties. The resulting regression models for suspended sediment had coefficient of determination (R2) values of about 0.9. Nitrate plus nitrite loads computed using continuous data were found to be approximately 8 percent larger than loads computed using traditional discrete-sampling based models. A regression model for total phosphorus was developed by using historic orthophosphate data (important during periods of low flow and low concentrations) and historic suspended-sediment data (important during periods of high flow and higher concentrations). The R2of the total phosphorus regression model using orthophosphorus and suspended sediment was 0.8. Data collection and refinement of the regression models is ongoing.
Li, Michael Jonathan; Distefano, Anthony; Mouttapa, Michele; Gill, Jasmeet K
2014-02-01
The present study aimed to determine whether the experience of bias-motivated bullying was associated with behaviors known to increase the risk of HIV infection among young men who have sex with men (YMSM) aged 18-29, and to assess whether the psychosocial problems moderated this relationship. Using an Internet-based direct marketing approach in sampling, we recruited 545 YMSM residing in the USA to complete an online questionnaire. Multiple linear regression analyses tested three regression models where we controlled for sociodemographics. The first model indicated that bullying during high school was associated with unprotected receptive anal intercourse within the past 12 months, while the second model indicated that bullying after high school was associated with engaging in anal intercourse while under the influence of drugs or alcohol in the past 12 months. In the final regression model, our composite measure of HIV risk behavior was found to be associated with lifetime verbal harassment. None of the psychosocial problems measured in this study - depression, low self-esteem, and internalized homonegativity - moderated any of the associations between bias-motivated bullying victimization and HIV risk behaviors in our regression models. Still, these findings provide novel evidence that bullying prevention programs in schools and communities should be included in comprehensive approaches to HIV prevention among YMSM.
Risk factors for displaced abomasum or ketosis in Swedish dairy herds.
Stengärde, L; Hultgren, J; Tråvén, M; Holtenius, K; Emanuelson, U
2012-03-01
Risk factors associated with high or low long-term incidence of displaced abomasum (DA) or clinical ketosis were studied in 60 Swedish dairy herds, using multivariable logistic regression modelling. Forty high-incidence herds were included as cases and 20 low-incidence herds as controls. Incidence rates were calculated based on veterinary records of clinical diagnoses. During the 3-year period preceding the herd classification, herds with a high incidence had a disease incidence of DA or clinical ketosis above the 3rd quartile in a national database for disease recordings. Control herds had no cows with DA or clinical ketosis. All herds were visited during the housing period and herdsmen were interviewed about management routines, housing, feeding, milk yield, and herd health. Target groups were heifers in late gestation, dry cows, and cows in early lactation. Univariable logistic regression was used to screen for factors associated with being a high-incidence herd. A multivariable logistic regression model was built using stepwise regression. A higher maximum daily milk yield in multiparous cows and a large herd size (p=0.054 and p=0.066, respectively) tended to be associated with being a high-incidence herd. Not cleaning the heifer feeding platform daily increased the odds of having a high-incidence herd twelvefold (p<0.01). Keeping cows in only one group in the dry period increased the odds of having a high incidence herd eightfold (p=0.03). Herd size was confounded with housing system. Housing system was therefore added to the final logistic regression model. In conclusion, a large herd size, a high maximum daily milk yield, keeping dry cows in one group, and not cleaning the feeding platform daily appear to be important risk factors for a high incidence of DA or clinical ketosis in Swedish dairy herds. These results confirm the importance of housing, management and feeding in the prevention of metabolic disorders in dairy cows around parturition and in early lactation. Copyright © 2011 Elsevier B.V. All rights reserved.
Washington, Simon; Haque, Md Mazharul; Oh, Jutaek; Lee, Dongmin
2014-05-01
Hot spot identification (HSID) aims to identify potential sites-roadway segments, intersections, crosswalks, interchanges, ramps, etc.-with disproportionately high crash risk relative to similar sites. An inefficient HSID methodology might result in either identifying a safe site as high risk (false positive) or a high risk site as safe (false negative), and consequently lead to the misuse the available public funds, to poor investment decisions, and to inefficient risk management practice. Current HSID methods suffer from issues like underreporting of minor injury and property damage only (PDO) crashes, challenges of accounting for crash severity into the methodology, and selection of a proper safety performance function to model crash data that is often heavily skewed by a preponderance of zeros. Addressing these challenges, this paper proposes a combination of a PDO equivalency calculation and quantile regression technique to identify hot spots in a transportation network. In particular, issues related to underreporting and crash severity are tackled by incorporating equivalent PDO crashes, whilst the concerns related to the non-count nature of equivalent PDO crashes and the skewness of crash data are addressed by the non-parametric quantile regression technique. The proposed method identifies covariate effects on various quantiles of a population, rather than the population mean like most methods in practice, which more closely corresponds with how black spots are identified in practice. The proposed methodology is illustrated using rural road segment data from Korea and compared against the traditional EB method with negative binomial regression. Application of a quantile regression model on equivalent PDO crashes enables identification of a set of high-risk sites that reflect the true safety costs to the society, simultaneously reduces the influence of under-reported PDO and minor injury crashes, and overcomes the limitation of traditional NB model in dealing with preponderance of zeros problem or right skewed dataset. Copyright © 2014 Elsevier Ltd. All rights reserved.
Boosting structured additive quantile regression for longitudinal childhood obesity data.
Fenske, Nora; Fahrmeir, Ludwig; Hothorn, Torsten; Rzehak, Peter; Höhle, Michael
2013-07-25
Childhood obesity and the investigation of its risk factors has become an important public health issue. Our work is based on and motivated by a German longitudinal study including 2,226 children with up to ten measurements on their body mass index (BMI) and risk factors from birth to the age of 10 years. We introduce boosting of structured additive quantile regression as a novel distribution-free approach for longitudinal quantile regression. The quantile-specific predictors of our model include conventional linear population effects, smooth nonlinear functional effects, varying-coefficient terms, and individual-specific effects, such as intercepts and slopes. Estimation is based on boosting, a computer intensive inference method for highly complex models. We propose a component-wise functional gradient descent boosting algorithm that allows for penalized estimation of the large variety of different effects, particularly leading to individual-specific effects shrunken toward zero. This concept allows us to flexibly estimate the nonlinear age curves of upper quantiles of the BMI distribution, both on population and on individual-specific level, adjusted for further risk factors and to detect age-varying effects of categorical risk factors. Our model approach can be regarded as the quantile regression analog of Gaussian additive mixed models (or structured additive mean regression models), and we compare both model classes with respect to our obesity data.
NASA Astrophysics Data System (ADS)
Eyarkai Nambi, Vijayaram; Thangavel, Kuladaisamy; Manickavasagan, Annamalai; Shahir, Sultan
2017-01-01
Prediction of ripeness level in climacteric fruits is essential for post-harvest handling. An index capable of predicting ripening level with minimum inputs would be highly beneficial to the handlers, processors and researchers in fruit industry. A study was conducted with Indian mango cultivars to develop a ripeness index and associated model. Changes in physicochemical, colour and textural properties were measured throughout the ripening period and the period was classified into five stages (unripe, early ripe, partially ripe, ripe and over ripe). Multivariate regression techniques like partial least square regression, principal component regression and multi linear regression were compared and evaluated for its prediction. Multi linear regression model with 12 parameters was found more suitable in ripening prediction. Scientific variable reduction method was adopted to simplify the developed model. Better prediction was achieved with either 2 or 3 variables (total soluble solids, colour and acidity). Cross validation was done to increase the robustness and it was found that proposed ripening index was more effective in prediction of ripening stages. Three-variable model would be suitable for commercial applications where reasonable accuracies are sufficient. However, 12-variable model can be used to obtain more precise results in research and development applications.
Giménez-Espert, María Del Carmen; Prado-Gascó, Vicente Javier
2018-03-01
To analyse link between empathy and emotional intelligence as a predictor of nurses' attitudes towards communication while comparing the contribution of emotional aspects and attitudinal elements on potential behaviour. Nurses' attitudes towards communication, empathy and emotional intelligence are key skills for nurses involved in patient care. There are currently no studies analysing this link, and its investigation is needed because attitudes may influence communication behaviours. Correlational study. To attain this goal, self-reported instruments (attitudes towards communication of nurses, trait emotional intelligence (Trait Emotional Meta-Mood Scale) and Jefferson Scale of Nursing Empathy (Jefferson Scale Nursing Empathy) were collected from 460 nurses between September 2015-February 2016. Two different analytical methodologies were used: traditional regression models and fuzzy-set qualitative comparative analysis models. The results of the regression model suggest that cognitive dimensions of attitude are a significant and positive predictor of the behavioural dimension. The perspective-taking dimension of empathy and the emotional-clarity dimension of emotional intelligence were significant positive predictors of the dimensions of attitudes towards communication, except for the affective dimension (for which the association was negative). The results of the fuzzy-set qualitative comparative analysis models confirm that the combination of high levels of cognitive dimension of attitudes, perspective-taking and emotional clarity explained high levels of the behavioural dimension of attitude. Empathy and emotional intelligence are predictors of nurses' attitudes towards communication, and the cognitive dimension of attitude is a good predictor of the behavioural dimension of attitudes towards communication of nurses in both regression models and fuzzy-set qualitative comparative analysis. In general, the fuzzy-set qualitative comparative analysis models appear to be better predictors than the regression models are. To evaluate current practices, establish intervention strategies and evaluate their effectiveness. The evaluation of these variables and their relationships are important in creating a satisfied and sustainable workforce and improving quality of care and patient health. © 2018 John Wiley & Sons Ltd.
Hughes, James P.; Haley, Danielle F.; Frew, Paula M.; Golin, Carol E.; Adimora, Adaora A; Kuo, Irene; Justman, Jessica; Soto-Torres, Lydia; Wang, Jing; Hodder, Sally
2015-01-01
Purpose Reductions in risk behaviors are common following enrollment in HIV prevention studies. We develop methods to quantify the proportion of change in risk behaviors that can be attributed to regression to the mean versus study participation and other factors. Methods A novel model that incorporates both regression to the mean and study participation effects is developed for binary measures. The model is used to estimate the proportion of change in the prevalence of “unprotected sex in the past 6 months” that can be attributed to study participation versus regression to the mean in a longitudinal cohort of women at risk for HIV infection who were recruited from ten US communities with high rates of HIV and poverty. HIV risk behaviors were evaluated using audio computer-assisted self-interviews at baseline and every 6 months for up to 12 months. Results The prevalence of “unprotected sex in the past 6 months” declined from 96% at baseline to 77% at 12 months. However, this change could be almost completely explained by regression to the mean. Conclusions Analyses that examine changes over time in cohorts selected for high or low risk behaviors should account for regression to the mean effects. PMID:25883065
[New method of mixed gas infrared spectrum analysis based on SVM].
Bai, Peng; Xie, Wen-Jun; Liu, Jun-Hua
2007-07-01
A new method of infrared spectrum analysis based on support vector machine (SVM) for mixture gas was proposed. The kernel function in SVM was used to map the seriously overlapping absorption spectrum into high-dimensional space, and after transformation, the high-dimensional data could be processed in the original space, so the regression calibration model was established, then the regression calibration model with was applied to analyze the concentration of component gas. Meanwhile it was proved that the regression calibration model with SVM also could be used for component recognition of mixture gas. The method was applied to the analysis of different data samples. Some factors such as scan interval, range of the wavelength, kernel function and penalty coefficient C that affect the model were discussed. Experimental results show that the component concentration maximal Mean AE is 0.132%, and the component recognition accuracy is higher than 94%. The problems of overlapping absorption spectrum, using the same method for qualitative and quantitative analysis, and limit number of training sample, were solved. The method could be used in other mixture gas infrared spectrum analyses, promising theoretic and application values.
Shi, K-Q; Zhou, Y-Y; Yan, H-D; Li, H; Wu, F-L; Xie, Y-Y; Braddock, M; Lin, X-Y; Zheng, M-H
2017-02-01
At present, there is no ideal model for predicting the short-term outcome of patients with acute-on-chronic hepatitis B liver failure (ACHBLF). This study aimed to establish and validate a prognostic model by using the classification and regression tree (CART) analysis. A total of 1047 patients from two separate medical centres with suspected ACHBLF were screened in the study, which were recognized as derivation cohort and validation cohort, respectively. CART analysis was applied to predict the 3-month mortality of patients with ACHBLF. The accuracy of the CART model was tested using the area under the receiver operating characteristic curve, which was compared with the model for end-stage liver disease (MELD) score and a new logistic regression model. CART analysis identified four variables as prognostic factors of ACHBLF: total bilirubin, age, serum sodium and INR, and three distinct risk groups: low risk (4.2%), intermediate risk (30.2%-53.2%) and high risk (81.4%-96.9%). The new logistic regression model was constructed with four independent factors, including age, total bilirubin, serum sodium and prothrombin activity by multivariate logistic regression analysis. The performances of the CART model (0.896), similar to the logistic regression model (0.914, P=.382), exceeded that of MELD score (0.667, P<.001). The results were confirmed in the validation cohort. We have developed and validated a novel CART model superior to MELD for predicting three-month mortality of patients with ACHBLF. Thus, the CART model could facilitate medical decision-making and provide clinicians with a validated practical bedside tool for ACHBLF risk stratification. © 2016 John Wiley & Sons Ltd.
Alexeeff, Stacey E.; Schwartz, Joel; Kloog, Itai; Chudnovsky, Alexandra; Koutrakis, Petros; Coull, Brent A.
2016-01-01
Many epidemiological studies use predicted air pollution exposures as surrogates for true air pollution levels. These predicted exposures contain exposure measurement error, yet simulation studies have typically found negligible bias in resulting health effect estimates. However, previous studies typically assumed a statistical spatial model for air pollution exposure, which may be oversimplified. We address this shortcoming by assuming a realistic, complex exposure surface derived from fine-scale (1km x 1km) remote-sensing satellite data. Using simulation, we evaluate the accuracy of epidemiological health effect estimates in linear and logistic regression when using spatial air pollution predictions from kriging and land use regression models. We examined chronic (long-term) and acute (short-term) exposure to air pollution. Results varied substantially across different scenarios. Exposure models with low out-of-sample R2 yielded severe biases in the health effect estimates of some models, ranging from 60% upward bias to 70% downward bias. One land use regression exposure model with greater than 0.9 out-of-sample R2 yielded upward biases up to 13% for acute health effect estimates. Almost all models drastically underestimated the standard errors. Land use regression models performed better in chronic effects simulations. These results can help researchers when interpreting health effect estimates in these types of studies. PMID:24896768
Choi, Seung Hoan; Labadorf, Adam T; Myers, Richard H; Lunetta, Kathryn L; Dupuis, Josée; DeStefano, Anita L
2017-02-06
Next generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNA-Seq studies designed to compare cases and controls, where disease status is modeled as a function of RNA-Seq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates. When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth's logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression. We conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth's logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.
Pellerin, Brian A.; Bergamaschi, Brian A.; Gilliom, Robert J.; Crawford, Charles G.; Saraceno, John F.; Frederick, C. Paul; Downing, Bryan D.; Murphy, Jennifer C.
2014-01-01
Accurately quantifying nitrate (NO3–) loading from the Mississippi River is important for predicting summer hypoxia in the Gulf of Mexico and targeting nutrient reduction within the basin. Loads have historically been modeled with regression-based techniques, but recent advances with high frequency NO3– sensors allowed us to evaluate model performance relative to measured loads in the lower Mississippi River. Patterns in NO3– concentrations and loads were observed at daily to annual time steps, with considerable variability in concentration-discharge relationships over the two year study. Differences were particularly accentuated during the 2012 drought and 2013 flood, which resulted in anomalously high NO3– concentrations consistent with a large flush of stored NO3– from soil. The comparison between measured loads and modeled loads (LOADEST, Composite Method, WRTDS) showed underestimates of only 3.5% across the entire study period, but much larger differences at shorter time steps. Absolute differences in loads were typically greatest in the spring and early summer critical to Gulf hypoxia formation, with the largest differences (underestimates) for all models during the flood period of 2013. In additional to improving the accuracy and precision of monthly loads, high frequency NO3– measurements offer additional benefits not available with regression-based or other load estimation techniques.
Quantile regression for the statistical analysis of immunological data with many non-detects.
Eilers, Paul H C; Röder, Esther; Savelkoul, Huub F J; van Wijk, Roy Gerth
2012-07-07
Immunological parameters are hard to measure. A well-known problem is the occurrence of values below the detection limit, the non-detects. Non-detects are a nuisance, because classical statistical analyses, like ANOVA and regression, cannot be applied. The more advanced statistical techniques currently available for the analysis of datasets with non-detects can only be used if a small percentage of the data are non-detects. Quantile regression, a generalization of percentiles to regression models, models the median or higher percentiles and tolerates very high numbers of non-detects. We present a non-technical introduction and illustrate it with an implementation to real data from a clinical trial. We show that by using quantile regression, groups can be compared and that meaningful linear trends can be computed, even if more than half of the data consists of non-detects. Quantile regression is a valuable addition to the statistical methods that can be used for the analysis of immunological datasets with non-detects.
NASA Astrophysics Data System (ADS)
Fei, Cheng-Wei; Bai, Guang-Chen
2014-12-01
To improve the computational precision and efficiency of probabilistic design for mechanical dynamic assembly like the blade-tip radial running clearance (BTRRC) of gas turbine, a distribution collaborative probabilistic design method-based support vector machine of regression (SR)(called as DCSRM) is proposed by integrating distribution collaborative response surface method and support vector machine regression model. The mathematical model of DCSRM is established and the probabilistic design idea of DCSRM is introduced. The dynamic assembly probabilistic design of aeroengine high-pressure turbine (HPT) BTRRC is accomplished to verify the proposed DCSRM. The analysis results reveal that the optimal static blade-tip clearance of HPT is gained for designing BTRRC, and improving the performance and reliability of aeroengine. The comparison of methods shows that the DCSRM has high computational accuracy and high computational efficiency in BTRRC probabilistic analysis. The present research offers an effective way for the reliability design of mechanical dynamic assembly and enriches mechanical reliability theory and method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dean, Jamie A., E-mail: jamie.dean@icr.ac.uk; Wong, Kee H.; Gay, Hiram
Purpose: Current normal tissue complication probability modeling using logistic regression suffers from bias and high uncertainty in the presence of highly correlated radiation therapy (RT) dose data. This hinders robust estimates of dose-response associations and, hence, optimal normal tissue–sparing strategies from being elucidated. Using functional data analysis (FDA) to reduce the dimensionality of the dose data could overcome this limitation. Methods and Materials: FDA was applied to modeling of severe acute mucositis and dysphagia resulting from head and neck RT. Functional partial least squares regression (FPLS) and functional principal component analysis were used for dimensionality reduction of the dose-volume histogrammore » data. The reduced dose data were input into functional logistic regression models (functional partial least squares–logistic regression [FPLS-LR] and functional principal component–logistic regression [FPC-LR]) along with clinical data. This approach was compared with penalized logistic regression (PLR) in terms of predictive performance and the significance of treatment covariate–response associations, assessed using bootstrapping. Results: The area under the receiver operating characteristic curve for the PLR, FPC-LR, and FPLS-LR models was 0.65, 0.69, and 0.67, respectively, for mucositis (internal validation) and 0.81, 0.83, and 0.83, respectively, for dysphagia (external validation). The calibration slopes/intercepts for the PLR, FPC-LR, and FPLS-LR models were 1.6/−0.67, 0.45/0.47, and 0.40/0.49, respectively, for mucositis (internal validation) and 2.5/−0.96, 0.79/−0.04, and 0.79/0.00, respectively, for dysphagia (external validation). The bootstrapped odds ratios indicated significant associations between RT dose and severe toxicity in the mucositis and dysphagia FDA models. Cisplatin was significantly associated with severe dysphagia in the FDA models. None of the covariates was significantly associated with severe toxicity in the PLR models. Dose levels greater than approximately 1.0 Gy/fraction were most strongly associated with severe acute mucositis and dysphagia in the FDA models. Conclusions: FPLS and functional principal component analysis marginally improved predictive performance compared with PLR and provided robust dose-response associations. FDA is recommended for use in normal tissue complication probability modeling.« less
Dean, Jamie A; Wong, Kee H; Gay, Hiram; Welsh, Liam C; Jones, Ann-Britt; Schick, Ulrike; Oh, Jung Hun; Apte, Aditya; Newbold, Kate L; Bhide, Shreerang A; Harrington, Kevin J; Deasy, Joseph O; Nutting, Christopher M; Gulliford, Sarah L
2016-11-15
Current normal tissue complication probability modeling using logistic regression suffers from bias and high uncertainty in the presence of highly correlated radiation therapy (RT) dose data. This hinders robust estimates of dose-response associations and, hence, optimal normal tissue-sparing strategies from being elucidated. Using functional data analysis (FDA) to reduce the dimensionality of the dose data could overcome this limitation. FDA was applied to modeling of severe acute mucositis and dysphagia resulting from head and neck RT. Functional partial least squares regression (FPLS) and functional principal component analysis were used for dimensionality reduction of the dose-volume histogram data. The reduced dose data were input into functional logistic regression models (functional partial least squares-logistic regression [FPLS-LR] and functional principal component-logistic regression [FPC-LR]) along with clinical data. This approach was compared with penalized logistic regression (PLR) in terms of predictive performance and the significance of treatment covariate-response associations, assessed using bootstrapping. The area under the receiver operating characteristic curve for the PLR, FPC-LR, and FPLS-LR models was 0.65, 0.69, and 0.67, respectively, for mucositis (internal validation) and 0.81, 0.83, and 0.83, respectively, for dysphagia (external validation). The calibration slopes/intercepts for the PLR, FPC-LR, and FPLS-LR models were 1.6/-0.67, 0.45/0.47, and 0.40/0.49, respectively, for mucositis (internal validation) and 2.5/-0.96, 0.79/-0.04, and 0.79/0.00, respectively, for dysphagia (external validation). The bootstrapped odds ratios indicated significant associations between RT dose and severe toxicity in the mucositis and dysphagia FDA models. Cisplatin was significantly associated with severe dysphagia in the FDA models. None of the covariates was significantly associated with severe toxicity in the PLR models. Dose levels greater than approximately 1.0 Gy/fraction were most strongly associated with severe acute mucositis and dysphagia in the FDA models. FPLS and functional principal component analysis marginally improved predictive performance compared with PLR and provided robust dose-response associations. FDA is recommended for use in normal tissue complication probability modeling. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seong W. Lee
During this reporting period, the literature survey including the gasifier temperature measurement literature, the ultrasonic application and its background study in cleaning application, and spray coating process are completed. The gasifier simulator (cold model) testing has been successfully conducted. Four factors (blower voltage, ultrasonic application, injection time intervals, particle weight) were considered as significant factors that affect the temperature measurement. The Analysis of Variance (ANOVA) was applied to analyze the test data. The analysis shows that all four factors are significant to the temperature measurements in the gasifier simulator (cold model). The regression analysis for the case with the normalizedmore » room temperature shows that linear model fits the temperature data with 82% accuracy (18% error). The regression analysis for the case without the normalized room temperature shows 72.5% accuracy (27.5% error). The nonlinear regression analysis indicates a better fit than that of the linear regression. The nonlinear regression model's accuracy is 88.7% (11.3% error) for normalized room temperature case, which is better than the linear regression analysis. The hot model thermocouple sleeve design and fabrication are completed. The gasifier simulator (hot model) design and the fabrication are completed. The system tests of the gasifier simulator (hot model) have been conducted and some modifications have been made. Based on the system tests and results analysis, the gasifier simulator (hot model) has met the proposed design requirement and the ready for system test. The ultrasonic cleaning method is under evaluation and will be further studied for the gasifier simulator (hot model) application. The progress of this project has been on schedule.« less
Wang, Shuang; Jiang, Xiaoqian; Wu, Yuan; Cui, Lijuan; Cheng, Samuel; Ohno-Machado, Lucila
2013-06-01
We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacy-preserving online learning. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance (e.g., discrimination, calibration, feature selection, etc.) as the traditional frequentist logistic regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications. Copyright © 2013 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Rooper, Christopher N.; Zimmermann, Mark; Prescott, Megan M.
2017-08-01
Deep-sea coral and sponge ecosystems are widespread throughout most of Alaska's marine waters, and are associated with many different species of fishes and invertebrates. These ecosystems are vulnerable to the effects of commercial fishing activities and climate change. We compared four commonly used species distribution models (general linear models, generalized additive models, boosted regression trees and random forest models) and an ensemble model to predict the presence or absence and abundance of six groups of benthic invertebrate taxa in the Gulf of Alaska. All four model types performed adequately on training data for predicting presence and absence, with regression forest models having the best overall performance measured by the area under the receiver-operating-curve (AUC). The models also performed well on the test data for presence and absence with average AUCs ranging from 0.66 to 0.82. For the test data, ensemble models performed the best. For abundance data, there was an obvious demarcation in performance between the two regression-based methods (general linear models and generalized additive models), and the tree-based models. The boosted regression tree and random forest models out-performed the other models by a wide margin on both the training and testing data. However, there was a significant drop-off in performance for all models of invertebrate abundance ( 50%) when moving from the training data to the testing data. Ensemble model performance was between the tree-based and regression-based methods. The maps of predictions from the models for both presence and abundance agreed very well across model types, with an increase in variability in predictions for the abundance data. We conclude that where data conforms well to the modeled distribution (such as the presence-absence data and binomial distribution in this study), the four types of models will provide similar results, although the regression-type models may be more consistent with biological theory. For data with highly zero-inflated distributions and non-normal distributions such as the abundance data from this study, the tree-based methods performed better. Ensemble models that averaged predictions across the four model types, performed better than the GLM or GAM models but slightly poorer than the tree-based methods, suggesting ensemble models might be more robust to overfitting than tree methods, while mitigating some of the disadvantages in predictive performance of regression methods.
Data error and highly parameterized groundwater models
Hill, M.C.
2008-01-01
Strengths and weaknesses of highly parameterized models, in which the number of parameters exceeds the number of observations, are demonstrated using a synthetic test case. Results suggest that the approach can yield close matches to observations but also serious errors in system representation. It is proposed that avoiding the difficulties of highly parameterized models requires close evaluation of: (1) model fit, (2) performance of the regression, and (3) estimated parameter distributions. Comparisons to hydrogeologic information are expected to be critical to obtaining credible models. Copyright ?? 2008 IAHS Press.
Zhu, K; Lou, Z; Zhou, J; Ballester, N; Kong, N; Parikh, P
2015-01-01
This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners. Explore the use of conditional logistic regression to increase the prediction accuracy. We analyzed an HCUP statewide inpatient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models. The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 - 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures. It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.
Global-scale high-resolution ( 1 km) modelling of mean, maximum and minimum annual streamflow
NASA Astrophysics Data System (ADS)
Barbarossa, Valerio; Huijbregts, Mark; Hendriks, Jan; Beusen, Arthur; Clavreul, Julie; King, Henry; Schipper, Aafke
2017-04-01
Quantifying mean, maximum and minimum annual flow (AF) of rivers at ungauged sites is essential for a number of applications, including assessments of global water supply, ecosystem integrity and water footprints. AF metrics can be quantified with spatially explicit process-based models, which might be overly time-consuming and data-intensive for this purpose, or with empirical regression models that predict AF metrics based on climate and catchment characteristics. Yet, so far, regression models have mostly been developed at a regional scale and the extent to which they can be extrapolated to other regions is not known. We developed global-scale regression models that quantify mean, maximum and minimum AF as function of catchment area and catchment-averaged slope, elevation, and mean, maximum and minimum annual precipitation and air temperature. We then used these models to obtain global 30 arc-seconds (˜ 1 km) maps of mean, maximum and minimum AF for each year from 1960 through 2015, based on a newly developed hydrologically conditioned digital elevation model. We calibrated our regression models based on observations of discharge and catchment characteristics from about 4,000 catchments worldwide, ranging from 100 to 106 km2 in size, and validated them against independent measurements as well as the output of a number of process-based global hydrological models (GHMs). The variance explained by our regression models ranged up to 90% and the performance of the models compared well with the performance of existing GHMs. Yet, our AF maps provide a level of spatial detail that cannot yet be achieved by current GHMs.
NASA Astrophysics Data System (ADS)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
Many uncertainty quantification (UQ) approaches suffer from the curse of dimensionality, that is, their computational costs become intractable for problems involving a large number of uncertainty parameters. In these situations, the classic Monte Carlo often remains the preferred method of choice because its convergence rate O (n - 1 / 2), where n is the required number of model simulations, does not depend on the dimension of the problem. However, many high-dimensional UQ problems are intrinsically low-dimensional, because the variation of the quantity of interest (QoI) is often caused by only a few latent parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace in the statistics literature. Motivated by this observation, we propose two inverse regression-based UQ algorithms (IRUQ) for high-dimensional problems. Both algorithms use inverse regression to convert the original high-dimensional problem to a low-dimensional one, which is then efficiently solved by building a response surface for the reduced model, for example via the polynomial chaos expansion. The first algorithm, which is for the situations where an exact SDR subspace exists, is proved to converge at rate O (n-1), hence much faster than MC. The second algorithm, which doesn't require an exact SDR, employs the reduced model as a control variate to reduce the error of the MC estimate. The accuracy gain could still be significant, depending on how well the reduced model approximates the original high-dimensional one. IRUQ also provides several additional practical advantages: it is non-intrusive; it does not require computing the high-dimensional gradient of the QoI; and it reports an error bar so the user knows how reliable the result is.
Spertus, Jacob V; Normand, Sharon-Lise T
2018-04-23
High-dimensional data provide many potential confounders that may bolster the plausibility of the ignorability assumption in causal inference problems. Propensity score methods are powerful causal inference tools, which are popular in health care research and are particularly useful for high-dimensional data. Recent interest has surrounded a Bayesian treatment of propensity scores in order to flexibly model the treatment assignment mechanism and summarize posterior quantities while incorporating variance from the treatment model. We discuss methods for Bayesian propensity score analysis of binary treatments, focusing on modern methods for high-dimensional Bayesian regression and the propagation of uncertainty. We introduce a novel and simple estimator for the average treatment effect that capitalizes on conjugacy of the beta and binomial distributions. Through simulations, we show the utility of horseshoe priors and Bayesian additive regression trees paired with our new estimator, while demonstrating the importance of including variance from the treatment regression model. An application to cardiac stent data with almost 500 confounders and 9000 patients illustrates approaches and facilitates comparison with existing alternatives. As measured by a falsifiability endpoint, we improved confounder adjustment compared with past observational research of the same problem. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
ERIC Educational Resources Information Center
Ikuma, Takeshi; Kunduk, Melda; McWhorter, Andrew J.
2014-01-01
Purpose: The model-based quantitative analysis of high-speed videoendoscopy (HSV) data at a low frame rate of 2,000 frames per second was assessed for its clinical adequacy. Stepwise regression was employed to evaluate the HSV parameters using harmonic models and their relationships to the Voice Handicap Index (VHI). Also, the model-based HSV…
2015-01-01
Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project. PMID:26339227
Shin, Yoonseok
2015-01-01
Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project.
NASA Astrophysics Data System (ADS)
Rounaghi, Mohammad Mahdi; Abbaszadeh, Mohammad Reza; Arashi, Mohammad
2015-11-01
One of the most important topics of interest to investors is stock price changes. Investors whose goals are long term are sensitive to stock price and its changes and react to them. In this regard, we used multivariate adaptive regression splines (MARS) model and semi-parametric splines technique for predicting stock price in this study. The MARS model as a nonparametric method is an adaptive method for regression and it fits for problems with high dimensions and several variables. semi-parametric splines technique was used in this study. Smoothing splines is a nonparametric regression method. In this study, we used 40 variables (30 accounting variables and 10 economic variables) for predicting stock price using the MARS model and using semi-parametric splines technique. After investigating the models, we select 4 accounting variables (book value per share, predicted earnings per share, P/E ratio and risk) as influencing variables on predicting stock price using the MARS model. After fitting the semi-parametric splines technique, only 4 accounting variables (dividends, net EPS, EPS Forecast and P/E Ratio) were selected as variables effective in forecasting stock prices.
Gender differences in body consciousness and substance use among high-risk adolescents.
Black, David Scott; Sussman, Steve; Unger, Jennifer; Pokhrel, Pallav; Sun, Ping
2010-08-01
This study explores the association between private and public body consciousness and past 30-day cigarette, alcohol, marijuana, and hard drug use among adolescents. Self-reported data from alterative high school students in California were analyzed (N = 976) using multilevel regression models to account for student clustering within schools. Separate regression analyses were conducted for males and females. Both cross-sectional baseline data and one-year longitudinal prediction models indicated that body consciousness is associated with specific drug use categories differentially by gender. Findings suggest that body consciousness accounts for additional variance in substance use etiology not explained by previously recognized dispositional variables.
Ribaroff, G A; Wastnedge, E; Drake, A J; Sharpe, R M; Chambers, T J G
2017-06-01
Animal models of maternal high fat diet (HFD) demonstrate perturbed offspring metabolism although the effects differ markedly between models. We assessed studies investigating metabolic parameters in the offspring of HFD fed mothers to identify factors explaining these inter-study differences. A total of 171 papers were identified, which provided data from 6047 offspring. Data were extracted regarding body weight, adiposity, glucose homeostasis and lipidaemia. Information regarding the macronutrient content of diet, species, time point of exposure and gestational weight gain were collected and utilized in meta-regression models to explore predictive factors. Publication bias was assessed using Egger's regression test. Maternal HFD exposure did not affect offspring birthweight but increased weaning weight, final bodyweight, adiposity, triglyceridaemia, cholesterolaemia and insulinaemia in both female and male offspring. Hyperglycaemia was found in female offspring only. Meta-regression analysis identified lactational HFD exposure as a key moderator. The fat content of the diet did not correlate with any outcomes. There was evidence of significant publication bias for all outcomes except birthweight. Maternal HFD exposure was associated with perturbed metabolism in offspring but between studies was not accounted for by dietary constituents, species, strain or maternal gestational weight gain. Specific weaknesses in experimental design predispose many of the results to bias. © 2017 The Authors. Obesity Reviews published by John Wiley & Sons Ltd on behalf of World Obesity Federation.
Lim, Changwon
2015-03-30
Nonlinear regression is often used to evaluate the toxicity of a chemical or a drug by fitting data from a dose-response study. Toxicologists and pharmacologists may draw a conclusion about whether a chemical is toxic by testing the significance of the estimated parameters. However, sometimes the null hypothesis cannot be rejected even though the fit is quite good. One possible reason for such cases is that the estimated standard errors of the parameter estimates are extremely large. In this paper, we propose robust ridge regression estimation procedures for nonlinear models to solve this problem. The asymptotic properties of the proposed estimators are investigated; in particular, their mean squared errors are derived. The performances of the proposed estimators are compared with several standard estimators using simulation studies. The proposed methodology is also illustrated using high throughput screening assay data obtained from the National Toxicology Program. Copyright © 2014 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Stigter, T. Y.; Ribeiro, L.; Dill, A. M. M. Carvalho
2008-07-01
SummaryFactorial regression models, based on correspondence analysis, are built to explain the high nitrate concentrations in groundwater beneath an agricultural area in the south of Portugal, exceeding 300 mg/l, as a function of chemical variables, electrical conductivity (EC), land use and hydrogeological setting. Two important advantages of the proposed methodology are that qualitative parameters can be involved in the regression analysis and that multicollinearity is avoided. Regression is performed on eigenvectors extracted from the data similarity matrix, the first of which clearly reveals the impact of agricultural practices and hydrogeological setting on the groundwater chemistry of the study area. Significant correlation exists between response variable NO3- and explanatory variables Ca 2+, Cl -, SO42-, depth to water, aquifer media and land use. Substituting Cl - by the EC results in the most accurate regression model for nitrate, when disregarding the four largest outliers (model A). When built solely on land use and hydrogeological setting, the regression model (model B) is less accurate but more interesting from a practical viewpoint, as it is based on easily obtainable data and can be used to predict nitrate concentrations in groundwater in other areas with similar conditions. This is particularly useful for conservative contaminants, where risk and vulnerability assessment methods, based on assumed rather than established correlations, generally produce erroneous results. Another purpose of the models can be to predict the future evolution of nitrate concentrations under influence of changes in land use or fertilization practices, which occur in compliance with policies such as the Nitrates Directive. Model B predicts a 40% decrease in nitrate concentrations in groundwater of the study area, when horticulture is replaced by other land use with much lower fertilization and irrigation rates.
Geographically weighted regression and multicollinearity: dispelling the myth
NASA Astrophysics Data System (ADS)
Fotheringham, A. Stewart; Oshan, Taylor M.
2016-10-01
Geographically weighted regression (GWR) extends the familiar regression framework by estimating a set of parameters for any number of locations within a study area, rather than producing a single parameter estimate for each relationship specified in the model. Recent literature has suggested that GWR is highly susceptible to the effects of multicollinearity between explanatory variables and has proposed a series of local measures of multicollinearity as an indicator of potential problems. In this paper, we employ a controlled simulation to demonstrate that GWR is in fact very robust to the effects of multicollinearity. Consequently, the contention that GWR is highly susceptible to multicollinearity issues needs rethinking.
Stone, Wesley W.; Gilliom, Robert J.; Crawford, Charles G.
2008-01-01
Regression models were developed for predicting annual maximum and selected annual maximum moving-average concentrations of atrazine in streams using the Watershed Regressions for Pesticides (WARP) methodology developed by the National Water-Quality Assessment Program (NAWQA) of the U.S. Geological Survey (USGS). The current effort builds on the original WARP models, which were based on the annual mean and selected percentiles of the annual frequency distribution of atrazine concentrations. Estimates of annual maximum and annual maximum moving-average concentrations for selected durations are needed to characterize the levels of atrazine and other pesticides for comparison to specific water-quality benchmarks for evaluation of potential concerns regarding human health or aquatic life. Separate regression models were derived for the annual maximum and annual maximum 21-day, 60-day, and 90-day moving-average concentrations. Development of the regression models used the same explanatory variables, transformations, model development data, model validation data, and regression methods as those used in the original development of WARP. The models accounted for 72 to 75 percent of the variability in the concentration statistics among the 112 sampling sites used for model development. Predicted concentration statistics from the four models were within a factor of 10 of the observed concentration statistics for most of the model development and validation sites. Overall, performance of the models for the development and validation sites supports the application of the WARP models for predicting annual maximum and selected annual maximum moving-average atrazine concentration in streams and provides a framework to interpret the predictions in terms of uncertainty. For streams with inadequate direct measurements of atrazine concentrations, the WARP model predictions for the annual maximum and the annual maximum moving-average atrazine concentrations can be used to characterize the probable levels of atrazine for comparison to specific water-quality benchmarks. Sites with a high probability of exceeding a benchmark for human health or aquatic life can be prioritized for monitoring.
Uncovering state-dependent relationships in shallow lakes using Bayesian latent variable regression.
Vitense, Kelsey; Hanson, Mark A; Herwig, Brian R; Zimmer, Kyle D; Fieberg, John
2018-03-01
Ecosystems sometimes undergo dramatic shifts between contrasting regimes. Shallow lakes, for instance, can transition between two alternative stable states: a clear state dominated by submerged aquatic vegetation and a turbid state dominated by phytoplankton. Theoretical models suggest that critical nutrient thresholds differentiate three lake types: highly resilient clear lakes, lakes that may switch between clear and turbid states following perturbations, and highly resilient turbid lakes. For effective and efficient management of shallow lakes and other systems, managers need tools to identify critical thresholds and state-dependent relationships between driving variables and key system features. Using shallow lakes as a model system for which alternative stable states have been demonstrated, we developed an integrated framework using Bayesian latent variable regression (BLR) to classify lake states, identify critical total phosphorus (TP) thresholds, and estimate steady state relationships between TP and chlorophyll a (chl a) using cross-sectional data. We evaluated the method using data simulated from a stochastic differential equation model and compared its performance to k-means clustering with regression (KMR). We also applied the framework to data comprising 130 shallow lakes. For simulated data sets, BLR had high state classification rates (median/mean accuracy >97%) and accurately estimated TP thresholds and state-dependent TP-chl a relationships. Classification and estimation improved with increasing sample size and decreasing noise levels. Compared to KMR, BLR had higher classification rates and better approximated the TP-chl a steady state relationships and TP thresholds. We fit the BLR model to three different years of empirical shallow lake data, and managers can use the estimated bifurcation diagrams to prioritize lakes for management according to their proximity to thresholds and chance of successful rehabilitation. Our model improves upon previous methods for shallow lakes because it allows classification and regression to occur simultaneously and inform one another, directly estimates TP thresholds and the uncertainty associated with thresholds and state classifications, and enables meaningful constraints to be built into models. The BLR framework is broadly applicable to other ecosystems known to exhibit alternative stable states in which regression can be used to establish relationships between driving variables and state variables. © 2017 by the Ecological Society of America.
Prediction of siRNA potency using sparse logistic regression.
Hu, Wei; Hu, John
2014-06-01
RNA interference (RNAi) can modulate gene expression at post-transcriptional as well as transcriptional levels. Short interfering RNA (siRNA) serves as a trigger for the RNAi gene inhibition mechanism, and therefore is a crucial intermediate step in RNAi. There have been extensive studies to identify the sequence characteristics of potent siRNAs. One such study built a linear model using LASSO (Least Absolute Shrinkage and Selection Operator) to measure the contribution of each siRNA sequence feature. This model is simple and interpretable, but it requires a large number of nonzero weights. We have introduced a novel technique, sparse logistic regression, to build a linear model using single-position specific nucleotide compositions which has the same prediction accuracy of the linear model based on LASSO. The weights in our new model share the same general trend as those in the previous model, but have only 25 nonzero weights out of a total 84 weights, a 54% reduction compared to the previous model. Contrary to the linear model based on LASSO, our model suggests that only a few positions are influential on the efficacy of the siRNA, which are the 5' and 3' ends and the seed region of siRNA sequences. We also employed sparse logistic regression to build a linear model using dual-position specific nucleotide compositions, a task LASSO is not able to accomplish well due to its high dimensional nature. Our results demonstrate the superiority of sparse logistic regression as a technique for both feature selection and regression over LASSO in the context of siRNA design.
Syed, Hamzah; Jorgensen, Andrea L; Morris, Andrew P
2016-06-01
To evaluate the power to detect associations between SNPs and time-to-event outcomes across a range of pharmacogenomic study designs while comparing alternative regression approaches. Simulations were conducted to compare Cox proportional hazards modeling accounting for censoring and logistic regression modeling of a dichotomized outcome at the end of the study. The Cox proportional hazards model was demonstrated to be more powerful than the logistic regression analysis. The difference in power between the approaches was highly dependent on the rate of censoring. Initial evaluation of single-nucleotide polymorphism association signals using computationally efficient software with dichotomized outcomes provides an effective screening tool for some design scenarios, and thus has important implications for the development of analytical protocols in pharmacogenomic studies.
High-risk regions and outbreak modelling of tularemia in humans.
Desvars-Larrive, A; Liu, X; Hjertqvist, M; Sjöstedt, A; Johansson, A; Rydén, P
2017-02-01
Sweden reports large and variable numbers of human tularemia cases, but the high-risk regions are anecdotally defined and factors explaining annual variations are poorly understood. Here, high-risk regions were identified by spatial cluster analysis on disease surveillance data for 1984-2012. Negative binomial regression with five previously validated predictors (including predicted mosquito abundance and predictors based on local weather data) was used to model the annual number of tularemia cases within the high-risk regions. Seven high-risk regions were identified with annual incidences of 3·8-44 cases/100 000 inhabitants, accounting for 56·4% of the tularemia cases but only 9·3% of Sweden's population. For all high-risk regions, most cases occurred between July and September. The regression models explained the annual variation of tularemia cases within most high-risk regions and discriminated between years with and without outbreaks. In conclusion, tularemia in Sweden is concentrated in a few high-risk regions and shows high annual and seasonal variations. We present reproducible methods for identifying tularemia high-risk regions and modelling tularemia cases within these regions. The results may help health authorities to target populations at risk and lay the foundation for developing an early warning system for outbreaks.
Feng, Yongjiu; Tong, Xiaohua
2017-09-22
Defining transition rules is an important issue in cellular automaton (CA)-based land use modeling because these models incorporate highly correlated driving factors. Multicollinearity among correlated driving factors may produce negative effects that must be eliminated from the modeling. Using exploratory regression under pre-defined criteria, we identified all possible combinations of factors from the candidate factors affecting land use change. Three combinations that incorporate five driving factors meeting pre-defined criteria were assessed. With the selected combinations of factors, three logistic regression-based CA models were built to simulate dynamic land use change in Shanghai, China, from 2000 to 2015. For comparative purposes, a CA model with all candidate factors was also applied to simulate the land use change. Simulations using three CA models with multicollinearity eliminated performed better (with accuracy improvements about 3.6%) than the model incorporating all candidate factors. Our results showed that not all candidate factors are necessary for accurate CA modeling and the simulations were not sensitive to changes in statistically non-significant driving factors. We conclude that exploratory regression is an effective method to search for the optimal combinations of driving factors, leading to better land use change models that are devoid of multicollinearity. We suggest identification of dominant factors and elimination of multicollinearity before building land change models, making it possible to simulate more realistic outcomes.
Hughes, James P; Haley, Danielle F; Frew, Paula M; Golin, Carol E; Adimora, Adaora A; Kuo, Irene; Justman, Jessica; Soto-Torres, Lydia; Wang, Jing; Hodder, Sally
2015-06-01
Reductions in risk behaviors are common following enrollment in human immunodeficiency virus (HIV) prevention studies. We develop methods to quantify the proportion of change in risk behaviors that can be attributed to regression to the mean versus study participation and other factors. A novel model that incorporates both regression to the mean and study participation effects is developed for binary measures. The model is used to estimate the proportion of change in the prevalence of "unprotected sex in the past 6 months" that can be attributed to study participation versus regression to the mean in a longitudinal cohort of women at risk for HIV infection who were recruited from ten U.S. communities with high rates of HIV and poverty. HIV risk behaviors were evaluated using audio computer-assisted self-interviews at baseline and every 6 months for up to 12 months. The prevalence of "unprotected sex in the past 6 months" declined from 96% at baseline to 77% at 12 months. However, this change could be almost completely explained by regression to the mean. Analyses that examine changes over time in cohorts selected for high- or low- risk behaviors should account for regression to the mean effects. Copyright © 2015 Elsevier Inc. All rights reserved.
Rupert, Michael G.; Cannon, Susan H.; Gartner, Joseph E.
2003-01-01
Logistic regression was used to predict the probability of debris flows occurring in areas recently burned by wildland fires. Multiple logistic regression is conceptually similar to multiple linear regression because statistical relations between one dependent variable and several independent variables are evaluated. In logistic regression, however, the dependent variable is transformed to a binary variable (debris flow did or did not occur), and the actual probability of the debris flow occurring is statistically modeled. Data from 399 basins located within 15 wildland fires that burned during 2000-2002 in Colorado, Idaho, Montana, and New Mexico were evaluated. More than 35 independent variables describing the burn severity, geology, land surface gradient, rainfall, and soil properties were evaluated. The models were developed as follows: (1) Basins that did and did not produce debris flows were delineated from National Elevation Data using a Geographic Information System (GIS). (2) Data describing the burn severity, geology, land surface gradient, rainfall, and soil properties were determined for each basin. These data were then downloaded to a statistics software package for analysis using logistic regression. (3) Relations between the occurrence/non-occurrence of debris flows and burn severity, geology, land surface gradient, rainfall, and soil properties were evaluated and several preliminary multivariate logistic regression models were constructed. All possible combinations of independent variables were evaluated to determine which combination produced the most effective model. The multivariate model that best predicted the occurrence of debris flows was selected. (4) The multivariate logistic regression model was entered into a GIS, and a map showing the probability of debris flows was constructed. The most effective model incorporates the percentage of each basin with slope greater than 30 percent, percentage of land burned at medium and high burn severity in each basin, particle size sorting, average storm intensity (millimeters per hour), soil organic matter content, soil permeability, and soil drainage. The results of this study demonstrate that logistic regression is a valuable tool for predicting the probability of debris flows occurring in recently-burned landscapes.
Seasonal forecasting of high wind speeds over Western Europe
NASA Astrophysics Data System (ADS)
Palutikof, J. P.; Holt, T.
2003-04-01
As financial losses associated with extreme weather events escalate, there is interest from end users in the forestry and insurance industries, for example, in the development of seasonal forecasting models with a long lead time. This study uses exceedences of the 90th, 95th, and 99th percentiles of daily maximum wind speed over the period 1958 to present to derive predictands of winter wind extremes. The source data is the 6-hourly NCEP Reanalysis gridded surface wind field. Predictor variables include principal components of Atlantic sea surface temperature and several indices of climate variability, including the NAO and SOI. Lead times of up to a year are considered, in monthly increments. Three regression techniques are evaluated; multiple linear regression (MLR), principal component regression (PCR), and partial least squares regression (PLS). PCR and PLS proved considerably superior to MLR with much lower standard errors. PLS was chosen to formulate the predictive model since it offers more flexibility in experimental design and gave slightly better results than PCR. The results indicate that winter windiness can be predicted with considerable skill one year ahead for much of coastal Europe, but that this deteriorates rapidly in the hinterland. The experiment succeeded in highlighting PLS as a very useful method for developing more precise forecasting models, and in identifying areas of high predictability.
Siebers, Nina; Kruse, Jens; Eckhardt, Kai-Uwe; Hu, Yongfeng; Leinweber, Peter
2012-07-01
Cadmium (Cd) has a high toxicity and resolving its speciation in soil is challenging but essential for estimating the environmental risk. In this study partial least-square (PLS) regression was tested for its capability to deconvolute Cd L(3)-edge X-ray absorption near-edge structure (XANES) spectra of multi-compound mixtures. For this, a library of Cd reference compound spectra and a spectrum of a soil sample were acquired. A good coefficient of determination (R(2)) of Cd compounds in mixtures was obtained for the PLS model using binary and ternary mixtures of various Cd reference compounds proving the validity of this approach. In order to describe complex systems like soil, multi-compound mixtures of a variety of Cd compounds must be included in the PLS model. The obtained PLS regression model was then applied to a highly Cd-contaminated soil revealing Cd(3)(PO(4))(2) (36.1%), Cd(NO(3))(2)·4H(2)O (24.5%), Cd(OH)(2) (21.7%), CdCO(3) (17.1%) and CdCl(2) (0.4%). These preliminary results proved that PLS regression is a promising approach for a direct determination of Cd speciation in the solid phase of a soil sample.
Quantile Regression in the Study of Developmental Sciences
Petscher, Yaacov; Logan, Jessica A. R.
2014-01-01
Linear regression analysis is one of the most common techniques applied in developmental research, but only allows for an estimate of the average relations between the predictor(s) and the outcome. This study describes quantile regression, which provides estimates of the relations between the predictor(s) and outcome, but across multiple points of the outcome’s distribution. Using data from the High School and Beyond and U.S. Sustained Effects Study databases, quantile regression is demonstrated and contrasted with linear regression when considering models with: (a) one continuous predictor, (b) one dichotomous predictor, (c) a continuous and a dichotomous predictor, and (d) a longitudinal application. Results from each example exhibited the differential inferences which may be drawn using linear or quantile regression. PMID:24329596
NASA Astrophysics Data System (ADS)
Öktem, H.
2012-01-01
Plastic injection molding plays a key role in the production of high-quality plastic parts. Shrinkage is one of the most significant problems of a plastic part in terms of quality in the plastic injection molding. This article focuses on the study of the modeling and analysis of the effects of process parameters on the shrinkage by evaluating the quality of the plastic part of a DVD-ROM cover made with Acrylonitrile Butadiene Styrene (ABS) polymer material. An effective regression model was developed to determine the mathematical relationship between the process parameters (mold temperature, melt temperature, injection pressure, injection time, and cooling time) and the volumetric shrinkage by utilizing the analysis data. Finite element (FE) analyses designed by Taguchi (L27) orthogonal arrays were run in the Moldflow simulation program. Analysis of variance (ANOVA) was then performed to check the adequacy of the regression model and to determine the effect of the process parameters on the shrinkage. Experiments were conducted to control the accuracy of the regression model with the FE analyses obtained from Moldflow. The results show that the regression model agrees very well with the FE analyses and the experiments. From this, it can be concluded that this study succeeded in modeling the shrinkage problem in our application.
Cross-validation pitfalls when selecting and assessing regression and classification models.
Krstajic, Damjan; Buturovic, Ljubomir J; Leahy, David E; Thomas, Simon
2014-03-29
We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.
Application of General Regression Neural Network to the Prediction of LOD Change
NASA Astrophysics Data System (ADS)
Zhang, Xiao-Hong; Wang, Qi-Jie; Zhu, Jian-Jun; Zhang, Hao
2012-01-01
Traditional methods for predicting the change in length of day (LOD change) are mainly based on some linear models, such as the least square model and autoregression model, etc. However, the LOD change comprises complicated non-linear factors and the prediction effect of the linear models is always not so ideal. Thus, a kind of non-linear neural network — general regression neural network (GRNN) model is tried to make the prediction of the LOD change and the result is compared with the predicted results obtained by taking advantage of the BP (back propagation) neural network model and other models. The comparison result shows that the application of the GRNN to the prediction of the LOD change is highly effective and feasible.
A Regional Analysis of Non-Methane Hydrocarbons And Meteorology of The Rural Southeast United States
1996-01-01
Zt is an ARIMA time series. This is a typical regression model , except that it allows for autocorrelation in the error term Z. In this work, an ARMA...data=folder; var residual; run; II Statistical output of 1992 regression model on 1993 ozone data ARIMA Procedure Maximum Likelihood Estimation Approx...at each of the sites, and to show the effect of synoptic meteorology on high ozone by examining NOAA daily weather maps and climatic data
USDA-ARS?s Scientific Manuscript database
High-throughput phenotyping (HTP) platforms can be used to measure traits that are genetically correlated with wheat (Triticum aestivum L.) grain yield across time. Incorporating such secondary traits in the multivariate pedigree and genomic prediction models would be desirable to improve indirect s...
Spatial regression analysis on 32 years of total column ozone data
NASA Astrophysics Data System (ADS)
Knibbe, J. S.; van der A, R. J.; de Laat, A. T. J.
2014-08-01
Multiple-regression analyses have been performed on 32 years of total ozone column data that was spatially gridded with a 1 × 1.5° resolution. The total ozone data consist of the MSR (Multi Sensor Reanalysis; 1979-2008) and 2 years of assimilated SCIAMACHY (SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY) ozone data (2009-2010). The two-dimensionality in this data set allows us to perform the regressions locally and investigate spatial patterns of regression coefficients and their explanatory power. Seasonal dependencies of ozone on regressors are included in the analysis. A new physically oriented model is developed to parameterize stratospheric ozone. Ozone variations on nonseasonal timescales are parameterized by explanatory variables describing the solar cycle, stratospheric aerosols, the quasi-biennial oscillation (QBO), El Niño-Southern Oscillation (ENSO) and stratospheric alternative halogens which are parameterized by the effective equivalent stratospheric chlorine (EESC). For several explanatory variables, seasonally adjusted versions of these explanatory variables are constructed to account for the difference in their effect on ozone throughout the year. To account for seasonal variation in ozone, explanatory variables describing the polar vortex, geopotential height, potential vorticity and average day length are included. Results of this regression model are compared to that of a similar analysis based on a more commonly applied statistically oriented model. The physically oriented model provides spatial patterns in the regression results for each explanatory variable. The EESC has a significant depleting effect on ozone at mid- and high latitudes, the solar cycle affects ozone positively mostly in the Southern Hemisphere, stratospheric aerosols affect ozone negatively at high northern latitudes, the effect of QBO is positive and negative in the tropics and mid- to high latitudes, respectively, and ENSO affects ozone negatively between 30° N and 30° S, particularly over the Pacific. The contribution of explanatory variables describing seasonal ozone variation is generally large at mid- to high latitudes. We observe ozone increases with potential vorticity and day length and ozone decreases with geopotential height and variable ozone effects due to the polar vortex in regions to the north and south of the polar vortices. Recovery of ozone is identified globally. However, recovery rates and uncertainties strongly depend on choices that can be made in defining the explanatory variables. The application of several trend models, each with their own pros and cons, yields a large range of recovery rate estimates. Overall these results suggest that care has to be taken in determining ozone recovery rates, in particular for the Antarctic ozone hole.
A Predictive Model for Readmissions Among Medicare Patients in a California Hospital.
Duncan, Ian; Huynh, Nhan
2017-11-17
Predictive models for hospital readmission rates are in high demand because of the Centers for Medicare & Medicaid Services (CMS) Hospital Readmission Reduction Program (HRRP). The LACE index is one of the most popular predictive tools among hospitals in the United States. The LACE index is a simple tool with 4 parameters: Length of stay, Acuity of admission, Comorbidity, and Emergency visits in the previous 6 months. The authors applied logistic regression to develop a predictive model for a medium-sized not-for-profit community hospital in California using patient-level data with more specific patient information (including 13 explanatory variables). Specifically, the logistic regression is applied to 2 populations: a general population including all patients and the specific group of patients targeted by the CMS penalty (characterized as ages 65 or older with select conditions). The 2 resulting logistic regression models have a higher sensitivity rate compared to the sensitivity of the LACE index. The C statistic values of the model applied to both populations demonstrate moderate levels of predictive power. The authors also build an economic model to demonstrate the potential financial impact of the use of the model for targeting high-risk patients in a sample hospital and demonstrate that, on balance, whether the hospital gains or loses from reducing readmissions depends on its margin and the extent of its readmission penalties.
Huang, Jian; Zhang, Cun-Hui
2013-01-01
The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. PMID:24348100
García Nieto, P J; Alonso Fernández, J R; de Cos Juez, F J; Sánchez Lasheras, F; Díaz Muñiz, C
2013-04-01
Cyanotoxins, a kind of poisonous substances produced by cyanobacteria, are responsible for health risks in drinking and recreational waters. As a result, anticipate its presence is a matter of importance to prevent risks. The aim of this study is to use a hybrid approach based on support vector regression (SVR) in combination with genetic algorithms (GAs), known as a genetic algorithm support vector regression (GA-SVR) model, in forecasting the cyanotoxins presence in the Trasona reservoir (Northern Spain). The GA-SVR approach is aimed at highly nonlinear biological problems with sharp peaks and the tests carried out proved its high performance. Some physical-chemical parameters have been considered along with the biological ones. The results obtained are two-fold. In the first place, the significance of each biological and physical-chemical variable on the cyanotoxins presence in the reservoir is determined with success. Finally, a predictive model able to forecast the possible presence of cyanotoxins in a short term was obtained. Copyright © 2013 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Kargoll, Boris; Omidalizarandi, Mohammad; Loth, Ina; Paffenholz, Jens-André; Alkhatib, Hamza
2018-03-01
In this paper, we investigate a linear regression time series model of possibly outlier-afflicted observations and autocorrelated random deviations. This colored noise is represented by a covariance-stationary autoregressive (AR) process, in which the independent error components follow a scaled (Student's) t-distribution. This error model allows for the stochastic modeling of multiple outliers and for an adaptive robust maximum likelihood (ML) estimation of the unknown regression and AR coefficients, the scale parameter, and the degree of freedom of the t-distribution. This approach is meant to be an extension of known estimators, which tend to focus only on the regression model, or on the AR error model, or on normally distributed errors. For the purpose of ML estimation, we derive an expectation conditional maximization either algorithm, which leads to an easy-to-implement version of iteratively reweighted least squares. The estimation performance of the algorithm is evaluated via Monte Carlo simulations for a Fourier as well as a spline model in connection with AR colored noise models of different orders and with three different sampling distributions generating the white noise components. We apply the algorithm to a vibration dataset recorded by a high-accuracy, single-axis accelerometer, focusing on the evaluation of the estimated AR colored noise model.
Stylianou, Neophytos; Akbarov, Artur; Kontopantelis, Evangelos; Buchan, Iain; Dunn, Ken W
2015-08-01
Predicting mortality from burn injury has traditionally employed logistic regression models. Alternative machine learning methods have been introduced in some areas of clinical prediction as the necessary software and computational facilities have become accessible. Here we compare logistic regression and machine learning predictions of mortality from burn. An established logistic mortality model was compared to machine learning methods (artificial neural network, support vector machine, random forests and naïve Bayes) using a population-based (England & Wales) case-cohort registry. Predictive evaluation used: area under the receiver operating characteristic curve; sensitivity; specificity; positive predictive value and Youden's index. All methods had comparable discriminatory abilities, similar sensitivities, specificities and positive predictive values. Although some machine learning methods performed marginally better than logistic regression the differences were seldom statistically significant and clinically insubstantial. Random forests were marginally better for high positive predictive value and reasonable sensitivity. Neural networks yielded slightly better prediction overall. Logistic regression gives an optimal mix of performance and interpretability. The established logistic regression model of burn mortality performs well against more complex alternatives. Clinical prediction with a small set of strong, stable, independent predictors is unlikely to gain much from machine learning outside specialist research contexts. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.
Hsu, David
2015-09-27
Clustering methods are often used to model energy consumption for two reasons. First, clustering is often used to process data and to improve the predictive accuracy of subsequent energy models. Second, stable clusters that are reproducible with respect to non-essential changes can be used to group, target, and interpret observed subjects. However, it is well known that clustering methods are highly sensitive to the choice of algorithms and variables. This can lead to misleading assessments of predictive accuracy and mis-interpretation of clusters in policymaking. This paper therefore introduces two methods to the modeling of energy consumption in buildings: clusterwise regression,more » also known as latent class regression, which integrates clustering and regression simultaneously; and cluster validation methods to measure stability. Using a large dataset of multifamily buildings in New York City, clusterwise regression is compared to common two-stage algorithms that use K-means and model-based clustering with linear regression. Predictive accuracy is evaluated using 20-fold cross validation, and the stability of the perturbed clusters is measured using the Jaccard coefficient. These results show that there seems to be an inherent tradeoff between prediction accuracy and cluster stability. This paper concludes by discussing which clustering methods may be appropriate for different analytical purposes.« less
Krider, Lori A.; Magner, Joseph A.; Perry, Jim; Vondracek, Bruce C.; Ferrington, Leonard C.
2013-01-01
Carbonate-sandstone geology in southeastern Minnesota creates a heterogeneous landscape of springs, seeps, and sinkholes that supply groundwater into streams. Air temperatures are effective predictors of water temperature in surface-water dominated streams. However, no published work investigates the relationship between air and water temperatures in groundwater-fed streams (GWFS) across watersheds. We used simple linear regressions to examine weekly air-water temperature relationships for 40 GWFS in southeastern Minnesota. A 40-stream, composite linear regression model has a slope of 0.38, an intercept of 6.63, and R2 of 0.83. The regression models for GWFS have lower slopes and higher intercepts in comparison to surface-water dominated streams. Regression models for streams with high R2 values offer promise for use as predictive tools for future climate conditions. Climate change is expected to alter the thermal regime of groundwater-fed systems, but will do so at a slower rate than surface-water dominated systems. A regression model of intercept vs. slope can be used to identify streams for which water temperatures are more meteorologically than groundwater controlled, and thus more vulnerable to climate change. Such relationships can be used to guide restoration vs. management strategies to protect trout streams.
Zhao, Lue Ping; Bolouri, Hamid
2016-04-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015). Copyright © 2016 Elsevier Inc. All rights reserved.
Zhao, Lue Ping; Bolouri, Hamid
2016-01-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and to make the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient’s similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient’s HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (p=0.015). PMID:26972839
NASA Astrophysics Data System (ADS)
Li, Xiao Ju; Yao, Kun; Dai, Jun Yu; Song, Yun Long
2018-05-01
The underground space, also known as the “fourth dimension” of the city, reflects the efficient use of urban development intensive. Urban traffic link tunnel is a typical underground limited-length space. Due to the geographical location, the special structure of space and the curvature of the tunnel, high-temperature smoke can easily form the phenomenon of “smoke turning” and the fire risk is extremely high. This paper takes an urban traffic link tunnel as an example to focus on the relationship between curvature and the temperature near the fire source, and use the pyrosim built different curvature fire model to analyze the influence of curvature on the temperature of the fire, then using SPSS Multivariate regression analysis simulate curvature of the tunnel and fire temperature data. Finally, a prediction model of urban traffic link tunnel curvature on fire temperature was proposed. The regression model analysis and test show that the curvature is negatively correlated with the tunnel temperature. This model is feasible and can provide a theoretical reference for the urban traffic link tunnel fire protection design and the preparation of the evacuation plan. And also, it provides some reference for other related curved tunnel curvature design and smoke control measures.
Chahine, Teresa; Schultz, Bradley D.; Zartarian, Valerie G.; Xue, Jianping; Subramanian, SV; Levy, Jonathan I.
2011-01-01
Community-based cumulative risk assessment requires characterization of exposures to multiple chemical and non-chemical stressors, with consideration of how the non-chemical stressors may influence risks from chemical stressors. Residential radon provides an interesting case example, given its large attributable risk, effect modification due to smoking, and significant variability in radon concentrations and smoking patterns. In spite of this fact, no study to date has estimated geographic and sociodemographic patterns of both radon and smoking in a manner that would allow for inclusion of radon in community-based cumulative risk assessment. In this study, we apply multi-level regression models to explain variability in radon based on housing characteristics and geological variables, and construct a regression model predicting housing characteristics using U.S. Census data. Multi-level regression models of smoking based on predictors common to the housing model allow us to link the exposures. We estimate county-average lifetime lung cancer risks from radon ranging from 0.15 to 1.8 in 100, with high-risk clusters in areas and for subpopulations with high predicted radon and smoking rates. Our findings demonstrate the viability of screening-level assessment to characterize patterns of lung cancer risk from radon, with an approach that can be generalized to multiple chemical and non-chemical stressors. PMID:22016710
A Solution to Separation and Multicollinearity in Multiple Logistic Regression
Shen, Jianzhao; Gao, Sujuan
2010-01-01
In dementia screening tests, item selection for shortening an existing screening test can be achieved using multiple logistic regression. However, maximum likelihood estimates for such logistic regression models often experience serious bias or even non-existence because of separation and multicollinearity problems resulting from a large number of highly correlated items. Firth (1993, Biometrika, 80(1), 27–38) proposed a penalized likelihood estimator for generalized linear models and it was shown to reduce bias and the non-existence problems. The ridge regression has been used in logistic regression to stabilize the estimates in cases of multicollinearity. However, neither solves the problems for each other. In this paper, we propose a double penalized maximum likelihood estimator combining Firth’s penalized likelihood equation with a ridge parameter. We present a simulation study evaluating the empirical performance of the double penalized likelihood estimator in small to moderate sample sizes. We demonstrate the proposed approach using a current screening data from a community-based dementia study. PMID:20376286
A Solution to Separation and Multicollinearity in Multiple Logistic Regression.
Shen, Jianzhao; Gao, Sujuan
2008-10-01
In dementia screening tests, item selection for shortening an existing screening test can be achieved using multiple logistic regression. However, maximum likelihood estimates for such logistic regression models often experience serious bias or even non-existence because of separation and multicollinearity problems resulting from a large number of highly correlated items. Firth (1993, Biometrika, 80(1), 27-38) proposed a penalized likelihood estimator for generalized linear models and it was shown to reduce bias and the non-existence problems. The ridge regression has been used in logistic regression to stabilize the estimates in cases of multicollinearity. However, neither solves the problems for each other. In this paper, we propose a double penalized maximum likelihood estimator combining Firth's penalized likelihood equation with a ridge parameter. We present a simulation study evaluating the empirical performance of the double penalized likelihood estimator in small to moderate sample sizes. We demonstrate the proposed approach using a current screening data from a community-based dementia study.
Modeling time-to-event (survival) data using classification tree analysis.
Linden, Ariel; Yarnold, Paul R
2017-12-01
Time to the occurrence of an event is often studied in health research. Survival analysis differs from other designs in that follow-up times for individuals who do not experience the event by the end of the study (called censored) are accounted for in the analysis. Cox regression is the standard method for analysing censored data, but the assumptions required of these models are easily violated. In this paper, we introduce classification tree analysis (CTA) as a flexible alternative for modelling censored data. Classification tree analysis is a "decision-tree"-like classification model that provides parsimonious, transparent (ie, easy to visually display and interpret) decision rules that maximize predictive accuracy, derives exact P values via permutation tests, and evaluates model cross-generalizability. Using empirical data, we identify all statistically valid, reproducible, longitudinally consistent, and cross-generalizable CTA survival models and then compare their predictive accuracy to estimates derived via Cox regression and an unadjusted naïve model. Model performance is assessed using integrated Brier scores and a comparison between estimated survival curves. The Cox regression model best predicts average incidence of the outcome over time, whereas CTA survival models best predict either relatively high, or low, incidence of the outcome over time. Classification tree analysis survival models offer many advantages over Cox regression, such as explicit maximization of predictive accuracy, parsimony, statistical robustness, and transparency. Therefore, researchers interested in accurate prognoses and clear decision rules should consider developing models using the CTA-survival framework. © 2017 John Wiley & Sons, Ltd.
Xu, Wenjun; Chen, Jie; Lau, Henry Y K; Ren, Hongliang
2017-09-01
Accurate motion control of flexible surgical manipulators is crucial in tissue manipulation tasks. The tendon-driven serpentine manipulator (TSM) is one of the most widely adopted flexible mechanisms in minimally invasive surgery because of its enhanced maneuverability in torturous environments. TSM, however, exhibits high nonlinearities and conventional analytical kinematics model is insufficient to achieve high accuracy. To account for the system nonlinearities, we applied a data driven approach to encode the system inverse kinematics. Three regression methods: extreme learning machine (ELM), Gaussian mixture regression (GMR) and K-nearest neighbors regression (KNNR) were implemented to learn a nonlinear mapping from the robot 3D position states to the control inputs. The performance of the three algorithms was evaluated both in simulation and physical trajectory tracking experiments. KNNR performed the best in the tracking experiments, with the lowest RMSE of 2.1275 mm. The proposed inverse kinematics learning methods provide an alternative and efficient way to accurately model the tendon driven flexible manipulator. Copyright © 2016 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Bloomfield, J. P.; Allen, D. J.; Griffiths, K. J.
2009-06-01
SummaryLinear regression methods can be used to quantify geological controls on baseflow index (BFI). This is illustrated using an example from the Thames Basin, UK. Two approaches have been adopted. The areal extents of geological classes based on lithostratigraphic and hydrogeological classification schemes have been correlated with BFI for 44 'natural' catchments from the Thames Basin. When regression models are built using lithostratigraphic classes that include a constant term then the model is shown to have some physical meaning and the relative influence of the different geological classes on BFI can be quantified. For example, the regression constants for two such models, 0.64 and 0.69, are consistent with the mean observed BFI (0.65) for the Thames Basin, and the signs and relative magnitudes of the regression coefficients for each of the lithostratigraphic classes are consistent with the hydrogeology of the Basin. In addition, regression coefficients for the lithostratigraphic classes scale linearly with estimates of log 10 hydraulic conductivity for each lithological class. When a regression is built using a hydrogeological classification scheme with no constant term, the model does not have any physical meaning, but it has a relatively high adjusted R2 value and because of the continuous coverage of the hydrogeological classification scheme, the model can be used for predictive purposes. A model calibrated on the 44 'natural' catchments and using four hydrogeological classes (low-permeability surficial deposits, consolidated aquitards, fractured aquifers and intergranular aquifers) is shown to perform as well as a model based on a hydrology of soil types (BFIHOST) scheme in predicting BFI in the Thames Basin. Validation of this model using 110 other 'variably impacted' catchments in the Basin shows that there is a correlation between modelled and observed BFI. Where the observed BFI is significantly higher than modelled BFI the deviations can be explained by an exogenous factor, catchment urban area. It is inferred that this is may be due influences from sewage discharge, mains leakage, and leakage from septic tanks.
Wong, Man Sing; Ho, Hung Chak; Yang, Lin; Shi, Wenzhong; Yang, Jinxin; Chan, Ta-Chien
2017-07-24
Dust events have long been recognized to be associated with a higher mortality risk. However, no study has investigated how prolonged dust events affect the spatial variability of mortality across districts in a downwind city. In this study, we applied a spatial regression approach to estimate the district-level mortality during two extreme dust events in Hong Kong. We compared spatial and non-spatial models to evaluate the ability of each regression to estimate mortality. We also compared prolonged dust events with non-dust events to determine the influences of community factors on mortality across the city. The density of a built environment (estimated by the sky view factor) had positive association with excess mortality in each district, while socioeconomic deprivation contributed by lower income and lower education induced higher mortality impact in each territory planning unit during a prolonged dust event. Based on the model comparison, spatial error modelling with the 1st order of queen contiguity consistently outperformed other models. The high-risk areas with higher increase in mortality were located in an urban high-density environment with higher socioeconomic deprivation. Our model design shows the ability to predict spatial variability of mortality risk during an extreme weather event that is not able to be estimated based on traditional time-series analysis or ecological studies. Our spatial protocol can be used for public health surveillance, sustainable planning and disaster preparation when relevant data are available.
Binder, Harald; Porzelius, Christine; Schumacher, Martin
2011-03-01
Analysis of molecular data promises identification of biomarkers for improving prognostic models, thus potentially enabling better patient management. For identifying such biomarkers, risk prediction models can be employed that link high-dimensional molecular covariate data to a clinical endpoint. In low-dimensional settings, a multitude of statistical techniques already exists for building such models, e.g. allowing for variable selection or for quantifying the added value of a new biomarker. We provide an overview of techniques for regularized estimation that transfer this toward high-dimensional settings, with a focus on models for time-to-event endpoints. Techniques for incorporating specific covariate structure are discussed, as well as techniques for dealing with more complex endpoints. Employing gene expression data from patients with diffuse large B-cell lymphoma, some typical modeling issues from low-dimensional settings are illustrated in a high-dimensional application. First, the performance of classical stepwise regression is compared to stage-wise regression, as implemented by a component-wise likelihood-based boosting approach. A second issues arises, when artificially transforming the response into a binary variable. The effects of the resulting loss of efficiency and potential bias in a high-dimensional setting are illustrated, and a link to competing risks models is provided. Finally, we discuss conditions for adequately quantifying the added value of high-dimensional gene expression measurements, both at the stage of model fitting and when performing evaluation. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Ito, Yukiko; Hattori, Reiko; Mase, Hiroki; Watanabe, Masako; Shiotani, Itaru
2008-12-01
Pollen information is indispensable for allergic individuals and clinicians. This study aimed to develop forecasting models for the total annual count of airborne pollen grains based on data monitored over the last 20 years at the Mie Chuo Medical Center, Tsu, Mie, Japan. Airborne pollen grains were collected using a Durham sampler. Total annual pollen count and pollen count from October to December (OD pollen count) of the previous year were transformed to logarithms. Regression analysis of the total pollen count was performed using variables such as the OD pollen count and the maximum temperature for mid-July of the previous year. Time series analysis revealed an alternate rhythm of the series of total pollen count. The alternate rhythm consisted of a cyclic alternation of an "on" year (high pollen count) and an "off" year (low pollen count). This rhythm was used as a dummy variable in regression equations. Of the three models involving the OD pollen count, a multiple regression equation that included the alternate rhythm variable and the interaction of this rhythm with OD pollen count showed a high coefficient of determination (0.844). Of the three models involving the maximum temperature for mid-July, those including the alternate rhythm variable and the interaction of this rhythm with maximum temperature had the highest coefficient of determination (0.925). An alternate pollen dispersal rhythm represented by a dummy variable in the multiple regression analysis plays a key role in improving forecasting models for the total annual sugi pollen count.
Simple to complex modeling of breathing volume using a motion sensor.
John, Dinesh; Staudenmayer, John; Freedson, Patty
2013-06-01
To compare simple and complex modeling techniques to estimate categories of low, medium, and high ventilation (VE) from ActiGraph™ activity counts. Vertical axis ActiGraph™ GT1M activity counts, oxygen consumption and VE were measured during treadmill walking and running, sports, household chores and labor-intensive employment activities. Categories of low (<19.3 l/min), medium (19.3 to 35.4 l/min) and high (>35.4 l/min) VEs were derived from activity intensity classifications (light <2.9 METs, moderate 3.0 to 5.9 METs and vigorous >6.0 METs). We examined the accuracy of two simple techniques (multiple regression and activity count cut-point analyses) and one complex (random forest technique) modeling technique in predicting VE from activity counts. Prediction accuracy of the complex random forest technique was marginally better than the simple multiple regression method. Both techniques accurately predicted VE categories almost 80% of the time. The multiple regression and random forest techniques were more accurate (85 to 88%) in predicting medium VE. Both techniques predicted the high VE (70 to 73%) with greater accuracy than low VE (57 to 60%). Actigraph™ cut-points for light, medium and high VEs were <1381, 1381 to 3660 and >3660 cpm. There were minor differences in prediction accuracy between the multiple regression and the random forest technique. This study provides methods to objectively estimate VE categories using activity monitors that can easily be deployed in the field. Objective estimates of VE should provide a better understanding of the dose-response relationship between internal exposure to pollutants and disease. Copyright © 2013 Elsevier B.V. All rights reserved.
Hermes, Ilarraza-Lomelí; Marianna, García-Saldivia; Jessica, Rojano-Castillo; Carlos, Barrera-Ramírez; Rafael, Chávez-Domínguez; María Dolores, Rius-Suárez; Pedro, Iturralde
2016-10-01
Mortality due to cardiovascular disease is often associated with ventricular arrhythmias. Nowadays, patients with cardiovascular disease are more encouraged to take part in physical training programs. Nevertheless, high-intensity exercise is associated to a higher risk for sudden death, even in apparently healthy people. During an exercise testing (ET), health care professionals provide patients, in a controlled scenario, an intense physiological stimulus that could precipitate cardiac arrhythmia in high risk individuals. There is still no clinical or statistical tool to predict this incidence. The aim of this study was to develop a statistical model to predict the incidence of exercise-induced potentially life-threatening ventricular arrhythmia (PLVA) during high intensity exercise. 6415 patients underwent a symptom-limited ET with a Balke ramp protocol. A multivariate logistic regression model where the primary outcome was PLVA was performed. Incidence of PLVA was 548 cases (8.5%). After a bivariate model, thirty one clinical or ergometric variables were statistically associated with PLVA and were included in the regression model. In the multivariate model, 13 of these variables were found to be statistically significant. A regression model (G) with a X(2) of 283.987 and a p<0.001, was constructed. Significant variables included: heart failure, antiarrhythmic drugs, myocardial lower-VD, age and use of digoxin, nitrates, among others. This study allows clinicians to identify patients at risk of ventricular tachycardia or couplets during exercise, and to take preventive measures or appropriate supervision. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Ma, Jing; Yu, Jiong; Hao, Guangshu; Wang, Dan; Sun, Yanni; Lu, Jianxin; Cao, Hongcui; Lin, Feiyan
2017-02-20
The prevalence of high hyperlipemia is increasing around the world. Our aims are to analyze the relationship of triglyceride (TG) and cholesterol (TC) with indexes of liver function and kidney function, and to develop a prediction model of TG, TC in overweight people. A total of 302 adult healthy subjects and 273 overweight subjects were enrolled in this study. The levels of fasting indexes of TG (fs-TG), TC (fs-TC), blood glucose, liver function, and kidney function were measured and analyzed by correlation analysis and multiple linear regression (MRL). The back propagation artificial neural network (BP-ANN) was applied to develop prediction models of fs-TG and fs-TC. The results showed there was significant difference in biochemical indexes between healthy people and overweight people. The correlation analysis showed fs-TG was related to weight, height, blood glucose, and indexes of liver and kidney function; while fs-TC was correlated with age, indexes of liver function (P < 0.01). The MRL analysis indicated regression equations of fs-TG and fs-TC both had statistic significant (P < 0.01) when included independent indexes. The BP-ANN model of fs-TG reached training goal at 59 epoch, while fs-TC model achieved high prediction accuracy after training 1000 epoch. In conclusions, there was high relationship of fs-TG and fs-TC with weight, height, age, blood glucose, indexes of liver function and kidney function. Based on related variables, the indexes of fs-TG and fs-TC can be predicted by BP-ANN models in overweight people.
Regression analysis for solving diagnosis problem of children's health
NASA Astrophysics Data System (ADS)
Cherkashina, Yu A.; Gerget, O. M.
2016-04-01
The paper includes results of scientific researches. These researches are devoted to the application of statistical techniques, namely, regression analysis, to assess the health status of children in the neonatal period based on medical data (hemostatic parameters, parameters of blood tests, the gestational age, vascular-endothelial growth factor) measured at 3-5 days of children's life. In this paper a detailed description of the studied medical data is given. A binary logistic regression procedure is discussed in the paper. Basic results of the research are presented. A classification table of predicted values and factual observed values is shown, the overall percentage of correct recognition is determined. Regression equation coefficients are calculated, the general regression equation is written based on them. Based on the results of logistic regression, ROC analysis was performed, sensitivity and specificity of the model are calculated and ROC curves are constructed. These mathematical techniques allow carrying out diagnostics of health of children providing a high quality of recognition. The results make a significant contribution to the development of evidence-based medicine and have a high practical importance in the professional activity of the author.
The Application of Censored Regression Models in Low Streamflow Analyses
NASA Astrophysics Data System (ADS)
Kroll, C.; Luz, J.
2003-12-01
Estimation of low streamflow statistics at gauged and ungauged river sites is often a daunting task. This process is further confounded by the presence of intermittent streamflows, where streamflow is sometimes reported as zero, within a region. Streamflows recorded as zero may be zero, or may be less than the measurement detection limit. Such data is often referred to as censored data. Numerous methods have been developed to characterize intermittent streamflow series. Logit regression has been proposed to develop regional models of the probability annual lowflows series (such as 7-day lowflows) are zero. In addition, Tobit regression, a method of regression that allows for censored dependent variables, has been proposed for lowflow regional regression models in regions where the lowflow statistic of interest estimated as zero at some sites in the region. While these methods have been proposed, their use in practice has been limited. Here a delete-one jackknife simulation is presented to examine the performance of Logit and Tobit models of 7-day annual minimum flows in 6 USGS water resource regions in the United States. For the Logit model, an assessment is made of whether sites are correctly classified as having at least 10% of 7-day annual lowflows equal to zero. In such a situation, the 7-day, 10-year lowflow (Q710), a commonly employed low streamflow statistic, would be reported as zero. For the Tobit model, a comparison is made between results from the Tobit model, and from performing either ordinary least squares (OLS) or principal component regression (PCR) after the zero sites are dropped from the analysis. Initial results for the Logit model indicate this method to have a high probability of correctly classifying sites into groups with Q710s as zero and non-zero. Initial results also indicate the Tobit model produces better results than PCR and OLS when more than 5% of the sites in the region have Q710 values calculated as zero.
Geodesic regression on orientation distribution functions with its application to an aging study.
Du, Jia; Goh, Alvina; Kushnarev, Sergey; Qiu, Anqi
2014-02-15
In this paper, we treat orientation distribution functions (ODFs) derived from high angular resolution diffusion imaging (HARDI) as elements of a Riemannian manifold and present a method for geodesic regression on this manifold. In order to find the optimal regression model, we pose this as a least-squares problem involving the sum-of-squared geodesic distances between observed ODFs and their model fitted data. We derive the appropriate gradient terms and employ gradient descent to find the minimizer of this least-squares optimization problem. In addition, we show how to perform statistical testing for determining the significance of the relationship between the manifold-valued regressors and the real-valued regressands. Experiments on both synthetic and real human data are presented. In particular, we examine aging effects on HARDI via geodesic regression of ODFs in normal adults aged 22 years old and above. © 2013 Elsevier Inc. All rights reserved.
Rupert, Michael G.; Cannon, Susan H.; Gartner, Joseph E.; Michael, John A.; Helsel, Dennis R.
2008-01-01
Logistic regression was used to develop statistical models that can be used to predict the probability of debris flows in areas recently burned by wildfires by using data from 14 wildfires that burned in southern California during 2003-2006. Twenty-eight independent variables describing the basin morphology, burn severity, rainfall, and soil properties of 306 drainage basins located within those burned areas were evaluated. The models were developed as follows: (1) Basins that did and did not produce debris flows soon after the 2003 to 2006 fires were delineated from data in the National Elevation Dataset using a geographic information system; (2) Data describing the basin morphology, burn severity, rainfall, and soil properties were compiled for each basin. These data were then input to a statistics software package for analysis using logistic regression; and (3) Relations between the occurrence or absence of debris flows and the basin morphology, burn severity, rainfall, and soil properties were evaluated, and five multivariate logistic regression models were constructed. All possible combinations of independent variables were evaluated to determine which combinations produced the most effective models, and the multivariate models that best predicted the occurrence of debris flows were identified. Percentage of high burn severity and 3-hour peak rainfall intensity were significant variables in all models. Soil organic matter content and soil clay content were significant variables in all models except Model 5. Soil slope was a significant variable in all models except Model 4. The most suitable model can be selected from these five models on the basis of the availability of independent variables in the particular area of interest and field checking of probability maps. The multivariate logistic regression models can be entered into a geographic information system, and maps showing the probability of debris flows can be constructed in recently burned areas of southern California. This study demonstrates that logistic regression is a valuable tool for developing models that predict the probability of debris flows occurring in recently burned landscapes.
Rovadoscki, Gregori A; Petrini, Juliana; Ramirez-Diaz, Johanna; Pertile, Simone F N; Pertille, Fábio; Salvian, Mayara; Iung, Laiza H S; Rodriguez, Mary Ana P; Zampar, Aline; Gaya, Leila G; Carvalho, Rachel S B; Coelho, Antonio A D; Savino, Vicente J M; Coutinho, Luiz L; Mourão, Gerson B
2016-09-01
Repeated measures from the same individual have been analyzed by using repeatability and finite dimension models under univariate or multivariate analyses. However, in the last decade, the use of random regression models for genetic studies with longitudinal data have become more common. Thus, the aim of this research was to estimate genetic parameters for body weight of four experimental chicken lines by using univariate random regression models. Body weight data from hatching to 84 days of age (n = 34,730) from four experimental free-range chicken lines (7P, Caipirão da ESALQ, Caipirinha da ESALQ and Carijó Barbado) were used. The analysis model included the fixed effects of contemporary group (gender and rearing system), fixed regression coefficients for age at measurement, and random regression coefficients for permanent environmental effects and additive genetic effects. Heterogeneous variances for residual effects were considered, and one residual variance was assigned for each of six subclasses of age at measurement. Random regression curves were modeled by using Legendre polynomials of the second and third orders, with the best model chosen based on the Akaike Information Criterion, Bayesian Information Criterion, and restricted maximum likelihood. Multivariate analyses under the same animal mixed model were also performed for the validation of the random regression models. The Legendre polynomials of second order were better for describing the growth curves of the lines studied. Moderate to high heritabilities (h(2) = 0.15 to 0.98) were estimated for body weight between one and 84 days of age, suggesting that selection for body weight at all ages can be used as a selection criteria. Genetic correlations among body weight records obtained through multivariate analyses ranged from 0.18 to 0.96, 0.12 to 0.89, 0.06 to 0.96, and 0.28 to 0.96 in 7P, Caipirão da ESALQ, Caipirinha da ESALQ, and Carijó Barbado chicken lines, respectively. Results indicate that genetic gain for body weight can be achieved by selection. Also, selection for body weight at 42 days of age can be maintained as a selection criterion. © 2016 Poultry Science Association Inc.
Phung, Dung; Huang, Cunrui; Rutherford, Shannon; Chu, Cordia; Wang, Xiaoming; Nguyen, Minh; Nguyen, Nga Huy; Manh, Cuong Do
2015-01-01
The Mekong Delta is highly vulnerable to climate change and a dengue endemic area in Vietnam. This study aims to examine the association between climate factors and dengue incidence and to identify the best climate prediction model for dengue incidence in Can Tho city, the Mekong Delta area in Vietnam. We used three different regression models comprising: standard multiple regression model (SMR), seasonal autoregressive integrated moving average model (SARIMA), and Poisson distributed lag model (PDLM) to examine the association between climate factors and dengue incidence over the period 2003-2010. We validated the models by forecasting dengue cases for the period of January-December, 2011 using the mean absolute percentage error (MAPE). Receiver operating characteristics curves were used to analyze the sensitivity of the forecast of a dengue outbreak. The results indicate that temperature and relative humidity are significantly associated with changes in dengue incidence consistently across the model methods used, but not cumulative rainfall. The Poisson distributed lag model (PDLM) performs the best prediction of dengue incidence for a 6, 9, and 12-month period and diagnosis of an outbreak however the SARIMA model performs a better prediction of dengue incidence for a 3-month period. The simple or standard multiple regression performed highly imprecise prediction of dengue incidence. We recommend a follow-up study to validate the model on a larger scale in the Mekong Delta region and to analyze the possibility of incorporating a climate-based dengue early warning method into the national dengue surveillance system. Copyright © 2014 Elsevier B.V. All rights reserved.
A new model for estimating total body water from bioelectrical resistance
NASA Technical Reports Server (NTRS)
Siconolfi, S. F.; Kear, K. T.
1992-01-01
Estimation of total body water (T) from bioelectrical resistance (R) is commonly done by stepwise regression models with height squared over R, H(exp 2)/R, age, sex, and weight (W). Polynomials of H(exp 2)/R have not been included in these models. We examined the validity of a model with third order polynomials and W. Methods: T was measured with oxygen-18 labled water in 27 subjects. R at 50 kHz was obtained from electrodes placed on the hand and foot while subjects were in the supine position. A stepwise regression equation was developed with 13 subjects (age 31.5 plus or minus 6.2 years, T 38.2 plus or minus 6.6 L, W 65.2 plus or minus 12.0 kg). Correlations, standard error of estimates and mean differences were computed between T and estimated T's from the new (N) model and other models. Evaluations were completed with the remaining 14 subjects (age 32.4 plus or minus 6.3 years, T 40.3 plus or minus 8 L, W 70.2 plus or minus 12.3 kg) and two of its subgroups (high and low) Results: A regression equation was developed from the model. The only significant mean difference was between T and one of the earlier models. Conclusion: Third order polynomials in regression models may increase the accuracy of estimating total body water. Evaluating the model with a larger population is needed.
NASA Astrophysics Data System (ADS)
Hassanzadeh, S.; Hosseinibalam, F.; Omidvari, M.
2008-04-01
Data of seven meteorological variables (relative humidity, wet temperature, dry temperature, maximum temperature, minimum temperature, ground temperature and sun radiation time) and ozone values have been used for statistical analysis. Meteorological variables and ozone values were analyzed using both multiple linear regression and principal component methods. Data for the period 1999-2004 are analyzed jointly using both methods. For all periods, temperature dependent variables were highly correlated, but were all negatively correlated with relative humidity. Multiple regression analysis was used to fit the meteorological variables using the meteorological variables as predictors. A variable selection method based on high loading of varimax rotated principal components was used to obtain subsets of the predictor variables to be included in the linear regression model of the meteorological variables. In 1999, 2001 and 2002 one of the meteorological variables was weakly influenced predominantly by the ozone concentrations. However, the model did not predict that the meteorological variables for the year 2000 were not influenced predominantly by the ozone concentrations that point to variation in sun radiation. This could be due to other factors that were not explicitly considered in this study.
Logistic regression model for diagnosis of transition zone prostate cancer on multi-parametric MRI.
Dikaios, Nikolaos; Alkalbani, Jokha; Sidhu, Harbir Singh; Fujiwara, Taiki; Abd-Alazeez, Mohamed; Kirkham, Alex; Allen, Clare; Ahmed, Hashim; Emberton, Mark; Freeman, Alex; Halligan, Steve; Taylor, Stuart; Atkinson, David; Punwani, Shonit
2015-02-01
We aimed to develop logistic regression (LR) models for classifying prostate cancer within the transition zone on multi-parametric magnetic resonance imaging (mp-MRI). One hundred and fifty-five patients (training cohort, 70 patients; temporal validation cohort, 85 patients) underwent mp-MRI and transperineal-template-prostate-mapping (TPM) biopsy. Positive cores were classified by cancer definitions: (1) any-cancer; (2) definition-1 [≥Gleason 4 + 3 or ≥ 6 mm cancer core length (CCL)] [high risk significant]; and (3) definition-2 (≥Gleason 3 + 4 or ≥ 4 mm CCL) cancer [intermediate-high risk significant]. For each, logistic-regression mp-MRI models were derived from the training cohort and validated internally and with the temporal cohort. Sensitivity/specificity and the area under the receiver operating characteristic (ROC-AUC) curve were calculated. LR model performance was compared to radiologists' performance. Twenty-eight of 70 patients from the training cohort, and 25/85 patients from the temporal validation cohort had significant cancer on TPM. The ROC-AUC of the LR model for classification of cancer was 0.73/0.67 at internal/temporal validation. The radiologist A/B ROC-AUC was 0.65/0.74 (temporal cohort). For patients scored by radiologists as Prostate Imaging Reporting and Data System (Pi-RADS) score 3, sensitivity/specificity of radiologist A 'best guess' and LR model was 0.14/0.54 and 0.71/0.61, respectively; and radiologist B 'best guess' and LR model was 0.40/0.34 and 0.50/0.76, respectively. LR models can improve classification of Pi-RADS score 3 lesions similar to experienced radiologists. • MRI helps find prostate cancer in the anterior of the gland • Logistic regression models based on mp-MRI can classify prostate cancer • Computers can help confirm cancer in areas doctors are uncertain about.
Modeling and managing risk early in software development
NASA Technical Reports Server (NTRS)
Briand, Lionel C.; Thomas, William M.; Hetmanski, Christopher J.
1993-01-01
In order to improve the quality of the software development process, we need to be able to build empirical multivariate models based on data collectable early in the software process. These models need to be both useful for prediction and easy to interpret, so that remedial actions may be taken in order to control and optimize the development process. We present an automated modeling technique which can be used as an alternative to regression techniques. We show how it can be used to facilitate the identification and aid the interpretation of the significant trends which characterize 'high risk' components in several Ada systems. Finally, we evaluate the effectiveness of our technique based on a comparison with logistic regression based models.
Parametric regression model for survival data: Weibull regression model as an example
2016-01-01
Weibull regression model is one of the most popular forms of parametric regression model that it provides estimate of baseline hazard function, as well as coefficients for covariates. Because of technical difficulties, Weibull regression model is seldom used in medical literature as compared to the semi-parametric proportional hazard model. To make clinical investigators familiar with Weibull regression model, this article introduces some basic knowledge on Weibull regression model and then illustrates how to fit the model with R software. The SurvRegCensCov package is useful in converting estimated coefficients to clinical relevant statistics such as hazard ratio (HR) and event time ratio (ETR). Model adequacy can be assessed by inspecting Kaplan-Meier curves stratified by categorical variable. The eha package provides an alternative method to model Weibull regression model. The check.dist() function helps to assess goodness-of-fit of the model. Variable selection is based on the importance of a covariate, which can be tested using anova() function. Alternatively, backward elimination starting from a full model is an efficient way for model development. Visualization of Weibull regression model after model development is interesting that it provides another way to report your findings. PMID:28149846
Lacagnina, Valerio; Leto-Barone, Maria S; La Piana, Simona; Seidita, Aurelio; Pingitore, Giuseppe; Di Lorenzo, Gabriele
2014-01-01
This article uses the logistic regression model for diagnostic decision making in patients with chronic nasal symptoms. We studied the ability of the logistic regression model, obtained by the evaluation of a database, to detect patients with positive allergy skin-prick test (SPT) and patients with negative SPT. The model developed was validated using the data set obtained from another medical institution. The analysis was performed using a database obtained from a questionnaire administered to the patients with nasal symptoms containing personal data, clinical data, and results of allergy testing (SPT). All variables found to be significantly different between patients with positive and negative SPT (p < 0.05) were selected for the logistic regression models and were analyzed with backward stepwise logistic regression, evaluated with area under the curve of the receiver operating characteristic curve. A second set of patients from another institution was used to prove the model. The accuracy of the model in identifying, over the second set, both patients whose SPT will be positive and negative was high. The model detected 96% of patients with nasal symptoms and positive SPT and classified 94% of those with negative SPT. This study is preliminary to the creation of a software that could help the primary care doctors in a diagnostic decision making process (need of allergy testing) in patients complaining of chronic nasal symptoms.
Smith, S. Jerrod; Lewis, Jason M.; Graves, Grant M.
2015-09-28
Generalized-least-squares multiple-linear regression analysis was used to formulate regression relations between peak-streamflow frequency statistics and basin characteristics. Contributing drainage area was the only basin characteristic determined to be statistically significant for all percentage of annual exceedance probabilities and was the only basin characteristic used in regional regression equations for estimating peak-streamflow frequency statistics on unregulated streams in and near the Oklahoma Panhandle. The regression model pseudo-coefficient of determination, converted to percent, for the Oklahoma Panhandle regional regression equations ranged from about 38 to 63 percent. The standard errors of prediction and the standard model errors for the Oklahoma Panhandle regional regression equations ranged from about 84 to 148 percent and from about 76 to 138 percent, respectively. These errors were comparable to those reported for regional peak-streamflow frequency regression equations for the High Plains areas of Texas and Colorado. The root mean square errors for the Oklahoma Panhandle regional regression equations (ranging from 3,170 to 92,000 cubic feet per second) were less than the root mean square errors for the Oklahoma statewide regression equations (ranging from 18,900 to 412,000 cubic feet per second); therefore, the Oklahoma Panhandle regional regression equations produce more accurate peak-streamflow statistic estimates for the irrigated period of record in the Oklahoma Panhandle than do the Oklahoma statewide regression equations. The regression equations developed in this report are applicable to streams that are not substantially affected by regulation, impoundment, or surface-water withdrawals. These regression equations are intended for use for stream sites with contributing drainage areas less than or equal to about 2,060 square miles, the maximum value for the independent variable used in the regression analysis.
2018-01-01
Objective The objective of this study was to estimate genetic parameters of milk, fat, and protein yields within and across lactations in Tunisian Holsteins using a random regression test-day (TD) model. Methods A random regression multiple trait multiple lactation TD model was used to estimate genetic parameters in the Tunisian dairy cattle population. Data were TD yields of milk, fat, and protein from the first three lactations. Random regressions were modeled with third-order Legendre polynomials for the additive genetic, and permanent environment effects. Heritabilities, and genetic correlations were estimated by Bayesian techniques using the Gibbs sampler. Results All variance components tended to be high in the beginning and the end of lactations. Additive genetic variances for milk, fat, and protein yields were the lowest and were the least variable compared to permanent variances. Heritability values tended to increase with parity. Estimates of heritabilities for 305-d yield-traits were low to moderate, 0.14 to 0.2, 0.12 to 0.17, and 0.13 to 0.18 for milk, fat, and protein yields, respectively. Within-parity, genetic correlations among traits were up to 0.74. Genetic correlations among lactations for the yield traits were relatively high and ranged from 0.78±0.01 to 0.82±0.03, between the first and second parities, from 0.73±0.03 to 0.8±0.04 between the first and third parities, and from 0.82±0.02 to 0.84±0.04 between the second and third parities. Conclusion These results are comparable to previously reported estimates on the same population, indicating that the adoption of a random regression TD model as the official genetic evaluation for production traits in Tunisia, as developed by most Interbull countries, is possible in the Tunisian Holsteins. PMID:28823122
Ben Zaabza, Hafedh; Ben Gara, Abderrahmen; Rekik, Boulbaba
2018-05-01
The objective of this study was to estimate genetic parameters of milk, fat, and protein yields within and across lactations in Tunisian Holsteins using a random regression test-day (TD) model. A random regression multiple trait multiple lactation TD model was used to estimate genetic parameters in the Tunisian dairy cattle population. Data were TD yields of milk, fat, and protein from the first three lactations. Random regressions were modeled with third-order Legendre polynomials for the additive genetic, and permanent environment effects. Heritabilities, and genetic correlations were estimated by Bayesian techniques using the Gibbs sampler. All variance components tended to be high in the beginning and the end of lactations. Additive genetic variances for milk, fat, and protein yields were the lowest and were the least variable compared to permanent variances. Heritability values tended to increase with parity. Estimates of heritabilities for 305-d yield-traits were low to moderate, 0.14 to 0.2, 0.12 to 0.17, and 0.13 to 0.18 for milk, fat, and protein yields, respectively. Within-parity, genetic correlations among traits were up to 0.74. Genetic correlations among lactations for the yield traits were relatively high and ranged from 0.78±0.01 to 0.82±0.03, between the first and second parities, from 0.73±0.03 to 0.8±0.04 between the first and third parities, and from 0.82±0.02 to 0.84±0.04 between the second and third parities. These results are comparable to previously reported estimates on the same population, indicating that the adoption of a random regression TD model as the official genetic evaluation for production traits in Tunisia, as developed by most Interbull countries, is possible in the Tunisian Holsteins.
Soil sail content estimation in the yellow river delta with satellite hyperspectral data
Weng, Yongling; Gong, Peng; Zhu, Zhi-Liang
2008-01-01
Soil salinization is one of the most common land degradation processes and is a severe environmental hazard. The primary objective of this study is to investigate the potential of predicting salt content in soils with hyperspectral data acquired with EO-1 Hyperion. Both partial least-squares regression (PLSR) and conventional multiple linear regression (MLR), such as stepwise regression (SWR), were tested as the prediction model. PLSR is commonly used to overcome the problem caused by high-dimensional and correlated predictors. Chemical analysis of 95 samples collected from the top layer of soils in the Yellow River delta area shows that salt content was high on average, and the dominant chemicals in the saline soil were NaCl and MgCl2. Multivariate models were established between soil contents and hyperspectral data. Our results indicate that the PLSR technique with laboratory spectral data has a strong prediction capacity. Spectral bands at 1487-1527, 1971-1991, 2032-2092, and 2163-2355 nm possessed large absolute values of regression coefficients, with the largest coefficient at 2203 nm. We obtained a root mean squared error (RMSE) for calibration (with 61 samples) of RMSEC = 0.753 (R2 = 0.893) and a root mean squared error for validation (with 30 samples) of RMSEV = 0.574. The prediction model was applied on a pixel-by-pixel basis to a Hyperion reflectance image to yield a quantitative surface distribution map of soil salt content. The result was validated successfully from 38 sampling points. We obtained an RMSE estimate of 1.037 (R2 = 0.784) for the soil salt content map derived by the PLSR model. The salinity map derived from the SWR model shows that the predicted value is higher than the true value. These results demonstrate that the PLSR method is a more suitable technique than stepwise regression for quantitative estimation of soil salt content in a large area. ?? 2008 CASI.
Space shuttle propulsion parameter estimation using optional estimation techniques
NASA Technical Reports Server (NTRS)
1983-01-01
A regression analyses on tabular aerodynamic data provided. A representative aerodynamic model for coefficient estimation. It also reduced the storage requirements for the "normal' model used to check out the estimation algorithms. The results of the regression analyses are presented. The computer routines for the filter portion of the estimation algorithm and the :"bringing-up' of the SRB predictive program on the computer was developed. For the filter program, approximately 54 routines were developed. The routines were highly subsegmented to facilitate overlaying program segments within the partitioned storage space on the computer.
NASA Astrophysics Data System (ADS)
Salleh, Nur Hanim Mohd; Ali, Zalila; Noor, Norlida Mohd.; Baharum, Adam; Saad, Ahmad Ramli; Sulaiman, Husna Mahirah; Ahmad, Wan Muhamad Amir W.
2014-07-01
Polynomial regression is used to model a curvilinear relationship between a response variable and one or more predictor variables. It is a form of a least squares linear regression model that predicts a single response variable by decomposing the predictor variables into an nth order polynomial. In a curvilinear relationship, each curve has a number of extreme points equal to the highest order term in the polynomial. A quadratic model will have either a single maximum or minimum, whereas a cubic model has both a relative maximum and a minimum. This study used quadratic modeling techniques to analyze the effects of environmental factors: temperature, relative humidity, and rainfall distribution on the breeding of Aedes albopictus, a type of Aedes mosquito. Data were collected at an urban area in south-west Penang from September 2010 until January 2011. The results indicated that the breeding of Aedes albopictus in the urban area is influenced by all three environmental characteristics. The number of mosquito eggs is estimated to reach a maximum value at a medium temperature, a medium relative humidity and a high rainfall distribution.
Introduction to the use of regression models in epidemiology.
Bender, Ralf
2009-01-01
Regression modeling is one of the most important statistical techniques used in analytical epidemiology. By means of regression models the effect of one or several explanatory variables (e.g., exposures, subject characteristics, risk factors) on a response variable such as mortality or cancer can be investigated. From multiple regression models, adjusted effect estimates can be obtained that take the effect of potential confounders into account. Regression methods can be applied in all epidemiologic study designs so that they represent a universal tool for data analysis in epidemiology. Different kinds of regression models have been developed in dependence on the measurement scale of the response variable and the study design. The most important methods are linear regression for continuous outcomes, logistic regression for binary outcomes, Cox regression for time-to-event data, and Poisson regression for frequencies and rates. This chapter provides a nontechnical introduction to these regression models with illustrating examples from cancer research.
NASA Astrophysics Data System (ADS)
Nazeer, Majid; Bilal, Muhammad
2018-04-01
Landsat-5 Thematic Mapper (TM) dataset have been used to estimate salinity in the coastal area of Hong Kong. Four adjacent Landsat TM images were used in this study, which was atmospherically corrected using the Second Simulation of the Satellite Signal in the Solar Spectrum (6S) radiative transfer code. The atmospherically corrected images were further used to develop models for salinity using Ordinary Least Square (OLS) regression and Geographically Weighted Regression (GWR) based on in situ data of October 2009. Results show that the coefficient of determination ( R 2) of 0.42 between the OLS estimated and in situ measured salinity is much lower than that of the GWR model, which is two times higher ( R 2 = 0.86). It indicates that the GWR model has more ability than the OLS regression model to predict salinity and show its spatial heterogeneity better. It was observed that the salinity was high in Deep Bay (north-western part of Hong Kong) which might be due to the industrial waste disposal, whereas the salinity was estimated to be constant (32 practical salinity units) towards the open sea.
A Selective Review of Group Selection in High-Dimensional Models
Huang, Jian; Breheny, Patrick; Ma, Shuangge
2013-01-01
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study. PMID:24174707
NASA Astrophysics Data System (ADS)
Xie, Yanan; Zhou, Mingliang; Pan, Dengke
2017-10-01
The forward-scattering model is introduced to describe the response of normalized radar cross section (NRCS) of precipitation with synthetic aperture radar (SAR). Since the distribution of near-surface rainfall is related to the rate of near-surface rainfall and horizontal distribution factor, a retrieval algorithm called modified regression empirical and model-oriented statistical (M-M) based on the volterra integration theory is proposed. Compared with the model-oriented statistical and volterra integration (MOSVI) algorithm, the biggest difference is that the M-M algorithm is based on the modified regression empirical algorithm rather than the linear regression formula to retrieve the value of near-surface rainfall rate. Half of the empirical parameters are reduced in the weighted integral work and a smaller average relative error is received while the rainfall rate is less than 100 mm/h. Therefore, the algorithm proposed in this paper can obtain high-precision rainfall information.
The 2011 heat wave in Greater Houston: Effects of land use on temperature.
Zhou, Weihe; Ji, Shuang; Chen, Tsun-Hsuan; Hou, Yi; Zhang, Kai
2014-11-01
Effects of land use on temperatures during severe heat waves have been rarely studied. This paper examines land use-temperature associations during the 2011 heat wave in Greater Houston. We obtained high resolution of satellite-derived land use data from the US National Land Cover Database, and temperature observations at 138 weather stations from Weather Underground, Inc (WU) during the August of 2011, which was the hottest month in Houston since 1889. Land use regression and quantile regression methods were applied to the monthly averages of daily maximum/mean/minimum temperatures and 114 land use-related predictors. Although selected variables vary with temperature metric, distance to the coastline consistently appears among all models. Other variables are generally related to high developed intensity, open water or wetlands. In addition, our quantile regression analysis shows that distance to the coastline and high developed intensity areas have larger impacts on daily average temperatures at higher quantiles, and open water area has greater impacts on daily minimum temperatures at lower quantiles. By utilizing both land use regression and quantile regression on a recent heat wave in one of the largest US metropolitan areas, this paper provides a new perspective on the impacts of land use on temperatures. Our models can provide estimates of heat exposures for epidemiological studies, and our findings can be combined with demographic variables, air conditioning and relevant diseases information to identify 'hot spots' of population vulnerability for public health interventions to reduce heat-related health effects during heat waves. Copyright © 2014 Elsevier Inc. All rights reserved.
Theodoratou, Evropi; Farrington, Susan M; Tenesa, Albert; McNeill, Geraldine; Cetnarskyj, Roseanne; Korakakis, Emmanouil; Din, Farhat V N; Porteous, Mary E; Dunlop, Malcolm G; Campbell, Harry
2014-01-01
Colorectal cancer (CRC) accounts for 9.7% of all cancer cases and for 8% of all cancer-related deaths. Established risk factors include personal or family history of CRC as well as lifestyle and dietary factors. We investigated the relationship between CRC and demographic, lifestyle, food and nutrient risk factors through a case-control study that included 2062 patients and 2776 controls from Scotland. Forward and backward stepwise regression was applied and the stability of the models was assessed in 1000 bootstrap samples. The variables that were automatically selected to be included by the forward or backward stepwise regression and whose selection was verified by bootstrap sampling in the current study were family history, dietary energy, 'high-energy snack foods', eggs, juice, sugar-sweetened beverages and white fish (associated with an increased CRC risk) and NSAIDs, coffee and magnesium (associated with a decreased CRC risk). Application of forward and backward stepwise regression in this CRC study identified some already established as well as some novel potential risk factors. Bootstrap findings suggest that examination of the stability of regression models by bootstrap sampling is useful in the interpretation of study findings. 'High-energy snack foods' and high-energy drinks (including sugar-sweetened beverages and fruit juices) as risk factors for CRC have not been reported previously and merit further investigation as such snacks and beverages are important contributors in European and North American diets.
Meng, Xia; Fu, Qingyan; Ma, Zongwei; Chen, Li; Zou, Bin; Zhang, Yan; Xue, Wenbo; Wang, Jinnan; Wang, Dongfang; Kan, Haidong; Liu, Yang
2016-01-01
Development of exposure assessment model is the key component for epidemiological studies concerning air pollution, but the evidence from China is limited. Therefore, a linear mixed effects (LME) model was established in this study in a Chinese metropolis by incorporating aerosol optical depth (AOD), meteorological information and the land use regression (LUR) model to predict ground PM10 levels on high spatiotemporal resolution. The cross validation (CV) R(2) and the RMSE of the LME model were 0.87 and 19.2 μg/m(3), respectively. The relative prediction error (RPE) of daily and annual mean predicted PM10 concentrations were 19.1% and 7.5%, respectively. This study was the first attempt in China to estimate both short-term and long-term variation of PM10 levels with high spatial resolution in a Chinese metropolis with the LME model. The results suggested that the LME model could provide exposure assessment for short-term and long-term epidemiological studies in China. Copyright © 2015 Elsevier Ltd. All rights reserved.
Latin hypercube approach to estimate uncertainty in ground water vulnerability
Gurdak, J.J.; McCray, J.E.; Thyne, G.; Qi, S.L.
2007-01-01
A methodology is proposed to quantify prediction uncertainty associated with ground water vulnerability models that were developed through an approach that coupled multivariate logistic regression with a geographic information system (GIS). This method uses Latin hypercube sampling (LHS) to illustrate the propagation of input error and estimate uncertainty associated with the logistic regression predictions of ground water vulnerability. Central to the proposed method is the assumption that prediction uncertainty in ground water vulnerability models is a function of input error propagation from uncertainty in the estimated logistic regression model coefficients (model error) and the values of explanatory variables represented in the GIS (data error). Input probability distributions that represent both model and data error sources of uncertainty were simultaneously sampled using a Latin hypercube approach with logistic regression calculations of probability of elevated nonpoint source contaminants in ground water. The resulting probability distribution represents the prediction intervals and associated uncertainty of the ground water vulnerability predictions. The method is illustrated through a ground water vulnerability assessment of the High Plains regional aquifer. Results of the LHS simulations reveal significant prediction uncertainties that vary spatially across the regional aquifer. Additionally, the proposed method enables a spatial deconstruction of the prediction uncertainty that can lead to improved prediction of ground water vulnerability. ?? 2007 National Ground Water Association.
Interpretation of commonly used statistical regression models.
Kasza, Jessica; Wolfe, Rory
2014-01-01
A review of some regression models commonly used in respiratory health applications is provided in this article. Simple linear regression, multiple linear regression, logistic regression and ordinal logistic regression are considered. The focus of this article is on the interpretation of the regression coefficients of each model, which are illustrated through the application of these models to a respiratory health research study. © 2013 The Authors. Respirology © 2013 Asian Pacific Society of Respirology.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
A well-known challenge in uncertainty quantification (UQ) is the "curse of dimensionality". However, many high-dimensional UQ problems are essentially low-dimensional, because the randomness of the quantity of interest (QoI) is caused only by uncertain parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace. Motivated by this observation, we propose and demonstrate in this paper an inverse regression-based UQ approach (IRUQ) for high-dimensional problems. Specifically, we use an inverse regression procedure to estimate the SDR subspace and then convert the original problem to a low-dimensional one, which can be efficiently solved by building a response surface model such as a polynomial chaos expansion. The novelty and advantages of the proposed approach is seen in its computational efficiency and practicality. Comparing with Monte Carlo, the traditionally preferred approach for high-dimensional UQ, IRUQ with a comparable cost generally gives much more accurate solutions even for high-dimensional problems, and even when the dimension reduction is not exactly sufficient. Theoretically, IRUQ is proved to converge twice as fast as the approach it uses seeking the SDR subspace. For example, while a sliced inverse regression method converges to the SDR subspace at the rate ofmore » $$O(n^{-1/2})$$, the corresponding IRUQ converges at $$O(n^{-1})$$. IRUQ also provides several desired conveniences in practice. It is non-intrusive, requiring only a simulator to generate realizations of the QoI, and there is no need to compute the high-dimensional gradient of the QoI. Finally, error bars can be derived for the estimation results reported by IRUQ.« less
Geodesic least squares regression for scaling studies in magnetic confinement fusion
DOE Office of Scientific and Technical Information (OSTI.GOV)
Verdoolaege, Geert
In regression analyses for deriving scaling laws that occur in various scientific disciplines, usually standard regression methods have been applied, of which ordinary least squares (OLS) is the most popular. However, concerns have been raised with respect to several assumptions underlying OLS in its application to scaling laws. We here discuss a new regression method that is robust in the presence of significant uncertainty on both the data and the regression model. The method, which we call geodesic least squares regression (GLS), is based on minimization of the Rao geodesic distance on a probabilistic manifold. We demonstrate the superiority ofmore » the method using synthetic data and we present an application to the scaling law for the power threshold for the transition to the high confinement regime in magnetic confinement fusion devices.« less
Zhao, Zeng-hui; Wang, Wei-ming; Gao, Xin; Yan, Ji-xing
2013-01-01
According to the geological characteristics of Xinjiang Ili mine in western area of China, a physical model of interstratified strata composed of soft rock and hard coal seam was established. Selecting the tunnel position, deformation modulus, and strength parameters of each layer as influencing factors, the sensitivity coefficient of roadway deformation to each parameter was firstly analyzed based on a Mohr-Columb strain softening model and nonlinear elastic-plastic finite element analysis. Then the effect laws of influencing factors which showed high sensitivity were further discussed. Finally, a regression model for the relationship between roadway displacements and multifactors was obtained by equivalent linear regression under multiple factors. The results show that the roadway deformation is highly sensitive to the depth of coal seam under the floor which should be considered in the layout of coal roadway; deformation modulus and strength of coal seam and floor have a great influence on the global stability of tunnel; on the contrary, roadway deformation is not sensitive to the mechanical parameters of soft roof; roadway deformation under random combinations of multi-factors can be deduced by the regression model. These conclusions provide theoretical significance to the arrangement and stability maintenance of coal roadway. PMID:24459447
SMURC: High-Dimension Small-Sample Multivariate Regression With Covariance Estimation.
Bayar, Belhassen; Bouaynaya, Nidhal; Shterenberg, Roman
2017-03-01
We consider a high-dimension low sample-size multivariate regression problem that accounts for correlation of the response variables. The system is underdetermined as there are more parameters than samples. We show that the maximum likelihood approach with covariance estimation is senseless because the likelihood diverges. We subsequently propose a normalization of the likelihood function that guarantees convergence. We call this method small-sample multivariate regression with covariance (SMURC) estimation. We derive an optimization problem and its convex approximation to compute SMURC. Simulation results show that the proposed algorithm outperforms the regularized likelihood estimator with known covariance matrix and the sparse conditional Gaussian graphical model. We also apply SMURC to the inference of the wing-muscle gene network of the Drosophila melanogaster (fruit fly).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mudunuru, Maruti Kumar; Karra, Satish; Harp, Dylan Robert
Reduced-order modeling is a promising approach, as many phenomena can be described by a few parameters/mechanisms. An advantage and attractive aspect of a reduced-order model is that it is computational inexpensive to evaluate when compared to running a high-fidelity numerical simulation. A reduced-order model takes couple of seconds to run on a laptop while a high-fidelity simulation may take couple of hours to run on a high-performance computing cluster. The goal of this paper is to assess the utility of regression-based reduced-order models (ROMs) developed from high-fidelity numerical simulations for predicting transient thermal power output for an enhanced geothermal reservoirmore » while explicitly accounting for uncertainties in the subsurface system and site-specific details. Numerical simulations are performed based on equally spaced values in the specified range of model parameters. Key sensitive parameters are then identified from these simulations, which are fracture zone permeability, well/skin factor, bottom hole pressure, and injection flow rate. We found the fracture zone permeability to be the most sensitive parameter. The fracture zone permeability along with time, are used to build regression-based ROMs for the thermal power output. The ROMs are trained and validated using detailed physics-based numerical simulations. Finally, predictions from the ROMs are then compared with field data. We propose three different ROMs with different levels of model parsimony, each describing key and essential features of the power production curves. The coefficients in the proposed regression-based ROMs are developed by minimizing a non-linear least-squares misfit function using the Levenberg–Marquardt algorithm. The misfit function is based on the difference between numerical simulation data and reduced-order model. ROM-1 is constructed based on polynomials up to fourth order. ROM-1 is able to accurately reproduce the power output of numerical simulations for low values of permeabilities and certain features of the field-scale data. ROM-2 is a model with more analytical functions consisting of polynomials up to order eight, exponential functions and smooth approximations of Heaviside functions, and accurately describes the field-data. At higher permeabilities, ROM-2 reproduces numerical results better than ROM-1, however, there is a considerable deviation from numerical results at low fracture zone permeabilities. ROM-3 consists of polynomials up to order ten, and is developed by taking the best aspects of ROM-1 and ROM-2. ROM-1 is relatively parsimonious than ROM-2 and ROM-3, while ROM-2 overfits the data. ROM-3 on the other hand, provides a middle ground for model parsimony. Based on R 2-values for training, validation, and prediction data sets we found that ROM-3 is better model than ROM-2 and ROM-1. For predicting thermal drawdown in EGS applications, where high fracture zone permeabilities (typically greater than 10 –15 m 2) are desired, ROM-2 and ROM-3 outperform ROM-1. As per computational time, all the ROMs are 10 4 times faster when compared to running a high-fidelity numerical simulation. In conclusion, this makes the proposed regression-based ROMs attractive for real-time EGS applications because they are fast and provide reasonably good predictions for thermal power output.« less
Anderson, Chauncey W.; Rounds, Stewart A.
2010-01-01
Management of water quality in streams of the United States is becoming increasingly complex as regulators seek to control aquatic pollution and ecological problems through Total Maximum Daily Load programs that target reductions in the concentrations of certain constituents. Sediment, nutrients, and bacteria, for example, are constituents that regulators target for reduction nationally and in the Tualatin River basin, Oregon. These constituents require laboratory analysis of discrete samples for definitive determinations of concentrations in streams. Recent technological advances in the nearly continuous, in situ monitoring of related water-quality parameters has fostered the use of these parameters as surrogates for the labor intensive, laboratory-analyzed constituents. Although these correlative techniques have been successful in large rivers, it was unclear whether they could be applied successfully in tributaries of the Tualatin River, primarily because these streams tend to be small, have rapid hydrologic response to rainfall and high streamflow variability, and may contain unique sources of sediment, nutrients, and bacteria. This report evaluates the feasibility of developing correlative regression models for predicting dependent variables (concentrations of total suspended solids, total phosphorus, and Escherichia coli bacteria) in two Tualatin River basin streams: one draining highly urbanized land (Fanno Creek near Durham, Oregon) and one draining rural agricultural land (Dairy Creek at Highway 8 near Hillsboro, Oregon), during 2002-04. An important difference between these two streams is their response to storm runoff; Fanno Creek has a relatively rapid response due to extensive upstream impervious areas and Dairy Creek has a relatively slow response because of the large amount of undeveloped upstream land. Four other stream sites also were evaluated, but in less detail. Potential explanatory variables included continuously monitored streamflow (discharge), stream stage, specific conductance, turbidity, and time (to account for seasonal processes). Preliminary multiple-regression models were identified using stepwise regression and Mallow's Cp, which maximizes regression correlation coefficients and accounts for the loss of additional degrees of freedom when extra explanatory variables are used. Several data scenarios were created and evaluated for each site to assess the representativeness of existing monitoring data and autosampler-derived data, and to assess the utility of the available data to develop robust predictive models. The goodness-of-fit of candidate predictive models was assessed with diagnostic statistics from validation exercises that compared predictions against a subset of the available data. The regression modeling met with mixed success. Functional model forms that have a high likelihood of success were identified for most (but not all) dependent variables at each site, but there were limitations in the available datasets, notably the lack of samples from high-flows. These limitations increase the uncertainty in the predictions of the models and suggest that the models are not yet ready for use in assessing these streams, particularly under high-flow conditions, without additional data collection and recalibration of model coefficients. Nonetheless, the results reveal opportunities to use existing resources more efficiently. Baseline conditions are well represented in the available data, and, for the most part, the models reproduced these conditions well. Future sampling might therefore focus on high flow conditions, without much loss of ability to characterize the baseline. Seasonal cycles, as represented by trigonometric functions of time, were not significant in the evaluated models, perhaps because the baseline conditions are well characterized in the datasets or because the other explanatory variables indirectly incorporate seasonal aspects. Multicollinearity among independent variabl
Mudunuru, Maruti Kumar; Karra, Satish; Harp, Dylan Robert; ...
2017-07-10
Reduced-order modeling is a promising approach, as many phenomena can be described by a few parameters/mechanisms. An advantage and attractive aspect of a reduced-order model is that it is computational inexpensive to evaluate when compared to running a high-fidelity numerical simulation. A reduced-order model takes couple of seconds to run on a laptop while a high-fidelity simulation may take couple of hours to run on a high-performance computing cluster. The goal of this paper is to assess the utility of regression-based reduced-order models (ROMs) developed from high-fidelity numerical simulations for predicting transient thermal power output for an enhanced geothermal reservoirmore » while explicitly accounting for uncertainties in the subsurface system and site-specific details. Numerical simulations are performed based on equally spaced values in the specified range of model parameters. Key sensitive parameters are then identified from these simulations, which are fracture zone permeability, well/skin factor, bottom hole pressure, and injection flow rate. We found the fracture zone permeability to be the most sensitive parameter. The fracture zone permeability along with time, are used to build regression-based ROMs for the thermal power output. The ROMs are trained and validated using detailed physics-based numerical simulations. Finally, predictions from the ROMs are then compared with field data. We propose three different ROMs with different levels of model parsimony, each describing key and essential features of the power production curves. The coefficients in the proposed regression-based ROMs are developed by minimizing a non-linear least-squares misfit function using the Levenberg–Marquardt algorithm. The misfit function is based on the difference between numerical simulation data and reduced-order model. ROM-1 is constructed based on polynomials up to fourth order. ROM-1 is able to accurately reproduce the power output of numerical simulations for low values of permeabilities and certain features of the field-scale data. ROM-2 is a model with more analytical functions consisting of polynomials up to order eight, exponential functions and smooth approximations of Heaviside functions, and accurately describes the field-data. At higher permeabilities, ROM-2 reproduces numerical results better than ROM-1, however, there is a considerable deviation from numerical results at low fracture zone permeabilities. ROM-3 consists of polynomials up to order ten, and is developed by taking the best aspects of ROM-1 and ROM-2. ROM-1 is relatively parsimonious than ROM-2 and ROM-3, while ROM-2 overfits the data. ROM-3 on the other hand, provides a middle ground for model parsimony. Based on R 2-values for training, validation, and prediction data sets we found that ROM-3 is better model than ROM-2 and ROM-1. For predicting thermal drawdown in EGS applications, where high fracture zone permeabilities (typically greater than 10 –15 m 2) are desired, ROM-2 and ROM-3 outperform ROM-1. As per computational time, all the ROMs are 10 4 times faster when compared to running a high-fidelity numerical simulation. In conclusion, this makes the proposed regression-based ROMs attractive for real-time EGS applications because they are fast and provide reasonably good predictions for thermal power output.« less
A nonparametric multiple imputation approach for missing categorical data.
Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu-Hsieh
2017-06-06
Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability.
NASA Astrophysics Data System (ADS)
Astuti, H. N.; Saputro, D. R. S.; Susanti, Y.
2017-06-01
MGWR model is combination of linear regression model and geographically weighted regression (GWR) model, therefore, MGWR model could produce parameter estimation that had global parameter estimation, and other parameter that had local parameter in accordance with its observation location. The linkage between locations of the observations expressed in specific weighting that is adaptive bi-square. In this research, we applied MGWR model with weighted adaptive bi-square for case of DHF in Surakarta based on 10 factors (variables) that is supposed to influence the number of people with DHF. The observation unit in the research is 51 urban villages and the variables are number of inhabitants, number of houses, house index, many public places, number of healthy homes, number of Posyandu, area width, level population density, welfare of the family, and high-region. Based on this research, we obtained 51 MGWR models. The MGWR model were divided into 4 groups with significant variable is house index as a global variable, an area width as a local variable and the remaining variables vary in each. Global variables are variables that significantly affect all locations, while local variables are variables that significantly affect a specific location.
Analytics of Radioactive Materials Released in the Fukushima Daiichi Nuclear Accident
DOE Office of Scientific and Technical Information (OSTI.GOV)
Egarievwe, Stephen U.; Nuclear Engineering Department, University of Tennessee, Knoxville, TN; Coble, Jamie B.
The 2011 Fukushima Daiichi nuclear accident in Japan resulted in the release of radioactive materials into the atmosphere, the nearby sea, and the surrounding land. Following the accident, several meteorological models were used to predict the transport of the radioactive materials to other continents such as North America and Europe. Also of high importance is the dispersion of radioactive materials locally and within Japan. Based on the International Atomic Energy Agency (IAEA) Convention on Early Notification of a nuclear accident, several radiological data sets were collected on the accident by the Japanese authorities. Among the radioactive materials monitored, are I-131more » and Cs-137 which form the major contributions to the contamination of drinking water. The radiation dose in the atmosphere was also measured. It is impractical to measure contamination and radiation dose in every place of interest. Therefore, modeling helps to predict contamination and radiation dose. Some modeling studies that have been reported in the literature include the simulation of transport and deposition of I-131 and Cs-137 from the accident, Cs-137 deposition and contamination of Japanese soils, and preliminary estimates of I-131 and Cs-137 discharged from the plant into the atmosphere. In this paper, we present statistical analytics of I-131 and Cs-137 with the goal of predicting gamma dose from the Fukushima Daiichi nuclear accident. The data sets used in our study were collected from the IAEA Fukushima Monitoring Database. As part of this study, we investigated several regression models to find the best algorithm for modeling the gamma dose. The modeling techniques used in our study include linear regression, principal component regression (PCR), partial least square (PLS) regression, and ridge regression. Our preliminary results on the first set of data showed that the linear regression model with one variable was the best with a root mean square error of 0.0133 μSv/h, compared to 0.0210 μSv/h for PCR, 0.231 μSv/h for ridge regression L-curve, 0.0856 μSv/h for PLS, and 0.0860 μSv/h for ridge regression cross validation. Complete results using the full datasets for these models will also be presented. (authors)« less
Pedersen, Nicklas Juel; Jensen, David Hebbelstrup; Lelkaitis, Giedrius; Kiss, Katalin; Charabi, Birgitte; Specht, Lena; von Buchwald, Christian
2017-01-01
It is challenging to identify at diagnosis those patients with early oral squamous cell carcinoma (OSCC), who have a poor prognosis and those that have a high risk of harboring occult lymph node metastases. The aim of this study was to develop a standardized and objective digital scoring method to evaluate the predictive value of tumor budding. We developed a semi-automated image-analysis algorithm, Digital Tumor Bud Count (DTBC), to evaluate tumor budding. The algorithm was tested in 222 consecutive patients with early-stage OSCC and major endpoints were overall (OS) and progression free survival (PFS). We subsequently constructed and cross-validated a binary logistic regression model and evaluated its clinical utility by decision curve analysis. A high DTBC was an independent predictor of both poor OS and PFS in a multivariate Cox regression model. The logistic regression model was able to identify patients with occult lymph node metastases with an area under the curve (AUC) of 0.83 (95% CI: 0.78–0.89, P <0.001) and a 10-fold cross-validated AUC of 0.79. Compared to other known histopathological risk factors, the DTBC had a higher diagnostic accuracy. The proposed, novel risk model could be used as a guide to identify patients who would benefit from an up-front neck dissection. PMID:28212555
Weaver, Brian Thomas; Fitzsimons, Kathleen; Braman, Jerrod; Haut, Roger
2016-09-01
The goal of the current study was to expand on previous work to validate the use of pressure insole technology in conjunction with linear regression models to predict the free torque at the shoe-surface interface that is generated while wearing different athletic shoes. Three distinctly different shoe designs were utilised. The stiffness of each shoe was determined with a material's testing machine. Six participants wore each shoe that was fitted with an insole pressure measurement device and performed rotation trials on an embedded force plate. A pressure sensor mask was constructed from those sensors having a high linear correlation with free torque values. Linear regression models were developed to predict free torques from these pressure sensor data. The models were able to accurately predict their own free torque well (RMS error 3.72 ± 0.74 Nm), but not that of the other shoes (RMS error 10.43 ± 3.79 Nm). Models performing self-prediction were also able to measure differences in shoe stiffness. The results of the current study showed the need for participant-shoe specific linear regression models to insure high prediction accuracy of free torques from pressure sensor data during isolated internal and external rotations of the body with respect to a planted foot.
Yamazaki, Takeshi; Takeda, Hisato; Hagiya, Koichi; Yamaguchi, Satoshi; Sasaki, Osamu
2018-03-13
Because lactation periods in dairy cows lengthen with increasing total milk production, it is important to predict individual productivities after 305 days in milk (DIM) to determine the optimal lactation period. We therefore examined whether the random regression (RR) coefficient from 306 to 450 DIM (M2) can be predicted from those during the first 305 DIM (M1) by using a random regression model. We analyzed test-day milk records from 85690 Holstein cows in their first lactations and 131727 cows in their later (second to fifth) lactations. Data in M1 and M2 were analyzed separately by using different single-trait RR animal models. We then performed a multiple regression analysis of the RR coefficients of M2 on those of M1 during the first and later lactations. The first-order Legendre polynomials were practical covariates of random regression for the milk yields of M2. All RR coefficients for the additive genetic (AG) effect and the intercept for the permanent environmental (PE) effect of M2 had moderate to strong correlations with the intercept for the AG effect of M1. The coefficients of determination for multiple regression of the combined intercepts for the AG and PE effects of M2 on the coefficients for the AG effect of M1 were moderate to high. The daily milk yields of M2 predicted by using the RR coefficients for the AG effect of M1 were highly correlated with those obtained by using the coefficients of M2. Milk production after 305 DIM can be predicted by using the RR coefficient estimates of the AG effect during the first 305 DIM.
The Chinese High School Student's Stress in the School and Academic Achievement
ERIC Educational Resources Information Center
Liu, Yangyang; Lu, Zuhong
2011-01-01
In a sample of 466 Chinese high school students, we examined the relationships between Chinese high school students' stress in the school and their academic achievements. Regression mixture modelling identified two different classes of the effects of Chinese high school students' stress on their academic achievements. One class contained 87% of…
Using regression methods to estimate stream phosphorus loads at the Illinois River, Arkansas
Haggard, B.E.; Soerens, T.S.; Green, W.R.; Richards, R.P.
2003-01-01
The development of total maximum daily loads (TMDLs) requires evaluating existing constituent loads in streams. Accurate estimates of constituent loads are needed to calibrate watershed and reservoir models for TMDL development. The best approach to estimate constituent loads is high frequency sampling, particularly during storm events, and mass integration of constituents passing a point in a stream. Most often, resources are limited and discrete water quality samples are collected on fixed intervals and sometimes supplemented with directed sampling during storm events. When resources are limited, mass integration is not an accurate means to determine constituent loads and other load estimation techniques such as regression models are used. The objective of this work was to determine a minimum number of water-quality samples needed to provide constituent concentration data adequate to estimate constituent loads at a large stream. Twenty sets of water quality samples with and without supplemental storm samples were randomly selected at various fixed intervals from a database at the Illinois River, northwest Arkansas. The random sets were used to estimate total phosphorus (TP) loads using regression models. The regression-based annual TP loads were compared to the integrated annual TP load estimated using all the data. At a minimum, monthly sampling plus supplemental storm samples (six samples per year) was needed to produce a root mean square error of less than 15%. Water quality samples should be collected at least semi-monthly (every 15 days) in studies less than two years if seasonal time factors are to be used in the regression models. Annual TP loads estimated from independently collected discrete water quality samples further demonstrated the utility of using regression models to estimate annual TP loads in this stream system.
Kumar, S.; Spaulding, S.A.; Stohlgren, T.J.; Hermann, K.A.; Schmidt, T.S.; Bahls, L.L.
2009-01-01
The diatom Didymosphenia geminata is a single-celled alga found in lakes, streams, and rivers. Nuisance blooms of D geminata affect the diversity, abundance, and productivity of other aquatic organisms. Because D geminata can be transported by humans on waders and other gear, accurate spatial prediction of habitat suitability is urgently needed for early detection and rapid response, as well as for evaluation of monitoring and control programs. We compared four modeling methods to predict D geminata's habitat distribution; two methods use presence-absence data (logistic regression and classification and regression tree [CART]), and two involve presence data (maximum entropy model [Maxent] and genetic algorithm for rule-set production [GARP]). Using these methods, we evaluated spatially explicit, bioclimatic and environmental variables as predictors of diatom distribution. The Maxent model provided the most accurate predictions, followed by logistic regression, CART, and GARP. The most suitable habitats were predicted to occur in the western US, in relatively cool sites, and at high elevations with a high base-flow index. The results provide insights into the factors that affect the distribution of D geminata and a spatial basis for the prediction of nuisance blooms. ?? The Ecological Society of America.
Predicting Student Success on the Texas Chemistry STAAR Test: A Logistic Regression Analysis
ERIC Educational Resources Information Center
Johnson, William L.; Johnson, Annabel M.; Johnson, Jared
2012-01-01
Background: The context is the new Texas STAAR end-of-course testing program. Purpose: The authors developed a logistic regression model to predict who would pass-or-fail the new Texas chemistry STAAR end-of-course exam. Setting: Robert E. Lee High School (5A) with an enrollment of 2700 students, Tyler, Texas. Date of the study was the 2011-2012…
Zhang, Ling Yu; Liu, Zhao Gang
2017-12-01
Based on the data collected from 108 permanent plots of the forest resources survey in Maoershan Experimental Forest Farm during 2004-2016, this study investigated the spatial distribution of recruitment trees in natural secondary forest by global Poisson regression and geographically weighted Poisson regression (GWPR) with four bandwidths of 2.5, 5, 10 and 15 km. The simulation effects of the 5 regressions and the factors influencing the recruitment trees in stands were analyzed, a description was given to the spatial autocorrelation of the regression residuals on global and local levels using Moran's I. The results showed that the spatial distribution of the number of natural secondary forest recruitment was significantly influenced by stands and topographic factors, especially average DBH. The GWPR model with small scale (2.5 km) had high accuracy of model fitting, a large range of model parameter estimates was generated, and the localized spatial distribution effect of the model parameters was obtained. The GWPR model at small scale (2.5 and 5 km) had produced a small range of model residuals, and the stability of the model was improved. The global spatial auto-correlation of the GWPR model residual at the small scale (2.5 km) was the lowe-st, and the local spatial auto-correlation was significantly reduced, in which an ideal spatial distribution pattern of small clusters with different observations was formed. The local model at small scale (2.5 km) was much better than the global model in the simulation effect on the spatial distribution of recruitment tree number.
Zhou, Qingping; Jiang, Haiyan; Wang, Jianzhou; Zhou, Jianling
2014-10-15
Exposure to high concentrations of fine particulate matter (PM₂.₅) can cause serious health problems because PM₂.₅ contains microscopic solid or liquid droplets that are sufficiently small to be ingested deep into human lungs. Thus, daily prediction of PM₂.₅ levels is notably important for regulatory plans that inform the public and restrict social activities in advance when harmful episodes are foreseen. A hybrid EEMD-GRNN (ensemble empirical mode decomposition-general regression neural network) model based on data preprocessing and analysis is firstly proposed in this paper for one-day-ahead prediction of PM₂.₅ concentrations. The EEMD part is utilized to decompose original PM₂.₅ data into several intrinsic mode functions (IMFs), while the GRNN part is used for the prediction of each IMF. The hybrid EEMD-GRNN model is trained using input variables obtained from principal component regression (PCR) model to remove redundancy. These input variables accurately and succinctly reflect the relationships between PM₂.₅ and both air quality and meteorological data. The model is trained with data from January 1 to November 1, 2013 and is validated with data from November 2 to November 21, 2013 in Xi'an Province, China. The experimental results show that the developed hybrid EEMD-GRNN model outperforms a single GRNN model without EEMD, a multiple linear regression (MLR) model, a PCR model, and a traditional autoregressive integrated moving average (ARIMA) model. The hybrid model with fast and accurate results can be used to develop rapid air quality warning systems. Copyright © 2014 Elsevier B.V. All rights reserved.
Watanabe, Hiroyuki; Miyazaki, Hiroyasu
2006-01-01
Over- and/or under-correction of QT intervals for changes in heart rate may lead to misleading conclusions and/or masking the potential of a drug to prolong the QT interval. This study examines a nonparametric regression model (Loess Smoother) to adjust the QT interval for differences in heart rate, with an improved fitness over a wide range of heart rates. 240 sets of (QT, RR) observations collected from each of 8 conscious and non-treated beagle dogs were used as the materials for investigation. The fitness of the nonparametric regression model to the QT-RR relationship was compared with four models (individual linear regression, common linear regression, and Bazett's and Fridericia's correlation models) with reference to Akaike's Information Criterion (AIC). Residuals were visually assessed. The bias-corrected AIC of the nonparametric regression model was the best of the models examined in this study. Although the parametric models did not fit, the nonparametric regression model improved the fitting at both fast and slow heart rates. The nonparametric regression model is the more flexible method compared with the parametric method. The mathematical fit for linear regression models was unsatisfactory at both fast and slow heart rates, while the nonparametric regression model showed significant improvement at all heart rates in beagle dogs.
Šarić, Željko; Xu, Xuecai; Duan, Li; Babić, Darko
2018-06-20
This study intended to investigate the interactions between accident rate and traffic signs in state roads located in Croatia, and accommodate the heterogeneity attributed to unobserved factors. The data from 130 state roads between 2012 and 2016 were collected from Traffic Accident Database System maintained by the Republic of Croatia Ministry of the Interior. To address the heterogeneity, a panel quantile regression model was proposed, in which quantile regression model offers a more complete view and a highly comprehensive analysis of the relationship between accident rate and traffic signs, while the panel data model accommodates the heterogeneity attributed to unobserved factors. Results revealed that (1) low visibility of material damage (MD) and death or injured (DI) increased the accident rate; (2) the number of mandatory signs and the number of warning signs were more likely to reduce the accident rate; (3)average speed limit and the number of invalid traffic signs per km exhibited a high accident rate. To our knowledge, it's the first attempt to analyze the interactions between accident consequences and traffic signs by employing a panel quantile regression model; by involving the visibility, the present study demonstrates that the low visibility causes a relatively higher risk of MD and DI; It is noteworthy that average speed limit corresponds with accident rate positively; The number of mandatory signs and the number of warning signs are more likely to reduce the accident rate; The number of invalid traffic signs per km are significant for accident rate, thus regular maintenance should be kept for a safer roadway environment.
An open-access CMIP5 pattern library for temperature and precipitation: Description and methodology
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lynch, Cary D.; Hartin, Corinne A.; Bond-Lamberty, Benjamin
Pattern scaling is used to efficiently emulate general circulation models and explore uncertainty in climate projections under multiple forcing scenarios. Pattern scaling methods assume that local climate changes scale with a global mean temperature increase, allowing for spatial patterns to be generated for multiple models for any future emission scenario. For uncertainty quantification and probabilistic statistical analysis, a library of patterns with descriptive statistics for each file would be beneficial, but such a library does not presently exist. Of the possible techniques used to generate patterns, the two most prominent are the delta and least squared regression methods. We exploremore » the differences and statistical significance between patterns generated by each method and assess performance of the generated patterns across methods and scenarios. Differences in patterns across seasons between methods and epochs were largest in high latitudes (60-90°N/S). Bias and mean errors between modeled and pattern predicted output from the linear regression method were smaller than patterns generated by the delta method. Across scenarios, differences in the linear regression method patterns were more statistically significant, especially at high latitudes. We found that pattern generation methodologies were able to approximate the forced signal of change to within ≤ 0.5°C, but choice of pattern generation methodology for pattern scaling purposes should be informed by user goals and criteria. As a result, this paper describes our library of least squared regression patterns from all CMIP5 models for temperature and precipitation on an annual and sub-annual basis, along with the code used to generate these patterns.« less
An open-access CMIP5 pattern library for temperature and precipitation: Description and methodology
Lynch, Cary D.; Hartin, Corinne A.; Bond-Lamberty, Benjamin; ...
2017-05-15
Pattern scaling is used to efficiently emulate general circulation models and explore uncertainty in climate projections under multiple forcing scenarios. Pattern scaling methods assume that local climate changes scale with a global mean temperature increase, allowing for spatial patterns to be generated for multiple models for any future emission scenario. For uncertainty quantification and probabilistic statistical analysis, a library of patterns with descriptive statistics for each file would be beneficial, but such a library does not presently exist. Of the possible techniques used to generate patterns, the two most prominent are the delta and least squared regression methods. We exploremore » the differences and statistical significance between patterns generated by each method and assess performance of the generated patterns across methods and scenarios. Differences in patterns across seasons between methods and epochs were largest in high latitudes (60-90°N/S). Bias and mean errors between modeled and pattern predicted output from the linear regression method were smaller than patterns generated by the delta method. Across scenarios, differences in the linear regression method patterns were more statistically significant, especially at high latitudes. We found that pattern generation methodologies were able to approximate the forced signal of change to within ≤ 0.5°C, but choice of pattern generation methodology for pattern scaling purposes should be informed by user goals and criteria. As a result, this paper describes our library of least squared regression patterns from all CMIP5 models for temperature and precipitation on an annual and sub-annual basis, along with the code used to generate these patterns.« less
NASA Astrophysics Data System (ADS)
Gonçalves, Karen dos Santos; Winkler, Mirko S.; Benchimol-Barbosa, Paulo Roberto; de Hoogh, Kees; Artaxo, Paulo Eduardo; de Souza Hacon, Sandra; Schindler, Christian; Künzli, Nino
2018-07-01
Epidemiological studies generally use particulate matter measurements with diameter less 2.5 μm (PM2.5) from monitoring networks. Satellite aerosol optical depth (AOD) data has considerable potential in predicting PM2.5 concentrations, and thus provides an alternative method for producing knowledge regarding the level of pollution and its health impact in areas where no ground PM2.5 measurements are available. This is the case in the Brazilian Amazon rainforest region where forest fires are frequent sources of high pollution. In this study, we applied a non-linear model for predicting PM2.5 concentration from AOD retrievals using interaction terms between average temperature, relative humidity, sine, cosine of date in a period of 365,25 days and the square of the lagged relative residual. Regression performance statistics were tested comparing the goodness of fit and R2 based on results from linear regression and non-linear regression for six different models. The regression results for non-linear prediction showed the best performance, explaining on average 82% of the daily PM2.5 concentrations when considering the whole period studied. In the context of Amazonia, it was the first study predicting PM2.5 concentrations using the latest high-resolution AOD products also in combination with the testing of a non-linear model performance. Our results permitted a reliable prediction considering the AOD-PM2.5 relationship and set the basis for further investigations on air pollution impacts in the complex context of Brazilian Amazon Region.
Quantile regression applied to spectral distance decay
Rocchini, D.; Cade, B.S.
2008-01-01
Remotely sensed imagery has long been recognized as a powerful support for characterizing and estimating biodiversity. Spectral distance among sites has proven to be a powerful approach for detecting species composition variability. Regression analysis of species similarity versus spectral distance allows us to quantitatively estimate the amount of turnover in species composition with respect to spectral and ecological variability. In classical regression analysis, the residual sum of squares is minimized for the mean of the dependent variable distribution. However, many ecological data sets are characterized by a high number of zeroes that add noise to the regression model. Quantile regressions can be used to evaluate trend in the upper quantiles rather than a mean trend across the whole distribution of the dependent variable. In this letter, we used ordinary least squares (OLS) and quantile regressions to estimate the decay of species similarity versus spectral distance. The achieved decay rates were statistically nonzero (p < 0.01), considering both OLS and quantile regressions. Nonetheless, the OLS regression estimate of the mean decay rate was only half the decay rate indicated by the upper quantiles. Moreover, the intercept value, representing the similarity reached when the spectral distance approaches zero, was very low compared with the intercepts of the upper quantiles, which detected high species similarity when habitats are more similar. In this letter, we demonstrated the power of using quantile regressions applied to spectral distance decay to reveal species diversity patterns otherwise lost or underestimated by OLS regression. ?? 2008 IEEE.
Spectral distance decay: Assessing species beta-diversity by quantile regression
Rocchinl, D.; Nagendra, H.; Ghate, R.; Cade, B.S.
2009-01-01
Remotely sensed data represents key information for characterizing and estimating biodiversity. Spectral distance among sites has proven to be a powerful approach for detecting species composition variability. Regression analysis of species similarity versus spectral distance may allow us to quantitatively estimate how beta-diversity in species changes with respect to spectral and ecological variability. In classical regression analysis, the residual sum of squares is minimized for the mean of the dependent variable distribution. However, many ecological datasets are characterized by a high number of zeroes that can add noise to the regression model. Quantile regression can be used to evaluate trend in the upper quantiles rather than a mean trend across the whole distribution of the dependent variable. In this paper, we used ordinary least square (ols) and quantile regression to estimate the decay of species similarity versus spectral distance. The achieved decay rates were statistically nonzero (p < 0.05) considering both ols and quantile regression. Nonetheless, ols regression estimate of mean decay rate was only half the decay rate indicated by the upper quantiles. Moreover, the intercept value, representing the similarity reached when spectral distance approaches zero, was very low compared with the intercepts of upper quantiles, which detected high species similarity when habitats are more similar. In this paper we demonstrated the power of using quantile regressions applied to spectral distance decay in order to reveal species diversity patterns otherwise lost or underestimated by ordinary least square regression. ?? 2009 American Society for Photogrammetry and Remote Sensing.
Alwee, Razana; Hj Shamsuddin, Siti Mariyam; Sallehuddin, Roselina
2013-01-01
Crimes forecasting is an important area in the field of criminology. Linear models, such as regression and econometric models, are commonly applied in crime forecasting. However, in real crimes data, it is common that the data consists of both linear and nonlinear components. A single model may not be sufficient to identify all the characteristics of the data. The purpose of this study is to introduce a hybrid model that combines support vector regression (SVR) and autoregressive integrated moving average (ARIMA) to be applied in crime rates forecasting. SVR is very robust with small training data and high-dimensional problem. Meanwhile, ARIMA has the ability to model several types of time series. However, the accuracy of the SVR model depends on values of its parameters, while ARIMA is not robust to be applied to small data sets. Therefore, to overcome this problem, particle swarm optimization is used to estimate the parameters of the SVR and ARIMA models. The proposed hybrid model is used to forecast the property crime rates of the United State based on economic indicators. The experimental results show that the proposed hybrid model is able to produce more accurate forecasting results as compared to the individual models. PMID:23766729
Alwee, Razana; Shamsuddin, Siti Mariyam Hj; Sallehuddin, Roselina
2013-01-01
Crimes forecasting is an important area in the field of criminology. Linear models, such as regression and econometric models, are commonly applied in crime forecasting. However, in real crimes data, it is common that the data consists of both linear and nonlinear components. A single model may not be sufficient to identify all the characteristics of the data. The purpose of this study is to introduce a hybrid model that combines support vector regression (SVR) and autoregressive integrated moving average (ARIMA) to be applied in crime rates forecasting. SVR is very robust with small training data and high-dimensional problem. Meanwhile, ARIMA has the ability to model several types of time series. However, the accuracy of the SVR model depends on values of its parameters, while ARIMA is not robust to be applied to small data sets. Therefore, to overcome this problem, particle swarm optimization is used to estimate the parameters of the SVR and ARIMA models. The proposed hybrid model is used to forecast the property crime rates of the United State based on economic indicators. The experimental results show that the proposed hybrid model is able to produce more accurate forecasting results as compared to the individual models.
2011-01-01
Background Several regression models have been proposed for estimation of isometric joint torque using surface electromyography (SEMG) signals. Common issues related to torque estimation models are degradation of model accuracy with passage of time, electrode displacement, and alteration of limb posture. This work compares the performance of the most commonly used regression models under these circumstances, in order to assist researchers with identifying the most appropriate model for a specific biomedical application. Methods Eleven healthy volunteers participated in this study. A custom-built rig, equipped with a torque sensor, was used to measure isometric torque as each volunteer flexed and extended his wrist. SEMG signals from eight forearm muscles, in addition to wrist joint torque data were gathered during the experiment. Additional data were gathered one hour and twenty-four hours following the completion of the first data gathering session, for the purpose of evaluating the effects of passage of time and electrode displacement on accuracy of models. Acquired SEMG signals were filtered, rectified, normalized and then fed to models for training. Results It was shown that mean adjusted coefficient of determination (Ra2) values decrease between 20%-35% for different models after one hour while altering arm posture decreased mean Ra2 values between 64% to 74% for different models. Conclusions Model estimation accuracy drops significantly with passage of time, electrode displacement, and alteration of limb posture. Therefore model retraining is crucial for preserving estimation accuracy. Data resampling can significantly reduce model training time without losing estimation accuracy. Among the models compared, ordinary least squares linear regression model (OLS) was shown to have high isometric torque estimation accuracy combined with very short training times. PMID:21943179
Variable Selection for Regression Models of Percentile Flows
NASA Astrophysics Data System (ADS)
Fouad, G.
2017-12-01
Percentile flows describe the flow magnitude equaled or exceeded for a given percent of time, and are widely used in water resource management. However, these statistics are normally unavailable since most basins are ungauged. Percentile flows of ungauged basins are often predicted using regression models based on readily observable basin characteristics, such as mean elevation. The number of these independent variables is too large to evaluate all possible models. A subset of models is typically evaluated using automatic procedures, like stepwise regression. This ignores a large variety of methods from the field of feature (variable) selection and physical understanding of percentile flows. A study of 918 basins in the United States was conducted to compare an automatic regression procedure to the following variable selection methods: (1) principal component analysis, (2) correlation analysis, (3) random forests, (4) genetic programming, (5) Bayesian networks, and (6) physical understanding. The automatic regression procedure only performed better than principal component analysis. Poor performance of the regression procedure was due to a commonly used filter for multicollinearity, which rejected the strongest models because they had cross-correlated independent variables. Multicollinearity did not decrease model performance in validation because of a representative set of calibration basins. Variable selection methods based strictly on predictive power (numbers 2-5 from above) performed similarly, likely indicating a limit to the predictive power of the variables. Similar performance was also reached using variables selected based on physical understanding, a finding that substantiates recent calls to emphasize physical understanding in modeling for predictions in ungauged basins. The strongest variables highlighted the importance of geology and land cover, whereas widely used topographic variables were the weakest predictors. Variables suffered from a high degree of multicollinearity, possibly illustrating the co-evolution of climatic and physiographic conditions. Given the ineffectiveness of many variables used here, future work should develop new variables that target specific processes associated with percentile flows.
Memory complaints in epilepsy: An examination of the role of mood and illness perceptions.
Tinson, Deborah; Crockford, Christopher; Gharooni, Sara; Russell, Helen; Zoeller, Sophie; Leavy, Yvonne; Lloyd, Rachel; Duncan, Susan
2018-03-01
The study examined the role of mood and illness perceptions in explaining the variance in the memory complaints of patients with epilepsy. Forty-four patients from an outpatient tertiary care center and 43 volunteer controls completed a formal assessment of memory and a verbal fluency test, as well as validated self-report questionnaires on memory complaints, mood, and illness perceptions. In hierarchical multiple regression analyses, objective memory test performance and verbal fluency did not contribute significantly to the variance in memory complaints for either patients or controls. In patients, illness perceptions and mood were highly correlated. Illness perceptions correlated more highly with memory complaints than mood and were therefore added to the multiple regression analysis. This accounted for an additional 25% of the variance, after controlling for objective memory test performance and verbal fluency, and the model was significant (model B). In order to compare with other studies, mood was added to a second model, instead of illness perceptions. This accounted for an additional 24% of the variance, which was again significant (model C). In controls, low mood accounted for 11% of the variance in memory complaints (model C2). A measure of illness perceptions was more highly correlated with the memory complaints of patients with epilepsy than with a measure of mood. In a hierarchical multiple regression model, illness perceptions accounted for 25% of the variance in memory complaints. Illness perceptions could provide useful information in a clinical investigation into the self-reported memory complaints of patients with epilepsy, alongside the assessment of mood and formal memory testing. Copyright © 2017 Elsevier Inc. All rights reserved.
Modified Regression Correlation Coefficient for Poisson Regression Model
NASA Astrophysics Data System (ADS)
Kaengthong, Nattacha; Domthong, Uthumporn
2017-09-01
This study gives attention to indicators in predictive power of the Generalized Linear Model (GLM) which are widely used; however, often having some restrictions. We are interested in regression correlation coefficient for a Poisson regression model. This is a measure of predictive power, and defined by the relationship between the dependent variable (Y) and the expected value of the dependent variable given the independent variables [E(Y|X)] for the Poisson regression model. The dependent variable is distributed as Poisson. The purpose of this research was modifying regression correlation coefficient for Poisson regression model. We also compare the proposed modified regression correlation coefficient with the traditional regression correlation coefficient in the case of two or more independent variables, and having multicollinearity in independent variables. The result shows that the proposed regression correlation coefficient is better than the traditional regression correlation coefficient based on Bias and the Root Mean Square Error (RMSE).
Aqil, Muhammad; Kita, Ichiro; Yano, Akira; Nishiyama, Soichi
2007-10-01
Traditionally, the multiple linear regression technique has been one of the most widely used models in simulating hydrological time series. However, when the nonlinear phenomenon is significant, the multiple linear will fail to develop an appropriate predictive model. Recently, neuro-fuzzy systems have gained much popularity for calibrating the nonlinear relationships. This study evaluated the potential of a neuro-fuzzy system as an alternative to the traditional statistical regression technique for the purpose of predicting flow from a local source in a river basin. The effectiveness of the proposed identification technique was demonstrated through a simulation study of the river flow time series of the Citarum River in Indonesia. Furthermore, in order to provide the uncertainty associated with the estimation of river flow, a Monte Carlo simulation was performed. As a comparison, a multiple linear regression analysis that was being used by the Citarum River Authority was also examined using various statistical indices. The simulation results using 95% confidence intervals indicated that the neuro-fuzzy model consistently underestimated the magnitude of high flow while the low and medium flow magnitudes were estimated closer to the observed data. The comparison of the prediction accuracy of the neuro-fuzzy and linear regression methods indicated that the neuro-fuzzy approach was more accurate in predicting river flow dynamics. The neuro-fuzzy model was able to improve the root mean square error (RMSE) and mean absolute percentage error (MAPE) values of the multiple linear regression forecasts by about 13.52% and 10.73%, respectively. Considering its simplicity and efficiency, the neuro-fuzzy model is recommended as an alternative tool for modeling of flow dynamics in the study area.
Use of ocean color scanner data in water quality mapping
NASA Technical Reports Server (NTRS)
Khorram, S.
1981-01-01
Remotely sensed data, in combination with in situ data, are used in assessing water quality parameters within the San Francisco Bay-Delta. The parameters include suspended solids, chlorophyll, and turbidity. Regression models are developed between each of the water quality parameter measurements and the Ocean Color Scanner (OCS) data. The models are then extended to the entire study area for mapping water quality parameters. The results include a series of color-coded maps, each pertaining to one of the water quality parameters, and the statistical analysis of the OCS data and regression models. It is found that concurrently collected OCS data and surface truth measurements are highly useful in mapping the selected water quality parameters and locating areas having relatively high biological activity. In addition, it is found to be virtually impossible, at least within this test site, to locate such areas on U-2 color and color-infrared photography.
Visual abilities distinguish pitchers from hitters in professional baseball.
Klemish, David; Ramger, Benjamin; Vittetoe, Kelly; Reiter, Jerome P; Tokdar, Surya T; Appelbaum, Lawrence Gregory
2018-01-01
This study aimed to evaluate the possibility that differences in sensorimotor abilities exist between hitters and pitchers in a large cohort of baseball players of varying levels of experience. Secondary data analysis was performed on 9 sensorimotor tasks comprising the Nike Sensory Station assessment battery. Bayesian hierarchical regression modelling was applied to test for differences between pitchers and hitters in data from 566 baseball players (112 high school, 85 college, 369 professional) collected at 20 testing centres. Explanatory variables including height, handedness, eye dominance, concussion history, and player position were modelled along with age curves using basis regression splines. Regression analyses revealed better performance for hitters relative to pitchers at the professional level in the visual clarity and depth perception tasks, but these differences did not exist at the high school or college levels. No significant differences were observed in the other 7 measures of sensorimotor capabilities included in the test battery, and no systematic biases were found between the testing centres. These findings, indicating that professional-level hitters have better visual acuity and depth perception than professional-level pitchers, affirm the notion that highly experienced athletes have differing perceptual skills. Findings are discussed in relation to deliberate practice theory.
Sandborgh, Maria; Johansson, Ann-Christin; Söderlund, Anne
2016-01-01
In the fear-avoidance (FA) model social cognitive constructs could add to explaining the disabling process in whiplash associated disorder (WAD). The aim was to exemplify the possible input from Social Cognitive Theory on the FA model. Specifically the role of functional self-efficacy and perceived responses from a spouse/intimate partner was studied. A cross-sectional and correlational design was used. Data from 64 patients with acute WAD were used. Measures were pain intensity measured with a numerical rating scale, the Pain Disability Index, support, punishing responses, solicitous responses, and distracting responses subscales from the Multidimensional Pain Inventory, the Catastrophizing subscale from the Coping Strategies Questionnaire, the Tampa Scale of Kinesiophobia, and the Self-Efficacy Scale. Bivariate correlational, simple linear regression, and multiple regression analyses were used. In the statistical prediction models high pain intensity indicated high punishing responses, which indicated high catastrophizing. High catastrophizing indicated high fear of movement, which indicated low self-efficacy. Low self-efficacy indicated high disability, which indicated high pain intensity. All independent variables together explained 66.4% of the variance in pain disability, p < 0.001. Results suggest a possible link between one aspect of the social environment, perceived punishing responses from a spouse/intimate partner, pain intensity, and catastrophizing. Further, results support a mediating role of self-efficacy between fear of movement and disability in WAD.
Conceptual model of consumer’s willingness to eat functional foods
Babicz-Zielinska, Ewa; Jezewska-Zychowicz, Maria
The functional foods constitute the important segment of the food market. Among factors that determine the intentions to eat functional foods, the psychological factors play very important roles. Motives, attitudes and personality are key factors. The relationships between socio-demographic characteristics, attitudes and willingness to purchase functional foods were not fully confirmed. Consumers’ beliefs about health benefits from eaten foods seem to be a strong determinant of a choice of functional foods. The objective of this study was to determine relations between familiarity, attitudes, and beliefs in benefits and risks about functional foods and develop some conceptual models of willingness to eat. The sample of Polish consumers counted 1002 subjects at age 15+. The foods enriched with vitamins or minerals, and cholesterol-lowering margarine or drinks were considered. The questionnaire focused on familiarity with foods, attitudes, beliefs about benefits and risks of their consumption was constructed. The Pearson’s correlations and linear regression equations were calculated. The strongest relations appeared between attitudes, high health value and high benefits, (r = 0.722 and 0.712 for enriched foods, and 0.664 and 0.693 for cholesterol-lowering foods), and between high health value and high benefits (0.814 for enriched foods and 0.758 for cholesterol-lowering foods). The conceptual models based on linear regression of relations between attitudes and all other variables, considering or not the familiarity with the foods, were developed. The positive attitudes and declared consumption are more important for enriched foods. The beliefs on high health value and high benefits play the most important role in the purchase. The interrelations between different variables may be described by new linear regression models, with the beliefs in high benefits, positive attitudes and familiarity being most significant predictors. Health expectations and trust to functional foods are the key factors in their choice.
REVEAL: An Extensible Reduced Order Model Builder for Simulation and Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Agarwal, Khushbu; Sharma, Poorva; Ma, Jinliang
2013-04-30
Many science domains need to build computationally efficient and accurate representations of high fidelity, computationally expensive simulations. These computationally efficient versions are known as reduced-order models. This paper presents the design and implementation of a novel reduced-order model (ROM) builder, the REVEAL toolset. This toolset generates ROMs based on science- and engineering-domain specific simulations executed on high performance computing (HPC) platforms. The toolset encompasses a range of sampling and regression methods that can be used to generate a ROM, automatically quantifies the ROM accuracy, and provides support for an iterative approach to improve ROM accuracy. REVEAL is designed to bemore » extensible in order to utilize the core functionality with any simulator that has published input and output formats. It also defines programmatic interfaces to include new sampling and regression techniques so that users can ‘mix and match’ mathematical techniques to best suit the characteristics of their model. In this paper, we describe the architecture of REVEAL and demonstrate its usage with a computational fluid dynamics model used in carbon capture.« less
Snyder, Marcia; Freeman, Mary C.; Purucker, S. Thomas; Pringle, Catherine M.
2016-01-01
Freshwater shrimps are an important biotic component of tropical ecosystems. However, they can have a low probability of detection when abundances are low. We sampled 3 of the most common freshwater shrimp species, Macrobrachium olfersii, Macrobrachium carcinus, and Macrobrachium heterochirus, and used occupancy modeling and logistic regression models to improve our limited knowledge of distribution of these cryptic species by investigating both local- and landscape-scale effects at La Selva Biological Station in Costa Rica. Local-scale factors included substrate type and stream size, and landscape-scale factors included presence or absence of regional groundwater inputs. Capture rates for 2 of the sampled species (M. olfersii and M. carcinus) were sufficient to compare the fit of occupancy models. Occupancy models did not converge for M. heterochirus, but M. heterochirus had high enough occupancy rates that logistic regression could be used to model the relationship between occupancy rates and predictors. The best-supported models for M. olfersii and M. carcinus included conductivity, discharge, and substrate parameters. Stream size was positively correlated with occupancy rates of all 3 species. High stream conductivity, which reflects the quantity of regional groundwater input into the stream, was positively correlated with M. olfersii occupancy rates. Boulder substrates increased occupancy rate of M. carcinus and decreased the detection probability of M. olfersii. Our models suggest that shrimp distribution is driven by factors that function at local (substrate and discharge) and landscape (conductivity) scales.
Laurens, L M L; Wolfrum, E J
2013-12-18
One of the challenges associated with microalgal biomass characterization and the comparison of microalgal strains and conversion processes is the rapid determination of the composition of algae. We have developed and applied a high-throughput screening technology based on near-infrared (NIR) spectroscopy for the rapid and accurate determination of algal biomass composition. We show that NIR spectroscopy can accurately predict the full composition using multivariate linear regression analysis of varying lipid, protein, and carbohydrate content of algal biomass samples from three strains. We also demonstrate a high quality of predictions of an independent validation set. A high-throughput 96-well configuration for spectroscopy gives equally good prediction relative to a ring-cup configuration, and thus, spectra can be obtained from as little as 10-20 mg of material. We found that lipids exhibit a dominant, distinct, and unique fingerprint in the NIR spectrum that allows for the use of single and multiple linear regression of respective wavelengths for the prediction of the biomass lipid content. This is not the case for carbohydrate and protein content, and thus, the use of multivariate statistical modeling approaches remains necessary.
NASA Astrophysics Data System (ADS)
Qianxiang, Zhou
2012-07-01
It is very important to clarify the geometric characteristic of human body segment and constitute analysis model for ergonomic design and the application of ergonomic virtual human. The typical anthropometric data of 1122 Chinese men aged 20-35 years were collected using three-dimensional laser scanner for human body. According to the correlation between different parameters, curve fitting were made between seven trunk parameters and ten body parameters with the SPSS 16.0 software. It can be concluded that hip circumference and shoulder breadth are the most important parameters in the models and the two parameters have high correlation with the others parameters of human body. By comparison with the conventional regressive curves, the present regression equation with the seven trunk parameters is more accurate to forecast the geometric dimensions of head, neck, height and the four limbs with high precision. Therefore, it is greatly valuable for ergonomic design and analysis of man-machine system.This result will be very useful to astronaut body model analysis and application.
Winters, Eric R; Petosa, Rick L; Charlton, Thomas E
2003-06-01
To examine whether knowledge of high school students' actions of self-regulation, and perceptions of self-efficacy to overcome exercise barriers, social situation, and outcome expectation will predict non-school related moderate and vigorous physical exercise. High school students enrolled in introductory Physical Education courses completed questionnaires that targeted selected Social Cognitive Theory variables. They also self-reported their typical "leisure-time" exercise participation using a standardized questionnaire. Bivariate correlation statistic and hierarchical regression were conducted on reports of moderate and vigorous exercise frequency. Each predictor variable was significantly associated with measures of moderate and vigorous exercise frequency. All predictor variables were significant in the final regression model used to explain vigorous exercise. After controlling for the effects of gender, the psychosocial variables explained 29% of variance in vigorous exercise frequency. Three of four predictor variables were significant in the final regression equation used to explain moderate exercise. The final regression equation accounted for 11% of variance in moderate exercise frequency. Professionals who attempt to increase the prevalence of physical exercise through educational methods should focus on the psychosocial variables utilized in this study.
Røislien, Jo; Lossius, Hans Morten; Kristiansen, Thomas
2015-01-01
Background Trauma is a leading global cause of death. Trauma mortality rates are higher in rural areas, constituting a challenge for quality and equality in trauma care. The aim of the study was to explore population density and transport time to hospital care as possible predictors of geographical differences in mortality rates, and to what extent choice of statistical method might affect the analytical results and accompanying clinical conclusions. Methods Using data from the Norwegian Cause of Death registry, deaths from external causes 1998–2007 were analysed. Norway consists of 434 municipalities, and municipality population density and travel time to hospital care were entered as predictors of municipality mortality rates in univariate and multiple regression models of increasing model complexity. We fitted linear regression models with continuous and categorised predictors, as well as piecewise linear and generalised additive models (GAMs). Models were compared using Akaike's information criterion (AIC). Results Population density was an independent predictor of trauma mortality rates, while the contribution of transport time to hospital care was highly dependent on choice of statistical model. A multiple GAM or piecewise linear model was superior, and similar, in terms of AIC. However, while transport time was statistically significant in multiple models with piecewise linear or categorised predictors, it was not in GAM or standard linear regression. Conclusions Population density is an independent predictor of trauma mortality rates. The added explanatory value of transport time to hospital care is marginal and model-dependent, highlighting the importance of exploring several statistical models when studying complex associations in observational data. PMID:25972600
Regression modeling of ground-water flow
Cooley, R.L.; Naff, R.L.
1985-01-01
Nonlinear multiple regression methods are developed to model and analyze groundwater flow systems. Complete descriptions of regression methodology as applied to groundwater flow models allow scientists and engineers engaged in flow modeling to apply the methods to a wide range of problems. Organization of the text proceeds from an introduction that discusses the general topic of groundwater flow modeling, to a review of basic statistics necessary to properly apply regression techniques, and then to the main topic: exposition and use of linear and nonlinear regression to model groundwater flow. Statistical procedures are given to analyze and use the regression models. A number of exercises and answers are included to exercise the student on nearly all the methods that are presented for modeling and statistical analysis. Three computer programs implement the more complex methods. These three are a general two-dimensional, steady-state regression model for flow in an anisotropic, heterogeneous porous medium, a program to calculate a measure of model nonlinearity with respect to the regression parameters, and a program to analyze model errors in computed dependent variables such as hydraulic head. (USGS)
First-year growth, recruitment, and maturity of walleyes in western Lake Erie
Madenjian, Charles P.; Tyson, Jeffrey T.; Knight, Roger L.; Kershner, Mark W.; Hansen, Michael J.
1996-01-01
In some lakes, first-year growth of walleyes Stizostedion vitreum has been identified as an important factor governing recruitment of juveniles to the adult population. We developed a regression model for walleye recruitment in western Lake Erie by considering factors such as first-year growth, size of the spawning stock, the rate at which the lake warmed during the spring, and abundance of gizzard shad Dorosoma cepedianum. Gizzard shad abundance during the fall prior to spring walleye spawning explained over 40% of the variation in walleye recruitment. Gizzard shad are relatively high in lipids and are preferred prey for walleyes in Lake Erie. Therefore, the high degree of correlation between shad abundance and subsequent walleye recruitment supported the contention that mature females needed adequate lipid reserves during the winter to spawn the following spring. According to the regression analysis, spring warming rate and size of the parental stock also influenced walleye recruitment. Our regression model explained 92% of the variation in recruitment of age-2 fish into the Lake Erie walleye population from 1981 to 1993. The regression model is potentially valuable as a management tool because it could be used to forecast walleye recruitment to the fishery 2 years in advance. First-year growth was poorly correlated with recruitment, which may reflect the unusually low incidence of walleye cannibalism in western Lake Erie. In contrast, first-year growth was strongly linked to age at maturity.
Predicting birth weight with conditionally linear transformation models.
Möst, Lisa; Schmid, Matthias; Faschingbauer, Florian; Hothorn, Torsten
2016-12-01
Low and high birth weight (BW) are important risk factors for neonatal morbidity and mortality. Gynecologists must therefore accurately predict BW before delivery. Most prediction formulas for BW are based on prenatal ultrasound measurements carried out within one week prior to birth. Although successfully used in clinical practice, these formulas focus on point predictions of BW but do not systematically quantify uncertainty of the predictions, i.e. they result in estimates of the conditional mean of BW but do not deliver prediction intervals. To overcome this problem, we introduce conditionally linear transformation models (CLTMs) to predict BW. Instead of focusing only on the conditional mean, CLTMs model the whole conditional distribution function of BW given prenatal ultrasound parameters. Consequently, the CLTM approach delivers both point predictions of BW and fetus-specific prediction intervals. Prediction intervals constitute an easy-to-interpret measure of prediction accuracy and allow identification of fetuses subject to high prediction uncertainty. Using a data set of 8712 deliveries at the Perinatal Centre at the University Clinic Erlangen (Germany), we analyzed variants of CLTMs and compared them to standard linear regression estimation techniques used in the past and to quantile regression approaches. The best-performing CLTM variant was competitive with quantile regression and linear regression approaches in terms of conditional coverage and average length of the prediction intervals. We propose that CLTMs be used because they are able to account for possible heteroscedasticity, kurtosis, and skewness of the distribution of BWs. © The Author(s) 2014.
The Application of the Cumulative Logistic Regression Model to Automated Essay Scoring
ERIC Educational Resources Information Center
Haberman, Shelby J.; Sinharay, Sandip
2010-01-01
Most automated essay scoring programs use a linear regression model to predict an essay score from several essay features. This article applied a cumulative logit model instead of the linear regression model to automated essay scoring. Comparison of the performances of the linear regression model and the cumulative logit model was performed on a…
NASA Astrophysics Data System (ADS)
Bradshaw, Tyler; Fu, Rau; Bowen, Stephen; Zhu, Jun; Forrest, Lisa; Jeraj, Robert
2015-07-01
Dose painting relies on the ability of functional imaging to identify resistant tumor subvolumes to be targeted for additional boosting. This work assessed the ability of FDG, FLT, and Cu-ATSM PET imaging to predict the locations of residual FDG PET in canine tumors following radiotherapy. Nineteen canines with spontaneous sinonasal tumors underwent PET/CT imaging with radiotracers FDG, FLT, and Cu-ATSM prior to hypofractionated radiotherapy. Therapy consisted of 10 fractions of 4.2 Gy to the sinonasal cavity with or without an integrated boost of 0.8 Gy to the GTV. Patients had an additional FLT PET/CT scan after fraction 2, a Cu-ATSM PET/CT scan after fraction 3, and follow-up FDG PET/CT scans after radiotherapy. Following image registration, simple and multiple linear and logistic voxel regressions were performed to assess how well pre- and mid-treatment PET imaging predicted post-treatment FDG uptake. R2 and pseudo R2 were used to assess the goodness of fits. For simple linear regression models, regression coefficients for all pre- and mid-treatment PET images were significantly positive across the population (P < 0.05). However, there was large variability among patients in goodness of fits: R2 ranged from 0.00 to 0.85, with a median of 0.12. Results for logistic regression models were similar. Multiple linear regression models resulted in better fits (median R2 = 0.31), but there was still large variability between patients in R2. The R2 from regression models for different predictor variables were highly correlated across patients (R ≈ 0.8), indicating tumors that were poorly predicted with one tracer were also poorly predicted by other tracers. In conclusion, the high inter-patient variability in goodness of fits indicates that PET was able to predict locations of residual tumor in some patients, but not others. This suggests not all patients would be good candidates for dose painting based on a single biological target.
Bradshaw, Tyler; Fu, Rau; Bowen, Stephen; Zhu, Jun; Forrest, Lisa; Jeraj, Robert
2015-07-07
Dose painting relies on the ability of functional imaging to identify resistant tumor subvolumes to be targeted for additional boosting. This work assessed the ability of FDG, FLT, and Cu-ATSM PET imaging to predict the locations of residual FDG PET in canine tumors following radiotherapy. Nineteen canines with spontaneous sinonasal tumors underwent PET/CT imaging with radiotracers FDG, FLT, and Cu-ATSM prior to hypofractionated radiotherapy. Therapy consisted of 10 fractions of 4.2 Gy to the sinonasal cavity with or without an integrated boost of 0.8 Gy to the GTV. Patients had an additional FLT PET/CT scan after fraction 2, a Cu-ATSM PET/CT scan after fraction 3, and follow-up FDG PET/CT scans after radiotherapy. Following image registration, simple and multiple linear and logistic voxel regressions were performed to assess how well pre- and mid-treatment PET imaging predicted post-treatment FDG uptake. R(2) and pseudo R(2) were used to assess the goodness of fits. For simple linear regression models, regression coefficients for all pre- and mid-treatment PET images were significantly positive across the population (P < 0.05). However, there was large variability among patients in goodness of fits: R(2) ranged from 0.00 to 0.85, with a median of 0.12. Results for logistic regression models were similar. Multiple linear regression models resulted in better fits (median R(2) = 0.31), but there was still large variability between patients in R(2). The R(2) from regression models for different predictor variables were highly correlated across patients (R ≈ 0.8), indicating tumors that were poorly predicted with one tracer were also poorly predicted by other tracers. In conclusion, the high inter-patient variability in goodness of fits indicates that PET was able to predict locations of residual tumor in some patients, but not others. This suggests not all patients would be good candidates for dose painting based on a single biological target.
NASA Astrophysics Data System (ADS)
Shrivastava, Prashant Kumar; Pandey, Arun Kumar
2018-06-01
Inconel-718 has found high demand in different industries due to their superior mechanical properties. The traditional cutting methods are facing difficulties for cutting these alloys due to their low thermal potential, lower elasticity and high chemical compatibility at inflated temperature. The challenges of machining and/or finishing of unusual shapes and/or sizes in these materials have also faced by traditional machining. Laser beam cutting may be applied for the miniaturization and ultra-precision cutting and/or finishing by appropriate control of different process parameter. This paper present multi-objective optimization the kerf deviation, kerf width and kerf taper in the laser cutting of Incone-718 sheet. The second order regression models have been developed for different quality characteristics by using the experimental data obtained through experimentation. The regression models have been used as objective function for multi-objective optimization based on the hybrid approach of multiple regression analysis and genetic algorithm. The comparison of optimization results to experimental results shows an improvement of 88%, 10.63% and 42.15% in kerf deviation, kerf width and kerf taper, respectively. Finally, the effects of different process parameters on quality characteristics have also been discussed.
NASA Astrophysics Data System (ADS)
Lombardo, L.; Cama, M.; Maerker, M.; Parisi, L.; Rotigliano, E.
2014-12-01
This study aims at comparing the performances of Binary Logistic Regression (BLR) and Boosted Regression Trees (BRT) methods in assessing landslide susceptibility for multiple-occurrence regional landslide events within the Mediterranean region. A test area was selected in the north-eastern sector of Sicily (southern Italy), corresponding to the catchments of the Briga and the Giampilieri streams both stretching for few kilometres from the Peloritan ridge (eastern Sicily, Italy) to the Ionian sea. This area was struck on the 1st October 2009 by an extreme climatic event resulting in thousands of rapid shallow landslides, mainly of debris flows and debris avalanches types involving the weathered layer of a low to high grade metamorphic bedrock. Exploiting the same set of predictors and the 2009 landslide archive, BLR- and BRT-based susceptibility models were obtained for the two catchments separately, adopting a random partition (RP) technique for validation; besides, the models trained in one of the two catchments (Briga) were tested in predicting the landslide distribution in the other (Giampilieri), adopting a spatial partition (SP) based validation procedure. All the validation procedures were based on multi-folds tests so to evaluate and compare the reliability of the fitting, the prediction skill, the coherence in the predictor selection and the precision of the susceptibility estimates. All the obtained models for the two methods produced very high predictive performances, with a general congruence between BLR and BRT in the predictor importance. In particular, the research highlighted that BRT-models reached a higher prediction performance with respect to BLR-models, for RP based modelling, whilst for the SP-based models the difference in predictive skills between the two methods dropped drastically, converging to an analogous excellent performance. However, when looking at the precision of the probability estimates, BLR demonstrated to produce more robust models in terms of selected predictors and coefficients, as well as of dispersion of the estimated probabilities around the mean value for each mapped pixel. The difference in the behaviour could be interpreted as the result of overfitting effects, which heavily affect decision tree classification more than logistic regression techniques.
A semi-nonparametric Poisson regression model for analyzing motor vehicle crash data.
Ye, Xin; Wang, Ke; Zou, Yajie; Lord, Dominique
2018-01-01
This paper develops a semi-nonparametric Poisson regression model to analyze motor vehicle crash frequency data collected from rural multilane highway segments in California, US. Motor vehicle crash frequency on rural highway is a topic of interest in the area of transportation safety due to higher driving speeds and the resultant severity level. Unlike the traditional Negative Binomial (NB) model, the semi-nonparametric Poisson regression model can accommodate an unobserved heterogeneity following a highly flexible semi-nonparametric (SNP) distribution. Simulation experiments are conducted to demonstrate that the SNP distribution can well mimic a large family of distributions, including normal distributions, log-gamma distributions, bimodal and trimodal distributions. Empirical estimation results show that such flexibility offered by the SNP distribution can greatly improve model precision and the overall goodness-of-fit. The semi-nonparametric distribution can provide a better understanding of crash data structure through its ability to capture potential multimodality in the distribution of unobserved heterogeneity. When estimated coefficients in empirical models are compared, SNP and NB models are found to have a substantially different coefficient for the dummy variable indicating the lane width. The SNP model with better statistical performance suggests that the NB model overestimates the effect of lane width on crash frequency reduction by 83.1%.
Selapa, N W; Nephawe, K A; Maiwashe, A; Norris, D
2012-02-08
The aim of this study was to estimate genetic parameters for body weights of individually fed beef bulls measured at centralized testing stations in South Africa using random regression models. Weekly body weights of Bonsmara bulls (N = 2919) tested between 1999 and 2003 were available for the analyses. The model included a fixed regression of the body weights on fourth-order orthogonal Legendre polynomials of the actual days on test (7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, and 84) for starting age and contemporary group effects. Random regressions on fourth-order orthogonal Legendre polynomials of the actual days on test were included for additive genetic effects and additional uncorrelated random effects of the weaning-herd-year and the permanent environment of the animal. Residual effects were assumed to be independently distributed with heterogeneous variance for each test day. Variance ratios for additive genetic, permanent environment and weaning-herd-year for weekly body weights at different test days ranged from 0.26 to 0.29, 0.37 to 0.44 and 0.26 to 0.34, respectively. The weaning-herd-year was found to have a significant effect on the variation of body weights of bulls despite a 28-day adjustment period. Genetic correlations amongst body weights at different test days were high, ranging from 0.89 to 1.00. Heritability estimates were comparable to literature using multivariate models. Therefore, random regression model could be applied in the genetic evaluation of body weight of individually fed beef bulls in South Africa.
NASA Astrophysics Data System (ADS)
Polat, Esra; Gunay, Suleyman
2013-10-01
One of the problems encountered in Multiple Linear Regression (MLR) is multicollinearity, which causes the overestimation of the regression parameters and increase of the variance of these parameters. Hence, in case of multicollinearity presents, biased estimation procedures such as classical Principal Component Regression (CPCR) and Partial Least Squares Regression (PLSR) are then performed. SIMPLS algorithm is the leading PLSR algorithm because of its speed, efficiency and results are easier to interpret. However, both of the CPCR and SIMPLS yield very unreliable results when the data set contains outlying observations. Therefore, Hubert and Vanden Branden (2003) have been presented a robust PCR (RPCR) method and a robust PLSR (RPLSR) method called RSIMPLS. In RPCR, firstly, a robust Principal Component Analysis (PCA) method for high-dimensional data on the independent variables is applied, then, the dependent variables are regressed on the scores using a robust regression method. RSIMPLS has been constructed from a robust covariance matrix for high-dimensional data and robust linear regression. The purpose of this study is to show the usage of RPCR and RSIMPLS methods on an econometric data set, hence, making a comparison of two methods on an inflation model of Turkey. The considered methods have been compared in terms of predictive ability and goodness of fit by using a robust Root Mean Squared Error of Cross-validation (R-RMSECV), a robust R2 value and Robust Component Selection (RCS) statistic.
Moderation analysis using a two-level regression model.
Yuan, Ke-Hai; Cheng, Ying; Maxwell, Scott
2014-10-01
Moderation analysis is widely used in social and behavioral research. The most commonly used model for moderation analysis is moderated multiple regression (MMR) in which the explanatory variables of the regression model include product terms, and the model is typically estimated by least squares (LS). This paper argues for a two-level regression model in which the regression coefficients of a criterion variable on predictors are further regressed on moderator variables. An algorithm for estimating the parameters of the two-level model by normal-distribution-based maximum likelihood (NML) is developed. Formulas for the standard errors (SEs) of the parameter estimates are provided and studied. Results indicate that, when heteroscedasticity exists, NML with the two-level model gives more efficient and more accurate parameter estimates than the LS analysis of the MMR model. When error variances are homoscedastic, NML with the two-level model leads to essentially the same results as LS with the MMR model. Most importantly, the two-level regression model permits estimating the percentage of variance of each regression coefficient that is due to moderator variables. When applied to data from General Social Surveys 1991, NML with the two-level model identified a significant moderation effect of race on the regression of job prestige on years of education while LS with the MMR model did not. An R package is also developed and documented to facilitate the application of the two-level model.
Zhang, Guosheng; Huang, Kuan-Chieh; Xu, Zheng; Tzeng, Jung-Ying; Conneely, Karen N; Guan, Weihua; Kang, Jian; Li, Yun
2016-05-01
DNA methylation is a key epigenetic mark involved in both normal development and disease progression. Recent advances in high-throughput technologies have enabled genome-wide profiling of DNA methylation. However, DNA methylation profiling often employs different designs and platforms with varying resolution, which hinders joint analysis of methylation data from multiple platforms. In this study, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from nonlocal probes to improve imputation quality. Here, we compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations. Specifically, we applied different imputation approaches to an acute myeloid leukemia dataset consisting of 194 samples and our method showed higher imputation accuracy, manifested, for example, by a 94% relative increase in information content and up to 86% more CpG sites passing post-imputation filtering. Our simulated association study further demonstrated that our method substantially improves the statistical power to identify trait-associated methylation loci. These findings indicate that the penalized functional regression model is a convenient and valuable imputation tool for methylation data, and it can boost statistical power in downstream epigenome-wide association study (EWAS). © 2016 WILEY PERIODICALS, INC.
NASA Astrophysics Data System (ADS)
Leroux, Romain; Chatellier, Ludovic; David, Laurent
2018-01-01
This article is devoted to the estimation of time-resolved particle image velocimetry (TR-PIV) flow fields using a time-resolved point measurements of a voltage signal obtained by hot-film anemometry. A multiple linear regression model is first defined to map the TR-PIV flow fields onto the voltage signal. Due to the high temporal resolution of the signal acquired by the hot-film sensor, the estimates of the TR-PIV flow fields are obtained with a multiple linear regression method called orthonormalized partial least squares regression (OPLSR). Subsequently, this model is incorporated as the observation equation in an ensemble Kalman filter (EnKF) applied on a proper orthogonal decomposition reduced-order model to stabilize it while reducing the effects of the hot-film sensor noise. This method is assessed for the reconstruction of the flow around a NACA0012 airfoil at a Reynolds number of 1000 and an angle of attack of {20}°. Comparisons with multi-time delay-modified linear stochastic estimation show that both the OPLSR and EnKF combined with OPLSR are more accurate as they produce a much lower relative estimation error, and provide a faithful reconstruction of the time evolution of the velocity flow fields.
Kolasa-Wiecek, Alicja
2015-04-01
The energy sector in Poland is the source of 81% of greenhouse gas (GHG) emissions. Poland, among other European Union countries, occupies a leading position with regard to coal consumption. Polish energy sector actively participates in efforts to reduce GHG emissions to the atmosphere, through a gradual decrease of the share of coal in the fuel mix and development of renewable energy sources. All evidence which completes the knowledge about issues related to GHG emissions is a valuable source of information. The article presents the results of modeling of GHG emissions which are generated by the energy sector in Poland. For a better understanding of the quantitative relationship between total consumption of primary energy and greenhouse gas emission, multiple stepwise regression model was applied. The modeling results of CO2 emissions demonstrate a high relationship (0.97) with the hard coal consumption variable. Adjustment coefficient of the model to actual data is high and equal to 95%. The backward step regression model, in the case of CH4 emission, indicated the presence of hard coal (0.66), peat and fuel wood (0.34), solid waste fuels, as well as other sources (-0.64) as the most important variables. The adjusted coefficient is suitable and equals R2=0.90. For N2O emission modeling the obtained coefficient of determination is low and equal to 43%. A significant variable influencing the amount of N2O emission is the peat and wood fuel consumption. Copyright © 2015. Published by Elsevier B.V.
Alternative High School Students: Prevalence and Correlates of Overweight
ERIC Educational Resources Information Center
Kubik, Martha Y.; Davey, Cynthia; Fulkerson, Jayne A.; Sirard, John; Story, Mary; Arcan, Chrisa
2009-01-01
Objective: To determine prevalence and correlates of overweight among adolescents attending alternative high schools (AHS). Methods: AHS students (n=145) from 6 schools completed surveys and anthropometric measures. Cross-sectional associations were assessed using mixed model multivariate logistic regression. Results: Among students, 42% were…
The Hill model of concentration-response is ubiquitous in toxicology, perhaps because its parameters directly relate to biologically significant metrics of toxicity such as efficacy and potency. Point estimates of these parameters obtained through least squares regression or maxi...
The Hill model of concentration-response is ubiquitous in toxicology, perhaps because its parameters directly relate to biologically significant metrics of toxicity such as efficacy and potency. Point estimates of these parameters obtained through least squares regression or maxi...
Local regression type methods applied to the study of geophysics and high frequency financial data
NASA Astrophysics Data System (ADS)
Mariani, M. C.; Basu, K.
2014-09-01
In this work we applied locally weighted scatterplot smoothing techniques (Lowess/Loess) to Geophysical and high frequency financial data. We first analyze and apply this technique to the California earthquake geological data. A spatial analysis was performed to show that the estimation of the earthquake magnitude at a fixed location is very accurate up to the relative error of 0.01%. We also applied the same method to a high frequency data set arising in the financial sector and obtained similar satisfactory results. The application of this approach to the two different data sets demonstrates that the overall method is accurate and efficient, and the Lowess approach is much more desirable than the Loess method. The previous works studied the time series analysis; in this paper our local regression models perform a spatial analysis for the geophysics data providing different information. For the high frequency data, our models estimate the curve of best fit where data are dependent on time.
Authoritative School Climate and High School Dropout Rates
ERIC Educational Resources Information Center
Jia, Yuane; Konold, Timothy R.; Cornell, Dewey
2016-01-01
This study tested the association between school-wide measures of an authoritative school climate and high school dropout rates in a statewide sample of 315 high schools. Regression models at the school level of analysis used teacher and student measures of disciplinary structure, student support, and academic expectations to predict overall high…
ERIC Educational Resources Information Center
Roulette-McIntyre, Ovella; Bagaka's, Joshua G.; Drake, Daniel D.
2005-01-01
This study identified parental practices that relate positively to high school students' academic performance. Parents of 643 high school students participated in the study. Data analysis, using a multiple linear regression model, shows parent-school connection, student gender, and race are significant predictors of student academic performance.…
The microcomputer scientific software series 2: general linear model--regression.
Harold M. Rauscher
1983-01-01
The general linear model regression (GLMR) program provides the microcomputer user with a sophisticated regression analysis capability. The output provides a regression ANOVA table, estimators of the regression model coefficients, their confidence intervals, confidence intervals around the predicted Y-values, residuals for plotting, a check for multicollinearity, a...
Functional CAR models for large spatially correlated functional datasets.
Zhang, Lin; Baladandayuthapani, Veerabhadran; Zhu, Hongxiao; Baggerly, Keith A; Majewski, Tadeusz; Czerniak, Bogdan A; Morris, Jeffrey S
2016-01-01
We develop a functional conditional autoregressive (CAR) model for spatially correlated data for which functions are collected on areal units of a lattice. Our model performs functional response regression while accounting for spatial correlations with potentially nonseparable and nonstationary covariance structure, in both the space and functional domains. We show theoretically that our construction leads to a CAR model at each functional location, with spatial covariance parameters varying and borrowing strength across the functional domain. Using basis transformation strategies, the nonseparable spatial-functional model is computationally scalable to enormous functional datasets, generalizable to different basis functions, and can be used on functions defined on higher dimensional domains such as images. Through simulation studies, we demonstrate that accounting for the spatial correlation in our modeling leads to improved functional regression performance. Applied to a high-throughput spatially correlated copy number dataset, the model identifies genetic markers not identified by comparable methods that ignore spatial correlations.
Random regression models using different functions to model milk flow in dairy cows.
Laureano, M M M; Bignardi, A B; El Faro, L; Cardoso, V L; Tonhati, H; Albuquerque, L G
2014-09-12
We analyzed 75,555 test-day milk flow records from 2175 primiparous Holstein cows that calved between 1997 and 2005. Milk flow was obtained by dividing the mean milk yield (kg) of the 3 daily milking by the total milking time (min) and was expressed as kg/min. Milk flow was grouped into 43 weekly classes. The analyses were performed using a single-trait Random Regression Models that included direct additive genetic, permanent environmental, and residual random effects. In addition, the contemporary group and linear and quadratic effects of cow age at calving were included as fixed effects. Fourth-order orthogonal Legendre polynomial of days in milk was used to model the mean trend in milk flow. The additive genetic and permanent environmental covariance functions were estimated using random regression Legendre polynomials and B-spline functions of days in milk. The model using a third-order Legendre polynomial for additive genetic effects and a sixth-order polynomial for permanent environmental effects, which contained 7 residual classes, proved to be the most adequate to describe variations in milk flow, and was also the most parsimonious. The heritability in milk flow estimated by the most parsimonious model was of moderate to high magnitude.
An efficient surrogate-based simulation-optimization method for calibrating a regional MODFLOW model
NASA Astrophysics Data System (ADS)
Chen, Mingjie; Izady, Azizallah; Abdalla, Osman A.
2017-01-01
Simulation-optimization method entails a large number of model simulations, which is computationally intensive or even prohibitive if the model simulation is extremely time-consuming. Statistical models have been examined as a surrogate of the high-fidelity physical model during simulation-optimization process to tackle this problem. Among them, Multivariate Adaptive Regression Splines (MARS), a non-parametric adaptive regression method, is superior in overcoming problems of high-dimensions and discontinuities of the data. Furthermore, the stability and accuracy of MARS model can be improved by bootstrap aggregating methods, namely, bagging. In this paper, Bagging MARS (BMARS) method is integrated to a surrogate-based simulation-optimization framework to calibrate a three-dimensional MODFLOW model, which is developed to simulate the groundwater flow in an arid hardrock-alluvium region in northwestern Oman. The physical MODFLOW model is surrogated by the statistical model developed using BMARS algorithm. The surrogate model, which is fitted and validated using training dataset generated by the physical model, can approximate solutions rapidly. An efficient Sobol' method is employed to calculate global sensitivities of head outputs to input parameters, which are used to analyze their importance for the model outputs spatiotemporally. Only sensitive parameters are included in the calibration process to further improve the computational efficiency. Normalized root mean square error (NRMSE) between measured and simulated heads at observation wells is used as the objective function to be minimized during optimization. The reasonable history match between the simulated and observed heads demonstrated feasibility of this high-efficient calibration framework.
Aziz, Shamsul Akmar Ab; Nuawi, Mohd Zaki; Nor, Mohd Jailani Mohd
2015-01-01
The objective of this study was to present a new method for determination of hand-arm vibration (HAV) in Malaysian Army (MA) three-tonne truck steering wheels based on changes in vehicle speed using regression model and the statistical analysis method known as Integrated Kurtosis-Based Algorithm for Z-Notch Filter Technique Vibro (I-kaz Vibro). The test was conducted for two different road conditions, tarmac and dirt roads. HAV exposure was measured using a Brüel & Kjær Type 3649 vibration analyzer, which is capable of recording HAV exposures from steering wheels. The data was analyzed using I-kaz Vibro to determine the HAV values in relation to varying speeds of a truck and to determine the degree of data scattering for HAV data signals. Based on the results obtained, HAV experienced by drivers can be determined using the daily vibration exposure A(8), I-kaz Vibro coefficient (Ƶ(v)(∞)), and the I-kaz Vibro display. The I-kaz Vibro displays also showed greater scatterings, indicating that the values of Ƶ(v)(∞) and A(8) were increasing. Prediction of HAV exposure was done using the developed regression model and graphical representations of Ƶ(v)(∞). The results of the regression model showed that Ƶ(v)(∞) increased when the vehicle speed and HAV exposure increased. For model validation, predicted and measured noise exposures were compared, and high coefficient of correlation (R(2)) values were obtained, indicating that good agreement was obtained between them. By using the developed regression model, we can easily predict HAV exposure from steering wheels for HAV exposure monitoring.
Trehan, Sumeet; Carlberg, Kevin T.; Durlofsky, Louis J.
2017-07-14
A machine learning–based framework for modeling the error introduced by surrogate models of parameterized dynamical systems is proposed. The framework entails the use of high-dimensional regression techniques (eg, random forests, and LASSO) to map a large set of inexpensively computed “error indicators” (ie, features) produced by the surrogate model at a given time instance to a prediction of the surrogate-model error in a quantity of interest (QoI). This eliminates the need for the user to hand-select a small number of informative features. The methodology requires a training set of parameter instances at which the time-dependent surrogate-model error is computed bymore » simulating both the high-fidelity and surrogate models. Using these training data, the method first determines regression-model locality (via classification or clustering) and subsequently constructs a “local” regression model to predict the time-instantaneous error within each identified region of feature space. We consider 2 uses for the resulting error model: (1) as a correction to the surrogate-model QoI prediction at each time instance and (2) as a way to statistically model arbitrary functions of the time-dependent surrogate-model error (eg, time-integrated errors). We then apply the proposed framework to model errors in reduced-order models of nonlinear oil-water subsurface flow simulations, with time-varying well-control (bottom-hole pressure) parameters. The reduced-order models used in this work entail application of trajectory piecewise linearization in conjunction with proper orthogonal decomposition. Moreover, when the first use of the method is considered, numerical experiments demonstrate consistent improvement in accuracy in the time-instantaneous QoI prediction relative to the original surrogate model, across a large number of test cases. When the second use is considered, results show that the proposed method provides accurate statistical predictions of the time- and well-averaged errors.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Trehan, Sumeet; Carlberg, Kevin T.; Durlofsky, Louis J.
A machine learning–based framework for modeling the error introduced by surrogate models of parameterized dynamical systems is proposed. The framework entails the use of high-dimensional regression techniques (eg, random forests, and LASSO) to map a large set of inexpensively computed “error indicators” (ie, features) produced by the surrogate model at a given time instance to a prediction of the surrogate-model error in a quantity of interest (QoI). This eliminates the need for the user to hand-select a small number of informative features. The methodology requires a training set of parameter instances at which the time-dependent surrogate-model error is computed bymore » simulating both the high-fidelity and surrogate models. Using these training data, the method first determines regression-model locality (via classification or clustering) and subsequently constructs a “local” regression model to predict the time-instantaneous error within each identified region of feature space. We consider 2 uses for the resulting error model: (1) as a correction to the surrogate-model QoI prediction at each time instance and (2) as a way to statistically model arbitrary functions of the time-dependent surrogate-model error (eg, time-integrated errors). We then apply the proposed framework to model errors in reduced-order models of nonlinear oil-water subsurface flow simulations, with time-varying well-control (bottom-hole pressure) parameters. The reduced-order models used in this work entail application of trajectory piecewise linearization in conjunction with proper orthogonal decomposition. Moreover, when the first use of the method is considered, numerical experiments demonstrate consistent improvement in accuracy in the time-instantaneous QoI prediction relative to the original surrogate model, across a large number of test cases. When the second use is considered, results show that the proposed method provides accurate statistical predictions of the time- and well-averaged errors.« less
High-flow oxygen therapy: pressure analysis in a pediatric airway model.
Urbano, Javier; del Castillo, Jimena; López-Herce, Jesús; Gallardo, José A; Solana, María J; Carrillo, Ángel
2012-05-01
The mechanism of high-flow oxygen therapy and the pressures reached in the airway have not been defined. We hypothesized that the flow would generate a low continuous positive pressure, and that elevated flow rates in this model could produce moderate pressures. The objective of this study was to analyze the pressure generated by a high-flow oxygen therapy system in an experimental model of the pediatric airway. An experimental in vitro study was performed. A high-flow oxygen therapy system was connected to 3 types of interface (nasal cannulae, nasal mask, and oronasal mask) and applied to 2 types of pediatric manikin (infant and neonatal). The pressures generated in the circuit, in the airway, and in the pharynx were measured at different flow rates (5, 10, 15, and 20 L/min). The experiment was conducted with and without a leak (mouth sealed and unsealed). Linear regression analyses were performed for each set of measurements. The pressures generated with the different interfaces were very similar. The maximum pressure recorded was 4 cm H(2)O with a flow of 20 L/min via nasal cannulae or nasal mask. When the mouth of the manikin was held open, the pressures reached in the airway and pharynxes were undetectable. Linear regression analyses showed a similar linear relationship between flow and pressures measured in the pharynx (pressure = -0.375 + 0.138 × flow) and in the airway (pressure = -0.375 + 0.158 × flow) with the closed mouth condition. According to our hypothesis, high-flow oxygen therapy systems produced a low-level CPAP in an experimental pediatric model, even with the use of very high flow rates. Linear regression analyses showed similar linear relationships between flow and pressures measured in the pharynx and in the airway. This finding suggests that, at least in part, the effects may be due to other mechanisms.
Grimby-Ekman, Anna; Andersson, Eva M; Hagberg, Mats
2009-06-19
In the literature there are discussions on the choice of outcome and the need for more longitudinal studies of musculoskeletal disorders. The general aim of this longitudinal study was to analyze musculoskeletal neck pain, in a group of young adults. Specific aims were to determine whether psychosocial factors, computer use, high work/study demands, and lifestyle are long-term or short-term factors for musculoskeletal neck pain, and whether these factors are important for developing or ongoing musculoskeletal neck pain. Three regression models were used to analyze the different outcomes. Pain at present was analyzed with a marginal logistic model, for number of years with pain a Poisson regression model was used and for developing and ongoing pain a logistic model was used. Presented results are odds ratios and proportion ratios (logistic models) and rate ratios (Poisson model). The material consisted of web-based questionnaires answered by 1204 Swedish university students from a prospective cohort recruited in 2002. Perceived stress was a risk factor for pain at present (PR = 1.6), for developing pain (PR = 1.7) and for number of years with pain (RR = 1.3). High work/study demands was associated with pain at present (PR = 1.6); and with number of years with pain when the demands negatively affect home life (RR = 1.3). Computer use pattern (number of times/week with a computer session > or = 4 h, without break) was a risk factor for developing pain (PR = 1.7), but also associated with pain at present (PR = 1.4) and number of years with pain (RR = 1.2). Among life style factors smoking (PR = 1.8) was found to be associated to pain at present. The difference between men and women in prevalence of musculoskeletal pain was confirmed in this study. It was smallest for the outcome ongoing pain (PR = 1.4) compared to pain at present (PR = 2.4) and developing pain (PR = 2.5). By using different regression models different aspects of neck pain pattern could be addressed and the risk factors impact on pain pattern was identified. Short-term risk factors were perceived stress, high work/study demands and computer use pattern (break pattern). Those were also long-term risk factors. For developing pain perceived stress and computer use pattern were risk factors.
Grimby-Ekman, Anna; Andersson, Eva M; Hagberg, Mats
2009-01-01
Background In the literature there are discussions on the choice of outcome and the need for more longitudinal studies of musculoskeletal disorders. The general aim of this longitudinal study was to analyze musculoskeletal neck pain, in a group of young adults. Specific aims were to determine whether psychosocial factors, computer use, high work/study demands, and lifestyle are long-term or short-term factors for musculoskeletal neck pain, and whether these factors are important for developing or ongoing musculoskeletal neck pain. Methods Three regression models were used to analyze the different outcomes. Pain at present was analyzed with a marginal logistic model, for number of years with pain a Poisson regression model was used and for developing and ongoing pain a logistic model was used. Presented results are odds ratios and proportion ratios (logistic models) and rate ratios (Poisson model). The material consisted of web-based questionnaires answered by 1204 Swedish university students from a prospective cohort recruited in 2002. Results Perceived stress was a risk factor for pain at present (PR = 1.6), for developing pain (PR = 1.7) and for number of years with pain (RR = 1.3). High work/study demands was associated with pain at present (PR = 1.6); and with number of years with pain when the demands negatively affect home life (RR = 1.3). Computer use pattern (number of times/week with a computer session ≥ 4 h, without break) was a risk factor for developing pain (PR = 1.7), but also associated with pain at present (PR = 1.4) and number of years with pain (RR = 1.2). Among life style factors smoking (PR = 1.8) was found to be associated to pain at present. The difference between men and women in prevalence of musculoskeletal pain was confirmed in this study. It was smallest for the outcome ongoing pain (PR = 1.4) compared to pain at present (PR = 2.4) and developing pain (PR = 2.5). Conclusion By using different regression models different aspects of neck pain pattern could be addressed and the risk factors impact on pain pattern was identified. Short-term risk factors were perceived stress, high work/study demands and computer use pattern (break pattern). Those were also long-term risk factors. For developing pain perceived stress and computer use pattern were risk factors. PMID:19545386
Medalie, Laura
2014-01-01
Annual and daily concentrations and fluxes of total and dissolved phosphorus, total nitrogen, chloride, and total suspended solids were estimated for 18 monitored tributaries to Lake Champlain by using the Weighted Regressions on Time, Discharge, and Seasons regression model. Estimates were made for 21 or 23 years, depending on data availability, for the purpose of providing timely and accessible summary reports as stipulated in the 2010 update to the Lake Champlain “Opportunities for Action” management plan. Estimates of concentration and flux were provided for each tributary based on (1) observed daily discharges and (2) a flow-normalizing procedure, which removed the random fluctuations of climate-related variability. The flux bias statistic, an indicator of the ability of the Weighted Regressions on Time, Discharge, and Season regression models to provide accurate representations of flux, showed acceptable bias (less than ±10 percent) for 68 out of 72 models for total and dissolved phosphorus, total nitrogen, and chloride. Six out of 18 models for total suspended solids had moderate bias (between 10 and 30 percent), an expected result given the frequently nonlinear relation between total suspended solids and discharge. One model for total suspended solids with a very high bias was influenced by a single extreme value; however, removal of that value, although reducing the bias substantially, had little effect on annual fluxes.
Determinants of The Grade A Embryos in Infertile Women; Zero-Inflated Regression Model.
Almasi-Hashiani, Amir; Ghaheri, Azadeh; Omani Samani, Reza
2017-10-01
In assisted reproductive technology, it is important to choose high quality embryos for embryo transfer. The aim of the present study was to determine the grade A embryo count and factors related to it in infertile women. This historical cohort study included 996 infertile women. The main outcome was the number of grade A embryos. Zero-Inflated Poisson (ZIP) regression and Zero-Inflated Negative Binomial (ZINB) regression were used to model the count data as it contained excessive zeros. Stata software, version 13 (Stata Corp, College Station, TX, USA) was used for all statistical analyses. After adjusting for potential confounders, results from the ZINB model show that for each unit increase in the number 2 pronuclear (2PN) zygotes, we get an increase of 1.45 times as incidence rate ratio (95% confidence interval (CI): 1.23-1.69, P=0.001) in the expected grade A embryo count number, and for each increase in the cleavage day we get a decrease 0.35 times (95% CI: 0.20-0.61, P=0.001) in expected grade A embryo count. There is a significant association between both the number of 2PN zygotes and cleavage day with the number of grade A embryos in both ZINB and ZIP regression models. The estimated coefficients are more plausible than values found in earlier studies using less relevant models. Copyright© by Royan Institute. All rights reserved.
NASA Technical Reports Server (NTRS)
Trejo, Leonard J.; Shensa, Mark J.; Remington, Roger W. (Technical Monitor)
1998-01-01
This report describes the development and evaluation of mathematical models for predicting human performance from discrete wavelet transforms (DWT) of event-related potentials (ERP) elicited by task-relevant stimuli. The DWT was compared to principal components analysis (PCA) for representation of ERPs in linear regression and neural network models developed to predict a composite measure of human signal detection performance. Linear regression models based on coefficients of the decimated DWT predicted signal detection performance with half as many f ree parameters as comparable models based on PCA scores. In addition, the DWT-based models were more resistant to model degradation due to over-fitting than PCA-based models. Feed-forward neural networks were trained using the backpropagation,-, algorithm to predict signal detection performance based on raw ERPs, PCA scores, or high-power coefficients of the DWT. Neural networks based on high-power DWT coefficients trained with fewer iterations, generalized to new data better, and were more resistant to overfitting than networks based on raw ERPs. Networks based on PCA scores did not generalize to new data as well as either the DWT network or the raw ERP network. The results show that wavelet expansions represent the ERP efficiently and extract behaviorally important features for use in linear regression or neural network models of human performance. The efficiency of the DWT is discussed in terms of its decorrelation and energy compaction properties. In addition, the DWT models provided evidence that a pattern of low-frequency activity (1 to 3.5 Hz) occurring at specific times and scalp locations is a reliable correlate of human signal detection performance.
NASA Technical Reports Server (NTRS)
Trejo, L. J.; Shensa, M. J.
1999-01-01
This report describes the development and evaluation of mathematical models for predicting human performance from discrete wavelet transforms (DWT) of event-related potentials (ERP) elicited by task-relevant stimuli. The DWT was compared to principal components analysis (PCA) for representation of ERPs in linear regression and neural network models developed to predict a composite measure of human signal detection performance. Linear regression models based on coefficients of the decimated DWT predicted signal detection performance with half as many free parameters as comparable models based on PCA scores. In addition, the DWT-based models were more resistant to model degradation due to over-fitting than PCA-based models. Feed-forward neural networks were trained using the backpropagation algorithm to predict signal detection performance based on raw ERPs, PCA scores, or high-power coefficients of the DWT. Neural networks based on high-power DWT coefficients trained with fewer iterations, generalized to new data better, and were more resistant to overfitting than networks based on raw ERPs. Networks based on PCA scores did not generalize to new data as well as either the DWT network or the raw ERP network. The results show that wavelet expansions represent the ERP efficiently and extract behaviorally important features for use in linear regression or neural network models of human performance. The efficiency of the DWT is discussed in terms of its decorrelation and energy compaction properties. In addition, the DWT models provided evidence that a pattern of low-frequency activity (1 to 3.5 Hz) occurring at specific times and scalp locations is a reliable correlate of human signal detection performance. Copyright 1999 Academic Press.
NASA Astrophysics Data System (ADS)
Zhang, Ying; Bi, Peng; Hiller, Janet
2008-01-01
This is the first study to identify appropriate regression models for the association between climate variation and salmonellosis transmission. A comparison between different regression models was conducted using surveillance data in Adelaide, South Australia. By using notified salmonellosis cases and climatic variables from the Adelaide metropolitan area over the period 1990-2003, four regression methods were examined: standard Poisson regression, autoregressive adjusted Poisson regression, multiple linear regression, and a seasonal autoregressive integrated moving average (SARIMA) model. Notified salmonellosis cases in 2004 were used to test the forecasting ability of the four models. Parameter estimation, goodness-of-fit and forecasting ability of the four regression models were compared. Temperatures occurring 2 weeks prior to cases were positively associated with cases of salmonellosis. Rainfall was also inversely related to the number of cases. The comparison of the goodness-of-fit and forecasting ability suggest that the SARIMA model is better than the other three regression models. Temperature and rainfall may be used as climatic predictors of salmonellosis cases in regions with climatic characteristics similar to those of Adelaide. The SARIMA model could, thus, be adopted to quantify the relationship between climate variations and salmonellosis transmission.
NASA Astrophysics Data System (ADS)
Rocha, Alby D.; Groen, Thomas A.; Skidmore, Andrew K.; Darvishzadeh, Roshanak; Willemen, Louise
2017-11-01
The growing number of narrow spectral bands in hyperspectral remote sensing improves the capacity to describe and predict biological processes in ecosystems. But it also poses a challenge to fit empirical models based on such high dimensional data, which often contain correlated and noisy predictors. As sample sizes, to train and validate empirical models, seem not to be increasing at the same rate, overfitting has become a serious concern. Overly complex models lead to overfitting by capturing more than the underlying relationship, and also through fitting random noise in the data. Many regression techniques claim to overcome these problems by using different strategies to constrain complexity, such as limiting the number of terms in the model, by creating latent variables or by shrinking parameter coefficients. This paper is proposing a new method, named Naïve Overfitting Index Selection (NOIS), which makes use of artificially generated spectra, to quantify the relative model overfitting and to select an optimal model complexity supported by the data. The robustness of this new method is assessed by comparing it to a traditional model selection based on cross-validation. The optimal model complexity is determined for seven different regression techniques, such as partial least squares regression, support vector machine, artificial neural network and tree-based regressions using five hyperspectral datasets. The NOIS method selects less complex models, which present accuracies similar to the cross-validation method. The NOIS method reduces the chance of overfitting, thereby avoiding models that present accurate predictions that are only valid for the data used, and too complex to make inferences about the underlying process.
[Evaluation of estimation of prevalence ratio using bayesian log-binomial regression model].
Gao, W L; Lin, H; Liu, X N; Ren, X W; Li, J S; Shen, X P; Zhu, S L
2017-03-10
To evaluate the estimation of prevalence ratio ( PR ) by using bayesian log-binomial regression model and its application, we estimated the PR of medical care-seeking prevalence to caregivers' recognition of risk signs of diarrhea in their infants by using bayesian log-binomial regression model in Openbugs software. The results showed that caregivers' recognition of infant' s risk signs of diarrhea was associated significantly with a 13% increase of medical care-seeking. Meanwhile, we compared the differences in PR 's point estimation and its interval estimation of medical care-seeking prevalence to caregivers' recognition of risk signs of diarrhea and convergence of three models (model 1: not adjusting for the covariates; model 2: adjusting for duration of caregivers' education, model 3: adjusting for distance between village and township and child month-age based on model 2) between bayesian log-binomial regression model and conventional log-binomial regression model. The results showed that all three bayesian log-binomial regression models were convergence and the estimated PRs were 1.130(95 %CI : 1.005-1.265), 1.128(95 %CI : 1.001-1.264) and 1.132(95 %CI : 1.004-1.267), respectively. Conventional log-binomial regression model 1 and model 2 were convergence and their PRs were 1.130(95 % CI : 1.055-1.206) and 1.126(95 % CI : 1.051-1.203), respectively, but the model 3 was misconvergence, so COPY method was used to estimate PR , which was 1.125 (95 %CI : 1.051-1.200). In addition, the point estimation and interval estimation of PRs from three bayesian log-binomial regression models differed slightly from those of PRs from conventional log-binomial regression model, but they had a good consistency in estimating PR . Therefore, bayesian log-binomial regression model can effectively estimate PR with less misconvergence and have more advantages in application compared with conventional log-binomial regression model.
NASA Astrophysics Data System (ADS)
Nieto, Paulino José García; Antón, Juan Carlos Álvarez; Vilán, José Antonio Vilán; García-Gonzalo, Esperanza
2014-10-01
The aim of this research work is to build a regression model of the particulate matter up to 10 micrometers in size (PM10) by using the multivariate adaptive regression splines (MARS) technique in the Oviedo urban area (Northern Spain) at local scale. This research work explores the use of a nonparametric regression algorithm known as multivariate adaptive regression splines (MARS) which has the ability to approximate the relationship between the inputs and outputs, and express the relationship mathematically. In this sense, hazardous air pollutants or toxic air contaminants refer to any substance that may cause or contribute to an increase in mortality or serious illness, or that may pose a present or potential hazard to human health. To accomplish the objective of this study, the experimental dataset of nitrogen oxides (NOx), carbon monoxide (CO), sulfur dioxide (SO2), ozone (O3) and dust (PM10) were collected over 3 years (2006-2008) and they are used to create a highly nonlinear model of the PM10 in the Oviedo urban nucleus (Northern Spain) based on the MARS technique. One main objective of this model is to obtain a preliminary estimate of the dependence between PM10 pollutant in the Oviedo urban area at local scale. A second aim is to determine the factors with the greatest bearing on air quality with a view to proposing health and lifestyle improvements. The United States National Ambient Air Quality Standards (NAAQS) establishes the limit values of the main pollutants in the atmosphere in order to ensure the health of healthy people. Firstly, this MARS regression model captures the main perception of statistical learning theory in order to obtain a good prediction of the dependence among the main pollutants in the Oviedo urban area. Secondly, the main advantages of MARS are its capacity to produce simple, easy-to-interpret models, its ability to estimate the contributions of the input variables, and its computational efficiency. Finally, on the basis of these numerical calculations, using the multivariate adaptive regression splines (MARS) technique, conclusions of this research work are exposed.
Suzuki, Hideaki; Tabata, Takahisa; Koizumi, Hiroki; Hohchi, Nobusuke; Takeuchi, Shoko; Kitamura, Takuro; Fujino, Yoshihisa; Ohbuchi, Toyoaki
2014-12-01
This study aimed to create a multiple regression model for predicting hearing outcomes of idiopathic sudden sensorineural hearing loss (ISSNHL). The participants were 205 consecutive patients (205 ears) with ISSNHL (hearing level ≥ 40 dB, interval between onset and treatment ≤ 30 days). They received systemic steroid administration combined with intratympanic steroid injection. Data were examined by simple and multiple regression analyses. Three hearing indices (percentage hearing improvement, hearing gain, and posttreatment hearing level [HLpost]) and 7 prognostic factors (age, days from onset to treatment, initial hearing level, initial hearing level at low frequencies, initial hearing level at high frequencies, presence of vertigo, and contralateral hearing level) were included in the multiple regression analysis as dependent and explanatory variables, respectively. In the simple regression analysis, the percentage hearing improvement, hearing gain, and HLpost showed significant correlation with 2, 5, and 6 of the 7 prognostic factors, respectively. The multiple correlation coefficients were 0.396, 0.503, and 0.714 for the percentage hearing improvement, hearing gain, and HLpost, respectively. Predicted values of HLpost calculated by the multiple regression equation were reliable with 70% probability with a 40-dB-width prediction interval. Prediction of HLpost by the multiple regression model may be useful to estimate the hearing prognosis of ISSNHL. © The Author(s) 2014.
Learning accurate and interpretable models based on regularized random forests regression
2014-01-01
Background Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. Methods In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. Results We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. Conclusion It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied. PMID:25350120
Cruz, Antonio M; Barr, Cameron; Puñales-Pozo, Elsa
2008-01-01
This research's main goals were to build a predictor for a turnaround time (TAT) indicator for estimating its values and use a numerical clustering technique for finding possible causes of undesirable TAT values. The following stages were used: domain understanding, data characterisation and sample reduction and insight characterisation. Building the TAT indicator multiple linear regression predictor and clustering techniques were used for improving corrective maintenance task efficiency in a clinical engineering department (CED). The indicator being studied was turnaround time (TAT). Multiple linear regression was used for building a predictive TAT value model. The variables contributing to such model were clinical engineering department response time (CE(rt), 0.415 positive coefficient), stock service response time (Stock(rt), 0.734 positive coefficient), priority level (0.21 positive coefficient) and service time (0.06 positive coefficient). The regression process showed heavy reliance on Stock(rt), CE(rt) and priority, in that order. Clustering techniques revealed the main causes of high TAT values. This examination has provided a means for analysing current technical service quality and effectiveness. In doing so, it has demonstrated a process for identifying areas and methods of improvement and a model against which to analyse these methods' effectiveness.
ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL
Huang, Jian; Sun, Tingni; Ying, Zhiliang; Yu, Yi; Zhang, Cun-Hui
2013-01-01
We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities. PMID:24086091
ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL.
Huang, Jian; Sun, Tingni; Ying, Zhiliang; Yu, Yi; Zhang, Cun-Hui
2013-06-01
We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities.
Application of near-infrared spectroscopy in the detection of fat-soluble vitamins in premix feed
NASA Astrophysics Data System (ADS)
Jia, Lian Ping; Tian, Shu Li; Zheng, Xue Cong; Jiao, Peng; Jiang, Xun Peng
2018-02-01
Vitamin is the organic compound and necessary for animal physiological maintenance. The rapid determination of the content of different vitamins in premix feed can help to achieve accurate diets and efficient feeding. Compared with high-performance liquid chromatography and other wet chemical methods, near-infrared spectroscopy is a fast, non-destructive, non-polluting method. 168 samples of premix feed were collected and the contents of vitamin A, vitamin E and vitamin D3 were detected by the standard method. The near-infrared spectra of samples ranging from 10 000 to 4 000 cm-1 were obtained. Partial least squares regression (PLSR) and support vector machine regression (SVMR) were used to construct the quantitative model. The results showed that the RMSEP of PLSR model of vitamin A, vitamin E and vitamin D3 were 0.43×107 IU/kg, 0.09×105 IU/kg and 0.17×107 IU/kg, respectively. The RMSEP of SVMR model was 0.45×107 IU/kg, 0.11×105 IU/kg and 0.18×107 IU/kg. Compared with nonlinear regression method (SVMR), linear regression method (PLSR) is more suitable for the quantitative analysis of vitamins in premix feed.
Evaluation of weighted regression and sample size in developing a taper model for loblolly pine
Kenneth L. Cormier; Robin M. Reich; Raymond L. Czaplewski; William A. Bechtold
1992-01-01
A stem profile model, fit using pseudo-likelihood weighted regression, was used to estimate merchantable volume of loblolly pine (Pinus taeda L.) in the southeast. The weighted regression increased model fit marginally, but did not substantially increase model performance. In all cases, the unweighted regression models performed as well as the...
Parameters Estimation of Geographically Weighted Ordinal Logistic Regression (GWOLR) Model
NASA Astrophysics Data System (ADS)
Zuhdi, Shaifudin; Retno Sari Saputro, Dewi; Widyaningsih, Purnami
2017-06-01
A regression model is the representation of relationship between independent variable and dependent variable. The dependent variable has categories used in the logistic regression model to calculate odds on. The logistic regression model for dependent variable has levels in the logistics regression model is ordinal. GWOLR model is an ordinal logistic regression model influenced the geographical location of the observation site. Parameters estimation in the model needed to determine the value of a population based on sample. The purpose of this research is to parameters estimation of GWOLR model using R software. Parameter estimation uses the data amount of dengue fever patients in Semarang City. Observation units used are 144 villages in Semarang City. The results of research get GWOLR model locally for each village and to know probability of number dengue fever patient categories.
NASA Astrophysics Data System (ADS)
Prahutama, Alan; Suparti; Wahyu Utami, Tiani
2018-03-01
Regression analysis is an analysis to model the relationship between response variables and predictor variables. The parametric approach to the regression model is very strict with the assumption, but nonparametric regression model isn’t need assumption of model. Time series data is the data of a variable that is observed based on a certain time, so if the time series data wanted to be modeled by regression, then we should determined the response and predictor variables first. Determination of the response variable in time series is variable in t-th (yt), while the predictor variable is a significant lag. In nonparametric regression modeling, one developing approach is to use the Fourier series approach. One of the advantages of nonparametric regression approach using Fourier series is able to overcome data having trigonometric distribution. In modeling using Fourier series needs parameter of K. To determine the number of K can be used Generalized Cross Validation method. In inflation modeling for the transportation sector, communication and financial services using Fourier series yields an optimal K of 120 parameters with R-square 99%. Whereas if it was modeled by multiple linear regression yield R-square 90%.
Theodoratou, Evropi; Zhang, Jian Shayne F.; Kolcic, Ivana; Davis, Andrew M.; Bhopal, Sunil; Nair, Harish; Chan, Kit Yee; Liu, Li; Johnson, Hope; Rudan, Igor; Campbell, Harry
2011-01-01
Background Pneumonia is the leading cause of child deaths globally. The aims of this study were to: a) estimate the number and global distribution of pneumonia deaths for children 1–59 months for 2008 for countries with low (<85%) or no coverage of death certification using single-cause regression models and b) compare these country estimates with recently published ones based on multi-cause regression models. Methods and Findings For 35 low child-mortality countries with <85% coverage of death certification, a regression model based on vital registration data of low child-mortality and >85% coverage of death certification countries was used. For 87 high child-mortality countries pneumonia death estimates were obtained by applying a regression model developed from published and unpublished verbal autopsy data from high child-mortality settings. The total number of 1–59 months pneumonia deaths for the year 2008 for these 122 countries was estimated to be 1.18 M (95% CI 0.77 M–1.80 M), which represented 23.27% (95% CI 17.15%–32.75%) of all 1–59 month child deaths. The country level estimation correlation coefficient between these two methods was 0.40. Interpretation Although the overall number of post-neonatal pneumonia deaths was similar irrespective to the method of estimation used, the country estimate correlation coefficient was low, and therefore country-specific estimates should be interpreted with caution. Pneumonia remains the leading cause of child deaths and is greatest in regions of poverty and high child-mortality. Despite the concerns about gender inequity linked with childhood mortality we could not estimate sex-specific pneumonia mortality rates due to the inadequate data. Life-saving interventions effective in preventing and treating pneumonia mortality exist but few children in high pneumonia disease burden regions are able to access them. To achieve the United Nations Millennium Development Goal 4 target to reduce child deaths by two-thirds in year 2015 will require the scale-up of access to these effective pneumonia interventions. PMID:21966425
An open-access CMIP5 pattern library for temperature and precipitation: description and methodology
NASA Astrophysics Data System (ADS)
Lynch, Cary; Hartin, Corinne; Bond-Lamberty, Ben; Kravitz, Ben
2017-05-01
Pattern scaling is used to efficiently emulate general circulation models and explore uncertainty in climate projections under multiple forcing scenarios. Pattern scaling methods assume that local climate changes scale with a global mean temperature increase, allowing for spatial patterns to be generated for multiple models for any future emission scenario. For uncertainty quantification and probabilistic statistical analysis, a library of patterns with descriptive statistics for each file would be beneficial, but such a library does not presently exist. Of the possible techniques used to generate patterns, the two most prominent are the delta and least squares regression methods. We explore the differences and statistical significance between patterns generated by each method and assess performance of the generated patterns across methods and scenarios. Differences in patterns across seasons between methods and epochs were largest in high latitudes (60-90° N/S). Bias and mean errors between modeled and pattern-predicted output from the linear regression method were smaller than patterns generated by the delta method. Across scenarios, differences in the linear regression method patterns were more statistically significant, especially at high latitudes. We found that pattern generation methodologies were able to approximate the forced signal of change to within ≤ 0.5 °C, but the choice of pattern generation methodology for pattern scaling purposes should be informed by user goals and criteria. This paper describes our library of least squares regression patterns from all CMIP5 models for temperature and precipitation on an annual and sub-annual basis, along with the code used to generate these patterns. The dataset and netCDF data generation code are available at doi:10.5281/zenodo.495632.
Wang, Ying; Goh, Joshua O; Resnick, Susan M; Davatzikos, Christos
2013-01-01
In this study, we used high-dimensional pattern regression methods based on structural (gray and white matter; GM and WM) and functional (positron emission tomography of regional cerebral blood flow; PET) brain data to identify cross-sectional imaging biomarkers of cognitive performance in cognitively normal older adults from the Baltimore Longitudinal Study of Aging (BLSA). We focused on specific components of executive and memory domains known to decline with aging, including manipulation, semantic retrieval, long-term memory (LTM), and short-term memory (STM). For each imaging modality, brain regions associated with each cognitive domain were generated by adaptive regional clustering. A relevance vector machine was adopted to model the nonlinear continuous relationship between brain regions and cognitive performance, with cross-validation to select the most informative brain regions (using recursive feature elimination) as imaging biomarkers and optimize model parameters. Predicted cognitive scores using our regression algorithm based on the resulting brain regions correlated well with actual performance. Also, regression models obtained using combined GM, WM, and PET imaging modalities outperformed models based on single modalities. Imaging biomarkers related to memory performance included the orbito-frontal and medial temporal cortical regions with LTM showing stronger correlation with the temporal lobe than STM. Brain regions predicting executive performance included orbito-frontal, and occipito-temporal areas. The PET modality had higher contribution to most cognitive domains except manipulation, which had higher WM contribution from the superior longitudinal fasciculus and the genu of the corpus callosum. These findings based on machine-learning methods demonstrate the importance of combining structural and functional imaging data in understanding complex cognitive mechanisms and also their potential usage as biomarkers that predict cognitive status.
Analysis of flight data from a High-Incidence Research Model by system identification methods
NASA Technical Reports Server (NTRS)
Batterson, James G.; Klein, Vladislav
1989-01-01
Data partitioning and modified stepwise regression were applied to recorded flight data from a Royal Aerospace Establishment high incidence research model. An aerodynamic model structure and corresponding stability and control derivatives were determined for angles of attack between 18 and 30 deg. Several nonlinearities in angles of attack and sideslip as well as a unique roll-dominated set of lateral modes were found. All flight estimated values were compared to available wind tunnel measurements.
NASA Astrophysics Data System (ADS)
Wang, Li-yong; Li, Le; Zhang, Zhi-hua
2016-09-01
Hot compression tests of Ti-6Al-4V alloy in a wide temperature range of 1023-1323 K and strain rate range of 0.01-10 s-1 were conducted by a servo-hydraulic and computer-controlled Gleeble-3500 machine. In order to accurately and effectively characterize the highly nonlinear flow behaviors, support vector regression (SVR) which is a machine learning method was combined with genetic algorithm (GA) for characterizing the flow behaviors, namely, the GA-SVR. The prominent character of GA-SVR is that it with identical training parameters will keep training accuracy and prediction accuracy at a stable level in different attempts for a certain dataset. The learning abilities, generalization abilities, and modeling efficiencies of the mathematical regression model, ANN, and GA-SVR for Ti-6Al-4V alloy were detailedly compared. Comparison results show that the learning ability of the GA-SVR is stronger than the mathematical regression model. The generalization abilities and modeling efficiencies of these models were shown as follows in ascending order: the mathematical regression model < ANN < GA-SVR. The stress-strain data outside experimental conditions were predicted by the well-trained GA-SVR, which improved simulation accuracy of the load-stroke curve and can further improve the related research fields where stress-strain data play important roles, such as speculating work hardening and dynamic recovery, characterizing dynamic recrystallization evolution, and improving processing maps.
Riley, Richard D; Ensor, Joie; Jackson, Dan; Burke, Danielle L
2017-01-01
Many meta-analysis models contain multiple parameters, for example due to multiple outcomes, multiple treatments or multiple regression coefficients. In particular, meta-regression models may contain multiple study-level covariates, and one-stage individual participant data meta-analysis models may contain multiple patient-level covariates and interactions. Here, we propose how to derive percentage study weights for such situations, in order to reveal the (otherwise hidden) contribution of each study toward the parameter estimates of interest. We assume that studies are independent, and utilise a decomposition of Fisher's information matrix to decompose the total variance matrix of parameter estimates into study-specific contributions, from which percentage weights are derived. This approach generalises how percentage weights are calculated in a traditional, single parameter meta-analysis model. Application is made to one- and two-stage individual participant data meta-analyses, meta-regression and network (multivariate) meta-analysis of multiple treatments. These reveal percentage study weights toward clinically important estimates, such as summary treatment effects and treatment-covariate interactions, and are especially useful when some studies are potential outliers or at high risk of bias. We also derive percentage study weights toward methodologically interesting measures, such as the magnitude of ecological bias (difference between within-study and across-study associations) and the amount of inconsistency (difference between direct and indirect evidence in a network meta-analysis).
Tularosa Basin Play Fairway: Weights of Evidence Models
Adam Brandt
2015-12-01
These models are related to weights of evidence play fairway anlaysis of the Tularosa Basin, New Mexico and Texas. They were created through Spatial Data Modeler: ArcMAP 9.3 geoprocessing tools for spatial data modeling using weights of evidence, logistic regression, fuzzy logic and neural networks. It used to identify high values for potential geothermal plays and low values. The results are relative not only within the Tularosa Basin, but also throughout New Mexico, Utah, Nevada, and other places where high to moderate enthalpy geothermal systems are present (training sites).
Climate Prediction Center - Seasonal Outlook
SEASONAL CLIMATE VARIABILITY, INCLUDING ENSO, SOIL MOISTURE, AND VARIOUS STATE-OF-THE-ART DYNAMICAL MODEL ACROSS PARTS OF THE EAST-CENTRAL CONUS CENTERED ON THE MISSISSIPPI RIVER. THIS IS DUE TO VERY HIGH SOIL TRENDS, NEGATIVE SOIL MOISTURE ANOMALIES, LAGGED ENSO REGRESSIONS, AND DYNAMICAL MODEL GUIDANCE ARE ALL
Prediction of sweetness and amino acid content in soybean crops from hyperspectral imagery
NASA Astrophysics Data System (ADS)
Monteiro, Sildomar Takahashi; Minekawa, Yohei; Kosugi, Yukio; Akazawa, Tsuneya; Oda, Kunio
Hyperspectral image data provides a powerful tool for non-destructive crop analysis. This paper investigates a hyperspectral image data-processing method to predict the sweetness and amino acid content of soybean crops. Regression models based on artificial neural networks were developed in order to calculate the level of sucrose, glucose, fructose, and nitrogen concentrations, which can be related to the sweetness and amino acid content of vegetables. A performance analysis was conducted comparing regression models obtained using different preprocessing methods, namely, raw reflectance, second derivative, and principal components analysis. This method is demonstrated using high-resolution hyperspectral data of wavelengths ranging from the visible to the near infrared acquired from an experimental field of green vegetable soybeans. The best predictions were achieved using a nonlinear regression model of the second derivative transformed dataset. Glucose could be predicted with greater accuracy, followed by sucrose, fructose and nitrogen. The proposed method provides the possibility to provide relatively accurate maps predicting the chemical content of soybean crop fields.
Estimating top-of-atmosphere thermal infrared radiance using MERRA-2 atmospheric data
NASA Astrophysics Data System (ADS)
Kleynhans, Tania; Montanaro, Matthew; Gerace, Aaron; Kanan, Christopher
2017-05-01
Thermal infrared satellite images have been widely used in environmental studies. However, satellites have limited temporal resolution, e.g., 16 day Landsat or 1 to 2 day Terra MODIS. This paper investigates the use of the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) reanalysis data product, produced by NASA's Global Modeling and Assimilation Office (GMAO) to predict global topof-atmosphere (TOA) thermal infrared radiance. The high temporal resolution of the MERRA-2 data product presents opportunities for novel research and applications. Various methods were applied to estimate TOA radiance from MERRA-2 variables namely (1) a parameterized physics based method, (2) Linear regression models and (3) non-linear Support Vector Regression. Model prediction accuracy was evaluated using temporally and spatially coincident Moderate Resolution Imaging Spectroradiometer (MODIS) thermal infrared data as reference data. This research found that Support Vector Regression with a radial basis function kernel produced the lowest error rates. Sources of errors are discussed and defined. Further research is currently being conducted to train deep learning models to predict TOA thermal radiance
Cohen, Mark E; Dimick, Justin B; Bilimoria, Karl Y; Ko, Clifford Y; Richards, Karen; Hall, Bruce Lee
2009-12-01
Although logistic regression has commonly been used to adjust for risk differences in patient and case mix to permit quality comparisons across hospitals, hierarchical modeling has been advocated as the preferred methodology, because it accounts for clustering of patients within hospitals. It is unclear whether hierarchical models would yield important differences in quality assessments compared with logistic models when applied to American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) data. Our objective was to evaluate differences in logistic versus hierarchical modeling for identifying hospitals with outlying outcomes in the ACS-NSQIP. Data from ACS-NSQIP patients who underwent colorectal operations in 2008 at hospitals that reported at least 100 operations were used to generate logistic and hierarchical prediction models for 30-day morbidity and mortality. Differences in risk-adjusted performance (ratio of observed-to-expected events) and outlier detections from the two models were compared. Logistic and hierarchical models identified the same 25 hospitals as morbidity outliers (14 low and 11 high outliers), but the hierarchical model identified 2 additional high outliers. Both models identified the same eight hospitals as mortality outliers (five low and three high outliers). The values of observed-to-expected events ratios and p values from the two models were highly correlated. Results were similar when data were permitted from hospitals providing < 100 patients. When applied to ACS-NSQIP data, logistic and hierarchical models provided nearly identical results with respect to identification of hospitals' observed-to-expected events ratio outliers. As hierarchical models are prone to implementation problems, logistic regression will remain an accurate and efficient method for performing risk adjustment of hospital quality comparisons.
Challenges of Electronic Medical Surveillance Systems
2004-06-01
More sophisticated approaches, such as regression models and classical autoregressive moving average ( ARIMA ) models that make estimates based on...with those predicted by a mathematical model . The primary benefit of ARIMA models is their ability to correct for local trends in the data so that...works well, for example, during a particularly severe flu season, where prolonged periods of high visit rates are adjusted to by the ARIMA model , thus
NASA Astrophysics Data System (ADS)
Wang, Liang-Jie; Sawada, Kazuhide; Moriguchi, Shuji
2013-01-01
To mitigate the damage caused by landslide disasters, different mathematical models have been applied to predict landslide spatial distribution characteristics. Although some researchers have achieved excellent results around the world, few studies take the spatial resolution of the database into account. Four types of digital elevation model (DEM) ranging from 2 to 20 m derived from light detection and ranging technology to analyze landslide susceptibility in Mizunami City, Gifu Prefecture, Japan, are presented. Fifteen landslide-causative factors are considered using a logistic-regression approach to create models for landslide potential analysis. Pre-existing landslide bodies are used to evaluate the performance of the four models. The results revealed that the 20-m model had the highest classification accuracy (71.9%), whereas the 2-m model had the lowest value (68.7%). In the 2-m model, 89.4% of the landslide bodies fit in the medium to very high categories. For the 20-m model, only 83.3% of the landslide bodies were concentrated in the medium to very high classes. When the cell size decreases from 20 to 2 m, the area under the relative operative characteristic increases from 0.68 to 0.77. Therefore, higher-resolution DEMs would provide better results for landslide-susceptibility mapping.
Big Data Toolsets to Pharmacometrics: Application of Machine Learning for Time-to-Event Analysis.
Gong, Xiajing; Hu, Meng; Zhao, Liang
2018-05-01
Additional value can be potentially created by applying big data tools to address pharmacometric problems. The performances of machine learning (ML) methods and the Cox regression model were evaluated based on simulated time-to-event data synthesized under various preset scenarios, i.e., with linear vs. nonlinear and dependent vs. independent predictors in the proportional hazard function, or with high-dimensional data featured by a large number of predictor variables. Our results showed that ML-based methods outperformed the Cox model in prediction performance as assessed by concordance index and in identifying the preset influential variables for high-dimensional data. The prediction performances of ML-based methods are also less sensitive to data size and censoring rates than the Cox regression model. In conclusion, ML-based methods provide a powerful tool for time-to-event analysis, with a built-in capacity for high-dimensional data and better performance when the predictor variables assume nonlinear relationships in the hazard function. © 2018 The Authors. Clinical and Translational Science published by Wiley Periodicals, Inc. on behalf of American Society for Clinical Pharmacology and Therapeutics.
Aryee, Samuel; Walumbwa, Fred O; Seidu, Emmanuel Y M; Otaye, Lilian E
2012-03-01
We proposed and tested a multilevel model, underpinned by empowerment theory, that examines the processes linking high-performance work systems (HPWS) and performance outcomes at the individual and organizational levels of analyses. Data were obtained from 37 branches of 2 banking institutions in Ghana. Results of hierarchical regression analysis revealed that branch-level HPWS relates to empowerment climate. Additionally, results of hierarchical linear modeling that examined the hypothesized cross-level relationships revealed 3 salient findings. First, experienced HPWS and empowerment climate partially mediate the influence of branch-level HPWS on psychological empowerment. Second, psychological empowerment partially mediates the influence of empowerment climate and experienced HPWS on service performance. Third, service orientation moderates the psychological empowerment-service performance relationship such that the relationship is stronger for those high rather than low in service orientation. Last, ordinary least squares regression results revealed that branch-level HPWS influences branch-level market performance through cross-level and individual-level influences on service performance that emerges at the branch level as aggregated service performance.
Christoper J. Schmitt; A. Dennis Lemly; Parley V. Winger
1993-01-01
Data from several sources were collated and analyzed by correlation, regression, and principal components analysis to define surrrogate variables for use in the brook trout (Salvelinus fontinalis) habitat suitability index (HSI) model, and to evaluate the applicability of the model for assessing habitat in high elevation streams of the southern Blue Ridge Province (...
Taslimitehrani, Vahid; Dong, Guozhu; Pereira, Naveen L; Panahiazar, Maryam; Pathak, Jyotishman
2016-04-01
Computerized survival prediction in healthcare identifying the risk of disease mortality, helps healthcare providers to effectively manage their patients by providing appropriate treatment options. In this study, we propose to apply a classification algorithm, Contrast Pattern Aided Logistic Regression (CPXR(Log)) with the probabilistic loss function, to develop and validate prognostic risk models to predict 1, 2, and 5year survival in heart failure (HF) using data from electronic health records (EHRs) at Mayo Clinic. The CPXR(Log) constructs a pattern aided logistic regression model defined by several patterns and corresponding local logistic regression models. One of the models generated by CPXR(Log) achieved an AUC and accuracy of 0.94 and 0.91, respectively, and significantly outperformed prognostic models reported in prior studies. Data extracted from EHRs allowed incorporation of patient co-morbidities into our models which helped improve the performance of the CPXR(Log) models (15.9% AUC improvement), although did not improve the accuracy of the models built by other classifiers. We also propose a probabilistic loss function to determine the large error and small error instances. The new loss function used in the algorithm outperforms other functions used in the previous studies by 1% improvement in the AUC. This study revealed that using EHR data to build prediction models can be very challenging using existing classification methods due to the high dimensionality and complexity of EHR data. The risk models developed by CPXR(Log) also reveal that HF is a highly heterogeneous disease, i.e., different subgroups of HF patients require different types of considerations with their diagnosis and treatment. Our risk models provided two valuable insights for application of predictive modeling techniques in biomedicine: Logistic risk models often make systematic prediction errors, and it is prudent to use subgroup based prediction models such as those given by CPXR(Log) when investigating heterogeneous diseases. Copyright © 2016 Elsevier Inc. All rights reserved.
Wright, Melecia; Sotres-Alvarez, Daniela; Mendez, Michelle A; Adair, Linda
2017-03-01
No study has analysed how protein intake from early childhood to young adulthood relate to adult BMI in a single cohort. To estimate the association of protein intake at 2, 11, 15, 19 and 22 years with age- and sex-standardised BMI at 22 years (early adulthood), we used linear regression models with dietary and anthropometric data from a Filipino birth cohort (1985-2005, n 2586). We used latent growth curve analysis to identify trajectories of protein intake relative to age-specific recommended daily allowance (intake in g/kg body weight) from 2 to 22 years, then related trajectory membership to early adulthood BMI using linear regression models. Lean mass and fat mass were secondary outcomes. Regression models included socioeconomic, dietary and anthropometric confounders from early life and adulthood. Protein intake relative to needs at age 2 years was positively associated with BMI and lean mass at age 22 years, but intakes at ages 11, 15 and 22 years were inversely associated with early adulthood BMI. Individuals were classified into four mutually exclusive trajectories: (i) normal consumers (referent trajectory, 58 % of cohort), (ii) high protein consumers in infancy (20 %), (iii) usually high consumers (18 %) and (iv) always high consumers (5 %). Compared with the normal consumers, 'usually high' consumption was inversely associated with BMI, lean mass and fat mass at age 22 years whereas 'always high' consumption was inversely associated with male lean mass in males. Proximal protein intakes were more important contributors to early adult BMI relative to early-childhood protein intake; protein intake history was differentially associated with adulthood body size.
Determination of riverbank erosion probability using Locally Weighted Logistic Regression
NASA Astrophysics Data System (ADS)
Ioannidou, Elena; Flori, Aikaterini; Varouchakis, Emmanouil A.; Giannakis, Georgios; Vozinaki, Anthi Eirini K.; Karatzas, George P.; Nikolaidis, Nikolaos
2015-04-01
Riverbank erosion is a natural geomorphologic process that affects the fluvial environment. The most important issue concerning riverbank erosion is the identification of the vulnerable locations. An alternative to the usual hydrodynamic models to predict vulnerable locations is to quantify the probability of erosion occurrence. This can be achieved by identifying the underlying relations between riverbank erosion and the geomorphological or hydrological variables that prevent or stimulate erosion. Thus, riverbank erosion can be determined by a regression model using independent variables that are considered to affect the erosion process. The impact of such variables may vary spatially, therefore, a non-stationary regression model is preferred instead of a stationary equivalent. Locally Weighted Regression (LWR) is proposed as a suitable choice. This method can be extended to predict the binary presence or absence of erosion based on a series of independent local variables by using the logistic regression model. It is referred to as Locally Weighted Logistic Regression (LWLR). Logistic regression is a type of regression analysis used for predicting the outcome of a categorical dependent variable (e.g. binary response) based on one or more predictor variables. The method can be combined with LWR to assign weights to local independent variables of the dependent one. LWR allows model parameters to vary over space in order to reflect spatial heterogeneity. The probabilities of the possible outcomes are modelled as a function of the independent variables using a logistic function. Logistic regression measures the relationship between a categorical dependent variable and, usually, one or several continuous independent variables by converting the dependent variable to probability scores. Then, a logistic regression is formed, which predicts success or failure of a given binary variable (e.g. erosion presence or absence) for any value of the independent variables. The erosion occurrence probability can be calculated in conjunction with the model deviance regarding the independent variables tested. The most straightforward measure for goodness of fit is the G statistic. It is a simple and effective way to study and evaluate the Logistic Regression model efficiency and the reliability of each independent variable. The developed statistical model is applied to the Koiliaris River Basin on the island of Crete, Greece. Two datasets of river bank slope, river cross-section width and indications of erosion were available for the analysis (12 and 8 locations). Two different types of spatial dependence functions, exponential and tricubic, were examined to determine the local spatial dependence of the independent variables at the measurement locations. The results show a significant improvement when the tricubic function is applied as the erosion probability is accurately predicted at all eight validation locations. Results for the model deviance show that cross-section width is more important than bank slope in the estimation of erosion probability along the Koiliaris riverbanks. The proposed statistical model is a useful tool that quantifies the erosion probability along the riverbanks and can be used to assist managing erosion and flooding events. Acknowledgements This work is part of an on-going THALES project (CYBERSENSORS - High Frequency Monitoring System for Integrated Water Resources Management of Rivers). The project has been co-financed by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program "Education and Lifelong Learning" of the National Strategic Reference Framework (NSRF) - Research Funding Program: THALES. Investing in knowledge society through the European Social Fund.
Dean, J A; Welsh, L C; Wong, K H; Aleksic, A; Dunne, E; Islam, M R; Patel, A; Patel, P; Petkar, I; Phillips, I; Sham, J; Schick, U; Newbold, K L; Bhide, S A; Harrington, K J; Nutting, C M; Gulliford, S L
2017-04-01
A normal tissue complication probability (NTCP) model of severe acute mucositis would be highly useful to guide clinical decision making and inform radiotherapy planning. We aimed to improve upon our previous model by using a novel oral mucosal surface organ at risk (OAR) in place of an oral cavity OAR. Predictive models of severe acute mucositis were generated using radiotherapy dose to the oral cavity OAR or mucosal surface OAR and clinical data. Penalised logistic regression and random forest classification (RFC) models were generated for both OARs and compared. Internal validation was carried out with 100-iteration stratified shuffle split cross-validation, using multiple metrics to assess different aspects of model performance. Associations between treatment covariates and severe mucositis were explored using RFC feature importance. Penalised logistic regression and RFC models using the oral cavity OAR performed at least as well as the models using mucosal surface OAR. Associations between dose metrics and severe mucositis were similar between the mucosal surface and oral cavity models. The volumes of oral cavity or mucosal surface receiving intermediate and high doses were most strongly associated with severe mucositis. The simpler oral cavity OAR should be preferred over the mucosal surface OAR for NTCP modelling of severe mucositis. We recommend minimising the volume of mucosa receiving intermediate and high doses, where possible. Copyright © 2016 The Royal College of Radiologists. Published by Elsevier Ltd. All rights reserved.
Applying Kaplan-Meier to Item Response Data
ERIC Educational Resources Information Center
McNeish, Daniel
2018-01-01
Some IRT models can be equivalently modeled in alternative frameworks such as logistic regression. Logistic regression can also model time-to-event data, which concerns the probability of an event occurring over time. Using the relation between time-to-event models and logistic regression and the relation between logistic regression and IRT, this…
Jin, H; Wu, S; Vidyanti, I; Di Capua, P; Wu, B
2015-01-01
This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". Depression is a common and often undiagnosed condition for patients with diabetes. It is also a condition that significantly impacts healthcare outcomes, use, and cost as well as elevating suicide risk. Therefore, a model to predict depression among diabetes patients is a promising and valuable tool for providers to proactively assess depressive symptoms and identify those with depression. This study seeks to develop a generalized multilevel regression model, using a longitudinal data set from a recent large-scale clinical trial, to predict depression severity and presence of major depression among patients with diabetes. Severity of depression was measured by the Patient Health Questionnaire PHQ-9 score. Predictors were selected from 29 candidate factors to develop a 2-level Poisson regression model that can make population-average predictions for all patients and subject-specific predictions for individual patients with historical records. Newly obtained patient records can be incorporated with historical records to update the prediction model. Root-mean-square errors (RMSE) were used to evaluate predictive accuracy of PHQ-9 scores. The study also evaluated the classification ability of using the predicted PHQ-9 scores to classify patients as having major depression. Two time-invariant and 10 time-varying predictors were selected for the model. Incorporating historical records and using them to update the model may improve both predictive accuracy of PHQ-9 scores and classification ability of the predicted scores. Subject-specific predictions (for individual patients with historical records) achieved RMSE about 4 and areas under the receiver operating characteristic (ROC) curve about 0.9 and are better than population-average predictions. The study developed a generalized multilevel regression model to predict depression and demonstrated that using generalized multilevel regression based on longitudinal patient records can achieve high predictive ability.
Complex Environmental Data Modelling Using Adaptive General Regression Neural Networks
NASA Astrophysics Data System (ADS)
Kanevski, Mikhail
2015-04-01
The research deals with an adaptation and application of Adaptive General Regression Neural Networks (GRNN) to high dimensional environmental data. GRNN [1,2,3] are efficient modelling tools both for spatial and temporal data and are based on nonparametric kernel methods closely related to classical Nadaraya-Watson estimator. Adaptive GRNN, using anisotropic kernels, can be also applied for features selection tasks when working with high dimensional data [1,3]. In the present research Adaptive GRNN are used to study geospatial data predictability and relevant feature selection using both simulated and real data case studies. The original raw data were either three dimensional monthly precipitation data or monthly wind speeds embedded into 13 dimensional space constructed by geographical coordinates and geo-features calculated from digital elevation model. GRNN were applied in two different ways: 1) adaptive GRNN with the resulting list of features ordered according to their relevancy; and 2) adaptive GRNN applied to evaluate all possible models N [in case of wind fields N=(2^13 -1)=8191] and rank them according to the cross-validation error. In both cases training were carried out applying leave-one-out procedure. An important result of the study is that the set of the most relevant features depends on the month (strong seasonal effect) and year. The predictabilities of precipitation and wind field patterns, estimated using the cross-validation and testing errors of raw and shuffled data, were studied in detail. The results of both approaches were qualitatively and quantitatively compared. In conclusion, Adaptive GRNN with their ability to select features and efficient modelling of complex high dimensional data can be widely used in automatic/on-line mapping and as an integrated part of environmental decision support systems. 1. Kanevski M., Pozdnoukhov A., Timonin V. Machine Learning for Spatial Environmental Data. Theory, applications and software. EPFL Press. With a CD: data, software, guides. (2009). 2. Kanevski M. Spatial Predictions of Soil Contamination Using General Regression Neural Networks. Systems Research and Information Systems, Volume 8, number 4, 1999. 3. Robert S., Foresti L., Kanevski M. Spatial prediction of monthly wind speeds in complex terrain with adaptive general regression neural networks. International Journal of Climatology, 33 pp. 1793-1804, 2013.
Identification of extremely premature infants at high risk of rehospitalization.
Ambalavanan, Namasivayam; Carlo, Waldemar A; McDonald, Scott A; Yao, Qing; Das, Abhik; Higgins, Rosemary D
2011-11-01
Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization. Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002-2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables. A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%-42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay. The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge.
Identification of Extremely Premature Infants at High Risk of Rehospitalization
Carlo, Waldemar A.; McDonald, Scott A.; Yao, Qing; Das, Abhik; Higgins, Rosemary D.
2011-01-01
OBJECTIVE: Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization. METHODS: Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002–2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables. RESULTS: A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%–42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay. CONCLUSIONS: The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge. PMID:22007016
Demonstration of a Fiber Optic Regression Probe in a High-Temperature Flow
NASA Technical Reports Server (NTRS)
Korman, Valentin; Polzin, Kurt
2011-01-01
The capability to provide localized, real-time monitoring of material regression rates in various applications has the potential to provide a new stream of data for development testing of various components and systems, as well as serving as a monitoring tool in flight applications. These applications include, but are not limited to, the regression of a combusting solid fuel surface, the ablation of the throat in a chemical rocket or the heat shield of an aeroshell, and the monitoring of erosion in long-life plasma thrusters. The rate of regression in the first application is very fast, while the second and third are increasingly slower. A recent fundamental sensor development effort has led to a novel regression, erosion, and ablation sensor technology (REAST). The REAST sensor allows for measurement of real-time surface erosion rates at a discrete surface location. The sensor is optical, using two different, co-located fiber-optics to perform the regression measurement. The disparate optical transmission properties of the two fiber-optics makes it possible to measure the regression rate by monitoring the relative light attenuation through the fibers. As the fibers regress along with the parent material in which they are embedded, the relative light intensities through the two fibers changes, providing a measure of the regression rate. The optical nature of the system makes it relatively easy to use in a variety of harsh, high temperature environments, and it is also unaffected by the presence of electric and magnetic fields. In addition, the sensor could be used to perform optical spectroscopy on the light emitted by a process and collected by fibers, giving localized measurements of various properties. The capability to perform an in-situ measurement of material regression rates is useful in addressing a variety of physical issues in various applications. An in-situ measurement allows for real-time data regarding the erosion rates, providing a quick method for empirically anchoring any analysis geared towards lifetime qualification. Erosion rate data over an operating envelope could also be useful in the modeling detailed physical processes. The sensor has been embedded in many regressing media to demonstrate the capabilities in a number of regressing environments. In the present work, sensors were installed in the eroding/regressing throat region of a converging-diverging flow, with the working gas heated to high temperatures by means of a high-pressure arc discharge at steady-state discharge power levels up to 500 kW. The amount of regression observed in each material sample was quantified using a later profilometer, which was compared to the in-situ erosion measurements to demonstrate the efficacy of the measurement technique in very harsh, high-temperature environments.
Lee, Martha; Brauer, Michael; Wong, Paulina; Tang, Robert; Tsui, Tsz Him; Choi, Crystal; Cheng, Wei; Lai, Poh-Chin; Tian, Linwei; Thach, Thuan-Quoc; Allen, Ryan; Barratt, Benjamin
2017-08-15
Land use regression (LUR) is a common method of predicting spatial variability of air pollution to estimate exposure. Nitrogen dioxide (NO 2 ), nitric oxide (NO), fine particulate matter (PM 2.5 ), and black carbon (BC) concentrations were measured during two sampling campaigns (April-May and November-January) in Hong Kong (a prototypical high-density high-rise city). Along with 365 potential geospatial predictor variables, these concentrations were used to build two-dimensional land use regression (LUR) models for the territory. Summary statistics for combined measurements over both campaigns were: a) NO 2 (Mean=106μg/m 3 , SD=38.5, N=95), b) NO (M=147μg/m 3 , SD=88.9, N=40), c) PM 2.5 (M=35μg/m 3 , SD=6.3, N=64), and BC (M=10.6μg/m 3 , SD=5.3, N=76). Final LUR models had the following statistics: a) NO 2 (R 2 =0.46, RMSE=28μg/m 3 ) b) NO (R 2 =0.50, RMSE=62μg/m 3 ), c) PM 2.5 (R 2 =0.59; RMSE=4μg/m 3 ), and d) BC (R 2 =0.50, RMSE=4μg/m 3 ). Traditional LUR predictors such as road length, car park density, and land use types were included in most models. The NO 2 prediction surface values were highest in Kowloon and the northern region of Hong Kong Island (downtown Hong Kong). NO showed a similar pattern in the built-up region. Both PM 2.5 and BC predictions exhibited a northwest-southeast gradient, with higher concentrations in the north (close to mainland China). For BC, the port was also an area of elevated predicted concentrations. The results matched with existing literature on spatial variation in concentrations of air pollutants and in relation to important emission sources in Hong Kong. The success of these models suggests LUR is appropriate in high-density, high-rise cities. Copyright © 2017 Elsevier B.V. All rights reserved.
Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data.
Abram, Samantha V; Helwig, Nathaniel E; Moodie, Craig A; DeYoung, Colin G; MacDonald, Angus W; Waller, Niels G
2016-01-01
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks.
Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data
Abram, Samantha V.; Helwig, Nathaniel E.; Moodie, Craig A.; DeYoung, Colin G.; MacDonald, Angus W.; Waller, Niels G.
2016-01-01
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks. PMID:27516732
Assessing product image quality for online shopping
NASA Astrophysics Data System (ADS)
Goswami, Anjan; Chung, Sung H.; Chittar, Naren; Islam, Atiq
2012-01-01
Assessing product-image quality is important in the context of online shopping. A high quality image that conveys more information about a product can boost the buyer's confidence and can get more attention. However, the notion of image quality for product-images is not the same as that in other domains. The perception of quality of product-images depends not only on various photographic quality features but also on various high level features such as clarity of the foreground or goodness of the background etc. In this paper, we define a notion of product-image quality based on various such features. We conduct a crowd-sourced experiment to collect user judgments on thousands of eBay's images. We formulate a multi-class classification problem for modeling image quality by classifying images into good, fair and poor quality based on the guided perceptual notions from the judges. We also conduct experiments with regression using average crowd-sourced human judgments as target. We compute a pseudo-regression score with expected average of predicted classes and also compute a score from the regression technique. We design many experiments with various sampling and voting schemes with crowd-sourced data and construct various experimental image quality models. Most of our models have reasonable accuracies (greater or equal to 70%) on test data set. We observe that our computed image quality score has a high (0.66) rank correlation with average votes from the crowd sourced human judgments.
Merello, Paloma; García-Diego, Fernando-Juan; Beltrán, Pedro; Scatigno, Claudia
2018-01-25
The characterization of the microclimatic conditions is fundamental for the preventive conservation of archaeological sites. In this context, the identification of the factors that influence the thermo-hygrometric equilibrium is key to determine the causes of cultural heritage deterioration. In this work, a characterization of the thermo-hygrometric conditions of Casa di Diana (Ostia Antica, Italy) is carried out analyzing the data of temperature and relative humidity recorded by a system of sensors with high monitoring frequency. Sensors are installed in parallel, calibrated and synchronized with a microcontroller. A data set of 793,620 data, arranged in a matrix with 66,135 rows and 12 columns, was used. Furthermore, the influence of human impact (visitors) is evaluated through a multiple linear regression model and a logistic regression model. The visitors do not affect the environmental humidity as it is very high and constant all the year. The results show a significant influence of the visitors in the upset of the thermal balance. When a tourist guide takes place, the probability that the hourly temperature variation reaches values higher than its monthly average is 10.64 times higher than it remains equal or less to its monthly average. The analysis of the regression residuals shows the influence of outdoor climatic variables in the thermal balance, such as solar radiation or ventilation.
Merello, Paloma; García-Diego, Fernando-Juan; Beltrán, Pedro; Scatigno, Claudia
2018-01-01
The characterization of the microclimatic conditions is fundamental for the preventive conservation of archaeological sites. In this context, the identification of the factors that influence the thermo-hygrometric equilibrium is key to determine the causes of cultural heritage deterioration. In this work, a characterization of the thermo-hygrometric conditions of Casa di Diana (Ostia Antica, Italy) is carried out analyzing the data of temperature and relative humidity recorded by a system of sensors with high monitoring frequency. Sensors are installed in parallel, calibrated and synchronized with a microcontroller. A data set of 793,620 data, arranged in a matrix with 66,135 rows and 12 columns, was used. Furthermore, the influence of human impact (visitors) is evaluated through a multiple linear regression model and a logistic regression model. The visitors do not affect the environmental humidity as it is very high and constant all the year. The results show a significant influence of the visitors in the upset of the thermal balance. When a tourist guide takes place, the probability that the hourly temperature variation reaches values higher than its monthly average is 10.64 times higher than it remains equal or less to its monthly average. The analysis of the regression residuals shows the influence of outdoor climatic variables in the thermal balance, such as solar radiation or ventilation. PMID:29370142
Body Fat Percentage Prediction Using Intelligent Hybrid Approaches
Shao, Yuehjen E.
2014-01-01
Excess of body fat often leads to obesity. Obesity is typically associated with serious medical diseases, such as cancer, heart disease, and diabetes. Accordingly, knowing the body fat is an extremely important issue since it affects everyone's health. Although there are several ways to measure the body fat percentage (BFP), the accurate methods are often associated with hassle and/or high costs. Traditional single-stage approaches may use certain body measurements or explanatory variables to predict the BFP. Diverging from existing approaches, this study proposes new intelligent hybrid approaches to obtain fewer explanatory variables, and the proposed forecasting models are able to effectively predict the BFP. The proposed hybrid models consist of multiple regression (MR), artificial neural network (ANN), multivariate adaptive regression splines (MARS), and support vector regression (SVR) techniques. The first stage of the modeling includes the use of MR and MARS to obtain fewer but more important sets of explanatory variables. In the second stage, the remaining important variables are served as inputs for the other forecasting methods. A real dataset was used to demonstrate the development of the proposed hybrid models. The prediction results revealed that the proposed hybrid schemes outperformed the typical, single-stage forecasting models. PMID:24723804
NASA Astrophysics Data System (ADS)
Yilmaz, Isik; Keskin, Inan; Marschalko, Marian; Bednarik, Martin
2010-05-01
This study compares the GIS based collapse susceptibility mapping methods such as; conditional probability (CP), logistic regression (LR) and artificial neural networks (ANN) applied in gypsum rock masses in Sivas basin (Turkey). Digital Elevation Model (DEM) was first constructed using GIS software. Collapse-related factors, directly or indirectly related to the causes of collapse occurrence, such as distance from faults, slope angle and aspect, topographical elevation, distance from drainage, topographic wetness index- TWI, stream power index- SPI, Normalized Difference Vegetation Index (NDVI) by means of vegetation cover, distance from roads and settlements were used in the collapse susceptibility analyses. In the last stage of the analyses, collapse susceptibility maps were produced from CP, LR and ANN models, and they were then compared by means of their validations. Area Under Curve (AUC) values obtained from all three methodologies showed that the map obtained from ANN model looks like more accurate than the other models, and the results also showed that the artificial neural networks is a usefull tool in preparation of collapse susceptibility map and highly compatible with GIS operating features. Key words: Collapse; doline; susceptibility map; gypsum; GIS; conditional probability; logistic regression; artificial neural networks.
Bootstrap evaluation of a young Douglas-fir height growth model for the Pacific Northwest
Nicholas R. Vaughn; Eric C. Turnblom; Martin W. Ritchie
2010-01-01
We evaluated the stability of a complex regression model developed to predict the annual height growth of young Douglas-fir. This model is highly nonlinear and is fit in an iterative manner for annual growth coefficients from data with multiple periodic remeasurement intervals. The traditional methods for such a sensitivity analysis either involve laborious math or...
Finley, Andrew O.; Banerjee, Sudipto; Cook, Bruce D.; Bradford, John B.
2013-01-01
In this paper we detail a multivariate spatial regression model that couples LiDAR, hyperspectral and forest inventory data to predict forest outcome variables at a high spatial resolution. The proposed model is used to analyze forest inventory data collected on the US Forest Service Penobscot Experimental Forest (PEF), ME, USA. In addition to helping meet the regression model's assumptions, results from the PEF analysis suggest that the addition of multivariate spatial random effects improves model fit and predictive ability, compared with two commonly applied modeling approaches. This improvement results from explicitly modeling the covariation among forest outcome variables and spatial dependence among observations through the random effects. Direct application of such multivariate models to even moderately large datasets is often computationally infeasible because of cubic order matrix algorithms involved in estimation. We apply a spatial dimension reduction technique to help overcome this computational hurdle without sacrificing richness in modeling.
Face Hallucination with Linear Regression Model in Semi-Orthogonal Multilinear PCA Method
NASA Astrophysics Data System (ADS)
Asavaskulkiet, Krissada
2018-04-01
In this paper, we propose a new face hallucination technique, face images reconstruction in HSV color space with a semi-orthogonal multilinear principal component analysis method. This novel hallucination technique can perform directly from tensors via tensor-to-vector projection by imposing the orthogonality constraint in only one mode. In our experiments, we use facial images from FERET database to test our hallucination approach which is demonstrated by extensive experiments with high-quality hallucinated color faces. The experimental results assure clearly demonstrated that we can generate photorealistic color face images by using the SO-MPCA subspace with a linear regression model.
ERIC Educational Resources Information Center
Liou, Pey-Yan
2009-01-01
The current study examines three regression models: OLS (ordinary least square) linear regression, Poisson regression, and negative binomial regression for analyzing count data. Simulation results show that the OLS regression model performed better than the others, since it did not produce more false statistically significant relationships than…
Boligon, A A; Baldi, F; Mercadante, M E Z; Lobo, R B; Pereira, R J; Albuquerque, L G
2011-06-28
We quantified the potential increase in accuracy of expected breeding value for weights of Nelore cattle, from birth to mature age, using multi-trait and random regression models on Legendre polynomials and B-spline functions. A total of 87,712 weight records from 8144 females were used, recorded every three months from birth to mature age from the Nelore Brazil Program. For random regression analyses, all female weight records from birth to eight years of age (data set I) were considered. From this general data set, a subset was created (data set II), which included only nine weight records: at birth, weaning, 365 and 550 days of age, and 2, 3, 4, 5, and 6 years of age. Data set II was analyzed using random regression and multi-trait models. The model of analysis included the contemporary group as fixed effects and age of dam as a linear and quadratic covariable. In the random regression analyses, average growth trends were modeled using a cubic regression on orthogonal polynomials of age. Residual variances were modeled by a step function with five classes. Legendre polynomials of fourth and sixth order were utilized to model the direct genetic and animal permanent environmental effects, respectively, while third-order Legendre polynomials were considered for maternal genetic and maternal permanent environmental effects. Quadratic polynomials were applied to model all random effects in random regression models on B-spline functions. Direct genetic and animal permanent environmental effects were modeled using three segments or five coefficients, and genetic maternal and maternal permanent environmental effects were modeled with one segment or three coefficients in the random regression models on B-spline functions. For both data sets (I and II), animals ranked differently according to expected breeding value obtained by random regression or multi-trait models. With random regression models, the highest gains in accuracy were obtained at ages with a low number of weight records. The results indicate that random regression models provide more accurate expected breeding values than the traditionally finite multi-trait models. Thus, higher genetic responses are expected for beef cattle growth traits by replacing a multi-trait model with random regression models for genetic evaluation. B-spline functions could be applied as an alternative to Legendre polynomials to model covariance functions for weights from birth to mature age.
Evaluating differential effects using regression interactions and regression mixture models
Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung
2015-01-01
Research increasingly emphasizes understanding differential effects. This paper focuses on understanding regression mixture models, a relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their formulation, and their assumptions are compared using Monte Carlo simulations and real data analysis. The capabilities of regression mixture models are described and specific issues to be addressed when conducting regression mixtures are proposed. The paper aims to clarify the role that regression mixtures can take in the estimation of differential effects and increase awareness of the benefits and potential pitfalls of this approach. Regression mixture models are shown to be a potentially effective exploratory method for finding differential effects when these effects can be defined by a small number of classes of respondents who share a typical relationship between a predictor and an outcome. It is also shown that the comparison between regression mixture models and interactions becomes substantially more complex as the number of classes increases. It is argued that regression interactions are well suited for direct tests of specific hypotheses about differential effects and regression mixtures provide a useful approach for exploring effect heterogeneity given adequate samples and study design. PMID:26556903
Roso, V M; Schenkel, F S; Miller, S P; Schaeffer, L R
2005-08-01
Breed additive, dominance, and epistatic loss effects are of concern in the genetic evaluation of a multibreed population. Multiple regression equations used for fitting these effects may show a high degree of multicollinearity among predictor variables. Typically, when strong linear relationships exist, the regression coefficients have large SE and are sensitive to changes in the data file and to the addition or deletion of variables in the model. Generalized ridge regression methods were applied to obtain stable estimates of direct and maternal breed additive, dominance, and epistatic loss effects in the presence of multicollinearity among predictor variables. Preweaning weight gains of beef calves in Ontario, Canada, from 1986 to 1999 were analyzed. The genetic model included fixed direct and maternal breed additive, dominance, and epistatic loss effects, fixed environmental effects of age of the calf, contemporary group, and age of the dam x sex of the calf, random additive direct and maternal genetic effects, and random maternal permanent environment effect. The degree and the nature of the multicollinearity were identified and ridge regression methods were used as an alternative to ordinary least squares (LS). Ridge parameters were obtained using two different objective methods: 1) generalized ridge estimator of Hoerl and Kennard (R1); and 2) bootstrap in combination with cross-validation (R2). Both ridge regression methods outperformed the LS estimator with respect to mean squared error of predictions (MSEP) and variance inflation factors (VIF) computed over 100 bootstrap samples. The MSEP of R1 and R2 were similar, and they were 3% less than the MSEP of LS. The average VIF of LS, R1, and R2 were equal to 26.81, 6.10, and 4.18, respectively. Ridge regression methods were particularly effective in decreasing the multicollinearity involving predictor variables of breed additive effects. Because of a high degree of confounding between estimates of maternal dominance and direct epistatic loss effects, it was not possible to compare the relative importance of these effects with a high level of confidence. The inclusion of epistatic loss effects in the additive-dominance model did not cause noticeable reranking of sires, dams, and calves based on across-breed EBV. More precise estimates of breed effects as a result of this study may result in more stable across-breed estimated breeding values over the years.
Evaluating Differential Effects Using Regression Interactions and Regression Mixture Models
ERIC Educational Resources Information Center
Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung
2015-01-01
Research increasingly emphasizes understanding differential effects. This article focuses on understanding regression mixture models, which are relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their…
NASA Astrophysics Data System (ADS)
Underwood, Kristen L.; Rizzo, Donna M.; Schroth, Andrew W.; Dewoolkar, Mandar M.
2017-12-01
Given the variable biogeochemical, physical, and hydrological processes driving fluvial sediment and nutrient export, the water science and management communities need data-driven methods to identify regions prone to production and transport under variable hydrometeorological conditions. We use Bayesian analysis to segment concentration-discharge linear regression models for total suspended solids (TSS) and particulate and dissolved phosphorus (PP, DP) using 22 years of monitoring data from 18 Lake Champlain watersheds. Bayesian inference was leveraged to estimate segmented regression model parameters and identify threshold position. The identified threshold positions demonstrated a considerable range below and above the median discharge—which has been used previously as the default breakpoint in segmented regression models to discern differences between pre and post-threshold export regimes. We then applied a Self-Organizing Map (SOM), which partitioned the watersheds into clusters of TSS, PP, and DP export regimes using watershed characteristics, as well as Bayesian regression intercepts and slopes. A SOM defined two clusters of high-flux basins, one where PP flux was predominantly episodic and hydrologically driven; and another in which the sediment and nutrient sourcing and mobilization were more bimodal, resulting from both hydrologic processes at post-threshold discharges and reactive processes (e.g., nutrient cycling or lateral/vertical exchanges of fine sediment) at prethreshold discharges. A separate DP SOM defined two high-flux clusters exhibiting a bimodal concentration-discharge response, but driven by differing land use. Our novel framework shows promise as a tool with broad management application that provides insights into landscape drivers of riverine solute and sediment export.
Dunham, J.B.; Cade, B.S.; Terrell, J.W.
2002-01-01
We used regression quantiles to model potentially limiting relationships between the standing crop of cutthroat trout Oncorhynchus clarki and measures of stream channel morphology. Regression quantile models indicated that variation in fish density was inversely related to the width:depth ratio of streams but not to stream width or depth alone. The spatial and temporal stability of model predictions were examined across years and streams, respectively. Variation in fish density with width:depth ratio (10th-90th regression quantiles) modeled for streams sampled in 1993-1997 predicted the variation observed in 1998-1999, indicating similar habitat relationships across years. Both linear and nonlinear models described the limiting relationships well, the latter performing slightly better. Although estimated relationships were transferable in time, results were strongly dependent on the influence of spatial variation in fish density among streams. Density changes with width:depth ratio in a single stream were responsible for the significant (P < 0.10) negative slopes estimated for the higher quantiles (>80th). This suggests that stream-scale factors other than width:depth ratio play a more direct role in determining population density. Much of the variation in densities of cutthroat trout among streams was attributed to the occurrence of nonnative brook trout Salvelinus fontinalis (a possible competitor) or connectivity to migratory habitats. Regression quantiles can be useful for estimating the effects of limiting factors when ecological responses are highly variable, but our results indicate that spatiotemporal variability in the data should be explicitly considered. In this study, data from individual streams and stream-specific characteristics (e.g., the occurrence of nonnative species and habitat connectivity) strongly affected our interpretation of the relationship between width:depth ratio and fish density.
Distiller, Larry A; Joffe, Barry I; Melville, Vanessa; Welman, Tania; Distiller, Greg B
2006-01-01
The factors responsible for premature coronary atherosclerosis in patients with type 1 diabetes are ill defined. We therefore assessed carotid intima-media complex thickness (IMT) in relatively long-surviving patients with type 1 diabetes as a marker of atherosclerosis and correlated this with traditional risk factors. Cross-sectional study of 148 patients with relatively long-surviving (>18 years) type 1 diabetes (76 men and 72 women) attending the Centre for Diabetes and Endocrinology, Johannesburg. The mean common carotid artery IMT and presence or absence of plaque was evaluated by high-resolution B-mode ultrasound. Their median age was 48 years and duration of diabetes 26 years (range 18-59 years). Traditional risk factors (age, duration of diabetes, glycemic control, hypertension, smoking and lipoprotein concentrations) were recorded. Three response variables were defined and modeled. Standard multiple regression was used for a continuous IMT variable, logistic regression for the presence/absence of plaque and ordinal logistic regression to model three categories of "risk." The median common carotid IMT was 0.62 mm (range 0.44-1.23 mm) with plaque detected in 28 cases. The multiple regression model found significant associations between IMT and current age (P=.001), duration of diabetes (P=.033), BMI (P=.008) and diagnosed hypertension (P=.046) with HDL showing a protective effect (P=.022). Current age (P=.001) and diagnosed hypertension (P=.004), smoking (P=.008) and retinopathy (P=.033) were significant in the logistic regression model. Current age was also significant in the ordinal logistic regression model (P<.001), as was total cholesterol/HDL ratio (P<.001) and mean HbA(1c) concentration (P=.073). The major factors influencing common carotid IMT in patients with relatively long-surviving type 1 diabetes are age, duration of diabetes, existing hypertension and HDL (protective) with a relatively minor role ascribed to relatively long-standing glycemic control.
Fenlon, Caroline; O'Grady, Luke; Butler, Stephen; Doherty, Michael L; Dunnion, John
2017-01-01
Herd fertility in pasture-based dairy farms is a key driver of farm economics. Models for predicting nulliparous reproductive outcomes are rare, but age, genetics, weight, and BCS have been identified as factors influencing heifer conception. The aim of this study was to create a simulation model of heifer conception to service with thorough evaluation. Artificial Insemination service records from two research herds and ten commercial herds were provided to build and evaluate the models. All were managed as spring-calving pasture-based systems. The factors studied were related to age, genetics, and time of service. The data were split into training and testing sets and bootstrapping was used to train the models. Logistic regression (with and without random effects) and generalised additive modelling were selected as the model-building techniques. Two types of evaluation were used to test the predictive ability of the models: discrimination and calibration. Discrimination, which includes sensitivity, specificity, accuracy and ROC analysis, measures a model's ability to distinguish between positive and negative outcomes. Calibration measures the accuracy of the predicted probabilities with the Hosmer-Lemeshow goodness-of-fit, calibration plot and calibration error. After data cleaning and the removal of services with missing values, 1396 services remained to train the models and 597 were left for testing. Age, breed, genetic predicted transmitting ability for calving interval, month and year were significant in the multivariate models. The regression models also included an interaction between age and month. Year within herd was a random effect in the mixed regression model. Overall prediction accuracy was between 77.1% and 78.9%. All three models had very high sensitivity, but low specificity. The two regression models were very well-calibrated. The mean absolute calibration errors were all below 4%. Because the models were not adept at identifying unsuccessful services, they are not suggested for use in predicting the outcome of individual heifer services. Instead, they are useful for the comparison of services with different covariate values or as sub-models in whole-farm simulations. The mixed regression model was identified as the best model for prediction, as the random effects can be ignored and the other variables can be easily obtained or simulated.
Estimating the exceedance probability of rain rate by logistic regression
NASA Technical Reports Server (NTRS)
Chiu, Long S.; Kedem, Benjamin
1990-01-01
Recent studies have shown that the fraction of an area with rain intensity above a fixed threshold is highly correlated with the area-averaged rain rate. To estimate the fractional rainy area, a logistic regression model, which estimates the conditional probability that rain rate over an area exceeds a fixed threshold given the values of related covariates, is developed. The problem of dependency in the data in the estimation procedure is bypassed by the method of partial likelihood. Analyses of simulated scanning multichannel microwave radiometer and observed electrically scanning microwave radiometer data during the Global Atlantic Tropical Experiment period show that the use of logistic regression in pixel classification is superior to multiple regression in predicting whether rain rate at each pixel exceeds a given threshold, even in the presence of noisy data. The potential of the logistic regression technique in satellite rain rate estimation is discussed.
Grieve, Richard; Nixon, Richard; Thompson, Simon G
2010-01-01
Cost-effectiveness analyses (CEA) may be undertaken alongside cluster randomized trials (CRTs) where randomization is at the level of the cluster (for example, the hospital or primary care provider) rather than the individual. Costs (and outcomes) within clusters may be correlated so that the assumption made by standard bivariate regression models, that observations are independent, is incorrect. This study develops a flexible modeling framework to acknowledge the clustering in CEA that use CRTs. The authors extend previous Bayesian bivariate models for CEA of multicenter trials to recognize the specific form of clustering in CRTs. They develop new Bayesian hierarchical models (BHMs) that allow mean costs and outcomes, and also variances, to differ across clusters. They illustrate how each model can be applied using data from a large (1732 cases, 70 primary care providers) CRT evaluating alternative interventions for reducing postnatal depression. The analyses compare cost-effectiveness estimates from BHMs with standard bivariate regression models that ignore the data hierarchy. The BHMs show high levels of cost heterogeneity across clusters (intracluster correlation coefficient, 0.17). Compared with standard regression models, the BHMs yield substantially increased uncertainty surrounding the cost-effectiveness estimates, and altered point estimates. The authors conclude that ignoring clustering can lead to incorrect inferences. The BHMs that they present offer a flexible modeling framework that can be applied more generally to CEA that use CRTs.
Monthly streamflow forecasting based on hidden Markov model and Gaussian Mixture Regression
NASA Astrophysics Data System (ADS)
Liu, Yongqi; Ye, Lei; Qin, Hui; Hong, Xiaofeng; Ye, Jiajun; Yin, Xingli
2018-06-01
Reliable streamflow forecasts can be highly valuable for water resources planning and management. In this study, we combined a hidden Markov model (HMM) and Gaussian Mixture Regression (GMR) for probabilistic monthly streamflow forecasting. The HMM is initialized using a kernelized K-medoids clustering method, and the Baum-Welch algorithm is then executed to learn the model parameters. GMR derives a conditional probability distribution for the predictand given covariate information, including the antecedent flow at a local station and two surrounding stations. The performance of HMM-GMR was verified based on the mean square error and continuous ranked probability score skill scores. The reliability of the forecasts was assessed by examining the uniformity of the probability integral transform values. The results show that HMM-GMR obtained reasonably high skill scores and the uncertainty spread was appropriate. Different HMM states were assumed to be different climate conditions, which would lead to different types of observed values. We demonstrated that the HMM-GMR approach can handle multimodal and heteroscedastic data.
Quantification of brain lipids by FTIR spectroscopy and partial least squares regression
NASA Astrophysics Data System (ADS)
Dreissig, Isabell; Machill, Susanne; Salzer, Reiner; Krafft, Christoph
2009-01-01
Brain tissue is characterized by high lipid content. Its content decreases and the lipid composition changes during transformation from normal brain tissue to tumors. Therefore, the analysis of brain lipids might complement the existing diagnostic tools to determine the tumor type and tumor grade. Objective of this work is to extract lipids from gray matter and white matter of porcine brain tissue, record infrared (IR) spectra of these extracts and develop a quantification model for the main lipids based on partial least squares (PLS) regression. IR spectra of the pure lipids cholesterol, cholesterol ester, phosphatidic acid, phosphatidylcholine, phosphatidylethanolamine, phosphatidylserine, phosphatidylinositol, sphingomyelin, galactocerebroside and sulfatide were used as references. Two lipid mixtures were prepared for training and validation of the quantification model. The composition of lipid extracts that were predicted by the PLS regression of IR spectra was compared with lipid quantification by thin layer chromatography.
Father and adolescent son variables related to son's HIV prevention.
Glenn, Betty L; Demi, Alice; Kimble, Laura P
2008-02-01
The purpose of this study was to examine the relationship between fathers' influences and African American male adolescents' perceptions of self-efficacy to reduce high-risk sexual behavior. A convenience sample of 70 fathers was recruited from churches in a large metropolitan area in the South. Hierarchical multiple linear regression analysis indicated father-related factors and son-related factors were associated with 26.1% of the variance in son's self-efficacy to be abstinent. In the regression model greater son's perception of the communication of sexual standards and greater father's perception of his son's self-efficacy were significantly related to greater son's self-efficacy for abstinence. The second regression model with son's self-efficacy for safer sex as the criterion was not statistically significant. Data support the need for fathers to express confidence in their sons' ability to be abstinent or practice safer sex and to communicate with their sons regarding sexual issues and standards.
Chouabe, C; Amsellem, J; Espinosa, L; Ribaux, P; Blaineau, S; Mégas, P; Bonvallet, R
2002-04-01
Recent studies indicate that regression of left ventricular hypertrophy normalizes membrane ionic current abnormalities. This work was designed to determine whether regression of right ventricular hypertrophy induced by permanent high-altitude exposure (4,500 m, 20 days) in adult rats also normalizes changes of ventricular myocyte electrophysiology. According to the current data, prolonged action potential, decreased transient outward current density, and increased inward sodium/calcium exchange current density normalized 20 days after the end of altitude exposure, whereas right ventricular hypertrophy evidenced by both the right ventricular weight-to-heart weight ratio and the right ventricular free wall thickness measurement normalized 40 days after the end of altitude exposure. This morphological normalization occurred at both the level of muscular tissue, as shown by the decrease toward control values of some myocyte parameters (perimeter, capacitance, and width), and the level of the interstitial collagenous connective tissue. In the chronic high-altitude hypoxia model, the regression of right ventricular hypertrophy would not be a prerequisite for normalization of ventricular electrophysiological abnormalities.
Modeling absolute differences in life expectancy with a censored skew-normal regression approach
Clough-Gorr, Kerri; Zwahlen, Marcel
2015-01-01
Parameter estimates from commonly used multivariable parametric survival regression models do not directly quantify differences in years of life expectancy. Gaussian linear regression models give results in terms of absolute mean differences, but are not appropriate in modeling life expectancy, because in many situations time to death has a negative skewed distribution. A regression approach using a skew-normal distribution would be an alternative to parametric survival models in the modeling of life expectancy, because parameter estimates can be interpreted in terms of survival time differences while allowing for skewness of the distribution. In this paper we show how to use the skew-normal regression so that censored and left-truncated observations are accounted for. With this we model differences in life expectancy using data from the Swiss National Cohort Study and from official life expectancy estimates and compare the results with those derived from commonly used survival regression models. We conclude that a censored skew-normal survival regression approach for left-truncated observations can be used to model differences in life expectancy across covariates of interest. PMID:26339544
Sarzynski, Mark A; Schuna, John M; Carnethon, Mercedes R; Jacobs, David R; Lewis, Cora E; Quesenberry, Charles P; Sidney, Stephen; Schreiner, Pamela J; Sternfeld, Barbara
2015-11-01
Few studies have examined the longitudinal associations of fitness or changes in fitness on the risk of developing dyslipidemias. This study examined the associations of (1) baseline fitness with 25-year dyslipidemia incidence and (2) 20-year fitness change on dyslipidemia development in middle age in the Coronary Artery Risk Development in Young Adults Study (CARDIA). Multivariable Cox proportional hazards regression models were used to test the association of baseline fitness (1985-1986) with dyslipidemia incidence over 25 years (2010-2011) in CARDIA (N=4,898). Modified Poisson regression models were used to examine the association of 20-year change in fitness with dyslipidemia incidence between Years 20 and 25 (n=2,487). Data were analyzed in June 2014 and February 2015. In adjusted models, the risk of incident low high-density lipoprotein cholesterol (HDL-C); high triglycerides; and high low-density lipoprotein cholesterol (LDL-C) was significantly lower, by 9%, 16%, and 14%, respectively, for each 2.0-minute increase in baseline treadmill endurance. After additional adjustment for baseline trait level, the associations remained significant for incident high triglycerides and high LDL-C in the total population and for incident high triglycerides in both men and women. In race-stratified models, these associations appeared to be limited to whites. In adjusted models, change in fitness did not predict 5-year incidence of dyslipidemias, whereas baseline fitness significantly predicted 5-year incidence of high triglycerides. Our findings demonstrate the importance of cardiorespiratory fitness in young adulthood as a risk factor for developing dyslipidemias, particularly high triglycerides, during the transition to middle age. Copyright © 2015 American Journal of Preventive Medicine. Published by Elsevier Inc. All rights reserved.
Sarzynski, Mark A.; Schuna, John M.; Carnethon, Mercedes R.; Jacobs, David R.; Lewis, Cora E.; Quesenberry, Charles P.; Sidney, Stephen; Schreiner, Pamela J.; Sternfeld, Barbara
2015-01-01
Introduction Few studies have examined the longitudinal associations of fitness or changes in fitness on the risk of developing dyslipidemias. This study examined the associations of: (1) baseline fitness with 25-year dyslipidemia incidence; and (2) 20-year fitness change on dyslipidemia development in middle age in the Coronary Artery Risk Development in young Adults (CARDIA) study. Methods Multivariable Cox proportional hazards regression models were used to test the association of baseline fitness (1985–1986) with dyslipidemia incidence over 25 years (2010–2011) in CARDIA (N=4,898). Modified Poisson regression models were used to examine the association of 20-year change in fitness with dyslipidemia incidence between Years 20 and 25 (n=2,487). Data were analyzed in June 2014 and February 2015. Results In adjusted models, the risk of incident low high-density lipoprotein cholesterol (HDL-C), high triglycerides, and high low-density lipoprotein cholesterol (LDL-C) was significantly lower, by 9%, 16%, and 14%, respectively, for each 2.0-minute increase in baseline treadmill endurance. After additional adjustment for baseline trait level, the associations remained significant for incident high triglycerides and high LDL-C in the total population and for incident high triglycerides in both men and women. In race-stratified models, these associations appeared to be limited to whites. In adjusted models, change in fitness did not predict 5-year incidence of dyslipidemias, whereas baseline fitness significantly predicted 5-year incidence of high triglycerides. Conclusions Our findings demonstrate the importance of cardiorespiratory fitness in young adulthood as a risk factor for developing dyslipidemias, particularly high triglycerides, during the transition to middle age. PMID:26165197
ERIC Educational Resources Information Center
Charles, Pajarita; Jones, Anne; Guo, Shenyang
2014-01-01
Objective: The purpose of the present study was to evaluate the treatment effects of a relationship skills and family strengthening intervention for n = 726 high-risk, disadvantaged new parents. Method: Hierarchical linear modeling and regression models were used to assess intervention treatment effects. These findings were subsequently verified…
Modeling current climate conditions for forest pest risk assessment
Frank H. Koch; John W. Coulston
2010-01-01
Current information on broad-scale climatic conditions is essential for assessing potential distribution of forest pests. At present, sophisticated spatial interpolation approaches such as the Parameter-elevation Regressions on Independent Slopes Model (PRISM) are used to create high-resolution climatic data sets. Unfortunately, these data sets are based on 30-year...
An Ionospheric Index Model based on Linear Regression and Neural Network Approaches
NASA Astrophysics Data System (ADS)
Tshisaphungo, Mpho; McKinnell, Lee-Anne; Bosco Habarulema, John
2017-04-01
The ionosphere is well known to reflect radio wave signals in the high frequency (HF) band due to the present of electron and ions within the region. To optimise the use of long distance HF communications, it is important to understand the drivers of ionospheric storms and accurately predict the propagation conditions especially during disturbed days. This paper presents the development of an ionospheric storm-time index over the South African region for the application of HF communication users. The model will result into a valuable tool to measure the complex ionospheric behaviour in an operational space weather monitoring and forecasting environment. The development of an ionospheric storm-time index is based on a single ionosonde station data over Grahamstown (33.3°S,26.5°E), South Africa. Critical frequency of the F2 layer (foF2) measurements for a period 1996-2014 were considered for this study. The model was developed based on linear regression and neural network approaches. In this talk validation results for low, medium and high solar activity periods will be discussed to demonstrate model's performance.
Allegrini, Franco; Braga, Jez W B; Moreira, Alessandro C O; Olivieri, Alejandro C
2018-06-29
A new multivariate regression model, named Error Covariance Penalized Regression (ECPR) is presented. Following a penalized regression strategy, the proposed model incorporates information about the measurement error structure of the system, using the error covariance matrix (ECM) as a penalization term. Results are reported from both simulations and experimental data based on replicate mid and near infrared (MIR and NIR) spectral measurements. The results for ECPR are better under non-iid conditions when compared with traditional first-order multivariate methods such as ridge regression (RR), principal component regression (PCR) and partial least-squares regression (PLS). Copyright © 2018 Elsevier B.V. All rights reserved.
Association between Personality Traits and Sleep Quality in Young Korean Women
Kim, Han-Na; Cho, Juhee; Chang, Yoosoo; Ryu, Seungho
2015-01-01
Personality is a trait that affects behavior and lifestyle, and sleep quality is an important component of a healthy life. We analyzed the association between personality traits and sleep quality in a cross-section of 1,406 young women (from 18 to 40 years of age) who were not reporting clinically meaningful depression symptoms. Surveys were carried out from December 2011 to February 2012, using the Revised NEO Personality Inventory and the Pittsburgh Sleep Quality Index (PSQI). All analyses were adjusted for demographic and behavioral variables. We considered beta weights, structure coefficients, unique effects, and common effects when evaluating the importance of sleep quality predictors in multiple linear regression models. Neuroticism was the most important contributor to PSQI global scores in the multiple regression models. By contrast, despite being strongly correlated with sleep quality, conscientiousness had a near-zero beta weight in linear regression models, because most variance was shared with other personality traits. However, conscientiousness was the most noteworthy predictor of poor sleep quality status (PSQI≥6) in logistic regression models and individuals high in conscientiousness were least likely to have poor sleep quality, which is consistent with an OR of 0.813, with conscientiousness being protective against poor sleep quality. Personality may be a factor in poor sleep quality and should be considered in sleep interventions targeting young women. PMID:26030141
Unconditional or Conditional Logistic Regression Model for Age-Matched Case-Control Data?
Kuo, Chia-Ling; Duan, Yinghui; Grady, James
2018-01-01
Matching on demographic variables is commonly used in case-control studies to adjust for confounding at the design stage. There is a presumption that matched data need to be analyzed by matched methods. Conditional logistic regression has become a standard for matched case-control data to tackle the sparse data problem. The sparse data problem, however, may not be a concern for loose-matching data when the matching between cases and controls is not unique, and one case can be matched to other controls without substantially changing the association. Data matched on a few demographic variables are clearly loose-matching data, and we hypothesize that unconditional logistic regression is a proper method to perform. To address the hypothesis, we compare unconditional and conditional logistic regression models by precision in estimates and hypothesis testing using simulated matched case-control data. Our results support our hypothesis; however, the unconditional model is not as robust as the conditional model to the matching distortion that the matching process not only makes cases and controls similar for matching variables but also for the exposure status. When the study design involves other complex features or the computational burden is high, matching in loose-matching data can be ignored for negligible loss in testing and estimation if the distributions of matching variables are not extremely different between cases and controls.
Unconditional or Conditional Logistic Regression Model for Age-Matched Case–Control Data?
Kuo, Chia-Ling; Duan, Yinghui; Grady, James
2018-01-01
Matching on demographic variables is commonly used in case–control studies to adjust for confounding at the design stage. There is a presumption that matched data need to be analyzed by matched methods. Conditional logistic regression has become a standard for matched case–control data to tackle the sparse data problem. The sparse data problem, however, may not be a concern for loose-matching data when the matching between cases and controls is not unique, and one case can be matched to other controls without substantially changing the association. Data matched on a few demographic variables are clearly loose-matching data, and we hypothesize that unconditional logistic regression is a proper method to perform. To address the hypothesis, we compare unconditional and conditional logistic regression models by precision in estimates and hypothesis testing using simulated matched case–control data. Our results support our hypothesis; however, the unconditional model is not as robust as the conditional model to the matching distortion that the matching process not only makes cases and controls similar for matching variables but also for the exposure status. When the study design involves other complex features or the computational burden is high, matching in loose-matching data can be ignored for negligible loss in testing and estimation if the distributions of matching variables are not extremely different between cases and controls. PMID:29552553
Learning Supervised Topic Models for Classification and Regression from Crowds.
Rodrigues, Filipe; Lourenco, Mariana; Ribeiro, Bernardete; Pereira, Francisco C
2017-12-01
The growing need to analyze large collections of documents has led to great developments in topic modeling. Since documents are frequently associated with other related variables, such as labels or ratings, much interest has been placed on supervised topic models. However, the nature of most annotation tasks, prone to ambiguity and noise, often with high volumes of documents, deem learning under a single-annotator assumption unrealistic or unpractical for most real-world applications. In this article, we propose two supervised topic models, one for classification and another for regression problems, which account for the heterogeneity and biases among different annotators that are encountered in practice when learning from crowds. We develop an efficient stochastic variational inference algorithm that is able to scale to very large datasets, and we empirically demonstrate the advantages of the proposed model over state-of-the-art approaches.
Electricity Load Forecasting Using Support Vector Regression with Memetic Algorithms
Hu, Zhongyi; Xiong, Tao
2013-01-01
Electricity load forecasting is an important issue that is widely explored and examined in power systems operation literature and commercial transactions in electricity markets literature as well. Among the existing forecasting models, support vector regression (SVR) has gained much attention. Considering the performance of SVR highly depends on its parameters; this study proposed a firefly algorithm (FA) based memetic algorithm (FA-MA) to appropriately determine the parameters of SVR forecasting model. In the proposed FA-MA algorithm, the FA algorithm is applied to explore the solution space, and the pattern search is used to conduct individual learning and thus enhance the exploitation of FA. Experimental results confirm that the proposed FA-MA based SVR model can not only yield more accurate forecasting results than the other four evolutionary algorithms based SVR models and three well-known forecasting models but also outperform the hybrid algorithms in the related existing literature. PMID:24459425
Electricity load forecasting using support vector regression with memetic algorithms.
Hu, Zhongyi; Bao, Yukun; Xiong, Tao
2013-01-01
Electricity load forecasting is an important issue that is widely explored and examined in power systems operation literature and commercial transactions in electricity markets literature as well. Among the existing forecasting models, support vector regression (SVR) has gained much attention. Considering the performance of SVR highly depends on its parameters; this study proposed a firefly algorithm (FA) based memetic algorithm (FA-MA) to appropriately determine the parameters of SVR forecasting model. In the proposed FA-MA algorithm, the FA algorithm is applied to explore the solution space, and the pattern search is used to conduct individual learning and thus enhance the exploitation of FA. Experimental results confirm that the proposed FA-MA based SVR model can not only yield more accurate forecasting results than the other four evolutionary algorithms based SVR models and three well-known forecasting models but also outperform the hybrid algorithms in the related existing literature.
Developing a Model for Forecasting Road Traffic Accident (RTA) Fatalities in Yemen
NASA Astrophysics Data System (ADS)
Karim, Fareed M. A.; Abdo Saleh, Ali; Taijoobux, Aref; Ševrović, Marko
2017-12-01
The aim of this paper is to develop a model for forecasting RTA fatalities in Yemen. The yearly fatalities was modeled as the dependent variable, while the number of independent variables included the population, number of vehicles, GNP, GDP and Real GDP per capita. It was determined that all these variables are highly correlated with the correlation coefficient (r ≈ 0.9); in order to avoid multicollinearity in the model, a single variable with the highest r value was selected (real GDP per capita). A simple regression model was developed; the model was very good (R2=0.916); however, the residuals were serially correlated. The Prais-Winsten procedure was used to overcome this violation of the regression assumption. The data for a 20-year period from 1991-2010 were analyzed to build the model; the model was validated by using data for the years 2011-2013; the historical fit for the period 1991 - 2011 was very good. Also, the validation for 2011-2013 proved accurate.
Decline in bloater fecundity in Southern Lake Michigan after decline of Diporeia
Bunnell, D.B.; David, S.R.; Madenjian, C.P.
2009-01-01
Population fecundity can vary through time, sometimes owing to changes in adult condition. Consideration of these fecundity changes can improve understanding of recruitment variation. Herein, we estimated fecundity of Lake Michigan bloater Coregonus hoyi during December 2005 and February 2006. Bloater recruitment has been highly variable from 1962 to present, and consistently poor since 1992. We compared our fecundity vs. weight regression to a previously published regression that used fish sampled in October 1969. We wanted to develop a new regression for two reasons. First, it should be more accurate because it uses fish collected closer to spawning, thus minimizing the potential for atresia (egg reabsorption) which could bias fecundity high. Second, we hypothesized that fecundity would be lower in 2006 because adult condition was 41% lower in 2006 compared to 1969, likely owing to the decline of Diporeia spp, a primary prey for bloater. Although the slope of the fecundity versus weight regression was similar between the years, fecundity was 24% lower in 2006 than in 1969 for bloater weighing between 70 and 240??g. Whether this was the result of the difference in sampling time prior to spawning or of differences in condition is unknown. We also found no relationship between maternal size and mature oocyte size. Incorporating our updated fecundity regression into a stock/recruit model failed to improve the model fit, indicating that the low bloater recruitment that has been observed since the early 1990s is not solely the result of reduced fecundity. ?? 2008 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Pandremmenou, K.; Tziortziotis, N.; Paluri, S.; Zhang, W.; Blekas, K.; Kondi, L. P.; Kumar, S.
2015-03-01
We propose the use of the Least Absolute Shrinkage and Selection Operator (LASSO) regression method in order to predict the Cumulative Mean Squared Error (CMSE), incurred by the loss of individual slices in video transmission. We extract a number of quality-relevant features from the H.264/AVC video sequences, which are given as input to the LASSO. This method has the benefit of not only keeping a subset of the features that have the strongest effects towards video quality, but also produces accurate CMSE predictions. Particularly, we study the LASSO regression through two different architectures; the Global LASSO (G.LASSO) and Local LASSO (L.LASSO). In G.LASSO, a single regression model is trained for all slice types together, while in L.LASSO, motivated by the fact that the values for some features are closely dependent on the considered slice type, each slice type has its own regression model, in an e ort to improve LASSO's prediction capability. Based on the predicted CMSE values, we group the video slices into four priority classes. Additionally, we consider a video transmission scenario over a noisy channel, where Unequal Error Protection (UEP) is applied to all prioritized slices. The provided results demonstrate the efficiency of LASSO in estimating CMSE with high accuracy, using only a few features. les that typically contain high-entropy data, producing a footprint that is far less conspicuous than existing methods. The system uses a local web server to provide a le system, user interface and applications through an web architecture.
Miller, Matthew P.; Johnson, Henry M.; Susong, David D.; Wolock, David M.
2015-01-01
Understanding how watershed characteristics and climate influence the baseflow component of stream discharge is a topic of interest to both the scientific and water management communities. Therefore, the development of baseflow estimation methods is a topic of active research. Previous studies have demonstrated that graphical hydrograph separation (GHS) and conductivity mass balance (CMB) methods can be applied to stream discharge data to estimate daily baseflow. While CMB is generally considered to be a more objective approach than GHS, its application across broad spatial scales is limited by a lack of high frequency specific conductance (SC) data. We propose a new method that uses discrete SC data, which are widely available, to estimate baseflow at a daily time step using the CMB method. The proposed approach involves the development of regression models that relate discrete SC concentrations to stream discharge and time. Regression-derived CMB baseflow estimates were more similar to baseflow estimates obtained using a CMB approach with measured high frequency SC data than were the GHS baseflow estimates at twelve snowmelt dominated streams and rivers. There was a near perfect fit between the regression-derived and measured CMB baseflow estimates at sites where the regression models were able to accurately predict daily SC concentrations. We propose that the regression-derived approach could be applied to estimate baseflow at large numbers of sites, thereby enabling future investigations of watershed and climatic characteristics that influence the baseflow component of stream discharge across large spatial scales.
ERIC Educational Resources Information Center
Chen, Chau-Kuang
2005-01-01
Logistic and Cox regression methods are practical tools used to model the relationships between certain student learning outcomes and their relevant explanatory variables. The logistic regression model fits an S-shaped curve into a binary outcome with data points of zero and one. The Cox regression model allows investigators to study the duration…
Robust geographically weighted regression of modeling the Air Polluter Standard Index (APSI)
NASA Astrophysics Data System (ADS)
Warsito, Budi; Yasin, Hasbi; Ispriyanti, Dwi; Hoyyi, Abdul
2018-05-01
The Geographically Weighted Regression (GWR) model has been widely applied to many practical fields for exploring spatial heterogenity of a regression model. However, this method is inherently not robust to outliers. Outliers commonly exist in data sets and may lead to a distorted estimate of the underlying regression model. One of solution to handle the outliers in the regression model is to use the robust models. So this model was called Robust Geographically Weighted Regression (RGWR). This research aims to aid the government in the policy making process related to air pollution mitigation by developing a standard index model for air polluter (Air Polluter Standard Index - APSI) based on the RGWR approach. In this research, we also consider seven variables that are directly related to the air pollution level, which are the traffic velocity, the population density, the business center aspect, the air humidity, the wind velocity, the air temperature, and the area size of the urban forest. The best model is determined by the smallest AIC value. There are significance differences between Regression and RGWR in this case, but Basic GWR using the Gaussian kernel is the best model to modeling APSI because it has smallest AIC.
Mapping of the DLQI scores to EQ-5D utility values using ordinal logistic regression.
Ali, Faraz Mahmood; Kay, Richard; Finlay, Andrew Y; Piguet, Vincent; Kupfer, Joerg; Dalgard, Florence; Salek, M Sam
2017-11-01
The Dermatology Life Quality Index (DLQI) and the European Quality of Life-5 Dimension (EQ-5D) are separate measures that may be used to gather health-related quality of life (HRQoL) information from patients. The EQ-5D is a generic measure from which health utility estimates can be derived, whereas the DLQI is a specialty-specific measure to assess HRQoL. To reduce the burden of multiple measures being administered and to enable a more disease-specific calculation of health utility estimates, we explored an established mathematical technique known as ordinal logistic regression (OLR) to develop an appropriate model to map DLQI data to EQ-5D-based health utility estimates. Retrospective data from 4010 patients were randomly divided five times into two groups for the derivation and testing of the mapping model. Split-half cross-validation was utilized resulting in a total of ten ordinal logistic regression models for each of the five EQ-5D dimensions against age, sex, and all ten items of the DLQI. Using Monte Carlo simulation, predicted health utility estimates were derived and compared against those observed. This method was repeated for both OLR and a previously tested mapping methodology based on linear regression. The model was shown to be highly predictive and its repeated fitting demonstrated a stable model using OLR as well as linear regression. The mean differences between OLR-predicted health utility estimates and observed health utility estimates ranged from 0.0024 to 0.0239 across the ten modeling exercises, with an average overall difference of 0.0120 (a 1.6% underestimate, not of clinical importance). This modeling framework developed in this study will enable researchers to calculate EQ-5D health utility estimates from a specialty-specific study population, reducing patient and economic burden.
Bayesian Unimodal Density Regression for Causal Inference
ERIC Educational Resources Information Center
Karabatsos, George; Walker, Stephen G.
2011-01-01
Karabatsos and Walker (2011) introduced a new Bayesian nonparametric (BNP) regression model. Through analyses of real and simulated data, they showed that the BNP regression model outperforms other parametric and nonparametric regression models of common use, in terms of predictive accuracy of the outcome (dependent) variable. The other,…
Wilderjans, Tom Frans; Vande Gaer, Eva; Kiers, Henk A L; Van Mechelen, Iven; Ceulemans, Eva
2017-03-01
In the behavioral sciences, many research questions pertain to a regression problem in that one wants to predict a criterion on the basis of a number of predictors. Although in many cases, ordinary least squares regression will suffice, sometimes the prediction problem is more challenging, for three reasons: first, multiple highly collinear predictors can be available, making it difficult to grasp their mutual relations as well as their relations to the criterion. In that case, it may be very useful to reduce the predictors to a few summary variables, on which one regresses the criterion and which at the same time yields insight into the predictor structure. Second, the population under study may consist of a few unknown subgroups that are characterized by different regression models. Third, the obtained data are often hierarchically structured, with for instance, observations being nested into persons or participants within groups or countries. Although some methods have been developed that partially meet these challenges (i.e., principal covariates regression (PCovR), clusterwise regression (CR), and structural equation models), none of these methods adequately deals with all of them simultaneously. To fill this gap, we propose the principal covariates clusterwise regression (PCCR) method, which combines the key idea's behind PCovR (de Jong & Kiers in Chemom Intell Lab Syst 14(1-3):155-164, 1992) and CR (Späth in Computing 22(4):367-373, 1979). The PCCR method is validated by means of a simulation study and by applying it to cross-cultural data regarding satisfaction with life.
Comparative evaluation of urban storm water quality models
NASA Astrophysics Data System (ADS)
Vaze, J.; Chiew, Francis H. S.
2003-10-01
The estimation of urban storm water pollutant loads is required for the development of mitigation and management strategies to minimize impacts to receiving environments. Event pollutant loads are typically estimated using either regression equations or "process-based" water quality models. The relative merit of using regression models compared to process-based models is not clear. A modeling study is carried out here to evaluate the comparative ability of the regression equations and process-based water quality models to estimate event diffuse pollutant loads from impervious surfaces. The results indicate that, once calibrated, both the regression equations and the process-based model can estimate event pollutant loads satisfactorily. In fact, the loads estimated using the regression equation as a function of rainfall intensity and runoff rate are better than the loads estimated using the process-based model. Therefore, if only estimates of event loads are required, regression models should be used because they are simpler and require less data compared to process-based models.
NASA Astrophysics Data System (ADS)
Stas, Michiel; Dong, Qinghan; Heremans, Stien; Zhang, Beier; Van Orshoven, Jos
2016-08-01
This paper compares two machine learning techniques to predict regional winter wheat yields. The models, based on Boosted Regression Trees (BRT) and Support Vector Machines (SVM), are constructed of Normalized Difference Vegetation Indices (NDVI) derived from low resolution SPOT VEGETATION satellite imagery. Three types of NDVI-related predictors were used: Single NDVI, Incremental NDVI and Targeted NDVI. BRT and SVM were first used to select features with high relevance for predicting the yield. Although the exact selections differed between the prefectures, certain periods with high influence scores for multiple prefectures could be identified. The same period of high influence stretching from March to June was detected by both machine learning methods. After feature selection, BRT and SVM models were applied to the subset of selected features for actual yield forecasting. Whereas both machine learning methods returned very low prediction errors, BRT seems to slightly but consistently outperform SVM.
Liu, Xiaoyan; Li, Feng; Ding, Yongsheng; Zou, Ting; Wang, Lu; Hao, Kuangrong
2015-01-01
A hierarchical support vector regression (SVR) model (HSVRM) was employed to correlate the compositions and mechanical properties of bicomponent stents composed of poly(lactic-co-glycolic acid) (PGLA) film and poly(glycolic acid) (PGA) fibers for urethral repair for the first time. PGLA film and PGA fibers could provide ureteral stents with good compressive and tensile properties, respectively. In bicomponent stents, high film content led to high stiffness, while high fiber content resulted in poor compressional properties. To simplify the procedures to optimize the ratio of PGLA film and PGA fiber in the stents, a hierarchical support vector regression model (HSVRM) and particle swarm optimization (PSO) algorithm were used to construct relationships between the film-to-fiber weight ratio and the measured compressional/tensile properties of the stents. The experimental data and simulated data fit well, proving that the HSVRM could closely reflect the relationship between the component ratio and performance properties of the ureteral stents. PMID:28793658
Hansson, Lisbeth; Khamis, Harry J
2008-12-01
Simulated data sets are used to evaluate conditional and unconditional maximum likelihood estimation in an individual case-control design with continuous covariates when there are different rates of excluded cases and different levels of other design parameters. The effectiveness of the estimation procedures is measured by method bias, variance of the estimators, root mean square error (RMSE) for logistic regression and the percentage of explained variation. Conditional estimation leads to higher RMSE than unconditional estimation in the presence of missing observations, especially for 1:1 matching. The RMSE is higher for the smaller stratum size, especially for the 1:1 matching. The percentage of explained variation appears to be insensitive to missing data, but is generally higher for the conditional estimation than for the unconditional estimation. It is particularly good for the 1:2 matching design. For minimizing RMSE, a high matching ratio is recommended; in this case, conditional and unconditional logistic regression models yield comparable levels of effectiveness. For maximizing the percentage of explained variation, the 1:2 matching design with the conditional logistic regression model is recommended.
A generalized right truncated bivariate Poisson regression model with applications to health data.
Islam, M Ataharul; Chowdhury, Rafiqul I
2017-01-01
A generalized right truncated bivariate Poisson regression model is proposed in this paper. Estimation and tests for goodness of fit and over or under dispersion are illustrated for both untruncated and right truncated bivariate Poisson regression models using marginal-conditional approach. Estimation and test procedures are illustrated for bivariate Poisson regression models with applications to Health and Retirement Study data on number of health conditions and the number of health care services utilized. The proposed test statistics are easy to compute and it is evident from the results that the models fit the data very well. A comparison between the right truncated and untruncated bivariate Poisson regression models using the test for nonnested models clearly shows that the truncated model performs significantly better than the untruncated model.
A generalized right truncated bivariate Poisson regression model with applications to health data
Islam, M. Ataharul; Chowdhury, Rafiqul I.
2017-01-01
A generalized right truncated bivariate Poisson regression model is proposed in this paper. Estimation and tests for goodness of fit and over or under dispersion are illustrated for both untruncated and right truncated bivariate Poisson regression models using marginal-conditional approach. Estimation and test procedures are illustrated for bivariate Poisson regression models with applications to Health and Retirement Study data on number of health conditions and the number of health care services utilized. The proposed test statistics are easy to compute and it is evident from the results that the models fit the data very well. A comparison between the right truncated and untruncated bivariate Poisson regression models using the test for nonnested models clearly shows that the truncated model performs significantly better than the untruncated model. PMID:28586344
Yang, Xiaowei; Nie, Kun
2008-03-15
Longitudinal data sets in biomedical research often consist of large numbers of repeated measures. In many cases, the trajectories do not look globally linear or polynomial, making it difficult to summarize the data or test hypotheses using standard longitudinal data analysis based on various linear models. An alternative approach is to apply the approaches of functional data analysis, which directly target the continuous nonlinear curves underlying discretely sampled repeated measures. For the purposes of data exploration, many functional data analysis strategies have been developed based on various schemes of smoothing, but fewer options are available for making causal inferences regarding predictor-outcome relationships, a common task seen in hypothesis-driven medical studies. To compare groups of curves, two testing strategies with good power have been proposed for high-dimensional analysis of variance: the Fourier-based adaptive Neyman test and the wavelet-based thresholding test. Using a smoking cessation clinical trial data set, this paper demonstrates how to extend the strategies for hypothesis testing into the framework of functional linear regression models (FLRMs) with continuous functional responses and categorical or continuous scalar predictors. The analysis procedure consists of three steps: first, apply the Fourier or wavelet transform to the original repeated measures; then fit a multivariate linear model in the transformed domain; and finally, test the regression coefficients using either adaptive Neyman or thresholding statistics. Since a FLRM can be viewed as a natural extension of the traditional multiple linear regression model, the development of this model and computational tools should enhance the capacity of medical statistics for longitudinal data.
ERIC Educational Resources Information Center
Lichtenberger, Eric; George-Jackson, Casey
2013-01-01
This study examined how various individual, family, and school level contextual factors impact the likelihood of planning to major in one of the science, technology, engineering, or mathematics (STEM) fields for high school students. A binary logistic regression model was developed to determine the extent to which each of the covariates helped to…
A Technique of Fuzzy C-Mean in Multiple Linear Regression Model toward Paddy Yield
NASA Astrophysics Data System (ADS)
Syazwan Wahab, Nur; Saifullah Rusiman, Mohd; Mohamad, Mahathir; Amira Azmi, Nur; Che Him, Norziha; Ghazali Kamardan, M.; Ali, Maselan
2018-04-01
In this paper, we propose a hybrid model which is a combination of multiple linear regression model and fuzzy c-means method. This research involved a relationship between 20 variates of the top soil that are analyzed prior to planting of paddy yields at standard fertilizer rates. Data used were from the multi-location trials for rice carried out by MARDI at major paddy granary in Peninsular Malaysia during the period from 2009 to 2012. Missing observations were estimated using mean estimation techniques. The data were analyzed using multiple linear regression model and a combination of multiple linear regression model and fuzzy c-means method. Analysis of normality and multicollinearity indicate that the data is normally scattered without multicollinearity among independent variables. Analysis of fuzzy c-means cluster the yield of paddy into two clusters before the multiple linear regression model can be used. The comparison between two method indicate that the hybrid of multiple linear regression model and fuzzy c-means method outperform the multiple linear regression model with lower value of mean square error.
Spatial Assessment of Model Errors from Four Regression Techniques
Lianjun Zhang; Jeffrey H. Gove; Jeffrey H. Gove
2005-01-01
Fomst modelers have attempted to account for the spatial autocorrelations among trees in growth and yield models by applying alternative regression techniques such as linear mixed models (LMM), generalized additive models (GAM), and geographicalIy weighted regression (GWR). However, the model errors are commonly assessed using average errors across the entire study...
Svensson, Fredrik; Aniceto, Natalia; Norinder, Ulf; Cortes-Ciriano, Isidro; Spjuth, Ola; Carlsson, Lars; Bender, Andreas
2018-05-29
Making predictions with an associated confidence is highly desirable as it facilitates decision making and resource prioritization. Conformal regression is a machine learning framework that allows the user to define the required confidence and delivers predictions that are guaranteed to be correct to the selected extent. In this study, we apply conformal regression to model molecular properties and bioactivity values and investigate different ways to scale the resultant prediction intervals to create as efficient (i.e., narrow) regressors as possible. Different algorithms to estimate the prediction uncertainty were used to normalize the prediction ranges, and the different approaches were evaluated on 29 publicly available data sets. Our results show that the most efficient conformal regressors are obtained when using the natural exponential of the ensemble standard deviation from the underlying random forest to scale the prediction intervals, but other approaches were almost as efficient. This approach afforded an average prediction range of 1.65 pIC50 units at the 80% confidence level when applied to bioactivity modeling. The choice of nonconformity function has a pronounced impact on the average prediction range with a difference of close to one log unit in bioactivity between the tightest and widest prediction range. Overall, conformal regression is a robust approach to generate bioactivity predictions with associated confidence.
Can We Use Regression Modeling to Quantify Mean Annual Streamflow at a Global-Scale?
NASA Astrophysics Data System (ADS)
Barbarossa, V.; Huijbregts, M. A. J.; Hendriks, J. A.; Beusen, A.; Clavreul, J.; King, H.; Schipper, A.
2016-12-01
Quantifying mean annual flow of rivers (MAF) at ungauged sites is essential for a number of applications, including assessments of global water supply, ecosystem integrity and water footprints. MAF can be quantified with spatially explicit process-based models, which might be overly time-consuming and data-intensive for this purpose, or with empirical regression models that predict MAF based on climate and catchment characteristics. Yet, regression models have mostly been developed at a regional scale and the extent to which they can be extrapolated to other regions is not known. In this study, we developed a global-scale regression model for MAF using observations of discharge and catchment characteristics from 1,885 catchments worldwide, ranging from 2 to 106 km2 in size. In addition, we compared the performance of the regression model with the predictive ability of the spatially explicit global hydrological model PCR-GLOBWB [van Beek et al., 2011] by comparing results from both models to independent measurements. We obtained a regression model explaining 89% of the variance in MAF based on catchment area, mean annual precipitation and air temperature, average slope and elevation. The regression model performed better than PCR-GLOBWB for the prediction of MAF, as root-mean-square error values were lower (0.29 - 0.38 compared to 0.49 - 0.57) and the modified index of agreement was higher (0.80 - 0.83 compared to 0.72 - 0.75). Our regression model can be applied globally at any point of the river network, provided that the input parameters are within the range of values employed in the calibration of the model. The performance is reduced for water scarce regions and further research should focus on improving such an aspect for regression-based global hydrological models.
Pereira, R J; Bignardi, A B; El Faro, L; Verneque, R S; Vercesi Filho, A E; Albuquerque, L G
2013-01-01
Studies investigating the use of random regression models for genetic evaluation of milk production in Zebu cattle are scarce. In this study, 59,744 test-day milk yield records from 7,810 first lactations of purebred dairy Gyr (Bos indicus) and crossbred (dairy Gyr × Holstein) cows were used to compare random regression models in which additive genetic and permanent environmental effects were modeled using orthogonal Legendre polynomials or linear spline functions. Residual variances were modeled considering 1, 5, or 10 classes of days in milk. Five classes fitted the changes in residual variances over the lactation adequately and were used for model comparison. The model that fitted linear spline functions with 6 knots provided the lowest sum of residual variances across lactation. On the other hand, according to the deviance information criterion (DIC) and bayesian information criterion (BIC), a model using third-order and fourth-order Legendre polynomials for additive genetic and permanent environmental effects, respectively, provided the best fit. However, the high rank correlation (0.998) between this model and that applying third-order Legendre polynomials for additive genetic and permanent environmental effects, indicates that, in practice, the same bulls would be selected by both models. The last model, which is less parameterized, is a parsimonious option for fitting dairy Gyr breed test-day milk yield records. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Family and Cultural Predictors of Depression among Samoan American Middle and High School Students
ERIC Educational Resources Information Center
Yeh, Christine J.; Borrero, Noah E.; Tito, Patsy
2013-01-01
This study investigated family intergenerational conflict and collective self-esteem as predictors of depression in a sample of 128 Samoan middle and high school students. Simultaneous regression analyses revealed that each independent variable significantly contributed to an overall model that accounted for 13% of the variance in depression.…
ERIC Educational Resources Information Center
Zullig, Keith; Ubbes, Valerie A.; Pyle, Jennifer; Valois, Robert F.
2006-01-01
This study explored the relationships among weight perceptions, dieting behavior, and breakfast eating in 4597 public high school adolescents using the Centers for Disease Control and Prevention Youth Risk Behavior Survey. Adjusted multiple logistic regression models were constructed separately for race and gender groups via SUDAAN (Survey Data…
Developing a predictive tropospheric ozone model for Tabriz
NASA Astrophysics Data System (ADS)
Khatibi, Rahman; Naghipour, Leila; Ghorbani, Mohammad A.; Smith, Michael S.; Karimi, Vahid; Farhoudi, Reza; Delafrouz, Hadi; Arvanaghi, Hadi
2013-04-01
Predictive ozone models are becoming indispensable tools by providing a capability for pollution alerts to serve people who are vulnerable to the risks. We have developed a tropospheric ozone prediction capability for Tabriz, Iran, by using the following five modeling strategies: three regression-type methods: Multiple Linear Regression (MLR), Artificial Neural Networks (ANNs), and Gene Expression Programming (GEP); and two auto-regression-type models: Nonlinear Local Prediction (NLP) to implement chaos theory and Auto-Regressive Integrated Moving Average (ARIMA) models. The regression-type modeling strategies explain the data in terms of: temperature, solar radiation, dew point temperature, and wind speed, by regressing present ozone values to their past values. The ozone time series are available at various time intervals, including hourly intervals, from August 2010 to March 2011. The results for MLR, ANN and GEP models are not overly good but those produced by NLP and ARIMA are promising for the establishing a forecasting capability.
High-Dimensional Heteroscedastic Regression with an Application to eQTL Data Analysis
Daye, Z. John; Chen, Jinbo; Li, Hongzhe
2011-01-01
Summary We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis. PMID:22547833
Unitary Response Regression Models
ERIC Educational Resources Information Center
Lipovetsky, S.
2007-01-01
The dependent variable in a regular linear regression is a numerical variable, and in a logistic regression it is a binary or categorical variable. In these models the dependent variable has varying values. However, there are problems yielding an identity output of a constant value which can also be modelled in a linear or logistic regression with…
QSAR Analysis of 2-Amino or 2-Methyl-1-Substituted Benzimidazoles Against Pseudomonas aeruginosa
Podunavac-Kuzmanović, Sanja O.; Cvetković, Dragoljub D.; Barna, Dijana J.
2009-01-01
A set of benzimidazole derivatives were tested for their inhibitory activities against the Gram-negative bacterium Pseudomonas aeruginosa and minimum inhibitory concentrations were determined for all the compounds. Quantitative structure activity relationship (QSAR) analysis was applied to fourteen of the abovementioned derivatives using a combination of various physicochemical, steric, electronic, and structural molecular descriptors. A multiple linear regression (MLR) procedure was used to model the relationships between molecular descriptors and the antibacterial activity of the benzimidazole derivatives. The stepwise regression method was used to derive the most significant models as a calibration model for predicting the inhibitory activity of this class of molecules. The best QSAR models were further validated by a leave one out technique as well as by the calculation of statistical parameters for the established theoretical models. To confirm the predictive power of the models, an external set of molecules was used. High agreement between experimental and predicted inhibitory values, obtained in the validation procedure, indicated the good quality of the derived QSAR models. PMID:19468332
Real-time quality monitoring in debutanizer column with regression tree and ANFIS
NASA Astrophysics Data System (ADS)
Siddharth, Kumar; Pathak, Amey; Pani, Ajaya Kumar
2018-05-01
A debutanizer column is an integral part of any petroleum refinery. Online composition monitoring of debutanizer column outlet streams is highly desirable in order to maximize the production of liquefied petroleum gas. In this article, data-driven models for debutanizer column are developed for real-time composition monitoring. The dataset used has seven process variables as inputs and the output is the butane concentration in the debutanizer column bottom product. The input-output dataset is divided equally into a training (calibration) set and a validation (testing) set. The training set data were used to develop fuzzy inference, adaptive neuro fuzzy (ANFIS) and regression tree models for the debutanizer column. The accuracy of the developed models were evaluated by simulation of the models with the validation dataset. It is observed that the ANFIS model has better estimation accuracy than other models developed in this work and many data-driven models proposed so far in the literature for the debutanizer column.
[On the effectiveness of the homeopathic remedy Arnica montana].
Lüdtke, Rainer; Hacke, Daniela
2005-11-01
Arnica montana is a homeopathic remedy often prescribed after traumata and injuries. To assess whether Arnica is effective beyond placebo and to identify factors which support or contradict this effectiveness. All prospective, controlled trials on the effectiveness of homeopathic Arnica were included. Overall effectiveness was assessed by meta-analysis and meta-regression techniques. 68 comparisons from 49 clinical trials show a significant effectiveness of Arnica in traumatic injuries in random effects meta-analysis (odds ratio [OR], 0.36; 95% confidence interval [CI], 0.24-0.55), but not in meta-regression models (OR, 0.37; CI, 0.11-1.24). We found no evidence for publication bias. Studies from Medline-listed journals and high-quality studies are less likely to report positive results (p = 0.0006 and p = 0.0167). The hypothesis that homeopathic Arnica is effective could neither be proved nor rejected. All trials were highly heterogeneous, meta-regression does not help to explain this heterogeneity substantially.
The Roles of IL-6, IL-10, and IL-1RA in Obesity and Insulin Resistance in African-Americans
Doumatey, Ayo; Huang, Hanxia; Zhou, Jie; Chen, Guanjie; Shriner, Daniel; Adeyemo, Adebowale
2011-01-01
Objective: The aim of the study was to investigate the associations between IL-1 receptor antagonist (IL-1RA), IL-6, IL-10, measures of obesity, and insulin resistance in African-Americans. Research Design and Methods: Nondiabetic participants (n = 1025) of the Howard University Family Study were investigated for associations between serum IL (IL-1RA, IL-6, IL-10), measures of obesity, and insulin resistance, with adjustment for age and sex. Measures of obesity included body mass index, waist circumference, hip circumference, waist-to-hip ratio, and percent fat mass. Insulin resistance was assessed using the homeostasis model assessment of insulin resistance (HOMA-IR). Data were analyzed with R statistical software using linear regression and likelihood ratio tests. Results: IL-1RA and IL-6 were associated with measures of obesity and insulin resistance, explaining 4–12.7% of the variance observed (P values < 0.001). IL-1RA was bimodally distributed and therefore was analyzed based on grouping those with low vs. high IL-1RA levels. High IL-1RA explained up to 20 and 12% of the variance in measures of obesity and HOMA-IR, respectively. Among the IL, only high IL-1RA improved the fit of models regressing HOMA-IR on measures of obesity. In contrast, all measures of obesity improved the fit of models regressing HOMA-IR on IL. IL-10 was not associated with obesity measures or HOMA-IR. Conclusions: High IL-1RA levels and obesity measures are associated with HOMA-IR in this population-based sample of African-Americans. The results suggest that obesity and increased levels of IL-1RA both contribute to the development of insulin resistance. PMID:21956416
NASA Astrophysics Data System (ADS)
Suhandy, D.; Yulia, M.; Ogawa, Y.; Kondo, N.
2018-05-01
In the present research, an evaluation of using near infrared (NIR) spectroscopy in tandem with full spectrum partial least squares (FS-PLS) regression for quantification of degree of adulteration in civet coffee was conducted. A number of 126 ground roasted coffee samples with degree of adulteration 0-51% were prepared. Spectral data were acquired using a NIR spectrometer equipped with an integrating sphere for diffuse reflectance measurement in the range of 1300-2500 nm. The samples were divided into two groups calibration sample set (84 samples) and prediction sample set (42 samples). The calibration model was developed on original spectra using FS-PLS regression with full-cross validation method. The calibration model exhibited the determination coefficient R2=0.96 for calibration and R2=0.92 for validation. The prediction resulted in low root mean square error of prediction (RMSEP) (4.67%) and high ratio prediction to deviation (RPD) (3.75). In conclusion, the degree of adulteration in civet coffee have been quantified successfully by using NIR spectroscopy and FS-PLS regression in a non-destructive, economical, precise, and highly sensitive method, which uses very simple sample preparation.
Modeling the spatio-temporal heterogeneity in the PM10-PM2.5 relationship
NASA Astrophysics Data System (ADS)
Chu, Hone-Jay; Huang, Bo; Lin, Chuan-Yao
2015-02-01
This paper explores the spatio-temporal patterns of particulate matter (PM) in Taiwan based on a series of methods. Using fuzzy c-means clustering first, the spatial heterogeneity (six clusters) in the PM data collected between 2005 and 2009 in Taiwan are identified and the industrial and urban areas of Taiwan (southwestern, west central, northwestern, and northern Taiwan) are found to have high PM concentrations. The PM10-PM2.5 relationship is then modeled with global ordinary least squares regression, geographically weighted regression (GWR), and geographically and temporally weighted regression (GTWR). The GTWR and GWR produce consistent results; however, GTWR provides more detailed information of spatio-temporal variations of the PM10-PM2.5 relationship. The results also show that GTWR provides a relatively high goodness of fit and sufficient space-time explanatory power. In particular, the PM2.5 or PM10 varies with time and space, depending on weather conditions and the spatial distribution of land use and emission patterns in local areas. Such information can be used to determine patterns of spatio-temporal heterogeneity in PM that will allow the control of pollutants and the reduction of public exposure.
[From clinical judgment to linear regression model.
Palacios-Cruz, Lino; Pérez, Marcela; Rivas-Ruiz, Rodolfo; Talavera, Juan O
2013-01-01
When we think about mathematical models, such as linear regression model, we think that these terms are only used by those engaged in research, a notion that is far from the truth. Legendre described the first mathematical model in 1805, and Galton introduced the formal term in 1886. Linear regression is one of the most commonly used regression models in clinical practice. It is useful to predict or show the relationship between two or more variables as long as the dependent variable is quantitative and has normal distribution. Stated in another way, the regression is used to predict a measure based on the knowledge of at least one other variable. Linear regression has as it's first objective to determine the slope or inclination of the regression line: Y = a + bx, where "a" is the intercept or regression constant and it is equivalent to "Y" value when "X" equals 0 and "b" (also called slope) indicates the increase or decrease that occurs when the variable "x" increases or decreases in one unit. In the regression line, "b" is called regression coefficient. The coefficient of determination (R 2 ) indicates the importance of independent variables in the outcome.
Synoptic and meteorological drivers of extreme ozone concentrations over Europe
NASA Astrophysics Data System (ADS)
Otero, Noelia Felipe; Sillmann, Jana; Schnell, Jordan L.; Rust, Henning W.; Butler, Tim
2016-04-01
The present work assesses the relationship between local and synoptic meteorological conditions and surface ozone concentration over Europe in spring and summer months, during the period 1998-2012 using a new interpolated data set of observed surface ozone concentrations over the European domain. Along with local meteorological conditions, the influence of large-scale atmospheric circulation on surface ozone is addressed through a set of airflow indices computed with a novel implementation of a grid-by-grid weather type classification across Europe. Drivers of surface ozone over the full distribution of maximum daily 8-hour average values are investigated, along with drivers of the extreme high percentiles and exceedances or air quality guideline thresholds. Three different regression techniques are applied: multiple linear regression to assess the drivers of maximum daily ozone, logistic regression to assess the probability of threshold exceedances and quantile regression to estimate the meteorological influence on extreme values, as represented by the 95th percentile. The relative importance of the input parameters (predictors) is assessed by a backward stepwise regression procedure that allows the identification of the most important predictors in each model. Spatial patterns of model performance exhibit distinct variations between regions. The inclusion of the ozone persistence is particularly relevant over Southern Europe. In general, the best model performance is found over Central Europe, where the maximum temperature plays an important role as a driver of maximum daily ozone as well as its extreme values, especially during warmer months.
Experimental validation of a coupled neutron-photon inverse radiation transport solver
NASA Astrophysics Data System (ADS)
Mattingly, John; Mitchell, Dean J.; Harding, Lee T.
2011-10-01
Sandia National Laboratories has developed an inverse radiation transport solver that applies nonlinear regression to coupled neutron-photon deterministic transport models. The inverse solver uses nonlinear regression to fit a radiation transport model to gamma spectrometry and neutron multiplicity counting measurements. The subject of this paper is the experimental validation of that solver. This paper describes a series of experiments conducted with a 4.5 kg sphere of α-phase, weapons-grade plutonium. The source was measured bare and reflected by high-density polyethylene (HDPE) spherical shells with total thicknesses between 1.27 and 15.24 cm. Neutron and photon emissions from the source were measured using three instruments: a gross neutron counter, a portable neutron multiplicity counter, and a high-resolution gamma spectrometer. These measurements were used as input to the inverse radiation transport solver to evaluate the solver's ability to correctly infer the configuration of the source from its measured radiation signatures.
Zhang, Qiuzhuo; Weng, Chen; Huang, Huiqin; Achal, Varenyam; Wang, Duanchao
2016-01-01
Water hyacinth was used as substrate for bioethanol production in the present study. Combination of acid pretreatment and enzymatic hydrolysis was the most effective process for sugar production that resulted in the production of 402.93 mg reducing sugar at optimal condition. A regression model was built to optimize the fermentation factors according to response surface method in saccharification and fermentation (SSF) process. The optimized condition for ethanol production by SSF process was fermented at 38.87°C in 81.87 h when inoculated with 6.11 ml yeast, where 1.291 g/L bioethanol was produced. Meanwhile, 1.289 g/L ethanol was produced during experimentation, which showed reliability of presented regression model in this research. The optimization method discussed in the present study leading to relatively high bioethanol production could provide a promising way for Alien Invasive Species with high cellulose content. PMID:26779125
Determinants of U.S. Prescription Drug Utilization using County Level Data.
Nianogo, Thierry; Okunade, Albert; Fofana, Demba; Chen, Weiwei
2016-05-01
Prescription drugs are the third largest component of U.S. healthcare expenditures. The 2006 Medicare Part D and the 2010 Affordable Care Act are catalysts for further growths in utilization becuase of insurance expansion effects. This research investigating the determinants of prescription drug utilization is timely, methodologically novel, and policy relevant. Differences in population health status, access to care, socioeconomics, demographics, and variations in per capita number of scripts filled at retail pharmacies across the U.S.A. justify fitting separate econometric models to county data of the states partitioned into low, medium, and high prescription drug users. Given the skewed distribution of per capita number of filled prescriptions (response variable), we fit the variance stabilizing Box-Cox power transformation regression models to 2011 county level data for investigating the correlates of prescription drug utilization separately for low, medium, and high utilization states. Maximum likelihood regression parameter estimates, including the optimal Box-Cox λ power transformations, differ across high (λ = 0.214), medium (λ = 0.942), and low (λ = 0.302) prescription drug utilization models. The estimated income elasticities of -0.634, 0.031, and -0.532 in high, medium, and low utilization models suggest that the economic behavior of prescriptions is not invariant across different utilization levels. Copyright © 2015 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.
2015-05-01
The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO2, Fe2O3, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na2O, K2O, TiO2, and P2O5, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144 channels) relative to the small number of samples studied. The best-performing models were SVR-Lin for SiO2, MgO, Fe2O3, and Na2O, lasso for Al2O3, elastic net for MnO, and PLS-1 for CaO, TiO2, and K2O. Although these differences in model performance between methods were identified, most of the models produce comparable results when p ≤ 0.05 and all techniques except kNN produced statistically-indistinguishable results. It is likely that a combination of models could be used together to yield a lower total error of prediction, depending on the requirements of the user.
Kayser, W; Glaze, J B; Welch, C M; Kerley, M; Hill, R A
2015-07-01
The objective of this study was to determine the effects of alternative-measurements of body weight and DMI used to evaluate residual feed intake (RFI). Weaning weight (WW), ADG, and DMI were recorded on 970 growing purebred Charolais bulls (n = 519) and heifers (n = 451) and 153 Red Angus growing steers (n = 69) and heifers (n = 84) using a GrowSafe (GrowSafe, Airdrie, Alberta, Canada) system. Averages of individual DMI were calculated in 10-d increments and compared to the overall DMI to identify the magnitude of the errors associated with measuring DMI. These incremental measurements were also used in calculation of RFI, computed from the linear regression of DMI on ADG and midtest body weight0.75 (MMWT). RFI_Regress was calculated using ADG_Regress (ADG calculated as the response of BW gain and DOF) and MMWT_PWG (metabolic midweight calculated throughout the postweaning gain test), considered the control in Red Angus. A similar calculation served as control for Charolais; RFI was calculated using 2-d consecutive start and finish weights (RFI_Calc). The RFI weaning weight (RFI_WW) was calculated using ADG_WW (ADG from weaning till the final out weight of the postweaning gain test) and MMWT_WW, calculated similarly. Overall average estimated DMI was highly correlated to the measurements derived over shorter periods, with 10 d being the least correlated and 60 d being the most correlated. The ADG_Calc (calculated using 2-d consecutive start and finish weight/DOF) and ADG_WW were highly correlated in Charolais. The ADG_Regress and ADG_Calc were highly correlated, and ADG_Regress and ADG_WW were moderately correlated in Red Angus. The control measures of RFI were highly correlated with the RFI_WW in Charolais and Red Angus. The outcomes of including abbreviated period DMI in the model with the weaning weight gain measurements showed that the model using 10 d of intake (RFI WW_10) was the least correlated with the control measures. The model with 60 d of intake had the largest correlation with the control measures. The fewest measured intake days coupled with the weaning weight values providing acceptable predictive value was RFI_WW_40, being highly correlated with the control measures. As established in the literature, at least 70 d is required to accurately measure ADG. However, we conclude that a shorter period, possibly as few as 40 d is needed to accurately estimate DMI for a reliable calculation of RFI.
Sharer, Melissa; Cluver, Lucie; Shields, Joseph J; Ahearn, Frederick
2016-03-01
Children affected by HIV and AIDS have significantly higher rates of mental health problems than unaffected children. There is a need for research to examine how social support functions as a source of resiliency for children in high HIV-prevalence settings such as South Africa. The purpose of this research was to explore how family social support relates to depression, anxiety, and post-traumatic stress (PTS). Using the ecological model as a frame, data were drawn from a 2011 cross-sectional study of 1380 children classified as either orphaned by AIDS and/or living with an AIDS sick family member. The children were from high-poverty, high HIV-prevalent rural and urban communities in South Africa. Social support was analyzed in depth by examining the source (e.g. caregiver, sibling) and the type (e.g. emotional, instrumental, quality). These variables were entered into multiple regression analyses to estimate the most parsimonious regression models to show the relationships between social support and depression, anxiety, and PTS symptoms among the children. Siblings emerged as the most consistent source of social support on mental health. Overall caregiver and sibling support explained 13% variance in depression, 12% in anxiety, and 11% in PTS. Emotional support was the most frequent type of social support associated with mental health in all regression models, with higher levels of quality and instrumental support having the strongest relation to positive mental health outcomes. Although instrumental and quality support from siblings were related to positive mental health, unexpectedly, the higher the level of emotional support received from a sibling resulted in the child reporting more symptoms of depression, anxiety, and PTS. The opposite was true for emotional support provided via caregivers, higher levels of this support was related to lower levels of all mental health symptoms. Sex was significant in all regressions, indicating the presence of moderation.
Sharer, Melissa; Cluver, Lucie; Shields, Joseph J.; Ahearn, Frederick
2016-01-01
ABSTRACT Children affected by HIV and AIDS have significantly higher rates of mental health problems than unaffected children. There is a need for research to examine how social support functions as a source of resiliency for children in high HIV-prevalence settings such as South Africa. The purpose of this research was to explore how family social support relates to depression, anxiety, and post-traumatic stress (PTS). Using the ecological model as a frame, data were drawn from a 2011 cross-sectional study of 1380 children classified as either orphaned by AIDS and/or living with an AIDS sick family member. The children were from high-poverty, high HIV-prevalent rural and urban communities in South Africa. Social support was analyzed in depth by examining the source (e.g. caregiver, sibling) and the type (e.g. emotional, instrumental, quality). These variables were entered into multiple regression analyses to estimate the most parsimonious regression models to show the relationships between social support and depression, anxiety, and PTS symptoms among the children. Siblings emerged as the most consistent source of social support on mental health. Overall caregiver and sibling support explained 13% variance in depression, 12% in anxiety, and 11% in PTS. Emotional support was the most frequent type of social support associated with mental health in all regression models, with higher levels of quality and instrumental support having the strongest relation to positive mental health outcomes. Although instrumental and quality support from siblings were related to positive mental health, unexpectedly, the higher the level of emotional support received from a sibling resulted in the child reporting more symptoms of depression, anxiety, and PTS. The opposite was true for emotional support provided via caregivers, higher levels of this support was related to lower levels of all mental health symptoms. Sex was significant in all regressions, indicating the presence of moderation. PMID:27392006
Predicting arsenic in drinking water wells of the Central Valley, California
Ayotte, Joseph; Nolan, Bernard T.; Gronberg, JoAnn M.
2016-01-01
Probabilities of arsenic in groundwater at depths used for domestic and public supply in the Central Valley of California are predicted using weak-learner ensemble models (boosted regression trees, BRT) and more traditional linear models (logistic regression, LR). Both methods captured major processes that affect arsenic concentrations, such as the chemical evolution of groundwater, redox differences, and the influence of aquifer geochemistry. Inferred flow-path length was the most important variable but near-surface-aquifer geochemical data also were significant. A unique feature of this study was that previously predicted nitrate concentrations in three dimensions were themselves predictive of arsenic and indicated an important redox effect at >10 μg/L, indicating low arsenic where nitrate was high. Additionally, a variable representing three-dimensional aquifer texture from the Central Valley Hydrologic Model was an important predictor, indicating high arsenic associated with fine-grained aquifer sediment. BRT outperformed LR at the 5 μg/L threshold in all five predictive performance measures and at 10 μg/L in four out of five measures. BRT yielded higher prediction sensitivity (39%) than LR (18%) at the 10 μg/L threshold–a useful outcome because a major objective of the modeling was to improve our ability to predict high arsenic areas.
Predictors of non- hookah smoking among high-school students based on prototype/willingness model.
Abedini, Sedigheh; MorowatiSharifabad, MohammadAli; Chaleshgar Kordasiabi, Mosharafeh; Ghanbarnejad, Amin
2014-01-01
The aim of the study was to determine predictors of refraining from hookah smoking among high-school students in Bandar Abbas, southern Iran based on Prototype/Willingness model. This cross- sectional with analytic approach was performed on 240 high-school students selected by a cluster random sampling. The data of demographic and Prototype-Willingness Model constructs were acquired via a self-administrated questionnaire. Data were analyzed by mean, frequency, correlation, liner and logistic regression statistical tests. Statistically significant determinants of the intention to refrain from hookah smoking were subjective norms, willingness, and attitude. Regression model indicated that the three items together explained 46.9% of the non-smoking hookah intention variance. Attitude and subjective norms predicted 36.0% of the non-smoking hookah intention variance. There was a significant relationship between the participants' negative prototype about the hookah smokers and the willingness to avoid from hookah smoking (P=0.002). Also willingness predicted non-smoking hookah better than the intention (P<0.001). Deigning intervention to increase negative prototype about the hookah smokers and reducing situations and conditions which facilitate hookah smoking, such as easy access to tobacco products in the cafés, beaches can be useful results among adolescents to hookah smoking prevention.
Minet, L; Gehr, R; Hatzopoulou, M
2017-11-01
The development of reliable measures of exposure to traffic-related air pollution is crucial for the evaluation of the health effects of transportation. Land-use regression (LUR) techniques have been widely used for the development of exposure surfaces, however these surfaces are often highly sensitive to the data collected. With the rise of inexpensive air pollution sensors paired with GPS devices, we witness the emergence of mobile data collection protocols. For the same urban area, can we achieve a 'universal' model irrespective of the number of locations and sampling visits? Can we trade the temporal representation of fixed-point sampling for a larger spatial extent afforded by mobile monitoring? This study highlights the challenges of short-term mobile sampling campaigns in terms of the resulting exposure surfaces. A mobile monitoring campaign was conducted in 2015 in Montreal; nitrogen dioxide (NO 2 ) levels at 1395 road segments were measured under repeated visits. We developed LUR models based on sub-segments, categorized in terms of the number of visits per road segment. We observe that LUR models were highly sensitive to the number of road segments and to the number of visits per road segment. The associated exposure surfaces were also highly dissimilar. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Mai, W.; Zhang, J.-F.; Zhao, X.-M.; Li, Z.; Xu, Z.-W.
2017-11-01
Wastewater from the dye industry is typically analyzed using a standard method for measurement of chemical oxygen demand (COD) or by a single-wavelength spectroscopic method. To overcome the disadvantages of these methods, ultraviolet-visible (UV-Vis) spectroscopy was combined with principal component regression (PCR) and partial least squares regression (PLSR) in this study. Unlike the standard method, this method does not require digestion of the samples for preparation. Experiments showed that the PLSR model offered high prediction performance for COD, with a mean relative error of about 5% for two dyes. This error is similar to that obtained with the standard method. In this study, the precision of the PLSR model decreased with the number of dye compounds present. It is likely that multiple models will be required in reality, and the complexity of a COD monitoring system would be greatly reduced if the PLSR model is used because it can include several dyes. UV-Vis spectroscopy with PLSR successfully enhanced the performance of COD prediction for dye wastewater and showed good potential for application in on-line water quality monitoring.
Test anxiety and academic performance in chiropractic students.
Zhang, Niu; Henderson, Charles N R
2014-01-01
Objective : We assessed the level of students' test anxiety, and the relationship between test anxiety and academic performance. Methods : We recruited 166 third-quarter students. The Test Anxiety Inventory (TAI) was administered to all participants. Total scores from written examinations and objective structured clinical examinations (OSCEs) were used as response variables. Results : Multiple regression analysis shows that there was a modest, but statistically significant negative correlation between TAI scores and written exam scores, but not OSCE scores. Worry and emotionality were the best predictive models for written exam scores. Mean total anxiety and emotionality scores for females were significantly higher than those for males, but not worry scores. Conclusion : Moderate-to-high test anxiety was observed in 85% of the chiropractic students examined. However, total test anxiety, as measured by the TAI score, was a very weak predictive model for written exam performance. Multiple regression analysis demonstrated that replacing total anxiety (TAI) with worry and emotionality (TAI subscales) produces a much more effective predictive model of written exam performance. Sex, age, highest current academic degree, and ethnicity contributed little additional predictive power in either regression model. Moreover, TAI scores were not found to be statistically significant predictors of physical exam skill performance, as measured by OSCEs.
Wang, Jake; Perry, Curtis J; Meeth, Katrina; Thakral, Durga; Damsky, William; Micevic, Goran; Kaech, Susan; Blenman, Kim; Bosenberg, Marcus
2017-07-01
Human melanomas exhibit relatively high somatic mutation burden compared to other malignancies. These somatic mutations may produce neoantigens that are recognized by the immune system, leading to an antitumor response. By irradiating a parental mouse melanoma cell line carrying three driver mutations with UVB and expanding a single-cell clone, we generated a mutagenized model that exhibits high somatic mutation burden. When inoculated at low cell numbers in immunocompetent C57BL/6J mice, YUMMER1.7 (Yale University Mouse Melanoma Exposed to Radiation) regresses after a brief period of growth. This regression phenotype is dependent on T cells as YUMMER1.7 tumors grow significantly faster in immunodeficient Rag1 -/- mice and C57BL/6J mice depleted of CD4 and CD8 T cells. Interestingly, regression can be overcome by injecting higher cell numbers of YUMMER1.7, which results in tumors that grow without effective rejection. Mice that have previously rejected YUMMER1.7 tumors develop immunity against higher doses of YUMMER1.7 tumor challenge. In addition, escaping YUMMER1.7 tumors are sensitive to anti-CTLA-4 and anti-PD-1 therapy, establishing a new model for the evaluation of immune checkpoint inhibition and antitumor immune responses. © 2017 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Numerical investigation on the regression rate of hybrid rocket motor with star swirl fuel grain
NASA Astrophysics Data System (ADS)
Zhang, Shuai; Hu, Fan; Zhang, Weihua
2016-10-01
Although hybrid rocket motor is prospected to have distinct advantages over liquid and solid rocket motor, low regression rate and insufficient efficiency are two major disadvantages which have prevented it from being commercially viable. In recent years, complex fuel grain configurations are attractive in overcoming the disadvantages with the help of Rapid Prototyping technology. In this work, an attempt has been made to numerically investigate the flow field characteristics and local regression rate distribution inside the hybrid rocket motor with complex star swirl grain. A propellant combination with GOX and HTPB has been chosen. The numerical model is established based on the three dimensional Navier-Stokes equations with turbulence, combustion, and coupled gas/solid phase formulations. The calculated fuel regression rate is compared with the experimental data to validate the accuracy of numerical model. The results indicate that, comparing the star swirl grain with the tube grain under the conditions of the same port area and the same grain length, the burning surface area rises about 200%, the spatially averaged regression rate rises as high as about 60%, and the oxidizer can combust sufficiently due to the big vortex around the axis in the aft-mixing chamber. The combustion efficiency of star swirl grain is better and more stable than that of tube grain.
Ji, Lei; Peters, Albert J.
2004-01-01
The relationship between vegetation and climate in the grassland and cropland of the northern US Great Plains was investigated with Normalized Difference Vegetation Index (NDVI) (1989–1993) images derived from the Advanced Very High Resolution Radiometer (AVHRR), and climate data from automated weather stations. The relationship was quantified using a spatial regression technique that adjusts for spatial autocorrelation inherent in these data. Conventional regression techniques used frequently in previous studies are not adequate, because they are based on the assumption of independent observations. Six climate variables during the growing season; precipitation, potential evapotranspiration, daily maximum and minimum air temperature, soil temperature, solar irradiation were regressed on NDVI derived from a 10-km weather station buffer. The regression model identified precipitation and potential evapotranspiration as the most significant climatic variables, indicating that the water balance is the most important factor controlling vegetation condition at an annual timescale. The model indicates that 46% and 24% of variation in NDVI is accounted for by climate in grassland and cropland, respectively, indicating that grassland vegetation has a more pronounced response to climate variation than cropland. Other factors contributing to NDVI variation include environmental factors (soil, groundwater and terrain), human manipulation of crops, and sensor variation.
Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne
2012-01-01
In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models. PMID:23275882
Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne
2012-12-01
In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models.
Gurnani, Ashita S; John, Samantha E; Gavett, Brandon E
2015-05-01
The current study developed regression-based normative adjustments for a bi-factor model of the The Brief Test of Adult Cognition by Telephone (BTACT). Archival data from the Midlife Development in the United States-II Cognitive Project were used to develop eight separate linear regression models that predicted bi-factor BTACT scores, accounting for age, education, gender, and occupation-alone and in various combinations. All regression models provided statistically significant fit to the data. A three-predictor regression model fit best and accounted for 32.8% of the variance in the global bi-factor BTACT score. The fit of the regression models was not improved by gender. Eight different regression models are presented to allow the user flexibility in applying demographic corrections to the bi-factor BTACT scores. Occupation corrections, while not widely used, may provide useful demographic adjustments for adult populations or for those individuals who have attained an occupational status not commensurate with expected educational attainment. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Regression Model Term Selection for the Analysis of Strain-Gage Balance Calibration Data
NASA Technical Reports Server (NTRS)
Ulbrich, Norbert Manfred; Volden, Thomas R.
2010-01-01
The paper discusses the selection of regression model terms for the analysis of wind tunnel strain-gage balance calibration data. Different function class combinations are presented that may be used to analyze calibration data using either a non-iterative or an iterative method. The role of the intercept term in a regression model of calibration data is reviewed. In addition, useful algorithms and metrics originating from linear algebra and statistics are recommended that will help an analyst (i) to identify and avoid both linear and near-linear dependencies between regression model terms and (ii) to make sure that the selected regression model of the calibration data uses only statistically significant terms. Three different tests are suggested that may be used to objectively assess the predictive capability of the final regression model of the calibration data. These tests use both the original data points and regression model independent confirmation points. Finally, data from a simplified manual calibration of the Ames MK40 balance is used to illustrate the application of some of the metrics and tests to a realistic calibration data set.
Panel regressions to estimate low-flow response to rainfall variability in ungaged basins
Bassiouni, Maoya; Vogel, Richard M.; Archfield, Stacey A.
2016-01-01
Multicollinearity and omitted-variable bias are major limitations to developing multiple linear regression models to estimate streamflow characteristics in ungaged areas and varying rainfall conditions. Panel regression is used to overcome limitations of traditional regression methods, and obtain reliable model coefficients, in particular to understand the elasticity of streamflow to rainfall. Using annual rainfall and selected basin characteristics at 86 gaged streams in the Hawaiian Islands, regional regression models for three stream classes were developed to estimate the annual low-flow duration discharges. Three panel-regression structures (random effects, fixed effects, and pooled) were compared to traditional regression methods, in which space is substituted for time. Results indicated that panel regression generally was able to reproduce the temporal behavior of streamflow and reduce the standard errors of model coefficients compared to traditional regression, even for models in which the unobserved heterogeneity between streams is significant and the variance inflation factor for rainfall is much greater than 10. This is because both spatial and temporal variability were better characterized in panel regression. In a case study, regional rainfall elasticities estimated from panel regressions were applied to ungaged basins on Maui, using available rainfall projections to estimate plausible changes in surface-water availability and usable stream habitat for native species. The presented panel-regression framework is shown to offer benefits over existing traditional hydrologic regression methods for developing robust regional relations to investigate streamflow response in a changing climate.
Panel regressions to estimate low-flow response to rainfall variability in ungaged basins
NASA Astrophysics Data System (ADS)
Bassiouni, Maoya; Vogel, Richard M.; Archfield, Stacey A.
2016-12-01
Multicollinearity and omitted-variable bias are major limitations to developing multiple linear regression models to estimate streamflow characteristics in ungaged areas and varying rainfall conditions. Panel regression is used to overcome limitations of traditional regression methods, and obtain reliable model coefficients, in particular to understand the elasticity of streamflow to rainfall. Using annual rainfall and selected basin characteristics at 86 gaged streams in the Hawaiian Islands, regional regression models for three stream classes were developed to estimate the annual low-flow duration discharges. Three panel-regression structures (random effects, fixed effects, and pooled) were compared to traditional regression methods, in which space is substituted for time. Results indicated that panel regression generally was able to reproduce the temporal behavior of streamflow and reduce the standard errors of model coefficients compared to traditional regression, even for models in which the unobserved heterogeneity between streams is significant and the variance inflation factor for rainfall is much greater than 10. This is because both spatial and temporal variability were better characterized in panel regression. In a case study, regional rainfall elasticities estimated from panel regressions were applied to ungaged basins on Maui, using available rainfall projections to estimate plausible changes in surface-water availability and usable stream habitat for native species. The presented panel-regression framework is shown to offer benefits over existing traditional hydrologic regression methods for developing robust regional relations to investigate streamflow response in a changing climate.
A Model for Predicting Student Performance on High-Stakes Assessment
ERIC Educational Resources Information Center
Dammann, Matthew Walter
2010-01-01
This research study examined the use of student achievement on reading and math state assessments to predict success on the science state assessment. Multiple regression analysis was utilized to test the prediction for all students in grades 5 and 8 in a mid-Atlantic state. The prediction model developed from the analysis explored the combined…
Kemper, Claudia; Koller, Daniela; Glaeske, Gerd; van den Bussche, Hendrik
2011-01-01
Aphasia, dementia, and depression are important and common neurological and neuropsychological disorders after ischemic stroke. We estimated the frequency of these comorbidities and their impact on mortality and nursing care dependency. Data of a German statutory health insurance were analyzed for people aged 50 years and older with first ischemic stroke. Aphasia, dementia, and depression were defined on the basis of outpatient medical diagnoses within 1 year after stroke. Logistic regression models for mortality and nursing care dependency were calculated and were adjusted for age, sex, and other relevant comorbidity. Of 977 individuals with a first ischemic stroke, 14.8% suffered from aphasia, 12.5% became demented, and 22.4% became depressed. The regression model for mortality showed a significant influence of age, aphasia, and other relevant comorbidity. In the regression model for nursing care dependency, the factors age, aphasia, dementia, depression, and other relevant comorbidity were significant. Aphasia has a high impact on mortality and nursing care dependency after ischemic stroke, while dementia and depression are strongly associated with increasing nursing care dependency.
Estimation of genetic parameters related to eggshell strength using random regression models.
Guo, J; Ma, M; Qu, L; Shen, M; Dou, T; Wang, K
2015-01-01
This study examined the changes in eggshell strength and the genetic parameters related to this trait throughout a hen's laying life using random regression. The data were collected from a crossbred population between 2011 and 2014, where the eggshell strength was determined repeatedly for 2260 hens. Using random regression models (RRMs), several Legendre polynomials were employed to estimate the fixed, direct genetic and permanent environment effects. The residual effects were treated as independently distributed with heterogeneous variance for each test week. The direct genetic variance was included with second-order Legendre polynomials and the permanent environment with third-order Legendre polynomials. The heritability of eggshell strength ranged from 0.26 to 0.43, the repeatability ranged between 0.47 and 0.69, and the estimated genetic correlations between test weeks was high at > 0.67. The first eigenvalue of the genetic covariance matrix accounted for about 97% of the sum of all the eigenvalues. The flexibility and statistical power of RRM suggest that this model could be an effective method to improve eggshell quality and to reduce losses due to cracked eggs in a breeding plan.
Dudley, Robert W.; Hodgkins, Glenn A.; Dickinson, Jesse
2017-01-01
We present a logistic regression approach for forecasting the probability of future groundwater levels declining or maintaining below specific groundwater-level thresholds. We tested our approach on 102 groundwater wells in different climatic regions and aquifers of the United States that are part of the U.S. Geological Survey Groundwater Climate Response Network. We evaluated the importance of current groundwater levels, precipitation, streamflow, seasonal variability, Palmer Drought Severity Index, and atmosphere/ocean indices for developing the logistic regression equations. Several diagnostics of model fit were used to evaluate the regression equations, including testing of autocorrelation of residuals, goodness-of-fit metrics, and bootstrap validation testing. The probabilistic predictions were most successful at wells with high persistence (low month-to-month variability) in their groundwater records and at wells where the groundwater level remained below the defined low threshold for sustained periods (generally three months or longer). The model fit was weakest at wells with strong seasonal variability in levels and with shorter duration low-threshold events. We identified challenges in deriving probabilistic-forecasting models and possible approaches for addressing those challenges.
NASA Astrophysics Data System (ADS)
Keshtpoor, M.; Carnacina, I.; Yablonsky, R. M.
2016-12-01
Extratropical cyclones (ETCs) are the primary driver of storm surge events along the UK and northwest mainland Europe coastlines. In an effort to evaluate the storm surge risk in coastal communities in this region, a stochastic catalog is developed by perturbing the historical storm seeds of European ETCs to account for 10,000 years of possible ETCs. Numerical simulation of the storm surge generated by the full 10,000-year stochastic catalog, however, is computationally expensive and may take several months to complete with available computational resources. A new statistical regression model is developed to select the major surge-generating events from the stochastic ETC catalog. This regression model is based on the maximum storm surge, obtained via numerical simulations using a calibrated version of the Delft3D-FM hydrodynamic model with a relatively coarse mesh, of 1750 historical ETC events that occurred over the past 38 years in Europe. These numerically-simulated surge values were regressed to the local sea level pressure and the U and V components of the wind field at the location of 196 tide gauge stations near the UK and northwest mainland Europe coastal areas. The regression model suggests that storm surge values in the area of interest are highly correlated to the U- and V-component of wind speed, as well as the sea level pressure. Based on these correlations, the regression model was then used to select surge-generating storms from the 10,000-year stochastic catalog. Results suggest that roughly 105,000 events out of 480,000 stochastic storms are surge-generating events and need to be considered for numerical simulation using a hydrodynamic model. The selected stochastic storms were then simulated in Delft3D-FM, and the final refinement of the storm population was performed based on return period analysis of the 1750 historical event simulations at each of the 196 tide gauges in preparation for Delft3D-FM fine mesh simulations.
Large biases in regression-based constituent flux estimates: causes and diagnostic tools
Hirsch, Robert M.
2014-01-01
It has been documented in the literature that, in some cases, widely used regression-based models can produce severely biased estimates of long-term mean river fluxes of various constituents. These models, estimated using sample values of concentration, discharge, and date, are used to compute estimated fluxes for a multiyear period at a daily time step. This study compares results of the LOADEST seven-parameter model, LOADEST five-parameter model, and the Weighted Regressions on Time, Discharge, and Season (WRTDS) model using subsampling of six very large datasets to better understand this bias problem. This analysis considers sample datasets for dissolved nitrate and total phosphorus. The results show that LOADEST-7 and LOADEST-5, although they often produce very nearly unbiased results, can produce highly biased results. This study identifies three conditions that can give rise to these severe biases: (1) lack of fit of the log of concentration vs. log discharge relationship, (2) substantial differences in the shape of this relationship across seasons, and (3) severely heteroscedastic residuals. The WRTDS model is more resistant to the bias problem than the LOADEST models but is not immune to them. Understanding the causes of the bias problem is crucial to selecting an appropriate method for flux computations. Diagnostic tools for identifying the potential for bias problems are introduced, and strategies for resolving bias problems are described.
Wilks, Scott E; Croom, Beth
2008-05-01
The study examined whether social support functioned as a protective, resilience factor among Alzheimer's disease (AD) caregivers. Moderation and mediation models were used to test social support amid stress and resilience. A cross-sectional analysis of self-reported data was conducted. Measures of demographics, perceived stress, family support, friend support, overall social support, and resilience were administered to caregiver attendees (N=229) of two AD caregiver conferences. Hierarchical regression analysis showed the compounded impact of predictors on resilience. Odds ratios generated probability of high resilience given high stress and social supports. Social support moderation and mediation were tested via distinct series of regression equations. Path analyses illustrated effects on the models for significant moderation and/or mediation. Stress negatively influenced and accounted for most variation in resilience. Social support positively influenced resilience, and caregivers with high family support had the highest probability of elevated resilience. Moderation was observed among all support factors. No social support fulfilled the complete mediation criteria. Evidence of social support as a protective, moderating factor yields implications for health care practitioners who deliver services to assist AD caregivers, particularly the promotion of identification and utilization of supportive familial and peer relations.
Barry, Adam E; Chaney, Beth; Chaney, J Don
2011-08-01
Truancy and alcohol use are quality indicators of academic achievement and success. However, there remains a paucity of substantive research articulating the impact these deviant behaviors have on an adolescent's educational aspirations. The purpose of this study is to assess whether recent alcohol use and truancy impact students' educational aspirations among a nationally representative sample of US high school seniors. This study conducted a secondary data analysis of the Monitoring the Future project data, 2006. Logistic regression was conducted to assess how alcohol use and truancy affected educational aspirations. Subsequent interaction effects were assessed in the final multivariable model. Demographic variables such as age, sex, race, and father and mother's educational level were included as covariates in the regression model. Results indicate that as students engage in increased alcohol use and/or truancy, educational aspirations decrease. Thus, students who indicated a desire to attend a 4-year college/university were less likely to engage in high-risk drinking behavior and/or truancy. Moreover, in testing the interaction between truancy and alcohol use, as it relates to educational aspirations, the logistic regression model found both of these independent variables to be statistically significant predictors of the likelihood students would attend a 4-year college/university. To ensure that adolescents further their education and maximize their potential life opportunities, school and public health officials should initiate efforts to reduce alcohol consumption and truancy among students. Furthermore, future research should examine the risk and protective factors that may influence one's educational aspirations. © 2011, American School Health Association.
Kepler AutoRegressive Planet Search: Motivation & Methodology
NASA Astrophysics Data System (ADS)
Caceres, Gabriel; Feigelson, Eric; Jogesh Babu, G.; Bahamonde, Natalia; Bertin, Karine; Christen, Alejandra; Curé, Michel; Meza, Cristian
2015-08-01
The Kepler AutoRegressive Planet Search (KARPS) project uses statistical methodology associated with autoregressive (AR) processes to model Kepler lightcurves in order to improve exoplanet transit detection in systems with high stellar variability. We also introduce a planet-search algorithm to detect transits in time-series residuals after application of the AR models. One of the main obstacles in detecting faint planetary transits is the intrinsic stellar variability of the host star. The variability displayed by many stars may have autoregressive properties, wherein later flux values are correlated with previous ones in some manner. Auto-Regressive Moving-Average (ARMA) models, Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH), and related models are flexible, phenomenological methods used with great success to model stochastic temporal behaviors in many fields of study, particularly econometrics. Powerful statistical methods are implemented in the public statistical software environment R and its many packages. Modeling involves maximum likelihood fitting, model selection, and residual analysis. These techniques provide a useful framework to model stellar variability and are used in KARPS with the objective of reducing stellar noise to enhance opportunities to find as-yet-undiscovered planets. Our analysis procedure consisting of three steps: pre-processing of the data to remove discontinuities, gaps and outliers; ARMA-type model selection and fitting; and transit signal search of the residuals using a new Transit Comb Filter (TCF) that replaces traditional box-finding algorithms. We apply the procedures to simulated Kepler-like time series with known stellar and planetary signals to evaluate the effectiveness of the KARPS procedures. The ARMA-type modeling is effective at reducing stellar noise, but also reduces and transforms the transit signal into ingress/egress spikes. A periodogram based on the TCF is constructed to concentrate the signal of these periodic spikes. When a periodic transit is found, the model is displayed on a standard period-folded averaged light curve. We also illustrate the efficient coding in R.
Wolters, Mark A; Dean, C B
2017-01-01
Remote sensing images from Earth-orbiting satellites are a potentially rich data source for monitoring and cataloguing atmospheric health hazards that cover large geographic regions. A method is proposed for classifying such images into hazard and nonhazard regions using the autologistic regression model, which may be viewed as a spatial extension of logistic regression. The method includes a novel and simple approach to parameter estimation that makes it well suited to handling the large and high-dimensional datasets arising from satellite-borne instruments. The methodology is demonstrated on both simulated images and a real application to the identification of forest fire smoke.
Nattee, Cholwich; Khamsemanan, Nirattaya; Lawtrakul, Luckhana; Toochinda, Pisanu; Hannongbua, Supa
2017-01-01
Malaria is still one of the most serious diseases in tropical regions. This is due in part to the high resistance against available drugs for the inhibition of parasites, Plasmodium, the cause of the disease. New potent compounds with high clinical utility are urgently needed. In this work, we created a novel model using a regression tree to study structure-activity relationships and predict the inhibition constant, K i of three different antimalarial analogues (Trimethoprim, Pyrimethamine, and Cycloguanil) based on their molecular descriptors. To the best of our knowledge, this work is the first attempt to study the structure-activity relationships of all three analogues combined. The most relevant descriptors and appropriate parameters of the regression tree are harvested using extremely randomized trees. These descriptors are water accessible surface area, Log of the aqueous solubility, total hydrophobic van der Waals surface area, and molecular refractivity. Out of all possible combinations of these selected parameters and descriptors, the tree with the strongest coefficient of determination is selected to be our prediction model. Predicted K i values from the proposed model show a strong coefficient of determination, R 2 =0.996, to experimental K i values. From the structure of the regression tree, compounds with high accessible surface area of all hydrophobic atoms (ASA_H) and low aqueous solubility of inhibitors (Log S) generally possess low K i values. Our prediction model can also be utilized as a screening test for new antimalarial drug compounds which may reduce the time and expenses for new drug development. New compounds with high predicted K i should be excluded from further drug development. It is also our inference that a threshold of ASA_H greater than 575.80 and Log S less than or equal to -4.36 is a sufficient condition for a new compound to possess a low K i . Copyright © 2016 Elsevier Inc. All rights reserved.
Erdogan, Saffet
2009-10-01
The aim of the study is to describe the inter-province differences in traffic accidents and mortality on roads of Turkey. Two different risk indicators were used to evaluate the road safety performance of the provinces in Turkey. These indicators are the ratios between the number of persons killed in road traffic accidents (1) and the number of accidents (2) (nominators) and their exposure to traffic risk (denominator). Population and the number of registered motor vehicles in the provinces were used as denominators individually. Spatial analyses were performed to the mean annual rate of deaths and to the number of fatal accidents that were calculated for the period of 2001-2006. Empirical Bayes smoothing was used to remove background noise from the raw death and accident rates because of the sparsely populated provinces and small number of accident and death rates of provinces. Global and local spatial autocorrelation analyses were performed to show whether the provinces with high rates of deaths-accidents show clustering or are located closer by chance. The spatial distribution of provinces with high rates of deaths and accidents was nonrandom and detected as clustered with significance of P<0.05 with spatial autocorrelation analyses. Regions with high concentration of fatal accidents and deaths were located in the provinces that contain the roads connecting the Istanbul, Ankara, and Antalya provinces. Accident and death rates were also modeled with some independent variables such as number of motor vehicles, length of roads, and so forth using geographically weighted regression analysis with forward step-wise elimination. The level of statistical significance was taken as P<0.05. Large differences were found between the rates of deaths and accidents according to denominators in the provinces. The geographically weighted regression analyses did significantly better predictions for both accident rates and death rates than did ordinary least regressions, as indicated by adjusted R(2) values. Geographically weighted regression provided values of 0.89-0.99 adjusted R(2) for death and accident rates, compared with 0.88-0.95, respectively, by ordinary least regressions. Geographically weighted regression has the potential to reveal local patterns in the spatial distribution of rates, which would be ignored by the ordinary least regression approach. The application of spatial analysis and modeling of accident statistics and death rates at provincial level in Turkey will help to identification of provinces with outstandingly high accident and death rates. This could help more efficient road safety management in Turkey.
Linear regression metamodeling as a tool to summarize and present simulation model results.
Jalal, Hawre; Dowd, Bryan; Sainfort, François; Kuntz, Karen M
2013-10-01
Modelers lack a tool to systematically and clearly present complex model results, including those from sensitivity analyses. The objective was to propose linear regression metamodeling as a tool to increase transparency of decision analytic models and better communicate their results. We used a simplified cancer cure model to demonstrate our approach. The model computed the lifetime cost and benefit of 3 treatment options for cancer patients. We simulated 10,000 cohorts in a probabilistic sensitivity analysis (PSA) and regressed the model outcomes on the standardized input parameter values in a set of regression analyses. We used the regression coefficients to describe measures of sensitivity analyses, including threshold and parameter sensitivity analyses. We also compared the results of the PSA to deterministic full-factorial and one-factor-at-a-time designs. The regression intercept represented the estimated base-case outcome, and the other coefficients described the relative parameter uncertainty in the model. We defined simple relationships that compute the average and incremental net benefit of each intervention. Metamodeling produced outputs similar to traditional deterministic 1-way or 2-way sensitivity analyses but was more reliable since it used all parameter values. Linear regression metamodeling is a simple, yet powerful, tool that can assist modelers in communicating model characteristics and sensitivity analyses.
Liu, Sze Yan; Kawachi, Ichiro; Glymour, M Maria
2012-09-01
Concerns have been raised that education may have greater benefits for persons at high risk of coronary heart disease (CHD) than for those at low risk. We estimated the association of education (less than high school, high school, or college graduates) with 10-year CHD risk and body mass index (BMI), using linear and quantile regression models, in the following two nationally representative datasets: the 2006 wave of the Health and Retirement Survey and the 2003-2008 National Health and Nutrition Examination Survey (NHANES). Higher educational attainment was associated with lower 10-year CHD risk for all groups. However, the magnitude of this association varied considerably across quantiles for some subgroups. For example, among women in NHANES, a high school degree was associated with 4% (95% confidence interval = -9% to 1%) and 17% (-24% to -8%) lower CHD risk in the 10th and 90th percentiles, respectively. For BMI, a college degree was associated with uniform decreases across the distribution for women, but with varying increases for men. Compared with those who had not completed high school, male college graduates in the NHANES sample had a BMI that was 6% greater (2% to 11%) at the 10th percentile of the BMI distribution and 7% lower (-10% to -3%) at the 90th percentile (ie, overweight/obese). Estimates from the Health and Retirement Survey sample and the marginal quantile regression models showed similar patterns. Conventional regression methods may mask important variations in the associations between education and CHD risk.
Random regression analyses using B-splines to model growth of Australian Angus cattle
Meyer, Karin
2005-01-01
Regression on the basis function of B-splines has been advocated as an alternative to orthogonal polynomials in random regression analyses. Basic theory of splines in mixed model analyses is reviewed, and estimates from analyses of weights of Australian Angus cattle from birth to 820 days of age are presented. Data comprised 84 533 records on 20 731 animals in 43 herds, with a high proportion of animals with 4 or more weights recorded. Changes in weights with age were modelled through B-splines of age at recording. A total of thirteen analyses, considering different combinations of linear, quadratic and cubic B-splines and up to six knots, were carried out. Results showed good agreement for all ages with many records, but fluctuated where data were sparse. On the whole, analyses using B-splines appeared more robust against "end-of-range" problems and yielded more consistent and accurate estimates of the first eigenfunctions than previous, polynomial analyses. A model fitting quadratic B-splines, with knots at 0, 200, 400, 600 and 821 days and a total of 91 covariance components, appeared to be a good compromise between detailedness of the model, number of parameters to be estimated, plausibility of results, and fit, measured as residual mean square error. PMID:16093011
Senn, Stephen; Graf, Erika; Caputo, Angelika
2007-12-30
Stratifying and matching by the propensity score are increasingly popular approaches to deal with confounding in medical studies investigating effects of a treatment or exposure. A more traditional alternative technique is the direct adjustment for confounding in regression models. This paper discusses fundamental differences between the two approaches, with a focus on linear regression and propensity score stratification, and identifies points to be considered for an adequate comparison. The treatment estimators are examined for unbiasedness and efficiency. This is illustrated in an application to real data and supplemented by an investigation on properties of the estimators for a range of underlying linear models. We demonstrate that in specific circumstances the propensity score estimator is identical to the effect estimated from a full linear model, even if it is built on coarser covariate strata than the linear model. As a consequence the coarsening property of the propensity score-adjustment for a one-dimensional confounder instead of a high-dimensional covariate-may be viewed as a way to implement a pre-specified, richly parametrized linear model. We conclude that the propensity score estimator inherits the potential for overfitting and that care should be taken to restrict covariates to those relevant for outcome. Copyright (c) 2007 John Wiley & Sons, Ltd.
An Entropy-Based Measure for Assessing Fuzziness in Logistic Regression
Weiss, Brandi A.; Dardick, William
2015-01-01
This article introduces an entropy-based measure of data–model fit that can be used to assess the quality of logistic regression models. Entropy has previously been used in mixture-modeling to quantify how well individuals are classified into latent classes. The current study proposes the use of entropy for logistic regression models to quantify the quality of classification and separation of group membership. Entropy complements preexisting measures of data–model fit and provides unique information not contained in other measures. Hypothetical data scenarios, an applied example, and Monte Carlo simulation results are used to demonstrate the application of entropy in logistic regression. Entropy should be used in conjunction with other measures of data–model fit to assess how well logistic regression models classify cases into observed categories. PMID:29795897
An Entropy-Based Measure for Assessing Fuzziness in Logistic Regression.
Weiss, Brandi A; Dardick, William
2016-12-01
This article introduces an entropy-based measure of data-model fit that can be used to assess the quality of logistic regression models. Entropy has previously been used in mixture-modeling to quantify how well individuals are classified into latent classes. The current study proposes the use of entropy for logistic regression models to quantify the quality of classification and separation of group membership. Entropy complements preexisting measures of data-model fit and provides unique information not contained in other measures. Hypothetical data scenarios, an applied example, and Monte Carlo simulation results are used to demonstrate the application of entropy in logistic regression. Entropy should be used in conjunction with other measures of data-model fit to assess how well logistic regression models classify cases into observed categories.
Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree
de los Campos, Gustavo; Naya, Hugo; Gianola, Daniel; Crossa, José; Legarra, Andrés; Manfredi, Eduardo; Weigel, Kent; Cotes, José Miguel
2009-01-01
The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available. PMID:19293140
NASA Astrophysics Data System (ADS)
Kozioł, Michał
2017-10-01
The article presents a parametric model describing the registered distributions spectrum of optical radiation emitted by electrical discharges generated in the systems: the needle- needle, the needleplate and in the system for surface discharges. Generation of electrical discharges and registration of the emitted radiation was carried out in three different electrical insulating oils: fabric new, operated (used) and operated with air bubbles. For registration of optical spectra in the range of ultraviolet, visible and near infrared a high resolution spectrophotometer was. The proposed mathematical model was developed in a regression procedure using gauss-sigmoid type function. The dependent variable was the intensity of the recorded optical signals. In order to estimate the optimal parameters of the model an evolutionary algorithm was used. The optimization procedure was performed in Matlab environment. For determination of the matching quality of theoretical parameters of the regression function to the empirical data determination coefficient R2 was applied.
Casemix funding for a specialist paediatrics hospital: a hedonic regression approach.
Bridges, J F; Hanson, R M
2000-01-01
This paper inquires into the effects that Diagnosis Related Groups (DRGs) have had on the ability to explain patient-level costs in a specialist paediatrics hospital. Two hedonic models are estimated using 1996/97 New Children's Hospital (NCH) patient level cost data, one with and one without a casemix index (CMI). The results show that the inclusion of a casemix index as an explanatory variable leads to a better accounting of cost. The full hedonic model is then used to simulate a funding model for the 1997/98 NCH cost data. These costs are highly correlated with the actual costs reported for that year. In addition, univariate regression indicates that there has been inflation in costs in the order of 4.8% between the two years. In conclusion, hedonic analysis can provide valuable evidence for the design of funding models that account for casemix.
Bayesian function-on-function regression for multilevel functional data.
Meyer, Mark J; Coull, Brent A; Versace, Francesco; Cinciripini, Paul; Morris, Jeffrey S
2015-09-01
Medical and public health research increasingly involves the collection of complex and high dimensional data. In particular, functional data-where the unit of observation is a curve or set of curves that are finely sampled over a grid-is frequently obtained. Moreover, researchers often sample multiple curves per person resulting in repeated functional measures. A common question is how to analyze the relationship between two functional variables. We propose a general function-on-function regression model for repeatedly sampled functional data on a fine grid, presenting a simple model as well as a more extensive mixed model framework, and introducing various functional Bayesian inferential procedures that account for multiple testing. We examine these models via simulation and a data analysis with data from a study that used event-related potentials to examine how the brain processes various types of images. © 2015, The International Biometric Society.
Characterizing multivariate decoding models based on correlated EEG spectral features
McFarland, Dennis J.
2013-01-01
Objective Multivariate decoding methods are popular techniques for analysis of neurophysiological data. The present study explored potential interpretative problems with these techniques when predictors are correlated. Methods Data from sensorimotor rhythm-based cursor control experiments was analyzed offline with linear univariate and multivariate models. Features were derived from autoregressive (AR) spectral analysis of varying model order which produced predictors that varied in their degree of correlation (i.e., multicollinearity). Results The use of multivariate regression models resulted in much better prediction of target position as compared to univariate regression models. However, with lower order AR features interpretation of the spectral patterns of the weights was difficult. This is likely to be due to the high degree of multicollinearity present with lower order AR features. Conclusions Care should be exercised when interpreting the pattern of weights of multivariate models with correlated predictors. Comparison with univariate statistics is advisable. Significance While multivariate decoding algorithms are very useful for prediction their utility for interpretation may be limited when predictors are correlated. PMID:23466267
Flexible Meta-Regression to Assess the Shape of the Benzene–Leukemia Exposure–Response Curve
Vlaanderen, Jelle; Portengen, Lützen; Rothman, Nathaniel; Lan, Qing; Kromhout, Hans; Vermeulen, Roel
2010-01-01
Background Previous evaluations of the shape of the benzene–leukemia exposure–response curve (ERC) were based on a single set or on small sets of human occupational studies. Integrating evidence from all available studies that are of sufficient quality combined with flexible meta-regression models is likely to provide better insight into the functional relation between benzene exposure and risk of leukemia. Objectives We used natural splines in a flexible meta-regression method to assess the shape of the benzene–leukemia ERC. Methods We fitted meta-regression models to 30 aggregated risk estimates extracted from nine human observational studies and performed sensitivity analyses to assess the impact of a priori assessed study characteristics on the predicted ERC. Results The natural spline showed a supralinear shape at cumulative exposures less than 100 ppm-years, although this model fitted the data only marginally better than a linear model (p = 0.06). Stratification based on study design and jackknifing indicated that the cohort studies had a considerable impact on the shape of the ERC at high exposure levels (> 100 ppm-years) but that predicted risks for the low exposure range (< 50 ppm-years) were robust. Conclusions Although limited by the small number of studies and the large heterogeneity between studies, the inclusion of all studies of sufficient quality combined with a flexible meta-regression method provides the most comprehensive evaluation of the benzene–leukemia ERC to date. The natural spline based on all data indicates a significantly increased risk of leukemia [relative risk (RR) = 1.14; 95% confidence interval (CI), 1.04–1.26] at an exposure level as low as 10 ppm-years. PMID:20064779
Stature estimation from the lengths of the growing foot-a study on North Indian adolescents.
Krishan, Kewal; Kanchan, Tanuj; Passi, Neelam; DiMaggio, John A
2012-12-01
Stature estimation is considered as one of the basic parameters of the investigation process in unknown and commingled human remains in medico-legal case work. Race, age and sex are the other parameters which help in this process. Stature estimation is of the utmost importance as it completes the biological profile of a person along with the other three parameters of identification. The present research is intended to formulate standards for stature estimation from foot dimensions in adolescent males from North India and study the pattern of foot growth during the growing years. 154 male adolescents from the Northern part of India were included in the study. Besides stature, five anthropometric measurements that included the length of the foot from each toe (T1, T2, T3, T4, and T5 respectively) to pternion were measured on each foot. The data was analyzed statistically using Student's t-test, Pearson's correlation, linear and multiple regression analysis for estimation of stature and growth of foot during ages 13-18 years. Correlation coefficients between stature and all the foot measurements were found to be highly significant and positively correlated. Linear regression models and multiple regression models (with age as a co-variable) were derived for estimation of stature from the different measurements of the foot. Multiple regression models (with age as a co-variable) estimate stature with greater accuracy than the regression models for 13-18 years age group. The study shows the growth pattern of feet in North Indian adolescents and indicates that anthropometric measurements of the foot and its segments are valuable in estimation of stature in growing individuals of that population. Copyright © 2012 Elsevier Ltd. All rights reserved.
Ren, Yilong; Wang, Yunpeng; Wu, Xinkai; Yu, Guizhen; Ding, Chuan
2016-10-01
Red light running (RLR) has become a major safety concern at signalized intersection. To prevent RLR related crashes, it is critical to identify the factors that significantly impact the drivers' behaviors of RLR, and to predict potential RLR in real time. In this research, 9-month's RLR events extracted from high-resolution traffic data collected by loop detectors from three signalized intersections were applied to identify the factors that significantly affect RLR behaviors. The data analysis indicated that occupancy time, time gap, used yellow time, time left to yellow start, whether the preceding vehicle runs through the intersection during yellow, and whether there is a vehicle passing through the intersection on the adjacent lane were significantly factors for RLR behaviors. Furthermore, due to the rare events nature of RLR, a modified rare events logistic regression model was developed for RLR prediction. The rare events logistic regression method has been applied in many fields for rare events studies and shows impressive performance, but so far none of previous research has applied this method to study RLR. The results showed that the rare events logistic regression model performed significantly better than the standard logistic regression model. More importantly, the proposed RLR prediction method is purely based on loop detector data collected from a single advance loop detector located 400 feet away from stop-bar. This brings great potential for future field applications of the proposed method since loops have been widely implemented in many intersections and can collect data in real time. This research is expected to contribute to the improvement of intersection safety significantly. Copyright © 2016 Elsevier Ltd. All rights reserved.
Construction of mathematical model for measuring material concentration by colorimetric method
NASA Astrophysics Data System (ADS)
Liu, Bing; Gao, Lingceng; Yu, Kairong; Tan, Xianghua
2018-06-01
This paper use the method of multiple linear regression to discuss the data of C problem of mathematical modeling in 2017. First, we have established a regression model for the concentration of 5 substances. But only the regression model of the substance concentration of urea in milk can pass through the significance test. The regression model established by the second sets of data can pass the significance test. But this model exists serious multicollinearity. We have improved the model by principal component analysis. The improved model is used to control the system so that it is possible to measure the concentration of material by direct colorimetric method.
Developing and testing a global-scale regression model to quantify mean annual streamflow
NASA Astrophysics Data System (ADS)
Barbarossa, Valerio; Huijbregts, Mark A. J.; Hendriks, A. Jan; Beusen, Arthur H. W.; Clavreul, Julie; King, Henry; Schipper, Aafke M.
2017-01-01
Quantifying mean annual flow of rivers (MAF) at ungauged sites is essential for assessments of global water supply, ecosystem integrity and water footprints. MAF can be quantified with spatially explicit process-based models, which might be overly time-consuming and data-intensive for this purpose, or with empirical regression models that predict MAF based on climate and catchment characteristics. Yet, regression models have mostly been developed at a regional scale and the extent to which they can be extrapolated to other regions is not known. In this study, we developed a global-scale regression model for MAF based on a dataset unprecedented in size, using observations of discharge and catchment characteristics from 1885 catchments worldwide, measuring between 2 and 106 km2. In addition, we compared the performance of the regression model with the predictive ability of the spatially explicit global hydrological model PCR-GLOBWB by comparing results from both models to independent measurements. We obtained a regression model explaining 89% of the variance in MAF based on catchment area and catchment averaged mean annual precipitation and air temperature, slope and elevation. The regression model performed better than PCR-GLOBWB for the prediction of MAF, as root-mean-square error (RMSE) values were lower (0.29-0.38 compared to 0.49-0.57) and the modified index of agreement (d) was higher (0.80-0.83 compared to 0.72-0.75). Our regression model can be applied globally to estimate MAF at any point of the river network, thus providing a feasible alternative to spatially explicit process-based global hydrological models.
Baba, Hiromi; Takahara, Jun-ichi; Yamashita, Fumiyoshi; Hashida, Mitsuru
2015-11-01
The solvent effect on skin permeability is important for assessing the effectiveness and toxicological risk of new dermatological formulations in pharmaceuticals and cosmetics development. The solvent effect occurs by diverse mechanisms, which could be elucidated by efficient and reliable prediction models. However, such prediction models have been hampered by the small variety of permeants and mixture components archived in databases and by low predictive performance. Here, we propose a solution to both problems. We first compiled a novel large database of 412 samples from 261 structurally diverse permeants and 31 solvents reported in the literature. The data were carefully screened to ensure their collection under consistent experimental conditions. To construct a high-performance predictive model, we then applied support vector regression (SVR) and random forest (RF) with greedy stepwise descriptor selection to our database. The models were internally and externally validated. The SVR achieved higher performance statistics than RF. The (externally validated) determination coefficient, root mean square error, and mean absolute error of SVR were 0.899, 0.351, and 0.268, respectively. Moreover, because all descriptors are fully computational, our method can predict as-yet unsynthesized compounds. Our high-performance prediction model offers an attractive alternative to permeability experiments for pharmaceutical and cosmetic candidate screening and optimizing skin-permeable topical formulations.
NASA Technical Reports Server (NTRS)
Parsons, Vickie s.
2009-01-01
The request to conduct an independent review of regression models, developed for determining the expected Launch Commit Criteria (LCC) External Tank (ET)-04 cycle count for the Space Shuttle ET tanking process, was submitted to the NASA Engineering and Safety Center NESC on September 20, 2005. The NESC team performed an independent review of regression models documented in Prepress Regression Analysis, Tom Clark and Angela Krenn, 10/27/05. This consultation consisted of a peer review by statistical experts of the proposed regression models provided in the Prepress Regression Analysis. This document is the consultation's final report.
Stochastic Approximation Methods for Latent Regression Item Response Models
ERIC Educational Resources Information Center
von Davier, Matthias; Sinharay, Sandip
2010-01-01
This article presents an application of a stochastic approximation expectation maximization (EM) algorithm using a Metropolis-Hastings (MH) sampler to estimate the parameters of an item response latent regression model. Latent regression item response models are extensions of item response theory (IRT) to a latent variable model with covariates…
Using Weighted Least Squares Regression for Obtaining Langmuir Sorption Constants
USDA-ARS?s Scientific Manuscript database
One of the most commonly used models for describing phosphorus (P) sorption to soils is the Langmuir model. To obtain model parameters, the Langmuir model is fit to measured sorption data using least squares regression. Least squares regression is based on several assumptions including normally dist...
Regression analysis using dependent Polya trees.
Schörgendorfer, Angela; Branscum, Adam J
2013-11-30
Many commonly used models for linear regression analysis force overly simplistic shape and scale constraints on the residual structure of data. We propose a semiparametric Bayesian model for regression analysis that produces data-driven inference by using a new type of dependent Polya tree prior to model arbitrary residual distributions that are allowed to evolve across increasing levels of an ordinal covariate (e.g., time, in repeated measurement studies). By modeling residual distributions at consecutive covariate levels or time points using separate, but dependent Polya tree priors, distributional information is pooled while allowing for broad pliability to accommodate many types of changing residual distributions. We can use the proposed dependent residual structure in a wide range of regression settings, including fixed-effects and mixed-effects linear and nonlinear models for cross-sectional, prospective, and repeated measurement data. A simulation study illustrates the flexibility of our novel semiparametric regression model to accurately capture evolving residual distributions. In an application to immune development data on immunoglobulin G antibodies in children, our new model outperforms several contemporary semiparametric regression models based on a predictive model selection criterion. Copyright © 2013 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Rhee, Jinyoung; Kim, Gayoung; Im, Jungho
2017-04-01
Three regions of Indonesia with different rainfall characteristics were chosen to develop drought forecast models based on machine learning. The 6-month Standardized Precipitation Index (SPI6) was selected as the target variable. The models' forecast skill was compared to the skill of long-range climate forecast models in terms of drought accuracy and regression mean absolute error (MAE). Indonesian droughts are known to be related to El Nino Southern Oscillation (ENSO) variability despite of regional differences as well as monsoon, local sea surface temperature (SST), other large-scale atmosphere-ocean interactions such as Indian Ocean Dipole (IOD) and Southern Pacific Convergence Zone (SPCZ), and local factors including topography and elevation. Machine learning models are thus to enhance drought forecast skill by combining local and remote SST and remote sensing information reflecting initial drought conditions to the long-range climate forecast model results. A total of 126 machine learning models were developed for the three regions of West Java (JB), West Sumatra (SB), and Gorontalo (GO) and six long-range climate forecast models of MSC_CanCM3, MSC_CanCM4, NCEP, NASA, PNU, POAMA as well as one climatology model based on remote sensing precipitation data, and 1 to 6-month lead times. When compared the results between the machine learning models and the long-range climate forecast models, West Java and Gorontalo regions showed similar characteristics in terms of drought accuracy. Drought accuracy of the long-range climate forecast models were generally higher than the machine learning models with short lead times but the opposite appeared for longer lead times. For West Sumatra, however, the machine learning models and the long-range climate forecast models showed similar drought accuracy. The machine learning models showed smaller regression errors for all three regions especially with longer lead times. Among the three regions, the machine learning models developed for Gorontalo showed the highest drought accuracy and the lowest regression error. West Java showed higher drought accuracy compared to West Sumatra, while West Sumatra showed lower regression error compared to West Java. The lower error in West Sumatra may be because of the smaller sample size used for training and evaluation for the region. Regional differences of forecast skill are determined by the effect of ENSO and the following forecast skill of the long-range climate forecast models. While shown somewhat high in West Sumatra, relative importance of remote sensing variables was mostly low in most cases. High importance of the variables based on long-range climate forecast models indicates that the forecast skill of the machine learning models are mostly determined by the forecast skill of the climate models.
NASA Astrophysics Data System (ADS)
Verrelst, Jochem; Rivera, J. P.; Alonso, L.; Guanter, L.; Moreno, J.
2012-04-01
ESA’s upcoming satellites Sentinel-2 (S2) and Sentinel-3 (S3) aim to ensure continuity for Landsat 5/7, SPOT- 5, SPOT-Vegetation and Envisat MERIS observations by providing superspectral images of high spatial and temporal resolution. S2 and S3 will deliver near real-time operational products with a high accuracy for land monitoring. This unprecedented data availability leads to an urgent need for developing robust and accurate retrieval methods. Machine learning regression algorithms could be powerful candidates for the estimation of biophysical parameters from satellite reflectance measurements because of their ability to perform adaptive, nonlinear data fitting. By using data from the ESA-led field campaign SPARC (Barrax, Spain), it was recently found [1] that Gaussian processes regression (GPR) outperformed competitive machine learning algorithms such as neural networks, support vector regression) and kernel ridge regression both in terms of accuracy and computational speed. For various Sentinel configurations (S2-10m, S2- 20m, S2-60m and S3-300m) three important biophysical parameters were estimated: leaf chlorophyll content (Chl), leaf area index (LAI) and fractional vegetation cover (FVC). GPR was the only method that reached the 10% precision required by end users in the estimation of Chl. In view of implementing the regressor into operational monitoring applications, here the portability of locally trained GPR models to other images was evaluated. The associated confidence maps proved to be a good indicator for evaluating the robustness of the trained models. Consistent retrievals were obtained across the different images, particularly over agricultural sites. To make the method suitable for operational use, however, the poorer confidences over bare soil areas suggest that the training dataset should be expanded with inputs from various land cover types.
Geodesic least squares regression on information manifolds
DOE Office of Scientific and Technical Information (OSTI.GOV)
Verdoolaege, Geert, E-mail: geert.verdoolaege@ugent.be
We present a novel regression method targeted at situations with significant uncertainty on both the dependent and independent variables or with non-Gaussian distribution models. Unlike the classic regression model, the conditional distribution of the response variable suggested by the data need not be the same as the modeled distribution. Instead they are matched by minimizing the Rao geodesic distance between them. This yields a more flexible regression method that is less constrained by the assumptions imposed through the regression model. As an example, we demonstrate the improved resistance of our method against some flawed model assumptions and we apply thismore » to scaling laws in magnetic confinement fusion.« less
Lim, Jongguk; Kim, Giyoung; Mo, Changyeun; Kim, Moon S; Chao, Kuanglin; Qin, Jianwei; Fu, Xiaping; Baek, Insuck; Cho, Byoung-Kwan
2016-05-01
Illegal use of nitrogen-rich melamine (C3H6N6) to boost perceived protein content of food products such as milk, infant formula, frozen yogurt, pet food, biscuits, and coffee drinks has caused serious food safety problems. Conventional methods to detect melamine in foods, such as Enzyme-linked immunosorbent assay (ELISA), High-performance liquid chromatography (HPLC), and Gas chromatography-mass spectrometry (GC-MS), are sensitive but they are time-consuming, expensive, and labor-intensive. In this research, near-infrared (NIR) hyperspectral imaging technique combined with regression coefficient of partial least squares regression (PLSR) model was used to detect melamine particles in milk powders easily and quickly. NIR hyperspectral reflectance imaging data in the spectral range of 990-1700nm were acquired from melamine-milk powder mixture samples prepared at various concentrations ranging from 0.02% to 1%. PLSR models were developed to correlate the spectral data (independent variables) with melamine concentration (dependent variables) in melamine-milk powder mixture samples. PLSR models applying various pretreatment methods were used to reconstruct the two-dimensional PLS images. PLS images were converted to the binary images to detect the suspected melamine pixels in milk powder. As the melamine concentration was increased, the numbers of suspected melamine pixels of binary images were also increased. These results suggested that NIR hyperspectral imaging technique and the PLSR model can be regarded as an effective tool to detect melamine particles in milk powders. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Ishizaki, N. N.; Dairaku, K.; Ueno, G.
2016-12-01
We have developed a statistical downscaling method for estimating probabilistic climate projection using CMIP5 multi general circulation models (GCMs). A regression model was established so that the combination of weights of GCMs reflects the characteristics of the variation of observations at each grid point. Cross validations were conducted to select GCMs and to evaluate the regression model to avoid multicollinearity. By using spatially high resolution observation system, we conducted statistically downscaled probabilistic climate projections with 20-km horizontal grid spacing. Root mean squared errors for monthly mean air surface temperature and precipitation estimated by the regression method were the smallest compared with the results derived from a simple ensemble mean of GCMs and a cumulative distribution function based bias correction method. Projected changes in the mean temperature and precipitation were basically similar to those of the simple ensemble mean of GCMs. Mean precipitation was generally projected to increase associated with increased temperature and consequent increased moisture content in the air. Weakening of the winter monsoon may affect precipitation decrease in some areas. Temperature increase in excess of 4 K was expected in most areas of Japan in the end of 21st century under RCP8.5 scenario. The estimated probability of monthly precipitation exceeding 300 mm would increase around the Pacific side during the summer and the Japan Sea side during the winter season. This probabilistic climate projection based on the statistical method can be expected to bring useful information to the impact studies and risk assessments.
ERIC Educational Resources Information Center
Malin, Heather; Han, Hyemin; Liauw, Indrawati
2017-01-01
This study investigated the effects of internal and demographic variables on civic development in late adolescence using the construct "civic purpose." We conducted surveys on civic engagement with 480 high school seniors, and surveyed them again 2 years later. Using multivariate regression and linear mixed models, we tested the main…
ERIC Educational Resources Information Center
Rhea, David M.
2017-01-01
Many honors programs make admissions decisions based on student high school GPA and a standardized test score. However, McKay argued that standardized test scores can be a barrier to honors program participation, particularly for minority students. Minority students, particularly Hispanic and African American students, are apt to have lower…
Effects of Parental Divorce or a Father's Death on High School Completion
ERIC Educational Resources Information Center
Sapharas, Nicole K.; Estell, David B.; Doran, Kelly A.; Waldron, Mary
2016-01-01
Associations between parental loss and high school (HS) completion were examined in data drawn from 1,761 male and 1,689 female offspring born in wedlock to mothers participating in a nationally representative study. Multiple logistic regression models were conducted predicting HS completion by age 19 among offspring whose parents divorced or…
ERIC Educational Resources Information Center
Joyce, Beverly A.; Farenga, Stephen J.
1999-01-01
Examines specific science-related attitudes, informal science-related experiences, future interest in science, and gender of young high-ability students (n=111) who completed the Test of Science Related Attitudes (TOSRA), the Science Experience Survey (SES), and the Course Selection Sheet (CSS). Develops two regression models to predict the number…
Ngwa, Julius S; Cabral, Howard J; Cheng, Debbie M; Pencina, Michael J; Gagnon, David R; LaValley, Michael P; Cupples, L Adrienne
2016-11-03
Typical survival studies follow individuals to an event and measure explanatory variables for that event, sometimes repeatedly over the course of follow up. The Cox regression model has been used widely in the analyses of time to diagnosis or death from disease. The associations between the survival outcome and time dependent measures may be biased unless they are modeled appropriately. In this paper we explore the Time Dependent Cox Regression Model (TDCM), which quantifies the effect of repeated measures of covariates in the analysis of time to event data. This model is commonly used in biomedical research but sometimes does not explicitly adjust for the times at which time dependent explanatory variables are measured. This approach can yield different estimates of association compared to a model that adjusts for these times. In order to address the question of how different these estimates are from a statistical perspective, we compare the TDCM to Pooled Logistic Regression (PLR) and Cross Sectional Pooling (CSP), considering models that adjust and do not adjust for time in PLR and CSP. In a series of simulations we found that time adjusted CSP provided identical results to the TDCM while the PLR showed larger parameter estimates compared to the time adjusted CSP and the TDCM in scenarios with high event rates. We also observed upwardly biased estimates in the unadjusted CSP and unadjusted PLR methods. The time adjusted PLR had a positive bias in the time dependent Age effect with reduced bias when the event rate is low. The PLR methods showed a negative bias in the Sex effect, a subject level covariate, when compared to the other methods. The Cox models yielded reliable estimates for the Sex effect in all scenarios considered. We conclude that survival analyses that explicitly account in the statistical model for the times at which time dependent covariates are measured provide more reliable estimates compared to unadjusted analyses. We present results from the Framingham Heart Study in which lipid measurements and myocardial infarction data events were collected over a period of 26 years.
Robust, Adaptive Functional Regression in Functional Mixed Model Framework.
Zhu, Hongxiao; Brown, Philip J; Morris, Jeffrey S
2011-09-01
Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets.
Robust, Adaptive Functional Regression in Functional Mixed Model Framework
Zhu, Hongxiao; Brown, Philip J.; Morris, Jeffrey S.
2012-01-01
Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets. PMID:22308015
Warton, David I; Thibaut, Loïc; Wang, Yi Alice
2017-01-01
Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)-common examples including logistic or Poisson regression and generalizations to handle clustered or multivariate data, such as generalised estimating equations. We propose a bootstrap method based on probability integral transform (PIT-) residuals, which we call the PIT-trap, which assumes data come from some marginal distribution F of known parametric form. This method can be understood as a type of "model-free bootstrap", adapted to the problem of discrete and highly multivariate data. PIT-residuals have the key property that they are (asymptotically) pivotal. The PIT-trap thus inherits the key property, not afforded by any other residual resampling approach, that the marginal distribution of data can be preserved under PIT-trapping. This in turn enables the derivation of some standard bootstrap properties, including second-order correctness of pivotal PIT-trap test statistics. In multivariate data, bootstrapping rows of PIT-residuals affords the property that it preserves correlation in data without the need for it to be modelled, a key point of difference as compared to a parametric bootstrap. The proposed method is illustrated on an example involving multivariate abundance data in ecology, and demonstrated via simulation to have improved properties as compared to competing resampling methods.
Thibaut, Loïc; Wang, Yi Alice
2017-01-01
Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)—common examples including logistic or Poisson regression and generalizations to handle clustered or multivariate data, such as generalised estimating equations. We propose a bootstrap method based on probability integral transform (PIT-) residuals, which we call the PIT-trap, which assumes data come from some marginal distribution F of known parametric form. This method can be understood as a type of “model-free bootstrap”, adapted to the problem of discrete and highly multivariate data. PIT-residuals have the key property that they are (asymptotically) pivotal. The PIT-trap thus inherits the key property, not afforded by any other residual resampling approach, that the marginal distribution of data can be preserved under PIT-trapping. This in turn enables the derivation of some standard bootstrap properties, including second-order correctness of pivotal PIT-trap test statistics. In multivariate data, bootstrapping rows of PIT-residuals affords the property that it preserves correlation in data without the need for it to be modelled, a key point of difference as compared to a parametric bootstrap. The proposed method is illustrated on an example involving multivariate abundance data in ecology, and demonstrated via simulation to have improved properties as compared to competing resampling methods. PMID:28738071
Background stratified Poisson regression analysis of cohort data.
Richardson, David B; Langholz, Bryan
2012-03-01
Background stratified Poisson regression is an approach that has been used in the analysis of data derived from a variety of epidemiologically important studies of radiation-exposed populations, including uranium miners, nuclear industry workers, and atomic bomb survivors. We describe a novel approach to fit Poisson regression models that adjust for a set of covariates through background stratification while directly estimating the radiation-disease association of primary interest. The approach makes use of an expression for the Poisson likelihood that treats the coefficients for stratum-specific indicator variables as 'nuisance' variables and avoids the need to explicitly estimate the coefficients for these stratum-specific parameters. Log-linear models, as well as other general relative rate models, are accommodated. This approach is illustrated using data from the Life Span Study of Japanese atomic bomb survivors and data from a study of underground uranium miners. The point estimate and confidence interval obtained from this 'conditional' regression approach are identical to the values obtained using unconditional Poisson regression with model terms for each background stratum. Moreover, it is shown that the proposed approach allows estimation of background stratified Poisson regression models of non-standard form, such as models that parameterize latency effects, as well as regression models in which the number of strata is large, thereby overcoming the limitations of previously available statistical software for fitting background stratified Poisson regression models.
Procedures for adjusting regional regression models of urban-runoff quality using local data
Hoos, A.B.; Sisolak, J.K.
1993-01-01
Statistical operations termed model-adjustment procedures (MAP?s) can be used to incorporate local data into existing regression models to improve the prediction of urban-runoff quality. Each MAP is a form of regression analysis in which the local data base is used as a calibration data set. Regression coefficients are determined from the local data base, and the resulting `adjusted? regression models can then be used to predict storm-runoff quality at unmonitored sites. The response variable in the regression analyses is the observed load or mean concentration of a constituent in storm runoff for a single storm. The set of explanatory variables used in the regression analyses is different for each MAP, but always includes the predicted value of load or mean concentration from a regional regression model. The four MAP?s examined in this study were: single-factor regression against the regional model prediction, P, (termed MAP-lF-P), regression against P,, (termed MAP-R-P), regression against P, and additional local variables (termed MAP-R-P+nV), and a weighted combination of P, and a local-regression prediction (termed MAP-W). The procedures were tested by means of split-sample analysis, using data from three cities included in the Nationwide Urban Runoff Program: Denver, Colorado; Bellevue, Washington; and Knoxville, Tennessee. The MAP that provided the greatest predictive accuracy for the verification data set differed among the three test data bases and among model types (MAP-W for Denver and Knoxville, MAP-lF-P and MAP-R-P for Bellevue load models, and MAP-R-P+nV for Bellevue concentration models) and, in many cases, was not clearly indicated by the values of standard error of estimate for the calibration data set. A scheme to guide MAP selection, based on exploratory data analysis of the calibration data set, is presented and tested. The MAP?s were tested for sensitivity to the size of a calibration data set. As expected, predictive accuracy of all MAP?s for the verification data set decreased as the calibration data-set size decreased, but predictive accuracy was not as sensitive for the MAP?s as it was for the local regression models.
lazar: a modular predictive toxicology framework
Maunz, Andreas; Gütlein, Martin; Rautenberg, Micha; Vorgrimmler, David; Gebele, Denis; Helma, Christoph
2013-01-01
lazar (lazy structure–activity relationships) is a modular framework for predictive toxicology. Similar to the read across procedure in toxicological risk assessment, lazar creates local QSAR (quantitative structure–activity relationship) models for each compound to be predicted. Model developers can choose between a large variety of algorithms for descriptor calculation and selection, chemical similarity indices, and model building. This paper presents a high level description of the lazar framework and discusses the performance of example classification and regression models. PMID:23761761
NASA Astrophysics Data System (ADS)
Fayad, Ibrahim; Baghdadi, Nicolas; Guitet, Stéphane; Bailly, Jean-Stéphane; Hérault, Bruno; Gond, Valéry; El Hajj, Mahmoud; Tong Minh, Dinh Ho
2016-10-01
Mapping forest aboveground biomass (AGB) has become an important task, particularly for the reporting of carbon stocks and changes. AGB can be mapped using synthetic aperture radar data (SAR) or passive optical data. However, these data are insensitive to high AGB levels (>150 Mg/ha, and >300 Mg/ha for P-band), which are commonly found in tropical forests. Studies have mapped the rough variations in AGB by combining optical and environmental data at regional and global scales. Nevertheless, these maps cannot represent local variations in AGB in tropical forests. In this paper, we hypothesize that the problem of misrepresenting local variations in AGB and AGB estimation with good precision occurs because of both methodological limits (signal saturation or dilution bias) and a lack of adequate calibration data in this range of AGB values. We test this hypothesis by developing a calibrated regression model to predict variations in high AGB values (mean >300 Mg/ha) in French Guiana by a methodological approach for spatial extrapolation with data from the optical geoscience laser altimeter system (GLAS), forest inventories, radar, optics, and environmental variables for spatial inter- and extrapolation. Given their higher point count, GLAS data allow a wider coverage of AGB values. We find that the metrics from GLAS footprints are correlated with field AGB estimations (R2 = 0.54, RMSE = 48.3 Mg/ha) with no bias for high values. First, predictive models, including remote-sensing, environmental variables and spatial correlation functions, allow us to obtain ;wall-to-wall; AGB maps over French Guiana with an RMSE for the in situ AGB estimates of ∼50 Mg/ha and R2 = 0.66 at a 1-km grid size. We conclude that a calibrated regression model based on GLAS with dependent environmental data can produce good AGB predictions even for high AGB values if the calibration data fit the AGB range. We also demonstrate that small temporal and spatial mismatches between field data and GLAS footprints are not a problem for regional and global calibrated regression models because field data aim to predict large and deep tendencies in AGB variations from environmental gradients and do not aim to represent high but stochastic and temporally limited variations from forest dynamics. Thus, we advocate including a greater variety of data, even if less precise and shifted, to better represent high AGB values in global models and to improve the fitting of these models for high values.
Added sugars and periodontal disease in young adults: an analysis of NHANES III data.
Lula, Estevam C O; Ribeiro, Cecilia C C; Hugo, Fernando N; Alves, Cláudia M C; Silva, Antônio A M
2014-10-01
Added sugar consumption seems to trigger a hyperinflammatory state and may result in visceral adiposity, dyslipidemia, and insulin resistance. These conditions are risk factors for periodontal disease. However, the role of sugar intake in the cause of periodontal disease has not been adequately studied. We evaluated the association between the frequency of added sugar consumption and periodontal disease in young adults by using NHANES III data. Data from 2437 young adults (aged 18-25 y) who participated in NHANES III (1988-1994) were analyzed. We estimated the frequency of added sugar consumption by using food-frequency questionnaire responses. We considered periodontal disease to be present in teeth with bleeding on probing and a probing depth ≥3 mm at one or more sites. We evaluated this outcome as a discrete variable in Poisson regression models and as a categorical variable in multinomial logistic regression models adjusted for sex, age, race-ethnicity, education, poverty-income ratio, tobacco exposure, previous diagnosis of diabetes, and body mass index. A high consumption of added sugars was associated with a greater prevalence of periodontal disease in middle [prevalence ratio (PR): 1.39; 95% CI: 1.02, 1.89] and upper (PR: 1.42; 95% CI: 1.08, 1.85) tertiles of consumption in the adjusted Poisson regression model. The upper tertile of added sugar intake was associated with periodontal disease in ≥2 teeth (PR: 1.73; 95% CI: 1.19, 2.52) but not with periodontal disease in only one tooth (PR: 0.85; 95% CI: 0.54, 1.34) in the adjusted multinomial logistic regression model. A high frequency of consumption of added sugars is associated with periodontal disease, independent of traditional risk factors, suggesting that this consumption pattern may contribute to the systemic inflammation observed in periodontal disease and associated noncommunicable diseases. © 2014 American Society for Nutrition.
Meloun, Milan; Hill, Martin; Vceláková-Havlíková, Helena
2009-01-01
Pregnenolone sulfate (PregS) is known as a steroid conjugate positively modulating N-methyl-D-aspartate receptors on neuronal membranes. These receptors are responsible for permeability of calcium channels and activation of neuronal function. Neuroactivating effect of PregS is also exerted via non-competitive negative modulation of GABA(A) receptors regulating the chloride influx. Recently, a penetrability of blood-brain barrier for PregS was found in rat, but some experiments in agreement with this finding were reported even earlier. It is known that circulating levels of PregS in human are relatively high depending primarily on age and adrenal activity. Concerning the neuromodulating effect of PregS, we recently evaluated age relationships of PregS in both sexes using polynomial regression models known to bring about the problems of multicollinearity, i.e., strong correlations among independent variables. Several criteria for the selection of suitable bias are demonstrated. Biased estimators based on the generalized principal component regression (GPCR) method avoiding multicollinearity problems are described. Significant differences were found between men and women in the course of the age dependence of PregS. In women, a significant maximum was found around the 30th year followed by a rapid decline, while the maximum in men was achieved almost 10 years earlier and changes were minor up to the 60th year. The investigation of gender differences and age dependencies in PregS could be of interest given its well-known neurostimulating effect, relatively high serum concentration, and the probable partial permeability of the blood-brain barrier for the steroid conjugate. GPCR in combination with the MEP (mean quadric error of prediction) criterion is extremely useful and appealing for constructing biased models. It can also be used for achieving such estimates with regard to keeping the model course corresponding to the data trend, especially in polynomial type regression models.
Prostate specific antigen and acinar density: a new dimension, the "Prostatocrit".
Robinson, Simon; Laniado, Marc; Montgomery, Bruce
2017-01-01
Prostate-specific antigen densities have limited success in diagnosing prostate cancer. We emphasise the importance of the peripheral zone when considered with its cellular constituents, the "prostatocrit". Using zonal volumes and asymmetry of glandular acini, we generate a peripheral zone acinar volume and density. With the ratio to the whole gland, we can better predict high grade and all grade cancer. We can model the gland into its acinar and stromal elements. This new "prostatocrit" model could offer more accurate nomograms for biopsy. 674 patients underwent TRUS and biopsy. Whole gland and zonal volumes were recorded. We compared ratio and acinar volumes when added to a "clinic" model using traditional PSA density. Univariate logistic regression was used to find significant predictors for all and high grade cancer. Backwards multiple logistic regression was used to generate ROC curves comparing the new model to conventional density and PSA alone. Prediction of all grades of prostate cancer: significant variables revealed four significant "prostatocrit" parameters: log peripheral zone acinar density; peripheral zone acinar volume/whole gland acinar volume; peripheral zone acinar density/whole gland volume; peripheral zone acinar density. Acinar model (AUC 0.774), clinic model (AUC 0.745) (P=0.0105). Prediction of high grade prostate cancer: peripheral zone acinar density ("prostatocrit") was the only significant density predictor. Acinar model (AUC 0.811), clinic model (AUC 0.769) (P=0.0005). There is renewed use for ratio and "prostatocrit" density of the peripheral zone in predicting cancer. This outperforms all traditional density measurements. Copyright® by the International Brazilian Journal of Urology.
Slopen, Natalie; Loucks, Eric B; Appleton, Allison A; Kawachi, Ichiro; Kubzansky, Laura D; Non, Amy L; Buka, Stephen; Gilman, Stephen E
2015-01-01
Children exposed to social adversity carry a greater risk of poor physical and mental health into adulthood. This increased risk is thought to be due, in part, to inflammatory processes associated with early adversity that contribute to the etiology of many adult illnesses. The current study asks whether aspects of the prenatal social environment are associated with levels of inflammation in adulthood, and whether prenatal and childhood adversity both contribute to adult inflammation. We examined associations of prenatal and childhood adversity assessed through direct interviews of participants in the Collaborative Perinatal Project between 1959 and 1974 with blood levels of C-reactive protein in 355 offspring interviewed in adulthood (mean age=42.2 years). Linear and quantile regression models were used to estimate the effects of prenatal adversity and childhood adversity on adult inflammation, adjusting for age, sex, and race and other potential confounders. In separate linear regression models, high levels of prenatal and childhood adversity were associated with higher CRP in adulthood. When prenatal and childhood adversity were analyzed together, our results support the presence of an effect of prenatal adversity on (log) CRP level in adulthood (β=0.73, 95% CI: 0.26, 1.20) that is independent of childhood adversity and potential confounding factors including maternal health conditions reported during pregnancy. Supplemental analyses revealed similar findings using quantile regression models and logistic regression models that used a clinically-relevant CRP threshold (>3mg/L). In a fully-adjusted model that included childhood adversity, high prenatal adversity was associated with a 3-fold elevated odds (95% CI: 1.15, 8.02) of having a CRP level in adulthood that indicates high risk of cardiovascular disease. Social adversity during the prenatal period is a risk factor for elevated inflammation in adulthood independent of adversities during childhood. This evidence is consistent with studies demonstrating that adverse exposures in the maternal environment during gestation have lasting effects on development of the immune system. If these results reflect causal associations, they suggest that interventions to improve the social and environmental conditions of pregnancy would promote health over the life course. It remains necessary to identify the mechanisms that link maternal conditions during pregnancy to the development of fetal immune and other systems involved in adaptation to environmental stressors. Copyright © 2014 Elsevier Ltd. All rights reserved.
Modeling of geogenic radon in Switzerland based on ordered logistic regression.
Kropat, Georg; Bochud, François; Murith, Christophe; Palacios Gruson, Martha; Baechler, Sébastien
2017-01-01
The estimation of the radon hazard of a future construction site should ideally be based on the geogenic radon potential (GRP), since this estimate is free of anthropogenic influences and building characteristics. The goal of this study was to evaluate terrestrial gamma dose rate (TGD), geology, fault lines and topsoil permeability as predictors for the creation of a GRP map based on logistic regression. Soil gas radon measurements (SRC) are more suited for the estimation of GRP than indoor radon measurements (IRC) since the former do not depend on ventilation and heating habits or building characteristics. However, SRC have only been measured at a few locations in Switzerland. In former studies a good correlation between spatial aggregates of IRC and SRC has been observed. That's why we used IRC measurements aggregated on a 10 km × 10 km grid to calibrate an ordered logistic regression model for geogenic radon potential (GRP). As predictors we took into account terrestrial gamma doserate, regrouped geological units, fault line density and the permeability of the soil. The classification success rate of the model results to 56% in case of the inclusion of all 4 predictor variables. Our results suggest that terrestrial gamma doserate and regrouped geological units are more suited to model GRP than fault line density and soil permeability. Ordered logistic regression is a promising tool for the modeling of GRP maps due to its simplicity and fast computation time. Future studies should account for additional variables to improve the modeling of high radon hazard in the Jura Mountains of Switzerland. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
Machine learning approaches to the social determinants of health in the health and retirement study.
Seligman, Benjamin; Tuljapurkar, Shripad; Rehkopf, David
2018-04-01
Social and economic factors are important predictors of health and of recognized importance for health systems. However, machine learning, used elsewhere in the biomedical literature, has not been extensively applied to study relationships between society and health. We investigate how machine learning may add to our understanding of social determinants of health using data from the Health and Retirement Study. A linear regression of age and gender, and a parsimonious theory-based regression additionally incorporating income, wealth, and education, were used to predict systolic blood pressure, body mass index, waist circumference, and telomere length. Prediction, fit, and interpretability were compared across four machine learning methods: linear regression, penalized regressions, random forests, and neural networks. All models had poor out-of-sample prediction. Most machine learning models performed similarly to the simpler models. However, neural networks greatly outperformed the three other methods. Neural networks also had good fit to the data ( R 2 between 0.4-0.6, versus <0.3 for all others). Across machine learning models, nine variables were frequently selected or highly weighted as predictors: dental visits, current smoking, self-rated health, serial-seven subtractions, probability of receiving an inheritance, probability of leaving an inheritance of at least $10,000, number of children ever born, African-American race, and gender. Some of the machine learning methods do not improve prediction or fit beyond simpler models, however, neural networks performed well. The predictors identified across models suggest underlying social factors that are important predictors of biological indicators of chronic disease, and that the non-linear and interactive relationships between variables fundamental to the neural network approach may be important to consider.
Accounting for measurement error in log regression models with applications to accelerated testing.
Richardson, Robert; Tolley, H Dennis; Evenson, William E; Lunt, Barry M
2018-01-01
In regression settings, parameter estimates will be biased when the explanatory variables are measured with error. This bias can significantly affect modeling goals. In particular, accelerated lifetime testing involves an extrapolation of the fitted model, and a small amount of bias in parameter estimates may result in a significant increase in the bias of the extrapolated predictions. Additionally, bias may arise when the stochastic component of a log regression model is assumed to be multiplicative when the actual underlying stochastic component is additive. To account for these possible sources of bias, a log regression model with measurement error and additive error is approximated by a weighted regression model which can be estimated using Iteratively Re-weighted Least Squares. Using the reduced Eyring equation in an accelerated testing setting, the model is compared to previously accepted approaches to modeling accelerated testing data with both simulations and real data.
Wan, Jian; Chen, Yi-Chieh; Morris, A Julian; Thennadil, Suresh N
2017-07-01
Near-infrared (NIR) spectroscopy is being widely used in various fields ranging from pharmaceutics to the food industry for analyzing chemical and physical properties of the substances concerned. Its advantages over other analytical techniques include available physical interpretation of spectral data, nondestructive nature and high speed of measurements, and little or no need for sample preparation. The successful application of NIR spectroscopy relies on three main aspects: pre-processing of spectral data to eliminate nonlinear variations due to temperature, light scattering effects and many others, selection of those wavelengths that contribute useful information, and identification of suitable calibration models using linear/nonlinear regression . Several methods have been developed for each of these three aspects and many comparative studies of different methods exist for an individual aspect or some combinations. However, there is still a lack of comparative studies for the interactions among these three aspects, which can shed light on what role each aspect plays in the calibration and how to combine various methods of each aspect together to obtain the best calibration model. This paper aims to provide such a comparative study based on four benchmark data sets using three typical pre-processing methods, namely, orthogonal signal correction (OSC), extended multiplicative signal correction (EMSC) and optical path-length estimation and correction (OPLEC); two existing wavelength selection methods, namely, stepwise forward selection (SFS) and genetic algorithm optimization combined with partial least squares regression for spectral data (GAPLSSP); four popular regression methods, namely, partial least squares (PLS), least absolute shrinkage and selection operator (LASSO), least squares support vector machine (LS-SVM), and Gaussian process regression (GPR). The comparative study indicates that, in general, pre-processing of spectral data can play a significant role in the calibration while wavelength selection plays a marginal role and the combination of certain pre-processing, wavelength selection, and nonlinear regression methods can achieve superior performance over traditional linear regression-based calibration.
Min, Seung Nam; Park, Se Jin; Kim, Dong Joon; Subramaniyam, Murali; Lee, Kyung-Sun
2018-01-01
Stroke is the second leading cause of death worldwide and remains an important health burden both for the individuals and for the national healthcare systems. Potentially modifiable risk factors for stroke include hypertension, cardiac disease, diabetes, and dysregulation of glucose metabolism, atrial fibrillation, and lifestyle factors. We aimed to derive a model equation for developing a stroke pre-diagnosis algorithm with the potentially modifiable risk factors. We used logistic regression for model derivation, together with data from the database of the Korea National Health Insurance Service (NHIS). We reviewed the NHIS records of 500,000 enrollees. For the regression analysis, data regarding 367 stroke patients were selected. The control group consisted of 500 patients followed up for 2 consecutive years and with no history of stroke. We developed a logistic regression model based on information regarding several well-known modifiable risk factors. The developed model could correctly discriminate between normal subjects and stroke patients in 65% of cases. The model developed in the present study can be applied in the clinical setting to estimate the probability of stroke in a year and thus improve the stroke prevention strategies in high-risk patients. The approach used to develop the stroke prevention algorithm can be applied for developing similar models for the pre-diagnosis of other diseases. © 2018 S. Karger AG, Basel.
Survival curves of Listeria monocytogenes in chorizos modeled with artificial neural networks.
Hajmeer, M; Basheer, I; Cliver, D O
2006-09-01
Using artificial neural networks (ANNs), a highly accurate model was developed to simulate survival curves of Listeria monocytogenes in chorizos as affected by the initial water activity (a(w0)) of the sausage formulation, temperature (T), and air inflow velocity (F) where the sausages are stored. The ANN-based survival model (R(2)=0.970) outperformed the regression-based cubic model (R(2)=0.851), and as such was used to derive other models (using regression) that allow prediction of the times needed to drop count by 1, 2, 3, and 4 logs (i.e., nD-values, n=1, 2, 3, 4). The nD-value regression models almost perfectly predicted the various times derived from a number of simulated survival curves exhibiting a wide variety of the operating conditions (R(2)=0.990-0.995). The nD-values were found to decrease with decreasing a(w0), and increasing T and F. The influence of a(w0) on nD-values seems to become more significant at some critical value of a(w0), below which the variation is negligible (0.93 for 1D-value, 0.90 for 2D-value, and <0.85 for 3D- and 4D-values). There is greater influence of storage T and F on 3D- and 4D-values than on 1D- and 2D-values.
Chen, Baojiang; Qin, Jing
2014-05-10
In statistical analysis, a regression model is needed if one is interested in finding the relationship between a response variable and covariates. When the response depends on the covariate, then it may also depend on the function of this covariate. If one has no knowledge of this functional form but expect for monotonic increasing or decreasing, then the isotonic regression model is preferable. Estimation of parameters for isotonic regression models is based on the pool-adjacent-violators algorithm (PAVA), where the monotonicity constraints are built in. With missing data, people often employ the augmented estimating method to improve estimation efficiency by incorporating auxiliary information through a working regression model. However, under the framework of the isotonic regression model, the PAVA does not work as the monotonicity constraints are violated. In this paper, we develop an empirical likelihood-based method for isotonic regression model to incorporate the auxiliary information. Because the monotonicity constraints still hold, the PAVA can be used for parameter estimation. Simulation studies demonstrate that the proposed method can yield more efficient estimates, and in some situations, the efficiency improvement is substantial. We apply this method to a dementia study. Copyright © 2013 John Wiley & Sons, Ltd.
Bayesian structured additive regression modeling of epidemic data: application to cholera
2012-01-01
Background A significant interest in spatial epidemiology lies in identifying associated risk factors which enhances the risk of infection. Most studies, however, make no, or limited use of the spatial structure of the data, as well as possible nonlinear effects of the risk factors. Methods We develop a Bayesian Structured Additive Regression model for cholera epidemic data. Model estimation and inference is based on fully Bayesian approach via Markov Chain Monte Carlo (MCMC) simulations. The model is applied to cholera epidemic data in the Kumasi Metropolis, Ghana. Proximity to refuse dumps, density of refuse dumps, and proximity to potential cholera reservoirs were modeled as continuous functions; presence of slum settlers and population density were modeled as fixed effects, whereas spatial references to the communities were modeled as structured and unstructured spatial effects. Results We observe that the risk of cholera is associated with slum settlements and high population density. The risk of cholera is equal and lower for communities with fewer refuse dumps, but variable and higher for communities with more refuse dumps. The risk is also lower for communities distant from refuse dumps and potential cholera reservoirs. The results also indicate distinct spatial variation in the risk of cholera infection. Conclusion The study highlights the usefulness of Bayesian semi-parametric regression model analyzing public health data. These findings could serve as novel information to help health planners and policy makers in making effective decisions to control or prevent cholera epidemics. PMID:22866662
Logistic regression for dichotomized counts.
Preisser, John S; Das, Kalyan; Benecha, Habtamu; Stamm, John W
2016-12-01
Sometimes there is interest in a dichotomized outcome indicating whether a count variable is positive or zero. Under this scenario, the application of ordinary logistic regression may result in efficiency loss, which is quantifiable under an assumed model for the counts. In such situations, a shared-parameter hurdle model is investigated for more efficient estimation of regression parameters relating to overall effects of covariates on the dichotomous outcome, while handling count data with many zeroes. One model part provides a logistic regression containing marginal log odds ratio effects of primary interest, while an ancillary model part describes the mean count of a Poisson or negative binomial process in terms of nuisance regression parameters. Asymptotic efficiency of the logistic model parameter estimators of the two-part models is evaluated with respect to ordinary logistic regression. Simulations are used to assess the properties of the models with respect to power and Type I error, the latter investigated under both misspecified and correctly specified models. The methods are applied to data from a randomized clinical trial of three toothpaste formulations to prevent incident dental caries in a large population of Scottish schoolchildren. © The Author(s) 2014.
ERIC Educational Resources Information Center
Laird, Robert D.; Weems, Carl F.
2011-01-01
Research on informant discrepancies has increasingly utilized difference scores. This article demonstrates the statistical equivalence of regression models using difference scores (raw or standardized) and regression models using separate scores for each informant to show that interpretations should be consistent with both models. First,…
Wood, Jeffrey J.; Lynne, Sarah D.; Langer, David A.; Wood, Patricia A.; Clark, Shaunna L.; Eddy, J. Mark; Ialongo, Nicholas
2011-01-01
This study tests a model of reciprocal influences between absenteeism and youth psychopathology using three longitudinal datasets (Ns= 20745, 2311, and 671). Participants in 1st through 12th grades were interviewed annually or bi-annually. Measures of psychopathology include self-, parent-, and teacher-report questionnaires. Structural cross-lagged regression models were tested. In a nationally representative dataset (Add Health), middle school students with relatively greater absenteeism at study year 1 tended towards increased depression and conduct problems in study year 2, over and above the effects of autoregressive associations and demographic covariates. The opposite direction of effects was found for both middle and high school students. Analyses with two regionally representative datasets were also partially supportive. Longitudinal links were more evident in adolescence than in childhood. PMID:22188462
NASA Technical Reports Server (NTRS)
Beck, L. R.; Rodriguez, M. H.; Dister, S. W.; Rodriguez, A. D.; Washino, R. K.; Roberts, D. R.; Spanner, M. A.
1997-01-01
A blind test of two remote sensing-based models for predicting adult populations of Anopheles albimanus in villages, an indicator of malaria transmission risk, was conducted in southern Chiapas, Mexico. One model was developed using a discriminant analysis approach, while the other was based on regression analysis. The models were developed in 1992 for an area around Tapachula, Chiapas, using Landsat Thematic Mapper (TM) satellite data and geographic information system functions. Using two remotely sensed landscape elements, the discriminant model was able to successfully distinguish between villages with high and low An. albimanus abundance with an overall accuracy of 90%. To test the predictive capability of the models, multitemporal TM data were used to generate a landscape map of the Huixtla area, northwest of Tapachula, where the models were used to predict risk for 40 villages. The resulting predictions were not disclosed until the end of the test. Independently, An. albimanus abundance data were collected in the 40 randomly selected villages for which the predictions had been made. These data were subsequently used to assess the models' accuracies. The discriminant model accurately predicted 79% of the high-abundance villages and 50% of the low-abundance villages, for an overall accuracy of 70%. The regression model correctly identified seven of the 10 villages with the highest mosquito abundance. This test demonstrated that remote sensing-based models generated for one area can be used successfully in another, comparable area.
Aspects of porosity prediction using multivariate linear regression
DOE Office of Scientific and Technical Information (OSTI.GOV)
Byrnes, A.P.; Wilson, M.D.
1991-03-01
Highly accurate multiple linear regression models have been developed for sandstones of diverse compositions. Porosity reduction or enhancement processes are controlled by the fundamental variables, Pressure (P), Temperature (T), Time (t), and Composition (X), where composition includes mineralogy, size, sorting, fluid composition, etc. The multiple linear regression equation, of which all linear porosity prediction models are subsets, takes the generalized form: Porosity = C{sub 0} + C{sub 1}(P) + C{sub 2}(T) + C{sub 3}(X) + C{sub 4}(t) + C{sub 5}(PT) + C{sub 6}(PX) + C{sub 7}(Pt) + C{sub 8}(TX) + C{sub 9}(Tt) + C{sub 10}(Xt) + C{sub 11}(PTX) + C{submore » 12}(PXt) + C{sub 13}(PTt) + C{sub 14}(TXt) + C{sub 15}(PTXt). The first four primary variables are often interactive, thus requiring terms involving two or more primary variables (the form shown implies interaction and not necessarily multiplication). The final terms used may also involve simple mathematic transforms such as log X, e{sup T}, X{sup 2}, or more complex transformations such as the Time-Temperature Index (TTI). The X term in the equation above represents a suite of compositional variable and, therefore, a fully expanded equation may include a series of terms incorporating these variables. Numerous published bivariate porosity prediction models involving P (or depth) or Tt (TTI) are effective to a degree, largely because of the high degree of colinearity between p and TTI. However, all such bivariate models ignore the unique contributions of P and Tt, as well as various X terms. These simpler models become poor predictors in regions where colinear relations change, were important variables have been ignored, or where the database does not include a sufficient range or weight distribution for the critical variables.« less
Xuan Chi; Barry Goodwin
2012-01-01
Spatial and temporal relationships among agricultural prices have been an important topic of applied research for many years. Such research is used to investigate the performance of markets and to examine linkages up and down the marketing chain. This research has empirically evaluated price linkages by using correlation and regression models and, later, linear and...
Measuring the impact of urbanization on scenic quality: land use change in the northeast
Robert O. Brush; James F. Palmer
1979-01-01
The changes in scenic quality resulting from urbanization are explored for a region in the Northeast. The relative contributions to scenic quality of certain landscape features are examined by developing regression models for the region and for town landscapes within that region. The models provide empirical evidence of the importance of trees for maintaining high...
ERIC Educational Resources Information Center
Konstantopoulos, Spyros
2009-01-01
Background: In recent years, Asian Americans have been consistently described as a model minority. The high levels of educational achievement and educational attainment are the main determinants for identifying Asian Americans as a model minority. Nonetheless, only a few studies have examined empirically the accomplishments of Asian Americans, and…
Majorization Minimization by Coordinate Descent for Concave Penalized Generalized Linear Models
Jiang, Dingfeng; Huang, Jian
2013-01-01
Recent studies have demonstrated theoretical attractiveness of a class of concave penalties in variable selection, including the smoothly clipped absolute deviation and minimax concave penalties. The computation of the concave penalized solutions in high-dimensional models, however, is a difficult task. We propose a majorization minimization by coordinate descent (MMCD) algorithm for computing the concave penalized solutions in generalized linear models. In contrast to the existing algorithms that use local quadratic or local linear approximation to the penalty function, the MMCD seeks to majorize the negative log-likelihood by a quadratic loss, but does not use any approximation to the penalty. This strategy makes it possible to avoid the computation of a scaling factor in each update of the solutions, which improves the efficiency of coordinate descent. Under certain regularity conditions, we establish theoretical convergence property of the MMCD. We implement this algorithm for a penalized logistic regression model using the SCAD and MCP penalties. Simulation studies and a data example demonstrate that the MMCD works sufficiently fast for the penalized logistic regression in high-dimensional settings where the number of covariates is much larger than the sample size. PMID:25309048
Naish, Suchithra; Hu, Wenbiao; Nicholls, Neville; Mackenzie, John S; Dale, Pat; McMichael, Anthony J; Tong, Shilu
2009-02-01
To assess the socio-environmental predictors of Barmah forest virus (BFV) transmission in coastal areas, Queensland, Australia. Data on BFV notified cases, climate, tidal levels and socioeconomic index for area (SEIFA) in six coastal cities, Queensland, for the period 1992-2001 were obtained from the relevant government agencies. Negative binomial regression models were used to assess the socio-environmental predictors of BFV transmission. The results show that maximum and minimum temperature, rainfall, relative humidity, high and low tide were statistically significantly associated with BFV incidence at lags 0-2 months. The fitted negative binomial regression models indicate a significant independent association of each of maximum temperature (beta = 0.139, P = 0.000), high tide (beta = 0.005, P = 0.000) and SEIFA index (beta = -0.010, P = 0.000) with BFV transmission after adjustment for confounding variables. The transmission of BFV disease in Queensland coastal areas seemed to be determined by a combination of local social and environmental factors. The model developed in this study may have applications in the control and prevention of BFV disease in these areas.
NASA Astrophysics Data System (ADS)
Powell, James Eckhardt
Emissions inventories are an important tool, often built by governments, and used to manage emissions. To build an inventory of urban CO2 emissions and other fossil fuel combustion products in the urban atmosphere, an inventory of on-road traffic is required. In particular, a high resolution inventory is necessary to capture the local characteristics of transport emissions. These emissions vary widely due to the local nature of the fleet, fuel, and roads. Here we show a new model of ADT for the Portland, OR metropolitan region. The backbone is traffic counter recordings made by the Portland Bureau of Transportation at 7,767 sites over 21 years (1986-2006), augmented with PORTAL (The Portland Regional Transportation Archive Listing) freeway traffic count data. We constructed a regression model to fill in traffic network gaps using GIS data such as road class and population density. An EPA-supplied emissions factor was used to estimate transportation CO2 emissions, which is compared to several other estimates for the city's CO2 footprint.
Modelling soil salinity in Oued El Abid watershed, Morocco
NASA Astrophysics Data System (ADS)
Mouatassime Sabri, El; Boukdir, Ahmed; Karaoui, Ismail; Arioua, Abdelkrim; Messlouhi, Rachid; El Amrani Idrissi, Abdelkhalek
2018-05-01
Soil salinisation is a phenomenon considered to be a real threat to natural resources in semi-arid climates. The phenomenon is controlled by soil (texture, depth, slope etc.), anthropogenic factors (drainage system, irrigation, crops types, etc.), and climate factors. This study was conducted in the watershed of Oued El Abid in the region of Beni Mellal-Khenifra, aimed at localising saline soil using remote sensing and a regression model. The spectral indices were extracted from Landsat imagery (30 m resolution). A linear correlation of electrical conductivity, which was calculated based on soil samples (ECs), and the values extracted based on spectral bands showed a high accuracy with an R2 (Root square) of 0.80. This study proposes a new spectral salinity index using Landsat bands B1 and B4. This hydro-chemical and statistical study, based on a yearlong survey, showed a moderate amount of salinity, which threatens dam water quality. The results present an improved ability to use remote sensing and regression model integration to detect soil salinity with high accuracy and low cost, and permit intervention at an early stage of salinisation.
Development of statistical linear regression model for metals from transportation land uses.
Maniquiz, Marla C; Lee, Soyoung; Lee, Eunju; Kim, Lee-Hyung
2009-01-01
The transportation landuses possessing impervious surfaces such as highways, parking lots, roads, and bridges were recognized as the highly polluted non-point sources (NPSs) in the urban areas. Lots of pollutants from urban transportation are accumulating on the paved surfaces during dry periods and are washed-off during a storm. In Korea, the identification and monitoring of NPSs still represent a great challenge. Since 2004, the Ministry of Environment (MOE) has been engaged in several researches and monitoring to develop stormwater management policies and treatment systems for future implementation. The data over 131 storm events during May 2004 to September 2008 at eleven sites were analyzed to identify correlation relationships between particulates and metals, and to develop simple linear regression (SLR) model to estimate event mean concentration (EMC). Results indicate that there was no significant relationship between metals and TSS EMC. However, the SLR estimation models although not providing useful results are valuable indicators of high uncertainties that NPS pollution possess. Therefore, long term monitoring employing proper methods and precise statistical analysis of the data should be undertaken to eliminate these uncertainties.
Rasmussen, Patrick P.; Gray, John R.; Glysson, G. Douglas; Ziegler, Andrew C.
2009-01-01
In-stream continuous turbidity and streamflow data, calibrated with measured suspended-sediment concentration data, can be used to compute a time series of suspended-sediment concentration and load at a stream site. Development of a simple linear (ordinary least squares) regression model for computing suspended-sediment concentrations from instantaneous turbidity data is the first step in the computation process. If the model standard percentage error (MSPE) of the simple linear regression model meets a minimum criterion, this model should be used to compute a time series of suspended-sediment concentrations. Otherwise, a multiple linear regression model using paired instantaneous turbidity and streamflow data is developed and compared to the simple regression model. If the inclusion of the streamflow variable proves to be statistically significant and the uncertainty associated with the multiple regression model results in an improvement over that for the simple linear model, the turbidity-streamflow multiple linear regression model should be used to compute a suspended-sediment concentration time series. The computed concentration time series is subsequently used with its paired streamflow time series to compute suspended-sediment loads by standard U.S. Geological Survey techniques. Once an acceptable regression model is developed, it can be used to compute suspended-sediment concentration beyond the period of record used in model development with proper ongoing collection and analysis of calibration samples. Regression models to compute suspended-sediment concentrations are generally site specific and should never be considered static, but they represent a set period in a continually dynamic system in which additional data will help verify any change in sediment load, type, and source.
Myer, Gregory D; Ford, Kevin R; Khoury, Jane; Succop, Paul; Hewett, Timothy E
2014-01-01
Objective Knee abduction moment (KAM) during landing predicts non-contact anterior cruciate ligament (ACL) injury risk with high sensitivity and specificity in female athletes. The purpose of this study was to employ sensitive laboratory (lab-based) tools to determine predictive mechanisms that underlie increased KAM during landing. Methods Female basketball and soccer players (N=744) from a single county public school district were recruited to participate in testing of anthropometrics, maturation, laxity/flexibility, strength and landing biomechanics. Linear regression was used to model KAM, and logistic regression was used to examine high (>25.25 Nm of KAM) versus low KAM as surrogate for ACL injury risk. Results The most parsimonious model included independent predictors (β±1 SE) (1) peak knee abduction angle (1.78±0.05; p<0.001), (2) peak knee extensor moment (0.17±0.01; p<0.001), (3) knee flexion range of motion (0.15±0.03; p<0.01), (4) body mass index (BMI) Z-score (−1.67±0.36; p<0.001) and (5) tibia length (−0.50±0.14; p<0.001) and accounted for 78% of the variance in KAM during landing. The logistic regression model that employed these same variables predicted high KAM status with 85% sensitivity and 93% specificity and a C-statistic of 0.96. Conclusions Increased knee abduction angle, quadriceps recruitment, tibia length and BMI with decreased knee flexion account for 80% of the measured variance in KAM during a drop vertical jump. Clinical relevance Females who demonstrate increased KAM are more responsive and more likely to benefit from neuromuscular training. These findings should significantly enhance the identification of those at increased risk and facilitate neuromuscular training targeted to this important risk factor (high KAM) for ACL injury. PMID:20558526
Parisi Kern, Andrea; Ferreira Dias, Michele; Piva Kulakowski, Marlova; Paulo Gomes, Luciana
2015-05-01
Reducing construction waste is becoming a key environmental issue in the construction industry. The quantification of waste generation rates in the construction sector is an invaluable management tool in supporting mitigation actions. However, the quantification of waste can be a difficult process because of the specific characteristics and the wide range of materials used in different construction projects. Large variations are observed in the methods used to predict the amount of waste generated because of the range of variables involved in construction processes and the different contexts in which these methods are employed. This paper proposes a statistical model to determine the amount of waste generated in the construction of high-rise buildings by assessing the influence of design process and production system, often mentioned as the major culprits behind the generation of waste in construction. Multiple regression was used to conduct a case study based on multiple sources of data of eighteen residential buildings. The resulting statistical model produced dependent (i.e. amount of waste generated) and independent variables associated with the design and the production system used. The best regression model obtained from the sample data resulted in an adjusted R(2) value of 0.694, which means that it predicts approximately 69% of the factors involved in the generation of waste in similar constructions. Most independent variables showed a low determination coefficient when assessed in isolation, which emphasizes the importance of assessing their joint influence on the response (dependent) variable. Copyright © 2015 Elsevier Ltd. All rights reserved.
Population heterogeneity in the salience of multiple risk factors for adolescent delinquency.
Lanza, Stephanie T; Cooper, Brittany R; Bray, Bethany C
2014-03-01
To present mixture regression analysis as an alternative to more standard regression analysis for predicting adolescent delinquency. We demonstrate how mixture regression analysis allows for the identification of population subgroups defined by the salience of multiple risk factors. We identified population subgroups (i.e., latent classes) of individuals based on their coefficients in a regression model predicting adolescent delinquency from eight previously established risk indices drawn from the community, school, family, peer, and individual levels. The study included N = 37,763 10th-grade adolescents who participated in the Communities That Care Youth Survey. Standard, zero-inflated, and mixture Poisson and negative binomial regression models were considered. Standard and mixture negative binomial regression models were selected as optimal. The five-class regression model was interpreted based on the class-specific regression coefficients, indicating that risk factors had varying salience across classes of adolescents. Standard regression showed that all risk factors were significantly associated with delinquency. Mixture regression provided more nuanced information, suggesting a unique set of risk factors that were salient for different subgroups of adolescents. Implications for the design of subgroup-specific interventions are discussed. Copyright © 2014 Society for Adolescent Health and Medicine. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Suhartono, Lee, Muhammad Hisyam; Prastyo, Dedy Dwi
2015-12-01
The aim of this research is to develop a calendar variation model for forecasting retail sales data with the Eid ul-Fitr effect. The proposed model is based on two methods, namely two levels ARIMAX and regression methods. Two levels ARIMAX and regression models are built by using ARIMAX for the first level and regression for the second level. Monthly men's jeans and women's trousers sales in a retail company for the period January 2002 to September 2009 are used as case study. In general, two levels of calendar variation model yields two models, namely the first model to reconstruct the sales pattern that already occurred, and the second model to forecast the effect of increasing sales due to Eid ul-Fitr that affected sales at the same and the previous months. The results show that the proposed two level calendar variation model based on ARIMAX and regression methods yields better forecast compared to the seasonal ARIMA model and Neural Networks.
Carberry, Angela E.; Turner, Robin M.; Bek, Emily J.; Raynes-Greenow, Camille H.; McEwan, Alistair L.; Jeffery, Heather E.
2018-01-01
Background With the greatest burden of infant undernutrition and morbidity in low and middle income countries (LMICs), there is a need for suitable approaches to monitor infants in a simple, low-cost and effective manner. Anthropometry continues to play a major role in characterising growth and nutritional status. Methods We developed a range of models to aid in identifying neonates at risk of malnutrition. We first adopted a logistic regression approach to screen for a composite neonatal morbidity, low and high body fat (BF%) infants. We then developed linear regression models for the estimation of neonatal fat mass as an assessment of body composition and nutritional status. Results We fitted logistic regression models combining up to four anthropometric variables to predict composite morbidity and low and high BF% neonates. The greatest area under receiver-operator characteristic curves (AUC with 95% confidence intervals (CI)) for identifying composite morbidity was 0.740 (0.63, 0.85), resulting from the combination of birthweight, length, chest and mid-thigh circumferences. The AUCs (95% CI) for identifying low and high BF% were 0.827 (0.78, 0.88) and 0.834 (0.79, 0.88), respectively. For identifying composite morbidity, BF% as measured via air displacement plethysmography showed strong predictive ability (AUC 0.786 (0.70, 0.88)), while birthweight percentiles had a lower AUC (0.695 (0.57, 0.82)). Birthweight percentiles could also identify low and high BF% neonates with AUCs of 0.792 (0.74, 0.85) and 0.834 (0.79, 0.88). We applied a sex-specific approach to anthropometric estimation of neonatal fat mass, demonstrating the influence of the testing sample size on the final model performance. Conclusions These models display potential for further development and evaluation in LMICs to detect infants in need of further nutritional management, especially where traditional methods of risk management such as birthweight for gestational age percentiles may be variable or non-existent, or unable to detect appropriately grown, low fat newborns. PMID:29601596
Regression Models for Identifying Noise Sources in Magnetic Resonance Images
Zhu, Hongtu; Li, Yimei; Ibrahim, Joseph G.; Shi, Xiaoyan; An, Hongyu; Chen, Yashen; Gao, Wei; Lin, Weili; Rowe, Daniel B.; Peterson, Bradley S.
2009-01-01
Stochastic noise, susceptibility artifacts, magnetic field and radiofrequency inhomogeneities, and other noise components in magnetic resonance images (MRIs) can introduce serious bias into any measurements made with those images. We formally introduce three regression models including a Rician regression model and two associated normal models to characterize stochastic noise in various magnetic resonance imaging modalities, including diffusion-weighted imaging (DWI) and functional MRI (fMRI). Estimation algorithms are introduced to maximize the likelihood function of the three regression models. We also develop a diagnostic procedure for systematically exploring MR images to identify noise components other than simple stochastic noise, and to detect discrepancies between the fitted regression models and MRI data. The diagnostic procedure includes goodness-of-fit statistics, measures of influence, and tools for graphical display. The goodness-of-fit statistics can assess the key assumptions of the three regression models, whereas measures of influence can isolate outliers caused by certain noise components, including motion artifacts. The tools for graphical display permit graphical visualization of the values for the goodness-of-fit statistic and influence measures. Finally, we conduct simulation studies to evaluate performance of these methods, and we analyze a real dataset to illustrate how our diagnostic procedure localizes subtle image artifacts by detecting intravoxel variability that is not captured by the regression models. PMID:19890478
NASA Astrophysics Data System (ADS)
Bae, Gihyun; Huh, Hoon; Park, Sungho
This paper deals with a regression model for light weight and crashworthiness enhancement design of automotive parts in frontal car crash. The ULSAB-AVC model is employed for the crash analysis and effective parts are selected based on the amount of energy absorption during the crash behavior. Finite element analyses are carried out for designated design cases in order to investigate the crashworthiness and weight according to the material and thickness of main energy absorption parts. Based on simulations results, a regression analysis is performed to construct a regression model utilized for light weight and crashworthiness enhancement design of automotive parts. An example for weight reduction of main energy absorption parts demonstrates the validity of a regression model constructed.
Army College Fund Cost-Effectiveness Study
1990-11-01
Section A.2 presents a theory of enlistment supply to provide a basis for specifying the regression model , The model Is specified in Section A.3, which...Supplementary materials are included in the final four sections. Section A.6 provides annual trends in the regression model variables. Estimates of the model ...millions, A.S. ESTIMATION OF A YOUTH EARNINGS FORECASTING MODEL Civilian pay is an important explanatory variable in the regression model . Previous
RRegrs: an R package for computer-aided model selection with multiple regression models.
Tsiliki, Georgia; Munteanu, Cristian R; Seoane, Jose A; Fernandez-Lozano, Carlos; Sarimveis, Haralambos; Willighagen, Egon L
2015-01-01
Predictive regression models can be created with many different modelling approaches. Choices need to be made for data set splitting, cross-validation methods, specific regression parameters and best model criteria, as they all affect the accuracy and efficiency of the produced predictive models, and therefore, raising model reproducibility and comparison issues. Cheminformatics and bioinformatics are extensively using predictive modelling and exhibit a need for standardization of these methodologies in order to assist model selection and speed up the process of predictive model development. A tool accessible to all users, irrespectively of their statistical knowledge, would be valuable if it tests several simple and complex regression models and validation schemes, produce unified reports, and offer the option to be integrated into more extensive studies. Additionally, such methodology should be implemented as a free programming package, in order to be continuously adapted and redistributed by others. We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at https://www.github.com/enanomapper/RRegrs, by reusing and extending on the caret package. The universality of the new methodology is demonstrated using five standard data sets from different scientific fields. Its efficiency in cheminformatics and QSAR modelling is shown with three use cases: proteomics data for surface-modified gold nanoparticles, nano-metal oxides descriptor data, and molecular descriptors for acute aquatic toxicity data. The results show that for all data sets RRegrs reports models with equal or better performance for both training and test sets than those reported in the original publications. Its good performance as well as its adaptability in terms of parameter optimization could make RRegrs a popular framework to assist the initial exploration of predictive models, and with that, the design of more comprehensive in silico screening applications.Graphical abstractRRegrs is a computer-aided model selection framework for R multiple regression models; this is a fully validated procedure with application to QSAR modelling.
Ngeo, Jimson; Tamei, Tomoya; Shibata, Tomohiro
2014-01-01
Surface electromyographic (EMG) signals have often been used in estimating upper and lower limb dynamics and kinematics for the purpose of controlling robotic devices such as robot prosthesis and finger exoskeletons. However, in estimating multiple and a high number of degrees-of-freedom (DOF) kinematics from EMG, output DOFs are usually estimated independently. In this study, we estimate finger joint kinematics from EMG signals using a multi-output convolved Gaussian Process (Multi-output Full GP) that considers dependencies between outputs. We show that estimation of finger joints from muscle activation inputs can be improved by using a regression model that considers inherent coupling or correlation within the hand and finger joints. We also provide a comparison of estimation performance between different regression methods, such as Artificial Neural Networks (ANN) which is used by many of the related studies. We show that using a multi-output GP gives improved estimation compared to multi-output ANN and even dedicated or independent regression models.
Schmiege, Sarah J; Bryan, Angela D
2016-04-01
Justice-involved adolescents engage in high levels of risky sexual behavior and substance use, and understanding potential relationships among these constructs is important for effective HIV/STI prevention. A regression mixture modeling approach was used to determine whether subgroups could be identified based on the regression of two indicators of sexual risk (condom use and frequency of intercourse) on three measures of substance use (alcohol, marijuana and hard drugs). Three classes were observed among n = 596 adolescents on probation: none of the substances predicted outcomes for approximately 18 % of the sample; alcohol and marijuana use were predictive for approximately 59 % of the sample, and marijuana use and hard drug use were predictive in approximately 23 % of the sample. Demographic, individual difference, and additional sexual and substance use risk variables were examined in relation to class membership. Findings are discussed in terms of understanding profiles of risk behavior among at-risk youth.
NASA Astrophysics Data System (ADS)
Cai, Jun; Wang, Kuaishe; Shi, Jiamin; Wang, Wen; Liu, Yingying
2018-01-01
Constitutive analysis for hot working of BFe10-1-2 alloy was carried out by using experimental stress-strain data from isothermal hot compression tests, in a wide range of temperature of 1,023 1,273 K, and strain rate range of 0.001 10 s-1. A constitutive equation based on modified double multiple nonlinear regression was proposed considering the independent effects of strain, strain rate, temperature and their interrelation. The predicted flow stress data calculated from the developed equation was compared with the experimental data. Correlation coefficient (R), average absolute relative error (AARE) and relative errors were introduced to verify the validity of the developed constitutive equation. Subsequently, a comparative study was made on the capability of strain-compensated Arrhenius-type constitutive model. The results showed that the developed constitutive equation based on modified double multiple nonlinear regression could predict flow stress of BFe10-1-2 alloy with good correlation and generalization.
Motion patterns in acupuncture needle manipulation.
Seo, Yoonjeong; Lee, In-Seon; Jung, Won-Mo; Ryu, Ho-Sun; Lim, Jinwoong; Ryu, Yeon-Hee; Kang, Jung-Won; Chae, Younbyoung
2014-10-01
In clinical practice, acupuncture manipulation is highly individualised for each practitioner. Before we establish a standard for acupuncture manipulation, it is important to understand completely the manifestations of acupuncture manipulation in the actual clinic. To examine motion patterns during acupuncture manipulation, we generated a fitted model of practitioners' motion patterns and evaluated their consistencies in acupuncture manipulation. Using a motion sensor, we obtained real-time motion data from eight experienced practitioners while they conducted acupuncture manipulation using their own techniques. We calculated the average amplitude and duration of a sampled motion unit for each practitioner and, after normalisation, we generated a true regression curve of motion patterns for each practitioner using a generalised additive mixed modelling (GAMM). We observed significant differences in rotation amplitude and duration in motion samples among practitioners. GAMM showed marked variations in average regression curves of motion patterns among practitioners but there was strong consistency in motion parameters for individual practitioners. The fitted regression model showed that the true regression curve accounted for an average of 50.2% of variance in the motion pattern for each practitioner. Our findings suggest that there is great inter-individual variability between practitioners, but remarkable intra-individual consistency within each practitioner. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Non-stationary hydrologic frequency analysis using B-spline quantile regression
NASA Astrophysics Data System (ADS)
Nasri, B.; Bouezmarni, T.; St-Hilaire, A.; Ouarda, T. B. M. J.
2017-11-01
Hydrologic frequency analysis is commonly used by engineers and hydrologists to provide the basic information on planning, design and management of hydraulic and water resources systems under the assumption of stationarity. However, with increasing evidence of climate change, it is possible that the assumption of stationarity, which is prerequisite for traditional frequency analysis and hence, the results of conventional analysis would become questionable. In this study, we consider a framework for frequency analysis of extremes based on B-Spline quantile regression which allows to model data in the presence of non-stationarity and/or dependence on covariates with linear and non-linear dependence. A Markov Chain Monte Carlo (MCMC) algorithm was used to estimate quantiles and their posterior distributions. A coefficient of determination and Bayesian information criterion (BIC) for quantile regression are used in order to select the best model, i.e. for each quantile, we choose the degree and number of knots of the adequate B-spline quantile regression model. The method is applied to annual maximum and minimum streamflow records in Ontario, Canada. Climate indices are considered to describe the non-stationarity in the variable of interest and to estimate the quantiles in this case. The results show large differences between the non-stationary quantiles and their stationary equivalents for an annual maximum and minimum discharge with high annual non-exceedance probabilities.
Held, Elizabeth; Cape, Joshua; Tintle, Nathan
2016-01-01
Machine learning methods continue to show promise in the analysis of data from genetic association studies because of the high number of variables relative to the number of observations. However, few best practices exist for the application of these methods. We extend a recently proposed supervised machine learning approach for predicting disease risk by genotypes to be able to incorporate gene expression data and rare variants. We then apply 2 different versions of the approach (radial and linear support vector machines) to simulated data from Genetic Analysis Workshop 19 and compare performance to logistic regression. Method performance was not radically different across the 3 methods, although the linear support vector machine tended to show small gains in predictive ability relative to a radial support vector machine and logistic regression. Importantly, as the number of genes in the models was increased, even when those genes contained causal rare variants, model predictive ability showed a statistically significant decrease in performance for both the radial support vector machine and logistic regression. The linear support vector machine showed more robust performance to the inclusion of additional genes. Further work is needed to evaluate machine learning approaches on larger samples and to evaluate the relative improvement in model prediction from the incorporation of gene expression data.
Perl-speaks-NONMEM (PsN)--a Perl module for NONMEM related programming.
Lindbom, Lars; Ribbing, Jakob; Jonsson, E Niclas
2004-08-01
The NONMEM program is the most widely used nonlinear regression software in population pharmacokinetic/pharmacodynamic (PK/PD) analyses. In this article we describe a programming library, Perl-speaks-NONMEM (PsN), intended for programmers that aim at using the computational capability of NONMEM in external applications. The library is object oriented and written in the programming language Perl. The classes of the library are built around NONMEM's data, model and output files. The specification of the NONMEM model is easily set or changed through the model and data file classes while the output from a model fit is accessed through the output file class. The classes have methods that help the programmer perform common repetitive tasks, e.g. summarising the output from a NONMEM run, setting the initial estimates of a model based on a previous run or truncating values over a certain threshold in the data file. PsN creates a basis for the development of high-level software using NONMEM as the regression tool.
Overhead longwave infrared hyperspectral material identification using radiometric models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zelinski, M. E.
Material detection algorithms used in hyperspectral data processing are computationally efficient but can produce relatively high numbers of false positives. Material identification performed as a secondary processing step on detected pixels can help separate true and false positives. This paper presents a material identification processing chain for longwave infrared hyperspectral data of solid materials collected from airborne platforms. The algorithms utilize unwhitened radiance data and an iterative algorithm that determines the temperature, humidity, and ozone of the atmospheric profile. Pixel unmixing is done using constrained linear regression and Bayesian Information Criteria for model selection. The resulting product includes an optimalmore » atmospheric profile and full radiance material model that includes material temperature, abundance values, and several fit statistics. A logistic regression method utilizing all model parameters to improve identification is also presented. This paper details the processing chain and provides justification for the algorithms used. Several examples are provided using modeled data at different noise levels.« less