Sample records for regression method results

  1. Synthesizing Regression Results: A Factored Likelihood Method

    ERIC Educational Resources Information Center

    Wu, Meng-Jia; Becker, Betsy Jane

    2013-01-01

    Regression methods are widely used by researchers in many fields, yet methods for synthesizing regression results are scarce. This study proposes using a factored likelihood method, originally developed to handle missing data, to appropriately synthesize regression models involving different predictors. This method uses the correlations reported…

  2. Testing Different Model Building Procedures Using Multiple Regression.

    ERIC Educational Resources Information Center

    Thayer, Jerome D.

    The stepwise regression method of selecting predictors for computer assisted multiple regression analysis was compared with forward, backward, and best subsets regression, using 16 data sets. The results indicated the stepwise method was preferred because of its practical nature, when the models chosen by different selection methods were similar…

  3. Modeling Governance KB with CATPCA to Overcome Multicollinearity in the Logistic Regression

    NASA Astrophysics Data System (ADS)

    Khikmah, L.; Wijayanto, H.; Syafitri, U. D.

    2017-04-01

    The problem often encounters in logistic regression modeling are multicollinearity problems. Data that have multicollinearity between explanatory variables with the result in the estimation of parameters to be bias. Besides, the multicollinearity will result in error in the classification. In general, to overcome multicollinearity in regression used stepwise regression. They are also another method to overcome multicollinearity which involves all variable for prediction. That is Principal Component Analysis (PCA). However, classical PCA in only for numeric data. Its data are categorical, one method to solve the problems is Categorical Principal Component Analysis (CATPCA). Data were used in this research were a part of data Demographic and Population Survey Indonesia (IDHS) 2012. This research focuses on the characteristic of women of using the contraceptive methods. Classification results evaluated using Area Under Curve (AUC) values. The higher the AUC value, the better. Based on AUC values, the classification of the contraceptive method using stepwise method (58.66%) is better than the logistic regression model (57.39%) and CATPCA (57.39%). Evaluation of the results of logistic regression using sensitivity, shows the opposite where CATPCA method (99.79%) is better than logistic regression method (92.43%) and stepwise (92.05%). Therefore in this study focuses on major class classification (using a contraceptive method), then the selected model is CATPCA because it can raise the level of the major class model accuracy.

  4. Basis Selection for Wavelet Regression

    NASA Technical Reports Server (NTRS)

    Wheeler, Kevin R.; Lau, Sonie (Technical Monitor)

    1998-01-01

    A wavelet basis selection procedure is presented for wavelet regression. Both the basis and the threshold are selected using cross-validation. The method includes the capability of incorporating prior knowledge on the smoothness (or shape of the basis functions) into the basis selection procedure. The results of the method are demonstrated on sampled functions widely used in the wavelet regression literature. The results of the method are contrasted with other published methods.

  5. Shrinkage regression-based methods for microarray missing value imputation.

    PubMed

    Wang, Hsiuying; Chiu, Chia-Chun; Wu, Yi-Ching; Wu, Wei-Sheng

    2013-01-01

    Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets. To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do. Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.

  6. Detecting outliers when fitting data with nonlinear regression – a new method based on robust nonlinear regression and the false discovery rate

    PubMed Central

    Motulsky, Harvey J; Brown, Ronald E

    2006-01-01

    Background Nonlinear regression, like linear regression, assumes that the scatter of data around the ideal curve follows a Gaussian or normal distribution. This assumption leads to the familiar goal of regression: to minimize the sum of the squares of the vertical or Y-value distances between the points and the curve. Outliers can dominate the sum-of-the-squares calculation, and lead to misleading results. However, we know of no practical method for routinely identifying outliers when fitting curves with nonlinear regression. Results We describe a new method for identifying outliers when fitting data with nonlinear regression. We first fit the data using a robust form of nonlinear regression, based on the assumption that scatter follows a Lorentzian distribution. We devised a new adaptive method that gradually becomes more robust as the method proceeds. To define outliers, we adapted the false discovery rate approach to handling multiple comparisons. We then remove the outliers, and analyze the data using ordinary least-squares regression. Because the method combines robust regression and outlier removal, we call it the ROUT method. When analyzing simulated data, where all scatter is Gaussian, our method detects (falsely) one or more outlier in only about 1–3% of experiments. When analyzing data contaminated with one or several outliers, the ROUT method performs well at outlier identification, with an average False Discovery Rate less than 1%. Conclusion Our method, which combines a new method of robust nonlinear regression with a new method of outlier identification, identifies outliers from nonlinear curve fits with reasonable power and few false positives. PMID:16526949

  7. Kernel Partial Least Squares for Nonlinear Regression and Discrimination

    NASA Technical Reports Server (NTRS)

    Rosipal, Roman; Clancy, Daniel (Technical Monitor)

    2002-01-01

    This paper summarizes recent results on applying the method of partial least squares (PLS) in a reproducing kernel Hilbert space (RKHS). A previously proposed kernel PLS regression model was proven to be competitive with other regularized regression methods in RKHS. The family of nonlinear kernel-based PLS models is extended by considering the kernel PLS method for discrimination. Theoretical and experimental results on a two-class discrimination problem indicate usefulness of the method.

  8. [Regression on order statistics and its application in estimating nondetects for food exposure assessment].

    PubMed

    Yu, Xiaojin; Liu, Pei; Min, Jie; Chen, Qiguang

    2009-01-01

    To explore the application of regression on order statistics (ROS) in estimating nondetects for food exposure assessment. Regression on order statistics was adopted in analysis of cadmium residual data set from global food contaminant monitoring, the mean residual was estimated basing SAS programming and compared with the results from substitution methods. The results show that ROS method performs better obviously than substitution methods for being robust and convenient for posterior analysis. Regression on order statistics is worth to adopt,but more efforts should be make for details of application of this method.

  9. Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data.

    PubMed

    Abram, Samantha V; Helwig, Nathaniel E; Moodie, Craig A; DeYoung, Colin G; MacDonald, Angus W; Waller, Niels G

    2016-01-01

    Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks.

  10. Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data

    PubMed Central

    Abram, Samantha V.; Helwig, Nathaniel E.; Moodie, Craig A.; DeYoung, Colin G.; MacDonald, Angus W.; Waller, Niels G.

    2016-01-01

    Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks. PMID:27516732

  11. The comparison of robust partial least squares regression with robust principal component regression on a real

    NASA Astrophysics Data System (ADS)

    Polat, Esra; Gunay, Suleyman

    2013-10-01

    One of the problems encountered in Multiple Linear Regression (MLR) is multicollinearity, which causes the overestimation of the regression parameters and increase of the variance of these parameters. Hence, in case of multicollinearity presents, biased estimation procedures such as classical Principal Component Regression (CPCR) and Partial Least Squares Regression (PLSR) are then performed. SIMPLS algorithm is the leading PLSR algorithm because of its speed, efficiency and results are easier to interpret. However, both of the CPCR and SIMPLS yield very unreliable results when the data set contains outlying observations. Therefore, Hubert and Vanden Branden (2003) have been presented a robust PCR (RPCR) method and a robust PLSR (RPLSR) method called RSIMPLS. In RPCR, firstly, a robust Principal Component Analysis (PCA) method for high-dimensional data on the independent variables is applied, then, the dependent variables are regressed on the scores using a robust regression method. RSIMPLS has been constructed from a robust covariance matrix for high-dimensional data and robust linear regression. The purpose of this study is to show the usage of RPCR and RSIMPLS methods on an econometric data set, hence, making a comparison of two methods on an inflation model of Turkey. The considered methods have been compared in terms of predictive ability and goodness of fit by using a robust Root Mean Squared Error of Cross-validation (R-RMSECV), a robust R2 value and Robust Component Selection (RCS) statistic.

  12. [Comparison of application of Cochran-Armitage trend test and linear regression analysis for rate trend analysis in epidemiology study].

    PubMed

    Wang, D Z; Wang, C; Shen, C F; Zhang, Y; Zhang, H; Song, G D; Xue, X D; Xu, Z L; Zhang, S; Jiang, G H

    2017-05-10

    We described the time trend of acute myocardial infarction (AMI) from 1999 to 2013 in Tianjin incidence rate with Cochran-Armitage trend (CAT) test and linear regression analysis, and the results were compared. Based on actual population, CAT test had much stronger statistical power than linear regression analysis for both overall incidence trend and age specific incidence trend (Cochran-Armitage trend P value

  13. Methods for calculating confidence and credible intervals for the residual between-study variance in random effects meta-regression models

    PubMed Central

    2014-01-01

    Background Meta-regression is becoming increasingly used to model study level covariate effects. However this type of statistical analysis presents many difficulties and challenges. Here two methods for calculating confidence intervals for the magnitude of the residual between-study variance in random effects meta-regression models are developed. A further suggestion for calculating credible intervals using informative prior distributions for the residual between-study variance is presented. Methods Two recently proposed and, under the assumptions of the random effects model, exact methods for constructing confidence intervals for the between-study variance in random effects meta-analyses are extended to the meta-regression setting. The use of Generalised Cochran heterogeneity statistics is extended to the meta-regression setting and a Newton-Raphson procedure is developed to implement the Q profile method for meta-analysis and meta-regression. WinBUGS is used to implement informative priors for the residual between-study variance in the context of Bayesian meta-regressions. Results Results are obtained for two contrasting examples, where the first example involves a binary covariate and the second involves a continuous covariate. Intervals for the residual between-study variance are wide for both examples. Conclusions Statistical methods, and R computer software, are available to compute exact confidence intervals for the residual between-study variance under the random effects model for meta-regression. These frequentist methods are almost as easily implemented as their established counterparts for meta-analysis. Bayesian meta-regressions are also easily performed by analysts who are comfortable using WinBUGS. Estimates of the residual between-study variance in random effects meta-regressions should be routinely reported and accompanied by some measure of their uncertainty. Confidence and/or credible intervals are well-suited to this purpose. PMID:25196829

  14. Improving validation methods for molecular diagnostics: application of Bland-Altman, Deming and simple linear regression analyses in assay comparison and evaluation for next-generation sequencing

    PubMed Central

    Misyura, Maksym; Sukhai, Mahadeo A; Kulasignam, Vathany; Zhang, Tong; Kamel-Reid, Suzanne; Stockley, Tracy L

    2018-01-01

    Aims A standard approach in test evaluation is to compare results of the assay in validation to results from previously validated methods. For quantitative molecular diagnostic assays, comparison of test values is often performed using simple linear regression and the coefficient of determination (R2), using R2 as the primary metric of assay agreement. However, the use of R2 alone does not adequately quantify constant or proportional errors required for optimal test evaluation. More extensive statistical approaches, such as Bland-Altman and expanded interpretation of linear regression methods, can be used to more thoroughly compare data from quantitative molecular assays. Methods We present the application of Bland-Altman and linear regression statistical methods to evaluate quantitative outputs from next-generation sequencing assays (NGS). NGS-derived data sets from assay validation experiments were used to demonstrate the utility of the statistical methods. Results Both Bland-Altman and linear regression were able to detect the presence and magnitude of constant and proportional error in quantitative values of NGS data. Deming linear regression was used in the context of assay comparison studies, while simple linear regression was used to analyse serial dilution data. Bland-Altman statistical approach was also adapted to quantify assay accuracy, including constant and proportional errors, and precision where theoretical and empirical values were known. Conclusions The complementary application of the statistical methods described in this manuscript enables more extensive evaluation of performance characteristics of quantitative molecular assays, prior to implementation in the clinical molecular laboratory. PMID:28747393

  15. Price of gasoline: forecasting comparisons. [Box-Jenkins, econometric, and regression methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bopp, A.E.; Neri, J.A.

    Gasoline prices are simulated using three popular forecasting methodologies: A Box--Jenkins type method, an econometric method, and a regression method. One-period-ahead and 18-period-ahead comparisons are made. For the one-period-ahead method, a Box--Jenkins type time-series model simulated best, although all do well. However, for the 18-period simulation, the econometric and regression methods perform substantially better than the Box-Jenkins formulation. A rationale for and implications of these results ae discussed. 11 references.

  16. Comparative study of some robust statistical methods: weighted, parametric, and nonparametric linear regression of HPLC convoluted peak responses using internal standard method in drug bioavailability studies.

    PubMed

    Korany, Mohamed A; Maher, Hadir M; Galal, Shereen M; Ragab, Marwa A A

    2013-05-01

    This manuscript discusses the application and the comparison between three statistical regression methods for handling data: parametric, nonparametric, and weighted regression (WR). These data were obtained from different chemometric methods applied to the high-performance liquid chromatography response data using the internal standard method. This was performed on a model drug Acyclovir which was analyzed in human plasma with the use of ganciclovir as internal standard. In vivo study was also performed. Derivative treatment of chromatographic response ratio data was followed by convolution of the resulting derivative curves using 8-points sin x i polynomials (discrete Fourier functions). This work studies and also compares the application of WR method and Theil's method, a nonparametric regression (NPR) method with the least squares parametric regression (LSPR) method, which is considered the de facto standard method used for regression. When the assumption of homoscedasticity is not met for analytical data, a simple and effective way to counteract the great influence of the high concentrations on the fitted regression line is to use WR method. WR was found to be superior to the method of LSPR as the former assumes that the y-direction error in the calibration curve will increase as x increases. Theil's NPR method was also found to be superior to the method of LSPR as the former assumes that errors could occur in both x- and y-directions and that might not be normally distributed. Most of the results showed a significant improvement in the precision and accuracy on applying WR and NPR methods relative to LSPR.

  17. A regularization corrected score method for nonlinear regression models with covariate error.

    PubMed

    Zucker, David M; Gorfine, Malka; Li, Yi; Tadesse, Mahlet G; Spiegelman, Donna

    2013-03-01

    Many regression analyses involve explanatory variables that are measured with error, and failing to account for this error is well known to lead to biased point and interval estimates of the regression coefficients. We present here a new general method for adjusting for covariate error. Our method consists of an approximate version of the Stefanski-Nakamura corrected score approach, using the method of regularization to obtain an approximate solution of the relevant integral equation. We develop the theory in the setting of classical likelihood models; this setting covers, for example, linear regression, nonlinear regression, logistic regression, and Poisson regression. The method is extremely general in terms of the types of measurement error models covered, and is a functional method in the sense of not involving assumptions on the distribution of the true covariate. We discuss the theoretical properties of the method and present simulation results in the logistic regression setting (univariate and multivariate). For illustration, we apply the method to data from the Harvard Nurses' Health Study concerning the relationship between physical activity and breast cancer mortality in the period following a diagnosis of breast cancer. Copyright © 2013, The International Biometric Society.

  18. A comparison of methods for the analysis of binomial clustered outcomes in behavioral research.

    PubMed

    Ferrari, Alberto; Comelli, Mario

    2016-12-01

    In behavioral research, data consisting of a per-subject proportion of "successes" and "failures" over a finite number of trials often arise. This clustered binary data are usually non-normally distributed, which can distort inference if the usual general linear model is applied and sample size is small. A number of more advanced methods is available, but they are often technically challenging and a comparative assessment of their performances in behavioral setups has not been performed. We studied the performances of some methods applicable to the analysis of proportions; namely linear regression, Poisson regression, beta-binomial regression and Generalized Linear Mixed Models (GLMMs). We report on a simulation study evaluating power and Type I error rate of these models in hypothetical scenarios met by behavioral researchers; plus, we describe results from the application of these methods on data from real experiments. Our results show that, while GLMMs are powerful instruments for the analysis of clustered binary outcomes, beta-binomial regression can outperform them in a range of scenarios. Linear regression gave results consistent with the nominal level of significance, but was overall less powerful. Poisson regression, instead, mostly led to anticonservative inference. GLMMs and beta-binomial regression are generally more powerful than linear regression; yet linear regression is robust to model misspecification in some conditions, whereas Poisson regression suffers heavily from violations of the assumptions when used to model proportion data. We conclude providing directions to behavioral scientists dealing with clustered binary data and small sample sizes. Copyright © 2016 Elsevier B.V. All rights reserved.

  19. Double Cross-Validation in Multiple Regression: A Method of Estimating the Stability of Results.

    ERIC Educational Resources Information Center

    Rowell, R. Kevin

    In multiple regression analysis, where resulting predictive equation effectiveness is subject to shrinkage, it is especially important to evaluate result replicability. Double cross-validation is an empirical method by which an estimate of invariance or stability can be obtained from research data. A procedure for double cross-validation is…

  20. Selection of a Geostatistical Method to Interpolate Soil Properties of the State Crop Testing Fields using Attributes of a Digital Terrain Model

    NASA Astrophysics Data System (ADS)

    Sahabiev, I. A.; Ryazanov, S. S.; Kolcova, T. G.; Grigoryan, B. R.

    2018-03-01

    The three most common techniques to interpolate soil properties at a field scale—ordinary kriging (OK), regression kriging with multiple linear regression drift model (RK + MLR), and regression kriging with principal component regression drift model (RK + PCR)—were examined. The results of the performed study were compiled into an algorithm of choosing the most appropriate soil mapping technique. Relief attributes were used as the auxiliary variables. When spatial dependence of a target variable was strong, the OK method showed more accurate interpolation results, and the inclusion of the auxiliary data resulted in an insignificant improvement in prediction accuracy. According to the algorithm, the RK + PCR method effectively eliminates multicollinearity of explanatory variables. However, if the number of predictors is less than ten, the probability of multicollinearity is reduced, and application of the PCR becomes irrational. In that case, the multiple linear regression should be used instead.

  1. Using the Ridge Regression Procedures to Estimate the Multiple Linear Regression Coefficients

    NASA Astrophysics Data System (ADS)

    Gorgees, HazimMansoor; Mahdi, FatimahAssim

    2018-05-01

    This article concerns with comparing the performance of different types of ordinary ridge regression estimators that have been already proposed to estimate the regression parameters when the near exact linear relationships among the explanatory variables is presented. For this situations we employ the data obtained from tagi gas filling company during the period (2008-2010). The main result we reached is that the method based on the condition number performs better than other methods since it has smaller mean square error (MSE) than the other stated methods.

  2. Prediction of monthly rainfall in Victoria, Australia: Clusterwise linear regression approach

    NASA Astrophysics Data System (ADS)

    Bagirov, Adil M.; Mahmood, Arshad; Barton, Andrew

    2017-05-01

    This paper develops the Clusterwise Linear Regression (CLR) technique for prediction of monthly rainfall. The CLR is a combination of clustering and regression techniques. It is formulated as an optimization problem and an incremental algorithm is designed to solve it. The algorithm is applied to predict monthly rainfall in Victoria, Australia using rainfall data with five input meteorological variables over the period of 1889-2014 from eight geographically diverse weather stations. The prediction performance of the CLR method is evaluated by comparing observed and predicted rainfall values using four measures of forecast accuracy. The proposed method is also compared with the CLR using the maximum likelihood framework by the expectation-maximization algorithm, multiple linear regression, artificial neural networks and the support vector machines for regression models using computational results. The results demonstrate that the proposed algorithm outperforms other methods in most locations.

  3. Resting-state functional magnetic resonance imaging: the impact of regression analysis.

    PubMed

    Yeh, Chia-Jung; Tseng, Yu-Sheng; Lin, Yi-Ru; Tsai, Shang-Yueh; Huang, Teng-Yi

    2015-01-01

    To investigate the impact of regression methods on resting-state functional magnetic resonance imaging (rsfMRI). During rsfMRI preprocessing, regression analysis is considered effective for reducing the interference of physiological noise on the signal time course. However, it is unclear whether the regression method benefits rsfMRI analysis. Twenty volunteers (10 men and 10 women; aged 23.4 ± 1.5 years) participated in the experiments. We used node analysis and functional connectivity mapping to assess the brain default mode network by using five combinations of regression methods. The results show that regressing the global mean plays a major role in the preprocessing steps. When a global regression method is applied, the values of functional connectivity are significantly lower (P ≤ .01) than those calculated without a global regression. This step increases inter-subject variation and produces anticorrelated brain areas. rsfMRI data processed using regression should be interpreted carefully. The significance of the anticorrelated brain areas produced by global signal removal is unclear. Copyright © 2014 by the American Society of Neuroimaging.

  4. Assessment of Weighted Quantile Sum Regression for Modeling Chemical Mixtures and Cancer Risk

    PubMed Central

    Czarnota, Jenna; Gennings, Chris; Wheeler, David C

    2015-01-01

    In evaluation of cancer risk related to environmental chemical exposures, the effect of many chemicals on disease is ultimately of interest. However, because of potentially strong correlations among chemicals that occur together, traditional regression methods suffer from collinearity effects, including regression coefficient sign reversal and variance inflation. In addition, penalized regression methods designed to remediate collinearity may have limitations in selecting the truly bad actors among many correlated components. The recently proposed method of weighted quantile sum (WQS) regression attempts to overcome these problems by estimating a body burden index, which identifies important chemicals in a mixture of correlated environmental chemicals. Our focus was on assessing through simulation studies the accuracy of WQS regression in detecting subsets of chemicals associated with health outcomes (binary and continuous) in site-specific analyses and in non-site-specific analyses. We also evaluated the performance of the penalized regression methods of lasso, adaptive lasso, and elastic net in correctly classifying chemicals as bad actors or unrelated to the outcome. We based the simulation study on data from the National Cancer Institute Surveillance Epidemiology and End Results Program (NCI-SEER) case–control study of non-Hodgkin lymphoma (NHL) to achieve realistic exposure situations. Our results showed that WQS regression had good sensitivity and specificity across a variety of conditions considered in this study. The shrinkage methods had a tendency to incorrectly identify a large number of components, especially in the case of strong association with the outcome. PMID:26005323

  5. Assessment of weighted quantile sum regression for modeling chemical mixtures and cancer risk.

    PubMed

    Czarnota, Jenna; Gennings, Chris; Wheeler, David C

    2015-01-01

    In evaluation of cancer risk related to environmental chemical exposures, the effect of many chemicals on disease is ultimately of interest. However, because of potentially strong correlations among chemicals that occur together, traditional regression methods suffer from collinearity effects, including regression coefficient sign reversal and variance inflation. In addition, penalized regression methods designed to remediate collinearity may have limitations in selecting the truly bad actors among many correlated components. The recently proposed method of weighted quantile sum (WQS) regression attempts to overcome these problems by estimating a body burden index, which identifies important chemicals in a mixture of correlated environmental chemicals. Our focus was on assessing through simulation studies the accuracy of WQS regression in detecting subsets of chemicals associated with health outcomes (binary and continuous) in site-specific analyses and in non-site-specific analyses. We also evaluated the performance of the penalized regression methods of lasso, adaptive lasso, and elastic net in correctly classifying chemicals as bad actors or unrelated to the outcome. We based the simulation study on data from the National Cancer Institute Surveillance Epidemiology and End Results Program (NCI-SEER) case-control study of non-Hodgkin lymphoma (NHL) to achieve realistic exposure situations. Our results showed that WQS regression had good sensitivity and specificity across a variety of conditions considered in this study. The shrinkage methods had a tendency to incorrectly identify a large number of components, especially in the case of strong association with the outcome.

  6. Regression Analysis by Example. 5th Edition

    ERIC Educational Resources Information Center

    Chatterjee, Samprit; Hadi, Ali S.

    2012-01-01

    Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. "Regression Analysis by Example, Fifth Edition" has been expanded and thoroughly…

  7. An empirical study using permutation-based resampling in meta-regression

    PubMed Central

    2012-01-01

    Background In meta-regression, as the number of trials in the analyses decreases, the risk of false positives or false negatives increases. This is partly due to the assumption of normality that may not hold in small samples. Creation of a distribution from the observed trials using permutation methods to calculate P values may allow for less spurious findings. Permutation has not been empirically tested in meta-regression. The objective of this study was to perform an empirical investigation to explore the differences in results for meta-analyses on a small number of trials using standard large sample approaches verses permutation-based methods for meta-regression. Methods We isolated a sample of randomized controlled clinical trials (RCTs) for interventions that have a small number of trials (herbal medicine trials). Trials were then grouped by herbal species and condition and assessed for methodological quality using the Jadad scale, and data were extracted for each outcome. Finally, we performed meta-analyses on the primary outcome of each group of trials and meta-regression for methodological quality subgroups within each meta-analysis. We used large sample methods and permutation methods in our meta-regression modeling. We then compared final models and final P values between methods. Results We collected 110 trials across 5 intervention/outcome pairings and 5 to 10 trials per covariate. When applying large sample methods and permutation-based methods in our backwards stepwise regression the covariates in the final models were identical in all cases. The P values for the covariates in the final model were larger in 78% (7/9) of the cases for permutation and identical for 22% (2/9) of the cases. Conclusions We present empirical evidence that permutation-based resampling may not change final models when using backwards stepwise regression, but may increase P values in meta-regression of multiple covariates for relatively small amount of trials. PMID:22587815

  8. A collaborative comparison of objective structured clinical examination (OSCE) standard setting methods at Australian medical schools.

    PubMed

    Malau-Aduli, Bunmi Sherifat; Teague, Peta-Ann; D'Souza, Karen; Heal, Clare; Turner, Richard; Garne, David L; van der Vleuten, Cees

    2017-12-01

    A key issue underpinning the usefulness of the OSCE assessment to medical education is standard setting, but the majority of standard-setting methods remain challenging for performance assessment because they produce varying passing marks. Several studies have compared standard-setting methods; however, most of these studies are limited by their experimental scope, or use data on examinee performance at a single OSCE station or from a single medical school. This collaborative study between 10 Australian medical schools investigated the effect of standard-setting methods on OSCE cut scores and failure rates. This research used 5256 examinee scores from seven shared OSCE stations to calculate cut scores and failure rates using two different compromise standard-setting methods, namely the Borderline Regression and Cohen's methods. The results of this study indicate that Cohen's method yields similar outcomes to the Borderline Regression method, particularly for large examinee cohort sizes. However, with lower examinee numbers on a station, the Borderline Regression method resulted in higher cut scores and larger difference margins in the failure rates. Cohen's method yields similar outcomes as the Borderline Regression method and its application for benchmarking purposes and in resource-limited settings is justifiable, particularly with large examinee numbers.

  9. Evaluating Differential Effects Using Regression Interactions and Regression Mixture Models

    ERIC Educational Resources Information Center

    Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung

    2015-01-01

    Research increasingly emphasizes understanding differential effects. This article focuses on understanding regression mixture models, which are relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their…

  10. High correlations between MRI brain volume measurements based on NeuroQuant® and FreeSurfer.

    PubMed

    Ross, David E; Ochs, Alfred L; Tate, David F; Tokac, Umit; Seabaugh, John; Abildskov, Tracy J; Bigler, Erin D

    2018-05-30

    NeuroQuant ® (NQ) and FreeSurfer (FS) are commonly used computer-automated programs for measuring MRI brain volume. Previously they were reported to have high intermethod reliabilities but often large intermethod effect size differences. We hypothesized that linear transformations could be used to reduce the large effect sizes. This study was an extension of our previously reported study. We performed NQ and FS brain volume measurements on 60 subjects (including normal controls, patients with traumatic brain injury, and patients with Alzheimer's disease). We used two statistical approaches in parallel to develop methods for transforming FS volumes into NQ volumes: traditional linear regression, and Bayesian linear regression. For both methods, we used regression analyses to develop linear transformations of the FS volumes to make them more similar to the NQ volumes. The FS-to-NQ transformations based on traditional linear regression resulted in effect sizes which were small to moderate. The transformations based on Bayesian linear regression resulted in all effect sizes being trivially small. To our knowledge, this is the first report describing a method for transforming FS to NQ data so as to achieve high reliability and low effect size differences. Machine learning methods like Bayesian regression may be more useful than traditional methods. Copyright © 2018 Elsevier B.V. All rights reserved.

  11. Error Covariance Penalized Regression: A novel multivariate model combining penalized regression with multivariate error structure.

    PubMed

    Allegrini, Franco; Braga, Jez W B; Moreira, Alessandro C O; Olivieri, Alejandro C

    2018-06-29

    A new multivariate regression model, named Error Covariance Penalized Regression (ECPR) is presented. Following a penalized regression strategy, the proposed model incorporates information about the measurement error structure of the system, using the error covariance matrix (ECM) as a penalization term. Results are reported from both simulations and experimental data based on replicate mid and near infrared (MIR and NIR) spectral measurements. The results for ECPR are better under non-iid conditions when compared with traditional first-order multivariate methods such as ridge regression (RR), principal component regression (PCR) and partial least-squares regression (PLS). Copyright © 2018 Elsevier B.V. All rights reserved.

  12. Linear regression analysis for comparing two measurers or methods of measurement: but which regression?

    PubMed

    Ludbrook, John

    2010-07-01

    1. There are two reasons for wanting to compare measurers or methods of measurement. One is to calibrate one method or measurer against another; the other is to detect bias. Fixed bias is present when one method gives higher (or lower) values across the whole range of measurement. Proportional bias is present when one method gives values that diverge progressively from those of the other. 2. Linear regression analysis is a popular method for comparing methods of measurement, but the familiar ordinary least squares (OLS) method is rarely acceptable. The OLS method requires that the x values are fixed by the design of the study, whereas it is usual that both y and x values are free to vary and are subject to error. In this case, special regression techniques must be used. 3. Clinical chemists favour techniques such as major axis regression ('Deming's method'), the Passing-Bablok method or the bivariate least median squares method. Other disciplines, such as allometry, astronomy, biology, econometrics, fisheries research, genetics, geology, physics and sports science, have their own preferences. 4. Many Monte Carlo simulations have been performed to try to decide which technique is best, but the results are almost uninterpretable. 5. I suggest that pharmacologists and physiologists should use ordinary least products regression analysis (geometric mean regression, reduced major axis regression): it is versatile, can be used for calibration or to detect bias and can be executed by hand-held calculator or by using the loss function in popular, general-purpose, statistical software.

  13. How is the weather? Forecasting inpatient glycemic control

    PubMed Central

    Saulnier, George E; Castro, Janna C; Cook, Curtiss B; Thompson, Bithika M

    2017-01-01

    Aim: Apply methods of damped trend analysis to forecast inpatient glycemic control. Method: Observed and calculated point-of-care blood glucose data trends were determined over 62 weeks. Mean absolute percent error was used to calculate differences between observed and forecasted values. Comparisons were drawn between model results and linear regression forecasting. Results: The forecasted mean glucose trends observed during the first 24 and 48 weeks of projections compared favorably to the results provided by linear regression forecasting. However, in some scenarios, the damped trend method changed inferences compared with linear regression. In all scenarios, mean absolute percent error values remained below the 10% accepted by demand industries. Conclusion: Results indicate that forecasting methods historically applied within demand industries can project future inpatient glycemic control. Additional study is needed to determine if forecasting is useful in the analyses of other glucometric parameters and, if so, how to apply the techniques to quality improvement. PMID:29134125

  14. Application of XGBoost algorithm in hourly PM2.5 concentration prediction

    NASA Astrophysics Data System (ADS)

    Pan, Bingyue

    2018-02-01

    In view of prediction techniques of hourly PM2.5 concentration in China, this paper applied the XGBoost(Extreme Gradient Boosting) algorithm to predict hourly PM2.5 concentration. The monitoring data of air quality in Tianjin city was analyzed by using XGBoost algorithm. The prediction performance of the XGBoost method is evaluated by comparing observed and predicted PM2.5 concentration using three measures of forecast accuracy. The XGBoost method is also compared with the random forest algorithm, multiple linear regression, decision tree regression and support vector machines for regression models using computational results. The results demonstrate that the XGBoost algorithm outperforms other data mining methods.

  15. Performance of the modified Poisson regression approach for estimating relative risks from clustered prospective data.

    PubMed

    Yelland, Lisa N; Salter, Amy B; Ryan, Philip

    2011-10-15

    Modified Poisson regression, which combines a log Poisson regression model with robust variance estimation, is a useful alternative to log binomial regression for estimating relative risks. Previous studies have shown both analytically and by simulation that modified Poisson regression is appropriate for independent prospective data. This method is often applied to clustered prospective data, despite a lack of evidence to support its use in this setting. The purpose of this article is to evaluate the performance of the modified Poisson regression approach for estimating relative risks from clustered prospective data, by using generalized estimating equations to account for clustering. A simulation study is conducted to compare log binomial regression and modified Poisson regression for analyzing clustered data from intervention and observational studies. Both methods generally perform well in terms of bias, type I error, and coverage. Unlike log binomial regression, modified Poisson regression is not prone to convergence problems. The methods are contrasted by using example data sets from 2 large studies. The results presented in this article support the use of modified Poisson regression as an alternative to log binomial regression for analyzing clustered prospective data when clustering is taken into account by using generalized estimating equations.

  16. Heterogeneity in drug abuse among juvenile offenders: is mixture regression more informative than standard regression?

    PubMed

    Montgomery, Katherine L; Vaughn, Michael G; Thompson, Sanna J; Howard, Matthew O

    2013-11-01

    Research on juvenile offenders has largely treated this population as a homogeneous group. However, recent findings suggest that this at-risk population may be considerably more heterogeneous than previously believed. This study compared mixture regression analyses with standard regression techniques in an effort to explain how known factors such as distress, trauma, and personality are associated with drug abuse among juvenile offenders. Researchers recruited 728 juvenile offenders from Missouri juvenile correctional facilities for participation in this study. Researchers investigated past-year substance use in relation to the following variables: demographic characteristics (gender, ethnicity, age, familial use of public assistance), antisocial behavior, and mental illness symptoms (psychopathic traits, psychiatric distress, and prior trauma). Results indicated that standard and mixed regression approaches identified significant variables related to past-year substance use among this population; however, the mixture regression methods provided greater specificity in results. Mixture regression analytic methods may help policy makers and practitioners better understand and intervene with the substance-related subgroups of juvenile offenders.

  17. [Comparison of different methods in dealing with HIV viral load data with diversified missing value mechanism on HIV positive MSM].

    PubMed

    Jiang, Z; Dou, Z; Song, W L; Xu, J; Wu, Z Y

    2017-11-10

    Objective: To compare results of different methods: in organizing HIV viral load (VL) data with missing values mechanism. Methods We used software SPSS 17.0 to simulate complete and missing data with different missing value mechanism from HIV viral loading data collected from MSM in 16 cities in China in 2013. Maximum Likelihood Methods Using the Expectation and Maximization Algorithm (EM), regressive method, mean imputation, delete method, and Markov Chain Monte Carlo (MCMC) were used to supplement missing data respectively. The results: of different methods were compared according to distribution characteristics, accuracy and precision. Results HIV VL data could not be transferred into a normal distribution. All the methods showed good results in iterating data which is Missing Completely at Random Mechanism (MCAR). For the other types of missing data, regressive and MCMC methods were used to keep the main characteristic of the original data. The means of iterating database with different methods were all close to the original one. The EM, regressive method, mean imputation, and delete method under-estimate VL while MCMC overestimates it. Conclusion: MCMC can be used as the main imputation method for HIV virus loading missing data. The iterated data can be used as a reference for mean HIV VL estimation among the investigated population.

  18. London Measure of Unplanned Pregnancy: guidance for its use as an outcome measure

    PubMed Central

    Hall, Jennifer A; Barrett, Geraldine; Copas, Andrew; Stephenson, Judith

    2017-01-01

    Background The London Measure of Unplanned Pregnancy (LMUP) is a psychometrically validated measure of the degree of intention of a current or recent pregnancy. The LMUP is increasingly being used worldwide, and can be used to evaluate family planning or preconception care programs. However, beyond recommending the use of the full LMUP scale, there is no published guidance on how to use the LMUP as an outcome measure. Ordinal logistic regression has been recommended informally, but studies published to date have all used binary logistic regression and dichotomized the scale at different cut points. There is thus a need for evidence-based guidance to provide a standardized methodology for multivariate analysis and to enable comparison of results. This paper makes recommendations for the regression method for analysis of the LMUP as an outcome measure. Materials and methods Data collected from 4,244 pregnant women in Malawi were used to compare five regression methods: linear, logistic with two cut points, and ordinal logistic with either the full or grouped LMUP score. The recommendations were then tested on the original UK LMUP data. Results There were small but no important differences in the findings across the regression models. Logistic regression resulted in the largest loss of information, and assumptions were violated for the linear and ordinal logistic regression. Consequently, robust standard errors were used for linear regression and a partial proportional odds ordinal logistic regression model attempted. The latter could only be fitted for grouped LMUP score. Conclusion We recommend the linear regression model with robust standard errors to make full use of the LMUP score when analyzed as an outcome measure. Ordinal logistic regression could be considered, but a partial proportional odds model with grouped LMUP score may be required. Logistic regression is the least-favored option, due to the loss of information. For logistic regression, the cut point for un/planned pregnancy should be between nine and ten. These recommendations will standardize the analysis of LMUP data and enhance comparability of results across studies. PMID:28435343

  19. Robust regression on noisy data for fusion scaling laws

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Verdoolaege, Geert, E-mail: geert.verdoolaege@ugent.be; Laboratoire de Physique des Plasmas de l'ERM - Laboratorium voor Plasmafysica van de KMS

    2014-11-15

    We introduce the method of geodesic least squares (GLS) regression for estimating fusion scaling laws. Based on straightforward principles, the method is easily implemented, yet it clearly outperforms established regression techniques, particularly in cases of significant uncertainty on both the response and predictor variables. We apply GLS for estimating the scaling of the L-H power threshold, resulting in estimates for ITER that are somewhat higher than predicted earlier.

  20. A study of machine learning regression methods for major elemental analysis of rocks using laser-induced breakdown spectroscopy

    NASA Astrophysics Data System (ADS)

    Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.

    2015-05-01

    The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO2, Fe2O3, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na2O, K2O, TiO2, and P2O5, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144 channels) relative to the small number of samples studied. The best-performing models were SVR-Lin for SiO2, MgO, Fe2O3, and Na2O, lasso for Al2O3, elastic net for MnO, and PLS-1 for CaO, TiO2, and K2O. Although these differences in model performance between methods were identified, most of the models produce comparable results when p ≤ 0.05 and all techniques except kNN produced statistically-indistinguishable results. It is likely that a combination of models could be used together to yield a lower total error of prediction, depending on the requirements of the user.

  1. Predicting recreational water quality advisories: A comparison of statistical methods

    USGS Publications Warehouse

    Brooks, Wesley R.; Corsi, Steven R.; Fienen, Michael N.; Carvin, Rebecca B.

    2016-01-01

    Epidemiological studies indicate that fecal indicator bacteria (FIB) in beach water are associated with illnesses among people having contact with the water. In order to mitigate public health impacts, many beaches are posted with an advisory when the concentration of FIB exceeds a beach action value. The most commonly used method of measuring FIB concentration takes 18–24 h before returning a result. In order to avoid the 24 h lag, it has become common to ”nowcast” the FIB concentration using statistical regressions on environmental surrogate variables. Most commonly, nowcast models are estimated using ordinary least squares regression, but other regression methods from the statistical and machine learning literature are sometimes used. This study compares 14 regression methods across 7 Wisconsin beaches to identify which consistently produces the most accurate predictions. A random forest model is identified as the most accurate, followed by multiple regression fit using the adaptive LASSO.

  2. Data Mining Methods Applied to Flight Operations Quality Assurance Data: A Comparison to Standard Statistical Methods

    NASA Technical Reports Server (NTRS)

    Stolzer, Alan J.; Halford, Carl

    2007-01-01

    In a previous study, multiple regression techniques were applied to Flight Operations Quality Assurance-derived data to develop parsimonious model(s) for fuel consumption on the Boeing 757 airplane. The present study examined several data mining algorithms, including neural networks, on the fuel consumption problem and compared them to the multiple regression results obtained earlier. Using regression methods, parsimonious models were obtained that explained approximately 85% of the variation in fuel flow. In general data mining methods were more effective in predicting fuel consumption. Classification and Regression Tree methods reported correlation coefficients of .91 to .92, and General Linear Models and Multilayer Perceptron neural networks reported correlation coefficients of about .99. These data mining models show great promise for use in further examining large FOQA databases for operational and safety improvements.

  3. Performance and separation occurrence of binary probit regression estimator using maximum likelihood method and Firths approach under different sample size

    NASA Astrophysics Data System (ADS)

    Lusiana, Evellin Dewi

    2017-12-01

    The parameters of binary probit regression model are commonly estimated by using Maximum Likelihood Estimation (MLE) method. However, MLE method has limitation if the binary data contains separation. Separation is the condition where there are one or several independent variables that exactly grouped the categories in binary response. It will result the estimators of MLE method become non-convergent, so that they cannot be used in modeling. One of the effort to resolve the separation is using Firths approach instead. This research has two aims. First, to identify the chance of separation occurrence in binary probit regression model between MLE method and Firths approach. Second, to compare the performance of binary probit regression model estimator that obtained by MLE method and Firths approach using RMSE criteria. Those are performed using simulation method and under different sample size. The results showed that the chance of separation occurrence in MLE method for small sample size is higher than Firths approach. On the other hand, for larger sample size, the probability decreased and relatively identic between MLE method and Firths approach. Meanwhile, Firths estimators have smaller RMSE than MLEs especially for smaller sample sizes. But for larger sample sizes, the RMSEs are not much different. It means that Firths estimators outperformed MLE estimator.

  4. Improving validation methods for molecular diagnostics: application of Bland-Altman, Deming and simple linear regression analyses in assay comparison and evaluation for next-generation sequencing.

    PubMed

    Misyura, Maksym; Sukhai, Mahadeo A; Kulasignam, Vathany; Zhang, Tong; Kamel-Reid, Suzanne; Stockley, Tracy L

    2018-02-01

    A standard approach in test evaluation is to compare results of the assay in validation to results from previously validated methods. For quantitative molecular diagnostic assays, comparison of test values is often performed using simple linear regression and the coefficient of determination (R 2 ), using R 2 as the primary metric of assay agreement. However, the use of R 2 alone does not adequately quantify constant or proportional errors required for optimal test evaluation. More extensive statistical approaches, such as Bland-Altman and expanded interpretation of linear regression methods, can be used to more thoroughly compare data from quantitative molecular assays. We present the application of Bland-Altman and linear regression statistical methods to evaluate quantitative outputs from next-generation sequencing assays (NGS). NGS-derived data sets from assay validation experiments were used to demonstrate the utility of the statistical methods. Both Bland-Altman and linear regression were able to detect the presence and magnitude of constant and proportional error in quantitative values of NGS data. Deming linear regression was used in the context of assay comparison studies, while simple linear regression was used to analyse serial dilution data. Bland-Altman statistical approach was also adapted to quantify assay accuracy, including constant and proportional errors, and precision where theoretical and empirical values were known. The complementary application of the statistical methods described in this manuscript enables more extensive evaluation of performance characteristics of quantitative molecular assays, prior to implementation in the clinical molecular laboratory. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  5. A simple approach to power and sample size calculations in logistic regression and Cox regression models.

    PubMed

    Vaeth, Michael; Skovlund, Eva

    2004-06-15

    For a given regression problem it is possible to identify a suitably defined equivalent two-sample problem such that the power or sample size obtained for the two-sample problem also applies to the regression problem. For a standard linear regression model the equivalent two-sample problem is easily identified, but for generalized linear models and for Cox regression models the situation is more complicated. An approximately equivalent two-sample problem may, however, also be identified here. In particular, we show that for logistic regression and Cox regression models the equivalent two-sample problem is obtained by selecting two equally sized samples for which the parameters differ by a value equal to the slope times twice the standard deviation of the independent variable and further requiring that the overall expected number of events is unchanged. In a simulation study we examine the validity of this approach to power calculations in logistic regression and Cox regression models. Several different covariate distributions are considered for selected values of the overall response probability and a range of alternatives. For the Cox regression model we consider both constant and non-constant hazard rates. The results show that in general the approach is remarkably accurate even in relatively small samples. Some discrepancies are, however, found in small samples with few events and a highly skewed covariate distribution. Comparison with results based on alternative methods for logistic regression models with a single continuous covariate indicates that the proposed method is at least as good as its competitors. The method is easy to implement and therefore provides a simple way to extend the range of problems that can be covered by the usual formulas for power and sample size determination. Copyright 2004 John Wiley & Sons, Ltd.

  6. A comparative study on generating simulated Landsat NDVI images using data fusion and regression method-the case of the Korean Peninsula.

    PubMed

    Lee, Mi Hee; Lee, Soo Bong; Eo, Yang Dam; Kim, Sun Woong; Woo, Jung-Hun; Han, Soo Hee

    2017-07-01

    Landsat optical images have enough spatial and spectral resolution to analyze vegetation growth characteristics. But, the clouds and water vapor degrade the image quality quite often, which limits the availability of usable images for the time series vegetation vitality measurement. To overcome this shortcoming, simulated images are used as an alternative. In this study, weighted average method, spatial and temporal adaptive reflectance fusion model (STARFM) method, and multilinear regression analysis method have been tested to produce simulated Landsat normalized difference vegetation index (NDVI) images of the Korean Peninsula. The test results showed that the weighted average method produced the images most similar to the actual images, provided that the images were available within 1 month before and after the target date. The STARFM method gives good results when the input image date is close to the target date. Careful regional and seasonal consideration is required in selecting input images. During summer season, due to clouds, it is very difficult to get the images close enough to the target date. Multilinear regression analysis gives meaningful results even when the input image date is not so close to the target date. Average R 2 values for weighted average method, STARFM, and multilinear regression analysis were 0.741, 0.70, and 0.61, respectively.

  7. Implementations of geographically weighted lasso in spatial data with multicollinearity (Case study: Poverty modeling of Java Island)

    NASA Astrophysics Data System (ADS)

    Setiyorini, Anis; Suprijadi, Jadi; Handoko, Budhi

    2017-03-01

    Geographically Weighted Regression (GWR) is a regression model that takes into account the spatial heterogeneity effect. In the application of the GWR, inference on regression coefficients is often of interest, as is estimation and prediction of the response variable. Empirical research and studies have demonstrated that local correlation between explanatory variables can lead to estimated regression coefficients in GWR that are strongly correlated, a condition named multicollinearity. It later results on a large standard error on estimated regression coefficients, and, hence, problematic for inference on relationships between variables. Geographically Weighted Lasso (GWL) is a method which capable to deal with spatial heterogeneity and local multicollinearity in spatial data sets. GWL is a further development of GWR method, which adds a LASSO (Least Absolute Shrinkage and Selection Operator) constraint in parameter estimation. In this study, GWL will be applied by using fixed exponential kernel weights matrix to establish a poverty modeling of Java Island, Indonesia. The results of applying the GWL to poverty datasets show that this method stabilizes regression coefficients in the presence of multicollinearity and produces lower prediction and estimation error of the response variable than GWR does.

  8. Least median of squares and iteratively re-weighted least squares as robust linear regression methods for fluorimetric determination of α-lipoic acid in capsules in ideal and non-ideal cases of linearity.

    PubMed

    Korany, Mohamed A; Gazy, Azza A; Khamis, Essam F; Ragab, Marwa A A; Kamal, Miranda F

    2018-06-01

    This study outlines two robust regression approaches, namely least median of squares (LMS) and iteratively re-weighted least squares (IRLS) to investigate their application in instrument analysis of nutraceuticals (that is, fluorescence quenching of merbromin reagent upon lipoic acid addition). These robust regression methods were used to calculate calibration data from the fluorescence quenching reaction (∆F and F-ratio) under ideal or non-ideal linearity conditions. For each condition, data were treated using three regression fittings: Ordinary Least Squares (OLS), LMS and IRLS. Assessment of linearity, limits of detection (LOD) and quantitation (LOQ), accuracy and precision were carefully studied for each condition. LMS and IRLS regression line fittings showed significant improvement in correlation coefficients and all regression parameters for both methods and both conditions. In the ideal linearity condition, the intercept and slope changed insignificantly, but a dramatic change was observed for the non-ideal condition and linearity intercept. Under both linearity conditions, LOD and LOQ values after the robust regression line fitting of data were lower than those obtained before data treatment. The results obtained after statistical treatment indicated that the linearity ranges for drug determination could be expanded to lower limits of quantitation by enhancing the regression equation parameters after data treatment. Analysis results for lipoic acid in capsules, using both fluorimetric methods, treated by parametric OLS and after treatment by robust LMS and IRLS were compared for both linearity conditions. Copyright © 2018 John Wiley & Sons, Ltd.

  9. HIGHLIGHTING DIFFERENCES BETWEEN CONDITIONAL AND UNCONDITIONAL QUANTILE REGRESSION APPROACHES THROUGH AN APPLICATION TO ASSESS MEDICATION ADHERENCE

    PubMed Central

    BORAH, BIJAN J.; BASU, ANIRBAN

    2014-01-01

    The quantile regression (QR) framework provides a pragmatic approach in understanding the differential impacts of covariates along the distribution of an outcome. However, the QR framework that has pervaded the applied economics literature is based on the conditional quantile regression method. It is used to assess the impact of a covariate on a quantile of the outcome conditional on specific values of other covariates. In most cases, conditional quantile regression may generate results that are often not generalizable or interpretable in a policy or population context. In contrast, the unconditional quantile regression method provides more interpretable results as it marginalizes the effect over the distributions of other covariates in the model. In this paper, the differences between these two regression frameworks are highlighted, both conceptually and econometrically. Additionally, using real-world claims data from a large US health insurer, alternative QR frameworks are implemented to assess the differential impacts of covariates along the distribution of medication adherence among elderly patients with Alzheimer’s disease. PMID:23616446

  10. Linear regression analysis: part 14 of a series on evaluation of scientific publications.

    PubMed

    Schneider, Astrid; Hommel, Gerhard; Blettner, Maria

    2010-11-01

    Regression analysis is an important statistical method for the analysis of medical data. It enables the identification and characterization of relationships among multiple factors. It also enables the identification of prognostically relevant risk factors and the calculation of risk scores for individual prognostication. This article is based on selected textbooks of statistics, a selective review of the literature, and our own experience. After a brief introduction of the uni- and multivariable regression models, illustrative examples are given to explain what the important considerations are before a regression analysis is performed, and how the results should be interpreted. The reader should then be able to judge whether the method has been used correctly and interpret the results appropriately. The performance and interpretation of linear regression analysis are subject to a variety of pitfalls, which are discussed here in detail. The reader is made aware of common errors of interpretation through practical examples. Both the opportunities for applying linear regression analysis and its limitations are presented.

  11. The Research of Regression Method for Forecasting Monthly Electricity Sales Considering Coupled Multi-factor

    NASA Astrophysics Data System (ADS)

    Wang, Jiangbo; Liu, Junhui; Li, Tiantian; Yin, Shuo; He, Xinhui

    2018-01-01

    The monthly electricity sales forecasting is a basic work to ensure the safety of the power system. This paper presented a monthly electricity sales forecasting method which comprehensively considers the coupled multi-factors of temperature, economic growth, electric power replacement and business expansion. The mathematical model is constructed by using regression method. The simulation results show that the proposed method is accurate and effective.

  12. A simplified competition data analysis for radioligand specific activity determination.

    PubMed

    Venturino, A; Rivera, E S; Bergoc, R M; Caro, R A

    1990-01-01

    Non-linear regression and two-step linear fit methods were developed to determine the actual specific activity of 125I-ovine prolactin by radioreceptor self-displacement analysis. The experimental results obtained by the different methods are superposable. The non-linear regression method is considered to be the most adequate procedure to calculate the specific activity, but if its software is not available, the other described methods are also suitable.

  13. Methodology for constructing a colour-difference acceptability scale.

    PubMed

    Laborie, Baptiste; Viénot, Françoise; Langlois, Sabine

    2010-09-01

    Observers were invited to report their degree of satisfaction on a 6-point semantic scale with respect to the conformity of a test colour with a white reference colour, simultaneously presented on a PDP display. Eight test patches were chosen along each of the +a*, -a*, +b*, -b* axes of the CIELAB chromaticity plane, at Y = 80 ± 2 cd.m(-2) . Experimental conditions reliably represented the automotive environment (patch size, angular distance between patches) and observers could move their head and eyes freely. We have compared several methods of category scaling, the Torgerson-DMT method (Torgerson, W. S. (1958). Theory and methods of scaling. Wiley, New York, USA); two versions of the regression method i.e. Bonnet's (Bonnet, C. (1986). Manuel pratique de psychophysique. Armand Colin, Paris, France) and logistic regression; and the medians method. We describe in detail a case where all methods yield substantial but slightly different results. The solution proposed by the regression method which works with incomplete matrices and yields results directly on a colorimetric scale is probably the most useful in this industrial context. Finally we summarize the implementation of the logistic regression method over four hues and for one experimental condition. © 2010 The Authors, Ophthalmic and Physiological Optics © 2010 The College of Optometrists.

  14. On the calibration process of film dosimetry: OLS inverse regression versus WLS inverse prediction.

    PubMed

    Crop, F; Van Rompaye, B; Paelinck, L; Vakaet, L; Thierens, H; De Wagter, C

    2008-07-21

    The purpose of this study was both putting forward a statistically correct model for film calibration and the optimization of this process. A reliable calibration is needed in order to perform accurate reference dosimetry with radiographic (Gafchromic) film. Sometimes, an ordinary least squares simple linear (in the parameters) regression is applied to the dose-optical-density (OD) curve with the dose as a function of OD (inverse regression) or sometimes OD as a function of dose (inverse prediction). The application of a simple linear regression fit is an invalid method because heteroscedasticity of the data is not taken into account. This could lead to erroneous results originating from the calibration process itself and thus to a lower accuracy. In this work, we compare the ordinary least squares (OLS) inverse regression method with the correct weighted least squares (WLS) inverse prediction method to create calibration curves. We found that the OLS inverse regression method could lead to a prediction bias of up to 7.3 cGy at 300 cGy and total prediction errors of 3% or more for Gafchromic EBT film. Application of the WLS inverse prediction method resulted in a maximum prediction bias of 1.4 cGy and total prediction errors below 2% in a 0-400 cGy range. We developed a Monte-Carlo-based process to optimize calibrations, depending on the needs of the experiment. This type of thorough analysis can lead to a higher accuracy for film dosimetry.

  15. Correlative and multivariate analysis of increased radon concentration in underground laboratory.

    PubMed

    Maletić, Dimitrije M; Udovičić, Vladimir I; Banjanac, Radomir M; Joković, Dejan R; Dragić, Aleksandar L; Veselinović, Nikola B; Filipović, Jelena

    2014-11-01

    The results of analysis using correlative and multivariate methods, as developed for data analysis in high-energy physics and implemented in the Toolkit for Multivariate Analysis software package, of the relations of the variation of increased radon concentration with climate variables in shallow underground laboratory is presented. Multivariate regression analysis identified a number of multivariate methods which can give a good evaluation of increased radon concentrations based on climate variables. The use of the multivariate regression methods will enable the investigation of the relations of specific climate variable with increased radon concentrations by analysis of regression methods resulting in 'mapped' underlying functional behaviour of radon concentrations depending on a wide spectrum of climate variables. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  16. Introduction to methodology of dose-response meta-analysis for binary outcome: With application on software.

    PubMed

    Zhang, Chao; Jia, Pengli; Yu, Liu; Xu, Chang

    2018-05-01

    Dose-response meta-analysis (DRMA) is widely applied to investigate the dose-specific relationship between independent and dependent variables. Such methods have been in use for over 30 years and are increasingly employed in healthcare and clinical decision-making. In this article, we give an overview of the methodology used in DRMA. We summarize the commonly used regression model and the pooled method in DRMA. We also use an example to illustrate how to employ a DRMA by these methods. Five regression models, linear regression, piecewise regression, natural polynomial regression, fractional polynomial regression, and restricted cubic spline regression, were illustrated in this article to fit the dose-response relationship. And two types of pooling approaches, that is, one-stage approach and two-stage approach are illustrated to pool the dose-response relationship across studies. The example showed similar results among these models. Several dose-response meta-analysis methods can be used for investigating the relationship between exposure level and the risk of an outcome. However the methodology of DRMA still needs to be improved. © 2018 Chinese Cochrane Center, West China Hospital of Sichuan University and John Wiley & Sons Australia, Ltd.

  17. Comparison of anatomical, functional and regression methods for estimating the rotation axes of the forearm.

    PubMed

    Fraysse, François; Thewlis, Dominic

    2014-11-07

    Numerous methods exist to estimate the pose of the axes of rotation of the forearm. These include anatomical definitions, such as the conventions proposed by the ISB, and functional methods based on instantaneous helical axes, which are commonly accepted as the modelling gold standard for non-invasive, in-vivo studies. We investigated the validity of a third method, based on regression equations, to estimate the rotation axes of the forearm. We also assessed the accuracy of both ISB methods. Axes obtained from a functional method were considered as the reference. Results indicate a large inter-subject variability in the axes positions, in accordance with previous studies. Both ISB methods gave the same level of accuracy in axes position estimations. Regression equations seem to improve estimation of the flexion-extension axis but not the pronation-supination axis. Overall, given the large inter-subject variability, the use of regression equations cannot be recommended. Copyright © 2014 Elsevier Ltd. All rights reserved.

  18. Contrasting OLS and Quantile Regression Approaches to Student "Growth" Percentiles

    ERIC Educational Resources Information Center

    Castellano, Katherine Elizabeth; Ho, Andrew Dean

    2013-01-01

    Regression methods can locate student test scores in a conditional distribution, given past scores. This article contrasts and clarifies two approaches to describing these locations in terms of readily interpretable percentile ranks or "conditional status percentile ranks." The first is Betebenner's quantile regression approach that results in…

  19. Beyond Multiple Regression: Using Commonality Analysis to Better Understand R[superscript 2] Results

    ERIC Educational Resources Information Center

    Warne, Russell T.

    2011-01-01

    Multiple regression is one of the most common statistical methods used in quantitative educational research. Despite the versatility and easy interpretability of multiple regression, it has some shortcomings in the detection of suppressor variables and for somewhat arbitrarily assigning values to the structure coefficients of correlated…

  20. [Local Regression Algorithm Based on Net Analyte Signal and Its Application in Near Infrared Spectral Analysis].

    PubMed

    Zhang, Hong-guang; Lu, Jian-gang

    2016-02-01

    Abstract To overcome the problems of significant difference among samples and nonlinearity between the property and spectra of samples in spectral quantitative analysis, a local regression algorithm is proposed in this paper. In this algorithm, net signal analysis method(NAS) was firstly used to obtain the net analyte signal of the calibration samples and unknown samples, then the Euclidean distance between net analyte signal of the sample and net analyte signal of calibration samples was calculated and utilized as similarity index. According to the defined similarity index, the local calibration sets were individually selected for each unknown sample. Finally, a local PLS regression model was built on each local calibration sets for each unknown sample. The proposed method was applied to a set of near infrared spectra of meat samples. The results demonstrate that the prediction precision and model complexity of the proposed method are superior to global PLS regression method and conventional local regression algorithm based on spectral Euclidean distance.

  1. A Comparison of Conventional Linear Regression Methods and Neural Networks for Forecasting Educational Spending.

    ERIC Educational Resources Information Center

    Baker, Bruce D.; Richards, Craig E.

    1999-01-01

    Applies neural network methods for forecasting 1991-95 per-pupil expenditures in U.S. public elementary and secondary schools. Forecasting models included the National Center for Education Statistics' multivariate regression model and three neural architectures. Regarding prediction accuracy, neural network results were comparable or superior to…

  2. Comparison of multinomial logistic regression and logistic regression: which is more efficient in allocating land use?

    NASA Astrophysics Data System (ADS)

    Lin, Yingzhi; Deng, Xiangzheng; Li, Xing; Ma, Enjun

    2014-12-01

    Spatially explicit simulation of land use change is the basis for estimating the effects of land use and cover change on energy fluxes, ecology and the environment. At the pixel level, logistic regression is one of the most common approaches used in spatially explicit land use allocation models to determine the relationship between land use and its causal factors in driving land use change, and thereby to evaluate land use suitability. However, these models have a drawback in that they do not determine/allocate land use based on the direct relationship between land use change and its driving factors. Consequently, a multinomial logistic regression method was introduced to address this flaw, and thereby, judge the suitability of a type of land use in any given pixel in a case study area of the Jiangxi Province, China. A comparison of the two regression methods indicated that the proportion of correctly allocated pixels using multinomial logistic regression was 92.98%, which was 8.47% higher than that obtained using logistic regression. Paired t-test results also showed that pixels were more clearly distinguished by multinomial logistic regression than by logistic regression. In conclusion, multinomial logistic regression is a more efficient and accurate method for the spatial allocation of land use changes. The application of this method in future land use change studies may improve the accuracy of predicting the effects of land use and cover change on energy fluxes, ecology, and environment.

  3. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis.

    PubMed

    Zheng, Jie; Erzurumluoglu, A Mesut; Elsworth, Benjamin L; Kemp, John P; Howe, Laurence; Haycock, Philip C; Hemani, Gibran; Tansey, Katherine; Laurin, Charles; Pourcain, Beate St; Warrington, Nicole M; Finucane, Hilary K; Price, Alkes L; Bulik-Sullivan, Brendan K; Anttila, Verneri; Paternoster, Lavinia; Gaunt, Tom R; Evans, David M; Neale, Benjamin M

    2017-01-15

    LD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes. Because the method relies on summary level results data, LD score regression is computationally tractable even for very large sample sizes. However, publicly available GWAS summary-level data are typically stored in different databases and have different formats, making it difficult to apply LD score regression to estimate genetic correlations across many different traits simultaneously. In this manuscript, we describe LD Hub - a centralized database of summary-level GWAS results for 173 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline. To demonstrate functionality and validate our software, we replicated previously reported LD score regression analyses of 49 traits/diseases using LD Hub; and estimated SNP heritability and the genetic correlation across the different phenotypes. We also present new results obtained by uploading a recent atopic dermatitis GWAS meta-analysis to examine the genetic correlation between the condition and other potentially related traits. In response to the growing availability of publicly accessible GWAS summary-level results data, our database and the accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies. The web interface and instructions for using LD Hub are available at http://ldsc.broadinstitute.org/ CONTACT: jie.zheng@bristol.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  4. Improved Correction of Atmospheric Pressure Data Obtained by Smartphones through Machine Learning

    PubMed Central

    Kim, Yong-Hyuk; Ha, Ji-Hun; Kim, Na-Young; Im, Hyo-Hyuc; Sim, Sangjin; Choi, Reno K. Y.

    2016-01-01

    A correction method using machine learning aims to improve the conventional linear regression (LR) based method for correction of atmospheric pressure data obtained by smartphones. The method proposed in this study conducts clustering and regression analysis with time domain classification. Data obtained in Gyeonggi-do, one of the most populous provinces in South Korea surrounding Seoul with the size of 10,000 km2, from July 2014 through December 2014, using smartphones were classified with respect to time of day (daytime or nighttime) as well as day of the week (weekday or weekend) and the user's mobility, prior to the expectation-maximization (EM) clustering. Subsequently, the results were analyzed for comparison by applying machine learning methods such as multilayer perceptron (MLP) and support vector regression (SVR). The results showed a mean absolute error (MAE) 26% lower on average when regression analysis was performed through EM clustering compared to that obtained without EM clustering. For machine learning methods, the MAE for SVR was around 31% lower for LR and about 19% lower for MLP. It is concluded that pressure data from smartphones are as good as the ones from national automatic weather station (AWS) network. PMID:27524999

  5. Panel regressions to estimate low-flow response to rainfall variability in ungaged basins

    USGS Publications Warehouse

    Bassiouni, Maoya; Vogel, Richard M.; Archfield, Stacey A.

    2016-01-01

    Multicollinearity and omitted-variable bias are major limitations to developing multiple linear regression models to estimate streamflow characteristics in ungaged areas and varying rainfall conditions. Panel regression is used to overcome limitations of traditional regression methods, and obtain reliable model coefficients, in particular to understand the elasticity of streamflow to rainfall. Using annual rainfall and selected basin characteristics at 86 gaged streams in the Hawaiian Islands, regional regression models for three stream classes were developed to estimate the annual low-flow duration discharges. Three panel-regression structures (random effects, fixed effects, and pooled) were compared to traditional regression methods, in which space is substituted for time. Results indicated that panel regression generally was able to reproduce the temporal behavior of streamflow and reduce the standard errors of model coefficients compared to traditional regression, even for models in which the unobserved heterogeneity between streams is significant and the variance inflation factor for rainfall is much greater than 10. This is because both spatial and temporal variability were better characterized in panel regression. In a case study, regional rainfall elasticities estimated from panel regressions were applied to ungaged basins on Maui, using available rainfall projections to estimate plausible changes in surface-water availability and usable stream habitat for native species. The presented panel-regression framework is shown to offer benefits over existing traditional hydrologic regression methods for developing robust regional relations to investigate streamflow response in a changing climate.

  6. Panel regressions to estimate low-flow response to rainfall variability in ungaged basins

    NASA Astrophysics Data System (ADS)

    Bassiouni, Maoya; Vogel, Richard M.; Archfield, Stacey A.

    2016-12-01

    Multicollinearity and omitted-variable bias are major limitations to developing multiple linear regression models to estimate streamflow characteristics in ungaged areas and varying rainfall conditions. Panel regression is used to overcome limitations of traditional regression methods, and obtain reliable model coefficients, in particular to understand the elasticity of streamflow to rainfall. Using annual rainfall and selected basin characteristics at 86 gaged streams in the Hawaiian Islands, regional regression models for three stream classes were developed to estimate the annual low-flow duration discharges. Three panel-regression structures (random effects, fixed effects, and pooled) were compared to traditional regression methods, in which space is substituted for time. Results indicated that panel regression generally was able to reproduce the temporal behavior of streamflow and reduce the standard errors of model coefficients compared to traditional regression, even for models in which the unobserved heterogeneity between streams is significant and the variance inflation factor for rainfall is much greater than 10. This is because both spatial and temporal variability were better characterized in panel regression. In a case study, regional rainfall elasticities estimated from panel regressions were applied to ungaged basins on Maui, using available rainfall projections to estimate plausible changes in surface-water availability and usable stream habitat for native species. The presented panel-regression framework is shown to offer benefits over existing traditional hydrologic regression methods for developing robust regional relations to investigate streamflow response in a changing climate.

  7. Testing Multivariate Adaptive Regression Splines (MARS) as a Method of Land Cover Classification of TERRA-ASTER Satellite Images.

    PubMed

    Quirós, Elia; Felicísimo, Angel M; Cuartero, Aurora

    2009-01-01

    This work proposes a new method to classify multi-spectral satellite images based on multivariate adaptive regression splines (MARS) and compares this classification system with the more common parallelepiped and maximum likelihood (ML) methods. We apply the classification methods to the land cover classification of a test zone located in southwestern Spain. The basis of the MARS method and its associated procedures are explained in detail, and the area under the ROC curve (AUC) is compared for the three methods. The results show that the MARS method provides better results than the parallelepiped method in all cases, and it provides better results than the maximum likelihood method in 13 cases out of 17. These results demonstrate that the MARS method can be used in isolation or in combination with other methods to improve the accuracy of soil cover classification. The improvement is statistically significant according to the Wilcoxon signed rank test.

  8. Locally-constrained Boundary Regression for Segmentation of Prostate and Rectum in the Planning CT Images

    PubMed Central

    Shao, Yeqin; Gao, Yaozong; Wang, Qian; Yang, Xin; Shen, Dinggang

    2015-01-01

    Automatic and accurate segmentation of the prostate and rectum in planning CT images is a challenging task due to low image contrast, unpredictable organ (relative) position, and uncertain existence of bowel gas across different patients. Recently, regression forest was adopted for organ deformable segmentation on 2D medical images by training one landmark detector for each point on the shape model. However, it seems impractical for regression forest to guide 3D deformable segmentation as a landmark detector, due to large number of vertices in the 3D shape model as well as the difficulty in building accurate 3D vertex correspondence for each landmark detector. In this paper, we propose a novel boundary detection method by exploiting the power of regression forest for prostate and rectum segmentation. The contributions of this paper are as follows: 1) we introduce regression forest as a local boundary regressor to vote the entire boundary of a target organ, which avoids training a large number of landmark detectors and building an accurate 3D vertex correspondence for each landmark detector; 2) an auto-context model is integrated with regression forest to improve the accuracy of the boundary regression; 3) we further combine a deformable segmentation method with the proposed local boundary regressor for the final organ segmentation by integrating organ shape priors. Our method is evaluated on a planning CT image dataset with 70 images from 70 different patients. The experimental results show that our proposed boundary regression method outperforms the conventional boundary classification method in guiding the deformable model for prostate and rectum segmentations. Compared with other state-of-the-art methods, our method also shows a competitive performance. PMID:26439938

  9. A comparison of three methods of assessing differential item functioning (DIF) in the Hospital Anxiety Depression Scale: ordinal logistic regression, Rasch analysis and the Mantel chi-square procedure.

    PubMed

    Cameron, Isobel M; Scott, Neil W; Adler, Mats; Reid, Ian C

    2014-12-01

    It is important for clinical practice and research that measurement scales of well-being and quality of life exhibit only minimal differential item functioning (DIF). DIF occurs where different groups of people endorse items in a scale to different extents after being matched by the intended scale attribute. We investigate the equivalence or otherwise of common methods of assessing DIF. Three methods of measuring age- and sex-related DIF (ordinal logistic regression, Rasch analysis and Mantel χ(2) procedure) were applied to Hospital Anxiety Depression Scale (HADS) data pertaining to a sample of 1,068 patients consulting primary care practitioners. Three items were flagged by all three approaches as having either age- or sex-related DIF with a consistent direction of effect; a further three items identified did not meet stricter criteria for important DIF using at least one method. When applying strict criteria for significant DIF, ordinal logistic regression was slightly less sensitive. Ordinal logistic regression, Rasch analysis and contingency table methods yielded consistent results when identifying DIF in the HADS depression and HADS anxiety scales. Regardless of methods applied, investigators should use a combination of statistical significance, magnitude of the DIF effect and investigator judgement when interpreting the results.

  10. Interpreting Regression Results: beta Weights and Structure Coefficients are Both Important.

    ERIC Educational Resources Information Center

    Thompson, Bruce

    Various realizations have led to less frequent use of the "OVA" methods (analysis of variance--ANOVA--among others) and to more frequent use of general linear model approaches such as regression. However, too few researchers understand all the various coefficients produced in regression. This paper explains these coefficients and their…

  11. Simultaneous Estimation of Regression Functions for Marine Corps Technical Training Specialties.

    ERIC Educational Resources Information Center

    Dunbar, Stephen B.; And Others

    This paper considers the application of Bayesian techniques for simultaneous estimation to the specification of regression weights for selection tests used in various technical training courses in the Marine Corps. Results of a method for m-group regression developed by Molenaar and Lewis (1979) suggest that common weights for training courses…

  12. Regression Analysis: Legal Applications in Institutional Research

    ERIC Educational Resources Information Center

    Frizell, Julie A.; Shippen, Benjamin S., Jr.; Luna, Andrew L.

    2008-01-01

    This article reviews multiple regression analysis, describes how its results should be interpreted, and instructs institutional researchers on how to conduct such analyses using an example focused on faculty pay equity between men and women. The use of multiple regression analysis will be presented as a method with which to compare salaries of…

  13. Clustering performance comparison using K-means and expectation maximization algorithms.

    PubMed

    Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

    2014-11-14

    Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.

  14. Highlighting differences between conditional and unconditional quantile regression approaches through an application to assess medication adherence.

    PubMed

    Borah, Bijan J; Basu, Anirban

    2013-09-01

    The quantile regression (QR) framework provides a pragmatic approach in understanding the differential impacts of covariates along the distribution of an outcome. However, the QR framework that has pervaded the applied economics literature is based on the conditional quantile regression method. It is used to assess the impact of a covariate on a quantile of the outcome conditional on specific values of other covariates. In most cases, conditional quantile regression may generate results that are often not generalizable or interpretable in a policy or population context. In contrast, the unconditional quantile regression method provides more interpretable results as it marginalizes the effect over the distributions of other covariates in the model. In this paper, the differences between these two regression frameworks are highlighted, both conceptually and econometrically. Additionally, using real-world claims data from a large US health insurer, alternative QR frameworks are implemented to assess the differential impacts of covariates along the distribution of medication adherence among elderly patients with Alzheimer's disease. Copyright © 2013 John Wiley & Sons, Ltd.

  15. Rank-Optimized Logistic Matrix Regression toward Improved Matrix Data Classification.

    PubMed

    Zhang, Jianguang; Jiang, Jianmin

    2018-02-01

    While existing logistic regression suffers from overfitting and often fails in considering structural information, we propose a novel matrix-based logistic regression to overcome the weakness. In the proposed method, 2D matrices are directly used to learn two groups of parameter vectors along each dimension without vectorization, which allows the proposed method to fully exploit the underlying structural information embedded inside the 2D matrices. Further, we add a joint [Formula: see text]-norm on two parameter matrices, which are organized by aligning each group of parameter vectors in columns. This added co-regularization term has two roles-enhancing the effect of regularization and optimizing the rank during the learning process. With our proposed fast iterative solution, we carried out extensive experiments. The results show that in comparison to both the traditional tensor-based methods and the vector-based regression methods, our proposed solution achieves better performance for matrix data classifications.

  16. Feature Selection for Ridge Regression with Provable Guarantees.

    PubMed

    Paul, Saurabh; Drineas, Petros

    2016-04-01

    We introduce single-set spectral sparsification as a deterministic sampling-based feature selection technique for regularized least-squares classification, which is the classification analog to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world data sets; a subset of TechTC-300 data sets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods.

  17. Multiplication factor versus regression analysis in stature estimation from hand and foot dimensions.

    PubMed

    Krishan, Kewal; Kanchan, Tanuj; Sharma, Abhilasha

    2012-05-01

    Estimation of stature is an important parameter in identification of human remains in forensic examinations. The present study is aimed to compare the reliability and accuracy of stature estimation and to demonstrate the variability in estimated stature and actual stature using multiplication factor and regression analysis methods. The study is based on a sample of 246 subjects (123 males and 123 females) from North India aged between 17 and 20 years. Four anthropometric measurements; hand length, hand breadth, foot length and foot breadth taken on the left side in each subject were included in the study. Stature was measured using standard anthropometric techniques. Multiplication factors were calculated and linear regression models were derived for estimation of stature from hand and foot dimensions. Derived multiplication factors and regression formula were applied to the hand and foot measurements in the study sample. The estimated stature from the multiplication factors and regression analysis was compared with the actual stature to find the error in estimated stature. The results indicate that the range of error in estimation of stature from regression analysis method is less than that of multiplication factor method thus, confirming that the regression analysis method is better than multiplication factor analysis in stature estimation. Copyright © 2012 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.

  18. A Comparison of Rule-based Analysis with Regression Methods in Understanding the Risk Factors for Study Withdrawal in a Pediatric Study.

    PubMed

    Haghighi, Mona; Johnson, Suzanne Bennett; Qian, Xiaoning; Lynch, Kristian F; Vehik, Kendra; Huang, Shuai

    2016-08-26

    Regression models are extensively used in many epidemiological studies to understand the linkage between specific outcomes of interest and their risk factors. However, regression models in general examine the average effects of the risk factors and ignore subgroups with different risk profiles. As a result, interventions are often geared towards the average member of the population, without consideration of the special health needs of different subgroups within the population. This paper demonstrates the value of using rule-based analysis methods that can identify subgroups with heterogeneous risk profiles in a population without imposing assumptions on the subgroups or method. The rules define the risk pattern of subsets of individuals by not only considering the interactions between the risk factors but also their ranges. We compared the rule-based analysis results with the results from a logistic regression model in The Environmental Determinants of Diabetes in the Young (TEDDY) study. Both methods detected a similar suite of risk factors, but the rule-based analysis was superior at detecting multiple interactions between the risk factors that characterize the subgroups. A further investigation of the particular characteristics of each subgroup may detect the special health needs of the subgroup and lead to tailored interventions.

  19. Acoustic-articulatory mapping in vowels by locally weighted regression

    PubMed Central

    McGowan, Richard S.; Berger, Michael A.

    2009-01-01

    A method for mapping between simultaneously measured articulatory and acoustic data is proposed. The method uses principal components analysis on the articulatory and acoustic variables, and mapping between the domains by locally weighted linear regression, or loess [Cleveland, W. S. (1979). J. Am. Stat. Assoc. 74, 829–836]. The latter method permits local variation in the slopes of the linear regression, assuming that the function being approximated is smooth. The methodology is applied to vowels of four speakers in the Wisconsin X-ray Microbeam Speech Production Database, with formant analysis. Results are examined in terms of (1) examples of forward (articulation-to-acoustics) mappings and inverse mappings, (2) distributions of local slopes and constants, (3) examples of correlations among slopes and constants, (4) root-mean-square error, and (5) sensitivity of formant frequencies to articulatory change. It is shown that the results are qualitatively correct and that loess performs better than global regression. The forward mappings show different root-mean-square error properties than the inverse mappings indicating that this method is better suited for the forward mappings than the inverse mappings, at least for the data chosen for the current study. Some preliminary results on sensitivity of the first two formant frequencies to the two most important articulatory principal components are presented. PMID:19813812

  20. The application of artificial neural networks and support vector regression for simultaneous spectrophotometric determination of commercial eye drop contents

    NASA Astrophysics Data System (ADS)

    Valizadeh, Maryam; Sohrabi, Mahmoud Reza

    2018-03-01

    In the present study, artificial neural networks (ANNs) and support vector regression (SVR) as intelligent methods coupled with UV spectroscopy for simultaneous quantitative determination of Dorzolamide (DOR) and Timolol (TIM) in eye drop. Several synthetic mixtures were analyzed for validating the proposed methods. At first, neural network time series, which one type of network from the artificial neural network was employed and its efficiency was evaluated. Afterwards, the radial basis network was applied as another neural network. Results showed that the performance of this method is suitable for predicting. Finally, support vector regression was proposed to construct the Zilomole prediction model. Also, root mean square error (RMSE) and mean recovery (%) were calculated for SVR method. Moreover, the proposed methods were compared to the high-performance liquid chromatography (HPLC) as a reference method. One way analysis of variance (ANOVA) test at the 95% confidence level applied to the comparison results of suggested and reference methods that there were no significant differences between them. Also, the effect of interferences was investigated in spike solutions.

  1. New analysis methods to push the boundaries of diagnostic techniques in the environmental sciences

    NASA Astrophysics Data System (ADS)

    Lungaroni, M.; Murari, A.; Peluso, E.; Gelfusa, M.; Malizia, A.; Vega, J.; Talebzadeh, S.; Gaudio, P.

    2016-04-01

    In the last years, new and more sophisticated measurements have been at the basis of the major progress in various disciplines related to the environment, such as remote sensing and thermonuclear fusion. To maximize the effectiveness of the measurements, new data analysis techniques are required. First data processing tasks, such as filtering and fitting, are of primary importance, since they can have a strong influence on the rest of the analysis. Even if Support Vector Regression is a method devised and refined at the end of the 90s, a systematic comparison with more traditional non parametric regression methods has never been reported. In this paper, a series of systematic tests is described, which indicates how SVR is a very competitive method of non-parametric regression that can usefully complement and often outperform more consolidated approaches. The performance of Support Vector Regression as a method of filtering is investigated first, comparing it with the most popular alternative techniques. Then Support Vector Regression is applied to the problem of non-parametric regression to analyse Lidar surveys for the environments measurement of particulate matter due to wildfires. The proposed approach has given very positive results and provides new perspectives to the interpretation of the data.

  2. Remote-sensing data processing with the multivariate regression analysis method for iron mineral resource potential mapping: a case study in the Sarvian area, central Iran

    NASA Astrophysics Data System (ADS)

    Mansouri, Edris; Feizi, Faranak; Jafari Rad, Alireza; Arian, Mehran

    2018-03-01

    This paper uses multivariate regression to create a mathematical model for iron skarn exploration in the Sarvian area, central Iran, using multivariate regression for mineral prospectivity mapping (MPM). The main target of this paper is to apply multivariate regression analysis (as an MPM method) to map iron outcrops in the northeastern part of the study area in order to discover new iron deposits in other parts of the study area. Two types of multivariate regression models using two linear equations were employed to discover new mineral deposits. This method is one of the reliable methods for processing satellite images. ASTER satellite images (14 bands) were used as unique independent variables (UIVs), and iron outcrops were mapped as dependent variables for MPM. According to the results of the probability value (p value), coefficient of determination value (R2) and adjusted determination coefficient (Radj2), the second regression model (which consistent of multiple UIVs) fitted better than other models. The accuracy of the model was confirmed by iron outcrops map and geological observation. Based on field observation, iron mineralization occurs at the contact of limestone and intrusive rocks (skarn type).

  3. Cocaine Dependence Treatment Data: Methods for Measurement Error Problems With Predictors Derived From Stationary Stochastic Processes

    PubMed Central

    Guan, Yongtao; Li, Yehua; Sinha, Rajita

    2011-01-01

    In a cocaine dependence treatment study, we use linear and nonlinear regression models to model posttreatment cocaine craving scores and first cocaine relapse time. A subset of the covariates are summary statistics derived from baseline daily cocaine use trajectories, such as baseline cocaine use frequency and average daily use amount. These summary statistics are subject to estimation error and can therefore cause biased estimators for the regression coefficients. Unlike classical measurement error problems, the error we encounter here is heteroscedastic with an unknown distribution, and there are no replicates for the error-prone variables or instrumental variables. We propose two robust methods to correct for the bias: a computationally efficient method-of-moments-based method for linear regression models and a subsampling extrapolation method that is generally applicable to both linear and nonlinear regression models. Simulations and an application to the cocaine dependence treatment data are used to illustrate the efficacy of the proposed methods. Asymptotic theory and variance estimation for the proposed subsampling extrapolation method and some additional simulation results are described in the online supplementary material. PMID:21984854

  4. Inferring gene regression networks with model trees

    PubMed Central

    2010-01-01

    Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET. PMID:20950452

  5. Advantages of continuous genotype values over genotype classes for GWAS in higher polyploids: a comparative study in hexaploid chrysanthemum.

    PubMed

    Grandke, Fabian; Singh, Priyanka; Heuven, Henri C M; de Haan, Jorn R; Metzler, Dirk

    2016-08-24

    Association studies are an essential part of modern plant breeding, but are limited for polyploid crops. The increased number of possible genotype classes complicates the differentiation between them. Available methods are limited with respect to the ploidy level or data producing technologies. While genotype classification is an established noise reduction step in diploids, it gains complexity with increasing ploidy levels. Eventually, the errors produced by misclassifications exceed the benefits of genotype classes. Alternatively, continuous genotype values can be used for association analysis in higher polyploids. We associated continuous genotypes to three different traits and compared the results to the output of the genotype caller SuperMASSA. Linear, Bayesian and partial least squares regression were applied, to determine if the use of continuous genotypes is limited to a specific method. A disease, a flowering and a growth trait with h (2) of 0.51, 0.78 and 0.91 were associated with a hexaploid chrysanthemum genotypes. The data set consisted of 55,825 probes and 228 samples. We were able to detect associating probes using continuous genotypes for multiple traits, using different regression methods. The identified probe sets were overlapping, but not identical between the methods. Baysian regression was the most restrictive method, resulting in ten probes for one trait and none for the others. Linear and partial least squares regression led to numerous associating probes. Association based on genotype classes resulted in similar values, but missed several significant probes. A simulation study was used to successfully validate the number of associating markers. Association of various phenotypic traits with continuous genotypes is successful with both uni- and multivariate regression methods. Genotype calling does not improve the association and shows no advantages in this study. Instead, use of continuous genotypes simplifies the analysis, saves computational time and results more potential markers.

  6. Implementing informative priors for heterogeneity in meta-analysis using meta-regression and pseudo data.

    PubMed

    Rhodes, Kirsty M; Turner, Rebecca M; White, Ian R; Jackson, Dan; Spiegelhalter, David J; Higgins, Julian P T

    2016-12-20

    Many meta-analyses combine results from only a small number of studies, a situation in which the between-study variance is imprecisely estimated when standard methods are applied. Bayesian meta-analysis allows incorporation of external evidence on heterogeneity, providing the potential for more robust inference on the effect size of interest. We present a method for performing Bayesian meta-analysis using data augmentation, in which we represent an informative conjugate prior for between-study variance by pseudo data and use meta-regression for estimation. To assist in this, we derive predictive inverse-gamma distributions for the between-study variance expected in future meta-analyses. These may serve as priors for heterogeneity in new meta-analyses. In a simulation study, we compare approximate Bayesian methods using meta-regression and pseudo data against fully Bayesian approaches based on importance sampling techniques and Markov chain Monte Carlo (MCMC). We compare the frequentist properties of these Bayesian methods with those of the commonly used frequentist DerSimonian and Laird procedure. The method is implemented in standard statistical software and provides a less complex alternative to standard MCMC approaches. An importance sampling approach produces almost identical results to standard MCMC approaches, and results obtained through meta-regression and pseudo data are very similar. On average, data augmentation provides closer results to MCMC, if implemented using restricted maximum likelihood estimation rather than DerSimonian and Laird or maximum likelihood estimation. The methods are applied to real datasets, and an extension to network meta-analysis is described. The proposed method facilitates Bayesian meta-analysis in a way that is accessible to applied researchers. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  7. Linear regression techniques for use in the EC tracer method of secondary organic aerosol estimation

    NASA Astrophysics Data System (ADS)

    Saylor, Rick D.; Edgerton, Eric S.; Hartsell, Benjamin E.

    A variety of linear regression techniques and simple slope estimators are evaluated for use in the elemental carbon (EC) tracer method of secondary organic carbon (OC) estimation. Linear regression techniques based on ordinary least squares are not suitable for situations where measurement uncertainties exist in both regressed variables. In the past, regression based on the method of Deming [1943. Statistical Adjustment of Data. Wiley, London] has been the preferred choice for EC tracer method parameter estimation. In agreement with Chu [2005. Stable estimate of primary OC/EC ratios in the EC tracer method. Atmospheric Environment 39, 1383-1392], we find that in the limited case where primary non-combustion OC (OC non-comb) is assumed to be zero, the ratio of averages (ROA) approach provides a stable and reliable estimate of the primary OC-EC ratio, (OC/EC) pri. In contrast with Chu [2005. Stable estimate of primary OC/EC ratios in the EC tracer method. Atmospheric Environment 39, 1383-1392], however, we find that the optimal use of Deming regression (and the more general York et al. [2004. Unified equations for the slope, intercept, and standard errors of the best straight line. American Journal of Physics 72, 367-375] regression) provides excellent results as well. For the more typical case where OC non-comb is allowed to obtain a non-zero value, we find that regression based on the method of York is the preferred choice for EC tracer method parameter estimation. In the York regression technique, detailed information on uncertainties in the measurement of OC and EC is used to improve the linear best fit to the given data. If only limited information is available on the relative uncertainties of OC and EC, then Deming regression should be used. On the other hand, use of ROA in the estimation of secondary OC, and thus the assumption of a zero OC non-comb value, generally leads to an overestimation of the contribution of secondary OC to total measured OC.

  8. Epidemiologic programs for computers and calculators. A microcomputer program for multiple logistic regression by unconditional and conditional maximum likelihood methods.

    PubMed

    Campos-Filho, N; Franco, E L

    1989-02-01

    A frequent procedure in matched case-control studies is to report results from the multivariate unmatched analyses if they do not differ substantially from the ones obtained after conditioning on the matching variables. Although conceptually simple, this rule requires that an extensive series of logistic regression models be evaluated by both the conditional and unconditional maximum likelihood methods. Most computer programs for logistic regression employ only one maximum likelihood method, which requires that the analyses be performed in separate steps. This paper describes a Pascal microcomputer (IBM PC) program that performs multiple logistic regression by both maximum likelihood estimation methods, which obviates the need for switching between programs to obtain relative risk estimates from both matched and unmatched analyses. The program calculates most standard statistics and allows factoring of categorical or continuous variables by two distinct methods of contrast. A built-in, descriptive statistics option allows the user to inspect the distribution of cases and controls across categories of any given variable.

  9. Inter-class sparsity based discriminative least square regression.

    PubMed

    Wen, Jie; Xu, Yong; Li, Zuoyong; Ma, Zhongli; Xu, Yuanrong

    2018-06-01

    Least square regression is a very popular supervised classification method. However, two main issues greatly limit its performance. The first one is that it only focuses on fitting the input features to the corresponding output labels while ignoring the correlations among samples. The second one is that the used label matrix, i.e., zero-one label matrix is inappropriate for classification. To solve these problems and improve the performance, this paper presents a novel method, i.e., inter-class sparsity based discriminative least square regression (ICS_DLSR), for multi-class classification. Different from other methods, the proposed method pursues that the transformed samples have a common sparsity structure in each class. For this goal, an inter-class sparsity constraint is introduced to the least square regression model such that the margins of samples from the same class can be greatly reduced while those of samples from different classes can be enlarged. In addition, an error term with row-sparsity constraint is introduced to relax the strict zero-one label matrix, which allows the method to be more flexible in learning the discriminative transformation matrix. These factors encourage the method to learn a more compact and discriminative transformation for regression and thus has the potential to perform better than other methods. Extensive experimental results show that the proposed method achieves the best performance in comparison with other methods for multi-class classification. Copyright © 2018 Elsevier Ltd. All rights reserved.

  10. An automated ranking platform for machine learning regression models for meat spoilage prediction using multi-spectral imaging and metabolic profiling.

    PubMed

    Estelles-Lopez, Lucia; Ropodi, Athina; Pavlidis, Dimitris; Fotopoulou, Jenny; Gkousari, Christina; Peyrodie, Audrey; Panagou, Efstathios; Nychas, George-John; Mohareb, Fady

    2017-09-01

    Over the past decade, analytical approaches based on vibrational spectroscopy, hyperspectral/multispectral imagining and biomimetic sensors started gaining popularity as rapid and efficient methods for assessing food quality, safety and authentication; as a sensible alternative to the expensive and time-consuming conventional microbiological techniques. Due to the multi-dimensional nature of the data generated from such analyses, the output needs to be coupled with a suitable statistical approach or machine-learning algorithms before the results can be interpreted. Choosing the optimum pattern recognition or machine learning approach for a given analytical platform is often challenging and involves a comparative analysis between various algorithms in order to achieve the best possible prediction accuracy. In this work, "MeatReg", a web-based application is presented, able to automate the procedure of identifying the best machine learning method for comparing data from several analytical techniques, to predict the counts of microorganisms responsible of meat spoilage regardless of the packaging system applied. In particularly up to 7 regression methods were applied and these are ordinary least squares regression, stepwise linear regression, partial least square regression, principal component regression, support vector regression, random forest and k-nearest neighbours. MeatReg" was tested with minced beef samples stored under aerobic and modified atmosphere packaging and analysed with electronic nose, HPLC, FT-IR, GC-MS and Multispectral imaging instrument. Population of total viable count, lactic acid bacteria, pseudomonads, Enterobacteriaceae and B. thermosphacta, were predicted. As a result, recommendations of which analytical platforms are suitable to predict each type of bacteria and which machine learning methods to use in each case were obtained. The developed system is accessible via the link: www.sorfml.com. Copyright © 2017 Elsevier Ltd. All rights reserved.

  11. Retrieving relevant factors with exploratory SEM and principal-covariate regression: A comparison.

    PubMed

    Vervloet, Marlies; Van den Noortgate, Wim; Ceulemans, Eva

    2018-02-12

    Behavioral researchers often linearly regress a criterion on multiple predictors, aiming to gain insight into the relations between the criterion and predictors. Obtaining this insight from the ordinary least squares (OLS) regression solution may be troublesome, because OLS regression weights show only the effect of a predictor on top of the effects of other predictors. Moreover, when the number of predictors grows larger, it becomes likely that the predictors will be highly collinear, which makes the regression weights' estimates unstable (i.e., the "bouncing beta" problem). Among other procedures, dimension-reduction-based methods have been proposed for dealing with these problems. These methods yield insight into the data by reducing the predictors to a smaller number of summarizing variables and regressing the criterion on these summarizing variables. Two promising methods are principal-covariate regression (PCovR) and exploratory structural equation modeling (ESEM). Both simultaneously optimize reduction and prediction, but they are based on different frameworks. The resulting solutions have not yet been compared; it is thus unclear what the strengths and weaknesses are of both methods. In this article, we focus on the extents to which PCovR and ESEM are able to extract the factors that truly underlie the predictor scores and can predict a single criterion. The results of two simulation studies showed that for a typical behavioral dataset, ESEM (using the BIC for model selection) in this regard is successful more often than PCovR. Yet, in 93% of the datasets PCovR performed equally well, and in the case of 48 predictors, 100 observations, and large differences in the strengths of the factors, PCovR even outperformed ESEM.

  12. Use of partial least squares regression to impute SNP genotypes in Italian cattle breeds.

    PubMed

    Dimauro, Corrado; Cellesi, Massimo; Gaspa, Giustino; Ajmone-Marsan, Paolo; Steri, Roberto; Marras, Gabriele; Macciotta, Nicolò P P

    2013-06-05

    The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available.

  13. Robust analysis of trends in noisy tokamak confinement data using geodesic least squares regression

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Verdoolaege, G., E-mail: geert.verdoolaege@ugent.be; Laboratory for Plasma Physics, Royal Military Academy, B-1000 Brussels; Shabbir, A.

    Regression analysis is a very common activity in fusion science for unveiling trends and parametric dependencies, but it can be a difficult matter. We have recently developed the method of geodesic least squares (GLS) regression that is able to handle errors in all variables, is robust against data outliers and uncertainty in the regression model, and can be used with arbitrary distribution models and regression functions. We here report on first results of application of GLS to estimation of the multi-machine scaling law for the energy confinement time in tokamaks, demonstrating improved consistency of the GLS results compared to standardmore » least squares.« less

  14. Quantitative laser-induced breakdown spectroscopy data using peak area step-wise regression analysis: an alternative method for interpretation of Mars science laboratory results

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Clegg, Samuel M; Barefield, James E; Wiens, Roger C

    2008-01-01

    The ChemCam instrument on the Mars Science Laboratory (MSL) will include a laser-induced breakdown spectrometer (LIBS) to quantify major and minor elemental compositions. The traditional analytical chemistry approach to calibration curves for these data regresses a single diagnostic peak area against concentration for each element. This approach contrasts with a new multivariate method in which elemental concentrations are predicted by step-wise multiple regression analysis based on areas of a specific set of diagnostic peaks for each element. The method is tested on LIBS data from igneous and metamorphosed rocks. Between 4 and 13 partial regression coefficients are needed to describemore » each elemental abundance accurately (i.e., with a regression line of R{sup 2} > 0.9995 for the relationship between predicted and measured elemental concentration) for all major and minor elements studied. Validation plots suggest that the method is limited at present by the small data set, and will work best for prediction of concentration when a wide variety of compositions and rock types has been analyzed.« less

  15. A Comparative Study of Pairwise Learning Methods Based on Kernel Ridge Regression.

    PubMed

    Stock, Michiel; Pahikkala, Tapio; Airola, Antti; De Baets, Bernard; Waegeman, Willem

    2018-06-12

    Many machine learning problems can be formulated as predicting labels for a pair of objects. Problems of that kind are often referred to as pairwise learning, dyadic prediction, or network inference problems. During the past decade, kernel methods have played a dominant role in pairwise learning. They still obtain a state-of-the-art predictive performance, but a theoretical analysis of their behavior has been underexplored in the machine learning literature. In this work we review and unify kernel-based algorithms that are commonly used in different pairwise learning settings, ranging from matrix filtering to zero-shot learning. To this end, we focus on closed-form efficient instantiations of Kronecker kernel ridge regression. We show that independent task kernel ridge regression, two-step kernel ridge regression, and a linear matrix filter arise naturally as a special case of Kronecker kernel ridge regression, implying that all these methods implicitly minimize a squared loss. In addition, we analyze universality, consistency, and spectral filtering properties. Our theoretical results provide valuable insights into assessing the advantages and limitations of existing pairwise learning methods.

  16. Quantifying the statistical importance of utilizing regression over classic energy intensity calculations for tracking efficiency improvements in industry

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nimbalkar, Sachin U.; Wenning, Thomas J.; Guo, Wei

    In the United States, manufacturing facilities account for about 32% of total domestic energy consumption in 2014. Robust energy tracking methodologies are critical to understanding energy performance in manufacturing facilities. Due to its simplicity and intuitiveness, the classic energy intensity method (i.e. the ratio of total energy use over total production) is the most widely adopted. However, the classic energy intensity method does not take into account the variation of other relevant parameters (i.e. product type, feed stock type, weather, etc.). Furthermore, the energy intensity method assumes that the facilities’ base energy consumption (energy use at zero production) is zero,more » which rarely holds true. Therefore, it is commonly recommended to utilize regression models rather than the energy intensity approach for tracking improvements at the facility level. Unfortunately, many energy managers have difficulties understanding why regression models are statistically better than utilizing the classic energy intensity method. While anecdotes and qualitative information may convince some, many have major reservations about the accuracy of regression models and whether it is worth the time and effort to gather data and build quality regression models. This paper will explain why regression models are theoretically and quantitatively more accurate for tracking energy performance improvements. Based on the analysis of data from 114 manufacturing plants over 12 years, this paper will present quantitative results on the importance of utilizing regression models over the energy intensity methodology. This paper will also document scenarios where regression models do not have significant relevance over the energy intensity method.« less

  17. Simultaneous spectrophotometric determination of salbutamol and bromhexine in tablets.

    PubMed

    Habib, I H I; Hassouna, M E M; Zaki, G A

    2005-03-01

    Typical anti-mucolytic drugs called salbutamol hydrochloride and bromhexine sulfate encountered in tablets were determined simultaneously either by using linear regression at zero-crossing wavelengths of the first derivation of UV-spectra or by application of multiple linear partial least squares regression method. The results obtained by the two proposed mathematical methods were compared with those obtained by the HPLC technique.

  18. The Application of Censored Regression Models in Low Streamflow Analyses

    NASA Astrophysics Data System (ADS)

    Kroll, C.; Luz, J.

    2003-12-01

    Estimation of low streamflow statistics at gauged and ungauged river sites is often a daunting task. This process is further confounded by the presence of intermittent streamflows, where streamflow is sometimes reported as zero, within a region. Streamflows recorded as zero may be zero, or may be less than the measurement detection limit. Such data is often referred to as censored data. Numerous methods have been developed to characterize intermittent streamflow series. Logit regression has been proposed to develop regional models of the probability annual lowflows series (such as 7-day lowflows) are zero. In addition, Tobit regression, a method of regression that allows for censored dependent variables, has been proposed for lowflow regional regression models in regions where the lowflow statistic of interest estimated as zero at some sites in the region. While these methods have been proposed, their use in practice has been limited. Here a delete-one jackknife simulation is presented to examine the performance of Logit and Tobit models of 7-day annual minimum flows in 6 USGS water resource regions in the United States. For the Logit model, an assessment is made of whether sites are correctly classified as having at least 10% of 7-day annual lowflows equal to zero. In such a situation, the 7-day, 10-year lowflow (Q710), a commonly employed low streamflow statistic, would be reported as zero. For the Tobit model, a comparison is made between results from the Tobit model, and from performing either ordinary least squares (OLS) or principal component regression (PCR) after the zero sites are dropped from the analysis. Initial results for the Logit model indicate this method to have a high probability of correctly classifying sites into groups with Q710s as zero and non-zero. Initial results also indicate the Tobit model produces better results than PCR and OLS when more than 5% of the sites in the region have Q710 values calculated as zero.

  19. A subagging regression method for estimating the qualitative and quantitative state of groundwater

    NASA Astrophysics Data System (ADS)

    Jeong, J.; Park, E.; Choi, J.; Han, W. S.; Yun, S. T.

    2016-12-01

    A subagging regression (SBR) method for the analysis of groundwater data pertaining to the estimation of trend and the associated uncertainty is proposed. The SBR method is validated against synthetic data competitively with other conventional robust and non-robust methods. From the results, it is verified that the estimation accuracies of the SBR method are consistent and superior to those of the other methods and the uncertainties are reasonably estimated where the others have no uncertainty analysis option. To validate further, real quantitative and qualitative data are employed and analyzed comparatively with Gaussian process regression (GPR). For all cases, the trend and the associated uncertainties are reasonably estimated by SBR, whereas the GPR has limitations in representing the variability of non-Gaussian skewed data. From the implementations, it is determined that the SBR method has potential to be further developed as an effective tool of anomaly detection or outlier identification in groundwater state data.

  20. Optimizing methods for linking cinematic features to fMRI data.

    PubMed

    Kauttonen, Janne; Hlushchuk, Yevhen; Tikka, Pia

    2015-04-15

    One of the challenges of naturalistic neurosciences using movie-viewing experiments is how to interpret observed brain activations in relation to the multiplicity of time-locked stimulus features. As previous studies have shown less inter-subject synchronization across viewers of random video footage than story-driven films, new methods need to be developed for analysis of less story-driven contents. To optimize the linkage between our fMRI data collected during viewing of a deliberately non-narrative silent film 'At Land' by Maya Deren (1944) and its annotated content, we combined the method of elastic-net regularization with the model-driven linear regression and the well-established data-driven independent component analysis (ICA) and inter-subject correlation (ISC) methods. In the linear regression analysis, both IC and region-of-interest (ROI) time-series were fitted with time-series of a total of 36 binary-valued and one real-valued tactile annotation of film features. The elastic-net regularization and cross-validation were applied in the ordinary least-squares linear regression in order to avoid over-fitting due to the multicollinearity of regressors, the results were compared against both the partial least-squares (PLS) regression and the un-regularized full-model regression. Non-parametric permutation testing scheme was applied to evaluate the statistical significance of regression. We found statistically significant correlation between the annotation model and 9 ICs out of 40 ICs. Regression analysis was also repeated for a large set of cubic ROIs covering the grey matter. Both IC- and ROI-based regression analyses revealed activations in parietal and occipital regions, with additional smaller clusters in the frontal lobe. Furthermore, we found elastic-net based regression more sensitive than PLS and un-regularized regression since it detected a larger number of significant ICs and ROIs. Along with the ISC ranking methods, our regression analysis proved a feasible method for ordering the ICs based on their functional relevance to the annotated cinematic features. The novelty of our method is - in comparison to the hypothesis-driven manual pre-selection and observation of some individual regressors biased by choice - in applying data-driven approach to all content features simultaneously. We found especially the combination of regularized regression and ICA useful when analyzing fMRI data obtained using non-narrative movie stimulus with a large set of complex and correlated features. Copyright © 2015. Published by Elsevier Inc.

  1. Two levels ARIMAX and regression models for forecasting time series data with calendar variation effects

    NASA Astrophysics Data System (ADS)

    Suhartono, Lee, Muhammad Hisyam; Prastyo, Dedy Dwi

    2015-12-01

    The aim of this research is to develop a calendar variation model for forecasting retail sales data with the Eid ul-Fitr effect. The proposed model is based on two methods, namely two levels ARIMAX and regression methods. Two levels ARIMAX and regression models are built by using ARIMAX for the first level and regression for the second level. Monthly men's jeans and women's trousers sales in a retail company for the period January 2002 to September 2009 are used as case study. In general, two levels of calendar variation model yields two models, namely the first model to reconstruct the sales pattern that already occurred, and the second model to forecast the effect of increasing sales due to Eid ul-Fitr that affected sales at the same and the previous months. The results show that the proposed two level calendar variation model based on ARIMAX and regression methods yields better forecast compared to the seasonal ARIMA model and Neural Networks.

  2. Comparison of methods for the prediction of human clearance from hepatocyte intrinsic clearance for a set of reference compounds and an external evaluation set.

    PubMed

    Yamagata, Tetsuo; Zanelli, Ugo; Gallemann, Dieter; Perrin, Dominique; Dolgos, Hugues; Petersson, Carl

    2017-09-01

    1. We compared direct scaling, regression model equation and the so-called "Poulin et al." methods to scale clearance (CL) from in vitro intrinsic clearance (CL int ) measured in human hepatocytes using two sets of compounds. One reference set comprised of 20 compounds with known elimination pathways and one external evaluation set based on 17 compounds development in Merck (MS). 2. A 90% prospective confidence interval was calculated using the reference set. This interval was found relevant for the regression equation method. The three outliers identified were justified on the basis of their elimination mechanism. 3. The direct scaling method showed a systematic underestimation of clearance in both the reference and evaluation sets. The "Poulin et al." and the regression equation methods showed no obvious bias in either the reference or evaluation sets. 4. The regression model equation was slightly superior to the "Poulin et al." method in the reference set and showed a better absolute average fold error (AAFE) of value 1.3 compared to 1.6. A larger difference was observed in the evaluation set were the regression method and "Poulin et al." resulted in an AAFE of 1.7 and 2.6, respectively (removing the three compounds with known issues mentioned above). A similar pattern was observed for the correlation coefficient. Based on these data we suggest the regression equation method combined with a prospective confidence interval as the first choice for the extrapolation of human in vivo hepatic metabolic clearance from in vitro systems.

  3. Comparison of integrated clustering methods for accurate and stable prediction of building energy consumption data

    DOE PAGES

    Hsu, David

    2015-09-27

    Clustering methods are often used to model energy consumption for two reasons. First, clustering is often used to process data and to improve the predictive accuracy of subsequent energy models. Second, stable clusters that are reproducible with respect to non-essential changes can be used to group, target, and interpret observed subjects. However, it is well known that clustering methods are highly sensitive to the choice of algorithms and variables. This can lead to misleading assessments of predictive accuracy and mis-interpretation of clusters in policymaking. This paper therefore introduces two methods to the modeling of energy consumption in buildings: clusterwise regression,more » also known as latent class regression, which integrates clustering and regression simultaneously; and cluster validation methods to measure stability. Using a large dataset of multifamily buildings in New York City, clusterwise regression is compared to common two-stage algorithms that use K-means and model-based clustering with linear regression. Predictive accuracy is evaluated using 20-fold cross validation, and the stability of the perturbed clusters is measured using the Jaccard coefficient. These results show that there seems to be an inherent tradeoff between prediction accuracy and cluster stability. This paper concludes by discussing which clustering methods may be appropriate for different analytical purposes.« less

  4. The intermediate endpoint effect in logistic and probit regression

    PubMed Central

    MacKinnon, DP; Lockwood, CM; Brown, CH; Wang, W; Hoffman, JM

    2010-01-01

    Background An intermediate endpoint is hypothesized to be in the middle of the causal sequence relating an independent variable to a dependent variable. The intermediate variable is also called a surrogate or mediating variable and the corresponding effect is called the mediated, surrogate endpoint, or intermediate endpoint effect. Clinical studies are often designed to change an intermediate or surrogate endpoint and through this intermediate change influence the ultimate endpoint. In many intermediate endpoint clinical studies the dependent variable is binary, and logistic or probit regression is used. Purpose The purpose of this study is to describe a limitation of a widely used approach to assessing intermediate endpoint effects and to propose an alternative method, based on products of coefficients, that yields more accurate results. Methods The intermediate endpoint model for a binary outcome is described for a true binary outcome and for a dichotomization of a latent continuous outcome. Plots of true values and a simulation study are used to evaluate the different methods. Results Distorted estimates of the intermediate endpoint effect and incorrect conclusions can result from the application of widely used methods to assess the intermediate endpoint effect. The same problem occurs for the proportion of an effect explained by an intermediate endpoint, which has been suggested as a useful measure for identifying intermediate endpoints. A solution to this problem is given based on the relationship between latent variable modeling and logistic or probit regression. Limitations More complicated intermediate variable models are not addressed in the study, although the methods described in the article can be extended to these more complicated models. Conclusions Researchers are encouraged to use an intermediate endpoint method based on the product of regression coefficients. A common method based on difference in coefficient methods can lead to distorted conclusions regarding the intermediate effect. PMID:17942466

  5. Lung nodule malignancy prediction using multi-task convolutional neural network

    NASA Astrophysics Data System (ADS)

    Li, Xiuli; Kao, Yueying; Shen, Wei; Li, Xiang; Xie, Guotong

    2017-03-01

    In this paper, we investigated the problem of diagnostic lung nodule malignancy prediction using thoracic Computed Tomography (CT) screening. Unlike most existing studies classify the nodules into two types benign and malignancy, we interpreted the nodule malignancy prediction as a regression problem to predict continuous malignancy level. We proposed a joint multi-task learning algorithm using Convolutional Neural Network (CNN) to capture nodule heterogeneity by extracting discriminative features from alternatingly stacked layers. We trained a CNN regression model to predict the nodule malignancy, and designed a multi-task learning mechanism to simultaneously share knowledge among 9 different nodule characteristics (Subtlety, Calcification, Sphericity, Margin, Lobulation, Spiculation, Texture, Diameter and Malignancy), and improved the final prediction result. Each CNN would generate characteristic-specific feature representations, and then we applied multi-task learning on the features to predict the corresponding likelihood for that characteristic. We evaluated the proposed method on 2620 nodules CT scans from LIDC-IDRI dataset with the 5-fold cross validation strategy. The multitask CNN regression result for regression RMSE and mapped classification ACC were 0.830 and 83.03%, while the results for single task regression RMSE 0.894 and mapped classification ACC 74.9%. Experiments show that the proposed method could predict the lung nodule malignancy likelihood effectively and outperforms the state-of-the-art methods. The learning framework could easily be applied in other anomaly likelihood prediction problem, such as skin cancer and breast cancer. It demonstrated the possibility of our method facilitating the radiologists for nodule staging assessment and individual therapeutic planning.

  6. A controlled experiment in ground water flow model calibration

    USGS Publications Warehouse

    Hill, M.C.; Cooley, R.L.; Pollock, D.W.

    1998-01-01

    Nonlinear regression was introduced to ground water modeling in the 1970s, but has been used very little to calibrate numerical models of complicated ground water systems. Apparently, nonlinear regression is thought by many to be incapable of addressing such complex problems. With what we believe to be the most complicated synthetic test case used for such a study, this work investigates using nonlinear regression in ground water model calibration. Results of the study fall into two categories. First, the study demonstrates how systematic use of a well designed nonlinear regression method can indicate the importance of different types of data and can lead to successive improvement of models and their parameterizations. Our method differs from previous methods presented in the ground water literature in that (1) weighting is more closely related to expected data errors than is usually the case; (2) defined diagnostic statistics allow for more effective evaluation of the available data, the model, and their interaction; and (3) prior information is used more cautiously. Second, our results challenge some commonly held beliefs about model calibration. For the test case considered, we show that (1) field measured values of hydraulic conductivity are not as directly applicable to models as their use in some geostatistical methods imply; (2) a unique model does not necessarily need to be identified to obtain accurate predictions; and (3) in the absence of obvious model bias, model error was normally distributed. The complexity of the test case involved implies that the methods used and conclusions drawn are likely to be powerful in practice.Nonlinear regression was introduced to ground water modeling in the 1970s, but has been used very little to calibrate numerical models of complicated ground water systems. Apparently, nonlinear regression is thought by many to be incapable of addressing such complex problems. With what we believe to be the most complicated synthetic test case used for such a study, this work investigates using nonlinear regression in ground water model calibration. Results of the study fall into two categories. First, the study demonstrates how systematic use of a well designed nonlinear regression method can indicate the importance of different types of data and can lead to successive improvement of models and their parameterizations. Our method differs from previous methods presented in the ground water literature in that (1) weighting is more closely related to expected data errors than is usually the case; (2) defined diagnostic statistics allow for more effective evaluation of the available data, the model, and their interaction; and (3) prior information is used more cautiously. Second, our results challenge some commonly held beliefs about model calibration. For the test case considered, we show that (1) field measured values of hydraulic conductivity are not as directly applicable to models as their use in some geostatistical methods imply; (2) a unique model does not necessarily need to be identified to obtain accurate predictions; and (3) in the absence of obvious model bias, model error was normally distributed. The complexity of the test case involved implies that the methods used and conclusions drawn are likely to be powerful in practice.

  7. Estimating interaction on an additive scale between continuous determinants in a logistic regression model.

    PubMed

    Knol, Mirjam J; van der Tweel, Ingeborg; Grobbee, Diederick E; Numans, Mattijs E; Geerlings, Mirjam I

    2007-10-01

    To determine the presence of interaction in epidemiologic research, typically a product term is added to the regression model. In linear regression, the regression coefficient of the product term reflects interaction as departure from additivity. However, in logistic regression it refers to interaction as departure from multiplicativity. Rothman has argued that interaction estimated as departure from additivity better reflects biologic interaction. So far, literature on estimating interaction on an additive scale using logistic regression only focused on dichotomous determinants. The objective of the present study was to provide the methods to estimate interaction between continuous determinants and to illustrate these methods with a clinical example. and results From the existing literature we derived the formulas to quantify interaction as departure from additivity between one continuous and one dichotomous determinant and between two continuous determinants using logistic regression. Bootstrapping was used to calculate the corresponding confidence intervals. To illustrate the theory with an empirical example, data from the Utrecht Health Project were used, with age and body mass index as risk factors for elevated diastolic blood pressure. The methods and formulas presented in this article are intended to assist epidemiologists to calculate interaction on an additive scale between two variables on a certain outcome. The proposed methods are included in a spreadsheet which is freely available at: http://www.juliuscenter.nl/additive-interaction.xls.

  8. Comparison of a Full Food-Frequency Questionnaire with the Three-Day Unweighted Food Records in Young Polish Adult Women: Implications for Dietary Assessment

    PubMed Central

    Kowalkowska, Joanna; Slowinska, Malgorzata A.; Slowinski, Dariusz; Dlugosz, Anna; Niedzwiedzka, Ewa; Wadolowska, Lidia

    2013-01-01

    The food frequency questionnaire (FFQ) and the food record (FR) are among the most common methods used in dietary research. It is important to know that is it possible to use both methods simultaneously in dietary assessment and prepare a single, comprehensive interpretation. The aim of this study was to compare the energy and nutritional value of diets, determined by the FFQ and by the three-day food records of young women. The study involved 84 female students aged 21–26 years (mean of 22.2 ± 0.8 years). Completing the FFQ was preceded by obtaining unweighted food records covering three consecutive days. Energy and nutritional value of diets was assessed for both methods (FFQ-crude, FR-crude). Data obtained for FFQ-crude were adjusted with beta-coefficient equaling 0.5915 (FFQ-adjusted) and regression analysis (FFQ-regressive). The FFQ-adjusted was calculated as FR-crude/FFQ-crude ratio of mean daily energy intake. FFQ-regressive was calculated for energy and each nutrient separately using regression equation, including FFQ-crude and FR-crude as covariates. For FR-crude and FFQ-crude the energy value of diets was standardized to 2000 kcal (FR-standardized, FFQ-standardized). Methods of statistical comparison included a dependent samples t-test, a chi-square test, and the Bland-Altman method. The mean energy intake in FFQ-crude was significantly higher than FR-crude (2740.5 kcal vs. 1621.0 kcal, respectively). For FR-standardized and FFQ-standardized, significance differences were found in the mean intake of 18 out of 31 nutrients, for FR-crude and FFQ-adjusted in 13 out of 31 nutrients and FR-crude and FFQ-regressive in 11 out of 31 nutrients. The Bland-Altman method showed an overestimation of energy and nutrient intake by FFQ-crude in comparison to FR-crude, e.g., total protein was overestimated by 34.7 g/day (95% Confidence Interval, CI: −29.6, 99.0 g/day) and fat by 48.6 g/day (95% CI: −36.4, 133.6 g/day). After regressive transformation of FFQ, the absolute difference between FFQ-regressive and FR-crude equaled 0.0 g/day and 95% CI were much better (e.g., for total protein 95% CI: −32.7, 32.7 g/day, for fat 95% CI: −49.6, 49.6 g/day). In conclusion, differences in nutritional value of diets resulted from overestimating energy intake by the FFQ in comparison to the three-day unweighted food records. Adjustment of energy and nutrient intake applied for the FFQ using various methods, particularly regression equations, significantly improved the agreement between results obtained by both methods and dietary assessment. To obtain the most accurate results in future studies using this FFQ, energy and nutrient intake should be adjusted by the regression equations presented in this paper. PMID:23877089

  9. Prediction of unwanted pregnancies using logistic regression, probit regression and discriminant analysis

    PubMed Central

    Ebrahimzadeh, Farzad; Hajizadeh, Ebrahim; Vahabi, Nasim; Almasian, Mohammad; Bakhteyar, Katayoon

    2015-01-01

    Background: Unwanted pregnancy not intended by at least one of the parents has undesirable consequences for the family and the society. In the present study, three classification models were used and compared to predict unwanted pregnancies in an urban population. Methods: In this cross-sectional study, 887 pregnant mothers referring to health centers in Khorramabad, Iran, in 2012 were selected by the stratified and cluster sampling; relevant variables were measured and for prediction of unwanted pregnancy, logistic regression, discriminant analysis, and probit regression models and SPSS software version 21 were used. To compare these models, indicators such as sensitivity, specificity, the area under the ROC curve, and the percentage of correct predictions were used. Results: The prevalence of unwanted pregnancies was 25.3%. The logistic and probit regression models indicated that parity and pregnancy spacing, contraceptive methods, household income and number of living male children were related to unwanted pregnancy. The performance of the models based on the area under the ROC curve was 0.735, 0.733, and 0.680 for logistic regression, probit regression, and linear discriminant analysis, respectively. Conclusion: Given the relatively high prevalence of unwanted pregnancies in Khorramabad, it seems necessary to revise family planning programs. Despite the similar accuracy of the models, if the researcher is interested in the interpretability of the results, the use of the logistic regression model is recommended. PMID:26793655

  10. Estimation Methods for Non-Homogeneous Regression - Minimum CRPS vs Maximum Likelihood

    NASA Astrophysics Data System (ADS)

    Gebetsberger, Manuel; Messner, Jakob W.; Mayr, Georg J.; Zeileis, Achim

    2017-04-01

    Non-homogeneous regression models are widely used to statistically post-process numerical weather prediction models. Such regression models correct for errors in mean and variance and are capable to forecast a full probability distribution. In order to estimate the corresponding regression coefficients, CRPS minimization is performed in many meteorological post-processing studies since the last decade. In contrast to maximum likelihood estimation, CRPS minimization is claimed to yield more calibrated forecasts. Theoretically, both scoring rules used as an optimization score should be able to locate a similar and unknown optimum. Discrepancies might result from a wrong distributional assumption of the observed quantity. To address this theoretical concept, this study compares maximum likelihood and minimum CRPS estimation for different distributional assumptions. First, a synthetic case study shows that, for an appropriate distributional assumption, both estimation methods yield to similar regression coefficients. The log-likelihood estimator is slightly more efficient. A real world case study for surface temperature forecasts at different sites in Europe confirms these results but shows that surface temperature does not always follow the classical assumption of a Gaussian distribution. KEYWORDS: ensemble post-processing, maximum likelihood estimation, CRPS minimization, probabilistic temperature forecasting, distributional regression models

  11. Hypothesis Testing Using Factor Score Regression

    PubMed Central

    Devlieger, Ines; Mayer, Axel; Rosseel, Yves

    2015-01-01

    In this article, an overview is given of four methods to perform factor score regression (FSR), namely regression FSR, Bartlett FSR, the bias avoiding method of Skrondal and Laake, and the bias correcting method of Croon. The bias correcting method is extended to include a reliable standard error. The four methods are compared with each other and with structural equation modeling (SEM) by using analytic calculations and two Monte Carlo simulation studies to examine their finite sample characteristics. Several performance criteria are used, such as the bias using the unstandardized and standardized parameterization, efficiency, mean square error, standard error bias, type I error rate, and power. The results show that the bias correcting method, with the newly developed standard error, is the only suitable alternative for SEM. While it has a higher standard error bias than SEM, it has a comparable bias, efficiency, mean square error, power, and type I error rate. PMID:29795886

  12. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dierauf, Timothy; Kurtz, Sarah; Riley, Evan

    This paper provides a recommended method for evaluating the AC capacity of a photovoltaic (PV) generating station. It also presents companion guidance on setting the facilitys capacity guarantee value. This is a principles-based approach that incorporates plant fundamental design parameters such as loss factors, module coefficients, and inverter constraints. This method has been used to prove contract guarantees for over 700 MW of installed projects. The method is transparent, and the results are deterministic. In contrast, current industry practices incorporate statistical regression where the empirical coefficients may only characterize the collected data. Though these methods may work well when extrapolationmore » is not required, there are other situations where the empirical coefficients may not adequately model actual performance.This proposed Fundamentals Approach method provides consistent results even where regression methods start to lose fidelity.« less

  13. Regression and direct methods do not give different estimates of digestible and metabolizable energy values of barley, sorghum, and wheat for pigs.

    PubMed

    Bolarinwa, O A; Adeola, O

    2016-02-01

    Direct or indirect methods can be used to determine the DE and ME of feed ingredients for pigs. In situations when only the indirect approach is suitable, the regression method presents a robust indirect approach. Three experiments were conducted to compare the direct and regression methods for determining the DE and ME values of barley, sorghum, and wheat for pigs. In each experiment, 24 barrows with an average initial BW of 31, 32, and 33 kg were assigned to 4 diets in a randomized complete block design. The 4 diets consisted of 969 g barley, sorghum, or wheat/kg plus minerals and vitamins for the direct method; a corn-soybean meal reference diet (RD); the RD + 300 g barley, sorghum, or wheat/kg; and the RD + 600 g barley, sorghum, or wheat/kg. The 3 corn-soybean meal diets were used for the regression method. Each diet was fed to 6 barrows in individual metabolism crates for a 5-d acclimation followed by a 5-d period of total but separate collection of feces and urine in each experiment. Graded substitution of barley or wheat, but not sorghum, into the RD linearly reduced ( < 0.05) dietary DE and ME. The direct method-derived DE and ME for barley were 3,669 and 3,593 kcal/kg DM, respectively. The regressions of barley contribution to DE and ME in kilocalories against the quantity of barley DMI in kilograms generated 3,746 kcal DE/kg DM and 3,647 kcal ME/kg DM. The DE and ME for sorghum by the direct method were 4,097 and 4,042 kcal/kg DM, respectively; the corresponding regression-derived estimates were 4,145 and 4,066 kcal/kg DM. Using the direct method, energy values for wheat were 3,953 kcal DE/kg DM and 3,889 kcal ME/kg DM. The regressions of wheat contribution to DE and ME in kilocalories against the quantity of wheat DMI in kilograms generated 3,960 kcal DE/kg DM and 3,874 kcal ME/kg DM. The DE and ME of barley using the direct method were not different (0.3 < < 0.4) from those obtained using the regression method (3,669 vs. 3,746 and 3,593 vs. 3,647 kcal/kg DM, respectively). The direct method-derived DE and ME of sorghum were not different (0.5 < < 0.7) from those obtained using the regression method (4,097 vs. 4,145 and 4,042 vs. 4,066 kcal/kg DM, respectively). The direct method- and regression method-derived DE (3,953 and 3,960 kcal/kg DM, respectively) and ME (3,889 and 3,874 kcal/kg DM, respectively) of wheat were not different (0.8 < < 0.9). Results of these 3 experiments suggest that regression and direct methods do not give different estimates of DE and ME in barley, sorghum, and wheat for pigs.

  14. Simulation of CO2 Solubility in Polystyrene-b-Polybutadieneb-Polystyrene (SEBS) by artificial intelligence network (ANN) method

    NASA Astrophysics Data System (ADS)

    Sharudin, R. W.; AbdulBari Ali, S.; Zulkarnain, M.; Shukri, M. A.

    2018-05-01

    This study reports on the integration of Artificial Neural Network (ANNs) with experimental data in predicting the solubility of carbon dioxide (CO2) blowing agent in SEBS by generating highest possible value for Regression coefficient (R2). Basically, foaming of thermoplastic elastomer with CO2 is highly affected by the CO2 solubility. The ability of ANN in predicting interpolated data of CO2 solubility was investigated by comparing training results via different method of network training. Regards to the final prediction result for CO2 solubility by ANN, the prediction trend (output generate) was corroborated with the experimental results. The obtained result of different method of training showed the trend of output generated by Gradient Descent with Momentum & Adaptive LR (traingdx) required longer training time and required more accurate input to produce better output with final Regression Value of 0.88. However, it goes vice versa with Levenberg-Marquardt (trainlm) technique as it produced better output in quick detention time with final Regression Value of 0.91.

  15. A novel multi-target regression framework for time-series prediction of drug efficacy.

    PubMed

    Li, Haiqing; Zhang, Wei; Chen, Ying; Guo, Yumeng; Li, Guo-Zheng; Zhu, Xiaoxin

    2017-01-18

    Excavating from small samples is a challenging pharmacokinetic problem, where statistical methods can be applied. Pharmacokinetic data is special due to the small samples of high dimensionality, which makes it difficult to adopt conventional methods to predict the efficacy of traditional Chinese medicine (TCM) prescription. The main purpose of our study is to obtain some knowledge of the correlation in TCM prescription. Here, a novel method named Multi-target Regression Framework to deal with the problem of efficacy prediction is proposed. We employ the correlation between the values of different time sequences and add predictive targets of previous time as features to predict the value of current time. Several experiments are conducted to test the validity of our method and the results of leave-one-out cross-validation clearly manifest the competitiveness of our framework. Compared with linear regression, artificial neural networks, and partial least squares, support vector regression combined with our framework demonstrates the best performance, and appears to be more suitable for this task.

  16. A novel multi-target regression framework for time-series prediction of drug efficacy

    PubMed Central

    Li, Haiqing; Zhang, Wei; Chen, Ying; Guo, Yumeng; Li, Guo-Zheng; Zhu, Xiaoxin

    2017-01-01

    Excavating from small samples is a challenging pharmacokinetic problem, where statistical methods can be applied. Pharmacokinetic data is special due to the small samples of high dimensionality, which makes it difficult to adopt conventional methods to predict the efficacy of traditional Chinese medicine (TCM) prescription. The main purpose of our study is to obtain some knowledge of the correlation in TCM prescription. Here, a novel method named Multi-target Regression Framework to deal with the problem of efficacy prediction is proposed. We employ the correlation between the values of different time sequences and add predictive targets of previous time as features to predict the value of current time. Several experiments are conducted to test the validity of our method and the results of leave-one-out cross-validation clearly manifest the competitiveness of our framework. Compared with linear regression, artificial neural networks, and partial least squares, support vector regression combined with our framework demonstrates the best performance, and appears to be more suitable for this task. PMID:28098186

  17. A self-trained classification technique for producing 30 m percent-water maps from Landsat data

    USGS Publications Warehouse

    Rover, Jennifer R.; Wylie, Bruce K.; Ji, Lei

    2010-01-01

    Small bodies of water can be mapped with moderate-resolution satellite data using methods where water is mapped as subpixel fractions using field measurements or high-resolution images as training datasets. A new method, developed from a regression-tree technique, uses a 30 m Landsat image for training the regression tree that, in turn, is applied to the same image to map subpixel water. The self-trained method was evaluated by comparing the percent-water map with three other maps generated from established percent-water mapping methods: (1) a regression-tree model trained with a 5 m SPOT 5 image, (2) a regression-tree model based on endmembers and (3) a linear unmixing classification technique. The results suggest that subpixel water fractions can be accurately estimated when high-resolution satellite data or intensively interpreted training datasets are not available, which increases our ability to map small water bodies or small changes in lake size at a regional scale.

  18. Spectral Regression Discriminant Analysis for Hyperspectral Image Classification

    NASA Astrophysics Data System (ADS)

    Pan, Y.; Wu, J.; Huang, H.; Liu, J.

    2012-08-01

    Dimensionality reduction algorithms, which aim to select a small set of efficient and discriminant features, have attracted great attention for Hyperspectral Image Classification. The manifold learning methods are popular for dimensionality reduction, such as Locally Linear Embedding, Isomap, and Laplacian Eigenmap. However, a disadvantage of many manifold learning methods is that their computations usually involve eigen-decomposition of dense matrices which is expensive in both time and memory. In this paper, we introduce a new dimensionality reduction method, called Spectral Regression Discriminant Analysis (SRDA). SRDA casts the problem of learning an embedding function into a regression framework, which avoids eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizes can be naturally incorporated into our algorithm which makes it more flexible. It can make efficient use of data points to discover the intrinsic discriminant structure in the data. Experimental results on Washington DC Mall and AVIRIS Indian Pines hyperspectral data sets demonstrate the effectiveness of the proposed method.

  19. Birthweight Related Factors in Northwestern Iran: Using Quantile Regression Method

    PubMed Central

    Fallah, Ramazan; Kazemnejad, Anoshirvan; Zayeri, Farid; Shoghli, Alireza

    2016-01-01

    Introduction: Birthweight is one of the most important predicting indicators of the health status in adulthood. Having a balanced birthweight is one of the priorities of the health system in most of the industrial and developed countries. This indicator is used to assess the growth and health status of the infants. The aim of this study was to assess the birthweight of the neonates by using quantile regression in Zanjan province. Methods: This analytical descriptive study was carried out using pre-registered (March 2010 - March 2012) data of neonates in urban/rural health centers of Zanjan province using multiple-stage cluster sampling. Data were analyzed using multiple linear regressions andquantile regression method and SAS 9.2 statistical software. Results: From 8456 newborn baby, 4146 (49%) were female. The mean age of the mothers was 27.1±5.4 years. The mean birthweight of the neonates was 3104 ± 431 grams. Five hundred and seventy-three patients (6.8%) of the neonates were less than 2500 grams. In all quantiles, gestational age of neonates (p<0.05), weight and educational level of the mothers (p<0.05) showed a linear significant relationship with the i of the neonates. However, sex and birth rank of the neonates, mothers age, place of residence (urban/rural) and career were not significant in all quantiles (p>0.05). Conclusion: This study revealed the results of multiple linear regression and quantile regression were not identical. We strictly recommend the use of quantile regression when an asymmetric response variable or data with outliers is available. PMID:26925889

  20. Evaluating differential effects using regression interactions and regression mixture models

    PubMed Central

    Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung

    2015-01-01

    Research increasingly emphasizes understanding differential effects. This paper focuses on understanding regression mixture models, a relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their formulation, and their assumptions are compared using Monte Carlo simulations and real data analysis. The capabilities of regression mixture models are described and specific issues to be addressed when conducting regression mixtures are proposed. The paper aims to clarify the role that regression mixtures can take in the estimation of differential effects and increase awareness of the benefits and potential pitfalls of this approach. Regression mixture models are shown to be a potentially effective exploratory method for finding differential effects when these effects can be defined by a small number of classes of respondents who share a typical relationship between a predictor and an outcome. It is also shown that the comparison between regression mixture models and interactions becomes substantially more complex as the number of classes increases. It is argued that regression interactions are well suited for direct tests of specific hypotheses about differential effects and regression mixtures provide a useful approach for exploring effect heterogeneity given adequate samples and study design. PMID:26556903

  1. Ecologic regression analysis and the study of the influence of air quality on mortality.

    PubMed Central

    Selvin, S; Merrill, D; Wong, L; Sacks, S T

    1984-01-01

    This presentation focuses entirely on the use and evaluation of regression analysis applied to ecologic data as a method to study the effects of ambient air pollution on mortality rates. Using extensive national data on mortality, air quality and socio-economic status regression analyses are used to study the influence of air quality on mortality. The analytic methods and data are selected in such a way that direct comparisons can be made with other ecologic regression studies of mortality and air quality. Analyses are performed by use of two types of geographic areas, age-specific mortality of both males and females and three pollutants (total suspended particulates, sulfur dioxide and nitrogen dioxide). The overall results indicate no persuasive evidence exists of a link between air quality and general mortality levels. Additionally, a lack of consistency between the present results and previous published work is noted. Overall, it is concluded that linear regression analysis applied to nationally collected ecologic data cannot be used to usefully infer a causal relationship between air quality and mortality which is in direct contradiction to other major published studies. PMID:6734568

  2. Combining logistic regression with classification and regression tree to predict quality of care in a home health nursing data set.

    PubMed

    Guo, Huey-Ming; Shyu, Yea-Ing Lotus; Chang, Her-Kun

    2006-01-01

    In this article, the authors provide an overview of a research method to predict quality of care in home health nursing data set. The results of this study can be visualized through classification an regression tree (CART) graphs. The analysis was more effective, and the results were more informative since the home health nursing dataset was analyzed with a combination of the logistic regression and CART, these two techniques complete each other. And the results more informative that more patients' characters were related to quality of care in home care. The results contributed to home health nurse predict patient outcome in case management. Improved prediction is needed for interventions to be appropriately targeted for improved patient outcome and quality of care.

  3. A 2-step penalized regression method for family-based next-generation sequencing association studies.

    PubMed

    Ding, Xiuhua; Su, Shaoyong; Nandakumar, Kannabiran; Wang, Xiaoling; Fardo, David W

    2014-01-01

    Large-scale genetic studies are often composed of related participants, and utilizing familial relationships can be cumbersome and computationally challenging. We present an approach to efficiently handle sequencing data from complex pedigrees that incorporates information from rare variants as well as common variants. Our method employs a 2-step procedure that sequentially regresses out correlation from familial relatedness and then uses the resulting phenotypic residuals in a penalized regression framework to test for associations with variants within genetic units. The operating characteristics of this approach are detailed using simulation data based on a large, multigenerational cohort.

  4. Soil Particle Size Analysis by Laser Diffractometry: Result Comparison with Pipette Method

    NASA Astrophysics Data System (ADS)

    Šinkovičová, Miroslava; Igaz, Dušan; Kondrlová, Elena; Jarošová, Miriam

    2017-10-01

    Soil texture as the basic soil physical property provides a basic information on the soil grain size distribution as well as grain size fraction representation. Currently, there are several methods of particle dimension measurement available that are based on different physical principles. Pipette method based on the different sedimentation velocity of particles with different diameter is considered to be one of the standard methods of individual grain size fraction distribution determination. Following the technical advancement, optical methods such as laser diffraction can be also used nowadays for grain size distribution determination in the soil. According to the literature review of domestic as well as international sources related to this topic, it is obvious that the results obtained by laser diffractometry do not correspond with the results obtained by pipette method. The main aim of this paper was to analyse 132 samples of medium fine soil, taken from the Nitra River catchment in Slovakia, from depths of 15-20 cm and 40-45 cm, respectively, using laser analysers: ANALYSETTE 22 MicroTec plus (Fritsch GmbH) and Mastersizer 2000 (Malvern Instruments Ltd). The results obtained by laser diffractometry were compared with pipette method and the regression relationships using linear, exponential, power and polynomial trend were derived. Regressions with the three highest regression coefficients (R2) were further investigated. The fit with the highest tightness was observed for the polynomial regression. In view of the results obtained, we recommend using the estimate of the representation of the clay fraction (<0.01 mm) polynomial regression, to achieve a highest confidence value R2 at the depths of 15-20 cm 0.72 (Analysette 22 MicroTec plus) and 0.95 (Mastersizer 2000), from a depth of 40-45 cm 0.90 (Analysette 22 MicroTec plus) and 0.96 (Mastersizer 2000). Since the percentage representation of clayey particles (2nd fraction according to the methodology of Complex Soil Survey done in Slovakia) in soil is the determinant for soil type specification, we recommend using the derived relationships in soil science when the soil texture analysis is done according to laser diffractometry. The advantages of laser diffraction method comprise the short analysis time, usage of small sample amount, application for the various grain size fraction and soil type classification systems, and a wide range of determined fractions. Therefore, it is necessary to focus on this issue further to address the needs of soil science research and attempt to replace the standard pipette method with more progressive laser diffraction method.

  5. Flood-frequency prediction methods for unregulated streams of Tennessee, 2000

    USGS Publications Warehouse

    Law, George S.; Tasker, Gary D.

    2003-01-01

    Up-to-date flood-frequency prediction methods for unregulated, ungaged rivers and streams of Tennessee have been developed. Prediction methods include the regional-regression method and the newer region-of-influence method. The prediction methods were developed using stream-gage records from unregulated streams draining basins having from 1 percent to about 30 percent total impervious area. These methods, however, should not be used in heavily developed or storm-sewered basins with impervious areas greater than 10 percent. The methods can be used to estimate 2-, 5-, 10-, 25-, 50-, 100-, and 500-year recurrence-interval floods of most unregulated rural streams in Tennessee. A computer application was developed that automates the calculation of flood frequency for unregulated, ungaged rivers and streams of Tennessee. Regional-regression equations were derived by using both single-variable and multivariable regional-regression analysis. Contributing drainage area is the explanatory variable used in the single-variable equations. Contributing drainage area, main-channel slope, and a climate factor are the explanatory variables used in the multivariable equations. Deleted-residual standard error for the single-variable equations ranged from 32 to 65 percent. Deleted-residual standard error for the multivariable equations ranged from 31 to 63 percent. These equations are included in the computer application to allow easy comparison of results produced by the different methods. The region-of-influence method calculates multivariable regression equations for each ungaged site and recurrence interval using basin characteristics from 60 similar sites selected from the study area. Explanatory variables that may be used in regression equations computed by the region-of-influence method include contributing drainage area, main-channel slope, a climate factor, and a physiographic-region factor. Deleted-residual standard error for the region-of-influence method tended to be only slightly smaller than those for the regional-regression method and ranged from 27 to 62 percent.

  6. Creep-Rupture Data Analysis - Engineering Application of Regression Techniques. Ph.D. Thesis - North Carolina State Univ.

    NASA Technical Reports Server (NTRS)

    Rummler, D. R.

    1976-01-01

    The results are presented of investigations to apply regression techniques to the development of methodology for creep-rupture data analysis. Regression analysis techniques are applied to the explicit description of the creep behavior of materials for space shuttle thermal protection systems. A regression analysis technique is compared with five parametric methods for analyzing three simulated and twenty real data sets, and a computer program for the evaluation of creep-rupture data is presented.

  7. Estimation of Fine Particulate Matter in Taipei Using Landuse Regression and Bayesian Maximum Entropy Methods

    PubMed Central

    Yu, Hwa-Lung; Wang, Chih-Hsih; Liu, Ming-Che; Kuo, Yi-Ming

    2011-01-01

    Fine airborne particulate matter (PM2.5) has adverse effects on human health. Assessing the long-term effects of PM2.5 exposure on human health and ecology is often limited by a lack of reliable PM2.5 measurements. In Taipei, PM2.5 levels were not systematically measured until August, 2005. Due to the popularity of geographic information systems (GIS), the landuse regression method has been widely used in the spatial estimation of PM concentrations. This method accounts for the potential contributing factors of the local environment, such as traffic volume. Geostatistical methods, on other hand, account for the spatiotemporal dependence among the observations of ambient pollutants. This study assesses the performance of the landuse regression model for the spatiotemporal estimation of PM2.5 in the Taipei area. Specifically, this study integrates the landuse regression model with the geostatistical approach within the framework of the Bayesian maximum entropy (BME) method. The resulting epistemic framework can assimilate knowledge bases including: (a) empirical-based spatial trends of PM concentration based on landuse regression, (b) the spatio-temporal dependence among PM observation information, and (c) site-specific PM observations. The proposed approach performs the spatiotemporal estimation of PM2.5 levels in the Taipei area (Taiwan) from 2005–2007. PMID:21776223

  8. Estimation of fine particulate matter in Taipei using landuse regression and bayesian maximum entropy methods.

    PubMed

    Yu, Hwa-Lung; Wang, Chih-Hsih; Liu, Ming-Che; Kuo, Yi-Ming

    2011-06-01

    Fine airborne particulate matter (PM2.5) has adverse effects on human health. Assessing the long-term effects of PM2.5 exposure on human health and ecology is often limited by a lack of reliable PM2.5 measurements. In Taipei, PM2.5 levels were not systematically measured until August, 2005. Due to the popularity of geographic information systems (GIS), the landuse regression method has been widely used in the spatial estimation of PM concentrations. This method accounts for the potential contributing factors of the local environment, such as traffic volume. Geostatistical methods, on other hand, account for the spatiotemporal dependence among the observations of ambient pollutants. This study assesses the performance of the landuse regression model for the spatiotemporal estimation of PM2.5 in the Taipei area. Specifically, this study integrates the landuse regression model with the geostatistical approach within the framework of the Bayesian maximum entropy (BME) method. The resulting epistemic framework can assimilate knowledge bases including: (a) empirical-based spatial trends of PM concentration based on landuse regression, (b) the spatio-temporal dependence among PM observation information, and (c) site-specific PM observations. The proposed approach performs the spatiotemporal estimation of PM2.5 levels in the Taipei area (Taiwan) from 2005-2007.

  9. Composite marginal quantile regression analysis for longitudinal adolescent body mass index data.

    PubMed

    Yang, Chi-Chuan; Chen, Yi-Hau; Chang, Hsing-Yi

    2017-09-20

    Childhood and adolescenthood overweight or obesity, which may be quantified through the body mass index (BMI), is strongly associated with adult obesity and other health problems. Motivated by the child and adolescent behaviors in long-term evolution (CABLE) study, we are interested in individual, family, and school factors associated with marginal quantiles of longitudinal adolescent BMI values. We propose a new method for composite marginal quantile regression analysis for longitudinal outcome data, which performs marginal quantile regressions at multiple quantile levels simultaneously. The proposed method extends the quantile regression coefficient modeling method introduced by Frumento and Bottai (Biometrics 2016; 72:74-84) to longitudinal data accounting suitably for the correlation structure in longitudinal observations. A goodness-of-fit test for the proposed modeling is also developed. Simulation results show that the proposed method can be much more efficient than the analysis without taking correlation into account and the analysis performing separate quantile regressions at different quantile levels. The application to the longitudinal adolescent BMI data from the CABLE study demonstrates the practical utility of our proposal. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  10. Selection of vegetation indices for mapping the sugarcane condition around the oil and gas field of North West Java Basin, Indonesia

    NASA Astrophysics Data System (ADS)

    Muji Susantoro, Tri; Wikantika, Ketut; Saepuloh, Asep; Handoyo Harsolumakso, Agus

    2018-05-01

    Selection of vegetation indices in plant mapping is needed to provide the best information of plant conditions. The methods used in this research are the standard deviation and the linear regression. This research tried to determine the vegetation indices used for mapping the sugarcane conditions around oil and gas fields. The data used in this study is Landsat 8 OLI/TIRS. The standard deviation analysis on the 23 vegetation indices with 27 samples has resulted in the six highest standard deviations of vegetation indices, termed as GRVI, SR, NLI, SIPI, GEMI and LAI. The standard deviation values are 0.47; 0.43; 0.30; 0.17; 0.16 and 0.13. Regression correlation analysis on the 23 vegetation indices with 280 samples has resulted in the six vegetation indices, termed as NDVI, ENDVI, GDVI, VARI, LAI and SIPI. This was performed based on regression correlation with the lowest value R2 than 0,8. The combined analysis of the standard deviation and the regression correlation has obtained the five vegetation indices, termed as NDVI, ENDVI, GDVI, LAI and SIPI. The results of the analysis of both methods show that a combination of two methods needs to be done to produce a good analysis of sugarcane conditions. It has been clarified through field surveys and showed good results for the prediction of microseepages.

  11. Estimating the magnitude of peak flows at selected recurrence intervals for streams in Idaho

    USGS Publications Warehouse

    Berenbrock, Charles

    2002-01-01

    The region-of-influence method is not recommended for use in determining flood-frequency estimates for ungaged sites in Idaho because the results, overall, are less accurate and the calculations are more complex than those of regional regression equations. The regional regression equations were considered to be the primary method of estimating the magnitude and frequency of peak flows for ungaged sites in Idaho.

  12. Latin hypercube approach to estimate uncertainty in ground water vulnerability

    USGS Publications Warehouse

    Gurdak, J.J.; McCray, J.E.; Thyne, G.; Qi, S.L.

    2007-01-01

    A methodology is proposed to quantify prediction uncertainty associated with ground water vulnerability models that were developed through an approach that coupled multivariate logistic regression with a geographic information system (GIS). This method uses Latin hypercube sampling (LHS) to illustrate the propagation of input error and estimate uncertainty associated with the logistic regression predictions of ground water vulnerability. Central to the proposed method is the assumption that prediction uncertainty in ground water vulnerability models is a function of input error propagation from uncertainty in the estimated logistic regression model coefficients (model error) and the values of explanatory variables represented in the GIS (data error). Input probability distributions that represent both model and data error sources of uncertainty were simultaneously sampled using a Latin hypercube approach with logistic regression calculations of probability of elevated nonpoint source contaminants in ground water. The resulting probability distribution represents the prediction intervals and associated uncertainty of the ground water vulnerability predictions. The method is illustrated through a ground water vulnerability assessment of the High Plains regional aquifer. Results of the LHS simulations reveal significant prediction uncertainties that vary spatially across the regional aquifer. Additionally, the proposed method enables a spatial deconstruction of the prediction uncertainty that can lead to improved prediction of ground water vulnerability. ?? 2007 National Ground Water Association.

  13. Meta-analytical synthesis of regression coefficients under different categorization scheme of continuous covariates.

    PubMed

    Yoneoka, Daisuke; Henmi, Masayuki

    2017-11-30

    Recently, the number of clinical prediction models sharing the same regression task has increased in the medical literature. However, evidence synthesis methodologies that use the results of these regression models have not been sufficiently studied, particularly in meta-analysis settings where only regression coefficients are available. One of the difficulties lies in the differences between the categorization schemes of continuous covariates across different studies. In general, categorization methods using cutoff values are study specific across available models, even if they focus on the same covariates of interest. Differences in the categorization of covariates could lead to serious bias in the estimated regression coefficients and thus in subsequent syntheses. To tackle this issue, we developed synthesis methods for linear regression models with different categorization schemes of covariates. A 2-step approach to aggregate the regression coefficient estimates is proposed. The first step is to estimate the joint distribution of covariates by introducing a latent sampling distribution, which uses one set of individual participant data to estimate the marginal distribution of covariates with categorization. The second step is to use a nonlinear mixed-effects model with correction terms for the bias due to categorization to estimate the overall regression coefficients. Especially in terms of precision, numerical simulations show that our approach outperforms conventional methods, which only use studies with common covariates or ignore the differences between categorization schemes. The method developed in this study is also applied to a series of WHO epidemiologic studies on white blood cell counts. Copyright © 2017 John Wiley & Sons, Ltd.

  14. Application of principal component regression and partial least squares regression in ultraviolet spectrum water quality detection

    NASA Astrophysics Data System (ADS)

    Li, Jiangtong; Luo, Yongdao; Dai, Honglin

    2018-01-01

    Water is the source of life and the essential foundation of all life. With the development of industrialization, the phenomenon of water pollution is becoming more and more frequent, which directly affects the survival and development of human. Water quality detection is one of the necessary measures to protect water resources. Ultraviolet (UV) spectral analysis is an important research method in the field of water quality detection, which partial least squares regression (PLSR) analysis method is becoming predominant technology, however, in some special cases, PLSR's analysis produce considerable errors. In order to solve this problem, the traditional principal component regression (PCR) analysis method was improved by using the principle of PLSR in this paper. The experimental results show that for some special experimental data set, improved PCR analysis method performance is better than PLSR. The PCR and PLSR is the focus of this paper. Firstly, the principal component analysis (PCA) is performed by MATLAB to reduce the dimensionality of the spectral data; on the basis of a large number of experiments, the optimized principal component is extracted by using the principle of PLSR, which carries most of the original data information. Secondly, the linear regression analysis of the principal component is carried out with statistic package for social science (SPSS), which the coefficients and relations of principal components can be obtained. Finally, calculating a same water spectral data set by PLSR and improved PCR, analyzing and comparing two results, improved PCR and PLSR is similar for most data, but improved PCR is better than PLSR for data near the detection limit. Both PLSR and improved PCR can be used in Ultraviolet spectral analysis of water, but for data near the detection limit, improved PCR's result better than PLSR.

  15. Mental chronometry with simple linear regression.

    PubMed

    Chen, J Y

    1997-10-01

    Typically, mental chronometry is performed by means of introducing an independent variable postulated to affect selectively some stage of a presumed multistage process. However, the effect could be a global one that spreads proportionally over all stages of the process. Currently, there is no method to test this possibility although simple linear regression might serve the purpose. In the present study, the regression approach was tested with tasks (memory scanning and mental rotation) that involved a selective effect and with a task (word superiority effect) that involved a global effect, by the dominant theories. The results indicate (1) the manipulation of the size of a memory set or of angular disparity affects the intercept of the regression function that relates the times for memory scanning with different set sizes or for mental rotation with different angular disparities and (2) the manipulation of context affects the slope of the regression function that relates the times for detecting a target character under word and nonword conditions. These ratify the regression approach as a useful method for doing mental chronometry.

  16. A New Hybrid-Multiscale SSA Prediction of Non-Stationary Time Series

    NASA Astrophysics Data System (ADS)

    Ghanbarzadeh, Mitra; Aminghafari, Mina

    2016-02-01

    Singular spectral analysis (SSA) is a non-parametric method used in the prediction of non-stationary time series. It has two parameters, which are difficult to determine and very sensitive to their values. Since, SSA is a deterministic-based method, it does not give good results when the time series is contaminated with a high noise level and correlated noise. Therefore, we introduce a novel method to handle these problems. It is based on the prediction of non-decimated wavelet (NDW) signals by SSA and then, prediction of residuals by wavelet regression. The advantages of our method are the automatic determination of parameters and taking account of the stochastic structure of time series. As shown through the simulated and real data, we obtain better results than SSA, a non-parametric wavelet regression method and Holt-Winters method.

  17. Three methods to construct predictive models using logistic regression and likelihood ratios to facilitate adjustment for pretest probability give similar results.

    PubMed

    Chan, Siew Foong; Deeks, Jonathan J; Macaskill, Petra; Irwig, Les

    2008-01-01

    To compare three predictive models based on logistic regression to estimate adjusted likelihood ratios allowing for interdependency between diagnostic variables (tests). This study was a review of the theoretical basis, assumptions, and limitations of published models; and a statistical extension of methods and application to a case study of the diagnosis of obstructive airways disease based on history and clinical examination. Albert's method includes an offset term to estimate an adjusted likelihood ratio for combinations of tests. Spiegelhalter and Knill-Jones method uses the unadjusted likelihood ratio for each test as a predictor and computes shrinkage factors to allow for interdependence. Knottnerus' method differs from the other methods because it requires sequencing of tests, which limits its application to situations where there are few tests and substantial data. Although parameter estimates differed between the models, predicted "posttest" probabilities were generally similar. Construction of predictive models using logistic regression is preferred to the independence Bayes' approach when it is important to adjust for dependency of tests errors. Methods to estimate adjusted likelihood ratios from predictive models should be considered in preference to a standard logistic regression model to facilitate ease of interpretation and application. Albert's method provides the most straightforward approach.

  18. Propensity score estimation: machine learning and classification methods as alternatives to logistic regression

    PubMed Central

    Westreich, Daniel; Lessler, Justin; Funk, Michele Jonsson

    2010-01-01

    Summary Objective Propensity scores for the analysis of observational data are typically estimated using logistic regression. Our objective in this Review was to assess machine learning alternatives to logistic regression which may accomplish the same goals but with fewer assumptions or greater accuracy. Study Design and Setting We identified alternative methods for propensity score estimation and/or classification from the public health, biostatistics, discrete mathematics, and computer science literature, and evaluated these algorithms for applicability to the problem of propensity score estimation, potential advantages over logistic regression, and ease of use. Results We identified four techniques as alternatives to logistic regression: neural networks, support vector machines, decision trees (CART), and meta-classifiers (in particular, boosting). Conclusion While the assumptions of logistic regression are well understood, those assumptions are frequently ignored. All four alternatives have advantages and disadvantages compared with logistic regression. Boosting (meta-classifiers) and to a lesser extent decision trees (particularly CART) appear to be most promising for use in the context of propensity score analysis, but extensive simulation studies are needed to establish their utility in practice. PMID:20630332

  19. Onset patterns in autism: Variation across informants, methods, and timing.

    PubMed

    Ozonoff, Sally; Gangi, Devon; Hanzel, Elise P; Hill, Alesha; Hill, Monique M; Miller, Meghan; Schwichtenberg, A J; Steinfeld, Mary Beth; Parikh, Chandni; Iosif, Ana-Maria

    2018-05-01

    While previous studies suggested that regressive forms of onset were not common in autism spectrum disorder (ASD), more recent investigations suggest that the rates are quite high and may be under-reported using certain methods. The current study undertook a systematic investigation of how rates of regression differed by measurement method. Infants with (n = 147) and without a family history of ASD (n = 83) were seen prospectively for up to 7 visits in the first three years of life. Reports of symptom onset were collected using four measures that systematically varied the informant (examiner vs. parent), the decision type (categorical [regression absent or present] vs. dimensional [frequency of social behaviors]), and the timing of the assessment (retrospective vs. prospective). Latent class growth models were used to classify individual trajectories to see whether regressive onset patterns were infrequent or widespread within the ASD group. A majority of the sample was classified as having a regressive onset using either examiner (88%) or parent (69%) prospective dimensional ratings. Rates of regression were much lower using retrospective or categorical measures (from 29 to 47%). Agreement among different measurement methods was low. Declining trajectories of development, consistent with a regressive onset pattern, are common in children with ASD and may be more the rule than the exception. The accuracy of widely used methods of measuring onset is questionable and the present findings argue against their widespread use. Autism Res 2018, 11: 788-797. © 2018 International Society for Autism Research, Wiley Periodicals, Inc. This study examines different ways of measuring the onset of symptoms in autism spectrum disorder (ASD). The present findings suggest that declining developmental skills, consistent with a regressive onset pattern, are common in children with ASD and may be more the rule than the exception. The results question the accuracy of widely used methods of measuring symptom onset and argue against their widespread use. © 2018 International Society for Autism Research, Wiley Periodicals, Inc.

  20. Cox regression analysis with missing covariates via nonparametric multiple imputation.

    PubMed

    Hsu, Chiu-Hsieh; Yu, Mandi

    2018-01-01

    We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.

  1. Applications of Some Artificial Intelligence Methods to Satellite Soundings

    NASA Technical Reports Server (NTRS)

    Munteanu, M. J.; Jakubowicz, O.

    1985-01-01

    Hard clustering of temperature profiles and regression temperature retrievals were used to refine the method using the probabilities of membership of each pattern vector in each of the clusters derived with discriminant analysis. In hard clustering the maximum probability is taken and the corresponding cluster as the correct cluster are considered discarding the rest of the probabilities. In fuzzy partitioned clustering these probabilities are kept and the final regression retrieval is a weighted regression retrieval of several clusters. This method was used in the clustering of brightness temperatures where the purpose was to predict tropopause height. A further refinement is the division of temperature profiles into three major regions for classification purposes. The results are summarized in the tables total r.m.s. errors are displayed. An approach based on fuzzy logic which is intimately related to artificial intelligence methods is recommended.

  2. Application of least median of squared orthogonal distance (LMD) and LMD-based reweighted least squares (RLS) methods on the stock-recruitment relationship

    NASA Astrophysics Data System (ADS)

    Wang, Yan-Jun; Liu, Qun

    1999-03-01

    Analysis of stock-recruitment (SR) data is most often done by fitting various SR relationship curves to the data. Fish population dynamics data often have stochastic variations and measurement errors, which usually result in a biased regression analysis. This paper presents a robust regression method, least median of squared orthogonal distance (LMD), which is insensitive to abnormal values in the dependent and independent variables in a regression analysis. Outliers that have significantly different variance from the rest of the data can be identified in a residual analysis. Then, the least squares (LS) method is applied to the SR data with defined outliers being down weighted. The application of LMD and LMD-based Reweighted Least Squares (RLS) method to simulated and real fisheries SR data is explored.

  3. Tools to Support Interpreting Multiple Regression in the Face of Multicollinearity

    PubMed Central

    Kraha, Amanda; Turner, Heather; Nimon, Kim; Zientek, Linda Reichwein; Henson, Robin K.

    2012-01-01

    While multicollinearity may increase the difficulty of interpreting multiple regression (MR) results, it should not cause undue problems for the knowledgeable researcher. In the current paper, we argue that rather than using one technique to investigate regression results, researchers should consider multiple indices to understand the contributions that predictors make not only to a regression model, but to each other as well. Some of the techniques to interpret MR effects include, but are not limited to, correlation coefficients, beta weights, structure coefficients, all possible subsets regression, commonality coefficients, dominance weights, and relative importance weights. This article will review a set of techniques to interpret MR effects, identify the elements of the data on which the methods focus, and identify statistical software to support such analyses. PMID:22457655

  4. Tools to support interpreting multiple regression in the face of multicollinearity.

    PubMed

    Kraha, Amanda; Turner, Heather; Nimon, Kim; Zientek, Linda Reichwein; Henson, Robin K

    2012-01-01

    While multicollinearity may increase the difficulty of interpreting multiple regression (MR) results, it should not cause undue problems for the knowledgeable researcher. In the current paper, we argue that rather than using one technique to investigate regression results, researchers should consider multiple indices to understand the contributions that predictors make not only to a regression model, but to each other as well. Some of the techniques to interpret MR effects include, but are not limited to, correlation coefficients, beta weights, structure coefficients, all possible subsets regression, commonality coefficients, dominance weights, and relative importance weights. This article will review a set of techniques to interpret MR effects, identify the elements of the data on which the methods focus, and identify statistical software to support such analyses.

  5. Detection of Cutting Tool Wear using Statistical Analysis and Regression Model

    NASA Astrophysics Data System (ADS)

    Ghani, Jaharah A.; Rizal, Muhammad; Nuawi, Mohd Zaki; Haron, Che Hassan Che; Ramli, Rizauddin

    2010-10-01

    This study presents a new method for detecting the cutting tool wear based on the measured cutting force signals. A statistical-based method called Integrated Kurtosis-based Algorithm for Z-Filter technique, called I-kaz was used for developing a regression model and 3D graphic presentation of I-kaz 3D coefficient during machining process. The machining tests were carried out using a CNC turning machine Colchester Master Tornado T4 in dry cutting condition. A Kistler 9255B dynamometer was used to measure the cutting force signals, which were transmitted, analyzed, and displayed in the DasyLab software. Various force signals from machining operation were analyzed, and each has its own I-kaz 3D coefficient. This coefficient was examined and its relationship with flank wear lands (VB) was determined. A regression model was developed due to this relationship, and results of the regression model shows that the I-kaz 3D coefficient value decreases as tool wear increases. The result then is used for real time tool wear monitoring.

  6. A Simulation-Based Comparison of Several Stochastic Linear Regression Methods in the Presence of Outliers.

    ERIC Educational Resources Information Center

    Rule, David L.

    Several regression methods were examined within the framework of weighted structural regression (WSR), comparing their regression weight stability and score estimation accuracy in the presence of outlier contamination. The methods compared are: (1) ordinary least squares; (2) WSR ridge regression; (3) minimum risk regression; (4) minimum risk 2;…

  7. Bringing a Perspective from Outside the Field: A Commentary on Davis et al.'s (2010) Use of a Modified Regression Discontinuity Design to Evaluate a Gifted Program

    ERIC Educational Resources Information Center

    Adelson, Jill L.; Kelcey, Benjamin

    2016-01-01

    In this commentary of "Evaluating the Gifted Program of an Urban School District Using a Modified Regression Discontinuity Design" by Davis, Engberg, Epple, Sieg, and Zimmer, we examine the background of the study, critique the methods used, and discuss the results and implications. The study used a fuzzy regression discontinuity design…

  8. Threshold regression to accommodate a censored covariate.

    PubMed

    Qian, Jing; Chiou, Sy Han; Maye, Jacqueline E; Atem, Folefac; Johnson, Keith A; Betensky, Rebecca A

    2018-06-22

    In several common study designs, regression modeling is complicated by the presence of censored covariates. Examples of such covariates include maternal age of onset of dementia that may be right censored in an Alzheimer's amyloid imaging study of healthy subjects, metabolite measurements that are subject to limit of detection censoring in a case-control study of cardiovascular disease, and progressive biomarkers whose baseline values are of interest, but are measured post-baseline in longitudinal neuropsychological studies of Alzheimer's disease. We propose threshold regression approaches for linear regression models with a covariate that is subject to random censoring. Threshold regression methods allow for immediate testing of the significance of the effect of a censored covariate. In addition, they provide for unbiased estimation of the regression coefficient of the censored covariate. We derive the asymptotic properties of the resulting estimators under mild regularity conditions. Simulations demonstrate that the proposed estimators have good finite-sample performance, and often offer improved efficiency over existing methods. We also derive a principled method for selection of the threshold. We illustrate the approach in application to an Alzheimer's disease study that investigated brain amyloid levels in older individuals, as measured through positron emission tomography scans, as a function of maternal age of dementia onset, with adjustment for other covariates. We have developed an R package, censCov, for implementation of our method, available at CRAN. © 2018, The International Biometric Society.

  9. Application of Semiparametric Spline Regression Model in Analyzing Factors that In uence Population Density in Central Java

    NASA Astrophysics Data System (ADS)

    Sumantari, Y. D.; Slamet, I.; Sugiyanto

    2017-06-01

    Semiparametric regression is a statistical analysis method that consists of parametric and nonparametric regression. There are various approach techniques in nonparametric regression. One of the approach techniques is spline. Central Java is one of the most densely populated province in Indonesia. Population density in this province can be modeled by semiparametric regression because it consists of parametric and nonparametric component. Therefore, the purpose of this paper is to determine the factors that in uence population density in Central Java using the semiparametric spline regression model. The result shows that the factors which in uence population density in Central Java is Family Planning (FP) active participants and district minimum wage.

  10. Simulating land-use changes by incorporating spatial autocorrelation and self-organization in CLUE-S modeling: a case study in Zengcheng District, Guangzhou, China

    NASA Astrophysics Data System (ADS)

    Mei, Zhixiong; Wu, Hao; Li, Shiyun

    2018-06-01

    The Conversion of Land Use and its Effects at Small regional extent (CLUE-S), which is a widely used model for land-use simulation, utilizes logistic regression to estimate the relationships between land use and its drivers, and thus, predict land-use change probabilities. However, logistic regression disregards possible spatial autocorrelation and self-organization in land-use data. Autologistic regression can depict spatial autocorrelation but cannot address self-organization, while logistic regression by considering only self-organization (NElogistic regression) fails to capture spatial autocorrelation. Therefore, this study developed a regression (NE-autologistic regression) method, which incorporated both spatial autocorrelation and self-organization, to improve CLUE-S. The Zengcheng District of Guangzhou, China was selected as the study area. The land-use data of 2001, 2005, and 2009, as well as 10 typical driving factors, were used to validate the proposed regression method and the improved CLUE-S model. Then, three future land-use scenarios in 2020: the natural growth scenario, ecological protection scenario, and economic development scenario, were simulated using the improved model. Validation results showed that NE-autologistic regression performed better than logistic regression, autologistic regression, and NE-logistic regression in predicting land-use change probabilities. The spatial allocation accuracy and kappa values of NE-autologistic-CLUE-S were higher than those of logistic-CLUE-S, autologistic-CLUE-S, and NE-logistic-CLUE-S for the simulations of two periods, 2001-2009 and 2005-2009, which proved that the improved CLUE-S model achieved the best simulation and was thereby effective to a certain extent. The scenario simulation results indicated that under all three scenarios, traffic land and residential/industrial land would increase, whereas arable land and unused land would decrease during 2009-2020. Apparent differences also existed in the simulated change sizes and locations of each land-use type under different scenarios. The results not only demonstrate the validity of the improved model but also provide a valuable reference for relevant policy-makers.

  11. New machine-learning algorithms for prediction of Parkinson's disease

    NASA Astrophysics Data System (ADS)

    Mandal, Indrajit; Sairam, N.

    2014-03-01

    This article presents an enhanced prediction accuracy of diagnosis of Parkinson's disease (PD) to prevent the delay and misdiagnosis of patients using the proposed robust inference system. New machine-learning methods are proposed and performance comparisons are based on specificity, sensitivity, accuracy and other measurable parameters. The robust methods of treating Parkinson's disease (PD) includes sparse multinomial logistic regression, rotation forest ensemble with support vector machines and principal components analysis, artificial neural networks, boosting methods. A new ensemble method comprising of the Bayesian network optimised by Tabu search algorithm as classifier and Haar wavelets as projection filter is used for relevant feature selection and ranking. The highest accuracy obtained by linear logistic regression and sparse multinomial logistic regression is 100% and sensitivity, specificity of 0.983 and 0.996, respectively. All the experiments are conducted over 95% and 99% confidence levels and establish the results with corrected t-tests. This work shows a high degree of advancement in software reliability and quality of the computer-aided diagnosis system and experimentally shows best results with supportive statistical inference.

  12. The extinction law from photometric data: linear regression methods

    NASA Astrophysics Data System (ADS)

    Ascenso, J.; Lombardi, M.; Lada, C. J.; Alves, J.

    2012-04-01

    Context. The properties of dust grains, in particular their size distribution, are expected to differ from the interstellar medium to the high-density regions within molecular clouds. Since the extinction at near-infrared wavelengths is caused by dust, the extinction law in cores should depart from that found in low-density environments if the dust grains have different properties. Aims: We explore methods to measure the near-infrared extinction law produced by dense material in molecular cloud cores from photometric data. Methods: Using controlled sets of synthetic and semi-synthetic data, we test several methods for linear regression applied to the specific problem of deriving the extinction law from photometric data. We cover the parameter space appropriate to this type of observations. Results: We find that many of the common linear-regression methods produce biased results when applied to the extinction law from photometric colors. We propose and validate a new method, LinES, as the most reliable for this effect. We explore the use of this method to detect whether or not the extinction law of a given reddened population has a break at some value of extinction. Based on observations collected at the European Organisation for Astronomical Research in the Southern Hemisphere, Chile (ESO programmes 069.C-0426 and 074.C-0728).

  13. Improving regression-model-based streamwater constituent load estimates derived from serially correlated data

    USGS Publications Warehouse

    Aulenbach, Brent T.

    2013-01-01

    A regression-model based approach is a commonly used, efficient method for estimating streamwater constituent load when there is a relationship between streamwater constituent concentration and continuous variables such as streamwater discharge, season and time. A subsetting experiment using a 30-year dataset of daily suspended sediment observations from the Mississippi River at Thebes, Illinois, was performed to determine optimal sampling frequency, model calibration period length, and regression model methodology, as well as to determine the effect of serial correlation of model residuals on load estimate precision. Two regression-based methods were used to estimate streamwater loads, the Adjusted Maximum Likelihood Estimator (AMLE), and the composite method, a hybrid load estimation approach. While both methods accurately and precisely estimated loads at the model’s calibration period time scale, precisions were progressively worse at shorter reporting periods, from annually to monthly. Serial correlation in model residuals resulted in observed AMLE precision to be significantly worse than the model calculated standard errors of prediction. The composite method effectively improved upon AMLE loads for shorter reporting periods, but required a sampling interval of at least 15-days or shorter, when the serial correlations in the observed load residuals were greater than 0.15. AMLE precision was better at shorter sampling intervals and when using the shortest model calibration periods, such that the regression models better fit the temporal changes in the concentration–discharge relationship. The models with the largest errors typically had poor high flow sampling coverage resulting in unrepresentative models. Increasing sampling frequency and/or targeted high flow sampling are more efficient approaches to ensure sufficient sampling and to avoid poorly performing models, than increasing calibration period length.

  14. A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers

    PubMed Central

    2009-01-01

    Background Genomic selection (GS) uses molecular breeding values (MBV) derived from dense markers across the entire genome for selection of young animals. The accuracy of MBV prediction is important for a successful application of GS. Recently, several methods have been proposed to estimate MBV. Initial simulation studies have shown that these methods can accurately predict MBV. In this study we compared the accuracies and possible bias of five different regression methods in an empirical application in dairy cattle. Methods Genotypes of 7,372 SNP and highly accurate EBV of 1,945 dairy bulls were used to predict MBV for protein percentage (PPT) and a profit index (Australian Selection Index, ASI). Marker effects were estimated by least squares regression (FR-LS), Bayesian regression (Bayes-R), random regression best linear unbiased prediction (RR-BLUP), partial least squares regression (PLSR) and nonparametric support vector regression (SVR) in a training set of 1,239 bulls. Accuracy and bias of MBV prediction were calculated from cross-validation of the training set and tested against a test team of 706 young bulls. Results For both traits, FR-LS using a subset of SNP was significantly less accurate than all other methods which used all SNP. Accuracies obtained by Bayes-R, RR-BLUP, PLSR and SVR were very similar for ASI (0.39-0.45) and for PPT (0.55-0.61). Overall, SVR gave the highest accuracy. All methods resulted in biased MBV predictions for ASI, for PPT only RR-BLUP and SVR predictions were unbiased. A significant decrease in accuracy of prediction of ASI was seen in young test cohorts of bulls compared to the accuracy derived from cross-validation of the training set. This reduction was not apparent for PPT. Combining MBV predictions with pedigree based predictions gave 1.05 - 1.34 times higher accuracies compared to predictions based on pedigree alone. Some methods have largely different computational requirements, with PLSR and RR-BLUP requiring the least computing time. Conclusions The four methods which use information from all SNP namely RR-BLUP, Bayes-R, PLSR and SVR generate similar accuracies of MBV prediction for genomic selection, and their use in the selection of immediate future generations in dairy cattle will be comparable. The use of FR-LS in genomic selection is not recommended. PMID:20043835

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kwon, Deukwoo; Little, Mark P.; Miller, Donald L.

    Purpose: To determine more accurate regression formulas for estimating peak skin dose (PSD) from reference air kerma (RAK) or kerma-area product (KAP). Methods: After grouping of the data from 21 procedures into 13 clinically similar groups, assessments were made of optimal clustering using the Bayesian information criterion to obtain the optimal linear regressions of (log-transformed) PSD vs RAK, PSD vs KAP, and PSD vs RAK and KAP. Results: Three clusters of clinical groups were optimal in regression of PSD vs RAK, seven clusters of clinical groups were optimal in regression of PSD vs KAP, and six clusters of clinical groupsmore » were optimal in regression of PSD vs RAK and KAP. Prediction of PSD using both RAK and KAP is significantly better than prediction of PSD with either RAK or KAP alone. The regression of PSD vs RAK provided better predictions of PSD than the regression of PSD vs KAP. The partial-pooling (clustered) method yields smaller mean squared errors compared with the complete-pooling method.Conclusion: PSD distributions for interventional radiology procedures are log-normal. Estimates of PSD derived from RAK and KAP jointly are most accurate, followed closely by estimates derived from RAK alone. Estimates of PSD derived from KAP alone are the least accurate. Using a stochastic search approach, it is possible to cluster together certain dissimilar types of procedures to minimize the total error sum of squares.« less

  16. Reduced rank regression via adaptive nuclear norm penalization

    PubMed Central

    Chen, Kun; Dong, Hongbo; Chan, Kung-Sik

    2014-01-01

    Summary We propose an adaptive nuclear norm penalization approach for low-rank matrix approximation, and use it to develop a new reduced rank estimation method for high-dimensional multivariate regression. The adaptive nuclear norm is defined as the weighted sum of the singular values of the matrix, and it is generally non-convex under the natural restriction that the weight decreases with the singular value. However, we show that the proposed non-convex penalized regression method has a global optimal solution obtained from an adaptively soft-thresholded singular value decomposition. The method is computationally efficient, and the resulting solution path is continuous. The rank consistency of and prediction/estimation performance bounds for the estimator are established for a high-dimensional asymptotic regime. Simulation studies and an application in genetics demonstrate its efficacy. PMID:25045172

  17. Order Selection for General Expression of Nonlinear Autoregressive Model Based on Multivariate Stepwise Regression

    NASA Astrophysics Data System (ADS)

    Shi, Jinfei; Zhu, Songqing; Chen, Ruwen

    2017-12-01

    An order selection method based on multiple stepwise regressions is proposed for General Expression of Nonlinear Autoregressive model which converts the model order problem into the variable selection of multiple linear regression equation. The partial autocorrelation function is adopted to define the linear term in GNAR model. The result is set as the initial model, and then the nonlinear terms are introduced gradually. Statistics are chosen to study the improvements of both the new introduced and originally existed variables for the model characteristics, which are adopted to determine the model variables to retain or eliminate. So the optimal model is obtained through data fitting effect measurement or significance test. The simulation and classic time-series data experiment results show that the method proposed is simple, reliable and can be applied to practical engineering.

  18. Application of near-infrared spectroscopy in the detection of fat-soluble vitamins in premix feed

    NASA Astrophysics Data System (ADS)

    Jia, Lian Ping; Tian, Shu Li; Zheng, Xue Cong; Jiao, Peng; Jiang, Xun Peng

    2018-02-01

    Vitamin is the organic compound and necessary for animal physiological maintenance. The rapid determination of the content of different vitamins in premix feed can help to achieve accurate diets and efficient feeding. Compared with high-performance liquid chromatography and other wet chemical methods, near-infrared spectroscopy is a fast, non-destructive, non-polluting method. 168 samples of premix feed were collected and the contents of vitamin A, vitamin E and vitamin D3 were detected by the standard method. The near-infrared spectra of samples ranging from 10 000 to 4 000 cm-1 were obtained. Partial least squares regression (PLSR) and support vector machine regression (SVMR) were used to construct the quantitative model. The results showed that the RMSEP of PLSR model of vitamin A, vitamin E and vitamin D3 were 0.43×107 IU/kg, 0.09×105 IU/kg and 0.17×107 IU/kg, respectively. The RMSEP of SVMR model was 0.45×107 IU/kg, 0.11×105 IU/kg and 0.18×107 IU/kg. Compared with nonlinear regression method (SVMR), linear regression method (PLSR) is more suitable for the quantitative analysis of vitamins in premix feed.

  19. Meta-analysis of haplotype-association studies: comparison of methods and empirical evaluation of the literature

    PubMed Central

    2011-01-01

    Background Meta-analysis is a popular methodology in several fields of medical research, including genetic association studies. However, the methods used for meta-analysis of association studies that report haplotypes have not been studied in detail. In this work, methods for performing meta-analysis of haplotype association studies are summarized, compared and presented in a unified framework along with an empirical evaluation of the literature. Results We present multivariate methods that use summary-based data as well as methods that use binary and count data in a generalized linear mixed model framework (logistic regression, multinomial regression and Poisson regression). The methods presented here avoid the inflation of the type I error rate that could be the result of the traditional approach of comparing a haplotype against the remaining ones, whereas, they can be fitted using standard software. Moreover, formal global tests are presented for assessing the statistical significance of the overall association. Although the methods presented here assume that the haplotypes are directly observed, they can be easily extended to allow for such an uncertainty by weighting the haplotypes by their probability. Conclusions An empirical evaluation of the published literature and a comparison against the meta-analyses that use single nucleotide polymorphisms, suggests that the studies reporting meta-analysis of haplotypes contain approximately half of the included studies and produce significant results twice more often. We show that this excess of statistically significant results, stems from the sub-optimal method of analysis used and, in approximately half of the cases, the statistical significance is refuted if the data are properly re-analyzed. Illustrative examples of code are given in Stata and it is anticipated that the methods developed in this work will be widely applied in the meta-analysis of haplotype association studies. PMID:21247440

  20. Two Paradoxes in Linear Regression Analysis.

    PubMed

    Feng, Ge; Peng, Jing; Tu, Dongke; Zheng, Julia Z; Feng, Changyong

    2016-12-25

    Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection.

  1. Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data.

    PubMed

    Won, Sungho; Choi, Hosik; Park, Suyeon; Lee, Juyoung; Park, Changyi; Kwon, Sunghoon

    2015-01-01

    Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called "large P and small N" problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.

  2. Hybrid ICA-Regression: Automatic Identification and Removal of Ocular Artifacts from Electroencephalographic Signals.

    PubMed

    Mannan, Malik M Naeem; Jeong, Myung Y; Kamran, Muhammad A

    2016-01-01

    Electroencephalography (EEG) is a portable brain-imaging technique with the advantage of high-temporal resolution that can be used to record electrical activity of the brain. However, it is difficult to analyze EEG signals due to the contamination of ocular artifacts, and which potentially results in misleading conclusions. Also, it is a proven fact that the contamination of ocular artifacts cause to reduce the classification accuracy of a brain-computer interface (BCI). It is therefore very important to remove/reduce these artifacts before the analysis of EEG signals for applications like BCI. In this paper, a hybrid framework that combines independent component analysis (ICA), regression and high-order statistics has been proposed to identify and eliminate artifactual activities from EEG data. We used simulated, experimental and standard EEG signals to evaluate and analyze the effectiveness of the proposed method. Results demonstrate that the proposed method can effectively remove ocular artifacts as well as it can preserve the neuronal signals present in EEG data. A comparison with four methods from literature namely ICA, regression analysis, wavelet-ICA (wICA), and regression-ICA (REGICA) confirms the significantly enhanced performance and effectiveness of the proposed method for removal of ocular activities from EEG, in terms of lower mean square error and mean absolute error values and higher mutual information between reconstructed and original EEG.

  3. Hybrid ICA—Regression: Automatic Identification and Removal of Ocular Artifacts from Electroencephalographic Signals

    PubMed Central

    Mannan, Malik M. Naeem; Jeong, Myung Y.; Kamran, Muhammad A.

    2016-01-01

    Electroencephalography (EEG) is a portable brain-imaging technique with the advantage of high-temporal resolution that can be used to record electrical activity of the brain. However, it is difficult to analyze EEG signals due to the contamination of ocular artifacts, and which potentially results in misleading conclusions. Also, it is a proven fact that the contamination of ocular artifacts cause to reduce the classification accuracy of a brain-computer interface (BCI). It is therefore very important to remove/reduce these artifacts before the analysis of EEG signals for applications like BCI. In this paper, a hybrid framework that combines independent component analysis (ICA), regression and high-order statistics has been proposed to identify and eliminate artifactual activities from EEG data. We used simulated, experimental and standard EEG signals to evaluate and analyze the effectiveness of the proposed method. Results demonstrate that the proposed method can effectively remove ocular artifacts as well as it can preserve the neuronal signals present in EEG data. A comparison with four methods from literature namely ICA, regression analysis, wavelet-ICA (wICA), and regression-ICA (REGICA) confirms the significantly enhanced performance and effectiveness of the proposed method for removal of ocular activities from EEG, in terms of lower mean square error and mean absolute error values and higher mutual information between reconstructed and original EEG. PMID:27199714

  4. Comparison of Adaline and Multiple Linear Regression Methods for Rainfall Forecasting

    NASA Astrophysics Data System (ADS)

    Sutawinaya, IP; Astawa, INGA; Hariyanti, NKD

    2018-01-01

    Heavy rainfall can cause disaster, therefore need a forecast to predict rainfall intensity. Main factor that cause flooding is there is a high rainfall intensity and it makes the river become overcapacity. This will cause flooding around the area. Rainfall factor is a dynamic factor, so rainfall is very interesting to be studied. In order to support the rainfall forecasting, there are methods that can be used from Artificial Intelligence (AI) to statistic. In this research, we used Adaline for AI method and Regression for statistic method. The more accurate forecast result shows the method that used is good for forecasting the rainfall. Through those methods, we expected which is the best method for rainfall forecasting here.

  5. The comparison between several robust ridge regression estimators in the presence of multicollinearity and multiple outliers

    NASA Astrophysics Data System (ADS)

    Zahari, Siti Meriam; Ramli, Norazan Mohamed; Moktar, Balkiah; Zainol, Mohammad Said

    2014-09-01

    In the presence of multicollinearity and multiple outliers, statistical inference of linear regression model using ordinary least squares (OLS) estimators would be severely affected and produces misleading results. To overcome this, many approaches have been investigated. These include robust methods which were reported to be less sensitive to the presence of outliers. In addition, ridge regression technique was employed to tackle multicollinearity problem. In order to mitigate both problems, a combination of ridge regression and robust methods was discussed in this study. The superiority of this approach was examined when simultaneous presence of multicollinearity and multiple outliers occurred in multiple linear regression. This study aimed to look at the performance of several well-known robust estimators; M, MM, RIDGE and robust ridge regression estimators, namely Weighted Ridge M-estimator (WRM), Weighted Ridge MM (WRMM), Ridge MM (RMM), in such a situation. Results of the study showed that in the presence of simultaneous multicollinearity and multiple outliers (in both x and y-direction), the RMM and RIDGE are more or less similar in terms of superiority over the other estimators, regardless of the number of observation, level of collinearity and percentage of outliers used. However, when outliers occurred in only single direction (y-direction), the WRMM estimator is the most superior among the robust ridge regression estimators, by producing the least variance. In conclusion, the robust ridge regression is the best alternative as compared to robust and conventional least squares estimators when dealing with simultaneous presence of multicollinearity and outliers.

  6. Local polynomial estimation of heteroscedasticity in a multivariate linear regression model and its applications in economics.

    PubMed

    Su, Liyun; Zhao, Yanyong; Yan, Tianshun; Li, Fenglan

    2012-01-01

    Multivariate local polynomial fitting is applied to the multivariate linear heteroscedastic regression model. Firstly, the local polynomial fitting is applied to estimate heteroscedastic function, then the coefficients of regression model are obtained by using generalized least squares method. One noteworthy feature of our approach is that we avoid the testing for heteroscedasticity by improving the traditional two-stage method. Due to non-parametric technique of local polynomial estimation, it is unnecessary to know the form of heteroscedastic function. Therefore, we can improve the estimation precision, when the heteroscedastic function is unknown. Furthermore, we verify that the regression coefficients is asymptotic normal based on numerical simulations and normal Q-Q plots of residuals. Finally, the simulation results and the local polynomial estimation of real data indicate that our approach is surely effective in finite-sample situations.

  7. Influence diagnostics in meta-regression model.

    PubMed

    Shi, Lei; Zuo, ShanShan; Yu, Dalei; Zhou, Xiaohua

    2017-09-01

    This paper studies the influence diagnostics in meta-regression model including case deletion diagnostic and local influence analysis. We derive the subset deletion formulae for the estimation of regression coefficient and heterogeneity variance and obtain the corresponding influence measures. The DerSimonian and Laird estimation and maximum likelihood estimation methods in meta-regression are considered, respectively, to derive the results. Internal and external residual and leverage measure are defined. The local influence analysis based on case-weights perturbation scheme, responses perturbation scheme, covariate perturbation scheme, and within-variance perturbation scheme are explored. We introduce a method by simultaneous perturbing responses, covariate, and within-variance to obtain the local influence measure, which has an advantage of capable to compare the influence magnitude of influential studies from different perturbations. An example is used to illustrate the proposed methodology. Copyright © 2017 John Wiley & Sons, Ltd.

  8. Role of regression analysis and variation of rheological data in calculation of pressure drop for sludge pipelines.

    PubMed

    Farno, E; Coventry, K; Slatter, P; Eshtiaghi, N

    2018-06-15

    Sludge pumps in wastewater treatment plants are often oversized due to uncertainty in calculation of pressure drop. This issue costs millions of dollars for industry to purchase and operate the oversized pumps. Besides costs, higher electricity consumption is associated with extra CO 2 emission which creates huge environmental impacts. Calculation of pressure drop via current pipe flow theory requires model estimation of flow curve data which depends on regression analysis and also varies with natural variation of rheological data. This study investigates impact of variation of rheological data and regression analysis on variation of pressure drop calculated via current pipe flow theories. Results compare the variation of calculated pressure drop between different models and regression methods and suggest on the suitability of each method. Copyright © 2018 Elsevier Ltd. All rights reserved.

  9. TEMPERATURE SCENARIO DEVELOPMENT USING REGRESSION METHODS

    EPA Science Inventory

    A method of developing scenarios of future temperature conditions resulting from climatic change is presented. he method is straightforward and can be used to provide information about daily temperature variations and diurnal ranges, monthly average high, and low temperatures, an...

  10. Statistical downscaling modeling with quantile regression using lasso to estimate extreme rainfall

    NASA Astrophysics Data System (ADS)

    Santri, Dewi; Wigena, Aji Hamim; Djuraidah, Anik

    2016-02-01

    Rainfall is one of the climatic elements with high diversity and has many negative impacts especially extreme rainfall. Therefore, there are several methods that required to minimize the damage that may occur. So far, Global circulation models (GCM) are the best method to forecast global climate changes include extreme rainfall. Statistical downscaling (SD) is a technique to develop the relationship between GCM output as a global-scale independent variables and rainfall as a local- scale response variable. Using GCM method will have many difficulties when assessed against observations because GCM has high dimension and multicollinearity between the variables. The common method that used to handle this problem is principal components analysis (PCA) and partial least squares regression. The new method that can be used is lasso. Lasso has advantages in simultaneuosly controlling the variance of the fitted coefficients and performing automatic variable selection. Quantile regression is a method that can be used to detect extreme rainfall in dry and wet extreme. Objective of this study is modeling SD using quantile regression with lasso to predict extreme rainfall in Indramayu. The results showed that the estimation of extreme rainfall (extreme wet in January, February and December) in Indramayu could be predicted properly by the model at quantile 90th.

  11. Bivariate least squares linear regression: Towards a unified analytic formalism. I. Functional models

    NASA Astrophysics Data System (ADS)

    Caimmi, R.

    2011-08-01

    Concerning bivariate least squares linear regression, the classical approach pursued for functional models in earlier attempts ( York, 1966, 1969) is reviewed using a new formalism in terms of deviation (matrix) traces which, for unweighted data, reduce to usual quantities leaving aside an unessential (but dimensional) multiplicative factor. Within the framework of classical error models, the dependent variable relates to the independent variable according to the usual additive model. The classes of linear models considered are regression lines in the general case of correlated errors in X and in Y for weighted data, and in the opposite limiting situations of (i) uncorrelated errors in X and in Y, and (ii) completely correlated errors in X and in Y. The special case of (C) generalized orthogonal regression is considered in detail together with well known subcases, namely: (Y) errors in X negligible (ideally null) with respect to errors in Y; (X) errors in Y negligible (ideally null) with respect to errors in X; (O) genuine orthogonal regression; (R) reduced major-axis regression. In the limit of unweighted data, the results determined for functional models are compared with their counterparts related to extreme structural models i.e. the instrumental scatter is negligible (ideally null) with respect to the intrinsic scatter ( Isobe et al., 1990; Feigelson and Babu, 1992). While regression line slope and intercept estimators for functional and structural models necessarily coincide, the contrary holds for related variance estimators even if the residuals obey a Gaussian distribution, with the exception of Y models. An example of astronomical application is considered, concerning the [O/H]-[Fe/H] empirical relations deduced from five samples related to different stars and/or different methods of oxygen abundance determination. For selected samples and assigned methods, different regression models yield consistent results within the errors (∓ σ) for both heteroscedastic and homoscedastic data. Conversely, samples related to different methods produce discrepant results, due to the presence of (still undetected) systematic errors, which implies no definitive statement can be made at present. A comparison is also made between different expressions of regression line slope and intercept variance estimators, where fractional discrepancies are found to be not exceeding a few percent, which grows up to about 20% in the presence of large dispersion data. An extension of the formalism to structural models is left to a forthcoming paper.

  12. Alternative Regression Equations for Estimation of Annual Peak-Streamflow Frequency for Undeveloped Watersheds in Texas using PRESS Minimization

    USGS Publications Warehouse

    Asquith, William H.; Thompson, David B.

    2008-01-01

    The U.S. Geological Survey, in cooperation with the Texas Department of Transportation and in partnership with Texas Tech University, investigated a refinement of the regional regression method and developed alternative equations for estimation of peak-streamflow frequency for undeveloped watersheds in Texas. A common model for estimation of peak-streamflow frequency is based on the regional regression method. The current (2008) regional regression equations for 11 regions of Texas are based on log10 transformations of all regression variables (drainage area, main-channel slope, and watershed shape). Exclusive use of log10-transformation does not fully linearize the relations between the variables. As a result, some systematic bias remains in the current equations. The bias results in overestimation of peak streamflow for both the smallest and largest watersheds. The bias increases with increasing recurrence interval. The primary source of the bias is the discernible curvilinear relation in log10 space between peak streamflow and drainage area. Bias is demonstrated by selected residual plots with superimposed LOWESS trend lines. To address the bias, a statistical framework based on minimization of the PRESS statistic through power transformation of drainage area is described and implemented, and the resulting regression equations are reported. Compared to log10-exclusive equations, the equations derived from PRESS minimization have PRESS statistics and residual standard errors less than the log10 exclusive equations. Selected residual plots for the PRESS-minimized equations are presented to demonstrate that systematic bias in regional regression equations for peak-streamflow frequency estimation in Texas can be reduced. Because the overall error is similar to the error associated with previous equations and because the bias is reduced, the PRESS-minimized equations reported here provide alternative equations for peak-streamflow frequency estimation.

  13. An interactive website for analytical method comparison and bias estimation.

    PubMed

    Bahar, Burak; Tuncel, Ayse F; Holmes, Earle W; Holmes, Daniel T

    2017-12-01

    Regulatory standards mandate laboratories to perform studies to ensure accuracy and reliability of their test results. Method comparison and bias estimation are important components of these studies. We developed an interactive website for evaluating the relative performance of two analytical methods using R programming language tools. The website can be accessed at https://bahar.shinyapps.io/method_compare/. The site has an easy-to-use interface that allows both copy-pasting and manual entry of data. It also allows selection of a regression model and creation of regression and difference plots. Available regression models include Ordinary Least Squares, Weighted-Ordinary Least Squares, Deming, Weighted-Deming, Passing-Bablok and Passing-Bablok for large datasets. The server processes the data and generates downloadable reports in PDF or HTML format. Our website provides clinical laboratories a practical way to assess the relative performance of two analytical methods. Copyright © 2017 The Canadian Society of Clinical Chemists. Published by Elsevier Inc. All rights reserved.

  14. Estimation of Monthly Near Surface Air Temperature Using Geographically Weighted Regression in China

    NASA Astrophysics Data System (ADS)

    Wang, M. M.; He, G. J.; Zhang, Z. M.; Zhang, Z. J.; Liu, X. G.

    2018-04-01

    Near surface air temperature (NSAT) is a primary descriptor of terrestrial environment conditions. The availability of NSAT with high spatial resolution is deemed necessary for several applications such as hydrology, meteorology and ecology. In this study, a regression-based NSAT mapping method is proposed. This method is combined remote sensing variables with geographical variables, and uses geographically weighted regression to estimate NSAT. The altitude was selected as geographical variable; and the remote sensing variables include land surface temperature (LST) and Normalized Difference vegetation index (NDVI). The performance of the proposed method was assessed by predict monthly minimum, mean, and maximum NSAT from point station measurements in China, a domain with a large area, complex topography, and highly variable station density, and the NSAT maps were validated against the meteorology observations. Validation results with meteorological data show the proposed method achieved an accuracy of 1.58 °C. It is concluded that the proposed method for mapping NSAT is very operational and has good precision.

  15. Spectral-Spatial Shared Linear Regression for Hyperspectral Image Classification.

    PubMed

    Haoliang Yuan; Yuan Yan Tang

    2017-04-01

    Classification of the pixels in hyperspectral image (HSI) is an important task and has been popularly applied in many practical applications. Its major challenge is the high-dimensional small-sized problem. To deal with this problem, lots of subspace learning (SL) methods are developed to reduce the dimension of the pixels while preserving the important discriminant information. Motivated by ridge linear regression (RLR) framework for SL, we propose a spectral-spatial shared linear regression method (SSSLR) for extracting the feature representation. Comparing with RLR, our proposed SSSLR has the following two advantages. First, we utilize a convex set to explore the spatial structure for computing the linear projection matrix. Second, we utilize a shared structure learning model, which is formed by original data space and a hidden feature space, to learn a more discriminant linear projection matrix for classification. To optimize our proposed method, an efficient iterative algorithm is proposed. Experimental results on two popular HSI data sets, i.e., Indian Pines and Salinas demonstrate that our proposed methods outperform many SL methods.

  16. A subagging regression method for estimating the qualitative and quantitative state of groundwater

    NASA Astrophysics Data System (ADS)

    Jeong, Jina; Park, Eungyu; Han, Weon Shik; Kim, Kue-Young

    2017-08-01

    A subsample aggregating (subagging) regression (SBR) method for the analysis of groundwater data pertaining to trend-estimation-associated uncertainty is proposed. The SBR method is validated against synthetic data competitively with other conventional robust and non-robust methods. From the results, it is verified that the estimation accuracies of the SBR method are consistent and superior to those of other methods, and the uncertainties are reasonably estimated; the others have no uncertainty analysis option. To validate further, actual groundwater data are employed and analyzed comparatively with Gaussian process regression (GPR). For all cases, the trend and the associated uncertainties are reasonably estimated by both SBR and GPR regardless of Gaussian or non-Gaussian skewed data. However, it is expected that GPR has a limitation in applications to severely corrupted data by outliers owing to its non-robustness. From the implementations, it is determined that the SBR method has the potential to be further developed as an effective tool of anomaly detection or outlier identification in groundwater state data such as the groundwater level and contaminant concentration.

  17. Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees.

    PubMed

    Chung, Yi-Shih

    2013-12-01

    Factor complexity is a characteristic of traffic crashes. This paper proposes a novel method, namely boosted regression trees (BRT), to investigate the complex and nonlinear relationships in high-variance traffic crash data. The Taiwanese 2004-2005 single-vehicle motorcycle crash data are used to demonstrate the utility of BRT. Traditional logistic regression and classification and regression tree (CART) models are also used to compare their estimation results and external validities. Both the in-sample cross-validation and out-of-sample validation results show that an increase in tree complexity provides improved, although declining, classification performance, indicating a limited factor complexity of single-vehicle motorcycle crashes. The effects of crucial variables including geographical, time, and sociodemographic factors explain some fatal crashes. Relatively unique fatal crashes are better approximated by interactive terms, especially combinations of behavioral factors. BRT models generally provide improved transferability than conventional logistic regression and CART models. This study also discusses the implications of the results for devising safety policies. Copyright © 2012 Elsevier Ltd. All rights reserved.

  18. Modeling Longitudinal Data Containing Non-Normal Within Subject Errors

    NASA Technical Reports Server (NTRS)

    Feiveson, Alan; Glenn, Nancy L.

    2013-01-01

    The mission of the National Aeronautics and Space Administration’s (NASA) human research program is to advance safe human spaceflight. This involves conducting experiments, collecting data, and analyzing data. The data are longitudinal and result from a relatively few number of subjects; typically 10 – 20. A longitudinal study refers to an investigation where participant outcomes and possibly treatments are collected at multiple follow-up times. Standard statistical designs such as mean regression with random effects and mixed–effects regression are inadequate for such data because the population is typically not approximately normally distributed. Hence, more advanced data analysis methods are necessary. This research focuses on four such methods for longitudinal data analysis: the recently proposed linear quantile mixed models (lqmm) by Geraci and Bottai (2013), quantile regression, multilevel mixed–effects linear regression, and robust regression. This research also provides computational algorithms for longitudinal data that scientists can directly use for human spaceflight and other longitudinal data applications, then presents statistical evidence that verifies which method is best for specific situations. This advances the study of longitudinal data in a broad range of applications including applications in the sciences, technology, engineering and mathematics fields.

  19. Monitoring heavy metal Cr in soil based on hyperspectral data using regression analysis

    NASA Astrophysics Data System (ADS)

    Zhang, Ningyu; Xu, Fuyun; Zhuang, Shidong; He, Changwei

    2016-10-01

    Heavy metal pollution in soils is one of the most critical problems in the global ecology and environment safety nowadays. Hyperspectral remote sensing and its application is capable of high speed, low cost, less risk and less damage, and provides a good method for detecting heavy metals in soil. This paper proposed a new idea of applying regression analysis of stepwise multiple regression between the spectral data and monitoring the amount of heavy metal Cr by sample points in soil for environmental protection. In the measurement, a FieldSpec HandHeld spectroradiometer is used to collect reflectance spectra of sample points over the wavelength range of 325-1075 nm. Then the spectral data measured by the spectroradiometer is preprocessed to reduced the influence of the external factors, and the preprocessed methods include first-order differential equation, second-order differential equation and continuum removal method. The algorithms of stepwise multiple regression are established accordingly, and the accuracy of each equation is tested. The results showed that the accuracy of first-order differential equation works best, which makes it feasible to predict the content of heavy metal Cr by using stepwise multiple regression.

  20. Comparing lagged linear correlation, lagged regression, Granger causality, and vector autoregression for uncovering associations in EHR data.

    PubMed

    Levine, Matthew E; Albers, David J; Hripcsak, George

    2016-01-01

    Time series analysis methods have been shown to reveal clinical and biological associations in data collected in the electronic health record. We wish to develop reliable high-throughput methods for identifying adverse drug effects that are easy to implement and produce readily interpretable results. To move toward this goal, we used univariate and multivariate lagged regression models to investigate associations between twenty pairs of drug orders and laboratory measurements. Multivariate lagged regression models exhibited higher sensitivity and specificity than univariate lagged regression in the 20 examples, and incorporating autoregressive terms for labs and drugs produced more robust signals in cases of known associations among the 20 example pairings. Moreover, including inpatient admission terms in the model attenuated the signals for some cases of unlikely associations, demonstrating how multivariate lagged regression models' explicit handling of context-based variables can provide a simple way to probe for health-care processes that confound analyses of EHR data.

  1. Analysis of training sample selection strategies for regression-based quantitative landslide susceptibility mapping methods

    NASA Astrophysics Data System (ADS)

    Erener, Arzu; Sivas, A. Abdullah; Selcuk-Kestel, A. Sevtap; Düzgün, H. Sebnem

    2017-07-01

    All of the quantitative landslide susceptibility mapping (QLSM) methods requires two basic data types, namely, landslide inventory and factors that influence landslide occurrence (landslide influencing factors, LIF). Depending on type of landslides, nature of triggers and LIF, accuracy of the QLSM methods differs. Moreover, how to balance the number of 0 (nonoccurrence) and 1 (occurrence) in the training set obtained from the landslide inventory and how to select which one of the 1's and 0's to be included in QLSM models play critical role in the accuracy of the QLSM. Although performance of various QLSM methods is largely investigated in the literature, the challenge of training set construction is not adequately investigated for the QLSM methods. In order to tackle this challenge, in this study three different training set selection strategies along with the original data set is used for testing the performance of three different regression methods namely Logistic Regression (LR), Bayesian Logistic Regression (BLR) and Fuzzy Logistic Regression (FLR). The first sampling strategy is proportional random sampling (PRS), which takes into account a weighted selection of landslide occurrences in the sample set. The second method, namely non-selective nearby sampling (NNS), includes randomly selected sites and their surrounding neighboring points at certain preselected distances to include the impact of clustering. Selective nearby sampling (SNS) is the third method, which concentrates on the group of 1's and their surrounding neighborhood. A randomly selected group of landslide sites and their neighborhood are considered in the analyses similar to NNS parameters. It is found that LR-PRS, FLR-PRS and BLR-Whole Data set-ups, with order, yield the best fits among the other alternatives. The results indicate that in QLSM based on regression models, avoidance of spatial correlation in the data set is critical for the model's performance.

  2. Two Paradoxes in Linear Regression Analysis

    PubMed Central

    FENG, Ge; PENG, Jing; TU, Dongke; ZHENG, Julia Z.; FENG, Changyong

    2016-01-01

    Summary Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection. PMID:28638214

  3. Regression models for analyzing costs and their determinants in health care: an introductory review.

    PubMed

    Gregori, Dario; Petrinco, Michele; Bo, Simona; Desideri, Alessandro; Merletti, Franco; Pagano, Eva

    2011-06-01

    This article aims to describe the various approaches in multivariable modelling of healthcare costs data and to synthesize the respective criticisms as proposed in the literature. We present regression methods suitable for the analysis of healthcare costs and then apply them to an experimental setting in cardiovascular treatment (COSTAMI study) and an observational setting in diabetes hospital care. We show how methods can produce different results depending on the degree of matching between the underlying assumptions of each method and the specific characteristics of the healthcare problem. The matching of healthcare cost models to the analytic objectives and characteristics of the data available to a study requires caution. The study results and interpretation can be heavily dependent on the choice of model with a real risk of spurious results and conclusions.

  4. Prediction model for the return to work of workers with injuries in Hong Kong.

    PubMed

    Xu, Yanwen; Chan, Chetwyn C H; Lo, Karen Hui Yu-Ling; Tang, Dan

    2008-01-01

    This study attempts to formulate a prediction model of return to work for a group of workers who have been suffering from chronic pain and physical injury while also being out of work in Hong Kong. The study used Case-based Reasoning (CBR) method, and compared the result with the statistical method of logistic regression model. The database of the algorithm of CBR was composed of 67 cases who were also used in the logistic regression model. The testing cases were 32 participants who had a similar background and characteristics to those in the database. The methods of setting constraints and Euclidean distance metric were used in CBR to search the closest cases to the trial case based on the matrix. The usefulness of the algorithm was tested on 32 new participants, and the accuracy of predicting return to work outcomes was 62.5%, which was no better than the 71.2% accuracy derived from the logistic regression model. The results of the study would enable us to have a better understanding of the CBR applied in the field of occupational rehabilitation by comparing with the conventional regression analysis. The findings would also shed light on the development of relevant interventions for the return-to-work process of these workers.

  5. Regression Methods for Categorical Dependent Variables: Effects on a Model of Student College Choice

    ERIC Educational Resources Information Center

    Rapp, Kelly E.

    2012-01-01

    The use of categorical dependent variables with the classical linear regression model (CLRM) violates many of the model's assumptions and may result in biased estimates (Long, 1997; O'Connell, Goldstein, Rogers, & Peng, 2008). Many dependent variables of interest to educational researchers (e.g., professorial rank, educational attainment) are…

  6. Accounting for spatial effects in land use regression for urban air pollution modeling.

    PubMed

    Bertazzon, Stefania; Johnson, Markey; Eccles, Kristin; Kaplan, Gilaad G

    2015-01-01

    In order to accurately assess air pollution risks, health studies require spatially resolved pollution concentrations. Land-use regression (LUR) models estimate ambient concentrations at a fine spatial scale. However, spatial effects such as spatial non-stationarity and spatial autocorrelation can reduce the accuracy of LUR estimates by increasing regression errors and uncertainty; and statistical methods for resolving these effects--e.g., spatially autoregressive (SAR) and geographically weighted regression (GWR) models--may be difficult to apply simultaneously. We used an alternate approach to address spatial non-stationarity and spatial autocorrelation in LUR models for nitrogen dioxide. Traditional models were re-specified to include a variable capturing wind speed and direction, and re-fit as GWR models. Mean R(2) values for the resulting GWR-wind models (summer: 0.86, winter: 0.73) showed a 10-20% improvement over traditional LUR models. GWR-wind models effectively addressed both spatial effects and produced meaningful predictive models. These results suggest a useful method for improving spatially explicit models. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  7. Fast determination of total ginsenosides content in ginseng powder by near infrared reflectance spectroscopy

    NASA Astrophysics Data System (ADS)

    Chen, Hua-cai; Chen, Xing-dan; Lu, Yong-jun; Cao, Zhi-qiang

    2006-01-01

    Near infrared (NIR) reflectance spectroscopy was used to develop a fast determination method for total ginsenosides in Ginseng (Panax Ginseng) powder. The spectra were analyzed with multiplicative signal correction (MSC) correlation method. The best correlative spectra region with the total ginsenosides content was 1660 nm~1880 nm and 2230nm~2380 nm. The NIR calibration models of ginsenosides were built with multiple linear regression (MLR), principle component regression (PCR) and partial least squares (PLS) regression respectively. The results showed that the calibration model built with PLS combined with MSC and the optimal spectrum region was the best one. The correlation coefficient and the root mean square error of correction validation (RMSEC) of the best calibration model were 0.98 and 0.15% respectively. The optimal spectrum region for calibration was 1204nm~2014nm. The result suggested that using NIR to rapidly determinate the total ginsenosides content in ginseng powder were feasible.

  8. Evaluating the perennial stream using logistic regression in central Taiwan

    NASA Astrophysics Data System (ADS)

    Ruljigaljig, T.; Cheng, Y. S.; Lin, H. I.; Lee, C. H.; Yu, T. T.

    2014-12-01

    This study produces a perennial stream head potential map, based on a logistic regression method with a Geographic Information System (GIS). Perennial stream initiation locations, indicates the location of the groundwater and surface contact, were identified in the study area from field survey. The perennial stream potential map in central Taiwan was constructed using the relationship between perennial stream and their causative factors, such as Catchment area, slope gradient, aspect, elevation, groundwater recharge and precipitation. Here, the field surveys of 272 streams were determined in the study area. The areas under the curve for logistic regression methods were calculated as 0.87. The results illustrate the importance of catchment area and groundwater recharge as key factors within the model. The results obtained from the model within the GIS were then used to produce a map of perennial stream and estimate the location of perennial stream head.

  9. An application of robust ridge regression model in the presence of outliers to real data problem

    NASA Astrophysics Data System (ADS)

    Shariff, N. S. Md.; Ferdaos, N. A.

    2017-09-01

    Multicollinearity and outliers are often leads to inconsistent and unreliable parameter estimates in regression analysis. The well-known procedure that is robust to multicollinearity problem is the ridge regression method. This method however is believed are affected by the presence of outlier. The combination of GM-estimation and ridge parameter that is robust towards both problems is on interest in this study. As such, both techniques are employed to investigate the relationship between stock market price and macroeconomic variables in Malaysia due to curiosity of involving the multicollinearity and outlier problem in the data set. There are four macroeconomic factors selected for this study which are Consumer Price Index (CPI), Gross Domestic Product (GDP), Base Lending Rate (BLR) and Money Supply (M1). The results demonstrate that the proposed procedure is able to produce reliable results towards the presence of multicollinearity and outliers in the real data.

  10. Using Time-Series Regression to Predict Academic Library Circulations.

    ERIC Educational Resources Information Center

    Brooks, Terrence A.

    1984-01-01

    Four methods were used to forecast monthly circulation totals in 15 midwestern academic libraries: dummy time-series regression, lagged time-series regression, simple average (straight-line forecasting), monthly average (naive forecasting). In tests of forecasting accuracy, dummy regression method and monthly mean method exhibited smallest average…

  11. A simple linear regression method for quantitative trait loci linkage analysis with censored observations.

    PubMed

    Anderson, Carl A; McRae, Allan F; Visscher, Peter M

    2006-07-01

    Standard quantitative trait loci (QTL) mapping techniques commonly assume that the trait is both fully observed and normally distributed. When considering survival or age-at-onset traits these assumptions are often incorrect. Methods have been developed to map QTL for survival traits; however, they are both computationally intensive and not available in standard genome analysis software packages. We propose a grouped linear regression method for the analysis of continuous survival data. Using simulation we compare this method to both the Cox and Weibull proportional hazards models and a standard linear regression method that ignores censoring. The grouped linear regression method is of equivalent power to both the Cox and Weibull proportional hazards methods and is significantly better than the standard linear regression method when censored observations are present. The method is also robust to the proportion of censored individuals and the underlying distribution of the trait. On the basis of linear regression methodology, the grouped linear regression model is computationally simple and fast and can be implemented readily in freely available statistical software.

  12. Improved accuracy in quantitative laser-induced breakdown spectroscopy using sub-models

    USGS Publications Warehouse

    Anderson, Ryan; Clegg, Samuel M.; Frydenvang, Jens; Wiens, Roger C.; McLennan, Scott M.; Morris, Richard V.; Ehlmann, Bethany L.; Dyar, M. Darby

    2017-01-01

    Accurate quantitative analysis of diverse geologic materials is one of the primary challenges faced by the Laser-Induced Breakdown Spectroscopy (LIBS)-based ChemCam instrument on the Mars Science Laboratory (MSL) rover. The SuperCam instrument on the Mars 2020 rover, as well as other LIBS instruments developed for geochemical analysis on Earth or other planets, will face the same challenge. Consequently, part of the ChemCam science team has focused on the development of improved multivariate analysis calibrations methods. Developing a single regression model capable of accurately determining the composition of very different target materials is difficult because the response of an element’s emission lines in LIBS spectra can vary with the concentration of other elements. We demonstrate a conceptually simple “sub-model” method for improving the accuracy of quantitative LIBS analysis of diverse target materials. The method is based on training several regression models on sets of targets with limited composition ranges and then “blending” these “sub-models” into a single final result. Tests of the sub-model method show improvement in test set root mean squared error of prediction (RMSEP) for almost all cases. The sub-model method, using partial least squares regression (PLS), is being used as part of the current ChemCam quantitative calibration, but the sub-model method is applicable to any multivariate regression method and may yield similar improvements.

  13. Deep ensemble learning of sparse regression models for brain disease diagnosis.

    PubMed

    Suk, Heung-Il; Lee, Seong-Whan; Shen, Dinggang

    2017-04-01

    Recent studies on brain imaging analysis witnessed the core roles of machine learning techniques in computer-assisted intervention for brain disease diagnosis. Of various machine-learning techniques, sparse regression models have proved their effectiveness in handling high-dimensional data but with a small number of training samples, especially in medical problems. In the meantime, deep learning methods have been making great successes by outperforming the state-of-the-art performances in various applications. In this paper, we propose a novel framework that combines the two conceptually different methods of sparse regression and deep learning for Alzheimer's disease/mild cognitive impairment diagnosis and prognosis. Specifically, we first train multiple sparse regression models, each of which is trained with different values of a regularization control parameter. Thus, our multiple sparse regression models potentially select different feature subsets from the original feature set; thereby they have different powers to predict the response values, i.e., clinical label and clinical scores in our work. By regarding the response values from our sparse regression models as target-level representations, we then build a deep convolutional neural network for clinical decision making, which thus we call 'Deep Ensemble Sparse Regression Network.' To our best knowledge, this is the first work that combines sparse regression models with deep neural network. In our experiments with the ADNI cohort, we validated the effectiveness of the proposed method by achieving the highest diagnostic accuracies in three classification tasks. We also rigorously analyzed our results and compared with the previous studies on the ADNI cohort in the literature. Copyright © 2017 Elsevier B.V. All rights reserved.

  14. Deep ensemble learning of sparse regression models for brain disease diagnosis

    PubMed Central

    Suk, Heung-Il; Lee, Seong-Whan; Shen, Dinggang

    2018-01-01

    Recent studies on brain imaging analysis witnessed the core roles of machine learning techniques in computer-assisted intervention for brain disease diagnosis. Of various machine-learning techniques, sparse regression models have proved their effectiveness in handling high-dimensional data but with a small number of training samples, especially in medical problems. In the meantime, deep learning methods have been making great successes by outperforming the state-of-the-art performances in various applications. In this paper, we propose a novel framework that combines the two conceptually different methods of sparse regression and deep learning for Alzheimer’s disease/mild cognitive impairment diagnosis and prognosis. Specifically, we first train multiple sparse regression models, each of which is trained with different values of a regularization control parameter. Thus, our multiple sparse regression models potentially select different feature subsets from the original feature set; thereby they have different powers to predict the response values, i.e., clinical label and clinical scores in our work. By regarding the response values from our sparse regression models as target-level representations, we then build a deep convolutional neural network for clinical decision making, which thus we call ‘ Deep Ensemble Sparse Regression Network.’ To our best knowledge, this is the first work that combines sparse regression models with deep neural network. In our experiments with the ADNI cohort, we validated the effectiveness of the proposed method by achieving the highest diagnostic accuracies in three classification tasks. We also rigorously analyzed our results and compared with the previous studies on the ADNI cohort in the literature. PMID:28167394

  15. Pan evaporation modeling using six different heuristic computing methods in different climates of China

    NASA Astrophysics Data System (ADS)

    Wang, Lunche; Kisi, Ozgur; Zounemat-Kermani, Mohammad; Li, Hui

    2017-01-01

    Pan evaporation (Ep) plays important roles in agricultural water resources management. One of the basic challenges is modeling Ep using limited climatic parameters because there are a number of factors affecting the evaporation rate. This study investigated the abilities of six different soft computing methods, multi-layer perceptron (MLP), generalized regression neural network (GRNN), fuzzy genetic (FG), least square support vector machine (LSSVM), multivariate adaptive regression spline (MARS), adaptive neuro-fuzzy inference systems with grid partition (ANFIS-GP), and two regression methods, multiple linear regression (MLR) and Stephens and Stewart model (SS) in predicting monthly Ep. Long-term climatic data at various sites crossing a wide range of climates during 1961-2000 are used for model development and validation. The results showed that the models have different accuracies in different climates and the MLP model performed superior to the other models in predicting monthly Ep at most stations using local input combinations (for example, the MAE (mean absolute errors), RMSE (root mean square errors), and determination coefficient (R2) are 0.314 mm/day, 0.405 mm/day and 0.988, respectively for HEB station), while GRNN model performed better in Tibetan Plateau (MAE, RMSE and R2 are 0.459 mm/day, 0.592 mm/day and 0.932, respectively). The accuracies of above models ranked as: MLP, GRNN, LSSVM, FG, ANFIS-GP, MARS and MLR. The overall results indicated that the soft computing techniques generally performed better than the regression methods, but MLR and SS models can be more preferred at some climatic zones instead of complex nonlinear models, for example, the BJ (Beijing), CQ (Chongqing) and HK (Haikou) stations. Therefore, it can be concluded that Ep could be successfully predicted using above models in hydrological modeling studies.

  16. Partitioning sources of variation in vertebrate species richness

    USGS Publications Warehouse

    Boone, R.B.; Krohn, W.B.

    2000-01-01

    Aim: To explore biogeographic patterns of terrestrial vertebrates in Maine, USA using techniques that would describe local and spatial correlations with the environment. Location: Maine, USA. Methods: We delineated the ranges within Maine (86,156 km2) of 275 species using literature and expert review. Ranges were combined into species richness maps, and compared to geomorphology, climate, and woody plant distributions. Methods were adapted that compared richness of all vertebrate classes to each environmental correlate, rather than assessing a single explanatory theory. We partitioned variation in species richness into components using tree and multiple linear regression. Methods were used that allowed for useful comparisons between tree and linear regression results. For both methods we partitioned variation into broad-scale (spatially autocorrelated) and fine-scale (spatially uncorrelated) explained and unexplained components. By partitioning variance, and using both tree and linear regression in analyses, we explored the degree of variation in species richness for each vertebrate group that Could be explained by the relative contribution of each environmental variable. Results: In tree regression, climate variation explained richness better (92% of mean deviance explained for all species) than woody plant variation (87%) and geomorphology (86%). Reptiles were highly correlated with environmental variation (93%), followed by mammals, amphibians, and birds (each with 84-82% deviance explained). In multiple linear regression, climate was most closely associated with total vertebrate richness (78%), followed by woody plants (67%) and geomorphology (56%). Again, reptiles were closely correlated with the environment (95%), followed by mammals (73%), amphibians (63%) and birds (57%). Main conclusions: Comparing variation explained using tree and multiple linear regression quantified the importance of nonlinear relationships and local interactions between species richness and environmental variation, identifying the importance of linear relationships between reptiles and the environment, and nonlinear relationships between birds and woody plants, for example. Conservation planners should capture climatic variation in broad-scale designs; temperatures may shift during climate change, but the underlying correlations between the environment and species richness will presumably remain.

  17. Application of NIRS coupled with PLS regression as a rapid, non-destructive alternative method for quantification of KBA in Boswellia sacra

    NASA Astrophysics Data System (ADS)

    Al-Harrasi, Ahmed; Rehman, Najeeb Ur; Mabood, Fazal; Albroumi, Muhammaed; Ali, Liaqat; Hussain, Javid; Hussain, Hidayat; Csuk, René; Khan, Abdul Latif; Alam, Tanveer; Alameri, Saif

    2017-09-01

    In the present study, for the first time, NIR spectroscopy coupled with PLS regression as a rapid and alternative method was developed to quantify the amount of Keto-β-Boswellic Acid (KBA) in different plant parts of Boswellia sacra and the resin exudates of the trunk. NIR spectroscopy was used for the measurement of KBA standards and B. sacra samples in absorption mode in the wavelength range from 700-2500 nm. PLS regression model was built from the obtained spectral data using 70% of KBA standards (training set) in the range from 0.1 ppm to 100 ppm. The PLS regression model obtained was having R-square value of 98% with 0.99 corelationship value and having good prediction with RMSEP value 3.2 and correlation of 0.99. It was then used to quantify the amount of KBA in the samples of B. sacra. The results indicated that the MeOH extract of resin has the highest concentration of KBA (0.6%) followed by essential oil (0.1%). However, no KBA was found in the aqueous extract. The MeOH extract of the resin was subjected to column chromatography to get various sub-fractions at different polarity of organic solvents. The sub-fraction at 4% MeOH/CHCl3 (4.1% of KBA) was found to contain the highest percentage of KBA followed by another sub-fraction at 2% MeOH/CHCl3 (2.2% of KBA). The present results also indicated that KBA is only present in the gum-resin of the trunk and not in all parts of the plant. These results were further confirmed through HPLC analysis and therefore it is concluded that NIRS coupled with PLS regression is a rapid and alternate method for quantification of KBA in Boswellia sacra. It is non-destructive, rapid, sensitive and uses simple methods of sample preparation.

  18. Parameter estimation of multivariate multiple regression model using bayesian with non-informative Jeffreys’ prior distribution

    NASA Astrophysics Data System (ADS)

    Saputro, D. R. S.; Amalia, F.; Widyaningsih, P.; Affan, R. C.

    2018-05-01

    Bayesian method is a method that can be used to estimate the parameters of multivariate multiple regression model. Bayesian method has two distributions, there are prior and posterior distributions. Posterior distribution is influenced by the selection of prior distribution. Jeffreys’ prior distribution is a kind of Non-informative prior distribution. This prior is used when the information about parameter not available. Non-informative Jeffreys’ prior distribution is combined with the sample information resulting the posterior distribution. Posterior distribution is used to estimate the parameter. The purposes of this research is to estimate the parameters of multivariate regression model using Bayesian method with Non-informative Jeffreys’ prior distribution. Based on the results and discussion, parameter estimation of β and Σ which were obtained from expected value of random variable of marginal posterior distribution function. The marginal posterior distributions for β and Σ are multivariate normal and inverse Wishart. However, in calculation of the expected value involving integral of a function which difficult to determine the value. Therefore, approach is needed by generating of random samples according to the posterior distribution characteristics of each parameter using Markov chain Monte Carlo (MCMC) Gibbs sampling algorithm.

  19. GIS-based spatial regression and prediction of water quality in river networks: A case study in Iowa

    USGS Publications Warehouse

    Yang, X.; Jin, W.

    2010-01-01

    Nonpoint source pollution is the leading cause of the U.S.'s water quality problems. One important component of nonpoint source pollution control is an understanding of what and how watershed-scale conditions influence ambient water quality. This paper investigated the use of spatial regression to evaluate the impacts of watershed characteristics on stream NO3NO2-N concentration in the Cedar River Watershed, Iowa. An Arc Hydro geodatabase was constructed to organize various datasets on the watershed. Spatial regression models were developed to evaluate the impacts of watershed characteristics on stream NO3NO2-N concentration and predict NO3NO2-N concentration at unmonitored locations. Unlike the traditional ordinary least square (OLS) method, the spatial regression method incorporates the potential spatial correlation among the observations in its coefficient estimation. Study results show that NO3NO2-N observations in the Cedar River Watershed are spatially correlated, and by ignoring the spatial correlation, the OLS method tends to over-estimate the impacts of watershed characteristics on stream NO3NO2-N concentration. In conjunction with kriging, the spatial regression method not only makes better stream NO3NO2-N concentration predictions than the OLS method, but also gives estimates of the uncertainty of the predictions, which provides useful information for optimizing the design of stream monitoring network. It is a promising tool for better managing and controlling nonpoint source pollution. ?? 2010 Elsevier Ltd.

  20. Comparison of methods for the analysis of relatively simple mediation models.

    PubMed

    Rijnhart, Judith J M; Twisk, Jos W R; Chinapaw, Mai J M; de Boer, Michiel R; Heymans, Martijn W

    2017-09-01

    Statistical mediation analysis is an often used method in trials, to unravel the pathways underlying the effect of an intervention on a particular outcome variable. Throughout the years, several methods have been proposed, such as ordinary least square (OLS) regression, structural equation modeling (SEM), and the potential outcomes framework. Most applied researchers do not know that these methods are mathematically equivalent when applied to mediation models with a continuous mediator and outcome variable. Therefore, the aim of this paper was to demonstrate the similarities between OLS regression, SEM, and the potential outcomes framework in three mediation models: 1) a crude model, 2) a confounder-adjusted model, and 3) a model with an interaction term for exposure-mediator interaction. Secondary data analysis of a randomized controlled trial that included 546 schoolchildren. In our data example, the mediator and outcome variable were both continuous. We compared the estimates of the total, direct and indirect effects, proportion mediated, and 95% confidence intervals (CIs) for the indirect effect across OLS regression, SEM, and the potential outcomes framework. OLS regression, SEM, and the potential outcomes framework yielded the same effect estimates in the crude mediation model, the confounder-adjusted mediation model, and the mediation model with an interaction term for exposure-mediator interaction. Since OLS regression, SEM, and the potential outcomes framework yield the same results in three mediation models with a continuous mediator and outcome variable, researchers can continue using the method that is most convenient to them.

  1. A simulation study on Bayesian Ridge regression models for several collinearity levels

    NASA Astrophysics Data System (ADS)

    Efendi, Achmad; Effrihan

    2017-12-01

    When analyzing data with multiple regression model if there are collinearities, then one or several predictor variables are usually omitted from the model. However, there sometimes some reasons, for instance medical or economic reasons, the predictors are all important and should be included in the model. Ridge regression model is not uncommon in some researches to use to cope with collinearity. Through this modeling, weights for predictor variables are used for estimating parameters. The next estimation process could follow the concept of likelihood. Furthermore, for the estimation nowadays the Bayesian version could be an alternative. This estimation method does not match likelihood one in terms of popularity due to some difficulties; computation and so forth. Nevertheless, with the growing improvement of computational methodology recently, this caveat should not at the moment become a problem. This paper discusses about simulation process for evaluating the characteristic of Bayesian Ridge regression parameter estimates. There are several simulation settings based on variety of collinearity levels and sample sizes. The results show that Bayesian method gives better performance for relatively small sample sizes, and for other settings the method does perform relatively similar to the likelihood method.

  2. Prescription-drug-related risk in driving: comparing conventional and lasso shrinkage logistic regressions.

    PubMed

    Avalos, Marta; Adroher, Nuria Duran; Lagarde, Emmanuel; Thiessard, Frantz; Grandvalet, Yves; Contrand, Benjamin; Orriols, Ludivine

    2012-09-01

    Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists. We illustrate the application of lasso methods in an analysis of the impact of prescribed drugs on the risk of a road traffic crash, using a large French nationwide database (PLoS Med 2010;7:e1000366). In the original case-control study, the authors analyzed each exposure separately. We use the lasso method, which can simultaneously perform estimation and variable selection in a single model. We compare point estimates and confidence intervals using (1) a separate logistic regression model for each drug with a Bonferroni correction and (2) lasso shrinkage logistic regression analysis. Shrinkage regression had little effect on (bias corrected) point estimates, but led to less conservative results, noticeably for drugs with moderate levels of exposure. Carbamates, carboxamide derivative and fatty acid derivative antiepileptics, drugs used in opioid dependence, and mineral supplements of potassium showed stronger associations. Lasso is a relevant method in the analysis of databases with large number of exposures and can be recommended as an alternative to conventional strategies.

  3. Higher-order Multivariable Polynomial Regression to Estimate Human Affective States

    NASA Astrophysics Data System (ADS)

    Wei, Jie; Chen, Tong; Liu, Guangyuan; Yang, Jiemin

    2016-03-01

    From direct observations, facial, vocal, gestural, physiological, and central nervous signals, estimating human affective states through computational models such as multivariate linear-regression analysis, support vector regression, and artificial neural network, have been proposed in the past decade. In these models, linear models are generally lack of precision because of ignoring intrinsic nonlinearities of complex psychophysiological processes; and nonlinear models commonly adopt complicated algorithms. To improve accuracy and simplify model, we introduce a new computational modeling method named as higher-order multivariable polynomial regression to estimate human affective states. The study employs standardized pictures in the International Affective Picture System to induce thirty subjects’ affective states, and obtains pure affective patterns of skin conductance as input variables to the higher-order multivariable polynomial model for predicting affective valence and arousal. Experimental results show that our method is able to obtain efficient correlation coefficients of 0.98 and 0.96 for estimation of affective valence and arousal, respectively. Moreover, the method may provide certain indirect evidences that valence and arousal have their brain’s motivational circuit origins. Thus, the proposed method can serve as a novel one for efficiently estimating human affective states.

  4. Higher-order Multivariable Polynomial Regression to Estimate Human Affective States

    PubMed Central

    Wei, Jie; Chen, Tong; Liu, Guangyuan; Yang, Jiemin

    2016-01-01

    From direct observations, facial, vocal, gestural, physiological, and central nervous signals, estimating human affective states through computational models such as multivariate linear-regression analysis, support vector regression, and artificial neural network, have been proposed in the past decade. In these models, linear models are generally lack of precision because of ignoring intrinsic nonlinearities of complex psychophysiological processes; and nonlinear models commonly adopt complicated algorithms. To improve accuracy and simplify model, we introduce a new computational modeling method named as higher-order multivariable polynomial regression to estimate human affective states. The study employs standardized pictures in the International Affective Picture System to induce thirty subjects’ affective states, and obtains pure affective patterns of skin conductance as input variables to the higher-order multivariable polynomial model for predicting affective valence and arousal. Experimental results show that our method is able to obtain efficient correlation coefficients of 0.98 and 0.96 for estimation of affective valence and arousal, respectively. Moreover, the method may provide certain indirect evidences that valence and arousal have their brain’s motivational circuit origins. Thus, the proposed method can serve as a novel one for efficiently estimating human affective states. PMID:26996254

  5. The prediction of intelligence in preschool children using alternative models to regression.

    PubMed

    Finch, W Holmes; Chang, Mei; Davis, Andrew S; Holden, Jocelyn E; Rothlisberg, Barbara A; McIntosh, David E

    2011-12-01

    Statistical prediction of an outcome variable using multiple independent variables is a common practice in the social and behavioral sciences. For example, neuropsychologists are sometimes called upon to provide predictions of preinjury cognitive functioning for individuals who have suffered a traumatic brain injury. Typically, these predictions are made using standard multiple linear regression models with several demographic variables (e.g., gender, ethnicity, education level) as predictors. Prior research has shown conflicting evidence regarding the ability of such models to provide accurate predictions of outcome variables such as full-scale intelligence (FSIQ) test scores. The present study had two goals: (1) to demonstrate the utility of a set of alternative prediction methods that have been applied extensively in the natural sciences and business but have not been frequently explored in the social sciences and (2) to develop models that can be used to predict premorbid cognitive functioning in preschool children. Predictions of Stanford-Binet 5 FSIQ scores for preschool-aged children is used to compare the performance of a multiple regression model with several of these alternative methods. Results demonstrate that classification and regression trees provided more accurate predictions of FSIQ scores than does the more traditional regression approach. Implications of these results are discussed.

  6. A novel simple QSAR model for the prediction of anti-HIV activity using multiple linear regression analysis.

    PubMed

    Afantitis, Antreas; Melagraki, Georgia; Sarimveis, Haralambos; Koutentis, Panayiotis A; Markopoulos, John; Igglessi-Markopoulou, Olga

    2006-08-01

    A quantitative-structure activity relationship was obtained by applying Multiple Linear Regression Analysis to a series of 80 1-[2-hydroxyethoxy-methyl]-6-(phenylthio) thymine (HEPT) derivatives with significant anti-HIV activity. For the selection of the best among 37 different descriptors, the Elimination Selection Stepwise Regression Method (ES-SWR) was utilized. The resulting QSAR model (R (2) (CV) = 0.8160; S (PRESS) = 0.5680) proved to be very accurate both in training and predictive stages.

  7. Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle.

    PubMed

    Ruuska, Salla; Hämäläinen, Wilhelmiina; Kajava, Sari; Mughal, Mikaela; Matilainen, Pekka; Mononen, Jaakko

    2018-03-01

    The aim of the present study was to evaluate empirically confusion matrices in device validation. We compared the confusion matrix method to linear regression and error indices in the validation of a device measuring feeding behaviour of dairy cattle. In addition, we studied how to extract additional information on classification errors with confusion probabilities. The data consisted of 12 h behaviour measurements from five dairy cows; feeding and other behaviour were detected simultaneously with a device and from video recordings. The resulting 216 000 pairs of classifications were used to construct confusion matrices and calculate performance measures. In addition, hourly durations of each behaviour were calculated and the accuracy of measurements was evaluated with linear regression and error indices. All three validation methods agreed when the behaviour was detected very accurately or inaccurately. Otherwise, in the intermediate cases, the confusion matrix method and error indices produced relatively concordant results, but the linear regression method often disagreed with them. Our study supports the use of confusion matrix analysis in validation since it is robust to any data distribution and type of relationship, it makes a stringent evaluation of validity, and it offers extra information on the type and sources of errors. Copyright © 2018 Elsevier B.V. All rights reserved.

  8. Use and interpretation of logistic regression in habitat-selection studies

    USGS Publications Warehouse

    Keating, Kim A.; Cherry, Steve

    2004-01-01

     Logistic regression is an important tool for wildlife habitat-selection studies, but the method frequently has been misapplied due to an inadequate understanding of the logistic model, its interpretation, and the influence of sampling design. To promote better use of this method, we review its application and interpretation under 3 sampling designs: random, case-control, and use-availability. Logistic regression is appropriate for habitat use-nonuse studies employing random sampling and can be used to directly model the conditional probability of use in such cases. Logistic regression also is appropriate for studies employing case-control sampling designs, but careful attention is required to interpret results correctly. Unless bias can be estimated or probability of use is small for all habitats, results of case-control studies should be interpreted as odds ratios, rather than probability of use or relative probability of use. When data are gathered under a use-availability design, logistic regression can be used to estimate approximate odds ratios if probability of use is small, at least on average. More generally, however, logistic regression is inappropriate for modeling habitat selection in use-availability studies. In particular, using logistic regression to fit the exponential model of Manly et al. (2002:100) does not guarantee maximum-likelihood estimates, valid probabilities, or valid likelihoods. We show that the resource selection function (RSF) commonly used for the exponential model is proportional to a logistic discriminant function. Thus, it may be used to rank habitats with respect to probability of use and to identify important habitat characteristics or their surrogates, but it is not guaranteed to be proportional to probability of use. Other problems associated with the exponential model also are discussed. We describe an alternative model based on Lancaster and Imbens (1996) that offers a method for estimating conditional probability of use in use-availability studies. Although promising, this model fails to converge to a unique solution in some important situations. Further work is needed to obtain a robust method that is broadly applicable to use-availability studies.

  9. Engine With Regression and Neural Network Approximators Designed

    NASA Technical Reports Server (NTRS)

    Patnaik, Surya N.; Hopkins, Dale A.

    2001-01-01

    At the NASA Glenn Research Center, the NASA engine performance program (NEPP, ref. 1) and the design optimization testbed COMETBOARDS (ref. 2) with regression and neural network analysis-approximators have been coupled to obtain a preliminary engine design methodology. The solution to a high-bypass-ratio subsonic waverotor-topped turbofan engine, which is shown in the preceding figure, was obtained by the simulation depicted in the following figure. This engine is made of 16 components mounted on two shafts with 21 flow stations. The engine is designed for a flight envelope with 47 operating points. The design optimization utilized both neural network and regression approximations, along with the cascade strategy (ref. 3). The cascade used three algorithms in sequence: the method of feasible directions, the sequence of unconstrained minimizations technique, and sequential quadratic programming. The normalized optimum thrusts obtained by the three methods are shown in the following figure: the cascade algorithm with regression approximation is represented by a triangle, a circle is shown for the neural network solution, and a solid line indicates original NEPP results. The solutions obtained from both approximate methods lie within one standard deviation of the benchmark solution for each operating point. The simulation improved the maximum thrust by 5 percent. The performance of the linear regression and neural network methods as alternate engine analyzers was found to be satisfactory for the analysis and operation optimization of air-breathing propulsion engines (ref. 4).

  10. Optical scatterometry of quarter-micron patterns using neural regression

    NASA Astrophysics Data System (ADS)

    Bischoff, Joerg; Bauer, Joachim J.; Haak, Ulrich; Hutschenreuther, Lutz; Truckenbrodt, Horst

    1998-06-01

    With shrinking dimensions and increasing chip areas, a rapid and non-destructive full wafer characterization after every patterning cycle is an inevitable necessity. In former publications it was shown that Optical Scatterometry (OS) has the potential to push the attainable feature limits of optical techniques from 0.8 . . . 0.5 microns for imaging methods down to 0.1 micron and below. Thus the demands of future metrology can be met. Basically being a nonimaging method, OS combines light scatter (or diffraction) measurements with modern data analysis schemes to solve the inverse scatter issue. For very fine patterns with lambda-to-pitch ratios grater than one, the specular reflected light versus the incidence angle is recorded. Usually, the data analysis comprises two steps -- a training cycle connected the a rigorous forward modeling and the prediction itself. Until now, two data analysis schemes are usually applied -- the multivariate regression based Partial Least Squares method (PLS) and a look-up-table technique which is also referred to as Minimum Mean Square Error approach (MMSE). Both methods are afflicted with serious drawbacks. On the one hand, the prediction accuracy of multivariate regression schemes degrades with larger parameter ranges due to the linearization properties of the method. On the other hand, look-up-table methods are rather time consuming during prediction thus prolonging the processing time and reducing the throughput. An alternate method is an Artificial Neural Network (ANN) based regression which combines the advantages of multivariate regression and MMSE. Due to the versatility of a neural network, not only can its structure be adapted more properly to the scatter problem, but also the nonlinearity of the neuronal transfer functions mimic the nonlinear behavior of optical diffraction processes more adequately. In spite of these pleasant properties, the prediction speed of ANN regression is comparable with that of the PLS-method. In this paper, the viability and performance of ANN-regression will be demonstrated with the example of sub-quarter-micron resist metrology. To this end, 0.25 micrometer line/space patterns have been printed in positive photoresist by means of DUV projection lithography. In order to evaluate the total metrology chain from light scatter measurement through data analysis, a thorough modeling has been performed. Assuming a trapezoidal shape of the developed resist profile, a training data set was generated by means of the Rigorous Coupled Wave Approach (RCWA). After training the model, a second data set was computed and deteriorated by Gaussian noise to imitate real measuring conditions. Then, these data have been fed into the models established before resulting in a Standard Error of Prediction (SEP) which corresponds to the measuring accuracy. Even with putting only little effort in the design of a back-propagation network, the ANN is clearly superior to the PLS-method. Depending on whether a network with one or two hidden layers was used, accuracy gains between 2 and 5 can be achieved compared with PLS regression. Furthermore, the ANN is less noise sensitive, for there is only a doubling of the SEP at 5% noise for ANN whereas for PLS the accuracy degrades rapidly with increasing noise. The accuracy gain also depends on the light polarization and on the measured parameters. Finally, these results have been proven experimentally, where the OS-results are in good accordance with the profiles obtained from cross- sectioning micrographs.

  11. Automating approximate Bayesian computation by local linear regression.

    PubMed

    Thornton, Kevin R

    2009-07-07

    In several biological contexts, parameter inference often relies on computationally-intensive techniques. "Approximate Bayesian Computation", or ABC, methods based on summary statistics have become increasingly popular. A particular flavor of ABC based on using a linear regression to approximate the posterior distribution of the parameters, conditional on the summary statistics, is computationally appealing, yet no standalone tool exists to automate the procedure. Here, I describe a program to implement the method. The software package ABCreg implements the local linear-regression approach to ABC. The advantages are: 1. The code is standalone, and fully-documented. 2. The program will automatically process multiple data sets, and create unique output files for each (which may be processed immediately in R), facilitating the testing of inference procedures on simulated data, or the analysis of multiple data sets. 3. The program implements two different transformation methods for the regression step. 4. Analysis options are controlled on the command line by the user, and the program is designed to output warnings for cases where the regression fails. 5. The program does not depend on any particular simulation machinery (coalescent, forward-time, etc.), and therefore is a general tool for processing the results from any simulation. 6. The code is open-source, and modular.Examples of applying the software to empirical data from Drosophila melanogaster, and testing the procedure on simulated data, are shown. In practice, the ABCreg simplifies implementing ABC based on local-linear regression.

  12. Region of influence regression for estimating the 50-year flood at ungaged sites

    USGS Publications Warehouse

    Tasker, Gary D.; Hodge, S.A.; Barks, C.S.

    1996-01-01

    Five methods of developing regional regression models to estimate flood characteristics at ungaged sites in Arkansas are examined. The methods differ in the manner in which the State is divided into subrogions. Each successive method (A to E) is computationally more complex than the previous method. Method A makes no subdivision. Methods B and C define two and four geographic subrogions, respectively. Method D uses cluster/discriminant analysis to define subrogions on the basis of similarities in watershed characteristics. Method E, the new region of influence method, defines a unique subregion for each ungaged site. Split-sample results indicate that, in terms of root-mean-square error, method E (38 percent error) is best. Methods C and D (42 and 41 percent error) were in a virtual tie for second, and methods B (44 percent error) and A (49 percent error) were fourth and fifth best.

  13. Prediction of hourly PM2.5 using a space-time support vector regression model

    NASA Astrophysics Data System (ADS)

    Yang, Wentao; Deng, Min; Xu, Feng; Wang, Hang

    2018-05-01

    Real-time air quality prediction has been an active field of research in atmospheric environmental science. The existing methods of machine learning are widely used to predict pollutant concentrations because of their enhanced ability to handle complex non-linear relationships. However, because pollutant concentration data, as typical geospatial data, also exhibit spatial heterogeneity and spatial dependence, they may violate the assumptions of independent and identically distributed random variables in most of the machine learning methods. As a result, a space-time support vector regression model is proposed to predict hourly PM2.5 concentrations. First, to address spatial heterogeneity, spatial clustering is executed to divide the study area into several homogeneous or quasi-homogeneous subareas. To handle spatial dependence, a Gauss vector weight function is then developed to determine spatial autocorrelation variables as part of the input features. Finally, a local support vector regression model with spatial autocorrelation variables is established for each subarea. Experimental data on PM2.5 concentrations in Beijing are used to verify whether the results of the proposed model are superior to those of other methods.

  14. Local-scale spatial modelling for interpolating climatic temperature variables to predict agricultural plant suitability

    NASA Astrophysics Data System (ADS)

    Webb, Mathew A.; Hall, Andrew; Kidd, Darren; Minansy, Budiman

    2016-05-01

    Assessment of local spatial climatic variability is important in the planning of planting locations for horticultural crops. This study investigated three regression-based calibration methods (i.e. traditional versus two optimized methods) to relate short-term 12-month data series from 170 temperature loggers and 4 weather station sites with data series from nearby long-term Australian Bureau of Meteorology climate stations. The techniques trialled to interpolate climatic temperature variables, such as frost risk, growing degree days (GDDs) and chill hours, were regression kriging (RK), regression trees (RTs) and random forests (RFs). All three calibration methods produced accurate results, with the RK-based calibration method delivering the most accurate validation measures: coefficients of determination ( R 2) of 0.92, 0.97 and 0.95 and root-mean-square errors of 1.30, 0.80 and 1.31 °C, for daily minimum, daily maximum and hourly temperatures, respectively. Compared with the traditional method of calibration using direct linear regression between short-term and long-term stations, the RK-based calibration method improved R 2 and reduced root-mean-square error (RMSE) by at least 5 % and 0.47 °C for daily minimum temperature, 1 % and 0.23 °C for daily maximum temperature and 3 % and 0.33 °C for hourly temperature. Spatial modelling indicated insignificant differences between the interpolation methods, with the RK technique tending to be the slightly better method due to the high degree of spatial autocorrelation between logger sites.

  15. Does Nonlinear Modeling Play a Role in Plasmid Bioprocess Monitoring Using Fourier Transform Infrared Spectra?

    PubMed

    Lopes, Marta B; Calado, Cecília R C; Figueiredo, Mário A T; Bioucas-Dias, José M

    2017-06-01

    The monitoring of biopharmaceutical products using Fourier transform infrared (FT-IR) spectroscopy relies on calibration techniques involving the acquisition of spectra of bioprocess samples along the process. The most commonly used method for that purpose is partial least squares (PLS) regression, under the assumption that a linear model is valid. Despite being successful in the presence of small nonlinearities, linear methods may fail in the presence of strong nonlinearities. This paper studies the potential usefulness of nonlinear regression methods for predicting, from in situ near-infrared (NIR) and mid-infrared (MIR) spectra acquired in high-throughput mode, biomass and plasmid concentrations in Escherichia coli DH5-α cultures producing the plasmid model pVAX-LacZ. The linear methods PLS and ridge regression (RR) are compared with their kernel (nonlinear) versions, kPLS and kRR, as well as with the (also nonlinear) relevance vector machine (RVM) and Gaussian process regression (GPR). For the systems studied, RR provided better predictive performances compared to the remaining methods. Moreover, the results point to further investigation based on larger data sets whenever differences in predictive accuracy between a linear method and its kernelized version could not be found. The use of nonlinear methods, however, shall be judged regarding the additional computational cost required to tune their additional parameters, especially when the less computationally demanding linear methods herein studied are able to successfully monitor the variables under study.

  16. Using methods from the data mining and machine learning literature for disease classification and prediction: A case study examining classification of heart failure sub-types

    PubMed Central

    Austin, Peter C.; Tu, Jack V.; Ho, Jennifer E.; Levy, Daniel; Lee, Douglas S.

    2014-01-01

    Objective Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines. Study design and Setting We compared the performance of these classification methods with those of conventional classification trees to classify patients with heart failure according to the following sub-types: heart failure with preserved ejection fraction (HFPEF) vs. heart failure with reduced ejection fraction (HFREF). We also compared the ability of these methods to predict the probability of the presence of HFPEF with that of conventional logistic regression. Results We found that modern, flexible tree-based methods from the data mining literature offer substantial improvement in prediction and classification of heart failure sub-type compared to conventional classification and regression trees. However, conventional logistic regression had superior performance for predicting the probability of the presence of HFPEF compared to the methods proposed in the data mining literature. Conclusion The use of tree-based methods offers superior performance over conventional classification and regression trees for predicting and classifying heart failure subtypes in a population-based sample of patients from Ontario. However, these methods do not offer substantial improvements over logistic regression for predicting the presence of HFPEF. PMID:23384592

  17. A Ranking Approach to Genomic Selection.

    PubMed

    Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori

    2015-01-01

    Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.

  18. Feature selection using probabilistic prediction of support vector regression.

    PubMed

    Yang, Jian-Bo; Ong, Chong-Jin

    2011-06-01

    This paper presents a new wrapper-based feature selection method for support vector regression (SVR) using its probabilistic predictions. The method computes the importance of a feature by aggregating the difference, over the feature space, of the conditional density functions of the SVR prediction with and without the feature. As the exact computation of this importance measure is expensive, two approximations are proposed. The effectiveness of the measure using these approximations, in comparison to several other existing feature selection methods for SVR, is evaluated on both artificial and real-world problems. The result of the experiments show that the proposed method generally performs better than, or at least as well as, the existing methods, with notable advantage when the dataset is sparse.

  19. Using Classification and Regression Trees (CART) and random forests to analyze attrition: Results from two simulations.

    PubMed

    Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J

    2015-12-01

    In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. (c) 2015 APA, all rights reserved).

  20. Using Classification and Regression Trees (CART) and Random Forests to Analyze Attrition: Results From Two Simulations

    PubMed Central

    Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J.

    2016-01-01

    In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. PMID:26389526

  1. A Method of Trigonometric Modelling of Seasonal Variation Demonstrated with Multiple Sclerosis Relapse Data.

    PubMed

    Spelman, Tim; Gray, Orla; Lucas, Robyn; Butzkueven, Helmut

    2015-12-09

    This report describes a novel Stata-based application of trigonometric regression modelling to 55 years of multiple sclerosis relapse data from 46 clinical centers across 20 countries located in both hemispheres. Central to the success of this method was the strategic use of plot analysis to guide and corroborate the statistical regression modelling. Initial plot analysis was necessary for establishing realistic hypotheses regarding the presence and structural form of seasonal and latitudinal influences on relapse probability and then testing the performance of the resultant models. Trigonometric regression was then necessary to quantify these relationships, adjust for important confounders and provide a measure of certainty as to how plausible these associations were. Synchronization of graphing techniques with regression modelling permitted a systematic refinement of models until best-fit convergence was achieved, enabling novel inferences to be made regarding the independent influence of both season and latitude in predicting relapse onset timing in MS. These methods have the potential for application across other complex disease and epidemiological phenomena suspected or known to vary systematically with season and/or geographic location.

  2. Recent trends in counts of migrant hawks from northeastern North America

    USGS Publications Warehouse

    Titus, K.; Fuller, M.R.

    1990-01-01

    Using simple regression, pooled-sites route-regression, and nonparametric rank-trend analyses, we evaluated trends in counts of hawks migrating past 6 eastern hawk lookouts from 1972 to 1987. The indexing variable was the total count for a season. Bald eagle (Haliaeetus leucocephalus), peregrine falcon (Falco peregrinus), merlin (F. columbarius), osprey (Pandion haliaetus), and Cooper's hawk (Accipiter cooperii) counts increased using route-regression and nonparametric methods (P 0.10). We found no consistent trends (P > 0.10) in counts of sharp-shinned hawks (A. striatus), northern goshawks (A. gentilis) red-shouldered hawks (Buteo lineatus), red-tailed hawks (B. jamaicensis), rough-legged hawsk (B. lagopus), and American kestrels (F. sparverius). Broad-winged hawk (B. platypterus) counts declined (P < 0.05) based on the route-regression method. Empirical comparisons of our results with those for well-studied species such as the peregrine falcon, bald eagle, and osprey indicated agreement with nesting surveys. We suggest that counts of migrant hawks are a useful and economical method for detecting long-term trends in species across regions, particularly for species that otherwise cannot be easily surveyed.

  3. Screening and clustering of sparse regressions with finite non-Gaussian mixtures.

    PubMed

    Zhang, Jian

    2017-06-01

    This article proposes a method to address the problem that can arise when covariates in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed errors, or when a true mixture of regressions produced the data. The method begins with non-Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller mixture regression model to the selected data with help of a new penalization scheme. Under certain regularity conditions, the new screening procedure is shown to possess a sure screening property even when the population is heterogeneous. We further prove that there exists an elbow point in the associated scree plot which results in a consistent estimator of the set of active covariates in the model. By simulations, we demonstrate that the new procedure can substantially improve the performance of the existing procedures in the content of variable screening and data clustering. By applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that the new method holds promise in practice. © 2016, The International Biometric Society.

  4. Interactive Visual Least Absolutes Method: Comparison with the Least Squares and the Median Methods

    ERIC Educational Resources Information Center

    Kim, Myung-Hoon; Kim, Michelle S.

    2016-01-01

    A visual regression analysis using the least absolutes method (LAB) was developed, utilizing an interactive approach of visually minimizing the sum of the absolute deviations (SAB) using a bar graph in Excel; the results agree very well with those obtained from nonvisual LAB using a numerical Solver in Excel. These LAB results were compared with…

  5. Robust Methods for Moderation Analysis with a Two-Level Regression Model.

    PubMed

    Yang, Miao; Yuan, Ke-Hai

    2016-01-01

    Moderation analysis has many applications in social sciences. Most widely used estimation methods for moderation analysis assume that errors are normally distributed and homoscedastic. When these assumptions are not met, the results from a classical moderation analysis can be misleading. For more reliable moderation analysis, this article proposes two robust methods with a two-level regression model when the predictors do not contain measurement error. One method is based on maximum likelihood with Student's t distribution and the other is based on M-estimators with Huber-type weights. An algorithm for obtaining the robust estimators is developed. Consistent estimates of standard errors of the robust estimators are provided. The robust approaches are compared against normal-distribution-based maximum likelihood (NML) with respect to power and accuracy of parameter estimates through a simulation study. Results show that the robust approaches outperform NML under various distributional conditions. Application of the robust methods is illustrated through a real data example. An R program is developed and documented to facilitate the application of the robust methods.

  6. Development of an anaerobic threshold (HRLT, HRVT) estimation equation using the heart rate threshold (HRT) during the treadmill incremental exercise test

    PubMed Central

    Ham, Joo-ho; Park, Hun-Young; Kim, Youn-ho; Bae, Sang-kon; Ko, Byung-hoon

    2017-01-01

    [Purpose] The purpose of this study was to develop a regression model to estimate the heart rate at the lactate threshold (HRLT) and the heart rate at the ventilatory threshold (HRVT) using the heart rate threshold (HRT), and to test the validity of the regression model. [Methods] We performed a graded exercise test with a treadmill in 220 normal individuals (men: 112, women: 108) aged 20–59 years. HRT, HRLT, and HRVT were measured in all subjects. A regression model was developed to estimate HRLT and HRVT using HRT with 70% of the data (men: 79, women: 76) through randomization (7:3), with the Bernoulli trial. The validity of the regression model developed with the remaining 30% of the data (men: 33, women: 32) was also examined. [Results] Based on the regression coefficient, we found that the independent variable HRT was a significant variable in all regression models. The adjusted R2 of the developed regression models averaged about 70%, and the standard error of estimation of the validity test results was 11 bpm, which is similar to that of the developed model. [Conclusion] These results suggest that HRT is a useful parameter for predicting HRLT and HRVT. PMID:29036765

  7. Methods for estimating population density in data-limited areas: evaluating regression and tree-based models in Peru.

    PubMed

    Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William

    2014-01-01

    Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies.

  8. Detecting sea-level hazards: Simple regression-based methods for calculating the acceleration of sea level

    USGS Publications Warehouse

    Doran, Kara S.; Howd, Peter A.; Sallenger,, Asbury H.

    2016-01-04

    Recent studies, and most of their predecessors, use tide gage data to quantify SL acceleration, ASL(t). In the current study, three techniques were used to calculate acceleration from tide gage data, and of those examined, it was determined that the two techniques based on sliding a regression window through the time series are more robust compared to the technique that fits a single quadratic form to the entire time series, particularly if there is temporal variation in the magnitude of the acceleration. The single-fit quadratic regression method has been the most commonly used technique in determining acceleration in tide gage data. The inability of the single-fit method to account for time-varying acceleration may explain some of the inconsistent findings between investigators. Properly quantifying ASL(t) from field measurements is of particular importance in evaluating numerical models of past, present, and future SLR resulting from anticipated climate change.

  9. Methods for Estimating Population Density in Data-Limited Areas: Evaluating Regression and Tree-Based Models in Peru

    PubMed Central

    Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William

    2014-01-01

    Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies. PMID:24992657

  10. Distributed collaborative probabilistic design for turbine blade-tip radial running clearance using support vector machine of regression

    NASA Astrophysics Data System (ADS)

    Fei, Cheng-Wei; Bai, Guang-Chen

    2014-12-01

    To improve the computational precision and efficiency of probabilistic design for mechanical dynamic assembly like the blade-tip radial running clearance (BTRRC) of gas turbine, a distribution collaborative probabilistic design method-based support vector machine of regression (SR)(called as DCSRM) is proposed by integrating distribution collaborative response surface method and support vector machine regression model. The mathematical model of DCSRM is established and the probabilistic design idea of DCSRM is introduced. The dynamic assembly probabilistic design of aeroengine high-pressure turbine (HPT) BTRRC is accomplished to verify the proposed DCSRM. The analysis results reveal that the optimal static blade-tip clearance of HPT is gained for designing BTRRC, and improving the performance and reliability of aeroengine. The comparison of methods shows that the DCSRM has high computational accuracy and high computational efficiency in BTRRC probabilistic analysis. The present research offers an effective way for the reliability design of mechanical dynamic assembly and enriches mechanical reliability theory and method.

  11. Experimental variability and data pre-processing as factors affecting the discrimination power of some chemometric approaches (PCA, CA and a new algorithm based on linear regression) applied to (+/-)ESI/MS and RPLC/UV data: Application on green tea extracts.

    PubMed

    Iorgulescu, E; Voicu, V A; Sârbu, C; Tache, F; Albu, F; Medvedovici, A

    2016-08-01

    The influence of the experimental variability (instrumental repeatability, instrumental intermediate precision and sample preparation variability) and data pre-processing (normalization, peak alignment, background subtraction) on the discrimination power of multivariate data analysis methods (Principal Component Analysis -PCA- and Cluster Analysis -CA-) as well as a new algorithm based on linear regression was studied. Data used in the study were obtained through positive or negative ion monitoring electrospray mass spectrometry (+/-ESI/MS) and reversed phase liquid chromatography/UV spectrometric detection (RPLC/UV) applied to green tea extracts. Extractions in ethanol and heated water infusion were used as sample preparation procedures. The multivariate methods were directly applied to mass spectra and chromatograms, involving strictly a holistic comparison of shapes, without assignment of any structural identity to compounds. An alternative data interpretation based on linear regression analysis mutually applied to data series is also discussed. Slopes, intercepts and correlation coefficients produced by the linear regression analysis applied on pairs of very large experimental data series successfully retain information resulting from high frequency instrumental acquisition rates, obviously better defining the profiles being compared. Consequently, each type of sample or comparison between samples produces in the Cartesian space an ellipsoidal volume defined by the normal variation intervals of the slope, intercept and correlation coefficient. Distances between volumes graphically illustrates (dis)similarities between compared data. The instrumental intermediate precision had the major effect on the discrimination power of the multivariate data analysis methods. Mass spectra produced through ionization from liquid state in atmospheric pressure conditions of bulk complex mixtures resulting from extracted materials of natural origins provided an excellent data basis for multivariate analysis methods, equivalent to data resulting from chromatographic separations. The alternative evaluation of very large data series based on linear regression analysis produced information equivalent to results obtained through application of PCA an CA. Copyright © 2016 Elsevier B.V. All rights reserved.

  12. GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA

    PubMed Central

    Zheng, Qi; Peng, Limin; He, Xuming

    2015-01-01

    Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal. PMID:26604424

  13. How many stakes are required to measure the mass balance of a glacier?

    USGS Publications Warehouse

    Fountain, A.G.; Vecchia, A.

    1999-01-01

    Glacier mass balance is estimated for South Cascade Glacier and Maclure Glacier using a one-dimensional regression of mass balance with altitude as an alternative to the traditional approach of contouring mass balance values. One attractive feature of regression is that it can be applied to sparse data sets where contouring is not possible and can provide an objective error of the resulting estimate. Regression methods yielded mass balance values equivalent to contouring methods. The effect of the number of mass balance measurements on the final value for the glacier showed that sample sizes as small as five stakes provided reasonable estimates, although the error estimates were greater than for larger sample sizes. Different spatial patterns of measurement locations showed no appreciable influence on the final value as long as different surface altitudes were intermittently sampled over the altitude range of the glacier. Two different regression equations were examined, a quadratic, and a piecewise linear spline, and comparison of results showed little sensitivity to the type of equation. These results point to the dominant effect of the gradient of mass balance with altitude of alpine glaciers compared to transverse variations. The number of mass balance measurements required to determine the glacier balance appears to be scale invariant for small glaciers and five to ten stakes are sufficient.

  14. Fuel Regression Rate Behavior of CAMUI Hybrid Rocket

    NASA Astrophysics Data System (ADS)

    Kaneko, Yudai; Itoh, Mitsunori; Kakikura, Akihito; Mori, Kazuhiro; Uejima, Kenta; Nakashima, Takuji; Wakita, Masashi; Totani, Tsuyoshi; Oshima, Nobuyuki; Nagata, Harunori

    A series of static firing tests was conducted to investigate the fuel regression characteristics of a Cascaded Multistage Impinging-jet (CAMUI) type hybrid rocket motor. A CAMUI type hybrid rocket uses the combination of liquid oxygen and a fuel grain made of polyethylene as a propellant. The collision distance divided by the port diameter, H/D, was varied to investigate the effect of the grain geometry on the fuel regression rate. As a result, the H/D geometry has little effect on the regression rate near the stagnation point, where the heat transfer coefficient is high. On the contrary, the fuel regression rate decreases near the circumference of the forward-end face and the backward-end face of fuel blocks. Besides the experimental approaches, a method of computational fluid dynamics clarified the heat transfer distribution on the grain surface with various H/D geometries. The calculation shows the decrease of the flow velocity due to the increase of H/D on the area where the fuel regression rate decreases with the increase of H/D. To estimate the exact fuel consumption, which is necessary to design a fuel grain, real-time measurement by an ultrasonic pulse-echo method was performed.

  15. Optimized multiple linear mappings for single image super-resolution

    NASA Astrophysics Data System (ADS)

    Zhang, Kaibing; Li, Jie; Xiong, Zenggang; Liu, Xiuping; Gao, Xinbo

    2017-12-01

    Learning piecewise linear regression has been recognized as an effective way for example learning-based single image super-resolution (SR) in literature. In this paper, we employ an expectation-maximization (EM) algorithm to further improve the SR performance of our previous multiple linear mappings (MLM) based SR method. In the training stage, the proposed method starts with a set of linear regressors obtained by the MLM-based method, and then jointly optimizes the clustering results and the low- and high-resolution subdictionary pairs for regression functions by using the metric of the reconstruction errors. In the test stage, we select the optimal regressor for SR reconstruction by accumulating the reconstruction errors of m-nearest neighbors in the training set. Thorough experimental results carried on six publicly available datasets demonstrate that the proposed SR method can yield high-quality images with finer details and sharper edges in terms of both quantitative and perceptual image quality assessments.

  16. Determinant of securitization asset pricing in Malaysia

    NASA Astrophysics Data System (ADS)

    Bakri, M. H.; Ali, R.; Ismail, S.; Sufian, F.; Baharom, A. H.

    2014-12-01

    Malaysian firms have been reported involve in Asset Back Securities since 1986s where Cagamas is a pioneer. This research aims to examine the factor influencing primary market spread. Least square method and regression analysis are applied for the study period 2004-2012. The result shows one determinants in internal regression model and three determinants in external regression influence or contribute to the primary market spread and are statistically significant in developing the securitization in Malaysia. It can be concluded that transaction size significantly contribute to the determinant primary market spread in internal regression model while liquidity, transaction size and crisis is significant in both regression model. From five hypotheses, three hypotheses support that the determinants have a relationship with primary market spread.

  17. BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES1

    PubMed Central

    Zhu, Xiang; Stephens, Matthew

    2017-01-01

    Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss. PMID:29399241

  18. Effects of past and recent blood pressure and cholesterol level on coronary heart disease and stroke mortality, accounting for measurement error.

    PubMed

    Boshuizen, Hendriek C; Lanti, Mariapaola; Menotti, Alessandro; Moschandreas, Joanna; Tolonen, Hanna; Nissinen, Aulikki; Nedeljkovic, Srecko; Kafatos, Anthony; Kromhout, Daan

    2007-02-15

    The authors aimed to quantify the effects of current systolic blood pressure (SBP) and serum total cholesterol on the risk of mortality in comparison with SBP or serum cholesterol 25 years previously, taking measurement error into account. The authors reanalyzed 35-year follow-up data on mortality due to coronary heart disease and stroke among subjects aged 65 years or more from nine cohorts of the Seven Countries Study. The two-step method of Tsiatis et al. (J Am Stat Assoc 1995;90:27-37) was used to adjust for regression dilution bias, and results were compared with those obtained using more commonly applied methods of adjustment for regression dilution bias. It was found that the commonly used univariate adjustment for regression dilution bias overestimates the effects of both SBP and cholesterol compared with multivariate methods. Also, the two-step method makes better use of the information available, resulting in smaller confidence intervals. Results comparing recent and past exposure indicated that past SBP is more important than recent SBP in terms of its effect on coronary heart disease mortality, while both recent and past values seem to be important for effects of cholesterol on coronary heart disease mortality and effects of SBP on stroke mortality. Associations between serum cholesterol concentration and risk of stroke mortality are weak.

  19. Locally linear regression for pose-invariant face recognition.

    PubMed

    Chai, Xiujuan; Shan, Shiguang; Chen, Xilin; Gao, Wen

    2007-07-01

    The variation of facial appearance due to the viewpoint (/pose) degrades face recognition systems considerably, which is one of the bottlenecks in face recognition. One of the possible solutions is generating virtual frontal view from any given nonfrontal view to obtain a virtual gallery/probe face. Following this idea, this paper proposes a simple, but efficient, novel locally linear regression (LLR) method, which generates the virtual frontal view from a given nonfrontal face image. We first justify the basic assumption of the paper that there exists an approximate linear mapping between a nonfrontal face image and its frontal counterpart. Then, by formulating the estimation of the linear mapping as a prediction problem, we present the regression-based solution, i.e., globally linear regression. To improve the prediction accuracy in the case of coarse alignment, LLR is further proposed. In LLR, we first perform dense sampling in the nonfrontal face image to obtain many overlapped local patches. Then, the linear regression technique is applied to each small patch for the prediction of its virtual frontal patch. Through the combination of all these patches, the virtual frontal view is generated. The experimental results on the CMU PIE database show distinct advantage of the proposed method over Eigen light-field method.

  20. Load forecast method of electric vehicle charging station using SVR based on GA-PSO

    NASA Astrophysics Data System (ADS)

    Lu, Kuan; Sun, Wenxue; Ma, Changhui; Yang, Shenquan; Zhu, Zijian; Zhao, Pengfei; Zhao, Xin; Xu, Nan

    2017-06-01

    This paper presents a Support Vector Regression (SVR) method for electric vehicle (EV) charging station load forecast based on genetic algorithm (GA) and particle swarm optimization (PSO). Fuzzy C-Means (FCM) clustering is used to establish similar day samples. GA is used for global parameter searching and PSO is used for a more accurately local searching. Load forecast is then regressed using SVR. The practical load data of an EV charging station were taken to illustrate the proposed method. The result indicates an obvious improvement in the forecasting accuracy compared with SVRs based on PSO and GA exclusively.

  1. Iterative integral parameter identification of a respiratory mechanics model.

    PubMed

    Schranz, Christoph; Docherty, Paul D; Chiew, Yeong Shiong; Möller, Knut; Chase, J Geoffrey

    2012-07-18

    Patient-specific respiratory mechanics models can support the evaluation of optimal lung protective ventilator settings during ventilation therapy. Clinical application requires that the individual's model parameter values must be identified with information available at the bedside. Multiple linear regression or gradient-based parameter identification methods are highly sensitive to noise and initial parameter estimates. Thus, they are difficult to apply at the bedside to support therapeutic decisions. An iterative integral parameter identification method is applied to a second order respiratory mechanics model. The method is compared to the commonly used regression methods and error-mapping approaches using simulated and clinical data. The clinical potential of the method was evaluated on data from 13 Acute Respiratory Distress Syndrome (ARDS) patients. The iterative integral method converged to error minima 350 times faster than the Simplex Search Method using simulation data sets and 50 times faster using clinical data sets. Established regression methods reported erroneous results due to sensitivity to noise. In contrast, the iterative integral method was effective independent of initial parameter estimations, and converged successfully in each case tested. These investigations reveal that the iterative integral method is beneficial with respect to computing time, operator independence and robustness, and thus applicable at the bedside for this clinical application.

  2. Incorporating Measurement Error from Modeled Air Pollution Exposures into Epidemiological Analyses.

    PubMed

    Samoli, Evangelia; Butland, Barbara K

    2017-12-01

    Outdoor air pollution exposures used in epidemiological studies are commonly predicted from spatiotemporal models incorporating limited measurements, temporal factors, geographic information system variables, and/or satellite data. Measurement error in these exposure estimates leads to imprecise estimation of health effects and their standard errors. We reviewed methods for measurement error correction that have been applied in epidemiological studies that use model-derived air pollution data. We identified seven cohort studies and one panel study that have employed measurement error correction methods. These methods included regression calibration, risk set regression calibration, regression calibration with instrumental variables, the simulation extrapolation approach (SIMEX), and methods under the non-parametric or parameter bootstrap. Corrections resulted in small increases in the absolute magnitude of the health effect estimate and its standard error under most scenarios. Limited application of measurement error correction methods in air pollution studies may be attributed to the absence of exposure validation data and the methodological complexity of the proposed methods. Future epidemiological studies should consider in their design phase the requirements for the measurement error correction method to be later applied, while methodological advances are needed under the multi-pollutants setting.

  3. Hypothesis Testing Using Factor Score Regression: A Comparison of Four Methods

    ERIC Educational Resources Information Center

    Devlieger, Ines; Mayer, Axel; Rosseel, Yves

    2016-01-01

    In this article, an overview is given of four methods to perform factor score regression (FSR), namely regression FSR, Bartlett FSR, the bias avoiding method of Skrondal and Laake, and the bias correcting method of Croon. The bias correcting method is extended to include a reliable standard error. The four methods are compared with each other and…

  4. Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble-based methods?

    PubMed Central

    Austin, Peter C; Lee, Douglas S; Steyerberg, Ewout W; Tu, Jack V

    2012-01-01

    In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease. PMID:22777999

  5. Improved accuracy in quantitative laser-induced breakdown spectroscopy using sub-models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Anderson, Ryan B.; Clegg, Samuel M.; Frydenvang, Jens

    We report that accurate quantitative analysis of diverse geologic materials is one of the primary challenges faced by the Laser-Induced Breakdown Spectroscopy (LIBS)-based ChemCam instrument on the Mars Science Laboratory (MSL) rover. The SuperCam instrument on the Mars 2020 rover, as well as other LIBS instruments developed for geochemical analysis on Earth or other planets, will face the same challenge. Consequently, part of the ChemCam science team has focused on the development of improved multivariate analysis calibrations methods. Developing a single regression model capable of accurately determining the composition of very different target materials is difficult because the response ofmore » an element’s emission lines in LIBS spectra can vary with the concentration of other elements. We demonstrate a conceptually simple “submodel” method for improving the accuracy of quantitative LIBS analysis of diverse target materials. The method is based on training several regression models on sets of targets with limited composition ranges and then “blending” these “sub-models” into a single final result. Tests of the sub-model method show improvement in test set root mean squared error of prediction (RMSEP) for almost all cases. Lastly, the sub-model method, using partial least squares regression (PLS), is being used as part of the current ChemCam quantitative calibration, but the sub-model method is applicable to any multivariate regression method and may yield similar improvements.« less

  6. Improved accuracy in quantitative laser-induced breakdown spectroscopy using sub-models

    DOE PAGES

    Anderson, Ryan B.; Clegg, Samuel M.; Frydenvang, Jens; ...

    2016-12-15

    We report that accurate quantitative analysis of diverse geologic materials is one of the primary challenges faced by the Laser-Induced Breakdown Spectroscopy (LIBS)-based ChemCam instrument on the Mars Science Laboratory (MSL) rover. The SuperCam instrument on the Mars 2020 rover, as well as other LIBS instruments developed for geochemical analysis on Earth or other planets, will face the same challenge. Consequently, part of the ChemCam science team has focused on the development of improved multivariate analysis calibrations methods. Developing a single regression model capable of accurately determining the composition of very different target materials is difficult because the response ofmore » an element’s emission lines in LIBS spectra can vary with the concentration of other elements. We demonstrate a conceptually simple “submodel” method for improving the accuracy of quantitative LIBS analysis of diverse target materials. The method is based on training several regression models on sets of targets with limited composition ranges and then “blending” these “sub-models” into a single final result. Tests of the sub-model method show improvement in test set root mean squared error of prediction (RMSEP) for almost all cases. Lastly, the sub-model method, using partial least squares regression (PLS), is being used as part of the current ChemCam quantitative calibration, but the sub-model method is applicable to any multivariate regression method and may yield similar improvements.« less

  7. Determination of water pH using absorption-based optical sensors: evaluation of different calculation methods

    NASA Astrophysics Data System (ADS)

    Wang, Hongliang; Liu, Baohua; Ding, Zhongjun; Wang, Xiangxin

    2017-02-01

    Absorption-based optical sensors have been developed for the determination of water pH. In this paper, based on the preparation of a transparent sol-gel thin film with a phenol red (PR) indicator, several calculation methods, including simple linear regression analysis, quadratic regression analysis and dual-wavelength absorbance ratio analysis, were used to calculate water pH. Results of MSSRR show that dual-wavelength absorbance ratio analysis can improve the calculation accuracy of water pH in long-term measurement.

  8. Geodesic regression for image time-series.

    PubMed

    Niethammer, Marc; Huang, Yang; Vialard, François-Xavier

    2011-01-01

    Registration of image-time series has so far been accomplished (i) by concatenating registrations between image pairs, (ii) by solving a joint estimation problem resulting in piecewise geodesic paths between image pairs, (iii) by kernel based local averaging or (iv) by augmenting the joint estimation with additional temporal irregularity penalties. Here, we propose a generative model extending least squares linear regression to the space of images by using a second-order dynamic formulation for image registration. Unlike previous approaches, the formulation allows for a compact representation of an approximation to the full spatio-temporal trajectory through its initial values. The method also opens up possibilities to design image-based approximation algorithms. The resulting optimization problem is solved using an adjoint method.

  9. [New method of mixed gas infrared spectrum analysis based on SVM].

    PubMed

    Bai, Peng; Xie, Wen-Jun; Liu, Jun-Hua

    2007-07-01

    A new method of infrared spectrum analysis based on support vector machine (SVM) for mixture gas was proposed. The kernel function in SVM was used to map the seriously overlapping absorption spectrum into high-dimensional space, and after transformation, the high-dimensional data could be processed in the original space, so the regression calibration model was established, then the regression calibration model with was applied to analyze the concentration of component gas. Meanwhile it was proved that the regression calibration model with SVM also could be used for component recognition of mixture gas. The method was applied to the analysis of different data samples. Some factors such as scan interval, range of the wavelength, kernel function and penalty coefficient C that affect the model were discussed. Experimental results show that the component concentration maximal Mean AE is 0.132%, and the component recognition accuracy is higher than 94%. The problems of overlapping absorption spectrum, using the same method for qualitative and quantitative analysis, and limit number of training sample, were solved. The method could be used in other mixture gas infrared spectrum analyses, promising theoretic and application values.

  10. Determination of boiling point of petrochemicals by gas chromatography-mass spectrometry and multivariate regression analysis of structural activity relationship.

    PubMed

    Fakayode, Sayo O; Mitchell, Breanna S; Pollard, David A

    2014-08-01

    Accurate understanding of analyte boiling points (BP) is of critical importance in gas chromatographic (GC) separation and crude oil refinery operation in petrochemical industries. This study reported the first combined use of GC separation and partial-least-square (PLS1) multivariate regression analysis of petrochemical structural activity relationship (SAR) for accurate BP determination of two commercially available (D3710 and MA VHP) calibration gas mix samples. The results of the BP determination using PLS1 multivariate regression were further compared with the results of traditional simulated distillation method of BP determination. The developed PLS1 regression was able to correctly predict analytes BP in D3710 and MA VHP calibration gas mix samples, with a root-mean-square-%-relative-error (RMS%RE) of 6.4%, and 10.8% respectively. In contrast, the overall RMS%RE of 32.9% and 40.4%, respectively obtained for BP determination in D3710 and MA VHP using a traditional simulated distillation method were approximately four times larger than the corresponding RMS%RE of BP prediction using MRA, demonstrating the better predictive ability of MRA. The reported method is rapid, robust, and promising, and can be potentially used routinely for fast analysis, pattern recognition, and analyte BP determination in petrochemical industries. Copyright © 2014 Elsevier B.V. All rights reserved.

  11. Modeling Tetanus Neonatorum case using the regression of negative binomial and zero-inflated negative binomial

    NASA Astrophysics Data System (ADS)

    Amaliana, Luthfatul; Sa'adah, Umu; Wayan Surya Wardhani, Ni

    2017-12-01

    Tetanus Neonatorum is an infectious disease that can be prevented by immunization. The number of Tetanus Neonatorum cases in East Java Province is the highest in Indonesia until 2015. Tetanus Neonatorum data contain over dispersion and big enough proportion of zero-inflation. Negative Binomial (NB) regression is an alternative method when over dispersion happens in Poisson regression. However, the data containing over dispersion and zero-inflation are more appropriately analyzed by using Zero-Inflated Negative Binomial (ZINB) regression. The purpose of this study are: (1) to model Tetanus Neonatorum cases in East Java Province with 71.05 percent proportion of zero-inflation by using NB and ZINB regression, (2) to obtain the best model. The result of this study indicates that ZINB is better than NB regression with smaller AIC.

  12. Forecasting daily patient volumes in the emergency department.

    PubMed

    Jones, Spencer S; Thomas, Alun; Evans, R Scott; Welch, Shari J; Haug, Peter J; Snow, Gregory L

    2008-02-01

    Shifts in the supply of and demand for emergency department (ED) resources make the efficient allocation of ED resources increasingly important. Forecasting is a vital activity that guides decision-making in many areas of economic, industrial, and scientific planning, but has gained little traction in the health care industry. There are few studies that explore the use of forecasting methods to predict patient volumes in the ED. The goals of this study are to explore and evaluate the use of several statistical forecasting methods to predict daily ED patient volumes at three diverse hospital EDs and to compare the accuracy of these methods to the accuracy of a previously proposed forecasting method. Daily patient arrivals at three hospital EDs were collected for the period January 1, 2005, through March 31, 2007. The authors evaluated the use of seasonal autoregressive integrated moving average, time series regression, exponential smoothing, and artificial neural network models to forecast daily patient volumes at each facility. Forecasts were made for horizons ranging from 1 to 30 days in advance. The forecast accuracy achieved by the various forecasting methods was compared to the forecast accuracy achieved when using a benchmark forecasting method already available in the emergency medicine literature. All time series methods considered in this analysis provided improved in-sample model goodness of fit. However, post-sample analysis revealed that time series regression models that augment linear regression models by accounting for serial autocorrelation offered only small improvements in terms of post-sample forecast accuracy, relative to multiple linear regression models, while seasonal autoregressive integrated moving average, exponential smoothing, and artificial neural network forecasting models did not provide consistently accurate forecasts of daily ED volumes. This study confirms the widely held belief that daily demand for ED services is characterized by seasonal and weekly patterns. The authors compared several time series forecasting methods to a benchmark multiple linear regression model. The results suggest that the existing methodology proposed in the literature, multiple linear regression based on calendar variables, is a reasonable approach to forecasting daily patient volumes in the ED. However, the authors conclude that regression-based models that incorporate calendar variables, account for site-specific special-day effects, and allow for residual autocorrelation provide a more appropriate, informative, and consistently accurate approach to forecasting daily ED patient volumes.

  13. Method and Excel VBA Algorithm for Modeling Master Recession Curve Using Trigonometry Approach.

    PubMed

    Posavec, Kristijan; Giacopetti, Marco; Materazzi, Marco; Birk, Steffen

    2017-11-01

    A new method was developed and implemented into an Excel Visual Basic for Applications (VBAs) algorithm utilizing trigonometry laws in an innovative way to overlap recession segments of time series and create master recession curves (MRCs). Based on a trigonometry approach, the algorithm horizontally translates succeeding recession segments of time series, placing their vertex, that is, the highest recorded value of each recession segment, directly onto the appropriate connection line defined by measurement points of a preceding recession segment. The new method and algorithm continues the development of methods and algorithms for the generation of MRC, where the first published method was based on a multiple linear/nonlinear regression model approach (Posavec et al. 2006). The newly developed trigonometry-based method was tested on real case study examples and compared with the previously published multiple linear/nonlinear regression model-based method. The results show that in some cases, that is, for some time series, the trigonometry-based method creates narrower overlaps of the recession segments, resulting in higher coefficients of determination R 2 , while in other cases the multiple linear/nonlinear regression model-based method remains superior. The Excel VBA algorithm for modeling MRC using the trigonometry approach is implemented into a spreadsheet tool (MRCTools v3.0 written by and available from Kristijan Posavec, Zagreb, Croatia) containing the previously published VBA algorithms for MRC generation and separation. All algorithms within the MRCTools v3.0 are open access and available free of charge, supporting the idea of running science on available, open, and free of charge software. © 2017, National Ground Water Association.

  14. a Geographic Weighted Regression for Rural Highways Crashes Modelling Using the Gaussian and Tricube Kernels: a Case Study of USA Rural Highways

    NASA Astrophysics Data System (ADS)

    Aghayari, M.; Pahlavani, P.; Bigdeli, B.

    2017-09-01

    Based on world health organization (WHO) report, driving incidents are counted as one of the eight initial reasons for death in the world. The purpose of this paper is to develop a method for regression on effective parameters of highway crashes. In the traditional methods, it was assumed that the data are completely independent and environment is homogenous while the crashes are spatial events which are occurring in geographic space and crashes have spatial data. Spatial data have spatial features such as spatial autocorrelation and spatial non-stationarity in a way working with them is going to be a bit difficult. The proposed method has implemented on a set of records of fatal crashes that have been occurred in highways connecting eight east states of US. This data have been recorded between the years 2007 and 2009. In this study, we have used GWR method with two Gaussian and Tricube kernels. The Number of casualties has been considered as dependent variable and number of persons in crash, road alignment, number of lanes, pavement type, surface condition, road fence, light condition, vehicle type, weather, drunk driver, speed limitation, harmful event, road profile, and junction type have been considered as explanatory variables according to previous studies in using GWR method. We have compered the results of implementation with OLS method. Results showed that R2 for OLS method is 0.0654 and for the proposed method is 0.9196 that implies the proposed GWR is better method for regression in rural highway crashes.

  15. Advantages of the net benefit regression framework for economic evaluations of interventions in the workplace: a case study of the cost-effectiveness of a collaborative mental health care program for people receiving short-term disability benefits for psychiatric disorders.

    PubMed

    Hoch, Jeffrey S; Dewa, Carolyn S

    2014-04-01

    Economic evaluations commonly accompany trials of new treatments or interventions; however, regression methods and their corresponding advantages for the analysis of cost-effectiveness data are not well known. To illustrate regression-based economic evaluation, we present a case study investigating the cost-effectiveness of a collaborative mental health care program for people receiving short-term disability benefits for psychiatric disorders. We implement net benefit regression to illustrate its strengths and limitations. Net benefit regression offers a simple option for cost-effectiveness analyses of person-level data. By placing economic evaluation in a regression framework, regression-based techniques can facilitate the analysis and provide simple solutions to commonly encountered challenges. Economic evaluations of person-level data (eg, from a clinical trial) should use net benefit regression to facilitate analysis and enhance results.

  16. Estimation of diffusion coefficients from voltammetric signals by support vector and gaussian process regression

    PubMed Central

    2014-01-01

    Background Support vector regression (SVR) and Gaussian process regression (GPR) were used for the analysis of electroanalytical experimental data to estimate diffusion coefficients. Results For simulated cyclic voltammograms based on the EC, Eqr, and EqrC mechanisms these regression algorithms in combination with nonlinear kernel/covariance functions yielded diffusion coefficients with higher accuracy as compared to the standard approach of calculating diffusion coefficients relying on the Nicholson-Shain equation. The level of accuracy achieved by SVR and GPR is virtually independent of the rate constants governing the respective reaction steps. Further, the reduction of high-dimensional voltammetric signals by manual selection of typical voltammetric peak features decreased the performance of both regression algorithms compared to a reduction by downsampling or principal component analysis. After training on simulated data sets, diffusion coefficients were estimated by the regression algorithms for experimental data comprising voltammetric signals for three organometallic complexes. Conclusions Estimated diffusion coefficients closely matched the values determined by the parameter fitting method, but reduced the required computational time considerably for one of the reaction mechanisms. The automated processing of voltammograms according to the regression algorithms yields better results than the conventional analysis of peak-related data. PMID:24987463

  17. Estimation of 1RM for knee extension based on the maximal isometric muscle strength and body composition.

    PubMed

    Kanada, Yoshikiyo; Sakurai, Hiroaki; Sugiura, Yoshito; Arai, Tomoaki; Koyama, Soichiro; Tanabe, Shigeo

    2017-11-01

    [Purpose] To create a regression formula in order to estimate 1RM for knee extensors, based on the maximal isometric muscle strength measured using a hand-held dynamometer and data regarding the body composition. [Subjects and Methods] Measurement was performed in 21 healthy males in their twenties to thirties. Single regression analysis was performed, with measurement values representing 1RM and the maximal isometric muscle strength as dependent and independent variables, respectively. Furthermore, multiple regression analysis was performed, with data regarding the body composition incorporated as another independent variable, in addition to the maximal isometric muscle strength. [Results] Through single regression analysis with the maximal isometric muscle strength as an independent variable, the following regression formula was created: 1RM (kg)=0.714 + 0.783 × maximal isometric muscle strength (kgf). On multiple regression analysis, only the total muscle mass was extracted. [Conclusion] A highly accurate regression formula to estimate 1RM was created based on both the maximal isometric muscle strength and body composition. Using a hand-held dynamometer and body composition analyzer, it was possible to measure these items in a short time, and obtain clinically useful results.

  18. Use of statistical study methods for the analysis of the results of the imitation modeling of radiation transfer

    NASA Astrophysics Data System (ADS)

    Alekseenko, M. A.; Gendrina, I. Yu.

    2017-11-01

    Recently, due to the abundance of various types of observational data in the systems of vision through the atmosphere and the need for their processing, the use of various methods of statistical research in the study of such systems as correlation-regression analysis, dynamic series, variance analysis, etc. is actual. We have attempted to apply elements of correlation-regression analysis for the study and subsequent prediction of the patterns of radiation transfer in these systems same as in the construction of radiation models of the atmosphere. In this paper, we present some results of statistical processing of the results of numerical simulation of the characteristics of vision systems through the atmosphere obtained with the help of a special software package.1

  19. Nonlinear Trimodal Regression Analysis of Radiodensitometric Distributions to Quantify Sarcopenic and Sequelae Muscle Degeneration

    PubMed Central

    Árnadóttir, Í.; Gíslason, M. K.; Carraro, U.

    2016-01-01

    Muscle degeneration has been consistently identified as an independent risk factor for high mortality in both aging populations and individuals suffering from neuromuscular pathology or injury. While there is much extant literature on its quantification and correlation to comorbidities, a quantitative gold standard for analyses in this regard remains undefined. Herein, we hypothesize that rigorously quantifying entire radiodensitometric distributions elicits more muscle quality information than average values reported in extant methods. This study reports the development and utility of a nonlinear trimodal regression analysis method utilized on radiodensitometric distributions of upper leg muscles from CT scans of a healthy young adult, a healthy elderly subject, and a spinal cord injury patient. The method was then employed with a THA cohort to assess pre- and postsurgical differences in their healthy and operative legs. Results from the initial representative models elicited high degrees of correlation to HU distributions, and regression parameters highlighted physiologically evident differences between subjects. Furthermore, results from the THA cohort echoed physiological justification and indicated significant improvements in muscle quality in both legs following surgery. Altogether, these results highlight the utility of novel parameters from entire HU distributions that could provide insight into the optimal quantification of muscle degeneration. PMID:28115982

  20. Complications after procedures of photorefractive keratectomy

    NASA Astrophysics Data System (ADS)

    Gierek-Ciaciura, Stanislawa

    1998-10-01

    Purpose: The aim of this study was to investigate the saveness of the PRK procedures. Material and method: 151 eyes after PRK for correction of myopia and 112 after PRK for correction of myopic astigmatism were examined. All PRK procedures have been performed with an excimer laser manufactured by Aesculap Meditec. Results: Haze, regression, decentration infection and overcorrection were found. Conclusions: The most often complication is regression. Corneal inflammation in the early postoperative period may cause the regression or haze. The greater corrected refractive error the greater haze degree. Haze decreases with time.

  1. Nonlinear least squares regression for single image scanning electron microscope signal-to-noise ratio estimation.

    PubMed

    Sim, K S; Norhisham, S

    2016-11-01

    A new method based on nonlinear least squares regression (NLLSR) is formulated to estimate signal-to-noise ratio (SNR) of scanning electron microscope (SEM) images. The estimation of SNR value based on NLLSR method is compared with the three existing methods of nearest neighbourhood, first-order interpolation and the combination of both nearest neighbourhood and first-order interpolation. Samples of SEM images with different textures, contrasts and edges were used to test the performance of NLLSR method in estimating the SNR values of the SEM images. It is shown that the NLLSR method is able to produce better estimation accuracy as compared to the other three existing methods. According to the SNR results obtained from the experiment, the NLLSR method is able to produce approximately less than 1% of SNR error difference as compared to the other three existing methods. © 2016 The Authors Journal of Microscopy © 2016 Royal Microscopical Society.

  2. Development and validation of multivariate calibration methods for simultaneous estimation of Paracetamol, Enalapril maleate and hydrochlorothiazide in pharmaceutical dosage form

    NASA Astrophysics Data System (ADS)

    Singh, Veena D.; Daharwal, Sanjay J.

    2017-01-01

    Three multivariate calibration spectrophotometric methods were developed for simultaneous estimation of Paracetamol (PARA), Enalapril maleate (ENM) and Hydrochlorothiazide (HCTZ) in tablet dosage form; namely multi-linear regression calibration (MLRC), trilinear regression calibration method (TLRC) and classical least square (CLS) method. The selectivity of the proposed methods were studied by analyzing the laboratory prepared ternary mixture and successfully applied in their combined dosage form. The proposed methods were validated as per ICH guidelines and good accuracy; precision and specificity were confirmed within the concentration range of 5-35 μg mL- 1, 5-40 μg mL- 1 and 5-40 μg mL- 1of PARA, HCTZ and ENM, respectively. The results were statistically compared with reported HPLC method. Thus, the proposed methods can be effectively useful for the routine quality control analysis of these drugs in commercial tablet dosage form.

  3. New strategy for determination of anthocyanins, polyphenols and antioxidant capacity of Brassica oleracea liquid extract using infrared spectroscopies and multivariate regression

    NASA Astrophysics Data System (ADS)

    de Oliveira, Isadora R. N.; Roque, Jussara V.; Maia, Mariza P.; Stringheta, Paulo C.; Teófilo, Reinaldo F.

    2018-04-01

    A new method was developed to determine the antioxidant properties of red cabbage extract (Brassica oleracea) by mid (MID) and near (NIR) infrared spectroscopies and partial least squares (PLS) regression. A 70% (v/v) ethanolic extract of red cabbage was concentrated to 9° Brix and further diluted (12 to 100%) in water. The dilutions were used as external standards for the building of PLS models. For the first time, this strategy was applied for building multivariate regression models. Reference analyses and spectral data were obtained from diluted extracts. The determinate properties were total and monomeric anthocyanins, total polyphenols and antioxidant capacity by ABTS (2,2-azino-bis(3-ethyl-benzothiazoline-6-sulfonate)) and DPPH (2,2-diphenyl-1-picrylhydrazyl) methods. Ordered predictors selection (OPS) and genetic algorithm (GA) were used for feature selection before PLS regression (PLS-1). In addition, a PLS-2 regression was applied to all properties simultaneously. PLS-1 models provided more predictive models than did PLS-2 regression. PLS-OPS and PLS-GA models presented excellent prediction results with a correlation coefficient higher than 0.98. However, the best models were obtained using PLS and variable selection with the OPS algorithm and the models based on NIR spectra were considered more predictive for all properties. Then, these models provided a simple, rapid and accurate method for determination of red cabbage extract antioxidant properties and its suitability for use in the food industry.

  4. Methods for estimating peak-flow frequencies at ungaged sites in Montana based on data through water year 2011: Chapter F in Montana StreamStats

    USGS Publications Warehouse

    Sando, Roy; Sando, Steven K.; McCarthy, Peter M.; Dutton, DeAnn M.

    2016-04-05

    The U.S. Geological Survey (USGS), in cooperation with the Montana Department of Natural Resources and Conservation, completed a study to update methods for estimating peak-flow frequencies at ungaged sites in Montana based on peak-flow data at streamflow-gaging stations through water year 2011. The methods allow estimation of peak-flow frequencies (that is, peak-flow magnitudes, in cubic feet per second, associated with annual exceedance probabilities of 66.7, 50, 42.9, 20, 10, 4, 2, 1, 0.5, and 0.2 percent) at ungaged sites. The annual exceedance probabilities correspond to 1.5-, 2-, 2.33-, 5-, 10-, 25-, 50-, 100-, 200-, and 500-year recurrence intervals, respectively.Regional regression analysis is a primary focus of Chapter F of this Scientific Investigations Report, and regression equations for estimating peak-flow frequencies at ungaged sites in eight hydrologic regions in Montana are presented. The regression equations are based on analysis of peak-flow frequencies and basin characteristics at 537 streamflow-gaging stations in or near Montana and were developed using generalized least squares regression or weighted least squares regression.All of the data used in calculating basin characteristics that were included as explanatory variables in the regression equations were developed for and are available through the USGS StreamStats application (http://water.usgs.gov/osw/streamstats/) for Montana. StreamStats is a Web-based geographic information system application that was created by the USGS to provide users with access to an assortment of analytical tools that are useful for water-resource planning and management. The primary purpose of the Montana StreamStats application is to provide estimates of basin characteristics and streamflow characteristics for user-selected ungaged sites on Montana streams. The regional regression equations presented in this report chapter can be conveniently solved using the Montana StreamStats application.Selected results from this study were compared with results of previous studies. For most hydrologic regions, the regression equations reported for this study had lower mean standard errors of prediction (in percent) than the previously reported regression equations for Montana. The equations presented for this study are considered to be an improvement on the previously reported equations primarily because this study (1) included 13 more years of peak-flow data; (2) included 35 more streamflow-gaging stations than previous studies; (3) used a detailed geographic information system (GIS)-based definition of the regulation status of streamflow-gaging stations, which allowed better determination of the unregulated peak-flow records that are appropriate for use in the regional regression analysis; (4) included advancements in GIS and remote-sensing technologies, which allowed more convenient calculation of basin characteristics and investigation of many more candidate basin characteristics; and (5) included advancements in computational and analytical methods, which allowed more thorough and consistent data analysis.This report chapter also presents other methods for estimating peak-flow frequencies at ungaged sites. Two methods for estimating peak-flow frequencies at ungaged sites located on the same streams as streamflow-gaging stations are described. Additionally, envelope curves relating maximum recorded annual peak flows to contributing drainage area for each of the eight hydrologic regions in Montana are presented and compared to a national envelope curve. In addition to providing general information on characteristics of large peak flows, the regional envelope curves can be used to assess the reasonableness of peak-flow frequency estimates determined using the regression equations.

  5. Carbon dioxide stripping in aquaculture -- part III: model verification

    USGS Publications Warehouse

    Colt, John; Watten, Barnaby; Pfeiffer, Tim

    2012-01-01

    Based on conventional mass transfer models developed for oxygen, the use of the non-linear ASCE method, 2-point method, and one parameter linear-regression method were evaluated for carbon dioxide stripping data. For values of KLaCO2 < approximately 1.5/h, the 2-point or ASCE method are a good fit to experimental data, but the fit breaks down at higher values of KLaCO2. How to correct KLaCO2 for gas phase enrichment remains to be determined. The one-parameter linear regression model was used to vary the C*CO2 over the test, but it did not result in a better fit to the experimental data when compared to the ASCE or fixed C*CO2 assumptions.

  6. The relationship between venture capital investment and macro economic variables via statistical computation method

    NASA Astrophysics Data System (ADS)

    Aygunes, Gunes

    2017-07-01

    The objective of this paper is to survey and determine the macroeconomic factors affecting the level of venture capital (VC) investments in a country. The literary depends on venture capitalists' quality and countries' venture capital investments. The aim of this paper is to give relationship between venture capital investment and macro economic variables via statistical computation method. We investigate the countries and macro economic variables. By using statistical computation method, we derive correlation between venture capital investments and macro economic variables. According to method of logistic regression model (logit regression or logit model), macro economic variables are correlated with each other in three group. Venture capitalists regard correlations as a indicator. Finally, we give correlation matrix of our results.

  7. Optimization of isotherm models for pesticide sorption on biopolymer-nanoclay composite by error analysis.

    PubMed

    Narayanan, Neethu; Gupta, Suman; Gajbhiye, V T; Manjaiah, K M

    2017-04-01

    A carboxy methyl cellulose-nano organoclay (nano montmorillonite modified with 35-45 wt % dimethyl dialkyl (C 14 -C 18 ) amine (DMDA)) composite was prepared by solution intercalation method. The prepared composite was characterized by infrared spectroscopy (FTIR), X-Ray diffraction spectroscopy (XRD) and scanning electron microscopy (SEM). The composite was utilized for its pesticide sorption efficiency for atrazine, imidacloprid and thiamethoxam. The sorption data was fitted into Langmuir and Freundlich isotherms using linear and non linear methods. The linear regression method suggested best fitting of sorption data into Type II Langmuir and Freundlich isotherms. In order to avoid the bias resulting from linearization, seven different error parameters were also analyzed by non linear regression method. The non linear error analysis suggested that the sorption data fitted well into Langmuir model rather than in Freundlich model. The maximum sorption capacity, Q 0 (μg/g) was given by imidacloprid (2000) followed by thiamethoxam (1667) and atrazine (1429). The study suggests that the degree of determination of linear regression alone cannot be used for comparing the best fitting of Langmuir and Freundlich models and non-linear error analysis needs to be done to avoid inaccurate results. Copyright © 2017 Elsevier Ltd. All rights reserved.

  8. An open-access CMIP5 pattern library for temperature and precipitation: Description and methodology

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lynch, Cary D.; Hartin, Corinne A.; Bond-Lamberty, Benjamin

    Pattern scaling is used to efficiently emulate general circulation models and explore uncertainty in climate projections under multiple forcing scenarios. Pattern scaling methods assume that local climate changes scale with a global mean temperature increase, allowing for spatial patterns to be generated for multiple models for any future emission scenario. For uncertainty quantification and probabilistic statistical analysis, a library of patterns with descriptive statistics for each file would be beneficial, but such a library does not presently exist. Of the possible techniques used to generate patterns, the two most prominent are the delta and least squared regression methods. We exploremore » the differences and statistical significance between patterns generated by each method and assess performance of the generated patterns across methods and scenarios. Differences in patterns across seasons between methods and epochs were largest in high latitudes (60-90°N/S). Bias and mean errors between modeled and pattern predicted output from the linear regression method were smaller than patterns generated by the delta method. Across scenarios, differences in the linear regression method patterns were more statistically significant, especially at high latitudes. We found that pattern generation methodologies were able to approximate the forced signal of change to within ≤ 0.5°C, but choice of pattern generation methodology for pattern scaling purposes should be informed by user goals and criteria. As a result, this paper describes our library of least squared regression patterns from all CMIP5 models for temperature and precipitation on an annual and sub-annual basis, along with the code used to generate these patterns.« less

  9. An open-access CMIP5 pattern library for temperature and precipitation: Description and methodology

    DOE PAGES

    Lynch, Cary D.; Hartin, Corinne A.; Bond-Lamberty, Benjamin; ...

    2017-05-15

    Pattern scaling is used to efficiently emulate general circulation models and explore uncertainty in climate projections under multiple forcing scenarios. Pattern scaling methods assume that local climate changes scale with a global mean temperature increase, allowing for spatial patterns to be generated for multiple models for any future emission scenario. For uncertainty quantification and probabilistic statistical analysis, a library of patterns with descriptive statistics for each file would be beneficial, but such a library does not presently exist. Of the possible techniques used to generate patterns, the two most prominent are the delta and least squared regression methods. We exploremore » the differences and statistical significance between patterns generated by each method and assess performance of the generated patterns across methods and scenarios. Differences in patterns across seasons between methods and epochs were largest in high latitudes (60-90°N/S). Bias and mean errors between modeled and pattern predicted output from the linear regression method were smaller than patterns generated by the delta method. Across scenarios, differences in the linear regression method patterns were more statistically significant, especially at high latitudes. We found that pattern generation methodologies were able to approximate the forced signal of change to within ≤ 0.5°C, but choice of pattern generation methodology for pattern scaling purposes should be informed by user goals and criteria. As a result, this paper describes our library of least squared regression patterns from all CMIP5 models for temperature and precipitation on an annual and sub-annual basis, along with the code used to generate these patterns.« less

  10. Eye pupil detection system using an ensemble of regression forest and fast radial symmetry transform with a near infrared camera

    NASA Astrophysics Data System (ADS)

    Jeong, Mira; Nam, Jae-Yeal; Ko, Byoung Chul

    2017-09-01

    In this paper, we focus on pupil center detection in various video sequences that include head poses and changes in illumination. To detect the pupil center, we first find four eye landmarks in each eye by using cascade local regression based on a regression forest. Based on the rough location of the pupil, a fast radial symmetric transform is applied using the previously found pupil location to rearrange the fine pupil center. As the final step, the pupil displacement is estimated between the previous frame and the current frame to maintain the level of accuracy against a false locating result occurring in a particular frame. We generated a new face dataset, called Keimyung University pupil detection (KMUPD), with infrared camera. The proposed method was successfully applied to the KMUPD dataset, and the results indicate that its pupil center detection capability is better than that of other methods and with a shorter processing time.

  11. Investigation on the effect of diaphragm on the combustion characteristics of solid-fuel ramjet

    NASA Astrophysics Data System (ADS)

    Gong, Lunkun; Chen, Xiong; Yang, Haitao; Li, Weixuan; Zhou, Changsheng

    2017-10-01

    The flow field characteristics and the regression rate distribution of solid-fuel ramjet with three-hole diaphragm were investigated by numerical and experimental methods. The experimental data were obtained by burning high-density polyethylene using a connected-pipe facility to validate the numerical model and analyze the combustion efficiency of the solid-fuel ramjet. The three-dimensional code developed in the present study adopted three-order MUSCL and central difference schemes, AUSMPW + flux vector splitting method, and second-order moment turbulence-chemistry model, together with k-ω shear stress transport (SST) turbulence model. The solid fuel surface temperature was calculated with fluid-solid heat coupling method. The numerical results show that strong circumferential flow exists in the region upstream of the diaphragm. The diaphragm can enhance the regression rate of the solid fuel in the region downstream of the diaphragm significantly, which mainly results from the increase of turbulent viscosity. As the diaphragm port area decreases, the regression rate of the solid fuel downstream of the diaphragm increases. The diaphragm can result in more sufficient mixing between the incoming air and fuel pyrolysis gases, while inevitably producing some pressure loss. The experimental results indicate that the effect of the diaphragm on the combustion efficiency of hydrocarbon fuels is slightly negative. It is conjectured that the diaphragm may have some positive effects on the combustion efficiency of the solid fuel with metal particles.

  12. Comparison and validation of statistical methods for predicting power outage durations in the event of hurricanes.

    PubMed

    Nateghi, Roshanak; Guikema, Seth D; Quiring, Steven M

    2011-12-01

    This article compares statistical methods for modeling power outage durations during hurricanes and examines the predictive accuracy of these methods. Being able to make accurate predictions of power outage durations is valuable because the information can be used by utility companies to plan their restoration efforts more efficiently. This information can also help inform customers and public agencies of the expected outage times, enabling better collective response planning, and coordination of restoration efforts for other critical infrastructures that depend on electricity. In the long run, outage duration estimates for future storm scenarios may help utilities and public agencies better allocate risk management resources to balance the disruption from hurricanes with the cost of hardening power systems. We compare the out-of-sample predictive accuracy of five distinct statistical models for estimating power outage duration times caused by Hurricane Ivan in 2004. The methods compared include both regression models (accelerated failure time (AFT) and Cox proportional hazard models (Cox PH)) and data mining techniques (regression trees, Bayesian additive regression trees (BART), and multivariate additive regression splines). We then validate our models against two other hurricanes. Our results indicate that BART yields the best prediction accuracy and that it is possible to predict outage durations with reasonable accuracy. © 2011 Society for Risk Analysis.

  13. Prediction models for clustered data: comparison of a random intercept and standard regression model

    PubMed Central

    2013-01-01

    Background When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Methods Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. Results The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. Conclusion The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters. PMID:23414436

  14. The Calibration of AVHRR/3 Visible Dual Gain Using Meteosat-8 as a MODIS Calibration Transfer Medium

    NASA Technical Reports Server (NTRS)

    Avey, Lance; Garber, Donald; Nguyen, Louis; Minnis, Patrick

    2007-01-01

    This viewgraph presentation reviews the NOAA-17 AVHRR visible channels calibrated against MET-8/MODIS using dual gain regression methods. The topics include: 1) Motivation; 2) Methodology; 3) Dual Gain Regression Methods; 4) Examples of Regression methods; 5) AVHRR/3 Regression Strategy; 6) Cross-Calibration Method; 7) Spectral Response Functions; 8) MET8/NOAA-17; 9) Example of gain ratio adjustment; 10) Effect of mixed low/high count FOV; 11) Monitor dual gains over time; and 12) Conclusions

  15. Principal components and iterative regression analysis of geophysical series: Application to Sunspot number (1750 2004)

    NASA Astrophysics Data System (ADS)

    Nordemann, D. J. R.; Rigozo, N. R.; de Souza Echer, M. P.; Echer, E.

    2008-11-01

    We present here an implementation of a least squares iterative regression method applied to the sine functions embedded in the principal components extracted from geophysical time series. This method seems to represent a useful improvement for the non-stationary time series periodicity quantitative analysis. The principal components determination followed by the least squares iterative regression method was implemented in an algorithm written in the Scilab (2006) language. The main result of the method is to obtain the set of sine functions embedded in the series analyzed in decreasing order of significance, from the most important ones, likely to represent the physical processes involved in the generation of the series, to the less important ones that represent noise components. Taking into account the need of a deeper knowledge of the Sun's past history and its implication to global climate change, the method was applied to the Sunspot Number series (1750-2004). With the threshold and parameter values used here, the application of the method leads to a total of 441 explicit sine functions, among which 65 were considered as being significant and were used for a reconstruction that gave a normalized mean squared error of 0.146.

  16. Visual tracking using objectness-bounding box regression and correlation filters

    NASA Astrophysics Data System (ADS)

    Mbelwa, Jimmy T.; Zhao, Qingjie; Lu, Yao; Wang, Fasheng; Mbise, Mercy

    2018-03-01

    Visual tracking is a fundamental problem in computer vision with extensive application domains in surveillance and intelligent systems. Recently, correlation filter-based tracking methods have shown a great achievement in terms of robustness, accuracy, and speed. However, such methods have a problem of dealing with fast motion (FM), motion blur (MB), illumination variation (IV), and drifting caused by occlusion (OCC). To solve this problem, a tracking method that integrates objectness-bounding box regression (O-BBR) model and a scheme based on kernelized correlation filter (KCF) is proposed. The scheme based on KCF is used to improve the tracking performance of FM and MB. For handling drift problem caused by OCC and IV, we propose objectness proposals trained in bounding box regression as prior knowledge to provide candidates and background suppression. Finally, scheme KCF as a base tracker and O-BBR are fused to obtain a state of a target object. Extensive experimental comparisons of the developed tracking method with other state-of-the-art trackers are performed on some of the challenging video sequences. Experimental comparison results show that our proposed tracking method outperforms other state-of-the-art tracking methods in terms of effectiveness, accuracy, and robustness.

  17. Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications

    PubMed Central

    Qian, Guoqi; Wu, Yuehua; Ferrari, Davide; Qiao, Puxue; Hollande, Frédéric

    2016-01-01

    Regression clustering is a mixture of unsupervised and supervised statistical learning and data mining method which is found in a wide range of applications including artificial intelligence and neuroscience. It performs unsupervised learning when it clusters the data according to their respective unobserved regression hyperplanes. The method also performs supervised learning when it fits regression hyperplanes to the corresponding data clusters. Applying regression clustering in practice requires means of determining the underlying number of clusters in the data, finding the cluster label of each data point, and estimating the regression coefficients of the model. In this paper, we review the estimation and selection issues in regression clustering with regard to the least squares and robust statistical methods. We also provide a model selection based technique to determine the number of regression clusters underlying the data. We further develop a computing procedure for regression clustering estimation and selection. Finally, simulation studies are presented for assessing the procedure, together with analyzing a real data set on RGB cell marking in neuroscience to illustrate and interpret the method. PMID:27212939

  18. Sample size determination for logistic regression on a logit-normal distribution.

    PubMed

    Kim, Seongho; Heath, Elisabeth; Heilbrun, Lance

    2017-06-01

    Although the sample size for simple logistic regression can be readily determined using currently available methods, the sample size calculation for multiple logistic regression requires some additional information, such as the coefficient of determination ([Formula: see text]) of a covariate of interest with other covariates, which is often unavailable in practice. The response variable of logistic regression follows a logit-normal distribution which can be generated from a logistic transformation of a normal distribution. Using this property of logistic regression, we propose new methods of determining the sample size for simple and multiple logistic regressions using a normal transformation of outcome measures. Simulation studies and a motivating example show several advantages of the proposed methods over the existing methods: (i) no need for [Formula: see text] for multiple logistic regression, (ii) available interim or group-sequential designs, and (iii) much smaller required sample size.

  19. Poisson Regression Analysis of Illness and Injury Surveillance Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Frome E.L., Watkins J.P., Ellis E.D.

    2012-12-12

    The Department of Energy (DOE) uses illness and injury surveillance to monitor morbidity and assess the overall health of the work force. Data collected from each participating site include health events and a roster file with demographic information. The source data files are maintained in a relational data base, and are used to obtain stratified tables of health event counts and person time at risk that serve as the starting point for Poisson regression analysis. The explanatory variables that define these tables are age, gender, occupational group, and time. Typical response variables of interest are the number of absences duemore » to illness or injury, i.e., the response variable is a count. Poisson regression methods are used to describe the effect of the explanatory variables on the health event rates using a log-linear main effects model. Results of fitting the main effects model are summarized in a tabular and graphical form and interpretation of model parameters is provided. An analysis of deviance table is used to evaluate the importance of each of the explanatory variables on the event rate of interest and to determine if interaction terms should be considered in the analysis. Although Poisson regression methods are widely used in the analysis of count data, there are situations in which over-dispersion occurs. This could be due to lack-of-fit of the regression model, extra-Poisson variation, or both. A score test statistic and regression diagnostics are used to identify over-dispersion. A quasi-likelihood method of moments procedure is used to evaluate and adjust for extra-Poisson variation when necessary. Two examples are presented using respiratory disease absence rates at two DOE sites to illustrate the methods and interpretation of the results. In the first example the Poisson main effects model is adequate. In the second example the score test indicates considerable over-dispersion and a more detailed analysis attributes the over-dispersion to extra-Poisson variation. The R open source software environment for statistical computing and graphics is used for analysis. Additional details about R and the data that were used in this report are provided in an Appendix. Information on how to obtain R and utility functions that can be used to duplicate results in this report are provided.« less

  20. Logistic regression applied to natural hazards: rare event logistic regression with replications

    NASA Astrophysics Data System (ADS)

    Guns, M.; Vanacker, V.

    2012-06-01

    Statistical analysis of natural hazards needs particular attention, as most of these phenomena are rare events. This study shows that the ordinary rare event logistic regression, as it is now commonly used in geomorphologic studies, does not always lead to a robust detection of controlling factors, as the results can be strongly sample-dependent. In this paper, we introduce some concepts of Monte Carlo simulations in rare event logistic regression. This technique, so-called rare event logistic regression with replications, combines the strength of probabilistic and statistical methods, and allows overcoming some of the limitations of previous developments through robust variable selection. This technique was here developed for the analyses of landslide controlling factors, but the concept is widely applicable for statistical analyses of natural hazards.

  1. Association between Stereotactic Radiotherapy and Death from Brain Metastases of Epithelial Ovarian Cancer: a Gliwice Data Re-Analysis with Penalization

    PubMed

    Tukiendorf, Andrzej; Mansournia, Mohammad Ali; Wydmański, Jerzy; Wolny-Rokicka, Edyta

    2017-04-01

    Background: Clinical datasets for epithelial ovarian cancer brain metastatic patients are usually small in size. When adequate case numbers are lacking, resulting estimates of regression coefficients may demonstrate bias. One of the direct approaches to reduce such sparse-data bias is based on penalized estimation. Methods: A re- analysis of formerly reported hazard ratios in diagnosed patients was performed using penalized Cox regression with a popular SAS package providing additional software codes for a statistical computational procedure. Results: It was found that the penalized approach can readily diminish sparse data artefacts and radically reduce the magnitude of estimated regression coefficients. Conclusions: It was confirmed that classical statistical approaches may exaggerate regression estimates or distort study interpretations and conclusions. The results support the thesis that penalization via weak informative priors and data augmentation are the safest approaches to shrink sparse data artefacts frequently occurring in epidemiological research. Creative Commons Attribution License

  2. Effect of Malmquist bias on correlation studies with IRAS data base

    NASA Technical Reports Server (NTRS)

    Verter, Frances

    1993-01-01

    The relationships between galaxy properties in the sample of Trinchieri et al. (1989) are reexamined with corrections for Malmquist bias. The linear correlations are tested and linear regressions are fit for log-log plots of L(FIR), L(H-alpha), and L(B) as well as ratios of these quantities. The linear correlations for Malmquist bias are corrected using the method of Verter (1988), in which each galaxy observation is weighted by the inverse of its sampling volume. The linear regressions are corrected for Malmquist bias by a new method invented here in which each galaxy observation is weighted by its sampling volume. The results of correlation and regressions among the sample are significantly changed in the anticipated sense that the corrected correlation confidences are lower and the corrected slopes of the linear regressions are lower. The elimination of Malmquist bias eliminates the nonlinear rise in luminosity that has caused some authors to hypothesize additional components of FIR emission.

  3. Improving Prediction Accuracy for WSN Data Reduction by Applying Multivariate Spatio-Temporal Correlation

    PubMed Central

    Carvalho, Carlos; Gomes, Danielo G.; Agoulmine, Nazim; de Souza, José Neuman

    2011-01-01

    This paper proposes a method based on multivariate spatial and temporal correlation to improve prediction accuracy in data reduction for Wireless Sensor Networks (WSN). Prediction of data not sent to the sink node is a technique used to save energy in WSNs by reducing the amount of data traffic. However, it may not be very accurate. Simulations were made involving simple linear regression and multiple linear regression functions to assess the performance of the proposed method. The results show a higher correlation between gathered inputs when compared to time, which is an independent variable widely used for prediction and forecasting. Prediction accuracy is lower when simple linear regression is used, whereas multiple linear regression is the most accurate one. In addition to that, our proposal outperforms some current solutions by about 50% in humidity prediction and 21% in light prediction. To the best of our knowledge, we believe that we are probably the first to address prediction based on multivariate correlation for WSN data reduction. PMID:22346626

  4. OPC modeling by genetic algorithm

    NASA Astrophysics Data System (ADS)

    Huang, W. C.; Lai, C. M.; Luo, B.; Tsai, C. K.; Tsay, C. S.; Lai, C. W.; Kuo, C. C.; Liu, R. G.; Lin, H. T.; Lin, B. J.

    2005-05-01

    Optical proximity correction (OPC) is usually used to pre-distort mask layouts to make the printed patterns as close to the desired shapes as possible. For model-based OPC, a lithographic model to predict critical dimensions after lithographic processing is needed. The model is usually obtained via a regression of parameters based on experimental data containing optical proximity effects. When the parameters involve a mix of the continuous (optical and resist models) and the discrete (kernel numbers) sets, the traditional numerical optimization method may have difficulty handling model fitting. In this study, an artificial-intelligent optimization method was used to regress the parameters of the lithographic models for OPC. The implemented phenomenological models were constant-threshold models that combine diffused aerial image models with loading effects. Optical kernels decomposed from Hopkin"s equation were used to calculate aerial images on the wafer. Similarly, the numbers of optical kernels were treated as regression parameters. This way, good regression results were obtained with different sets of optical proximity effect data.

  5. Comparison of cranial sex determination by discriminant analysis and logistic regression.

    PubMed

    Amores-Ampuero, Anabel; Alemán, Inmaculada

    2016-04-05

    Various methods have been proposed for estimating dimorphism. The objective of this study was to compare sex determination results from cranial measurements using discriminant analysis or logistic regression. The study sample comprised 130 individuals (70 males) of known sex, age, and cause of death from San José cemetery in Granada (Spain). Measurements of 19 neurocranial dimensions and 11 splanchnocranial dimensions were subjected to discriminant analysis and logistic regression, and the percentages of correct classification were compared between the sex functions obtained with each method. The discriminant capacity of the selected variables was evaluated with a cross-validation procedure. The percentage accuracy with discriminant analysis was 78.2% for the neurocranium (82.4% in females and 74.6% in males) and 73.7% for the splanchnocranium (79.6% in females and 68.8% in males). These percentages were higher with logistic regression analysis: 85.7% for the neurocranium (in both sexes) and 94.1% for the splanchnocranium (100% in females and 91.7% in males).

  6. Controlling Type I Error Rates in Assessing DIF for Logistic Regression Method Combined with SIBTEST Regression Correction Procedure and DIF-Free-Then-DIF Strategy

    ERIC Educational Resources Information Center

    Shih, Ching-Lin; Liu, Tien-Hsiang; Wang, Wen-Chung

    2014-01-01

    The simultaneous item bias test (SIBTEST) method regression procedure and the differential item functioning (DIF)-free-then-DIF strategy are applied to the logistic regression (LR) method simultaneously in this study. These procedures are used to adjust the effects of matching true score on observed score and to better control the Type I error…

  7. Developing a predictive tropospheric ozone model for Tabriz

    NASA Astrophysics Data System (ADS)

    Khatibi, Rahman; Naghipour, Leila; Ghorbani, Mohammad A.; Smith, Michael S.; Karimi, Vahid; Farhoudi, Reza; Delafrouz, Hadi; Arvanaghi, Hadi

    2013-04-01

    Predictive ozone models are becoming indispensable tools by providing a capability for pollution alerts to serve people who are vulnerable to the risks. We have developed a tropospheric ozone prediction capability for Tabriz, Iran, by using the following five modeling strategies: three regression-type methods: Multiple Linear Regression (MLR), Artificial Neural Networks (ANNs), and Gene Expression Programming (GEP); and two auto-regression-type models: Nonlinear Local Prediction (NLP) to implement chaos theory and Auto-Regressive Integrated Moving Average (ARIMA) models. The regression-type modeling strategies explain the data in terms of: temperature, solar radiation, dew point temperature, and wind speed, by regressing present ozone values to their past values. The ozone time series are available at various time intervals, including hourly intervals, from August 2010 to March 2011. The results for MLR, ANN and GEP models are not overly good but those produced by NLP and ARIMA are promising for the establishing a forecasting capability.

  8. A robust and efficient stepwise regression method for building sparse polynomial chaos expansions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Abraham, Simon, E-mail: Simon.Abraham@ulb.ac.be; Raisee, Mehrdad; Ghorbaniasl, Ghader

    2017-03-01

    Polynomial Chaos (PC) expansions are widely used in various engineering fields for quantifying uncertainties arising from uncertain parameters. The computational cost of classical PC solution schemes is unaffordable as the number of deterministic simulations to be calculated grows dramatically with the number of stochastic dimension. This considerably restricts the practical use of PC at the industrial level. A common approach to address such problems is to make use of sparse PC expansions. This paper presents a non-intrusive regression-based method for building sparse PC expansions. The most important PC contributions are detected sequentially through an automatic search procedure. The variable selectionmore » criterion is based on efficient tools relevant to probabilistic method. Two benchmark analytical functions are used to validate the proposed algorithm. The computational efficiency of the method is then illustrated by a more realistic CFD application, consisting of the non-deterministic flow around a transonic airfoil subject to geometrical uncertainties. To assess the performance of the developed methodology, a detailed comparison is made with the well established LAR-based selection technique. The results show that the developed sparse regression technique is able to identify the most significant PC contributions describing the problem. Moreover, the most important stochastic features are captured at a reduced computational cost compared to the LAR method. The results also demonstrate the superior robustness of the method by repeating the analyses using random experimental designs.« less

  9. EMD-regression for modelling multi-scale relationships, and application to weather-related cardiovascular mortality

    NASA Astrophysics Data System (ADS)

    Masselot, Pierre; Chebana, Fateh; Bélanger, Diane; St-Hilaire, André; Abdous, Belkacem; Gosselin, Pierre; Ouarda, Taha B. M. J.

    2018-01-01

    In a number of environmental studies, relationships between natural processes are often assessed through regression analyses, using time series data. Such data are often multi-scale and non-stationary, leading to a poor accuracy of the resulting regression models and therefore to results with moderate reliability. To deal with this issue, the present paper introduces the EMD-regression methodology consisting in applying the empirical mode decomposition (EMD) algorithm on data series and then using the resulting components in regression models. The proposed methodology presents a number of advantages. First, it accounts of the issues of non-stationarity associated to the data series. Second, this approach acts as a scan for the relationship between a response variable and the predictors at different time scales, providing new insights about this relationship. To illustrate the proposed methodology it is applied to study the relationship between weather and cardiovascular mortality in Montreal, Canada. The results shed new knowledge concerning the studied relationship. For instance, they show that the humidity can cause excess mortality at the monthly time scale, which is a scale not visible in classical models. A comparison is also conducted with state of the art methods which are the generalized additive models and distributed lag models, both widely used in weather-related health studies. The comparison shows that EMD-regression achieves better prediction performances and provides more details than classical models concerning the relationship.

  10. Comparison of modeling methods to predict the spatial distribution of deep-sea coral and sponge in the Gulf of Alaska

    NASA Astrophysics Data System (ADS)

    Rooper, Christopher N.; Zimmermann, Mark; Prescott, Megan M.

    2017-08-01

    Deep-sea coral and sponge ecosystems are widespread throughout most of Alaska's marine waters, and are associated with many different species of fishes and invertebrates. These ecosystems are vulnerable to the effects of commercial fishing activities and climate change. We compared four commonly used species distribution models (general linear models, generalized additive models, boosted regression trees and random forest models) and an ensemble model to predict the presence or absence and abundance of six groups of benthic invertebrate taxa in the Gulf of Alaska. All four model types performed adequately on training data for predicting presence and absence, with regression forest models having the best overall performance measured by the area under the receiver-operating-curve (AUC). The models also performed well on the test data for presence and absence with average AUCs ranging from 0.66 to 0.82. For the test data, ensemble models performed the best. For abundance data, there was an obvious demarcation in performance between the two regression-based methods (general linear models and generalized additive models), and the tree-based models. The boosted regression tree and random forest models out-performed the other models by a wide margin on both the training and testing data. However, there was a significant drop-off in performance for all models of invertebrate abundance ( 50%) when moving from the training data to the testing data. Ensemble model performance was between the tree-based and regression-based methods. The maps of predictions from the models for both presence and abundance agreed very well across model types, with an increase in variability in predictions for the abundance data. We conclude that where data conforms well to the modeled distribution (such as the presence-absence data and binomial distribution in this study), the four types of models will provide similar results, although the regression-type models may be more consistent with biological theory. For data with highly zero-inflated distributions and non-normal distributions such as the abundance data from this study, the tree-based methods performed better. Ensemble models that averaged predictions across the four model types, performed better than the GLM or GAM models but slightly poorer than the tree-based methods, suggesting ensemble models might be more robust to overfitting than tree methods, while mitigating some of the disadvantages in predictive performance of regression methods.

  11. A novel method of estimation of lipophilicity using distance-based topological indices: dominating role of equalized electronegativity.

    PubMed

    Agrawal, Vijay K; Gupta, Madhu; Singh, Jyoti; Khadikar, Padmakar V

    2005-03-15

    Attempt is made to propose yet another method of estimating lipophilicity of a heterogeneous set of 223 compounds. The method is based on the use of equalized electronegativity along with topological indices. It was observed that excellent results are obtained in multiparametric regression upon introduction of indicator parameters. The results are discussed critically on the basis various statistical parameters.

  12. Homogeneity Pursuit

    PubMed Central

    Ke, Tracy; Fan, Jianqing; Wu, Yichao

    2014-01-01

    This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fussed Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online. PMID:26085701

  13. Building Regression Models: The Importance of Graphics.

    ERIC Educational Resources Information Center

    Dunn, Richard

    1989-01-01

    Points out reasons for using graphical methods to teach simple and multiple regression analysis. Argues that a graphically oriented approach has considerable pedagogic advantages in the exposition of simple and multiple regression. Shows that graphical methods may play a central role in the process of building regression models. (Author/LS)

  14. Differentially private distributed logistic regression using private and public data

    PubMed Central

    2014-01-01

    Background Privacy protecting is an important issue in medical informatics and differential privacy is a state-of-the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). However, differentially private methods sometimes introduce too much noise and make outputs less useful. Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced. Methodology In this paper, we modify the update step in Newton-Raphson method to propose a differentially private distributed logistic regression model based on both public and private data. Experiments and results We try our algorithm on three different data sets, and show its advantage over: (1) a logistic regression model based solely on public data, and (2) a differentially private distributed logistic regression model based on private data under various scenarios. Conclusion Logistic regression models built with our new algorithm based on both private and public datasets demonstrate better utility than models that trained on private or public datasets alone without sacrificing the rigorous privacy guarantee. PMID:25079786

  15. Regression to the Mean and Changes in Risk Behavior following Study Enrollment in a Cohort of US Women at Risk for HIV

    PubMed Central

    Hughes, James P.; Haley, Danielle F.; Frew, Paula M.; Golin, Carol E.; Adimora, Adaora A; Kuo, Irene; Justman, Jessica; Soto-Torres, Lydia; Wang, Jing; Hodder, Sally

    2015-01-01

    Purpose Reductions in risk behaviors are common following enrollment in HIV prevention studies. We develop methods to quantify the proportion of change in risk behaviors that can be attributed to regression to the mean versus study participation and other factors. Methods A novel model that incorporates both regression to the mean and study participation effects is developed for binary measures. The model is used to estimate the proportion of change in the prevalence of “unprotected sex in the past 6 months” that can be attributed to study participation versus regression to the mean in a longitudinal cohort of women at risk for HIV infection who were recruited from ten US communities with high rates of HIV and poverty. HIV risk behaviors were evaluated using audio computer-assisted self-interviews at baseline and every 6 months for up to 12 months. Results The prevalence of “unprotected sex in the past 6 months” declined from 96% at baseline to 77% at 12 months. However, this change could be almost completely explained by regression to the mean. Conclusions Analyses that examine changes over time in cohorts selected for high or low risk behaviors should account for regression to the mean effects. PMID:25883065

  16. Generating patient specific pseudo-CT of the head from MR using atlas-based regression

    NASA Astrophysics Data System (ADS)

    Sjölund, J.; Forsberg, D.; Andersson, M.; Knutsson, H.

    2015-01-01

    Radiotherapy planning and attenuation correction of PET images require simulation of radiation transport. The necessary physical properties are typically derived from computed tomography (CT) images, but in some cases, including stereotactic neurosurgery and combined PET/MR imaging, only magnetic resonance (MR) images are available. With these applications in mind, we describe how a realistic, patient-specific, pseudo-CT of the head can be derived from anatomical MR images. We refer to the method as atlas-based regression, because of its similarity to atlas-based segmentation. Given a target MR and an atlas database comprising MR and CT pairs, atlas-based regression works by registering each atlas MR to the target MR, applying the resulting displacement fields to the corresponding atlas CTs and, finally, fusing the deformed atlas CTs into a single pseudo-CT. We use a deformable registration algorithm known as the Morphon and augment it with a certainty mask that allows a tailoring of the influence certain regions are allowed to have on the registration. Moreover, we propose a novel method of fusion, wherein the collection of deformed CTs is iteratively registered to their joint mean and find that the resulting mean CT becomes more similar to the target CT. However, the voxelwise median provided even better results; at least as good as earlier work that required special MR imaging techniques. This makes atlas-based regression a good candidate for clinical use.

  17. Adjusting for Confounding in Early Postlaunch Settings: Going Beyond Logistic Regression Models.

    PubMed

    Schmidt, Amand F; Klungel, Olaf H; Groenwold, Rolf H H

    2016-01-01

    Postlaunch data on medical treatments can be analyzed to explore adverse events or relative effectiveness in real-life settings. These analyses are often complicated by the number of potential confounders and the possibility of model misspecification. We conducted a simulation study to compare the performance of logistic regression, propensity score, disease risk score, and stabilized inverse probability weighting methods to adjust for confounding. Model misspecification was induced in the independent derivation dataset. We evaluated performance using relative bias confidence interval coverage of the true effect, among other metrics. At low events per coefficient (1.0 and 0.5), the logistic regression estimates had a large relative bias (greater than -100%). Bias of the disease risk score estimates was at most 13.48% and 18.83%. For the propensity score model, this was 8.74% and >100%, respectively. At events per coefficient of 1.0 and 0.5, inverse probability weighting frequently failed or reduced to a crude regression, resulting in biases of -8.49% and 24.55%. Coverage of logistic regression estimates became less than the nominal level at events per coefficient ≤5. For the disease risk score, inverse probability weighting, and propensity score, coverage became less than nominal at events per coefficient ≤2.5, ≤1.0, and ≤1.0, respectively. Bias of misspecified disease risk score models was 16.55%. In settings with low events/exposed subjects per coefficient, disease risk score methods can be useful alternatives to logistic regression models, especially when propensity score models cannot be used. Despite better performance of disease risk score methods than logistic regression and propensity score models in small events per coefficient settings, bias, and coverage still deviated from nominal.

  18. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

    NASA Astrophysics Data System (ADS)

    Hasan, Haliza; Ahmad, Sanizah; Osman, Balkish Mohd; Sapri, Shamsiah; Othman, Nadirah

    2017-08-01

    In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.

  19. Practical aspects of estimating energy components in rodents

    PubMed Central

    van Klinken, Jan B.; van den Berg, Sjoerd A. A.; van Dijk, Ko Willems

    2013-01-01

    Recently there has been an increasing interest in exploiting computational and statistical techniques for the purpose of component analysis of indirect calorimetry data. Using these methods it becomes possible to dissect daily energy expenditure into its components and to assess the dynamic response of the resting metabolic rate (RMR) to nutritional and pharmacological manipulations. To perform robust component analysis, however, is not straightforward and typically requires the tuning of parameters and the preprocessing of data. Moreover the degree of accuracy that can be attained by these methods depends on the configuration of the system, which must be properly taken into account when setting up experimental studies. Here, we review the methods of Kalman filtering, linear, and penalized spline regression, and minimal energy expenditure estimation in the context of component analysis and discuss their results on high resolution datasets from mice and rats. In addition, we investigate the effect of the sample time, the accuracy of the activity sensor, and the washout time of the chamber on the estimation accuracy. We found that on the high resolution data there was a strong correlation between the results of Kalman filtering and penalized spline (P-spline) regression, except for the activity respiratory quotient (RQ). For low resolution data the basal metabolic rate (BMR) and resting RQ could still be estimated accurately with P-spline regression, having a strong correlation with the high resolution estimate (R2 > 0.997; sample time of 9 min). In contrast, the thermic effect of food (TEF) and activity related energy expenditure (AEE) were more sensitive to a reduction in the sample rate (R2 > 0.97). In conclusion, for component analysis on data generated by single channel systems with continuous data acquisition both Kalman filtering and P-spline regression can be used, while for low resolution data from multichannel systems P-spline regression gives more robust results. PMID:23641217

  20. A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice.

    PubMed

    Jacquin, Laval; Cao, Tuong-Vi; Ahmadi, Nourollah

    2016-01-01

    One objective of this study was to provide readers with a clear and unified understanding of parametric statistical and kernel methods, used for genomic prediction, and to compare some of these in the context of rice breeding for quantitative traits. Furthermore, another objective was to provide a simple and user-friendly R package, named KRMM, which allows users to perform RKHS regression with several kernels. After introducing the concept of regularized empirical risk minimization, the connections between well-known parametric and kernel methods such as Ridge regression [i.e., genomic best linear unbiased predictor (GBLUP)] and reproducing kernel Hilbert space (RKHS) regression were reviewed. Ridge regression was then reformulated so as to show and emphasize the advantage of the kernel "trick" concept, exploited by kernel methods in the context of epistatic genetic architectures, over parametric frameworks used by conventional methods. Some parametric and kernel methods; least absolute shrinkage and selection operator (LASSO), GBLUP, support vector machine regression (SVR) and RKHS regression were thereupon compared for their genomic predictive ability in the context of rice breeding using three real data sets. Among the compared methods, RKHS regression and SVR were often the most accurate methods for prediction followed by GBLUP and LASSO. An R function which allows users to perform RR-BLUP of marker effects, GBLUP and RKHS regression, with a Gaussian, Laplacian, polynomial or ANOVA kernel, in a reasonable computation time has been developed. Moreover, a modified version of this function, which allows users to tune kernels for RKHS regression, has also been developed and parallelized for HPC Linux clusters. The corresponding KRMM package and all scripts have been made publicly available.

  1. RRegrs: an R package for computer-aided model selection with multiple regression models.

    PubMed

    Tsiliki, Georgia; Munteanu, Cristian R; Seoane, Jose A; Fernandez-Lozano, Carlos; Sarimveis, Haralambos; Willighagen, Egon L

    2015-01-01

    Predictive regression models can be created with many different modelling approaches. Choices need to be made for data set splitting, cross-validation methods, specific regression parameters and best model criteria, as they all affect the accuracy and efficiency of the produced predictive models, and therefore, raising model reproducibility and comparison issues. Cheminformatics and bioinformatics are extensively using predictive modelling and exhibit a need for standardization of these methodologies in order to assist model selection and speed up the process of predictive model development. A tool accessible to all users, irrespectively of their statistical knowledge, would be valuable if it tests several simple and complex regression models and validation schemes, produce unified reports, and offer the option to be integrated into more extensive studies. Additionally, such methodology should be implemented as a free programming package, in order to be continuously adapted and redistributed by others. We propose an integrated framework for creating multiple regression models, called RRegrs. The tool offers the option of ten simple and complex regression methods combined with repeated 10-fold and leave-one-out cross-validation. Methods include Multiple Linear regression, Generalized Linear Model with Stepwise Feature Selection, Partial Least Squares regression, Lasso regression, and Support Vector Machines Recursive Feature Elimination. The new framework is an automated fully validated procedure which produces standardized reports to quickly oversee the impact of choices in modelling algorithms and assess the model and cross-validation results. The methodology was implemented as an open source R package, available at https://www.github.com/enanomapper/RRegrs, by reusing and extending on the caret package. The universality of the new methodology is demonstrated using five standard data sets from different scientific fields. Its efficiency in cheminformatics and QSAR modelling is shown with three use cases: proteomics data for surface-modified gold nanoparticles, nano-metal oxides descriptor data, and molecular descriptors for acute aquatic toxicity data. The results show that for all data sets RRegrs reports models with equal or better performance for both training and test sets than those reported in the original publications. Its good performance as well as its adaptability in terms of parameter optimization could make RRegrs a popular framework to assist the initial exploration of predictive models, and with that, the design of more comprehensive in silico screening applications.Graphical abstractRRegrs is a computer-aided model selection framework for R multiple regression models; this is a fully validated procedure with application to QSAR modelling.

  2. Post-processing through linear regression

    NASA Astrophysics Data System (ADS)

    van Schaeybroeck, B.; Vannitsem, S.

    2011-03-01

    Various post-processing techniques are compared for both deterministic and ensemble forecasts, all based on linear regression between forecast data and observations. In order to evaluate the quality of the regression methods, three criteria are proposed, related to the effective correction of forecast error, the optimal variability of the corrected forecast and multicollinearity. The regression schemes under consideration include the ordinary least-square (OLS) method, a new time-dependent Tikhonov regularization (TDTR) method, the total least-square method, a new geometric-mean regression (GM), a recently introduced error-in-variables (EVMOS) method and, finally, a "best member" OLS method. The advantages and drawbacks of each method are clarified. These techniques are applied in the context of the 63 Lorenz system, whose model version is affected by both initial condition and model errors. For short forecast lead times, the number and choice of predictors plays an important role. Contrarily to the other techniques, GM degrades when the number of predictors increases. At intermediate lead times, linear regression is unable to provide corrections to the forecast and can sometimes degrade the performance (GM and the best member OLS with noise). At long lead times the regression schemes (EVMOS, TDTR) which yield the correct variability and the largest correlation between ensemble error and spread, should be preferred.

  3. Mapping urban environmental noise: a land use regression method.

    PubMed

    Xie, Dan; Liu, Yi; Chen, Jining

    2011-09-01

    Forecasting and preventing urban noise pollution are major challenges in urban environmental management. Most existing efforts, including experiment-based models, statistical models, and noise mapping, however, have limited capacity to explain the association between urban growth and corresponding noise change. Therefore, these conventional methods can hardly forecast urban noise at a given outlook of development layout. This paper, for the first time, introduces a land use regression method, which has been applied for simulating urban air quality for a decade, to construct an urban noise model (LUNOS) in Dalian Municipality, Northwest China. The LUNOS model describes noise as a dependent variable of surrounding various land areas via a regressive function. The results suggest that a linear model performs better in fitting monitoring data, and there is no significant difference of the LUNOS's outputs when applied to different spatial scales. As the LUNOS facilitates a better understanding of the association between land use and urban environmental noise in comparison to conventional methods, it can be regarded as a promising tool for noise prediction for planning purposes and aid smart decision-making.

  4. Regression relation for pure quantum states and its implications for efficient computing.

    PubMed

    Elsayed, Tarek A; Fine, Boris V

    2013-02-15

    We obtain a modified version of the Onsager regression relation for the expectation values of quantum-mechanical operators in pure quantum states of isolated many-body quantum systems. We use the insights gained from this relation to show that high-temperature time correlation functions in many-body quantum systems can be controllably computed without complete diagonalization of the Hamiltonians, using instead the direct integration of the Schrödinger equation for randomly sampled pure states. This method is also applicable to quantum quenches and other situations describable by time-dependent many-body Hamiltonians. The method implies exponential reduction of the computer memory requirement in comparison with the complete diagonalization. We illustrate the method by numerically computing infinite-temperature correlation functions for translationally invariant Heisenberg chains of up to 29 spins 1/2. Thereby, we also test the spin diffusion hypothesis and find it in a satisfactory agreement with the numerical results. Both the derivation of the modified regression relation and the justification of the computational method are based on the notion of quantum typicality.

  5. Predicting MHC-II binding affinity using multiple instance regression

    PubMed Central

    EL-Manzalawy, Yasser; Dobbs, Drena; Honavar, Vasant

    2011-01-01

    Reliably predicting the ability of antigen peptides to bind to major histocompatibility complex class II (MHC-II) molecules is an essential step in developing new vaccines. Uncovering the amino acid sequence correlates of the binding affinity of MHC-II binding peptides is important for understanding pathogenesis and immune response. The task of predicting MHC-II binding peptides is complicated by the significant variability in their length. Most existing computational methods for predicting MHC-II binding peptides focus on identifying a nine amino acids core region in each binding peptide. We formulate the problems of qualitatively and quantitatively predicting flexible length MHC-II peptides as multiple instance learning and multiple instance regression problems, respectively. Based on this formulation, we introduce MHCMIR, a novel method for predicting MHC-II binding affinity using multiple instance regression. We present results of experiments using several benchmark datasets that show that MHCMIR is competitive with the state-of-the-art methods for predicting MHC-II binding peptides. An online web server that implements the MHCMIR method for MHC-II binding affinity prediction is freely accessible at http://ailab.cs.iastate.edu/mhcmir. PMID:20855923

  6. Stochastic search, optimization and regression with energy applications

    NASA Astrophysics Data System (ADS)

    Hannah, Lauren A.

    Designing clean energy systems will be an important task over the next few decades. One of the major roadblocks is a lack of mathematical tools to economically evaluate those energy systems. However, solutions to these mathematical problems are also of interest to the operations research and statistical communities in general. This thesis studies three problems that are of interest to the energy community itself or provide support for solution methods: R&D portfolio optimization, nonparametric regression and stochastic search with an observable state variable. First, we consider the one stage R&D portfolio optimization problem to avoid the sequential decision process associated with the multi-stage. The one stage problem is still difficult because of a non-convex, combinatorial decision space and a non-convex objective function. We propose a heuristic solution method that uses marginal project values---which depend on the selected portfolio---to create a linear objective function. In conjunction with the 0-1 decision space, this new problem can be solved as a knapsack linear program. This method scales well to large decision spaces. We also propose an alternate, provably convergent algorithm that does not exploit problem structure. These methods are compared on a solid oxide fuel cell R&D portfolio problem. Next, we propose Dirichlet Process mixtures of Generalized Linear Models (DPGLM), a new method of nonparametric regression that accommodates continuous and categorical inputs, and responses that can be modeled by a generalized linear model. We prove conditions for the asymptotic unbiasedness of the DP-GLM regression mean function estimate. We also give examples for when those conditions hold, including models for compactly supported continuous distributions and a model with continuous covariates and categorical response. We empirically analyze the properties of the DP-GLM and why it provides better results than existing Dirichlet process mixture regression models. We evaluate DP-GLM on several data sets, comparing it to modern methods of nonparametric regression like CART, Bayesian trees and Gaussian processes. Compared to existing techniques, the DP-GLM provides a single model (and corresponding inference algorithms) that performs well in many regression settings. Finally, we study convex stochastic search problems where a noisy objective function value is observed after a decision is made. There are many stochastic search problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation to take observations from the joint state-outcome distribution and use them to infer the optimal decision for a given query state. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We examine two weighting schemes, kernel-based weights and Dirichlet process-based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour-ahead wind commitment problem. Our results show that in some cases Dirichlet process weights offer substantial benefits over kernel based weights and more generally that nonparametric estimation methods provide good solutions to otherwise intractable problems.

  7. Estimation of genetic effects in the presence of multicollinearity in multibreed beef cattle evaluation.

    PubMed

    Roso, V M; Schenkel, F S; Miller, S P; Schaeffer, L R

    2005-08-01

    Breed additive, dominance, and epistatic loss effects are of concern in the genetic evaluation of a multibreed population. Multiple regression equations used for fitting these effects may show a high degree of multicollinearity among predictor variables. Typically, when strong linear relationships exist, the regression coefficients have large SE and are sensitive to changes in the data file and to the addition or deletion of variables in the model. Generalized ridge regression methods were applied to obtain stable estimates of direct and maternal breed additive, dominance, and epistatic loss effects in the presence of multicollinearity among predictor variables. Preweaning weight gains of beef calves in Ontario, Canada, from 1986 to 1999 were analyzed. The genetic model included fixed direct and maternal breed additive, dominance, and epistatic loss effects, fixed environmental effects of age of the calf, contemporary group, and age of the dam x sex of the calf, random additive direct and maternal genetic effects, and random maternal permanent environment effect. The degree and the nature of the multicollinearity were identified and ridge regression methods were used as an alternative to ordinary least squares (LS). Ridge parameters were obtained using two different objective methods: 1) generalized ridge estimator of Hoerl and Kennard (R1); and 2) bootstrap in combination with cross-validation (R2). Both ridge regression methods outperformed the LS estimator with respect to mean squared error of predictions (MSEP) and variance inflation factors (VIF) computed over 100 bootstrap samples. The MSEP of R1 and R2 were similar, and they were 3% less than the MSEP of LS. The average VIF of LS, R1, and R2 were equal to 26.81, 6.10, and 4.18, respectively. Ridge regression methods were particularly effective in decreasing the multicollinearity involving predictor variables of breed additive effects. Because of a high degree of confounding between estimates of maternal dominance and direct epistatic loss effects, it was not possible to compare the relative importance of these effects with a high level of confidence. The inclusion of epistatic loss effects in the additive-dominance model did not cause noticeable reranking of sires, dams, and calves based on across-breed EBV. More precise estimates of breed effects as a result of this study may result in more stable across-breed estimated breeding values over the years.

  8. Feature Grouping and Selection Over an Undirected Graph.

    PubMed

    Yang, Sen; Yuan, Lei; Lai, Ying-Cheng; Shen, Xiaotong; Wonka, Peter; Ye, Jieping

    2012-01-01

    High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l ∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l 1 regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.

  9. Geodesic least squares regression for scaling studies in magnetic confinement fusion

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Verdoolaege, Geert

    In regression analyses for deriving scaling laws that occur in various scientific disciplines, usually standard regression methods have been applied, of which ordinary least squares (OLS) is the most popular. However, concerns have been raised with respect to several assumptions underlying OLS in its application to scaling laws. We here discuss a new regression method that is robust in the presence of significant uncertainty on both the data and the regression model. The method, which we call geodesic least squares regression (GLS), is based on minimization of the Rao geodesic distance on a probabilistic manifold. We demonstrate the superiority ofmore » the method using synthetic data and we present an application to the scaling law for the power threshold for the transition to the high confinement regime in magnetic confinement fusion devices.« less

  10. New method for calculating a mathematical expression for streamflow recession

    USGS Publications Warehouse

    Rutledge, Albert T.

    1991-01-01

    An empirical method has been devised to calculate the master recession curve, which is a mathematical expression for streamflow recession during times of negligible direct runoff. The method is based on the assumption that the storage-delay factor, which is the time per log cycle of streamflow recession, varies linearly with the logarithm of streamflow. The resulting master recession curve can be nonlinear. The method can be executed by a computer program that reads a data file of daily mean streamflow, then allows the user to select several near-linear segments of streamflow recession. The storage-delay factor for each segment is one of the coefficients of the equation that results from linear least-squares regression. Using results for each recession segment, a mathematical expression of the storage-delay factor as a function of the log of streamflow is determined by linear least-squares regression. The master recession curve, which is a second-order polynomial expression for time as a function of log of streamflow, is then derived using the coefficients of this function.

  11. A comparison of fitness-case sampling methods for genetic programming

    NASA Astrophysics Data System (ADS)

    Martínez, Yuliana; Naredo, Enrique; Trujillo, Leonardo; Legrand, Pierrick; López, Uriel

    2017-11-01

    Genetic programming (GP) is an evolutionary computation paradigm for automatic program induction. GP has produced impressive results but it still needs to overcome some practical limitations, particularly its high computational cost, overfitting and excessive code growth. Recently, many researchers have proposed fitness-case sampling methods to overcome some of these problems, with mixed results in several limited tests. This paper presents an extensive comparative study of four fitness-case sampling methods, namely: Interleaved Sampling, Random Interleaved Sampling, Lexicase Selection and Keep-Worst Interleaved Sampling. The algorithms are compared on 11 symbolic regression problems and 11 supervised classification problems, using 10 synthetic benchmarks and 12 real-world data-sets. They are evaluated based on test performance, overfitting and average program size, comparing them with a standard GP search. Comparisons are carried out using non-parametric multigroup tests and post hoc pairwise statistical tests. The experimental results suggest that fitness-case sampling methods are particularly useful for difficult real-world symbolic regression problems, improving performance, reducing overfitting and limiting code growth. On the other hand, it seems that fitness-case sampling cannot improve upon GP performance when considering supervised binary classification.

  12. Neural network and multiple linear regression to predict school children dimensions for ergonomic school furniture design.

    PubMed

    Agha, Salah R; Alnahhal, Mohammed J

    2012-11-01

    The current study investigates the possibility of obtaining the anthropometric dimensions, critical to school furniture design, without measuring all of them. The study first selects some anthropometric dimensions that are easy to measure. Two methods are then used to check if these easy-to-measure dimensions can predict the dimensions critical to the furniture design. These methods are multiple linear regression and neural networks. Each dimension that is deemed necessary to ergonomically design school furniture is expressed as a function of some other measured anthropometric dimensions. Results show that out of the five dimensions needed for chair design, four can be related to other dimensions that can be measured while children are standing. Therefore, the method suggested here would definitely save time and effort and avoid the difficulty of dealing with students while measuring these dimensions. In general, it was found that neural networks perform better than multiple linear regression in the current study. Copyright © 2012 Elsevier Ltd and The Ergonomics Society. All rights reserved.

  13. Comparison of enzyme-linked immunosorbent assay and gas chromatography procedures for the detection of cyanazine and metolachlor in surface water samples

    USGS Publications Warehouse

    Schraer, S.M.; Shaw, D.R.; Boyette, M.; Coupe, R.H.; Thurman, E.M.

    2000-01-01

    Enzyme-linked immunosorbent assay (ELISA) data from surface water reconnaissance were compared to data from samples analyzed by gas chromatography for the pesticide residues cyanazine (2-[[4-chloro-6-(ethylamino)-l,3,5-triazin-2-yl]amino]-2-methylpropanenitrile ) and metolachlor (2-chloro-N-(2-ethyl-6-methylphenyl)-N-(2-methoxy-1-methylethyl)acetamide). When ELISA analyses were duplicated, cyanazine and metolachlor detection was found to have highly reproducible results; adjusted R2s were 0.97 and 0.94, respectively. When ELISA results for cyanazine were regressed against gas chromatography results, the models effectively predicted cyanazine concentrations from ELISA analyses (adjusted R2s ranging from 0.76 to 0.81). The intercepts and slopes for these models were not different from 0 and 1, respectively. This indicates that cyanazine analysis by ELISA is expected to give the same results as analysis by gas chromatography. However, regressing ELISA analyses for metolachlor against gas chromatography data provided more variable results (adjusted R2s ranged from 0.67 to 0.94). Regression models for metolachlor analyses had two of three intercepts that were not different from 0. Slopes for all metolachlor regression models were significantly different from 1. This indicates that as metolachlor concentrations increase, ELISA will over- or under-estimate metolachlor concentration, depending on the method of comparison. ELISA can be effectively used to detect cyanazine and metolachlor in surface water samples. However, when detections of metolachlor have significant consequences or implications it may be necessary to use other analytical methods.

  14. Regression modeling of ground-water flow

    USGS Publications Warehouse

    Cooley, R.L.; Naff, R.L.

    1985-01-01

    Nonlinear multiple regression methods are developed to model and analyze groundwater flow systems. Complete descriptions of regression methodology as applied to groundwater flow models allow scientists and engineers engaged in flow modeling to apply the methods to a wide range of problems. Organization of the text proceeds from an introduction that discusses the general topic of groundwater flow modeling, to a review of basic statistics necessary to properly apply regression techniques, and then to the main topic: exposition and use of linear and nonlinear regression to model groundwater flow. Statistical procedures are given to analyze and use the regression models. A number of exercises and answers are included to exercise the student on nearly all the methods that are presented for modeling and statistical analysis. Three computer programs implement the more complex methods. These three are a general two-dimensional, steady-state regression model for flow in an anisotropic, heterogeneous porous medium, a program to calculate a measure of model nonlinearity with respect to the regression parameters, and a program to analyze model errors in computed dependent variables such as hydraulic head. (USGS)

  15. Comparison between Two Linear Supervised Learning Machines' Methods with Principle Component Based Methods for the Spectrofluorimetric Determination of Agomelatine and Its Degradants.

    PubMed

    Elkhoudary, Mahmoud M; Naguib, Ibrahim A; Abdel Salam, Randa A; Hadad, Ghada M

    2017-05-01

    Four accurate, sensitive and reliable stability indicating chemometric methods were developed for the quantitative determination of Agomelatine (AGM) whether in pure form or in pharmaceutical formulations. Two supervised learning machines' methods; linear artificial neural networks (PC-linANN) preceded by principle component analysis and linear support vector regression (linSVR), were compared with two principle component based methods; principle component regression (PCR) as well as partial least squares (PLS) for the spectrofluorimetric determination of AGM and its degradants. The results showed the benefits behind using linear learning machines' methods and the inherent merits of their algorithms in handling overlapped noisy spectral data especially during the challenging determination of AGM alkaline and acidic degradants (DG1 and DG2). Relative mean squared error of prediction (RMSEP) for the proposed models in the determination of AGM were 1.68, 1.72, 0.68 and 0.22 for PCR, PLS, SVR and PC-linANN; respectively. The results showed the superiority of supervised learning machines' methods over principle component based methods. Besides, the results suggested that linANN is the method of choice for determination of components in low amounts with similar overlapped spectra and narrow linearity range. Comparison between the proposed chemometric models and a reported HPLC method revealed the comparable performance and quantification power of the proposed models.

  16. Investigating the Performance of Alternate Regression Weights by Studying All Possible Criteria in Regression Models with a Fixed Set of Predictors

    ERIC Educational Resources Information Center

    Waller, Niels; Jones, Jeff

    2011-01-01

    We describe methods for assessing all possible criteria (i.e., dependent variables) and subsets of criteria for regression models with a fixed set of predictors, x (where x is an n x 1 vector of independent variables). Our methods build upon the geometry of regression coefficients (hereafter called regression weights) in n-dimensional space. For a…

  17. Application of factor analysis of infrared spectra for quantitative determination of beta-tricalcium phosphate in calcium hydroxylapatite.

    PubMed

    Arsenyev, P A; Trezvov, V V; Saratovskaya, N V

    1997-01-01

    This work represents a method, which allows to determine phase composition of calcium hydroxylapatite basing on its infrared spectrum. The method uses factor analysis of the spectral data of calibration set of samples to determine minimal number of factors required to reproduce the spectra within experimental error. Multiple linear regression is applied to establish correlation between factor scores of calibration standards and their properties. The regression equations can be used to predict the property value of unknown sample. The regression model was built for determination of beta-tricalcium phosphate content in hydroxylapatite. Statistical estimation of quality of the model was carried out. Application of the factor analysis on spectral data allows to increase accuracy of beta-tricalcium phosphate determination and expand the range of determination towards its less concentration. Reproducibility of results is retained.

  18. Bias in logistic regression due to imperfect diagnostic test results and practical correction approaches.

    PubMed

    Valle, Denis; Lima, Joanna M Tucker; Millar, Justin; Amratia, Punam; Haque, Ubydul

    2015-11-04

    Logistic regression is a statistical model widely used in cross-sectional and cohort studies to identify and quantify the effects of potential disease risk factors. However, the impact of imperfect tests on adjusted odds ratios (and thus on the identification of risk factors) is under-appreciated. The purpose of this article is to draw attention to the problem associated with modelling imperfect diagnostic tests, and propose simple Bayesian models to adequately address this issue. A systematic literature review was conducted to determine the proportion of malaria studies that appropriately accounted for false-negatives/false-positives in a logistic regression setting. Inference from the standard logistic regression was also compared with that from three proposed Bayesian models using simulations and malaria data from the western Brazilian Amazon. A systematic literature review suggests that malaria epidemiologists are largely unaware of the problem of using logistic regression to model imperfect diagnostic test results. Simulation results reveal that statistical inference can be substantially improved when using the proposed Bayesian models versus the standard logistic regression. Finally, analysis of original malaria data with one of the proposed Bayesian models reveals that microscopy sensitivity is strongly influenced by how long people have lived in the study region, and an important risk factor (i.e., participation in forest extractivism) is identified that would have been missed by standard logistic regression. Given the numerous diagnostic methods employed by malaria researchers and the ubiquitous use of logistic regression to model the results of these diagnostic tests, this paper provides critical guidelines to improve data analysis practice in the presence of misclassification error. Easy-to-use code that can be readily adapted to WinBUGS is provided, enabling straightforward implementation of the proposed Bayesian models.

  19. Multiple imputation for cure rate quantile regression with censored data.

    PubMed

    Wu, Yuanshan; Yin, Guosheng

    2017-03-01

    The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration. © 2016, The International Biometric Society.

  20. Censored Hurdle Negative Binomial Regression (Case Study: Neonatorum Tetanus Case in Indonesia)

    NASA Astrophysics Data System (ADS)

    Yuli Rusdiana, Riza; Zain, Ismaini; Wulan Purnami, Santi

    2017-06-01

    Hurdle negative binomial model regression is a method that can be used for discreate dependent variable, excess zero and under- and overdispersion. It uses two parts approach. The first part estimates zero elements from dependent variable is zero hurdle model and the second part estimates not zero elements (non-negative integer) from dependent variable is called truncated negative binomial models. The discrete dependent variable in such cases is censored for some values. The type of censor that will be studied in this research is right censored. This study aims to obtain the parameter estimator hurdle negative binomial regression for right censored dependent variable. In the assessment of parameter estimation methods used Maximum Likelihood Estimator (MLE). Hurdle negative binomial model regression for right censored dependent variable is applied on the number of neonatorum tetanus cases in Indonesia. The type data is count data which contains zero values in some observations and other variety value. This study also aims to obtain the parameter estimator and test statistic censored hurdle negative binomial model. Based on the regression results, the factors that influence neonatorum tetanus case in Indonesia is the percentage of baby health care coverage and neonatal visits.

  1. Linear and nonlinear regression techniques for simultaneous and proportional myoelectric control.

    PubMed

    Hahne, J M; Biessmann, F; Jiang, N; Rehbaum, H; Farina, D; Meinecke, F C; Muller, K-R; Parra, L C

    2014-03-01

    In recent years the number of active controllable joints in electrically powered hand-prostheses has increased significantly. However, the control strategies for these devices in current clinical use are inadequate as they require separate and sequential control of each degree-of-freedom (DoF). In this study we systematically compare linear and nonlinear regression techniques for an independent, simultaneous and proportional myoelectric control of wrist movements with two DoF. These techniques include linear regression, mixture of linear experts (ME), multilayer-perceptron, and kernel ridge regression (KRR). They are investigated offline with electro-myographic signals acquired from ten able-bodied subjects and one person with congenital upper limb deficiency. The control accuracy is reported as a function of the number of electrodes and the amount and diversity of training data providing guidance for the requirements in clinical practice. The results showed that KRR, a nonparametric statistical learning method, outperformed the other methods. However, simple transformations in the feature space could linearize the problem, so that linear models could achieve similar performance as KRR at much lower computational costs. Especially ME, a physiologically inspired extension of linear regression represents a promising candidate for the next generation of prosthetic devices.

  2. Linear Regression Quantile Mapping (RQM) - A new approach to bias correction with consistent quantile trends

    NASA Astrophysics Data System (ADS)

    Passow, Christian; Donner, Reik

    2017-04-01

    Quantile mapping (QM) is an established concept that allows to correct systematic biases in multiple quantiles of the distribution of a climatic observable. It shows remarkable results in correcting biases in historical simulations through observational data and outperforms simpler correction methods which relate only to the mean or variance. Since it has been shown that bias correction of future predictions or scenario runs with basic QM can result in misleading trends in the projection, adjusted, trend preserving, versions of QM were introduced in the form of detrended quantile mapping (DQM) and quantile delta mapping (QDM) (Cannon, 2015, 2016). Still, all previous versions and applications of QM based bias correction rely on the assumption of time-independent quantiles over the investigated period, which can be misleading in the context of a changing climate. Here, we propose a novel combination of linear quantile regression (QR) with the classical QM method to introduce a consistent, time-dependent and trend preserving approach of bias correction for historical and future projections. Since QR is a regression method, it is possible to estimate quantiles in the same resolution as the given data and include trends or other dependencies. We demonstrate the performance of the new method of linear regression quantile mapping (RQM) in correcting biases of temperature and precipitation products from historical runs (1959 - 2005) of the COSMO model in climate mode (CCLM) from the Euro-CORDEX ensemble relative to gridded E-OBS data of the same spatial and temporal resolution. A thorough comparison with established bias correction methods highlights the strengths and potential weaknesses of the new RQM approach. References: A.J. Cannon, S.R. Sorbie, T.Q. Murdock: Bias Correction of GCM Precipitation by Quantile Mapping - How Well Do Methods Preserve Changes in Quantiles and Extremes? Journal of Climate, 28, 6038, 2015 A.J. Cannon: Multivariate Bias Correction of Climate Model Outputs - Matching Marginal Distributions and Inter-variable Dependence Structure. Journal of Climate, 29, 7045, 2016

  3. A comparison between univariate probabilistic and multivariate (logistic regression) methods for landslide susceptibility analysis: the example of the Febbraro valley (Northern Alps, Italy)

    NASA Astrophysics Data System (ADS)

    Rossi, M.; Apuani, T.; Felletti, F.

    2009-04-01

    The aim of this paper is to compare the results of two statistical methods for landslide susceptibility analysis: 1) univariate probabilistic method based on landslide susceptibility index, 2) multivariate method (logistic regression). The study area is the Febbraro valley, located in the central Italian Alps, where different types of metamorphic rocks croup out. On the eastern part of the studied basin a quaternary cover represented by colluvial and secondarily, by glacial deposits, is dominant. In this study 110 earth flows, mainly located toward NE portion of the catchment, were analyzed. They involve only the colluvial deposits and their extension mainly ranges from 36 to 3173 m2. Both statistical methods require to establish a spatial database, in which each landslide is described by several parameters that can be assigned using a main scarp central point of landslide. The spatial database is constructed using a Geographical Information System (GIS). Each landslide is described by several parameters corresponding to the value of main scarp central point of the landslide. Based on bibliographic review a total of 15 predisposing factors were utilized. The width of the intervals, in which the maps of the predisposing factors have to be reclassified, has been defined assuming constant intervals to: elevation (100 m), slope (5 °), solar radiation (0.1 MJ/cm2/year), profile curvature (1.2 1/m), tangential curvature (2.2 1/m), drainage density (0.5), lineament density (0.00126). For the other parameters have been used the results of the probability-probability plots analysis and the statistical indexes of landslides site. In particular slope length (0 ÷ 2, 2 ÷ 5, 5 ÷ 10, 10 ÷ 20, 20 ÷ 35, 35 ÷ 260), accumulation flow (0 ÷ 1, 1 ÷ 2, 2 ÷ 5, 5 ÷ 12, 12 ÷ 60, 60 ÷27265), Topographic Wetness Index 0 ÷ 0.74, 0.74 ÷ 1.94, 1.94 ÷ 2.62, 2.62 ÷ 3.48, 3.48 ÷ 6,00, 6.00 ÷ 9.44), Stream Power Index (0 ÷ 0.64, 0.64 ÷ 1.28, 1.28 ÷ 1.81, 1.81 ÷ 4.20, 4.20 ÷ 9.40). Geological map and land use map were also used, considering geological and land use properties as categorical variables. Appling the univariate probabilistic method the Landslide Susceptibility Index (LSI) is defined as the sum of the ratio Ra/Rb calculated for each predisposing factor, where Ra is the ratio between number of pixel of class and the total number of pixel of the study area, and Rb is the ratio between number of landslides respect to the pixel number of the interval area. From the analysis of the Ra/Rb ratio the relationship between landslide occurrence and predisposing factors were defined. Then the equation of LSI was used in GIS to trace the landslide susceptibility maps. The multivariate method for landslide susceptibility analysis, based on logistic regression, was performed starting from the density maps of the predisposing factors, calculated with the intervals defined above using the equation Rb/Rbtot, where Rbtot is a sum of all Rb values. Using stepwise forward algorithms the logistic regression was performed in two successive steps: first a univariate logistic regression is used to choose the most significant predisposing factors, then the multivariate logistic regression can be performed. The univariate regression highlighted the importance of the following factors: elevation, accumulation flow, drainage density, lineament density, geology and land use. When the multivariate regression was applied the number of controlling factors was reduced neglecting the geological properties. The resulting final susceptibility equation is: P = 1 / (1 + exp-(6.46-22.34*elevation-5.33*accumulation flow-7.99* drainage density-4.47*lineament density-17.31*land use)) and using this equation the susceptibility maps were obtained. To easy compare the results of the two methodologies, the susceptibility maps were reclassified in five susceptibility intervals (very high, high, moderate, low and very low) using natural breaks. Then the maps were validated using two cumulative distribution curves, one related to the landslides (number of landslides in each susceptibility class) and one to the basin (number of pixel covering each class). Comparing the curves for each method, it results that the two approaches (univariate and multivariate) are appropriate, providing acceptable results. In both maps the distribution of high susceptibility condition is mainly localized on the left slope of the catchment in agreement with the field evidences. The comparison between the methods was obtained by subtraction of the two maps. This operation shows that about 40% of the basin is classified by the same class of susceptibility. In general the univariate probabilistic method tends to overestimate the areal extension of the high susceptibility class with respect to the maps obtained by the logistic regression method.

  4. Estimating the causes of traffic accidents using logistic regression and discriminant analysis.

    PubMed

    Karacasu, Murat; Ergül, Barış; Altin Yavuz, Arzu

    2014-01-01

    Factors that affect traffic accidents have been analysed in various ways. In this study, we use the methods of logistic regression and discriminant analysis to determine the damages due to injury and non-injury accidents in the Eskisehir Province. Data were obtained from the accident reports of the General Directorate of Security in Eskisehir; 2552 traffic accidents between January and December 2009 were investigated regarding whether they resulted in injury. According to the results, the effects of traffic accidents were reflected in the variables. These results provide a wealth of information that may aid future measures toward the prevention of undesired results.

  5. A generalized least squares regression approach for computing effect sizes in single-case research: application examples.

    PubMed

    Maggin, Daniel M; Swaminathan, Hariharan; Rogers, Helen J; O'Keeffe, Breda V; Sugai, George; Horner, Robert H

    2011-06-01

    A new method for deriving effect sizes from single-case designs is proposed. The strategy is applicable to small-sample time-series data with autoregressive errors. The method uses Generalized Least Squares (GLS) to model the autocorrelation of the data and estimate regression parameters to produce an effect size that represents the magnitude of treatment effect from baseline to treatment phases in standard deviation units. In this paper, the method is applied to two published examples using common single case designs (i.e., withdrawal and multiple-baseline). The results from these studies are described, and the method is compared to ten desirable criteria for single-case effect sizes. Based on the results of this application, we conclude with observations about the use of GLS as a support to visual analysis, provide recommendations for future research, and describe implications for practice. Copyright © 2011 Society for the Study of School Psychology. Published by Elsevier Ltd. All rights reserved.

  6. OPLS statistical model versus linear regression to assess sonographic predictors of stroke prognosis.

    PubMed

    Vajargah, Kianoush Fathi; Sadeghi-Bazargani, Homayoun; Mehdizadeh-Esfanjani, Robab; Savadi-Oskouei, Daryoush; Farhoudi, Mehdi

    2012-01-01

    The objective of the present study was to assess the comparable applicability of orthogonal projections to latent structures (OPLS) statistical model vs traditional linear regression in order to investigate the role of trans cranial doppler (TCD) sonography in predicting ischemic stroke prognosis. The study was conducted on 116 ischemic stroke patients admitted to a specialty neurology ward. The Unified Neurological Stroke Scale was used once for clinical evaluation on the first week of admission and again six months later. All data was primarily analyzed using simple linear regression and later considered for multivariate analysis using PLS/OPLS models through the SIMCA P+12 statistical software package. The linear regression analysis results used for the identification of TCD predictors of stroke prognosis were confirmed through the OPLS modeling technique. Moreover, in comparison to linear regression, the OPLS model appeared to have higher sensitivity in detecting the predictors of ischemic stroke prognosis and detected several more predictors. Applying the OPLS model made it possible to use both single TCD measures/indicators and arbitrarily dichotomized measures of TCD single vessel involvement as well as the overall TCD result. In conclusion, the authors recommend PLS/OPLS methods as complementary rather than alternative to the available classical regression models such as linear regression.

  7. Incorporation of prior information on parameters into nonlinear regression groundwater flow models: 1. Theory

    USGS Publications Warehouse

    Cooley, Richard L.

    1982-01-01

    Prior information on the parameters of a groundwater flow model can be used to improve parameter estimates obtained from nonlinear regression solution of a modeling problem. Two scales of prior information can be available: (1) prior information having known reliability (that is, bias and random error structure) and (2) prior information consisting of best available estimates of unknown reliability. A regression method that incorporates the second scale of prior information assumes the prior information to be fixed for any particular analysis to produce improved, although biased, parameter estimates. Approximate optimization of two auxiliary parameters of the formulation is used to help minimize the bias, which is almost always much smaller than that resulting from standard ridge regression. It is shown that if both scales of prior information are available, then a combined regression analysis may be made.

  8. The association between short interpregnancy interval and preterm birth in Louisiana: a comparison of methods.

    PubMed

    Howard, Elizabeth J; Harville, Emily; Kissinger, Patricia; Xiong, Xu

    2013-07-01

    There is growing interest in the application of propensity scores (PS) in epidemiologic studies, especially within the field of reproductive epidemiology. This retrospective cohort study assesses the impact of a short interpregnancy interval (IPI) on preterm birth and compares the results of the conventional logistic regression analysis with analyses utilizing a PS. The study included 96,378 singleton infants from Louisiana birth certificate data (1995-2007). Five regression models designed for methods comparison are presented. Ten percent (10.17 %) of all births were preterm; 26.83 % of births were from a short IPI. The PS-adjusted model produced a more conservative estimate of the exposure variable compared to the conventional logistic regression method (β-coefficient: 0.21 vs. 0.43), as well as a smaller standard error (0.024 vs. 0.028), odds ratio and 95 % confidence intervals [1.15 (1.09, 1.20) vs. 1.23 (1.17, 1.30)]. The inclusion of more covariate and interaction terms in the PS did not change the estimates of the exposure variable. This analysis indicates that PS-adjusted regression may be appropriate for validation of conventional methods in a large dataset with a fairly common outcome. PS's may be beneficial in producing more precise estimates, especially for models with many confounders and effect modifiers and where conventional adjustment with logistic regression is unsatisfactory. Short intervals between pregnancies are associated with preterm birth in this population, according to either technique. Birth spacing is an issue that women have some control over. Educational interventions, including birth control, should be applied during prenatal visits and following delivery.

  9. Experimental investigation of fuel regression rate in a HTPB based lab-scale hybrid rocket motor

    NASA Astrophysics Data System (ADS)

    Li, Xintian; Tian, Hui; Yu, Nanjia; Cai, Guobiao

    2014-12-01

    The fuel regression rate is an important parameter in the design process of the hybrid rocket motor. Additives in the solid fuel may have influences on the fuel regression rate, which will affect the internal ballistics of the motor. A series of firing experiments have been conducted on lab-scale hybrid rocket motors with 98% hydrogen peroxide (H2O2) oxidizer and hydroxyl terminated polybutadiene (HTPB) based fuels in this paper. An innovative fuel regression rate analysis method is established to diminish the errors caused by start and tailing stages in a short time firing test. The effects of the metal Mg, Al, aromatic hydrocarbon anthracene (C14H10), and carbon black (C) on the fuel regression rate are investigated. The fuel regression rate formulas of different fuel components are fitted according to the experiment data. The results indicate that the influence of C14H10 on the fuel regression rate of HTPB is not evident. However, the metal additives in the HTPB fuel can increase the fuel regression rate significantly.

  10. The Precision Efficacy Analysis for Regression Sample Size Method.

    ERIC Educational Resources Information Center

    Brooks, Gordon P.; Barcikowski, Robert S.

    The general purpose of this study was to examine the efficiency of the Precision Efficacy Analysis for Regression (PEAR) method for choosing appropriate sample sizes in regression studies used for precision. The PEAR method, which is based on the algebraic manipulation of an accepted cross-validity formula, essentially uses an effect size to…

  11. Methods for identifying SNP interactions: a review on variations of Logic Regression, Random Forest and Bayesian logistic regression.

    PubMed

    Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula

    2011-01-01

    Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.

  12. cp-R, an interface the R programming language for clinical laboratory method comparisons.

    PubMed

    Holmes, Daniel T

    2015-02-01

    Clinical scientists frequently need to compare two different bioanalytical methods as part of assay validation/monitoring. As a matter necessity, regression methods for quantitative comparison in clinical chemistry, hematology and other clinical laboratory disciplines must allow for error in both the x and y variables. Traditionally the methods popularized by 1) Deming and 2) Passing and Bablok have been recommended. While commercial tools exist, no simple open source tool is available. The purpose of this work was to develop and entirely open-source GUI-driven program for bioanalytical method comparisons capable of performing these regression methods and able to produce highly customized graphical output. The GUI is written in python and PyQt4 with R scripts performing regression and graphical functions. The program can be run from source code or as a pre-compiled binary executable. The software performs three forms of regression and offers weighting where applicable. Confidence bands of the regression are calculated using bootstrapping for Deming and Passing Bablok methods. Users can customize regression plots according to the tools available in R and can produced output in any of: jpg, png, tiff, bmp at any desired resolution or ps and pdf vector formats. Bland Altman plots and some regression diagnostic plots are also generated. Correctness of regression parameter estimates was confirmed against existing R packages. The program allows for rapid and highly customizable graphical output capable of conforming to the publication requirements of any clinical chemistry journal. Quick method comparisons can also be performed and cut and paste into spreadsheet or word processing applications. We present a simple and intuitive open source tool for quantitative method comparison in a clinical laboratory environment. Copyright © 2014 The Canadian Society of Clinical Chemists. Published by Elsevier Inc. All rights reserved.

  13. Cross-validation pitfalls when selecting and assessing regression and classification models.

    PubMed

    Krstajic, Damjan; Buturovic, Ljubomir J; Leahy, David E; Thomas, Simon

    2014-03-29

    We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.

  14. Simulation of land use change in the three gorges reservoir area based on CART-CA

    NASA Astrophysics Data System (ADS)

    Yuan, Min

    2018-05-01

    This study proposes a new method to simulate spatiotemporal complex multiple land uses by using classification and regression tree algorithm (CART) based CA model. In this model, we use classification and regression tree algorithm to calculate land class conversion probability, and combine neighborhood factor, random factor to extract cellular transformation rules. The overall Kappa coefficient is 0.8014 and the overall accuracy is 0.8821 in the land dynamic simulation results of the three gorges reservoir area from 2000 to 2010, and the simulation results are satisfactory.

  15. Parameter estimation in Cox models with missing failure indicators and the OPPERA study.

    PubMed

    Brownstein, Naomi C; Cai, Jianwen; Slade, Gary D; Bair, Eric

    2015-12-30

    In a prospective cohort study, examining all participants for incidence of the condition of interest may be prohibitively expensive. For example, the "gold standard" for diagnosing temporomandibular disorder (TMD) is a physical examination by a trained clinician. In large studies, examining all participants in this manner is infeasible. Instead, it is common to use questionnaires to screen for incidence of TMD and perform the "gold standard" examination only on participants who screen positively. Unfortunately, some participants may leave the study before receiving the "gold standard" examination. Within the framework of survival analysis, this results in missing failure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing failure indicators. We estimate the probability of being an incident case for those lacking a "gold standard" examination using logistic regression. These estimated probabilities are used to generate multiple imputations of case status for each missing examination that are combined with observed data in appropriate regression models. The variance introduced by the procedure is estimated using multiple imputation. The method can be used to estimate both regression coefficients in Cox proportional hazard models as well as incidence rates using Poisson regression. We simulate data with missing failure indicators and show that our method performs as well as or better than competing methods. Finally, we apply the proposed method to data from the OPPERA study. Copyright © 2015 John Wiley & Sons, Ltd.

  16. Advantages of geographically weighted regression for modeling benthic substrate in two Greater Yellowstone Ecosystem streams

    USGS Publications Warehouse

    Sheehan, Kenneth R.; Strager, Michael P.; Welsh, Stuart A.

    2013-01-01

    Stream habitat assessments are commonplace in fish management, and often involve nonspatial analysis methods for quantifying or predicting habitat, such as ordinary least squares regression (OLS). Spatial relationships, however, often exist among stream habitat variables. For example, water depth, water velocity, and benthic substrate sizes within streams are often spatially correlated and may exhibit spatial nonstationarity or inconsistency in geographic space. Thus, analysis methods should address spatial relationships within habitat datasets. In this study, OLS and a recently developed method, geographically weighted regression (GWR), were used to model benthic substrate from water depth and water velocity data at two stream sites within the Greater Yellowstone Ecosystem. For data collection, each site was represented by a grid of 0.1 m2 cells, where actual values of water depth, water velocity, and benthic substrate class were measured for each cell. Accuracies of regressed substrate class data by OLS and GWR methods were calculated by comparing maps, parameter estimates, and determination coefficient r 2. For analysis of data from both sites, Akaike’s Information Criterion corrected for sample size indicated the best approximating model for the data resulted from GWR and not from OLS. Adjusted r 2 values also supported GWR as a better approach than OLS for prediction of substrate. This study supports GWR (a spatial analysis approach) over nonspatial OLS methods for prediction of habitat for stream habitat assessments.

  17. Use of probabilistic weights to enhance linear regression myoelectric control

    NASA Astrophysics Data System (ADS)

    Smith, Lauren H.; Kuiken, Todd A.; Hargrove, Levi J.

    2015-12-01

    Objective. Clinically available prostheses for transradial amputees do not allow simultaneous myoelectric control of degrees of freedom (DOFs). Linear regression methods can provide simultaneous myoelectric control, but frequently also result in difficulty with isolating individual DOFs when desired. This study evaluated the potential of using probabilistic estimates of categories of gross prosthesis movement, which are commonly used in classification-based myoelectric control, to enhance linear regression myoelectric control. Approach. Gaussian models were fit to electromyogram (EMG) feature distributions for three movement classes at each DOF (no movement, or movement in either direction) and used to weight the output of linear regression models by the probability that the user intended the movement. Eight able-bodied and two transradial amputee subjects worked in a virtual Fitts’ law task to evaluate differences in controllability between linear regression and probability-weighted regression for an intramuscular EMG-based three-DOF wrist and hand system. Main results. Real-time and offline analyses in able-bodied subjects demonstrated that probability weighting improved performance during single-DOF tasks (p < 0.05) by preventing extraneous movement at additional DOFs. Similar results were seen in experiments with two transradial amputees. Though goodness-of-fit evaluations suggested that the EMG feature distributions showed some deviations from the Gaussian, equal-covariance assumptions used in this experiment, the assumptions were sufficiently met to provide improved performance compared to linear regression control. Significance. Use of probability weights can improve the ability to isolate individual during linear regression myoelectric control, while maintaining the ability to simultaneously control multiple DOFs.

  18. Regression modeling and prediction of road sweeping brush load characteristics from finite element analysis and experimental results.

    PubMed

    Wang, Chong; Sun, Qun; Wahab, Magd Abdel; Zhang, Xingyu; Xu, Limin

    2015-09-01

    Rotary cup brushes mounted on each side of a road sweeper undertake heavy debris removal tasks but the characteristics have not been well known until recently. A Finite Element (FE) model that can analyze brush deformation and predict brush characteristics have been developed to investigate the sweeping efficiency and to assist the controller design. However, the FE model requires large amount of CPU time to simulate each brush design and operating scenario, which may affect its applications in a real-time system. This study develops a mathematical regression model to summarize the FE modeled results. The complex brush load characteristic curves were statistically analyzed to quantify the effects of cross-section, length, mounting angle, displacement and rotational speed etc. The data were then fitted by a multiple variable regression model using the maximum likelihood method. The fitted results showed good agreement with the FE analysis results and experimental results, suggesting that the mathematical regression model may be directly used in a real-time system to predict characteristics of different brushes under varying operating conditions. The methodology may also be used in the design and optimization of rotary brush tools. Copyright © 2015 Elsevier Ltd. All rights reserved.

  19. Fungible weights in logistic regression.

    PubMed

    Jones, Jeff A; Waller, Niels G

    2016-06-01

    In this article we develop methods for assessing parameter sensitivity in logistic regression models. To set the stage for this work, we first review Waller's (2008) equations for computing fungible weights in linear regression. Next, we describe 2 methods for computing fungible weights in logistic regression. To demonstrate the utility of these methods, we compute fungible logistic regression weights using data from the Centers for Disease Control and Prevention's (2010) Youth Risk Behavior Surveillance Survey, and we illustrate how these alternate weights can be used to evaluate parameter sensitivity. To make our work accessible to the research community, we provide R code (R Core Team, 2015) that will generate both kinds of fungible logistic regression weights. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  20. Development of hybrid genetic-algorithm-based neural networks using regression trees for modeling air quality inside a public transportation bus.

    PubMed

    Kadiyala, Akhil; Kaur, Devinder; Kumar, Ashok

    2013-02-01

    The present study developed a novel approach to modeling indoor air quality (IAQ) of a public transportation bus by the development of hybrid genetic-algorithm-based neural networks (also known as evolutionary neural networks) with input variables optimized from using the regression trees, referred as the GART approach. This study validated the applicability of the GART modeling approach in solving complex nonlinear systems by accurately predicting the monitored contaminants of carbon dioxide (CO2), carbon monoxide (CO), nitric oxide (NO), sulfur dioxide (SO2), 0.3-0.4 microm sized particle numbers, 0.4-0.5 microm sized particle numbers, particulate matter (PM) concentrations less than 1.0 microm (PM10), and PM concentrations less than 2.5 microm (PM2.5) inside a public transportation bus operating on 20% grade biodiesel in Toledo, OH. First, the important variables affecting each monitored in-bus contaminant were determined using regression trees. Second, the analysis of variance was used as a complimentary sensitivity analysis to the regression tree results to determine a subset of statistically significant variables affecting each monitored in-bus contaminant. Finally, the identified subsets of statistically significant variables were used as inputs to develop three artificial neural network (ANN) models. The models developed were regression tree-based back-propagation network (BPN-RT), regression tree-based radial basis function network (RBFN-RT), and GART models. Performance measures were used to validate the predictive capacity of the developed IAQ models. The results from this approach were compared with the results obtained from using a theoretical approach and a generalized practicable approach to modeling IAQ that included the consideration of additional independent variables when developing the aforementioned ANN models. The hybrid GART models were able to capture majority of the variance in the monitored in-bus contaminants. The genetic-algorithm-based neural network IAQ models outperformed the traditional ANN methods of the back-propagation and the radial basis function networks. The novelty of this research is the development of a novel approach to modeling vehicular indoor air quality by integration of the advanced methods of genetic algorithms, regression trees, and the analysis of variance for the monitored in-vehicle gaseous and particulate matter contaminants, and comparing the results obtained from using the developed approach with conventional artificial intelligence techniques of back propagation networks and radial basis function networks. This study validated the newly developed approach using holdout and threefold cross-validation methods. These results are of great interest to scientists, researchers, and the public in understanding the various aspects of modeling an indoor microenvironment. This methodology can easily be extended to other fields of study also.

  1. An interlaboratory comparison study on the measurement of elements in PM10

    NASA Astrophysics Data System (ADS)

    Yatkin, Sinan; Belis, Claudio A.; Gerboles, Michel; Calzolai, Giulia; Lucarelli, Franco; Cavalli, Fabrizia; Trzepla, Krystyna

    2016-01-01

    An inter-laboratory comparison study was conducted to measure elemental loadings on PM10 samples, collected in Ispra, a regional background/rural site in Italy, using three different XRF (X-ray Fluorescence) methods, namely Epsilon 5 by linear calibration, Quant'X by the standardless analysis, and PIXE (Particle Induced X-ray Emission) with linear calibration. A subset of samples was also analyzed by ICP-MS (Inductively Coupled Plasma-Mass Spectrometry). Several metrics including method detection limits (MDLs), precision, bias from a NIST standard reference material (SRM 2783) quoted values, relative absolute difference, orthogonal regression and the ratio of the absolute difference between the methods to claimed uncertainty were used to compare the laboratories. The MDLs were found to be comparable for many elements. Precision estimates were less than 10% for the majority of the elements. Absolute biases from SRM 2783 remained less than 20% for the majority of certified elements. The regression results of PM10 samples showed that the three XRF laboratories measured very similar mass loadings for S, K, Ti, Mn, Fe, Cu, Br, Sr and Pb with slopes within 20% of unity. The ICP-MS results confirmed the agreement and discrepancies between XRF laboratories for Al, K, Ca, Ti, V, Cu, Sr and Pb. The ICP-MS results are inconsistent with the XRF laboratories for Fe and Zn. The absolute differences between the XRF laboratories generally remained within their claimed uncertainties, showing a pattern generally consistent with the orthogonal regression results.

  2. Incorporating wind availability into land use regression modelling of air quality in mountainous high-density urban environment.

    PubMed

    Shi, Yuan; Lau, Kevin Ka-Lun; Ng, Edward

    2017-08-01

    Urban air quality serves as an important function of the quality of urban life. Land use regression (LUR) modelling of air quality is essential for conducting health impacts assessment but more challenging in mountainous high-density urban scenario due to the complexities of the urban environment. In this study, a total of 21 LUR models are developed for seven kinds of air pollutants (gaseous air pollutants CO, NO 2 , NO x , O 3 , SO 2 and particulate air pollutants PM 2.5 , PM 10 ) with reference to three different time periods (summertime, wintertime and annual average of 5-year long-term hourly monitoring data from local air quality monitoring network) in Hong Kong. Under the mountainous high-density urban scenario, we improved the traditional LUR modelling method by incorporating wind availability information into LUR modelling based on surface geomorphometrical analysis. As a result, 269 independent variables were examined to develop the LUR models by using the "ADDRESS" independent variable selection method and stepwise multiple linear regression (MLR). Cross validation has been performed for each resultant model. The results show that wind-related variables are included in most of the resultant models as statistically significant independent variables. Compared with the traditional method, a maximum increase of 20% was achieved in the prediction performance of annual averaged NO 2 concentration level by incorporating wind-related variables into LUR model development. Copyright © 2017 Elsevier Inc. All rights reserved.

  3. Determining Cutoff Point of Ensemble Trees Based on Sample Size in Predicting Clinical Dose with DNA Microarray Data.

    PubMed

    Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha

    2016-01-01

    Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.

  4. Application of principal component regression and artificial neural network in FT-NIR soluble solids content determination of intact pear fruit

    NASA Astrophysics Data System (ADS)

    Ying, Yibin; Liu, Yande; Fu, Xiaping; Lu, Huishan

    2005-11-01

    The artificial neural networks (ANNs) have been used successfully in applications such as pattern recognition, image processing, automation and control. However, majority of today's applications of ANNs is back-propagate feed-forward ANN (BP-ANN). In this paper, back-propagation artificial neural networks (BP-ANN) were applied for modeling soluble solid content (SSC) of intact pear from their Fourier transform near infrared (FT-NIR) spectra. One hundred and sixty-four pear samples were used to build the calibration models and evaluate the models predictive ability. The results are compared to the classical calibration approaches, i.e. principal component regression (PCR), partial least squares (PLS) and non-linear PLS (NPLS). The effects of the optimal methods of training parameters on the prediction model were also investigated. BP-ANN combine with principle component regression (PCR) resulted always better than the classical PCR, PLS and Weight-PLS methods, from the point of view of the predictive ability. Based on the results, it can be concluded that FT-NIR spectroscopy and BP-ANN models can be properly employed for rapid and nondestructive determination of fruit internal quality.

  5. Advances in simultaneous atmospheric profile and cloud parameter regression based retrieval from high-spectral resolution radiance measurements

    NASA Astrophysics Data System (ADS)

    Weisz, Elisabeth; Smith, William L.; Smith, Nadia

    2013-06-01

    The dual-regression (DR) method retrieves information about the Earth surface and vertical atmospheric conditions from measurements made by any high-spectral resolution infrared sounder in space. The retrieved information includes temperature and atmospheric gases (such as water vapor, ozone, and carbon species) as well as surface and cloud top parameters. The algorithm was designed to produce a high-quality product with low latency and has been demonstrated to yield accurate results in real-time environments. The speed of the retrieval is achieved through linear regression, while accuracy is achieved through a series of classification schemes and decision-making steps. These steps are necessary to account for the nonlinearity of hyperspectral retrievals. In this work, we detail the key steps that have been developed in the DR method to advance accuracy in the retrieval of nonlinear parameters, specifically cloud top pressure. The steps and their impact on retrieval results are discussed in-depth and illustrated through relevant case studies. In addition to discussing and demonstrating advances made in addressing nonlinearity in a linear geophysical retrieval method, advances toward multi-instrument geophysical analysis by applying the DR to three different operational sounders in polar orbit are also noted. For any area on the globe, the DR method achieves consistent accuracy and precision, making it potentially very valuable to both the meteorological and environmental user communities.

  6. Estimating Time to Event From Longitudinal Categorical Data: An Analysis of Multiple Sclerosis Progression.

    PubMed

    Mandel, Micha; Gauthier, Susan A; Guttmann, Charles R G; Weiner, Howard L; Betensky, Rebecca A

    2007-12-01

    The expanded disability status scale (EDSS) is an ordinal score that measures progression in multiple sclerosis (MS). Progression is defined as reaching EDSS of a certain level (absolute progression) or increasing of one point of EDSS (relative progression). Survival methods for time to progression are not adequate for such data since they do not exploit the EDSS level at the end of follow-up. Instead, we suggest a Markov transitional model applicable for repeated categorical or ordinal data. This approach enables derivation of covariate-specific survival curves, obtained after estimation of the regression coefficients and manipulations of the resulting transition matrix. Large sample theory and resampling methods are employed to derive pointwise confidence intervals, which perform well in simulation. Methods for generating survival curves for time to EDSS of a certain level, time to increase of EDSS of at least one point, and time to two consecutive visits with EDSS greater than three are described explicitly. The regression models described are easily implemented using standard software packages. Survival curves are obtained from the regression results using packages that support simple matrix calculation. We present and demonstrate our method on data collected at the Partners MS center in Boston, MA. We apply our approach to progression defined by time to two consecutive visits with EDSS greater than three, and calculate crude (without covariates) and covariate-specific curves.

  7. Reader reaction to "a robust method for estimating optimal treatment regimes" by Zhang et al. (2012).

    PubMed

    Taylor, Jeremy M G; Cheng, Wenting; Foster, Jared C

    2015-03-01

    A recent article (Zhang et al., 2012, Biometrics 168, 1010-1018) compares regression based and inverse probability based methods of estimating an optimal treatment regime and shows for a small number of covariates that inverse probability weighted methods are more robust to model misspecification than regression methods. We demonstrate that using models that fit the data better reduces the concern about non-robustness for the regression methods. We extend the simulation study of Zhang et al. (2012, Biometrics 168, 1010-1018), also considering the situation of a larger number of covariates, and show that incorporating random forests into both regression and inverse probability weighted based methods improves their properties. © 2014, The International Biometric Society.

  8. Statistical Approaches for Spatiotemporal Prediction of Low Flows

    NASA Astrophysics Data System (ADS)

    Fangmann, A.; Haberlandt, U.

    2017-12-01

    An adequate assessment of regional climate change impacts on streamflow requires the integration of various sources of information and modeling approaches. This study proposes simple statistical tools for inclusion into model ensembles, which are fast and straightforward in their application, yet able to yield accurate streamflow predictions in time and space. Target variables for all approaches are annual low flow indices derived from a data set of 51 records of average daily discharge for northwestern Germany. The models require input of climatic data in the form of meteorological drought indices, derived from observed daily climatic variables, averaged over the streamflow gauges' catchments areas. Four different modeling approaches are analyzed. Basis for all pose multiple linear regression models that estimate low flows as a function of a set of meteorological indices and/or physiographic and climatic catchment descriptors. For the first method, individual regression models are fitted at each station, predicting annual low flow values from a set of annual meteorological indices, which are subsequently regionalized using a set of catchment characteristics. The second method combines temporal and spatial prediction within a single panel data regression model, allowing estimation of annual low flow values from input of both annual meteorological indices and catchment descriptors. The third and fourth methods represent non-stationary low flow frequency analyses and require fitting of regional distribution functions. Method three is subject to a spatiotemporal prediction of an index value, method four to estimation of L-moments that adapt the regional frequency distribution to the at-site conditions. The results show that method two outperforms successive prediction in time and space. Method three also shows a high performance in the near future period, but since it relies on a stationary distribution, its application for prediction of far future changes may be problematic. Spatiotemporal prediction of L-moments appeared highly uncertain for higher-order moments resulting in unrealistic future low flow values. All in all, the results promote an inclusion of simple statistical methods in climate change impact assessment.

  9. Nonparametric Methods in Astronomy: Think, Regress, Observe—Pick Any Three

    NASA Astrophysics Data System (ADS)

    Steinhardt, Charles L.; Jermyn, Adam S.

    2018-02-01

    Telescopes are much more expensive than astronomers, so it is essential to minimize required sample sizes by using the most data-efficient statistical methods possible. However, the most commonly used model-independent techniques for finding the relationship between two variables in astronomy are flawed. In the worst case they can lead without warning to subtly yet catastrophically wrong results, and even in the best case they require more data than necessary. Unfortunately, there is no single best technique for nonparametric regression. Instead, we provide a guide for how astronomers can choose the best method for their specific problem and provide a python library with both wrappers for the most useful existing algorithms and implementations of two new algorithms developed here.

  10. A new analytical method for quantification of olive and palm oil in blends with other vegetable edible oils based on the chromatographic fingerprints from the methyl-transesterified fraction.

    PubMed

    Jiménez-Carvelo, Ana M; González-Casado, Antonio; Cuadros-Rodríguez, Luis

    2017-03-01

    A new analytical method for the quantification of olive oil and palm oil in blends with other vegetable edible oils (canola, safflower, corn, peanut, seeds, grapeseed, linseed, sesame and soybean) using normal phase liquid chromatography, and applying chemometric tools was developed. The procedure for obtaining of chromatographic fingerprint from the methyl-transesterified fraction from each blend is described. The multivariate quantification methods used were Partial Least Square-Regression (PLS-R) and Support Vector Regression (SVR). The quantification results were evaluated by several parameters as the Root Mean Square Error of Validation (RMSEV), Mean Absolute Error of Validation (MAEV) and Median Absolute Error of Validation (MdAEV). It has to be highlighted that the new proposed analytical method, the chromatographic analysis takes only eight minutes and the results obtained showed the potential of this method and allowed quantification of mixtures of olive oil and palm oil with other vegetable oils. Copyright © 2016 Elsevier B.V. All rights reserved.

  11. Accurate motion parameter estimation for colonoscopy tracking using a regression method

    NASA Astrophysics Data System (ADS)

    Liu, Jianfei; Subramanian, Kalpathi R.; Yoo, Terry S.

    2010-03-01

    Co-located optical and virtual colonoscopy images have the potential to provide important clinical information during routine colonoscopy procedures. In our earlier work, we presented an optical flow based algorithm to compute egomotion from live colonoscopy video, permitting navigation and visualization of the corresponding patient anatomy. In the original algorithm, motion parameters were estimated using the traditional Least Sum of squares(LS) procedure which can be unstable in the context of optical flow vectors with large errors. In the improved algorithm, we use the Least Median of Squares (LMS) method, a robust regression method for motion parameter estimation. Using the LMS method, we iteratively analyze and converge toward the main distribution of the flow vectors, while disregarding outliers. We show through three experiments the improvement in tracking results obtained using the LMS method, in comparison to the LS estimator. The first experiment demonstrates better spatial accuracy in positioning the virtual camera in the sigmoid colon. The second and third experiments demonstrate the robustness of this estimator, resulting in longer tracked sequences: from 300 to 1310 in the ascending colon, and 410 to 1316 in the transverse colon.

  12. Value of Information Analysis for Time-lapse Seismic Data by Simulation-Regression

    NASA Astrophysics Data System (ADS)

    Dutta, G.; Mukerji, T.; Eidsvik, J.

    2016-12-01

    A novel method to estimate the Value of Information (VOI) of time-lapse seismic data in the context of reservoir development is proposed. VOI is a decision analytic metric quantifying the incremental value that would be created by collecting information prior to making a decision under uncertainty. The VOI has to be computed before collecting the information and can be used to justify its collection. Previous work on estimating the VOI of geophysical data has involved explicit approximation of the posterior distribution of reservoir properties given the data and then evaluating the prospect values for that posterior distribution of reservoir properties. Here, we propose to directly estimate the prospect values given the data by building a statistical relationship between them using regression. Various regression techniques such as Partial Least Squares Regression (PLSR), Multivariate Adaptive Regression Splines (MARS) and k-Nearest Neighbors (k-NN) are used to estimate the VOI, and the results compared. For a univariate Gaussian case, the VOI obtained from simulation-regression has been shown to be close to the analytical solution. Estimating VOI by simulation-regression is much less computationally expensive since the posterior distribution of reservoir properties given each possible dataset need not be modeled and the prospect values need not be evaluated for each such posterior distribution of reservoir properties. This method is flexible, since it does not require rigid model specification of posterior but rather fits conditional expectations non-parametrically from samples of values and data.

  13. Addressing the identification problem in age-period-cohort analysis: a tutorial on the use of partial least squares and principal components analysis.

    PubMed

    Tu, Yu-Kang; Krämer, Nicole; Lee, Wen-Chung

    2012-07-01

    In the analysis of trends in health outcomes, an ongoing issue is how to separate and estimate the effects of age, period, and cohort. As these 3 variables are perfectly collinear by definition, regression coefficients in a general linear model are not unique. In this tutorial, we review why identification is a problem, and how this problem may be tackled using partial least squares and principal components regression analyses. Both methods produce regression coefficients that fulfill the same collinearity constraint as the variables age, period, and cohort. We show that, because the constraint imposed by partial least squares and principal components regression is inherent in the mathematical relation among the 3 variables, this leads to more interpretable results. We use one dataset from a Taiwanese health-screening program to illustrate how to use partial least squares regression to analyze the trends in body heights with 3 continuous variables for age, period, and cohort. We then use another dataset of hepatocellular carcinoma mortality rates for Taiwanese men to illustrate how to use partial least squares regression to analyze tables with aggregated data. We use the second dataset to show the relation between the intrinsic estimator, a recently proposed method for the age-period-cohort analysis, and partial least squares regression. We also show that the inclusion of all indicator variables provides a more consistent approach. R code for our analyses is provided in the eAppendix.

  14. Using structural equation modeling to construct calibration equations relating PM2.5 mass concentration samplers to the federal reference method sampler

    NASA Astrophysics Data System (ADS)

    Bilonick, Richard A.; Connell, Daniel P.; Talbott, Evelyn O.; Rager, Judith R.; Xue, Tao

    2015-02-01

    The objective of this study was to remove systematic bias among fine particulate matter (PM2.5) mass concentration measurements made by different types of samplers used in the Pittsburgh Aerosol Research and Inhalation Epidemiology Study (PARIES). PARIES is a retrospective epidemiology study that aims to provide a comprehensive analysis of the associations between air quality and human health effects in the Pittsburgh, Pennsylvania, region from 1999 to 2008. Calibration was needed in order to minimize the amount of systematic error in PM2.5 exposure estimation as a result of including data from 97 different PM2.5 samplers at 47 monitoring sites. Ordinary regression often has been used for calibrating air quality measurements from pairs of measurement devices; however, this is only appropriate when one of the two devices (the "independent" variable) is free from random error, which is rarely the case. A group of methods known as "errors-in-variables" (e.g., Deming regression, reduced major axis regression) has been developed to handle calibration between two devices when both are subject to random error, but these methods require information on the relative sizes of the random errors for each device, which typically cannot be obtained from the observed data. When data from more than two devices (or repeats of the same device) are available, the additional information is not used to inform the calibration. A more general approach that often has been overlooked is the use of a measurement error structural equation model (SEM) that allows the simultaneous comparison of three or more devices (or repeats). The theoretical underpinnings of all of these approaches to calibration are described, and the pros and cons of each are discussed. In particular, it is shown that both ordinary regression (when used for calibration) and Deming regression are particular examples of SEMs but with substantial deficiencies. To illustrate the use of SEMs, the 7865 daily average PM2.5 mass concentration measurements made by seven collocated samplers at an urban monitoring site in Pittsburgh, Pennsylvania, were used. These samplers, which included three federal reference method (FRM) samplers, three speciation samplers, and a tapered element oscillating microbalance (TEOM), operated at various times during the 10-year PARIES study period. Because TEOM measurements are known to depend on temperature, the constructed SEM provided calibration equations relating the TEOM to the FRM and speciation samplers as a function of ambient temperature. It was shown that TEOM imprecision and TEOM bias (relative to the FRM) both decreased as temperature increased. It also was shown that the temperature dependency for bias was non-linear and followed a sigmoidal (logistic) pattern. The speciation samplers exhibited only small bias relative to the FRM samplers, although the FRM samplers were shown to be substantially more precise than both the TEOM and the speciation samplers. Comparison of the SEM results to pairwise simple linear regression results showed that the regression results can differ substantially from the correctly-derived calibration equations, especially if the less-precise device is used as the independent variable in the regression.

  15. Identification of immune correlates of protection in Shigella infection by application of machine learning.

    PubMed

    Arevalillo, Jorge M; Sztein, Marcelo B; Kotloff, Karen L; Levine, Myron M; Simon, Jakub K

    2017-10-01

    Immunologic correlates of protection are important in vaccine development because they give insight into mechanisms of protection, assist in the identification of promising vaccine candidates, and serve as endpoints in bridging clinical vaccine studies. Our goal is the development of a methodology to identify immunologic correlates of protection using the Shigella challenge as a model. The proposed methodology utilizes the Random Forests (RF) machine learning algorithm as well as Classification and Regression Trees (CART) to detect immune markers that predict protection, identify interactions between variables, and define optimal cutoffs. Logistic regression modeling is applied to estimate the probability of protection and the confidence interval (CI) for such a probability is computed by bootstrapping the logistic regression models. The results demonstrate that the combination of Classification and Regression Trees and Random Forests complements the standard logistic regression and uncovers subtle immune interactions. Specific levels of immunoglobulin IgG antibody in blood on the day of challenge predicted protection in 75% (95% CI 67-86). Of those subjects that did not have blood IgG at or above a defined threshold, 100% were protected if they had IgA antibody secreting cells above a defined threshold. Comparison with the results obtained by applying only logistic regression modeling with standard Akaike Information Criterion for model selection shows the usefulness of the proposed method. Given the complexity of the immune system, the use of machine learning methods may enhance traditional statistical approaches. When applied together, they offer a novel way to quantify important immune correlates of protection that may help the development of vaccines. Copyright © 2017 Elsevier Inc. All rights reserved.

  16. Design of an optimum computer vision-based automatic abalone (Haliotis discus hannai) grading algorithm.

    PubMed

    Lee, Donggil; Lee, Kyounghoon; Kim, Seonghun; Yang, Yongsu

    2015-04-01

    An automatic abalone grading algorithm that estimates abalone weights on the basis of computer vision using 2D images is developed and tested. The algorithm overcomes the problems experienced by conventional abalone grading methods that utilize manual sorting and mechanical automatic grading. To design an optimal algorithm, a regression formula and R(2) value were investigated by performing a regression analysis for each of total length, body width, thickness, view area, and actual volume against abalone weights. The R(2) value between the actual volume and abalone weight was 0.999, showing a relatively high correlation. As a result, to easily estimate the actual volumes of abalones based on computer vision, the volumes were calculated under the assumption that abalone shapes are half-oblate ellipsoids, and a regression formula was derived to estimate the volumes of abalones through linear regression analysis between the calculated and actual volumes. The final automatic abalone grading algorithm is designed using the abalone volume estimation regression formula derived from test results, and the actual volumes and abalone weights regression formula. In the range of abalones weighting from 16.51 to 128.01 g, the results of evaluation of the performance of algorithm via cross-validation indicate root mean square and worst-case prediction errors of are 2.8 and ±8 g, respectively. © 2015 Institute of Food Technologists®

  17. Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

    PubMed Central

    Huang, Jian; Zhang, Cun-Hui

    2013-01-01

    The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. PMID:24348100

  18. Estimation of Standard Error of Regression Effects in Latent Regression Models Using Binder's Linearization. Research Report. ETS RR-07-09

    ERIC Educational Resources Information Center

    Li, Deping; Oranje, Andreas

    2007-01-01

    Two versions of a general method for approximating standard error of regression effect estimates within an IRT-based latent regression model are compared. The general method is based on Binder's (1983) approach, accounting for complex samples and finite populations by Taylor series linearization. In contrast, the current National Assessment of…

  19. Comparison of dimensionality reduction methods to predict genomic breeding values for carcass traits in pigs.

    PubMed

    Azevedo, C F; Nascimento, M; Silva, F F; Resende, M D V; Lopes, P S; Guimarães, S E F; Glória, L S

    2015-10-09

    A significant contribution of molecular genetics is the direct use of DNA information to identify genetically superior individuals. With this approach, genome-wide selection (GWS) can be used for this purpose. GWS consists of analyzing a large number of single nucleotide polymorphism markers widely distributed in the genome; however, because the number of markers is much larger than the number of genotyped individuals, and such markers are highly correlated, special statistical methods are widely required. Among these methods, independent component regression, principal component regression, partial least squares, and partial principal components stand out. Thus, the aim of this study was to propose an application of the methods of dimensionality reduction to GWS of carcass traits in an F2 (Piau x commercial line) pig population. The results show similarities between the principal and the independent component methods and provided the most accurate genomic breeding estimates for most carcass traits in pigs.

  20. Introduction to the use of regression models in epidemiology.

    PubMed

    Bender, Ralf

    2009-01-01

    Regression modeling is one of the most important statistical techniques used in analytical epidemiology. By means of regression models the effect of one or several explanatory variables (e.g., exposures, subject characteristics, risk factors) on a response variable such as mortality or cancer can be investigated. From multiple regression models, adjusted effect estimates can be obtained that take the effect of potential confounders into account. Regression methods can be applied in all epidemiologic study designs so that they represent a universal tool for data analysis in epidemiology. Different kinds of regression models have been developed in dependence on the measurement scale of the response variable and the study design. The most important methods are linear regression for continuous outcomes, logistic regression for binary outcomes, Cox regression for time-to-event data, and Poisson regression for frequencies and rates. This chapter provides a nontechnical introduction to these regression models with illustrating examples from cancer research.

  1. Improved model of the retardance in citric acid coated ferrofluids using stepwise regression

    NASA Astrophysics Data System (ADS)

    Lin, J. F.; Qiu, X. R.

    2017-06-01

    Citric acid (CA) coated Fe3O4 ferrofluids (FFs) have been conducted for biomedical application. The magneto-optical retardance of CA coated FFs was measured by a Stokes polarimeter. Optimization and multiple regression of retardance in FFs were executed by Taguchi method and Microsoft Excel previously, and the F value of regression model was large enough. However, the model executed by Excel was not systematic. Instead we adopted the stepwise regression to model the retardance of CA coated FFs. From the results of stepwise regression by MATLAB, the developed model had highly predictable ability owing to F of 2.55897e+7 and correlation coefficient of one. The average absolute error of predicted retardances to measured retardances was just 0.0044%. Using the genetic algorithm (GA) in MATLAB, the optimized parametric combination was determined as [4.709 0.12 39.998 70.006] corresponding to the pH of suspension, molar ratio of CA to Fe3O4, CA volume, and coating temperature. The maximum retardance was found as 31.712°, close to that obtained by evolutionary solver in Excel and a relative error of -0.013%. Above all, the stepwise regression method was successfully used to model the retardance of CA coated FFs, and the maximum global retardance was determined by the use of GA.

  2. Multivariate Linear Regression and CART Regression Analysis of TBM Performance at Abu Hamour Phase-I Tunnel

    NASA Astrophysics Data System (ADS)

    Jakubowski, J.; Stypulkowski, J. B.; Bernardeau, F. G.

    2017-12-01

    The first phase of the Abu Hamour drainage and storm tunnel was completed in early 2017. The 9.5 km long, 3.7 m diameter tunnel was excavated with two Earth Pressure Balance (EPB) Tunnel Boring Machines from Herrenknecht. TBM operation processes were monitored and recorded by Data Acquisition and Evaluation System. The authors coupled collected TBM drive data with available information on rock mass properties, cleansed, completed with secondary variables and aggregated by weeks and shifts. Correlations and descriptive statistics charts were examined. Multivariate Linear Regression and CART regression tree models linking TBM penetration rate (PR), penetration per revolution (PPR) and field penetration index (FPI) with TBM operational and geotechnical characteristics were performed for the conditions of the weak/soft rock of Doha. Both regression methods are interpretable and the data were screened with different computational approaches allowing enriched insight. The primary goal of the analysis was to investigate empirical relations between multiple explanatory and responding variables, to search for best subsets of explanatory variables and to evaluate the strength of linear and non-linear relations. For each of the penetration indices, a predictive model coupling both regression methods was built and validated. The resultant models appeared to be stronger than constituent ones and indicated an opportunity for more accurate and robust TBM performance predictions.

  3. Application of seemingly unrelated regression in medical data with intermittently observed time-dependent covariates.

    PubMed

    Keshavarzi, Sareh; Ayatollahi, Seyyed Mohammad Taghi; Zare, Najaf; Pakfetrat, Maryam

    2012-01-01

    BACKGROUND. In many studies with longitudinal data, time-dependent covariates can only be measured intermittently (not at all observation times), and this presents difficulties for standard statistical analyses. This situation is common in medical studies, and methods that deal with this challenge would be useful. METHODS. In this study, we performed the seemingly unrelated regression (SUR) based models, with respect to each observation time in longitudinal data with intermittently observed time-dependent covariates and further compared these models with mixed-effect regression models (MRMs) under three classic imputation procedures. Simulation studies were performed to compare the sample size properties of the estimated coefficients for different modeling choices. RESULTS. In general, the proposed models in the presence of intermittently observed time-dependent covariates showed a good performance. However, when we considered only the observed values of the covariate without any imputations, the resulted biases were greater. The performances of the proposed SUR-based models in comparison with MRM using classic imputation methods were nearly similar with approximately equal amounts of bias and MSE. CONCLUSION. The simulation study suggests that the SUR-based models work as efficiently as MRM in the case of intermittently observed time-dependent covariates. Thus, it can be used as an alternative to MRM.

  4. Spatial prediction of landslides using a hybrid machine learning approach based on Random Subspace and Classification and Regression Trees

    NASA Astrophysics Data System (ADS)

    Pham, Binh Thai; Prakash, Indra; Tien Bui, Dieu

    2018-02-01

    A hybrid machine learning approach of Random Subspace (RSS) and Classification And Regression Trees (CART) is proposed to develop a model named RSSCART for spatial prediction of landslides. This model is a combination of the RSS method which is known as an efficient ensemble technique and the CART which is a state of the art classifier. The Luc Yen district of Yen Bai province, a prominent landslide prone area of Viet Nam, was selected for the model development. Performance of the RSSCART model was evaluated through the Receiver Operating Characteristic (ROC) curve, statistical analysis methods, and the Chi Square test. Results were compared with other benchmark landslide models namely Support Vector Machines (SVM), single CART, Naïve Bayes Trees (NBT), and Logistic Regression (LR). In the development of model, ten important landslide affecting factors related with geomorphology, geology and geo-environment were considered namely slope angles, elevation, slope aspect, curvature, lithology, distance to faults, distance to rivers, distance to roads, and rainfall. Performance of the RSSCART model (AUC = 0.841) is the best compared with other popular landslide models namely SVM (0.835), single CART (0.822), NBT (0.821), and LR (0.723). These results indicate that performance of the RSSCART is a promising method for spatial landslide prediction.

  5. Understanding the Impact of School Factors on School Counselor Burnout: A Mixed-Methods Study

    ERIC Educational Resources Information Center

    Bardhoshi, Gerta; Schweinle, Amy; Duncan, Kelly

    2014-01-01

    This mixed-methods study investigated the relationship between burnout and performing noncounseling duties among a national sample of professional school counselors, while identifying school factors that could attenuate this relationship. Results of regression analyses indicate that performing noncounseling duties significantly predicted burnout…

  6. DOA Finding with Support Vector Regression Based Forward-Backward Linear Prediction.

    PubMed

    Pan, Jingjing; Wang, Yide; Le Bastard, Cédric; Wang, Tianzhen

    2017-05-27

    Direction-of-arrival (DOA) estimation has drawn considerable attention in array signal processing, particularly with coherent signals and a limited number of snapshots. Forward-backward linear prediction (FBLP) is able to directly deal with coherent signals. Support vector regression (SVR) is robust with small samples. This paper proposes the combination of the advantages of FBLP and SVR in the estimation of DOAs of coherent incoming signals with low snapshots. The performance of the proposed method is validated with numerical simulations in coherent scenarios, in terms of different angle separations, numbers of snapshots, and signal-to-noise ratios (SNRs). Simulation results show the effectiveness of the proposed method.

  7. Face Hallucination with Linear Regression Model in Semi-Orthogonal Multilinear PCA Method

    NASA Astrophysics Data System (ADS)

    Asavaskulkiet, Krissada

    2018-04-01

    In this paper, we propose a new face hallucination technique, face images reconstruction in HSV color space with a semi-orthogonal multilinear principal component analysis method. This novel hallucination technique can perform directly from tensors via tensor-to-vector projection by imposing the orthogonality constraint in only one mode. In our experiments, we use facial images from FERET database to test our hallucination approach which is demonstrated by extensive experiments with high-quality hallucinated color faces. The experimental results assure clearly demonstrated that we can generate photorealistic color face images by using the SO-MPCA subspace with a linear regression model.

  8. Improving power and robustness for detecting genetic association with extreme-value sampling design.

    PubMed

    Chen, Hua Yun; Li, Mingyao

    2011-12-01

    Extreme-value sampling design that samples subjects with extremely large or small quantitative trait values is commonly used in genetic association studies. Samples in such designs are often treated as "cases" and "controls" and analyzed using logistic regression. Such a case-control analysis ignores the potential dose-response relationship between the quantitative trait and the underlying trait locus and thus may lead to loss of power in detecting genetic association. An alternative approach to analyzing such data is to model the dose-response relationship by a linear regression model. However, parameter estimation from this model can be biased, which may lead to inflated type I errors. We propose a robust and efficient approach that takes into consideration of both the biased sampling design and the potential dose-response relationship. Extensive simulations demonstrate that the proposed method is more powerful than the traditional logistic regression analysis and is more robust than the linear regression analysis. We applied our method to the analysis of a candidate gene association study on high-density lipoprotein cholesterol (HDL-C) which includes study subjects with extremely high or low HDL-C levels. Using our method, we identified several SNPs showing a stronger evidence of association with HDL-C than the traditional case-control logistic regression analysis. Our results suggest that it is important to appropriately model the quantitative traits and to adjust for the biased sampling when dose-response relationship exists in extreme-value sampling designs. © 2011 Wiley Periodicals, Inc.

  9. Comparison of different wind data interpolation methods for a region with complex terrain in Central Asia

    NASA Astrophysics Data System (ADS)

    Reinhardt, Katja; Samimi, Cyrus

    2018-01-01

    While climatological data of high spatial resolution are largely available in most developed countries, the network of climatological stations in many other regions of the world still constitutes large gaps. Especially for those regions, interpolation methods are important tools to fill these gaps and to improve the data base indispensible for climatological research. Over the last years, new hybrid methods of machine learning and geostatistics have been developed which provide innovative prospects in spatial predictive modelling. This study will focus on evaluating the performance of 12 different interpolation methods for the wind components \\overrightarrow{u} and \\overrightarrow{v} in a mountainous region of Central Asia. Thereby, a special focus will be on applying new hybrid methods on spatial interpolation of wind data. This study is the first evaluating and comparing the performance of several of these hybrid methods. The overall aim of this study is to determine whether an optimal interpolation method exists, which can equally be applied for all pressure levels, or whether different interpolation methods have to be used for the different pressure levels. Deterministic (inverse distance weighting) and geostatistical interpolation methods (ordinary kriging) were explored, which take into account only the initial values of \\overrightarrow{u} and \\overrightarrow{v} . In addition, more complex methods (generalized additive model, support vector machine and neural networks as single methods and as hybrid methods as well as regression-kriging) that consider additional variables were applied. The analysis of the error indices revealed that regression-kriging provided the most accurate interpolation results for both wind components and all pressure heights. At 200 and 500 hPa, regression-kriging is followed by the different kinds of neural networks and support vector machines and for 850 hPa it is followed by the different types of support vector machine and ordinary kriging. Overall, explanatory variables improve the interpolation results.

  10. Assessment of the relative merits of a few methods to detect evolutionary trends.

    PubMed

    Laurin, Michel

    2010-12-01

    Some of the most basic questions about the history of life concern evolutionary trends. These include determining whether or not metazoans have become more complex over time, whether or not body size tends to increase over time (the Cope-Depéret rule), or whether or not brain size has increased over time in various taxa, such as mammals and birds. Despite the proliferation of studies on such topics, assessment of the reliability of results in this field is hampered by the variability of techniques used and the lack of statistical validation of these methods. To solve this problem, simulations are performed using a variety of evolutionary models (gradual Brownian motion, speciational Brownian motion, and Ornstein-Uhlenbeck), with or without a drift of variable amplitude, with variable variance of tips, and with bounds placed close or far from the starting values and final means of simulated characters. These are used to assess the relative merits (power, Type I error rate, bias, and mean absolute value of error on slope estimate) of several statistical methods that have recently been used to assess the presence of evolutionary trends in comparative data. Results show widely divergent performance of the methods. The simple, nonphylogenetic regression (SR) and variance partitioning using phylogenetic eigenvector regression (PVR) with a broken stick selection procedure have greatly inflated Type I error rate (0.123-0.180 at a 0.05 threshold), which invalidates their use in this context. However, they have the greatest power. Most variants of Felsenstein's independent contrasts (FIC; five of which are presented) have adequate Type I error rate, although two have a slightly inflated Type I error rate with at least one of the two reference trees (0.064-0.090 error rate at a 0.05 threshold). The power of all contrast-based methods is always much lower than that of SR and PVR, except under Brownian motion with a strong trend and distant bounds. Mean absolute value of error on slope of all FIC methods is slightly higher than that of phylogenetic generalized least squares (PGLS), SR, and PVR. PGLS performs well, with low Type I error rate, low error on regression coefficient, and power comparable with some FIC methods. Four variants of skewness analysis are examined, and a new method to assess significance of results is presented. However, all have consistently low power, except in rare combinations of trees, trend strength, and distance between final means and bounds. Globally, the results clearly show that FIC-based methods and PGLS are globally better than nonphylogenetic methods and variance partitioning with PVR. FIC methods and PGLS are sensitive to the model of evolution (and, hence, to branch length errors). Our results suggest that regressing raw character contrasts against raw geological age contrasts yields a good combination of power and Type I error rate. New software to facilitate batch analysis is presented.

  11. Testing a single regression coefficient in high dimensional linear models

    PubMed Central

    Zhong, Ping-Shou; Li, Runze; Wang, Hansheng; Tsai, Chih-Ling

    2017-01-01

    In linear regression models with high dimensional data, the classical z-test (or t-test) for testing the significance of each single regression coefficient is no longer applicable. This is mainly because the number of covariates exceeds the sample size. In this paper, we propose a simple and novel alternative by introducing the Correlated Predictors Screening (CPS) method to control for predictors that are highly correlated with the target covariate. Accordingly, the classical ordinary least squares approach can be employed to estimate the regression coefficient associated with the target covariate. In addition, we demonstrate that the resulting estimator is consistent and asymptotically normal even if the random errors are heteroscedastic. This enables us to apply the z-test to assess the significance of each covariate. Based on the p-value obtained from testing the significance of each covariate, we further conduct multiple hypothesis testing by controlling the false discovery rate at the nominal level. Then, we show that the multiple hypothesis testing achieves consistent model selection. Simulation studies and empirical examples are presented to illustrate the finite sample performance and the usefulness of the proposed method, respectively. PMID:28663668

  12. Testing a single regression coefficient in high dimensional linear models.

    PubMed

    Lan, Wei; Zhong, Ping-Shou; Li, Runze; Wang, Hansheng; Tsai, Chih-Ling

    2016-11-01

    In linear regression models with high dimensional data, the classical z -test (or t -test) for testing the significance of each single regression coefficient is no longer applicable. This is mainly because the number of covariates exceeds the sample size. In this paper, we propose a simple and novel alternative by introducing the Correlated Predictors Screening (CPS) method to control for predictors that are highly correlated with the target covariate. Accordingly, the classical ordinary least squares approach can be employed to estimate the regression coefficient associated with the target covariate. In addition, we demonstrate that the resulting estimator is consistent and asymptotically normal even if the random errors are heteroscedastic. This enables us to apply the z -test to assess the significance of each covariate. Based on the p -value obtained from testing the significance of each covariate, we further conduct multiple hypothesis testing by controlling the false discovery rate at the nominal level. Then, we show that the multiple hypothesis testing achieves consistent model selection. Simulation studies and empirical examples are presented to illustrate the finite sample performance and the usefulness of the proposed method, respectively.

  13. The Effect of Latent Binary Variables on the Uncertainty of the Prediction of a Dichotomous Outcome Using Logistic Regression Based Propensity Score Matching.

    PubMed

    Szekér, Szabolcs; Vathy-Fogarassy, Ágnes

    2018-01-01

    Logistic regression based propensity score matching is a widely used method in case-control studies to select the individuals of the control group. This method creates a suitable control group if all factors affecting the output variable are known. However, if relevant latent variables exist as well, which are not taken into account during the calculations, the quality of the control group is uncertain. In this paper, we present a statistics-based research in which we try to determine the relationship between the accuracy of the logistic regression model and the uncertainty of the dependent variable of the control group defined by propensity score matching. Our analyses show that there is a linear correlation between the fit of the logistic regression model and the uncertainty of the output variable. In certain cases, a latent binary explanatory variable can result in a relative error of up to 70% in the prediction of the outcome variable. The observed phenomenon calls the attention of analysts to an important point, which must be taken into account when deducting conclusions.

  14. A mathematical programming method for formulating a fuzzy regression model based on distance criterion.

    PubMed

    Chen, Liang-Hsuan; Hsueh, Chan-Ching

    2007-06-01

    Fuzzy regression models are useful to investigate the relationship between explanatory and response variables with fuzzy observations. Different from previous studies, this correspondence proposes a mathematical programming method to construct a fuzzy regression model based on a distance criterion. The objective of the mathematical programming is to minimize the sum of distances between the estimated and observed responses on the X axis, such that the fuzzy regression model constructed has the minimal total estimation error in distance. Only several alpha-cuts of fuzzy observations are needed as inputs to the mathematical programming model; therefore, the applications are not restricted to triangular fuzzy numbers. Three examples, adopted in the previous studies, and a larger example, modified from the crisp case, are used to illustrate the performance of the proposed approach. The results indicate that the proposed model has better performance than those in the previous studies based on either distance criterion or Kim and Bishu's criterion. In addition, the efficiency and effectiveness for solving the larger example by the proposed model are also satisfactory.

  15. A Technique of Fuzzy C-Mean in Multiple Linear Regression Model toward Paddy Yield

    NASA Astrophysics Data System (ADS)

    Syazwan Wahab, Nur; Saifullah Rusiman, Mohd; Mohamad, Mahathir; Amira Azmi, Nur; Che Him, Norziha; Ghazali Kamardan, M.; Ali, Maselan

    2018-04-01

    In this paper, we propose a hybrid model which is a combination of multiple linear regression model and fuzzy c-means method. This research involved a relationship between 20 variates of the top soil that are analyzed prior to planting of paddy yields at standard fertilizer rates. Data used were from the multi-location trials for rice carried out by MARDI at major paddy granary in Peninsular Malaysia during the period from 2009 to 2012. Missing observations were estimated using mean estimation techniques. The data were analyzed using multiple linear regression model and a combination of multiple linear regression model and fuzzy c-means method. Analysis of normality and multicollinearity indicate that the data is normally scattered without multicollinearity among independent variables. Analysis of fuzzy c-means cluster the yield of paddy into two clusters before the multiple linear regression model can be used. The comparison between two method indicate that the hybrid of multiple linear regression model and fuzzy c-means method outperform the multiple linear regression model with lower value of mean square error.

  16. Using GA-Ridge regression to select hydro-geological parameters influencing groundwater pollution vulnerability.

    PubMed

    Ahn, Jae Joon; Kim, Young Min; Yoo, Keunje; Park, Joonhong; Oh, Kyong Joo

    2012-11-01

    For groundwater conservation and management, it is important to accurately assess groundwater pollution vulnerability. This study proposed an integrated model using ridge regression and a genetic algorithm (GA) to effectively select the major hydro-geological parameters influencing groundwater pollution vulnerability in an aquifer. The GA-Ridge regression method determined that depth to water, net recharge, topography, and the impact of vadose zone media were the hydro-geological parameters that influenced trichloroethene pollution vulnerability in a Korean aquifer. When using these selected hydro-geological parameters, the accuracy was improved for various statistical nonlinear and artificial intelligence (AI) techniques, such as multinomial logistic regression, decision trees, artificial neural networks, and case-based reasoning. These results provide a proof of concept that the GA-Ridge regression is effective at determining influential hydro-geological parameters for the pollution vulnerability of an aquifer, and in turn, improves the AI performance in assessing groundwater pollution vulnerability.

  17. Continuous-variable quantum Gaussian process regression and quantum singular value decomposition of nonsparse low-rank matrices

    NASA Astrophysics Data System (ADS)

    Das, Siddhartha; Siopsis, George; Weedbrook, Christian

    2018-02-01

    With the significant advancement in quantum computation during the past couple of decades, the exploration of machine-learning subroutines using quantum strategies has become increasingly popular. Gaussian process regression is a widely used technique in supervised classical machine learning. Here we introduce an algorithm for Gaussian process regression using continuous-variable quantum systems that can be realized with technology based on photonic quantum computers under certain assumptions regarding distribution of data and availability of efficient quantum access. Our algorithm shows that by using a continuous-variable quantum computer a dramatic speedup in computing Gaussian process regression can be achieved, i.e., the possibility of exponentially reducing the time to compute. Furthermore, our results also include a continuous-variable quantum-assisted singular value decomposition method of nonsparse low rank matrices and forms an important subroutine in our Gaussian process regression algorithm.

  18. Model Robust Calibration: Method and Application to Electronically-Scanned Pressure Transducers

    NASA Technical Reports Server (NTRS)

    Walker, Eric L.; Starnes, B. Alden; Birch, Jeffery B.; Mays, James E.

    2010-01-01

    This article presents the application of a recently developed statistical regression method to the controlled instrument calibration problem. The statistical method of Model Robust Regression (MRR), developed by Mays, Birch, and Starnes, is shown to improve instrument calibration by reducing the reliance of the calibration on a predetermined parametric (e.g. polynomial, exponential, logarithmic) model. This is accomplished by allowing fits from the predetermined parametric model to be augmented by a certain portion of a fit to the residuals from the initial regression using a nonparametric (locally parametric) regression technique. The method is demonstrated for the absolute scale calibration of silicon-based pressure transducers.

  19. Linear regression based on Minimum Covariance Determinant (MCD) and TELBS methods on the productivity of phytoplankton

    NASA Astrophysics Data System (ADS)

    Gusriani, N.; Firdaniza

    2018-03-01

    The existence of outliers on multiple linear regression analysis causes the Gaussian assumption to be unfulfilled. If the Least Square method is forcedly used on these data, it will produce a model that cannot represent most data. For that, we need a robust regression method against outliers. This paper will compare the Minimum Covariance Determinant (MCD) method and the TELBS method on secondary data on the productivity of phytoplankton, which contains outliers. Based on the robust determinant coefficient value, MCD method produces a better model compared to TELBS method.

  20. Impact of Uncertainties in Exposure Assessment on Thyroid Cancer Risk among Persons in Belarus Exposed as Children or Adolescents Due to the Chernobyl Accident.

    PubMed

    Little, Mark P; Kwon, Deukwoo; Zablotska, Lydia B; Brenner, Alina V; Cahoon, Elizabeth K; Rozhko, Alexander V; Polyanskaya, Olga N; Minenko, Victor F; Golovanov, Ivan; Bouville, André; Drozdovitch, Vladimir

    2015-01-01

    The excess incidence of thyroid cancer in Ukraine and Belarus observed a few years after the Chernobyl accident is considered to be largely the result of 131I released from the reactor. Although the Belarus thyroid cancer prevalence data has been previously analyzed, no account was taken of dose measurement error. We examined dose-response patterns in a thyroid screening prevalence cohort of 11,732 persons aged under 18 at the time of the accident, diagnosed during 1996-2004, who had direct thyroid 131I activity measurement, and were resident in the most radio-actively contaminated regions of Belarus. Three methods of dose-error correction (regression calibration, Monte Carlo maximum likelihood, Bayesian Markov Chain Monte Carlo) were applied. There was a statistically significant (p<0.001) increasing dose-response for prevalent thyroid cancer, irrespective of regression-adjustment method used. Without adjustment for dose errors the excess odds ratio was 1.51 Gy- (95% CI 0.53, 3.86), which was reduced by 13% when regression-calibration adjustment was used, 1.31 Gy- (95% CI 0.47, 3.31). A Monte Carlo maximum likelihood method yielded an excess odds ratio of 1.48 Gy- (95% CI 0.53, 3.87), about 2% lower than the unadjusted analysis. The Bayesian method yielded a maximum posterior excess odds ratio of 1.16 Gy- (95% BCI 0.20, 4.32), 23% lower than the unadjusted analysis. There were borderline significant (p = 0.053-0.078) indications of downward curvature in the dose response, depending on the adjustment methods used. There were also borderline significant (p = 0.102) modifying effects of gender on the radiation dose trend, but no significant modifying effects of age at time of accident, or age at screening as modifiers of dose response (p>0.2). In summary, the relatively small contribution of unshared classical dose error in the current study results in comparatively modest effects on the regression parameters.

  1. Short-term Forecasting of the Prevalence of Trachoma: Expert Opinion, Statistical Regression, versus Transmission Models

    PubMed Central

    Liu, Fengchen; Porco, Travis C.; Amza, Abdou; Kadri, Boubacar; Nassirou, Baido; West, Sheila K.; Bailey, Robin L.; Keenan, Jeremy D.; Solomon, Anthony W.; Emerson, Paul M.; Gambhir, Manoj; Lietman, Thomas M.

    2015-01-01

    Background Trachoma programs rely on guidelines made in large part using expert opinion of what will happen with and without intervention. Large community-randomized trials offer an opportunity to actually compare forecasting methods in a masked fashion. Methods The Program for the Rapid Elimination of Trachoma trials estimated longitudinal prevalence of ocular chlamydial infection from 24 communities treated annually with mass azithromycin. Given antibiotic coverage and biannual assessments from baseline through 30 months, forecasts of the prevalence of infection in each of the 24 communities at 36 months were made by three methods: the sum of 15 experts’ opinion, statistical regression of the square-root-transformed prevalence, and a stochastic hidden Markov model of infection transmission (Susceptible-Infectious-Susceptible, or SIS model). All forecasters were masked to the 36-month results and to the other forecasts. Forecasts of the 24 communities were scored by the likelihood of the observed results and compared using Wilcoxon’s signed-rank statistic. Findings Regression and SIS hidden Markov models had significantly better likelihood than community expert opinion (p = 0.004 and p = 0.01, respectively). All forecasts scored better when perturbed to decrease Fisher’s information. Each individual expert’s forecast was poorer than the sum of experts. Interpretation Regression and SIS models performed significantly better than expert opinion, although all forecasts were overly confident. Further model refinements may score better, although would need to be tested and compared in new masked studies. Construction of guidelines that rely on forecasting future prevalence could consider use of mathematical and statistical models. PMID:26302380

  2. Vascular Disease, ESRD, and Death: Interpreting Competing Risk Analyses

    PubMed Central

    Coresh, Josef; Segev, Dorry L.; Kucirka, Lauren M.; Tighiouart, Hocine; Sarnak, Mark J.

    2012-01-01

    Summary Background and objectives Vascular disease, a common condition in CKD, is a risk factor for mortality and ESRD. Optimal patient care requires accurate estimation and ordering of these competing risks. Design, setting, participants, & measurements This is a prospective cohort study of screened (n=885) and randomized participants (n=837) in the Modification of Diet in Renal Disease study (original study enrollment, 1989–1992), evaluating the association of vascular disease with ESRD and pre-ESRD mortality using standard survival analysis and competing risk regression. Results The method of analysis resulted in markedly different estimates. Cumulative incidence by standard analysis (censoring at the competing event) implied that, with vascular disease, the 15-year incidence was 66% and 51% for ESRD and pre-ESRD death, respectively. A more accurate representation of absolute risk was estimated with competing risk regression: 15-year incidence was 54% and 29% for ESRD and pre-ESRD death, respectively. For the association of vascular disease with pre-ESRD death, estimates of relative risk by the two methods were similar (standard survival analysis adjusted hazard ratio, 1.63; 95% confidence interval, 1.20–2.20; competing risk regression adjusted subhazard ratio, 1.57; 95% confidence interval, 1.15–2.14). In contrast, the hazard and subhazard ratios differed substantially for other associations, such as GFR and pre-ESRD mortality. Conclusions When competing events exist, absolute risk is better estimated using competing risk regression, but etiologic associations by this method must be carefully interpreted. The presence of vascular disease in CKD decreases the likelihood of survival to ESRD, independent of age and other risk factors. PMID:22859747

  3. Periodicity analysis of tourist arrivals to Banda Aceh using smoothing SARIMA approach

    NASA Astrophysics Data System (ADS)

    Miftahuddin, Helida, Desri; Sofyan, Hizir

    2017-11-01

    Forecasting the number of tourist arrivals who enters a region is needed for tourism businesses, economic and industrial policies, so that the statistical modeling needs to be conducted. Banda Aceh is the capital of Aceh province more economic activity is driven by the services sector, one of which is the tourism sector. Therefore, the prediction of the number of tourist arrivals is needed to develop further policies. The identification results indicate that the data arrival of foreign tourists to Banda Aceh to contain the trend and seasonal nature. Allegedly, the number of arrivals is influenced by external factors, such as economics, politics, and the holiday season caused the structural break in the data. Trend patterns are detected by using polynomial regression with quadratic and cubic approaches, while seasonal is detected by a periodic regression polynomial with quadratic and cubic approach. To model the data that has seasonal effects, one of the statistical methods that can be used is SARIMA (Seasonal Autoregressive Integrated Moving Average). The results showed that the smoothing, a method to detect the trend pattern is cubic polynomial regression approach, with the modified model and the multiplicative periodicity of 12 months. The AIC value obtained was 70.52. While the method for detecting the seasonal pattern is a periodic regression polynomial cubic approach, with the modified model and the multiplicative periodicity of 12 months. The AIC value obtained was 73.37. Furthermore, the best model to predict the number of foreign tourist arrivals to Banda Aceh in 2017 to 2018 is SARIMA (0,1,1)(1,1,0) with MAPE is 26%.

  4. Geodesic least squares regression on information manifolds

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Verdoolaege, Geert, E-mail: geert.verdoolaege@ugent.be

    We present a novel regression method targeted at situations with significant uncertainty on both the dependent and independent variables or with non-Gaussian distribution models. Unlike the classic regression model, the conditional distribution of the response variable suggested by the data need not be the same as the modeled distribution. Instead they are matched by minimizing the Rao geodesic distance between them. This yields a more flexible regression method that is less constrained by the assumptions imposed through the regression model. As an example, we demonstrate the improved resistance of our method against some flawed model assumptions and we apply thismore » to scaling laws in magnetic confinement fusion.« less

  5. Estimate the contribution of incubation parameters influence egg hatchability using multiple linear regression analysis

    PubMed Central

    Khalil, Mohamed H.; Shebl, Mostafa K.; Kosba, Mohamed A.; El-Sabrout, Karim; Zaki, Nesma

    2016-01-01

    Aim: This research was conducted to determine the most affecting parameters on hatchability of indigenous and improved local chickens’ eggs. Materials and Methods: Five parameters were studied (fertility, early and late embryonic mortalities, shape index, egg weight, and egg weight loss) on four strains, namely Fayoumi, Alexandria, Matrouh, and Montazah. Multiple linear regression was performed on the studied parameters to determine the most influencing one on hatchability. Results: The results showed significant differences in commercial and scientific hatchability among strains. Alexandria strain has the highest significant commercial hatchability (80.70%). Regarding the studied strains, highly significant differences in hatching chick weight among strains were observed. Using multiple linear regression analysis, fertility made the greatest percent contribution (71.31%) to hatchability, and the lowest percent contributions were made by shape index and egg weight loss. Conclusion: A prediction of hatchability using multiple regression analysis could be a good tool to improve hatchability percentage in chickens. PMID:27651666

  6. Electricity Consumption in the Industrial Sector of Jordan: Application of Multivariate Linear Regression and Adaptive Neuro-Fuzzy Techniques

    NASA Astrophysics Data System (ADS)

    Samhouri, M.; Al-Ghandoor, A.; Fouad, R. H.

    2009-08-01

    In this study two techniques, for modeling electricity consumption of the Jordanian industrial sector, are presented: (i) multivariate linear regression and (ii) neuro-fuzzy models. Electricity consumption is modeled as function of different variables such as number of establishments, number of employees, electricity tariff, prevailing fuel prices, production outputs, capacity utilizations, and structural effects. It was found that industrial production and capacity utilization are the most important variables that have significant effect on future electrical power demand. The results showed that both the multivariate linear regression and neuro-fuzzy models are generally comparable and can be used adequately to simulate industrial electricity consumption. However, comparison that is based on the square root average squared error of data suggests that the neuro-fuzzy model performs slightly better for future prediction of electricity consumption than the multivariate linear regression model. Such results are in full agreement with similar work, using different methods, for other countries.

  7. Influential factors of red-light running at signalized intersection and prediction using a rare events logistic regression model.

    PubMed

    Ren, Yilong; Wang, Yunpeng; Wu, Xinkai; Yu, Guizhen; Ding, Chuan

    2016-10-01

    Red light running (RLR) has become a major safety concern at signalized intersection. To prevent RLR related crashes, it is critical to identify the factors that significantly impact the drivers' behaviors of RLR, and to predict potential RLR in real time. In this research, 9-month's RLR events extracted from high-resolution traffic data collected by loop detectors from three signalized intersections were applied to identify the factors that significantly affect RLR behaviors. The data analysis indicated that occupancy time, time gap, used yellow time, time left to yellow start, whether the preceding vehicle runs through the intersection during yellow, and whether there is a vehicle passing through the intersection on the adjacent lane were significantly factors for RLR behaviors. Furthermore, due to the rare events nature of RLR, a modified rare events logistic regression model was developed for RLR prediction. The rare events logistic regression method has been applied in many fields for rare events studies and shows impressive performance, but so far none of previous research has applied this method to study RLR. The results showed that the rare events logistic regression model performed significantly better than the standard logistic regression model. More importantly, the proposed RLR prediction method is purely based on loop detector data collected from a single advance loop detector located 400 feet away from stop-bar. This brings great potential for future field applications of the proposed method since loops have been widely implemented in many intersections and can collect data in real time. This research is expected to contribute to the improvement of intersection safety significantly. Copyright © 2016 Elsevier Ltd. All rights reserved.

  8. Improved spatial regression analysis of diffusion tensor imaging for lesion detection during longitudinal progression of multiple sclerosis in individual subjects

    NASA Astrophysics Data System (ADS)

    Liu, Bilan; Qiu, Xing; Zhu, Tong; Tian, Wei; Hu, Rui; Ekholm, Sven; Schifitto, Giovanni; Zhong, Jianhui

    2016-03-01

    Subject-specific longitudinal DTI study is vital for investigation of pathological changes of lesions and disease evolution. Spatial Regression Analysis of Diffusion tensor imaging (SPREAD) is a non-parametric permutation-based statistical framework that combines spatial regression and resampling techniques to achieve effective detection of localized longitudinal diffusion changes within the whole brain at individual level without a priori hypotheses. However, boundary blurring and dislocation limit its sensitivity, especially towards detecting lesions of irregular shapes. In the present study, we propose an improved SPREAD (dubbed improved SPREAD, or iSPREAD) method by incorporating a three-dimensional (3D) nonlinear anisotropic diffusion filtering method, which provides edge-preserving image smoothing through a nonlinear scale space approach. The statistical inference based on iSPREAD was evaluated and compared with the original SPREAD method using both simulated and in vivo human brain data. Results demonstrated that the sensitivity and accuracy of the SPREAD method has been improved substantially by adapting nonlinear anisotropic filtering. iSPREAD identifies subject-specific longitudinal changes in the brain with improved sensitivity, accuracy, and enhanced statistical power, especially when the spatial correlation is heterogeneous among neighboring image pixels in DTI.

  9. Automatic energy expenditure measurement for health science.

    PubMed

    Catal, Cagatay; Akbulut, Akhan

    2018-04-01

    It is crucial to predict the human energy expenditure in any sports activity and health science application accurately to investigate the impact of the activity. However, measurement of the real energy expenditure is not a trivial task and involves complex steps. The objective of this work is to improve the performance of existing estimation models of energy expenditure by using machine learning algorithms and several data from different sensors and provide this estimation service in a cloud-based platform. In this study, we used input data such as breathe rate, and hearth rate from three sensors. Inputs are received from a web form and sent to the web service which applies a regression model on Azure cloud platform. During the experiments, we assessed several machine learning models based on regression methods. Our experimental results showed that our novel model which applies Boosted Decision Tree Regression in conjunction with the median aggregation technique provides the best result among other five regression algorithms. This cloud-based energy expenditure system which uses a web service showed that cloud computing technology is a great opportunity to develop estimation systems and the new model which applies Boosted Decision Tree Regression with the median aggregation provides remarkable results. Copyright © 2018 Elsevier B.V. All rights reserved.

  10. Decreasing Multicollinearity: A Method for Models with Multiplicative Functions.

    ERIC Educational Resources Information Center

    Smith, Kent W.; Sasaki, M. S.

    1979-01-01

    A method is proposed for overcoming the problem of multicollinearity in multiple regression equations where multiplicative independent terms are entered. The method is not a ridge regression solution. (JKS)

  11. Survival analysis of postoperative nausea and vomiting in patients receiving patient-controlled epidural analgesia.

    PubMed

    Lee, Shang-Yi; Hung, Chih-Jen; Chen, Chih-Chieh; Wu, Chih-Cheng

    2014-11-01

    Postoperative nausea and vomiting as well as postoperative pain are two major concerns when patients undergo surgery and receive anesthetics. Various models and predictive methods have been developed to investigate the risk factors of postoperative nausea and vomiting, and different types of preventive managements have subsequently been developed. However, there continues to be a wide variation in the previously reported incidence rates of postoperative nausea and vomiting. This may have occurred because patients were assessed at different time points, coupled with the overall limitation of the statistical methods used. However, using survival analysis with Cox regression, and thus factoring in these time effects, may solve this statistical limitation and reveal risk factors related to the occurrence of postoperative nausea and vomiting in the following period. In this retrospective, observational, uni-institutional study, we analyzed the results of 229 patients who received patient-controlled epidural analgesia following surgery from June 2007 to December 2007. We investigated the risk factors for the occurrence of postoperative nausea and vomiting, and also assessed the effect of evaluating patients at different time points using the Cox proportional hazards model. Furthermore, the results of this inquiry were compared with those results using logistic regression. The overall incidence of postoperative nausea and vomiting in our study was 35.4%. Using logistic regression, we found that only sex, but not the total doses and the average dose of opioids, had significant effects on the occurrence of postoperative nausea and vomiting at some time points. Cox regression showed that, when patients consumed a higher average dose of opioids, this correlated with a higher incidence of postoperative nausea and vomiting with a hazard ratio of 1.286. Survival analysis using Cox regression showed that the average consumption of opioids played an important role in postoperative nausea and vomiting, a result not found by logistic regression. Therefore, the incidence of postoperative nausea and vomiting in patients cannot be reliably determined on the basis of a single visit at one point in time. Copyright © 2014. Published by Elsevier Taiwan.

  12. Evaluation of logistic regression models and effect of covariates for case-control study in RNA-Seq analysis.

    PubMed

    Choi, Seung Hoan; Labadorf, Adam T; Myers, Richard H; Lunetta, Kathryn L; Dupuis, Josée; DeStefano, Anita L

    2017-02-06

    Next generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNA-Seq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNA-Seq studies designed to compare cases and controls, where disease status is modeled as a function of RNA-Seq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates. When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth's logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression. We conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth's logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.

  13. Forecasting Daily Patient Outflow From a Ward Having No Real-Time Clinical Data

    PubMed Central

    Tran, Truyen; Luo, Wei; Phung, Dinh; Venkatesh, Svetha

    2016-01-01

    Background: Modeling patient flow is crucial in understanding resource demand and prioritization. We study patient outflow from an open ward in an Australian hospital, where currently bed allocation is carried out by a manager relying on past experiences and looking at demand. Automatic methods that provide a reasonable estimate of total next-day discharges can aid in efficient bed management. The challenges in building such methods lie in dealing with large amounts of discharge noise introduced by the nonlinear nature of hospital procedures, and the nonavailability of real-time clinical information in wards. Objective Our study investigates different models to forecast the total number of next-day discharges from an open ward having no real-time clinical data. Methods We compared 5 popular regression algorithms to model total next-day discharges: (1) autoregressive integrated moving average (ARIMA), (2) the autoregressive moving average with exogenous variables (ARMAX), (3) k-nearest neighbor regression, (4) random forest regression, and (5) support vector regression. Although the autoregressive integrated moving average model relied on past 3-month discharges, nearest neighbor forecasting used median of similar discharges in the past in estimating next-day discharge. In addition, the ARMAX model used the day of the week and number of patients currently in ward as exogenous variables. For the random forest and support vector regression models, we designed a predictor set of 20 patient features and 88 ward-level features. Results Our data consisted of 12,141 patient visits over 1826 days. Forecasting quality was measured using mean forecast error, mean absolute error, symmetric mean absolute percentage error, and root mean square error. When compared with a moving average prediction model, all 5 models demonstrated superior performance with the random forests achieving 22.7% improvement in mean absolute error, for all days in the year 2014. Conclusions In the absence of clinical information, our study recommends using patient-level and ward-level data in predicting next-day discharges. Random forest and support vector regression models are able to use all available features from such data, resulting in superior performance over traditional autoregressive methods. An intelligent estimate of available beds in wards plays a crucial role in relieving access block in emergency departments. PMID:27444059

  14. Simultaneous Force Regression and Movement Classification of Fingers via Surface EMG within a Unified Bayesian Framework.

    PubMed

    Baldacchino, Tara; Jacobs, William R; Anderson, Sean R; Worden, Keith; Rowson, Jennifer

    2018-01-01

    This contribution presents a novel methodology for myolectric-based control using surface electromyographic (sEMG) signals recorded during finger movements. A multivariate Bayesian mixture of experts (MoE) model is introduced which provides a powerful method for modeling force regression at the fingertips, while also performing finger movement classification as a by-product of the modeling algorithm. Bayesian inference of the model allows uncertainties to be naturally incorporated into the model structure. This method is tested using data from the publicly released NinaPro database which consists of sEMG recordings for 6 degree-of-freedom force activations for 40 intact subjects. The results demonstrate that the MoE model achieves similar performance compared to the benchmark set by the authors of NinaPro for finger force regression. Additionally, inherent to the Bayesian framework is the inclusion of uncertainty in the model parameters, naturally providing confidence bounds on the force regression predictions. Furthermore, the integrated clustering step allows a detailed investigation into classification of the finger movements, without incurring any extra computational effort. Subsequently, a systematic approach to assessing the importance of the number of electrodes needed for accurate control is performed via sensitivity analysis techniques. A slight degradation in regression performance is observed for a reduced number of electrodes, while classification performance is unaffected.

  15. a Comparison Between Two Ols-Based Approaches to Estimating Urban Multifractal Parameters

    NASA Astrophysics Data System (ADS)

    Huang, Lin-Shan; Chen, Yan-Guang

    Multifractal theory provides a new spatial analytical tool for urban studies, but many basic problems remain to be solved. Among various pending issues, the most significant one is how to obtain proper multifractal dimension spectrums. If an algorithm is improperly used, the parameter spectrums will be abnormal. This paper is devoted to investigating two ordinary least squares (OLS)-based approaches for estimating urban multifractal parameters. Using empirical study and comparative analysis, we demonstrate how to utilize the adequate linear regression to calculate multifractal parameters. The OLS regression analysis has two different approaches. One is that the intercept is fixed to zero, and the other is that the intercept is not limited. The results of comparative study show that the zero-intercept regression yields proper multifractal parameter spectrums within certain scale range of moment order, while the common regression method often leads to abnormal multifractal parameter values. A conclusion can be reached that fixing the intercept to zero is a more advisable regression method for multifractal parameters estimation, and the shapes of spectral curves and value ranges of fractal parameters can be employed to diagnose urban problems. This research is helpful for scientists to understand multifractal models and apply a more reasonable technique to multifractal parameter calculations.

  16. Real-time model learning using Incremental Sparse Spectrum Gaussian Process Regression.

    PubMed

    Gijsberts, Arjan; Metta, Giorgio

    2013-05-01

    Novel applications in unstructured and non-stationary human environments require robots that learn from experience and adapt autonomously to changing conditions. Predictive models therefore not only need to be accurate, but should also be updated incrementally in real-time and require minimal human intervention. Incremental Sparse Spectrum Gaussian Process Regression is an algorithm that is targeted specifically for use in this context. Rather than developing a novel algorithm from the ground up, the method is based on the thoroughly studied Gaussian Process Regression algorithm, therefore ensuring a solid theoretical foundation. Non-linearity and a bounded update complexity are achieved simultaneously by means of a finite dimensional random feature mapping that approximates a kernel function. As a result, the computational cost for each update remains constant over time. Finally, algorithmic simplicity and support for automated hyperparameter optimization ensures convenience when employed in practice. Empirical validation on a number of synthetic and real-life learning problems confirms that the performance of Incremental Sparse Spectrum Gaussian Process Regression is superior with respect to the popular Locally Weighted Projection Regression, while computational requirements are found to be significantly lower. The method is therefore particularly suited for learning with real-time constraints or when computational resources are limited. Copyright © 2012 Elsevier Ltd. All rights reserved.

  17. Complementary nonparametric analysis of covariance for logistic regression in a randomized clinical trial setting.

    PubMed

    Tangen, C M; Koch, G G

    1999-03-01

    In the randomized clinical trial setting, controlling for covariates is expected to produce variance reduction for the treatment parameter estimate and to adjust for random imbalances of covariates between the treatment groups. However, for the logistic regression model, variance reduction is not obviously obtained. This can lead to concerns about the assumptions of the logistic model. We introduce a complementary nonparametric method for covariate adjustment. It provides results that are usually compatible with expectations for analysis of covariance. The only assumptions required are based on randomization and sampling arguments. The resulting treatment parameter is a (unconditional) population average log-odds ratio that has been adjusted for random imbalance of covariates. Data from a randomized clinical trial are used to compare results from the traditional maximum likelihood logistic method with those from the nonparametric logistic method. We examine treatment parameter estimates, corresponding standard errors, and significance levels in models with and without covariate adjustment. In addition, we discuss differences between unconditional population average treatment parameters and conditional subpopulation average treatment parameters. Additional features of the nonparametric method, including stratified (multicenter) and multivariate (multivisit) analyses, are illustrated. Extensions of this methodology to the proportional odds model are also made.

  18. Comparison of 3 Methods for Identifying Dietary Patterns Associated With Risk of Disease

    PubMed Central

    DiBello, Julia R.; Kraft, Peter; McGarvey, Stephen T.; Goldberg, Robert; Campos, Hannia

    2008-01-01

    Reduced rank regression and partial least-squares regression (PLS) are proposed alternatives to principal component analysis (PCA). Using all 3 methods, the authors derived dietary patterns in Costa Rican data collected on 3,574 cases and controls in 1994–2004 and related the resulting patterns to risk of first incident myocardial infarction. Four dietary patterns associated with myocardial infarction were identified. Factor 1, characterized by high intakes of lean chicken, vegetables, fruit, and polyunsaturated oil, was generated by all 3 dietary pattern methods and was associated with a significantly decreased adjusted risk of myocardial infarction (28%–46%, depending on the method used). PCA and PLS also each yielded a pattern associated with a significantly decreased risk of myocardial infarction (31% and 23%, respectively); this pattern was characterized by moderate intake of alcohol and polyunsaturated oil and low intake of high-fat dairy products. The fourth factor derived from PCA was significantly associated with a 38% increased risk of myocardial infarction and was characterized by high intakes of coffee and palm oil. Contrary to previous studies, the authors found PCA and PLS to produce more patterns associated with cardiovascular disease than reduced rank regression. The most effective method for deriving dietary patterns related to disease may vary depending on the study goals. PMID:18945692

  19. [Correlation coefficient-based classification method of hydrological dependence variability: With auto-regression model as example].

    PubMed

    Zhao, Yu Xi; Xie, Ping; Sang, Yan Fang; Wu, Zi Yi

    2018-04-01

    Hydrological process evaluation is temporal dependent. Hydrological time series including dependence components do not meet the data consistency assumption for hydrological computation. Both of those factors cause great difficulty for water researches. Given the existence of hydrological dependence variability, we proposed a correlationcoefficient-based method for significance evaluation of hydrological dependence based on auto-regression model. By calculating the correlation coefficient between the original series and its dependence component and selecting reasonable thresholds of correlation coefficient, this method divided significance degree of dependence into no variability, weak variability, mid variability, strong variability, and drastic variability. By deducing the relationship between correlation coefficient and auto-correlation coefficient in each order of series, we found that the correlation coefficient was mainly determined by the magnitude of auto-correlation coefficient from the 1 order to p order, which clarified the theoretical basis of this method. With the first-order and second-order auto-regression models as examples, the reasonability of the deduced formula was verified through Monte-Carlo experiments to classify the relationship between correlation coefficient and auto-correlation coefficient. This method was used to analyze three observed hydrological time series. The results indicated the coexistence of stochastic and dependence characteristics in hydrological process.

  20. Alternative evaluation metrics for risk adjustment methods.

    PubMed

    Park, Sungchul; Basu, Anirban

    2018-06-01

    Risk adjustment is instituted to counter risk selection by accurately equating payments with expected expenditures. Traditional risk-adjustment methods are designed to estimate accurate payments at the group level. However, this generates residual risks at the individual level, especially for high-expenditure individuals, thereby inducing health plans to avoid those with high residual risks. To identify an optimal risk-adjustment method, we perform a comprehensive comparison of prediction accuracies at the group level, at the tail distributions, and at the individual level across 19 estimators: 9 parametric regression, 7 machine learning, and 3 distributional estimators. Using the 2013-2014 MarketScan database, we find that no one estimator performs best in all prediction accuracies. Generally, machine learning and distribution-based estimators achieve higher group-level prediction accuracy than parametric regression estimators. However, parametric regression estimators show higher tail distribution prediction accuracy and individual-level prediction accuracy, especially at the tails of the distribution. This suggests that there is a trade-off in selecting an appropriate risk-adjustment method between estimating accurate payments at the group level and lower residual risks at the individual level. Our results indicate that an optimal method cannot be determined solely on the basis of statistical metrics but rather needs to account for simulating plans' risk selective behaviors. Copyright © 2018 John Wiley & Sons, Ltd.

  1. Regional flow duration curves: Geostatistical techniques versus multivariate regression

    USGS Publications Warehouse

    Pugliese, Alessio; Farmer, William H.; Castellarin, Attilio; Archfield, Stacey A.; Vogel, Richard M.

    2016-01-01

    A period-of-record flow duration curve (FDC) represents the relationship between the magnitude and frequency of daily streamflows. Prediction of FDCs is of great importance for locations characterized by sparse or missing streamflow observations. We present a detailed comparison of two methods which are capable of predicting an FDC at ungauged basins: (1) an adaptation of the geostatistical method, Top-kriging, employing a linear weighted average of dimensionless empirical FDCs, standardised with a reference streamflow value; and (2) regional multiple linear regression of streamflow quantiles, perhaps the most common method for the prediction of FDCs at ungauged sites. In particular, Top-kriging relies on a metric for expressing the similarity between catchments computed as the negative deviation of the FDC from a reference streamflow value, which we termed total negative deviation (TND). Comparisons of these two methods are made in 182 largely unregulated river catchments in the southeastern U.S. using a three-fold cross-validation algorithm. Our results reveal that the two methods perform similarly throughout flow-regimes, with average Nash-Sutcliffe Efficiencies 0.566 and 0.662, (0.883 and 0.829 on log-transformed quantiles) for the geostatistical and the linear regression models, respectively. The differences between the reproduction of FDC's occurred mostly for low flows with exceedance probability (i.e. duration) above 0.98.

  2. The Use of Linear Instrumental Variables Methods in Health Services Research and Health Economics: A Cautionary Note

    PubMed Central

    Terza, Joseph V; Bradford, W David; Dismuke, Clara E

    2008-01-01

    Objective To investigate potential bias in the use of the conventional linear instrumental variables (IV) method for the estimation of causal effects in inherently nonlinear regression settings. Data Sources Smoking Supplement to the 1979 National Health Interview Survey, National Longitudinal Alcohol Epidemiologic Survey, and simulated data. Study Design Potential bias from the use of the linear IV method in nonlinear models is assessed via simulation studies and real world data analyses in two commonly encountered regression setting: (1) models with a nonnegative outcome (e.g., a count) and a continuous endogenous regressor; and (2) models with a binary outcome and a binary endogenous regressor. Principle Findings The simulation analyses show that substantial bias in the estimation of causal effects can result from applying the conventional IV method in inherently nonlinear regression settings. Moreover, the bias is not attenuated as the sample size increases. This point is further illustrated in the survey data analyses in which IV-based estimates of the relevant causal effects diverge substantially from those obtained with appropriate nonlinear estimation methods. Conclusions We offer this research as a cautionary note to those who would opt for the use of linear specifications in inherently nonlinear settings involving endogeneity. PMID:18546544

  3. Nonparametric methods for drought severity estimation at ungauged sites

    NASA Astrophysics Data System (ADS)

    Sadri, S.; Burn, D. H.

    2012-12-01

    The objective in frequency analysis is, given extreme events such as drought severity or duration, to estimate the relationship between that event and the associated return periods at a catchment. Neural networks and other artificial intelligence approaches in function estimation and regression analysis are relatively new techniques in engineering, providing an attractive alternative to traditional statistical models. There are, however, few applications of neural networks and support vector machines in the area of severity quantile estimation for drought frequency analysis. In this paper, we compare three methods for this task: multiple linear regression, radial basis function neural networks, and least squares support vector regression (LS-SVR). The area selected for this study includes 32 catchments in the Canadian Prairies. From each catchment drought severities are extracted and fitted to a Pearson type III distribution, which act as observed values. For each method-duration pair, we use a jackknife algorithm to produce estimated values at each site. The results from these three approaches are compared and analyzed, and it is found that LS-SVR provides the best quantile estimates and extrapolating capacity.

  4. Comparison of exact, efron and breslow parameter approach method on hazard ratio and stratified cox regression model

    NASA Astrophysics Data System (ADS)

    Fatekurohman, Mohamat; Nurmala, Nita; Anggraeni, Dian

    2018-04-01

    Lungs are the most important organ, in the case of respiratory system. Problems related to disorder of the lungs are various, i.e. pneumonia, emphysema, tuberculosis and lung cancer. Comparing all those problems, lung cancer is the most harmful. Considering about that, the aim of this research applies survival analysis and factors affecting the endurance of the lung cancer patient using comparison of exact, Efron and Breslow parameter approach method on hazard ratio and stratified cox regression model. The data applied are based on the medical records of lung cancer patients in Jember Paru-paru hospital on 2016, east java, Indonesia. The factors affecting the endurance of the lung cancer patients can be classified into several criteria, i.e. sex, age, hemoglobin, leukocytes, erythrocytes, sedimentation rate of blood, therapy status, general condition, body weight. The result shows that exact method of stratified cox regression model is better than other. On the other hand, the endurance of the patients is affected by their age and the general conditions.

  5. Comparison of safety effect estimates obtained from empirical Bayes before-after study, propensity scores-potential outcomes framework, and regression model with cross-sectional data.

    PubMed

    Wood, Jonathan S; Donnell, Eric T; Porter, Richard J

    2015-02-01

    A variety of different study designs and analysis methods have been used to evaluate the performance of traffic safety countermeasures. The most common study designs and methods include observational before-after studies using the empirical Bayes method and cross-sectional studies using regression models. The propensity scores-potential outcomes framework has recently been proposed as an alternative traffic safety countermeasure evaluation method to address the challenges associated with selection biases that can be part of cross-sectional studies. Crash modification factors derived from the application of all three methods have not yet been compared. This paper compares the results of retrospective, observational evaluations of a traffic safety countermeasure using both before-after and cross-sectional study designs. The paper describes the strengths and limitations of each method, focusing primarily on how each addresses site selection bias, which is a common issue in observational safety studies. The Safety Edge paving technique, which seeks to mitigate crashes related to roadway departure events, is the countermeasure used in the present study to compare the alternative evaluation methods. The results indicated that all three methods yielded results that were consistent with each other and with previous research. The empirical Bayes results had the smallest standard errors. It is concluded that the propensity scores with potential outcomes framework is a viable alternative analysis method to the empirical Bayes before-after study. It should be considered whenever a before-after study is not possible or practical. Copyright © 2014 Elsevier Ltd. All rights reserved.

  6. Locoregional Control of Non-Small Cell Lung Cancer in Relation to Automated Early Assessment of Tumor Regression on Cone Beam Computed Tomography

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brink, Carsten, E-mail: carsten.brink@rsyd.dk; Laboratory of Radiation Physics, Odense University Hospital; Bernchou, Uffe

    2014-07-15

    Purpose: Large interindividual variations in volume regression of non-small cell lung cancer (NSCLC) are observable on standard cone beam computed tomography (CBCT) during fractionated radiation therapy. Here, a method for automated assessment of tumor volume regression is presented and its potential use in response adapted personalized radiation therapy is evaluated empirically. Methods and Materials: Automated deformable registration with calculation of the Jacobian determinant was applied to serial CBCT scans in a series of 99 patients with NSCLC. Tumor volume at the end of treatment was estimated on the basis of the first one third and two thirds of the scans.more » The concordance between estimated and actual relative volume at the end of radiation therapy was quantified by Pearson's correlation coefficient. On the basis of the estimated relative volume, the patients were stratified into 2 groups having volume regressions below or above the population median value. Kaplan-Meier plots of locoregional disease-free rate and overall survival in the 2 groups were used to evaluate the predictive value of tumor regression during treatment. Cox proportional hazards model was used to adjust for other clinical characteristics. Results: Automatic measurement of the tumor regression from standard CBCT images was feasible. Pearson's correlation coefficient between manual and automatic measurement was 0.86 in a sample of 9 patients. Most patients experienced tumor volume regression, and this could be quantified early into the treatment course. Interestingly, patients with pronounced volume regression had worse locoregional tumor control and overall survival. This was significant on patient with non-adenocarcinoma histology. Conclusions: Evaluation of routinely acquired CBCT images during radiation therapy provides biological information on the specific tumor. This could potentially form the basis for personalized response adaptive therapy.« less

  7. A systematic study on the influencing parameters and improvement of quantitative analysis of multi-component with single marker method using notoginseng as research subject.

    PubMed

    Wang, Chao-Qun; Jia, Xiu-Hong; Zhu, Shu; Komatsu, Katsuko; Wang, Xuan; Cai, Shao-Qing

    2015-03-01

    A new quantitative analysis of multi-component with single marker (QAMS) method for 11 saponins (ginsenosides Rg1, Rb1, Rg2, Rh1, Rf, Re and Rd; notoginsenosides R1, R4, Fa and K) in notoginseng was established, when 6 of these saponins were individually used as internal referring substances to investigate the influences of chemical structure, concentrations of quantitative components, and purities of the standard substances on the accuracy of the QAMS method. The results showed that the concentration of the analyte in sample solution was the major influencing parameter, whereas the other parameters had minimal influence on the accuracy of the QAMS method. A new method for calculating the relative correction factors by linear regression was established (linear regression method), which demonstrated to decrease standard method differences of the QAMS method from 1.20%±0.02% - 23.29%±3.23% to 0.10%±0.09% - 8.84%±2.85% in comparison with the previous method. And the differences between external standard method and the QAMS method using relative correction factors calculated by linear regression method were below 5% in the quantitative determination of Rg1, Re, R1, Rd and Fa in 24 notoginseng samples and Rb1 in 21 notoginseng samples. And the differences were mostly below 10% in the quantitative determination of Rf, Rg2, R4 and N-K (the differences of these 4 constituents bigger because their contents lower) in all the 24 notoginseng samples. The results indicated that the contents assayed by the new QAMS method could be considered as accurate as those assayed by external standard method. In addition, a method for determining applicable concentration ranges of the quantitative components assayed by QAMS method was established for the first time, which could ensure its high accuracy and could be applied to QAMS methods of other TCMs. The present study demonstrated the practicability of the application of the QAMS method for the quantitative analysis of multi-component and the quality control of TCMs and TCM prescriptions. Copyright © 2014 Elsevier B.V. All rights reserved.

  8. Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory.

    PubMed

    Kruppa, Jochen; Liu, Yufeng; Biau, Gérard; Kohler, Michael; König, Inke R; Malley, James D; Ziegler, Andreas

    2014-07-01

    Probability estimation for binary and multicategory outcome using logistic and multinomial logistic regression has a long-standing tradition in biostatistics. However, biases may occur if the model is misspecified. In contrast, outcome probabilities for individuals can be estimated consistently with machine learning approaches, including k-nearest neighbors (k-NN), bagged nearest neighbors (b-NN), random forests (RF), and support vector machines (SVM). Because machine learning methods are rarely used by applied biostatisticians, the primary goal of this paper is to explain the concept of probability estimation with these methods and to summarize recent theoretical findings. Probability estimation in k-NN, b-NN, and RF can be embedded into the class of nonparametric regression learning machines; therefore, we start with the construction of nonparametric regression estimates and review results on consistency and rates of convergence. In SVMs, outcome probabilities for individuals are estimated consistently by repeatedly solving classification problems. For SVMs we review classification problem and then dichotomous probability estimation. Next we extend the algorithms for estimating probabilities using k-NN, b-NN, and RF to multicategory outcomes and discuss approaches for the multicategory probability estimation problem using SVM. In simulation studies for dichotomous and multicategory dependent variables we demonstrate the general validity of the machine learning methods and compare it with logistic regression. However, each method fails in at least one simulation scenario. We conclude with a discussion of the failures and give recommendations for selecting and tuning the methods. Applications to real data and example code are provided in a companion article (doi:10.1002/bimj.201300077). © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  9. [Application of negative binomial regression and modified Poisson regression in the research of risk factors for injury frequency].

    PubMed

    Cao, Qingqing; Wu, Zhenqiang; Sun, Ying; Wang, Tiezhu; Han, Tengwei; Gu, Chaomei; Sun, Yehuan

    2011-11-01

    To Eexplore the application of negative binomial regression and modified Poisson regression analysis in analyzing the influential factors for injury frequency and the risk factors leading to the increase of injury frequency. 2917 primary and secondary school students were selected from Hefei by cluster random sampling method and surveyed by questionnaire. The data on the count event-based injuries used to fitted modified Poisson regression and negative binomial regression model. The risk factors incurring the increase of unintentional injury frequency for juvenile students was explored, so as to probe the efficiency of these two models in studying the influential factors for injury frequency. The Poisson model existed over-dispersion (P < 0.0001) based on testing by the Lagrangemultiplier. Therefore, the over-dispersion dispersed data using a modified Poisson regression and negative binomial regression model, was fitted better. respectively. Both showed that male gender, younger age, father working outside of the hometown, the level of the guardian being above junior high school and smoking might be the results of higher injury frequencies. On a tendency of clustered frequency data on injury event, both the modified Poisson regression analysis and negative binomial regression analysis can be used. However, based on our data, the modified Poisson regression fitted better and this model could give a more accurate interpretation of relevant factors affecting the frequency of injury.

  10. Adjusting data to body size: a comparison of methods as applied to quantitative trait loci analysis of musculoskeletal phenotypes.

    PubMed

    Lang, Dean H; Sharkey, Neil A; Lionikas, Arimantas; Mack, Holly A; Larsson, Lars; Vogler, George P; Vandenbergh, David J; Blizard, David A; Stout, Joseph T; Stitt, Joseph P; McClearn, Gerald E

    2005-05-01

    The aim of this study was to compare three methods of adjusting skeletal data for body size and examine their use in QTL analyses. It was found that dividing skeletal phenotypes by body mass index induced erroneous QTL results. The preferred method of body size adjustment was multiple regression. Many skeletal studies have reported strong correlations between phenotypes for muscle, bone, and body size, and these correlations add to the difficulty in identifying genetic influence on skeletal traits that are not mediated through overall body size. Quantitative trait loci (QTL) identified for skeletal phenotypes often map to the same chromosome regions as QTLs for body size. The actions of a QTL identified as influencing BMD could therefore be mediated through the generalized actions of growth on body size or muscle mass. Three methods of adjusting skeletal phenotypes to body size were performed on morphologic, structural, and compositional measurements of the femur and tibia in 200-day-old C57BL/6J x DBA/2 (BXD) second generation (F(2)) mice (n = 400). A common method of removing the size effect has been through the use of ratios. This technique and two alternative techniques using simple and multiple regression were performed on muscle and skeletal data before QTL analyses, and the differences in QTL results were examined. The use of ratios to remove the size effect was shown to increase the size effect by inducing spurious correlations, thereby leading to inaccurate QTL results. Adjustments for body size using multiple regression eliminated these problems. Multiple regression should be used to remove the variance of co-factors related to skeletal phenotypes to allow for the study of genetic influence independent of correlated phenotypes. However, to better understand the genetic influence, adjusted and unadjusted skeletal QTL results should be compared. Additional insight can be gained by observing the difference in LOD score between the adjusted and nonadjusted phenotypes. Identifying QTLs that exert their effects on skeletal phenotypes through body size-related pathways as well as those having a more direct and independent influence on bone are equally important in deciphering the complex physiologic pathways responsible for the maintenance of bone health.

  11. Model selection for logistic regression models

    NASA Astrophysics Data System (ADS)

    Duller, Christine

    2012-09-01

    Model selection for logistic regression models decides which of some given potential regressors have an effect and hence should be included in the final model. The second interesting question is whether a certain factor is heterogeneous among some subsets, i.e. whether the model should include a random intercept or not. In this paper these questions will be answered with classical as well as with Bayesian methods. The application show some results of recent research projects in medicine and business administration.

  12. A systematic evaluation of normalization methods in quantitative label-free proteomics.

    PubMed

    Välikangas, Tommi; Suomi, Tomi; Elo, Laura L

    2018-01-01

    To date, mass spectrometry (MS) data remain inherently biased as a result of reasons ranging from sample handling to differences caused by the instrumentation. Normalization is the process that aims to account for the bias and make samples more comparable. The selection of a proper normalization method is a pivotal task for the reliability of the downstream analysis and results. Many normalization methods commonly used in proteomics have been adapted from the DNA microarray techniques. Previous studies comparing normalization methods in proteomics have focused mainly on intragroup variation. In this study, several popular and widely used normalization methods representing different strategies in normalization are evaluated using three spike-in and one experimental mouse label-free proteomic data sets. The normalization methods are evaluated in terms of their ability to reduce variation between technical replicates, their effect on differential expression analysis and their effect on the estimation of logarithmic fold changes. Additionally, we examined whether normalizing the whole data globally or in segments for the differential expression analysis has an effect on the performance of the normalization methods. We found that variance stabilization normalization (Vsn) reduced variation the most between technical replicates in all examined data sets. Vsn also performed consistently well in the differential expression analysis. Linear regression normalization and local regression normalization performed also systematically well. Finally, we discuss the choice of a normalization method and some qualities of a suitable normalization method in the light of the results of our evaluation. © The Author 2016. Published by Oxford University Press.

  13. A consensus least squares support vector regression (LS-SVR) for analysis of near-infrared spectra of plant samples.

    PubMed

    Li, Yankun; Shao, Xueguang; Cai, Wensheng

    2007-04-15

    Consensus modeling of combining the results of multiple independent models to produce a single prediction avoids the instability of single model. Based on the principle of consensus modeling, a consensus least squares support vector regression (LS-SVR) method for calibrating the near-infrared (NIR) spectra was proposed. In the proposed approach, NIR spectra of plant samples were firstly preprocessed using discrete wavelet transform (DWT) for filtering the spectral background and noise, then, consensus LS-SVR technique was used for building the calibration model. With an optimization of the parameters involved in the modeling, a satisfied model was achieved for predicting the content of reducing sugar in plant samples. The predicted results show that consensus LS-SVR model is more robust and reliable than the conventional partial least squares (PLS) and LS-SVR methods.

  14. Deciphering factors controlling groundwater arsenic spatial variability in Bangladesh

    NASA Astrophysics Data System (ADS)

    Tan, Z.; Yang, Q.; Zheng, C.; Zheng, Y.

    2017-12-01

    Elevated concentrations of geogenic arsenic in groundwater have been found in many countries to exceed 10 μg/L, the WHO's guideline value for drinking water. A common yet unexplained characteristic of groundwater arsenic spatial distribution is the extensive variability at various spatial scales. This study investigates factors influencing the spatial variability of groundwater arsenic in Bangladesh to improve the accuracy of models predicting arsenic exceedance rate spatially. A novel boosted regression tree method is used to establish a weak-learning ensemble model, which is compared to a linear model using a conventional stepwise logistic regression method. The boosted regression tree models offer the advantage of parametric interaction when big datasets are analyzed in comparison to the logistic regression. The point data set (n=3,538) of groundwater hydrochemistry with 19 parameters was obtained by the British Geological Survey in 2001. The spatial data sets of geological parameters (n=13) were from the Consortium for Spatial Information, Technical University of Denmark, University of East Anglia and the FAO, while the soil parameters (n=42) were from the Harmonized World Soil Database. The aforementioned parameters were regressed to categorical groundwater arsenic concentrations below or above three thresholds: 5 μg/L, 10 μg/L and 50 μg/L to identify respective controlling factors. Boosted regression tree method outperformed logistic regression methods in all three threshold levels in terms of accuracy, specificity and sensitivity, resulting in an improvement of spatial distribution map of probability of groundwater arsenic exceeding all three thresholds when compared to disjunctive-kriging interpolated spatial arsenic map using the same groundwater arsenic dataset. Boosted regression tree models also show that the most important controlling factors of groundwater arsenic distribution include groundwater iron content and well depth for all three thresholds. The probability of a well with iron content higher than 5mg/L to contain greater than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be more than 91%, 85% and 51%, respectively, while the probability of a well from depth more than 160m to contain more than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be less than 38%, 25% and 14%, respectively.

  15. Trend analysis of salt load and evaluation of the frequency of water-quality measurements for the Gunnison, the Colorado, and the Dolores rivers in Colorado and Utah

    USGS Publications Warehouse

    Kircher, J.E.; Dinicola, Richard S.; Middelburg, R.F.

    1984-01-01

    Monthly values were computed for water-quality constituents at four streamflow gaging stations in the Upper Colorado River basin for the determination of trends. Seasonal regression and seasonal Kendall trend analysis techniques were applied to two monthly data sets at each station site for four different time periods. A recently developed method for determining optimal water-discharge data-collection frequency was also applied to the monthly water-quality data. Trend analysis results varied with each monthly load computational method, period of record, and trend detection model used. No conclusions could be reached regarding which computational method was best to use in trend analysis. Time-period selection for analysis was found to be important with regard to intended use of the results. Seasonal Kendall procedures were found to be applicable to most data sets. Seasonal regression models were more difficult to apply and were sometimes of questionable validity; however, those results were more informative than seasonal Kendall results. The best model to use depends upon the characteristics of the data and the amount of trend information needed. The measurement-frequency optimization method had potential for application to water-quality data, but refinements are needed. (USGS)

  16. Birthweight Related Factors in Northwestern Iran: Using Quantile Regression Method.

    PubMed

    Fallah, Ramazan; Kazemnejad, Anoshirvan; Zayeri, Farid; Shoghli, Alireza

    2015-11-18

    Birthweight is one of the most important predicting indicators of the health status in adulthood. Having a balanced birthweight is one of the priorities of the health system in most of the industrial and developed countries. This indicator is used to assess the growth and health status of the infants. The aim of this study was to assess the birthweight of the neonates by using quantile regression in Zanjan province. This analytical descriptive study was carried out using pre-registered (March 2010 - March 2012) data of neonates in urban/rural health centers of Zanjan province using multiple-stage cluster sampling. Data were analyzed using multiple linear regressions andquantile regression method and SAS 9.2 statistical software. From 8456 newborn baby, 4146 (49%) were female. The mean age of the mothers was 27.1±5.4 years. The mean birthweight of the neonates was 3104 ± 431 grams. Five hundred and seventy-three patients (6.8%) of the neonates were less than 2500 grams. In all quantiles, gestational age of neonates (p<0.05), weight and educational level of the mothers (p<0.05) showed a linear significant relationship with the i of the neonates. However, sex and birth rank of the neonates, mothers age, place of residence (urban/rural) and career were not significant in all quantiles (p>0.05). This study revealed the results of multiple linear regression and quantile regression were not identical. We strictly recommend the use of quantile regression when an asymmetric response variable or data with outliers is available.

  17. Bayesian dose-response analysis for epidemiological studies with complex uncertainty in dose estimation.

    PubMed

    Kwon, Deukwoo; Hoffman, F Owen; Moroz, Brian E; Simon, Steven L

    2016-02-10

    Most conventional risk analysis methods rely on a single best estimate of exposure per person, which does not allow for adjustment for exposure-related uncertainty. Here, we propose a Bayesian model averaging method to properly quantify the relationship between radiation dose and disease outcomes by accounting for shared and unshared uncertainty in estimated dose. Our Bayesian risk analysis method utilizes multiple realizations of sets (vectors) of doses generated by a two-dimensional Monte Carlo simulation method that properly separates shared and unshared errors in dose estimation. The exposure model used in this work is taken from a study of the risk of thyroid nodules among a cohort of 2376 subjects who were exposed to fallout from nuclear testing in Kazakhstan. We assessed the performance of our method through an extensive series of simulations and comparisons against conventional regression risk analysis methods. When the estimated doses contain relatively small amounts of uncertainty, the Bayesian method using multiple a priori plausible draws of dose vectors gave similar results to the conventional regression-based methods of dose-response analysis. However, when large and complex mixtures of shared and unshared uncertainties are present, the Bayesian method using multiple dose vectors had significantly lower relative bias than conventional regression-based risk analysis methods and better coverage, that is, a markedly increased capability to include the true risk coefficient within the 95% credible interval of the Bayesian-based risk estimate. An evaluation of the dose-response using our method is presented for an epidemiological study of thyroid disease following radiation exposure. Copyright © 2015 John Wiley & Sons, Ltd.

  18. Plant selection for ethnobotanical uses on the Amalfi Coast (Southern Italy).

    PubMed

    Savo, V; Joy, R; Caneva, G; McClatchey, W C

    2015-07-15

    Many ethnobotanical studies have investigated selection criteria for medicinal and non-medicinal plants. In this paper we test several statistical methods using different ethnobotanical datasets in order to 1) define to which extent the nature of the datasets can affect the interpretation of results; 2) determine if the selection for different plant uses is based on phylogeny, or other selection criteria. We considered three different ethnobotanical datasets: two datasets of medicinal plants and a dataset of non-medicinal plants (handicraft production, domestic and agro-pastoral practices) and two floras of the Amalfi Coast. We performed residual analysis from linear regression, the binomial test and the Bayesian approach for calculating under-used and over-used plant families within ethnobotanical datasets. Percentages of agreement were calculated to compare the results of the analyses. We also analyzed the relationship between plant selection and phylogeny, chorology, life form and habitat using the chi-square test. Pearson's residuals for each of the significant chi-square analyses were examined for investigating alternative hypotheses of plant selection criteria. The three statistical analysis methods differed within the same dataset, and between different datasets and floras, but with some similarities. In the two medicinal datasets, only Lamiaceae was identified in both floras as an over-used family by all three statistical methods. All statistical methods in one flora agreed that Malvaceae was over-used and Poaceae under-used, but this was not found to be consistent with results of the second flora in which one statistical result was non-significant. All other families had some discrepancy in significance across methods, or floras. Significant over- or under-use was observed in only a minority of cases. The chi-square analyses were significant for phylogeny, life form and habitat. Pearson's residuals indicated a non-random selection of woody species for non-medicinal uses and an under-use of plants of temperate forests for medicinal uses. Our study showed that selection criteria for plant uses (including medicinal) are not always based on phylogeny. The comparison of different statistical methods (regression, binomial and Bayesian) under different conditions led to the conclusion that the most conservative results are obtained using regression analysis.

  19. Noninvasive and fast measurement of blood glucose in vivo by near infrared (NIR) spectroscopy

    NASA Astrophysics Data System (ADS)

    Jintao, Xue; Liming, Ye; Yufei, Liu; Chunyan, Li; Han, Chen

    2017-05-01

    This research was to develop a method for noninvasive and fast blood glucose assay in vivo. Near-infrared (NIR) spectroscopy, a more promising technique compared to other methods, was investigated in rats with diabetes and normal rats. Calibration models are generated by two different multivariate strategies: partial least squares (PLS) as linear regression method and artificial neural networks (ANN) as non-linear regression method. The PLS model was optimized individually by considering spectral range, spectral pretreatment methods and number of model factors, while the ANN model was studied individually by selecting spectral pretreatment methods, parameters of network topology, number of hidden neurons, and times of epoch. The results of the validation showed the two models were robust, accurate and repeatable. Compared to the ANN model, the performance of the PLS model was much better, with lower root mean square error of validation (RMSEP) of 0.419 and higher correlation coefficients (R) of 96.22%.

  20. Comparison of two occurrence risk assessment methods for collapse gully erosion ——A case study in Guangdong province

    NASA Astrophysics Data System (ADS)

    Sun, K.; Cheng, D. B.; He, J. J.; Zhao, Y. L.

    2018-02-01

    Collapse gully erosion is a specific type of soil erosion in the red soil region of southern China, and early warning and prevention of the occurrence of collapse gully erosion is very important. Based on the idea of risk assessment, this research, taking Guangdong province as an example, adopt the information acquisition analysis and the logistic regression analysis, to discuss the feasibility for collapse gully erosion risk assessment in regional scale, and compare the applicability of the different risk assessment methods. The results show that in the Guangdong province, the risk degree of collapse gully erosion occurrence is high in northeastern and western area, and relatively low in southwestern and central part. The comparing analysis of the different risk assessment methods on collapse gully also indicated that the risk distribution patterns from the different methods were basically consistent. However, the accuracy of risk map from the information acquisition analysis method was slightly better than that from the logistic regression analysis method.

  1. Applications of Monte Carlo method to nonlinear regression of rheological data

    NASA Astrophysics Data System (ADS)

    Kim, Sangmo; Lee, Junghaeng; Kim, Sihyun; Cho, Kwang Soo

    2018-02-01

    In rheological study, it is often to determine the parameters of rheological models from experimental data. Since both rheological data and values of the parameters vary in logarithmic scale and the number of the parameters is quite large, conventional method of nonlinear regression such as Levenberg-Marquardt (LM) method is usually ineffective. The gradient-based method such as LM is apt to be caught in local minima which give unphysical values of the parameters whenever the initial guess of the parameters is far from the global optimum. Although this problem could be solved by simulated annealing (SA), the Monte Carlo (MC) method needs adjustable parameter which could be determined in ad hoc manner. We suggest a simplified version of SA, a kind of MC methods which results in effective values of the parameters of most complicated rheological models such as the Carreau-Yasuda model of steady shear viscosity, discrete relaxation spectrum and zero-shear viscosity as a function of concentration and molecular weight.

  2. A modified approach to estimating sample size for simple logistic regression with one continuous covariate.

    PubMed

    Novikov, I; Fund, N; Freedman, L S

    2010-01-15

    Different methods for the calculation of sample size for simple logistic regression (LR) with one normally distributed continuous covariate give different results. Sometimes the difference can be large. Furthermore, some methods require the user to specify the prevalence of cases when the covariate equals its population mean, rather than the more natural population prevalence. We focus on two commonly used methods and show through simulations that the power for a given sample size may differ substantially from the nominal value for one method, especially when the covariate effect is large, while the other method performs poorly if the user provides the population prevalence instead of the required parameter. We propose a modification of the method of Hsieh et al. that requires specification of the population prevalence and that employs Schouten's sample size formula for a t-test with unequal variances and group sizes. This approach appears to increase the accuracy of the sample size estimates for LR with one continuous covariate.

  3. A study on industrial accident rate forecasting and program development of estimated zero accident time in Korea.

    PubMed

    Kim, Tae-gu; Kang, Young-sig; Lee, Hyung-won

    2011-01-01

    To begin a zero accident campaign for industry, the first thing is to estimate the industrial accident rate and the zero accident time systematically. This paper considers the social and technical change of the business environment after beginning the zero accident campaign through quantitative time series analysis methods. These methods include sum of squared errors (SSE), regression analysis method (RAM), exponential smoothing method (ESM), double exponential smoothing method (DESM), auto-regressive integrated moving average (ARIMA) model, and the proposed analytic function method (AFM). The program is developed to estimate the accident rate, zero accident time and achievement probability of an efficient industrial environment. In this paper, MFC (Microsoft Foundation Class) software of Visual Studio 2008 was used to develop a zero accident program. The results of this paper will provide major information for industrial accident prevention and be an important part of stimulating the zero accident campaign within all industrial environments.

  4. A Review of the Study Designs and Statistical Methods Used in the Determination of Predictors of All-Cause Mortality in HIV-Infected Cohorts: 2002–2011

    PubMed Central

    Otwombe, Kennedy N.; Petzold, Max; Martinson, Neil; Chirwa, Tobias

    2014-01-01

    Background Research in the predictors of all-cause mortality in HIV-infected people has widely been reported in literature. Making an informed decision requires understanding the methods used. Objectives We present a review on study designs, statistical methods and their appropriateness in original articles reporting on predictors of all-cause mortality in HIV-infected people between January 2002 and December 2011. Statistical methods were compared between 2002–2006 and 2007–2011. Time-to-event analysis techniques were considered appropriate. Data Sources Pubmed/Medline. Study Eligibility Criteria Original English-language articles were abstracted. Letters to the editor, editorials, reviews, systematic reviews, meta-analysis, case reports and any other ineligible articles were excluded. Results A total of 189 studies were identified (n = 91 in 2002–2006 and n = 98 in 2007–2011) out of which 130 (69%) were prospective and 56 (30%) were retrospective. One hundred and eighty-two (96%) studies described their sample using descriptive statistics while 32 (17%) made comparisons using t-tests. Kaplan-Meier methods for time-to-event analysis were commonly used in the earlier period (n = 69, 76% vs. n = 53, 54%, p = 0.002). Predictors of mortality in the two periods were commonly determined using Cox regression analysis (n = 67, 75% vs. n = 63, 64%, p = 0.12). Only 7 (4%) used advanced survival analysis methods of Cox regression analysis with frailty in which 6 (3%) were used in the later period. Thirty-two (17%) used logistic regression while 8 (4%) used other methods. There were significantly more articles from the first period using appropriate methods compared to the second (n = 80, 88% vs. n = 69, 70%, p-value = 0.003). Conclusion Descriptive statistics and survival analysis techniques remain the most common methods of analysis in publications on predictors of all-cause mortality in HIV-infected cohorts while prospective research designs are favoured. Sophisticated techniques of time-dependent Cox regression and Cox regression with frailty are scarce. This motivates for more training in the use of advanced time-to-event methods. PMID:24498313

  5. Enhancing hyperspectral spatial resolution using multispectral image fusion: A wavelet approach

    NASA Astrophysics Data System (ADS)

    Jazaeri, Amin

    High spectral and spatial resolution images have a significant impact in remote sensing applications. Because both spatial and spectral resolutions of spaceborne sensors are fixed by design and it is not possible to further increase the spatial or spectral resolution, techniques such as image fusion must be applied to achieve such goals. This dissertation introduces the concept of wavelet fusion between hyperspectral and multispectral sensors in order to enhance the spectral and spatial resolution of a hyperspectral image. To test the robustness of this concept, images from Hyperion (hyperspectral sensor) and Advanced Land Imager (multispectral sensor) were first co-registered and then fused using different wavelet algorithms. A regression-based fusion algorithm was also implemented for comparison purposes. The results show that the fused images using a combined bi-linear wavelet-regression algorithm have less error than other methods when compared to the ground truth. In addition, a combined regression-wavelet algorithm shows more immunity to misalignment of the pixels due to the lack of proper registration. The quantitative measures of average mean square error show that the performance of wavelet-based methods degrades when the spatial resolution of hyperspectral images becomes eight times less than its corresponding multispectral image. Regardless of what method of fusion is utilized, the main challenge in image fusion is image registration, which is also a very time intensive process. Because the combined regression wavelet technique is computationally expensive, a hybrid technique based on regression and wavelet methods was also implemented to decrease computational overhead. However, the gain in faster computation was offset by the introduction of more error in the outcome. The secondary objective of this dissertation is to examine the feasibility and sensor requirements for image fusion for future NASA missions in order to be able to perform onboard image fusion. In this process, the main challenge of image registration was resolved by registering the input images using transformation matrices of previously acquired data. The composite image resulted from the fusion process remarkably matched the ground truth, indicating the possibility of real time onboard fusion processing.

  6. Actual and estimated costs of disposable materials used during surgical procedures.

    PubMed

    Toyabe, Shin-Ichi; Cao, Pengyu; Kurashima, Sachiko; Nakayama, Yukiko; Ishii, Yuko; Hosoyama, Noriko; Akazawa, Kouhei

    2005-07-01

    It is difficult to estimate precisely the costs of disposable materials used during surgical operations. To evaluate the actual costs of disposable materials, we calculated the actual costs of disposable materials used in 59 operations by taking account of costs of all disposable materials used for each operation. The costs of the disposable materials varied significantly from operation to operation (US$ 38-4230 per operation), and the median [25-percentile and 75-percentile] of the sum total of disposable material costs of a single operation was found to be US$ 686 [205 and 993]. Multiple regression analysis with a stepwise regression method showed that costs of disposable materials significantly correlated only with operation time (p<0.001). Based on the results, we propose a simple method for estimating costs of disposable materials by measuring operation time, and we found that the method gives reliable results. Since costs of disposable materials used during surgical operations are considerable, precise estimation of the costs is essential for hospital cost accounting. Our method should be useful for planning hospital administration strategies.

  7. A Prediction Model for Functional Outcomes in Spinal Cord Disorder Patients Using Gaussian Process Regression.

    PubMed

    Lee, Sunghoon Ivan; Mortazavi, Bobak; Hoffman, Haydn A; Lu, Derek S; Li, Charles; Paak, Brian H; Garst, Jordan H; Razaghy, Mehrdad; Espinal, Marie; Park, Eunjeong; Lu, Daniel C; Sarrafzadeh, Majid

    2016-01-01

    Predicting the functional outcomes of spinal cord disorder patients after medical treatments, such as a surgical operation, has always been of great interest. Accurate posttreatment prediction is especially beneficial for clinicians, patients, care givers, and therapists. This paper introduces a prediction method for postoperative functional outcomes by a novel use of Gaussian process regression. The proposed method specifically considers the restricted value range of the target variables by modeling the Gaussian process based on a truncated Normal distribution, which significantly improves the prediction results. The prediction has been made in assistance with target tracking examinations using a highly portable and inexpensive handgrip device, which greatly contributes to the prediction performance. The proposed method has been validated through a dataset collected from a clinical cohort pilot involving 15 patients with cervical spinal cord disorder. The results show that the proposed method can accurately predict postoperative functional outcomes, Oswestry disability index and target tracking scores, based on the patient's preoperative information with a mean absolute error of 0.079 and 0.014 (out of 1.0), respectively.

  8. Support vector regression and artificial neural network models for stability indicating analysis of mebeverine hydrochloride and sulpiride mixtures in pharmaceutical preparation: A comparative study

    NASA Astrophysics Data System (ADS)

    Naguib, Ibrahim A.; Darwish, Hany W.

    2012-02-01

    A comparison between support vector regression (SVR) and Artificial Neural Networks (ANNs) multivariate regression methods is established showing the underlying algorithm for each and making a comparison between them to indicate the inherent advantages and limitations. In this paper we compare SVR to ANN with and without variable selection procedure (genetic algorithm (GA)). To project the comparison in a sensible way, the methods are used for the stability indicating quantitative analysis of mixtures of mebeverine hydrochloride and sulpiride in binary mixtures as a case study in presence of their reported impurities and degradation products (summing up to 6 components) in raw materials and pharmaceutical dosage form via handling the UV spectral data. For proper analysis, a 6 factor 5 level experimental design was established resulting in a training set of 25 mixtures containing different ratios of the interfering species. An independent test set consisting of 5 mixtures was used to validate the prediction ability of the suggested models. The proposed methods (linear SVR (without GA) and linear GA-ANN) were successfully applied to the analysis of pharmaceutical tablets containing mebeverine hydrochloride and sulpiride mixtures. The results manifest the problem of nonlinearity and how models like the SVR and ANN can handle it. The methods indicate the ability of the mentioned multivariate calibration models to deconvolute the highly overlapped UV spectra of the 6 components' mixtures, yet using cheap and easy to handle instruments like the UV spectrophotometer.

  9. The process and utility of classification and regression tree methodology in nursing research

    PubMed Central

    Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda

    2014-01-01

    Aim This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Background Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Design Discussion paper. Data sources English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984–2013. Discussion Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Implications for Nursing Research Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Conclusion Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. PMID:24237048

  10. The process and utility of classification and regression tree methodology in nursing research.

    PubMed

    Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda

    2014-06-01

    This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Discussion paper. English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984-2013. Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. © 2013 The Authors. Journal of Advanced Nursing Published by John Wiley & Sons Ltd.

  11. Iterative Strain-Gage Balance Calibration Data Analysis for Extended Independent Variable Sets

    NASA Technical Reports Server (NTRS)

    Ulbrich, Norbert Manfred

    2011-01-01

    A new method was developed that makes it possible to use an extended set of independent calibration variables for an iterative analysis of wind tunnel strain gage balance calibration data. The new method permits the application of the iterative analysis method whenever the total number of balance loads and other independent calibration variables is greater than the total number of measured strain gage outputs. Iteration equations used by the iterative analysis method have the limitation that the number of independent and dependent variables must match. The new method circumvents this limitation. It simply adds a missing dependent variable to the original data set by using an additional independent variable also as an additional dependent variable. Then, the desired solution of the regression analysis problem can be obtained that fits each gage output as a function of both the original and additional independent calibration variables. The final regression coefficients can be converted to data reduction matrix coefficients because the missing dependent variables were added to the data set without changing the regression analysis result for each gage output. Therefore, the new method still supports the application of the two load iteration equation choices that the iterative method traditionally uses for the prediction of balance loads during a wind tunnel test. An example is discussed in the paper that illustrates the application of the new method to a realistic simulation of temperature dependent calibration data set of a six component balance.

  12. Inference for multivariate regression model based on multiply imputed synthetic data generated via posterior predictive sampling

    NASA Astrophysics Data System (ADS)

    Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.

    2017-06-01

    The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.

  13. Exhaustive Search for Sparse Variable Selection in Linear Regression

    NASA Astrophysics Data System (ADS)

    Igarashi, Yasuhiko; Takenaka, Hikaru; Nakanishi-Ohno, Yoshinori; Uemura, Makoto; Ikeda, Shiro; Okada, Masato

    2018-04-01

    We propose a K-sparse exhaustive search (ES-K) method and a K-sparse approximate exhaustive search method (AES-K) for selecting variables in linear regression. With these methods, K-sparse combinations of variables are tested exhaustively assuming that the optimal combination of explanatory variables is K-sparse. By collecting the results of exhaustively computing ES-K, various approximate methods for selecting sparse variables can be summarized as density of states. With this density of states, we can compare different methods for selecting sparse variables such as relaxation and sampling. For large problems where the combinatorial explosion of explanatory variables is crucial, the AES-K method enables density of states to be effectively reconstructed by using the replica-exchange Monte Carlo method and the multiple histogram method. Applying the ES-K and AES-K methods to type Ia supernova data, we confirmed the conventional understanding in astronomy when an appropriate K is given beforehand. However, we found the difficulty to determine K from the data. Using virtual measurement and analysis, we argue that this is caused by data shortage.

  14. Use of empirical likelihood to calibrate auxiliary information in partly linear monotone regression models.

    PubMed

    Chen, Baojiang; Qin, Jing

    2014-05-10

    In statistical analysis, a regression model is needed if one is interested in finding the relationship between a response variable and covariates. When the response depends on the covariate, then it may also depend on the function of this covariate. If one has no knowledge of this functional form but expect for monotonic increasing or decreasing, then the isotonic regression model is preferable. Estimation of parameters for isotonic regression models is based on the pool-adjacent-violators algorithm (PAVA), where the monotonicity constraints are built in. With missing data, people often employ the augmented estimating method to improve estimation efficiency by incorporating auxiliary information through a working regression model. However, under the framework of the isotonic regression model, the PAVA does not work as the monotonicity constraints are violated. In this paper, we develop an empirical likelihood-based method for isotonic regression model to incorporate the auxiliary information. Because the monotonicity constraints still hold, the PAVA can be used for parameter estimation. Simulation studies demonstrate that the proposed method can yield more efficient estimates, and in some situations, the efficiency improvement is substantial. We apply this method to a dementia study. Copyright © 2013 John Wiley & Sons, Ltd.

  15. Forecasting space weather: Can new econometric methods improve accuracy?

    NASA Astrophysics Data System (ADS)

    Reikard, Gordon

    2011-06-01

    Space weather forecasts are currently used in areas ranging from navigation and communication to electric power system operations. The relevant forecast horizons can range from as little as 24 h to several days. This paper analyzes the predictability of two major space weather measures using new time series methods, many of them derived from econometrics. The data sets are the A p geomagnetic index and the solar radio flux at 10.7 cm. The methods tested include nonlinear regressions, neural networks, frequency domain algorithms, GARCH models (which utilize the residual variance), state transition models, and models that combine elements of several techniques. While combined models are complex, they can be programmed using modern statistical software. The data frequency is daily, and forecasting experiments are run over horizons ranging from 1 to 7 days. Two major conclusions stand out. First, the frequency domain method forecasts the A p index more accurately than any time domain model, including both regressions and neural networks. This finding is very robust, and holds for all forecast horizons. Combining the frequency domain method with other techniques yields a further small improvement in accuracy. Second, the neural network forecasts the solar flux more accurately than any other method, although at short horizons (2 days or less) the regression and net yield similar results. The neural net does best when it includes measures of the long-term component in the data.

  16. An algebraic method for constructing stable and consistent autoregressive filters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Harlim, John, E-mail: jharlim@psu.edu; Department of Meteorology, the Pennsylvania State University, University Park, PA 16802; Hong, Hoon, E-mail: hong@ncsu.edu

    2015-02-15

    In this paper, we introduce an algebraic method to construct stable and consistent univariate autoregressive (AR) models of low order for filtering and predicting nonlinear turbulent signals with memory depth. By stable, we refer to the classical stability condition for the AR model. By consistent, we refer to the classical consistency constraints of Adams–Bashforth methods of order-two. One attractive feature of this algebraic method is that the model parameters can be obtained without directly knowing any training data set as opposed to many standard, regression-based parameterization methods. It takes only long-time average statistics as inputs. The proposed method provides amore » discretization time step interval which guarantees the existence of stable and consistent AR model and simultaneously produces the parameters for the AR models. In our numerical examples with two chaotic time series with different characteristics of decaying time scales, we find that the proposed AR models produce significantly more accurate short-term predictive skill and comparable filtering skill relative to the linear regression-based AR models. These encouraging results are robust across wide ranges of discretization times, observation times, and observation noise variances. Finally, we also find that the proposed model produces an improved short-time prediction relative to the linear regression-based AR-models in forecasting a data set that characterizes the variability of the Madden–Julian Oscillation, a dominant tropical atmospheric wave pattern.« less

  17. Fast metabolite identification with Input Output Kernel Regression

    PubMed Central

    Brouard, Céline; Shen, Huibin; Dührkop, Kai; d'Alché-Buc, Florence; Böcker, Sebastian; Rousu, Juho

    2016-01-01

    Motivation: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space. Results: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods. Availability and implementation: Contact: celine.brouard@aalto.fi Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307628

  18. Logistic Regression with Multiple Random Effects: A Simulation Study of Estimation Methods and Statistical Packages.

    PubMed

    Kim, Yoonsang; Choi, Young-Ku; Emery, Sherry

    2013-08-01

    Several statistical packages are capable of estimating generalized linear mixed models and these packages provide one or more of three estimation methods: penalized quasi-likelihood, Laplace, and Gauss-Hermite. Many studies have investigated these methods' performance for the mixed-effects logistic regression model. However, the authors focused on models with one or two random effects and assumed a simple covariance structure between them, which may not be realistic. When there are multiple correlated random effects in a model, the computation becomes intensive, and often an algorithm fails to converge. Moreover, in our analysis of smoking status and exposure to anti-tobacco advertisements, we have observed that when a model included multiple random effects, parameter estimates varied considerably from one statistical package to another even when using the same estimation method. This article presents a comprehensive review of the advantages and disadvantages of each estimation method. In addition, we compare the performances of the three methods across statistical packages via simulation, which involves two- and three-level logistic regression models with at least three correlated random effects. We apply our findings to a real dataset. Our results suggest that two packages-SAS GLIMMIX Laplace and SuperMix Gaussian quadrature-perform well in terms of accuracy, precision, convergence rates, and computing speed. We also discuss the strengths and weaknesses of the two packages in regard to sample sizes.

  19. A comparison of radiometric correction techniques in the evaluation of the relationship between LST and NDVI in Landsat imagery.

    PubMed

    Tan, Kok Chooi; Lim, Hwee San; Matjafri, Mohd Zubir; Abdullah, Khiruddin

    2012-06-01

    Atmospheric corrections for multi-temporal optical satellite images are necessary, especially in change detection analyses, such as normalized difference vegetation index (NDVI) rationing. Abrupt change detection analysis using remote-sensing techniques requires radiometric congruity and atmospheric correction to monitor terrestrial surfaces over time. Two atmospheric correction methods were used for this study: relative radiometric normalization and the simplified method for atmospheric correction (SMAC) in the solar spectrum. A multi-temporal data set consisting of two sets of Landsat images from the period between 1991 and 2002 of Penang Island, Malaysia, was used to compare NDVI maps, which were generated using the proposed atmospheric correction methods. Land surface temperature (LST) was retrieved using ATCOR3_T in PCI Geomatica 10.1 image processing software. Linear regression analysis was utilized to analyze the relationship between NDVI and LST. This study reveals that both of the proposed atmospheric correction methods yielded high accuracy through examination of the linear correlation coefficients. To check for the accuracy of the equation obtained through linear regression analysis for every single satellite image, 20 points were randomly chosen. The results showed that the SMAC method yielded a constant value (in terms of error) to predict the NDVI value from linear regression analysis-derived equation. The errors (average) from both proposed atmospheric correction methods were less than 10%.

  20. A robust ridge regression approach in the presence of both multicollinearity and outliers in the data

    NASA Astrophysics Data System (ADS)

    Shariff, Nurul Sima Mohamad; Ferdaos, Nur Aqilah

    2017-08-01

    Multicollinearity often leads to inconsistent and unreliable parameter estimates in regression analysis. This situation will be more severe in the presence of outliers it will cause fatter tails in the error distributions than the normal distributions. The well-known procedure that is robust to multicollinearity problem is the ridge regression method. This method however is expected to be affected by the presence of outliers due to some assumptions imposed in the modeling procedure. Thus, the robust version of existing ridge method with some modification in the inverse matrix and the estimated response value is introduced. The performance of the proposed method is discussed and comparisons are made with several existing estimators namely, Ordinary Least Squares (OLS), ridge regression and robust ridge regression based on GM-estimates. The finding of this study is able to produce reliable parameter estimates in the presence of both multicollinearity and outliers in the data.

  1. Comparative assessment of GIS-based methods and metrics for estimating long-term exposures to air pollution

    NASA Astrophysics Data System (ADS)

    Gulliver, John; de Hoogh, Kees; Fecht, Daniela; Vienneau, Danielle; Briggs, David

    2011-12-01

    The development of geographical information system techniques has opened up a wide array of methods for air pollution exposure assessment. The extent to which these provide reliable estimates of air pollution concentrations is nevertheless not clearly established. Nor is it clear which methods or metrics should be preferred in epidemiological studies. This paper compares the performance of ten different methods and metrics in terms of their ability to predict mean annual PM 10 concentrations across 52 monitoring sites in London, UK. Metrics analysed include indicators (distance to nearest road, traffic volume on nearest road, heavy duty vehicle (HDV) volume on nearest road, road density within 150 m, traffic volume within 150 m and HDV volume within 150 m) and four modelling approaches: based on the nearest monitoring site, kriging, dispersion modelling and land use regression (LUR). Measures were computed in a GIS, and resulting metrics calibrated and validated against monitoring data using a form of grouped jack-knife analysis. The results show that PM 10 concentrations across London show little spatial variation. As a consequence, most methods can predict the average without serious bias. Few of the approaches, however, show good correlations with monitored PM 10 concentrations, and most predict no better than a simple classification based on site type. Only land use regression reaches acceptable levels of correlation ( R2 = 0.47), though this can be improved by also including information on site type. This might therefore be taken as a recommended approach in many studies, though care is needed in developing meaningful land use regression models, and like any method they need to be validated against local data before their application as part of epidemiological studies.

  2. Geographically weighted regression based methods for merging satellite and gauge precipitation

    NASA Astrophysics Data System (ADS)

    Chao, Lijun; Zhang, Ke; Li, Zhijia; Zhu, Yuelong; Wang, Jingfeng; Yu, Zhongbo

    2018-03-01

    Real-time precipitation data with high spatiotemporal resolutions are crucial for accurate hydrological forecasting. To improve the spatial resolution and quality of satellite precipitation, a three-step satellite and gauge precipitation merging method was formulated in this study: (1) bilinear interpolation is first applied to downscale coarser satellite precipitation to a finer resolution (PS); (2) the (mixed) geographically weighted regression methods coupled with a weighting function are then used to estimate biases of PS as functions of gauge observations (PO) and PS; and (3) biases of PS are finally corrected to produce a merged precipitation product. Based on the above framework, eight algorithms, a combination of two geographically weighted regression methods and four weighting functions, are developed to merge CMORPH (CPC MORPHing technique) precipitation with station observations on a daily scale in the Ziwuhe Basin of China. The geographical variables (elevation, slope, aspect, surface roughness, and distance to the coastline) and a meteorological variable (wind speed) were used for merging precipitation to avoid the artificial spatial autocorrelation resulting from traditional interpolation methods. The results show that the combination of the MGWR and BI-square function (MGWR-BI) has the best performance (R = 0.863 and RMSE = 7.273 mm/day) among the eight algorithms. The MGWR-BI algorithm was then applied to produce hourly merged precipitation product. Compared to the original CMORPH product (R = 0.208 and RMSE = 1.208 mm/hr), the quality of the merged data is significantly higher (R = 0.724 and RMSE = 0.706 mm/hr). The developed merging method not only improves the spatial resolution and quality of the satellite product but also is easy to implement, which is valuable for hydrological modeling and other applications.

  3. QSRR modeling for diverse drugs using different feature selection methods coupled with linear and nonlinear regressions.

    PubMed

    Goodarzi, Mohammad; Jensen, Richard; Vander Heyden, Yvan

    2012-12-01

    A Quantitative Structure-Retention Relationship (QSRR) is proposed to estimate the chromatographic retention of 83 diverse drugs on a Unisphere poly butadiene (PBD) column, using isocratic elutions at pH 11.7. Previous work has generated QSRR models for them using Classification And Regression Trees (CART). In this work, Ant Colony Optimization is used as a feature selection method to find the best molecular descriptors from a large pool. In addition, several other selection methods have been applied, such as Genetic Algorithms, Stepwise Regression and the Relief method, not only to evaluate Ant Colony Optimization as a feature selection method but also to investigate its ability to find the important descriptors in QSRR. Multiple Linear Regression (MLR) and Support Vector Machines (SVMs) were applied as linear and nonlinear regression methods, respectively, giving excellent correlation between the experimental, i.e. extrapolated to a mobile phase consisting of pure water, and predicted logarithms of the retention factors of the drugs (logk(w)). The overall best model was the SVM one built using descriptors selected by ACO. Copyright © 2012 Elsevier B.V. All rights reserved.

  4. Analyzing hospitalization data: potential limitations of Poisson regression.

    PubMed

    Weaver, Colin G; Ravani, Pietro; Oliver, Matthew J; Austin, Peter C; Quinn, Robert R

    2015-08-01

    Poisson regression is commonly used to analyze hospitalization data when outcomes are expressed as counts (e.g. number of days in hospital). However, data often violate the assumptions on which Poisson regression is based. More appropriate extensions of this model, while available, are rarely used. We compared hospitalization data between 206 patients treated with hemodialysis (HD) and 107 treated with peritoneal dialysis (PD) using Poisson regression and compared results from standard Poisson regression with those obtained using three other approaches for modeling count data: negative binomial (NB) regression, zero-inflated Poisson (ZIP) regression and zero-inflated negative binomial (ZINB) regression. We examined the appropriateness of each model and compared the results obtained with each approach. During a mean 1.9 years of follow-up, 183 of 313 patients (58%) were never hospitalized (indicating an excess of 'zeros'). The data also displayed overdispersion (variance greater than mean), violating another assumption of the Poisson model. Using four criteria, we determined that the NB and ZINB models performed best. According to these two models, patients treated with HD experienced similar hospitalization rates as those receiving PD {NB rate ratio (RR): 1.04 [bootstrapped 95% confidence interval (CI): 0.49-2.20]; ZINB summary RR: 1.21 (bootstrapped 95% CI 0.60-2.46)}. Poisson and ZIP models fit the data poorly and had much larger point estimates than the NB and ZINB models [Poisson RR: 1.93 (bootstrapped 95% CI 0.88-4.23); ZIP summary RR: 1.84 (bootstrapped 95% CI 0.88-3.84)]. We found substantially different results when modeling hospitalization data, depending on the approach used. Our results argue strongly for a sound model selection process and improved reporting around statistical methods used for modeling count data. © The Author 2015. Published by Oxford University Press on behalf of ERA-EDTA. All rights reserved.

  5. Regression and multivariate models for predicting particulate matter concentration level.

    PubMed

    Nazif, Amina; Mohammed, Nurul Izma; Malakahmad, Amirhossein; Abualqumboz, Motasem S

    2018-01-01

    The devastating health effects of particulate matter (PM 10 ) exposure by susceptible populace has made it necessary to evaluate PM 10 pollution. Meteorological parameters and seasonal variation increases PM 10 concentration levels, especially in areas that have multiple anthropogenic activities. Hence, stepwise regression (SR), multiple linear regression (MLR) and principal component regression (PCR) analyses were used to analyse daily average PM 10 concentration levels. The analyses were carried out using daily average PM 10 concentration, temperature, humidity, wind speed and wind direction data from 2006 to 2010. The data was from an industrial air quality monitoring station in Malaysia. The SR analysis established that meteorological parameters had less influence on PM 10 concentration levels having coefficient of determination (R 2 ) result from 23 to 29% based on seasoned and unseasoned analysis. While, the result of the prediction analysis showed that PCR models had a better R 2 result than MLR methods. The results for the analyses based on both seasoned and unseasoned data established that MLR models had R 2 result from 0.50 to 0.60. While, PCR models had R 2 result from 0.66 to 0.89. In addition, the validation analysis using 2016 data also recognised that the PCR model outperformed the MLR model, with the PCR model for the seasoned analysis having the best result. These analyses will aid in achieving sustainable air quality management strategies.

  6. A comparison between standard methods and structural nested modelling when bias from a healthy worker survivor effect is suspected: an iron-ore mining cohort study.

    PubMed

    Björ, Ove; Damber, Lena; Jonsson, Håkan; Nilsson, Tohr

    2015-07-01

    Iron-ore miners are exposed to extremely dusty and physically arduous work environments. The demanding activities of mining select healthier workers with longer work histories (ie, the Healthy Worker Survivor Effect (HWSE)), and could have a reversing effect on the exposure-response association. The objective of this study was to evaluate an iron-ore mining cohort to determine whether the effect of respirable dust was confounded by the presence of an HWSE. When an HWSE exists, standard modelling methods, such as Cox regression analysis, produce biased results. We compared results from g-estimation of accelerated failure-time modelling adjusted for HWSE with corresponding unadjusted Cox regression modelling results. For all-cause mortality when adjusting for the HWSE, cumulative exposure from respirable dust was associated with a 6% decrease of life expectancy if exposed ≥15 years, compared with never being exposed. Respirable dust continued to be associated with mortality after censoring outcomes known to be associated with dust when adjusting for the HWSE. In contrast, results based on Cox regression analysis did not support that an association was present. The adjustment for the HWSE made a difference when estimating the risk of mortality from respirable dust. The results of this study, therefore, support the recommendation that standard methods of analysis should be complemented with structural modelling analysis techniques, such as g-estimation of accelerated failure-time modelling, to adjust for the HWSE. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  7. Poisson regression models outperform the geometrical model in estimating the peak-to-trough ratio of seasonal variation: a simulation study.

    PubMed

    Christensen, A L; Lundbye-Christensen, S; Dethlefsen, C

    2011-12-01

    Several statistical methods of assessing seasonal variation are available. Brookhart and Rothman [3] proposed a second-order moment-based estimator based on the geometrical model derived by Edwards [1], and reported that this estimator is superior in estimating the peak-to-trough ratio of seasonal variation compared with Edwards' estimator with respect to bias and mean squared error. Alternatively, seasonal variation may be modelled using a Poisson regression model, which provides flexibility in modelling the pattern of seasonal variation and adjustments for covariates. Based on a Monte Carlo simulation study three estimators, one based on the geometrical model, and two based on log-linear Poisson regression models, were evaluated in regards to bias and standard deviation (SD). We evaluated the estimators on data simulated according to schemes varying in seasonal variation and presence of a secular trend. All methods and analyses in this paper are available in the R package Peak2Trough[13]. Applying a Poisson regression model resulted in lower absolute bias and SD for data simulated according to the corresponding model assumptions. Poisson regression models had lower bias and SD for data simulated to deviate from the corresponding model assumptions than the geometrical model. This simulation study encourages the use of Poisson regression models in estimating the peak-to-trough ratio of seasonal variation as opposed to the geometrical model. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.

  8. A pocket-sized metabolic analyzer for assessment of resting energy expenditure.

    PubMed

    Zhao, Di; Xian, Xiaojun; Terrera, Mirna; Krishnan, Ranganath; Miller, Dylan; Bridgeman, Devon; Tao, Kevin; Zhang, Lihua; Tsow, Francis; Forzani, Erica S; Tao, Nongjian

    2014-04-01

    The assessment of metabolic parameters related to energy expenditure has a proven value for weight management; however these measurements remain too difficult and costly for monitoring individuals at home. The objective of this study is to evaluate the accuracy of a new pocket-sized metabolic analyzer device for assessing energy expenditure at rest (REE) and during sedentary activities (EE). The new device performs indirect calorimetry by measuring an individual's oxygen consumption (VO2) and carbon dioxide production (VCO2) rates, which allows the determination of resting- and sedentary activity-related energy expenditure. VO2 and VCO2 values of 17 volunteer adult subjects were measured during resting and sedentary activities in order to compare the metabolic analyzer with the Douglas bag method. The Douglas bag method is considered the Gold Standard method for indirect calorimetry. Metabolic parameters of VO2, VCO2, and energy expenditure were compared using linear regression analysis, paired t-tests, and Bland-Altman plots. Linear regression analysis of measured VO2 and VCO2 values, as well as calculated energy expenditure assessed with the new analyzer and Douglas bag method, had the following linear regression parameters (linear regression slope LRS0, and R-squared coefficient, r(2)) with p = 0: LRS0 (SD) = 1.00 (0.01), r(2) = 0.9933 for VO2; LRS0 (SD) = 1.00 (0.01), r(2) = 0.9929 for VCO2; and LRS0 (SD) = 1.00 (0.01), r(2) = 0.9942 for energy expenditure. In addition, results from paired t-tests did not show statistical significant difference between the methods with a significance level of α = 0.05 for VO2, VCO2, REE, and EE. Furthermore, the Bland-Altman plot for REE showed good agreement between methods with 100% of the results within ±2SD, which was equivalent to ≤10% error. The findings demonstrate that the new pocket-sized metabolic analyzer device is accurate for determining VO2, VCO2, and energy expenditure. Copyright © 2013 Elsevier Ltd and European Society for Clinical Nutrition and Metabolism. All rights reserved.

  9. On using summary statistics from an external calibration sample to correct for covariate measurement error.

    PubMed

    Guo, Ying; Little, Roderick J; McConnell, Daniel S

    2012-01-01

    Covariate measurement error is common in epidemiologic studies. Current methods for correcting measurement error with information from external calibration samples are insufficient to provide valid adjusted inferences. We consider the problem of estimating the regression of an outcome Y on covariates X and Z, where Y and Z are observed, X is unobserved, but a variable W that measures X with error is observed. Information about measurement error is provided in an external calibration sample where data on X and W (but not Y and Z) are recorded. We describe a method that uses summary statistics from the calibration sample to create multiple imputations of the missing values of X in the regression sample, so that the regression coefficients of Y on X and Z and associated standard errors can be estimated using simple multiple imputation combining rules, yielding valid statistical inferences under the assumption of a multivariate normal distribution. The proposed method is shown by simulation to provide better inferences than existing methods, namely the naive method, classical calibration, and regression calibration, particularly for correction for bias and achieving nominal confidence levels. We also illustrate our method with an example using linear regression to examine the relation between serum reproductive hormone concentrations and bone mineral density loss in midlife women in the Michigan Bone Health and Metabolism Study. Existing methods fail to adjust appropriately for bias due to measurement error in the regression setting, particularly when measurement error is substantial. The proposed method corrects this deficiency.

  10. Vesicular stomatitis forecasting based on Google Trends

    PubMed Central

    Lu, Yi; Zhou, GuangYa; Chen, Qin

    2018-01-01

    Background Vesicular stomatitis (VS) is an important viral disease of livestock. The main feature of VS is irregular blisters that occur on the lips, tongue, oral mucosa, hoof crown and nipple. Humans can also be infected with vesicular stomatitis and develop meningitis. This study analyses 2014 American VS outbreaks in order to accurately predict vesicular stomatitis outbreak trends. Methods American VS outbreaks data were collected from OIE. The data for VS keywords were obtained by inputting 24 disease-related keywords into Google Trends. After calculating the Pearson and Spearman correlation coefficients, it was found that there was a relationship between outbreaks and keywords derived from Google Trends. Finally, the predicted model was constructed based on qualitative classification and quantitative regression. Results For the regression model, the Pearson correlation coefficients between the predicted outbreaks and actual outbreaks are 0.953 and 0.948, respectively. For the qualitative classification model, we constructed five classification predictive models and chose the best classification predictive model as the result. The results showed, SN (sensitivity), SP (specificity) and ACC (prediction accuracy) values of the best classification predictive model are 78.52%,72.5% and 77.14%, respectively. Conclusion This study applied Google search data to construct a qualitative classification model and a quantitative regression model. The results show that the method is effective and that these two models obtain more accurate forecast. PMID:29385198

  11. Uncertainty of streamwater solute fluxes in five contrasting headwater catchments including model uncertainty and natural variability (Invited)

    NASA Astrophysics Data System (ADS)

    Aulenbach, B. T.; Burns, D. A.; Shanley, J. B.; Yanai, R. D.; Bae, K.; Wild, A.; Yang, Y.; Dong, Y.

    2013-12-01

    There are many sources of uncertainty in estimates of streamwater solute flux. Flux is the product of discharge and concentration (summed over time), each of which has measurement uncertainty of its own. Discharge can be measured almost continuously, but concentrations are usually determined from discrete samples, which increases uncertainty dependent on sampling frequency and how concentrations are assigned for the periods between samples. Gaps between samples can be estimated by linear interpolation or by models that that use the relations between concentration and continuously measured or known variables such as discharge, season, temperature, and time. For this project, developed in cooperation with QUEST (Quantifying Uncertainty in Ecosystem Studies), we evaluated uncertainty for three flux estimation methods and three different sampling frequencies (monthly, weekly, and weekly plus event). The constituents investigated were dissolved NO3, Si, SO4, and dissolved organic carbon (DOC), solutes whose concentration dynamics exhibit strongly contrasting behavior. The evaluation was completed for a 10-year period at five small, forested watersheds in Georgia, New Hampshire, New York, Puerto Rico, and Vermont. Concentration regression models were developed for each solute at each of the three sampling frequencies for all five watersheds. Fluxes were then calculated using (1) a linear interpolation approach, (2) a regression-model method, and (3) the composite method - which combines the regression-model method for estimating concentrations and the linear interpolation method for correcting model residuals to the observed sample concentrations. We considered the best estimates of flux to be derived using the composite method at the highest sampling frequencies. We also evaluated the importance of sampling frequency and estimation method on flux estimate uncertainty; flux uncertainty was dependent on the variability characteristics of each solute and varied for different reporting periods (e.g. 10-year, study period vs. annually vs. monthly). The usefulness of the two regression model based flux estimation approaches was dependent upon the amount of variance in concentrations the regression models could explain. Our results can guide the development of optimal sampling strategies by weighing sampling frequency with improvements in uncertainty in stream flux estimates for solutes with particular characteristics of variability. The appropriate flux estimation method is dependent on a combination of sampling frequency and the strength of concentration regression models. Sites: Biscuit Brook (Frost Valley, NY), Hubbard Brook Experimental Forest and LTER (West Thornton, NH), Luquillo Experimental Forest and LTER (Luquillo, Puerto Rico), Panola Mountain (Stockbridge, GA), Sleepers River Research Watershed (Danville, VT)

  12. Quality optimization of H.264/AVC video transmission over noisy environments using a sparse regression framework

    NASA Astrophysics Data System (ADS)

    Pandremmenou, K.; Tziortziotis, N.; Paluri, S.; Zhang, W.; Blekas, K.; Kondi, L. P.; Kumar, S.

    2015-03-01

    We propose the use of the Least Absolute Shrinkage and Selection Operator (LASSO) regression method in order to predict the Cumulative Mean Squared Error (CMSE), incurred by the loss of individual slices in video transmission. We extract a number of quality-relevant features from the H.264/AVC video sequences, which are given as input to the LASSO. This method has the benefit of not only keeping a subset of the features that have the strongest effects towards video quality, but also produces accurate CMSE predictions. Particularly, we study the LASSO regression through two different architectures; the Global LASSO (G.LASSO) and Local LASSO (L.LASSO). In G.LASSO, a single regression model is trained for all slice types together, while in L.LASSO, motivated by the fact that the values for some features are closely dependent on the considered slice type, each slice type has its own regression model, in an e ort to improve LASSO's prediction capability. Based on the predicted CMSE values, we group the video slices into four priority classes. Additionally, we consider a video transmission scenario over a noisy channel, where Unequal Error Protection (UEP) is applied to all prioritized slices. The provided results demonstrate the efficiency of LASSO in estimating CMSE with high accuracy, using only a few features. les that typically contain high-entropy data, producing a footprint that is far less conspicuous than existing methods. The system uses a local web server to provide a le system, user interface and applications through an web architecture.

  13. Evaluation of interpolation techniques for the creation of gridded daily precipitation (1 × 1 km2); Cyprus, 1980-2010

    NASA Astrophysics Data System (ADS)

    Camera, Corrado; Bruggeman, Adriana; Hadjinicolaou, Panos; Pashiardis, Stelios; Lange, Manfred A.

    2014-01-01

    High-resolution gridded daily data sets are essential for natural resource management and the analyses of climate changes and their effects. This study aims to evaluate the performance of 15 simple or complex interpolation techniques in reproducing daily precipitation at a resolution of 1 km2 over topographically complex areas. Methods are tested considering two different sets of observation densities and different rainfall amounts. We used rainfall data that were recorded at 74 and 145 observational stations, respectively, spread over the 5760 km2 of the Republic of Cyprus, in the Eastern Mediterranean. Regression analyses utilizing geographical copredictors and neighboring interpolation techniques were evaluated both in isolation and combined. Linear multiple regression (LMR) and geographically weighted regression methods (GWR) were tested. These included a step-wise selection of covariables, as well as inverse distance weighting (IDW), kriging, and 3D-thin plate splines (TPS). The relative rank of the different techniques changes with different station density and rainfall amounts. Our results indicate that TPS performs well for low station density and large-scale events and also when coupled with regression models. It performs poorly for high station density. The opposite is observed when using IDW. Simple IDW performs best for local events, while a combination of step-wise GWR and IDW proves to be the best method for large-scale events and high station density. This study indicates that the use of step-wise regression with a variable set of geographic parameters can improve the interpolation of large-scale events because it facilitates the representation of local climate dynamics.

  14. Assessing response of sediment load variation to climate change and human activities with six different approaches.

    PubMed

    Zhao, Guangju; Mu, Xingmin; Jiao, Juying; Gao, Peng; Sun, Wenyi; Li, Erhui; Wei, Yanhong; Huang, Jiacong

    2018-05-23

    Understanding the relative contributions of climate change and human activities to variations in sediment load is of great importance for regional soil, and river basin management. Considerable studies have investigated spatial-temporal variation of sediment load within the Loess Plateau; however, contradictory findings exist among methods used. This study systematically reviewed six quantitative methods: simple linear regression, double mass curve, sediment identity factor analysis, dam-sedimentation based method, the Sediment Delivery Distributed (SEDD) model, and the Soil Water Assessment Tool (SWAT) model. The calculation procedures and merits for each method were systematically explained. A case study in the Huangfuchuan watershed on the northern Loess Plateau has been undertaken. The results showed that sediment load had been reduced by 70.5% during the changing period from 1990 to 2012 compared to that of the baseline period from 1955 to 1989. Human activities accounted for an average of 93.6 ± 4.1% of the total decline in sediment load, whereas climate change contributed 6.4 ± 4.1%. Five methods produced similar estimates, but the linear regression yielded relatively different results. The results of this study provide a good reference for assessing the effects of climate change and human activities on sediment load variation by using different methods. Copyright © 2018. Published by Elsevier B.V.

  15. Regression: The Apple Does Not Fall Far From the Tree.

    PubMed

    Vetter, Thomas R; Schober, Patrick

    2018-05-15

    Researchers and clinicians are frequently interested in either: (1) assessing whether there is a relationship or association between 2 or more variables and quantifying this association; or (2) determining whether 1 or more variables can predict another variable. The strength of such an association is mainly described by the correlation. However, regression analysis and regression models can be used not only to identify whether there is a significant relationship or association between variables but also to generate estimations of such a predictive relationship between variables. This basic statistical tutorial discusses the fundamental concepts and techniques related to the most common types of regression analysis and modeling, including simple linear regression, multiple regression, logistic regression, ordinal regression, and Poisson regression, as well as the common yet often underrecognized phenomenon of regression toward the mean. The various types of regression analysis are powerful statistical techniques, which when appropriately applied, can allow for the valid interpretation of complex, multifactorial data. Regression analysis and models can assess whether there is a relationship or association between 2 or more observed variables and estimate the strength of this association, as well as determine whether 1 or more variables can predict another variable. Regression is thus being applied more commonly in anesthesia, perioperative, critical care, and pain research. However, it is crucial to note that regression can identify plausible risk factors; it does not prove causation (a definitive cause and effect relationship). The results of a regression analysis instead identify independent (predictor) variable(s) associated with the dependent (outcome) variable. As with other statistical methods, applying regression requires that certain assumptions be met, which can be tested with specific diagnostics.

  16. Restoration of Monotonicity Respecting in Dynamic Regression

    PubMed Central

    Huang, Yijian

    2017-01-01

    Dynamic regression models, including the quantile regression model and Aalen’s additive hazards model, are widely adopted to investigate evolving covariate effects. Yet lack of monotonicity respecting with standard estimation procedures remains an outstanding issue. Advances have recently been made, but none provides a complete resolution. In this article, we propose a novel adaptive interpolation method to restore monotonicity respecting, by successively identifying and then interpolating nearest monotonicity-respecting points of an original estimator. Under mild regularity conditions, the resulting regression coefficient estimator is shown to be asymptotically equivalent to the original. Our numerical studies have demonstrated that the proposed estimator is much more smooth and may have better finite-sample efficiency than the original as well as, when available as only in special cases, other competing monotonicity-respecting estimators. Illustration with a clinical study is provided. PMID:29430068

  17. Ordinary Least Squares and Quantile Regression: An Inquiry-Based Learning Approach to a Comparison of Regression Methods

    ERIC Educational Resources Information Center

    Helmreich, James E.; Krog, K. Peter

    2018-01-01

    We present a short, inquiry-based learning course on concepts and methods underlying ordinary least squares (OLS), least absolute deviation (LAD), and quantile regression (QR). Students investigate squared, absolute, and weighted absolute distance functions (metrics) as location measures. Using differential calculus and properties of convex…

  18. Statistical relations among earthquake magnitude, surface rupture length, and surface fault displacement

    USGS Publications Warehouse

    Bonilla, M.G.; Mark, R.K.; Lienkaemper, J.J.

    1984-01-01

    In order to refine correlations of surface-wave magnitude, fault rupture length at the ground surface, and fault displacement at the surface by including the uncertainties in these variables, the existing data were critically reviewed and a new data base was compiled. Earthquake magnitudes were redetermined as necessary to make them as consistent as possible with the Gutenberg methods and results, which necessarily make up much of the data base. Measurement errors were estimated for the three variables for 58 moderate to large shallow-focus earthquakes. Regression analyses were then made utilizing the estimated measurement errors. The regression analysis demonstrates that the relations among the variables magnitude, length, and displacement are stochastic in nature. The stochastic variance, introduced in part by incomplete surface expression of seismogenic faulting, variation in shear modulus, and regional factors, dominates the estimated measurement errors. Thus, it is appropriate to use ordinary least squares for the regression models, rather than regression models based upon an underlying deterministic relation with the variance resulting from measurement errors. Significant differences exist in correlations of certain combinations of length, displacement, and magnitude when events are qrouped by fault type or by region, including attenuation regions delineated by Evernden and others. Subdivision of the data results in too few data for some fault types and regions, and for these only regressions using all of the data as a group are reported. Estimates of the magnitude and the standard deviation of the magnitude of a prehistoric or future earthquake associated with a fault can be made by correlating M with the logarithms of rupture length, fault displacement, or the product of length and displacement. Fault rupture area could be reliably estimated for about 20 of the events in the data set. Regression of MS on rupture area did not result in a marked improvement over regressions that did not involve rupture area. Because no subduction-zone earthquakes are included in this study, the reported results do not apply to such zones.

  19. Evaluation of conventional and alternative monitoring methods for a recreational marine beach with nonpoint source of fecal contamination.

    PubMed

    Shibata, Tomoyuki; Solo-Gabriele, Helena M; Sinigalliano, Christopher D; Gidley, Maribeth L; Plano, Lisa R W; Fleisher, Jay M; Wang, John D; Elmir, Samir M; He, Guoqing; Wright, Mary E; Abdelzaher, Amir M; Ortega, Cristina; Wanless, David; Garza, Anna C; Kish, Jonathan; Scott, Troy; Hollenbeck, Julie; Backer, Lorraine C; Fleming, Lora E

    2010-11-01

    The objectives of this work were to compare enterococci (ENT) measurements based on the membrane filter, ENT(MF) with alternatives that can provide faster results including alternative enterococci methods (e.g., chromogenic substrate (CS), and quantitative polymerase chain reaction (qPCR)), and results from regression models based upon environmental parameters that can be measured in real-time. ENT(MF) were also compared to source tracking markers (Staphylococcus aureus, Bacteroidales human and dog markers, and Catellicoccus gull marker) in an effort to interpret the variability of the signal. Results showed that concentrations of enterococci based upon MF (<2 to 3320 CFU/100 mL) were significantly different from the CS and qPCR methods (p < 0.01). The correlations between MF and CS (r = 0.58, p < 0.01) were stronger than between MF and qPCR (r ≤ 0.36, p < 0.01). Enterococci levels by MF, CS, and qPCR methods were positively correlated with turbidity and tidal height. Enterococci by MF and CS were also inversely correlated with solar radiation but enterococci by qPCR was not. The regression model based on environmental variables provided fair qualitative predictions of enterococci by MF in real-time, for daily geometric mean levels, but not for individual samples. Overall, ENT(MF) was not significantly correlated with source tracking markers with the exception of samples collected during one storm event. The inability of the regression model to predict ENT(MF) levels for individual samples is likely due to the different sources of ENT impacting the beach at any given time, making it particularly difficult to to predict short-term variability of ENT(MF) for environmental parameters.

  20. Face Alignment via Regressing Local Binary Features.

    PubMed

    Ren, Shaoqing; Cao, Xudong; Wei, Yichen; Sun, Jian

    2016-03-01

    This paper presents a highly efficient and accurate regression approach for face alignment. Our approach has two novel components: 1) a set of local binary features and 2) a locality principle for learning those features. The locality principle guides us to learn a set of highly discriminative local binary features for each facial landmark independently. The obtained local binary features are used to jointly learn a linear regression for the final output. This approach achieves the state-of-the-art results when tested on the most challenging benchmarks to date. Furthermore, because extracting and regressing local binary features are computationally very cheap, our system is much faster than previous methods. It achieves over 3000 frames per second (FPS) on a desktop or 300 FPS on a mobile phone for locating a few dozens of landmarks. We also study a key issue that is important but has received little attention in the previous research, which is the face detector used to initialize alignment. We investigate several face detectors and perform quantitative evaluation on how they affect alignment accuracy. We find that an alignment friendly detector can further greatly boost the accuracy of our alignment method, reducing the error up to 16% relatively. To facilitate practical usage of face detection/alignment methods, we also propose a convenient metric to measure how good a detector is for alignment initialization.

  1. Hyperspectral Analysis of Soil Nitrogen, Carbon, Carbonate, and Organic Matter Using Regression Trees

    PubMed Central

    Gmur, Stephan; Vogt, Daniel; Zabowski, Darlene; Moskal, L. Monika

    2012-01-01

    The characterization of soil attributes using hyperspectral sensors has revealed patterns in soil spectra that are known to respond to mineral composition, organic matter, soil moisture and particle size distribution. Soil samples from different soil horizons of replicated soil series from sites located within Washington and Oregon were analyzed with the FieldSpec Spectroradiometer to measure their spectral signatures across the electromagnetic range of 400 to 1,000 nm. Similarity rankings of individual soil samples reveal differences between replicate series as well as samples within the same replicate series. Using classification and regression tree statistical methods, regression trees were fitted to each spectral response using concentrations of nitrogen, carbon, carbonate and organic matter as the response variables. Statistics resulting from fitted trees were: nitrogen R2 0.91 (p < 0.01) at 403, 470, 687, and 846 nm spectral band widths, carbonate R2 0.95 (p < 0.01) at 531 and 898 nm band widths, total carbon R2 0.93 (p < 0.01) at 400, 409, 441 and 907 nm band widths, and organic matter R2 0.98 (p < 0.01) at 300, 400, 441, 832 and 907 nm band widths. Use of the 400 to 1,000 nm electromagnetic range utilizing regression trees provided a powerful, rapid and inexpensive method for assessing nitrogen, carbon, carbonate and organic matter for upper soil horizons in a nondestructive method. PMID:23112620

  2. Support vector regression to predict porosity and permeability: Effect of sample size

    NASA Astrophysics Data System (ADS)

    Al-Anazi, A. F.; Gates, I. D.

    2012-02-01

    Porosity and permeability are key petrophysical parameters obtained from laboratory core analysis. Cores, obtained from drilled wells, are often few in number for most oil and gas fields. Porosity and permeability correlations based on conventional techniques such as linear regression or neural networks trained with core and geophysical logs suffer poor generalization to wells with only geophysical logs. The generalization problem of correlation models often becomes pronounced when the training sample size is small. This is attributed to the underlying assumption that conventional techniques employing the empirical risk minimization (ERM) inductive principle converge asymptotically to the true risk values as the number of samples increases. In small sample size estimation problems, the available training samples must span the complexity of the parameter space so that the model is able both to match the available training samples reasonably well and to generalize to new data. This is achieved using the structural risk minimization (SRM) inductive principle by matching the capability of the model to the available training data. One method that uses SRM is support vector regression (SVR) network. In this research, the capability of SVR to predict porosity and permeability in a heterogeneous sandstone reservoir under the effect of small sample size is evaluated. Particularly, the impact of Vapnik's ɛ-insensitivity loss function and least-modulus loss function on generalization performance was empirically investigated. The results are compared to the multilayer perception (MLP) neural network, a widely used regression method, which operates under the ERM principle. The mean square error and correlation coefficients were used to measure the quality of predictions. The results demonstrate that SVR yields consistently better predictions of the porosity and permeability with small sample size than the MLP method. Also, the performance of SVR depends on both kernel function type and loss functions used.

  3. A new improved graphical and quantitative method for detecting bias in meta-analysis.

    PubMed

    Furuya-Kanamori, Luis; Barendregt, Jan J; Doi, Suhail A R

    2018-04-04

    Detection of publication and related biases remains suboptimal and threatens the validity and interpretation of meta-analytical findings. When bias is present, it usually differentially affects small and large studies manifesting as an association between precision and effect size and therefore visual asymmetry of conventional funnel plots. This asymmetry can be quantified and Egger's regression is, by far, the most widely used statistical measure for quantifying funnel plot asymmetry. However, concerns have been raised about both the visual appearance of funnel plots and the sensitivity of Egger's regression to detect such asymmetry, particularly when the number of studies is small. In this article, we propose a new graphical method, the Doi plot, to visualize asymmetry and also a new measure, the LFK index, to detect and quantify asymmetry of study effects in Doi plots. We demonstrate that the visual representation of asymmetry was better for the Doi plot when compared with the funnel plot. We also show that the diagnostic accuracy of the LFK index in discriminating between asymmetry due to simulated publication bias versus chance or no asymmetry was also better with the LFK index which had areas under the receiver operating characteristic curve of 0.74-0.88 with simulations of meta-analyses with five, 10, 15, and 20 studies. The Egger's regression result had lower areas under the receiver operating characteristic curve values of 0.58-0.75 across the same simulations. The LFK index also had a higher sensitivity (71.3-72.1%) than the Egger's regression result (18.5-43.0%). We conclude that the methods proposed in this article can markedly improve the ability of researchers to detect bias in meta-analysis.

  4. Multidisciplinary Design Optimization for Aeropropulsion Engines and Solid Modeling/Animation via the Integrated Forced Methods

    NASA Technical Reports Server (NTRS)

    2004-01-01

    The grant closure report is organized in the following four chapters: Chapter describes the two research areas Design optimization and Solid mechanics. Ten journal publications are listed in the second chapter. Five highlights is the subject matter of chapter three. CHAPTER 1. The Design Optimization Test Bed CometBoards. CHAPTER 2. Solid Mechanics: Integrated Force Method of Analysis. CHAPTER 3. Five Highlights: Neural Network and Regression Methods Demonstrated in the Design Optimization of a Subsonic Aircraft. Neural Network and Regression Soft Model Extended for PX-300 Aircraft Engine. Engine with Regression and Neural Network Approximators Designed. Cascade Optimization Strategy with Neural network and Regression Approximations Demonstrated on a Preliminary Aircraft Engine Design. Neural Network and Regression Approximations Used in Aircraft Design.

  5. Effects of land cover, topography, and built structure on seasonal water quality at multiple spatial scales.

    PubMed

    Pratt, Bethany; Chang, Heejun

    2012-03-30

    The relationship among land cover, topography, built structure and stream water quality in the Portland Metro region of Oregon and Clark County, Washington areas, USA, is analyzed using ordinary least squares (OLS) and geographically weighted (GWR) multiple regression models. Two scales of analysis, a sectional watershed and a buffer, offered a local and a global investigation of the sources of stream pollutants. Model accuracy, measured by R(2) values, fluctuated according to the scale, season, and regression method used. While most wet season water quality parameters are associated with urban land covers, most dry season water quality parameters are related topographic features such as elevation and slope. GWR models, which take into consideration local relations of spatial autocorrelation, had stronger results than OLS regression models. In the multiple regression models, sectioned watershed results were consistently better than the sectioned buffer results, except for dry season pH and stream temperature parameters. This suggests that while riparian land cover does have an effect on water quality, a wider contributing area needs to be included in order to account for distant sources of pollutants. Copyright © 2012 Elsevier B.V. All rights reserved.

  6. Logistic regression for dichotomized counts.

    PubMed

    Preisser, John S; Das, Kalyan; Benecha, Habtamu; Stamm, John W

    2016-12-01

    Sometimes there is interest in a dichotomized outcome indicating whether a count variable is positive or zero. Under this scenario, the application of ordinary logistic regression may result in efficiency loss, which is quantifiable under an assumed model for the counts. In such situations, a shared-parameter hurdle model is investigated for more efficient estimation of regression parameters relating to overall effects of covariates on the dichotomous outcome, while handling count data with many zeroes. One model part provides a logistic regression containing marginal log odds ratio effects of primary interest, while an ancillary model part describes the mean count of a Poisson or negative binomial process in terms of nuisance regression parameters. Asymptotic efficiency of the logistic model parameter estimators of the two-part models is evaluated with respect to ordinary logistic regression. Simulations are used to assess the properties of the models with respect to power and Type I error, the latter investigated under both misspecified and correctly specified models. The methods are applied to data from a randomized clinical trial of three toothpaste formulations to prevent incident dental caries in a large population of Scottish schoolchildren. © The Author(s) 2014.

  7. Combustion performance and scale effect from N2O/HTPB hybrid rocket motor simulations

    NASA Astrophysics Data System (ADS)

    Shan, Fanli; Hou, Lingyun; Piao, Ying

    2013-04-01

    HRM code for the simulation of N2O/HTPB hybrid rocket motor operation and scale effect analysis has been developed. This code can be used to calculate motor thrust and distributions of physical properties inside the combustion chamber and nozzle during the operational phase by solving the unsteady Navier-Stokes equations using a corrected compressible difference scheme and a two-step, five species combustion model. A dynamic fuel surface regression technique and a two-step calculation method together with the gas-solid coupling are applied in the calculation of fuel regression and the determination of combustion chamber wall profile as fuel regresses. Both the calculated motor thrust from start-up to shut-down mode and the combustion chamber wall profile after motor operation are in good agreements with experimental data. The fuel regression rate equation and the relation between fuel regression rate and axial distance have been derived. Analysis of results suggests improvements in combustion performance to the current hybrid rocket motor design and explains scale effects in the variation of fuel regression rate with combustion chamber diameter.

  8. Effects of Individual Development Accounts (IDAs) on Household Wealth and Saving Taste

    ERIC Educational Resources Information Center

    Huang, Jin

    2010-01-01

    This study examines effects of individual development accounts (IDAs) on household wealth of low-income participants. Methods: This study uses longitudinal survey data from the American Dream Demonstration (ADD) involving experimental design (treatment group = 537, control group = 566). Results: Results from quantile regression analysis indicate…

  9. A Study of the Effect of the Front-End Styling of Sport Utility Vehicles on Pedestrian Head Injuries

    PubMed Central

    Qin, Qin; Chen, Zheng; Bai, Zhonghao; Cao, Libo

    2018-01-01

    Background The number of sport utility vehicles (SUVs) on China market is continuously increasing. It is necessary to investigate the relationships between the front-end styling features of SUVs and head injuries at the styling design stage for improving the pedestrian protection performance and product development efficiency. Methods Styling feature parameters were extracted from the SUV side contour line. And simplified finite element models were established based on the 78 SUV side contour lines. Pedestrian headform impact simulations were performed and validated. The head injury criterion of 15 ms (HIC15) at four wrap-around distances was obtained. A multiple linear regression analysis method was employed to describe the relationships between the styling feature parameters and the HIC15 at each impact point. Results The relationship between the selected styling features and the HIC15 showed reasonable correlations, and the regression models and the selected independent variables showed statistical significance. Conclusions The regression equations obtained by multiple linear regression can be used to assess the performance of SUV styling in protecting pedestrians' heads and provide styling designers with technical guidance regarding their artistic creations.

  10. A locally adaptive kernel regression method for facies delineation

    NASA Astrophysics Data System (ADS)

    Fernàndez-Garcia, D.; Barahona-Palomo, M.; Henri, C. V.; Sanchez-Vila, X.

    2015-12-01

    Facies delineation is defined as the separation of geological units with distinct intrinsic characteristics (grain size, hydraulic conductivity, mineralogical composition). A major challenge in this area stems from the fact that only a few scattered pieces of hydrogeological information are available to delineate geological facies. Several methods to delineate facies are available in the literature, ranging from those based only on existing hard data, to those including secondary data or external knowledge about sedimentological patterns. This paper describes a methodology to use kernel regression methods as an effective tool for facies delineation. The method uses both the spatial and the actual sampled values to produce, for each individual hard data point, a locally adaptive steering kernel function, self-adjusting the principal directions of the local anisotropic kernels to the direction of highest local spatial correlation. The method is shown to outperform the nearest neighbor classification method in a number of synthetic aquifers whenever the available number of hard data is small and randomly distributed in space. In the case of exhaustive sampling, the steering kernel regression method converges to the true solution. Simulations ran in a suite of synthetic examples are used to explore the selection of kernel parameters in typical field settings. It is shown that, in practice, a rule of thumb can be used to obtain suboptimal results. The performance of the method is demonstrated to significantly improve when external information regarding facies proportions is incorporated. Remarkably, the method allows for a reasonable reconstruction of the facies connectivity patterns, shown in terms of breakthrough curves performance.

  11. Construction of mathematical model for measuring material concentration by colorimetric method

    NASA Astrophysics Data System (ADS)

    Liu, Bing; Gao, Lingceng; Yu, Kairong; Tan, Xianghua

    2018-06-01

    This paper use the method of multiple linear regression to discuss the data of C problem of mathematical modeling in 2017. First, we have established a regression model for the concentration of 5 substances. But only the regression model of the substance concentration of urea in milk can pass through the significance test. The regression model established by the second sets of data can pass the significance test. But this model exists serious multicollinearity. We have improved the model by principal component analysis. The improved model is used to control the system so that it is possible to measure the concentration of material by direct colorimetric method.

  12. Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree

    PubMed Central

    de los Campos, Gustavo; Naya, Hugo; Gianola, Daniel; Crossa, José; Legarra, Andrés; Manfredi, Eduardo; Weigel, Kent; Cotes, José Miguel

    2009-01-01

    The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available. PMID:19293140

  13. Application of software technology to automatic test data analysis

    NASA Technical Reports Server (NTRS)

    Stagner, J. R.

    1991-01-01

    The verification process for a major software subsystem was partially automated as part of a feasibility demonstration. The methods employed are generally useful and applicable to other types of subsystems. The effort resulted in substantial savings in test engineer analysis time and offers a method for inclusion of automatic verification as a part of regression testing.

  14. Parent Support for Breakfast after the Bell Programs in a State with Very Low Breakfast Participation

    ERIC Educational Resources Information Center

    Haines, Amanda; Spruance, Lori Andersen

    2018-01-01

    Purpose/Objectives: The study aim was to evaluate parent support for breakfast after the bell programs (BABPs). Methods: Data were collected through an online survey from parents (n=488) of school-aged children enrolled in public schools in Utah. Data were analyzed using generalized estimating equation (GEE) regression methods. Results: Parents…

  15. Use of a New Set of Linguistic Features to Improve Automatic Assessment of Text Readability

    ERIC Educational Resources Information Center

    Yoshimi, Takehiko; Kotani, Katsunori; Isahara, Hitoshi

    2012-01-01

    The present paper proposes and evaluates a readability assessment method designed for Japanese learners of EFL (English as a foreign language). The proposed readability assessment method is constructed by a regression algorithm using a new set of linguistic features that were employed separately in previous studies. The results showed that the…

  16. Occurrence of Conotruncal Heart Birth Defects in Texas: A Comparison of Urban/Rural Classifications

    ERIC Educational Resources Information Center

    Langlois, Peter H.; Jandle, Leigh; Scheuerle, Angela; Horel, Scott A.; Carozza, Susan E.

    2010-01-01

    Purpose: (1) Determine if there is an association between 3 conotruncal heart birth defects and urban/rural residence of mother. (2) Compare results using different methods of measuring urban/rural status. Methods: Data were taken from the Texas Birth Defects Registry, 1999-2003. Poisson regression was used to compare crude and adjusted birth…

  17. Probabilistic composite micromechanics

    NASA Technical Reports Server (NTRS)

    Stock, T. A.; Bellini, P. X.; Murthy, P. L. N.; Chamis, C. C.

    1988-01-01

    Probabilistic composite micromechanics methods are developed that simulate expected uncertainties in unidirectional fiber composite properties. These methods are in the form of computational procedures using Monte Carlo simulation. A graphite/epoxy unidirectional composite (ply) is studied to demonstrate fiber composite material properties at the micro level. Regression results are presented to show the relative correlation between predicted and response variables in the study.

  18. A Robust Shape Reconstruction Method for Facial Feature Point Detection.

    PubMed

    Tan, Shuqiu; Chen, Dongyi; Guo, Chenggang; Huang, Zhiqi

    2017-01-01

    Facial feature point detection has been receiving great research advances in recent years. Numerous methods have been developed and applied in practical face analysis systems. However, it is still a quite challenging task because of the large variability in expression and gestures and the existence of occlusions in real-world photo shoot. In this paper, we present a robust sparse reconstruction method for the face alignment problems. Instead of a direct regression between the feature space and the shape space, the concept of shape increment reconstruction is introduced. Moreover, a set of coupled overcomplete dictionaries termed the shape increment dictionary and the local appearance dictionary are learned in a regressive manner to select robust features and fit shape increments. Additionally, to make the learned model more generalized, we select the best matched parameter set through extensive validation tests. Experimental results on three public datasets demonstrate that the proposed method achieves a better robustness over the state-of-the-art methods.

  19. Porosity estimation by semi-supervised learning with sparsely available labeled samples

    NASA Astrophysics Data System (ADS)

    Lima, Luiz Alberto; Görnitz, Nico; Varella, Luiz Eduardo; Vellasco, Marley; Müller, Klaus-Robert; Nakajima, Shinichi

    2017-09-01

    This paper addresses the porosity estimation problem from seismic impedance volumes and porosity samples located in a small group of exploratory wells. Regression methods, trained on the impedance as inputs and the porosity as output labels, generally suffer from extremely expensive (and hence sparsely available) porosity samples. To optimally make use of the valuable porosity data, a semi-supervised machine learning method was proposed, Transductive Conditional Random Field Regression (TCRFR), showing good performance (Görnitz et al., 2017). TCRFR, however, still requires more labeled data than those usually available, which creates a gap when applying the method to the porosity estimation problem in realistic situations. In this paper, we aim to fill this gap by introducing two graph-based preprocessing techniques, which adapt the original TCRFR for extremely weakly supervised scenarios. Our new method outperforms the previous automatic estimation methods on synthetic data and provides a comparable result to the manual labored, time-consuming geostatistics approach on real data, proving its potential as a practical industrial tool.

  20. Deep learning architecture for air quality predictions.

    PubMed

    Li, Xiang; Peng, Ling; Hu, Yuan; Shao, Jing; Chi, Tianhe

    2016-11-01

    With the rapid development of urbanization and industrialization, many developing countries are suffering from heavy air pollution. Governments and citizens have expressed increasing concern regarding air pollution because it affects human health and sustainable development worldwide. Current air quality prediction methods mainly use shallow models; however, these methods produce unsatisfactory results, which inspired us to investigate methods of predicting air quality based on deep architecture models. In this paper, a novel spatiotemporal deep learning (STDL)-based air quality prediction method that inherently considers spatial and temporal correlations is proposed. A stacked autoencoder (SAE) model is used to extract inherent air quality features, and it is trained in a greedy layer-wise manner. Compared with traditional time series prediction models, our model can predict the air quality of all stations simultaneously and shows the temporal stability in all seasons. Moreover, a comparison with the spatiotemporal artificial neural network (STANN), auto regression moving average (ARMA), and support vector regression (SVR) models demonstrates that the proposed method of performing air quality predictions has a superior performance.

  1. Using Simulation Technique to overcome the multi-collinearity problem for estimating fuzzy linear regression parameters.

    NASA Astrophysics Data System (ADS)

    Mansoor Gorgees, Hazim; Hilal, Mariam Mohammed

    2018-05-01

    Fatigue cracking is one of the common types of pavement distresses and is an indicator of structural failure; cracks allow moisture infiltration, roughness, may further deteriorate to a pothole. Some causes of pavement deterioration are: traffic loading; environment influences; drainage deficiencies; materials quality problems; construction deficiencies and external contributors. Many researchers have made models that contain many variables like asphalt content, asphalt viscosity, fatigue life, stiffness of asphalt mixture, temperature and other parameters that affect the fatigue life. For this situation, a fuzzy linear regression model was employed and analyzed by using the traditional methods and our proposed method in order to overcome the multi-collinearity problem. The total spread error was used as a criterion to compare the performance of the studied methods. Simulation program was used to obtain the required results.

  2. Two-dimensional advective transport in ground-water flow parameter estimation

    USGS Publications Warehouse

    Anderman, E.R.; Hill, M.C.; Poeter, E.P.

    1996-01-01

    Nonlinear regression is useful in ground-water flow parameter estimation, but problems of parameter insensitivity and correlation often exist given commonly available hydraulic-head and head-dependent flow (for example, stream and lake gain or loss) observations. To address this problem, advective-transport observations are added to the ground-water flow, parameter-estimation model MODFLOWP using particle-tracking methods. The resulting model is used to investigate the importance of advective-transport observations relative to head-dependent flow observations when either or both are used in conjunction with hydraulic-head observations in a simulation of the sewage-discharge plume at Otis Air Force Base, Cape Cod, Massachusetts, USA. The analysis procedure for evaluating the probable effect of new observations on the regression results consists of two steps: (1) parameter sensitivities and correlations calculated at initial parameter values are used to assess the model parameterization and expected relative contributions of different types of observations to the regression; and (2) optimal parameter values are estimated by nonlinear regression and evaluated. In the Cape Cod parameter-estimation model, advective-transport observations did not significantly increase the overall parameter sensitivity; however: (1) inclusion of advective-transport observations decreased parameter correlation enough for more unique parameter values to be estimated by the regression; (2) realistic uncertainties in advective-transport observations had a small effect on parameter estimates relative to the precision with which the parameters were estimated; and (3) the regression results and sensitivity analysis provided insight into the dynamics of the ground-water flow system, especially the importance of accurate boundary conditions. In this work, advective-transport observations improved the calibration of the model and the estimation of ground-water flow parameters, and use of regression and related techniques produced significant insight into the physical system.

  3. Biodiversity patterns along ecological gradients: unifying β-diversity indices.

    PubMed

    Szava-Kovats, Robert C; Pärtel, Meelis

    2014-01-01

    Ecologists have developed an abundance of conceptions and mathematical expressions to define β-diversity, the link between local (α) and regional-scale (γ) richness, in order to characterize patterns of biodiversity along ecological (i.e., spatial and environmental) gradients. These patterns are often realized by regression of β-diversity indices against one or more ecological gradients. This practice, however, is subject to two shortcomings that can undermine the validity of the biodiversity patterns. First, many β-diversity indices are constrained to range between fixed lower and upper limits. As such, regression analysis of β-diversity indices against ecological gradients can result in regression curves that extend beyond these mathematical constraints, thus creating an interpretational dilemma. Second, despite being a function of the same measured α- and γ-diversity, the resultant biodiversity pattern depends on the choice of β-diversity index. We propose a simple logistic transformation that rids beta-diversity indices of their mathematical constraints, thus eliminating the possibility of an uninterpretable regression curve. Moreover, this transformation results in identical biodiversity patterns for three commonly used classical beta-diversity indices. As a result, this transformation eliminates the difficulties of both shortcomings, while allowing the researcher to use whichever beta-diversity index deemed most appropriate. We believe this method can help unify the study of biodiversity patterns along ecological gradients.

  4. Biodiversity Patterns along Ecological Gradients: Unifying β-Diversity Indices

    PubMed Central

    Szava-Kovats, Robert C.; Pärtel, Meelis

    2014-01-01

    Ecologists have developed an abundance of conceptions and mathematical expressions to define β-diversity, the link between local (α) and regional-scale (γ) richness, in order to characterize patterns of biodiversity along ecological (i.e., spatial and environmental) gradients. These patterns are often realized by regression of β-diversity indices against one or more ecological gradients. This practice, however, is subject to two shortcomings that can undermine the validity of the biodiversity patterns. First, many β-diversity indices are constrained to range between fixed lower and upper limits. As such, regression analysis of β-diversity indices against ecological gradients can result in regression curves that extend beyond these mathematical constraints, thus creating an interpretational dilemma. Second, despite being a function of the same measured α- and γ-diversity, the resultant biodiversity pattern depends on the choice of β-diversity index. We propose a simple logistic transformation that rids beta-diversity indices of their mathematical constraints, thus eliminating the possibility of an uninterpretable regression curve. Moreover, this transformation results in identical biodiversity patterns for three commonly used classical beta-diversity indices. As a result, this transformation eliminates the difficulties of both shortcomings, while allowing the researcher to use whichever beta-diversity index deemed most appropriate. We believe this method can help unify the study of biodiversity patterns along ecological gradients. PMID:25330181

  5. Multilayer Perceptron for Robust Nonlinear Interval Regression Analysis Using Genetic Algorithms

    PubMed Central

    2014-01-01

    On the basis of fuzzy regression, computational models in intelligence such as neural networks have the capability to be applied to nonlinear interval regression analysis for dealing with uncertain and imprecise data. When training data are not contaminated by outliers, computational models perform well by including almost all given training data in the data interval. Nevertheless, since training data are often corrupted by outliers, robust learning algorithms employed to resist outliers for interval regression analysis have been an interesting area of research. Several approaches involving computational intelligence are effective for resisting outliers, but the required parameters for these approaches are related to whether the collected data contain outliers or not. Since it seems difficult to prespecify the degree of contamination beforehand, this paper uses multilayer perceptron to construct the robust nonlinear interval regression model using the genetic algorithm. Outliers beyond or beneath the data interval will impose slight effect on the determination of data interval. Simulation results demonstrate that the proposed method performs well for contaminated datasets. PMID:25110755

  6. Multilayer perceptron for robust nonlinear interval regression analysis using genetic algorithms.

    PubMed

    Hu, Yi-Chung

    2014-01-01

    On the basis of fuzzy regression, computational models in intelligence such as neural networks have the capability to be applied to nonlinear interval regression analysis for dealing with uncertain and imprecise data. When training data are not contaminated by outliers, computational models perform well by including almost all given training data in the data interval. Nevertheless, since training data are often corrupted by outliers, robust learning algorithms employed to resist outliers for interval regression analysis have been an interesting area of research. Several approaches involving computational intelligence are effective for resisting outliers, but the required parameters for these approaches are related to whether the collected data contain outliers or not. Since it seems difficult to prespecify the degree of contamination beforehand, this paper uses multilayer perceptron to construct the robust nonlinear interval regression model using the genetic algorithm. Outliers beyond or beneath the data interval will impose slight effect on the determination of data interval. Simulation results demonstrate that the proposed method performs well for contaminated datasets.

  7. Comparison of regression coefficient and GIS-based methodologies for regional estimates of forest soil carbon stocks.

    PubMed

    Campbell, J Elliott; Moen, Jeremie C; Ney, Richard A; Schnoor, Jerald L

    2008-03-01

    Estimates of forest soil organic carbon (SOC) have applications in carbon science, soil quality studies, carbon sequestration technologies, and carbon trading. Forest SOC has been modeled using a regression coefficient methodology that applies mean SOC densities (mass/area) to broad forest regions. A higher resolution model is based on an approach that employs a geographic information system (GIS) with soil databases and satellite-derived landcover images. Despite this advancement, the regression approach remains the basis of current state and federal level greenhouse gas inventories. Both approaches are analyzed in detail for Wisconsin forest soils from 1983 to 2001, applying rigorous error-fixing algorithms to soil databases. Resulting SOC stock estimates are 20% larger when determined using the GIS method rather than the regression approach. Average annual rates of increase in SOC stocks are 3.6 and 1.0 million metric tons of carbon per year for the GIS and regression approaches respectively.

  8. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis.

    PubMed

    Dietrich, Stefan; Floegel, Anna; Troll, Martina; Kühn, Tilman; Rathmann, Wolfgang; Peters, Anette; Sookthai, Disorn; von Bergen, Martin; Kaaks, Rudolf; Adamski, Jerzy; Prehn, Cornelia; Boeing, Heiner; Schulze, Matthias B; Illig, Thomas; Pischon, Tobias; Knüppel, Sven; Wang-Sattler, Rui; Drogan, Dagmar

    2016-10-01

    The application of metabolomics in prospective cohort studies is statistically challenging. Given the importance of appropriate statistical methods for selection of disease-associated metabolites in highly correlated complex data, we combined random survival forest (RSF) with an automated backward elimination procedure that addresses such issues. Our RSF approach was illustrated with data from the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam study, with concentrations of 127 serum metabolites as exposure variables and time to development of type 2 diabetes mellitus (T2D) as outcome variable. Out of this data set, Cox regression with a stepwise selection method was recently published. Replication of methodical comparison (RSF and Cox regression) was conducted in two independent cohorts. Finally, the R-code for implementing the metabolite selection procedure into the RSF-syntax is provided. The application of the RSF approach in EPIC-Potsdam resulted in the identification of 16 incident T2D-associated metabolites which slightly improved prediction of T2D when used in addition to traditional T2D risk factors and also when used together with classical biomarkers. The identified metabolites partly agreed with previous findings using Cox regression, though RSF selected a higher number of highly correlated metabolites. The RSF method appeared to be a promising approach for identification of disease-associated variables in complex data with time to event as outcome. The demonstrated RSF approach provides comparable findings as the generally used Cox regression, but also addresses the problem of multicollinearity and is suitable for high-dimensional data. © The Author 2016; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.

  9. Logistic regression for circular data

    NASA Astrophysics Data System (ADS)

    Al-Daffaie, Kadhem; Khan, Shahjahan

    2017-05-01

    This paper considers the relationship between a binary response and a circular predictor. It develops the logistic regression model by employing the linear-circular regression approach. The maximum likelihood method is used to estimate the parameters. The Newton-Raphson numerical method is used to find the estimated values of the parameters. A data set from weather records of Toowoomba city is analysed by the proposed methods. Moreover, a simulation study is considered. The R software is used for all computations and simulations.

  10. [Evaluation of estimation of prevalence ratio using bayesian log-binomial regression model].

    PubMed

    Gao, W L; Lin, H; Liu, X N; Ren, X W; Li, J S; Shen, X P; Zhu, S L

    2017-03-10

    To evaluate the estimation of prevalence ratio ( PR ) by using bayesian log-binomial regression model and its application, we estimated the PR of medical care-seeking prevalence to caregivers' recognition of risk signs of diarrhea in their infants by using bayesian log-binomial regression model in Openbugs software. The results showed that caregivers' recognition of infant' s risk signs of diarrhea was associated significantly with a 13% increase of medical care-seeking. Meanwhile, we compared the differences in PR 's point estimation and its interval estimation of medical care-seeking prevalence to caregivers' recognition of risk signs of diarrhea and convergence of three models (model 1: not adjusting for the covariates; model 2: adjusting for duration of caregivers' education, model 3: adjusting for distance between village and township and child month-age based on model 2) between bayesian log-binomial regression model and conventional log-binomial regression model. The results showed that all three bayesian log-binomial regression models were convergence and the estimated PRs were 1.130(95 %CI : 1.005-1.265), 1.128(95 %CI : 1.001-1.264) and 1.132(95 %CI : 1.004-1.267), respectively. Conventional log-binomial regression model 1 and model 2 were convergence and their PRs were 1.130(95 % CI : 1.055-1.206) and 1.126(95 % CI : 1.051-1.203), respectively, but the model 3 was misconvergence, so COPY method was used to estimate PR , which was 1.125 (95 %CI : 1.051-1.200). In addition, the point estimation and interval estimation of PRs from three bayesian log-binomial regression models differed slightly from those of PRs from conventional log-binomial regression model, but they had a good consistency in estimating PR . Therefore, bayesian log-binomial regression model can effectively estimate PR with less misconvergence and have more advantages in application compared with conventional log-binomial regression model.

  11. An improved strategy for regression of biophysical variables and Landsat ETM+ data.

    Treesearch

    Warren B. Cohen; Thomas K. Maiersperger; Stith T. Gower; David P. Turner

    2003-01-01

    Empirical models are important tools for relating field-measured biophysical variables to remote sensing data. Regression analysis has been a popular empirical method of linking these two types of data to provide continuous estimates for variables such as biomass, percent woody canopy cover, and leaf area index (LAI). Traditional methods of regression are not...

  12. Using Robust Standard Errors to Combine Multiple Regression Estimates with Meta-Analysis

    ERIC Educational Resources Information Center

    Williams, Ryan T.

    2012-01-01

    Combining multiple regression estimates with meta-analysis has continued to be a difficult task. A variety of methods have been proposed and used to combine multiple regression slope estimates with meta-analysis, however, most of these methods have serious methodological and practical limitations. The purpose of this study was to explore the use…

  13. The Necessity-Concerns-Framework: A Multidimensional Theory Benefits from Multidimensional Analysis

    PubMed Central

    Phillips, L. Alison; Diefenbach, Michael; Kronish, Ian M.; Negron, Rennie M.; Horowitz, Carol R.

    2014-01-01

    Background Patients’ medication-related concerns and necessity-beliefs predict adherence. Evaluation of the potentially complex interplay of these two dimensions has been limited because of methods that reduce them to a single dimension (difference scores). Purpose We use polynomial regression to assess the multidimensional effect of stroke-event survivors’ medication-related concerns and necessity-beliefs on their adherence to stroke-prevention medication. Methods Survivors (n=600) rated their concerns, necessity-beliefs, and adherence to medication. Confirmatory and exploratory polynomial regression determined the best-fitting multidimensional model. Results As posited by the Necessity-Concerns Framework (NCF), the greatest and lowest adherence was reported by those with strong necessity-beliefs/weak concerns and strong concerns/weak necessity-beliefs, respectively. However, as could not be assessed using a difference-score model, patients with ambivalent beliefs were less adherent than those exhibiting indifference. Conclusions Polynomial regression allows for assessment of the multidimensional nature of the NCF. Clinicians/Researchers should be aware that concerns and necessity dimensions are not polar opposites. PMID:24500078

  14. Durbin-Watson partial least-squares regression applied to MIR data on adulteration with edible oils of different origins.

    PubMed

    Jović, Ozren

    2016-12-15

    A novel method for quantitative prediction and variable-selection on spectroscopic data, called Durbin-Watson partial least-squares regression (dwPLS), is proposed in this paper. The idea is to inspect serial correlation in infrared data that is known to consist of highly correlated neighbouring variables. The method selects only those variables whose intervals have a lower Durbin-Watson statistic (dw) than a certain optimal cutoff. For each interval, dw is calculated on a vector of regression coefficients. Adulteration of cold-pressed linseed oil (L), a well-known nutrient beneficial to health, is studied in this work by its being mixed with cheaper oils: rapeseed oil (R), sesame oil (Se) and sunflower oil (Su). The samples for each botanical origin of oil vary with respect to producer, content and geographic origin. The results obtained indicate that MIR-ATR, combined with dwPLS could be implemented to quantitative determination of edible-oil adulteration. Copyright © 2016 Elsevier Ltd. All rights reserved.

  15. Performance characterization of complex fuel port geometries for hybrid rocket fuel grains

    NASA Astrophysics Data System (ADS)

    Bath, Andrew

    This research investigated the 3D printing and burning of fuel grains with complex geometry and the development of software capable of modeling and predicting the regression of a cross-section of these complex fuel grains. The software developed did predict the geometry to a fair degree of accuracy, especially when enhanced corner rounding was turned on. The model does have some drawbacks, notably being relatively slow, and does not perfectly predict the regression. If corner rounding is turned off, however, the model does become much faster; although less accurate, this method does still predict a relatively accurate resulting burn geometry, and is fast enough to be used for performance-tuning or genetic algorithms. In addition to the modeling method, preliminary investigations into the burning behavior of fuel grains with a helical flow path were performed. The helix fuel grains have a regression rate of nearly 3 times that of any other fuel grain geometry, primarily due to the enhancement of the friction coefficient between the flow and flow path.

  16. Predictions of biochar production and torrefaction performance from sugarcane bagasse using interpolation and regression analysis.

    PubMed

    Chen, Wei-Hsin; Hsu, Hung-Jen; Kumar, Gopalakrishnan; Budzianowski, Wojciech M; Ong, Hwai Chyuan

    2017-12-01

    This study focuses on the biochar formation and torrefaction performance of sugarcane bagasse, and they are predicted using the bilinear interpolation (BLI), inverse distance weighting (IDW) interpolation, and regression analysis. It is found that the biomass torrefied at 275°C for 60min or at 300°C for 30min or longer is appropriate to produce biochar as alternative fuel to coal with low carbon footprint, but the energy yield from the torrefaction at 300°C is too low. From the biochar yield, enhancement factor of HHV, and energy yield, the results suggest that the three methods are all feasible for predicting the performance, especially for the enhancement factor. The power parameter of unity in the IDW method provides the best predictions and the error is below 5%. The second order in regression analysis gives a more reasonable approach than the first order, and is recommended for the predictions. Copyright © 2017 Elsevier Ltd. All rights reserved.

  17. Cascade Optimization Strategy with Neural Network and Regression Approximations Demonstrated on a Preliminary Aircraft Engine Design

    NASA Technical Reports Server (NTRS)

    Hopkins, Dale A.; Patnaik, Surya N.

    2000-01-01

    A preliminary aircraft engine design methodology is being developed that utilizes a cascade optimization strategy together with neural network and regression approximation methods. The cascade strategy employs different optimization algorithms in a specified sequence. The neural network and regression methods are used to approximate solutions obtained from the NASA Engine Performance Program (NEPP), which implements engine thermodynamic cycle and performance analysis models. The new methodology is proving to be more robust and computationally efficient than the conventional optimization approach of using a single optimization algorithm with direct reanalysis. The methodology has been demonstrated on a preliminary design problem for a novel subsonic turbofan engine concept that incorporates a wave rotor as a cycle-topping device. Computations of maximum thrust were obtained for a specific design point in the engine mission profile. The results (depicted in the figure) show a significant improvement in the maximum thrust obtained using the new methodology in comparison to benchmark solutions obtained using NEPP in a manual design mode.

  18. Establishing a Mathematical Equations and Improving the Production of L-tert-Leucine by Uniform Design and Regression Analysis.

    PubMed

    Jiang, Wei; Xu, Chao-Zhen; Jiang, Si-Zhi; Zhang, Tang-Duo; Wang, Shi-Zhen; Fang, Bai-Shan

    2017-04-01

    L-tert-Leucine (L-Tle) and its derivatives are extensively used as crucial building blocks for chiral auxiliaries, pharmaceutically active ingredients, and ligands. Combining with formate dehydrogenase (FDH) for regenerating the expensive coenzyme NADH, leucine dehydrogenase (LeuDH) is continually used for synthesizing L-Tle from α-keto acid. A multilevel factorial experimental design was executed for research of this system. In this work, an efficient optimization method for improving the productivity of L-Tle was developed. And the mathematical model between different fermentation conditions and L-Tle yield was also determined in the form of the equation by using uniform design and regression analysis. The multivariate regression equation was conveniently implemented in water, with a space time yield of 505.9 g L -1  day -1 and an enantiomeric excess value of >99 %. These results demonstrated that this method might become an ideal protocol for industrial production of chiral compounds and unnatural amino acids such as chiral drug intermediates.

  19. Robust, Adaptive Functional Regression in Functional Mixed Model Framework.

    PubMed

    Zhu, Hongxiao; Brown, Philip J; Morris, Jeffrey S

    2011-09-01

    Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets.

  20. Robust, Adaptive Functional Regression in Functional Mixed Model Framework

    PubMed Central

    Zhu, Hongxiao; Brown, Philip J.; Morris, Jeffrey S.

    2012-01-01

    Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets. PMID:22308015

Top