Partial covariate adjusted regression
Şentürk, Damla; Nguyen, Danh V.
2008-01-01
Covariate adjusted regression (CAR) is a recently proposed adjustment method for regression analysis where both the response and predictors are not directly observed (Şentürk and Müller, 2005). The available data has been distorted by unknown functions of an observable confounding covariate. CAR provides consistent estimators for the coefficients of the regression between the variables of interest, adjusted for the confounder. We develop a broader class of partial covariate adjusted regression (PCAR) models to accommodate both distorted and undistorted (adjusted/unadjusted) predictors. The PCAR model allows for unadjusted predictors, such as age, gender and demographic variables, which are common in the analysis of biomedical and epidemiological data. The available estimation and inference procedures for CAR are shown to be invalid for the proposed PCAR model. We propose new estimators and develop new inference tools for the more general PCAR setting. In particular, we establish the asymptotic normality of the proposed estimators and propose consistent estimators of their asymptotic variances. Finite sample properties of the proposed estimators are investigated using simulation studies and the method is also illustrated with a Pima Indians diabetes data set. PMID:20126296
Eberly, Lynn E
2007-01-01
This chapter describes multiple linear regression, a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout. PMID:18450050
Fast Censored Linear Regression
HUANG, YIJIAN
2013-01-01
Weighted log-rank estimating function has become a standard estimation method for the censored linear regression model, or the accelerated failure time model. Well established statistically, the estimator defined as a consistent root has, however, rather poor computational properties because the estimating function is neither continuous nor, in general, monotone. We propose a computationally efficient estimator through an asymptotics-guided Newton algorithm, in which censored quantile regression methods are tailored to yield an initial consistent estimate and a consistent derivative estimate of the limiting estimating function. We also develop fast interval estimation with a new proposal for sandwich variance estimation. The proposed estimator is asymptotically equivalent to the consistent root estimator and barely distinguishable in samples of practical size. However, computation time is typically reduced by two to three orders of magnitude for point estimation alone. Illustrations with clinical applications are provided. PMID:24347802
Correlation and simple linear regression.
Eberly, Lynn E
2007-01-01
This chapter highlights important steps in using correlation and simple linear regression to address scientific questions about the association of two continuous variables with each other. These steps include estimation and inference, assessing model fit, the connection between regression and ANOVA, and study design. Examples in microbiology are used throughout. This chapter provides a framework that is helpful in understanding more complex statistical techniques, such as multiple linear regression, linear mixed effects models, logistic regression, and proportional hazards regression. PMID:18450049
Recursive Algorithm For Linear Regression
NASA Technical Reports Server (NTRS)
Varanasi, S. V.
1988-01-01
Order of model determined easily. Linear-regression algorithhm includes recursive equations for coefficients of model of increased order. Algorithm eliminates duplicative calculations, facilitates search for minimum order of linear-regression model fitting set of data satisfactory.
Multiple linear regression analysis
NASA Technical Reports Server (NTRS)
Edwards, T. R.
1980-01-01
Program rapidly selects best-suited set of coefficients. User supplies only vectors of independent and dependent data and specifies confidence level required. Program uses stepwise statistical procedure for relating minimal set of variables to set of observations; final regression contains only most statistically significant coefficients. Program is written in FORTRAN IV for batch execution and has been implemented on NOVA 1200.
Practical Session: Simple Linear Regression
NASA Astrophysics Data System (ADS)
Clausel, M.; Grégoire, G.
2014-12-01
Two exercises are proposed to illustrate the simple linear regression. The first one is based on the famous Galton's data set on heredity. We use the lm R command and get coefficients estimates, standard error of the error, R2, residuals …In the second example, devoted to data related to the vapor tension of mercury, we fit a simple linear regression, predict values, and anticipate on multiple linear regression. This pratical session is an excerpt from practical exercises proposed by A. Dalalyan at EPNC (see Exercises 1 and 2 of http://certis.enpc.fr/~dalalyan/Download/TP_ENPC_4.pdf).
Linear regression in astronomy. II
NASA Technical Reports Server (NTRS)
Feigelson, Eric D.; Babu, Gutti J.
1992-01-01
A wide variety of least-squares linear regression procedures used in observational astronomy, particularly investigations of the cosmic distance scale, are presented and discussed. The classes of linear models considered are (1) unweighted regression lines, with bootstrap and jackknife resampling; (2) regression solutions when measurement error, in one or both variables, dominates the scatter; (3) methods to apply a calibration line to new data; (4) truncated regression models, which apply to flux-limited data sets; and (5) censored regression models, which apply when nondetections are present. For the calibration problem we develop two new procedures: a formula for the intercept offset between two parallel data sets, which propagates slope errors from one regression to the other; and a generalization of the Working-Hotelling confidence bands to nonstandard least-squares lines. They can provide improved error analysis for Faber-Jackson, Tully-Fisher, and similar cosmic distance scale relations.
Linear regression in astronomy. I
NASA Technical Reports Server (NTRS)
Isobe, Takashi; Feigelson, Eric D.; Akritas, Michael G.; Babu, Gutti Jogesh
1990-01-01
Five methods for obtaining linear regression fits to bivariate data with unknown or insignificant measurement errors are discussed: ordinary least-squares (OLS) regression of Y on X, OLS regression of X on Y, the bisector of the two OLS lines, orthogonal regression, and 'reduced major-axis' regression. These methods have been used by various researchers in observational astronomy, most importantly in cosmic distance scale applications. Formulas for calculating the slope and intercept coefficients and their uncertainties are given for all the methods, including a new general form of the OLS variance estimates. The accuracy of the formulas was confirmed using numerical simulations. The applicability of the procedures is discussed with respect to their mathematical properties, the nature of the astronomical data under consideration, and the scientific purpose of the regression. It is found that, for problems needing symmetrical treatment of the variables, the OLS bisector performs significantly better than orthogonal or reduced major-axis regression.
Linearly Adjustable International Portfolios
Fonseca, R. J.; Kuhn, D.; Rustem, B.
2010-09-30
We present an approach to multi-stage international portfolio optimization based on the imposition of a linear structure on the recourse decisions. Multiperiod decision problems are traditionally formulated as stochastic programs. Scenario tree based solutions however can become intractable as the number of stages increases. By restricting the space of decision policies to linear rules, we obtain a conservative tractable approximation to the original problem. Local asset prices and foreign exchange rates are modelled separately, which allows for a direct measure of their impact on the final portfolio value.
Practical Session: Multiple Linear Regression
NASA Astrophysics Data System (ADS)
Clausel, M.; Grégoire, G.
2014-12-01
Three exercises are proposed to illustrate the simple linear regression. In the first one investigates the influence of several factors on atmospheric pollution. It has been proposed by D. Chessel and A.B. Dufour in Lyon 1 (see Sect. 6 of http://pbil.univ-lyon1.fr/R/pdf/tdr33.pdf) and is based on data coming from 20 cities of U.S. Exercise 2 is an introduction to model selection whereas Exercise 3 provides a first example of analysis of variance. Exercises 2 and 3 have been proposed by A. Dalalyan at ENPC (see Exercises 2 and 3 of http://certis.enpc.fr/~dalalyan/Download/TP_ENPC_5.pdf).
LRGS: Linear Regression by Gibbs Sampling
NASA Astrophysics Data System (ADS)
Mantz, Adam B.
2016-02-01
LRGS (Linear Regression by Gibbs Sampling) implements a Gibbs sampler to solve the problem of multivariate linear regression with uncertainties in all measured quantities and intrinsic scatter. LRGS extends an algorithm by Kelly (2007) that used Gibbs sampling for performing linear regression in fairly general cases in two ways: generalizing the procedure for multiple response variables, and modeling the prior distribution of covariates using a Dirichlet process.
Weather adjustment using seemingly unrelated regression
Noll, T.A.
1995-05-01
Seemingly unrelated regression (SUR) is a system estimation technique that accounts for time-contemporaneous correlation between individual equations within a system of equations. SUR is suited to weather adjustment estimations when the estimation is: (1) composed of a system of equations and (2) the system of equations represents either different weather stations, different sales sectors or a combination of different weather stations and different sales sectors. SUR utilizes the cross-equation error values to develop more accurate estimates of the system coefficients than are obtained using ordinary least-squares (OLS) estimation. SUR estimates can be generated using a variety of statistical software packages including MicroTSP and SAS.
Three-Dimensional Modeling in Linear Regression.
ERIC Educational Resources Information Center
Herman, James D.
Linear regression examines the relationship between one or more independent (predictor) variables and a dependent variable. By using a particular formula, regression determines the weights needed to minimize the error term for a given set of predictors. With one predictor variable, the relationship between the predictor and the dependent variable…
A Constrained Linear Estimator for Multiple Regression
ERIC Educational Resources Information Center
Davis-Stober, Clintin P.; Dana, Jason; Budescu, David V.
2010-01-01
"Improper linear models" (see Dawes, Am. Psychol. 34:571-582, "1979"), such as equal weighting, have garnered interest as alternatives to standard regression models. We analyze the general circumstances under which these models perform well by recasting a class of "improper" linear models as "proper" statistical models with a single predictor. We…
Multiple Linear Regression: A Realistic Reflector.
ERIC Educational Resources Information Center
Nutt, A. T.; Batsell, R. R.
Examples of the use of Multiple Linear Regression (MLR) techniques are presented. This is done to show how MLR aids data processing and decision-making by providing the decision-maker with freedom in phrasing questions and by accurately reflecting the data on hand. A brief overview of the rationale underlying MLR is given, some basic definitions…
Moving the Bar: Transformations in Linear Regression.
ERIC Educational Resources Information Center
Miranda, Janet
The assumption that is most important to the hypothesis testing procedure of multiple linear regression is the assumption that the residuals are normally distributed, but this assumption is not always tenable given the realities of some data sets. When normal distribution of the residuals is not met, an alternative method can be initiated. As an…
A tutorial on Bayesian Normal linear regression
NASA Astrophysics Data System (ADS)
Klauenberg, Katy; Wübbeler, Gerd; Mickan, Bodo; Harris, Peter; Elster, Clemens
2015-12-01
Regression is a common task in metrology and often applied to calibrate instruments, evaluate inter-laboratory comparisons or determine fundamental constants, for example. Yet, a regression model cannot be uniquely formulated as a measurement function, and consequently the Guide to the Expression of Uncertainty in Measurement (GUM) and its supplements are not applicable directly. Bayesian inference, however, is well suited to regression tasks, and has the advantage of accounting for additional a priori information, which typically robustifies analyses. Furthermore, it is anticipated that future revisions of the GUM shall also embrace the Bayesian view. Guidance on Bayesian inference for regression tasks is largely lacking in metrology. For linear regression models with Gaussian measurement errors this tutorial gives explicit guidance. Divided into three steps, the tutorial first illustrates how a priori knowledge, which is available from previous experiments, can be translated into prior distributions from a specific class. These prior distributions have the advantage of yielding analytical, closed form results, thus avoiding the need to apply numerical methods such as Markov Chain Monte Carlo. Secondly, formulas for the posterior results are given, explained and illustrated, and software implementations are provided. In the third step, Bayesian tools are used to assess the assumptions behind the suggested approach. These three steps (prior elicitation, posterior calculation, and robustness to prior uncertainty and model adequacy) are critical to Bayesian inference. The general guidance given here for Normal linear regression tasks is accompanied by a simple, but real-world, metrological example. The calibration of a flow device serves as a running example and illustrates the three steps. It is shown that prior knowledge from previous calibrations of the same sonic nozzle enables robust predictions even for extrapolations.
Mental chronometry with simple linear regression.
Chen, J Y
1997-10-01
Typically, mental chronometry is performed by means of introducing an independent variable postulated to affect selectively some stage of a presumed multistage process. However, the effect could be a global one that spreads proportionally over all stages of the process. Currently, there is no method to test this possibility although simple linear regression might serve the purpose. In the present study, the regression approach was tested with tasks (memory scanning and mental rotation) that involved a selective effect and with a task (word superiority effect) that involved a global effect, by the dominant theories. The results indicate (1) the manipulation of the size of a memory set or of angular disparity affects the intercept of the regression function that relates the times for memory scanning with different set sizes or for mental rotation with different angular disparities and (2) the manipulation of context affects the slope of the regression function that relates the times for detecting a target character under word and nonword conditions. These ratify the regression approach as a useful method for doing mental chronometry. PMID:9347535
Estimation of adjusted rate differences using additive negative binomial regression.
Donoghoe, Mark W; Marschner, Ian C
2016-08-15
Rate differences are an important effect measure in biostatistics and provide an alternative perspective to rate ratios. When the data are event counts observed during an exposure period, adjusted rate differences may be estimated using an identity-link Poisson generalised linear model, also known as additive Poisson regression. A problem with this approach is that the assumption of equality of mean and variance rarely holds in real data, which often show overdispersion. An additive negative binomial model is the natural alternative to account for this; however, standard model-fitting methods are often unable to cope with the constrained parameter space arising from the non-negativity restrictions of the additive model. In this paper, we propose a novel solution to this problem using a variant of the expectation-conditional maximisation-either algorithm. Our method provides a reliable way to fit an additive negative binomial regression model and also permits flexible generalisations using semi-parametric regression functions. We illustrate the method using a placebo-controlled clinical trial of fenofibrate treatment in patients with type II diabetes, where the outcome is the number of laser therapy courses administered to treat diabetic retinopathy. An R package is available that implements the proposed method. Copyright © 2016 John Wiley & Sons, Ltd. PMID:27073156
Multiple linear regression for isotopic measurements
NASA Astrophysics Data System (ADS)
Garcia Alonso, J. I.
2012-04-01
There are two typical applications of isotopic measurements: the detection of natural variations in isotopic systems and the detection man-made variations using enriched isotopes as indicators. For both type of measurements accurate and precise isotope ratio measurements are required. For the so-called non-traditional stable isotopes, multicollector ICP-MS instruments are usually applied. In many cases, chemical separation procedures are required before accurate isotope measurements can be performed. The off-line separation of Rb and Sr or Nd and Sm is the classical procedure employed to eliminate isobaric interferences before multicollector ICP-MS measurement of Sr and Nd isotope ratios. Also, this procedure allows matrix separation for precise and accurate Sr and Nd isotope ratios to be obtained. In our laboratory we have evaluated the separation of Rb-Sr and Nd-Sm isobars by liquid chromatography and on-line multicollector ICP-MS detection. The combination of this chromatographic procedure with multiple linear regression of the raw chromatographic data resulted in Sr and Nd isotope ratios with precisions and accuracies typical of off-line sample preparation procedures. On the other hand, methods for the labelling of individual organisms (such as a given plant, fish or animal) are required for population studies. We have developed a dual isotope labelling procedure which can be unique for a given individual, can be inherited in living organisms and it is stable. The detection of the isotopic signature is based also on multiple linear regression. The labelling of fish and its detection in otoliths by Laser Ablation ICP-MS will be discussed using trout and salmon as examples. As a conclusion, isotope measurement procedures based on multiple linear regression can be a viable alternative in multicollector ICP-MS measurements.
Double linear regression classification for face recognition
NASA Astrophysics Data System (ADS)
Feng, Qingxiang; Zhu, Qi; Tang, Lin-Lin; Pan, Jeng-Shyang
2015-02-01
A new classifier designed based on linear regression classification (LRC) classifier and simple-fast representation-based classifier (SFR), named double linear regression classification (DLRC) classifier, is proposed for image recognition in this paper. As we all know, the traditional LRC classifier only uses the distance between test image vectors and predicted image vectors of the class subspace for classification. And the SFR classifier uses the test image vectors and the nearest image vectors of the class subspace to classify the test sample. However, the DLRC classifier computes out the predicted image vectors of each class subspace and uses all the predicted vectors to construct a novel robust global space. Then, the DLRC utilizes the novel global space to get the novel predicted vectors of each class for classification. A mass number of experiments on AR face database, JAFFE face database, Yale face database, Extended YaleB face database, and PIE face database are used to evaluate the performance of the proposed classifier. The experimental results show that the proposed classifier achieves better recognition rate than the LRC classifier, SFR classifier, and several other classifiers.
Sparse brain network using penalized linear regression
NASA Astrophysics Data System (ADS)
Lee, Hyekyoung; Lee, Dong Soo; Kang, Hyejin; Kim, Boong-Nyun; Chung, Moo K.
2011-03-01
Sparse partial correlation is a useful connectivity measure for brain networks when it is difficult to compute the exact partial correlation in the small-n large-p setting. In this paper, we formulate the problem of estimating partial correlation as a sparse linear regression with a l1-norm penalty. The method is applied to brain network consisting of parcellated regions of interest (ROIs), which are obtained from FDG-PET images of the autism spectrum disorder (ASD) children and the pediatric control (PedCon) subjects. To validate the results, we check their reproducibilities of the obtained brain networks by the leave-one-out cross validation and compare the clustered structures derived from the brain networks of ASD and PedCon.
A Gibbs sampler for multivariate linear regression
NASA Astrophysics Data System (ADS)
Mantz, Adam B.
2016-04-01
Kelly described an efficient algorithm, using Gibbs sampling, for performing linear regression in the fairly general case where non-zero measurement errors exist for both the covariates and response variables, where these measurements may be correlated (for the same data point), where the response variable is affected by intrinsic scatter in addition to measurement error, and where the prior distribution of covariates is modelled by a flexible mixture of Gaussians rather than assumed to be uniform. Here, I extend the Kelly algorithm in two ways. First, the procedure is generalized to the case of multiple response variables. Secondly, I describe how to model the prior distribution of covariates using a Dirichlet process, which can be thought of as a Gaussian mixture where the number of mixture components is learned from the data. I present an example of multivariate regression using the extended algorithm, namely fitting scaling relations of the gas mass, temperature, and luminosity of dynamically relaxed galaxy clusters as a function of their mass and redshift. An implementation of the Gibbs sampler in the R language, called LRGS, is provided.
Fuzzy multiple linear regression: A computational approach
NASA Technical Reports Server (NTRS)
Juang, C. H.; Huang, X. H.; Fleming, J. W.
1992-01-01
This paper presents a new computational approach for performing fuzzy regression. In contrast to Bardossy's approach, the new approach, while dealing with fuzzy variables, closely follows the conventional regression technique. In this approach, treatment of fuzzy input is more 'computational' than 'symbolic.' The following sections first outline the formulation of the new approach, then deal with the implementation and computational scheme, and this is followed by examples to illustrate the new procedure.
Augmenting Data with Published Results in Bayesian Linear Regression
ERIC Educational Resources Information Center
de Leeuw, Christiaan; Klugkist, Irene
2012-01-01
In most research, linear regression analyses are performed without taking into account published results (i.e., reported summary statistics) of similar previous studies. Although the prior density in Bayesian linear regression could accommodate such prior knowledge, formal models for doing so are absent from the literature. The goal of this…
Who Will Win?: Predicting the Presidential Election Using Linear Regression
ERIC Educational Resources Information Center
Lamb, John H.
2007-01-01
This article outlines a linear regression activity that engages learners, uses technology, and fosters cooperation. Students generated least-squares linear regression equations using TI-83 Plus[TM] graphing calculators, Microsoft[C] Excel, and paper-and-pencil calculations using derived normal equations to predict the 2004 presidential election.…
Compound Identification Using Penalized Linear Regression on Metabolomics
Liu, Ruiqi; Wu, Dongfeng; Zhang, Xiang; Kim, Seongho
2014-01-01
Compound identification is often achieved by matching the experimental mass spectra to the mass spectra stored in a reference library based on mass spectral similarity. Because the number of compounds in the reference library is much larger than the range of mass-to-charge ratio (m/z) values so that the data become high dimensional data suffering from singularity. For this reason, penalized linear regressions such as ridge regression and the lasso are used instead of the ordinary least squares regression. Furthermore, two-step approaches using the dot product and Pearson’s correlation along with the penalized linear regression are proposed in this study. PMID:27212894
A VBA-based Simulation for Teaching Simple Linear Regression
ERIC Educational Resources Information Center
Jones, Gregory Todd; Hagtvedt, Reidar; Jones, Kari
2004-01-01
In spite of the name, simple linear regression presents a number of conceptual difficulties, particularly for introductory students. This article describes a simulation tool that provides a hands-on method for illuminating the relationship between parameters and sample statistics.
A SEMIPARAMETRIC BAYESIAN MODEL FOR CIRCULAR-LINEAR REGRESSION
We present a Bayesian approach to regress a circular variable on a linear predictor. The regression coefficients are assumed to have a nonparametric distribution with a Dirichlet process prior. The semiparametric Bayesian approach gives added flexibility to the model and is usefu...
Linear regression analysis of survival data with missing censoring indicators.
Wang, Qihua; Dinse, Gregg E
2011-04-01
Linear regression analysis has been studied extensively in a random censorship setting, but typically all of the censoring indicators are assumed to be observed. In this paper, we develop synthetic data methods for estimating regression parameters in a linear model when some censoring indicators are missing. We define estimators based on regression calibration, imputation, and inverse probability weighting techniques, and we prove all three estimators are asymptotically normal. The finite-sample performance of each estimator is evaluated via simulation. We illustrate our methods by assessing the effects of sex and age on the time to non-ambulatory progression for patients in a brain cancer clinical trial. PMID:20559722
Use of probabilistic weights to enhance linear regression myoelectric control
NASA Astrophysics Data System (ADS)
Smith, Lauren H.; Kuiken, Todd A.; Hargrove, Levi J.
2015-12-01
Objective. Clinically available prostheses for transradial amputees do not allow simultaneous myoelectric control of degrees of freedom (DOFs). Linear regression methods can provide simultaneous myoelectric control, but frequently also result in difficulty with isolating individual DOFs when desired. This study evaluated the potential of using probabilistic estimates of categories of gross prosthesis movement, which are commonly used in classification-based myoelectric control, to enhance linear regression myoelectric control. Approach. Gaussian models were fit to electromyogram (EMG) feature distributions for three movement classes at each DOF (no movement, or movement in either direction) and used to weight the output of linear regression models by the probability that the user intended the movement. Eight able-bodied and two transradial amputee subjects worked in a virtual Fitts’ law task to evaluate differences in controllability between linear regression and probability-weighted regression for an intramuscular EMG-based three-DOF wrist and hand system. Main results. Real-time and offline analyses in able-bodied subjects demonstrated that probability weighting improved performance during single-DOF tasks (p < 0.05) by preventing extraneous movement at additional DOFs. Similar results were seen in experiments with two transradial amputees. Though goodness-of-fit evaluations suggested that the EMG feature distributions showed some deviations from the Gaussian, equal-covariance assumptions used in this experiment, the assumptions were sufficiently met to provide improved performance compared to linear regression control. Significance. Use of probability weights can improve the ability to isolate individual during linear regression myoelectric control, while maintaining the ability to simultaneously control multiple DOFs.
A Bayesian approach to linear regression in astronomy
NASA Astrophysics Data System (ADS)
Sereno, Mauro
2016-01-01
Linear regression is common in astronomical analyses. I discuss a Bayesian hierarchical modelling of data with heteroscedastic and possibly correlated measurement errors and intrinsic scatter. The method fully accounts for time evolution. The slope, the normalization, and the intrinsic scatter of the relation can evolve with the redshift. The intrinsic distribution of the independent variable is approximated using a mixture of Gaussian distributions whose means and standard deviations depend on time. The method can address scatter in the measured independent variable (a kind of Eddington bias), selection effects in the response variable (Malmquist bias), and departure from linearity in form of a knee. I tested the method with toy models and simulations and quantified the effect of biases and inefficient modelling. The R-package LIRA (LInear Regression in Astronomy) is made available to perform the regression.
Construction cost estimation of municipal incinerators by fuzzy linear regression
Chang, N.B.; Chen, Y.L.; Yang, H.H.
1996-12-31
Regression analysis has been widely used in engineering cost estimation. It is recognized that the fuzzy structure in cost estimation is a different type of uncertainty compared to the measurement error in the least-squares regression modeling. Hence, the uncertainties encountered in many events of construction and operating costs estimation and prediction cannot be fully depicted by conventional least-squares regression models. This paper presents a construction cost analysis of municipal incinerators by the techniques of fuzzy linear regression. A thorough investigation of construction costs in the Taiwan Resource Recovery Project was conducted based on design parameters such as design capacity, type of grate system, and the selected air pollution control process. The focus has been placed upon the methodology for dealing with the heterogeneity phenomenon of a set of observations for which regression is evaluated.
Direction of Effects in Multiple Linear Regression Models.
Wiedermann, Wolfgang; von Eye, Alexander
2015-01-01
Previous studies analyzed asymmetric properties of the Pearson correlation coefficient using higher than second order moments. These asymmetric properties can be used to determine the direction of dependence in a linear regression setting (i.e., establish which of two variables is more likely to be on the outcome side) within the framework of cross-sectional observational data. Extant approaches are restricted to the bivariate regression case. The present contribution extends the direction of dependence methodology to a multiple linear regression setting by analyzing distributional properties of residuals of competing multiple regression models. It is shown that, under certain conditions, the third central moments of estimated regression residuals can be used to decide upon direction of effects. In addition, three different approaches for statistical inference are discussed: a combined D'Agostino normality test, a skewness difference test, and a bootstrap difference test. Type I error and power of the procedures are assessed using Monte Carlo simulations, and an empirical example is provided for illustrative purposes. In the discussion, issues concerning the quality of psychological data, possible extensions of the proposed methods to the fourth central moment of regression residuals, and potential applications are addressed. PMID:26609741
Multiple Linear Regression as a Technique for Predicting College Enrollment.
ERIC Educational Resources Information Center
Clegg, Ambrose A.; And Others
The application of multiple linear regression to the problem of identifying appropriate criterion variables and predicting enrollment in college courses during a period of major rapid decline was studied. Data were gathered on course enrollments for 1972-78 at Kent State University, and five independent variables were selected to determine the…
Rethinking the linear regression model for spatial ecological data.
Wagner, Helene H
2013-11-01
The linear regression model, with its numerous extensions including multivariate ordination, is fundamental to quantitative research in many disciplines. However, spatial or temporal structure in the data may invalidate the regression assumption of independent residuals. Spatial structure at any spatial scale can be modeled flexibly based on a set of uncorrelated component patterns (e.g., Moran's eigenvector maps, MEM) that is derived from the spatial relationships between sampling locations as defined in a spatial weight matrix. Spatial filtering thus addresses spatial autocorrelation in the residuals by adding such component patterns (spatial eigenvectors) as predictors to the regression model. However, space is not an ecologically meaningful predictor, and commonly used tests for selecting significant component patterns do not take into account the specific nature of these variables. This paper proposes "spatial component regression" (SCR) as a new way of integrating the linear regression model with Moran's eigenvector maps. In its unconditioned form, SCR decomposes the relationship between response and predictors by component patterns, whereas conditioned SCR provides an alternative method of spatial filtering, taking into account the statistical properties of component patterns in the design of statistical hypothesis tests. Application to the well-known multivariate mite data set illustrates how SCR may be used to condition for significant residual spatial structure and to identify additional predictors associated with residual spatial structure. Finally, I argue that all variance is spatially structured, hence spatial independence is best characterized by a lack of excess variance at any spatial scale, i.e., spatial white noise. PMID:24400490
A linear regression solution to the spatial autocorrelation problem
NASA Astrophysics Data System (ADS)
Griffith, Daniel A.
The Moran Coefficient spatial autocorrelation index can be decomposed into orthogonal map pattern components. This decomposition relates it directly to standard linear regression, in which corresponding eigenvectors can be used as predictors. This paper reports comparative results between these linear regressions and their auto-Gaussian counterparts for the following georeferenced data sets: Columbus (Ohio) crime, Ottawa-Hull median family income, Toronto population density, southwest Ohio unemployment, Syracuse pediatric lead poisoning, and Glasgow standard mortality rates, and a small remotely sensed image of the High Peak district. This methodology is extended to auto-logistic and auto-Poisson situations, with selected data analyses including percentage of urban population across Puerto Rico, and the frequency of SIDs cases across North Carolina. These data analytic results suggest that this approach to georeferenced data analysis offers considerable promise.
HIGH RESOLUTION FOURIER ANALYSIS WITH AUTO-REGRESSIVE LINEAR PREDICTION
Barton, J.; Shirley, D.A.
1984-04-01
Auto-regressive linear prediction is adapted to double the resolution of Angle-Resolved Photoemission Extended Fine Structure (ARPEFS) Fourier transforms. Even with the optimal taper (weighting function), the commonly used taper-and-transform Fourier method has limited resolution: it assumes the signal is zero beyond the limits of the measurement. By seeking the Fourier spectrum of an infinite extent oscillation consistent with the measurements but otherwise having maximum entropy, the errors caused by finite data range can be reduced. Our procedure developed to implement this concept adapts auto-regressive linear prediction to extrapolate the signal in an effective and controllable manner. Difficulties encountered when processing actual ARPEFS data are discussed. A key feature of this approach is the ability to convert improved measurements (signal-to-noise or point density) into improved Fourier resolution.
Comparison of Logistic Regression and Linear Regression in Modeling Percentage Data
Zhao, Lihui; Chen, Yuhuan; Schaffner, Donald W.
2001-01-01
Percentage is widely used to describe different results in food microbiology, e.g., probability of microbial growth, percent inactivated, and percent of positive samples. Four sets of percentage data, percent-growth-positive, germination extent, probability for one cell to grow, and maximum fraction of positive tubes, were obtained from our own experiments and the literature. These data were modeled using linear and logistic regression. Five methods were used to compare the goodness of fit of the two models: percentage of predictions closer to observations, range of the differences (predicted value minus observed value), deviation of the model, linear regression between the observed and predicted values, and bias and accuracy factors. Logistic regression was a better predictor of at least 78% of the observations in all four data sets. In all cases, the deviation of logistic models was much smaller. The linear correlation between observations and logistic predictions was always stronger. Validation (accomplished using part of one data set) also demonstrated that the logistic model was more accurate in predicting new data points. Bias and accuracy factors were found to be less informative when evaluating models developed for percentage data, since neither of these indices can compare predictions at zero. Model simplification for the logistic model was demonstrated with one data set. The simplified model was as powerful in making predictions as the full linear model, and it also gave clearer insight in determining the key experimental factors. PMID:11319091
Assessing Longitudinal Change: Adjustment for Regression to the Mean Effects
ERIC Educational Resources Information Center
Rocconi, Louis M.; Ethington, Corinna A.
2009-01-01
Pascarella (J Coll Stud Dev 47:508-520, 2006) has called for an increase in use of longitudinal data with pretest-posttest design when studying effects on college students. However, such designs that use multiple measures to document change are vulnerable to an important threat to internal validity, regression to the mean. Herein, we discuss a…
Procedures for adjusting regional regression models of urban-runoff quality using local data
Hoos, Anne B.; Lizarraga, Joy S.
1996-01-01
Statistical operations termed model-adjustment procedures can be used to incorporate local data into existing regression modes to improve the predication of urban-runoff quality. Each procedure is a form of regression analysis in which the local data base is used as a calibration data set; the resulting adjusted regression models can then be used to predict storm-runoff quality at unmonitored sites. Statistical tests of the calibration data set guide selection among proposed procedures.
Coercively Adjusted Auto Regression Model for Forecasting in Epilepsy EEG
Kim, Sun-Hee; Faloutsos, Christos; Yang, Hyung-Jeong
2013-01-01
Recently, data with complex characteristics such as epilepsy electroencephalography (EEG) time series has emerged. Epilepsy EEG data has special characteristics including nonlinearity, nonnormality, and nonperiodicity. Therefore, it is important to find a suitable forecasting method that covers these special characteristics. In this paper, we propose a coercively adjusted autoregression (CA-AR) method that forecasts future values from a multivariable epilepsy EEG time series. We use the technique of random coefficients, which forcefully adjusts the coefficients with −1 and 1. The fractal dimension is used to determine the order of the CA-AR model. We applied the CA-AR method reflecting special characteristics of data to forecast the future value of epilepsy EEG data. Experimental results show that when compared to previous methods, the proposed method can forecast faster and accurately. PMID:23710252
Conditional local influence in case-weights linear regression.
Poon, W Y; Poon, Y S
2001-05-01
The local influence approach proposed by Cook (1986) makes use of the normal curvature and the direction achieving the maximum curvature to assess the local influence of minor perturbation of statistical models. When the approach is applied to the linear regression model, the result provides information concerning the data structure different from that contributed by Cook's distance. One of the main advantages of the local influence approach is its ability to handle the simultaneous effect of several cases, namely, the ability to address the problem of 'masking'. However, Lawrance (1995) points out that there are two notions of 'masking' effects, the joint influence and the conditional influence, which are distinct in nature. The normal curvature and the direction of maximum curvature are capable of addressing effects under the category of joint influences but not conditional influences. We construct a new measure to define and detect conditional local influences and use the linear regression model for illustration. Several reported data sets are used to demonstrate that new information can be revealed by this proposed measure. PMID:11393899
The extinction law from photometric data: linear regression methods
NASA Astrophysics Data System (ADS)
Ascenso, J.; Lombardi, M.; Lada, C. J.; Alves, J.
2012-04-01
Context. The properties of dust grains, in particular their size distribution, are expected to differ from the interstellar medium to the high-density regions within molecular clouds. Since the extinction at near-infrared wavelengths is caused by dust, the extinction law in cores should depart from that found in low-density environments if the dust grains have different properties. Aims: We explore methods to measure the near-infrared extinction law produced by dense material in molecular cloud cores from photometric data. Methods: Using controlled sets of synthetic and semi-synthetic data, we test several methods for linear regression applied to the specific problem of deriving the extinction law from photometric data. We cover the parameter space appropriate to this type of observations. Results: We find that many of the common linear-regression methods produce biased results when applied to the extinction law from photometric colors. We propose and validate a new method, LinES, as the most reliable for this effect. We explore the use of this method to detect whether or not the extinction law of a given reddened population has a break at some value of extinction. Based on observations collected at the European Organisation for Astronomical Research in the Southern Hemisphere, Chile (ESO programmes 069.C-0426 and 074.C-0728).
Adjustment of regional regression equations for urban storm-runoff quality using at-site data
Barks, C.S.
1996-01-01
Regional regression equations have been developed to estimate urban storm-runoff loads and mean concentrations using a national data base. Four statistical methods using at-site data to adjust the regional equation predictions were developed to provide better local estimates. The four adjustment procedures are a single-factor adjustment, a regression of the observed data against the predicted values, a regression of the observed values against the predicted values and additional local independent variables, and a weighted combination of a local regression with the regional prediction. Data collected at five representative storm-runoff sites during 22 storms in Little Rock, Arkansas, were used to verify, and, when appropriate, adjust the regional regression equation predictions. Comparison of observed values of stormrunoff loads and mean concentrations to the predicted values from the regional regression equations for nine constituents (chemical oxygen demand, suspended solids, total nitrogen as N, total ammonia plus organic nitrogen as N, total phosphorus as P, dissolved phosphorus as P, total recoverable copper, total recoverable lead, and total recoverable zinc) showed large prediction errors ranging from 63 percent to more than several thousand percent. Prediction errors for 6 of the 18 regional regression equations were less than 100 percent and could be considered reasonable for water-quality prediction equations. The regression adjustment procedure was used to adjust five of the regional equation predictions to improve the predictive accuracy. For seven of the regional equations the observed and the predicted values are not significantly correlated. Thus neither the unadjusted regional equations nor any of the adjustments were appropriate. The mean of the observed values was used as a simple estimator when the regional equation predictions and adjusted predictions were not appropriate.
Precipitation interpolation in mountainous regions using multiple linear regression
Hay, L.; Viger, R.; McCabe, G.
1998-01-01
Multiple linear regression (MLR) was used to spatially interpolate precipitation for simulating runoff in the Animas River basin of southwestern Colorado. MLR equations were defined for each time step using measured precipitation as dependent variables. Explanatory variables used in each MLR were derived for the dependent variable locations from a digital elevation model (DEM) using a geographic information system. The same explanatory variables were defined for a 5 ?? 5 km grid of the DEM. For each time step, the best MLR equation was chosen and used to interpolate precipitation onto the 5 ?? 5 km grid. The gridded values of precipitation provide a physically-based estimate of the spatial distribution of precipitation and result in reliable simulations of daily runoff in the Animas River basin.
The Dantzig Selector for Censored Linear Regression Models
Li, Yi; Dicker, Lee; Zhao, Sihai Dave
2013-01-01
The Dantzig variable selector has recently emerged as a powerful tool for fitting regularized regression models. To our knowledge, most work involving the Dantzig selector has been performed with fully-observed response variables. This paper proposes a new class of adaptive Dantzig variable selectors for linear regression models when the response variable is subject to right censoring. This is motivated by a clinical study to identify genes predictive of event-free survival in newly diagnosed multiple myeloma patients. Under some mild conditions, we establish the theoretical properties of our procedures, including consistency in model selection (i.e. the right subset model will be identified with a probability tending to 1) and the optimal efficiency of estimation (i.e. the asymptotic distribution of the estimates is the same as that when the true subset model is known a priori). The practical utility of the proposed adaptive Dantzig selectors is verified via extensive simulations. We apply our new methods to the aforementioned myeloma clinical trial and identify important predictive genes. PMID:24478569
Modeling Pan Evaporation for Kuwait by Multiple Linear Regression
Almedeij, Jaber
2012-01-01
Evaporation is an important parameter for many projects related to hydrology and water resources systems. This paper constitutes the first study conducted in Kuwait to obtain empirical relations for the estimation of daily and monthly pan evaporation as functions of available meteorological data of temperature, relative humidity, and wind speed. The data used here for the modeling are daily measurements of substantial continuity coverage, within a period of 17 years between January 1993 and December 2009, which can be considered representative of the desert climate of the urban zone of the country. Multiple linear regression technique is used with a procedure of variable selection for fitting the best model forms. The correlations of evaporation with temperature and relative humidity are also transformed in order to linearize the existing curvilinear patterns of the data by using power and exponential functions, respectively. The evaporation models suggested with the best variable combinations were shown to produce results that are in a reasonable agreement with observation values. PMID:23226984
The Allometry of Coarse Root Biomass: Log-Transformed Linear Regression or Nonlinear Regression?
Lai, Jiangshan; Yang, Bo; Lin, Dunmei; Kerkhoff, Andrew J.; Ma, Keping
2013-01-01
Precise estimation of root biomass is important for understanding carbon stocks and dynamics in forests. Traditionally, biomass estimates are based on allometric scaling relationships between stem diameter and coarse root biomass calculated using linear regression (LR) on log-transformed data. Recently, it has been suggested that nonlinear regression (NLR) is a preferable fitting method for scaling relationships. But while this claim has been contested on both theoretical and empirical grounds, and statistical methods have been developed to aid in choosing between the two methods in particular cases, few studies have examined the ramifications of erroneously applying NLR. Here, we use direct measurements of 159 trees belonging to three locally dominant species in east China to compare the LR and NLR models of diameter-root biomass allometry. We then contrast model predictions by estimating stand coarse root biomass based on census data from the nearby 24-ha Gutianshan forest plot and by testing the ability of the models to predict known root biomass values measured on multiple tropical species at the Pasoh Forest Reserve in Malaysia. Based on likelihood estimates for model error distributions, as well as the accuracy of extrapolative predictions, we find that LR on log-transformed data is superior to NLR for fitting diameter-root biomass scaling models. More importantly, inappropriately using NLR leads to grossly inaccurate stand biomass estimates, especially for stands dominated by smaller trees. PMID:24116197
Outlier Detection In Linear Regression Using Standart Parity Space Approach
NASA Astrophysics Data System (ADS)
Mustafa Durdag, Utkan; Hekimoglu, Serif
2013-04-01
Despite all technological advancements, outliers may occur due to some mistakes in engineering measurements. Before estimation of unknown parameters, aforementioned outliers must be detected and removed from the measurements. There are two main outlier detection methods: the conventional tests based on least square approach (e.g. Baarda, Pope etc.) and the robust tests (e.g. Huber, Hampel etc.) are used to identify outliers in a set of measurement. Standart Parity Space Approach is one of the important model-based Fault Detection and Isolation (FDI) technique that usually uses in Control Engineering. In this study the standart parity space method is used for outlier detection in linear regression. Our main goal is to compare success of two approaches of standart parity space method and conventional tests in linear regression through the Monte Carlo simulation with each other. The least square estimation is the most common estimator as known and it minimizes the sum of squared residuals. In standart parity space approach to eliminate unknown vector, the measurement vector projected onto the left null space of the coefficient matrix. Thus, the orthogonal condition of parity vector is satisfied and only the effects of noise vector noticed. The residual vector is derived from two cases that one is absence of an outlier; the other is occurrence of an outlier. Its likelihood function is used for determining the detection decision function for global Test. Localization decision function is calculated for each column of parity matrix and the maximum one of these values is accepted as an outlier. There are some results obtained from two different intervals that one of them is between 3σ and 6σ (small outlier) the other one is between 6σ and 12σ (large outlier) for outlier generator when the number of unknown parameter is chosen 2 and 3. The measure success rates (MSR) of Baarda's method is better than the standart parity space method when the confidence intervals are
Linear regression techniques for use in the EC tracer method of secondary organic aerosol estimation
NASA Astrophysics Data System (ADS)
Saylor, Rick D.; Edgerton, Eric S.; Hartsell, Benjamin E.
A variety of linear regression techniques and simple slope estimators are evaluated for use in the elemental carbon (EC) tracer method of secondary organic carbon (OC) estimation. Linear regression techniques based on ordinary least squares are not suitable for situations where measurement uncertainties exist in both regressed variables. In the past, regression based on the method of Deming [1943. Statistical Adjustment of Data. Wiley, London] has been the preferred choice for EC tracer method parameter estimation. In agreement with Chu [2005. Stable estimate of primary OC/EC ratios in the EC tracer method. Atmospheric Environment 39, 1383-1392], we find that in the limited case where primary non-combustion OC (OC non-comb) is assumed to be zero, the ratio of averages (ROA) approach provides a stable and reliable estimate of the primary OC-EC ratio, (OC/EC) pri. In contrast with Chu [2005. Stable estimate of primary OC/EC ratios in the EC tracer method. Atmospheric Environment 39, 1383-1392], however, we find that the optimal use of Deming regression (and the more general York et al. [2004. Unified equations for the slope, intercept, and standard errors of the best straight line. American Journal of Physics 72, 367-375] regression) provides excellent results as well. For the more typical case where OC non-comb is allowed to obtain a non-zero value, we find that regression based on the method of York is the preferred choice for EC tracer method parameter estimation. In the York regression technique, detailed information on uncertainties in the measurement of OC and EC is used to improve the linear best fit to the given data. If only limited information is available on the relative uncertainties of OC and EC, then Deming regression should be used. On the other hand, use of ROA in the estimation of secondary OC, and thus the assumption of a zero OC non-comb value, generally leads to an overestimation of the contribution of secondary OC to total measured OC.
Forecasting Groundwater Temperature with Linear Regression Models Using Historical Data.
Figura, Simon; Livingstone, David M; Kipfer, Rolf
2015-01-01
Although temperature is an important determinant of many biogeochemical processes in groundwater, very few studies have attempted to forecast the response of groundwater temperature to future climate warming. Using a composite linear regression model based on the lagged relationship between historical groundwater and regional air temperature data, empirical forecasts were made of groundwater temperature in several aquifers in Switzerland up to the end of the current century. The model was fed with regional air temperature projections calculated for greenhouse-gas emissions scenarios A2, A1B, and RCP3PD. Model evaluation revealed that the approach taken is adequate only when the data used to calibrate the models are sufficiently long and contain sufficient variability. These conditions were satisfied for three aquifers, all fed by riverbank infiltration. The forecasts suggest that with respect to the reference period 1980 to 2009, groundwater temperature in these aquifers will most likely increase by 1.1 to 3.8 K by the end of the current century, depending on the greenhouse-gas emissions scenario employed. PMID:25412761
Robust linear regression with broad distributions of errors
NASA Astrophysics Data System (ADS)
Postnikov, Eugene B.; Sokolov, Igor M.
2015-09-01
We consider the problem of linear fitting of noisy data in the case of broad (say α-stable) distributions of random impacts ("noise"), which can lack even the first moment. This situation, common in statistical physics of small systems, in Earth sciences, in network science or in econophysics, does not allow for application of conventional Gaussian maximum-likelihood estimators resulting in usual least-squares fits. Such fits lead to large deviations of fitted parameters from their true values due to the presence of outliers. The approaches discussed here aim onto the minimization of the width of the distribution of residua. The corresponding width of the distribution can either be defined via the interquantile distance of the corresponding distributions or via the scale parameter in its characteristic function. The methods provide the robust regression even in the case of short samples with large outliers, and are equivalent to the normal least squares fit for the Gaussian noises. Our discussion is illustrated by numerical examples.
Comparison of the Properties of Regression and Categorical Risk-Adjustment Models
Averill, Richard F.; Muldoon, John H.; Hughes, John S.
2016-01-01
Clinical risk-adjustment, the ability to standardize the comparison of individuals with different health needs, is based upon 2 main alternative approaches: regression models and clinical categorical models. In this article, we examine the impact of the differences in the way these models are constructed on end user applications. PMID:26945302
ERIC Educational Resources Information Center
Olejnik, Stephen; Mills, Jamie; Keselman, Harvey
2000-01-01
Evaluated the use of Mallow's C(p) and Wherry's adjusted R squared (R. Wherry, 1931) statistics to select a final model from a pool of model solutions using computer generated data. Neither statistic identified the underlying regression model any better than, and usually less well than, the stepwise selection method, which itself was poor for…
Comparison between Linear and Nonlinear Regression in a Laboratory Heat Transfer Experiment
ERIC Educational Resources Information Center
Gonçalves, Carine Messias; Schwaab, Marcio; Pinto, José Carlos
2013-01-01
In order to interpret laboratory experimental data, undergraduate students are used to perform linear regression through linearized versions of nonlinear models. However, the use of linearized models can lead to statistically biased parameter estimates. Even so, it is not an easy task to introduce nonlinear regression and show for the students…
NASA Astrophysics Data System (ADS)
Kheloufi, N.; Kahlouche, S.; Lamara, R. Ait Ahmed
2009-04-01
The resolution of the MRE's (Multiple Regression Equations) is an important tool for fitting different geodetic network. Nevertheless, in different fields of engineering and earth science, certain cases need more accuracy; the ordinary least squares (linear least squares) prove to be limited. Thus, we have to use new numerical methods of resolution that can provide a great efficiency of polynomial modelisation. In geodesy the accuracy of coordinates determination and network adjustment is very important, that's why instead of being limited to the linear models, we have to apply the non linear least squares resolution for the transformation problem between geodetic systems. This need, appears especially in the case of Nord-Sahara datum (Algeria), on wich the linear models are not much appropriate, because of the lack of information about the geoid's undulation. In this paper, we have fixed as main aim, to carry out the importance of using non linear least squares to improve the quality of geodetic adjustment and coordinates transformation and even the extent of his use. The algorithms carried out concerned the application of two models: three dimensions (global transformation) and the two-dimension one (local transformation) over huge area (Algeria). We compute coordinates transformation parameters and their Rms by both of the ordinary least squares and new algorithms, then we perform a statistical analysis in order to compare on the one hand between the linear adjustment with its two variants (local and global) and the non linear one. In this context, a set of 16 benchmark, have been integrated to compute the transformation parameters (3D and 2D). Different non linear optimization algorithms (Newton algorithm, Steepest Descent, and Levenberg-Marquardt) have been implemented to solve transformation problem. Conclusions and recommendations are given with respect to the suitability, accuracy and efficiency of each method. Key words: MRE's, Nord Sahara, global
Interpreting Multiple Linear Regression: A Guidebook of Variable Importance
ERIC Educational Resources Information Center
Nathans, Laura L.; Oswald, Frederick L.; Nimon, Kim
2012-01-01
Multiple regression (MR) analyses are commonly employed in social science fields. It is also common for interpretation of results to typically reflect overreliance on beta weights, often resulting in very limited interpretations of variable importance. It appears that few researchers employ other methods to obtain a fuller understanding of what…
Sample Sizes when Using Multiple Linear Regression for Prediction
ERIC Educational Resources Information Center
Knofczynski, Gregory T.; Mundfrom, Daniel
2008-01-01
When using multiple regression for prediction purposes, the issue of minimum required sample size often needs to be addressed. Using a Monte Carlo simulation, models with varying numbers of independent variables were examined and minimum sample sizes were determined for multiple scenarios at each number of independent variables. The scenarios…
Procedures for adjusting regional regression models of urban-runoff quality using local data
Hoos, A.B.; Sisolak, J.K.
1993-01-01
Statistical operations termed model-adjustment procedures (MAP?s) can be used to incorporate local data into existing regression models to improve the prediction of urban-runoff quality. Each MAP is a form of regression analysis in which the local data base is used as a calibration data set. Regression coefficients are determined from the local data base, and the resulting `adjusted? regression models can then be used to predict storm-runoff quality at unmonitored sites. The response variable in the regression analyses is the observed load or mean concentration of a constituent in storm runoff for a single storm. The set of explanatory variables used in the regression analyses is different for each MAP, but always includes the predicted value of load or mean concentration from a regional regression model. The four MAP?s examined in this study were: single-factor regression against the regional model prediction, P, (termed MAP-lF-P), regression against P,, (termed MAP-R-P), regression against P, and additional local variables (termed MAP-R-P+nV), and a weighted combination of P, and a local-regression prediction (termed MAP-W). The procedures were tested by means of split-sample analysis, using data from three cities included in the Nationwide Urban Runoff Program: Denver, Colorado; Bellevue, Washington; and Knoxville, Tennessee. The MAP that provided the greatest predictive accuracy for the verification data set differed among the three test data bases and among model types (MAP-W for Denver and Knoxville, MAP-lF-P and MAP-R-P for Bellevue load models, and MAP-R-P+nV for Bellevue concentration models) and, in many cases, was not clearly indicated by the values of standard error of estimate for the calibration data set. A scheme to guide MAP selection, based on exploratory data analysis of the calibration data set, is presented and tested. The MAP?s were tested for sensitivity to the size of a calibration data set. As expected, predictive accuracy of all MAP?s for
ERIC Educational Resources Information Center
Hecht, Jeffrey B.
The analysis of regression residuals and detection of outliers are discussed, with emphasis on determining how deviant an individual data point must be to be considered an outlier and the impact that multiple suspected outlier data points have on the process of outlier determination and treatment. Only bivariate (one dependent and one independent)…
Evaluation of preservative systems in a sunscreen formula by linear regression method.
Bou-Chacra, Nádia A; Pinto, Terezinha de Jesus A; Ohara, Mitsuko Taba
2003-01-01
A sunscreen formula with eight different preservative systems was evaluated by linear regression, pharmacopeial, and the CTFA (Cosmetic, Toiletry and Fragrance Association) methods. The preparations were tested against Staphylococcus aureus, Burkholderia cepacia, Shewanella putrefaciens, Escherichia coli, and Bacillus sp. The linear regression method proved to be useful in the selection of the most effective preservative system used in cosmetic formulation. PMID:12688287
NASA Astrophysics Data System (ADS)
Ciupak, Maurycy; Ozga-Zielinski, Bogdan; Adamowski, Jan; Quilty, John; Khalil, Bahaa
2015-11-01
A novel implementation of Dynamic Linear Bayesian Models (DLBM), using either a Varying Coefficient Regression (VCR) or a Discount Weighted Regression (DWR) algorithm was used in the hydrological modeling of annual hydrographs as well as 1-, 2-, and 3-day lead time stream flow forecasting. Using hydrological data (daily discharge, rainfall, and mean, maximum and minimum air temperatures) from the Upper Narew River watershed in Poland, the forecasting performance of DLBM was compared to that of traditional multiple linear regression (MLR) and more recent artificial neural network (ANN) based models. Model performance was ranked DLBM-DWR > DLBM-VCR > MLR > ANN for both annual hydrograph modeling and 1-, 2-, and 3-day lead forecasting, indicating that the DWR and VCR algorithms, operating in a DLBM framework, represent promising new methods for both annual hydrograph modeling and short-term stream flow forecasting.
NASA Astrophysics Data System (ADS)
Liu, Pudong; Shi, Runhe; Wang, Hong; Bai, Kaixu; Gao, Wei
2014-10-01
Leaf pigments are key elements for plant photosynthesis and growth. Traditional manual sampling of these pigments is labor-intensive and costly, which also has the difficulty in capturing their temporal and spatial characteristics. The aim of this work is to estimate photosynthetic pigments at large scale by remote sensing. For this purpose, inverse model were proposed with the aid of stepwise multiple linear regression (SMLR) analysis. Furthermore, a leaf radiative transfer model (i.e. PROSPECT model) was employed to simulate the leaf reflectance where wavelength varies from 400 to 780 nm at 1 nm interval, and then these values were treated as the data from remote sensing observations. Meanwhile, simulated chlorophyll concentration (Cab), carotenoid concentration (Car) and their ratio (Cab/Car) were taken as target to build the regression model respectively. In this study, a total of 4000 samples were simulated via PROSPECT with different Cab, Car and leaf mesophyll structures as 70% of these samples were applied for training while the last 30% for model validation. Reflectance (r) and its mathematic transformations (1/r and log (1/r)) were all employed to build regression model respectively. Results showed fair agreements between pigments and simulated reflectance with all adjusted coefficients of determination (R2) larger than 0.8 as 6 wavebands were selected to build the SMLR model. The largest value of R2 for Cab, Car and Cab/Car are 0.8845, 0.876 and 0.8765, respectively. Meanwhile, mathematic transformations of reflectance showed little influence on regression accuracy. We concluded that it was feasible to estimate the chlorophyll and carotenoids and their ratio based on statistical model with leaf reflectance data.
Divergent estimation error in portfolio optimization and in linear regression
NASA Astrophysics Data System (ADS)
Kondor, I.; Varga-Haszonits, I.
2008-08-01
The problem of estimation error in portfolio optimization is discussed, in the limit where the portfolio size N and the sample size T go to infinity such that their ratio is fixed. The estimation error strongly depends on the ratio N/T and diverges for a critical value of this parameter. This divergence is the manifestation of an algorithmic phase transition, it is accompanied by a number of critical phenomena, and displays universality. As the structure of a large number of multidimensional regression and modelling problems is very similar to portfolio optimization, the scope of the above observations extends far beyond finance, and covers a large number of problems in operations research, machine learning, bioinformatics, medical science, economics, and technology.
Two biased estimation techniques in linear regression: Application to aircraft
NASA Technical Reports Server (NTRS)
Klein, Vladislav
1988-01-01
Several ways for detection and assessment of collinearity in measured data are discussed. Because data collinearity usually results in poor least squares estimates, two estimation techniques which can limit a damaging effect of collinearity are presented. These two techniques, the principal components regression and mixed estimation, belong to a class of biased estimation techniques. Detection and assessment of data collinearity and the two biased estimation techniques are demonstrated in two examples using flight test data from longitudinal maneuvers of an experimental aircraft. The eigensystem analysis and parameter variance decomposition appeared to be a promising tool for collinearity evaluation. The biased estimators had far better accuracy than the results from the ordinary least squares technique.
SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression *
Sun, Qiang; Zhu, Hongtu; Liu, Yufeng; Ibrahim, Joseph G.
2014-01-01
The aim of this paper is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling’s T2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM out-performs other state-of-the-art methods. PMID:26527844
Identifying predictors of physics item difficulty: A linear regression approach
NASA Astrophysics Data System (ADS)
Mesic, Vanes; Muratovic, Hasnija
2011-06-01
Large-scale assessments of student achievement in physics are often approached with an intention to discriminate students based on the attained level of their physics competencies. Therefore, for purposes of test design, it is important that items display an acceptable discriminatory behavior. To that end, it is recommended to avoid extraordinary difficult and very easy items. Knowing the factors that influence physics item difficulty makes it possible to model the item difficulty even before the first pilot study is conducted. Thus, by identifying predictors of physics item difficulty, we can improve the test-design process. Furthermore, we get additional qualitative feedback regarding the basic aspects of student cognitive achievement in physics that are directly responsible for the obtained, quantitative test results. In this study, we conducted a secondary analysis of data that came from two large-scale assessments of student physics achievement at the end of compulsory education in Bosnia and Herzegovina. Foremost, we explored the concept of “physics competence” and performed a content analysis of 123 physics items that were included within the above-mentioned assessments. Thereafter, an item database was created. Items were described by variables which reflect some basic cognitive aspects of physics competence. For each of the assessments, Rasch item difficulties were calculated in separate analyses. In order to make the item difficulties from different assessments comparable, a virtual test equating procedure had to be implemented. Finally, a regression model of physics item difficulty was created. It has been shown that 61.2% of item difficulty variance can be explained by factors which reflect the automaticity, complexity, and modality of the knowledge structure that is relevant for generating the most probable correct solution, as well as by the divergence of required thinking and interference effects between intuitive and formal physics knowledge
Simultaneous Determination of Cobalt, Copper, and Nickel by Multivariate Linear Regression.
ERIC Educational Resources Information Center
Dado, Greg; Rosenthal, Jeffrey
1990-01-01
Presented is an experiment where the concentrations of three metal ions in a solution are simultaneously determined by ultraviolet-vis spectroscopy. Availability of the computer program used for statistically analyzing data using a multivariate linear regression is listed. (KR)
As a fast and effective technique, the multiple linear regression (MLR) method has been widely used in modeling and prediction of beach bacteria concentrations. Among previous works on this subject, however, several issues were insufficiently or inconsistently addressed. Those is...
Pérez-Rodríguez, Paulino; Gianola, Daniel; González-Camacho, Juan Manuel; Crossa, José; Manès, Yann; Dreisigacker, Susanne
2012-01-01
In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models. PMID:23275882
Graphical Description of Johnson-Neyman Outcomes for Linear and Quadratic Regression Surfaces.
ERIC Educational Resources Information Center
Schafer, William D.; Wang, Yuh-Yin
A modification of the usual graphical representation of heterogeneous regressions is described that can aid in interpreting significant regions for linear or quadratic surfaces. The standard Johnson-Neyman graph is a bivariate plot with the criterion variable on the ordinate and the predictor variable on the abscissa. Regression surfaces are drawn…
NASA Astrophysics Data System (ADS)
Gao, Xiangyun; An, Haizhong; Fang, Wei; Huang, Xuan; Li, Huajiao; Zhong, Weiqiong; Ding, Yinghui
2014-07-01
The linear regression parameters between two time series can be different under different lengths of observation period. If we study the whole period by the sliding window of a short period, the change of the linear regression parameters is a process of dynamic transmission over time. We tackle fundamental research that presents a simple and efficient computational scheme: a linear regression patterns transmission algorithm, which transforms linear regression patterns into directed and weighted networks. The linear regression patterns (nodes) are defined by the combination of intervals of the linear regression parameters and the results of the significance testing under different sizes of the sliding window. The transmissions between adjacent patterns are defined as edges, and the weights of the edges are the frequency of the transmissions. The major patterns, the distance, and the medium in the process of the transmission can be captured. The statistical results of weighted out-degree and betweenness centrality are mapped on timelines, which shows the features of the distribution of the results. Many measurements in different areas that involve two related time series variables could take advantage of this algorithm to characterize the dynamic relationships between the time series from a new perspective.
NASA Astrophysics Data System (ADS)
Zhu, Dazhou; Ji, Baoping; Meng, Chaoying; Shi, Bolin; Tu, Zhenhua; Qing, Zhaoshen
Hybrid linear analysis (HLA), partial least-squares (PLS) regression, and the linear least square support vector machine (LSSVM) were used to determinate the soluble solids content (SSC) of apple by Fourier transform near-infrared (FT-NIR) spectroscopy. The performance of these three linear regression methods was compared. Results showed that HLA could be used for the analysis of complex solid samples such as apple. The predictive ability of SSC model constructed by HLA was comparable to that of PLS. HLA was sensitive to outliers, thus the outliers should be eliminated before HLA calibration. Linear LSSVM performed better than PLS and HLA. Direct orthogonal signal correction (DOSC) pretreatment was effective for PLS and linear LSSVM, but not suitable for HLA. The combination of DOSC and linear LSSVM had good generalization ability and was not sensitive to outliers, so it is a promising method for linear multivariate calibration.
2014-01-01
Background In biomedical research, response variables are often encountered which have bounded support on the open unit interval - (0,1). Traditionally, researchers have attempted to estimate covariate effects on these types of response data using linear regression. Alternative modelling strategies may include: beta regression, variable-dispersion beta regression, and fractional logit regression models. This study employs a Monte Carlo simulation design to compare the statistical properties of the linear regression model to that of the more novel beta regression, variable-dispersion beta regression, and fractional logit regression models. Methods In the Monte Carlo experiment we assume a simple two sample design. We assume observations are realizations of independent draws from their respective probability models. The randomly simulated draws from the various probability models are chosen to emulate average proportion/percentage/rate differences of pre-specified magnitudes. Following simulation of the experimental data we estimate average proportion/percentage/rate differences. We compare the estimators in terms of bias, variance, type-1 error and power. Estimates of Monte Carlo error associated with these quantities are provided. Results If response data are beta distributed with constant dispersion parameters across the two samples, then all models are unbiased and have reasonable type-1 error rates and power profiles. If the response data in the two samples have different dispersion parameters, then the simple beta regression model is biased. When the sample size is small (N0 = N1 = 25) linear regression has superior type-1 error rates compared to the other models. Small sample type-1 error rates can be improved in beta regression models using bias correction/reduction methods. In the power experiments, variable-dispersion beta regression and fractional logit regression models have slightly elevated power compared to linear regression models. Similar
Algamal, Zakariya Yahya; Lee, Muhammad Hisyam
2015-12-01
Cancer classification and gene selection in high-dimensional data have been popular research topics in genetics and molecular biology. Recently, adaptive regularized logistic regression using the elastic net regularization, which is called the adaptive elastic net, has been successfully applied in high-dimensional cancer classification to tackle both estimating the gene coefficients and performing gene selection simultaneously. The adaptive elastic net originally used elastic net estimates as the initial weight, however, using this weight may not be preferable for certain reasons: First, the elastic net estimator is biased in selecting genes. Second, it does not perform well when the pairwise correlations between variables are not high. Adjusted adaptive regularized logistic regression (AAElastic) is proposed to address these issues and encourage grouping effects simultaneously. The real data results indicate that AAElastic is significantly consistent in selecting genes compared to the other three competitor regularization methods. Additionally, the classification performance of AAElastic is comparable to the adaptive elastic net and better than other regularization methods. Thus, we can conclude that AAElastic is a reliable adaptive regularized logistic regression method in the field of high-dimensional cancer classification. PMID:26520484
An improved multiple linear regression and data analysis computer program package
NASA Technical Reports Server (NTRS)
Sidik, S. M.
1972-01-01
NEWRAP, an improved version of a previous multiple linear regression program called RAPIER, CREDUC, and CRSPLT, allows for a complete regression analysis including cross plots of the independent and dependent variables, correlation coefficients, regression coefficients, analysis of variance tables, t-statistics and their probability levels, rejection of independent variables, plots of residuals against the independent and dependent variables, and a canonical reduction of quadratic response functions useful in optimum seeking experimentation. A major improvement over RAPIER is that all regression calculations are done in double precision arithmetic.
ERIC Educational Resources Information Center
Preacher, Kristopher J.; Curran, Patrick J.; Bauer, Daniel J.
2006-01-01
Simple slopes, regions of significance, and confidence bands are commonly used to evaluate interactions in multiple linear regression (MLR) models, and the use of these techniques has recently been extended to multilevel or hierarchical linear modeling (HLM) and latent curve analysis (LCA). However, conducting these tests and plotting the…
Application of wavelet-based multiple linear regression model to rainfall forecasting in Australia
NASA Astrophysics Data System (ADS)
He, X.; Guan, H.; Zhang, X.; Simmons, C.
2013-12-01
In this study, a wavelet-based multiple linear regression model is applied to forecast monthly rainfall in Australia by using monthly historical rainfall data and climate indices as inputs. The wavelet-based model is constructed by incorporating the multi-resolution analysis (MRA) with the discrete wavelet transform and multiple linear regression (MLR) model. The standardized monthly rainfall anomaly and large-scale climate index time series are decomposed using MRA into a certain number of component subseries at different temporal scales. The hierarchical lag relationship between the rainfall anomaly and each potential predictor is identified by cross correlation analysis with a lag time of at least one month at different temporal scales. The components of predictor variables with known lag times are then screened with a stepwise linear regression algorithm to be selectively included into the final forecast model. The MRA-based rainfall forecasting method is examined with 255 stations over Australia, and compared to the traditional multiple linear regression model based on the original time series. The models are trained with data from the 1959-1995 period and then tested in the 1996-2008 period for each station. The performance is compared with observed rainfall values, and evaluated by common statistics of relative absolute error and correlation coefficient. The results show that the wavelet-based regression model provides considerably more accurate monthly rainfall forecasts for all of the selected stations over Australia than the traditional regression model.
Comparison of Linear and Non-Linear Regression Models to Estimate Leaf Area Index of Dryland Shrubs.
NASA Astrophysics Data System (ADS)
Dashti, H.; Glenn, N. F.; Ilangakoon, N. T.; Mitchell, J.; Dhakal, S.; Spaete, L.
2015-12-01
Leaf area index (LAI) is a key parameter in global ecosystem studies. LAI is considered a forcing variable in land surface processing models since ecosystem dynamics are highly correlated to LAI. In response to environmental limitations, plants in semiarid ecosystems have smaller leaf area, making accurate estimation of LAI by remote sensing a challenging issue. Optical remote sensing (400-2500 nm) techniques to estimate LAI are based either on radiative transfer models (RTMs) or statistical approaches. Considering the complex radiation field of dry ecosystems, simple 1-D RTMs lead to poor results, and on the other hand, inversion of more complex 3-D RTMs is a demanding task which requires the specification of many variables. A good alternative to physical approaches is using methods based on statistics. Similar to many natural phenomena, there is a non-linear relationship between LAI and top of canopy electromagnetic waves reflected to optical sensors. Non-linear regression models can better capture this relationship. However, considering the problem of a few numbers of observations in comparison to the feature space (n
linear models. In this study linear versus non-linear regression techniques were investigated to estimate LAI. Our study area is located in southwestern Idaho, Great Basin. Sagebrush (Artemisia tridentata spp) serves a critical role in maintaining the structure of this ecosystem. Using a leaf area meter (Accupar LP-80), LAI values were measured in the field. Linear Partial Least Square regression and non-linear, tree based Random Forest regression have been implemented to estimate the LAI of sagebrush from hyperspectral data (AVIRIS-ng) collected in late summer 2014. Cross validation of results indicate that PLS can provide comparable results to Random Forest.
Li, Li; Brumback, Babette A; Weppelmann, Thomas A; Morris, J Glenn; Ali, Afsar
2016-08-15
Motivated by an investigation of the effect of surface water temperature on the presence of Vibrio cholerae in water samples collected from different fixed surface water monitoring sites in Haiti in different months, we investigated methods to adjust for unmeasured confounding due to either of the two crossed factors site and month. In the process, we extended previous methods that adjust for unmeasured confounding due to one nesting factor (such as site, which nests the water samples from different months) to the case of two crossed factors. First, we developed a conditional pseudolikelihood estimator that eliminates fixed effects for the levels of each of the crossed factors from the estimating equation. Using the theory of U-Statistics for independent but non-identically distributed vectors, we show that our estimator is consistent and asymptotically normal, but that its variance depends on the nuisance parameters and thus cannot be easily estimated. Consequently, we apply our estimator in conjunction with a permutation test, and we investigate use of the pigeonhole bootstrap and the jackknife for constructing confidence intervals. We also incorporate our estimator into a diagnostic test for a logistic mixed model with crossed random effects and no unmeasured confounding. For comparison, we investigate between-within models extended to two crossed factors. These generalized linear mixed models include covariate means for each level of each factor in order to adjust for the unmeasured confounding. We conduct simulation studies, and we apply the methods to the Haitian data. Copyright © 2016 John Wiley & Sons, Ltd. PMID:26892025
Linear and nonlinear regression techniques for simultaneous and proportional myoelectric control.
Hahne, J M; Biessmann, F; Jiang, N; Rehbaum, H; Farina, D; Meinecke, F C; Muller, K-R; Parra, L C
2014-03-01
In recent years the number of active controllable joints in electrically powered hand-prostheses has increased significantly. However, the control strategies for these devices in current clinical use are inadequate as they require separate and sequential control of each degree-of-freedom (DoF). In this study we systematically compare linear and nonlinear regression techniques for an independent, simultaneous and proportional myoelectric control of wrist movements with two DoF. These techniques include linear regression, mixture of linear experts (ME), multilayer-perceptron, and kernel ridge regression (KRR). They are investigated offline with electro-myographic signals acquired from ten able-bodied subjects and one person with congenital upper limb deficiency. The control accuracy is reported as a function of the number of electrodes and the amount and diversity of training data providing guidance for the requirements in clinical practice. The results showed that KRR, a nonparametric statistical learning method, outperformed the other methods. However, simple transformations in the feature space could linearize the problem, so that linear models could achieve similar performance as KRR at much lower computational costs. Especially ME, a physiologically inspired extension of linear regression represents a promising candidate for the next generation of prosthetic devices. PMID:24608685
Kleinman, Lawrence C; Norton, Edward C
2009-01-01
Objective To develop and validate a general method (called regression risk analysis) to estimate adjusted risk measures from logistic and other nonlinear multiple regression models. We show how to estimate standard errors for these estimates. These measures could supplant various approximations (e.g., adjusted odds ratio [AOR]) that may diverge, especially when outcomes are common. Study Design Regression risk analysis estimates were compared with internal standards as well as with Mantel–Haenszel estimates, Poisson and log-binomial regressions, and a widely used (but flawed) equation to calculate adjusted risk ratios (ARR) from AOR. Data Collection Data sets produced using Monte Carlo simulations. Principal Findings Regression risk analysis accurately estimates ARR and differences directly from multiple regression models, even when confounders are continuous, distributions are skewed, outcomes are common, and effect size is large. It is statistically sound and intuitive, and has properties favoring it over other methods in many cases. Conclusions Regression risk analysis should be the new standard for presenting findings from multiple regression analysis of dichotomous outcomes for cross-sectional, cohort, and population-based case–control studies, particularly when outcomes are common or effect size is large. PMID:18793213
Solution of the linear regression problem using matrix correction methods in the l 1 metric
NASA Astrophysics Data System (ADS)
Gorelik, V. A.; Trembacheva (Barkalova), O. S.
2016-02-01
The linear regression problem is considered as an improper interpolation problem. The metric l 1 is used to correct (approximate) all the initial data. A probabilistic justification of this metric in the case of the exponential noise distribution is given. The original improper interpolation problem is reduced to a set of a finite number of linear programming problems. The corresponding computational algorithms are implemented in MATLAB.
NASA Astrophysics Data System (ADS)
Tan, C. H.; Matjafri, M. Z.; Lim, H. S.
2015-10-01
This paper presents the prediction models which analyze and compute the CO2 emission in Malaysia. Each prediction model for CO2 emission will be analyzed based on three main groups which is transportation, electricity and heat production as well as residential buildings and commercial and public services. The prediction models were generated using data obtained from World Bank Open Data. Best subset method will be used to remove irrelevant data and followed by multi linear regression to produce the prediction models. From the results, high R-square (prediction) value was obtained and this implies that the models are reliable to predict the CO2 emission by using specific data. In addition, the CO2 emissions from these three groups are forecasted using trend analysis plots for observation purpose.
Madarang, Krish J; Kang, Joo-Hyon
2014-06-01
Stormwater runoff has been identified as a source of pollution for the environment, especially for receiving waters. In order to quantify and manage the impacts of stormwater runoff on the environment, predictive models and mathematical models have been developed. Predictive tools such as regression models have been widely used to predict stormwater discharge characteristics. Storm event characteristics, such as antecedent dry days (ADD), have been related to response variables, such as pollutant loads and concentrations. However it has been a controversial issue among many studies to consider ADD as an important variable in predicting stormwater discharge characteristics. In this study, we examined the accuracy of general linear regression models in predicting discharge characteristics of roadway runoff. A total of 17 storm events were monitored in two highway segments, located in Gwangju, Korea. Data from the monitoring were used to calibrate United States Environmental Protection Agency's Storm Water Management Model (SWMM). The calibrated SWMM was simulated for 55 storm events, and the results of total suspended solid (TSS) discharge loads and event mean concentrations (EMC) were extracted. From these data, linear regression models were developed. R(2) and p-values of the regression of ADD for both TSS loads and EMCs were investigated. Results showed that pollutant loads were better predicted than pollutant EMC in the multiple regression models. Regression may not provide the true effect of site-specific characteristics, due to uncertainty in the data. PMID:25079842
ERIC Educational Resources Information Center
Nelson, Dean
2009-01-01
Following the Guidelines for Assessment and Instruction in Statistics Education (GAISE) recommendation to use real data, an example is presented in which simple linear regression is used to evaluate the effect of the Montreal Protocol on atmospheric concentration of chlorofluorocarbons. This simple set of data, obtained from a public archive, can…
A Comparison of Robust and Nonparametric Estimators under the Simple Linear Regression Model.
ERIC Educational Resources Information Center
Nevitt, Jonathan; Tam, Hak P.
This study investigates parameter estimation under the simple linear regression model for situations in which the underlying assumptions of ordinary least squares estimation are untenable. Classical nonparametric estimation methods are directly compared against some robust estimation methods for conditions in which varying degrees of outliers are…
ERIC Educational Resources Information Center
Yan, Jun; Aseltine, Robert H., Jr.; Harel, Ofer
2013-01-01
Comparing regression coefficients between models when one model is nested within another is of great practical interest when two explanations of a given phenomenon are specified as linear models. The statistical problem is whether the coefficients associated with a given set of covariates change significantly when other covariates are added into…
A Simple and Convenient Method of Multiple Linear Regression to Calculate Iodine Molecular Constants
ERIC Educational Resources Information Center
Cooper, Paul D.
2010-01-01
A new procedure using a student-friendly least-squares multiple linear-regression technique utilizing a function within Microsoft Excel is described that enables students to calculate molecular constants from the vibronic spectrum of iodine. This method is advantageous pedagogically as it calculates molecular constants for ground and excited…
Point Estimates and Confidence Intervals for Variable Importance in Multiple Linear Regression
ERIC Educational Resources Information Center
Thomas, D. Roland; Zhu, PengCheng; Decady, Yves J.
2007-01-01
The topic of variable importance in linear regression is reviewed, and a measure first justified theoretically by Pratt (1987) is examined in detail. Asymptotic variance estimates are used to construct individual and simultaneous confidence intervals for these importance measures. A simulation study of their coverage properties is reported, and an…
Calibrated Peer Review for Interpreting Linear Regression Parameters: Results from a Graduate Course
ERIC Educational Resources Information Center
Enders, Felicity B.; Jenkins, Sarah; Hoverman, Verna
2010-01-01
Biostatistics is traditionally a difficult subject for students to learn. While the mathematical aspects are challenging, it can also be demanding for students to learn the exact language to use to correctly interpret statistical results. In particular, correctly interpreting the parameters from linear regression is both a vital tool and a…
Due to the complexity of the processes contributing to beach bacteria concentrations, many researchers rely on statistical modeling, among which multiple linear regression (MLR) modeling is most widely used. Despite its ease of use and interpretation, there may be time dependence...
Predicting recycling behaviour: Comparison of a linear regression model and a fuzzy logic model.
Vesely, Stepan; Klöckner, Christian A; Dohnal, Mirko
2016-03-01
In this paper we demonstrate that fuzzy logic can provide a better tool for predicting recycling behaviour than the customarily used linear regression. To show this, we take a set of empirical data on recycling behaviour (N=664), which we randomly divide into two halves. The first half is used to estimate a linear regression model of recycling behaviour, and to develop a fuzzy logic model of recycling behaviour. As the first comparison, the fit of both models to the data included in estimation of the models (N=332) is evaluated. As the second comparison, predictive accuracy of both models for "new" cases (hold-out data not included in building the models, N=332) is assessed. In both cases, the fuzzy logic model significantly outperforms the regression model in terms of fit. To conclude, when accurate predictions of recycling and possibly other environmental behaviours are needed, fuzzy logic modelling seems to be a promising technique. PMID:26774211
Ong, Hong Choon; Alih, Ekele
2015-01-01
The tendency for experimental and industrial variables to include a certain proportion of outliers has become a rule rather than an exception. These clusters of outliers, if left undetected, have the capability to distort the mean and the covariance matrix of the Hotelling’s T2 multivariate control charts constructed to monitor individual quality characteristics. The effect of this distortion is that the control chart constructed from it becomes unreliable as it exhibits masking and swamping, a phenomenon in which an out-of-control process is erroneously declared as an in-control process or an in-control process is erroneously declared as out-of-control process. To handle these problems, this article proposes a control chart that is based on cluster-regression adjustment for retrospective monitoring of individual quality characteristics in a multivariate setting. The performance of the proposed method is investigated through Monte Carlo simulation experiments and historical datasets. Results obtained indicate that the proposed method is an improvement over the state-of-art methods in terms of outlier detection as well as keeping masking and swamping rate under control. PMID:25923739
NASA Technical Reports Server (NTRS)
MCKissick, Burnell T. (Technical Monitor); Plassman, Gerald E.; Mall, Gerald H.; Quagliano, John R.
2005-01-01
Linear multivariable regression models for predicting day and night Eddy Dissipation Rate (EDR) from available meteorological data sources are defined and validated. Model definition is based on a combination of 1997-2000 Dallas/Fort Worth (DFW) data sources, EDR from Aircraft Vortex Spacing System (AVOSS) deployment data, and regression variables primarily from corresponding Automated Surface Observation System (ASOS) data. Model validation is accomplished through EDR predictions on a similar combination of 1994-1995 Memphis (MEM) AVOSS and ASOS data. Model forms include an intercept plus a single term of fixed optimal power for each of these regression variables; 30-minute forward averaged mean and variance of near-surface wind speed and temperature, variance of wind direction, and a discrete cloud cover metric. Distinct day and night models, regressing on EDR and the natural log of EDR respectively, yield best performance and avoid model discontinuity over day/night data boundaries.
A General Linear Model Approach to Adjusting the Cumulative GPA.
ERIC Educational Resources Information Center
Young, John W.
A general linear model (GLM), using least-squares techniques, was used to develop a criterion measure to replace freshman year grade point average (GPA) in college admission predictive validity studies. Problems with the use of GPA include those associated with the combination of grades from different courses and disciplines into a single measure,…
Hoffman, Haydn; Lee, Sunghoon I; Garst, Jordan H; Lu, Derek S; Li, Charles H; Nagasawa, Daniel T; Ghalehsari, Nima; Jahanforouz, Nima; Razaghy, Mehrdad; Espinal, Marie; Ghavamrezaii, Amir; Paak, Brian H; Wu, Irene; Sarrafzadeh, Majid; Lu, Daniel C
2015-09-01
This study introduces the use of multivariate linear regression (MLR) and support vector regression (SVR) models to predict postoperative outcomes in a cohort of patients who underwent surgery for cervical spondylotic myelopathy (CSM). Currently, predicting outcomes after surgery for CSM remains a challenge. We recruited patients who had a diagnosis of CSM and required decompressive surgery with or without fusion. Fine motor function was tested preoperatively and postoperatively with a handgrip-based tracking device that has been previously validated, yielding mean absolute accuracy (MAA) results for two tracking tasks (sinusoidal and step). All patients completed Oswestry disability index (ODI) and modified Japanese Orthopaedic Association questionnaires preoperatively and postoperatively. Preoperative data was utilized in MLR and SVR models to predict postoperative ODI. Predictions were compared to the actual ODI scores with the coefficient of determination (R(2)) and mean absolute difference (MAD). From this, 20 patients met the inclusion criteria and completed follow-up at least 3 months after surgery. With the MLR model, a combination of the preoperative ODI score, preoperative MAA (step function), and symptom duration yielded the best prediction of postoperative ODI (R(2)=0.452; MAD=0.0887; p=1.17 × 10(-3)). With the SVR model, a combination of preoperative ODI score, preoperative MAA (sinusoidal function), and symptom duration yielded the best prediction of postoperative ODI (R(2)=0.932; MAD=0.0283; p=5.73 × 10(-12)). The SVR model was more accurate than the MLR model. The SVR can be used preoperatively in risk/benefit analysis and the decision to operate. PMID:26115898
Hoffman, Haydn; Lee, Sunghoon Ivan; Garst, Jordan H.; Lu, Derek S.; Li, Charles H.; Nagasawa, Daniel T.; Ghalehsari, Nima; Jahanforouz, Nima; Razaghy, Mehrdad; Espinal, Marie; Ghavamrezaii, Amir; Paak, Brian H.; Wu, Irene; Sarrafzadeh, Majid; Lu, Daniel C.
2016-01-01
This study introduces the use of multivariate linear regression (MLR) and support vector regression (SVR) models to predict postoperative outcomes in a cohort of patients who underwent surgery for cervical spondylotic myelopathy (CSM). Currently, predicting outcomes after surgery for CSM remains a challenge. We recruited patients who had a diagnosis of CSM and required decompressive surgery with or without fusion. Fine motor function was tested preoperatively and postoperatively with a handgrip-based tracking device that has been previously validated, yielding mean absolute accuracy (MAA) results for two tracking tasks (sinusoidal and step). All patients completed Oswestry disability index (ODI) and modified Japanese Orthopaedic Association questionnaires preoperatively and postoperatively. Preoperative data was utilized in MLR and SVR models to predict postoperative ODI. Predictions were compared to the actual ODI scores with the coefficient of determination (R2) and mean absolute difference (MAD). From this, 20 patients met the inclusion criteria and completed follow-up at least 3 months after surgery. With the MLR model, a combination of the preoperative ODI score, preoperative MAA (step function), and symptom duration yielded the best prediction of postoperative ODI (R2 = 0.452; MAD = 0.0887; p = 1.17 × 10−3). With the SVR model, a combination of preoperative ODI score, preoperative MAA (sinusoidal function), and symptom duration yielded the best prediction of postoperative ODI (R2 = 0.932; MAD = 0.0283; p = 5.73 × 10−12). The SVR model was more accurate than the MLR model. The SVR can be used preoperatively in risk/benefit analysis and the decision to operate. PMID:26115898
CANFIS: A non-linear regression procedure to produce statistical air-quality forecast models
Burrows, W.R.; Montpetit, J.; Pudykiewicz, J.
1997-12-31
Statistical models for forecasts of environmental variables can provide a good trade-off between significance and precision in return for substantial saving of computer execution time. Recent non-linear regression techniques give significantly increased accuracy compared to traditional linear regression methods. Two are Classification and Regression Trees (CART) and the Neuro-Fuzzy Inference System (NFIS). Both can model predict and distributions, including the tails, with much better accuracy than linear regression. Given a learning data set of matched predict and predictors, CART regression produces a non-linear, tree-based, piecewise-continuous model of the predict and data. Its variance-minimizing procedure optimizes the task of predictor selection, often greatly reducing initial data dimensionality. NFIS reduces dimensionality by a procedure known as subtractive clustering but it does not of itself eliminate predictors. Over-lapping coverage in predictor space is enhanced by NFIS with a Gaussian membership function for each cluster component. Coefficients for a continuous response model based on the fuzzified cluster centers are obtained by a least-squares estimation procedure. CANFIS is a two-stage data-modeling technique that combines the strength of CART to optimize the process of selecting predictors from a large pool of potential predictors with the modeling strength of NFIS. A CANFIS model requires negligible computer time to run. CANFIS models for ground-level O{sub 3}, particulates, and other pollutants will be produced for each of about 100 Canadian sites. The air-quality models will run twice daily using a small number of predictors isolated from a large pool of upstream and local Lagrangian potential predictors.
About the multiple linear regressions applied in studying the solvatochromic effects.
Dorohoi, Dana-Ortansa
2010-03-01
Statistical analysis is applied to study the solvatochromic effects using the solvent parameters (regressors) influencing the spectral shifts in the electronic spectra. The data pointed to eliminate the non-significant parameters and the aberrant points (for which supplemental interactions were neglected in used theories) from those supposed to multi-linear regression. A BASIC program permits to follow these desiderates step by step. In order to exemplify the steps of regression, the wavenumbers of the maximum pi-pi* absorption band of three benzene derivatives in various solvents were used. PMID:20089443
NASA Astrophysics Data System (ADS)
Kozubek, M.; Rozanov, E.; Krizan, P.
2014-09-01
The stratosphere is influenced by many external forcings (natural or anthropogenic). There are many studies which are focused on this problem and that is why we can compare our results with them. This study is focused on the variability and trends of temperature and circulation characteristics (zonal and meridional wind component) in connection with different phenomena variation in the stratosphere and lower mesosphere. We consider the interactions between the troposphere-stratosphere-lower mesosphere system and external and internal phenomena, e.g. solar cycle, QBO, NAO or ENSO using multiple linear techniques. The analysis was applied to the period 1979-2012 based on the current reanalysis data, mainly the MERRA reanalysis dataset (Modern Era Retrospective-analysis for Research and Applications) for pressure levels: 1000-0.1 hPa. We do not find a strong temperature signal for solar flux over the tropics about 30 hPa (ERA-40 results) but the strong positive signal has been observed near stratopause almost in the whole analyzed area. This could indicate that solar forcing is not represented well in the higher pressure levels in MERRA. The analysis of ENSO and ENSO Modoki shows that we should take into account more than one ENSO index for similar analysis. Previous studies show that the volcanic activity is important parameter. The signal of volcanic activity in MERRA is very weak and insignificant.
Liu, Dawei; Lin, Xihong; Ghosh, Debashis
2007-12-01
We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations. PMID:18078480
Adjusting for matching and covariates in linear discriminant analysis
Asafu-Adjei, Josephine K.; Sampson, Allan R.; Sweet, Robert A.; Lewis, David A.
2013-01-01
In studies that compare several diagnostic or treatment groups, subjects may not only be measured on a certain set of feature variables, but also be matched on a number of demographic characteristics and measured on additional covariates. Linear discriminant analysis (LDA) is sometimes used to identify which feature variables best discriminate among groups, while accounting for the dependencies among the feature variables. We present a new approach to LDA for multivariate normal data that accounts for the subject matching used in a particular study design, as well as covariates not used in the matching. Applications are given for post-mortem tissue data with the aim of comparing neurobiological characteristics of subjects with schizophrenia with those of normal controls, and for a post-mortem tissue primate study comparing brain biomarker measurements across three treatment groups. We also investigate the performance of our approach using a simulation study. PMID:23640791
Island embryo regression driven by a beam of self-ions in the linear regime
NASA Astrophysics Data System (ADS)
Flynn, C. P.
2010-10-01
The kinetics of island growth and regression are discussed under the approximation of linear response, including the Gibbs-Thompson potential, for a reacting assembly of adatoms and advacancies (thermal defects) on a surface irradiated with a beam of self-ions. First the quasistatic growth or shrinkage rate, for islands of size n less than the critical size \\hat {n} , is calculated for the driven system, exactly, for linear response. This result is employed to determine successively: (i) the regression rate of driven embryo islands with n \\lt \\hat {n} ; and (ii) the structure of the steady state decay chain established when embryos of a particular size n_{0}\\lt \\hat {n} are created by ion beam impacts. The changed embryo distribution caused by irradiation differs markedly from the populations of the embryos at equilibrium.
User's Guide to the Weighted-Multiple-Linear Regression Program (WREG version 1.0)
Eng, Ken; Chen, Yin-Yu; Kiang, Julie.E.
2009-01-01
Streamflow is not measured at every location in a stream network. Yet hydrologists, State and local agencies, and the general public still seek to know streamflow characteristics, such as mean annual flow or flood flows with different exceedance probabilities, at ungaged basins. The goals of this guide are to introduce and familiarize the user with the weighted multiple-linear regression (WREG) program, and to also provide the theoretical background for program features. The program is intended to be used to develop a regional estimation equation for streamflow characteristics that can be applied at an ungaged basin, or to improve the corresponding estimate at continuous-record streamflow gages with short records. The regional estimation equation results from a multiple-linear regression that relates the observable basin characteristics, such as drainage area, to streamflow characteristics.
SERF: A Simple, Effective, Robust, and Fast Image Super-Resolver From Cascaded Linear Regression.
Hu, Yanting; Wang, Nannan; Tao, Dacheng; Gao, Xinbo; Li, Xuelong
2016-09-01
Example learning-based image super-resolution techniques estimate a high-resolution image from a low-resolution input image by relying on high- and low-resolution image pairs. An important issue for these techniques is how to model the relationship between high- and low-resolution image patches: most existing complex models either generalize hard to diverse natural images or require a lot of time for model training, while simple models have limited representation capability. In this paper, we propose a simple, effective, robust, and fast (SERF) image super-resolver for image super-resolution. The proposed super-resolver is based on a series of linear least squares functions, namely, cascaded linear regression. It has few parameters to control the model and is thus able to robustly adapt to different image data sets and experimental settings. The linear least square functions lead to closed form solutions and therefore achieve computationally efficient implementations. To effectively decrease these gaps, we group image patches into clusters via k-means algorithm and learn a linear regressor for each cluster at each iteration. The cascaded learning process gradually decreases the gap of high-frequency detail between the estimated high-resolution image patch and the ground truth image patch and simultaneously obtains the linear regression parameters. Experimental results show that the proposed method achieves superior performance with lower time consumption than the state-of-the-art methods. PMID:27323364
Speaker adaptation of HMMs using evolutionary strategy-based linear regression
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; O'Shaughnessy, Douglas
2002-05-01
A new framework for speaker adaptation of continuous-density hidden Markov models (HMMs) is introduced. It aims to improve the robustness of speech recognizers by adapting HMM parameters to new conditions (e.g., from new speakers). It describes an optimization technique using an evolutionary strategy for linear regression-based spectral transformation. In classical iterative maximum likelihood linear regression (MLLR), a global transform matrix is estimated to make a general model better match particular target conditions. To permit adaptation on a small amount of data, a regression tree classification is performed. However, an important drawback of MLLR is that the number of regression classes is fixed. The new approach allows the degree of freedom of the global transform to be implicitly variable, as the evolutionary optimization permits the survival of only active classes. The fitness function is evaluated by the phoneme correctness through the evolution steps. The implementation requirements such as chromosome representation, selection function, genetic operators, and evaluation function have been chosen in order to lend more reliability to the global transformation matrix. Triphone experiments used the TIMIT and ARPA-RM1 databases. For new speakers, the new technique achieves 8 percent fewer word errors than the basic MLLR method.
Kondric, Miran; Trajkovski, Biljana; Strbad, Maja; Foretić, Nikola; Zenić, Natasa
2013-12-01
There is evident lack of studies which investigated morphological influence on physical fitness (PF) among preschool children. The aim of this study was to (1) calculate and interpret linear and nonlinear relationships between simple anthropometric predictors and PF criteria among preschoolers of both genders, and (2) to find critical values of the anthropometric predictors which should be recognized as the breakpoint of the negative influence on the PF. The sample of subjects consisted of 413 preschoolers aged 4 to 6 (mean age, 5.08 years; 176 girls and 237 boys), from Rijeka, Croatia. The anthropometric variables included body height (BH), body weight (BW), sum of triceps and subscapular skinfold (SUMSF), and calculated BMI (BMI = BW (kg)/BH (m)2). The PF was screened throughout testing of flexibility, repetitive strength, explosive strength, and agility. Linear and nonlinear (general quadratic model y = a + bx + cx2) regressions were calculated and interpreted simultaneously. BH and BW are far better predictors of the physical fitness status than BMI and SUMSF. In all calculated regressions excluding flexibility criterion, linear and nonlinear prediction of the PF throughout BH and BW reached statistical significance, indicating influence of the advancement in maturity status on PF variables Differences between linear and nonlinear regressions are smaller in males than in females. There are some indices that the age of 4 to 6 years is a critical period in the prevention of obesity, mostly because the extensively studied and proven negative influence of overweight and adiposity on PF tests is not yet evident. In some cases we have found evident regression breakpoints (approximately 25 kg in boys), which should be interpreted as critical values of the anthropometric measures for the studied sample of subjects. PMID:24611341
Christophersen, A; McKinley-McKee, J S
1984-01-01
An interactive program for analysing enzyme activity-time data using non-linear regression analysis is described. Protection studies can also be dealt with. The program computes inactivation rates, dissociation constants and promotion or inhibition parameters with their standard errors. It can also be used to distinguish different inactivation models. The program is written in SIMULA and is menu-oriented for refining or correcting data at the different levels of computing. PMID:6546558
González-Aparicio, I; Hidalgo, J; Baklanov, A; Padró, A; Santa-Coloma, O
2013-07-01
There is extensive evidence of the negative impacts on health linked to the rise of the regional background of particulate matter (PM) 10 levels. These levels are often increased over urban areas becoming one of the main air pollution concerns. This is the case on the Bilbao metropolitan area, Spain. This study describes a data-driven model to diagnose PM10 levels in Bilbao at hourly intervals. The model is built with a training period of 7-year historical data covering different urban environments (inland, city centre and coastal sites). The explanatory variables are quantitative-log [NO2], temperature, short-wave incoming radiation, wind speed and direction, specific humidity, hour and vehicle intensity-and qualitative-working days/weekends, season (winter/summer), the hour (from 00 to 23 UTC) and precipitation/no precipitation. Three different linear regression models are compared: simple linear regression; linear regression with interaction terms (INT); and linear regression with interaction terms following the Sawa's Bayesian Information Criteria (INT-BIC). Each type of model is calculated selecting two different periods: the training (it consists of 6 years) and the testing dataset (it consists of 1 year). The results of each type of model show that the INT-BIC-based model (R(2) = 0.42) is the best. Results were R of 0.65, 0.63 and 0.60 for the city centre, inland and coastal sites, respectively, a level of confidence similar to the state-of-the art methodology. The related error calculated for longer time intervals (monthly or seasonal means) diminished significantly (R of 0.75-0.80 for monthly means and R of 0.80 to 0.98 at seasonally means) with respect to shorter periods. PMID:23247520
Dhanya, S; Kumari Roshni, V S
2016-01-01
Textures play an important role in image classification. This paper proposes a high performance texture classification method using a combination of multiresolution analysis tool and linear regression modelling by channel elimination. The correlation between different frequency regions has been validated as a sort of effective texture characteristic. This method is motivated by the observation that there exists a distinctive correlation between the image samples belonging to the same kind of texture, at different frequency regions obtained by a wavelet transform. Experimentally, it is observed that this correlation differs across textures. The linear regression modelling is employed to analyze this correlation and extract texture features that characterize the samples. Our method considers not only the frequency regions but also the correlation between these regions. This paper primarily focuses on applying the Dual Tree Complex Wavelet Packet Transform and the Linear Regression model for classification of the obtained texture features. Additionally the paper also presents a comparative assessment of the classification results obtained from the above method with two more types of wavelet transform methods namely the Discrete Wavelet Transform and the Discrete Wavelet Packet Transform. PMID:26835234
Yang, Xiaowei; Nie, Kun
2008-03-15
Longitudinal data sets in biomedical research often consist of large numbers of repeated measures. In many cases, the trajectories do not look globally linear or polynomial, making it difficult to summarize the data or test hypotheses using standard longitudinal data analysis based on various linear models. An alternative approach is to apply the approaches of functional data analysis, which directly target the continuous nonlinear curves underlying discretely sampled repeated measures. For the purposes of data exploration, many functional data analysis strategies have been developed based on various schemes of smoothing, but fewer options are available for making causal inferences regarding predictor-outcome relationships, a common task seen in hypothesis-driven medical studies. To compare groups of curves, two testing strategies with good power have been proposed for high-dimensional analysis of variance: the Fourier-based adaptive Neyman test and the wavelet-based thresholding test. Using a smoking cessation clinical trial data set, this paper demonstrates how to extend the strategies for hypothesis testing into the framework of functional linear regression models (FLRMs) with continuous functional responses and categorical or continuous scalar predictors. The analysis procedure consists of three steps: first, apply the Fourier or wavelet transform to the original repeated measures; then fit a multivariate linear model in the transformed domain; and finally, test the regression coefficients using either adaptive Neyman or thresholding statistics. Since a FLRM can be viewed as a natural extension of the traditional multiple linear regression model, the development of this model and computational tools should enhance the capacity of medical statistics for longitudinal data. PMID:17610294
2013-01-01
Background Integrase inhibitors (INI) form a new drug class in the treatment of HIV-1 patients. We developed a linear regression modeling approach to make a quantitative raltegravir (RAL) resistance phenotype prediction, as Fold Change in IC50 against a wild type virus, from mutations in the integrase genotype. Methods We developed a clonal genotype-phenotype database with 991 clones from 153 clinical isolates of INI naïve and RAL treated patients, and 28 site-directed mutants. We did the development of the RAL linear regression model in two stages, employing a genetic algorithm (GA) to select integrase mutations by consensus. First, we ran multiple GAs to generate first order linear regression models (GA models) that were stochastically optimized to reach a goal R2 accuracy, and consisted of a fixed-length subset of integrase mutations to estimate INI resistance. Secondly, we derived a consensus linear regression model in a forward stepwise regression procedure, considering integrase mutations or mutation pairs by descending prevalence in the GA models. Results The most frequently occurring mutations in the GA models were 92Q, 97A, 143R and 155H (all 100%), 143G (90%), 148H/R (89%), 148K (88%), 151I (81%), 121Y (75%), 143C (72%), and 74M (69%). The RAL second order model contained 30 single mutations and five mutation pairs (p < 0.01): 143C/R&97A, 155H&97A/151I and 74M&151I. The R2 performance of this model on the clonal training data was 0.97, and 0.78 on an unseen population genotype-phenotype dataset of 171 clinical isolates from RAL treated and INI naïve patients. Conclusions We describe a systematic approach to derive a model for predicting INI resistance from a limited amount of clonal samples. Our RAL second order model is made available as an Additional file for calculating a resistance phenotype as the sum of integrase mutations and mutation pairs. PMID:23282253
Distributed Monitoring of the R(sup 2) Statistic for Linear Regression
NASA Technical Reports Server (NTRS)
Bhaduri, Kanishka; Das, Kamalika; Giannella, Chris R.
2011-01-01
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness.
Model Averaging Methods for Weight Trimming in Generalized Linear Regression Models
Elliott, Michael R.
2012-01-01
In sample surveys where units have unequal probabilities of inclusion, associations between the inclusion probability and the statistic of interest can induce bias in unweighted estimates. This is true even in regression models, where the estimates of the population slope may be biased if the underlying mean model is misspecified or the sampling is nonignorable. Weights equal to the inverse of the probability of inclusion are often used to counteract this bias. Highly disproportional sample designs have highly variable weights; weight trimming reduces large weights to a maximum value, reducing variability but introducing bias. Most standard approaches are ad hoc in that they do not use the data to optimize bias-variance trade-offs. This article uses Bayesian model averaging to create “data driven” weight trimming estimators. We extend previous results for linear regression models (Elliott 2008) to generalized linear regression models, developing robust models that approximate fully-weighted estimators when bias correction is of greatest importance, and approximate unweighted estimators when variance reduction is critical. PMID:23275683
Multiple regression technique for Pth degree polynominals with and without linear cross products
NASA Technical Reports Server (NTRS)
Davis, J. W.
1973-01-01
A multiple regression technique was developed by which the nonlinear behavior of specified independent variables can be related to a given dependent variable. The polynomial expression can be of Pth degree and can incorporate N independent variables. Two cases are treated such that mathematical models can be studied both with and without linear cross products. The resulting surface fits can be used to summarize trends for a given phenomenon and provide a mathematical relationship for subsequent analysis. To implement this technique, separate computer programs were developed for the case without linear cross products and for the case incorporating such cross products which evaluate the various constants in the model regression equation. In addition, the significance of the estimated regression equation is considered and the standard deviation, the F statistic, the maximum absolute percent error, and the average of the absolute values of the percent of error evaluated. The computer programs and their manner of utilization are described. Sample problems are included to illustrate the use and capability of the technique which show the output formats and typical plots comparing computer results to each set of input data.
Barks, C.S.
1995-01-01
Storm-runoff water-quality data were used to verify and, when appropriate, adjust regional regression models previously developed to estimate urban storm- runoff loads and mean concentrations in Little Rock, Arkansas. Data collected at 5 representative sites during 22 storms from June 1992 through January 1994 compose the Little Rock data base. Comparison of observed values (0) of storm-runoff loads and mean concentrations to the predicted values (Pu) from the regional regression models for nine constituents (chemical oxygen demand, suspended solids, total nitrogen, total ammonia plus organic nitrogen as nitrogen, total phosphorus, dissolved phosphorus, total recoverable copper, total recoverable lead, and total recoverable zinc) shows large prediction errors ranging from 63 to several thousand percent. Prediction errors for six of the regional regression models are less than 100 percent, and can be considered reasonable for water-quality models. Differences between 0 and Pu are due to variability in the Little Rock data base and error in the regional models. Where applicable, a model adjustment procedure (termed MAP-R-P) based upon regression with 0 against Pu was applied to improve predictive accuracy. For 11 of the 18 regional water-quality models, 0 and Pu are significantly correlated, that is much of the variation in 0 is explained by the regional models. Five of these 11 regional models consistently overestimate O; therefore, MAP-R-P can be used to provide a better estimate. For the remaining seven regional models, 0 and Pu are not significanfly correlated, thus neither the unadjusted regional models nor the MAP-R-P is appropriate. A simple estimator, such as the mean of the observed values may be used if the regression models are not appropriate. Standard error of estimate of the adjusted models ranges from 48 to 130 percent. Calibration results may be biased due to the limited data set sizes in the Little Rock data base. The relatively large values of
Aboveground biomass and carbon stocks modelling using non-linear regression model
NASA Astrophysics Data System (ADS)
Ain Mohd Zaki, Nurul; Abd Latif, Zulkiflee; Nazip Suratman, Mohd; Zainee Zainal, Mohd
2016-06-01
Aboveground biomass (AGB) is an important source of uncertainty in the carbon estimation for the tropical forest due to the variation biodiversity of species and the complex structure of tropical rain forest. Nevertheless, the tropical rainforest holds the most extensive forest in the world with the vast diversity of tree with layered canopies. With the usage of optical sensor integrate with empirical models is a common way to assess the AGB. Using the regression, the linkage between remote sensing and a biophysical parameter of the forest may be made. Therefore, this paper exemplifies the accuracy of non-linear regression equation of quadratic function to estimate the AGB and carbon stocks for the tropical lowland Dipterocarp forest of Ayer Hitam forest reserve, Selangor. The main aim of this investigation is to obtain the relationship between biophysical parameter field plots with the remotely-sensed data using nonlinear regression model. The result showed that there is a good relationship between crown projection area (CPA) and carbon stocks (CS) with Pearson Correlation (p < 0.01), the coefficient of correlation (r) is 0.671. The study concluded that the integration of Worldview-3 imagery with the canopy height model (CHM) raster based LiDAR were useful in order to quantify the AGB and carbon stocks for a larger sample area of the lowland Dipterocarp forest.
NASA Astrophysics Data System (ADS)
Deglint, Jason; Kazemzadeh, Farnoud; Wong, Alexander; Clausi, David A.
2015-09-01
One method to acquire multispectral images is to sequentially capture a series of images where each image contains information from a different bandwidth of light. Another method is to use a series of beamsplitters and dichroic filters to guide different bandwidths of light onto different cameras. However, these methods are very time consuming and expensive and perform poorly in dynamic scenes or when observing transient phenomena. An alternative strategy to capturing multispectral data is to infer this data using sparse spectral reflectance measurements captured using an imaging device with overlapping bandpass filters, such as a consumer digital camera using a Bayer filter pattern. Currently the only method of inferring dense reflectance spectra is the Wiener adaptive filter, which makes Gaussian assumptions about the data. However, these assumptions may not always hold true for all data. We propose a new technique to infer dense reflectance spectra from sparse spectral measurements through the use of a non-linear regression model. The non-linear regression model used in this technique is the random forest model, which is an ensemble of decision trees and trained via the spectral characterization of the optical imaging system and spectral data pair generation. This model is then evaluated by spectrally characterizing different patches on the Macbeth color chart, as well as by reconstructing inferred multispectral images. Results show that the proposed technique can produce inferred dense reflectance spectra that correlate well with the true dense reflectance spectra, which illustrates the merits of the technique.
Pagowski, M O; Grell, G A; Devenyi, D; Peckham, S E; McKeen, S A; Gong, W; Monache, L D; McHenry, J N; McQueen, J; Lee, P
2006-02-02
Forecasts from seven air quality models and surface ozone data collected over the eastern USA and southern Canada during July and August 2004 provide a unique opportunity to assess benefits of ensemble-based ozone forecasting and devise methods to improve ozone forecasts. In this investigation, past forecasts from the ensemble of models and hourly surface ozone measurements at over 350 sites are used to issue deterministic 24-h forecasts using a method based on dynamic linear regression. Forecasts of hourly ozone concentrations as well as maximum daily 8-h and 1-h averaged concentrations are considered. It is shown that the forecasts issued with the application of this method have reduced bias and root mean square error and better overall performance scores than any of the ensemble members and the ensemble average. Performance of the method is similar to another method based on linear regression described previously by Pagowski et al., but unlike the latter, the current method does not require measurements from multiple monitors since it operates on individual time series. Improvement in the forecasts can be easily implemented and requires minimal computational cost.
NASA Technical Reports Server (NTRS)
Dawson, Terence P.; Curran, Paul J.; Kupiec, John A.
1995-01-01
link between wavelengths chosen by stepwise regression and the biochemical of interest, and this in turn has cast doubts on the use of imaging spectrometry for the estimation of foliar biochemical concentrations at sites distant from the training sites. To investigate this problem, an analysis was conducted on the variation in canopy biochemical concentrations and reflectance spectra using forced entry linear regression.
Quantile Regression Adjusting for Dependent Censoring from Semi-Competing Risks
Li, Ruosha; Peng, Limin
2014-01-01
Summary In this work, we study quantile regression when the response is an event time subject to potentially dependent censoring. We consider the semi-competing risks setting, where time to censoring remains observable after the occurrence of the event of interest. While such a scenario frequently arises in biomedical studies, most of current quantile regression methods for censored data are not applicable because they generally require the censoring time and the event time be independent. By imposing rather mild assumptions on the association structure between the time-to-event response and the censoring time variable, we propose quantile regression procedures, which allow us to garner a comprehensive view of the covariate effects on the event time outcome as well as to examine the informativeness of censoring. An efficient and stable algorithm is provided for implementing the new method. We establish the asymptotic properties of the resulting estimators including uniform consistency and weak convergence. The theoretical development may serve as a useful template for addressing estimating settings that involve stochastic integrals. Extensive simulation studies suggest that the proposed method performs well with moderate sample sizes. We illustrate the practical utility of our proposals through an application to a bone marrow transplant trial. PMID:25574152
Hoos, Anne B.; Patel, Anant R.
1996-01-01
Model-adjustment procedures were applied to the combined data bases of storm-runoff quality for Chattanooga, Knoxville, and Nashville, Tennessee, to improve predictive accuracy for storm-runoff quality for urban watersheds in these three cities and throughout Middle and East Tennessee. Data for 45 storms at 15 different sites (five sites in each city) constitute the data base. Comparison of observed values of storm-runoff load and event-mean concentration to the predicted values from the regional regression models for 10 constituents shows prediction errors, as large as 806,000 percent. Model-adjustment procedures, which combine the regional model predictions with local data, are applied to improve predictive accuracy. Standard error of estimate after model adjustment ranges from 67 to 322 percent. Calibration results may be biased due to sampling error in the Tennessee data base. The relatively large values of standard error of estimate for some of the constituent models, although representing significant reduction (at least 50 percent) in prediction error compared to estimation with unadjusted regional models, may be unacceptable for some applications. The user may wish to collect additional local data for these constituents and repeat the analysis, or calibrate an independent local regression model.
ERIC Educational Resources Information Center
Quinino, Roberto C.; Reis, Edna A.; Bessegato, Lupercio F.
2013-01-01
This article proposes the use of the coefficient of determination as a statistic for hypothesis testing in multiple linear regression based on distributions acquired by beta sampling. (Contains 3 figures.)
ERIC Educational Resources Information Center
Rule, David L.
Several regression methods were examined within the framework of weighted structural regression (WSR), comparing their regression weight stability and score estimation accuracy in the presence of outlier contamination. The methods compared are: (1) ordinary least squares; (2) WSR ridge regression; (3) minimum risk regression; (4) minimum risk 2;…
Covariate-Adjusted Linear Mixed Effects Model with an Application to Longitudinal Data
Nguyen, Danh V.; Şentürk, Damla; Carroll, Raymond J.
2009-01-01
Linear mixed effects (LME) models are useful for longitudinal data/repeated measurements. We propose a new class of covariate-adjusted LME models for longitudinal data that nonparametrically adjusts for a normalizing covariate. The proposed approach involves fitting a parametric LME model to the data after adjusting for the nonparametric effects of a baseline confounding covariate. In particular, the effect of the observable covariate on the response and predictors of the LME model is modeled nonparametrically via smooth unknown functions. In addition to covariate-adjusted estimation of fixed/population parameters and random effects, an estimation procedure for the variance components is also developed. Numerical properties of the proposed estimators are investigated with simulation studies. The consistency and convergence rates of the proposed estimators are also established. An application to a longitudinal data set on calcium absorption, accounting for baseline distortion from body mass index, illustrates the proposed methodology. PMID:19266053
Chicken barn climate and hazardous volatile compounds control using simple linear regression and PID
NASA Astrophysics Data System (ADS)
Abdullah, A. H.; Bakar, M. A. A.; Shukor, S. A. A.; Saad, F. S. A.; Kamis, M. S.; Mustafa, M. H.; Khalid, N. S.
2016-07-01
The hazardous volatile compounds from chicken manure in chicken barn are potentially to be a health threat to the farm animals and workers. Ammonia (NH3) and hydrogen sulphide (H2S) produced in chicken barn are influenced by climate changes. The Electronic Nose (e-nose) is used for the barn's air, temperature and humidity data sampling. Simple Linear Regression is used to identify the correlation between temperature-humidity, humidity-ammonia and ammonia-hydrogen sulphide. MATLAB Simulink software was used for the sample data analysis using PID controller. Results shows that the performance of PID controller using the Ziegler-Nichols technique can improve the system controller to control climate in chicken barn.
Discriminative Feature Extraction via Multivariate Linear Regression for SSVEP-Based BCI.
Wang, Haiqiang; Zhang, Yu; Waytowich, Nicholas R; Krusienski, Dean J; Zhou, Guoxu; Jin, Jing; Wang, Xingyu; Cichocki, Andrzej
2016-05-01
Many of the most widely accepted methods for reliable detection of steady-state visual evoked potentials (SSVEPs) in the electroencephalogram (EEG) utilize canonical correlation analysis (CCA). CCA uses pure sine and cosine reference templates with frequencies corresponding to the visual stimulation frequencies. These generic reference templates may not optimally reflect the natural SSVEP features obscured by the background EEG. This paper introduces a new approach that utilizes spatio-temporal feature extraction with multivariate linear regression (MLR) to learn discriminative SSVEP features for improving the detection accuracy. MLR is implemented on dimensionality-reduced EEG training data and a constructed label matrix to find optimally discriminative subspaces. Experimental results show that the proposed MLR method significantly outperforms CCA as well as several other competing methods for SSVEP detection, especially for time windows shorter than 1 second. This demonstrates that the MLR method is a promising new approach for achieving improved real-time performance of SSVEP-BCIs. PMID:26812728
Rubio, Francisco J; Genton, Marc G
2016-06-30
We study Bayesian linear regression models with skew-symmetric scale mixtures of normal error distributions. These kinds of models can be used to capture departures from the usual assumption of normality of the errors in terms of heavy tails and asymmetry. We propose a general noninformative prior structure for these regression models and show that the corresponding posterior distribution is proper under mild conditions. We extend these propriety results to cases where the response variables are censored. The latter scenario is of interest in the context of accelerated failure time models, which are relevant in survival analysis. We present a simulation study that demonstrates good frequentist properties of the posterior credible intervals associated with the proposed priors. This study also sheds some light on the trade-off between increased model flexibility and the risk of over-fitting. We illustrate the performance of the proposed models with real data. Although we focus on models with univariate response variables, we also present some extensions to the multivariate case in the Supporting Information. Copyright © 2016 John Wiley & Sons, Ltd. PMID:26856806
The overlooked potential of Generalized Linear Models in astronomy, I: Binomial regression
NASA Astrophysics Data System (ADS)
de Souza, R. S.; Cameron, E.; Killedar, M.; Hilbe, J.; Vilalta, R.; Maio, U.; Biffi, V.; Ciardi, B.; Riggs, J. D.
2015-09-01
Revealing hidden patterns in astronomical data is often the path to fundamental scientific breakthroughs; meanwhile the complexity of scientific enquiry increases as more subtle relationships are sought. Contemporary data analysis problems often elude the capabilities of classical statistical techniques, suggesting the use of cutting edge statistical methods. In this light, astronomers have overlooked a whole family of statistical techniques for exploratory data analysis and robust regression, the so-called Generalized Linear Models (GLMs). In this paper-the first in a series aimed at illustrating the power of these methods in astronomical applications-we elucidate the potential of a particular class of GLMs for handling binary/binomial data, the so-called logit and probit regression techniques, from both a maximum likelihood and a Bayesian perspective. As a case in point, we present the use of these GLMs to explore the conditions of star formation activity and metal enrichment in primordial minihaloes from cosmological hydro-simulations including detailed chemistry, gas physics, and stellar feedback. We predict that for a dark mini-halo with metallicity ≈ 1.3 × 10-4Z⨀, an increase of 1.2 × 10-2 in the gas molecular fraction, increases the probability of star formation occurrence by a factor of 75%. Finally, we highlight the use of receiver operating characteristic curves as a diagnostic for binary classifiers, and ultimately we use these to demonstrate the competitive predictive performance of GLMs against the popular technique of artificial neural networks.
2014-01-01
This paper examined the efficiency of multivariate linear regression (MLR) and artificial neural network (ANN) models in prediction of two major water quality parameters in a wastewater treatment plant. Biochemical oxygen demand (BOD) and chemical oxygen demand (COD) as well as indirect indicators of organic matters are representative parameters for sewer water quality. Performance of the ANN models was evaluated using coefficient of correlation (r), root mean square error (RMSE) and bias values. The computed values of BOD and COD by model, ANN method and regression analysis were in close agreement with their respective measured values. Results showed that the ANN performance model was better than the MLR model. Comparative indices of the optimized ANN with input values of temperature (T), pH, total suspended solid (TSS) and total suspended (TS) for prediction of BOD was RMSE = 25.1 mg/L, r = 0.83 and for prediction of COD was RMSE = 49.4 mg/L, r = 0.81. It was found that the ANN model could be employed successfully in estimating the BOD and COD in the inlet of wastewater biochemical treatment plants. Moreover, sensitive examination results showed that pH parameter have more effect on BOD and COD predicting to another parameters. Also, both implemented models have predicted BOD better than COD. PMID:24456676
NASA Astrophysics Data System (ADS)
Urrutia, Jackie D.; Tampis, Razzcelle L.; Mercado, Joseph; Baygan, Aaron Vito M.; Baccay, Edcon B.
2016-02-01
The objective of this research is to formulate a mathematical model for the Philippines' Real Gross Domestic Product (Real GDP). The following factors are considered: Consumers' Spending (x1), Government's Spending (x2), Capital Formation (x3) and Imports (x4) as the Independent Variables that can actually influence in the Real GDP in the Philippines (y). The researchers used a Normal Estimation Equation using Matrices to create the model for Real GDP and used α = 0.01.The researchers analyzed quarterly data from 1990 to 2013. The data were acquired from the National Statistical Coordination Board (NSCB) resulting to a total of 96 observations for each variable. The data have undergone a logarithmic transformation particularly the Dependent Variable (y) to satisfy all the assumptions of the Multiple Linear Regression Analysis. The mathematical model for Real GDP was formulated using Matrices through MATLAB. Based on the results, only three of the Independent Variables are significant to the Dependent Variable namely: Consumers' Spending (x1), Capital Formation (x3) and Imports (x4), hence, can actually predict Real GDP (y). The regression analysis displays that 98.7% (coefficient of determination) of the Independent Variables can actually predict the Dependent Variable. With 97.6% of the result in Paired T-Test, the Predicted Values obtained from the model showed no significant difference from the Actual Values of Real GDP. This research will be essential in appraising the forthcoming changes to aid the Government in implementing policies for the development of the economy.
Predicting students' success at pre-university studies using linear and logistic regressions
NASA Astrophysics Data System (ADS)
Suliman, Noor Azizah; Abidin, Basir; Manan, Norhafizah Abdul; Razali, Ahmad Mahir
2014-09-01
The study is aimed to find the most suitable model that could predict the students' success at the medical pre-university studies, Centre for Foundation in Science, Languages and General Studies of Cyberjaya University College of Medical Sciences (CUCMS). The predictors under investigation were the national high school exit examination-Sijil Pelajaran Malaysia (SPM) achievements such as Biology, Chemistry, Physics, Additional Mathematics, Mathematics, English and Bahasa Malaysia results as well as gender and high school background factors. The outcomes showed that there is a significant difference in the final CGPA, Biology and Mathematics subjects at pre-university by gender factor, while by high school background also for Mathematics subject. In general, the correlation between the academic achievements at the high school and medical pre-university is moderately significant at α-level of 0.05, except for languages subjects. It was found also that logistic regression techniques gave better prediction models than the multiple linear regression technique for this data set. The developed logistic models were able to give the probability that is almost accurate with the real case. Hence, it could be used to identify successful students who are qualified to enter the CUCMS medical faculty before accepting any students to its foundation program.
ERIC Educational Resources Information Center
So, Tak-Shing Harry; Peng, Chao-Ying Joanne
This study compared the accuracy of predicting two-group membership obtained from K-means clustering with those derived from linear probability modeling, linear discriminant function, and logistic regression under various data properties. Multivariate normally distributed populations were simulated based on combinations of population proportions,…
Methods for Adjusting U.S. Geological Survey Rural Regression Peak Discharges in an Urban Setting
Moglen, Glenn E.; Shivers, Dorianne E.
2006-01-01
A study was conducted of 78 U.S. Geological Survey gaged streams that have been subjected to varying degrees of urbanization over the last three decades. Flood-frequency analysis coupled with nonlinear regression techniques were used to generate a set of equations for converting peak discharge estimates determined from rural regression equations to a set of peak discharge estimates that represent known urbanization. Specifically, urban regression equations for the 2-, 5-, 10-, 25-, 50-, 100-, and 500-year return periods were calibrated as a function of the corresponding rural peak discharge and the percentage of impervious area in a watershed. The results of this study indicate that two sets of equations, one set based on imperviousness and one set based on population density, performed well. Both sets of equations are dependent on rural peak discharges, a measure of development (average percentage of imperviousness or average population density), and a measure of homogeneity of development within a watershed. Average imperviousness was readily determined by using geographic information system methods and commonly available land-cover data. Similarly, average population density was easily determined from census data. Thus, a key advantage to the equations developed in this study is that they do not require field measurements of watershed characteristics as did the U.S. Geological Survey urban equations developed in an earlier investigation. During this study, the U.S. Geological Survey PeakFQ program was used as an integral tool in the calibration of all equations. The scarcity of historical land-use data, however, made exclusive use of flow records necessary for the 30-year period from 1970 to 2000. Such relatively short-duration streamflow time series required a nonstandard treatment of the historical data function of the PeakFQ program in comparison to published guidelines. Thus, the approach used during this investigation does not fully comply with the
Weichenthal, Scott; Ryswyk, Keith Van; Goldstein, Alon; Bagg, Scott; Shekkarizfard, Maryam; Hatzopoulou, Marianne
2016-04-01
Existing evidence suggests that ambient ultrafine particles (UFPs) (<0.1µm) may contribute to acute cardiorespiratory morbidity. However, few studies have examined the long-term health effects of these pollutants owing in part to a need for exposure surfaces that can be applied in large population-based studies. To address this need, we developed a land use regression model for UFPs in Montreal, Canada using mobile monitoring data collected from 414 road segments during the summer and winter months between 2011 and 2012. Two different approaches were examined for model development including standard multivariable linear regression and a machine learning approach (kernel-based regularized least squares (KRLS)) that learns the functional form of covariate impacts on ambient UFP concentrations from the data. The final models included parameters for population density, ambient temperature and wind speed, land use parameters (park space and open space), length of local roads and rail, and estimated annual average NOx emissions from traffic. The final multivariable linear regression model explained 62% of the spatial variation in ambient UFP concentrations whereas the KRLS model explained 79% of the variance. The KRLS model performed slightly better than the linear regression model when evaluated using an external dataset (R(2)=0.58 vs. 0.55) or a cross-validation procedure (R(2)=0.67 vs. 0.60). In general, our findings suggest that the KRLS approach may offer modest improvements in predictive performance compared to standard multivariable linear regression models used to estimate spatial variations in ambient UFPs. However, differences in predictive performance were not statistically significant when evaluated using the cross-validation procedure. PMID:26720396
Wang, Ching-Yun; Dieu Tapsoba, Jean De; Duggan, Catherine; Campbell, Kristin L; McTiernan, Anne
2016-05-10
In many biomedical studies, covariates of interest may be measured with errors. However, frequently in a regression analysis, the quantiles of the exposure variable are often used as the covariates in the regression analysis. Because of measurement errors in the continuous exposure variable, there could be misclassification in the quantiles for the exposure variable. Misclassification in the quantiles could lead to bias estimation in the association between the exposure variable and the outcome variable. Adjustment for misclassification will be challenging when the gold standard variables are not available. In this paper, we develop two regression calibration estimators to reduce bias in effect estimation. The first estimator is normal likelihood-based. The second estimator is linearization-based, and it provides a simple and practical correction. Finite sample performance is examined via a simulation study. We apply the methods to a four-arm randomized clinical trial that tested exercise and weight loss interventions in women aged 50-75years. Copyright © 2015 John Wiley & Sons, Ltd. PMID:26593772
Linard, Joshua I.
2013-01-01
Mitigating the effects of salt and selenium on water quality in the Grand Valley and lower Gunnison River Basin in western Colorado is a major concern for land managers. Previous modeling indicated means to improve the models by including more detailed geospatial data and a more rigorous method for developing the models. After evaluating all possible combinations of geospatial variables, four multiple linear regression models resulted that could estimate irrigation-season salt yield, nonirrigation-season salt yield, irrigation-season selenium yield, and nonirrigation-season selenium yield. The adjusted r-squared and the residual standard error (in units of log-transformed yield) of the models were, respectively, 0.87 and 2.03 for the irrigation-season salt model, 0.90 and 1.25 for the nonirrigation-season salt model, 0.85 and 2.94 for the irrigation-season selenium model, and 0.93 and 1.75 for the nonirrigation-season selenium model. The four models were used to estimate yields and loads from contributing areas corresponding to 12-digit hydrologic unit codes in the lower Gunnison River Basin study area. Each of the 175 contributing areas was ranked according to its estimated mean seasonal yield of salt and selenium.
NASA Astrophysics Data System (ADS)
Shu, Yuqin; Lam, Nina S. N.
2011-01-01
Detailed estimates of carbon dioxide emissions at fine spatial scales are critical to both modelers and decision makers dealing with global warming and climate change. Globally, traffic-related emissions of carbon dioxide are growing rapidly. This paper presents a new method based on a multiple linear regression model to disaggregate traffic-related CO 2 emission estimates from the parish-level scale to a 1 × 1 km grid scale. Considering the allocation factors (population density, urban area, income, road density) together, we used a correlation and regression analysis to determine the relationship between these factors and traffic-related CO 2 emissions, and developed the best-fit model. The method was applied to downscale the traffic-related CO 2 emission values by parish (i.e. county) for the State of Louisiana into 1-km 2 grid cells. In the four highest parishes in traffic-related CO 2 emissions, the biggest area that has above average CO 2 emissions is found in East Baton Rouge, and the smallest area with no CO 2 emissions is also in East Baton Rouge, but Orleans has the most CO 2 emissions per unit area. The result reveals that high CO 2 emissions are concentrated in dense road network of urban areas with high population density and low CO 2 emissions are distributed in rural areas with low population density, sparse road network. The proposed method can be used to identify the emission "hot spots" at fine scale and is considered more accurate and less time-consuming than the previous methods.
Adjustable Permanent Quadrupoles Using Rotating Magnet Material Rods for the Next Linear Collider
James T Volk et al.
2001-09-24
The proposed Next Linear Collider (NLC) will require over 1400 adjustable quadrupoles between the main linacs' accelerator structures. These 12.7 mm bore quadrupoles will have a range of integrated strength from 0.6 to 132 Tesla, with a maximum gradient of 135 Tesla per meter, an adjustment range of +0-20% and effective lengths from 324 mm to 972 mm. The magnetic center must remain stable to within 1 micrometer during the 20% adjustment. In an effort to reduce estimated costs and increase reliability, several designs using hybrid permanent magnets have been developed. All magnets have iron poles and use either Samarium Cobalt or Neodymium Iron to provide the magnetic fields. Two prototypes use rotating rods containing permanent magnetic material to vary the gradient. Gradient changes of 20% and center shifts of less than 20 microns have been measured. These data are compared to an equivalent electromagnet prototype.
NASA Astrophysics Data System (ADS)
Montanari, A.
2006-12-01
This contribution introduces a statistically based approach for uncertainty assessment in hydrological modeling, in an optimality context. Indeed, in several real world applications, there is the need for the user to select a model that is deemed to be the best possible choice accordingly to a given goodness of fit criteria. In this case, it is extremely important to assess the model uncertainty, intended as the range around the model output within which the measured hydrological variable is expected to fall with a given probability. This indication allows the user to quantify the risk associated to a decision that is based on the model response. The technique proposed here is carried out by inferring the probability distribution of the hydrological model error through a non linear multiple regression approach, depending on an arbitrary number of selected conditioning variables. These may include the current and previous model output as well as internal state variables of the model. The purpose is to indirectly relate the model error to the sources of uncertainty, through the conditioning variables. The method can be applied to any model of arbitrary complexity, included distributed approaches. The probability distribution of the model error is derived in the Gaussian space, through a meta-Gaussian approach. The normal quantile transform is applied in order to make the marginal probability distribution of the model error and the conditioning variables Gaussian. Then the above marginal probability distributions are related through the multivariate Gaussian distribution, whose parameters are estimated via multiple regression. Application of the inverse of the normal quantile transform allows the user to derive the confidence limits of the model output for an assigned significance level. The proposed technique is valid under statistical assumptions, that are essentially those conditioning the validity of the multiple regression in the Gaussian space. Statistical tests
Asquith, William H.; Roussel, Meghan C.
2009-01-01
Annual peak-streamflow frequency estimates are needed for flood-plain management; for objective assessment of flood risk; for cost-effective design of dams, levees, and other flood-control structures; and for design of roads, bridges, and culverts. Annual peak-streamflow frequency represents the peak streamflow for nine recurrence intervals of 2, 5, 10, 25, 50, 100, 200, 250, and 500 years. Common methods for estimation of peak-streamflow frequency for ungaged or unmonitored watersheds are regression equations for each recurrence interval developed for one or more regions; such regional equations are the subject of this report. The method is based on analysis of annual peak-streamflow data from U.S. Geological Survey streamflow-gaging stations (stations). Beginning in 2007, the U.S. Geological Survey, in cooperation with the Texas Department of Transportation and in partnership with Texas Tech University, began a 3-year investigation concerning the development of regional equations to estimate annual peak-streamflow frequency for undeveloped watersheds in Texas. The investigation focuses primarily on 638 stations with 8 or more years of data from undeveloped watersheds and other criteria. The general approach is explicitly limited to the use of L-moment statistics, which are used in conjunction with a technique of multi-linear regression referred to as PRESS minimization. The approach used to develop the regional equations, which was refined during the investigation, is referred to as the 'L-moment-based, PRESS-minimized, residual-adjusted approach'. For the approach, seven unique distributions are fit to the sample L-moments of the data for each of 638 stations and trimmed means of the seven results of the distributions for each recurrence interval are used to define the station specific, peak-streamflow frequency. As a first iteration of regression, nine weighted-least-squares, PRESS-minimized, multi-linear regression equations are computed using the watershed
Shayan, Zahra; Mezerji, Naser Mohammad Gholi; Shayan, Leila; Naseri, Parisa
2016-01-01
Background: Logistic regression (LR) and linear discriminant analysis (LDA) are two popular statistical models for prediction of group membership. Although they are very similar, the LDA makes more assumptions about the data. When categorical and continuous variables used simultaneously, the optimal choice between the two models is questionable. In most studies, classification error (CE) is used to discriminate between subjects in several groups, but this index is not suitable to predict the accuracy of the outcome. The present study compared LR and LDA models using classification indices. Methods: This cross-sectional study selected 243 cancer patients. Sample sets of different sizes (n = 50, 100, 150, 200, 220) were randomly selected and the CE, B, and Q classification indices were calculated by the LR and LDA models. Results: CE revealed the a lack of superiority for one model over the other, but the results showed that LR performed better than LDA for the B and Q indices in all situations. No significant effect for sample size on CE was noted for selection of an optimal model. Assessment of the accuracy of prediction of real data indicated that the B and Q indices are appropriate for selection of an optimal model. Conclusion: The results of this study showed that LR performs better in some cases and LDA in others when based on CE. The CE index is not appropriate for classification, although the B and Q indices performed better and offered more efficient criteria for comparison and discrimination between groups.
Accounting for data errors discovered from an audit in multiple linear regression.
Shepherd, Bryan E; Yu, Chang
2011-09-01
A data coordinating team performed onsite audits and discovered discrepancies between the data sent to the coordinating center and that recorded at sites. We present statistical methods for incorporating audit results into analyses. This can be thought of as a measurement error problem, where the distribution of errors is a mixture with a point mass at 0. If the error rate is nonzero, then even if the mean of the discrepancy between the reported and correct values of a predictor is 0, naive estimates of the association between two continuous variables will be biased. We consider scenarios where there are (1) errors in the predictor, (2) errors in the outcome, and (3) possibly correlated errors in the predictor and outcome. We show how to incorporate the error rate and magnitude, estimated from a random subset (the audited records), to compute unbiased estimates of association and proper confidence intervals. We then extend these results to multiple linear regression where multiple covariates may be incorrect in the database and the rate and magnitude of the errors may depend on study site. We study the finite sample properties of our estimators using simulations, discuss some practical considerations, and illustrate our methods with data from 2815 HIV-infected patients in Latin America, of whom 234 had their data audited using a sequential auditing plan. PMID:21281274
Linear regression model for predicting interactive mixture toxicity of pesticide and ionic liquid.
Qin, Li-Tang; Wu, Jie; Mo, Ling-Yun; Zeng, Hong-Hu; Liang, Yan-Peng
2015-08-01
The nature of most environmental contaminants comes from chemical mixtures rather than from individual chemicals. Most of the existed mixture models are only valid for non-interactive mixture toxicity. Therefore, we built two simple linear regression-based concentration addition (LCA) and independent action (LIA) models that aim to predict the combined toxicities of the interactive mixture. The LCA model was built between the negative log-transformation of experimental and expected effect concentrations of concentration addition (CA), while the LIA model was developed between the negative log-transformation of experimental and expected effect concentrations of independent action (IA). Twenty-four mixtures of pesticide and ionic liquid were used to evaluate the predictive abilities of LCA and LIA models. The models correlated well with the observed responses of the 24 binary mixtures. The values of the coefficient of determination (R (2)) and leave-one-out (LOO) cross-validated correlation coefficient (Q(2)) for LCA and LIA models are larger than 0.99, which indicates high predictive powers of the models. The results showed that the developed LCA and LIA models allow for accurately predicting the mixture toxicities of synergism, additive effect, and antagonism. The proposed LCA and LIA models may serve as a useful tool in ecotoxicological assessment. PMID:25929456
Fuzzy clustering and soft switching of linear regression models for reversible image compression
NASA Astrophysics Data System (ADS)
Aiazzi, Bruno; Alba, Pasquale S.; Alparone, Luciano; Baronti, Stefano
1998-10-01
This paper describes an original application of fuzzy logic to reversible compression of 2D and 3D data. The compression method consists of a space-variant prediction followed by context- based classification ad arithmetic coding of the outcome residuals. Prediction of a pixel to be encoded is obtained from the fuzzy-switching of a set of linear regression predictors. The coefficients of each predictor are calculated so as to minimize prediction MSE for those pixels whose graylevel patterns, lying on a causal neighborhood of prefixed shape, are vectors belonging in a fuzzy sense to one cluster. In the 3D case, pixels both on the current slice and on previously encoded slices may be used. The size and shape of the causal neighborhood, as well as the number of predictors to be switched, may be chosen before running the algorithm and determine the trade-off between coding performance sand computational cost. The method exhibits impressive performances, for both 2D and 3D data, mainly thanks to the optimality of predictors, due to their skill in fitting data patterns.
Turlapaty, Anish C.; Younan, Nicolas H.; Anantharaj, Valentine G
2012-01-01
Currently, the only viable option for a global precipitation product is the merger of several precipitation products from different modalities. In this article, we develop a linear merging methodology based on spatiotemporal regression. Four highresolution precipitation products (HRPPs), obtained through methods including the Climate Prediction Center's Morphing (CMORPH), Geostationary Operational Environmental Satellite-Based Auto-Estimator (GOES-AE), GOES-Based Hydro-Estimator (GOES-HE) and Self-Calibrating Multivariate Precipitation Retrieval (SCAMPR) algorithms, are used in this study. The merged data are evaluated against the Arkansas Red Basin River Forecast Center's (ABRFC's) ground-based rainfall product. The evaluation is performed using the Heidke skill score (HSS) for four seasons, from summer 2007 to spring 2008, and for two different rainfall detection thresholds. It is shown that the merged data outperform all the other products in seven out of eight cases. A key innovation of this machine learning method is that only 6% of the validation data are used for the initial training. The sensitivity of the algorithm to location, distribution of training data, selection of input data sets and seasons is also analysed and presented.
Robust linear regression model of Ki-67 for mitotic rate in gastrointestinal stromal tumors
KEMMERLING, RALF; WEYLAND, DENIS; KIESSLICH, TOBIAS; ILLIG, ROMANA; KLIESER, ECKHARD; JÄGER, TARKAN; DIETZE, OTTO; NEUREITER, DANIEL
2014-01-01
Risk stratification of gastrointestinal stromal tumors (GISTs) by tumor size, lymph node and metastasis status is crucially affected by mitotic activity. To date, no studies have quantitatively compared mitotic activity in hematoxylin and eosin (H&E)-stained tissue sections with immunohistochemical markers, such as phosphohistone H3 (PHH3) and Ki-67. According to the TNM guidelines, the mitotic count on H&E sections and immunohistochemical PHH3-stained slides has been assessed per 50 high-power fields of 154 specimens of clinically documented GIST cases. The Ki-67-associated proliferation rate was evaluated on three digitalized hot spots using image analysis. The H&E-based mitotic rate was found to correlate significantly better with Ki-67-assessed proliferation activity than with PHH3-assessed proliferation activity (r=0.780; P<0.01). A linear regression model (analysis of variance; P<0.001) allowed reliable predictions of the H&E-associated mitoses based on the Ki-67 expression alone. Additionally, the Ki-67-associated proliferation revealed a higher and significant impact on the recurrence and metastasis rate of the GIST cases than by the classical H&E-based mitotic rate. The results of the present study indicated that the mitotic rate may be reliably and time-efficiently estimated by immunohistochemistry of Ki-67 using only three hot spots. PMID:24527082
The Ω Counter, a Frequency Counter Based on the Linear Regression.
Rubiola, Enrico; Lenczner, Michel; Bourgeois, Pierre-Yves; Vernotte, Francois
2016-07-01
This paper introduces the Ω counter, a frequency counter-i.e., a frequency-to-digital converter-based on the linear regression (LR) algorithm on time stamps. We discuss the noise of the electronics. We derive the statistical properties of the Ω counter on rigorous mathematical basis, including the weighted measure and the frequency response. We describe an implementation based on a system on chip, under test in our laboratory, and we compare the Ω counter to the traditional Π and Λ counters. The LR exhibits the optimum rejection of white phase noise, superior to that of the Π and Λ counters. White noise is the major practical problem of wideband digital electronics, both in the instrument internal circuits and in the fast processes, which we may want to measure. With a measurement time τ , the variance is proportional to 1/τ(2) for the Π counter, and to 1/τ(3) for both the Λ and Ω counters. However, the Ω counter has the smallest possible variance, 1.25 dB smaller than that of the Λ counter. The Ω counter finds a natural application in the measurement of the parabolic variance, described in the companion article in this Journal [vol. 63 no. 4 pp. 611-623, April 2016 (Special Issue on the 50th Anniversary of the Allan Variance), DOI 10.1109/TUFFC.2015.2499325]. PMID:27244731
Comparison of linear discriminant analysis and logistic regression for data classification
NASA Astrophysics Data System (ADS)
Liong, Choong-Yeun; Foo, Sin-Fan
2013-04-01
Linear discriminant analysis (LDA) and logistic regression (LR) are often used for the purpose of classifying populations or groups using a set of predictor variables. Assumptions of multivariate normality and equal variance-covariance matrices across groups are required before proceeding with LDA, but such assumptions are not required for LR and hence LR is considered to be much more robust than LDA. In this paper, several real datasets which are different in terms of normality, number of independent variables and sample size are used to study the performance of both methods. The methods are compared based on the percentage of correct classification and B index. The results show that overall, LR performs better regardless of the distribution of the data is normal or nonnormal. However, LR needs longer computing time than LDA with the increase in sample size. The performance of LDA was also tested by using various prior probabilities. The results show that the average percentage of correct classification and the B index are higher when the prior probability is set based on the group size rather than using equal probabilities for all groups.
NASA Astrophysics Data System (ADS)
Samhouri, M.; Al-Ghandoor, A.; Fouad, R. H.
2009-08-01
In this study two techniques, for modeling electricity consumption of the Jordanian industrial sector, are presented: (i) multivariate linear regression and (ii) neuro-fuzzy models. Electricity consumption is modeled as function of different variables such as number of establishments, number of employees, electricity tariff, prevailing fuel prices, production outputs, capacity utilizations, and structural effects. It was found that industrial production and capacity utilization are the most important variables that have significant effect on future electrical power demand. The results showed that both the multivariate linear regression and neuro-fuzzy models are generally comparable and can be used adequately to simulate industrial electricity consumption. However, comparison that is based on the square root average squared error of data suggests that the neuro-fuzzy model performs slightly better for future prediction of electricity consumption than the multivariate linear regression model. Such results are in full agreement with similar work, using different methods, for other countries.
Ho Hoang, Khai-Long; Mombaur, Katja
2015-10-15
Dynamic modeling of the human body is an important tool to investigate the fundamentals of the biomechanics of human movement. To model the human body in terms of a multi-body system, it is necessary to know the anthropometric parameters of the body segments. For young healthy subjects, several data sets exist that are widely used in the research community, e.g. the tables provided by de Leva. None such comprehensive anthropometric parameter sets exist for elderly people. It is, however, well known that body proportions change significantly during aging, e.g. due to degenerative effects in the spine, such that parameters for young people cannot be used for realistically simulating the dynamics of elderly people. In this study, regression equations are derived from the inertial parameters, center of mass positions, and body segment lengths provided by de Leva to be adjustable to the changes in proportion of the body parts of male and female humans due to aging. Additional adjustments are made to the reference points of the parameters for the upper body segments as they are chosen in a more practicable way in the context of creating a multi-body model in a chain structure with the pelvis representing the most proximal segment. PMID:26338096
A non-linear regression method for CT brain perfusion analysis
NASA Astrophysics Data System (ADS)
Bennink, E.; Oosterbroek, J.; Viergever, M. A.; Velthuis, B. K.; de Jong, H. W. A. M.
2015-03-01
CT perfusion (CTP) imaging allows for rapid diagnosis of ischemic stroke. Generation of perfusion maps from CTP data usually involves deconvolution algorithms providing estimates for the impulse response function in the tissue. We propose the use of a fast non-linear regression (NLR) method that we postulate has similar performance to the current academic state-of-art method (bSVD), but that has some important advantages, including the estimation of vascular permeability, improved robustness to tracer-delay, and very few tuning parameters, that are all important in stroke assessment. The aim of this study is to evaluate the fast NLR method against bSVD and a commercial clinical state-of-art method. The three methods were tested against a published digital perfusion phantom earlier used to illustrate the superiority of bSVD. In addition, the NLR and clinical methods were also tested against bSVD on 20 clinical scans. Pearson correlation coefficients were calculated for each of the tested methods. All three methods showed high correlation coefficients (>0.9) with the ground truth in the phantom. With respect to the clinical scans, the NLR perfusion maps showed higher correlation with bSVD than the perfusion maps from the clinical method. Furthermore, the perfusion maps showed that the fast NLR estimates are robust to tracer-delay. In conclusion, the proposed fast NLR method provides a simple and flexible way of estimating perfusion parameters from CT perfusion scans, with high correlation coefficients. This suggests that it could be a better alternative to the current clinical and academic state-of-art methods.
Bayesian Method for Support Union Recovery in Multivariate Multi-Response Linear Regression
NASA Astrophysics Data System (ADS)
Chen, Wan-Ping
Sparse modeling has become a particularly important and quickly developing topic in many applications of statistics, machine learning, and signal processing. The main objective of sparse modeling is discovering a small number of predictive patterns that would improve our understanding of the data. This paper extends the idea of sparse modeling to the variable selection problem in high dimensional linear regression, where there are multiple response vectors, and they share the same or similar subsets of predictor variables to be selected from a large set of candidate variables. In the literature, this problem is called multi-task learning, support union recovery or simultaneous sparse coding in different contexts. We present a Bayesian method for solving this problem by introducing two nested sets of binary indicator variables. In the first set of indicator variables, each indicator is associated with a predictor variable or a regressor, indicating whether this variable is active for any of the response vectors. In the second set of indicator variables, each indicator is associated with both a predicator variable and a response vector, indicating whether this variable is active for the particular response vector. The problem of variable selection is solved by sampling from the posterior distributions of the two sets of indicator variables. We develop a Gibbs sampling algorithm for posterior sampling and use the generated samples to identify active support both in shared and individual level. Theoretical and simulation justification are performed in the paper. The proposed algorithm is also demonstrated on the real image data sets. To learn the patterns of the object in images, we treat images as the different tasks. Through combining images with the object in the same category, we cannot only learn the shared patterns efficiently but also get individual sketch of each image.
Optimization of end-members used in multiple linear regression geochemical mixing models
NASA Astrophysics Data System (ADS)
Dunlea, Ann G.; Murray, Richard W.
2015-11-01
Tracking marine sediment provenance (e.g., of dust, ash, hydrothermal material, etc.) provides insight into contemporary ocean processes and helps construct paleoceanographic records. In a simple system with only a few end-members that can be easily quantified by a unique chemical or isotopic signal, chemical ratios and normative calculations can help quantify the flux of sediment from the few sources. In a more complex system (e.g., each element comes from multiple sources), more sophisticated mixing models are required. MATLAB codes published in Pisias et al. solidified the foundation for application of a Constrained Least Squares (CLS) multiple linear regression technique that can use many elements and several end-members in a mixing model. However, rigorous sensitivity testing to check the robustness of the CLS model is time and labor intensive. MATLAB codes provided in this paper reduce the time and labor involved and facilitate finding a robust and stable CLS model. By quickly comparing the goodness of fit between thousands of different end-member combinations, users are able to identify trends in the results that reveal the CLS solution uniqueness and the end-member composition precision required for a good fit. Users can also rapidly check that they have the appropriate number and type of end-members in their model. In the end, these codes improve the user's confidence that the final CLS model(s) they select are the most reliable solutions. These advantages are demonstrated by application of the codes in two case studies of well-studied datasets (Nazca Plate and South Pacific Gyre).
Fisher, Charles K; Mehta, Pankaj
2014-01-01
Human associated microbial communities exert tremendous influence over human health and disease. With modern metagenomic sequencing methods it is now possible to follow the relative abundance of microbes in a community over time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer the ecological interactions between species directly from sequence data. Any algorithm for inferring ecological interactions must overcome three major obstacles: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units, bias inferences of species interactions due to a statistical problem called "errors-in-variables". Here we introduce an approach, Learning Interactions from MIcrobial Time Series (LIMITS), that overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut microbiomes of two individuals and found that the interaction networks varied significantly between individuals. Furthermore, we found that the interaction networks of the two individuals are dominated by distinct "keystone species", Bacteroides fragilis and Bacteroided stercosis, that have a disproportionate influence on the structure of the gut microbiome even though they are only found in moderate abundance. Based on our results, we hypothesize that the abundances of certain keystone species may be responsible for individuality in the human gut
Technology Transfer Automated Retrieval System (TEKTRAN)
Accumulated feedlot manure negatively affects the environment. The objective was to test the validity of using EMI mapping methods combined with predictive-based sampling and ordinary linear regression for measuring spatially variable manure accumulation. A Dualem-1S EMI meter also recording GPS c...
Technology Transfer Automated Retrieval System (TEKTRAN)
Parametric non-linear regression (PNR) techniques commonly are used to develop weed seedling emergence models. Such techniques, however, require statistical assumptions that are difficult to meet. To examine and overcome these limitations, we compared PNR with a nonparametric estimation technique. F...
A New Test of Linear Hypotheses in OLS Regression under Heteroscedasticity of Unknown Form
ERIC Educational Resources Information Center
Cai, Li; Hayes, Andrew F.
2008-01-01
When the errors in an ordinary least squares (OLS) regression model are heteroscedastic, hypothesis tests involving the regression coefficients can have Type I error rates that are far from the nominal significance level. Asymptotically, this problem can be rectified with the use of a heteroscedasticity-consistent covariance matrix (HCCM)…
Isolating and Examining Sources of Suppression and Multicollinearity in Multiple Linear Regression
ERIC Educational Resources Information Center
Beckstead, Jason W.
2012-01-01
The presence of suppression (and multicollinearity) in multiple regression analysis complicates interpretation of predictor-criterion relationships. The mathematical conditions that produce suppression in regression analysis have received considerable attention in the methodological literature but until now nothing in the way of an analytic…
Confidence Intervals for an Effect Size Measure in Multiple Linear Regression
ERIC Educational Resources Information Center
Algina, James; Keselman, H. J.; Penfield, Randall D.
2007-01-01
The increase in the squared multiple correlation coefficient ([Delta]R[squared]) associated with a variable in a regression equation is a commonly used measure of importance in regression analysis. The coverage probability that an asymptotic and percentile bootstrap confidence interval includes [Delta][rho][squared] was investigated. As expected,…
An Investigation of the Median-Median Method of Linear Regression
ERIC Educational Resources Information Center
Walters, Elizabeth J.; Morrell, Christopher H.; Auer, Richard E.
2006-01-01
Least squares regression is the most common method of fitting a straight line to a set of bivariate data. Another less known method that is available on Texas Instruments graphing calculators is median-median regression. This method is proposed as a simple method that may be used with middle and high school students to motivate the idea of fitting…
ERIC Educational Resources Information Center
Aitkin, Murray A.
Fixed-width confidence intervals for a population regression line over a finite interval of x have recently been derived by Gafarian. The method is extended to provide fixed-width confidence intervals for the difference between two population regression lines, resulting in a simple procedure analogous to the Johnson-Neyman technique. (Author)
Weighted Structural Regression: A Broad Class of Adaptive Methods for Improving Linear Prediction.
ERIC Educational Resources Information Center
Pruzek, Robert M.; Lepak, Greg M.
1992-01-01
Adaptive forms of weighted structural regression are developed and discussed. Bootstrapping studies indicate that the new methods have potential to recover known population regression weights and predict criterion score values routinely better than do ordinary least squares methods. The new methods are scale free and simple to compute. (SLD)
Worachartcheewan, Apilak; Nantasenamat, Chanin; Owasirikul, Wiwat; Monnor, Teerawat; Naruepantawart, Orapan; Janyapaisarn, Sayamon; Prachayasittikul, Supaluk; Prachayasittikul, Virapong
2014-02-12
A data set of 1-adamantylthiopyridine analogs (1-19) with antioxidant activity, comprising of 2,2-diphenyl-1-picrylhydrazyl (DPPH) and superoxide dismutase (SOD) activities, was used for constructing quantitative structure-activity relationship (QSAR) models. Molecular structures were geometrically optimized at B3LYP/6-31g(d) level and subjected for further molecular descriptor calculation using Dragon software. Multiple linear regression (MLR) was employed for the development of QSAR models using 3 significant descriptors (i.e. Mor29e, F04[N-N] and GATS5v) for predicting the DPPH activity and 2 essential descriptors (i.e. EEig06r and Mor06v) for predicting the SOD activity. Such molecular descriptors accounted for the effects and positions of substituent groups (R) on the 1-adamantylthiopyridine ring. The results showed that high atomic electronegativity of polar substituent group (R = CO2H) afforded high DPPH activity, while substituent with high atomic van der Waals volumes such as R = Br gave high SOD activity. Leave-one-out cross-validation (LOO-CV) and external test set were used for model validation. Correlation coefficient (QCV) and root mean squared error (RMSECV) of the LOO-CV set for predicting DPPH activity were 0.5784 and 8.3440, respectively, while QExt and RMSEExt of external test set corresponded to 0.7353 and 4.2721, respectively. Furthermore, QCV and RMSECV values of the LOO-CV set for predicting SOD activity were 0.7549 and 5.6380, respectively. The QSAR model's equation was then used in predicting the SOD activity of tested compounds and these were subsequently verified experimentally. It was observed that the experimental activity was more potent than the predicted activity. Structure-activity relationships of significant descriptors governing antioxidant activity are also discussed. The QSAR models investigated herein are anticipated to be useful in the rational design and development of novel compounds with antioxidant activity. PMID
Hu, L.; Zhang, Z.G.; Mouraux, A.; Iannetti, G.D.
2015-01-01
Transient sensory, motor or cognitive event elicit not only phase-locked event-related potentials (ERPs) in the ongoing electroencephalogram (EEG), but also induce non-phase-locked modulations of ongoing EEG oscillations. These modulations can be detected when single-trial waveforms are analysed in the time-frequency domain, and consist in stimulus-induced decreases (event-related desynchronization, ERD) or increases (event-related synchronization, ERS) of synchrony in the activity of the underlying neuronal populations. ERD and ERS reflect changes in the parameters that control oscillations in neuronal networks and, depending on the frequency at which they occur, represent neuronal mechanisms involved in cortical activation, inhibition and binding. ERD and ERS are commonly estimated by averaging the time-frequency decomposition of single trials. However, their trial-to-trial variability that can reflect physiologically-important information is lost by across-trial averaging. Here, we aim to (1) develop novel approaches to explore single-trial parameters (including latency, frequency and magnitude) of ERP/ERD/ERS; (2) disclose the relationship between estimated single-trial parameters and other experimental factors (e.g., perceived intensity). We found that (1) stimulus-elicited ERP/ERD/ERS can be correctly separated using principal component analysis (PCA) decomposition with Varimax rotation on the single-trial time-frequency distributions; (2) time-frequency multiple linear regression with dispersion term (TF-MLRd) enhances the signal-to-noise ratio of ERP/ERD/ERS in single trials, and provides an unbiased estimation of their latency, frequency, and magnitude at single-trial level; (3) these estimates can be meaningfully correlated with each other and with other experimental factors at single-trial level (e.g., perceived stimulus intensity and ERP magnitude). The methods described in this article allow exploring fully non-phase-locked stimulus-induced cortical
Hu, L; Zhang, Z G; Mouraux, A; Iannetti, G D
2015-05-01
Transient sensory, motor or cognitive event elicit not only phase-locked event-related potentials (ERPs) in the ongoing electroencephalogram (EEG), but also induce non-phase-locked modulations of ongoing EEG oscillations. These modulations can be detected when single-trial waveforms are analysed in the time-frequency domain, and consist in stimulus-induced decreases (event-related desynchronization, ERD) or increases (event-related synchronization, ERS) of synchrony in the activity of the underlying neuronal populations. ERD and ERS reflect changes in the parameters that control oscillations in neuronal networks and, depending on the frequency at which they occur, represent neuronal mechanisms involved in cortical activation, inhibition and binding. ERD and ERS are commonly estimated by averaging the time-frequency decomposition of single trials. However, their trial-to-trial variability that can reflect physiologically-important information is lost by across-trial averaging. Here, we aim to (1) develop novel approaches to explore single-trial parameters (including latency, frequency and magnitude) of ERP/ERD/ERS; (2) disclose the relationship between estimated single-trial parameters and other experimental factors (e.g., perceived intensity). We found that (1) stimulus-elicited ERP/ERD/ERS can be correctly separated using principal component analysis (PCA) decomposition with Varimax rotation on the single-trial time-frequency distributions; (2) time-frequency multiple linear regression with dispersion term (TF-MLRd) enhances the signal-to-noise ratio of ERP/ERD/ERS in single trials, and provides an unbiased estimation of their latency, frequency, and magnitude at single-trial level; (3) these estimates can be meaningfully correlated with each other and with other experimental factors at single-trial level (e.g., perceived stimulus intensity and ERP magnitude). The methods described in this article allow exploring fully non-phase-locked stimulus-induced cortical
NASA Technical Reports Server (NTRS)
Parker, Peter A.; Geoffrey, Vining G.; Wilson, Sara R.; Szarka, John L., III; Johnson, Nels G.
2010-01-01
The calibration of measurement systems is a fundamental but under-studied problem within industrial statistics. The origins of this problem go back to basic chemical analysis based on NIST standards. In today's world these issues extend to mechanical, electrical, and materials engineering. Often, these new scenarios do not provide "gold standards" such as the standard weights provided by NIST. This paper considers the classic "forward regression followed by inverse regression" approach. In this approach the initial experiment treats the "standards" as the regressor and the observed values as the response to calibrate the instrument. The analyst then must invert the resulting regression model in order to use the instrument to make actual measurements in practice. This paper compares this classical approach to "reverse regression," which treats the standards as the response and the observed measurements as the regressor in the calibration experiment. Such an approach is intuitively appealing because it avoids the need for the inverse regression. However, it also violates some of the basic regression assumptions.
NASA Astrophysics Data System (ADS)
Alih, Ekele; Ong, Hong Choon
2014-07-01
The application of Ordinary Least Squares (OLS) to a single equation assumes among others, that the predictor variables are truly exogenous; that there is only one-way causation between the dependent variable yi and the predictor variables xij. If this is not true and the xij 'S are at the same time determined by yi, the OLS assumption will be violated and a single equation method will give biased and inconsistent parameter estimates. The OLS also suffers a huge set back in the presence of contaminated data. In order to rectify these problems, simultaneous equation models have been introduced as well as robust regression. In this paper, we construct a simultaneous equation model with variables that exhibit simultaneous dependence and we proposed a robust multivariate regression procedure for estimating the parameters of such models. The performance of the robust multivariate regression procedure was examined and compared with the OLS multivariate regression technique and the Three-Stage Least squares procedure (3SLS) using numerical simulation experiment. The performance of the robust multivariate regression and (3SLS) were approximately equally better than OLS when there is no contamination in the data. Nevertheless, when contaminations occur in the data, the robust multivariate regression outperformed the 3SLS and OLS.
NASA Technical Reports Server (NTRS)
Sidik, S. M.
1975-01-01
Ridge, Marquardt's generalized inverse, shrunken, and principal components estimators are discussed in terms of the objectives of point estimation of parameters, estimation of the predictive regression function, and hypothesis testing. It is found that as the normal equations approach singularity, more consideration must be given to estimable functions of the parameters as opposed to estimation of the full parameter vector; that biased estimators all introduce constraints on the parameter space; that adoption of mean squared error as a criterion of goodness should be independent of the degree of singularity; and that ordinary least-squares subset regression is the best overall method.
Lee, Eunjee; Zhu, Hongtu; Kong, Dehan; Wang, Yalin; Giovanello, Kelly Sullivan; Ibrahim, Joseph G
2015-01-01
The aim of this paper is to develop a Bayesian functional linear Cox regression model (BFLCRM) with both functional and scalar covariates. This new development is motivated by establishing the likelihood of conversion to Alzheimer’s disease (AD) in 346 patients with mild cognitive impairment (MCI) enrolled in the Alzheimer’s Disease Neuroimaging Initiative 1 (ADNI-1) and the early markers of conversion. These 346 MCI patients were followed over 48 months, with 161 MCI participants progressing to AD at 48 months. The functional linear Cox regression model was used to establish that functional covariates including hippocampus surface morphology and scalar covariates including brain MRI volumes, cognitive performance (ADAS-Cog), and APOE status can accurately predict time to onset of AD. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. A simulation study is performed to evaluate the finite sample performance of BFLCRM. PMID:26900412
NASA Astrophysics Data System (ADS)
Arioli, M.; Gratton, S.
2012-11-01
Minimum-variance unbiased estimates for linear regression models can be obtained by solving least-squares problems. The conjugate gradient method can be successfully used in solving the symmetric and positive definite normal equations obtained from these least-squares problems. Taking into account the results of Golub and Meurant (1997, 2009) [10,11], Hestenes and Stiefel (1952) [17], and Strakoš and Tichý (2002) [16], which make it possible to approximate the energy norm of the error during the conjugate gradient iterative process, we adapt the stopping criterion introduced by Arioli (2005) [18] to the normal equations taking into account the statistical properties of the underpinning linear regression problem. Moreover, we show how the energy norm of the error is linked to the χ2-distribution and to the Fisher-Snedecor distribution. Finally, we present the results of several numerical tests that experimentally validate the effectiveness of our stopping criteria.
Modeling protein tandem mass spectrometry data with an extended linear regression strategy.
Liu, Han; Bonner, Anthony J; Emili, Andrew
2004-01-01
Tandem mass spectrometry (MS/MS) has emerged as a cornerstone of proteomics owing in part to robust spectral interpretation algorithm. The intensity patterns presented in mass spectra are useful information for identification of peptides and proteins. However, widely used algorithms can not predicate the peak intensity patterns exactly. We have developed a systematic analytical approach based on a family of extended regression models, which permits routine, large scale protein expression profile modeling. By proving an important technical result that the regression coefficient vector is just the eigenvector corresponding to the least eigenvalue of a space transformed version of the original data, this extended regression problem can be reduced to a SVD decomposition problem, thus gain the robustness and efficiency. To evaluate the performance of our model, from 60,960 spectra, we chose 2,859 with high confidence, non redundant matches as training data, based on this specific problem, we derived some measurements of goodness of fit to show that our modeling method is reasonable. The issues of overfitting and underfitting are also discussed. This extended regression strategy therefore offers an effective and efficient framework for in-depth investigation of complex mammalian proteomes. PMID:17270923
ERIC Educational Resources Information Center
Baker, Bruce D.; Richards, Craig E.
1999-01-01
Applies neural network methods for forecasting 1991-95 per-pupil expenditures in U.S. public elementary and secondary schools. Forecasting models included the National Center for Education Statistics' multivariate regression model and three neural architectures. Regarding prediction accuracy, neural network results were comparable or superior to…
Jen, Min-Hua; Bottle, Alex; Kirkwood, Graham; Johnston, Ron; Aylin, Paul
2011-09-01
We have previously described a system for monitoring a number of healthcare outcomes using case-mix adjustment models. It is desirable to automate the model fitting process in such a system if monitoring covers a large number of outcome measures or subgroup analyses. Our aim was to compare the performance of three different variable selection strategies: "manual", "automated" backward elimination and re-categorisation, and including all variables at once, irrespective of their apparent importance, with automated re-categorisation. Logistic regression models for predicting in-hospital mortality and emergency readmission within 28 days were fitted to an administrative database for 78 diagnosis groups and 126 procedures from 1996 to 2006 for National Health Services hospital trusts in England. The performance of models was assessed with Receiver Operating Characteristic (ROC) c statistics, (measuring discrimination) and Brier score (assessing the average of the predictive accuracy). Overall, discrimination was similar for diagnoses and procedures and consistently better for mortality than for emergency readmission. Brier scores were generally low overall (showing higher accuracy) and were lower for procedures than diagnoses, with a few exceptions for emergency readmission within 28 days. Among the three variable selection strategies, the automated procedure had similar performance to the manual method in almost all cases except low-risk groups with few outcome events. For the rapid generation of multiple case-mix models we suggest applying automated modelling to reduce the time required, in particular when examining different outcomes of large numbers of procedures and diseases in routinely collected administrative health data. PMID:21556848
Creating a non-linear total sediment load formula using polynomial best subset regression model
NASA Astrophysics Data System (ADS)
Okcu, Davut; Pektas, Ali Osman; Uyumaz, Ali
2016-08-01
The aim of this study is to derive a new total sediment load formula which is more accurate and which has less application constraints than the well-known formulae of the literature. 5 most known stream power concept sediment formulae which are approved by ASCE are used for benchmarking on a wide range of datasets that includes both field and flume (lab) observations. The dimensionless parameters of these widely used formulae are used as inputs in a new regression approach. The new approach is called Polynomial Best subset regression (PBSR) analysis. The aim of the PBRS analysis is fitting and testing all possible combinations of the input variables and selecting the best subset. Whole the input variables with their second and third powers are included in the regression to test the possible relation between the explanatory variables and the dependent variable. While selecting the best subset a multistep approach is used that depends on significance values and also the multicollinearity degrees of inputs. The new formula is compared to others in a holdout dataset and detailed performance investigations are conducted for field and lab datasets within this holdout data. Different goodness of fit statistics are used as they represent different perspectives of the model accuracy. After the detailed comparisons are carried out we figured out the most accurate equation that is also applicable on both flume and river data. Especially, on field dataset the prediction performance of the proposed formula outperformed the benchmark formulations.
ERIC Educational Resources Information Center
Kobrin, Jennifer L.; Sinharay, Sandip; Haberman, Shelby J.; Chajewski, Michael
2011-01-01
This study examined the adequacy of a multiple linear regression model for predicting first-year college grade point average (FYGPA) using SAT[R] scores and high school grade point average (HSGPA). A variety of techniques, both graphical and statistical, were used to examine if it is possible to improve on the linear regression model. The results…
Application of Dynamic Grey-Linear Auto-regressive Model in Time Scale Calculation
NASA Astrophysics Data System (ADS)
Yuan, H. T.; Don, S. W.
2009-01-01
Because of the influence of different noise and the other factors, the running of an atomic clock is very complex. In order to forecast the velocity of an atomic clock accurately, it is necessary to study and design a model to calculate its velocity in the near future. By using the velocity, the clock could be used in the calculation of local atomic time and the steering of local universal time. In this paper, a new forecast model called dynamic grey-liner auto-regressive model is studied, and the precision of the new model is given. By the real data of National Time Service Center, the new model is tested.
Regression Is a Univariate General Linear Model Subsuming Other Parametric Methods as Special Cases.
ERIC Educational Resources Information Center
Vidal, Sherry
Although the concept of the general linear model (GLM) has existed since the 1960s, other univariate analyses such as the t-test and the analysis of variance models have remained popular. The GLM produces an equation that minimizes the mean differences of independent variables as they are related to a dependent variable. From a computer printout…
Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S.
2016-01-01
Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0–20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The
Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S
2016-01-01
Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0-20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The
Schilling, K.E.; Wolter, C.F.
2005-01-01
Nineteen variables, including precipitation, soils and geology, land use, and basin morphologic characteristics, were evaluated to develop Iowa regression models to predict total streamflow (Q), base flow (Qb), storm flow (Qs) and base flow percentage (%Qb) in gauged and ungauged watersheds in the state. Discharge records from a set of 33 watersheds across the state for the 1980 to 2000 period were separated into Qb and Qs. Multiple linear regression found that 75.5 percent of long term average Q was explained by rainfall, sand content, and row crop percentage variables, whereas 88.5 percent of Qb was explained by these three variables plus permeability and floodplain area variables. Qs was explained by average rainfall and %Qb was a function of row crop percentage, permeability, and basin slope variables. Regional regression models developed for long term average Q and Qb were adapted to annual rainfall and showed good correlation between measured and predicted values. Combining the regression model for Q with an estimate of mean annual nitrate concentration, a map of potential nitrate loads in the state was produced. Results from this study have important implications for understanding geomorphic and land use controls on streamflow and base flow in Iowa watersheds and similar agriculture dominated watersheds in the glaciated Midwest. (JAWRA) (Copyright ?? 2005).
Theobald, Roddy; Freeman, Scott
2014-01-01
Although researchers in undergraduate science, technology, engineering, and mathematics education are currently using several methods to analyze learning gains from pre- and posttest data, the most commonly used approaches have significant shortcomings. Chief among these is the inability to distinguish whether differences in learning gains are due to the effect of an instructional intervention or to differences in student characteristics when students cannot be assigned to control and treatment groups at random. Using pre- and posttest scores from an introductory biology course, we illustrate how the methods currently in wide use can lead to erroneous conclusions, and how multiple linear regression offers an effective framework for distinguishing the impact of an instructional intervention from the impact of student characteristics on test score gains. In general, we recommend that researchers always use student-level regression models that control for possible differences in student ability and preparation to estimate the effect of any nonrandomized instructional intervention on student performance. PMID:24591502
Yu, Donghai; Du, Ruobing; Xiao, Ji-Chang
2016-07-01
Ninety-six acidic phosphorus-containing molecules with pKa 1.88 to 6.26 were collected and divided into training and test sets by random sampling. Structural parameters were obtained by density functional theory calculation of the molecules. The relationship between the experimental pKa values and structural parameters was obtained by multiple linear regression fitting for the training set, and tested with the test set; the R(2) values were 0.974 and 0.966 for the training and test sets, respectively. This regression equation, which quantitatively describes the influence of structural parameters on pKa , and can be used to predict pKa values of similar structures, is significant for the design of new acidic phosphorus-containing extractants. © 2016 Wiley Periodicals, Inc. PMID:27218266
Lewin, M.D.; Sarasua, S.; Jones, P.A. . Div. of Health Studies)
1999-07-01
For the purpose of examining the association between blood lead levels and household-specific soil lead levels, the authors used a multivariate linear regression model to find a slope factor relating soil lead levels to blood lead levels. They used previously collected data from the Agency for Toxic Substances and Disease Registry's (ATSDR's) multisite lead and cadmium study. The data included in the blood lead measurements of 1,015 children aged 6--71 months, and corresponding household-specific environmental samples. The environmental samples included lead in soil, house dust, interior paint, and tap water. After adjusting for income, education or the parents, presence of a smoker in the household, sex, and dust lead, and using a double log transformation, they found a slope factor of 0.1388 with a 95% confidence interval of 0.09--0.19 for the dose-response relationship between the natural log of the soil lead level and the natural log of the blood lead level. The predicted blood lead level corresponding to a soil lead level of 500 mg/kg was 5.99 [micro]g/kg with a 95% prediction interval of 2.08--17.29. Predicted values and their corresponding prediction intervals varied by covariate level. The model shows that increased soil lead level is associated with elevated blood leads in children, but that predictions based on this regression model are subject to high levels of uncertainty and variability.
NASA Technical Reports Server (NTRS)
Lo, Ching F.
1999-01-01
The integration of Radial Basis Function Networks and Back Propagation Neural Networks with the Multiple Linear Regression has been accomplished to map nonlinear response surfaces over a wide range of independent variables in the process of the Modem Design of Experiments. The integrated method is capable to estimate the precision intervals including confidence and predicted intervals. The power of the innovative method has been demonstrated by applying to a set of wind tunnel test data in construction of response surface and estimation of precision interval.
Jaber, Abobaker M; Ismail, Mohd Tahir; Altaher, Alsaidi M
2014-01-01
This paper mainly forecasts the daily closing price of stock markets. We propose a two-stage technique that combines the empirical mode decomposition (EMD) with nonparametric methods of local linear quantile (LLQ). We use the proposed technique, EMD-LLQ, to forecast two stock index time series. Detailed experiments are implemented for the proposed method, in which EMD-LPQ, EMD, and Holt-Winter methods are compared. The proposed EMD-LPQ model is determined to be superior to the EMD and Holt-Winter methods in predicting the stock closing prices. PMID:25140343
NASA Astrophysics Data System (ADS)
Eghnam, Karam M.; Sheta, Alaa F.
2008-06-01
Development of accurate models is necessary in critical applications such as prediction. In this paper, a solution to the stock prediction problem of the Barents Sea capelin is introduced using Artificial Neural Network (ANN) and Multiple Linear model Regression (MLR) models. The Capelin stock in the Barents Sea is one of the largest in the world. It normally maintained a fishery with annual catches of up to 3 million tons. The Capelin stock problem has an impact in the fish stock development. The proposed prediction model was developed using an ANNs with their weights adapted using Genetic Algorithm (GA). The proposed model was compared to traditional linear model the MLR. The results showed that the ANN-GA model produced an overall accuracy of 21% better than the MLR model.
NASA Technical Reports Server (NTRS)
Barrett, C. A.
1985-01-01
Multiple linear regression analysis was used to determine an equation for estimating hot corrosion attack for a series of Ni base cast turbine alloys. The U transform (i.e., 1/sin (% A/100) to the 1/2) was shown to give the best estimate of the dependent variable, y. A complete second degree equation is described for the centered" weight chemistries for the elements Cr, Al, Ti, Mo, W, Cb, Ta, and Co. In addition linear terms for the minor elements C, B, and Zr were added for a basic 47 term equation. The best reduced equation was determined by the stepwise selection method with essentially 13 terms. The Cr term was found to be the most important accounting for 60 percent of the explained variability hot corrosion attack.
Stevens, F. J.; Bobrovnik, S. A.; Biosciences Division; Palladin Inst. Biochemistry
2007-12-01
Physiological responses of the adaptive immune system are polyclonal in nature whether induced by a naturally occurring infection, by vaccination to prevent infection or, in the case of animals, by challenge with antigen to generate reagents of research or commercial significance. The composition of the polyclonal responses is distinct to each individual or animal and changes over time. Differences exist in the affinities of the constituents and their relative proportion of the responsive population. In addition, some of the antibodies bind to different sites on the antigen, whereas other pairs of antibodies are sterically restricted from concurrent interaction with the antigen. Even if generation of a monoclonal antibody is the ultimate goal of a project, the quality of the resulting reagent is ultimately related to the characteristics of the initial immune response. It is probably impossible to quantitatively parse the composition of a polyclonal response to antigen. However, molecular regression allows further parameterization of a polyclonal antiserum in the context of certain simplifying assumptions. The antiserum is described as consisting of two competing populations of high- and low-affinity and unknown relative proportions. This simple model allows the quantitative determination of representative affinities and proportions. These parameters may be of use in evaluating responses to vaccines, to evaluating continuity of antibody production whether in vaccine recipients or animals used for the production of antisera, or in optimizing selection of donors for the production of monoclonal antibodies.
Hassan, A. K.
2015-01-01
In this work, O/W emulsion sets were prepared by using different concentrations of two nonionic surfactants. The two surfactants, tween 80(HLB=15.0) and span 80(HLB=4.3) were used in a fixed proportions equal to 0.55:0.45 respectively. HLB value of the surfactants blends were fixed at 10.185. The surfactants blend concentration is starting from 3% up to 19%. For each O/W emulsion set the conductivity was measured at room temperature (25±2°), 40, 50, 60, 70 and 80°. Applying the simple linear regression least squares method statistical analysis to the temperature-conductivity obtained data determines the effective surfactants blend concentration required for preparing the most stable O/W emulsion. These results were confirmed by applying the physical stability centrifugation testing and the phase inversion temperature range measurements. The results indicated that, the relation which represents the most stable O/W emulsion has the strongest direct linear relationship between temperature and conductivity. This relationship is linear up to 80°. This work proves that, the most stable O/W emulsion is determined via the determination of the maximum R² value by applying of the simple linear regression least squares method to the temperature–conductivity obtained data up to 80°, in addition to, the true maximum slope is represented by the equation which has the maximum R² value. Because the conditions would be changed in a more complex formulation, the method of the determination of the effective surfactants blend concentration was verified by applying it for more complex formulations of 2% O/W miconazole nitrate cream and the results indicate its reproducibility. PMID:26664063
Hassan, A K
2015-01-01
In this work, O/W emulsion sets were prepared by using different concentrations of two nonionic surfactants. The two surfactants, tween 80(HLB=15.0) and span 80(HLB=4.3) were used in a fixed proportions equal to 0.55:0.45 respectively. HLB value of the surfactants blends were fixed at 10.185. The surfactants blend concentration is starting from 3% up to 19%. For each O/W emulsion set the conductivity was measured at room temperature (25±2°), 40, 50, 60, 70 and 80°. Applying the simple linear regression least squares method statistical analysis to the temperature-conductivity obtained data determines the effective surfactants blend concentration required for preparing the most stable O/W emulsion. These results were confirmed by applying the physical stability centrifugation testing and the phase inversion temperature range measurements. The results indicated that, the relation which represents the most stable O/W emulsion has the strongest direct linear relationship between temperature and conductivity. This relationship is linear up to 80°. This work proves that, the most stable O/W emulsion is determined via the determination of the maximum R² value by applying of the simple linear regression least squares method to the temperature-conductivity obtained data up to 80°, in addition to, the true maximum slope is represented by the equation which has the maximum R² value. Because the conditions would be changed in a more complex formulation, the method of the determination of the effective surfactants blend concentration was verified by applying it for more complex formulations of 2% O/W miconazole nitrate cream and the results indicate its reproducibility. PMID:26664063
ERIC Educational Resources Information Center
Phillips, Gary W.
The usefulness of path analysis as a means of better understanding various linear models is demonstrated. First, two linear models are presented in matrix form using linear structural relations (LISREL) notation. The two models, regression and factor analysis, are shown to be identical although the research question and data matrix to which these…
NASA Astrophysics Data System (ADS)
Joshi, Deepti; St-Hilaire, André; Daigle, Anik; Ouarda, Taha B. M. J.
2013-04-01
SummaryThis study attempts to compare the performance of two statistical downscaling frameworks in downscaling hydrological indices (descriptive statistics) characterizing the low flow regimes of three rivers in Eastern Canada - Moisie, Romaine and Ouelle. The statistical models selected are Relevance Vector Machine (RVM), an implementation of Sparse Bayesian Learning, and the Automated Statistical Downscaling tool (ASD), an implementation of Multiple Linear Regression. Inputs to both frameworks involve climate variables significantly (α = 0.05) correlated with the indices. These variables were processed using Canonical Correlation Analysis and the resulting canonical variates scores were used as input to RVM to estimate the selected low flow indices. In ASD, the significantly correlated climate variables were subjected to backward stepwise predictor selection and the selected predictors were subsequently used to estimate the selected low flow indices using Multiple Linear Regression. With respect to the correlation between climate variables and the selected low flow indices, it was observed that all indices are influenced, primarily, by wind components (Vertical, Zonal and Meridonal) and humidity variables (Specific and Relative Humidity). The downscaling performance of the framework involving RVM was found to be better than ASD in terms of Relative Root Mean Square Error, Relative Mean Absolute Bias and Coefficient of Determination. In all cases, the former resulted in less variability of the performance indices between calibration and validation sets, implying better generalization ability than for the latter.
NASA Technical Reports Server (NTRS)
Smith, Timothy D.; Steffen, Christopher J., Jr.; Yungster, Shaye; Keller, Dennis J.
1998-01-01
The all rocket mode of operation is shown to be a critical factor in the overall performance of a rocket based combined cycle (RBCC) vehicle. An axisymmetric RBCC engine was used to determine specific impulse efficiency values based upon both full flow and gas generator configurations. Design of experiments methodology was used to construct a test matrix and multiple linear regression analysis was used to build parametric models. The main parameters investigated in this study were: rocket chamber pressure, rocket exit area ratio, injected secondary flow, mixer-ejector inlet area, mixer-ejector area ratio, and mixer-ejector length-to-inlet diameter ratio. A perfect gas computational fluid dynamics analysis, using both the Spalart-Allmaras and k-omega turbulence models, was performed with the NPARC code to obtain values of vacuum specific impulse. Results from the multiple linear regression analysis showed that for both the full flow and gas generator configurations increasing mixer-ejector area ratio and rocket area ratio increase performance, while increasing mixer-ejector inlet area ratio and mixer-ejector length-to-diameter ratio decrease performance. Increasing injected secondary flow increased performance for the gas generator analysis, but was not statistically significant for the full flow analysis. Chamber pressure was found to be not statistically significant.
Nucleus detection using gradient orientation information and linear least squares regression
NASA Astrophysics Data System (ADS)
Kwak, Jin Tae; Hewitt, Stephen M.; Xu, Sheng; Pinto, Peter A.; Wood, Bradford J.
2015-03-01
Computerized histopathology image analysis enables an objective, efficient, and quantitative assessment of digitized histopathology images. Such analysis often requires an accurate and efficient detection and segmentation of histological structures such as glands, cells and nuclei. The segmentation is used to characterize tissue specimens and to determine the disease status or outcomes. The segmentation of nuclei, in particular, is challenging due to the overlapping or clumped nuclei. Here, we propose a nuclei seed detection method for the individual and overlapping nuclei that utilizes the gradient orientation or direction information. The initial nuclei segmentation is provided by a multiview boosting approach. The angle of the gradient orientation is computed and traced for the nuclear boundaries. Taking the first derivative of the angle of the gradient orientation, high concavity points (junctions) are discovered. False junctions are found and removed by adopting a greedy search scheme with the goodness-of-fit statistic in a linear least squares sense. Then, the junctions determine boundary segments. Partial boundary segments belonging to the same nucleus are identified and combined by examining the overlapping area between them. Using the final set of the boundary segments, we generate the list of seeds in tissue images. The method achieved an overall precision of 0.89 and a recall of 0.88 in comparison to the manual segmentation.
A componential model of human interaction with graphs: 1. Linear regression modeling
NASA Technical Reports Server (NTRS)
Gillan, Douglas J.; Lewis, Robert
1994-01-01
Task analyses served as the basis for developing the Mixed Arithmetic-Perceptual (MA-P) model, which proposes (1) that people interacting with common graphs to answer common questions apply a set of component processes-searching for indicators, encoding the value of indicators, performing arithmetic operations on the values, making spatial comparisons among indicators, and repsonding; and (2) that the type of graph and user's task determine the combination and order of the components applied (i.e., the processing steps). Two experiments investigated the prediction that response time will be linearly related to the number of processing steps according to the MA-P model. Subjects used line graphs, scatter plots, and stacked bar graphs to answer comparison questions and questions requiring arithmetic calculations. A one-parameter version of the model (with equal weights for all components) and a two-parameter version (with different weights for arithmetic and nonarithmetic processes) accounted for 76%-85% of individual subjects' variance in response time and 61%-68% of the variance taken across all subjects. The discussion addresses possible modifications in the MA-P model, alternative models, and design implications from the MA-P model.
NASA Astrophysics Data System (ADS)
Elliott, J.; de Souza, R. S.; Krone-Martins, A.; Cameron, E.; Ishida, E. E. O.; Hilbe, J.
2015-04-01
Machine learning techniques offer a precious tool box for use within astronomy to solve problems involving so-called big data. They provide a means to make accurate predictions about a particular system without prior knowledge of the underlying physical processes of the data. In this article, and the companion papers of this series, we present the set of Generalized Linear Models (GLMs) as a fast alternative method for tackling general astronomical problems, including the ones related to the machine learning paradigm. To demonstrate the applicability of GLMs to inherently positive and continuous physical observables, we explore their use in estimating the photometric redshifts of galaxies from their multi-wavelength photometry. Using the gamma family with a log link function we predict redshifts from the PHoto-z Accuracy Testing simulated catalogue and a subset of the Sloan Digital Sky Survey from Data Release 10. We obtain fits that result in catastrophic outlier rates as low as ∼1% for simulated and ∼2% for real data. Moreover, we can easily obtain such levels of precision within a matter of seconds on a normal desktop computer and with training sets that contain merely thousands of galaxies. Our software is made publicly available as a user-friendly package developed in Python, R and via an interactive web application. This software allows users to apply a set of GLMs to their own photometric catalogues and generates publication quality plots with minimum effort. By facilitating their ease of use to the astronomical community, this paper series aims to make GLMs widely known and to encourage their implementation in future large-scale projects, such as the Large Synoptic Survey Telescope.
ERIC Educational Resources Information Center
Tipton, Elizabeth; Pustejovsky, James E.
2015-01-01
Randomized experiments are commonly used to evaluate the effectiveness of educational interventions. The goal of the present investigation is to develop small-sample corrections for multiple contrast hypothesis tests (i.e., F-tests) such as the omnibus test of meta-regression fit or a test for equality of three or more levels of a categorical…
ERIC Educational Resources Information Center
Thatcher, Greg W.; Henson, Robin K.
This study examined research in training and development to determine effect size reporting practices. It focused on the reporting of corrected effect sizes in research articles using multiple regression analyses. When possible, researchers calculated corrected effect sizes and determine if the associated shrinkage could have impacted researcher…
Jahandideh, Sepideh Jahandideh, Samad; Asadabadi, Ebrahim Barzegari; Askarian, Mehrdad; Movahedi, Mohammad Mehdi; Hosseini, Somayyeh; Jahandideh, Mina
2009-11-15
Prediction of the amount of hospital waste production will be helpful in the storage, transportation and disposal of hospital waste management. Based on this fact, two predictor models including artificial neural networks (ANNs) and multiple linear regression (MLR) were applied to predict the rate of medical waste generation totally and in different types of sharp, infectious and general. In this study, a 5-fold cross-validation procedure on a database containing total of 50 hospitals of Fars province (Iran) were used to verify the performance of the models. Three performance measures including MAR, RMSE and R{sup 2} were used to evaluate performance of models. The MLR as a conventional model obtained poor prediction performance measure values. However, MLR distinguished hospital capacity and bed occupancy as more significant parameters. On the other hand, ANNs as a more powerful model, which has not been introduced in predicting rate of medical waste generation, showed high performance measure values, especially 0.99 value of R{sup 2} confirming the good fit of the data. Such satisfactory results could be attributed to the non-linear nature of ANNs in problem solving which provides the opportunity for relating independent variables to dependent ones non-linearly. In conclusion, the obtained results showed that our ANN-based model approach is very promising and may play a useful role in developing a better cost-effective strategy for waste management in future.
Caballero, Julio; Fernández, Michael
2006-01-01
Antifungal activity was modeled for a set of 96 heterocyclic ring derivatives (2,5,6-trisubstituted benzoxazoles, 2,5-disubstituted benzimidazoles, 2-substituted benzothiazoles and 2-substituted oxazolo(4,5-b)pyridines) using multiple linear regression (MLR) and Bayesian-regularized artificial neural network (BRANN) techniques. Inhibitory activity against Candida albicans (log(1/C)) was correlated with 3D descriptors encoding the chemical structures of the heterocyclic compounds. Training and test sets were chosen by means of k-Means Clustering. The most appropriate variables for linear and nonlinear modeling were selected using a genetic algorithm (GA) approach. In addition to the MLR equation (MLR-GA), two nonlinear models were built, model BRANN employing the linear variable subset and an optimum model BRANN-GA obtained by a hybrid method that combined BRANN and GA approaches (BRANN-GA). The linear model fit the training set (n = 80) with r2 = 0.746, while BRANN and BRANN-GA gave higher values of r2 = 0.889 and r2 = 0.937, respectively. Beyond the improvement of training set fitting, the BRANN-GA model was superior to the others by being able to describe 87% of test set (n = 16) variance in comparison with 78 and 81% the MLR-GA and BRANN models, respectively. Our quantitative structure-activity relationship study suggests that the distributions of atomic mass, volume and polarizability have relevant relationships with the antifungal potency of the compounds studied. Furthermore, the ability of the six variables selected nonlinearly to differentiate the data was demonstrated when the total data set was well distributed in a Kohonen self-organizing neural network (KNN). PMID:16205958
Technology Transfer Automated Retrieval System (TEKTRAN)
A stochastic/linear program Excel workbook was developed consisting of two worksheets illustrating linear and stochastic program approaches. Both approaches used the Excel Solver add-in. A published linear program problem served as an example for the ingredients, nutrients and costs and as a benchma...
NASA Astrophysics Data System (ADS)
Soares dos Santos, T.; Mendes, D.; Rodrigues Torres, R.
2016-01-01
Several studies have been devoted to dynamic and statistical downscaling for analysis of both climate variability and climate change. This paper introduces an application of artificial neural networks (ANNs) and multiple linear regression (MLR) by principal components to estimate rainfall in South America. This method is proposed for downscaling monthly precipitation time series over South America for three regions: the Amazon; northeastern Brazil; and the La Plata Basin, which is one of the regions of the planet that will be most affected by the climate change projected for the end of the 21st century. The downscaling models were developed and validated using CMIP5 model output and observed monthly precipitation. We used general circulation model (GCM) experiments for the 20th century (RCP historical; 1970-1999) and two scenarios (RCP 2.6 and 8.5; 2070-2100). The model test results indicate that the ANNs significantly outperform the MLR downscaling of monthly precipitation variability.
NASA Astrophysics Data System (ADS)
dos Santos, T. S.; Mendes, D.; Torres, R. R.
2015-08-01
Several studies have been devoted to dynamic and statistical downscaling for analysis of both climate variability and climate change. This paper introduces an application of artificial neural networks (ANN) and multiple linear regression (MLR) by principal components to estimate rainfall in South America. This method is proposed for downscaling monthly precipitation time series over South America for three regions: the Amazon, Northeastern Brazil and the La Plata Basin, which is one of the regions of the planet that will be most affected by the climate change projected for the end of the 21st century. The downscaling models were developed and validated using CMIP5 model out- put and observed monthly precipitation. We used GCMs experiments for the 20th century (RCP Historical; 1970-1999) and two scenarios (RCP 2.6 and 8.5; 2070-2100). The model test results indicate that the ANN significantly outperforms the MLR downscaling of monthly precipitation variability.
Soboyejo, W.O.; Soboyejo, A.B.O.; Ni, Y.; Mercer, C.
1997-12-31
In a recent paper, Mercer and Soboyejo demonstrated the Hall-Petch dependence of basic room- and elevated-temperature (815 C) mechanical properties (0.2% offset strength, ultimate tensile strength, plastic elongation to failure and fracture toughness) on the average equiaxed/lamellar grain size. Simple Hall-Petch behavior was shown to occur in a wide range of extruded duplex {alpha}{sub 2}+{gamma} alloys (Ti-48Al, Ti-48Al-1.4Mn Ti-48Al-2Mn and Ti-48Al-1.5Cr). As in steels and other materials, simple Hall-Petch equations were derived for the above properties. However, the Hall-Petch equations did not include the effect of other variables that can affect the basic mechanical properties of gamma alloys. Multiple linear regression equations for the prediction of the combined effects of several (alloying, microstructure and temperature) variables on basic mechanical properties temperature are presented in this paper.
Rousselot, J M; Peslin, R; Duvivier, C
1992-07-01
A potentially useful method to monitor respiratory mechanics in artificially ventilated patients consists of analyzing the relationship between tracheal pressure (P), lung volume (V), and gas flow (V) by multiple linear regression (MLR) using a suitable model. Contrary to other methods, it does not require any particular flow waveform and, therefore, may be used with any ventilator. This approach was evaluated in three neonates and seven young children admitted into an intensive care unit for respiratory disorders of various etiologies. P and V were measured and digitized at a sampling rate of 40 Hz for periods of 20-48 s. After correction of P for the non-linear resistance of the endotracheal tube, the data were first analyzed with the usual linear monoalveolar model: P = PO + E.V + R.V where E and R are total respiratory elastance and resistance, and PO is the static recoil pressure at end-expiration. A good fit of the model to the data was seen in five of ten children. PO, E, and R were reproducible within cycles, and consistent with the patient's age and condition; the data obtained with two ventilatory modes were highly correlated. In the five instances in which the simple model did not fit the data well, they were reanalyzed with more sophisticated models allowing for mechanical non-homogeneity or for non-linearity of R or E. While several models substantially improved the fit, physiologically meaningful results were only obtained when R was allowed to change with lung volume. We conclude that the MLR method is adequate to monitor respiratory mechanics, even when the usual model is inadequate. PMID:1437330
2013-01-01
Background Genome-wide association studies have become very popular in identifying genetic contributions to phenotypes. Millions of SNPs are being tested for their association with diseases and traits using linear or logistic regression models. This conceptually simple strategy encounters the following computational issues: a large number of tests and very large genotype files (many Gigabytes) which cannot be directly loaded into the software memory. One of the solutions applied on a grand scale is cluster computing involving large-scale resources. We show how to speed up the computations using matrix operations in pure R code. Results We improve speed: computation time from 6 hours is reduced to 10-15 minutes. Our approach can handle essentially an unlimited amount of covariates efficiently, using projections. Data files in GWAS are vast and reading them into computer memory becomes an important issue. However, much improvement can be made if the data is structured beforehand in a way allowing for easy access to blocks of SNPs. We propose several solutions based on the R packages ff and ncdf. We adapted the semi-parallel computations for logistic regression. We show that in a typical GWAS setting, where SNP effects are very small, we do not lose any precision and our computations are few hundreds times faster than standard procedures. Conclusions We provide very fast algorithms for GWAS written in pure R code. We also show how to rearrange SNP data for fast access. PMID:23711206
Naguib, Ibrahim A.; Abdelaleem, Eglal A.; Zaazaa, Hala E.; Hussein, Essraa A.
2015-01-01
A comparison between partial least squares regression and support vector regression chemometric models is introduced in this study. The two models are implemented to analyze cefoperazone sodium in presence of its reported impurities, 7-aminocephalosporanic acid and 5-mercapto-1-methyl-tetrazole, in pure powders and in pharmaceutical formulations through processing UV spectroscopic data. For best results, a 3-factor 4-level experimental design was used, resulting in a training set of 16 mixtures containing different ratios of interfering moieties. For method validation, an independent test set consisting of 9 mixtures was used to test predictive ability of established models. The introduced results show the capability of the two proposed models to analyze cefoperazone in presence of its impurities 7-aminocephalosporanic acid and 5-mercapto-1-methyl-tetrazole with high trueness and selectivity (101.87 ± 0.708 and 101.43 ± 0.536 for PLSR and linear SVR, resp.). Analysis results of drug products were statistically compared to a reported HPLC method showing no significant difference in trueness and precision, indicating the capability of the suggested multivariate calibration models to be reliable and adequate for routine quality control analysis of drug product. SVR offers more accurate results with lower prediction error compared to PLSR model; however, PLSR is easy to handle and fast to optimize. PMID:26664764
Piriyaprasarth, Suchada; Sriamornsak, Pornsak
2011-06-15
The aim of this study was to investigate the effect of source variation of hydroxypropyl methylcellulose (HPMC) raw material on prediction of drug release from HPMC matrix tablets. To achieve this objective, the flow ability (i.e., angle of repose and Carr's compressibility index) and apparent viscosity of HPMC from 3 sources was investigated to differentiate HPMC source variation. The physicochemical properties of drug and manufacturing process were also incorporated to develop the linear regression model for prediction of drug release. Specifically, the in vitro release of 18 formulations was determined according to a 2 × 3 × 3 full factorial design. Further regression analysis provided a quantitative relationship between the response and the studied independent variables. It was found that either apparent viscosity or Carr's compressibility index of HPMC powders combining with solubility and molecular weight of drug had significant impact on the release behavior of drug. The increased drug release was observed when a greater in drug solubility and a decrease in the molecular weight of drug were applied. Most importantly, this study has shown that the HPMC having low viscosity or high compressibility index resulted in an increase of drug release, especially in the case of poorly soluble drugs. PMID:21420475
Silva, Ana Elisa Pereira; Freitas, Corina da Costa; Dutra, Luciano Vieira; Molento, Marcelo Beltrão
2016-02-15
Fasciola hepatica is the causative agent of fasciolosis, a disease that triggers a chronic inflammatory process in the liver affecting mainly ruminants and other animals including humans. In Brazil, F. hepatica occurs in larger numbers in the most Southern state of Rio Grande do Sul. The objective of this study was to estimate areas at risk using an eight-year (2002-2010) time series of climatic and environmental variables that best relate to the disease using a linear regression method to municipalities in the state of Rio Grande do Sul. The positivity index of the disease, which is the rate of infected animal per slaughtered animal, was divided into three risk classes: low, medium and high. The accuracy of the known sample classification on the confusion matrix for the low, medium and high rates produced by the estimated model presented values between 39 and 88% depending of the year. The regression analysis showed the importance of the time-based data for the construction of the model, considering the two variables of the previous year of the event (positivity index and maximum temperature). The generated data is important for epidemiological and parasite control studies mainly because F. hepatica is an infection that can last from months to years. PMID:26827853
Rafiei, Hamid; Khanzadeh, Marziyeh; Mozaffari, Shahla; Bostanifar, Mohammad Hassan; Avval, Zhila Mohajeri; Aalizadeh, Reza; Pourbasheer, Eslam
2016-01-01
Quantitative structure-activity relationship (QSAR) study has been employed for predicting the inhibitory activities of the Hepatitis C virus (HCV) NS5B polymerase inhibitors. A data set consisted of 72 compounds was selected, and then different types of molecular descriptors were calculated. The whole data set was split into a training set (80 % of the dataset) and a test set (20 % of the dataset) using principle component analysis. The stepwise (SW) and the genetic algorithm (GA) techniques were used as variable selection tools. Multiple linear regression method was then used to linearly correlate the selected descriptors with inhibitory activities. Several validation technique including leave-one-out and leave-group-out cross-validation, Y-randomization method were used to evaluate the internal capability of the derived models. The external prediction ability of the derived models was further analyzed using modified r2, concordance correlation coefficient values and Golbraikh and Tropsha acceptable model criteria's. Based on the derived results (GA-MLR), some new insights toward molecular structural requirements for obtaining better inhibitory activity were obtained. PMID:27065774
NASA Astrophysics Data System (ADS)
Bell, A. L.; Moore, J. N.; Greenwood, M. C.
2007-12-01
The Flathead River in Northwestern Montana drains the relatively pristine, high-mountain watersheds of Glacier- Waterton national parks and large wilderness areas making it an excellent test-bed for hydrologic response to climate change. Flows in the North Fork and Middle Fork of the Flathead River are relatively unmodified by humans, whereas the South Fork has a large hydroelectric reservoir (Hungry Horse) in the lower end of the basin. USGS stream gage data for the North, Middle and South forks from 1940 to 2006 were analyzed for significant trends in the timing of quantiles of flow to examine climate forcing vs. direct modification of flow from the dam. The trends in timing were analyzed for climate change influences using the PRISM model output for 1940 to 2006 for the respective basin. The analysis of trends in timing employed two linear regression methods, typical least squares estimation and robust estimation using weighted least squares. Least squares estimation is the standard method employed when performing regression analysis. The power of this method is sensitive to the violation of the assumptions of normally distributed errors with constant variance (homoscedasticity). Considering that violations of these assumptions are common in hydrologic data, robust estimation was used to preserve the desired statistical power because it is not significantly affected by non-normality or heteroscedasticity. Least squares estimated trends that were found to be significant, using a 10% significance level, were typically not significant using a robust estimation method. This could have implications for interpreting the meaning of significant trends found using the least squares estimator. Utilizing robust estimation methods for analyzing hydrologic data may allow investigators to more accurately summarize any trends.
Rossi, D J; Kress, D D; Tess, M W; Burfening, P J
1992-05-01
Standard linear adjustment of weaning weight to a constant age has been shown to introduce bias in the adjusted weight due to nonlinear growth from birth to weaning of beef calves. Ten years of field records from the five strains of Beefbooster Cattle Alberta Ltd. seed stock herds were used to investigate the use of correction factors to adjust standard 180-d weight (WT180) for this bias. Statistical analyses were performed within strain and followed three steps: 1) the full data set was split into an estimation set (ES) and a validation set (VS), 2) WT180 from the ES was used to develop estimates of correction factors using a model including herd (H), year (YR), age of dam (DA), sex of calf (S), all two and three-way interactions, and any significant linear and quadratic covariates of calf age at weaning deviated from 180 d (DEVCA) and interactions between DEVCA and DA, S or DA x S, and 3) significant DEVCA coefficients were used to correct WT180 from the VS, then WT180 and the corrected weight (WTCOR) from the VS were analyzed with the same model as in Step 2 and significance of DEVCA terms were compared. Two types of data splitting were used. Adjusted R2 was calculated to describe the proportion of total variation of DEVCA terms explained for WT180 from the ES. The DEVCA terms explained .08 to 1.54% of the total variation for the five strains. Linear and quadratic correction factors were both positive and negative. Bias in WT180 from the ES within 180 +/- 35 d of age ranged from 2.8 to 21.7 kg.(ABSTRACT TRUNCATED AT 250 WORDS) PMID:1526901
Kokaly, R.F.; Clark, R.N.
1999-01-01
We develop a new method for estimating the biochemistry of plant material using spectroscopy. Normalized band depths calculated from the continuum-removed reflectance spectra of dried and ground leaves were used to estimate their concentrations of nitrogen, lignin, and cellulose. Stepwise multiple linear regression was used to select wavelengths in the broad absorption features centered at 1.73 ??m, 2.10 ??m, and 2.30 ??m that were highly correlated with the chemistry of samples from eastern U.S. forests. Band depths of absorption features at these wavelengths were found to also be highly correlated with the chemistry of four other sites. A subset of data from the eastern U.S. forest sites was used to derive linear equations that were applied to the remaining data to successfully estimate their nitrogen, lignin, and cellulose concentrations. Correlations were highest for nitrogen (R2 from 0.75 to 0.94). The consistent results indicate the possibility of establishing a single equation capable of estimating the chemical concentrations in a wide variety of species from the reflectance spectra of dried leaves. The extension of this method to remote sensing was investigated. The effects of leaf water content, sensor signal-to-noise and bandpass, atmospheric effects, and background soil exposure were examined. Leaf water was found to be the greatest challenge to extending this empirical method to the analysis of fresh whole leaves and complete vegetation canopies. The influence of leaf water on reflectance spectra must be removed to within 10%. Other effects were reduced by continuum removal and normalization of band depths. If the effects of leaf water can be compensated for, it might be possible to extend this method to remote sensing data acquired by imaging spectrometers to give estimates of nitrogen, lignin, and cellulose concentrations over large areas for use in ecosystem studies.We develop a new method for estimating the biochemistry of plant material using
NASA Astrophysics Data System (ADS)
Deml, Ann M.; O'Hayre, Ryan; Wolverton, Chris; Stevanović, Vladan
2016-02-01
The availability of quantitatively accurate total energies (Etot) of atoms, molecules, and solids, enabled by the development of density functional theory (DFT), has transformed solid state physics, quantum chemistry, and materials science by allowing direct calculations of measureable quantities, such as enthalpies of formation (Δ Hf ). Still, the ability to compute Etot and Δ Hf values does not, necessarily, provide insights into the physical mechanisms behind their magnitudes or chemical trends. Here, we examine a large set of calculated Etot and Δ Hf values obtained from the DFT+U -based fitted elemental-phase reference energies (FERE) approach [V. Stevanović, S. Lany, X. Zhang, and A. Zunger, Phys. Rev. B 85, 115104 (2012), 10.1103/PhysRevB.85.115104] to probe relationships between the Etot/Δ Hf of metal-nonmetal compounds in their ground-state crystal structures and properties describing the compound compositions and their elemental constituents. From a stepwise linear regression, we develop a linear model for Etot, and consequently Δ Hf , that reproduces calculated FERE values with a mean absolute error of ˜80 meV/atom. The most significant contributions to the model include calculated total energies of the constituent elements in their reference phases (e.g., metallic iron or gas phase O2), atomic ionization energies and electron affinities, Pauling electronegativity differences, and atomic electric polarizabilities. These contributions are discussed in the context of their connection to the underlying physics. We also demonstrate that our Etot/Δ Hf model can be directly extended to predict the Etot and Δ Hf of compounds outside the set used to develop the model.
Zheng, Fang; Zhan, Max; Huang, Xiaoqin; AbdulHameed, Mohamed Diwan M.; Zhan, Chang-Guo
2013-01-01
Butyrylcholinesterase (BChE) has been an important protein used for development of anti-cocaine medication. Through computational design, BChE mutants with ~2000-fold improved catalytic efficiency against cocaine have been discovered in our lab. To study drug-enzyme interaction it is important to build mathematical model to predict molecular inhibitory activity against BChE. This report presents a neural network (NN) QSAR study, compared with multi-linear regression (MLR) and molecular docking, on a set of 93 small molecules that act as inhibitors of BChE by use of the inhibitory activities (pIC50 values) of the molecules as target values. The statistical results for the linear model built from docking generated energy descriptors were: r2 = 0.67, rmsd = 0.87, q2 = 0.65 and loormsd = 0.90; The statistical results for the ligand-based MLR model were: r2 = 0.89, rmsd = 0.51, q2 = 0.85 and loormsd = 0.58; the statistical results for the ligand-based NN model were the best: r2 = 0.95, rmsd = 0.33, q2 = 0.90 and loormsd = 0.48, demonstrating that the NN is powerful in analysis of a set of complicated data. As BChE is also an established drug target to develop new treatment for Alzheimer’s disease (AD). The developped QSAR models provide tools for rationalizing identification of potential BChE inhibitors or selection of compounds for synthesis in the discovery of novel effective inhibitors of BChE in the future. PMID:24290065
NASA Astrophysics Data System (ADS)
Lee, C. Y.; Tippett, M. K.; Sobel, A. H.; Camargo, S. J.
2014-12-01
We are working towards the development of a new statistical-dynamical downscaling system to study the influence of climate on tropical cyclones (TCs). The first step is development of an appropriate model for TC intensity as a function of environmental variables. We approach this issue with a stochastic model consisting of a multiple linear regression model (MLR) for 12-hour intensity forecasts as a deterministic component, and a random error generator as a stochastic component. Similar to the operational Statistical Hurricane Intensity Prediction Scheme (SHIPS), MLR relates the surrounding environment to storm intensity, but with only essential predictors calculated from monthly-mean NCEP reanalysis fields (potential intensity, shear, etc.) and from persistence. The deterministic MLR is developed with data from 1981-1999 and tested with data from 2000-2012 for the Atlantic, Eastern North Pacific, Western North Pacific, Indian Ocean, and Southern Hemisphere basins. While the global MLR's skill is comparable to that of the operational statistical models (e.g., SHIPS), the distribution of the predicted maximum intensity from deterministic results has a systematic low bias compared to observations; the deterministic MLR creates almost no storms with intensities greater than 100 kt. The deterministic MLR can be significantly improved by adding the stochastic component, based on the distribution of random forecasting errors from the deterministic model compared to the training data. This stochastic component may be thought of as representing the component of TC intensification that is not linearly related to the environmental variables. We find that in order for the stochastic model to accurately capture the observed distribution of maximum storm intensities, the stochastic component must be auto-correlated across 12-hour time steps. This presentation also includes a detailed discussion of the distributions of other TC-intensity related quantities, as well as the inter
Martin, L; Mezcua, M; Ferrer, C; Gil Garcia, M D; Malato, O; Fernandez-Alba, A R
2013-01-01
The main objective of this work was to establish a mathematical function that correlates pesticide residue levels in apple juice with the levels of the pesticides applied on the raw fruit, taking into account some of their physicochemical properties such as water solubility, the octanol/water partition coefficient, the organic carbon partition coefficient, vapour pressure and density. A mixture of 12 pesticides was applied to an apple tree; apples were collected after 10 days of application. After harvest, apples were treated with a mixture of three post-harvest pesticides and the fruits were then processed in order to obtain apple juice following a routine industrial process. The pesticide residue levels in the apple samples were analysed using two multi-residue methods based on LC-MS/MS and GC-MS/MS. The concentration of pesticides was determined in samples derived from the different steps of processing. The processing factors (the coefficient between residue level in the processed commodity and the residue level in the commodity to be processed) obtained for the full juicing process were found to vary among the different pesticides studied. In order to investigate the relationships between the levels of pesticide residue found in apple juice samples and their physicochemical properties, principal component analysis (PCA) was performed using two sets of samples (one of them using experimental data obtained in this work and the other including the data taken from the literature). In both cases the correlation was found between processing factors of pesticides in the apple juice and the negative logarithms (base 10) of the water solubility, octanol/water partition coefficient and organic carbon partition coefficient. The linear correlation between these physicochemical properties and the processing factor were established using a multiple linear regression technique. PMID:23281800
NASA Astrophysics Data System (ADS)
Frecon, Jordan; Didier, Gustavo; Pustelnik, Nelly; Abry, Patrice
2016-08-01
Self-similarity is widely considered the reference framework for modeling the scaling properties of real-world data. However, most theoretical studies and their practical use have remained univariate. Operator Fractional Brownian Motion (OfBm) was recently proposed as a multivariate model for self-similarity. Yet it has remained seldom used in applications because of serious issues that appear in the joint estimation of its numerous parameters. While the univariate fractional Brownian motion requires the estimation of two parameters only, its mere bivariate extension already involves 7 parameters which are very different in nature. The present contribution proposes a method for the full identification of bivariate OfBm (i.e., the joint estimation of all parameters) through an original formulation as a non-linear wavelet regression coupled with a custom-made Branch & Bound numerical scheme. The estimation performance (consistency and asymptotic normality) is mathematically established and numerically assessed by means of Monte Carlo experiments. The impact of the parameters defining OfBm on the estimation performance as well as the associated computational costs are also thoroughly investigated.
Boulet, Sebastien; Boudot, Elsa; Houel, Nicolas
2016-05-01
Back pain is a common reason for consultation in primary healthcare clinical practice, and has effects on daily activities and posture. Relationships between the whole spine and upright posture, however, remain unknown. The aim of this study was to identify the relationship between each spinal curve and centre of pressure position as well as velocity for healthy subjects. Twenty-one male subjects performed quiet stance in natural position. Each upright posture was then recorded using an optoelectronics system (Vicon Nexus) synchronized with two force plates. At each moment, polynomial interpolations of markers attached on the spine segment were used to compute cervical lordosis, thoracic kyphosis and lumbar lordosis angle curves. Mean of centre of pressure position and velocity was then computed. Multiple stepwise linear regression analysis showed that the position and velocity of centre of pressure associated with each part of the spinal curves were defined as best predictors of the lumbar lordosis angle (R(2)=0.45; p=1.65*10-10) and the thoracic kyphosis angle (R(2)=0.54; p=4.89*10-13) of healthy subjects in quiet stance. This study showed the relationships between each of cervical, thoracic, lumbar curvatures, and centre of pressure's fluctuation during free quiet standing using non-invasive full spinal curve exploration. PMID:26970888
NASA Astrophysics Data System (ADS)
de Souza, R. S.; Hilbe, J. M.; Buelens, B.; Riggs, J. D.; Cameron, E.; Ishida, E. E. O.; Chies-Santos, A. L.; Killedar, M.
2015-10-01
In this paper, the third in a series illustrating the power of generalized linear models (GLMs) for the astronomical community, we elucidate the potential of the class of GLMs which handles count data. The size of a galaxy's globular cluster (GC) population (NGC) is a prolonged puzzle in the astronomical literature. It falls in the category of count data analysis, yet it is usually modelled as if it were a continuous response variable. We have developed a Bayesian negative binomial regression model to study the connection between NGC and the following galaxy properties: central black hole mass, dynamical bulge mass, bulge velocity dispersion and absolute visual magnitude. The methodology introduced herein naturally accounts for heteroscedasticity, intrinsic scatter, errors in measurements in both axes (either discrete or continuous) and allows modelling the population of GCs on their natural scale as a non-negative integer variable. Prediction intervals of 99 per cent around the trend for expected NGC comfortably envelope the data, notably including the Milky Way, which has hitherto been considered a problematic outlier. Finally, we demonstrate how random intercept models can incorporate information of each particular galaxy morphological type. Bayesian variable selection methodology allows for automatically identifying galaxy types with different productions of GCs, suggesting that on average S0 galaxies have a GC population 35 per cent smaller than other types with similar brightness.
Shabri, Ani; Samsudin, Ruhaidah
2014-01-01
Crude oil prices do play significant role in the global economy and are a key input into option pricing formulas, portfolio allocation, and risk measurement. In this paper, a hybrid model integrating wavelet and multiple linear regressions (MLR) is proposed for crude oil price forecasting. In this model, Mallat wavelet transform is first selected to decompose an original time series into several subseries with different scale. Then, the principal component analysis (PCA) is used in processing subseries data in MLR for crude oil price forecasting. The particle swarm optimization (PSO) is used to adopt the optimal parameters of the MLR model. To assess the effectiveness of this model, daily crude oil market, West Texas Intermediate (WTI), has been used as the case study. Time series prediction capability performance of the WMLR model is compared with the MLR, ARIMA, and GARCH models using various statistics measures. The experimental results show that the proposed model outperforms the individual models in forecasting of the crude oil prices series. PMID:24895666
Wang, S; Huang, G H; He, L
2012-09-01
Groundwater contamination by dense non-aqueous phase liquids (DNAPLs) has become an issue of great concern in many industrialized countries due to their serious threat to human health. Dissolution and transport of DNAPLs in porous media are complicated, multidimensional and multiphase processes, which pose formidable challenges for investigation of their behaviors and implementation of effective remediation technologies. Numerical simulation models could help gain in-depth insight into complex mechanisms of DNAPLs dissolution and transport processes in the subsurface; however, they were computationally expensive, especially when a large number of runs were required, which was considered as a major obstacle for conducting further analysis. Therefore, proxy models that mimic key characteristics of a full simulation model were desired to save many orders of magnitude of computational cost. In this study, a clusterwise-linear-regression (CLR)-based forecasting system was developed for establishing a statistical relationship between DNAPL dissolution behaviors and system conditions under discrete and nonlinear complexities. The results indicated that the developed CLR-based forecasting system was capable not only of predicting DNAPL concentrations with acceptable error levels, but also of providing a significance level in each cutting/merging step such that the accuracies of the developed forecasting trees could be controlled. This study was a first attempt to apply the CLR model to characterize DNAPL dissolution and transport processes. PMID:22789814
Shabri, Ani; Samsudin, Ruhaidah
2014-01-01
Crude oil prices do play significant role in the global economy and are a key input into option pricing formulas, portfolio allocation, and risk measurement. In this paper, a hybrid model integrating wavelet and multiple linear regressions (MLR) is proposed for crude oil price forecasting. In this model, Mallat wavelet transform is first selected to decompose an original time series into several subseries with different scale. Then, the principal component analysis (PCA) is used in processing subseries data in MLR for crude oil price forecasting. The particle swarm optimization (PSO) is used to adopt the optimal parameters of the MLR model. To assess the effectiveness of this model, daily crude oil market, West Texas Intermediate (WTI), has been used as the case study. Time series prediction capability performance of the WMLR model is compared with the MLR, ARIMA, and GARCH models using various statistics measures. The experimental results show that the proposed model outperforms the individual models in forecasting of the crude oil prices series. PMID:24895666
Kjelstrom, L.C.
1995-01-01
Previously developed U.S. Geological Survey regional regression models of runoff and 11 chemical constituents were evaluated to assess their suitability for use in urban areas in Boise and Garden City. Data collected in the study area were used to develop adjusted regional models of storm-runoff volumes and mean concentrations and loads of chemical oxygen demand, dissolved and suspended solids, total nitrogen and total ammonia plus organic nitrogen as nitrogen, total and dissolved phosphorus, and total recoverable cadmium, copper, lead, and zinc. Explanatory variables used in these models were drainage area, impervious area, land-use information, and precipitation data. Mean annual runoff volume and loads at the five outfalls were estimated from 904 individual storms during 1976 through 1993. Two methods were used to compute individual storm loads. The first method used adjusted regional models of storm loads and the second used adjusted regional models for mean concentration and runoff volume. For large storms, the first method seemed to produce excessively high loads for some constituents and the second method provided more reliable results for all constituents except suspended solids. The first method provided more reliable results for large storms for suspended solids.
A consistent local linear estimator of the covariate adjusted correlation coefficient
Nguyen, Danh V.; Şentürk, Damla
2009-01-01
Consider the correlation between two random variables (X, Y), both not directly observed. One only observes X̃ = φ1(U)X + φ2(U) and Ỹ = ψ1(U)Y + ψ2(U), where all four functions {φl(·),ψl(·), l = 1, 2} are unknown/unspecified smooth functions of an observable covariate U. We consider consistent estimation of the correlation between the unobserved variables X and Y, adjusted for the above general dual additive and multiplicative effects of U, based on the observed data (X̃, Ỹ, U). PMID:21720454
Jian, Shih-Jie; Kou, Chwung-Shan; Hwang, Jennchang; Lee, Chein-Dhau; Lin, Wei-Cheng
2013-06-15
A method for controlling the pretilt angles of liquid crystals (LC) was developed. Hexamethyldisiloxane polymer films were first deposited on indium tin oxide coated glass plates using a linear atmospheric pressure plasma source. The films were subsequently treated with the rubbing method for LC alignment. Fourier transform infrared spectroscopy and X-ray photoelectron spectroscopy measurements were used to characterize the film composition, which could be varied to control the surface energy by adjusting the monomer feed rate and input power. The results of LC alignment experiments showed that the pretilt angle continuously increased from 0 Degree-Sign to 90 Degree-Sign with decreasing film surface energy.
2012-01-01
Background We wanted to compare growth differences between 13 Escherichia coli strains exposed to various concentrations of the growth inhibitor lactoferrin in two different types of broth (Syncase and Luria-Bertani (LB)). To carry this out, we present a simple statistical procedure that separates microbial growth curves that are due to natural random perturbations and growth curves that are more likely caused by biological differences. Bacterial growth was determined using optical density data (OD) recorded for triplicates at 620 nm for 18 hours for each strain. Each resulting growth curve was divided into three equally spaced intervals. We propose a procedure using linear spline regression with two knots to compute the slopes of each interval in the bacterial growth curves. These slopes are subsequently used to estimate a 95% confidence interval based on an appropriate statistical distribution. Slopes outside the confidence interval were considered as significantly different from slopes within. We also demonstrate the use of related, but more advanced methods known collectively as generalized additive models (GAMs) to model growth. In addition to impressive curve fitting capabilities with corresponding confidence intervals, GAM’s allow for the computation of derivatives, i.e. growth rate estimation, with respect to each time point. Results The results from our proposed procedure agreed well with the observed data. The results indicated that there were substantial growth differences between the E. coli strains. Most strains exhibited improved growth in the nutrient rich LB broth compared to Syncase. The inhibiting effect of lactoferrin varied between the different strains. The atypical enteropathogenic aEPEC-2 grew, on average, faster in both broths than the other strains tested while the enteroinvasive strains, EIEC-6 and EIEC-7 grew slower. The enterotoxigenic ETEC-5 strain, exhibited exceptional growth in Syncase broth, but slower growth in LB broth
NASA Astrophysics Data System (ADS)
Molina, Enrique; Estrada, Ernesto; Nodarse, Delvin; Torres, Luis A.; González, Humberto; Uriarte, Eugenio
Time-dependent antibacterial activity of 2-furylethylenes using quantum chemical, topographic, and topological indices is described as inhibition of respiration in E. coli. A QSAR strategy based on the combination of the linear piecewise regression and the discriminant analysis is used to predict the biological activity values of strong and moderates antibacterial furylethylenes. The breakpoint in the values of the biological activity was detected. The biological activities of the compounds are described by two linear regression equations. A discriminant analysis is carried out to classify the compounds in one of the biological activity two groups. The results showed using different kind of descriptors were compared. In all cases the piecewise linear regression - discriminant analysis (PLR-DA) method produced significantly better QSAR models than the linear regression analysis. The QSAR models were validated using an external validation previously extracted from the original data. A prediction of reported antibacterial activity analysis was carried out showing dependence between the probability of a good classification and the experimental antibacterial activity. Statistical parameters showed the quality of quantum-chemical descriptors based models prediction in LDA having an accuracy of 0.9 and a C of 0.9. The best PLR-DA model explains more than 92% of the variance of experimental activity. Models with best prediction results were those based on quantum-chemical descriptors. An interpretation of quantum-chemical descriptors entered in models was carried out.
Lourenco, D A L; Tsuruta, S; Fragomeni, B O; Chen, C Y; Herring, W O; Misztal, I
2016-03-01
, except for , which was 1 percentage point less accurate. Accuracy of GEBV for number of stillborns in F1 was 0.5 for all tested genomic relationship matrices with no changes after tuning. We observed that genotyping F increased accuracies of GEBV for the same animals by up to 39% compared with having genotypes for only AA and BB. In crossbreed evaluations, accounting for breed-specific allele frequencies promoted changes in G that were not influential enough to improve accuracy of GEBV. Therefore, the best performance of ssGBLUP for crossbreed evaluations requires genotypes for pure- and crossbreeds and no breed-specific adjustments in the realized relationship matrix. PMID:27065253
Robertson, D.M.; Saad, D.A.; Heisey, D.M.
2006-01-01
Various approaches are used to subdivide large areas into regions containing streams that have similar reference or background water quality and that respond similarly to different factors. For many applications, such as establishing reference conditions, it is preferable to use physical characteristics that are not affected by human activities to delineate these regions. However, most approaches, such as ecoregion classifications, rely on land use to delineate regions or have difficulties compensating for the effects of land use. Land use not only directly affects water quality, but it is often correlated with the factors used to define the regions. In this article, we describe modifications to SPARTA (spatial regression-tree analysis), a relatively new approach applied to water-quality and environmental characteristic data to delineate zones with similar factors affecting water quality. In this modified approach, land-use-adjusted (residualized) water quality and environmental characteristics are computed for each site. Regression-tree analysis is applied to the residualized data to determine the most statistically important environmental characteristics describing the distribution of a specific water-quality constituent. Geographic information for small basins throughout the study area is then used to subdivide the area into relatively homogeneous environmental water-quality zones. For each zone, commonly used approaches are subsequently used to define its reference water quality and how its water quality responds to changes in land use. SPARTA is used to delineate zones of similar reference concentrations of total phosphorus and suspended sediment throughout the upper Midwestern part of the United States. ?? 2006 Springer Science+Business Media, Inc.
NASA Astrophysics Data System (ADS)
Winkler, Peter; Bergmann, Helmar; Stuecklschweiger, Georg; Guss, Helmuth
2003-05-01
Mechanical stability and precise adjustment of rotation axes, collimator and room lasers are essential for the success of radiotherapy and particularly stereotactic radiosurgery with a linear accelerator. Quality assurance procedures, at present mainly based on visual tests and radiographic film evaluations, should desirably be little time consuming and highly accurate. We present a method based on segmentation and analysis of digital images acquired with an electronic portal imaging device (EPID) that meets these objectives. The method can be employed for routine quality assurance with a square field formed by the built-in collimator jaws as well as with a circular field using an external drill hole collimator. A number of tests, performed to evaluate accuracy and reproducibility of the algorithm, yielded very satisfying results. Studies performed over a period of 18 months prove the applicability of the inspected accelerator for stereotactic radiosurgery.
NASA Astrophysics Data System (ADS)
Saltogianni, Vasso; Stiros, Stathis
2012-11-01
The adjustment of systems of highly non-linear, redundant equations, deriving from observations of certain geophysical processes and geodetic data cannot be based on conventional least-squares techniques, and is based on various numerical inversion techniques. Still these techniques lead to solutions trapped in local minima, to correlated estimates and to solution with poor error control. To overcome these problems, we propose an alternative numerical-topological approach inspired by lighthouse beacon navigation, usually used in 2-D, low-accuracy applications. In our approach, an m-dimensional grid G of points around the real solution (an m-dimensional vector) is at first specified. Then, for each equation an uncertainty is assigned to the corresponding measurement, and the sets of the grid points which satisfy the condition are detected. This process is repeated for all equations, and the common section A of the sets of grid points is defined. From this set of grid points, which define a space including the real solution, we compute its center of weight, which corresponds to an estimate of the solution, and its variance-covariance matrix. An optimal solution can be obtained through optimization of the uncertainty in each observation. The efficiency of the overall process was assessed in comparison with conventional least squares adjustment.
Cadmium-hazard mapping using a general linear regression model (Irr-Cad) for rapid risk assessment.
Simmons, Robert W; Noble, Andrew D; Pongsakul, P; Sukreeyapongse, O; Chinabut, N
2009-02-01
Research undertaken over the last 40 years has identified the irrefutable relationship between the long-term consumption of cadmium (Cd)-contaminated rice and human Cd disease. In order to protect public health and livelihood security, the ability to accurately and rapidly determine spatial Cd contamination is of high priority. During 2001-2004, a General Linear Regression Model Irr-Cad was developed to predict the spatial distribution of soil Cd in a Cd/Zn co-contaminated cascading irrigated rice-based system in Mae Sot District, Tak Province, Thailand (Longitude E 98 degrees 59'-E 98 degrees 63' and Latitude N 16 degrees 67'-16 degrees 66'). The results indicate that Irr-Cad accounted for 98% of the variance in mean Field Order total soil Cd. Preliminary validation indicated that Irr-Cad 'predicted' mean Field Order total soil Cd, was significantly (p < 0.001) correlated (R (2) = 0.92) with 'observed' mean Field Order total soil Cd values. Field Order is determined by a given field's proximity to primary outlets from in-field irrigation channels and subsequent inter-field irrigation flows. This in turn determines Field Order in Irrigation Sequence (Field Order(IS)). Mean Field Order total soil Cd represents the mean total soil Cd (aqua regia-digested) for a given Field Order(IS). In 2004-2005, Irr-Cad was utilized to evaluate the spatial distribution of total soil Cd in a 'high-risk' area of Mae Sot District. Secondary validation on six randomly selected field groups verified that Irr-Cad predicted mean Field Order total soil Cd and was significantly (p < 0.001) correlated with the observed mean Field Order total soil Cd with R (2) values ranging from 0.89 to 0.97. The practical applicability of Irr-Cad is in its minimal input requirements, namely the classification of fields in terms of Field Order(IS), strategic sampling of all primary fields and laboratory based determination of total soil Cd (T-Cd(P)) and the use of a weighed coefficient for Cd (Coeff
Agogo, George O; van der Voet, Hilko; Van't Veer, Pieter; van Eeuwijk, Fred A; Boshuizen, Hendriek C
2016-07-01
Dietary questionnaires are prone to measurement error, which bias the perceived association between dietary intake and risk of disease. Short-term measurements are required to adjust for the bias in the association. For foods that are not consumed daily, the short-term measurements are often characterized by excess zeroes. Via a simulation study, the performance of a two-part calibration model that was developed for a single-replicate study design was assessed by mimicking leafy vegetable intake reports from the multicenter European Prospective Investigation into Cancer and Nutrition (EPIC) study. In part I of the fitted two-part calibration model, a logistic distribution was assumed; in part II, a gamma distribution was assumed. The model was assessed with respect to the magnitude of the correlation between the consumption probability and the consumed amount (hereafter, cross-part correlation), the number and form of covariates in the calibration model, the percentage of zero response values, and the magnitude of the measurement error in the dietary intake. From the simulation study results, transforming the dietary variable in the regression calibration to an appropriate scale was found to be the most important factor for the model performance. Reducing the number of covariates in the model could be beneficial, but was not critical in large-sample studies. The performance was remarkably robust when fitting a one-part rather than a two-part model. The model performance was minimally affected by the cross-part correlation. PMID:27003183
NASA Astrophysics Data System (ADS)
Denli, H. H.; Koc, Z.
2015-12-01
Estimation of real properties depending on standards is difficult to apply in time and location. Regression analysis construct mathematical models which describe or explain relationships that may exist between variables. The problem of identifying price differences of properties to obtain a price index can be converted into a regression problem, and standard techniques of regression analysis can be used to estimate the index. Considering regression analysis for real estate valuation, which are presented in real marketing process with its current characteristics and quantifiers, the method will help us to find the effective factors or variables in the formation of the value. In this study, prices of housing for sale in Zeytinburnu, a district in Istanbul, are associated with its characteristics to find a price index, based on information received from a real estate web page. The associated variables used for the analysis are age, size in m2, number of floors having the house, floor number of the estate and number of rooms. The price of the estate represents the dependent variable, whereas the rest are independent variables. Prices from 60 real estates have been used for the analysis. Same price valued locations have been found and plotted on the map and equivalence curves have been drawn identifying the same valued zones as lines.
NASA Astrophysics Data System (ADS)
Shams Amiri, Shideh
Modeling of energy consumption in buildings is essential for different applications such as building energy management and establishing baselines. This makes building energy consumption estimation as a key tool to achieve the goals on energy consumption and emissions reduction. Energy performance of building is complex, since it depends on several parameters related to the building characteristics, equipment and systems, weather, occupants, and sociological influences. This paper presents a new model to predict and quantify energy consumption in commercial buildings in the early stages of the design. eQUEST and DOE-2 building simulation software was used to build and simulate individual building configuration that were generated using Monte Carlo simulation technique. Ten thousands simulations for seven building shapes were performed to create a comprehensive dataset covering the full ranges of design parameters. The present study considered building materials, their thickness, building shape, and occupant schedule as design variables since building energy performance is sensitive to these variables. Then, the results of the energy simulations were implemented into a set of regression equation to predict the energy consumption in each design scenario. The difference between regression-predicted and DOE-simulated annual building energy consumption are largely within 5%. It is envisioned that the developed regression models can be utilized to estimate the energy savings in the early stages of the design when different building schemes and design concepts are being considered. Keywords: eQUEST simulation, DOE-2 simulation, Monte Carlo simulation, Regression equations, Building energy performance
NASA Astrophysics Data System (ADS)
Zhang, Xiaoyu; Li, Qingbo; Zhang, Guangjun
2013-11-01
In this paper, a modified single-index signal regression (mSISR) method is proposed to construct a nonlinear and practical model with high-accuracy. The mSISR method defines the optimal penalty tuning parameter in P-spline signal regression (PSR) as initial tuning parameter and chooses the number of cycles based on minimizing root mean squared error of cross-validation (RMSECV). mSISR is superior to single-index signal regression (SISR) in terms of accuracy, computation time and convergency. And it can provide the character of the non-linearity between spectra and responses in a more precise manner than SISR. Two spectra data sets from basic research experiments, including plant chlorophyll nondestructive measurement and human blood glucose noninvasive measurement, are employed to illustrate the advantages of mSISR. The results indicate that the mSISR method (i) obtains the smooth and helpful regression coefficient vector, (ii) explicitly exhibits the type and amount of the non-linearity, (iii) can take advantage of nonlinear features of the signals to improve prediction performance and (iv) has distinct adaptability for the complex spectra model by comparing with other calibration methods. It is validated that mSISR is a promising nonlinear modeling strategy for multivariate calibration.
Brasquet, C.; Bourges, B.; Le Cloirec, P.
1999-12-01
The adsorption of 55 organic compounds is carried out onto a recently discovered adsorbent, activated carbon cloth. Isotherms are modeled using the Freundlich classical model, and the large database generated allows qualitative assumptions about the adsorption mechanism. However, to confirm these assumptions, a quantitative structure-property relationship methodology is used to assess the correlations between an adsorbability parameter (expressed using the Freundlich parameter K) and topological indices related to the compounds molecular structure (molecular connectivity indices, MCI). This correlation is set up by mean of two different statistical tools, multiple linear regression (MLR) and neural network (NN). A principal component analysis is carried out to generate new and uncorrelated variables. It enables the relations between the MCI to be analyzed, but the multiple linear regression assessed using the principal components (PCs) has a poor statistical quality and introduces high order PCs, too inaccurate for an explanation of the adsorption mechanism. The correlations are thus set up using the original variables (MCI), and both statistical tools, multiple linear regression and neutral network, are compared from a descriptive and predictive point of view. To compare the predictive ability of both methods, a test database of 10 organic compounds is used.
NASA Technical Reports Server (NTRS)
Jacobsen, R. T.; Stewart, R. B.; Crain, R. W., Jr.; Rose, G. L.; Myers, A. F.
1976-01-01
A method was developed for establishing a rational choice of the terms to be included in an equation of state with a large number of adjustable coefficients. The methods presented were developed for use in the determination of an equation of state for oxygen and nitrogen. However, a general application of the methods is possible in studies involving the determination of an optimum polynomial equation for fitting a large number of data points. The data considered in the least squares problem are experimental thermodynamic pressure-density-temperature data. Attention is given to a description of stepwise multiple regression and the use of stepwise regression in the determination of an equation of state for oxygen and nitrogen.
NASA Astrophysics Data System (ADS)
Grégoire, G.
2014-12-01
The logistic regression originally is intended to explain the relationship between the probability of an event and a set of covariables. The model's coefficients can be interpreted via the odds and odds ratio, which are presented in introduction of the chapter. The observations are possibly got individually, then we speak of binary logistic regression. When they are grouped, the logistic regression is said binomial. In our presentation we mainly focus on the binary case. For statistical inference the main tool is the maximum likelihood methodology: we present the Wald, Rao and likelihoods ratio results and their use to compare nested models. The problems we intend to deal with are essentially the same as in multiple linear regression: testing global effect, individual effect, selection of variables to build a model, measure of the fitness of the model, prediction of new values… . The methods are demonstrated on data sets using R. Finally we briefly consider the binomial case and the situation where we are interested in several events, that is the polytomous (multinomial) logistic regression and the particular case of ordinal logistic regression.
Ratios as a size adjustment in morphometrics.
Albrecht, G H; Gelvin, B R; Hartman, S E
1993-08-01
Simple ratios in which a measurement variable is divided by a size variable are commonly used but known to be inadequate for eliminating size correlations from morphometric data. Deficiencies in the simple ratio can be alleviated by incorporating regression coefficients describing the bivariate relationship between the measurement and size variables. Recommendations have included: 1) subtracting the regression intercept to force the bivariate relationship through the origin (intercept-adjusted ratios); 2) exponentiating either the measurement or the size variable using an allometry coefficient to achieve linearity (allometrically adjusted ratios); or 3) both subtracting the intercept and exponentiating (fully adjusted ratios). These three strategies for deriving size-adjusted ratios imply different data models for describing the bivariate relationship between the measurement and size variables (i.e., the linear, simple allometric, and full allometric models, respectively). Algebraic rearrangement of the equation associated with each data model leads to a correctly formulated adjusted ratio whose expected value is constant (i.e., size correlation is eliminated). Alternatively, simple algebra can be used to derive an expected value function for assessing whether any proposed ratio formula is effective in eliminating size correlations. Some published ratio adjustments were incorrectly formulated as indicated by expected values that remain a function of size after ratio transformation. Regression coefficients incorporated into adjusted ratios must be estimated using least-squares regression of the measurement variable on the size variable. Use of parameters estimated by any other regression technique (e.g., major axis or reduced major axis) results in residual correlations between size and the adjusted measurement variable. Correctly formulated adjusted ratios, whose parameters are estimated by least-squares methods, do control for size correlations. The size-adjusted
Unitary Response Regression Models
ERIC Educational Resources Information Center
Lipovetsky, S.
2007-01-01
The dependent variable in a regular linear regression is a numerical variable, and in a logistic regression it is a binary or categorical variable. In these models the dependent variable has varying values. However, there are problems yielding an identity output of a constant value which can also be modelled in a linear or logistic regression with…
Sharon Falcone Miller; Bruce G. Miller
2007-12-15
This paper compares the emissions factors for a suite of liquid biofuels (three animal fats, waste restaurant grease, pressed soybean oil, and a biodiesel produced from soybean oil) and four fossil fuels (i.e., natural gas, No. 2 fuel oil, No. 6 fuel oil, and pulverized coal) in Penn State's commercial water-tube boiler to assess their viability as fuels for green heat applications. The data were broken into two subsets, i.e., fossil fuels and biofuels. The regression model for the liquid biofuels (as a subset) did not perform well for all of the gases. In addition, the coefficient in the models showed the EPA method underestimating CO and NOx emissions. No relation could be studied for SO{sub 2} for the liquid biofuels as they contain no sulfur; however, the model showed a good relationship between the two methods for SO{sub 2} in the fossil fuels. AP-42 emissions factors for the fossil fuels were also compared to the mass balance emissions factors and EPA CFR Title 40 emissions factors. Overall, the AP-42 emissions factors for the fossil fuels did not compare well with the mass balance emissions factors or the EPA CFR Title 40 emissions factors. Regression analysis of the AP-42, EPA, and mass balance emissions factors for the fossil fuels showed a significant relationship only for CO{sub 2} and SO{sub 2}. However, the regression models underestimate the SO{sub 2} emissions by 33%. These tests illustrate the importance in performing material balances around boilers to obtain the most accurate emissions levels, especially when dealing with biofuels. The EPA emissions factors were very good at predicting the mass balance emissions factors for the fossil fuels and to a lesser degree the biofuels. While the AP-42 emissions factors and EPA CFR Title 40 emissions factors are easier to perform, especially in large, full-scale systems, this study illustrated the shortcomings of estimation techniques. 23 refs., 3 figs., 8 tabs.
Lee, Myung Hee; Liu, Yufeng
2013-12-01
The continuum regression technique provides an appealing regression framework connecting ordinary least squares, partial least squares and principal component regression in one family. It offers some insight on the underlying regression model for a given application. Moreover, it helps to provide deep understanding of various regression techniques. Despite the useful framework, however, the current development on continuum regression is only for linear regression. In many applications, nonlinear regression is necessary. The extension of continuum regression from linear models to nonlinear models using kernel learning is considered. The proposed kernel continuum regression technique is quite general and can handle very flexible regression model estimation. An efficient algorithm is developed for fast implementation. Numerical examples have demonstrated the usefulness of the proposed technique. PMID:24058224
Mahani, Mohamad Khayatzadeh; Chaloosi, Marzieh; Maragheh, Mohamad Ghanadi; Khanchi, Ali Reza; Afzali, Daryoush
2007-09-01
The oral acute in vivo toxicity of 32 amine and amide drugs was related to their structural-dependent properties. Genetic algorithm-partial least-squares and stepwise variable selection was applied to select of meaningful descriptors. Multiple linear regression (MLR), artificial neural network (ANN) and partial least square (PLS) models were created with selected descriptors. The predictive ability of all three models was evaluated and compared on a set of five drugs, which were not used in modeling steps. Average errors of 0.168, 0.169 and 0.259 were obtained for MLR, ANN and PLS, respectively. PMID:17878584
Ghaedi, M; Rahimi, Mahmoud Reza; Ghaedi, A M; Tyagi, Inderjeet; Agarwal, Shilpi; Gupta, Vinod Kumar
2016-01-01
Two novel and eco friendly adsorbents namely tin oxide nanoparticles loaded on activated carbon (SnO2-NP-AC) and activated carbon prepared from wood tree Pistacia atlantica (AC-PAW) were used for the rapid removal and fast adsorption of methyl orange (MO) from the aqueous phase. The dependency of MO removal with various adsorption influential parameters was well modeled and optimized using multiple linear regressions (MLR) and least squares support vector regression (LSSVR). The optimal parameters for the LSSVR model were found based on γ value of 0.76 and σ(2) of 0.15. For testing the data set, the mean square error (MSE) values of 0.0010 and the coefficient of determination (R(2)) values of 0.976 were obtained for LSSVR model, and the MSE value of 0.0037 and the R(2) value of 0.897 were obtained for the MLR model. The adsorption equilibrium and kinetic data was found to be well fitted and in good agreement with Langmuir isotherm model and second-order equation and intra-particle diffusion models respectively. The small amount of the proposed SnO2-NP-AC and AC-PAW (0.015 g and 0.08 g) is applicable for successful rapid removal of methyl orange (>95%). The maximum adsorption capacity for SnO2-NP-AC and AC-PAW was 250 mg g(-1) and 125 mg g(-1) respectively. PMID:26414425