NASA Astrophysics Data System (ADS)
Wang, Lijuan; Yan, Yong; Wang, Xue; Wang, Tao
2017-03-01
Input variable selection is an essential step in the development of data-driven models for environmental, biological and industrial applications. Through input variable selection to eliminate the irrelevant or redundant variables, a suitable subset of variables is identified as the input of a model. Meanwhile, through input variable selection the complexity of the model structure is simplified and the computational efficiency is improved. This paper describes the procedures of the input variable selection for the data-driven models for the measurement of liquid mass flowrate and gas volume fraction under two-phase flow conditions using Coriolis flowmeters. Three advanced input variable selection methods, including partial mutual information (PMI), genetic algorithm-artificial neural network (GA-ANN) and tree-based iterative input selection (IIS) are applied in this study. Typical data-driven models incorporating support vector machine (SVM) are established individually based on the input candidates resulting from the selection methods. The validity of the selection outcomes is assessed through an output performance comparison of the SVM based data-driven models and sensitivity analysis. The validation and analysis results suggest that the input variables selected from the PMI algorithm provide more effective information for the models to measure liquid mass flowrate while the IIS algorithm provides a fewer but more effective variables for the models to predict gas volume fraction.
Input variable selection and calibration data selection for storm water quality regression models.
Sun, Siao; Bertrand-Krajewski, Jean-Luc
2013-01-01
Storm water quality models are useful tools in storm water management. Interest has been growing in analyzing existing data for developing models for urban storm water quality evaluations. It is important to select appropriate model inputs when many candidate explanatory variables are available. Model calibration and verification are essential steps in any storm water quality modeling. This study investigates input variable selection and calibration data selection in storm water quality regression models. The two selection problems are mutually interacted. A procedure is developed in order to fulfil the two selection tasks in order. The procedure firstly selects model input variables using a cross validation method. An appropriate number of variables are identified as model inputs to ensure that a model is neither overfitted nor underfitted. Based on the model input selection results, calibration data selection is studied. Uncertainty of model performances due to calibration data selection is investigated with a random selection method. An approach using the cluster method is applied in order to enhance model calibration practice based on the principle of selecting representative data for calibration. The comparison between results from the cluster selection method and random selection shows that the former can significantly improve performances of calibrated models. It is found that the information content in calibration data is important in addition to the size of calibration data.
Model selection bias and Freedman's paradox
Lukacs, P.M.; Burnham, K.P.; Anderson, D.R.
2010-01-01
In situations where limited knowledge of a system exists and the ratio of data points to variables is small, variable selection methods can often be misleading. Freedman (Am Stat 37:152-155, 1983) demonstrated how common it is to select completely unrelated variables as highly "significant" when the number of data points is similar in magnitude to the number of variables. A new type of model averaging estimator based on model selection with Akaike's AIC is used with linear regression to investigate the problems of likely inclusion of spurious effects and model selection bias, the bias introduced while using the data to select a single seemingly "best" model from a (often large) set of models employing many predictor variables. The new model averaging estimator helps reduce these problems and provides confidence interval coverage at the nominal level while traditional stepwise selection has poor inferential properties. ?? The Institute of Statistical Mathematics, Tokyo 2009.
Jiang, Hui; Zhang, Hang; Chen, Quansheng; Mei, Congli; Liu, Guohai
2015-01-01
The use of wavelength variable selection before partial least squares discriminant analysis (PLS-DA) for qualitative identification of solid state fermentation degree by FT-NIR spectroscopy technique was investigated in this study. Two wavelength variable selection methods including competitive adaptive reweighted sampling (CARS) and stability competitive adaptive reweighted sampling (SCARS) were employed to select the important wavelengths. PLS-DA was applied to calibrate identified model using selected wavelength variables by CARS and SCARS for identification of solid state fermentation degree. Experimental results showed that the number of selected wavelength variables by CARS and SCARS were 58 and 47, respectively, from the 1557 original wavelength variables. Compared with the results of full-spectrum PLS-DA, the two wavelength variable selection methods both could enhance the performance of identified models. Meanwhile, compared with CARS-PLS-DA model, the SCARS-PLS-DA model achieved better results with the identification rate of 91.43% in the validation process. The overall results sufficiently demonstrate the PLS-DA model constructed using selected wavelength variables by a proper wavelength variable method can be more accurate identification of solid state fermentation degree. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Jiang, Hui; Zhang, Hang; Chen, Quansheng; Mei, Congli; Liu, Guohai
2015-10-01
The use of wavelength variable selection before partial least squares discriminant analysis (PLS-DA) for qualitative identification of solid state fermentation degree by FT-NIR spectroscopy technique was investigated in this study. Two wavelength variable selection methods including competitive adaptive reweighted sampling (CARS) and stability competitive adaptive reweighted sampling (SCARS) were employed to select the important wavelengths. PLS-DA was applied to calibrate identified model using selected wavelength variables by CARS and SCARS for identification of solid state fermentation degree. Experimental results showed that the number of selected wavelength variables by CARS and SCARS were 58 and 47, respectively, from the 1557 original wavelength variables. Compared with the results of full-spectrum PLS-DA, the two wavelength variable selection methods both could enhance the performance of identified models. Meanwhile, compared with CARS-PLS-DA model, the SCARS-PLS-DA model achieved better results with the identification rate of 91.43% in the validation process. The overall results sufficiently demonstrate the PLS-DA model constructed using selected wavelength variables by a proper wavelength variable method can be more accurate identification of solid state fermentation degree.
Fox, Eric W; Hill, Ryan A; Leibowitz, Scott G; Olsen, Anthony R; Thornbrugh, Darren J; Weber, Marc H
2017-07-01
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
Assessing the accuracy and stability of variable selection ...
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively add/remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating dataset consists of the good/poor condition of n=1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p=212) of landscape features from the StreamCat dataset. Two types of RF models are compared: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substanti
Methodological development for selection of significant predictors explaining fatal road accidents.
Dadashova, Bahar; Arenas-Ramírez, Blanca; Mira-McWilliams, José; Aparicio-Izquierdo, Francisco
2016-05-01
Identification of the most relevant factors for explaining road accident occurrence is an important issue in road safety research, particularly for future decision-making processes in transport policy. However model selection for this particular purpose is still an ongoing research. In this paper we propose a methodological development for model selection which addresses both explanatory variable and adequate model selection issues. A variable selection procedure, TIM (two-input model) method is carried out by combining neural network design and statistical approaches. The error structure of the fitted model is assumed to follow an autoregressive process. All models are estimated using Markov Chain Monte Carlo method where the model parameters are assigned non-informative prior distributions. The final model is built using the results of the variable selection. For the application of the proposed methodology the number of fatal accidents in Spain during 2000-2011 was used. This indicator has experienced the maximum reduction internationally during the indicated years thus making it an interesting time series from a road safety policy perspective. Hence the identification of the variables that have affected this reduction is of particular interest for future decision making. The results of the variable selection process show that the selected variables are main subjects of road safety policy measures. Published by Elsevier Ltd.
NASA Astrophysics Data System (ADS)
Song, Yunquan; Lin, Lu; Jian, Ling
2016-07-01
Single-index varying-coefficient model is an important mathematical modeling method to model nonlinear phenomena in science and engineering. In this paper, we develop a variable selection method for high-dimensional single-index varying-coefficient models using a shrinkage idea. The proposed procedure can simultaneously select significant nonparametric components and parametric components. Under defined regularity conditions, with appropriate selection of tuning parameters, the consistency of the variable selection procedure and the oracle property of the estimators are established. Moreover, due to the robustness of the check loss function to outliers in the finite samples, our proposed variable selection method is more robust than the ones based on the least squares criterion. Finally, the method is illustrated with numerical simulations.
A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method.
Yang, Jun-He; Cheng, Ching-Hsue; Chan, Chia-Pan
2017-01-01
Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
Do bioclimate variables improve performance of climate envelope models?
Watling, James I.; Romañach, Stephanie S.; Bucklin, David N.; Speroterra, Carolina; Brandt, Laura A.; Pearlstine, Leonard G.; Mazzotti, Frank J.
2012-01-01
Climate envelope models are widely used to forecast potential effects of climate change on species distributions. A key issue in climate envelope modeling is the selection of predictor variables that most directly influence species. To determine whether model performance and spatial predictions were related to the selection of predictor variables, we compared models using bioclimate variables with models constructed from monthly climate data for twelve terrestrial vertebrate species in the southeastern USA using two different algorithms (random forests or generalized linear models), and two model selection techniques (using uncorrelated predictors or a subset of user-defined biologically relevant predictor variables). There were no differences in performance between models created with bioclimate or monthly variables, but one metric of model performance was significantly greater using the random forest algorithm compared with generalized linear models. Spatial predictions between maps using bioclimate and monthly variables were very consistent using the random forest algorithm with uncorrelated predictors, whereas we observed greater variability in predictions using generalized linear models.
Church, Sheri A; Livingstone, Kevin; Lai, Zhao; Kozik, Alexander; Knapp, Steven J; Michelmore, Richard W; Rieseberg, Loren H
2007-02-01
Using likelihood-based variable selection models, we determined if positive selection was acting on 523 EST sequence pairs from two lineages of sunflower and lettuce. Variable rate models are generally not used for comparisons of sequence pairs due to the limited information and the inaccuracy of estimates of specific substitution rates. However, previous studies have shown that the likelihood ratio test (LRT) is reliable for detecting positive selection, even with low numbers of sequences. These analyses identified 56 genes that show a signature of selection, of which 75% were not identified by simpler models that average selection across codons. Subsequent mapping studies in sunflower show four of five of the positively selected genes identified by these methods mapped to domestication QTLs. We discuss the validity and limitations of using variable rate models for comparisons of sequence pairs, as well as the limitations of using ESTs for identification of positively selected genes.
Variable selection and model choice in geoadditive regression models.
Kneib, Thomas; Hothorn, Torsten; Tutz, Gerhard
2009-06-01
Model choice and variable selection are issues of major concern in practical regression analyses, arising in many biometric applications such as habitat suitability analyses, where the aim is to identify the influence of potentially many environmental conditions on certain species. We describe regression models for breeding bird communities that facilitate both model choice and variable selection, by a boosting algorithm that works within a class of geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, and varying coefficients. The major modeling components are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a smooth component with one degree of freedom to obtain a fair comparison between the model terms. A generic representation of the geoadditive model allows us to devise a general boosting algorithm that automatically performs model choice and variable selection.
Brandt, Laura A.; Benscoter, Allison; Harvey, Rebecca G.; Speroterra, Carolina; Bucklin, David N.; Romañach, Stephanie; Watling, James I.; Mazzotti, Frank J.
2017-01-01
Climate envelope models are widely used to describe potential future distribution of species under different climate change scenarios. It is broadly recognized that there are both strengths and limitations to using climate envelope models and that outcomes are sensitive to initial assumptions, inputs, and modeling methods Selection of predictor variables, a central step in modeling, is one of the areas where different techniques can yield varying results. Selection of climate variables to use as predictors is often done using statistical approaches that develop correlations between occurrences and climate data. These approaches have received criticism in that they rely on the statistical properties of the data rather than directly incorporating biological information about species responses to temperature and precipitation. We evaluated and compared models and prediction maps for 15 threatened or endangered species in Florida based on two variable selection techniques: expert opinion and a statistical method. We compared model performance between these two approaches for contemporary predictions, and the spatial correlation, spatial overlap and area predicted for contemporary and future climate predictions. In general, experts identified more variables as being important than the statistical method and there was low overlap in the variable sets (<40%) between the two methods Despite these differences in variable sets (expert versus statistical), models had high performance metrics (>0.9 for area under the curve (AUC) and >0.7 for true skill statistic (TSS). Spatial overlap, which compares the spatial configuration between maps constructed using the different variable selection techniques, was only moderate overall (about 60%), with a great deal of variability across species. Difference in spatial overlap was even greater under future climate projections, indicating additional divergence of model outputs from different variable selection techniques. Our work is in agreement with other studies which have found that for broad-scale species distribution modeling, using statistical methods of variable selection is a useful first step, especially when there is a need to model a large number of species or expert knowledge of the species is limited. Expert input can then be used to refine models that seem unrealistic or for species that experts believe are particularly sensitive to change. It also emphasizes the importance of using multiple models to reduce uncertainty and improve map outputs for conservation planning. Where outputs overlap or show the same direction of change there is greater certainty in the predictions. Areas of disagreement can be used for learning by asking why the models do not agree, and may highlight areas where additional on-the-ground data collection could improve the models.
Fan, Shu-Xiang; Huang, Wen-Qian; Li, Jiang-Bo; Guo, Zhi-Ming; Zhaq, Chun-Jiang
2014-10-01
In order to detect the soluble solids content(SSC)of apple conveniently and rapidly, a ring fiber probe and a portable spectrometer were applied to obtain the spectroscopy of apple. Different wavelength variable selection methods, including unin- formative variable elimination (UVE), competitive adaptive reweighted sampling (CARS) and genetic algorithm (GA) were pro- posed to select effective wavelength variables of the NIR spectroscopy of the SSC in apple based on PLS. The back interval LS- SVM (BiLS-SVM) and GA were used to select effective wavelength variables based on LS-SVM. Selected wavelength variables and full wavelength range were set as input variables of PLS model and LS-SVM model, respectively. The results indicated that PLS model built using GA-CARS on 50 characteristic variables selected from full-spectrum which had 1512 wavelengths achieved the optimal performance. The correlation coefficient (Rp) and root mean square error of prediction (RMSEP) for prediction sets were 0.962, 0.403°Brix respectively for SSC. The proposed method of GA-CARS could effectively simplify the portable detection model of SSC in apple based on near infrared spectroscopy and enhance the predictive precision. The study can provide a reference for the development of portable apple soluble solids content spectrometer.
VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA
Garcia, Ramon I.; Ibrahim, Joseph G.; Zhu, Hongtu
2009-01-01
We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the ICQ statistic, for selecting the penalty parameters. We show that the variable selection procedure based on ICQ automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology. PMID:20336190
A non-linear data mining parameter selection algorithm for continuous variables
Razavi, Marianne; Brady, Sean
2017-01-01
In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables. PMID:29131829
Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures
ERIC Educational Resources Information Center
Steinley, Douglas; Brusco, Michael J.
2008-01-01
Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the…
Variable Selection for Regression Models of Percentile Flows
NASA Astrophysics Data System (ADS)
Fouad, G.
2017-12-01
Percentile flows describe the flow magnitude equaled or exceeded for a given percent of time, and are widely used in water resource management. However, these statistics are normally unavailable since most basins are ungauged. Percentile flows of ungauged basins are often predicted using regression models based on readily observable basin characteristics, such as mean elevation. The number of these independent variables is too large to evaluate all possible models. A subset of models is typically evaluated using automatic procedures, like stepwise regression. This ignores a large variety of methods from the field of feature (variable) selection and physical understanding of percentile flows. A study of 918 basins in the United States was conducted to compare an automatic regression procedure to the following variable selection methods: (1) principal component analysis, (2) correlation analysis, (3) random forests, (4) genetic programming, (5) Bayesian networks, and (6) physical understanding. The automatic regression procedure only performed better than principal component analysis. Poor performance of the regression procedure was due to a commonly used filter for multicollinearity, which rejected the strongest models because they had cross-correlated independent variables. Multicollinearity did not decrease model performance in validation because of a representative set of calibration basins. Variable selection methods based strictly on predictive power (numbers 2-5 from above) performed similarly, likely indicating a limit to the predictive power of the variables. Similar performance was also reached using variables selected based on physical understanding, a finding that substantiates recent calls to emphasize physical understanding in modeling for predictions in ungauged basins. The strongest variables highlighted the importance of geology and land cover, whereas widely used topographic variables were the weakest predictors. Variables suffered from a high degree of multicollinearity, possibly illustrating the co-evolution of climatic and physiographic conditions. Given the ineffectiveness of many variables used here, future work should develop new variables that target specific processes associated with percentile flows.
Variable selection in discrete survival models including heterogeneity.
Groll, Andreas; Tutz, Gerhard
2017-04-01
Several variable selection procedures are available for continuous time-to-event data. However, if time is measured in a discrete way and therefore many ties occur models for continuous time are inadequate. We propose penalized likelihood methods that perform efficient variable selection in discrete survival modeling with explicit modeling of the heterogeneity in the population. The method is based on a combination of ridge and lasso type penalties that are tailored to the case of discrete survival. The performance is studied in simulation studies and an application to the birth of the first child.
NASA Astrophysics Data System (ADS)
Creaco, E.; Berardi, L.; Sun, Siao; Giustolisi, O.; Savic, D.
2016-04-01
The growing availability of field data, from information and communication technologies (ICTs) in "smart" urban infrastructures, allows data modeling to understand complex phenomena and to support management decisions. Among the analyzed phenomena, those related to storm water quality modeling have recently been gaining interest in the scientific literature. Nonetheless, the large amount of available data poses the problem of selecting relevant variables to describe a phenomenon and enable robust data modeling. This paper presents a procedure for the selection of relevant input variables using the multiobjective evolutionary polynomial regression (EPR-MOGA) paradigm. The procedure is based on scrutinizing the explanatory variables that appear inside the set of EPR-MOGA symbolic model expressions of increasing complexity and goodness of fit to target output. The strategy also enables the selection to be validated by engineering judgement. In such context, the multiple case study extension of EPR-MOGA, called MCS-EPR-MOGA, is adopted. The application of the proposed procedure to modeling storm water quality parameters in two French catchments shows that it was able to significantly reduce the number of explanatory variables for successive analyses. Finally, the EPR-MOGA models obtained after the input selection are compared with those obtained by using the same technique without benefitting from input selection and with those obtained in previous works where other data-modeling techniques were used on the same data. The comparison highlights the effectiveness of both EPR-MOGA and the input selection procedure.
Waite, Ian R.
2014-01-01
As part of the USGS study of nutrient enrichment of streams in agricultural regions throughout the United States, about 30 sites within each of eight study areas were selected to capture a gradient of nutrient conditions. The objective was to develop watershed disturbance predictive models for macroinvertebrate and algal metrics at national and three regional landscape scales to obtain a better understanding of important explanatory variables. Explanatory variables in models were generated from landscape data, habitat, and chemistry. Instream nutrient concentration and variables assessing the amount of disturbance to the riparian zone (e.g., percent row crops or percent agriculture) were selected as most important explanatory variable in almost all boosted regression tree models regardless of landscape scale or assemblage. Frequently, TN and TP concentration and riparian agricultural land use variables showed a threshold type response at relatively low values to biotic metrics modeled. Some measure of habitat condition was also commonly selected in the final invertebrate models, though the variable(s) varied across regions. Results suggest national models tended to account for more general landscape/climate differences, while regional models incorporated both broad landscape scale and more specific local-scale variables.
NASA Astrophysics Data System (ADS)
Chen, Hui; Tan, Chao; Lin, Zan; Wu, Tong
2018-01-01
Milk is among the most popular nutrient source worldwide, which is of great interest due to its beneficial medicinal properties. The feasibility of the classification of milk powder samples with respect to their brands and the determination of protein concentration is investigated by NIR spectroscopy along with chemometrics. Two datasets were prepared for experiment. One contains 179 samples of four brands for classification and the other contains 30 samples for quantitative analysis. Principal component analysis (PCA) was used for exploratory analysis. Based on an effective model-independent variable selection method, i.e., minimal-redundancy maximal-relevance (MRMR), only 18 variables were selected to construct a partial least-square discriminant analysis (PLS-DA) model. On the test set, the PLS-DA model based on the selected variable set was compared with the full-spectrum PLS-DA model, both of which achieved 100% accuracy. In quantitative analysis, the partial least-square regression (PLSR) model constructed by the selected subset of 260 variables outperforms significantly the full-spectrum model. It seems that the combination of NIR spectroscopy, MRMR and PLS-DA or PLSR is a powerful tool for classifying different brands of milk and determining the protein content.
Fine-scale habitat modeling of a top marine predator: do prey data improve predictive capacity?
Torres, Leigh G; Read, Andrew J; Halpin, Patrick
2008-10-01
Predators and prey assort themselves relative to each other, the availability of resources and refuges, and the temporal and spatial scale of their interaction. Predictive models of predator distributions often rely on these relationships by incorporating data on environmental variability and prey availability to determine predator habitat selection patterns. This approach to predictive modeling holds true in marine systems where observations of predators are logistically difficult, emphasizing the need for accurate models. In this paper, we ask whether including prey distribution data in fine-scale predictive models of bottlenose dolphin (Tursiops truncatus) habitat selection in Florida Bay, Florida, U.S.A., improves predictive capacity. Environmental characteristics are often used as predictor variables in habitat models of top marine predators with the assumption that they act as proxies of prey distribution. We examine the validity of this assumption by comparing the response of dolphin distribution and fish catch rates to the same environmental variables. Next, the predictive capacities of four models, with and without prey distribution data, are tested to determine whether dolphin habitat selection can be predicted without recourse to describing the distribution of their prey. The final analysis determines the accuracy of predictive maps of dolphin distribution produced by modeling areas of high fish catch based on significant environmental characteristics. We use spatial analysis and independent data sets to train and test the models. Our results indicate that, due to high habitat heterogeneity and the spatial variability of prey patches, fine-scale models of dolphin habitat selection in coastal habitats will be more successful if environmental variables are used as predictor variables of predator distributions rather than relying on prey data as explanatory variables. However, predictive modeling of prey distribution as the response variable based on environmental variability did produce high predictive performance of dolphin habitat selection, particularly foraging habitat.
Selection of latent variables for multiple mixed-outcome models
ZHOU, LING; LIN, HUAZHEN; SONG, XINYUAN; LI, YI
2014-01-01
Latent variable models have been widely used for modeling the dependence structure of multiple outcomes data. However, the formulation of a latent variable model is often unknown a priori, the misspecification will distort the dependence structure and lead to unreliable model inference. Moreover, multiple outcomes with varying types present enormous analytical challenges. In this paper, we present a class of general latent variable models that can accommodate mixed types of outcomes. We propose a novel selection approach that simultaneously selects latent variables and estimates parameters. We show that the proposed estimator is consistent, asymptotically normal and has the oracle property. The practical utility of the methods is confirmed via simulations as well as an application to the analysis of the World Values Survey, a global research project that explores peoples’ values and beliefs and the social and personal characteristics that might influence them. PMID:27642219
Covariate Selection for Multilevel Models with Missing Data
Marino, Miguel; Buxton, Orfeu M.; Li, Yi
2017-01-01
Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population. PMID:28239457
Hao, Yong; Sun, Xu-Dong; Yang, Qiang
2012-12-01
Variables selection strategy combined with local linear embedding (LLE) was introduced for the analysis of complex samples by using near infrared spectroscopy (NIRS). Three methods include Monte Carlo uninformation variable elimination (MCUVE), successive projections algorithm (SPA) and MCUVE connected with SPA were used for eliminating redundancy spectral variables. Partial least squares regression (PLSR) and LLE-PLSR were used for modeling complex samples. The results shown that MCUVE can both extract effective informative variables and improve the precision of models. Compared with PLSR models, LLE-PLSR models can achieve more accurate analysis results. MCUVE combined with LLE-PLSR is an effective modeling method for NIRS quantitative analysis.
Safari, Parviz; Danyali, Syyedeh Fatemeh; Rahimi, Mehdi
2018-06-02
Drought is the main abiotic stress seriously influencing wheat production. Information about the inheritance of drought tolerance is necessary to determine the most appropriate strategy to develop tolerant cultivars and populations. In this study, generation means analysis to identify the genetic effects controlling grain yield inheritance in water deficit and normal conditions was considered as a model selection problem in a Bayesian framework. Stochastic search variable selection (SSVS) was applied to identify the most important genetic effects and the best fitted models using different generations obtained from two crosses applying two water regimes in two growing seasons. The SSVS is used to evaluate the effect of each variable on the dependent variable via posterior variable inclusion probabilities. The model with the highest posterior probability is selected as the best model. In this study, the grain yield was controlled by the main effects (additive and non-additive effects) and epistatic. The results demonstrate that breeding methods such as recurrent selection and subsequent pedigree method and hybrid production can be useful to improve grain yield.
A Selective Review of Group Selection in High-Dimensional Models
Huang, Jian; Breheny, Patrick; Ma, Shuangge
2013-01-01
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study. PMID:24174707
Selection of climate change scenario data for impact modelling.
Sloth Madsen, M; Maule, C Fox; MacKellar, N; Olesen, J E; Christensen, J Hesselbjerg
2012-01-01
Impact models investigating climate change effects on food safety often need detailed climate data. The aim of this study was to select climate change projection data for selected crop phenology and mycotoxin impact models. Using the ENSEMBLES database of climate model output, this study illustrates how the projected climate change signal of important variables as temperature, precipitation and relative humidity depends on the choice of the climate model. Using climate change projections from at least two different climate models is recommended to account for model uncertainty. To make the climate projections suitable for impact analysis at the local scale a weather generator approach was adopted. As the weather generator did not treat all the necessary variables, an ad-hoc statistical method was developed to synthesise realistic values of missing variables. The method is presented in this paper, applied to relative humidity, but it could be adopted to other variables if needed.
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, e...
Woelmer, Whitney; Kao, Yu-Chun; Bunnell, David B.; Deines, Andrew M.; Bennion, David; Rogers, Mark W.; Brooks, Colin N.; Sayers, Michael J.; Banach, David M.; Grimm, Amanda G.; Shuchman, Robert A.
2016-01-01
Prediction of primary production of lentic water bodies (i.e., lakes and reservoirs) is valuable to researchers and resource managers alike, but is very rarely done at the global scale. With the development of remote sensing technologies, it is now feasible to gather large amounts of data across the world, including understudied and remote regions. To determine which factors were most important in explaining the variation of chlorophyll a (Chl-a), an indicator of primary production in water bodies, at global and regional scales, we first developed a geospatial database of 227 water bodies and watersheds with corresponding Chl-a, nutrient, hydrogeomorphic, and climate data. Then we used a generalized additive modeling approach and developed model selection criteria to select models that most parsimoniously related Chl-a to predictor variables for all 227 water bodies and for 51 lakes in the Laurentian Great Lakes region in the data set. Our best global model contained two hydrogeomorphic variables (water body surface area and the ratio of watershed to water body surface area) and a climate variable (average temperature in the warmest model selection criteria to select models that most parsimoniously related Chl-a to predictor variables quarter) and explained ~ 30% of variation in Chl-a. Our regional model contained one hydrogeomorphic variable (flow accumulation) and the same climate variable, but explained substantially more variation (58%). Our results indicate that a regional approach to watershed modeling may be more informative to predicting Chl-a, and that nearly a third of global variability in Chl-a may be explained using hydrogeomorphic and climate variables.
Wang, Zhu; Shuangge, Ma; Wang, Ching-Yun
2017-01-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using an open-source R package mpath. PMID:26059498
Mujalli, Randa Oqab; de Oña, Juan
2011-10-01
This study describes a method for reducing the number of variables frequently considered in modeling the severity of traffic accidents. The method's efficiency is assessed by constructing Bayesian networks (BN). It is based on a two stage selection process. Several variable selection algorithms, commonly used in data mining, are applied in order to select subsets of variables. BNs are built using the selected subsets and their performance is compared with the original BN (with all the variables) using five indicators. The BNs that improve the indicators' values are further analyzed for identifying the most significant variables (accident type, age, atmospheric factors, gender, lighting, number of injured, and occupant involved). A new BN is built using these variables, where the results of the indicators indicate, in most of the cases, a statistically significant improvement with respect to the original BN. It is possible to reduce the number of variables used to model traffic accidents injury severity through BNs without reducing the performance of the model. The study provides the safety analysts a methodology that could be used to minimize the number of variables used in order to determine efficiently the injury severity of traffic accidents without reducing the performance of the model. Copyright © 2011 Elsevier Ltd. All rights reserved.
Random forest (RF) is popular in ecological and environmental modeling, in part, because of its insensitivity to correlated predictors and resistance to overfitting. Although variable selection has been proposed to improve both performance and interpretation of RF models, it is u...
Variable selection in subdistribution hazard frailty models with competing risks data
Do Ha, Il; Lee, Minjung; Oh, Seungyoung; Jeong, Jong-Hyeon; Sylvester, Richard; Lee, Youngjo
2014-01-01
The proportional subdistribution hazards model (i.e. Fine-Gray model) has been widely used for analyzing univariate competing risks data. Recently, this model has been extended to clustered competing risks data via frailty. To the best of our knowledge, however, there has been no literature on variable selection method for such competing risks frailty models. In this paper, we propose a simple but unified procedure via a penalized h-likelihood (HL) for variable selection of fixed effects in a general class of subdistribution hazard frailty models, in which random effects may be shared or correlated. We consider three penalty functions (LASSO, SCAD and HL) in our variable selection procedure. We show that the proposed method can be easily implemented using a slight modification to existing h-likelihood estimation approaches. Numerical studies demonstrate that the proposed procedure using the HL penalty performs well, providing a higher probability of choosing the true model than LASSO and SCAD methods without losing prediction accuracy. The usefulness of the new method is illustrated using two actual data sets from multi-center clinical trials. PMID:25042872
Liu, Xiang; Peng, Yingwei; Tu, Dongsheng; Liang, Hua
2012-10-30
Survival data with a sizable cure fraction are commonly encountered in cancer research. The semiparametric proportional hazards cure model has been recently used to analyze such data. As seen in the analysis of data from a breast cancer study, a variable selection approach is needed to identify important factors in predicting the cure status and risk of breast cancer recurrence. However, no specific variable selection method for the cure model is available. In this paper, we present a variable selection approach with penalized likelihood for the cure model. The estimation can be implemented easily by combining the computational methods for penalized logistic regression and the penalized Cox proportional hazards models with the expectation-maximization algorithm. We illustrate the proposed approach on data from a breast cancer study. We conducted Monte Carlo simulations to evaluate the performance of the proposed method. We used and compared different penalty functions in the simulation studies. Copyright © 2012 John Wiley & Sons, Ltd.
Diversified models for portfolio selection based on uncertain semivariance
NASA Astrophysics Data System (ADS)
Chen, Lin; Peng, Jin; Zhang, Bo; Rosyida, Isnaini
2017-02-01
Since the financial markets are complex, sometimes the future security returns are represented mainly based on experts' estimations due to lack of historical data. This paper proposes a semivariance method for diversified portfolio selection, in which the security returns are given subjective to experts' estimations and depicted as uncertain variables. In the paper, three properties of the semivariance of uncertain variables are verified. Based on the concept of semivariance of uncertain variables, two types of mean-semivariance diversified models for uncertain portfolio selection are proposed. Since the models are complex, a hybrid intelligent algorithm which is based on 99-method and genetic algorithm is designed to solve the models. In this hybrid intelligent algorithm, 99-method is applied to compute the expected value and semivariance of uncertain variables, and genetic algorithm is employed to seek the best allocation plan for portfolio selection. At last, several numerical examples are presented to illustrate the modelling idea and the effectiveness of the algorithm.
Quantifying Variability of Avian Colours: Are Signalling Traits More Variable?
Delhey, Kaspar; Peters, Anne
2008-01-01
Background Increased variability in sexually selected ornaments, a key assumption of evolutionary theory, is thought to be maintained through condition-dependence. Condition-dependent handicap models of sexual selection predict that (a) sexually selected traits show amplified variability compared to equivalent non-sexually selected traits, and since males are usually the sexually selected sex, that (b) males are more variable than females, and (c) sexually dimorphic traits more variable than monomorphic ones. So far these predictions have only been tested for metric traits. Surprisingly, they have not been examined for bright coloration, one of the most prominent sexual traits. This omission stems from computational difficulties: different types of colours are quantified on different scales precluding the use of coefficients of variation. Methodology/Principal Findings Based on physiological models of avian colour vision we develop an index to quantify the degree of discriminable colour variation as it can be perceived by conspecifics. A comparison of variability in ornamental and non-ornamental colours in six bird species confirmed (a) that those coloured patches that are sexually selected or act as indicators of quality show increased chromatic variability. However, we found no support for (b) that males generally show higher levels of variability than females, or (c) that sexual dichromatism per se is associated with increased variability. Conclusions/Significance We show that it is currently possible to realistically estimate variability of animal colours as perceived by them, something difficult to achieve with other traits. Increased variability of known sexually-selected/quality-indicating colours in the studied species, provides support to the predictions borne from sexual selection theory but the lack of increased overall variability in males or dimorphic colours in general indicates that sexual differences might not always be shaped by similar selective forces. PMID:18301766
Craig, Marlies H; Sharp, Brian L; Mabaso, Musawenkosi LH; Kleinschmidt, Immo
2007-01-01
Background Several malaria risk maps have been developed in recent years, many from the prevalence of infection data collated by the MARA (Mapping Malaria Risk in Africa) project, and using various environmental data sets as predictors. Variable selection is a major obstacle due to analytical problems caused by over-fitting, confounding and non-independence in the data. Testing and comparing every combination of explanatory variables in a Bayesian spatial framework remains unfeasible for most researchers. The aim of this study was to develop a malaria risk map using a systematic and practicable variable selection process for spatial analysis and mapping of historical malaria risk in Botswana. Results Of 50 potential explanatory variables from eight environmental data themes, 42 were significantly associated with malaria prevalence in univariate logistic regression and were ranked by the Akaike Information Criterion. Those correlated with higher-ranking relatives of the same environmental theme, were temporarily excluded. The remaining 14 candidates were ranked by selection frequency after running automated step-wise selection procedures on 1000 bootstrap samples drawn from the data. A non-spatial multiple-variable model was developed through step-wise inclusion in order of selection frequency. Previously excluded variables were then re-evaluated for inclusion, using further step-wise bootstrap procedures, resulting in the exclusion of another variable. Finally a Bayesian geo-statistical model using Markov Chain Monte Carlo simulation was fitted to the data, resulting in a final model of three predictor variables, namely summer rainfall, mean annual temperature and altitude. Each was independently and significantly associated with malaria prevalence after allowing for spatial correlation. This model was used to predict malaria prevalence at unobserved locations, producing a smooth risk map for the whole country. Conclusion We have produced a highly plausible and parsimonious model of historical malaria risk for Botswana from point-referenced data from a 1961/2 prevalence survey of malaria infection in 1–14 year old children. After starting with a list of 50 potential variables we ended with three highly plausible predictors, by applying a systematic and repeatable staged variable selection procedure that included a spatial analysis, which has application for other environmentally determined infectious diseases. All this was accomplished using general-purpose statistical software. PMID:17892584
Ni, Ai; Cai, Jianwen
2018-07-01
Case-cohort designs are commonly used in large epidemiological studies to reduce the cost associated with covariate measurement. In many such studies the number of covariates is very large. An efficient variable selection method is needed for case-cohort studies where the covariates are only observed in a subset of the sample. Current literature on this topic has been focused on the proportional hazards model. However, in many studies the additive hazards model is preferred over the proportional hazards model either because the proportional hazards assumption is violated or the additive hazards model provides more relevent information to the research question. Motivated by one such study, the Atherosclerosis Risk in Communities (ARIC) study, we investigate the properties of a regularized variable selection procedure in stratified case-cohort design under an additive hazards model with a diverging number of parameters. We establish the consistency and asymptotic normality of the penalized estimator and prove its oracle property. Simulation studies are conducted to assess the finite sample performance of the proposed method with a modified cross-validation tuning parameter selection methods. We apply the variable selection procedure to the ARIC study to demonstrate its practical use.
Variable screening via quantile partial correlation
Ma, Shujie; Tsai, Chih-Ling
2016-01-01
In quantile linear regression with ultra-high dimensional data, we propose an algorithm for screening all candidate variables and subsequently selecting relevant predictors. Specifically, we first employ quantile partial correlation for screening, and then we apply the extended Bayesian information criterion (EBIC) for best subset selection. Our proposed method can successfully select predictors when the variables are highly correlated, and it can also identify variables that make a contribution to the conditional quantiles but are marginally uncorrelated or weakly correlated with the response. Theoretical results show that the proposed algorithm can yield the sure screening set. By controlling the false selection rate, model selection consistency can be achieved theoretically. In practice, we proposed using EBIC for best subset selection so that the resulting model is screening consistent. Simulation studies demonstrate that the proposed algorithm performs well, and an empirical example is presented. PMID:28943683
Tanner, Evan P; Papeş, Monica; Elmore, R Dwayne; Fuhlendorf, Samuel D; Davis, Craig A
2017-01-01
Ecological niche models (ENMs) have increasingly been used to estimate the potential effects of climate change on species' distributions worldwide. Recently, predictions of species abundance have also been obtained with such models, though knowledge about the climatic variables affecting species abundance is often lacking. To address this, we used a well-studied guild (temperate North American quail) and the Maxent modeling algorithm to compare model performance of three variable selection approaches: correlation/variable contribution (CVC), biological (i.e., variables known to affect species abundance), and random. We then applied the best approach to forecast potential distributions, under future climatic conditions, and analyze future potential distributions in light of available abundance data and presence-only occurrence data. To estimate species' distributional shifts we generated ensemble forecasts using four global circulation models, four representative concentration pathways, and two time periods (2050 and 2070). Furthermore, we present distributional shifts where 75%, 90%, and 100% of our ensemble models agreed. The CVC variable selection approach outperformed our biological approach for four of the six species. Model projections indicated species-specific effects of climate change on future distributions of temperate North American quail. The Gambel's quail (Callipepla gambelii) was the only species predicted to gain area in climatic suitability across all three scenarios of ensemble model agreement. Conversely, the scaled quail (Callipepla squamata) was the only species predicted to lose area in climatic suitability across all three scenarios of ensemble model agreement. Our models projected future loss of areas for the northern bobwhite (Colinus virginianus) and scaled quail in portions of their distributions which are currently areas of high abundance. Climatic variables that influence local abundance may not always scale up to influence species' distributions. Special attention should be given to selecting variables for ENMs, and tests of model performance should be used to validate the choice of variables.
Selecting the process variables for filament winding
NASA Technical Reports Server (NTRS)
Calius, E.; Springer, G. S.
1986-01-01
A model is described which can be used to determine the appropriate values of the process variables for filament winding cylinders. The process variables which can be selected by the model include the winding speed, fiber tension, initial resin degree of cure, and the temperatures applied during winding, curing, and post-curing. The effects of these process variables on the properties of the cylinder during and after manufacture are illustrated by a numerical example.
Variable selection under multiple imputation using the bootstrap in a prognostic study
Heymans, Martijn W; van Buuren, Stef; Knol, Dirk L; van Mechelen, Willem; de Vet, Henrica CW
2007-01-01
Background Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection. Method In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels. Results We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found. Conclusion We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values. PMID:17629912
Muller, Benjamin J.; Cade, Brian S.; Schwarzkoph, Lin
2018-01-01
Many different factors influence animal activity. Often, the value of an environmental variable may influence significantly the upper or lower tails of the activity distribution. For describing relationships with heterogeneous boundaries, quantile regressions predict a quantile of the conditional distribution of the dependent variable. A quantile count model extends linear quantile regression methods to discrete response variables, and is useful if activity is quantified by trapping, where there may be many tied (equal) values in the activity distribution, over a small range of discrete values. Additionally, different environmental variables in combination may have synergistic or antagonistic effects on activity, so examining their effects together, in a modeling framework, is a useful approach. Thus, model selection on quantile counts can be used to determine the relative importance of different variables in determining activity, across the entire distribution of capture results. We conducted model selection on quantile count models to describe the factors affecting activity (numbers of captures) of cane toads (Rhinella marina) in response to several environmental variables (humidity, temperature, rainfall, wind speed, and moon luminosity) over eleven months of trapping. Environmental effects on activity are understudied in this pest animal. In the dry season, model selection on quantile count models suggested that rainfall positively affected activity, especially near the lower tails of the activity distribution. In the wet season, wind speed limited activity near the maximum of the distribution, while minimum activity increased with minimum temperature. This statistical methodology allowed us to explore, in depth, how environmental factors influenced activity across the entire distribution, and is applicable to any survey or trapping regime, in which environmental variables affect activity.
Lafuente, Victoria; Herrera, Luis J; Pérez, María del Mar; Val, Jesús; Negueruela, Ignacio
2015-08-15
In this work, near infrared spectroscopy (NIR) and an acoustic measure (AWETA) (two non-destructive methods) were applied in Prunus persica fruit 'Calrico' (n = 260) to predict Magness-Taylor (MT) firmness. Separate and combined use of these measures was evaluated and compared using partial least squares (PLS) and least squares support vector machine (LS-SVM) regression methods. Also, a mutual-information-based variable selection method, seeking to find the most significant variables to produce optimal accuracy of the regression models, was applied to a joint set of variables (NIR wavelengths and AWETA measure). The newly proposed combined NIR-AWETA model gave good values of the determination coefficient (R(2)) for PLS and LS-SVM methods (0.77 and 0.78, respectively), improving the reliability of MT firmness prediction in comparison with separate NIR and AWETA predictions. The three variables selected by the variable selection method (AWETA measure plus NIR wavelengths 675 and 697 nm) achieved R(2) values 0.76 and 0.77, PLS and LS-SVM. These results indicated that the proposed mutual-information-based variable selection algorithm was a powerful tool for the selection of the most relevant variables. © 2014 Society of Chemical Industry.
Bayesian Group Bridge for Bi-level Variable Selection.
Mallick, Himel; Yi, Nengjun
2017-06-01
A Bayesian bi-level variable selection method (BAGB: Bayesian Analysis of Group Bridge) is developed for regularized regression and classification. This new development is motivated by grouped data, where generic variables can be divided into multiple groups, with variables in the same group being mechanistically related or statistically correlated. As an alternative to frequentist group variable selection methods, BAGB incorporates structural information among predictors through a group-wise shrinkage prior. Posterior computation proceeds via an efficient MCMC algorithm. In addition to the usual ease-of-interpretation of hierarchical linear models, the Bayesian formulation produces valid standard errors, a feature that is notably absent in the frequentist framework. Empirical evidence of the attractiveness of the method is illustrated by extensive Monte Carlo simulations and real data analysis. Finally, several extensions of this new approach are presented, providing a unified framework for bi-level variable selection in general models with flexible penalties.
NASA Astrophysics Data System (ADS)
Müller, Aline Lima Hermes; Picoloto, Rochele Sogari; Mello, Paola de Azevedo; Ferrão, Marco Flores; dos Santos, Maria de Fátima Pereira; Guimarães, Regina Célia Lourenço; Müller, Edson Irineu; Flores, Erico Marlon Moraes
2012-04-01
Total sulfur concentration was determined in atmospheric residue (AR) and vacuum residue (VR) samples obtained from petroleum distillation process by Fourier transform infrared spectroscopy with attenuated total reflectance (FT-IR/ATR) in association with chemometric methods. Calibration and prediction set consisted of 40 and 20 samples, respectively. Calibration models were developed using two variable selection models: interval partial least squares (iPLS) and synergy interval partial least squares (siPLS). Different treatments and pre-processing steps were also evaluated for the development of models. The pre-treatment based on multiplicative scatter correction (MSC) and the mean centered data were selected for models construction. The use of siPLS as variable selection method provided a model with root mean square error of prediction (RMSEP) values significantly better than those obtained by PLS model using all variables. The best model was obtained using siPLS algorithm with spectra divided in 20 intervals and combinations of 3 intervals (911-824, 823-736 and 737-650 cm-1). This model produced a RMSECV of 400 mg kg-1 S and RMSEP of 420 mg kg-1 S, showing a correlation coefficient of 0.990.
Variable Selection in the Presence of Missing Data: Imputation-based Methods.
Zhao, Yize; Long, Qi
2017-01-01
Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.
Variable selection for marginal longitudinal generalized linear models.
Cantoni, Eva; Flemming, Joanna Mills; Ronchetti, Elvezio
2005-06-01
Variable selection is an essential part of any statistical analysis and yet has been somewhat neglected in the context of longitudinal data analysis. In this article, we propose a generalized version of Mallows's C(p) (GC(p)) suitable for use with both parametric and nonparametric models. GC(p) provides an estimate of a measure of model's adequacy for prediction. We examine its performance with popular marginal longitudinal models (fitted using GEE) and contrast results with what is typically done in practice: variable selection based on Wald-type or score-type tests. An application to real data further demonstrates the merits of our approach while at the same time emphasizing some important robust features inherent to GC(p).
NASA Astrophysics Data System (ADS)
Shi, Jinfei; Zhu, Songqing; Chen, Ruwen
2017-12-01
An order selection method based on multiple stepwise regressions is proposed for General Expression of Nonlinear Autoregressive model which converts the model order problem into the variable selection of multiple linear regression equation. The partial autocorrelation function is adopted to define the linear term in GNAR model. The result is set as the initial model, and then the nonlinear terms are introduced gradually. Statistics are chosen to study the improvements of both the new introduced and originally existed variables for the model characteristics, which are adopted to determine the model variables to retain or eliminate. So the optimal model is obtained through data fitting effect measurement or significance test. The simulation and classic time-series data experiment results show that the method proposed is simple, reliable and can be applied to practical engineering.
Analysis of model development strategies: predicting ventral hernia recurrence.
Holihan, Julie L; Li, Linda T; Askenasy, Erik P; Greenberg, Jacob A; Keith, Jerrod N; Martindale, Robert G; Roth, J Scott; Liang, Mike K
2016-11-01
There have been many attempts to identify variables associated with ventral hernia recurrence; however, it is unclear which statistical modeling approach results in models with greatest internal and external validity. We aim to assess the predictive accuracy of models developed using five common variable selection strategies to determine variables associated with hernia recurrence. Two multicenter ventral hernia databases were used. Database 1 was randomly split into "development" and "internal validation" cohorts. Database 2 was designated "external validation". The dependent variable for model development was hernia recurrence. Five variable selection strategies were used: (1) "clinical"-variables considered clinically relevant, (2) "selective stepwise"-all variables with a P value <0.20 were assessed in a step-backward model, (3) "liberal stepwise"-all variables were included and step-backward regression was performed, (4) "restrictive internal resampling," and (5) "liberal internal resampling." Variables were included with P < 0.05 for the Restrictive model and P < 0.10 for the Liberal model. A time-to-event analysis using Cox regression was performed using these strategies. The predictive accuracy of the developed models was tested on the internal and external validation cohorts using Harrell's C-statistic where C > 0.70 was considered "reasonable". The recurrence rate was 32.9% (n = 173/526; median/range follow-up, 20/1-58 mo) for the development cohort, 36.0% (n = 95/264, median/range follow-up 20/1-61 mo) for the internal validation cohort, and 12.7% (n = 155/1224, median/range follow-up 9/1-50 mo) for the external validation cohort. Internal validation demonstrated reasonable predictive accuracy (C-statistics = 0.772, 0.760, 0.767, 0.757, 0.763), while on external validation, predictive accuracy dipped precipitously (C-statistic = 0.561, 0.557, 0.562, 0.553, 0.560). Predictive accuracy was equally adequate on internal validation among models; however, on external validation, all five models failed to demonstrate utility. Future studies should report multiple variable selection techniques and demonstrate predictive accuracy on external data sets for model validation. Copyright © 2016 Elsevier Inc. All rights reserved.
Variability aware compact model characterization for statistical circuit design optimization
NASA Astrophysics Data System (ADS)
Qiao, Ying; Qian, Kun; Spanos, Costas J.
2012-03-01
Variability modeling at the compact transistor model level can enable statistically optimized designs in view of limitations imposed by the fabrication technology. In this work we propose an efficient variabilityaware compact model characterization methodology based on the linear propagation of variance. Hierarchical spatial variability patterns of selected compact model parameters are directly calculated from transistor array test structures. This methodology has been implemented and tested using transistor I-V measurements and the EKV-EPFL compact model. Calculation results compare well to full-wafer direct model parameter extractions. Further studies are done on the proper selection of both compact model parameters and electrical measurement metrics used in the method.
Wang, Zhu; Ma, Shuangge; Wang, Ching-Yun
2015-09-01
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open-source R package mpath. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Thomas, D.L.; Johnson, D.; Griffith, B.
2006-01-01
Modeling the probability of use of land units characterized by discrete and continuous measures, we present a Bayesian random-effects model to assess resource selection. This model provides simultaneous estimation of both individual- and population-level selection. Deviance information criterion (DIC), a Bayesian alternative to AIC that is sample-size specific, is used for model selection. Aerial radiolocation data from 76 adult female caribou (Rangifer tarandus) and calf pairs during 1 year on an Arctic coastal plain calving ground were used to illustrate models and assess population-level selection of landscape attributes, as well as individual heterogeneity of selection. Landscape attributes included elevation, NDVI (a measure of forage greenness), and land cover-type classification. Results from the first of a 2-stage model-selection procedure indicated that there is substantial heterogeneity among cow-calf pairs with respect to selection of the landscape attributes. In the second stage, selection of models with heterogeneity included indicated that at the population-level, NDVI and land cover class were significant attributes for selection of different landscapes by pairs on the calving ground. Population-level selection coefficients indicate that the pairs generally select landscapes with higher levels of NDVI, but the relationship is quadratic. The highest rate of selection occurs at values of NDVI less than the maximum observed. Results for land cover-class selections coefficients indicate that wet sedge, moist sedge, herbaceous tussock tundra, and shrub tussock tundra are selected at approximately the same rate, while alpine and sparsely vegetated landscapes are selected at a lower rate. Furthermore, the variability in selection by individual caribou for moist sedge and sparsely vegetated landscapes is large relative to the variability in selection of other land cover types. The example analysis illustrates that, while sometimes computationally intense, a Bayesian hierarchical discrete-choice model for resource selection can provide managers with 2 components of population-level inference: average population selection and variability of selection. Both components are necessary to make sound management decisions based on animal selection.
Miaw, Carolina Sheng Whei; Assis, Camila; Silva, Alessandro Rangel Carolino Sales; Cunha, Maria Luísa; Sena, Marcelo Martins; de Souza, Scheilla Vitorino Carvalho
2018-07-15
Grape, orange, peach and passion fruit nectars were formulated and adulterated by dilution with syrup, apple and cashew juices at 10 levels for each adulterant. Attenuated total reflectance Fourier transform mid infrared (ATR-FTIR) spectra were obtained. Partial least squares (PLS) multivariate calibration models allied to different variable selection methods, such as interval partial least squares (iPLS), ordered predictors selection (OPS) and genetic algorithm (GA), were used to quantify the main fruits. PLS improved by iPLS-OPS variable selection showed the highest predictive capacity to quantify the main fruit contents. The selected variables in the final models varied from 72 to 100; the root mean square errors of prediction were estimated from 0.5 to 2.6%; the correlation coefficients of prediction ranged from 0.948 to 0.990; and, the mean relative errors of prediction varied from 3.0 to 6.7%. All of the developed models were validated. Copyright © 2018 Elsevier Ltd. All rights reserved.
Deng, Bai-chuan; Yun, Yong-huan; Liang, Yi-zeng; Yi, Lun-zhao
2014-10-07
In this study, a new optimization algorithm called the Variable Iterative Space Shrinkage Approach (VISSA) that is based on the idea of model population analysis (MPA) is proposed for variable selection. Unlike most of the existing optimization methods for variable selection, VISSA statistically evaluates the performance of variable space in each step of optimization. Weighted binary matrix sampling (WBMS) is proposed to generate sub-models that span the variable subspace. Two rules are highlighted during the optimization procedure. First, the variable space shrinks in each step. Second, the new variable space outperforms the previous one. The second rule, which is rarely satisfied in most of the existing methods, is the core of the VISSA strategy. Compared with some promising variable selection methods such as competitive adaptive reweighted sampling (CARS), Monte Carlo uninformative variable elimination (MCUVE) and iteratively retaining informative variables (IRIV), VISSA showed better prediction ability for the calibration of NIR data. In addition, VISSA is user-friendly; only a few insensitive parameters are needed, and the program terminates automatically without any additional conditions. The Matlab codes for implementing VISSA are freely available on the website: https://sourceforge.net/projects/multivariateanalysis/files/VISSA/.
[Measurement of Water COD Based on UV-Vis Spectroscopy Technology].
Wang, Xiao-ming; Zhang, Hai-liang; Luo, Wei; Liu, Xue-mei
2016-01-01
Ultraviolet/visible (UV/Vis) spectroscopy technology was used to measure water COD. A total of 135 water samples were collected from Zhejiang province. Raw spectra with 3 different pretreatment methods (Multiplicative Scatter Correction (MSC), Standard Normal Variate (SNV) and 1st Derivatives were compared to determine the optimal pretreatment method for analysis. Spectral variable selection is an important strategy in spectrum modeling analysis, because it tends to parsimonious data representation and can lead to multivariate models with better performance. In order to simply calibration models, the preprocessed spectra were then used to select sensitive wavelengths by competitive adaptive reweighted sampling (CARS), Random frog and Successive Genetic Algorithm (GA) methods. Different numbers of sensitive wavelengths were selected by different variable selection methods with SNV preprocessing method. Partial least squares (PLS) was used to build models with the full spectra, and Extreme Learning Machine (ELM) was applied to build models with the selected wavelength variables. The overall results showed that ELM model performed better than PLS model, and the ELM model with the selected wavelengths based on CARS obtained the best results with the determination coefficient (R2), RMSEP and RPD were 0.82, 14.48 and 2.34 for prediction set. The results indicated that it was feasible to use UV/Vis with characteristic wavelengths which were obtained by CARS variable selection method, combined with ELM calibration could apply for the rapid and accurate determination of COD in aquaculture water. Moreover, this study laid the foundation for further implementation of online analysis of aquaculture water and rapid determination of other water quality parameters.
Schnitzer, Mireille E.; Lok, Judith J.; Gruber, Susan
2015-01-01
This paper investigates the appropriateness of the integration of flexible propensity score modeling (nonparametric or machine learning approaches) in semiparametric models for the estimation of a causal quantity, such as the mean outcome under treatment. We begin with an overview of some of the issues involved in knowledge-based and statistical variable selection in causal inference and the potential pitfalls of automated selection based on the fit of the propensity score. Using a simple example, we directly show the consequences of adjusting for pure causes of the exposure when using inverse probability of treatment weighting (IPTW). Such variables are likely to be selected when using a naive approach to model selection for the propensity score. We describe how the method of Collaborative Targeted minimum loss-based estimation (C-TMLE; van der Laan and Gruber, 2010) capitalizes on the collaborative double robustness property of semiparametric efficient estimators to select covariates for the propensity score based on the error in the conditional outcome model. Finally, we compare several approaches to automated variable selection in low-and high-dimensional settings through a simulation study. From this simulation study, we conclude that using IPTW with flexible prediction for the propensity score can result in inferior estimation, while Targeted minimum loss-based estimation and C-TMLE may benefit from flexible prediction and remain robust to the presence of variables that are highly correlated with treatment. However, in our study, standard influence function-based methods for the variance underestimated the standard errors, resulting in poor coverage under certain data-generating scenarios. PMID:26226129
Schnitzer, Mireille E; Lok, Judith J; Gruber, Susan
2016-05-01
This paper investigates the appropriateness of the integration of flexible propensity score modeling (nonparametric or machine learning approaches) in semiparametric models for the estimation of a causal quantity, such as the mean outcome under treatment. We begin with an overview of some of the issues involved in knowledge-based and statistical variable selection in causal inference and the potential pitfalls of automated selection based on the fit of the propensity score. Using a simple example, we directly show the consequences of adjusting for pure causes of the exposure when using inverse probability of treatment weighting (IPTW). Such variables are likely to be selected when using a naive approach to model selection for the propensity score. We describe how the method of Collaborative Targeted minimum loss-based estimation (C-TMLE; van der Laan and Gruber, 2010 [27]) capitalizes on the collaborative double robustness property of semiparametric efficient estimators to select covariates for the propensity score based on the error in the conditional outcome model. Finally, we compare several approaches to automated variable selection in low- and high-dimensional settings through a simulation study. From this simulation study, we conclude that using IPTW with flexible prediction for the propensity score can result in inferior estimation, while Targeted minimum loss-based estimation and C-TMLE may benefit from flexible prediction and remain robust to the presence of variables that are highly correlated with treatment. However, in our study, standard influence function-based methods for the variance underestimated the standard errors, resulting in poor coverage under certain data-generating scenarios.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective.
Yoo, Jin Eun
2018-01-01
A substantial body of research has been conducted on variables relating to students' mathematics achievement with TIMSS. However, most studies have employed conventional statistical methods, and have focused on selected few indicators instead of utilizing hundreds of variables TIMSS provides. This study aimed to find a prediction model for students' mathematics achievement using as many TIMSS student and teacher variables as possible. Elastic net, the selected machine learning technique in this study, takes advantage of both LASSO and ridge in terms of variable selection and multicollinearity, respectively. A logistic regression model was also employed to predict TIMSS 2011 Korean 4th graders' mathematics achievement. Ten-fold cross-validation with mean squared error was employed to determine the elastic net regularization parameter. Among 162 TIMSS variables explored, 12 student and 5 teacher variables were selected in the elastic net model, and the prediction accuracy, sensitivity, and specificity were 76.06, 70.23, and 80.34%, respectively. This study showed that the elastic net method can be successfully applied to educational large-scale data by selecting a subset of variables with reasonable prediction accuracy and finding new variables to predict students' mathematics achievement. Newly found variables via machine learning can shed light on the existing theories from a totally different perspective, which in turn propagates creation of a new theory or complement of existing ones. This study also examined the current scale development convention from a machine learning perspective. PMID:29599736
Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.
Johnson, Brent A
2009-10-01
We consider estimation and variable selection in the partial linear model for censored data. The partial linear model for censored data is a direct extension of the accelerated failure time model, the latter of which is a very important alternative model to the proportional hazards model. We extend rank-based lasso-type estimators to a model that may contain nonlinear effects. Variable selection in such partial linear model has direct application to high-dimensional survival analyses that attempt to adjust for clinical predictors. In the microarray setting, previous methods can adjust for other clinical predictors by assuming that clinical and gene expression data enter the model linearly in the same fashion. Here, we select important variables after adjusting for prognostic clinical variables but the clinical effects are assumed nonlinear. Our estimator is based on stratification and can be extended naturally to account for multiple nonlinear effects. We illustrate the utility of our method through simulation studies and application to the Wisconsin prognostic breast cancer data set.
NASA Astrophysics Data System (ADS)
Chen, Jie; Brissette, François P.; Lucas-Picher, Philippe
2016-11-01
Given the ever increasing number of climate change simulations being carried out, it has become impractical to use all of them to cover the uncertainty of climate change impacts. Various methods have been proposed to optimally select subsets of a large ensemble of climate simulations for impact studies. However, the behaviour of optimally-selected subsets of climate simulations for climate change impacts is unknown, since the transfer process from climate projections to the impact study world is usually highly non-linear. Consequently, this study investigates the transferability of optimally-selected subsets of climate simulations in the case of hydrological impacts. Two different methods were used for the optimal selection of subsets of climate scenarios, and both were found to be capable of adequately representing the spread of selected climate model variables contained in the original large ensemble. However, in both cases, the optimal subsets had limited transferability to hydrological impacts. To capture a similar variability in the impact model world, many more simulations have to be used than those that are needed to simply cover variability from the climate model variables' perspective. Overall, both optimal subset selection methods were better than random selection when small subsets were selected from a large ensemble for impact studies. However, as the number of selected simulations increased, random selection often performed better than the two optimal methods. To ensure adequate uncertainty coverage, the results of this study imply that selecting as many climate change simulations as possible is the best avenue. Where this was not possible, the two optimal methods were found to perform adequately.
Protein construct storage: Bayesian variable selection and prediction with mixtures.
Clyde, M A; Parmigiani, G
1998-07-01
Determining optimal conditions for protein storage while maintaining a high level of protein activity is an important question in pharmaceutical research. A designed experiment based on a space-filling design was conducted to understand the effects of factors affecting protein storage and to establish optimal storage conditions. Different model-selection strategies to identify important factors may lead to very different answers about optimal conditions. Uncertainty about which factors are important, or model uncertainty, can be a critical issue in decision-making. We use Bayesian variable selection methods for linear models to identify important variables in the protein storage data, while accounting for model uncertainty. We also use the Bayesian framework to build predictions based on a large family of models, rather than an individual model, and to evaluate the probability that certain candidate storage conditions are optimal.
NASA Astrophysics Data System (ADS)
Duan, Fajie; Fu, Xiao; Jiang, Jiajia; Huang, Tingting; Ma, Ling; Zhang, Cong
2018-05-01
In this work, an automatic variable selection method for quantitative analysis of soil samples using laser-induced breakdown spectroscopy (LIBS) is proposed, which is based on full spectrum correction (FSC) and modified iterative predictor weighting-partial least squares (mIPW-PLS). The method features automatic selection without artificial processes. To illustrate the feasibility and effectiveness of the method, a comparison with genetic algorithm (GA) and successive projections algorithm (SPA) for different elements (copper, barium and chromium) detection in soil was implemented. The experimental results showed that all the three methods could accomplish variable selection effectively, among which FSC-mIPW-PLS required significantly shorter computation time (12 s approximately for 40,000 initial variables) than the others. Moreover, improved quantification models were got with variable selection approaches. The root mean square errors of prediction (RMSEP) of models utilizing the new method were 27.47 (copper), 37.15 (barium) and 39.70 (chromium) mg/kg, which showed comparable prediction effect with GA and SPA.
NASA Astrophysics Data System (ADS)
Goudarzi, Nasser
2016-04-01
In this work, two new and powerful chemometrics methods are applied for the modeling and prediction of the 19F chemical shift values of some fluorinated organic compounds. The radial basis function-partial least square (RBF-PLS) and random forest (RF) are employed to construct the models to predict the 19F chemical shifts. In this study, we didn't used from any variable selection method and RF method can be used as variable selection and modeling technique. Effects of the important parameters affecting the ability of the RF prediction power such as the number of trees (nt) and the number of randomly selected variables to split each node (m) were investigated. The root-mean-square errors of prediction (RMSEP) for the training set and the prediction set for the RBF-PLS and RF models were 44.70, 23.86, 29.77, and 23.69, respectively. Also, the correlation coefficients of the prediction set for the RBF-PLS and RF models were 0.8684 and 0.9313, respectively. The results obtained reveal that the RF model can be used as a powerful chemometrics tool for the quantitative structure-property relationship (QSPR) studies.
Müller, Aline Lima Hermes; Picoloto, Rochele Sogari; de Azevedo Mello, Paola; Ferrão, Marco Flores; de Fátima Pereira dos Santos, Maria; Guimarães, Regina Célia Lourenço; Müller, Edson Irineu; Flores, Erico Marlon Moraes
2012-04-01
Total sulfur concentration was determined in atmospheric residue (AR) and vacuum residue (VR) samples obtained from petroleum distillation process by Fourier transform infrared spectroscopy with attenuated total reflectance (FT-IR/ATR) in association with chemometric methods. Calibration and prediction set consisted of 40 and 20 samples, respectively. Calibration models were developed using two variable selection models: interval partial least squares (iPLS) and synergy interval partial least squares (siPLS). Different treatments and pre-processing steps were also evaluated for the development of models. The pre-treatment based on multiplicative scatter correction (MSC) and the mean centered data were selected for models construction. The use of siPLS as variable selection method provided a model with root mean square error of prediction (RMSEP) values significantly better than those obtained by PLS model using all variables. The best model was obtained using siPLS algorithm with spectra divided in 20 intervals and combinations of 3 intervals (911-824, 823-736 and 737-650 cm(-1)). This model produced a RMSECV of 400 mg kg(-1) S and RMSEP of 420 mg kg(-1) S, showing a correlation coefficient of 0.990. Copyright © 2011 Elsevier B.V. All rights reserved.
Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model
Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan
2014-01-01
Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389
Jensen, Jacob S; Egebo, Max; Meyer, Anne S
2008-05-28
Accomplishment of fast tannin measurements is receiving increased interest as tannins are important for the mouthfeel and color properties of red wines. Fourier transform mid-infrared spectroscopy allows fast measurement of different wine components, but quantification of tannins is difficult due to interferences from spectral responses of other wine components. Four different variable selection tools were investigated for the identification of the most important spectral regions which would allow quantification of tannins from the spectra using partial least-squares regression. The study included the development of a new variable selection tool, iterative backward elimination of changeable size intervals PLS. The spectral regions identified by the different variable selection methods were not identical, but all included two regions (1485-1425 and 1060-995 cm(-1)), which therefore were concluded to be particularly important for tannin quantification. The spectral regions identified from the variable selection methods were used to develop calibration models. All four variable selection methods identified regions that allowed an improved quantitative prediction of tannins (RMSEP = 69-79 mg of CE/L; r = 0.93-0.94) as compared to a calibration model developed using all variables (RMSEP = 115 mg of CE/L; r = 0.87). Only minor differences in the performance of the variable selection methods were observed.
Wright, Marvin N; Dankowski, Theresa; Ziegler, Andreas
2017-04-15
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Bootstrap investigation of the stability of a Cox regression model.
Altman, D G; Andersen, P K
1989-07-01
We describe a bootstrap investigation of the stability of a Cox proportional hazards regression model resulting from the analysis of a clinical trial of azathioprine versus placebo in patients with primary biliary cirrhosis. We have considered stability to refer both to the choice of variables included in the model and, more importantly, to the predictive ability of the model. In stepwise Cox regression analyses of 100 bootstrap samples using 17 candidate variables, the most frequently selected variables were those selected in the original analysis, and no other important variable was identified. Thus there was no reason to doubt the model obtained in the original analysis. For each patient in the trial, bootstrap confidence intervals were constructed for the estimated probability of surviving two years. It is shown graphically that these intervals are markedly wider than those obtained from the original model.
Ballabio, Davide; Consonni, Viviana; Mauri, Andrea; Todeschini, Roberto
2010-01-11
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures. In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one. The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks' Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees. A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.
Evaluation of variable selection methods for random forests and omics data sets.
Degenhardt, Frauke; Seifert, Stephan; Szymczak, Silke
2017-10-16
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE). In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta.In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings. © The Author 2017. Published by Oxford University Press.
Model building strategy for logistic regression: purposeful selection.
Zhang, Zhongheng
2016-03-01
Logistic regression is one of the most commonly used models to account for confounders in medical literature. The article introduces how to perform purposeful selection model building strategy with R. I stress on the use of likelihood ratio test to see whether deleting a variable will have significant impact on model fit. A deleted variable should also be checked for whether it is an important adjustment of remaining covariates. Interaction should be checked to disentangle complex relationship between covariates and their synergistic effect on response variable. Model should be checked for the goodness-of-fit (GOF). In other words, how the fitted model reflects the real data. Hosmer-Lemeshow GOF test is the most widely used for logistic regression model.
NASA Astrophysics Data System (ADS)
Rounaghi, Mohammad Mahdi; Abbaszadeh, Mohammad Reza; Arashi, Mohammad
2015-11-01
One of the most important topics of interest to investors is stock price changes. Investors whose goals are long term are sensitive to stock price and its changes and react to them. In this regard, we used multivariate adaptive regression splines (MARS) model and semi-parametric splines technique for predicting stock price in this study. The MARS model as a nonparametric method is an adaptive method for regression and it fits for problems with high dimensions and several variables. semi-parametric splines technique was used in this study. Smoothing splines is a nonparametric regression method. In this study, we used 40 variables (30 accounting variables and 10 economic variables) for predicting stock price using the MARS model and using semi-parametric splines technique. After investigating the models, we select 4 accounting variables (book value per share, predicted earnings per share, P/E ratio and risk) as influencing variables on predicting stock price using the MARS model. After fitting the semi-parametric splines technique, only 4 accounting variables (dividends, net EPS, EPS Forecast and P/E Ratio) were selected as variables effective in forecasting stock prices.
NASA Astrophysics Data System (ADS)
Lü, Chengxu; Jiang, Xunpeng; Zhou, Xingfan; Zhang, Yinqiao; Zhang, Naiqian; Wei, Chongfeng; Mao, Wenhua
2017-10-01
Wet gluten is a useful quality indicator for wheat, and short wave near infrared spectroscopy (NIRS) is a high performance technique with the advantage of economic rapid and nondestructive test. To study the feasibility of short wave NIRS analyzing wet gluten directly from wheat seed, 54 representative wheat seed samples were collected and scanned by spectrometer. 8 spectral pretreatment method and genetic algorithm (GA) variable selection method were used to optimize analysis. Both quantitative and qualitative model of wet gluten were built by partial least squares regression and discriminate analysis. For quantitative analysis, normalization is the optimized pretreatment method, 17 wet gluten sensitive variables are selected by GA, and GA model performs a better result than that of all variable model, with R2V=0.88, and RMSEV=1.47. For qualitative analysis, automatic weighted least squares baseline is the optimized pretreatment method, all variable models perform better results than those of GA models. The correct classification rates of 3 class of <24%, 24-30%, >30% wet gluten content are 95.45, 84.52, and 90.00%, respectively. The short wave NIRS technique shows potential for both quantitative and qualitative analysis of wet gluten for wheat seed.
Error propagation of partial least squares for parameters optimization in NIR modeling.
Du, Chenzhao; Dai, Shengyun; Qiao, Yanjiang; Wu, Zhisheng
2018-03-05
A novel methodology is proposed to determine the error propagation of partial least-square (PLS) for parameters optimization in near-infrared (NIR) modeling. The parameters include spectral pretreatment, latent variables and variable selection. In this paper, an open source dataset (corn) and a complicated dataset (Gardenia) were used to establish PLS models under different modeling parameters. And error propagation of modeling parameters for water quantity in corn and geniposide quantity in Gardenia were presented by both type І and type II error. For example, when variable importance in the projection (VIP), interval partial least square (iPLS) and backward interval partial least square (BiPLS) variable selection algorithms were used for geniposide in Gardenia, compared with synergy interval partial least squares (SiPLS), the error weight varied from 5% to 65%, 55% and 15%. The results demonstrated how and what extent the different modeling parameters affect error propagation of PLS for parameters optimization in NIR modeling. The larger the error weight, the worse the model. Finally, our trials finished a powerful process in developing robust PLS models for corn and Gardenia under the optimal modeling parameters. Furthermore, it could provide a significant guidance for the selection of modeling parameters of other multivariate calibration models. Copyright © 2017. Published by Elsevier B.V.
Error propagation of partial least squares for parameters optimization in NIR modeling
NASA Astrophysics Data System (ADS)
Du, Chenzhao; Dai, Shengyun; Qiao, Yanjiang; Wu, Zhisheng
2018-03-01
A novel methodology is proposed to determine the error propagation of partial least-square (PLS) for parameters optimization in near-infrared (NIR) modeling. The parameters include spectral pretreatment, latent variables and variable selection. In this paper, an open source dataset (corn) and a complicated dataset (Gardenia) were used to establish PLS models under different modeling parameters. And error propagation of modeling parameters for water quantity in corn and geniposide quantity in Gardenia were presented by both type І and type II error. For example, when variable importance in the projection (VIP), interval partial least square (iPLS) and backward interval partial least square (BiPLS) variable selection algorithms were used for geniposide in Gardenia, compared with synergy interval partial least squares (SiPLS), the error weight varied from 5% to 65%, 55% and 15%. The results demonstrated how and what extent the different modeling parameters affect error propagation of PLS for parameters optimization in NIR modeling. The larger the error weight, the worse the model. Finally, our trials finished a powerful process in developing robust PLS models for corn and Gardenia under the optimal modeling parameters. Furthermore, it could provide a significant guidance for the selection of modeling parameters of other multivariate calibration models.
Modelling the co-evolution of indirect genetic effects and inherited variability.
Marjanovic, Jovana; Mulder, Han A; Rönnegård, Lars; Bijma, Piter
2018-03-28
When individuals interact, their phenotypes may be affected not only by their own genes but also by genes in their social partners. This phenomenon is known as Indirect Genetic Effects (IGEs). In aquaculture species and some plants, however, competition not only affects trait levels of individuals, but also inflates variability of trait values among individuals. In the field of quantitative genetics, the variability of trait values has been studied as a quantitative trait in itself, and is often referred to as inherited variability. Such studies, however, consider only the genetic effect of the focal individual on trait variability and do not make a connection to competition. Although the observed phenotypic relationship between competition and variability suggests an underlying genetic relationship, the current quantitative genetic models of IGE and inherited variability do not allow for such a relationship. The lack of quantitative genetic models that connect IGEs to inherited variability limits our understanding of the potential of variability to respond to selection, both in nature and agriculture. Models of trait levels, for example, show that IGEs may considerably change heritable variation in trait values. Currently, we lack the tools to investigate whether this result extends to variability of trait values. Here we present a model that integrates IGEs and inherited variability. In this model, the target phenotype, say growth rate, is a function of the genetic and environmental effects of the focal individual and of the difference in trait value between the social partner and the focal individual, multiplied by a regression coefficient. The regression coefficient is a genetic trait, which is a measure of cooperation; a negative value indicates competition, a positive value cooperation, and an increasing value due to selection indicates the evolution of cooperation. In contrast to the existing quantitative genetic models, our model allows for co-evolution of IGEs and variability, as the regression coefficient can respond to selection. Our simulations show that the model results in increased variability of body weight with increasing competition. When competition decreases, i.e., cooperation evolves, variability becomes significantly smaller. Hence, our model facilitates quantitative genetic studies on the relationship between IGEs and inherited variability. Moreover, our findings suggest that we may have been overlooking an entire level of genetic variation in variability, the one due to IGEs.
Efficient Variable Selection Method for Exposure Variables on Binary Data
NASA Astrophysics Data System (ADS)
Ohno, Manabu; Tarumi, Tomoyuki
In this paper, we propose a new variable selection method for "robust" exposure variables. We define "robust" as property that the same variable can select among original data and perturbed data. There are few studies of effective for the selection method. The problem that selects exposure variables is almost the same as a problem that extracts correlation rules without robustness. [Brin 97] is suggested that correlation rules are possible to extract efficiently using chi-squared statistic of contingency table having monotone property on binary data. But the chi-squared value does not have monotone property, so it's is easy to judge the method to be not independent with an increase in the dimension though the variable set is completely independent, and the method is not usable in variable selection for robust exposure variables. We assume anti-monotone property for independent variables to select robust independent variables and use the apriori algorithm for it. The apriori algorithm is one of the algorithms which find association rules from the market basket data. The algorithm use anti-monotone property on the support which is defined by association rules. But independent property does not completely have anti-monotone property on the AIC of independent probability model, but the tendency to have anti-monotone property is strong. Therefore, selected variables with anti-monotone property on the AIC have robustness. Our method judges whether a certain variable is exposure variable for the independent variable using previous comparison of the AIC. Our numerical experiments show that our method can select robust exposure variables efficiently and precisely.
NASA Astrophysics Data System (ADS)
De Lucia, Frank C., Jr.; Gottfried, Jennifer L.
2011-02-01
Using a series of thirteen organic materials that includes novel high-nitrogen energetic materials, conventional organic military explosives, and benign organic materials, we have demonstrated the importance of variable selection for maximizing residue discrimination with partial least squares discriminant analysis (PLS-DA). We built several PLS-DA models using different variable sets based on laser induced breakdown spectroscopy (LIBS) spectra of the organic residues on an aluminum substrate under an argon atmosphere. The model classification results for each sample are presented and the influence of the variables on these results is discussed. We found that using the whole spectra as the data input for the PLS-DA model gave the best results. However, variables due to the surrounding atmosphere and the substrate contribute to discrimination when the whole spectra are used, indicating this may not be the most robust model. Further iterative testing with additional validation data sets is necessary to determine the most robust model.
Maintenance of Genetic Variability under Strong Stabilizing Selection: A Two-Locus Model
Gavrilets, S.; Hastings, A.
1993-01-01
We study a two locus model with additive contributions to the phenotype to explore the relationship between stabilizing selection and recombination. We show that if the double heterozygote has the optimum phenotype and the contributions of the loci to the trait are different, then any symmetric stabilizing selection fitness function can maintain genetic variability provided selection is sufficiently strong relative to linkage. We present results of a detailed analysis of the quadratic fitness function which show that selection need not be extremely strong relative to recombination for the polymorphic equilibria to be stable. At these polymorphic equilibria the mean value of the trait, in general, is not equal to the optimum phenotype, there exists a large level of negative linkage disequilibrium which ``hides'' additive genetic variance, and different equilibria can be stable simultaneously. We analyze dependence of different characteristics of these equilibria on the location of optimum phenotype, on the difference in allelic effect, and on the strength of selection relative to recombination. Our overall result that stabilizing selection does not necessarily eliminate genetic variability is compatible with some experimental results where the lines subject to strong stabilizing selection did not have significant reductions in genetic variability. PMID:8514145
Yao, Zheng-Yang; Liu, Jian-Jun
2014-01-01
Four common greening shrub species (i. e. Ligustrum quihoui, Buxus bodinieri, Berberis xinganensis and Buxus megistophylla) in Xi'an City were selected to develop the highest correlation and best-fit estimation models for the organ (branch, leaf and root) and total biomass against different independent variables. The results indicated that the organ and total biomass optimal models of the four shrubs were power functional model (CAR model) except for the leaf biomass model of B. megistophylla which was logarithmic functional model (VAR model). The independent variables included basal diameter, crown diameter, crown diameter multiplied by height, canopy area and canopy volume. B. megistophylla significantly differed from the other three shrub species in the independent variable selection, which were basal diameter and crown-related factors, respectively.
Graves, Tabitha A.; Royle, J. Andrew; Kendall, Katherine C.; Beier, Paul; Stetz, Jeffrey B.; Macleod, Amy C.
2012-01-01
Using multiple detection methods can increase the number, kind, and distribution of individuals sampled, which may increase accuracy and precision and reduce cost of population abundance estimates. However, when variables influencing abundance are of interest, if individuals detected via different methods are influenced by the landscape differently, separate analysis of multiple detection methods may be more appropriate. We evaluated the effects of combining two detection methods on the identification of variables important to local abundance using detections of grizzly bears with hair traps (systematic) and bear rubs (opportunistic). We used hierarchical abundance models (N-mixture models) with separate model components for each detection method. If both methods sample the same population, the use of either data set alone should (1) lead to the selection of the same variables as important and (2) provide similar estimates of relative local abundance. We hypothesized that the inclusion of 2 detection methods versus either method alone should (3) yield more support for variables identified in single method analyses (i.e. fewer variables and models with greater weight), and (4) improve precision of covariate estimates for variables selected in both separate and combined analyses because sample size is larger. As expected, joint analysis of both methods increased precision as well as certainty in variable and model selection. However, the single-method analyses identified different variables and the resulting predicted abundances had different spatial distributions. We recommend comparing single-method and jointly modeled results to identify the presence of individual heterogeneity between detection methods in N-mixture models, along with consideration of detection probabilities, correlations among variables, and tolerance to risk of failing to identify variables important to a subset of the population. The benefits of increased precision should be weighed against those risks. The analysis framework presented here will be useful for other species exhibiting heterogeneity by detection method.
Fernandes, David Douglas Sousa; Gomes, Adriano A; Costa, Gean Bezerra da; Silva, Gildo William B da; Véras, Germano
2011-12-15
This work is concerned of evaluate the use of visible and near-infrared (NIR) range, separately and combined, to determine the biodiesel content in biodiesel/diesel blends using Multiple Linear Regression (MLR) and variable selection by Successive Projections Algorithm (SPA). Full spectrum models employing Partial Least Squares (PLS) and variables selection by Stepwise (SW) regression coupled with Multiple Linear Regression (MLR) and PLS models also with variable selection by Jack-Knife (Jk) were compared the proposed methodology. Several preprocessing were evaluated, being chosen derivative Savitzky-Golay with second-order polynomial and 17-point window for NIR and visible-NIR range, with offset correction. A total of 100 blends with biodiesel content between 5 and 50% (v/v) prepared starting from ten sample of biodiesel. In the NIR and visible region the best model was the SPA-MLR using only two and eight wavelengths with RMSEP of 0.6439% (v/v) and 0.5741 respectively, while in the visible-NIR region the best model was the SW-MLR using five wavelengths and RMSEP of 0.9533% (v/v). Results indicate that both spectral ranges evaluated showed potential for developing a rapid and nondestructive method to quantify biodiesel in blends with mineral diesel. Finally, one can still mention that the improvement in terms of prediction error obtained with the procedure for variables selection was significant. Copyright © 2011 Elsevier B.V. All rights reserved.
Lesmerises, Rémi; St-Laurent, Martin-Hugues
2017-11-01
Habitat selection studies conducted at the population scale commonly aim to describe general patterns that could improve our understanding of the limiting factors in species-habitat relationships. Researchers often consider interindividual variation in selection patterns to control for its effects and avoid pseudoreplication by using mixed-effect models that include individuals as random factors. Here, we highlight common pitfalls and possible misinterpretations of this strategy by describing habitat selection of 21 black bears Ursus americanus. We used Bayesian mixed-effect models and compared results obtained when using random intercept (i.e., population level) versus calculating individual coefficients for each independent variable (i.e., individual level). We then related interindividual variability to individual characteristics (i.e., age, sex, reproductive status, body condition) in a multivariate analysis. The assumption of comparable behavior among individuals was verified only in 40% of the cases in our seasonal best models. Indeed, we found strong and opposite responses among sampled bears and individual coefficients were linked to individual characteristics. For some covariates, contrasted responses canceled each other out at the population level. In other cases, interindividual variability was concealed by the composition of our sample, with the majority of the bears (e.g., old individuals and bears in good physical condition) driving the population response (e.g., selection of young forest cuts). Our results stress the need to consider interindividual variability to avoid misinterpretation and uninformative results, especially for a flexible and opportunistic species. This study helps to identify some ecological drivers of interindividual variability in bear habitat selection patterns.
a Empirical Modelation of Runoff in Small Watersheds Using LIDAR Data
NASA Astrophysics Data System (ADS)
Lopatin, J.; Hernández, J.; Galleguillos, M.; Mancilla, G.
2013-12-01
Hydrological models allow the simulation of water natural processes and also the quantification and prediction of the effects of human impacts in runoff behavior. However, obtaining the information that is need for applying these models can be costly in both time and resources, especially in large and difficult to access areas. The objective of this research was to integrate LiDAR data in the hydrological modeling of runoff in small watersheds, using derivated hydrologic, vegetation and topography variables. The study area includes 10 small head watersheds cover bay forest, between 2 and 16 ha, which are located in the south-central coastal range of Chile. In each of the former instantaneous rainfall and runoff flow of a total of 15 rainfall events were measured, between August 2012 and July 2013, yielding a total of 79 observations. In March 2011 a Harrier 54/G4 Dual System was used to obtain a LiDAR point cloud of discrete pulse with an average of 4.64 points per square meter. A Digital Terrain Model (DTM) of 1 meter resolution was obtained from the point cloud, and subsequently 55 topographic variables were derived, such as physical watershed parameters and morphometric features. At the same time, 30 vegetation descriptive variables were obtained directly from the point cloud and from a Digital Canopy Model (DCM). The classification and regression "Random Forest" (RF) algorithm was used to select the most important variables in predicting water height (liters), and the "Partial Least Squares Path Modeling" (PLS-PM) algorithm was used to fit a model using the selected set of variables. Four Latent variables were selected (outer model) related to: climate, topography, vegetation and runoff, where in each one was designated a group of the predictor variables selected by RF (inner model). The coefficient of determination (R2) and Goodnes-of-Fit (GoF) of the final model were obtained. The best results were found when modeling using only the upper 50th percentile of rainfall events. The best variables selected by the RF algorithm were three topographic variables and three vegetation related ones. We obtained an R2 of 0.82 and a GoF of 0.87 with a 95% of confidence interval. This study shows that it is possible to predict the water harvesting collected during a rainstorm event in forest environment using only LiDAR data. However, this type of methodology does not have good result in flow produced by low magnitude rainfall events, as these are more influenced by initial conditions of soil, vegetation and climate, which make their behavior slower and erratic.
CORRELATION PURSUIT: FORWARD STEPWISE VARIABLE SELECTION FOR INDEX MODELS
Zhong, Wenxuan; Zhang, Tingting; Zhu, Yu; Liu, Jun S.
2012-01-01
In this article, a stepwise procedure, correlation pursuit (COP), is developed for variable selection under the sufficient dimension reduction framework, in which the response variable Y is influenced by the predictors X1, X2, …, Xp through an unknown function of a few linear combinations of them. Unlike linear stepwise regression, COP does not impose a special form of relationship (such as linear) between the response variable and the predictor variables. The COP procedure selects variables that attain the maximum correlation between the transformed response and the linear combination of the variables. Various asymptotic properties of the COP procedure are established, and in particular, its variable selection performance under diverging number of predictors and sample size has been investigated. The excellent empirical performance of the COP procedure in comparison with existing methods are demonstrated by both extensive simulation studies and a real example in functional genomics. PMID:23243388
Ribic, C.A.; Miller, T.W.
1998-01-01
We investigated CART performance with a unimodal response curve for one continuous response and four continuous explanatory variables, where two variables were important (ie directly related to the response) and the other two were not. We explored performance under three relationship strengths and two explanatory variable conditions: equal importance and one variable four times as important as the other. We compared CART variable selection performance using three tree-selection rules ('minimum risk', 'minimum risk complexity', 'one standard error') to stepwise polynomial ordinary least squares (OLS) under four sample size conditions. The one-standard-error and minimum-risk-complexity methods performed about as well as stepwise OLS with large sample sizes when the relationship was strong. With weaker relationships, equally important explanatory variables and larger sample sizes, the one-standard-error and minimum-risk-complexity rules performed better than stepwise OLS. With weaker relationships and explanatory variables of unequal importance, tree-structured methods did not perform as well as stepwise OLS. Comparing performance within tree-structured methods, with a strong relationship and equally important explanatory variables, the one-standard-error-rule was more likely to choose the correct model than were the other tree-selection rules 1) with weaker relationships and equally important explanatory variables; and 2) under all relationship strengths when explanatory variables were of unequal importance and sample sizes were lower.
Origin and Function of Tuning Diversity in Macaque Visual Cortex
Goris, Robbe L.T.; Simoncelli, Eero P.; Movshon, J. Anthony
2016-01-01
SUMMARY Neurons in visual cortex vary in their orientation selectivity. We measured responses of V1 and V2 cells to orientation mixtures and fit them with a model whose stimulus selectivity arises from the combined effects of filtering, suppression, and response nonlinearity. The model explains the diversity of orientation selectivity with neuron-to-neuron variability in all three mechanisms, of which variability in the orientation bandwidth of linear filtering is the most important. The model also accounts for the cells’ diversity of spatial frequency selectivity. Tuning diversity is matched to the needs of visual encoding. The orientation content found in natural scenes is diverse, and neurons with different selectivities are adapted to different stimulus configurations. Single orientations are better encoded by highly selective neurons, while orientation mixtures are better encoded by less selective neurons. A diverse population of neurons therefore provides better overall discrimination capabilities for natural images than any homogeneous population. PMID:26549331
Development of an automated energy audit protocol for office buildings
NASA Astrophysics Data System (ADS)
Deb, Chirag
This study aims to enhance the building energy audit process, and bring about reduction in time and cost requirements in the conduction of a full physical audit. For this, a total of 5 Energy Service Companies in Singapore have collaborated and provided energy audit reports for 62 office buildings. Several statistical techniques are adopted to analyse these reports. These techniques comprise cluster analysis and development of prediction models to predict energy savings for buildings. The cluster analysis shows that there are 3 clusters of buildings experiencing different levels of energy savings. To understand the effect of building variables on the change in EUI, a robust iterative process for selecting the appropriate variables is developed. The results show that the 4 variables of GFA, non-air-conditioning energy consumption, average chiller plant efficiency and installed capacity of chillers should be taken for clustering. This analysis is extended to the development of prediction models using linear regression and artificial neural networks (ANN). An exhaustive variable selection algorithm is developed to select the input variables for the two energy saving prediction models. The results show that the ANN prediction model can predict the energy saving potential of a given building with an accuracy of +/-14.8%.
Preliminary results of spatial modeling of selected forest health variables in Georgia
Brock Stewart; Chris J. Cieszewski
2009-01-01
Variables relating to forest health monitoring, such as mortality, are difficult to predict and model. We present here the results of fitting various spatial regression models to these variables. We interpolate plot-level values compiled from the Forest Inventory and Analysis National Information Management System (FIA-NIMS) data that are related to forest health....
Clustering Words to Match Conditions: An Algorithm for Stimuli Selection in Factorial Designs
ERIC Educational Resources Information Center
Guasch, Marc; Haro, Juan; Boada, Roger
2017-01-01
With the increasing refinement of language processing models and the new discoveries about which variables can modulate these processes, stimuli selection for experiments with a factorial design is becoming a tough task. Selecting sets of words that differ in one variable, while matching these same words into dozens of other confounding variables…
Can Geostatistical Models Represent Nature's Variability? An Analysis Using Flume Experiments
NASA Astrophysics Data System (ADS)
Scheidt, C.; Fernandes, A. M.; Paola, C.; Caers, J.
2015-12-01
The lack of understanding in the Earth's geological and physical processes governing sediment deposition render subsurface modeling subject to large uncertainty. Geostatistics is often used to model uncertainty because of its capability to stochastically generate spatially varying realizations of the subsurface. These methods can generate a range of realizations of a given pattern - but how representative are these of the full natural variability? And how can we identify the minimum set of images that represent this natural variability? Here we use this minimum set to define the geostatistical prior model: a set of training images that represent the range of patterns generated by autogenic variability in the sedimentary environment under study. The proper definition of the prior model is essential in capturing the variability of the depositional patterns. This work starts with a set of overhead images from an experimental basin that showed ongoing autogenic variability. We use the images to analyze the essential characteristics of this suite of patterns. In particular, our goal is to define a prior model (a minimal set of selected training images) such that geostatistical algorithms, when applied to this set, can reproduce the full measured variability. A necessary prerequisite is to define a measure of variability. In this study, we measure variability using a dissimilarity distance between the images. The distance indicates whether two snapshots contain similar depositional patterns. To reproduce the variability in the images, we apply an MPS algorithm to the set of selected snapshots of the sedimentary basin that serve as training images. The training images are chosen from among the initial set by using the distance measure to ensure that only dissimilar images are chosen. Preliminary investigations show that MPS can reproduce fairly accurately the natural variability of the experimental depositional system. Furthermore, the selected training images provide process information. They fall into three basic patterns: a channelized end member, a sheet flow end member, and one intermediate case. These represent the continuum between autogenic bypass or erosion, and net deposition.
Variable Selection for Nonparametric Quantile Regression via Smoothing Spline AN OVA
Lin, Chen-Yen; Bondell, Howard; Zhang, Hao Helen; Zou, Hui
2014-01-01
Quantile regression provides a more thorough view of the effect of covariates on a response. Nonparametric quantile regression has become a viable alternative to avoid restrictive parametric assumption. The problem of variable selection for quantile regression is challenging, since important variables can influence various quantiles in different ways. We tackle the problem via regularization in the context of smoothing spline ANOVA models. The proposed sparse nonparametric quantile regression (SNQR) can identify important variables and provide flexible estimates for quantiles. Our numerical study suggests the promising performance of the new procedure in variable selection and function estimation. Supplementary materials for this article are available online. PMID:24554792
Identification of Coffee Varieties Using Laser-Induced Breakdown Spectroscopy and Chemometrics.
Zhang, Chu; Shen, Tingting; Liu, Fei; He, Yong
2017-12-31
We linked coffee quality to its different varieties. This is of interest because the identification of coffee varieties should help coffee trading and consumption. Laser-induced breakdown spectroscopy (LIBS) combined with chemometric methods was used to identify coffee varieties. Wavelet transform (WT) was used to reduce LIBS spectra noise. Partial least squares-discriminant analysis (PLS-DA), radial basis function neural network (RBFNN), and support vector machine (SVM) were used to build classification models. Loadings of principal component analysis (PCA) were used to select the spectral variables contributing most to the identification of coffee varieties. Twenty wavelength variables corresponding to C I, Mg I, Mg II, Al II, CN, H, Ca II, Fe I, K I, Na I, N I, and O I were selected. PLS-DA, RBFNN, and SVM models on selected wavelength variables showed acceptable results. SVM and RBFNN models performed better with a classification accuracy of over 80% in the prediction set, for both full spectra and the selected variables. The overall results indicated that it was feasible to use LIBS and chemometric methods to identify coffee varieties. For further studies, more samples are needed to produce robust classification models, research should be conducted on which methods to use to select spectral peaks that correspond to the elements contributing most to identification, and the methods for acquiring stable spectra should also be studied.
Identification of Coffee Varieties Using Laser-Induced Breakdown Spectroscopy and Chemometrics
Zhang, Chu; Shen, Tingting
2017-01-01
We linked coffee quality to its different varieties. This is of interest because the identification of coffee varieties should help coffee trading and consumption. Laser-induced breakdown spectroscopy (LIBS) combined with chemometric methods was used to identify coffee varieties. Wavelet transform (WT) was used to reduce LIBS spectra noise. Partial least squares-discriminant analysis (PLS-DA), radial basis function neural network (RBFNN), and support vector machine (SVM) were used to build classification models. Loadings of principal component analysis (PCA) were used to select the spectral variables contributing most to the identification of coffee varieties. Twenty wavelength variables corresponding to C I, Mg I, Mg II, Al II, CN, H, Ca II, Fe I, K I, Na I, N I, and O I were selected. PLS-DA, RBFNN, and SVM models on selected wavelength variables showed acceptable results. SVM and RBFNN models performed better with a classification accuracy of over 80% in the prediction set, for both full spectra and the selected variables. The overall results indicated that it was feasible to use LIBS and chemometric methods to identify coffee varieties. For further studies, more samples are needed to produce robust classification models, research should be conducted on which methods to use to select spectral peaks that correspond to the elements contributing most to identification, and the methods for acquiring stable spectra should also be studied. PMID:29301228
LQTA-QSAR: a new 4D-QSAR methodology.
Martins, João Paulo A; Barbosa, Euzébio G; Pasqualoto, Kerly F M; Ferreira, Márcia M C
2009-06-01
A novel 4D-QSAR approach which makes use of the molecular dynamics (MD) trajectories and topology information retrieved from the GROMACS package is presented in this study. This new methodology, named LQTA-QSAR (LQTA, Laboratório de Quimiometria Teórica e Aplicada), has a module (LQTAgrid) that calculates intermolecular interaction energies at each grid point considering probes and all aligned conformations resulting from MD simulations. These interaction energies are the independent variables or descriptors employed in a QSAR analysis. The comparison of the proposed methodology to other 4D-QSAR and CoMFA formalisms was performed using a set of forty-seven glycogen phosphorylase b inhibitors (data set 1) and a set of forty-four MAP p38 kinase inhibitors (data set 2). The QSAR models for both data sets were built using the ordered predictor selection (OPS) algorithm for variable selection. Model validation was carried out applying y-randomization and leave-N-out cross-validation in addition to the external validation. PLS models for data set 1 and 2 provided the following statistics: q(2) = 0.72, r(2) = 0.81 for 12 variables selected and 2 latent variables and q(2) = 0.82, r(2) = 0.90 for 10 variables selected and 5 latent variables, respectively. Visualization of the descriptors in 3D space was successfully interpreted from the chemical point of view, supporting the applicability of this new approach in rational drug design.
NASA Astrophysics Data System (ADS)
Collins, Curtis Andrew
Ordinary and weighted least squares multiple linear regression techniques were used to derive 720 models predicting Katrina-induced storm damage in cubic foot volume (outside bark) and green weight tons (outside bark). The large number of models was dictated by the use of three damage classes, three product types, and four forest type model strata. These 36 models were then fit and reported across 10 variable sets and variable set combinations for volume and ton units. Along with large model counts, potential independent variables were created using power transforms and interactions. The basis of these variables was field measured plot data, satellite (Landsat TM and ETM+) imagery, and NOAA HWIND wind data variable types. As part of the modeling process, lone variable types as well as two-type and three-type combinations were examined. By deriving models with these varying inputs, model utility is flexible as all independent variable data are not needed in future applications. The large number of potential variables led to the use of forward, sequential, and exhaustive independent variable selection techniques. After variable selection, weighted least squares techniques were often employed using weights of one over the square root of the pre-storm volume or weight of interest. This was generally successful in improving residual variance homogeneity. Finished model fits, as represented by coefficient of determination (R2), surpassed 0.5 in numerous models with values over 0.6 noted in a few cases. Given these models, an analyst is provided with a toolset to aid in risk assessment and disaster recovery should Katrina-like weather events reoccur.
NASA Astrophysics Data System (ADS)
Shoaib, Syed Abu; Marshall, Lucy; Sharma, Ashish
2018-06-01
Every model to characterise a real world process is affected by uncertainty. Selecting a suitable model is a vital aspect of engineering planning and design. Observation or input errors make the prediction of modelled responses more uncertain. By way of a recently developed attribution metric, this study is aimed at developing a method for analysing variability in model inputs together with model structure variability to quantify their relative contributions in typical hydrological modelling applications. The Quantile Flow Deviation (QFD) metric is used to assess these alternate sources of uncertainty. The Australian Water Availability Project (AWAP) precipitation data for four different Australian catchments is used to analyse the impact of spatial rainfall variability on simulated streamflow variability via the QFD. The QFD metric attributes the variability in flow ensembles to uncertainty associated with the selection of a model structure and input time series. For the case study catchments, the relative contribution of input uncertainty due to rainfall is higher than that due to potential evapotranspiration, and overall input uncertainty is significant compared to model structure and parameter uncertainty. Overall, this study investigates the propagation of input uncertainty in a daily streamflow modelling scenario and demonstrates how input errors manifest across different streamflow magnitudes.
Guo, Pi; Zeng, Fangfang; Hu, Xiaomin; Zhang, Dingmei; Zhu, Shuming; Deng, Yu; Hao, Yuantao
2015-01-01
Objectives In epidemiological studies, it is important to identify independent associations between collective exposures and a health outcome. The current stepwise selection technique ignores stochastic errors and suffers from a lack of stability. The alternative LASSO-penalized regression model can be applied to detect significant predictors from a pool of candidate variables. However, this technique is prone to false positives and tends to create excessive biases. It remains challenging to develop robust variable selection methods and enhance predictability. Material and methods Two improved algorithms denoted the two-stage hybrid and bootstrap ranking procedures, both using a LASSO-type penalty, were developed for epidemiological association analysis. The performance of the proposed procedures and other methods including conventional LASSO, Bolasso, stepwise and stability selection models were evaluated using intensive simulation. In addition, methods were compared by using an empirical analysis based on large-scale survey data of hepatitis B infection-relevant factors among Guangdong residents. Results The proposed procedures produced comparable or less biased selection results when compared to conventional variable selection models. In total, the two newly proposed procedures were stable with respect to various scenarios of simulation, demonstrating a higher power and a lower false positive rate during variable selection than the compared methods. In empirical analysis, the proposed procedures yielding a sparse set of hepatitis B infection-relevant factors gave the best predictive performance and showed that the procedures were able to select a more stringent set of factors. The individual history of hepatitis B vaccination, family and individual history of hepatitis B infection were associated with hepatitis B infection in the studied residents according to the proposed procedures. Conclusions The newly proposed procedures improve the identification of significant variables and enable us to derive a new insight into epidemiological association analysis. PMID:26214802
Mazerolle, M.J.
2006-01-01
In ecology, researchers frequently use observational studies to explain a given pattern, such as the number of individuals in a habitat patch, with a large number of explanatory (i.e., independent) variables. To elucidate such relationships, ecologists have long relied on hypothesis testing to include or exclude variables in regression models, although the conclusions often depend on the approach used (e.g., forward, backward, stepwise selection). Though better tools have surfaced in the mid 1970's, they are still underutilized in certain fields, particularly in herpetology. This is the case of the Akaike information criterion (AIC) which is remarkably superior in model selection (i.e., variable selection) than hypothesis-based approaches. It is simple to compute and easy to understand, but more importantly, for a given data set, it provides a measure of the strength of evidence for each model that represents a plausible biological hypothesis relative to the entire set of models considered. Using this approach, one can then compute a weighted average of the estimate and standard error for any given variable of interest across all the models considered. This procedure, termed model-averaging or multimodel inference, yields precise and robust estimates. In this paper, I illustrate the use of the AIC in model selection and inference, as well as the interpretation of results analysed in this framework with two real herpetological data sets. The AIC and measures derived from it is should be routinely adopted by herpetologists. ?? Koninklijke Brill NV 2006.
Curve fitting and modeling with splines using statistical variable selection techniques
NASA Technical Reports Server (NTRS)
Smith, P. L.
1982-01-01
The successful application of statistical variable selection techniques to fit splines is demonstrated. Major emphasis is given to knot selection, but order determination is also discussed. Two FORTRAN backward elimination programs, using the B-spline basis, were developed. The program for knot elimination is compared in detail with two other spline-fitting methods and several statistical software packages. An example is also given for the two-variable case using a tensor product basis, with a theoretical discussion of the difficulties of their use.
Variable selection for distribution-free models for longitudinal zero-inflated count responses.
Chen, Tian; Wu, Pan; Tang, Wan; Zhang, Hui; Feng, Changyong; Kowalski, Jeanne; Tu, Xin M
2016-07-20
Zero-inflated count outcomes arise quite often in research and practice. Parametric models such as the zero-inflated Poisson and zero-inflated negative binomial are widely used to model such responses. Like most parametric models, they are quite sensitive to departures from assumed distributions. Recently, new approaches have been proposed to provide distribution-free, or semi-parametric, alternatives. These methods extend the generalized estimating equations to provide robust inference for population mixtures defined by zero-inflated count outcomes. In this paper, we propose methods to extend smoothly clipped absolute deviation (SCAD)-based variable selection methods to these new models. Variable selection has been gaining popularity in modern clinical research studies, as determining differential treatment effects of interventions for different subgroups has become the norm, rather the exception, in the era of patent-centered outcome research. Such moderation analysis in general creates many explanatory variables in regression analysis, and the advantages of SCAD-based methods over their traditional counterparts render them a great choice for addressing this important and timely issues in clinical research. We illustrate the proposed approach with both simulated and real study data. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Species distribution model transferability and model grain size - finer may not always be better.
Manzoor, Syed Amir; Griffiths, Geoffrey; Lukac, Martin
2018-05-08
Species distribution models have been used to predict the distribution of invasive species for conservation planning. Understanding spatial transferability of niche predictions is critical to promote species-habitat conservation and forecasting areas vulnerable to invasion. Grain size of predictor variables is an important factor affecting the accuracy and transferability of species distribution models. Choice of grain size is often dependent on the type of predictor variables used and the selection of predictors sometimes rely on data availability. This study employed the MAXENT species distribution model to investigate the effect of the grain size on model transferability for an invasive plant species. We modelled the distribution of Rhododendron ponticum in Wales, U.K. and tested model performance and transferability by varying grain size (50 m, 300 m, and 1 km). MAXENT-based models are sensitive to grain size and selection of variables. We found that over-reliance on the commonly used bioclimatic variables may lead to less accurate models as it often compromises the finer grain size of biophysical variables which may be more important determinants of species distribution at small spatial scales. Model accuracy is likely to increase with decreasing grain size. However, successful model transferability may require optimization of model grain size.
Random forest feature selection approach for image segmentation
NASA Astrophysics Data System (ADS)
Lefkovits, László; Lefkovits, Szidónia; Emerich, Simina; Vaida, Mircea Florin
2017-03-01
In the field of image segmentation, discriminative models have shown promising performance. Generally, every such model begins with the extraction of numerous features from annotated images. Most authors create their discriminative model by using many features without using any selection criteria. A more reliable model can be built by using a framework that selects the important variables, from the point of view of the classification, and eliminates the unimportant once. In this article we present a framework for feature selection and data dimensionality reduction. The methodology is built around the random forest (RF) algorithm and its variable importance evaluation. In order to deal with datasets so large as to be practically unmanageable, we propose an algorithm based on RF that reduces the dimension of the database by eliminating irrelevant features. Furthermore, this framework is applied to optimize our discriminative model for brain tumor segmentation.
Why climate change will invariably alter selection pressures on phenology.
Gienapp, Phillip; Reed, Thomas E; Visser, Marcel E
2014-10-22
The seasonal timing of lifecycle events is closely linked to individual fitness and hence, maladaptation in phenological traits may impact population dynamics. However, few studies have analysed whether and why climate change will alter selection pressures and hence possibly induce maladaptation in phenology. To fill this gap, we here use a theoretical modelling approach. In our models, the phenologies of consumer and resource are (potentially) environmentally sensitive and depend on two different but correlated environmental variables. Fitness of the consumer depends on the phenological match with the resource. Because we explicitly model the dependence of the phenologies on environmental variables, we can test how differential (heterogeneous) versus equal (homogeneous) rates of change in the environmental variables affect selection on consumer phenology. As expected, under heterogeneous change, phenotypic plasticity is insufficient and thus selection on consumer phenology arises. However, even homogeneous change leads to directional selection on consumer phenology. This is because the consumer reaction norm has historically evolved to be flatter than the resource reaction norm, owing to time lags and imperfect cue reliability. Climate change will therefore lead to increased selection on consumer phenology across a broad range of situations. © 2014 The Author(s) Published by the Royal Society. All rights reserved.
Tyler Jon Smith; Lucy Amanda Marshall
2010-01-01
Model selection is an extremely important aspect of many hydrologic modeling studies because of the complexity, variability, and uncertainty that surrounds the current understanding of watershed-scale systems. However, development and implementation of a complete precipitation-runoff modeling framework, from model selection to calibration and uncertainty analysis, are...
Variable selection based cotton bollworm odor spectroscopic detection
NASA Astrophysics Data System (ADS)
Lü, Chengxu; Gai, Shasha; Luo, Min; Zhao, Bo
2016-10-01
Aiming at rapid automatic pest detection based efficient and targeting pesticide application and shooting the trouble of reflectance spectral signal covered and attenuated by the solid plant, the possibility of near infrared spectroscopy (NIRS) detection on cotton bollworm odor is studied. Three cotton bollworm odor samples and 3 blank air gas samples were prepared. Different concentrations of cotton bollworm odor were prepared by mixing the above gas samples, resulting a calibration group of 62 samples and a validation group of 31 samples. Spectral collection system includes light source, optical fiber, sample chamber, spectrometer. Spectra were pretreated by baseline correction, modeled with partial least squares (PLS), and optimized by genetic algorithm (GA) and competitive adaptive reweighted sampling (CARS). Minor counts differences are found among spectra of different cotton bollworm odor concentrations. PLS model of all the variables was built presenting RMSEV of 14 and RV2 of 0.89, its theory basis is insect volatilizes specific odor, including pheromone and allelochemics, which are used for intra-specific and inter-specific communication and could be detected by NIR spectroscopy. 28 sensitive variables are selected by GA, presenting the model performance of RMSEV of 14 and RV2 of 0.90. Comparably, 8 sensitive variables are selected by CARS, presenting the model performance of RMSEV of 13 and RV2 of 0.92. CARS model employs only 1.5% variables presenting smaller error than that of all variable. Odor gas based NIR technique shows the potential for cotton bollworm detection.
Zhang, Peng; Parenteau, Chantal; Wang, Lu; Holcombe, Sven; Kohoyda-Inglis, Carla; Sullivan, June; Wang, Stewart
2013-11-01
This study resulted in a model-averaging methodology that predicts crash injury risk using vehicle, demographic, and morphomic variables and assesses the importance of individual predictors. The effectiveness of this methodology was illustrated through analysis of occupant chest injuries in frontal vehicle crashes. The crash data were obtained from the International Center for Automotive Medicine (ICAM) database for calendar year 1996 to 2012. The morphomic data are quantitative measurements of variations in human body 3-dimensional anatomy. Morphomics are obtained from imaging records. In this study, morphomics were obtained from chest, abdomen, and spine CT using novel patented algorithms. A NASS-trained crash investigator with over thirty years of experience collected the in-depth crash data. There were 226 cases available with occupants involved in frontal crashes and morphomic measurements. Only cases with complete recorded data were retained for statistical analysis. Logistic regression models were fitted using all possible configurations of vehicle, demographic, and morphomic variables. Different models were ranked by the Akaike Information Criteria (AIC). An averaged logistic regression model approach was used due to the limited sample size relative to the number of variables. This approach is helpful when addressing variable selection, building prediction models, and assessing the importance of individual variables. The final predictive results were developed using this approach, based on the top 100 models in the AIC ranking. Model-averaging minimized model uncertainty, decreased the overall prediction variance, and provided an approach to evaluating the importance of individual variables. There were 17 variables investigated: four vehicle, four demographic, and nine morphomic. More than 130,000 logistic models were investigated in total. The models were characterized into four scenarios to assess individual variable contribution to injury risk. Scenario 1 used vehicle variables; Scenario 2, vehicle and demographic variables; Scenario 3, vehicle and morphomic variables; and Scenario 4 used all variables. AIC was used to rank the models and to address over-fitting. In each scenario, the results based on the top three models and the averages of the top 100 models were presented. The AIC and the area under the receiver operating characteristic curve (AUC) were reported in each model. The models were re-fitted after removing each variable one at a time. The increases of AIC and the decreases of AUC were then assessed to measure the contribution and importance of the individual variables in each model. The importance of the individual variables was also determined by their weighted frequencies of appearance in the top 100 selected models. Overall, the AUC was 0.58 in Scenario 1, 0.78 in Scenario 2, 0.76 in Scenario 3 and 0.82 in Scenario 4. The results showed that morphomic variables are as accurate at predicting injury risk as demographic variables. The results of this study emphasize the importance of including morphomic variables when assessing injury risk. The results also highlight the need for morphomic data in the development of human mathematical models when assessing restraint performance in frontal crashes, since morphomic variables are more "tangible" measurements compared to demographic variables such as age and gender. Copyright © 2013 Elsevier Ltd. All rights reserved.
Bounds on internal state variables in viscoplasticity
NASA Technical Reports Server (NTRS)
Freed, Alan D.
1993-01-01
A typical viscoplastic model will introduce up to three types of internal state variables in order to properly describe transient material behavior; they are as follows: the back stress, the yield stress, and the drag strength. Different models employ different combinations of these internal variables--their selection and description of evolution being largely dependent on application and material selection. Under steady-state conditions, the internal variables cease to evolve and therefore become related to the external variables (stress and temperature) through simple functional relationships. A physically motivated hypothesis is presented that links the kinetic equation of viscoplasticity with that of creep under steady-state conditions. From this hypothesis one determines how the internal variables relate to one another at steady state, but most importantly, one obtains bounds on the magnitudes of stress and back stress, and on the yield stress and drag strength.
Calus, M P L; de Haas, Y; Veerkamp, R F
2013-10-01
Genomic selection holds the promise to be particularly beneficial for traits that are difficult or expensive to measure, such that access to phenotypes on large daughter groups of bulls is limited. Instead, cow reference populations can be generated, potentially supplemented with existing information from the same or (highly) correlated traits available on bull reference populations. The objective of this study, therefore, was to develop a model to perform genomic predictions and genome-wide association studies based on a combined cow and bull reference data set, with the accuracy of the phenotypes differing between the cow and bull genomic selection reference populations. The developed bivariate Bayesian stochastic search variable selection model allowed for an unbalanced design by imputing residuals in the residual updating scheme for all missing records. The performance of this model is demonstrated on a real data example, where the analyzed trait, being milk fat or protein yield, was either measured only on a cow or a bull reference population, or recorded on both. Our results were that the developed bivariate Bayesian stochastic search variable selection model was able to analyze 2 traits, even though animals had measurements on only 1 of 2 traits. The Bayesian stochastic search variable selection model yielded consistently higher accuracy for fat yield compared with a model without variable selection, both for the univariate and bivariate analyses, whereas the accuracy of both models was very similar for protein yield. The bivariate model identified several additional quantitative trait loci peaks compared with the single-trait models on either trait. In addition, the bivariate models showed a marginal increase in accuracy of genomic predictions for the cow traits (0.01-0.05), although a greater increase in accuracy is expected as the size of the bull population increases. Our results emphasize that the chosen value of priors in Bayesian genomic prediction models are especially important in small data sets. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Porfirio, Luciana L.; Harris, Rebecca M. B.; Lefroy, Edward C.; Hugh, Sonia; Gould, Susan F.; Lee, Greg; Bindoff, Nathaniel L.; Mackey, Brendan
2014-01-01
Choice of variables, climate models and emissions scenarios all influence the results of species distribution models under future climatic conditions. However, an overview of applied studies suggests that the uncertainty associated with these factors is not always appropriately incorporated or even considered. We examine the effects of choice of variables, climate models and emissions scenarios can have on future species distribution models using two endangered species: one a short-lived invertebrate species (Ptunarra Brown Butterfly), and the other a long-lived paleo-endemic tree species (King Billy Pine). We show the range in projected distributions that result from different variable selection, climate models and emissions scenarios. The extent to which results are affected by these choices depends on the characteristics of the species modelled, but they all have the potential to substantially alter conclusions about the impacts of climate change. We discuss implications for conservation planning and management, and provide recommendations to conservation practitioners on variable selection and accommodating uncertainty when using future climate projections in species distribution models. PMID:25420020
Rahman, Anisur; Faqeerzada, Mohammad A; Cho, Byoung-Kwan
2018-03-14
Allicin and soluble solid content (SSC) in garlic is the responsible for its pungent flavor and odor. However, current conventional methods such as the use of high-pressure liquid chromatography and a refractometer have critical drawbacks in that they are time-consuming, labor-intensive and destructive procedures. The present study aimed to predict allicin and SSC in garlic using hyperspectral imaging in combination with variable selection algorithms and calibration models. Hyperspectral images of 100 garlic cloves were acquired that covered two spectral ranges, from which the mean spectra of each clove were extracted. The calibration models included partial least squares (PLS) and least squares-support vector machine (LS-SVM) regression, as well as different spectral pre-processing techniques, from which the highest performing spectral preprocessing technique and spectral range were selected. Then, variable selection methods, such as regression coefficients, variable importance in projection (VIP) and the successive projections algorithm (SPA), were evaluated for the selection of effective wavelengths (EWs). Furthermore, PLS and LS-SVM regression methods were applied to quantitatively predict the quality attributes of garlic using the selected EWs. Of the established models, the SPA-LS-SVM model obtained an Rpred2 of 0.90 and standard error of prediction (SEP) of 1.01% for SSC prediction, whereas the VIP-LS-SVM model produced the best result with an Rpred2 of 0.83 and SEP of 0.19 mg g -1 for allicin prediction in the range 1000-1700 nm. Furthermore, chemical images of garlic were developed using the best predictive model to facilitate visualization of the spatial distributions of allicin and SSC. The present study clearly demonstrates that hyperspectral imaging combined with an appropriate chemometrics method can potentially be employed as a fast, non-invasive method to predict the allicin and SSC in garlic. © 2018 Society of Chemical Industry. © 2018 Society of Chemical Industry.
NASA Astrophysics Data System (ADS)
Yi, Jin; Li, Xinyu; Xiao, Mi; Xu, Junnan; Zhang, Lin
2017-01-01
Engineering design often involves different types of simulation, which results in expensive computational costs. Variable fidelity approximation-based design optimization approaches can realize effective simulation and efficiency optimization of the design space using approximation models with different levels of fidelity and have been widely used in different fields. As the foundations of variable fidelity approximation models, the selection of sample points of variable-fidelity approximation, called nested designs, is essential. In this article a novel nested maximin Latin hypercube design is constructed based on successive local enumeration and a modified novel global harmony search algorithm. In the proposed nested designs, successive local enumeration is employed to select sample points for a low-fidelity model, whereas the modified novel global harmony search algorithm is employed to select sample points for a high-fidelity model. A comparative study with multiple criteria and an engineering application are employed to verify the efficiency of the proposed nested designs approach.
Lobréaux, Stéphane; Melodelima, Christelle
2015-02-01
We tested the use of Generalized Linear Mixed Models to detect associations between genetic loci and environmental variables, taking into account the population structure of sampled individuals. We used a simulation approach to generate datasets under demographically and selectively explicit models. These datasets were used to analyze and optimize GLMM capacity to detect the association between markers and selective coefficients as environmental data in terms of false and true positive rates. Different sampling strategies were tested, maximizing the number of populations sampled, sites sampled per population, or individuals sampled per site, and the effect of different selective intensities on the efficiency of the method was determined. Finally, we apply these models to an Arabidopsis thaliana SNP dataset from different accessions, looking for loci associated with spring minimal temperature. We identified 25 regions that exhibit unusual correlations with the climatic variable and contain genes with functions related to temperature stress. Copyright © 2014 Elsevier Inc. All rights reserved.
Origin and Function of Tuning Diversity in Macaque Visual Cortex.
Goris, Robbe L T; Simoncelli, Eero P; Movshon, J Anthony
2015-11-18
Neurons in visual cortex vary in their orientation selectivity. We measured responses of V1 and V2 cells to orientation mixtures and fit them with a model whose stimulus selectivity arises from the combined effects of filtering, suppression, and response nonlinearity. The model explains the diversity of orientation selectivity with neuron-to-neuron variability in all three mechanisms, of which variability in the orientation bandwidth of linear filtering is the most important. The model also accounts for the cells' diversity of spatial frequency selectivity. Tuning diversity is matched to the needs of visual encoding. The orientation content found in natural scenes is diverse, and neurons with different selectivities are adapted to different stimulus configurations. Single orientations are better encoded by highly selective neurons, while orientation mixtures are better encoded by less selective neurons. A diverse population of neurons therefore provides better overall discrimination capabilities for natural images than any homogeneous population. Copyright © 2015 Elsevier Inc. All rights reserved.
Vasconcelos, A G; Almeida, R M; Nobre, F F
2001-08-01
This paper introduces an approach that includes non-quantitative factors for the selection and assessment of multivariate complex models in health. A goodness-of-fit based methodology combined with fuzzy multi-criteria decision-making approach is proposed for model selection. Models were obtained using the Path Analysis (PA) methodology in order to explain the interrelationship between health determinants and the post-neonatal component of infant mortality in 59 municipalities of Brazil in the year 1991. Socioeconomic and demographic factors were used as exogenous variables, and environmental, health service and agglomeration as endogenous variables. Five PA models were developed and accepted by statistical criteria of goodness-of fit. These models were then submitted to a group of experts, seeking to characterize their preferences, according to predefined criteria that tried to evaluate model relevance and plausibility. Fuzzy set techniques were used to rank the alternative models according to the number of times a model was superior to ("dominated") the others. The best-ranked model explained above 90% of the endogenous variables variation, and showed the favorable influences of income and education levels on post-neonatal mortality. It also showed the unfavorable effect on mortality of fast population growth, through precarious dwelling conditions and decreased access to sanitation. It was possible to aggregate expert opinions in model evaluation. The proposed procedure for model selection allowed the inclusion of subjective information in a clear and systematic manner.
Geng, Zhigeng; Wang, Sijian; Yu, Menggang; Monahan, Patrick O.; Champion, Victoria; Wahba, Grace
2017-01-01
Summary In many scientific and engineering applications, covariates are naturally grouped. When the group structures are available among covariates, people are usually interested in identifying both important groups and important variables within the selected groups. Among existing successful group variable selection methods, some methods fail to conduct the within group selection. Some methods are able to conduct both group and within group selection, but the corresponding objective functions are non-convex. Such a non-convexity may require extra numerical effort. In this article, we propose a novel Log-Exp-Sum(LES) penalty for group variable selection. The LES penalty is strictly convex. It can identify important groups as well as select important variables within the group. We develop an efficient group-level coordinate descent algorithm to fit the model. We also derive non-asymptotic error bounds and asymptotic group selection consistency for our method in the high-dimensional setting where the number of covariates can be much larger than the sample size. Numerical results demonstrate the good performance of our method in both variable selection and prediction. We applied the proposed method to an American Cancer Society breast cancer survivor dataset. The findings are clinically meaningful and may help design intervention programs to improve the qualify of life for breast cancer survivors. PMID:25257196
Selecting an Informative/Discriminating Multivariate Response for Inverse Prediction
Thomas, Edward V.; Lewis, John R.; Anderson-Cook, Christine M.; ...
2017-11-21
nverse prediction is important in a wide variety of scientific and engineering contexts. One might use inverse prediction to predict fundamental properties/characteristics of an object using measurements obtained from it. This can be accomplished by “inverting” parameterized forward models that relate the measurements (responses) to the properties/characteristics of interest. Sometimes forward models are science based; but often, forward models are empirically based, using the results of experimentation. For empirically-based forward models, it is important that the experiments provide a sound basis to develop accurate forward models in terms of the properties/characteristics (factors). While nature dictates the causal relationship between factorsmore » and responses, experimenters can influence control of the type, accuracy, and precision of forward models that can be constructed via selection of factors, factor levels, and the set of trials that are performed. Whether the forward models are based on science, experiments or both, researchers can influence the ability to perform inverse prediction by selecting informative response variables. By using an errors-in-variables framework for inverse prediction, this paper shows via simple analysis and examples how the capability of a multivariate response (with respect to being informative and discriminating) can vary depending on how well the various responses complement one another over the range of the factor-space of interest. Insights derived from this analysis could be useful for selecting a set of response variables among candidates in cases where the number of response variables that can be acquired is limited by difficulty, expense, and/or availability of material.« less
Selecting an Informative/Discriminating Multivariate Response for Inverse Prediction
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thomas, Edward V.; Lewis, John R.; Anderson-Cook, Christine M.
nverse prediction is important in a wide variety of scientific and engineering contexts. One might use inverse prediction to predict fundamental properties/characteristics of an object using measurements obtained from it. This can be accomplished by “inverting” parameterized forward models that relate the measurements (responses) to the properties/characteristics of interest. Sometimes forward models are science based; but often, forward models are empirically based, using the results of experimentation. For empirically-based forward models, it is important that the experiments provide a sound basis to develop accurate forward models in terms of the properties/characteristics (factors). While nature dictates the causal relationship between factorsmore » and responses, experimenters can influence control of the type, accuracy, and precision of forward models that can be constructed via selection of factors, factor levels, and the set of trials that are performed. Whether the forward models are based on science, experiments or both, researchers can influence the ability to perform inverse prediction by selecting informative response variables. By using an errors-in-variables framework for inverse prediction, this paper shows via simple analysis and examples how the capability of a multivariate response (with respect to being informative and discriminating) can vary depending on how well the various responses complement one another over the range of the factor-space of interest. Insights derived from this analysis could be useful for selecting a set of response variables among candidates in cases where the number of response variables that can be acquired is limited by difficulty, expense, and/or availability of material.« less
Shan, Jiajia; Wang, Xue; Zhou, Hao; Han, Shuqing; Riza, Dimas Firmanda Al; Kondo, Naoshi
2018-03-13
Synchronous fluorescence spectra, combined with multivariate analysis were used to predict flavonoids content in green tea rapidly and nondestructively. This paper presented a new and efficient spectral intervals selection method called clustering based partial least square (CL-PLS), which selected informative wavelengths by combining clustering concept and partial least square (PLS) methods to improve models' performance by synchronous fluorescence spectra. The fluorescence spectra of tea samples were obtained and k-means and kohonen-self organizing map clustering algorithms were carried out to cluster full spectra into several clusters, and sub-PLS regression model was developed on each cluster. Finally, CL-PLS models consisting of gradually selected clusters were built. Correlation coefficient (R) was used to evaluate the effect on prediction performance of PLS models. In addition, variable influence on projection partial least square (VIP-PLS), selectivity ratio partial least square (SR-PLS), interval partial least square (iPLS) models and full spectra PLS model were investigated and the results were compared. The results showed that CL-PLS presented the best result for flavonoids prediction using synchronous fluorescence spectra.
Optimization Of Mean-Semivariance-Skewness Portfolio Selection Model In Fuzzy Random Environment
NASA Astrophysics Data System (ADS)
Chatterjee, Amitava; Bhattacharyya, Rupak; Mukherjee, Supratim; Kar, Samarjit
2010-10-01
The purpose of the paper is to construct a mean-semivariance-skewness portfolio selection model in fuzzy random environment. The objective is to maximize the skewness with predefined maximum risk tolerance and minimum expected return. Here the security returns in the objectives and constraints are assumed to be fuzzy random variables in nature and then the vagueness of the fuzzy random variables in the objectives and constraints are transformed into fuzzy variables which are similar to trapezoidal numbers. The newly formed fuzzy model is then converted into a deterministic optimization model. The feasibility and effectiveness of the proposed method is verified by numerical example extracted from Bombay Stock Exchange (BSE). The exact parameters of fuzzy membership function and probability density function are obtained through fuzzy random simulating the past dates.
Li, Jin; Tran, Maggie; Siwabessy, Justy
2016-01-01
Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia’s marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to ‘small p and large n’ problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and caution should be taken when applying filter FS methods in selecting predictive models. PMID:26890307
Li, Jin; Tran, Maggie; Siwabessy, Justy
2016-01-01
Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia's marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to 'small p and large n' problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and caution should be taken when applying filter FS methods in selecting predictive models.
A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
Igne, Benoit; Shi, Zhenqi; Drennen, James K; Anderson, Carl A
2014-02-01
The impact of raw material variability on the prediction ability of a near-infrared calibration model was studied. Calibrations, developed from a quaternary mixture design comprising theophylline anhydrous, lactose monohydrate, microcrystalline cellulose, and soluble starch, were challenged by intentional variation of raw material properties. A design with two theophylline physical forms, three lactose particle sizes, and two starch manufacturers was created to test model robustness. Further challenges to the models were accomplished through environmental conditions. Along with full-spectrum partial least squares (PLS) modeling, variable selection by dynamic backward PLS and genetic algorithms was utilized in an effort to mitigate the effects of raw material variability. In addition to evaluating models based on their prediction statistics, prediction residuals were analyzed by analyses of variance and model diagnostics (Hotelling's T(2) and Q residuals). Full-spectrum models were significantly affected by lactose particle size. Models developed by selecting variables gave lower prediction errors and proved to be a good approach to limit the effect of changing raw material characteristics. Hotelling's T(2) and Q residuals provided valuable information that was not detectable when studying only prediction trends. Diagnostic statistics were demonstrated to be critical in the appropriate interpretation of the prediction of quality parameters. © 2013 Wiley Periodicals, Inc. and the American Pharmacists Association.
Fan, X-J; Wan, X-B; Huang, Y; Cai, H-M; Fu, X-H; Yang, Z-L; Chen, D-K; Song, S-X; Wu, P-H; Liu, Q; Wang, L; Wang, J-P
2012-01-01
Background: Current imaging modalities are inadequate in preoperatively predicting regional lymph node metastasis (RLNM) status in rectal cancer (RC). Here, we designed support vector machine (SVM) model to address this issue by integrating epithelial–mesenchymal-transition (EMT)-related biomarkers along with clinicopathological variables. Methods: Using tissue microarrays and immunohistochemistry, the EMT-related biomarkers expression was measured in 193 RC patients. Of which, 74 patients were assigned to the training set to select the robust variables for designing SVM model. The SVM model predictive value was validated in the testing set (119 patients). Results: In training set, eight variables, including six EMT-related biomarkers and two clinicopathological variables, were selected to devise SVM model. In testing set, we identified 63 patients with high risk to RLNM and 56 patients with low risk. The sensitivity, specificity and overall accuracy of SVM in predicting RLNM were 68.3%, 81.1% and 72.3%, respectively. Importantly, multivariate logistic regression analysis showed that SVM model was indeed an independent predictor of RLNM status (odds ratio, 11.536; 95% confidence interval, 4.113–32.361; P<0.0001). Conclusion: Our SVM-based model displayed moderately strong predictive power in defining the RLNM status in RC patients, providing an important approach to select RLNM high-risk subgroup for neoadjuvant chemoradiotherapy. PMID:22538975
NASA Technical Reports Server (NTRS)
Leduc, S. (Principal Investigator)
1982-01-01
Models based on multiple regression were developed to estimate corn and soybean yield from weather data for agrophysical units (APU) in Iowa. The predictor variables are derived from monthly average temperature and monthly total precipitation data at meteorological stations in the cooperative network. The models are similar in form to the previous models developed for crop reporting districts (CRD). The trends and derived variables were the same and the approach to select the significant predictors was similar to that used in developing the CRD models. The APU's were selected to be more homogeneous with respect crop to production than the CRDs. The APU models are quite similar to the CRD models, similar explained variation and number of predictor variables. The APU models are to be independently evaluated and compared to the previously evaluated CRD models. That comparison should indicate the preferred model area for this application, i.e., APU or CRD.
A Demonstration of Regression False Positive Selection in Data Mining
ERIC Educational Resources Information Center
Pinder, Jonathan P.
2014-01-01
Business analytics courses, such as marketing research, data mining, forecasting, and advanced financial modeling, have substantial predictive modeling components. The predictive modeling in these courses requires students to estimate and test many linear regressions. As a result, false positive variable selection ("type I errors") is…
ERIC Educational Resources Information Center
Manouselis, Nikos; Sampson, Demetrios
This paper focuses on the way a multi-criteria decision making methodology is applied in the case of agent-based selection of offered learning objects. The problem of selection is modeled as a decision making one, with the decision variables being the learner model and the learning objects' educational description. In this way, selection of…
Variable selection with stepwise and best subset approaches
2016-01-01
While purposeful selection is performed partly by software and partly by hand, the stepwise and best subset approaches are automatically performed by software. Two R functions stepAIC() and bestglm() are well designed for stepwise and best subset regression, respectively. The stepAIC() function begins with a full or null model, and methods for stepwise regression can be specified in the direction argument with character values “forward”, “backward” and “both”. The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should be in the last column. Varieties of goodness-of-fit criteria can be specified in the IC argument. The Bayesian information criterion (BIC) usually results in more parsimonious model than the Akaike information criterion. PMID:27162786
Vanderhaeghe, F; Smolders, A J P; Roelofs, J G M; Hoffmann, M
2012-03-01
Selecting an appropriate variable subset in linear multivariate methods is an important methodological issue for ecologists. Interest often exists in obtaining general predictive capacity or in finding causal inferences from predictor variables. Because of a lack of solid knowledge on a studied phenomenon, scientists explore predictor variables in order to find the most meaningful (i.e. discriminating) ones. As an example, we modelled the response of the amphibious softwater plant Eleocharis multicaulis using canonical discriminant function analysis. We asked how variables can be selected through comparison of several methods: univariate Pearson chi-square screening, principal components analysis (PCA) and step-wise analysis, as well as combinations of some methods. We expected PCA to perform best. The selected methods were evaluated through fit and stability of the resulting discriminant functions and through correlations between these functions and the predictor variables. The chi-square subset, at P < 0.05, followed by a step-wise sub-selection, gave the best results. In contrast to expectations, PCA performed poorly, as so did step-wise analysis. The different chi-square subset methods all yielded ecologically meaningful variables, while probable noise variables were also selected by PCA and step-wise analysis. We advise against the simple use of PCA or step-wise discriminant analysis to obtain an ecologically meaningful variable subset; the former because it does not take into account the response variable, the latter because noise variables are likely to be selected. We suggest that univariate screening techniques are a worthwhile alternative for variable selection in ecology. © 2011 German Botanical Society and The Royal Botanical Society of the Netherlands.
A decision tool for selecting trench cap designs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paige, G.B.; Stone, J.J.; Lane, L.J.
1995-12-31
A computer based prototype decision support system (PDSS) is being developed to assist the risk manager in selecting an appropriate trench cap design for waste disposal sites. The selection of the {open_quote}best{close_quote} design among feasible alternatives requires consideration of multiple and often conflicting objectives. The methodology used in the selection process consists of: selecting and parameterizing decision variables using data, simulation models, or expert opinion; selecting feasible trench cap design alternatives; ordering the decision variables and ranking the design alternatives. The decision model is based on multi-objective decision theory and uses a unique approach to order the decision variables andmore » rank the design alternatives. Trench cap designs are evaluated based on federal regulations, hydrologic performance, cover stability and cost. Four trench cap designs, which were monitored for a four year period at Hill Air Force Base in Utah, are used to demonstrate the application of the PDSS and evaluate the results of the decision model. The results of the PDSS, using both data and simulations, illustrate the relative advantages of each of the cap designs and which cap is the {open_quotes}best{close_quotes} alternative for a given set of criteria and a particular importance order of those decision criteria.« less
A Model for Investigating Predictive Validity at Highly Selective Institutions.
ERIC Educational Resources Information Center
Gross, Alan L.; And Others
A statistical model for investigating predictive validity at highly selective institutions is described. When the selection ratio is small, one must typically deal with a data set containing relatively large amounts of missing data on both criterion and predictor variables. Standard statistical approaches are based on the strong assumption that…
VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS
Huang, Jian; Horowitz, Joel L.; Wei, Fengrong
2010-01-01
We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with B-spline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model, and the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method. PMID:21127739
Lee, Kyu Ha; Tadesse, Mahlet G; Baccarelli, Andrea A; Schwartz, Joel; Coull, Brent A
2017-03-01
The analysis of multiple outcomes is becoming increasingly common in modern biomedical studies. It is well-known that joint statistical models for multiple outcomes are more flexible and more powerful than fitting a separate model for each outcome; they yield more powerful tests of exposure or treatment effects by taking into account the dependence among outcomes and pooling evidence across outcomes. It is, however, unlikely that all outcomes are related to the same subset of covariates. Therefore, there is interest in identifying exposures or treatments associated with particular outcomes, which we term outcome-specific variable selection. In this work, we propose a variable selection approach for multivariate normal responses that incorporates not only information on the mean model, but also information on the variance-covariance structure of the outcomes. The approach effectively leverages evidence from all correlated outcomes to estimate the effect of a particular covariate on a given outcome. To implement this strategy, we develop a Bayesian method that builds a multivariate prior for the variable selection indicators based on the variance-covariance of the outcomes. We show via simulation that the proposed variable selection strategy can boost power to detect subtle effects without increasing the probability of false discoveries. We apply the approach to the Normative Aging Study (NAS) epigenetic data and identify a subset of five genes in the asthma pathway for which gene-specific DNA methylations are associated with exposures to either black carbon, a marker of traffic pollution, or sulfate, a marker of particles generated by power plants. © 2016, The International Biometric Society.
Model-Averaged ℓ1 Regularization using Markov Chain Monte Carlo Model Composition
Fraley, Chris; Percival, Daniel
2014-01-01
Bayesian Model Averaging (BMA) is an effective technique for addressing model uncertainty in variable selection problems. However, current BMA approaches have computational difficulty dealing with data in which there are many more measurements (variables) than samples. This paper presents a method for combining ℓ1 regularization and Markov chain Monte Carlo model composition techniques for BMA. By treating the ℓ1 regularization path as a model space, we propose a method to resolve the model uncertainty issues arising in model averaging from solution path point selection. We show that this method is computationally and empirically effective for regression and classification in high-dimensional datasets. We apply our technique in simulations, as well as to some applications that arise in genomics. PMID:25642001
Genetic signatures of natural selection in a model invasive ascidian
NASA Astrophysics Data System (ADS)
Lin, Yaping; Chen, Yiyong; Yi, Changho; Fong, Jonathan J.; Kim, Won; Rius, Marc; Zhan, Aibin
2017-03-01
Invasive species represent promising models to study species’ responses to rapidly changing environments. Although local adaptation frequently occurs during contemporary range expansion, the associated genetic signatures at both population and genomic levels remain largely unknown. Here, we use genome-wide gene-associated microsatellites to investigate genetic signatures of natural selection in a model invasive ascidian, Ciona robusta. Population genetic analyses of 150 individuals sampled in Korea, New Zealand, South Africa and Spain showed significant genetic differentiation among populations. Based on outlier tests, we found high incidence of signatures of directional selection at 19 loci. Hitchhiking mapping analyses identified 12 directional selective sweep regions, and all selective sweep windows on chromosomes were narrow (~8.9 kb). Further analyses indentified 132 candidate genes under selection. When we compared our genetic data and six crucial environmental variables, 16 putatively selected loci showed significant correlation with these environmental variables. This suggests that the local environmental conditions have left significant signatures of selection at both population and genomic levels. Finally, we identified “plastic” genomic regions and genes that are promising regions to investigate evolutionary responses to rapid environmental change in C. robusta.
Advanced colorectal neoplasia risk stratification by penalized logistic regression.
Lin, Yunzhi; Yu, Menggang; Wang, Sijian; Chappell, Richard; Imperiale, Thomas F
2016-08-01
Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the [Formula: see text]-norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance. © The Author(s) 2013.
NASA Astrophysics Data System (ADS)
Shan, Jiajia; Wang, Xue; Zhou, Hao; Han, Shuqing; Riza, Dimas Firmanda Al; Kondo, Naoshi
2018-04-01
Synchronous fluorescence spectra, combined with multivariate analysis were used to predict flavonoids content in green tea rapidly and nondestructively. This paper presented a new and efficient spectral intervals selection method called clustering based partial least square (CL-PLS), which selected informative wavelengths by combining clustering concept and partial least square (PLS) methods to improve models’ performance by synchronous fluorescence spectra. The fluorescence spectra of tea samples were obtained and k-means and kohonen-self organizing map clustering algorithms were carried out to cluster full spectra into several clusters, and sub-PLS regression model was developed on each cluster. Finally, CL-PLS models consisting of gradually selected clusters were built. Correlation coefficient (R) was used to evaluate the effect on prediction performance of PLS models. In addition, variable influence on projection partial least square (VIP-PLS), selectivity ratio partial least square (SR-PLS), interval partial least square (iPLS) models and full spectra PLS model were investigated and the results were compared. The results showed that CL-PLS presented the best result for flavonoids prediction using synchronous fluorescence spectra.
Treatment Selection in Depression.
Cohen, Zachary D; DeRubeis, Robert J
2018-05-07
Mental health researchers and clinicians have long sought answers to the question "What works for whom?" The goal of precision medicine is to provide evidence-based answers to this question. Treatment selection in depression aims to help each individual receive the treatment, among the available options, that is most likely to lead to a positive outcome for them. Although patient variables that are predictive of response to treatment have been identified, this knowledge has not yet translated into real-world treatment recommendations. The Personalized Advantage Index (PAI) and related approaches combine information obtained prior to the initiation of treatment into multivariable prediction models that can generate individualized predictions to help clinicians and patients select the right treatment. With increasing availability of advanced statistical modeling approaches, as well as novel predictive variables and big data, treatment selection models promise to contribute to improved outcomes in depression.
Penalized regression procedures for variable selection in the potential outcomes framework
Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.
2015-01-01
A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185
Linear and nonlinear variable selection in competing risks data.
Ren, Xiaowei; Li, Shanshan; Shen, Changyu; Yu, Zhangsheng
2018-06-15
Subdistribution hazard model for competing risks data has been applied extensively in clinical researches. Variable selection methods of linear effects for competing risks data have been studied in the past decade. There is no existing work on selection of potential nonlinear effects for subdistribution hazard model. We propose a two-stage procedure to select the linear and nonlinear covariate(s) simultaneously and estimate the selected covariate effect(s). We use spectral decomposition approach to distinguish the linear and nonlinear parts of each covariate and adaptive LASSO to select each of the 2 components. Extensive numerical studies are conducted to demonstrate that the proposed procedure can achieve good selection accuracy in the first stage and small estimation biases in the second stage. The proposed method is applied to analyze a cardiovascular disease data set with competing death causes. Copyright © 2018 John Wiley & Sons, Ltd.
Theory and design of variable conductance heat pipes
NASA Technical Reports Server (NTRS)
Marcus, B. D.
1972-01-01
A comprehensive review and analysis of all aspects of heat pipe technology pertinent to the design of self-controlled, variable conductance devices for spacecraft thermal control is presented. Subjects considered include hydrostatics, hydrodynamics, heat transfer into and out of the pipe, fluid selection, materials compatibility and variable conductance control techniques. The report includes a selected bibliography of pertinent literature, analytical formulations of various models and theories describing variable conductance heat pipe behavior, and the results of numerous experiments on the steady state and transient performance of gas controlled variable conductance heat pipes. Also included is a discussion of VCHP design techniques.
Tidholm, A; Höglund, K; Häggström, J; Ljungvall, I
2015-01-01
Pulmonary hypertension (PH) is commonly associated with myxomatous mitral valve disease (MMVD). Because dogs with PH present without measureable tricuspid regurgitation (TR), it would be useful to investigate echocardiographic variables that can identify PH. To investigate associations between estimated systolic TR pressure gradient (TRPG) and dog characteristics and selected echocardiographic variables. 156 privately owned dogs. Prospective observational study comparing the estimations of TRPG with dog characteristics and selected echocardiographic variables in dogs with MMVD and measureable TR. Tricuspid regurgitation pressure gradient was significantly (P < .05) associated with body weight corrected right (RVIDDn) and left (LVIDDn) ventricular end-diastolic and systolic (LVIDSn) internal diameters, pulmonary arterial (PA) acceleration to deceleration time ratio (AT/DT), heart rate, left atrial to aortic root ratio (LA/Ao), and the presence of congestive heart failure. Four variables remained significant in the multiple regression analysis with TRPG as a dependent variable: modeled as linear variables LA/Ao (P < .0001) and RVIDDn (P = .041), modeled as second order polynomial variables: AT/DT (P = .0039) and LVIDDn (P < .0001) The adjusted R(2) -value for the final model was 0.45 and receiver operating characteristic curve analysis suggested the model's performance to predict PH, defined as 36, 45, and 55 mmHg as fair (area under the curve [AUC] = 0.80), good (AUC = 0.86), and excellent (AUC = 0.92), respectively. In dogs with MMVD, the presence of PH might be suspected with the combination of decreased PA AT/DT, increased RVIDDn and LA/Ao, and a small or great LVIDDn. Copyright © 2015 The Authors Journal of Veterinary Internal Medicine published by Wiley Periodicals, Inc. on behalf of the American College of Veterinary Internal Medicine.
NASA Astrophysics Data System (ADS)
Milovančević, Miloš; Nikolić, Vlastimir; Anđelković, Boban
2017-01-01
Vibration-based structural health monitoring is widely recognized as an attractive strategy for early damage detection in civil structures. Vibration monitoring and prediction is important for any system since it can save many unpredictable behaviors of the system. If the vibration monitoring is properly managed, that can ensure economic and safe operations. Potentials for further improvement of vibration monitoring lie in the improvement of current control strategies. One of the options is the introduction of model predictive control. Multistep ahead predictive models of vibration are a starting point for creating a successful model predictive strategy. For the purpose of this article, predictive models of are created for vibration monitoring of planetary power transmissions in pellet mills. The models were developed using the novel method based on ANFIS (adaptive neuro fuzzy inference system). The aim of this study is to investigate the potential of ANFIS for selecting the most relevant variables for predictive models of vibration monitoring of pellet mills power transmission. The vibration data are collected by PIC (Programmable Interface Controller) microcontrollers. The goal of the predictive vibration monitoring of planetary power transmissions in pellet mills is to indicate deterioration in the vibration of the power transmissions before the actual failure occurs. The ANFIS process for variable selection was implemented in order to detect the predominant variables affecting the prediction of vibration monitoring. It was also used to select the minimal input subset of variables from the initial set of input variables - current and lagged variables (up to 11 steps) of vibration. The obtained results could be used for simplification of predictive methods so as to avoid multiple input variables. It was preferable to used models with less inputs because of overfitting between training and testing data. While the obtained results are promising, further work is required in order to get results that could be directly applied in practice.
Pereira, Hebert Vinicius; Amador, Victória Silva; Sena, Marcelo Martins; Augusti, Rodinei; Piccin, Evandro
2016-10-12
Paper spray mass spectrometry (PS-MS) combined with partial least squares discriminant analysis (PLS-DA) was applied for the first time in a forensic context to a fast and effective differentiation of beers. Eight different brands of American standard lager beers produced by four different breweries (141 samples from 55 batches) were studied with the aim at performing a differentiation according to their market prices. The three leader brands in the Brazilian beer market, which have been subject to fraud, were modeled as the higher-price class, while the five brands most used for counterfeiting were modeled as the lower-price class. Parameters affecting the paper spray ionization were examined and optimized. The best MS signal stability and intensity was obtained while using the positive ion mode, with PS(+) mass spectra characterized by intense pairs of signals corresponding to sodium and potassium adducts of malto-oligosaccharides. Discrimination was not apparent neither by using visual inspection nor principal component analysis (PCA). However, supervised classification models provided high rates of sensitivity and specificity. A PLS-DA model using full scan mass spectra were improved by variable selection with ordered predictors selection (OPS), providing 100% of reliability rate and reducing the number of variables from 1701 to 60. This model was interpreted by detecting fifteen variables as the most significant VIP (variable importance in projection) scores, which were therefore considered diagnostic ions for this type of beer counterfeit. Copyright © 2016 Elsevier B.V. All rights reserved.
Development of LACIE CCEA-1 weather/wheat yield models. [regression analysis
NASA Technical Reports Server (NTRS)
Strommen, N. D.; Sakamoto, C. M.; Leduc, S. K.; Umberger, D. E. (Principal Investigator)
1979-01-01
The advantages and disadvantages of the casual (phenological, dynamic, physiological), statistical regression, and analog approaches to modeling for grain yield are examined. Given LACIE's primary goal of estimating wheat production for the large areas of eight major wheat-growing regions, the statistical regression approach of correlating historical yield and climate data offered the Center for Climatic and Environmental Assessment the greatest potential return within the constraints of time and data sources. The basic equation for the first generation wheat-yield model is given. Topics discussed include truncation, trend variable, selection of weather variables, episodic events, strata selection, operational data flow, weighting, and model results.
Pleiotropic Models of Polygenic Variation, Stabilizing Selection, and Epistasis
Gavrilets, S.; de-Jong, G.
1993-01-01
We show that in polymorphic populations many polygenic traits pleiotropically related to fitness are expected to be under apparent ``stabilizing selection'' independently of the real selection acting on the population. This occurs, for example, if the genetic system is at a stable polymorphic equilibrium determined by selection and the nonadditive contributions of the loci to the trait value either are absent, or are random and independent of those to fitness. Stabilizing selection is also observed if the polygenic system is at an equilibrium determined by a balance between selection and mutation (or migration) when both additive and nonadditive contributions of the loci to the trait value are random and independent of those to fitness. We also compare different viability models that can maintain genetic variability at many loci with respect to their ability to account for the strong stabilizing selection on an additive trait. Let V(m) be the genetic variance supplied by mutation (or migration) each generation, V(g) be the genotypic variance maintained in the population, and n be the number of the loci influencing fitness. We demonstrate that in mutation (migration)-selection balance models the strength of apparent stabilizing selection is order V(m)/V(g). In the overdominant model and in the symmetric viability model the strength of apparent stabilizing selection is approximately 1/(2n) that of total selection on the whole phenotype. We show that a selection system that involves pairwise additive by additive epistasis in maintaining variability can lead to a lower genetic load and genetic variance in fitness (approximately 1/(2n) times) than an equivalent selection system that involves overdominance. We show that, in the epistatic model, the apparent stabilizing selection on an additive trait can be as strong as the total selection on the whole phenotype. PMID:8325491
NASA Astrophysics Data System (ADS)
Mao, Zhiyi; Shan, Ruifeng; Wang, Jiajun; Cai, Wensheng; Shao, Xueguang
2014-07-01
Polyphenols in plant samples have been extensively studied because phenolic compounds are ubiquitous in plants and can be used as antioxidants in promoting human health. A method for rapid determination of three phenolic compounds (chlorogenic acid, scopoletin and rutin) in plant samples using near-infrared diffuse reflectance spectroscopy (NIRDRS) is studied in this work. Partial least squares (PLS) regression was used for building the calibration models, and the effects of spectral preprocessing and variable selection on the models are investigated for optimization of the models. The results show that individual spectral preprocessing and variable selection has no or slight influence on the models, but the combination of the techniques can significantly improve the models. The combination of continuous wavelet transform (CWT) for removing the variant background, multiplicative scatter correction (MSC) for correcting the scattering effect and randomization test (RT) for selecting the informative variables was found to be the best way for building the optimal models. For validation of the models, the polyphenol contents in an independent sample set were predicted. The correlation coefficients between the predicted values and the contents determined by high performance liquid chromatography (HPLC) analysis are as high as 0.964, 0.948 and 0.934 for chlorogenic acid, scopoletin and rutin, respectively.
Bayesian block-diagonal variable selection and model averaging
Papaspiliopoulos, O.; Rossell, D.
2018-01-01
Summary We propose a scalable algorithmic framework for exact Bayesian variable selection and model averaging in linear models under the assumption that the Gram matrix is block-diagonal, and as a heuristic for exploring the model space for general designs. In block-diagonal designs our approach returns the most probable model of any given size without resorting to numerical integration. The algorithm also provides a novel and efficient solution to the frequentist best subset selection problem for block-diagonal designs. Posterior probabilities for any number of models are obtained by evaluating a single one-dimensional integral, and other quantities of interest such as variable inclusion probabilities and model-averaged regression estimates are obtained by an adaptive, deterministic one-dimensional numerical integration. The overall computational cost scales linearly with the number of blocks, which can be processed in parallel, and exponentially with the block size, rendering it most adequate in situations where predictors are organized in many moderately-sized blocks. For general designs, we approximate the Gram matrix by a block-diagonal matrix using spectral clustering and propose an iterative algorithm that capitalizes on the block-diagonal algorithms to explore efficiently the model space. All methods proposed in this paper are implemented in the R library mombf. PMID:29861501
The Variables Affecting the Success of Students
ERIC Educational Resources Information Center
Savas, Behsat; Gurel, Ramazan
2014-01-01
The aim of this study is to determine the variables affecting the success of students. This research, which was conducted through the relational screening model, has a sampling of students who were selected from a middle city in Turkey. The schools are classified into three as low, medium and high. A total of 3491 students are selected by using…
Selecting climate change scenarios using impact-relevant sensitivities
Julie A. Vano; John B. Kim; David E. Rupp; Philip W. Mote
2015-01-01
Climate impact studies often require the selection of a small number of climate scenarios. Ideally, a subset would have simulations that both (1) appropriately represent the range of possible futures for the variable/s most important to the impact under investigation and (2) come from global climate models (GCMs) that provide plausible results for future climate in the...
NASA Astrophysics Data System (ADS)
Seo, Seung Beom; Kim, Young-Oh; Kim, Youngil; Eum, Hyung-Il
2018-04-01
When selecting a subset of climate change scenarios (GCM models), the priority is to ensure that the subset reflects the comprehensive range of possible model results for all variables concerned. Though many studies have attempted to improve the scenario selection, there is a lack of studies that discuss methods to ensure that the results from a subset of climate models contain the same range of uncertainty in hydrologic variables as when all models are considered. We applied the Katsavounidis-Kuo-Zhang (KKZ) algorithm to select a subset of climate change scenarios and demonstrated its ability to reduce the number of GCM models in an ensemble, while the ranges of multiple climate extremes indices were preserved. First, we analyzed the role of 27 ETCCDI climate extremes indices for scenario selection and selected the representative climate extreme indices. Before the selection of a subset, we excluded a few deficient GCM models that could not represent the observed climate regime. Subsequently, we discovered that a subset of GCM models selected by the KKZ algorithm with the representative climate extreme indices could not capture the full potential range of changes in hydrologic extremes (e.g., 3-day peak flow and 7-day low flow) in some regional case studies. However, the application of the KKZ algorithm with a different set of climate indices, which are correlated to the hydrologic extremes, enabled the overcoming of this limitation. Key climate indices, dependent on the hydrologic extremes to be projected, must therefore be determined prior to the selection of a subset of GCM models.
Covariate selection with group lasso and doubly robust estimation of causal effects
Koch, Brandon; Vock, David M.; Wolfson, Julian
2017-01-01
Summary The efficiency of doubly robust estimators of the average causal effect (ACE) of a treatment can be improved by including in the treatment and outcome models only those covariates which are related to both treatment and outcome (i.e., confounders) or related only to the outcome. However, it is often challenging to identify such covariates among the large number that may be measured in a given study. In this paper, we propose GLiDeR (Group Lasso and Doubly Robust Estimation), a novel variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models. The selected variables and corresponding coefficient estimates are used in a standard doubly robust ACE estimator. We provide asymptotic results showing that, for a broad class of data generating mechanisms, GLiDeR yields a consistent estimator of the ACE when either the outcome or treatment model is correctly specified. A comprehensive simulation study shows that GLiDeR is more efficient than doubly robust methods using standard variable selection techniques and has substantial computational advantages over a recently proposed doubly robust Bayesian model averaging method. We illustrate our method by estimating the causal treatment effect of bilateral versus single-lung transplant on forced expiratory volume in one year after transplant using an observational registry. PMID:28636276
Covariate selection with group lasso and doubly robust estimation of causal effects.
Koch, Brandon; Vock, David M; Wolfson, Julian
2018-03-01
The efficiency of doubly robust estimators of the average causal effect (ACE) of a treatment can be improved by including in the treatment and outcome models only those covariates which are related to both treatment and outcome (i.e., confounders) or related only to the outcome. However, it is often challenging to identify such covariates among the large number that may be measured in a given study. In this article, we propose GLiDeR (Group Lasso and Doubly Robust Estimation), a novel variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models. The selected variables and corresponding coefficient estimates are used in a standard doubly robust ACE estimator. We provide asymptotic results showing that, for a broad class of data generating mechanisms, GLiDeR yields a consistent estimator of the ACE when either the outcome or treatment model is correctly specified. A comprehensive simulation study shows that GLiDeR is more efficient than doubly robust methods using standard variable selection techniques and has substantial computational advantages over a recently proposed doubly robust Bayesian model averaging method. We illustrate our method by estimating the causal treatment effect of bilateral versus single-lung transplant on forced expiratory volume in one year after transplant using an observational registry. © 2017, The International Biometric Society.
Selection Index in the Study of Adaptability and Stability in Maize
Lunezzo de Oliveira, Rogério; Garcia Von Pinho, Renzo; Furtado Ferreira, Daniel; Costa Melo, Wagner Mateus
2014-01-01
This paper proposes an alternative method for evaluating the stability and adaptability of maize hybrids using a genotype-ideotype distance index (GIDI) for selection. Data from seven variables were used, obtained through evaluation of 25 maize hybrids at six sites in southern Brazil. The GIDI was estimated by means of the generalized Mahalanobis distance for each plot of the test. We then proceeded to GGE biplot analysis in order to compare the predictive accuracy of the GGE models and the grouping of environments and to select the best five hybrids. The G × E interaction was significant for both variables assessed. The GGE model with two principal components obtained a predictive accuracy (PRECORR) of 0.8913 for the GIDI and 0.8709 for yield (t ha−1). Two groups of environments were obtained upon analyzing the GIDI, whereas all the environments remained in the same group upon analyzing yield. Coincidence occurred in only two hybrids considering evaluation of the two features. The GIDI assessment provided for selection of hybrids that combine adaptability and stability in most of the variables assessed, making its use more highly recommended than analyzing each variable separately. Not all the higher-yielding hybrids were the best in the other variables assessed. PMID:24696641
Artificial neural network model for ozone concentration estimation and Monte Carlo analysis
NASA Astrophysics Data System (ADS)
Gao, Meng; Yin, Liting; Ning, Jicai
2018-07-01
Air pollution in urban atmosphere directly affects public-health; therefore, it is very essential to predict air pollutant concentrations. Air quality is a complex function of emissions, meteorology and topography, and artificial neural networks (ANNs) provide a sound framework for relating these variables. In this study, we investigated the feasibility of using ANN model with meteorological parameters as input variables to predict ozone concentration in the urban area of Jinan, a metropolis in Northern China. We firstly found that the architecture of network of neurons had little effect on the predicting capability of ANN model. A parsimonious ANN model with 6 routinely monitored meteorological parameters and one temporal covariate (the category of day, i.e. working day, legal holiday and regular weekend) as input variables was identified, where the 7 input variables were selected following the forward selection procedure. Compared with the benchmarking ANN model with 9 meteorological and photochemical parameters as input variables, the predicting capability of the parsimonious ANN model was acceptable. Its predicting capability was also verified in term of warming success ratio during the pollution episodes. Finally, uncertainty and sensitivity analysis were also performed based on Monte Carlo simulations (MCS). It was concluded that the ANN could properly predict the ambient ozone level. Maximum temperature, atmospheric pressure, sunshine duration and maximum wind speed were identified as the predominate input variables significantly influencing the prediction of ambient ozone concentrations.
Cao, Hongbao; Duan, Junbo; Lin, Dongdong; Shugart, Yin Yao; Calhoun, Vince; Wang, Yu-Ping
2014-11-15
Integrative analysis of multiple data types can take advantage of their complementary information and therefore may provide higher power to identify potential biomarkers that would be missed using individual data analysis. Due to different natures of diverse data modality, data integration is challenging. Here we address the data integration problem by developing a generalized sparse model (GSM) using weighting factors to integrate multi-modality data for biomarker selection. As an example, we applied the GSM model to a joint analysis of two types of schizophrenia data sets: 759,075 SNPs and 153,594 functional magnetic resonance imaging (fMRI) voxels in 208 subjects (92 cases/116 controls). To solve this small-sample-large-variable problem, we developed a novel sparse representation based variable selection (SRVS) algorithm, with the primary aim to identify biomarkers associated with schizophrenia. To validate the effectiveness of the selected variables, we performed multivariate classification followed by a ten-fold cross validation. We compared our proposed SRVS algorithm with an earlier sparse model based variable selection algorithm for integrated analysis. In addition, we compared with the traditional statistics method for uni-variant data analysis (Chi-squared test for SNP data and ANOVA for fMRI data). Results showed that our proposed SRVS method can identify novel biomarkers that show stronger capability in distinguishing schizophrenia patients from healthy controls. Moreover, better classification ratios were achieved using biomarkers from both types of data, suggesting the importance of integrative analysis. Copyright © 2014 Elsevier Inc. All rights reserved.
A discriminant function model for admission at undergraduate university level
NASA Astrophysics Data System (ADS)
Ali, Hamdi F.; Charbaji, Abdulrazzak; Hajj, Nada Kassim
1992-09-01
The study is aimed at predicting objective criteria based on a statistically tested model for admitting undergraduate students to Beirut University College. The University is faced with a dual problem of having to select only a fraction of an increasing number of applicants, and of trying to minimize the number of students placed on academic probation (currently 36 percent of new admissions). Out of 659 new students, a sample of 272 students (45 percent) were selected; these were all the students on the Dean's list and on academic probation. With academic performance as the dependent variable, the model included ten independent variables and their interactions. These variables included the type of high school, the language of instruction in high school, recommendations, sex, academic average in high school, score on the English Entrance Examination, the major in high school, and whether the major was originally applied for by the student. Discriminant analysis was used to evaluate the relative weight of the independent variables, and from the analysis three equations were developed, one for each academic division in the College. The predictive power of these equations was tested by using them to classify students not in the selected sample into successful and unsuccessful ones. Applicability of the model to other institutions of higher learning is discussed.
NASA Astrophysics Data System (ADS)
Bhattacharyya, Sidhakam; Bandyopadhyay, Gautam
2010-10-01
The council of most of the Urban Local Bodies (ULBs) has a limited scope for decision making in the absence of appropriate financial control mechanism. The information about expected amount of own fund during a particular period is of great importance for decision making. Therefore, in this paper, efforts are being made to present set of findings and to establish a model of estimating receipts of own sources and payments thereof using multiple regression analysis. Data for sixty months from a reputed ULB in West Bengal have been considered for ascertaining the regression models. This can be used as a part of financial management and control procedure by the council to estimate the effect on own fund. In our study we have considered two models using multiple regression analysis. "Model I" comprises of total adjusted receipt as the dependent variable and selected individual receipts as the independent variables. Similarly "Model II" consists of total adjusted payments as the dependent variable and selected individual payments as independent variables. The resultant of Model I and Model II is the surplus or deficit effecting own fund. This may be applied for decision making purpose by the council.
A survey of variable selection methods in two Chinese epidemiology journals
2010-01-01
Background Although much has been written on developing better procedures for variable selection, there is little research on how it is practiced in actual studies. This review surveys the variable selection methods reported in two high-ranking Chinese epidemiology journals. Methods Articles published in 2004, 2006, and 2008 in the Chinese Journal of Epidemiology and the Chinese Journal of Preventive Medicine were reviewed. Five categories of methods were identified whereby variables were selected using: A - bivariate analyses; B - multivariable analysis; e.g. stepwise or individual significance testing of model coefficients; C - first bivariate analyses, followed by multivariable analysis; D - bivariate analyses or multivariable analysis; and E - other criteria like prior knowledge or personal judgment. Results Among the 287 articles that reported using variable selection methods, 6%, 26%, 30%, 21%, and 17% were in categories A through E, respectively. One hundred sixty-three studies selected variables using bivariate analyses, 80% (130/163) via multiple significance testing at the 5% alpha-level. Of the 219 multivariable analyses, 97 (44%) used stepwise procedures, 89 (41%) tested individual regression coefficients, but 33 (15%) did not mention how variables were selected. Sixty percent (58/97) of the stepwise routines also did not specify the algorithm and/or significance levels. Conclusions The variable selection methods reported in the two journals were limited in variety, and details were often missing. Many studies still relied on problematic techniques like stepwise procedures and/or multiple testing of bivariate associations at the 0.05 alpha-level. These deficiencies should be rectified to safeguard the scientific validity of articles published in Chinese epidemiology journals. PMID:20920252
Recurrent personality dimensions in inclusive lexical studies: indications for a big six structure.
Saucier, Gerard
2009-10-01
Previous evidence for both the Big Five and the alternative six-factor model has been drawn from lexical studies with relatively narrow selections of attributes. This study examined factors from previous lexical studies using a wider selection of attributes in 7 languages (Chinese, English, Filipino, Greek, Hebrew, Spanish, and Turkish) and found 6 recurrent factors, each with common conceptual content across most of the studies. The previous narrow-selection-based six-factor model outperformed the Big Five in capturing the content of the 6 recurrent wideband factors. Adjective markers of the 6 recurrent wideband factors showed substantial incremental prediction of important criterion variables over and above the Big Five. Correspondence between wideband 6 and narrowband 6 factors indicate they are variants of a "Big Six" model that is more general across variable-selection procedures and may be more general across languages and populations.
Is hyporheic flow an indicator for salmonid spawning site selection?
NASA Astrophysics Data System (ADS)
Benjankar, R. M.; Tonina, D.; Marzadri, A.; McKean, J. A.; Isaak, D.
2015-12-01
Several studies have investigated the role of hydraulic variables in the selection of spawning sites by salmonids. Some recent studies suggest that the intensity of the ambient hyporheic flow, that present without a salmon egg pocket, is a cue for spawning site selection, but others have argued against it. We tested this hypothesis by using a unique dataset of field surveyed spawning site locations and an unprecedented meter-scale resolution bathymetry of a 13.5 km long reach of Bear Valley Creek (Idaho, USA), an important Chinook salmon spawning stream. We used a two-dimensional surface water model to quantify stream hydraulics and a three-dimensional hyporheic model to quantify the hyporheic flows. Our results show that the intensity of ambient hyporheic flows is not a statistically significant variable for spawning site selection. Conversely, the intensity of the water surface curvature and the habitat quality, quantified as a function of stream hydraulics and morphology, are the most important variables for salmonid spawning site selection. KEY WORDS: Salmonid spawning habitat, pool-riffle system, habitat quality, surface water curvature, hyporheic flow
Lecours, Vincent; Brown, Craig J; Devillers, Rodolphe; Lucieer, Vanessa L; Edinger, Evan N
2016-01-01
Selecting appropriate environmental variables is a key step in ecology. Terrain attributes (e.g. slope, rugosity) are routinely used as abiotic surrogates of species distribution and to produce habitat maps that can be used in decision-making for conservation or management. Selecting appropriate terrain attributes for ecological studies may be a challenging process that can lead users to select a subjective, potentially sub-optimal combination of attributes for their applications. The objective of this paper is to assess the impacts of subjectively selecting terrain attributes for ecological applications by comparing the performance of different combinations of terrain attributes in the production of habitat maps and species distribution models. Seven different selections of terrain attributes, alone or in combination with other environmental variables, were used to map benthic habitats of German Bank (off Nova Scotia, Canada). 29 maps of potential habitats based on unsupervised classifications of biophysical characteristics of German Bank were produced, and 29 species distribution models of sea scallops were generated using MaxEnt. The performances of the 58 maps were quantified and compared to evaluate the effectiveness of the various combinations of environmental variables. One of the combinations of terrain attributes-recommended in a related study and that includes a measure of relative position, slope, two measures of orientation, topographic mean and a measure of rugosity-yielded better results than the other selections for both methodologies, confirming that they together best describe terrain properties. Important differences in performance (up to 47% in accuracy measurement) and spatial outputs (up to 58% in spatial distribution of habitats) highlighted the importance of carefully selecting variables for ecological applications. This paper demonstrates that making a subjective choice of variables may reduce map accuracy and produce maps that do not adequately represent habitats and species distributions, thus having important implications when these maps are used for decision-making.
Applications of information theory, genetic algorithms, and neural models to predict oil flow
NASA Astrophysics Data System (ADS)
Ludwig, Oswaldo; Nunes, Urbano; Araújo, Rui; Schnitman, Leizer; Lepikson, Herman Augusto
2009-07-01
This work introduces a new information-theoretic methodology for choosing variables and their time lags in a prediction setting, particularly when neural networks are used in non-linear modeling. The first contribution of this work is the Cross Entropy Function (XEF) proposed to select input variables and their lags in order to compose the input vector of black-box prediction models. The proposed XEF method is more appropriate than the usually applied Cross Correlation Function (XCF) when the relationship among the input and output signals comes from a non-linear dynamic system. The second contribution is a method that minimizes the Joint Conditional Entropy (JCE) between the input and output variables by means of a Genetic Algorithm (GA). The aim is to take into account the dependence among the input variables when selecting the most appropriate set of inputs for a prediction problem. In short, theses methods can be used to assist the selection of input training data that have the necessary information to predict the target data. The proposed methods are applied to a petroleum engineering problem; predicting oil production. Experimental results obtained with a real-world dataset are presented demonstrating the feasibility and effectiveness of the method.
Bayesian dynamic modeling of time series of dengue disease case counts.
Martínez-Bello, Daniel Adyro; López-Quílez, Antonio; Torres-Prieto, Alexander
2017-07-01
The aim of this study is to model the association between weekly time series of dengue case counts and meteorological variables, in a high-incidence city of Colombia, applying Bayesian hierarchical dynamic generalized linear models over the period January 2008 to August 2015. Additionally, we evaluate the model's short-term performance for predicting dengue cases. The methodology shows dynamic Poisson log link models including constant or time-varying coefficients for the meteorological variables. Calendar effects were modeled using constant or first- or second-order random walk time-varying coefficients. The meteorological variables were modeled using constant coefficients and first-order random walk time-varying coefficients. We applied Markov Chain Monte Carlo simulations for parameter estimation, and deviance information criterion statistic (DIC) for model selection. We assessed the short-term predictive performance of the selected final model, at several time points within the study period using the mean absolute percentage error. The results showed the best model including first-order random walk time-varying coefficients for calendar trend and first-order random walk time-varying coefficients for the meteorological variables. Besides the computational challenges, interpreting the results implies a complete analysis of the time series of dengue with respect to the parameter estimates of the meteorological effects. We found small values of the mean absolute percentage errors at one or two weeks out-of-sample predictions for most prediction points, associated with low volatility periods in the dengue counts. We discuss the advantages and limitations of the dynamic Poisson models for studying the association between time series of dengue disease and meteorological variables. The key conclusion of the study is that dynamic Poisson models account for the dynamic nature of the variables involved in the modeling of time series of dengue disease, producing useful models for decision-making in public health.
Delay correlation analysis and representation for vital complaint VHDL models
Rich, Marvin J.; Misra, Ashutosh
2004-11-09
A method and system unbind a rise/fall tuple of a VHDL generic variable and create rise time and fall time generics of each generic variable that are independent of each other. Then, according to a predetermined correlation policy, the method and system collect delay values in a VHDL standard delay file, sort the delay values, remove duplicate delay values, group the delay values into correlation sets, and output an analysis file. The correlation policy may include collecting all generic variables in a VHDL standard delay file, selecting each generic variable, and performing reductions on the set of delay values associated with each selected generic variable.
Massol, François; Débarre, Florence
2015-07-01
Spatiotemporal variability of the environment is bound to affect the evolution of dispersal, and yet model predictions strongly differ on this particular effect. Recent studies on the evolution of local adaptation have shown that the life cycle chosen to model the selective effects of spatiotemporal variability of the environment is a critical factor determining evolutionary outcomes. Here, we investigate the effect of the order of events in the life cycle on the evolution of unconditional dispersal in a spatially heterogeneous, temporally varying landscape. Our results show that the occurrence of intermediate singular strategies and disruptive selection are conditioned by the temporal autocorrelation of the environment and by the life cycle. Life cycles with dispersal of adults versus dispersal of juveniles, local versus global density regulation, give radically different evolutionary outcomes that include selection for total philopatry, evolutionary bistability, selection for intermediate stable states, and evolutionary branching points. Our results highlight the importance of accounting for life-cycle specifics when predicting the effects of the environment on evolutionarily selected trait values, such as dispersal, as well as the need to check the robustness of model conclusions against modifications of the life cycle. © 2015 The Author(s). Evolution © 2015 The Society for the Study of Evolution.
An Update on Statistical Boosting in Biomedicine.
Mayr, Andreas; Hofner, Benjamin; Waldmann, Elisabeth; Hepp, Tobias; Meyer, Sebastian; Gefeller, Olaf
2017-01-01
Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.
Toward a Unified Representation of Atmospheric Convection in Variable-Resolution Climate Models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Walko, Robert
2016-11-07
The purpose of this project was to improve the representation of convection in atmospheric weather and climate models that employ computational grids with spatially-variable resolution. Specifically, our work targeted models whose grids are fine enough over selected regions that convection is resolved explicitly, while over other regions the grid is coarser and convection is represented as a subgrid-scale process. The working criterion for a successful scheme for representing convection over this range of grid resolution was that identical convective environments must produce very similar convective responses (i.e., the same precipitation amount, rate, and timing, and the same modification of themore » atmospheric profile) regardless of grid scale. The need for such a convective scheme has increased in recent years as more global weather and climate models have adopted variable resolution meshes that are often extended into the range of resolving convection in selected locations.« less
Continuous-time discrete-space models for animal movement
Hanks, Ephraim M.; Hooten, Mevin B.; Alldredge, Mat W.
2015-01-01
The processes influencing animal movement and resource selection are complex and varied. Past efforts to model behavioral changes over time used Bayesian statistical models with variable parameter space, such as reversible-jump Markov chain Monte Carlo approaches, which are computationally demanding and inaccessible to many practitioners. We present a continuous-time discrete-space (CTDS) model of animal movement that can be fit using standard generalized linear modeling (GLM) methods. This CTDS approach allows for the joint modeling of location-based as well as directional drivers of movement. Changing behavior over time is modeled using a varying-coefficient framework which maintains the computational simplicity of a GLM approach, and variable selection is accomplished using a group lasso penalty. We apply our approach to a study of two mountain lions (Puma concolor) in Colorado, USA.
Talent identification and selection in elite youth football: An Australian context.
O'Connor, Donna; Larkin, Paul; Mark Williams, A
2016-10-01
We identified the perceptual-cognitive skills and player history variables that differentiate players selected or not selected into an elite youth football (i.e. soccer) programme in Australia. A sample of elite youth male football players (n = 127) completed an adapted participation history questionnaire and video-based assessments of perceptual-cognitive skills. Following data collection, 22 of these players were offered a full-time scholarship for enrolment at an elite player residential programme. Participants selected for the scholarship programme recorded superior performance on the combined perceptual-cognitive skills tests compared to the non-selected group. There were no significant between group differences on the player history variables. Stepwise discriminant function analysis identified four predictor variables that resulted in the best categorization of selected and non-selected players (i.e. recent match-play performance, region, number of other sports participated, combined perceptual-cognitive performance). The effectiveness of the discriminant function is reflected by 93.7% of players being correctly classified, with the four variables accounting for 57.6% of the variance. Our discriminating model for selection may provide a greater understanding of the factors that influence elite youth talent selection and identification.
ERIC Educational Resources Information Center
Goodyear, Rodney K.; Newcomb, Micheal D.; Locke, Thomas F.
2002-01-01
Data from a community sample of 493 pregnant Latina teenagers were used to test a mediated model of mate selection with 5 classes of variables: (a) male partner characteristics (antisocial behaviors, negative relationships with women, harm risk, and relationship length), (b) young women's psychosocial variables (antisocial behaviors, drug use,…
Genetic signatures of natural selection in a model invasive ascidian
Lin, Yaping; Chen, Yiyong; Yi, Changho; Fong, Jonathan J.; Kim, Won; Rius, Marc; Zhan, Aibin
2017-01-01
Invasive species represent promising models to study species’ responses to rapidly changing environments. Although local adaptation frequently occurs during contemporary range expansion, the associated genetic signatures at both population and genomic levels remain largely unknown. Here, we use genome-wide gene-associated microsatellites to investigate genetic signatures of natural selection in a model invasive ascidian, Ciona robusta. Population genetic analyses of 150 individuals sampled in Korea, New Zealand, South Africa and Spain showed significant genetic differentiation among populations. Based on outlier tests, we found high incidence of signatures of directional selection at 19 loci. Hitchhiking mapping analyses identified 12 directional selective sweep regions, and all selective sweep windows on chromosomes were narrow (~8.9 kb). Further analyses indentified 132 candidate genes under selection. When we compared our genetic data and six crucial environmental variables, 16 putatively selected loci showed significant correlation with these environmental variables. This suggests that the local environmental conditions have left significant signatures of selection at both population and genomic levels. Finally, we identified “plastic” genomic regions and genes that are promising regions to investigate evolutionary responses to rapid environmental change in C. robusta. PMID:28266616
Discrete choice modeling of shovelnose sturgeon habitat selection in the Lower Missouri River
Bonnot, T.W.; Wildhaber, M.L.; Millspaugh, J.J.; DeLonay, A.J.; Jacobson, R.B.; Bryan, J.L.
2011-01-01
Substantive changes to physical habitat in the Lower Missouri River, resulting from intensive management, have been implicated in the decline of pallid (Scaphirhynchus albus) and shovelnose (S. platorynchus) sturgeon. To aid in habitat rehabilitation efforts, we evaluated habitat selection of gravid, female shovelnose sturgeon during the spawning season in two sections (lower and upper) of the Lower Missouri River in 2005 and in the upper section in 2007. We fit discrete choice models within an information theoretic framework to identify selection of means and variability in three components of physical habitat. Characterizing habitat within divisions around fish better explained selection than habitat values at the fish locations. In general, female shovelnose sturgeon were negatively associated with mean velocity between them and the bank and positively associated with variability in surrounding depths. For example, in the upper section in 2005, a 0.5 m s-1 decrease in velocity within 10 m in the bank direction increased the relative probability of selection 70%. In the upper section fish also selected sites with surrounding structure in depth (e.g., change in relief). Differences in models between sections and years, which are reinforced by validation rates, suggest that changes in habitat due to geomorphology, hydrology, and their interactions over time need to be addressed when evaluating habitat selection. Because of the importance of variability in surrounding depths, these results support an emphasis on restoring channel complexity as an objective of habitat restoration for shovelnose sturgeon in the Lower Missouri River.
Fernandez-Lozano, C.; Canto, C.; Gestal, M.; Andrade-Garda, J. M.; Rabuñal, J. R.; Dorado, J.; Pazos, A.
2013-01-01
Given the background of the use of Neural Networks in problems of apple juice classification, this paper aim at implementing a newly developed method in the field of machine learning: the Support Vector Machines (SVM). Therefore, a hybrid model that combines genetic algorithms and support vector machines is suggested in such a way that, when using SVM as a fitness function of the Genetic Algorithm (GA), the most representative variables for a specific classification problem can be selected. PMID:24453933
Restricted cross-scale habitat selection by American beavers.
Francis, Robert A; Taylor, Jimmy D; Dibble, Eric; Strickland, Bronson; Petro, Vanessa M; Easterwood, Christine; Wang, Guiming
2017-12-01
Animal habitat selection, among other ecological phenomena, is spatially scale dependent. Habitat selection by American beavers Castor canadensis (hereafter, beaver) has been studied at singular spatial scales, but to date no research addresses multi-scale selection. Our objectives were to determine if beaver habitat selection was specialized to semiaquatic habitats and if variables explaining habitat selection are consistent between landscape and fine spatial scales. We built maximum entropy (MaxEnt) models to relate landscape-scale presence-only data to landscape variables, and used generalized linear mixed models to evaluate fine spatial scale habitat selection using global positioning system (GPS) relocation data. Explanatory variables between the landscape and fine spatial scale were compared for consistency. Our findings suggested that beaver habitat selection at coarse (study area) and fine (within home range) scales was congruent, and was influenced by increasing amounts of woody wetland edge density and shrub edge density, and decreasing amounts of open water edge density. Habitat suitability at the landscape scale also increased with decreasing amounts of grass frequency. As territorial, central-place foragers, beavers likely trade-off open water edge density (i.e., smaller non-forested wetlands or lodges closer to banks) for defense and shorter distances to forage and obtain construction material. Woody plants along edges and expanses of open water for predator avoidance may limit beaver fitness and subsequently determine beaver habitat selection.
Restricted cross-scale habitat selection by American beavers
Taylor, Jimmy D; Dibble, Eric; Strickland, Bronson; Petro, Vanessa M; Easterwood, Christine; Wang, Guiming
2017-01-01
Abstract Animal habitat selection, among other ecological phenomena, is spatially scale dependent. Habitat selection by American beavers Castor canadensis (hereafter, beaver) has been studied at singular spatial scales, but to date no research addresses multi-scale selection. Our objectives were to determine if beaver habitat selection was specialized to semiaquatic habitats and if variables explaining habitat selection are consistent between landscape and fine spatial scales. We built maximum entropy (MaxEnt) models to relate landscape-scale presence-only data to landscape variables, and used generalized linear mixed models to evaluate fine spatial scale habitat selection using global positioning system (GPS) relocation data. Explanatory variables between the landscape and fine spatial scale were compared for consistency. Our findings suggested that beaver habitat selection at coarse (study area) and fine (within home range) scales was congruent, and was influenced by increasing amounts of woody wetland edge density and shrub edge density, and decreasing amounts of open water edge density. Habitat suitability at the landscape scale also increased with decreasing amounts of grass frequency. As territorial, central-place foragers, beavers likely trade-off open water edge density (i.e., smaller non-forested wetlands or lodges closer to banks) for defense and shorter distances to forage and obtain construction material. Woody plants along edges and expanses of open water for predator avoidance may limit beaver fitness and subsequently determine beaver habitat selection. PMID:29492032
Sex determination of the Acadian Flycatcher using discriminant analysis
Wilson, R.R.
1999-01-01
I used five morphometric variables from 114 individuals captured in Arkansas to develop a discriminant model to predict the sex of Acadian Flycatchers (Empidonax virescens). Stepwise discriminant function analyses selected wing chord and tail length as the most parsimonious subset of variables for discriminating sex. This two-variable model correctly classified 80% of females and 97% of males used to develop the model. Validation of the model using 19 individuals from Louisiana and Virginia resulted in 100% correct classification of males and females. This model provides criteria for sexing monomorphic Acadian Flycatchers during the breeding season and possibly during the winter.
A Study of Quasar Selection in the Supernova Fields of the Dark Energy Survey
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tie, S. S.; Martini, P.; Mudd, D.
In this paper, we present a study of quasar selection using the supernova fields of the Dark Energy Survey (DES). We used a quasar catalog from an overlapping portion of the SDSS Stripe 82 region to quantify the completeness and efficiency of selection methods involving color, probabilistic modeling, variability, and combinations of color/probabilistic modeling with variability. In all cases, we considered only objects that appear as point sources in the DES images. We examine color selection methods based on the Wide-field Infrared Survey Explorer (WISE) mid-IR W1-W2 color, a mixture of WISE and DES colors (g - i and i-W1),more » and a mixture of Vista Hemisphere Survey and DES colors (g - i and i - K). For probabilistic quasar selection, we used XDQSO, an algorithm that employs an empirical multi-wavelength flux model of quasars to assign quasar probabilities. Our variability selection uses the multi-band χ 2-probability that sources are constant in the DES Year 1 griz-band light curves. The completeness and efficiency are calculated relative to an underlying sample of point sources that are detected in the required selection bands and pass our data quality and photometric error cuts. We conduct our analyses at two magnitude limits, i < 19.8 mag and i < 22 mag. For the subset of sources with W1 and W2 detections, the W1-W2 color or XDQSOz method combined with variability gives the highest completenesses of >85% for both i-band magnitude limits and efficiencies of >80% to the bright limit and >60% to the faint limit; however, the giW1 and giW1+variability methods give the highest quasar surface densities. The XDQSOz method and combinations of W1W2/giW1/XDQSOz with variability are among the better selection methods when both high completeness and high efficiency are desired. We also present the OzDES Quasar Catalog of 1263 spectroscopically confirmed quasars from three years of OzDES observation in the 30 deg 2 of the DES supernova fields. Finally, the catalog includes quasars with redshifts up to z ~ 4 and brighter than i = 22 mag, although the catalog is not complete up to this magnitude limit.« less
A Study of Quasar Selection in the Supernova Fields of the Dark Energy Survey
Tie, S. S.; Martini, P.; Mudd, D.; ...
2017-02-15
In this paper, we present a study of quasar selection using the supernova fields of the Dark Energy Survey (DES). We used a quasar catalog from an overlapping portion of the SDSS Stripe 82 region to quantify the completeness and efficiency of selection methods involving color, probabilistic modeling, variability, and combinations of color/probabilistic modeling with variability. In all cases, we considered only objects that appear as point sources in the DES images. We examine color selection methods based on the Wide-field Infrared Survey Explorer (WISE) mid-IR W1-W2 color, a mixture of WISE and DES colors (g - i and i-W1),more » and a mixture of Vista Hemisphere Survey and DES colors (g - i and i - K). For probabilistic quasar selection, we used XDQSO, an algorithm that employs an empirical multi-wavelength flux model of quasars to assign quasar probabilities. Our variability selection uses the multi-band χ 2-probability that sources are constant in the DES Year 1 griz-band light curves. The completeness and efficiency are calculated relative to an underlying sample of point sources that are detected in the required selection bands and pass our data quality and photometric error cuts. We conduct our analyses at two magnitude limits, i < 19.8 mag and i < 22 mag. For the subset of sources with W1 and W2 detections, the W1-W2 color or XDQSOz method combined with variability gives the highest completenesses of >85% for both i-band magnitude limits and efficiencies of >80% to the bright limit and >60% to the faint limit; however, the giW1 and giW1+variability methods give the highest quasar surface densities. The XDQSOz method and combinations of W1W2/giW1/XDQSOz with variability are among the better selection methods when both high completeness and high efficiency are desired. We also present the OzDES Quasar Catalog of 1263 spectroscopically confirmed quasars from three years of OzDES observation in the 30 deg 2 of the DES supernova fields. Finally, the catalog includes quasars with redshifts up to z ~ 4 and brighter than i = 22 mag, although the catalog is not complete up to this magnitude limit.« less
Yan, Zhengbing; Kuang, Te-Hui; Yao, Yuan
2017-09-01
In recent years, multivariate statistical monitoring of batch processes has become a popular research topic, wherein multivariate fault isolation is an important step aiming at the identification of the faulty variables contributing most to the detected process abnormality. Although contribution plots have been commonly used in statistical fault isolation, such methods suffer from the smearing effect between correlated variables. In particular, in batch process monitoring, the high autocorrelations and cross-correlations that exist in variable trajectories make the smearing effect unavoidable. To address such a problem, a variable selection-based fault isolation method is proposed in this research, which transforms the fault isolation problem into a variable selection problem in partial least squares discriminant analysis and solves it by calculating a sparse partial least squares model. As different from the traditional methods, the proposed method emphasizes the relative importance of each process variable. Such information may help process engineers in conducting root-cause diagnosis. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
The Joint Effects of Background Selection and Genetic Recombination on Local Gene Genealogies
Zeng, Kai; Charlesworth, Brian
2011-01-01
Background selection, the effects of the continual removal of deleterious mutations by natural selection on variability at linked sites, is potentially a major determinant of DNA sequence variability. However, the joint effects of background selection and genetic recombination on the shape of the neutral gene genealogy have proved hard to study analytically. The only existing formula concerns the mean coalescent time for a pair of alleles, making it difficult to assess the importance of background selection from genome-wide data on sequence polymorphism. Here we develop a structured coalescent model of background selection with recombination and implement it in a computer program that efficiently generates neutral gene genealogies for an arbitrary sample size. We check the validity of the structured coalescent model against forward-in-time simulations and show that it accurately captures the effects of background selection. The model produces more accurate predictions of the mean coalescent time than the existing formula and supports the conclusion that the effect of background selection is greater in the interior of a deleterious region than at its boundaries. The level of linkage disequilibrium between sites is elevated by background selection, to an extent that is well summarized by a change in effective population size. The structured coalescent model is readily extendable to more realistic situations and should prove useful for analyzing genome-wide polymorphism data. PMID:21705759
The joint effects of background selection and genetic recombination on local gene genealogies.
Zeng, Kai; Charlesworth, Brian
2011-09-01
Background selection, the effects of the continual removal of deleterious mutations by natural selection on variability at linked sites, is potentially a major determinant of DNA sequence variability. However, the joint effects of background selection and genetic recombination on the shape of the neutral gene genealogy have proved hard to study analytically. The only existing formula concerns the mean coalescent time for a pair of alleles, making it difficult to assess the importance of background selection from genome-wide data on sequence polymorphism. Here we develop a structured coalescent model of background selection with recombination and implement it in a computer program that efficiently generates neutral gene genealogies for an arbitrary sample size. We check the validity of the structured coalescent model against forward-in-time simulations and show that it accurately captures the effects of background selection. The model produces more accurate predictions of the mean coalescent time than the existing formula and supports the conclusion that the effect of background selection is greater in the interior of a deleterious region than at its boundaries. The level of linkage disequilibrium between sites is elevated by background selection, to an extent that is well summarized by a change in effective population size. The structured coalescent model is readily extendable to more realistic situations and should prove useful for analyzing genome-wide polymorphism data.
Intractable Ménière's disease. Modelling of the treatment by means of statistical analysis.
Sanchez-Ferrandiz, Noelia; Fernandez-Gonzalez, Secundino; Guillen-Grima, Francisco; Perez-Fernandez, Nicolas
2010-08-01
To evaluate the value of different variables of the clinical history, auditory and vestibular tests and handicap measurements to define intractable or disabling Ménière's disease. This is a prospective study with 212 patients of which 155 were treated with intratympanic gentamicin and considered to be suffering a medically intractable Ménière's disease. Age and sex adjustments were performed with the 11 variables selected. Discriminant analysis was performed either using the aforementioned variables or following the stepwise method. Different variables needed to be sex and/or age adjusted and both data were included in the discriminant function. Two different mathematical formulas were obtained and four models were analyzed. With the model selected, diagnostic accuracy is 77.7%, sensitivity is 94.9% and specificity is 52.8%. After discriminant analysis we found that the most informative variables were the number of vertigo spells, the speech discrimination score, the time constant of the VOR and a measure of handicap, the "dizziness index". Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Hakkarinen, C.; Brown, D.; Callahan, J.; hankin, S.; de Koningh, M.; Middleton-Link, D.; Wigley, T.
2001-05-01
A Web-based access system to climate model output data sets for intercomparison and analysis has been produced, using the NOAA-PMEL developed Live Access Server software as host server and Ferret as the data serving and visualization engine. Called ARCAS ("ACACIA Regional Climate-data Access System"), and publicly accessible at http://dataserver.ucar.edu/arcas, the site currently serves climate model outputs from runs of the NCAR Climate System Model for the 21st century, for Business as Usual and Stabilization of Greenhouse Gas Emission scenarios. Users can select, download, and graphically display single variables or comparisons of two variables from either or both of the CSM model runs, averaged for monthly, seasonal, or annual time resolutions. The time length of the averaging period, and the geographical domain for download and display, are fully selectable by the user. A variety of arithmetic operations on the data variables can be computed "on-the-fly", as defined by the user. Expansions of the user-selectable options for defining analysis options, and for accessing other DOD-compatible ("Distributed Ocean Data System-compatible") data sets, residing at locations other than the NCAR hardware server on which ARCAS operates, are planned for this year. These expansions are designed to allow users quick and easy-to-operate web-based access to the largest possible selection of climate model output data sets available throughout the world.
Bello, Alessandra; Bianchi, Federica; Careri, Maria; Giannetto, Marco; Mori, Giovanni; Musci, Marilena
2007-11-05
A new NIR method based on multivariate calibration for determination of ethanol in industrially packed wholemeal bread was developed and validated. GC-FID was used as reference method for the determination of actual ethanol concentration of different samples of wholemeal bread with proper content of added ethanol, ranging from 0 to 3.5% (w/w). Stepwise discriminant analysis was carried out on the NIR dataset, in order to reduce the number of original variables by selecting those that were able to discriminate between the samples of different ethanol concentrations. With the so selected variables a multivariate calibration model was then obtained by multiple linear regression. The prediction power of the linear model was optimized by a new "leave one out" method, so that the number of original variables resulted further reduced.
Tang, Rongnian; Chen, Xupeng; Li, Chuang
2018-05-01
Near-infrared spectroscopy is an efficient, low-cost technology that has potential as an accurate method in detecting the nitrogen content of natural rubber leaves. Successive projections algorithm (SPA) is a widely used variable selection method for multivariate calibration, which uses projection operations to select a variable subset with minimum multi-collinearity. However, due to the fluctuation of correlation between variables, high collinearity may still exist in non-adjacent variables of subset obtained by basic SPA. Based on analysis to the correlation matrix of the spectra data, this paper proposed a correlation-based SPA (CB-SPA) to apply the successive projections algorithm in regions with consistent correlation. The result shows that CB-SPA can select variable subsets with more valuable variables and less multi-collinearity. Meanwhile, models established by the CB-SPA subset outperform basic SPA subsets in predicting nitrogen content in terms of both cross-validation and external prediction. Moreover, CB-SPA is assured to be more efficient, for the time cost in its selection procedure is one-twelfth that of the basic SPA.
Moreno-Opo, Rubén; Fernández-Olalla, Mariana; Margalida, Antoni; Arredondo, Ángel; Guil, Francisco
2012-01-01
The application of scientific-based conservation measures requires that sampling methodologies in studies modelling similar ecological aspects produce comparable results making easier their interpretation. We aimed to show how the choice of different methodological and ecological approaches can affect conclusions in nest-site selection studies along different Palearctic meta-populations of an indicator species. First, a multivariate analysis of the variables affecting nest-site selection in a breeding colony of cinereous vulture (Aegypius monachus) in central Spain was performed. Then, a meta-analysis was applied to establish how methodological and habitat-type factors determine differences and similarities in the results obtained by previous studies that have modelled the forest breeding habitat of the species. Our results revealed patterns in nesting-habitat modelling by the cinereous vulture throughout its whole range: steep and south-facing slopes, great cover of large trees and distance to human activities were generally selected. The ratio and situation of the studied plots (nests/random), the use of plots vs. polygons as sampling units and the number of years of data set determined the variability explained by the model. Moreover, a greater size of the breeding colony implied that ecological and geomorphological variables at landscape level were more influential. Additionally, human activities affected in greater proportion to colonies situated in Mediterranean forests. For the first time, a meta-analysis regarding the factors determining nest-site selection heterogeneity for a single species at broad scale was achieved. It is essential to homogenize and coordinate experimental design in modelling the selection of species' ecological requirements in order to avoid that differences in results among studies would be due to methodological heterogeneity. This would optimize best conservation and management practices for habitats and species in a global context. PMID:22413023
Merrill, Scott C; Peairs, Frank B
2017-02-01
Models describing the effects of climate change on arthropod pest ecology are needed to help mitigate and adapt to forthcoming changes. Challenges arise because climate data are at resolutions that do not readily synchronize with arthropod biology. Here we explain how multiple sources of climate and weather data can be synthesized to quantify the effects of climate change on pest phenology. Predictions of phenological events differ substantially between models that incorporate scale-appropriate temperature variability and models that do not. As an illustrative example, we predicted adult emergence of a pest of sunflower, the sunflower stem weevil Cylindrocopturus adspersus (LeConte). Predictions of the timing of phenological events differed by an average of 11 days between models with different temperature variability inputs. Moreover, as temperature variability increases, developmental rates accelerate. Our work details a phenological modeling approach intended to help develop tools to plan for and mitigate the effects of climate change. Results show that selection of scale-appropriate temperature data is of more importance than selecting a climate change emission scenario. Predictions derived without appropriate temperature variability inputs will likely result in substantial phenological event miscalculations. Additionally, results suggest that increased temperature instability will lead to accelerated pest development. © 2016 Society of Chemical Industry. © 2016 Society of Chemical Industry.
A New Integrated Weighted Model in SNOW-V10: Verification of Categorical Variables
NASA Astrophysics Data System (ADS)
Huang, Laura X.; Isaac, George A.; Sheng, Grant
2014-01-01
This paper presents the verification results for nowcasts of seven categorical variables from an integrated weighted model (INTW) and the underlying numerical weather prediction (NWP) models. Nowcasting, or short range forecasting (0-6 h), over complex terrain with sufficient accuracy is highly desirable but a very challenging task. A weighting, evaluation, bias correction and integration system (WEBIS) for generating nowcasts by integrating NWP forecasts and high frequency observations was used during the Vancouver 2010 Olympic and Paralympic Winter Games as part of the Science of Nowcasting Olympic Weather for Vancouver 2010 (SNOW-V10) project. Forecast data from Canadian high-resolution deterministic NWP system with three nested grids (at 15-, 2.5- and 1-km horizontal grid-spacing) were selected as background gridded data for generating the integrated nowcasts. Seven forecast variables of temperature, relative humidity, wind speed, wind gust, visibility, ceiling and precipitation rate are treated as categorical variables for verifying the integrated weighted forecasts. By analyzing the verification of forecasts from INTW and the NWP models among 15 sites, the integrated weighted model was found to produce more accurate forecasts for the 7 selected forecast variables, regardless of location. This is based on the multi-categorical Heidke skill scores for the test period 12 February to 21 March 2010.
Applied Music Teaching Behavior as a Function of Selected Personality Variables.
ERIC Educational Resources Information Center
Schmidt, Charles P.
1989-01-01
Investigates the relationships among applied music teaching behaviors and personality variables as measured by the Myers-Briggs Type Indicator (MBTI). Suggests that personality variables may be important factors underlying four applied music teaching behaviors: approvals, rate of reinforcement, teacher model/performance, and pace. (LS)
Conjoint Analysis: A Study of the Effects of Using Person Variables.
ERIC Educational Resources Information Center
Fraas, John W.; Newman, Isadore
Three statistical techniques--conjoint analysis, a multiple linear regression model, and a multiple linear regression model with a surrogate person variable--were used to estimate the relative importance of five university attributes for students in the process of selecting a college. The five attributes include: availability and variety of…
Brenn, T; Arnesen, E
1985-01-01
For comparative evaluation, discriminant analysis, logistic regression and Cox's model were used to select risk factors for total and coronary deaths among 6595 men aged 20-49 followed for 9 years. Groups with mortality between 5 and 93 per 1000 were considered. Discriminant analysis selected variable sets only marginally different from the logistic and Cox methods which always selected the same sets. A time-saving option, offered for both the logistic and Cox selection, showed no advantage compared with discriminant analysis. Analysing more than 3800 subjects, the logistic and Cox methods consumed, respectively, 80 and 10 times more computer time than discriminant analysis. When including the same set of variables in non-stepwise analyses, all methods estimated coefficients that in most cases were almost identical. In conclusion, discriminant analysis is advocated for preliminary or stepwise analysis, otherwise Cox's method should be used.
NASA Astrophysics Data System (ADS)
Dalla Rosa, Luciano; Ford, John K. B.; Trites, Andrew W.
2012-03-01
Humpback whales are common in feeding areas off British Columbia (BC) from spring to fall, and are widely distributed along the coast. Climate change and the increase in population size of North Pacific humpback whales may lead to increased anthropogenic impact and require a better understanding of species-habitat relationships. We investigated the distribution and relative abundance of humpback whales in relation to environmental variables and processes in BC waters using GIS and generalized additive models (GAMs). Six non-systematic cetacean surveys were conducted between 2004 and 2006. Whale encounter rates and environmental variables (oceanographic and remote sensing data) were recorded along transects divided into 4 km segments. A combined 3-year model and individual year models (two surveys each) were fitted with the mgcv R package. Model selection was based primarily on GCV scores. The explained deviance of our models ranged from 39% for the 3-year model to 76% for the 2004 model. Humpback whales were strongly associated with latitude and bathymetric features, including depth, slope and distance to the 100-m isobath. Distance to sea-surface-temperature fronts and salinity (climatology) were also constantly selected by the models. The shapes of smooth functions estimated for variables based on chlorophyll concentration or net primary productivity with different temporal resolutions and time lags were not consistent, even though higher numbers of whales seemed to be associated with higher primary productivity for some models. These and other selected explanatory variables may reflect areas of higher biological productivity that favor top predators. Our study confirms the presence of at least three important regions for humpback whales along the BC coast: south Dixon Entrance, middle and southwestern Hecate Strait and the area between La Perouse Bank and the southern edge of Juan de Fuca Canyon.
Selection of optimal complexity for ENSO-EMR model by minimum description length principle
NASA Astrophysics Data System (ADS)
Loskutov, E. M.; Mukhin, D.; Mukhina, A.; Gavrilov, A.; Kondrashov, D. A.; Feigin, A. M.
2012-12-01
One of the main problems arising in modeling of data taken from natural system is finding a phase space suitable for construction of the evolution operator model. Since we usually deal with strongly high-dimensional behavior, we are forced to construct a model working in some projection of system phase space corresponding to time scales of interest. Selection of optimal projection is non-trivial problem since there are many ways to reconstruct phase variables from given time series, especially in the case of a spatio-temporal data field. Actually, finding optimal projection is significant part of model selection, because, on the one hand, the transformation of data to some phase variables vector can be considered as a required component of the model. On the other hand, such an optimization of a phase space makes sense only in relation to the parametrization of the model we use, i.e. representation of evolution operator, so we should find an optimal structure of the model together with phase variables vector. In this paper we propose to use principle of minimal description length (Molkov et al., 2009) for selection models of optimal complexity. The proposed method is applied to optimization of Empirical Model Reduction (EMR) of ENSO phenomenon (Kravtsov et al. 2005, Kondrashov et. al., 2005). This model operates within a subset of leading EOFs constructed from spatio-temporal field of SST in Equatorial Pacific, and has a form of multi-level stochastic differential equations (SDE) with polynomial parameterization of the right-hand side. Optimal values for both the number of EOF, the order of polynomial and number of levels are estimated from the Equatorial Pacific SST dataset. References: Ya. Molkov, D. Mukhin, E. Loskutov, G. Fidelin and A. Feigin, Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series, Phys. Rev. E, Vol. 80, P 046207, 2009 Kravtsov S, Kondrashov D, Ghil M, 2005: Multilevel regression modeling of nonlinear processes: Derivation and applications to climatic variability. J. Climate, 18 (21): 4404-4424. D. Kondrashov, S. Kravtsov, A. W. Robertson and M. Ghil, 2005. A hierarchy of data-based ENSO models. J. Climate, 18, 4425-4444.
A site specific model and analysis of the neutral somatic mutation rate in whole-genome cancer data.
Bertl, Johanna; Guo, Qianyun; Juul, Malene; Besenbacher, Søren; Nielsen, Morten Muhlig; Hornshøj, Henrik; Pedersen, Jakob Skou; Hobolth, Asger
2018-04-19
Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: whole-genome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration. To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and site-specific models. We show that for 1000 randomly selected genomic positions, the site-specific model predicts the mutation rate much better than regional based models. We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies site-specific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain well-known mutational signatures. We find that our site-specific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.
Žuvela, Petar; Liu, J Jay; Macur, Katarzyna; Bączek, Tomasz
2015-10-06
In this work, performance of five nature-inspired optimization algorithms, genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), firefly algorithm (FA), and flower pollination algorithm (FPA), was compared in molecular descriptor selection for development of quantitative structure-retention relationship (QSRR) models for 83 peptides that originate from eight model proteins. The matrix with 423 descriptors was used as input, and QSRR models based on selected descriptors were built using partial least squares (PLS), whereas root mean square error of prediction (RMSEP) was used as a fitness function for their selection. Three performance criteria, prediction accuracy, computational cost, and the number of selected descriptors, were used to evaluate the developed QSRR models. The results show that all five variable selection methods outperform interval PLS (iPLS), sparse PLS (sPLS), and the full PLS model, whereas GA is superior because of its lowest computational cost and higher accuracy (RMSEP of 5.534%) with a smaller number of variables (nine descriptors). The GA-QSRR model was validated initially through Y-randomization. In addition, it was successfully validated with an external testing set out of 102 peptides originating from Bacillus subtilis proteomes (RMSEP of 22.030%). Its applicability domain was defined, from which it was evident that the developed GA-QSRR exhibited strong robustness. All the sources of the model's error were identified, thus allowing for further application of the developed methodology in proteomics.
NASA Astrophysics Data System (ADS)
Ruan, John J.; Anderson, Scott F.; MacLeod, Chelsea L.; Becker, Andrew C.; Burnett, T. H.; Davenport, James R. A.; Ivezić, Željko; Kochanek, Christopher S.; Plotkin, Richard M.; Sesar, Branimir; Stuart, J. Scott
2012-11-01
We investigate the use of optical photometric variability to select and identify blazars in large-scale time-domain surveys, in part to aid in the identification of blazar counterparts to the ~30% of γ-ray sources in the Fermi 2FGL catalog still lacking reliable associations. Using data from the optical LINEAR asteroid survey, we characterize the optical variability of blazars by fitting a damped random walk model to individual light curves with two main model parameters, the characteristic timescales of variability τ, and driving amplitudes on short timescales \\hat{\\sigma }. Imposing cuts on minimum τ and \\hat{\\sigma } allows for blazar selection with high efficiency E and completeness C. To test the efficacy of this approach, we apply this method to optically variable LINEAR objects that fall within the several-arcminute error ellipses of γ-ray sources in the Fermi 2FGL catalog. Despite the extreme stellar contamination at the shallow depth of the LINEAR survey, we are able to recover previously associated optical counterparts to Fermi active galactic nuclei with E >= 88% and C = 88% in Fermi 95% confidence error ellipses having semimajor axis r < 8'. We find that the suggested radio counterpart to Fermi source 2FGL J1649.6+5238 has optical variability consistent with other γ-ray blazars and is likely to be the γ-ray source. Our results suggest that the variability of the non-thermal jet emission in blazars is stochastic in nature, with unique variability properties due to the effects of relativistic beaming. After correcting for beaming, we estimate that the characteristic timescale of blazar variability is ~3 years in the rest frame of the jet, in contrast with the ~320 day disk flux timescale observed in quasars. The variability-based selection method presented will be useful for blazar identification in time-domain optical surveys and is also a probe of jet physics.
Zhang, Haixia; Zhao, Junkang; Gu, Caijiao; Cui, Yan; Rong, Huiying; Meng, Fanlong; Wang, Tong
2015-05-01
The study of the medical expenditure and its influencing factors among the students enrolling in Urban Resident Basic Medical Insurance (URBMI) in Taiyuan indicated that non response bias and selection bias coexist in dependent variable of the survey data. Unlike previous studies only focused on one missing mechanism, a two-stage method to deal with two missing mechanisms simultaneously was suggested in this study, combining multiple imputation with sample selection model. A total of 1 190 questionnaires were returned by the students (or their parents) selected in child care settings, schools and universities in Taiyuan by stratified cluster random sampling in 2012. In the returned questionnaires, 2.52% existed not missing at random (NMAR) of dependent variable and 7.14% existed missing at random (MAR) of dependent variable. First, multiple imputation was conducted for MAR by using completed data, then sample selection model was used to correct NMAR in multiple imputation, and a multi influencing factor analysis model was established. Based on 1 000 times resampling, the best scheme of filling the random missing values is the predictive mean matching (PMM) method under the missing proportion. With this optimal scheme, a two stage survey was conducted. Finally, it was found that the influencing factors on annual medical expenditure among the students enrolling in URBMI in Taiyuan included population group, annual household gross income, affordability of medical insurance expenditure, chronic disease, seeking medical care in hospital, seeking medical care in community health center or private clinic, hospitalization, hospitalization canceled due to certain reason, self medication and acceptable proportion of self-paid medical expenditure. The two-stage method combining multiple imputation with sample selection model can deal with non response bias and selection bias effectively in dependent variable of the survey data.
ERIC Educational Resources Information Center
Vrieze, Scott I.
2012-01-01
This article reviews the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) in model selection and the appraisal of psychological theory. The focus is on latent variable models, given their growing use in theory testing and construction. Theoretical statistical results in regression are discussed, and more important…
ERIC Educational Resources Information Center
Bergee, Martin J.; Westfall, Claude R.
2005-01-01
This is the third study in a line of inquiry whose purpose has been to develop a theoretical model of selected extra musical variables' influence on solo and small-ensemble festival ratings. Authors of the second of these (Bergee & McWhirter, 2005) had used binomial logistic regression as the basis for their model-formulation strategy. Their…
Radio variability in complete samples of extragalactic radio sources at 1.4 GHz
NASA Astrophysics Data System (ADS)
Rys, S.; Machalski, J.
1990-09-01
Complete samples of extragalactic radio sources obtained in 1970-1975 and the sky survey of Condon and Broderick (1983) were used to select sources variable at 1.4 GHz, and to investigate the characteristics of variability in the whole population of sources at this frequency. The radio structures, radio spectral types, and optical identifications of the selected variables are discussed. Only compact flat-spectrum sources vary at 1.4 GHz, and all but four are identified with QSOs, BL Lacs, or other (unconfirmed spectroscopically) stellar objects. No correlation of degree of variability at 1.4 GHz with Galactic latitude or variability at 408 MHz has been found, suggesting that most of the 1.4-GHz variability is intrinsic and not caused by refractive scintillations. Numerical models of the variability have been computed.
Rosswog, Carolina; Schmidt, Rene; Oberthuer, André; Juraeva, Dilafruz; Brors, Benedikt; Engesser, Anne; Kahlert, Yvonne; Volland, Ruth; Bartenhagen, Christoph; Simon, Thorsten; Berthold, Frank; Hero, Barbara; Faldum, Andreas; Fischer, Matthias
2017-12-01
Current risk stratification systems for neuroblastoma patients consider clinical, histopathological, and genetic variables, and additional prognostic markers have been proposed in recent years. We here sought to select highly informative covariates in a multistep strategy based on consecutive Cox regression models, resulting in a risk score that integrates hazard ratios of prognostic variables. A cohort of 695 neuroblastoma patients was divided into a discovery set (n=75) for multigene predictor generation, a training set (n=411) for risk score development, and a validation set (n=209). Relevant prognostic variables were identified by stepwise multivariable L1-penalized least absolute shrinkage and selection operator (LASSO) Cox regression, followed by backward selection in multivariable Cox regression, and then integrated into a novel risk score. The variables stage, age, MYCN status, and two multigene predictors, NB-th24 and NB-th44, were selected as independent prognostic markers by LASSO Cox regression analysis. Following backward selection, only the multigene predictors were retained in the final model. Integration of these classifiers in a risk scoring system distinguished three patient subgroups that differed substantially in their outcome. The scoring system discriminated patients with diverging outcome in the validation cohort (5-year event-free survival, 84.9±3.4 vs 63.6±14.5 vs 31.0±5.4; P<.001), and its prognostic value was validated by multivariable analysis. We here propose a translational strategy for developing risk assessment systems based on hazard ratios of relevant prognostic variables. Our final neuroblastoma risk score comprised two multigene predictors only, supporting the notion that molecular properties of the tumor cells strongly impact clinical courses of neuroblastoma patients. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Zhang, Xiaoshuai; Xue, Fuzhong; Liu, Hong; Zhu, Dianwen; Peng, Bin; Wiemels, Joseph L; Yang, Xiaowei
2014-12-10
Genome-wide Association Studies (GWAS) are typically designed to identify phenotype-associated single nucleotide polymorphisms (SNPs) individually using univariate analysis methods. Though providing valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally account for only a small proportion of the total heritability for complex diseases. To solve this "missing heritability" problem, we implemented a strategy called integrative Bayesian Variable Selection (iBVS), which is based on a hierarchical model that incorporates an informative prior by considering the gene interrelationship as a network. It was applied here to both simulated and real data sets. Simulation studies indicated that the iBVS method was advantageous in its performance with highest AUC in both variable selection and outcome prediction, when compared to Stepwise and LASSO based strategies. In an analysis of a leprosy case-control study, iBVS selected 94 SNPs as predictors, while LASSO selected 100 SNPs. The Stepwise regression yielded a more parsimonious model with only 3 SNPs. The prediction results demonstrated that the iBVS method had comparable performance with that of LASSO, but better than Stepwise strategies. The proposed iBVS strategy is a novel and valid method for Genome-wide Association Studies, with the additional advantage in that it produces more interpretable posterior probabilities for each variable unlike LASSO and other penalized regression methods.
NASA Astrophysics Data System (ADS)
Attia, Khalid A. M.; Nassar, Mohammed W. I.; El-Zeiny, Mohamed B.; Serag, Ahmed
2017-01-01
For the first time, a new variable selection method based on swarm intelligence namely firefly algorithm is coupled with three different multivariate calibration models namely, concentration residual augmented classical least squares, artificial neural network and support vector regression in UV spectral data. A comparative study between the firefly algorithm and the well-known genetic algorithm was developed. The discussion revealed the superiority of using this new powerful algorithm over the well-known genetic algorithm. Moreover, different statistical tests were performed and no significant differences were found between all the models regarding their predictabilities. This ensures that simpler and faster models were obtained without any deterioration of the quality of the calibration.
Measurement error in epidemiologic studies of air pollution based on land-use regression models.
Basagaña, Xavier; Aguilera, Inmaculada; Rivera, Marcela; Agis, David; Foraster, Maria; Marrugat, Jaume; Elosua, Roberto; Künzli, Nino
2013-10-15
Land-use regression (LUR) models are increasingly used to estimate air pollution exposure in epidemiologic studies. These models use air pollution measurements taken at a small set of locations and modeling based on geographical covariates for which data are available at all study participant locations. The process of LUR model development commonly includes a variable selection procedure. When LUR model predictions are used as explanatory variables in a model for a health outcome, measurement error can lead to bias of the regression coefficients and to inflation of their variance. In previous studies dealing with spatial predictions of air pollution, bias was shown to be small while most of the effect of measurement error was on the variance. In this study, we show that in realistic cases where LUR models are applied to health data, bias in health-effect estimates can be substantial. This bias depends on the number of air pollution measurement sites, the number of available predictors for model selection, and the amount of explainable variability in the true exposure. These results should be taken into account when interpreting health effects from studies that used LUR models.
Cavallo, Jaime A.; Roma, Andres A.; Jasielec, Mateusz S.; Ousley, Jenny; Creamer, Jennifer; Pichert, Matthew D.; Baalman, Sara; Frisella, Margaret M.; Matthews, Brent D.
2014-01-01
Background The purpose of this study was to evaluate the associations between patient characteristics or surgical site classifications and the histologic remodeling scores of synthetic meshes biopsied from their abdominal wall repair sites in the first attempt to generate a multivariable risk prediction model of non-constructive remodeling. Methods Biopsies of the synthetic meshes were obtained from the abdominal wall repair sites of 51 patients during a subsequent abdominal re-exploration. Biopsies were stained with hematoxylin and eosin, and evaluated according to a semi-quantitative scoring system for remodeling characteristics (cell infiltration, cell types, extracellular matrix deposition, inflammation, fibrous encapsulation, and neovascularization) and a mean composite score (CR). Biopsies were also stained with Sirius Red and Fast Green, and analyzed to determine the collagen I:III ratio. Based on univariate analyses between subject clinical characteristics or surgical site classification and the histologic remodeling scores, cohort variables were selected for multivariable regression models using a threshold p value of ≤0.200. Results The model selection process for the extracellular matrix score yielded two variables: subject age at time of mesh implantation, and mesh classification (c-statistic = 0.842). For CR score, the model selection process yielded two variables: subject age at time of mesh implantation and mesh classification (r2 = 0.464). The model selection process for the collagen III area yielded a model with two variables: subject body mass index at time of mesh explantation and pack-year history (r2 = 0.244). Conclusion Host characteristics and surgical site assessments may predict degree of remodeling for synthetic meshes used to reinforce abdominal wall repair sites. These preliminary results constitute the first steps in generating a risk prediction model that predicts the patients and clinical circumstances for which non-constructive remodeling of an abdominal wall repair site with synthetic mesh reinforcement is most likely to occur. PMID:24442681
Surface Estimation, Variable Selection, and the Nonparametric Oracle Property.
Storlie, Curtis B; Bondell, Howard D; Reich, Brian J; Zhang, Hao Helen
2011-04-01
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting.
Surface Estimation, Variable Selection, and the Nonparametric Oracle Property
Storlie, Curtis B.; Bondell, Howard D.; Reich, Brian J.; Zhang, Hao Helen
2010-01-01
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting. PMID:21603586
Stochastic isotropic hyperelastic materials: constitutive calibration and model selection
NASA Astrophysics Data System (ADS)
Mihai, L. Angela; Woolley, Thomas E.; Goriely, Alain
2018-03-01
Biological and synthetic materials often exhibit intrinsic variability in their elastic responses under large strains, owing to microstructural inhomogeneity or when elastic data are extracted from viscoelastic mechanical tests. For these materials, although hyperelastic models calibrated to mean data are useful, stochastic representations accounting also for data dispersion carry extra information about the variability of material properties found in practical applications. We combine finite elasticity and information theories to construct homogeneous isotropic hyperelastic models with random field parameters calibrated to discrete mean values and standard deviations of either the stress-strain function or the nonlinear shear modulus, which is a function of the deformation, estimated from experimental tests. These quantities can take on different values, corresponding to possible outcomes of the experiments. As multiple models can be derived that adequately represent the observed phenomena, we apply Occam's razor by providing an explicit criterion for model selection based on Bayesian statistics. We then employ this criterion to select a model among competing models calibrated to experimental data for rubber and brain tissue under single or multiaxial loads.
Grainger, Matthew James; Aramyan, Lusine; Piras, Simone; Quested, Thomas Edward; Righi, Simone; Setti, Marco; Vittuari, Matteo; Stewart, Gavin Bruce
2018-01-01
Food waste from households contributes the greatest proportion to total food waste in developed countries. Therefore, food waste reduction requires an understanding of the socio-economic (contextual and behavioural) factors that lead to its generation within the household. Addressing such a complex subject calls for sound methodological approaches that until now have been conditioned by the large number of factors involved in waste generation, by the lack of a recognised definition, and by limited available data. This work contributes to food waste generation literature by using one of the largest available datasets that includes data on the objective amount of avoidable household food waste, along with information on a series of socio-economic factors. In order to address one aspect of the complexity of the problem, machine learning algorithms (random forests and boruta) for variable selection integrated with linear modelling, model selection and averaging are implemented. Model selection addresses model structural uncertainty, which is not routinely considered in assessments of food waste in literature. The main drivers of food waste in the home selected in the most parsimonious models include household size, the presence of fussy eaters, employment status, home ownership status, and the local authority. Results, regardless of which variable set the models are run on, point toward large households as being a key target element for food waste reduction interventions.
Aramyan, Lusine; Piras, Simone; Quested, Thomas Edward; Righi, Simone; Setti, Marco; Vittuari, Matteo; Stewart, Gavin Bruce
2018-01-01
Food waste from households contributes the greatest proportion to total food waste in developed countries. Therefore, food waste reduction requires an understanding of the socio-economic (contextual and behavioural) factors that lead to its generation within the household. Addressing such a complex subject calls for sound methodological approaches that until now have been conditioned by the large number of factors involved in waste generation, by the lack of a recognised definition, and by limited available data. This work contributes to food waste generation literature by using one of the largest available datasets that includes data on the objective amount of avoidable household food waste, along with information on a series of socio-economic factors. In order to address one aspect of the complexity of the problem, machine learning algorithms (random forests and boruta) for variable selection integrated with linear modelling, model selection and averaging are implemented. Model selection addresses model structural uncertainty, which is not routinely considered in assessments of food waste in literature. The main drivers of food waste in the home selected in the most parsimonious models include household size, the presence of fussy eaters, employment status, home ownership status, and the local authority. Results, regardless of which variable set the models are run on, point toward large households as being a key target element for food waste reduction interventions. PMID:29389949
Scale dependency of American marten (Martes americana) habitat relations [Chapter 12
Andrew J. Shirk; Tzeidle N. Wasserman; Samuel A. Cushman; Martin G. Raphael
2012-01-01
Animals select habitat resources at multiple spatial scales; therefore, explicit attention to scale-dependency when modeling habitat relations is critical to understanding how organisms select habitat in complex landscapes. Models that evaluate habitat variables calculated at a single spatial scale (e.g., patch, home range) fail to account for the effects of...
Estimation of selection intensity under overdominance by Bayesian methods.
Buzbas, Erkan Ozge; Joyce, Paul; Abdo, Zaid
2009-01-01
A balanced pattern in the allele frequencies of polymorphic loci is a potential sign of selection, particularly of overdominance. Although this type of selection is of some interest in population genetics, there exists no likelihood based approaches specifically tailored to make inference on selection intensity. To fill this gap, we present Bayesian methods to estimate selection intensity under k-allele models with overdominance. Our model allows for an arbitrary number of loci and alleles within a locus. The neutral and selected variability within each locus are modeled with corresponding k-allele models. To estimate the posterior distribution of the mean selection intensity in a multilocus region, a hierarchical setup between loci is used. The methods are demonstrated with data at the Human Leukocyte Antigen loci from world-wide populations.
Tian, Xin; Xin, Mingyuan; Luo, Jian; Liu, Mingyao; Jiang, Zhenran
2017-02-01
The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein-protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.
Developing deterioration models for Wyoming bridges.
DOT National Transportation Integrated Search
2016-05-01
Deterioration models for the Wyoming Bridge Inventory were developed using both stochastic and deterministic models. : The selection of explanatory variables is investigated and a new method using LASSO regression to eliminate human bias : in explana...
Variable Step-Size Selection Methods for Implicit Integration Schemes
2005-10-01
for ρk numerically. 23 4 Examples In this section we explore this variable step-size selection method for two problems, the Lotka - Volterra model and...the Kepler problem. 4.1 The Lotka - Volterra Model For this example we consider the Lotka - Volterra model of a simple predator- prey system from...problems. Consider this variation to the Lotka - Volterra problem: u̇ v̇ = u2v(v − 2) v2u(1− u) = f(u, v); t ∈ [0, 50
NASA Astrophysics Data System (ADS)
Kuppusamy, Sivaraman; Faris Khamidi, Mohd; Sheng, Lee Xia; Salvi Mari, Tamil
2017-12-01
The study intend to investigate sustainability knowledge using “AKASA” model. This model comprises all the literacy level which is the awareness, knowledge, attitude, skills and action. 234 students from 5 selected private universities were surveyed using questionnaires. Students were specifically selected from year 2 and year 3 from private universities in Klang valley, Malaysia. The study intends to investigate the environmental literacy level specifically the knowledge variable. The parametric study was conducted with descriptive analysis and the results shows that the environmental knowledge is at high level compared to other environmental literacy variables among year 2, year 3 and combine year 2 and year 3.
Modification of the Integrated Sasang Constitutional Diagnostic Model
Nam, Jiho
2017-01-01
In 2012, the Korea Institute of Oriental Medicine proposed an objective and comprehensive physical diagnostic model to address quantification problems in the existing Sasang constitutional diagnostic method. However, certain issues have been raised regarding a revision of the proposed diagnostic model. In this paper, we propose various methodological approaches to address the problems of the previous diagnostic model. Firstly, more useful variables are selected in each component. Secondly, the least absolute shrinkage and selection operator is used to reduce multicollinearity without the modification of explanatory variables. Thirdly, proportions of SC types and age are considered to construct individual diagnostic models and classify the training set and the test set for reflecting the characteristics of the entire dataset. Finally, an integrated model is constructed with explanatory variables of individual diagnosis models. The proposed integrated diagnostic model significantly improves the sensitivities for both the male SY type (36.4% → 62.0%) and the female SE type (43.7% → 64.5%), which were areas of limitation of the previous integrated diagnostic model. The ideas of these new algorithms are expected to contribute not only to the scientific development of Sasang constitutional medicine in Korea but also to that of other diagnostic methods for traditional medicine. PMID:29317897
An Interactive Tool For Semi-automated Statistical Prediction Using Earth Observations and Models
NASA Astrophysics Data System (ADS)
Zaitchik, B. F.; Berhane, F.; Tadesse, T.
2015-12-01
We developed a semi-automated statistical prediction tool applicable to concurrent analysis or seasonal prediction of any time series variable in any geographic location. The tool was developed using Shiny, JavaScript, HTML and CSS. A user can extract a predictand by drawing a polygon over a region of interest on the provided user interface (global map). The user can select the Climatic Research Unit (CRU) precipitation or Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) as predictand. They can also upload their own predictand time series. Predictors can be extracted from sea surface temperature, sea level pressure, winds at different pressure levels, air temperature at various pressure levels, and geopotential height at different pressure levels. By default, reanalysis fields are applied as predictors, but the user can also upload their own predictors, including a wide range of compatible satellite-derived datasets. The package generates correlations of the variables selected with the predictand. The user also has the option to generate composites of the variables based on the predictand. Next, the user can extract predictors by drawing polygons over the regions that show strong correlations (composites). Then, the user can select some or all of the statistical prediction models provided. Provided models include Linear Regression models (GLM, SGLM), Tree-based models (bagging, random forest, boosting), Artificial Neural Network, and other non-linear models such as Generalized Additive Model (GAM) and Multivariate Adaptive Regression Splines (MARS). Finally, the user can download the analysis steps they used, such as the region they selected, the time period they specified, the predictand and predictors they chose and preprocessing options they used, and the model results in PDF or HTML format. Key words: Semi-automated prediction, Shiny, R, GLM, ANN, RF, GAM, MARS
Harmonize input selection for sediment transport prediction
NASA Astrophysics Data System (ADS)
Afan, Haitham Abdulmohsin; Keshtegar, Behrooz; Mohtar, Wan Hanna Melini Wan; El-Shafie, Ahmed
2017-09-01
In this paper, three modeling approaches using a Neural Network (NN), Response Surface Method (RSM) and response surface method basis Global Harmony Search (GHS) are applied to predict the daily time series suspended sediment load. Generally, the input variables for forecasting the suspended sediment load are manually selected based on the maximum correlations of input variables in the modeling approaches based on NN and RSM. The RSM is improved to select the input variables by using the errors terms of training data based on the GHS, namely as response surface method and global harmony search (RSM-GHS) modeling method. The second-order polynomial function with cross terms is applied to calibrate the time series suspended sediment load with three, four and five input variables in the proposed RSM-GHS. The linear, square and cross corrections of twenty input variables of antecedent values of suspended sediment load and water discharge are investigated to achieve the best predictions of the RSM based on the GHS method. The performances of the NN, RSM and proposed RSM-GHS including both accuracy and simplicity are compared through several comparative predicted and error statistics. The results illustrated that the proposed RSM-GHS is as uncomplicated as the RSM but performed better, where fewer errors and better correlation was observed (R = 0.95, MAE = 18.09 (ton/day), RMSE = 25.16 (ton/day)) compared to the ANN (R = 0.91, MAE = 20.17 (ton/day), RMSE = 33.09 (ton/day)) and RSM (R = 0.91, MAE = 20.06 (ton/day), RMSE = 31.92 (ton/day)) for all types of input variables.
Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario
2014-01-01
Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
Modulation Depth Estimation and Variable Selection in State-Space Models for Neural Interfaces
Hochberg, Leigh R.; Donoghue, John P.; Brown, Emery N.
2015-01-01
Rapid developments in neural interface technology are making it possible to record increasingly large signal sets of neural activity. Various factors such as asymmetrical information distribution and across-channel redundancy may, however, limit the benefit of high-dimensional signal sets, and the increased computational complexity may not yield corresponding improvement in system performance. High-dimensional system models may also lead to overfitting and lack of generalizability. To address these issues, we present a generalized modulation depth measure using the state-space framework that quantifies the tuning of a neural signal channel to relevant behavioral covariates. For a dynamical system, we develop computationally efficient procedures for estimating modulation depth from multivariate data. We show that this measure can be used to rank neural signals and select an optimal channel subset for inclusion in the neural decoding algorithm. We present a scheme for choosing the optimal subset based on model order selection criteria. We apply this method to neuronal ensemble spike-rate decoding in neural interfaces, using our framework to relate motor cortical activity with intended movement kinematics. With offline analysis of intracortical motor imagery data obtained from individuals with tetraplegia using the BrainGate neural interface, we demonstrate that our variable selection scheme is useful for identifying and ranking the most information-rich neural signals. We demonstrate that our approach offers several orders of magnitude lower complexity but virtually identical decoding performance compared to greedy search and other selection schemes. Our statistical analysis shows that the modulation depth of human motor cortical single-unit signals is well characterized by the generalized Pareto distribution. Our variable selection scheme has wide applicability in problems involving multisensor signal modeling and estimation in biomedical engineering systems. PMID:25265627
User's instructions for the cardiovascular Walters model
NASA Technical Reports Server (NTRS)
Croston, R. C.
1973-01-01
The model is a combined, steady-state cardiovascular and thermal model. It was originally developed for interactive use, but was converted to batch mode simulation for the Sigma 3 computer. The model has the purpose to compute steady-state circulatory and thermal variables in response to exercise work loads and environmental factors. During a computer simulation run, several selected variables are printed at each time step. End conditions are also printed at the completion of the run.
Boykin, K.G.; Thompson, B.C.; Propeck-Gray, S.
2010-01-01
Despite widespread and long-standing efforts to model wildlife-habitat associations using remotely sensed and other spatially explicit data, there are relatively few evaluations of the performance of variables included in predictive models relative to actual features on the landscape. As part of the National Gap Analysis Program, we specifically examined physical site features at randomly selected sample locations in the Southwestern U.S. to assess degree of concordance with predicted features used in modeling vertebrate habitat distribution. Our analysis considered hypotheses about relative accuracy with respect to 30 vertebrate species selected to represent the spectrum of habitat generalist to specialist and categorization of site by relative degree of conservation emphasis accorded to the site. Overall comparison of 19 variables observed at 382 sample sites indicated ???60% concordance for 12 variables. Directly measured or observed variables (slope, soil composition, rock outcrop) generally displayed high concordance, while variables that required judgments regarding descriptive categories (aspect, ecological system, landform) were less concordant. There were no differences detected in concordance among taxa groups, degree of specialization or generalization of selected taxa, or land conservation categorization of sample sites with respect to all sites. We found no support for the hypothesis that accuracy of habitat models is inversely related to degree of taxa specialization when model features for a habitat specialist could be more difficult to represent spatially. Likewise, we did not find support for the hypothesis that physical features will be predicted with higher accuracy on lands with greater dedication to biodiversity conservation than on other lands because of relative differences regarding available information. Accuracy generally was similar (>60%) to that observed for land cover mapping at the ecological system level. These patterns demonstrate resilience of gap analysis deductive model processes to the type of remotely sensed or interpreted data used in habitat feature predictions. ?? 2010 Elsevier B.V.
Scheel, Ida; Ferkingstad, Egil; Frigessi, Arnoldo; Haug, Ola; Hinnerichsen, Mikkel; Meze-Hausken, Elisabeth
2013-01-01
Climate change will affect the insurance industry. We develop a Bayesian hierarchical statistical approach to explain and predict insurance losses due to weather events at a local geographic scale. The number of weather-related insurance claims is modelled by combining generalized linear models with spatially smoothed variable selection. Using Gibbs sampling and reversible jump Markov chain Monte Carlo methods, this model is fitted on daily weather and insurance data from each of the 319 municipalities which constitute southern and central Norway for the period 1997–2006. Precise out-of-sample predictions validate the model. Our results show interesting regional patterns in the effect of different weather covariates. In addition to being useful for insurance pricing, our model can be used for short-term predictions based on weather forecasts and for long-term predictions based on downscaled climate models. PMID:23396890
An Ensemble Successive Project Algorithm for Liquor Detection Using Near Infrared Sensor.
Qu, Fangfang; Ren, Dong; Wang, Jihua; Zhang, Zhong; Lu, Na; Meng, Lei
2016-01-11
Spectral analysis technique based on near infrared (NIR) sensor is a powerful tool for complex information processing and high precision recognition, and it has been widely applied to quality analysis and online inspection of agricultural products. This paper proposes a new method to address the instability of small sample sizes in the successive projections algorithm (SPA) as well as the lack of association between selected variables and the analyte. The proposed method is an evaluated bootstrap ensemble SPA method (EBSPA) based on a variable evaluation index (EI) for variable selection, and is applied to the quantitative prediction of alcohol concentrations in liquor using NIR sensor. In the experiment, the proposed EBSPA with three kinds of modeling methods are established to test their performance. In addition, the proposed EBSPA combined with partial least square is compared with other state-of-the-art variable selection methods. The results show that the proposed method can solve the defects of SPA and it has the best generalization performance and stability. Furthermore, the physical meaning of the selected variables from the near infrared sensor data is clear, which can effectively reduce the variables and improve their prediction accuracy.
Variable mechanical ventilation
Fontela, Paula Caitano; Prestes, Renata Bernardy; Forgiarini Jr., Luiz Alberto; Friedman, Gilberto
2017-01-01
Objective To review the literature on the use of variable mechanical ventilation and the main outcomes of this technique. Methods Search, selection, and analysis of all original articles on variable ventilation, without restriction on the period of publication and language, available in the electronic databases LILACS, MEDLINE®, and PubMed, by searching the terms "variable ventilation" OR "noisy ventilation" OR "biologically variable ventilation". Results A total of 36 studies were selected. Of these, 24 were original studies, including 21 experimental studies and three clinical studies. Conclusion Several experimental studies reported the beneficial effects of distinct variable ventilation strategies on lung function using different models of lung injury and healthy lungs. Variable ventilation seems to be a viable strategy for improving gas exchange and respiratory mechanics and preventing lung injury associated with mechanical ventilation. However, further clinical studies are necessary to assess the potential of variable ventilation strategies for the clinical improvement of patients undergoing mechanical ventilation. PMID:28444076
Bolandzadeh, Niousha; Kording, Konrad; Salowitz, Nicole; Davis, Jennifer C; Hsu, Liang; Chan, Alison; Sharma, Devika; Blohm, Gunnar; Liu-Ambrose, Teresa
2015-01-01
Current research suggests that the neuropathology of dementia-including brain changes leading to memory impairment and cognitive decline-is evident years before the onset of this disease. Older adults with cognitive decline have reduced functional independence and quality of life, and are at greater risk for developing dementia. Therefore, identifying biomarkers that can be easily assessed within the clinical setting and predict cognitive decline is important. Early recognition of cognitive decline could promote timely implementation of preventive strategies. We included 89 community-dwelling adults aged 70 years and older in our study, and collected 32 measures of physical function, health status and cognitive function at baseline. We utilized an L1-L2 regularized regression model (elastic net) to identify which of the 32 baseline measures were strongly predictive of cognitive function after one year. We built three linear regression models: 1) based on baseline cognitive function, 2) based on variables consistently selected in every cross-validation loop, and 3) a full model based on all the 32 variables. Each of these models was carefully tested with nested cross-validation. Our model with the six variables consistently selected in every cross-validation loop had a mean squared prediction error of 7.47. This number was smaller than that of the full model (115.33) and the model with baseline cognitive function (7.98). Our model explained 47% of the variance in cognitive function after one year. We built a parsimonious model based on a selected set of six physical function and health status measures strongly predictive of cognitive function after one year. In addition to reducing the complexity of the model without changing the model significantly, our model with the top variables improved the mean prediction error and R-squared. These six physical function and health status measures can be easily implemented in a clinical setting.
Stochastic model search with binary outcomes for genome-wide association studies.
Russu, Alberto; Malovini, Alberto; Puca, Annibale A; Bellazzi, Riccardo
2012-06-01
The spread of case-control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model.
NASA Astrophysics Data System (ADS)
Gerlitz, Lars; Gafurov, Abror; Apel, Heiko; Unger-Sayesteh, Katy; Vorogushyn, Sergiy; Merz, Bruno
2016-04-01
Statistical climate forecast applications typically utilize a small set of large scale SST or climate indices, such as ENSO, PDO or AMO as predictor variables. If the predictive skill of these large scale modes is insufficient, specific predictor variables such as customized SST patterns are frequently included. Hence statistically based climate forecast models are either based on a fixed number of climate indices (and thus might not consider important predictor variables) or are highly site specific and barely transferable to other regions. With the aim of developing an operational seasonal forecast model, which is easily transferable to any region in the world, we present a generic data mining approach which automatically selects potential predictors from gridded SST observations and reanalysis derived large scale atmospheric circulation patterns and generates robust statistical relationships with posterior precipitation anomalies for user selected target regions. Potential predictor variables are derived by means of a cellwise correlation analysis of precipitation anomalies with gridded global climate variables under consideration of varying lead times. Significantly correlated grid cells are subsequently aggregated to predictor regions by means of a variability based cluster analysis. Finally for every month and lead time, an individual random forest based forecast model is automatically calibrated and evaluated by means of the preliminary generated predictor variables. The model is exemplarily applied and evaluated for selected headwater catchments in Central and South Asia. Particularly the for winter and spring precipitation (which is associated with westerly disturbances in the entire target domain) the model shows solid results with correlation coefficients up to 0.7, although the variability of precipitation rates is highly underestimated. Likewise for the monsoonal precipitation amounts in the South Asian target areas a certain skill of the model could be detected. The skill of the model for the dry summer season in Central Asia and the transition seasons over South Asia is found to be low. A sensitivity analysis by means on well known climate indices reveals the major large scale controlling mechanisms for the seasonal precipitation climate of each target area. For the Central Asian target areas, both, the El Nino Southern Oscillation and the North Atlantic Oscillation are identified as important controlling factors for precipitation totals during moist spring season. Drought conditions are found to be triggered by a warm ENSO phase in combination with a positive phase of the NAO. For the monsoonal summer precipitation amounts over Southern Asia, the model suggests a distinct negative response to El Nino events.
Liu, Zun-lei; Yuan, Xing-wei; Yang, Lin-lin; Yan, Li-ping; Zhang, Hui; Cheng, Jia-hua
2015-02-01
Multiple hypotheses are available to explain recruitment rate. Model selection methods can be used to identify the best model that supports a particular hypothesis. However, using a single model for estimating recruitment success is often inadequate for overexploited population because of high model uncertainty. In this study, stock-recruitment data of small yellow croaker in the East China Sea collected from fishery dependent and independent surveys between 1992 and 2012 were used to examine density-dependent effects on recruitment success. Model selection methods based on frequentist (AIC, maximum adjusted R2 and P-values) and Bayesian (Bayesian model averaging, BMA) methods were applied to identify the relationship between recruitment and environment conditions. Interannual variability of the East China Sea environment was indicated by sea surface temperature ( SST) , meridional wind stress (MWS), zonal wind stress (ZWS), sea surface pressure (SPP) and runoff of Changjiang River ( RCR). Mean absolute error, mean squared predictive error and continuous ranked probability score were calculated to evaluate the predictive performance of recruitment success. The results showed that models structures were not consistent based on three kinds of model selection methods, predictive variables of models were spawning abundance and MWS by AIC, spawning abundance by P-values, spawning abundance, MWS and RCR by maximum adjusted R2. The recruitment success decreased linearly with stock abundance (P < 0.01), suggesting overcompensation effect in the recruitment success might be due to cannibalism or food competition. Meridional wind intensity showed marginally significant and positive effects on the recruitment success (P = 0.06), while runoff of Changjiang River showed a marginally negative effect (P = 0.07). Based on mean absolute error and continuous ranked probability score, predictive error associated with models obtained from BMA was the smallest amongst different approaches, while that from models selected based on the P-value of the independent variables was the highest. However, mean squared predictive error from models selected based on the maximum adjusted R2 was highest. We found that BMA method could improve the prediction of recruitment success, derive more accurate prediction interval and quantitatively evaluate model uncertainty.
Reduced Lung Cancer Mortality With Lower Atmospheric Pressure.
Merrill, Ray M; Frutos, Aaron
2018-01-01
Research has shown that higher altitude is associated with lower risk of lung cancer and improved survival among patients. The current study assessed the influence of county-level atmospheric pressure (a measure reflecting both altitude and temperature) on age-adjusted lung cancer mortality rates in the contiguous United States, with 2 forms of spatial regression. Ordinary least squares regression and geographically weighted regression models were used to evaluate the impact of climate and other selected variables on lung cancer mortality, based on 2974 counties. Atmospheric pressure was significantly positively associated with lung cancer mortality, after controlling for sunlight, precipitation, PM2.5 (µg/m 3 ), current smoker, and other selected variables. Positive county-level β coefficient estimates ( P < .05) for atmospheric pressure were observed throughout the United States, higher in the eastern half of the country. The spatial regression models showed that atmospheric pressure is positively associated with age-adjusted lung cancer mortality rates, after controlling for other selected variables.
Plasticity models of material variability based on uncertainty quantification techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, Reese E.; Rizzi, Francesco; Boyce, Brad
The advent of fabrication techniques like additive manufacturing has focused attention on the considerable variability of material response due to defects and other micro-structural aspects. This variability motivates the development of an enhanced design methodology that incorporates inherent material variability to provide robust predictions of performance. In this work, we develop plasticity models capable of representing the distribution of mechanical responses observed in experiments using traditional plasticity models of the mean response and recently developed uncertainty quantification (UQ) techniques. Lastly, we demonstrate that the new method provides predictive realizations that are superior to more traditional ones, and how these UQmore » techniques can be used in model selection and assessing the quality of calibrated physical parameters.« less
Bayesian dynamic modeling of time series of dengue disease case counts
López-Quílez, Antonio; Torres-Prieto, Alexander
2017-01-01
The aim of this study is to model the association between weekly time series of dengue case counts and meteorological variables, in a high-incidence city of Colombia, applying Bayesian hierarchical dynamic generalized linear models over the period January 2008 to August 2015. Additionally, we evaluate the model’s short-term performance for predicting dengue cases. The methodology shows dynamic Poisson log link models including constant or time-varying coefficients for the meteorological variables. Calendar effects were modeled using constant or first- or second-order random walk time-varying coefficients. The meteorological variables were modeled using constant coefficients and first-order random walk time-varying coefficients. We applied Markov Chain Monte Carlo simulations for parameter estimation, and deviance information criterion statistic (DIC) for model selection. We assessed the short-term predictive performance of the selected final model, at several time points within the study period using the mean absolute percentage error. The results showed the best model including first-order random walk time-varying coefficients for calendar trend and first-order random walk time-varying coefficients for the meteorological variables. Besides the computational challenges, interpreting the results implies a complete analysis of the time series of dengue with respect to the parameter estimates of the meteorological effects. We found small values of the mean absolute percentage errors at one or two weeks out-of-sample predictions for most prediction points, associated with low volatility periods in the dengue counts. We discuss the advantages and limitations of the dynamic Poisson models for studying the association between time series of dengue disease and meteorological variables. The key conclusion of the study is that dynamic Poisson models account for the dynamic nature of the variables involved in the modeling of time series of dengue disease, producing useful models for decision-making in public health. PMID:28671941
ERIC Educational Resources Information Center
Dolan, Conor V.; Molenaar, Peter C. M.
1994-01-01
In multigroup covariance structure analysis with structured means, the traditional latent selection model is formulated as a special case of phenotypic selection. Illustrations with real and simulated data demonstrate how one can test specific hypotheses concerning selection on latent variables. (SLD)
ERIC Educational Resources Information Center
Dong, Nianbo
2015-01-01
Researchers have become increasingly interested in programs' main and interaction effects of two variables (A and B, e.g., two treatment variables or one treatment variable and one moderator) on outcomes. A challenge for estimating main and interaction effects is to eliminate selection bias across A-by-B groups. I introduce Rubin's causal model to…
Ecological prediction with nonlinear multivariate time-frequency functional data models
Yang, Wen-Hsi; Wikle, Christopher K.; Holan, Scott H.; Wildhaber, Mark L.
2013-01-01
Time-frequency analysis has become a fundamental component of many scientific inquiries. Due to improvements in technology, the amount of high-frequency signals that are collected for ecological and other scientific processes is increasing at a dramatic rate. In order to facilitate the use of these data in ecological prediction, we introduce a class of nonlinear multivariate time-frequency functional models that can identify important features of each signal as well as the interaction of signals corresponding to the response variable of interest. Our methodology is of independent interest and utilizes stochastic search variable selection to improve model selection and performs model averaging to enhance prediction. We illustrate the effectiveness of our approach through simulation and by application to predicting spawning success of shovelnose sturgeon in the Lower Missouri River.
Attia, Khalid A M; Nassar, Mohammed W I; El-Zeiny, Mohamed B; Serag, Ahmed
2017-01-05
For the first time, a new variable selection method based on swarm intelligence namely firefly algorithm is coupled with three different multivariate calibration models namely, concentration residual augmented classical least squares, artificial neural network and support vector regression in UV spectral data. A comparative study between the firefly algorithm and the well-known genetic algorithm was developed. The discussion revealed the superiority of using this new powerful algorithm over the well-known genetic algorithm. Moreover, different statistical tests were performed and no significant differences were found between all the models regarding their predictabilities. This ensures that simpler and faster models were obtained without any deterioration of the quality of the calibration. Copyright © 2016 Elsevier B.V. All rights reserved.
Williams, Jennifer A.; Schmitter-Edgecombe, Maureen; Cook, Diane J.
2016-01-01
Introduction Reducing the amount of testing required to accurately detect cognitive impairment is clinically relevant. The aim of this research was to determine the fewest number of clinical measures required to accurately classify participants as healthy older adult, mild cognitive impairment (MCI) or dementia using a suite of classification techniques. Methods Two variable selection machine learning models (i.e., naive Bayes, decision tree), a logistic regression, and two participant datasets (i.e., clinical diagnosis, clinical dementia rating; CDR) were explored. Participants classified using clinical diagnosis criteria included 52 individuals with dementia, 97 with MCI, and 161 cognitively healthy older adults. Participants classified using CDR included 154 individuals CDR = 0, 93 individuals with CDR = 0.5, and 25 individuals with CDR = 1.0+. Twenty-seven demographic, psychological, and neuropsychological variables were available for variable selection. Results No significant difference was observed between naive Bayes, decision tree, and logistic regression models for classification of both clinical diagnosis and CDR datasets. Participant classification (70.0 – 99.1%), geometric mean (60.9 – 98.1%), sensitivity (44.2 – 100%), and specificity (52.7 – 100%) were generally satisfactory. Unsurprisingly, the MCI/CDR = 0.5 participant group was the most challenging to classify. Through variable selection only 2 – 9 variables were required for classification and varied between datasets in a clinically meaningful way. Conclusions The current study results reveal that machine learning techniques can accurately classifying cognitive impairment and reduce the number of measures required for diagnosis. PMID:26332171
Mathematical Model Of Variable-Polarity Plasma Arc Welding
NASA Technical Reports Server (NTRS)
Hung, R. J.
1996-01-01
Mathematical model of variable-polarity plasma arc (VPPA) welding process developed for use in predicting characteristics of welds and thus serves as guide for selection of process parameters. Parameters include welding electric currents in, and durations of, straight and reverse polarities; rates of flow of plasma and shielding gases; and sizes and relative positions of welding electrode, welding orifice, and workpiece.
Variable Selection with Prior Information for Generalized Linear Models via the Prior LASSO Method.
Jiang, Yuan; He, Yunxiao; Zhang, Heping
LASSO is a popular statistical tool often used in conjunction with generalized linear models that can simultaneously select variables and estimate parameters. When there are many variables of interest, as in current biological and biomedical studies, the power of LASSO can be limited. Fortunately, so much biological and biomedical data have been collected and they may contain useful information about the importance of certain variables. This paper proposes an extension of LASSO, namely, prior LASSO (pLASSO), to incorporate that prior information into penalized generalized linear models. The goal is achieved by adding in the LASSO criterion function an additional measure of the discrepancy between the prior information and the model. For linear regression, the whole solution path of the pLASSO estimator can be found with a procedure similar to the Least Angle Regression (LARS). Asymptotic theories and simulation results show that pLASSO provides significant improvement over LASSO when the prior information is relatively accurate. When the prior information is less reliable, pLASSO shows great robustness to the misspecification. We illustrate the application of pLASSO using a real data set from a genome-wide association study.
NASA Astrophysics Data System (ADS)
Agjee, Na'eem Hoosen; Ismail, Riyad; Mutanga, Onisimo
2016-10-01
Water hyacinth plants (Eichhornia crassipes) are threatening freshwater ecosystems throughout Africa. The Neochetina spp. weevils are seen as an effective solution that can combat the proliferation of the invasive alien plant. We aimed to determine if multitemporal hyperspectral data could be utilized to detect the efficacy of the biocontrol agent. The random forest (RF) algorithm was used to classify variable infestation levels for 6 weeks using: (1) all the hyperspectral bands, (2) bands selected by the recursive feature elimination (RFE) algorithm, and (3) bands selected by the Boruta algorithm. Results showed that the RF model using all the bands successfully produced low-classification errors (12.50% to 32.29%) for all 6 weeks. However, the RF model using Boruta selected bands produced lower classification errors (8.33% to 15.62%) than the RF model using all the bands or bands selected by the RFE algorithm (11.25% to 21.25%) for all 6 weeks, highlighting the utility of Boruta as an all relevant band selection algorithm. All relevant bands selected by Boruta included: 352, 754, 770, 771, 775, 781, 782, 783, 786, and 789 nm. It was concluded that RF coupled with Boruta band-selection algorithm can be utilized to undertake multitemporal monitoring of variable infestation levels on water hyacinth plants.
Influence of BMI and dietary restraint on self-selected portions of prepared meals in US women.
Labbe, David; Rytz, Andréas; Brunstrom, Jeffrey M; Forde, Ciarán G; Martin, Nathalie
2017-04-01
The rise of obesity prevalence has been attributed in part to an increase in food and beverage portion sizes selected and consumed among overweight and obese consumers. Nevertheless, evidence from observations of adults is mixed and contradictory findings might reflect the use of small or unrepresentative samples. The objective of this study was i) to determine the extent to which BMI and dietary restraint predict self-selected portion sizes for a range of commercially available prepared savoury meals and ii) to consider the importance of these variables relative to two previously established predictors of portion selection, expected satiation and expected liking. A representative sample of female consumers (N = 300, range 18-55 years) evaluated 15 frozen savoury prepared meals. For each meal, participants rated their expected satiation and expected liking, and selected their ideal portion using a previously validated computer-based task. Dietary restraint was quantified using the Dutch Eating Behaviour Questionnaire (DEBQ-R). Hierarchical multiple regression was performed on self-selected portions with age, hunger level, and meal familiarity entered as control variables in the first step of the model, expected satiation and expected liking as predictor variables in the second step, and DEBQ-R and BMI as exploratory predictor variables in the third step. The second and third steps significantly explained variance in portion size selection (18% and 4%, respectively). Larger portion selections were significantly associated with lower dietary restraint and with lower expected satiation. There was a positive relationship between BMI and portion size selection (p = 0.06) and between expected liking and portion size selection (p = 0.06). Our discussion considers future research directions, the limited variance explained by our model, and the potential for portion size underreporting by overweight participants. Copyright © 2016 Nestec S.A. Published by Elsevier Ltd.. All rights reserved.
Westover, Matthew; Baxter, Jared; Baxter, Rick; Day, Casey; Jensen, Ryan; Petersen, Steve; Larsen, Randy
2016-01-01
Greater sage-grouse populations have decreased steadily since European settlement in western North America. Reduced availability of brood-rearing habitat has been identified as a limiting factor for many populations. We used radio-telemetry to acquire locations of sage-grouse broods from 1998 to 2012 in Strawberry Valley, Utah. Using these locations and remotely-sensed NAIP (National Agricultural Imagery Program) imagery, we 1) determined which characteristics of brood-rearing habitat could be used in widely available, high resolution imagery 2) assessed the spatial extent at which sage-grouse selected brood-rearing habitat, and 3) created a predictive habitat model to identify areas of preferred brood-rearing habitat. We used AIC model selection to evaluate support for a list of variables derived from remotely-sensed imagery. We examined the relationship of these explanatory variables at three spatial extents (45, 200, and 795 meter radii). Our top model included 10 variables (percent shrub, percent grass, percent tree, percent paved road, percent riparian, meters of sage/tree edge, meters of riparian/tree edge, distance to tree, distance to transmission lines, and distance to permanent structures). Variables from each spatial extent were represented in our top model with the majority being associated with the larger (795 meter) spatial extent. When applied to our study area, our top model predicted 75% of naïve brood locations suggesting reasonable success using this method and widely available NAIP imagery. We encourage application of our methodology to other sage-grouse populations and species of conservation concern.
Planillo, Aimara; Malo, Juan E
2018-01-01
Human disturbance is widespread across landscapes in the form of roads that alter wildlife populations. Knowing which road features are responsible for the species response and their relevance in comparison with environmental variables will provide useful information for effective conservation measures. We sampled relative abundance of European rabbits, a very widespread species, in motorway verges at regional scale, in an area with large variability in environmental and infrastructure conditions. Environmental variables included vegetation structure, plant productivity, distance to water sources, and altitude. Infrastructure characteristics were the type of vegetation in verges, verge width, traffic volume, and the presence of embankments. We performed a variance partitioning analysis to determine the relative importance of two sets of variables on rabbit abundance. Additionally, we identified the most important variables and their effects model averaging after model selection by AICc on hypothesis-based models. As a group, infrastructure features explained four times more variability in rabbit abundance than environmental variables, being the effects of the former critical in motorway stretches located in altered landscapes with no available habitat for rabbits, such as agricultural fields. Model selection and Akaike weights showed that verge width and traffic volume are the most important variables explaining rabbit abundance index, with positive and negative effects, respectively. In the light of these results, the response of species to the infrastructure can be modulated through the modification of motorway features, being some of them manageable in the design phase. The identification of such features leads to suggestions for improvement through low-cost corrective measures and conservation plans. As a general indication, keeping motorway verges less than 10 m wide will prevent high densities of rabbits and avoid the unwanted effects that rabbit populations can generate in some areas.
He, Yan-Lin; Xu, Yuan; Geng, Zhi-Qiang; Zhu, Qun-Xiong
2016-03-01
In this paper, a hybrid robust model based on an improved functional link neural network integrating with partial least square (IFLNN-PLS) is proposed. Firstly, an improved functional link neural network with small norm of expanded weights and high input-output correlation (SNEWHIOC-FLNN) was proposed for enhancing the generalization performance of FLNN. Unlike the traditional FLNN, the expanded variables of the original inputs are not directly used as the inputs in the proposed SNEWHIOC-FLNN model. The original inputs are attached to some small norm of expanded weights. As a result, the correlation coefficient between some of the expanded variables and the outputs is enhanced. The larger the correlation coefficient is, the more relevant the expanded variables tend to be. In the end, the expanded variables with larger correlation coefficient are selected as the inputs to improve the performance of the traditional FLNN. In order to test the proposed SNEWHIOC-FLNN model, three UCI (University of California, Irvine) regression datasets named Housing, Concrete Compressive Strength (CCS), and Yacht Hydro Dynamics (YHD) are selected. Then a hybrid model based on the improved FLNN integrating with partial least square (IFLNN-PLS) was built. In IFLNN-PLS model, the connection weights are calculated using the partial least square method but not the error back propagation algorithm. Lastly, IFLNN-PLS was developed as an intelligent measurement model for accurately predicting the key variables in the Purified Terephthalic Acid (PTA) process and the High Density Polyethylene (HDPE) process. Simulation results illustrated that the IFLNN-PLS could significant improve the prediction performance. Copyright © 2015 ISA. Published by Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Kamaruddin, Ainur Amira; Ali, Zalila; Noor, Norlida Mohd.; Baharum, Adam; Ahmad, Wan Muhamad Amir W.
2014-07-01
Logistic regression analysis examines the influence of various factors on a dichotomous outcome by estimating the probability of the event's occurrence. Logistic regression, also called a logit model, is a statistical procedure used to model dichotomous outcomes. In the logit model the log odds of the dichotomous outcome is modeled as a linear combination of the predictor variables. The log odds ratio in logistic regression provides a description of the probabilistic relationship of the variables and the outcome. In conducting logistic regression, selection procedures are used in selecting important predictor variables, diagnostics are used to check that assumptions are valid which include independence of errors, linearity in the logit for continuous variables, absence of multicollinearity, and lack of strongly influential outliers and a test statistic is calculated to determine the aptness of the model. This study used the binary logistic regression model to investigate overweight and obesity among rural secondary school students on the basis of their demographics profile, medical history, diet and lifestyle. The results indicate that overweight and obesity of students are influenced by obesity in family and the interaction between a student's ethnicity and routine meals intake. The odds of a student being overweight and obese are higher for a student having a family history of obesity and for a non-Malay student who frequently takes routine meals as compared to a Malay student.
de Almeida, Valber Elias; de Araújo Gomes, Adriano; de Sousa Fernandes, David Douglas; Goicoechea, Héctor Casimiro; Galvão, Roberto Kawakami Harrop; Araújo, Mario Cesar Ugulino
2018-05-01
This paper proposes a new variable selection method for nonlinear multivariate calibration, combining the Successive Projections Algorithm for interval selection (iSPA) with the Kernel Partial Least Squares (Kernel-PLS) modelling technique. The proposed iSPA-Kernel-PLS algorithm is employed in a case study involving a Vis-NIR spectrometric dataset with complex nonlinear features. The analytical problem consists of determining Brix and sucrose content in samples from a sugar production system, on the basis of transflectance spectra. As compared to full-spectrum Kernel-PLS, the iSPA-Kernel-PLS models involve a smaller number of variables and display statistically significant superiority in terms of accuracy and/or bias in the predictions. Published by Elsevier B.V.
Spatio-temporal Bayesian model selection for disease mapping
Carroll, R; Lawson, AB; Faes, C; Kirby, RS; Aregay, M; Watjou, K
2016-01-01
Spatio-temporal analysis of small area health data often involves choosing a fixed set of predictors prior to the final model fit. In this paper, we propose a spatio-temporal approach of Bayesian model selection to implement model selection for certain areas of the study region as well as certain years in the study time line. Here, we examine the usefulness of this approach by way of a large-scale simulation study accompanied by a case study. Our results suggest that a special case of the model selection methods, a mixture model allowing a weight parameter to indicate if the appropriate linear predictor is spatial, spatio-temporal, or a mixture of the two, offers the best option to fitting these spatio-temporal models. In addition, the case study illustrates the effectiveness of this mixture model within the model selection setting by easily accommodating lifestyle, socio-economic, and physical environmental variables to select a predominantly spatio-temporal linear predictor. PMID:28070156
Wu, Jing-zhu; Wang, Feng-zhu; Wang, Li-li; Zhang, Xiao-chao; Mao, Wen-hua
2015-01-01
In order to improve the accuracy and robustness of detecting tomato seedlings nitrogen content based on near-infrared spectroscopy (NIR), 4 kinds of characteristic spectrum selecting methods were studied in the present paper, i. e. competitive adaptive reweighted sampling (CARS), Monte Carlo uninformative variables elimination (MCUVE), backward interval partial least squares (BiPLS) and synergy interval partial least squares (SiPLS). There were totally 60 tomato seedlings cultivated at 10 different nitrogen-treatment levels (urea concentration from 0 to 120 mg . L-1), with 6 samples at each nitrogen-treatment level. They are in different degrees of over nitrogen, moderate nitrogen, lack of nitrogen and no nitrogen status. Each sample leaves were collected to scan near-infrared spectroscopy from 12 500 to 3 600 cm-1. The quantitative models based on the above 4 methods were established. According to the experimental result, the calibration model based on CARS and MCUVE selecting methods show better performance than those based on BiPLS and SiPLS selecting methods, but their prediction ability is much lower than that of the latter. Among them, the model built by BiPLS has the best prediction performance. The correlation coefficient (r), root mean square error of prediction (RMSEP) and ratio of performance to standard derivate (RPD) is 0. 952 7, 0. 118 3 and 3. 291, respectively. Therefore, NIR technology combined with characteristic spectrum selecting methods can improve the model performance. But the characteristic spectrum selecting methods are not universal. For the built model based or single wavelength variables selection is more sensitive, it is more suitable for the uniform object. While the anti-interference ability of the model built based on wavelength interval selection is much stronger, it is more suitable for the uneven and poor reproducibility object. Therefore, the characteristic spectrum selection will only play a better role in building model, combined with the consideration of sample state and the model indexes.
Economic evaluation of genomic selection in small ruminants: a sheep meat breeding program.
Shumbusho, F; Raoul, J; Astruc, J M; Palhiere, I; Lemarié, S; Fugeray-Scarbel, A; Elsen, J M
2016-06-01
Recent genomic evaluation studies using real data and predicting genetic gain by modeling breeding programs have reported moderate expected benefits from the replacement of classic selection schemes by genomic selection (GS) in small ruminants. The objectives of this study were to compare the cost, monetary genetic gain and economic efficiency of classic selection and GS schemes in the meat sheep industry. Deterministic methods were used to model selection based on multi-trait indices from a sheep meat breeding program. Decisional variables related to male selection candidates and progeny testing were optimized to maximize the annual monetary genetic gain (AMGG), that is, a weighted sum of meat and maternal traits annual genetic gains. For GS, a reference population of 2000 individuals was assumed and genomic information was available for evaluation of male candidates only. In the classic selection scheme, males breeding values were estimated from own and offspring phenotypes. In GS, different scenarios were considered, differing by the information used to select males (genomic only, genomic+own performance, genomic+offspring phenotypes). The results showed that all GS scenarios were associated with higher total variable costs than classic selection (if the cost of genotyping was 123 euros/animal). In terms of AMGG and economic returns, GS scenarios were found to be superior to classic selection only if genomic information was combined with their own meat phenotypes (GS-Pheno) or with their progeny test information. The predicted economic efficiency, defined as returns (proportional to number of expressions of AMGG in the nucleus and commercial flocks) minus total variable costs, showed that the best GS scenario (GS-Pheno) was up to 15% more efficient than classic selection. For all selection scenarios, optimization increased the overall AMGG, returns and economic efficiency. As a conclusion, our study shows that some forms of GS strategies are more advantageous than classic selection, provided that GS is already initiated (i.e. the initial reference population is available). Optimizing decisional variables of the classic selection scheme could be of greater benefit than including genomic information in optimized designs.
Steen, Valerie A.; Powell, Abby N.
2012-01-01
We examined wetland selection by the Black Tern (Chlidonias niger), a species that breeds primarily in the prairie pothole region, has experienced population declines, and is difficult to manage because of low site fidelity. To characterize its selection of wetlands in this region, we surveyed 589 wetlands throughout North and South Dakota. We documented breeding at 5% and foraging at 17% of wetlands. We created predictive habitat models with a machine-learning algorithm, Random Forests, to explore the relative role of local wetland characteristics and those of the surrounding landscape and to evaluate which characteristics were important to predicting breeding versus foraging. We also examined area-dependent wetland selection while addressing the passive sampling bias by replacing occurrence of terns in the models with an index of density. Local wetland variables were more important than landscape variables in predictions of occurrence of breeding and foraging. Wetland size was more important to prediction of foraging than of breeding locations, while floating matted vegetation was more important to prediction of breeding than of foraging locations. The amount of seasonal wetland in the landscape was the only landscape variable important to prediction of both foraging and breeding. Models based on a density index indicated that wetland selection by foraging terns may be more area dependent than that by breeding terns. Our study provides some of the first evidence for differential breeding and foraging wetland selection by Black Terns and for a more limited role of landscape effects and area sensitivity than has been previously shown.
Frank, Laurence E; Heiser, Willem J
2008-05-01
A set of features is the basis for the network representation of proximity data achieved by feature network models (FNMs). Features are binary variables that characterize the objects in an experiment, with some measure of proximity as response variable. Sometimes features are provided by theory and play an important role in the construction of the experimental conditions. In some research settings, the features are not known a priori. This paper shows how to generate features in this situation and how to select an adequate subset of features that takes into account a good compromise between model fit and model complexity, using a new version of least angle regression that restricts coefficients to be non-negative, called the Positive Lasso. It will be shown that features can be generated efficiently with Gray codes that are naturally linked to the FNMs. The model selection strategy makes use of the fact that FNM can be considered as univariate multiple regression model. A simulation study shows that the proposed strategy leads to satisfactory results if the number of objects is less than or equal to 22. If the number of objects is larger than 22, the number of features selected by our method exceeds the true number of features in some conditions.
NASA Astrophysics Data System (ADS)
Attia, Khalid A. M.; Nassar, Mohammed W. I.; El-Zeiny, Mohamed B.; Serag, Ahmed
2016-03-01
Different chemometric models were applied for the quantitative analysis of amoxicillin (AMX), and flucloxacillin (FLX) in their binary mixtures, namely, partial least squares (PLS), spectral residual augmented classical least squares (SRACLS), concentration residual augmented classical least squares (CRACLS) and artificial neural networks (ANNs). All methods were applied with and without variable selection procedure (genetic algorithm GA). The methods were used for the quantitative analysis of the drugs in laboratory prepared mixtures and real market sample via handling the UV spectral data. Robust and simpler models were obtained by applying GA. The proposed methods were found to be rapid, simple and required no preliminary separation steps.
NASA Astrophysics Data System (ADS)
Juszczyk, Michał; Leśniak, Agnieszka; Zima, Krzysztof
2013-06-01
Conceptual cost estimation is important for construction projects. Either underestimation or overestimation of building raising cost may lead to failure of a project. In the paper authors present application of a multicriteria comparative analysis (MCA) in order to select factors influencing residential building raising cost. The aim of the analysis is to indicate key factors useful in conceptual cost estimation in the early design stage. Key factors are being investigated on basis of the elementary information about the function, form and structure of the building, and primary assumptions of technological and organizational solutions applied in construction process. The mentioned factors are considered as variables of the model which aim is to make possible conceptual cost estimation fast and with satisfying accuracy. The whole analysis included three steps: preliminary research, choice of a set of potential variables and reduction of this set to select the final set of variables. Multicriteria comparative analysis is applied in problem solution. Performed analysis allowed to select group of factors, defined well enough at the conceptual stage of the design process, to be used as a describing variables of the model.
O'Malley, A James; Cotterill, Philip; Schermerhorn, Marc L; Landon, Bruce E
2011-12-01
When 2 treatment approaches are available, there are likely to be unmeasured confounders that influence choice of procedure, which complicates estimation of the causal effect of treatment on outcomes using observational data. To estimate the effect of endovascular (endo) versus open surgical (open) repair, including possible modification by institutional volume, on survival after treatment for abdominal aortic aneurysm, accounting for observed and unobserved confounding variables. Observational study of data from the Medicare program using a joint model of treatment selection and survival given treatment to estimate the effects of type of surgery and institutional volume on survival. We studied 61,414 eligible repairs of intact abdominal aortic aneurysms during 2001 to 2004. The outcome, perioperative death, is defined as in-hospital death or death within 30 days of operation. The key predictors are use of endo, transformed endo and open volume, and endo-volume interactions. There is strong evidence of nonrandom selection of treatment with potential confounding variables including institutional volume and procedure date, variables not typically adjusted for in clinical trials. The best fitting model included heterogeneous transformations of endo volume for endo cases and open volume for open cases as predictors. Consistent with our hypothesis, accounting for unmeasured selection reduced the mortality benefit of endo. The effect of endo versus open surgery varies nonlinearly with endo and open volume. Accounting for institutional experience and unmeasured selection enables better decision-making by physicians making treatment referrals, investigators evaluating treatments, and policy makers.
A Short Guide to the Climatic Variables of the Last Glacial Maximum for Biogeographers.
Varela, Sara; Lima-Ribeiro, Matheus S; Terribile, Levi Carina
2015-01-01
Ecological niche models are widely used for mapping the distribution of species during the last glacial maximum (LGM). Although the selection of the variables and General Circulation Models (GCMs) used for constructing those maps determine the model predictions, we still lack a discussion about which variables and which GCM should be included in the analysis and why. Here, we analyzed the climatic predictions for the LGM of 9 different GCMs in order to help biogeographers to select their GCMs and climatic layers for mapping the species ranges in the LGM. We 1) map the discrepancies between the climatic predictions of the nine GCMs available for the LGM, 2) analyze the similarities and differences between the GCMs and group them to help researchers choose the appropriate GCMs for calibrating and projecting their ecological niche models (ENM) during the LGM, and 3) quantify the agreement of the predictions for each bioclimatic variable to help researchers avoid the environmental variables with a poor consensus between models. Our results indicate that, in absolute values, GCMs have a strong disagreement in their temperature predictions for temperate areas, while the uncertainties for the precipitation variables are in the tropics. In spite of the discrepancies between model predictions, temperature variables (BIO1-BIO11) are highly correlated between models. Precipitation variables (BIO12-BIO19) show no correlation between models, and specifically, BIO14 (precipitation of the driest month) and BIO15 (Precipitation Seasonality (Coefficient of Variation)) show the highest level of discrepancy between GCMs. Following our results, we strongly recommend the use of different GCMs for constructing or projecting ENMs, particularly when predicting the distribution of species that inhabit the tropics and the temperate areas of the Northern and Southern Hemispheres, because climatic predictions for those areas vary greatly among GCMs. We also recommend the exclusion of BIO14 and BIO15 from ENMs because those variables show a high level of discrepancy between GCMs. Thus, by excluding them, we decrease the level of uncertainty of our predictions. All the climatic layers produced for this paper are freely available in http://ecoclimate.org/.
A Short Guide to the Climatic Variables of the Last Glacial Maximum for Biogeographers
Varela, Sara; Lima-Ribeiro, Matheus S.; Terribile, Levi Carina
2015-01-01
Ecological niche models are widely used for mapping the distribution of species during the last glacial maximum (LGM). Although the selection of the variables and General Circulation Models (GCMs) used for constructing those maps determine the model predictions, we still lack a discussion about which variables and which GCM should be included in the analysis and why. Here, we analyzed the climatic predictions for the LGM of 9 different GCMs in order to help biogeographers to select their GCMs and climatic layers for mapping the species ranges in the LGM. We 1) map the discrepancies between the climatic predictions of the nine GCMs available for the LGM, 2) analyze the similarities and differences between the GCMs and group them to help researchers choose the appropriate GCMs for calibrating and projecting their ecological niche models (ENM) during the LGM, and 3) quantify the agreement of the predictions for each bioclimatic variable to help researchers avoid the environmental variables with a poor consensus between models. Our results indicate that, in absolute values, GCMs have a strong disagreement in their temperature predictions for temperate areas, while the uncertainties for the precipitation variables are in the tropics. In spite of the discrepancies between model predictions, temperature variables (BIO1-BIO11) are highly correlated between models. Precipitation variables (BIO12- BIO19) show no correlation between models, and specifically, BIO14 (precipitation of the driest month) and BIO15 (Precipitation Seasonality (Coefficient of Variation)) show the highest level of discrepancy between GCMs. Following our results, we strongly recommend the use of different GCMs for constructing or projecting ENMs, particularly when predicting the distribution of species that inhabit the tropics and the temperate areas of the Northern and Southern Hemispheres, because climatic predictions for those areas vary greatly among GCMs. We also recommend the exclusion of BIO14 and BIO15 from ENMs because those variables show a high level of discrepancy between GCMs. Thus, by excluding them, we decrease the level of uncertainty of our predictions. All the climatic layers produced for this paper are freely available in http://ecoclimate.org/. PMID:26068930
González Costa, J J; Reigosa, M J; Matías, J M; Covelo, E F
2017-09-01
The aim of this study was to model the sorption and retention of Cd, Cu, Ni, Pb and Zn in soils. To that extent, the sorption and retention of these metals were studied and the soil characterization was performed separately. Multiple stepwise regression was used to produce multivariate models with linear techniques and with support vector machines, all of which included 15 explanatory variables characterizing soils. When the R-squared values are represented, two different groups are noticed. Cr, Cu and Pb sorption and retention show a higher R-squared; the most explanatory variables being humified organic matter, Al oxides and, in some cases, cation-exchange capacity (CEC). The other group of metals (Cd, Ni and Zn) shows a lower R-squared, and clays are the most explanatory variables, including a percentage of vermiculite and slime. In some cases, quartz, plagioclase or hematite percentages also show some explanatory capacity. Support Vector Machine (SVM) regression shows that the different models are not as regular as in multiple regression in terms of number of variables, the regression for nickel adsorption being the one with the highest number of variables in its optimal model. On the other hand, there are cases where the most explanatory variables are the same for two metals, as it happens with Cd and Cr adsorption. A similar adsorption mechanism is thus postulated. These patterns of the introduction of variables in the model allow us to create explainability sequences. Those which are the most similar to the selectivity sequences obtained by Covelo (2005) are Mn oxides in multiple regression and change capacity in SVM. Among all the variables, the only one that is explanatory for all the metals after applying the maximum parsimony principle is the percentage of sand in the retention process. In the competitive model arising from the aforementioned sequences, the most intense competitiveness for the adsorption and retention of different metals appears between Cr and Cd, Cu and Zn in multiple regression; and between Cr and Cd in SVM regression. Copyright © 2017 Elsevier B.V. All rights reserved.
Huntsman, Brock M; Falke, Jeffrey A; Savereide, James W; Bennett, Katrina E
2017-01-01
Density-dependent (DD) and density-independent (DI) habitat selection is strongly linked to a species' evolutionary history. Determining the relative importance of each is necessary because declining populations are not always the result of altered DI mechanisms but can often be the result of DD via a reduced carrying capacity. We developed spatially and temporally explicit models throughout the Chena River, Alaska to predict important DI mechanisms that influence Chinook salmon spawning success. We used resource-selection functions to predict suitable spawning habitat based on geomorphic characteristics, a semi-distributed water-and-energy balance hydrologic model to generate stream flow metrics, and modeled stream temperature as a function of climatic variables. Spawner counts were predicted throughout the core and periphery spawning sections of the Chena River from escapement estimates (DD) and DI variables. Additionally, we used isodar analysis to identify whether spawners actively defend spawning habitat or follow an ideal free distribution along the riverscape. Aerial counts were best explained by escapement and reference to the core or periphery, while no models with DI variables were supported in the candidate set. Furthermore, isodar plots indicated habitat selection was best explained by ideal free distributions, although there was strong evidence for active defense of core spawning habitat. Our results are surprising, given salmon commonly defend spawning resources, and are likely due to competition occurring at finer spatial scales than addressed in this study.
Huntsman, Brock M.; Falke, Jeffrey A.; Savereide, James W.; ...
2017-05-22
Density-dependent (DD) and density-independent (DI) habitat selection is strongly linked to a species’ evolutionary history. Determining the relative importance of each is necessary because declining populations are not always the result of altered DI mechanisms but can often be the result of DD via a reduced carrying capacity. Here, we developed spatially and temporally explicit models throughout the Chena River, Alaska to predict important DI mechanisms that influence Chinook salmon spawning success. We used resource-selection functions to predict suitable spawning habitat based on geomorphic characteristics, a semi-distributed water-and-energy balance hydrologic model to generate stream flow metrics, and modeled stream temperaturemore » as a function of climatic variables. Spawner counts were predicted throughout the core and periphery spawning sections of the Chena River from escapement estimates (DD) and DI variables. In addition, we used isodar analysis to identify whether spawners actively defend spawning habitat or follow an ideal free distribution along the riverscape. Aerial counts were best explained by escapement and reference to the core or periphery, while no models with DI variables were supported in the candidate set. Moreover, isodar plots indicated habitat selection was best explained by ideal free distributions, although there was strong evidence for active defense of core spawning habitat. These results are surprising, given salmon commonly defend spawning resources, and are likely due to competition occurring at finer spatial scales than addressed in this study.« less
Huntsman, Brock M.; Falke, Jeffrey A.; Savereide, James W.; Bennett, Katrina E.
2017-01-01
Density-dependent (DD) and density-independent (DI) habitat selection is strongly linked to a species’ evolutionary history. Determining the relative importance of each is necessary because declining populations are not always the result of altered DI mechanisms but can often be the result of DD via a reduced carrying capacity. We developed spatially and temporally explicit models throughout the Chena River, Alaska to predict important DI mechanisms that influence Chinook salmon spawning success. We used resource-selection functions to predict suitable spawning habitat based on geomorphic characteristics, a semi-distributed water-and-energy balance hydrologic model to generate stream flow metrics, and modeled stream temperature as a function of climatic variables. Spawner counts were predicted throughout the core and periphery spawning sections of the Chena River from escapement estimates (DD) and DI variables. Additionally, we used isodar analysis to identify whether spawners actively defend spawning habitat or follow an ideal free distribution along the riverscape. Aerial counts were best explained by escapement and reference to the core or periphery, while no models with DI variables were supported in the candidate set. Furthermore, isodar plots indicated habitat selection was best explained by ideal free distributions, although there was strong evidence for active defense of core spawning habitat. Our results are surprising, given salmon commonly defend spawning resources, and are likely due to competition occurring at finer spatial scales than addressed in this study.
Effects of baseline conditions on the simulated hydrologic response to projected climate change
Koczot, Kathryn M.; Markstrom, Steven L.; Hay, Lauren E.
2011-01-01
Changes in temperature and precipitation projected from five general circulation models, using one late-twentieth-century and three twenty-first-century emission scenarios, were downscaled to three different baseline conditions. Baseline conditions are periods of measured temperature and precipitation data selected to represent twentieth-century climate. The hydrologic effects of the climate projections are evaluated using the Precipitation-Runoff Modeling System (PRMS), which is a watershed hydrology simulation model. The Almanor Catchment in the North Fork of the Feather River basin, California, is used as a case study. Differences and similarities between PRMS simulations of hydrologic components (i.e., snowpack formation and melt, evapotranspiration, and streamflow) are examined, and results indicate that the selection of a specific time period used for baseline conditions has a substantial effect on some, but not all, hydrologic variables. This effect seems to be amplified in hydrologic variables, which accumulate over time, such as soil-moisture content. Results also indicate that uncertainty related to the selection of baseline conditions should be evaluated using a range of different baseline conditions. This is particularly important for studies in basins with highly variable climate, such as the Almanor Catchment.
NASA Technical Reports Server (NTRS)
Epperson, David L.; Davis, Jerry M.; Bloomfield, Peter; Karl, Thomas R.; Mcnab, Alan L.; Gallo, Kevin P.
1995-01-01
Multiple regression techniques were used to predict surface shelter temperatures based on the time period 1986-89 using upper-air data from the European Centre for Medium-Range Weather Forecasts (ECMWF) to represent the background climate and site-specific data to represent the local landscape. Global monthly mean temperature models were developed using data from over 5000 stations available in the Global Historical Climate Network (GHCN). Monthly maximum, mean, and minimum temperature models for the United States were also developed using data from over 1000 stations available in the U.S. Cooperative (COOP) Network and comparative monthly mean temperature models were developed using over 1150 U.S. stations in the GHCN. Three-, six-, and full-variable models were developed for comparative purposes. Inferences about the variables selected for the various models were easier for the GHCN models, which displayed month-to-month consistency in which variables were selected, than for the COOP models, which were assigned a different list of variables for nearly every month. These and other results suggest that global calibration is preferred because data from the global spectrum of physical processes that control surface temperatures are incorporated in a global model. All of the models that were developed in this study validated relatively well, especially the global models. Recalibration of the models with validation data resulted in only slightly poorer regression statistics, indicating that the calibration list of variables was valid. Predictions using data from the validation dataset in the calibrated equation were better for the GHCN models, and the globally calibrated GHCN models generally provided better U.S. predictions than the U.S.-calibrated COOP models. Overall, the GHCN and COOP models explained approximately 64%-95% of the total variance of surface shelter temperatures, depending on the month and the number of model variables. In addition, root-mean-square errors (rmse's) were over 3 C for GHCN models and over 2 C for COOP models for winter months, and near 2 C for GHCN models and near 1.5 C for COOP models for summer months.
Ecological and personal predictors of science achievement in an urban center
NASA Astrophysics Data System (ADS)
Guidubaldi, John Michael
This study sought to examine selected personal and environmental factors that predict urban students' achievement test scores on the science subject area of the Ohio standardized test. Variables examined were in the general categories of teacher/classroom, student, and parent/home. It assumed that these clusters might add independent variance to a best predictor model, and that discovering relative strength of different predictors might lead to better selection of intervention strategies to improve student performance. This study was conducted in an urban school district and was comprised of teachers and students enrolled in ninth grade science in three of this district's high schools. Consenting teachers (9), students (196), and parents (196) received written surveys with questions designed to examine the predictive power of each variable cluster. Regression analyses were used to determine which factors best correlate with student scores and classroom science grades. Selected factors were then compiled into a best predictive model, predicting success on standardized science tests. Students t tests of gender and racial subgroups confirmed that there were racial differences in OPT scores, and both gender and racial differences in science grades. Additional examinations were therefore conducted for all 12 variables to determine whether gender and race had an impact on the strength of individual variable predictions and on the final best predictor model. Of the 15 original OPT and cluster variable hypotheses, eight showed significant positive relationships that occurred in the expected direction. However, when more broadly based end-of-the-year science class grade was used as a criterion, 13 of the 15 hypotheses showed significant relationships in the expected direction. With both criteria, significant gender and racial differences were observed in the strength of individual predictors and in the composition of best predictor models.
Rúa-Uribe, Guillermo L; Suárez-Acosta, Carolina; Chauca, José; Ventosilla, Palmira; Almanza, Rita
2013-09-01
Dengue fever is a major impact on public health vector-borne disease, and its transmission is influenced by entomological, sociocultural and economic factors. Additionally, climate variability plays an important role in the transmission dynamics. A large scientific consensus has indicated that the strong association between climatic variables and disease could be used to develop models to explain the incidence of the disease. To develop a model that provides a better understanding of dengue transmission dynamics in Medellin and predicts increases in the incidence of the disease. The incidence of dengue fever was used as dependent variable, and weekly climatic factors (maximum, mean and minimum temperature, relative humidity and precipitation) as independent variables. Expert Modeler was used to develop a model to better explain the behavior of the disease. Climatic variables with significant association to the dependent variable were selected through ARIMA models. The model explains 34% of observed variability. Precipitation was the climatic variable showing statistically significant association with the incidence of dengue fever, but with a 20 weeks delay. In Medellin, the transmission of dengue fever was influenced by climate variability, especially precipitation. The strong association dengue fever/precipitation allowed the construction of a model to help understand dengue transmission dynamics. This information will be useful to develop appropriate and timely strategies for dengue control.
Boosted structured additive regression for Escherichia coli fed-batch fermentation modeling.
Melcher, Michael; Scharl, Theresa; Luchner, Markus; Striedner, Gerald; Leisch, Friedrich
2017-02-01
The quality of biopharmaceuticals and patients' safety are of highest priority and there are tremendous efforts to replace empirical production process designs by knowledge-based approaches. Main challenge in this context is that real-time access to process variables related to product quality and quantity is severely limited. To date comprehensive on- and offline monitoring platforms are used to generate process data sets that allow for development of mechanistic and/or data driven models for real-time prediction of these important quantities. Ultimate goal is to implement model based feed-back control loops that facilitate online control of product quality. In this contribution, we explore structured additive regression (STAR) models in combination with boosting as a variable selection tool for modeling the cell dry mass, product concentration, and optical density on the basis of online available process variables and two-dimensional fluorescence spectroscopic data. STAR models are powerful extensions of linear models allowing for inclusion of smooth effects or interactions between predictors. Boosting constructs the final model in a stepwise manner and provides a variable importance measure via predictor selection frequencies. Our results show that the cell dry mass can be modeled with a relative error of about ±3%, the optical density with ±6%, the soluble protein with ±16%, and the insoluble product with an accuracy of ±12%. Biotechnol. Bioeng. 2017;114: 321-334. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Hofer, Marlis; Mölg, Thomas; Marzeion, Ben; Kaser, Georg
2010-05-01
Recently initiated observation networks in the Cordillera Blanca provide temporally high-resolution, yet short-term atmospheric data. The aim of this study is to extend the existing time series into the past. We present an empirical-statistical downscaling (ESD) model that links 6-hourly NCEP/NCAR reanalysis data to the local target variables, measured at the tropical glacier Artesonraju (Northern Cordillera Blanca). The approach is particular in the context of ESD for two reasons. First, the observational time series for model calibration are short (only about two years). Second, unlike most ESD studies in climate research, we focus on variables at a high temporal resolution (i.e., six-hourly values). Our target variables are two important drivers in the surface energy balance of tropical glaciers; air temperature and specific humidity. The selection of predictor fields from the reanalysis data is based on regression analyses and climatologic considerations. The ESD modelling procedure includes combined empirical orthogonal function and multiple regression analyses. Principal component screening is based on cross-validation using the Akaike Information Criterion as model selection criterion. Double cross-validation is applied for model evaluation. Potential autocorrelation in the time series is considered by defining the block length in the resampling procedure. Apart from the selection of predictor fields, the modelling procedure is automated and does not include subjective choices. We assess the ESD model sensitivity to the predictor choice by using both single- and mixed-field predictors of the variables air temperature (1000 hPa), specific humidity (1000 hPa), and zonal wind speed (500 hPa). The chosen downscaling domain ranges from 80 to 50 degrees west and from 0 to 20 degrees south. Statistical transfer functions are derived individually for different months and times of day (month/hour-models). The forecast skill of the month/hour-models largely depends on month and time of day, ranging from 0 to 0.8, but the mixed-field predictors generally perform better than the single-field predictors. At all time scales, the ESD model shows added value against two simple reference models; (i) the direct use of reanalysis grid point values, and (ii) mean diurnal and seasonal cycles over the calibration period. The ESD model forecast 1960 to 2008 clearly reflects interannual variability related to the El Niño/Southern Oscillation, but is sensitive to the chosen predictor type. So far, we have not assessed the performance of NCEP/NCAR reanalysis data against other reanalysis products. The developed ESD model is computationally cheap and applicable wherever measurements are available for model calibration.
Red-shouldered hawk nesting habitat preference in south Texas
Strobel, Bradley N.; Boal, Clint W.
2010-01-01
We examined nesting habitat preference by red-shouldered hawks Buteo lineatus using conditional logistic regression on characteristics measured at 27 occupied nest sites and 68 unused sites in 2005–2009 in south Texas. We measured vegetation characteristics of individual trees (nest trees and unused trees) and corresponding 0.04-ha plots. We evaluated the importance of tree and plot characteristics to nesting habitat selection by comparing a priori tree-specific and plot-specific models using Akaike's information criterion. Models with only plot variables carried 14% more weight than models with only center tree variables. The model-averaged odds ratios indicated red-shouldered hawks selected to nest in taller trees and in areas with higher average diameter at breast height than randomly available within the forest stand. Relative to randomly selected areas, each 1-m increase in nest tree height and 1-cm increase in the plot average diameter at breast height increased the probability of selection by 85% and 10%, respectively. Our results indicate that red-shouldered hawks select nesting habitat based on vegetation characteristics of individual trees as well as the 0.04-ha area surrounding the tree. Our results indicate forest management practices resulting in tall forest stands with large average diameter at breast height would benefit red-shouldered hawks in south Texas.
[Study on Application of NIR Spectral Information Screening in Identification of Maca Origin].
Wang, Yuan-zhong; Zhao, Yan-li; Zhang, Ji; Jin, Hang
2016-02-01
Medicinal and edible plant Maca is rich in various nutrients and owns great medicinal value. Based on near infrared diffuse reflectance spectra, 139 Maca samples collected from Peru and Yunnan were used to identify their geographical origins. Multiplication signal correction (MSC) coupled with second derivative (SD) and Norris derivative filter (ND) was employed in spectral pretreatment. Spectrum range (7,500-4,061 cm⁻¹) was chosen by spectrum standard deviation. Combined with principal component analysis-mahalanobis distance (PCA-MD), the appropriate number of principal components was selected as 5. Based on the spectrum range and the number of principal components selected, two abnormal samples were eliminated by modular group iterative singular sample diagnosis method. Then, four methods were used to filter spectral variable information, competitive adaptive reweighted sampling (CARS), monte carlo-uninformative variable elimination (MC-UVE), genetic algorithm (GA) and subwindow permutation analysis (SPA). The spectral variable information filtered was evaluated by model population analysis (MPA). The results showed that RMSECV(SPA) > RMSECV(CARS) > RMSECV(MC-UVE) > RMSECV(GA), were 2. 14, 2. 05, 2. 02, and 1. 98, and the spectral variables were 250, 240, 250 and 70, respectively. According to the spectral variable filtered, partial least squares discriminant analysis (PLS-DA) was used to build the model, with random selection of 97 samples as training set, and the other 40 samples as validation set. The results showed that, R²: GA > MC-UVE > CARS > SPA, RMSEC and RMSEP: GA < MC-UVE < CARS
Variability-aware compact modeling and statistical circuit validation on SRAM test array
NASA Astrophysics Data System (ADS)
Qiao, Ying; Spanos, Costas J.
2016-03-01
Variability modeling at the compact transistor model level can enable statistically optimized designs in view of limitations imposed by the fabrication technology. In this work we propose a variability-aware compact model characterization methodology based on stepwise parameter selection. Transistor I-V measurements are obtained from bit transistor accessible SRAM test array fabricated using a collaborating foundry's 28nm FDSOI technology. Our in-house customized Monte Carlo simulation bench can incorporate these statistical compact models; and simulation results on SRAM writability performance are very close to measurements in distribution estimation. Our proposed statistical compact model parameter extraction methodology also has the potential of predicting non-Gaussian behavior in statistical circuit performances through mixtures of Gaussian distributions.
Lorenzo-Seva, Urbano; Ferrando, Pere J
2011-03-01
We provide an SPSS program that implements currently recommended techniques and recent developments for selecting variables in multiple linear regression analysis via the relative importance of predictors. The approach consists of: (1) optimally splitting the data for cross-validation, (2) selecting the final set of predictors to be retained in the equation regression, and (3) assessing the behavior of the chosen model using standard indices and procedures. The SPSS syntax, a short manual, and data files related to this article are available as supplemental materials from brm.psychonomic-journals.org/content/supplemental.
Cross Validation of Selection of Variables in Multiple Regression.
1979-12-01
55 vii CROSS VALIDATION OF SELECTION OF VARIABLES IN MULTIPLE REGRESSION I Introduction Background Long term DoD planning gcals...028545024 .31109000 BF * SS - .008700618 .0471961 Constant - .70977903 85.146786 55 had adequate predictive capabilities; the other two models (the...71ZCO F111D Control 54 73EGO FlIID Computer, General Purpose 55 73EPO FII1D Converter-Multiplexer 56 73HAO flllD Stabilizer Platform 57 73HCO F1ID
Balabin, Roman M; Smirnov, Sergey V
2011-04-29
During the past several years, near-infrared (near-IR/NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields from petroleum to biomedical sectors. The NIR spectrum (above 4000 cm(-1)) of a sample is typically measured by modern instruments at a few hundred of wavelengths. Recently, considerable effort has been directed towards developing procedures to identify variables (wavelengths) that contribute useful information. Variable selection (VS) or feature selection, also called frequency selection or wavelength selection, is a critical step in data analysis for vibrational spectroscopy (infrared, Raman, or NIRS). In this paper, we compare the performance of 16 different feature selection methods for the prediction of properties of biodiesel fuel, including density, viscosity, methanol content, and water concentration. The feature selection algorithms tested include stepwise multiple linear regression (MLR-step), interval partial least squares regression (iPLS), backward iPLS (BiPLS), forward iPLS (FiPLS), moving window partial least squares regression (MWPLS), (modified) changeable size moving window partial least squares (CSMWPLS/MCSMWPLSR), searching combination moving window partial least squares (SCMWPLS), successive projections algorithm (SPA), uninformative variable elimination (UVE, including UVE-SPA), simulated annealing (SA), back-propagation artificial neural networks (BP-ANN), Kohonen artificial neural network (K-ANN), and genetic algorithms (GAs, including GA-iPLS). Two linear techniques for calibration model building, namely multiple linear regression (MLR) and partial least squares regression/projection to latent structures (PLS/PLSR), are used for the evaluation of biofuel properties. A comparison with a non-linear calibration model, artificial neural networks (ANN-MLP), is also provided. Discussion of gasoline, ethanol-gasoline (bioethanol), and diesel fuel data is presented. The results of other spectroscopic techniques application, such as Raman, ultraviolet-visible (UV-vis), or nuclear magnetic resonance (NMR) spectroscopies, can be greatly improved by an appropriate feature selection choice. Copyright © 2011 Elsevier B.V. All rights reserved.
Maloney, Kelly O.; Schmid, Matthias; Weller, Donald E.
2012-01-01
Issues with ecological data (e.g. non-normality of errors, nonlinear relationships and autocorrelation of variables) and modelling (e.g. overfitting, variable selection and prediction) complicate regression analyses in ecology. Flexible models, such as generalized additive models (GAMs), can address data issues, and machine learning techniques (e.g. gradient boosting) can help resolve modelling issues. Gradient boosted GAMs do both. Here, we illustrate the advantages of this technique using data on benthic macroinvertebrates and fish from 1573 small streams in Maryland, USA.
Growth models of Rhizophora mangle L. seedlings in tropical southwestern Atlantic
NASA Astrophysics Data System (ADS)
Lima, Karen Otoni de Oliveira; Tognella, Mônica Maria Pereira; Cunha, Simone Rabelo; Andrade, Humber Agrelli de
2018-07-01
The present study selected and compared regression models that best describe the growth curves of Rhizophora mangle seedlings based on the height (cm) and time (days) variables. The Linear, Exponential, Power Law, Monomolecular, Logistic, and Gompertz models were adjusted with non-linear formulations and minimization of the sum of the squares of the residues. The Akaike Information Criterion was used to select the best model for each seedling. After this selection, the determination coefficient, which evaluates how well a model describes height variation as a time function, was inspected. Differing from the classic population ecology studies, the Monomolecular, Three-parameter Logistic, and Gompertz models presented the best performance in describing growth, suggesting they are the most adequate options for long-term studies. The different growth curves reflect the complexity of stem growth at the seedling stage for R. mangle. The analysis of the joint distribution of the parameters initial height, growth rate, and, asymptotic size allowed the study of the species ecological attributes and to observe its intraspecific variability in each model. Our results provide a basis for interpretation of the dynamics of seedlings growth during their establishment in a mature forest, as well as its regeneration processes.
ERIC Educational Resources Information Center
Collazo, Andres; And Others
Since a great number of variables influence future educational outcomes, forecasting possible trends is a complex task. One such model, the cross-impact matrix, has been developed. The use of this matrix in forecasting future values of social indicators of educational outcomes is described. Variables associated with educational outcomes are used…
Data-driven process decomposition and robust online distributed modelling for large-scale processes
NASA Astrophysics Data System (ADS)
Shu, Zhang; Lijuan, Li; Lijuan, Yao; Shipin, Yang; Tao, Zou
2018-02-01
With the increasing attention of networked control, system decomposition and distributed models show significant importance in the implementation of model-based control strategy. In this paper, a data-driven system decomposition and online distributed subsystem modelling algorithm was proposed for large-scale chemical processes. The key controlled variables are first partitioned by affinity propagation clustering algorithm into several clusters. Each cluster can be regarded as a subsystem. Then the inputs of each subsystem are selected by offline canonical correlation analysis between all process variables and its controlled variables. Process decomposition is then realised after the screening of input and output variables. When the system decomposition is finished, the online subsystem modelling can be carried out by recursively block-wise renewing the samples. The proposed algorithm was applied in the Tennessee Eastman process and the validity was verified.
Community models for wildlife impact assessment: a review of concepts and approaches
Schroeder, Richard L.
1987-01-01
The first two sections of this paper are concerned with defining and bounding communities, and describing those attributes of the community that are quantifiable and suitable for wildlife impact assessment purposes. Prior to the development or use of a community model, it is important to have a clear understanding of the concept of a community and a knowledge of the types of community attributes that can serve as outputs for the development of models. Clearly defined, unambiguous model outputs are essential for three reasons: (1) to ensure that the measured community attributes relate to the wildlife resource objectives of the study; (2) to allow testing of the outputs in experimental studies, to determine accuracy, and to allow for improvements based on such testing; and (3) to enable others to clearly understand the community attribute that has been measured. The third section of this paper described input variables that may be used to predict various community attributes. These input variables do not include direct measures of wildlife populations. Most impact assessments involve projects that result in drastic changes in habitat, such as changes in land use, vegetation, or available area. Therefore, the model input variables described in this section deal primarily with habitat related features. Several existing community models are described in the fourth section of this paper. A general description of each model is provided, including the nature of the input variables and the model output. The logic and assumptions of each model are discussed, along with data requirements needed to use the model. The fifth section provides guidance on the selection and development of community models. Identification of the community attribute that is of concern will determine the type of model most suitable for a particular application. This section provides guidelines on selected an existing model, as well as a discussion of the major steps to be followed in modifying an existing model or developing a new model. Considerations associated with the use of community models with the Habitat Evaluation Procedures are also discussed. The final section of the paper summarizes major findings of interest to field biologists and provides recommendations concerning the implementation of selected concepts in wildlife community analyses.
Initial proposition of kinematics model for selected karate actions analysis
NASA Astrophysics Data System (ADS)
Hachaj, Tomasz; Koptyra, Katarzyna; Ogiela, Marek R.
2017-03-01
The motivation for this paper is to initially propose and evaluate two new kinematics models that were developed to describe motion capture (MoCap) data of karate techniques. We decided to develop this novel proposition to create the model that is capable to handle actions description both from multimedia and professional MoCap hardware. For the evaluation purpose we have used 25-joints data with karate techniques recordings acquired with Kinect version 2. It is consisted of MoCap recordings of two professional sport (black belt) instructors and masters of Oyama Karate. We have selected following actions for initial analysis: left-handed furi-uchi punch, right leg hiza-geri kick, right leg yoko-geri kick and left-handed jodan-uke block. Basing on evaluation we made we can conclude that both proposed kinematics models seems to be convenient method for karate actions description. From two proposed variables models it seems that global might be more useful for further usage. We think that because in case of considered punches variables seems to be less correlated and they might also be easier to interpret because of single reference coordinate system. Also principal components analysis proved to be reliable way to examine the quality of kinematics models and with the plot of the variable in principal components space we can nicely present the dependences between variables.
Nordborg, Magnus; Innan, Hideki
2003-01-01
A stochastic model for the genealogy of a sample of recombining sequences containing one or more sites subject to selection in a subdivided population is described. Selection is incorporated by dividing the population into allelic classes and then conditioning on the past sizes of these classes. The past allele frequencies at the selected sites are thus treated as parameters rather than as random variables. The purpose of the model is not to investigate the dynamics of selection, but to investigate effects of linkage to the selected sites on the genealogy of the surrounding chromosomal region. This approach is useful for modeling strong selection, when it is natural to parameterize the past allele frequencies at the selected sites. Several models of strong balancing selection are used as examples, and the effects on the pattern of neutral polymorphism in the chromosomal region are discussed. We focus in particular on the statistical power to detect balancing selection when it is present. PMID:12663556
Nordborg, Magnus; Innan, Hideki
2003-03-01
A stochastic model for the genealogy of a sample of recombining sequences containing one or more sites subject to selection in a subdivided population is described. Selection is incorporated by dividing the population into allelic classes and then conditioning on the past sizes of these classes. The past allele frequencies at the selected sites are thus treated as parameters rather than as random variables. The purpose of the model is not to investigate the dynamics of selection, but to investigate effects of linkage to the selected sites on the genealogy of the surrounding chromosomal region. This approach is useful for modeling strong selection, when it is natural to parameterize the past allele frequencies at the selected sites. Several models of strong balancing selection are used as examples, and the effects on the pattern of neutral polymorphism in the chromosomal region are discussed. We focus in particular on the statistical power to detect balancing selection when it is present.
Stochastic model search with binary outcomes for genome-wide association studies
Malovini, Alberto; Puca, Annibale A; Bellazzi, Riccardo
2012-01-01
Objective The spread of case–control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Materials and methods Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. Results BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. Discussion BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. Conclusion The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model. PMID:22534080
Olivera, André Rodrigues; Roesler, Valter; Iochpe, Cirano; Schmidt, Maria Inês; Vigo, Álvaro; Barreto, Sandhi Maria; Duncan, Bruce Bartholow
2017-01-01
Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.
Qin, Li-Tang; Liu, Shu-Shen; Liu, Hai-Ling
2010-02-01
A five-variable model (model M2) was developed for the bioconcentration factors (BCFs) of nonpolar organic compounds (NPOCs) by using molecular electronegativity distance vector (MEDV) to characterize the structures of NPOCs and variable selection and modeling based on prediction (VSMP) to select the optimum descriptors. The estimated correlation coefficient (r (2)) and the leave-one-out cross-validation correlation coefficients (q (2)) of model M2 were 0.9271 and 0.9171, respectively. The model was externally validated by splitting the whole data set into a representative training set of 85 chemicals and a validation set of 29 chemicals. The results show that the main structural factors influencing the BCFs of NPOCs are -cCc, cCcc, -Cl, and -Br (where "-" refers to a single bond and "c" refers to a conjugated bond). The quantitative structure-property relationship (QSPR) model can effectively predict the BCFs of NPOCs, and the predictions of the model can also extend the current BCF database of experimental values.
NASA Astrophysics Data System (ADS)
Kim, Junhan; Marrone, Daniel P.; Chan, Chi-Kwan; Medeiros, Lia; Özel, Feryal; Psaltis, Dimitrios
2016-12-01
The Event Horizon Telescope (EHT) is a millimeter-wavelength, very-long-baseline interferometry (VLBI) experiment that is capable of observing black holes with horizon-scale resolution. Early observations have revealed variable horizon-scale emission in the Galactic Center black hole, Sagittarius A* (Sgr A*). Comparing such observations to time-dependent general relativistic magnetohydrodynamic (GRMHD) simulations requires statistical tools that explicitly consider the variability in both the data and the models. We develop here a Bayesian method to compare time-resolved simulation images to variable VLBI data, in order to infer model parameters and perform model comparisons. We use mock EHT data based on GRMHD simulations to explore the robustness of this Bayesian method and contrast it to approaches that do not consider the effects of variability. We find that time-independent models lead to offset values of the inferred parameters with artificially reduced uncertainties. Moreover, neglecting the variability in the data and the models often leads to erroneous model selections. We finally apply our method to the early EHT data on Sgr A*.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Junhan; Marrone, Daniel P.; Chan, Chi-Kwan
2016-12-01
The Event Horizon Telescope (EHT) is a millimeter-wavelength, very-long-baseline interferometry (VLBI) experiment that is capable of observing black holes with horizon-scale resolution. Early observations have revealed variable horizon-scale emission in the Galactic Center black hole, Sagittarius A* (Sgr A*). Comparing such observations to time-dependent general relativistic magnetohydrodynamic (GRMHD) simulations requires statistical tools that explicitly consider the variability in both the data and the models. We develop here a Bayesian method to compare time-resolved simulation images to variable VLBI data, in order to infer model parameters and perform model comparisons. We use mock EHT data based on GRMHD simulations to explore themore » robustness of this Bayesian method and contrast it to approaches that do not consider the effects of variability. We find that time-independent models lead to offset values of the inferred parameters with artificially reduced uncertainties. Moreover, neglecting the variability in the data and the models often leads to erroneous model selections. We finally apply our method to the early EHT data on Sgr A*.« less
On the Accretion Rates of SW Sextantis Nova-like Variables
NASA Astrophysics Data System (ADS)
Ballouz, Ronald-Louis; Sion, Edward M.
2009-06-01
We present accretion rates for selected samples of nova-like variables having IUE archival spectra and distances uniformly determined using an infrared method by Knigge. A comparison with accretion rates derived independently with a multiparametric optimization modeling approach by Puebla et al. is carried out. The accretion rates of SW Sextantis nova-like systems are compared with the accretion rates of non-SW Sextantis systems in the Puebla et al. sample and in our sample, which was selected in the orbital period range of three to four and a half hours, with all systems having distances using the method of Knigge. Based upon the two independent modeling approaches, we find no significant difference between the accretion rates of SW Sextantis systems and non-SW Sextantis nova-like systems insofar as optically thick disk models are appropriate. We find little evidence to suggest that the SW Sex stars have higher accretion rates than other nova-like cataclysmic variables (CVs) above the period gap within the same range of orbital periods.
Hall, S. A.; Burke, I.C.; Box, D. O.; Kaufmann, M. R.; Stoker, Jason M.
2005-01-01
The ponderosa pine forests of the Colorado Front Range, USA, have historically been subjected to wildfires. Recent large burns have increased public interest in fire behavior and effects, and scientific interest in the carbon consequences of wildfires. Remote sensing techniques can provide spatially explicit estimates of stand structural characteristics. Some of these characteristics can be used as inputs to fire behavior models, increasing our understanding of the effect of fuels on fire behavior. Others provide estimates of carbon stocks, allowing us to quantify the carbon consequences of fire. Our objective was to use discrete-return lidar to estimate such variables, including stand height, total aboveground biomass, foliage biomass, basal area, tree density, canopy base height and canopy bulk density. We developed 39 metrics from the lidar data, and used them in limited combinations in regression models, which we fit to field estimates of the stand structural variables. We used an information–theoretic approach to select the best model for each variable, and to select the subset of lidar metrics with most predictive potential. Observed versus predicted values of stand structure variables were highly correlated, with r2 ranging from 57% to 87%. The most parsimonious linear models for the biomass structure variables, based on a restricted dataset, explained between 35% and 58% of the observed variability. Our results provide us with useful estimates of stand height, total aboveground biomass, foliage biomass and basal area. There is promise for using this sensor to estimate tree density, canopy base height and canopy bulk density, though more research is needed to generate robust relationships. We selected 14 lidar metrics that showed the most potential as predictors of stand structure. We suggest that the focus of future lidar studies should broaden to include low density forests, particularly systems where the vertical structure of the canopy is important, such as fire prone forests.
A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection
Sabourin, Jeremy A; Valdar, William; Nobel, Andrew B
2015-01-01
Summary We describe a simple, computationally effcient, permutation-based procedure for selecting the penalty parameter in LASSO penalized regression. The procedure, permutation selection, is intended for applications where variable selection is the primary focus, and can be applied in a variety of structural settings, including that of generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of real biomedical data sets in which permutation selection is compared with selection based on the following: cross-validation (CV), the Bayesian information criterion (BIC), Scaled Sparse Linear Regression, and a selection method based on recently developed testing procedures for the LASSO. PMID:26243050
NASA Astrophysics Data System (ADS)
Yáñez, Marco A.; Baettig, Ricardo; Cornejo, Jorge; Zamudio, Francisco; Guajardo, Jorge; Fica, Rodrigo
2017-07-01
Air pollution is one of the major global environmental problems affecting human health and life quality. Many cities of Chile are heavily polluted with PM2.5 and PM10, mainly in the cold season, and there is little understanding of how the variation in particle matter differs between cities and how this is affected by the meteorological conditions. The objective of this study was to assess the effect of meteorological variables on respirable particulate matter (PM) of the main cities in the central-south valley of Chile during the cold season (May to August) between 2014 and 2016. We used hourly PM2.5 and PMcoarse (PM10- PM2.5) information along with wind speed, temperature and relative humidity, and other variables derived from meteorological parameters. Generalized additive models (GAMs) were fitted for each of the eight cities selected, covering a latitudinal range of 929 km, from Santiago to Osorno. Great variation in PM was found between cities during the cold months, and that variation exhibited a marked latitudinal pattern. Overall, the more northerly cities tended to be less polluted in PM2.5 and more polluted in PMcoarse than the more southerly cities, and vice versa. The results show that other derived variables from meteorology were better related with PM than the use of traditional daily means. The main variables selected with regard to PM2.5 content were mean wind speed and minimum temperature (negative relationship). Otherwise, the main variables selected with regard to PMcoarse content were mean wind speed (negative), and the daily range in temperature (positive). Variables derived from relative humidity contributed differently to the models, having a higher effect on PMcoarse than PM2.5, and exhibiting both negative and positive effects. For the different cities the deviance explained by the GAMs ranged from 37.6 to 79.1% for PM2.5 and from 18.5 to 63.7% for PMcoarse. The percentage of deviance explained by the models for PM2.5 exhibited a latitudinal pattern, which was not observed in PMcoarse. This highlights the greater predictability of PM2.5 according to meteorological parameters in the cities to the south. Southern cities located spatially close to one another had similar patterns in both the selected variables for the models and the trends. The meteorological factor influencing the cities had a major impact on PM concentrations. The findings of this study may aid understanding of PM variation across the country, in the way of improving forecasting models.
Fan, Shu-xiang; Huang, Wen-qian; Li, Jiang-bo; Zhao, Chun-jiang; Zhang, Bao-hua
2014-08-01
To improve the precision and robustness of the NIR model of the soluble solid content (SSC) on pear. The total number of 160 pears was for the calibration (n=120) and prediction (n=40). Different spectral pretreatment methods, including standard normal variate (SNV) and multiplicative scatter correction (MSC) were used before further analysis. A combination of genetic algorithm (GA) and successive projections algorithm (SPA) was proposed to select most effective wavelengths after uninformative variable elimination (UVE) from original spectra, SNV pretreated spectra and MSC pretreated spectra respectively. The selected variables were used as the inputs of least squares-support vector machine (LS-SVM) model to build models for de- termining the SSC of pear. The results indicated that LS-SVM model built using SNVE-UVE-GA-SPA on 30 characteristic wavelengths selected from full-spectrum which had 3112 wavelengths achieved the optimal performance. The correlation coefficient (Rp) and root mean square error of prediction (RMSEP) for prediction sets were 0.956, 0.271 for SSC. The model is reliable and the predicted result is effective. The method can meet the requirement of quick measuring SSC of pear and might be important for the development of portable instruments and online monitoring.
Shen, Chung-Wei; Chen, Yi-Hau
2018-03-13
We propose a model selection criterion for semiparametric marginal mean regression based on generalized estimating equations. The work is motivated by a longitudinal study on the physical frailty outcome in the elderly, where the cluster size, that is, the number of the observed outcomes in each subject, is "informative" in the sense that it is related to the frailty outcome itself. The new proposal, called Resampling Cluster Information Criterion (RCIC), is based on the resampling idea utilized in the within-cluster resampling method (Hoffman, Sen, and Weinberg, 2001, Biometrika 88, 1121-1134) and accommodates informative cluster size. The implementation of RCIC, however, is free of performing actual resampling of the data and hence is computationally convenient. Compared with the existing model selection methods for marginal mean regression, the RCIC method incorporates an additional component accounting for variability of the model over within-cluster subsampling, and leads to remarkable improvements in selecting the correct model, regardless of whether the cluster size is informative or not. Applying the RCIC method to the longitudinal frailty study, we identify being female, old age, low income and life satisfaction, and chronic health conditions as significant risk factors for physical frailty in the elderly. © 2018, The International Biometric Society.
Estimation and Model Selection for Finite Mixtures of Latent Interaction Models
ERIC Educational Resources Information Center
Hsu, Jui-Chen
2011-01-01
Latent interaction models and mixture models have received considerable attention in social science research recently, but little is known about how to handle if unobserved population heterogeneity exists in the endogenous latent variables of the nonlinear structural equation models. The current study estimates a mixture of latent interaction…
Nunes, Karen M; Andrade, Marcus Vinícius O; Santos Filho, Antônio M P; Lasmar, Marcelo C; Sena, Marcelo M
2016-08-15
Concerns about meat authenticity are increasing recently, due to great fraud scandals. This paper analysed real samples (43 adulterated and 12 controls) originated from criminal networks dismantled by the Brazilian Police. This fraud consisted of injecting solutions of non-meat ingredients (NaCl, phosphates, carrageenan, maltodextrin) in bovine meat, aiming to increase its water holding capacity. Five physico-chemical variables were determined, protein, ash, chloride, sodium, phosphate. Additionally, infrared spectra were recorded. Supervised classification PLS-DA models were built with each data set individually, but the best model was obtained with data fusion, correctly detecting 91% of the adulterated samples. From this model, a variable selection based on the highest VIPscores was performed and a new data fusion model was built with only one chemical variable, providing slightly lower predictions, but a good cost/performance ratio. Finally, some of the selected infrared bands were specifically associated to the presence of adulterants NaCl, tripolyphosphate and carrageenan. Copyright © 2016 Elsevier Ltd. All rights reserved.
Naccarato, Attilio; Furia, Emilia; Sindona, Giovanni; Tagarelli, Antonio
2016-09-01
Four class-modeling techniques (soft independent modeling of class analogy (SIMCA), unequal dispersed classes (UNEQ), potential functions (PF), and multivariate range modeling (MRM)) were applied to multielement distribution to build chemometric models able to authenticate chili pepper samples grown in Calabria respect to those grown outside of Calabria. The multivariate techniques were applied by considering both all the variables (32 elements, Al, As, Ba, Ca, Cd, Ce, Co, Cr, Cs, Cu, Dy, Fe, Ga, La, Li, Mg, Mn, Na, Nd, Ni, Pb, Pr, Rb, Sc, Se, Sr, Tl, Tm, V, Y, Yb, Zn) and variables selected by means of stepwise linear discriminant analysis (S-LDA). In the first case, satisfactory and comparable results in terms of CV efficiency are obtained with the use of SIMCA and MRM (82.3 and 83.2% respectively), whereas MRM performs better than SIMCA in terms of forced model efficiency (96.5%). The selection of variables by S-LDA permitted to build models characterized, in general, by a higher efficiency. MRM provided again the best results for CV efficiency (87.7% with an effective balance of sensitivity and specificity) as well as forced model efficiency (96.5%). Copyright © 2016 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Colette, Augustin; Bessagnet, Bertrand; Dangiola, Ariela; D'Isidoro, Massimo; Gauss, Michael; Granier, Claire; Hodnebrog, Øivind; Jakobs, Hermann; Kanakidou, Maria; Khokhar, Fahim; Law, Kathy; Maurizi, Alberto; Meleux, Frederik; Memmesheimer, Michael; Nyiri, Agnes; Rouil, Laurence; Stordal, Frode; Tampieri, Francesco
2010-05-01
With the growth of urban agglomerations, assessing the drivers of variability of air quality in and around the main anthropogenic emission hotspots has become a major societal concern as well as a scientific challenge. These drivers include emission changes and meteorological variability; both of them can be investigated by means of numerical modelling of trends over the past few years. A collaborative effort has been developed in the framework of the CityZen European project to address this question. Several chemistry and transport models (CTMs) are deployed in this activity: four regional models (BOLCHEM, CHIMERE, EMEP and EURAD) and three global models (CTM2, MOZART, and TM4). The period from 1998 to 2007 has been selected for the historic reconstruction. The focus for the present preliminary presentation is Europe. A consistent set of emissions is used by all partners (EMEP for the European domain and IPCC-AR5 beyond) while a variety of meteorological forcing is used to gain robustness in the ensemble spread amongst models. The results of this experiment will be investigated to address the following questions: - Is the envelope of models able to reproduce the observed trends of the key chemical constituents? - How the variability amongst models changes in time and space and what does it tell us about the processes driving the observed trends? - Did chemical regimes and aerosol formation processes changed in selected hotspots? Answering the above questions will contribute to fulfil the ultimate goal of the present study: distinguishing the respective contribution of meteorological variability and emissions changes on air quality trends in major anthropogenic emissions hotspots.
Gan, Zhaoyu; Diao, Feici; Wei, Qinling; Wu, Xiaoli; Cheng, Minfeng; Guan, Nianhong; Zhang, Ming; Zhang, Jinbei
2011-11-01
A correct timely diagnosis of bipolar depression remains a big challenge for clinicians. This study aimed to develop a clinical characteristic based model to predict the diagnosis of bipolar disorder among patients with current major depressive episodes. A prospective study was carried out on 344 patients with current major depressive episodes, with 268 completing 1-year follow-up. Data were collected through structured interviews. Univariate binary logistic regression was conducted to select potential predictive variables among 19 initial variables, and then multivariate binary logistic regression was performed to analyze the combination of risk factors and build a predictive model. Receiver operating characteristic (ROC) curve was plotted. Of 19 initial variables, 13 variables were preliminarily selected, and then forward stepwise exercise produced a final model consisting of 6 variables: age at first onset, maximum duration of depressive episodes, somatalgia, hypersomnia, diurnal variation of mood, irritability. The correct prediction rate of this model was 78% (95%CI: 75%-86%) and the area under the ROC curve was 0.85 (95%CI: 0.80-0.90). The cut-off point for age at first onset was 28.5 years old, while the cut-off point for maximum duration of depressive episode was 7.5 months. The limitations of this study include small sample size, relatively short follow-up period and lack of treatment information. Our predictive models based on six clinical characteristics of major depressive episodes prove to be robust and can help differentiate bipolar depression from unipolar depression. Copyright © 2011 Elsevier B.V. All rights reserved.
Binder, Harald; Sauerbrei, Willi; Royston, Patrick
2013-06-15
In observational studies, many continuous or categorical covariates may be related to an outcome. Various spline-based procedures or the multivariable fractional polynomial (MFP) procedure can be used to identify important variables and functional forms for continuous covariates. This is the main aim of an explanatory model, as opposed to a model only for prediction. The type of analysis often guides the complexity of the final model. Spline-based procedures and MFP have tuning parameters for choosing the required complexity. To compare model selection approaches, we perform a simulation study in the linear regression context based on a data structure intended to reflect realistic biomedical data. We vary the sample size, variance explained and complexity parameters for model selection. We consider 15 variables. A sample size of 200 (1000) and R(2) = 0.2 (0.8) is the scenario with the smallest (largest) amount of information. For assessing performance, we consider prediction error, correct and incorrect inclusion of covariates, qualitative measures for judging selected functional forms and further novel criteria. From limited information, a suitable explanatory model cannot be obtained. Prediction performance from all types of models is similar. With a medium amount of information, MFP performs better than splines on several criteria. MFP better recovers simpler functions, whereas splines better recover more complex functions. For a large amount of information and no local structure, MFP and the spline procedures often select similar explanatory models. Copyright © 2012 John Wiley & Sons, Ltd.
Design Optimization of a Centrifugal Fan with Splitter Blades
NASA Astrophysics Data System (ADS)
Heo, Man-Woong; Kim, Jin-Hyuk; Kim, Kwang-Yong
2015-05-01
Multi-objective optimization of a centrifugal fan with additionally installed splitter blades was performed to simultaneously maximize the efficiency and pressure rise using three-dimensional Reynolds-averaged Navier-Stokes equations and hybrid multi-objective evolutionary algorithm. Two design variables defining the location of splitter, and the height ratio between inlet and outlet of impeller were selected for the optimization. In addition, the aerodynamic characteristics of the centrifugal fan were investigated with the variation of design variables in the design space. Latin hypercube sampling was used to select the training points, and response surface approximation models were constructed as surrogate models of the objective functions. With the optimization, both the efficiency and pressure rise of the centrifugal fan with splitter blades were improved considerably compared to the reference model.
NASA Astrophysics Data System (ADS)
Hadi, Sinan Jasim; Tombul, Mustafa
2018-06-01
Streamflow is an essential component of the hydrologic cycle in the regional and global scale and the main source of fresh water supply. It is highly associated with natural disasters, such as droughts and floods. Therefore, accurate streamflow forecasting is essential. Forecasting streamflow in general and monthly streamflow in particular is a complex process that cannot be handled by data-driven models (DDMs) only and requires pre-processing. Wavelet transformation is a pre-processing technique; however, application of continuous wavelet transformation (CWT) produces many scales that cause deterioration in the performance of any DDM because of the high number of redundant variables. This study proposes multigene genetic programming (MGGP) as a selection tool. After the CWT analysis, it selects important scales to be imposed into the artificial neural network (ANN). A basin located in the southeast of Turkey is selected as case study to prove the forecasting ability of the proposed model. One month ahead downstream flow is used as output, and downstream flow, upstream, rainfall, temperature, and potential evapotranspiration with associated lags are used as inputs. Before modeling, wavelet coherence transformation (WCT) analysis was conducted to analyze the relationship between variables in the time-frequency domain. Several combinations were developed to investigate the effect of the variables on streamflow forecasting. The results indicated a high localized correlation between the streamflow and other variables, especially the upstream. In the models of the standalone layout where the data were entered to ANN and MGGP without CWT, the performance is found poor. In the best-scale layout, where the best scale of the CWT identified as the highest correlated scale is chosen and enters to ANN and MGGP, the performance increased slightly. Using the proposed model, the performance improved dramatically particularly in forecasting the peak values because of the inclusion of several scales in which seasonality and irregularity can be captured. Using hydrological and meteorological variables also improved the ability to forecast the streamflow.
Obtaining the variance of gametic diversity with genomic models
USDA-ARS?s Scientific Manuscript database
It may be possible to use information about the variability among gametes (spermatozoa and ova) to select parents that are more likely than average to produce offspring with extremely high or low breeding values. In this study, statistical formulae were developed to calculate variability among gamet...
TRANPLAN and GIS support for agencies in Alabama
DOT National Transportation Integrated Search
2001-08-06
Travel demand models are computerized programs intended to forecast future roadway traffic volumes for a community based on selected socioeconomic variables and travel behavior algorithms. Software to operate these travel demand models is currently a...
Beatty, William S.; Webb, Elisabeth B.; Kesler, Dylan C.; Raedeke, Andrew H.; Naylor, Luke W.; Humburg, Dale D.
2014-01-01
Previous studies that evaluated effects of landscape-scale habitat heterogeneity on migratory waterbird distributions were spatially limited and temporally restricted to one major life-history phase. However, effects of landscape-scale habitat heterogeneity on long-distance migratory waterbirds can be studied across the annual cycle using new technologies, including global positioning system satellite transmitters. We used Bayesian discrete choice models to examine the influence of local habitats and landscape composition on habitat selection by a generalist dabbling duck, the mallard (Anas platyrhynchos), in the midcontinent of North America during the non-breeding period. Using a previously published empirical movement metric, we separated the non-breeding period into three seasons, including autumn migration, winter, and spring migration. We defined spatial scales based on movement patterns such that movements >0.25 and <30.00 km were classified as local scale and movements >30.00 km were classified as relocation scale. Habitat selection at the local scale was generally influenced by local and landscape-level variables across all seasons. Variables in top models at the local scale included proximities to cropland, emergent wetland, open water, and woody wetland. Similarly, variables associated with area of cropland, emergent wetland, open water, and woody wetland were also included at the local scale. At the relocation scale, mallards selected resource units based on more generalized variables, including proximity to wetlands and total wetland area. Our results emphasize the role of landscape composition in waterbird habitat selection and provide further support for local wetland landscapes to be considered functional units of waterbird conservation and management.
Artificial neural networks modelling the prednisolone nanoprecipitation in microfluidic reactors.
Ali, Hany S M; Blagden, Nicholas; York, Peter; Amani, Amir; Brook, Toni
2009-06-28
This study employs artificial neural networks (ANNs) to create a model to identify relationships between variables affecting drug nanoprecipitation using microfluidic reactors. The input variables examined were saturation levels of prednisolone, solvent and antisolvent flow rates, microreactor inlet angles and internal diameters, while particle size was the single output. ANNs software was used to analyse a set of data obtained by random selection of the variables. The developed model was then assessed using a separate set of validation data and provided good agreement with the observed results. The antisolvent flow rate was found to have the dominant role on determining final particle size.
USDA-ARS?s Scientific Manuscript database
Bacterial cold water disease (BCWD) causes significant economic losses in salmonid aquaculture. At the National Center for Cool and Cold Water Aquaculture (NCCCWA), we have pursued selective breeding to increase rainbow trout genetic resistance against BCWD and found that post-challenge survival is ...
Efficient robust doubly adaptive regularized regression with applications.
Karunamuni, Rohana J; Kong, Linglong; Tu, Wei
2018-01-01
We consider the problem of estimation and variable selection for general linear regression models. Regularized regression procedures have been widely used for variable selection, but most existing methods perform poorly in the presence of outliers. We construct a new penalized procedure that simultaneously attains full efficiency and maximum robustness. Furthermore, the proposed procedure satisfies the oracle properties. The new procedure is designed to achieve sparse and robust solutions by imposing adaptive weights on both the decision loss and the penalty function. The proposed method of estimation and variable selection attains full efficiency when the model is correct and, at the same time, achieves maximum robustness when outliers are present. We examine the robustness properties using the finite-sample breakdown point and an influence function. We show that the proposed estimator attains the maximum breakdown point. Furthermore, there is no loss in efficiency when there are no outliers or the error distribution is normal. For practical implementation of the proposed method, we present a computational algorithm. We examine the finite-sample and robustness properties using Monte Carlo studies. Two datasets are also analyzed.
Nelson, Jon P
2014-01-01
Precise estimates of price elasticities are important for alcohol tax policy. Using meta-analysis, this paper corrects average beer elasticities for heterogeneity, dependence, and publication selection bias. A sample of 191 estimates is obtained from 114 primary studies. Simple and weighted means are reported. Dependence is addressed by restricting number of estimates per study, author-restricted samples, and author-specific variables. Publication bias is addressed using funnel graph, trim-and-fill, and Egger's intercept model. Heterogeneity and selection bias are examined jointly in meta-regressions containing moderator variables for econometric methodology, primary data, and precision of estimates. Results for fixed- and random-effects regressions are reported. Country-specific effects and sample time periods are unimportant, but several methodology variables help explain the dispersion of estimates. In models that correct for selection bias and heterogeneity, the average beer price elasticity is about -0.20, which is less elastic by 50% compared to values commonly used in alcohol tax policy simulations. Copyright © 2013 Elsevier B.V. All rights reserved.
Voss, Frank D.; Mastin, Mark C.
2012-01-01
A database was developed to automate model execution and to provide users with Internet access to voluminous data products ranging from summary figures to model output timeseries. Database-enabled Internet tools were developed to allow users to create interactive graphs of output results based on their analysis needs. For example, users were able to create graphs by selecting time intervals, greenhouse gas emission scenarios, general circulation models, and specific hydrologic variables.
NASA Astrophysics Data System (ADS)
El Naqa, I.; Suneja, G.; Lindsay, P. E.; Hope, A. J.; Alaly, J. R.; Vicic, M.; Bradley, J. D.; Apte, A.; Deasy, J. O.
2006-11-01
Radiotherapy treatment outcome models are a complicated function of treatment, clinical and biological factors. Our objective is to provide clinicians and scientists with an accurate, flexible and user-friendly software tool to explore radiotherapy outcomes data and build statistical tumour control or normal tissue complications models. The software tool, called the dose response explorer system (DREES), is based on Matlab, and uses a named-field structure array data type. DREES/Matlab in combination with another open-source tool (CERR) provides an environment for analysing treatment outcomes. DREES provides many radiotherapy outcome modelling features, including (1) fitting of analytical normal tissue complication probability (NTCP) and tumour control probability (TCP) models, (2) combined modelling of multiple dose-volume variables (e.g., mean dose, max dose, etc) and clinical factors (age, gender, stage, etc) using multi-term regression modelling, (3) manual or automated selection of logistic or actuarial model variables using bootstrap statistical resampling, (4) estimation of uncertainty in model parameters, (5) performance assessment of univariate and multivariate analyses using Spearman's rank correlation and chi-square statistics, boxplots, nomograms, Kaplan-Meier survival plots, and receiver operating characteristics curves, and (6) graphical capabilities to visualize NTCP or TCP prediction versus selected variable models using various plots. DREES provides clinical researchers with a tool customized for radiotherapy outcome modelling. DREES is freely distributed. We expect to continue developing DREES based on user feedback.
Van Steen, Kristel; Curran, Desmond; Kramer, Jocelyn; Molenberghs, Geert; Van Vreckem, Ann; Bottomley, Andrew; Sylvester, Richard
2002-12-30
Clinical and quality of life (QL) variables from an EORTC clinical trial of first line chemotherapy in advanced breast cancer were used in a prognostic factor analysis of survival and response to chemotherapy. For response, different final multivariate models were obtained from forward and backward selection methods, suggesting a disconcerting instability. Quality of life was measured using the EORTC QLQ-C30 questionnaire completed by patients. Subscales on the questionnaire are known to be highly correlated, and therefore it was hypothesized that multicollinearity contributed to model instability. A correlation matrix indicated that global QL was highly correlated with 7 out of 11 variables. In a first attempt to explore multicollinearity, we used global QL as dependent variable in a regression model with other QL subscales as predictors. Afterwards, standard diagnostic tests for multicollinearity were performed. An exploratory principal components analysis and factor analysis of the QL subscales identified at most three important components and indicated that inclusion of global QL made minimal difference to the loadings on each component, suggesting that it is redundant in the model. In a second approach, we advocate a bootstrap technique to assess the stability of the models. Based on these analyses and since global QL exacerbates problems of multicollinearity, we therefore recommend that global QL be excluded from prognostic factor analyses using the QLQ-C30. The prognostic factor analysis was rerun without global QL in the model, and selected the same significant prognostic factors as before. Copyright 2002 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Hayes, Catherine
2005-07-01
This study sought to identify a variable or variables predictive of attrition among baccalaureate nursing students. The study was quantitative in design and multivariate correlational statistics and discriminant statistical analysis were used to identify a model for prediction of attrition. The analysis then weighted variables according to their predictive value to determine the most parsimonious model with the greatest predictive value. Three public university nursing education programs in Mississippi offering a Bachelors Degree in Nursing were selected for the study. The population consisted of students accepted and enrolled in these three programs for the years 2001 and 2002 and graduating in the years 2003 and 2004 (N = 195). The categorical dependent variable was attrition (includes academic failure or withdrawal) from the program of nursing education. The ten independent variables selected for the study and considered to have possible predictive value were: Grade Point Average for Pre-requisite Course Work; ACT Composite Score, ACT Reading Subscore, and ACT Mathematics Subscore; Letter Grades in the Courses: Anatomy & Physiology and Lab I, Algebra I, English I (101), Chemistry & Lab I, and Microbiology & Lab I; and Number of Institutions Attended (Universities, Colleges, Junior Colleges or Community Colleges). Descriptive analysis was performed and the means of each of the ten independent variables was compared for students who attrited and those who were retained in the population. The discriminant statistical analysis performed created a matrix using the ten variable model that was able to correctly predicted attrition in the study's population in 77.6% of the cases. Variables were then combined and recombined to produce the most efficient and parsimonious model for prediction. A six variable model resulted which weighted each variable according to predictive value: GPA for Prerequisite Coursework, ACT Composite, English I, Chemistry & Lab I, Microbiology & Lab I, and Number of Institutions Attended. Results of the study indicate that it is possible to predict attrition among students enrolled in baccalaureate nursing education programs and that additional investigation on the subject is warranted.
Developing a Model for Forecasting Road Traffic Accident (RTA) Fatalities in Yemen
NASA Astrophysics Data System (ADS)
Karim, Fareed M. A.; Abdo Saleh, Ali; Taijoobux, Aref; Ševrović, Marko
2017-12-01
The aim of this paper is to develop a model for forecasting RTA fatalities in Yemen. The yearly fatalities was modeled as the dependent variable, while the number of independent variables included the population, number of vehicles, GNP, GDP and Real GDP per capita. It was determined that all these variables are highly correlated with the correlation coefficient (r ≈ 0.9); in order to avoid multicollinearity in the model, a single variable with the highest r value was selected (real GDP per capita). A simple regression model was developed; the model was very good (R2=0.916); however, the residuals were serially correlated. The Prais-Winsten procedure was used to overcome this violation of the regression assumption. The data for a 20-year period from 1991-2010 were analyzed to build the model; the model was validated by using data for the years 2011-2013; the historical fit for the period 1991 - 2011 was very good. Also, the validation for 2011-2013 proved accurate.
Uniting statistical and individual-based approaches for animal movement modelling.
Latombe, Guillaume; Parrott, Lael; Basille, Mathieu; Fortin, Daniel
2014-01-01
The dynamic nature of their internal states and the environment directly shape animals' spatial behaviours and give rise to emergent properties at broader scales in natural systems. However, integrating these dynamic features into habitat selection studies remains challenging, due to practically impossible field work to access internal states and the inability of current statistical models to produce dynamic outputs. To address these issues, we developed a robust method, which combines statistical and individual-based modelling. Using a statistical technique for forward modelling of the IBM has the advantage of being faster for parameterization than a pure inverse modelling technique and allows for robust selection of parameters. Using GPS locations from caribou monitored in Québec, caribou movements were modelled based on generative mechanisms accounting for dynamic variables at a low level of emergence. These variables were accessed by replicating real individuals' movements in parallel sub-models, and movement parameters were then empirically parameterized using Step Selection Functions. The final IBM model was validated using both k-fold cross-validation and emergent patterns validation and was tested for two different scenarios, with varying hardwood encroachment. Our results highlighted a functional response in habitat selection, which suggests that our method was able to capture the complexity of the natural system, and adequately provided projections on future possible states of the system in response to different management plans. This is especially relevant for testing the long-term impact of scenarios corresponding to environmental configurations that have yet to be observed in real systems.
Uniting Statistical and Individual-Based Approaches for Animal Movement Modelling
Latombe, Guillaume; Parrott, Lael; Basille, Mathieu; Fortin, Daniel
2014-01-01
The dynamic nature of their internal states and the environment directly shape animals' spatial behaviours and give rise to emergent properties at broader scales in natural systems. However, integrating these dynamic features into habitat selection studies remains challenging, due to practically impossible field work to access internal states and the inability of current statistical models to produce dynamic outputs. To address these issues, we developed a robust method, which combines statistical and individual-based modelling. Using a statistical technique for forward modelling of the IBM has the advantage of being faster for parameterization than a pure inverse modelling technique and allows for robust selection of parameters. Using GPS locations from caribou monitored in Québec, caribou movements were modelled based on generative mechanisms accounting for dynamic variables at a low level of emergence. These variables were accessed by replicating real individuals' movements in parallel sub-models, and movement parameters were then empirically parameterized using Step Selection Functions. The final IBM model was validated using both k-fold cross-validation and emergent patterns validation and was tested for two different scenarios, with varying hardwood encroachment. Our results highlighted a functional response in habitat selection, which suggests that our method was able to capture the complexity of the natural system, and adequately provided projections on future possible states of the system in response to different management plans. This is especially relevant for testing the long-term impact of scenarios corresponding to environmental configurations that have yet to be observed in real systems. PMID:24979047
NASA Astrophysics Data System (ADS)
Moll, Andreas; Stegert, Christoph
2007-01-01
This paper outlines an approach to couple a structured zooplankton population model with state variables for eggs, nauplii, two copepodites stages and adults adapted to Pseudocalanus elongatus into the complex marine ecosystem model ECOHAM2 with 13 state variables resolving the carbon and nitrogen cycle. Different temperature and food scenarios derived from laboratory culture studies were examined to improve the process parameterisation for copepod stage dependent development processes. To study annual cycles under realistic weather and hydrographic conditions, the coupled ecosystem-zooplankton model is applied to a water column in the northern North Sea. The main ecosystem state variables were validated against observed monthly mean values. Then vertical profiles of selected state variables were compared to the physical forcing to study differences between zooplankton as one biomass state variable or partitioned into five population state variables. Simulated generation times are more affected by temperature than food conditions except during the spring phytoplankton bloom. Up to six generations within the annual cycle can be discerned in the simulation.
Variable selection in a flexible parametric mixture cure model with interval-censored data.
Scolas, Sylvie; El Ghouch, Anouar; Legrand, Catherine; Oulhaj, Abderrahim
2016-03-30
In standard survival analysis, it is generally assumed that every individual will experience someday the event of interest. However, this is not always the case, as some individuals may not be susceptible to this event. Also, in medical studies, it is frequent that patients come to scheduled interviews and that the time to the event is only known to occur between two visits. That is, the data are interval-censored with a cure fraction. Variable selection in such a setting is of outstanding interest. Covariates impacting the survival are not necessarily the same as those impacting the probability to experience the event. The objective of this paper is to develop a parametric but flexible statistical model to analyze data that are interval-censored and include a fraction of cured individuals when the number of potential covariates may be large. We use the parametric mixture cure model with an accelerated failure time regression model for the survival, along with the extended generalized gamma for the error term. To overcome the issue of non-stable and non-continuous variable selection procedures, we extend the adaptive LASSO to our model. By means of simulation studies, we show good performance of our method and discuss the behavior of estimates with varying cure and censoring proportion. Lastly, our proposed method is illustrated with a real dataset studying the time until conversion to mild cognitive impairment, a possible precursor of Alzheimer's disease. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Development and evaluation of height diameter at breast models for native Chinese Metasequoia.
Liu, Mu; Feng, Zhongke; Zhang, Zhixiang; Ma, Chenghui; Wang, Mingming; Lian, Bo-Ling; Sun, Renjie; Zhang, Li
2017-01-01
Accurate tree height and diameter at breast height (dbh) are important input variables for growth and yield models. A total of 5503 Chinese Metasequoia trees were used in this study. We studied 53 fitted models, of which 7 were linear models and 46 were non-linear models. These models were divided into two groups of single models and multivariate models according to the number of independent variables. The results show that the allometry equation of tree height which has diameter at breast height as independent variable can better reflect the change of tree height; in addition the prediction accuracy of the multivariate composite models is higher than that of the single variable models. Although tree age is not the most important variable in the study of the relationship between tree height and dbh, the consideration of tree age when choosing models and parameters in model selection can make the prediction of tree height more accurate. The amount of data is also an important parameter what can improve the reliability of models. Other variables such as tree height, main dbh and altitude, etc can also affect models. In this study, the method of developing the recommended models for predicting the tree height of native Metasequoias aged 50-485 years is statistically reliable and can be used for reference in predicting the growth and production of mature native Metasequoia.
Development and evaluation of height diameter at breast models for native Chinese Metasequoia
Feng, Zhongke; Zhang, Zhixiang; Ma, Chenghui; Wang, Mingming; Lian, Bo-ling; Sun, Renjie; Zhang, Li
2017-01-01
Accurate tree height and diameter at breast height (dbh) are important input variables for growth and yield models. A total of 5503 Chinese Metasequoia trees were used in this study. We studied 53 fitted models, of which 7 were linear models and 46 were non-linear models. These models were divided into two groups of single models and multivariate models according to the number of independent variables. The results show that the allometry equation of tree height which has diameter at breast height as independent variable can better reflect the change of tree height; in addition the prediction accuracy of the multivariate composite models is higher than that of the single variable models. Although tree age is not the most important variable in the study of the relationship between tree height and dbh, the consideration of tree age when choosing models and parameters in model selection can make the prediction of tree height more accurate. The amount of data is also an important parameter what can improve the reliability of models. Other variables such as tree height, main dbh and altitude, etc can also affect models. In this study, the method of developing the recommended models for predicting the tree height of native Metasequoias aged 50–485 years is statistically reliable and can be used for reference in predicting the growth and production of mature native Metasequoia. PMID:28817600
Motamarri, Srinivas; Boccelli, Dominic L
2012-09-15
Users of recreational waters may be exposed to elevated pathogen levels through various point/non-point sources. Typical daily notifications rely on microbial analysis of indicator organisms (e.g., Escherichia coli) that require 18, or more, hours to provide an adequate response. Modeling approaches, such as multivariate linear regression (MLR) and artificial neural networks (ANN), have been utilized to provide quick predictions of microbial concentrations for classification purposes, but generally suffer from high false negative rates. This study introduces the use of learning vector quantization (LVQ)--a direct classification approach--for comparison with MLR and ANN approaches and integrates input selection for model development with respect to primary and secondary water quality standards within the Charles River Basin (Massachusetts, USA) using meteorologic, hydrologic, and microbial explanatory variables. Integrating input selection into model development showed that discharge variables were the most important explanatory variables while antecedent rainfall and time since previous events were also important. With respect to classification, all three models adequately represented the non-violated samples (>90%). The MLR approach had the highest false negative rates associated with classifying violated samples (41-62% vs 13-43% (ANN) and <16% (LVQ)) when using five or more explanatory variables. The ANN performance was more similar to LVQ when a larger number of explanatory variables were utilized, but the ANN performance degraded toward MLR performance as explanatory variables were removed. Overall, the use of LVQ as a direct classifier provided the best overall classification ability with respect to violated/non-violated samples for both standards. Copyright © 2012 Elsevier Ltd. All rights reserved.
UNCERTAINTY ANALYSIS IN WATER QUALITY MODELING USING QUAL2E
A strategy for incorporating uncertainty analysis techniques (sensitivity analysis, first order error analysis, and Monte Carlo simulation) into the mathematical water quality model QUAL2E is described. The model, named QUAL2E-UNCAS, automatically selects the input variables or p...
A solution to the static frame validation challenge problem using Bayesian model selection
Grigoriu, M. D.; Field, R. V.
2007-12-23
Within this paper, we provide a solution to the static frame validation challenge problem (see this issue) in a manner that is consistent with the guidelines provided by the Validation Challenge Workshop tasking document. The static frame problem is constructed such that variability in material properties is known to be the only source of uncertainty in the system description, but there is ignorance on the type of model that best describes this variability. Hence both types of uncertainty, aleatoric and epistemic, are present and must be addressed. Our approach is to consider a collection of competing probabilistic models for themore » material properties, and calibrate these models to the information provided; models of different levels of complexity and numerical efficiency are included in the analysis. A Bayesian formulation is used to select the optimal model from the collection, which is then used for the regulatory assessment. Lastly, bayesian credible intervals are used to provide a measure of confidence to our regulatory assessment.« less
King, M.D.; Burkardt, N.; Clark, B.T.
2006-01-01
Recent literature on the diffusion of innovations concentrates either specifically on public adoption of policy, where social or environmental conditions are the dependent variables for adoption, or on private adoption of an innovation, where emphasis is placed on the characteristics of the innovation itself. This article uses both the policy diffusion literature and the diffusion of innovation literature to assess watershed management councils' decisions to adopt, or not adopt, scientific models. Watershed management councils are a relevant case study because they possess both public and private attributes. We report on a survey of councils in the United States that was conducted to determine the criteria used when selecting scientific models for studying watershed conditions. We found that specific variables from each body of literature play a role in explaining the choice to adopt scientific models by these quasi-public organizations. The diffusion of innovation literature contributes to an understanding of how organizations select models by confirming the importance of a model's ability to provide better data. Variables from the policy diffusion literature showed that watershed management councils that employ consultants are more likely to use scientific models. We found a gap between those who create scientific models and those who use these models. We recommend shrinking this gap through more communication between these actors and advancing the need for developers to provide more technical assistance.
NASA Astrophysics Data System (ADS)
Urry, C. Megan
1997-01-01
This grant was awarded to Dr. C. Megan Urry of the Space Telescope Science Institute in response to two successful ADP proposals to use archival Ginga and Rosat X-ray data for 'Testing the Pairs-Reflection model with X-Ray Spectral Variability' (in collaboration with Paola Grandi, now at the University of Rome) and 'X-Ray Properties of Complete Samples of Radio-Selected BL Lacertae Objects' (in collaboration with then-graduate student Rita Sambruna, now a post-doc at Goddard Space Flight Center). In addition, post-docs Joseph Pesce and Elena Pian, and graduate student Matthew O'Dowd, have worked on several aspects of these projects. The grant was originally awarded on 3/01/94; this report covers the full period, through May 1997. We have completed our project on the X-ray properties of radio-selected BL Lacs.
Zheng, Qi; Peng, Limin
2016-01-01
Quantile regression provides a flexible platform for evaluating covariate effects on different segments of the conditional distribution of response. As the effects of covariates may change with quantile level, contemporaneously examining a spectrum of quantiles is expected to have a better capacity to identify variables with either partial or full effects on the response distribution, as compared to focusing on a single quantile. Under this motivation, we study a general adaptively weighted LASSO penalization strategy in the quantile regression setting, where a continuum of quantile index is considered and coefficients are allowed to vary with quantile index. We establish the oracle properties of the resulting estimator of coefficient function. Furthermore, we formally investigate a BIC-type uniform tuning parameter selector and show that it can ensure consistent model selection. Our numerical studies confirm the theoretical findings and illustrate an application of the new variable selection procedure. PMID:28008212
A Rapid Approach to Modeling Species-Habitat Relationships
NASA Technical Reports Server (NTRS)
Carter, Geoffrey M.; Breinger, David R.; Stolen, Eric D.
2005-01-01
A growing number of species require conservation or management efforts. Success of these activities requires knowledge of the species' occurrence pattern. Species-habitat models developed from GIS data sources are commonly used to predict species occurrence but commonly used data sources are often developed for purposes other than predicting species occurrence and are of inappropriate scale and the techniques used to extract predictor variables are often time consuming and cannot be repeated easily and thus cannot efficiently reflect changing conditions. We used digital orthophotographs and a grid cell classification scheme to develop an efficient technique to extract predictor variables. We combined our classification scheme with a priori hypothesis development using expert knowledge and a previously published habitat suitability index and used an objective model selection procedure to choose candidate models. We were able to classify a large area (57,000 ha) in a fraction of the time that would be required to map vegetation and were able to test models at varying scales using a windowing process. Interpretation of the selected models confirmed existing knowledge of factors important to Florida scrub-jay habitat occupancy. The potential uses and advantages of using a grid cell classification scheme in conjunction with expert knowledge or an habitat suitability index (HSI) and an objective model selection procedure are discussed.
Decision tree modeling using R.
Zhang, Zhongheng
2016-08-01
In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.
Rice, Mindy B; Rossi, Liza G; Apa, Anthony D
2016-01-01
Fragmentation of the sagebrush (Artemisia spp.) ecosystem has led to concern about a variety of sagebrush obligates including the greater sage-grouse (Centrocercus urophasianus). Given the increase of energy development within greater sage-grouse habitats, mapping seasonal habitats in pre-development populations is critical. The North Park population in Colorado is one of the largest and most stable in the state and provides a unique case study for investigating resource selection at a relatively low level of energy development compared to other populations both within and outside the state. We used locations from 117 radio-marked female greater sage-grouse in North Park, Colorado to develop seasonal resource selection models. We then added energy development variables to the base models at both a landscape and local scale to determine if energy variables improved the fit of the seasonal models. The base models for breeding and winter resource selection predicted greater use in large expanses of sagebrush whereas the base summer model predicted greater use along the edge of riparian areas. Energy development variables did not improve the winter or the summer models at either scale of analysis, but distance to oil/gas roads slightly improved model fit at both scales in the breeding season, albeit in opposite ways. At the landscape scale, greater sage-grouse were closer to oil/gas roads whereas they were further from oil/gas roads at the local scale during the breeding season. Although we found limited effects from low level energy development in the breeding season, the scale of analysis can influence the interpretation of effects. The lack of strong effects from energy development may be indicative that energy development at current levels are not impacting greater sage-grouse in North Park. Our baseline seasonal resource selection maps can be used for conservation to help identify ways of minimizing the effects of energy development.
Jiao, Shengwu; Guo, Yumin; Huettmann, Falk; Lei, Guangchun
2014-07-01
Avian nest-site selection is an important research and management subject. The hooded crane (Grus monacha) is a vulnerable (VU) species according to the IUCN Red List. Here, we present the first long-term Chinese legacy nest data for this species (1993-2010) with publicly available metadata. Further, we provide the first study that reports findings on multivariate nest habitat preference using such long-term field data for this species. Our work was carried out in Northeastern China, where we found and measured 24 nests and 81 randomly selected control plots and their environmental parameters in a vast landscape. We used machine learning (stochastic boosted regression trees) to quantify nest selection. Our analysis further included varclust (R Hmisc) and (TreenNet) to address statistical correlations and two-way interactions. We found that from an initial list of 14 measured field variables, water area (+), water depth (+) and shrub coverage (-) were the main explanatory variables that contributed to hooded crane nest-site selection. Agricultural sites played a smaller role in the selection of these nests. Our results are important for the conservation management of cranes all over East Asia and constitute a defensible and quantitative basis for predictive models.
Application of neural networks and sensitivity analysis to improved prediction of trauma survival.
Hunter, A; Kennedy, L; Henry, J; Ferguson, I
2000-05-01
The performance of trauma departments is widely audited by applying predictive models that assess probability of survival, and examining the rate of unexpected survivals and deaths. Although the TRISS methodology, a logistic regression modelling technique, is still the de facto standard, it is known that neural network models perform better. A key issue when applying neural network models is the selection of input variables. This paper proposes a novel form of sensitivity analysis, which is simpler to apply than existing techniques, and can be used for both numeric and nominal input variables. The technique is applied to the audit survival problem, and used to analyse the TRISS variables. The conclusions discuss the implications for the design of further improved scoring schemes and predictive models.
Modelling municipal solid waste generation: a review.
Beigl, Peter; Lebersorger, Sandra; Salhofer, Stefan
2008-01-01
The objective of this paper is to review previously published models of municipal solid waste generation and to propose an implementation guideline which will provide a compromise between information gain and cost-efficient model development. The 45 modelling approaches identified in a systematic literature review aim at explaining or estimating the present or future waste generation using economic, socio-demographic or management-orientated data. A classification was developed in order to categorise these highly heterogeneous models according to the following criteria--the regional scale, the modelled waste streams, the hypothesised independent variables and the modelling method. A procedural practice guideline was derived from a discussion of the underlying models in order to propose beneficial design options concerning regional sampling (i.e., number and size of observed areas), waste stream definition and investigation, selection of independent variables and model validation procedures. The practical application of the findings was demonstrated with two case studies performed on different regional scales, i.e., on a household and on a city level. The findings of this review are finally summarised in the form of a relevance tree for methodology selection.
The variability of software scoring of the CDMAM phantom associated with a limited number of images
NASA Astrophysics Data System (ADS)
Yang, Chang-Ying J.; Van Metter, Richard
2007-03-01
Software scoring approaches provide an attractive alternative to human evaluation of CDMAM images from digital mammography systems, particularly for annual quality control testing as recommended by the European Protocol for the Quality Control of the Physical and Technical Aspects of Mammography Screening (EPQCM). Methods for correlating CDCOM-based results with human observer performance have been proposed. A common feature of all methods is the use of a small number (at most eight) of CDMAM images to evaluate the system. This study focuses on the potential variability in the estimated system performance that is associated with these methods. Sets of 36 CDMAM images were acquired under carefully controlled conditions from three different digital mammography systems. The threshold visibility thickness (TVT) for each disk diameter was determined using previously reported post-analysis methods from the CDCOM scorings for a randomly selected group of eight images for one measurement trial. This random selection process was repeated 3000 times to estimate the variability in the resulting TVT values for each disk diameter. The results from using different post-analysis methods, different random selection strategies and different digital systems were compared. Additional variability of the 0.1 mm disk diameter was explored by comparing the results from two different image data sets acquired under the same conditions from the same system. The magnitude and the type of error estimated for experimental data was explained through modeling. The modeled results also suggest a limitation in the current phantom design for the 0.1 mm diameter disks. Through modeling, it was also found that, because of the binomial statistic nature of the CDMAM test, the true variability of the test could be underestimated by the commonly used method of random re-sampling.
Abidi, Mustufa Haider; Al-Ahmari, Abdulrahman; Ahmad, Ali
2018-01-01
Advanced graphics capabilities have enabled the use of virtual reality as an efficient design technique. The integration of virtual reality in the design phase still faces impediment because of issues linked to the integration of CAD and virtual reality software. A set of empirical tests using the selected conversion parameters was found to yield properly represented virtual reality models. The reduced model yields an R-sq (pred) value of 72.71% and an R-sq (adjusted) value of 86.64%, indicating that 86.64% of the response variability can be explained by the model. The R-sq (pred) is 67.45%, which is not very high, indicating that the model should be further reduced by eliminating insignificant terms. The reduced model yields an R-sq (pred) value of 73.32% and an R-sq (adjusted) value of 79.49%, indicating that 79.49% of the response variability can be explained by the model. Using the optimization software MODE Frontier (Optimization, MOGA-II, 2014), four types of response surfaces for the three considered response variables were tested for the data of DOE. The parameter values obtained using the proposed experimental design methodology result in better graphics quality, and other necessary design attributes.
Stekel, Dov J.; Sarti, Donatella; Trevino, Victor; Zhang, Lihong; Salmon, Mike; Buckley, Chris D.; Stevens, Mark; Pallen, Mark J.; Penn, Charles; Falciani, Francesco
2005-01-01
A key step in the analysis of microarray data is the selection of genes that are differentially expressed. Ideally, such experiments should be properly replicated in order to infer both technical and biological variability, and the data should be subjected to rigorous hypothesis tests to identify the differentially expressed genes. However, in microarray experiments involving the analysis of very large numbers of biological samples, replication is not always practical. Therefore, there is a need for a method to select differentially expressed genes in a rational way from insufficiently replicated data. In this paper, we describe a simple method that uses bootstrapping to generate an error model from a replicated pilot study that can be used to identify differentially expressed genes in subsequent large-scale studies on the same platform, but in which there may be no replicated arrays. The method builds a stratified error model that includes array-to-array variability, feature-to-feature variability and the dependence of error on signal intensity. We apply this model to the characterization of the host response in a model of bacterial infection of human intestinal epithelial cells. We demonstrate the effectiveness of error model based microarray experiments and propose this as a general strategy for a microarray-based screening of large collections of biological samples. PMID:15800204
Forecasting of cyanobacterial density in Torrão reservoir using artificial neural networks.
Torres, Rita; Pereira, Elisa; Vasconcelos, Vítor; Teles, Luís Oliva
2011-06-01
The ability of general regression neural networks (GRNN) to forecast the density of cyanobacteria in the Torrão reservoir (Tâmega river, Portugal), in a period of 15 days, based on three years of collected physical and chemical data, was assessed. Several models were developed and 176 were selected based on their correlation values for the verification series. A time lag of 11 was used, equivalent to one sample (periods of 15 days in the summer and 30 days in the winter). Several combinations of the series were used. Input and output data collected from three depths of the reservoir were applied (surface, euphotic zone limit and bottom). The model that presented a higher average correlation value presented the correlations 0.991; 0.843; 0.978 for training, verification and test series. This model had the three series independent in time: first test series, then verification series and, finally, training series. Only six input variables were considered significant to the performance of this model: ammonia, phosphates, dissolved oxygen, water temperature, pH and water evaporation, physical and chemical parameters referring to the three depths of the reservoir. These variables are common to the next four best models produced and, although these included other input variables, their performance was not better than the selected best model.
Characteristic analysis-1981: Final program and a possible discovery
McCammon, R.B.; Botbol, J.M.; Sinding-Larsen, R.; Bowen, R.W.
1983-01-01
The latest ornewest version of thecharacteristicanalysis (NCHARAN)computer program offers the exploration geologist a wide variety of options for integrating regionalized multivariate data. The options include the selection of regional cells for characterizing deposit models, the selection of variables that constitute the models, and the choice of logical combinations of variables that best represent these models. Moreover, the program provides for the display of results which, in turn, makes possible review, reselection, and refinement of a model. Most important, the performance of the above-mentioned steps in an interactive computing mode can result in a timely and meaningful interpretation of the data available to the exploration geologist. The most recent application of characteristic analysis has resulted in the possible discovery of economic sulfide mineralization in the Grong area in central Norway. Exploration data for 27 geophysical, geological, and geochemical variables were used to construct a mineralized and a lithogeochemical model for an area that contained a known massive sulfide deposit. The models were applied to exploration data collected from the Gjersvik area in the Grong mining district and resulted in the identification of two localities of possible mineralization. Detailed field examination revealed the presence of a sulfide vein system and a partially inverted stratigraphic sequence indicating the possible presence of a massive sulfide deposit at depth. ?? 1983 Plenum Publishing Corporation.
Baxter, Suzanne Domel; Royer, Julie A.; Hitchcock, David B.
2013-01-01
BACKGROUND A positive relationship exists between children’s body mass index (BMI) and energy intake at school-provided meals. To help explain this relationship, we investigated 7 outcome variables concerning aspects of school-provided meals—energy content of items selected, number of meal components selected, number of meal components eaten, amounts eaten of standardized school-meal portions, energy intake from flavored milk, energy intake received in trades, and energy content given in trades. METHODS We observed children in grade 4 (N=465) eating school-provided breakfast and lunch on one to 4 days per child. We measured children’s weight and height. For daily values at school meals, a generalized linear model was fit with BMI (dependent variable) and the 7 outcome variables, sex, and age (independent variables). RESULTS BMI was positively related to amounts eaten of standardized school-meal portions (p < .0001) and increased 8.45 kg/m2 per serving, controlling for other variables in the model. BMI was positively related to energy intake from flavored milk (p = .0041) and increased 0.347 kg/m2 for every 100-kcal consumed. BMI was negatively related to energy intake received in trades (p = .0003) and decreased 0.468 kg/m2 for every 100-kcal received. BMI was not significantly related to 4 outcome variables. CONCLUSIONS Knowing that relationships between BMI and actual consumption, not selection, at school-provided meals explained the (previously found) positive relationship between BMI and energy intake at school-provided meals is helpful for school-based obesity interventions. PMID:23517000
Variable Selection for Road Segmentation in Aerial Images
NASA Astrophysics Data System (ADS)
Warnke, S.; Bulatov, D.
2017-05-01
For extraction of road pixels from combined image and elevation data, Wegner et al. (2015) proposed classification of superpixels into road and non-road, after which a refinement of the classification results using minimum cost paths and non-local optimization methods took place. We believed that the variable set used for classification was to a certain extent suboptimal, because many variables were redundant while several features known as useful in Photogrammetry and Remote Sensing are missed. This motivated us to implement a variable selection approach which builds a model for classification using portions of training data and subsets of features, evaluates this model, updates the feature set, and terminates when a stopping criterion is satisfied. The choice of classifier is flexible; however, we tested the approach with Logistic Regression and Random Forests, and taylored the evaluation module to the chosen classifier. To guarantee a fair comparison, we kept the segment-based approach and most of the variables from the related work, but we extended them by additional, mostly higher-level features. Applying these superior features, removing the redundant ones, as well as using more accurately acquired 3D data allowed to keep stable or even to reduce the misclassification error in a challenging dataset.
Impact of auditory selective attention on verbal short-term memory and vocabulary development.
Majerus, Steve; Heiligenstein, Lucie; Gautherot, Nathalie; Poncelet, Martine; Van der Linden, Martial
2009-05-01
This study investigated the role of auditory selective attention capacities as a possible mediator of the well-established association between verbal short-term memory (STM) and vocabulary development. A total of 47 6- and 7-year-olds were administered verbal immediate serial recall and auditory attention tasks. Both task types probed processing of item and serial order information because recent studies have shown this distinction to be critical when exploring relations between STM and lexical development. Multiple regression and variance partitioning analyses highlighted two variables as determinants of vocabulary development: (a) a serial order processing variable shared by STM order recall and a selective attention task for sequence information and (b) an attentional variable shared by selective attention measures targeting item or sequence information. The current study highlights the need for integrative STM models, accounting for conjoined influences of attentional capacities and serial order processing capacities on STM performance and the establishment of the lexical language network.
Zawbaa, Hossam M; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven.
Zawbaa, Hossam M.; Szlȩk, Jakub; Grosan, Crina; Jachowicz, Renata; Mendyk, Aleksander
2016-01-01
Poly-lactide-co-glycolide (PLGA) is a copolymer of lactic and glycolic acid. Drug release from PLGA microspheres depends not only on polymer properties but also on drug type, particle size, morphology of microspheres, release conditions, etc. Selecting a subset of relevant properties for PLGA is a challenging machine learning task as there are over three hundred features to consider. In this work, we formulate the selection of critical attributes for PLGA as a multiobjective optimization problem with the aim of minimizing the error of predicting the dissolution profile while reducing the number of attributes selected. Four bio-inspired optimization algorithms: antlion optimization, binary version of antlion optimization, grey wolf optimization, and social spider optimization are used to select the optimal feature set for predicting the dissolution profile of PLGA. Besides these, LASSO algorithm is also used for comparisons. Selection of crucial variables is performed under the assumption that both predictability and model simplicity are of equal importance to the final result. During the feature selection process, a set of input variables is employed to find minimum generalization error across different predictive models and their settings/architectures. The methodology is evaluated using predictive modeling for which various tools are chosen, such as Cubist, random forests, artificial neural networks (monotonic MLP, deep learning MLP), multivariate adaptive regression splines, classification and regression tree, and hybrid systems of fuzzy logic and evolutionary computations (fugeR). The experimental results are compared with the results reported by Szlȩk. We obtain a normalized root mean square error (NRMSE) of 15.97% versus 15.4%, and the number of selected input features is smaller, nine versus eleven. PMID:27315205
Effects of parceling on model selection: Parcel-allocation variability in model ranking.
Sterba, Sonya K; Rights, Jason D
2017-03-01
Research interest often lies in comparing structural model specifications implying different relationships among latent factors. In this context parceling is commonly accepted, assuming the item-level measurement structure is well known and, conservatively, assuming items are unidimensional in the population. Under these assumptions, researchers compare competing structural models, each specified using the same parcel-level measurement model. However, little is known about consequences of parceling for model selection in this context-including whether and when model ranking could vary across alternative item-to-parcel allocations within-sample. This article first provides a theoretical framework that predicts the occurrence of parcel-allocation variability (PAV) in model selection index values and its consequences for PAV in ranking of competing structural models. These predictions are then investigated via simulation. We show that conditions known to manifest PAV in absolute fit of a single model may or may not manifest PAV in model ranking. Thus, one cannot assume that low PAV in absolute fit implies a lack of PAV in ranking, and vice versa. PAV in ranking is shown to occur under a variety of conditions, including large samples. To provide an empirically supported strategy for selecting a model when PAV in ranking exists, we draw on relationships between structural model rankings in parcel- versus item-solutions. This strategy employs the across-allocation modal ranking. We developed software tools for implementing this strategy in practice, and illustrate them with an example. Even if a researcher has substantive reason to prefer one particular allocation, investigating PAV in ranking within-sample still provides an informative sensitivity analysis. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
DiMagno, Matthew J; Spaete, Joshua P; Ballard, Darren D; Wamsteker, Erik-Jan; Saini, Sameer D
2013-08-01
We investigated which variables independently associated with protection against or development of postendoscopic retrograde cholangiopancreatography (ERCP) pancreatitis (PEP) and severity of PEP. Subsequently, we derived predictive risk models for PEP. In a case-control design, 6505 patients had 8264 ERCPs, 211 patients had PEP, and 22 patients had severe PEP. We randomly selected 348 non-PEP controls. We examined 7 established- and 9 investigational variables. In univariate analysis, 7 variables predicted PEP: younger age, female sex, suspected sphincter of Oddi dysfunction (SOD), pancreatic sphincterotomy, moderate-difficult cannulation (MDC), pancreatic stent placement, and lower Charlson score. Protective variables were current smoking, former drinking, diabetes, and chronic liver disease (CLD, biliary/transplant complications). Multivariate analysis identified seven independent variables for PEP, three protective (current smoking, CLD-biliary, CLD-transplant/hepatectomy complications) and 4 predictive (younger age, suspected SOD, pancreatic sphincterotomy, MDC). Pre- and post-ERCP risk models of 7 variables have a C-statistic of 0.74. Removing age (seventh variable) did not significantly affect the predictive value (C-statistic of 0.73) and reduced model complexity. Severity of PEP did not associate with any variables by multivariate analysis. By using the newly identified protective variables with 3 predictive variables, we derived 2 risk models with a higher predictive value for PEP compared to prior studies.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ghazali, Amirul Syafiq Mohd; Ali, Zalila; Noor, Norlida Mohd
Multinomial logistic regression is widely used to model the outcomes of a polytomous response variable, a categorical dependent variable with more than two categories. The model assumes that the conditional mean of the dependent categorical variables is the logistic function of an affine combination of predictor variables. Its procedure gives a number of logistic regression models that make specific comparisons of the response categories. When there are q categories of the response variable, the model consists of q-1 logit equations which are fitted simultaneously. The model is validated by variable selection procedures, tests of regression coefficients, a significant test ofmore » the overall model, goodness-of-fit measures, and validation of predicted probabilities using odds ratio. This study used the multinomial logistic regression model to investigate obesity and overweight among primary school students in a rural area on the basis of their demographic profiles, lifestyles and on the diet and food intake. The results indicated that obesity and overweight of students are related to gender, religion, sleep duration, time spent on electronic games, breakfast intake in a week, with whom meals are taken, protein intake, and also, the interaction between breakfast intake in a week with sleep duration, and the interaction between gender and protein intake.« less
NASA Astrophysics Data System (ADS)
Ghazali, Amirul Syafiq Mohd; Ali, Zalila; Noor, Norlida Mohd; Baharum, Adam
2015-10-01
Multinomial logistic regression is widely used to model the outcomes of a polytomous response variable, a categorical dependent variable with more than two categories. The model assumes that the conditional mean of the dependent categorical variables is the logistic function of an affine combination of predictor variables. Its procedure gives a number of logistic regression models that make specific comparisons of the response categories. When there are q categories of the response variable, the model consists of q-1 logit equations which are fitted simultaneously. The model is validated by variable selection procedures, tests of regression coefficients, a significant test of the overall model, goodness-of-fit measures, and validation of predicted probabilities using odds ratio. This study used the multinomial logistic regression model to investigate obesity and overweight among primary school students in a rural area on the basis of their demographic profiles, lifestyles and on the diet and food intake. The results indicated that obesity and overweight of students are related to gender, religion, sleep duration, time spent on electronic games, breakfast intake in a week, with whom meals are taken, protein intake, and also, the interaction between breakfast intake in a week with sleep duration, and the interaction between gender and protein intake.
Impact of multicollinearity on small sample hydrologic regression models
NASA Astrophysics Data System (ADS)
Kroll, Charles N.; Song, Peter
2013-06-01
Often hydrologic regression models are developed with ordinary least squares (OLS) procedures. The use of OLS with highly correlated explanatory variables produces multicollinearity, which creates highly sensitive parameter estimators with inflated variances and improper model selection. It is not clear how to best address multicollinearity in hydrologic regression models. Here a Monte Carlo simulation is developed to compare four techniques to address multicollinearity: OLS, OLS with variance inflation factor screening (VIF), principal component regression (PCR), and partial least squares regression (PLS). The performance of these four techniques was observed for varying sample sizes, correlation coefficients between the explanatory variables, and model error variances consistent with hydrologic regional regression models. The negative effects of multicollinearity are magnified at smaller sample sizes, higher correlations between the variables, and larger model error variances (smaller R2). The Monte Carlo simulation indicates that if the true model is known, multicollinearity is present, and the estimation and statistical testing of regression parameters are of interest, then PCR or PLS should be employed. If the model is unknown, or if the interest is solely on model predictions, is it recommended that OLS be employed since using more complicated techniques did not produce any improvement in model performance. A leave-one-out cross-validation case study was also performed using low-streamflow data sets from the eastern United States. Results indicate that OLS with stepwise selection generally produces models across study regions with varying levels of multicollinearity that are as good as biased regression techniques such as PCR and PLS.
The Assessment of Climatological Impacts on Agricultural Production and Residential Energy Demand
NASA Astrophysics Data System (ADS)
Cooter, Ellen Jean
The assessment of climatological impacts on selected economic activities is presented as a multi-step, inter -disciplinary problem. The assessment process which is addressed explicitly in this report focuses on (1) user identification, (2) direct impact model selection, (3) methodological development, (4) product development and (5) product communication. Two user groups of major economic importance were selected for study; agriculture and gas utilities. The broad agricultural sector is further defined as U.S.A. corn production. The general category of utilities is narrowed to Oklahoma residential gas heating demand. The CERES physiological growth model was selected as the process model for corn production. The statistical analysis for corn production suggests that (1) although this is a statistically complex model, it can yield useful impact information, (2) as a result of output distributional biases, traditional statistical techniques are not adequate analytical tools, (3) the model yield distribution as a whole is probably non-Gausian, particularly in the tails and (4) there appears to be identifiable weekly patterns of forecasted yields throughout the growing season. Agricultural quantities developed include point yield impact estimates and distributional characteristics, geographic corn weather distributions, return period estimates, decision making criteria (confidence limits) and time series of indices. These products were communicated in economic terms through the use of a Bayesian decision example and an econometric model. The NBSLD energy load model was selected to represent residential gas heating consumption. A cursory statistical analysis suggests relationships among weather variables across the Oklahoma study sites. No linear trend in "technology -free" modeled energy demand or input weather variables which would correspond to that contained in observed state -level residential energy use was detected. It is suggested that this trend is largely the result of non-weather factors such as population and home usage patterns rather than regional climate change. Year-to-year changes in modeled residential heating demand on the order of 10('6) Btu's per household were determined and later related to state -level components of the Oklahoma economy. Products developed include the definition of regional forecast areas, likelihood estimates of extreme seasonal conditions and an energy/climate index. This information is communicated in economic terms through an input/output model which is used to estimate changes in Gross State Product and Household income attributable to weather variability.
Chakraborty, Debojyoti; Wang, Tongli; Andre, Konrad; Konnert, Monika; Lexer, Manfred J; Matulla, Christoph; Schueler, Silvio
2015-01-01
Identifying populations within tree species potentially adapted to future climatic conditions is an important requirement for reforestation and assisted migration programmes. Such populations can be identified either by empirical response functions based on correlations of quantitative traits with climate variables or by climate envelope models that compare the climate of seed sources and potential growing areas. In the present study, we analyzed the intraspecific variation in climate growth response of Douglas-fir planted within the non-analogous climate conditions of Central and continental Europe. With data from 50 common garden trials, we developed Universal Response Functions (URF) for tree height and mean basal area and compared the growth performance of the selected best performing populations with that of populations identified through a climate envelope approach. Climate variables of the trial location were found to be stronger predictors of growth performance than climate variables of the population origin. Although the precipitation regime of the population sources varied strongly none of the precipitation related climate variables of population origin was found to be significant within the models. Overall, the URFs explained more than 88% of variation in growth performance. Populations identified by the URF models originate from western Cascades and coastal areas of Washington and Oregon and show significantly higher growth performance than populations identified by the climate envelope approach under both current and climate change scenarios. The URFs predict decreasing growth performance at low and middle elevations of the case study area, but increasing growth performance on high elevation sites. Our analysis suggests that population recommendations based on empirical approaches should be preferred and population selections by climate envelope models without considering climatic constrains of growth performance should be carefully appraised before transferring populations to planting locations with novel or dissimilar climate.
Paradowska, Katarzyna; Jamróz, Marta Katarzyna; Kobyłka, Mariola; Gowin, Ewelina; Maczka, Paulina; Skibiński, Robert; Komsta, Łukasz
2012-01-01
This paper presents a preliminary study in building discriminant models from solid-state NMR spectrometry data to detect the presence of acetaminophen in over-the-counter pharmaceutical formulations. The dataset, containing 11 spectra of pure substances and 21 spectra of various formulations, was processed by partial least squares discriminant analysis (PLS-DA). The model found coped with the discrimination, and its quality parameters were acceptable. It was found that standard normal variate preprocessing had almost no influence on unsupervised investigation of the dataset. The influence of variable selection with the uninformative variable elimination by PLS method was studied, reducing the dataset from 7601 variables to around 300 informative variables, but not improving the model performance. The results showed the possibility to construct well-working PLS-DA models from such small datasets without a full experimental design.
[Multivariate Adaptive Regression Splines (MARS), an alternative for the analysis of time series].
Vanegas, Jairo; Vásquez, Fabián
Multivariate Adaptive Regression Splines (MARS) is a non-parametric modelling method that extends the linear model, incorporating nonlinearities and interactions between variables. It is a flexible tool that automates the construction of predictive models: selecting relevant variables, transforming the predictor variables, processing missing values and preventing overshooting using a self-test. It is also able to predict, taking into account structural factors that might influence the outcome variable, thereby generating hypothetical models. The end result could identify relevant cut-off points in data series. It is rarely used in health, so it is proposed as a tool for the evaluation of relevant public health indicators. For demonstrative purposes, data series regarding the mortality of children under 5 years of age in Costa Rica were used, comprising the period 1978-2008. Copyright © 2016 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.
McFarland, Kent P.; Rimmer, Christopher C.; Goetz, James E.; Aubry, Yves; Wunderle, Joseph M.; Sutton, Anne; Townsend, Jason M.; Sosa, Alejandro Llanes; Kirkconnell, Arturo
2013-01-01
Conservation planning and implementation require identifying pertinent habitats and locations where protection and management may improve viability of targeted species. The winter range of Bicknell’s Thrush (Catharus bicknelli), a threatened Nearctic-Neotropical migratory songbird, is restricted to the Greater Antilles. We analyzed winter records from the mid-1970s to 2009 to quantitatively evaluate winter distribution and habitat selection. Additionally, we conducted targeted surveys in Jamaica (n = 433), Cuba (n = 363), Dominican Republic (n = 1,000), Haiti (n = 131) and Puerto Rico (n = 242) yielding 179 sites with thrush presence. We modeled Bicknell’s Thrush winter habitat selection and distribution in the Greater Antilles in Maxent version 3.3.1. using environmental predictors represented in 30 arc second study area rasters. These included nine landform, land cover and climatic variables that were thought a priori to have potentially high predictive power. We used the average training gain from ten model runs to select the best subset of predictors. Total winter precipitation, aspect and land cover, particularly broadleaf forests, emerged as important variables. A five-variable model that contained land cover, winter precipitation, aspect, slope, and elevation was the most parsimonious and not significantly different than the models with more variables. We used the best fitting model to depict potential winter habitat. Using the 10 percentile threshold (>0.25), we estimated winter habitat to cover 33,170 km2, nearly 10% of the study area. The Dominican Republic contained half of all potential habitat (51%), followed by Cuba (15.1%), Jamaica (13.5%), Haiti (10.6%), and Puerto Rico (9.9%). Nearly one-third of the range was found to be in protected areas. By providing the first detailed predictive map of Bicknell’s Thrush winter distribution, our study provides a useful tool to prioritize and direct conservation planning for this and other wet, broadleaf forest specialists in the Greater Antilles. PMID:23326554
DOT National Transportation Integrated Search
2016-06-01
The flashing yellow arrow (FYA) signal display creates an opportunity to enhance the left-turn phase with a : variable mode that can be changed on demand. The previously developed decision support system (DSS) in : phase I facilitated the selection o...
Simulating tracer transport in variably saturated soils and shallow groundwater
USDA-ARS?s Scientific Manuscript database
The objective of this study was to develop a realistic model to simulate the complex processes of flow and tracer transport in variably saturated soils and to compare simulation results with the detailed monitoring observations. The USDA-ARS OPE3 field site was selected for the case study due to ava...
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package.
Reid, Stephen; Tibshirani, Rob
2014-07-01
We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso [Formula: see text] and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by.
Regularization Paths for Conditional Logistic Regression: The clogitL1 Package
Reid, Stephen; Tibshirani, Rob
2014-01-01
We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso (ℓ1) and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by. PMID:26257587
[Spatial differentiation and impact factors of Yutian Oasis's soil surface salt based on GWR model].
Yuan, Yu Yun; Wahap, Halik; Guan, Jing Yun; Lu, Long Hui; Zhang, Qin Qin
2016-10-01
In this paper, topsoil salinity data gathered from 24 sampling sites in the Yutian Oasis were used, nine different kinds of environmental variables closely related to soil salinity were selec-ted as influencing factors, then, the spatial distribution characteristics of topsoil salinity and spatial heterogeneity of influencing factors were analyzed by combining the spatial autocorrelation with traditional regression analysis and geographically weighted regression model. Results showed that the topsoil salinity in Yutian Oasis was not of random distribution but had strong spatial dependence, and the spatial autocorrelation index for topsoil salinity was 0.479. Groundwater salinity, groundwater depth, elevation and temperature were the main factors influencing topsoil salt accumulation in arid land oases and they were spatially heterogeneous. The nine selected environmental variables except soil pH had significant influences on topsoil salinity with spatial disparity. GWR model was superior to the OLS model on interpretation and estimation of spatial non-stationary data, also had a remarkable advantage in visualization of modeling parameters.
Joost, Stéphane; Kalbermatten, Michael; Bezault, Etienne; Seehausen, Ole
2012-01-01
When searching for loci possibly under selection in the genome, an alternative to population genetics theoretical models is to establish allele distribution models (ADM) for each locus to directly correlate allelic frequencies and environmental variables such as precipitation, temperature, or sun radiation. Such an approach implementing multiple logistic regression models in parallel was implemented within a computing program named MATSAM: . Recently, this application was improved in order to support qualitative environmental predictors as well as to permit the identification of associations between genomic variation and individual phenotypes, allowing the detection of loci involved in the genetic architecture of polymorphic characters. Here, we present the corresponding methodological developments and compare the results produced by software implementing population genetics theoretical models (DFDIST: and BAYESCAN: ) and ADM (MATSAM: ) in an empirical context to detect signatures of genomic divergence associated with speciation in Lake Victoria cichlid fishes.
Structure-activity relationships between sterols and their thermal stability in oil matrix.
Hu, Yinzhou; Xu, Junli; Huang, Weisu; Zhao, Yajing; Li, Maiquan; Wang, Mengmeng; Zheng, Lufei; Lu, Baiyi
2018-08-30
Structure-activity relationships between 20 sterols and their thermal stabilities were studied in a model oil system. All sterol degradations were found to be consistent with a first-order kinetic model with determination of coefficient (R 2 ) higher than 0.9444. The number of double bonds in the sterol structure was negatively correlated with the thermal stability of sterol, whereas the length of the branch chain was positively correlated with the thermal stability of sterol. A quantitative structure-activity relationship (QSAR) model to predict thermal stability of sterol was developed by using partial least squares regression (PLSR) combined with genetic algorithm (GA). A regression model was built with R 2 of 0.806. Almost all sterol degradation constants can be predicted accurately with R 2 of cross-validation equals to 0.680. Four important variables were selected in optimal QSAR model and the selected variables were observed to be related with information indices, RDF descriptors, and 3D-MoRSE descriptors. Copyright © 2018 Elsevier Ltd. All rights reserved.
2012-01-01
Background An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data. Results We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information. Conclusions The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge. PMID:22578440
NASA Astrophysics Data System (ADS)
La Cour, Brian R.
2017-07-01
An experiment has recently been performed to demonstrate quantum nonlocality by establishing contextuality in one of a pair of photons encoding four qubits; however, low detection efficiencies and use of the fair-sampling hypothesis leave these results open to possible criticism due to the detection loophole. In this Letter, a physically motivated local hidden-variable model is considered as a possible mechanism for explaining the experimentally observed results. The model, though not intrinsically contextual, acquires this quality upon post-selection of coincident detections.
On using surface-source downhole-receiver logging to determine seismic slownesses
Boore, D.M.; Thompson, E.M.
2007-01-01
We present a method to solve for slowness models from surface-source downhole-receiver seismic travel-times. The method estimates the slownesses in a single inversion of the travel-times from all receiver depths and accounts for refractions at layer boundaries. The number and location of layer interfaces in the model can be selected based on lithologic changes or linear trends in the travel-time data. The interfaces based on linear trends in the data can be picked manually or by an automated algorithm. We illustrate the method with example sites for which geologic descriptions of the subsurface materials and independent slowness measurements are available. At each site we present slowness models that result from different interpretations of the data. The examples were carefully selected to address the reliability of interface-selection and the ability of the inversion to identify thin layers, large slowness contrasts, and slowness gradients. Additionally, we compare the models in terms of ground-motion amplification. These plots illustrate the sensitivity of site amplifications to the uncertainties in the slowness model. We show that one-dimensional site amplifications are insensitive to thin layers in the slowness models; although slowness is variable over short ranges of depth, this variability has little affect on ground-motion amplification at frequencies up to 5 Hz.
The Evaluation and Selection of Adequate Causal Models: A Compensatory Education Example.
ERIC Educational Resources Information Center
Tanaka, Jeffrey S.
1982-01-01
Implications of model evaluation (using traditional chi square goodness of fit statistics, incremental fit indices for covariance structure models, and latent variable coefficients of determination) on substantive conclusions are illustrated with an example examining the effects of participation in a compensatory education program on posttreatment…
Fogarty, Dillon T; Elmore, R Dwayne; Fuhlendorf, Samuel D; Loss, Scott R
2017-08-01
Habitat selection by animals is influenced by and mitigates the effects of predation and environmental extremes. For birds, nest site selection is crucial to offspring production because nests are exposed to extreme weather and predation pressure. Predators that forage using olfaction often dominate nest predator communities; therefore, factors that influence olfactory detection (e.g., airflow and weather variables, including turbulence and moisture) should influence nest site selection and survival. However, few studies have assessed the importance of olfactory cover for habitat selection and survival. We assessed whether ground-nesting birds select nest sites based on visual and/or olfactory cover. Additionally, we assessed the importance of visual cover and airflow and weather variables associated with olfactory cover in influencing nest survival. In managed grasslands in Oklahoma, USA, we monitored nests of Northern Bobwhite ( Colinus virginianus ), Eastern Meadowlark ( Sturnella magna ), and Grasshopper Sparrow ( Ammodramus savannarum ) during 2015 and 2016. To assess nest site selection, we compared cover variables between nests and random points. To assess factors influencing nest survival, we used visual cover and olfactory-related measurements (i.e., airflow and weather variables) to model daily nest survival. For nest site selection, nest sites had greater overhead visual cover than random points, but no other significant differences were found. Weather variables hypothesized to influence olfactory detection, specifically precipitation and relative humidity, were the best predictors of and were positively related to daily nest survival. Selection for overhead cover likely contributed to mitigation of thermal extremes and possibly reduced detectability of nests. For daily nest survival, we hypothesize that major nest predators focused on prey other than the monitored species' nests during high moisture conditions, thus increasing nest survival on these days. Our study highlights how mechanistic approaches to studying cover informs which dimensions are perceived and selected by animals and which dimensions confer fitness-related benefits.
NASA Astrophysics Data System (ADS)
Hu, Haixin
This dissertation consists of two parts. The first part studies the sample selection and spatial models of housing price index using transaction data on detached single-family houses of two California metropolitan areas from 1990 through 2008. House prices are often spatially correlated due to shared amenities, or when the properties are viewed as close substitutes in a housing submarket. There have been many studies that address spatial correlation in the context of housing markets. However, none has used spatial models to construct housing price indexes at zip code level for the entire time period analyzed in this dissertation to the best of my knowledge. In this paper, I study a first-order autoregressive spatial model with four different weighing matrix schemes. Four sets of housing price indexes are constructed accordingly. Gatzlaff and Haurin (1997, 1998) study the sample selection problem in housing index by using Heckman's two-step method. This method, however, is generally inefficient and can cause multicollinearity problem. Also, it requires data on unsold houses in order to carry out the first-step probit regression. Maximum likelihood (ML) method can be used to estimate a truncated incidental model which allows one to correct for sample selection based on transaction data only. However, convergence problem is very prevalent in practice. In this paper I adopt Lewbel's (2007) sample selection correction method which does not require one to model or estimate the selection model, except for some very general assumptions. I then extend this method to correct for spatial correlation. In the second part, I analyze the U.S. gasoline market with a disequilibrium model that allows lagged-latent variables, endogenous prices, and panel data with fixed effects. Most existing studies (see the survey of Espey, 1998, Energy Economics) of the gasoline market assume equilibrium. In practice, however, prices do not always adjust fast enough to clear the market. Equilibrium assumptions greatly simplify statistical inference, but are very restrictive and can produce conflicting estimates. For example, econometric models of markets that assume equilibrium often produce more elastic demand price elasticity than their disequilibrium counterparts (Holt and Johnson, 1989, Review of Economics and Statistics, Oczkowski, 1998, Economics Letters). The few studies that allow disequilibrium, however, have been limited to macroeconomic time-series data without lagged-latent variables. While time series data allows one to investigate national trends, it cannot be used to identify and analyze regional differences and the role of local markets. Exclusion of the lagged-latent variables is also undesirable because such variables capture adjustment costs and inter-temporal spillovers. Simulation methods offer tractable solutions to dynamic and panel data disequilibrium models (Lee, 1997, Journal of Econometrics), but assume normally distributed errors. This paper compares estimates of price/income elasticity and excess supply/demand across time periods, regions, and model specifications, using both equilibrium and disequilibrium methods. In the equilibrium model, I compare the within group estimator with Anderson and Hsiao's first-difference 2SLS estimator. In the disequilibrium model, I extend Amemiya's 2SLS by using Newey's efficient estimator with optimal instruments.
DOE Office of Scientific and Technical Information (OSTI.GOV)
LaFarge, R.A.
1990-05-01
MCPRAM (Monte Carlo PReprocessor for AMEER), a computer program that uses Monte Carlo techniques to create an input file for the AMEER trajectory code, has been developed for the Sandia National Laboratories VAX and Cray computers. Users can select the number of trajectories to compute, which AMEER variables to investigate, and the type of probability distribution for each variable. Any legal AMEER input variable can be investigated anywhere in the input run stream with either a normal, uniform, or Rayleigh distribution. Users also have the option to use covariance matrices for the investigation of certain correlated variables such as boostermore » pre-reentry errors and wind, axial force, and atmospheric models. In conjunction with MCPRAM, AMEER was modified to include the variables introduced by the covariance matrices and to include provisions for six types of fuze models. The new fuze models and the new AMEER variables are described in this report.« less
Multi-omics facilitated variable selection in Cox-regression model for cancer prognosis prediction.
Liu, Cong; Wang, Xujun; Genchev, Georgi Z; Lu, Hui
2017-07-15
New developments in high-throughput genomic technologies have enabled the measurement of diverse types of omics biomarkers in a cost-efficient and clinically-feasible manner. Developing computational methods and tools for analysis and translation of such genomic data into clinically-relevant information is an ongoing and active area of investigation. For example, several studies have utilized an unsupervised learning framework to cluster patients by integrating omics data. Despite such recent advances, predicting cancer prognosis using integrated omics biomarkers remains a challenge. There is also a shortage of computational tools for predicting cancer prognosis by using supervised learning methods. The current standard approach is to fit a Cox regression model by concatenating the different types of omics data in a linear manner, while penalty could be added for feature selection. A more powerful approach, however, would be to incorporate data by considering relationships among omics datatypes. Here we developed two methods: a SKI-Cox method and a wLASSO-Cox method to incorporate the association among different types of omics data. Both methods fit the Cox proportional hazards model and predict a risk score based on mRNA expression profiles. SKI-Cox borrows the information generated by these additional types of omics data to guide variable selection, while wLASSO-Cox incorporates this information as a penalty factor during model fitting. We show that SKI-Cox and wLASSO-Cox models select more true variables than a LASSO-Cox model in simulation studies. We assess the performance of SKI-Cox and wLASSO-Cox using TCGA glioblastoma multiforme and lung adenocarcinoma data. In each case, mRNA expression, methylation, and copy number variation data are integrated to predict the overall survival time of cancer patients. Our methods achieve better performance in predicting patients' survival in glioblastoma and lung adenocarcinoma. Copyright © 2017. Published by Elsevier Inc.
Selection of Optimal Auxiliary Soil Nutrient Variables for Cokriging Interpolation
Song, Genxin; Zhang, Jing; Wang, Ke
2014-01-01
In order to explore the selection of the best auxiliary variables (BAVs) when using the Cokriging method for soil attribute interpolation, this paper investigated the selection of BAVs from terrain parameters, soil trace elements, and soil nutrient attributes when applying Cokriging interpolation to soil nutrients (organic matter, total N, available P, and available K). In total, 670 soil samples were collected in Fuyang, and the nutrient and trace element attributes of the soil samples were determined. Based on the spatial autocorrelation of soil attributes, the Digital Elevation Model (DEM) data for Fuyang was combined to explore the coordinate relationship among terrain parameters, trace elements, and soil nutrient attributes. Variables with a high correlation to soil nutrient attributes were selected as BAVs for Cokriging interpolation of soil nutrients, and variables with poor correlation were selected as poor auxiliary variables (PAVs). The results of Cokriging interpolations using BAVs and PAVs were then compared. The results indicated that Cokriging interpolation with BAVs yielded more accurate results than Cokriging interpolation with PAVs (the mean absolute error of BAV interpolation results for organic matter, total N, available P, and available K were 0.020, 0.002, 7.616, and 12.4702, respectively, and the mean absolute error of PAV interpolation results were 0.052, 0.037, 15.619, and 0.037, respectively). The results indicated that Cokriging interpolation with BAVs can significantly improve the accuracy of Cokriging interpolation for soil nutrient attributes. This study provides meaningful guidance and reference for the selection of auxiliary parameters for the application of Cokriging interpolation to soil nutrient attributes. PMID:24927129
NASA Technical Reports Server (NTRS)
Sidik, S. M.
1972-01-01
A sequential adaptive experimental design procedure for a related problem is studied. It is assumed that a finite set of potential linear models relating certain controlled variables to an observed variable is postulated, and that exactly one of these models is correct. The problem is to sequentially design most informative experiments so that the correct model equation can be determined with as little experimentation as possible. Discussion includes: structure of the linear models; prerequisite distribution theory; entropy functions and the Kullback-Leibler information function; the sequential decision procedure; and computer simulation results. An example of application is given.
Climate variability slows evolutionary responses of Colias butterflies to recent climate change.
Kingsolver, Joel G; Buckley, Lauren B
2015-03-07
How does recent climate warming and climate variability alter fitness, phenotypic selection and evolution in natural populations? We combine biophysical, demographic and evolutionary models with recent climate data to address this question for the subalpine and alpine butterfly, Colias meadii, in the southern Rocky Mountains. We focus on predicting patterns of selection and evolution for a key thermoregulatory trait, melanin (solar absorptivity) on the posterior ventral hindwings, which affects patterns of body temperature, flight activity, adult and egg survival, and reproductive success in Colias. Both mean annual summer temperatures and thermal variability within summers have increased during the past 60 years at subalpine and alpine sites. At the subalpine site, predicted directional selection on wing absorptivity has shifted from generally positive (favouring increased wing melanin) to generally negative during the past 60 years, but there is substantial variation among years in the predicted magnitude and direction of selection and the optimal absorptivity. The predicted magnitude of directional selection at the alpine site declined during the past 60 years and varies substantially among years, but selection has generally been positive at this site. Predicted evolutionary responses to mean climate warming at the subalpine site since 1980 is small, because of the variability in selection and asymmetry of the fitness function. At both sites, the predicted effects of adaptive evolution on mean population fitness are much smaller than the fluctuations in mean fitness due to climate variability among years. Our analyses suggest that variation in climate within and among years may strongly limit evolutionary responses of ectotherms to mean climate warming in these habitats. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Rocha, R R A; Thomaz, S M; Carvalho, P; Gomes, L C
2009-06-01
The need for prediction is widely recognized in limnology. In this study, data from 25 lakes of the Upper Paraná River floodplain were used to build models to predict chlorophyll-a and dissolved oxygen concentrations. Akaike's information criterion (AIC) was used as a criterion for model selection. Models were validated with independent data obtained in the same lakes in 2001. Predictor variables that significantly explained chlorophyll-a concentration were pH, electrical conductivity, total seston (positive correlation) and nitrate (negative correlation). This model explained 52% of chlorophyll variability. Variables that significantly explained dissolved oxygen concentration were pH, lake area and nitrate (all positive correlations); water temperature and electrical conductivity were negatively correlated with oxygen. This model explained 54% of oxygen variability. Validation with independent data showed that both models had the potential to predict algal biomass and dissolved oxygen concentration in these lakes. These findings suggest that multiple regression models are valuable and practical tools for understanding the dynamics of ecosystems and that predictive limnology may still be considered a powerful approach in aquatic ecology.
Some Implications of a Behavioral Analysis of Verbal Behavior for Logic and Mathematics
2013-01-01
The evident power and utility of the formal models of logic and mathematics pose a puzzle: Although such models are instances of verbal behavior, they are also essentialistic. But behavioral terms, and indeed all products of selection contingencies, are intrinsically variable and in this respect appear to be incommensurate with essentialism. A distinctive feature of verbal contingencies resolves this puzzle: The control of behavior by the nonverbal environment is often mediated by the verbal behavior of others, and behavior under control of verbal stimuli is blind to the intrinsic variability of the stimulating environment. Thus, words and sentences serve as filters of variability and thereby facilitate essentialistic model building and the formal structures of logic, mathematics, and science. Autoclitic frames, verbal chains interrupted by interchangeable variable terms, are ubiquitous in verbal behavior. Variable terms can be substituted in such frames almost without limit, a feature fundamental to formal models. Consequently, our fluency with autoclitic frames fosters generalization to formal models, which in turn permit deduction and other kinds of logical and mathematical inference. PMID:28018038
Chouchane, Hatem; Krol, Maarten S; Hoekstra, Arjen Y
2018-02-01
Growing water demands put increasing pressure on local water resources, especially in water-short countries. Virtual water trade can play a key role in filling the gap between local demand and supply of water-intensive commodities. This study aims to analyse the dynamics in virtual water trade of Tunisia in relation to environmental and socio-economic factors such as GDP, irrigated land, precipitation, population and water scarcity. The water footprint of crop production is estimated using AquaCrop for six crops over the period 1981-2010. Net virtual water import (NVWI) is quantified at yearly basis. Regression models are used to investigate dynamics in NVWI in relation to the selected factors. The results show that NVWI during the study period for the selected crops is not influenced by blue water scarcity. NVWI correlates in two alternative models to either population and precipitation (model I) or to GDP and irrigated area (model II). The models are better in explaining NVWI of staple crops (wheat, barley, potatoes) than NVWI of cash crops (dates, olives, tomatoes). Using model I, we are able to explain both trends and inter-annual variability for rain-fed crops. Model II performs better for irrigated crops and is able to explain trends significantly; no significant relation is found, however, with variables hypothesized to represent inter-annual variability. Copyright © 2017 Elsevier B.V. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, Kaiguang; Valle, Denis; Popescu, Sorin
2013-05-15
Model specification remains challenging in spectroscopy of plant biochemistry, as exemplified by the availability of various spectral indices or band combinations for estimating the same biochemical. This lack of consensus in model choice across applications argues for a paradigm shift in hyperspectral methods to address model uncertainty and misspecification. We demonstrated one such method using Bayesian model averaging (BMA), which performs variable/band selection and quantifies the relative merits of many candidate models to synthesize a weighted average model with improved predictive performances. The utility of BMA was examined using a portfolio of 27 foliage spectral–chemical datasets representing over 80 speciesmore » across the globe to estimate multiple biochemical properties, including nitrogen, hydrogen, carbon, cellulose, lignin, chlorophyll (a or b), carotenoid, polar and nonpolar extractives, leaf mass per area, and equivalent water thickness. We also compared BMA with partial least squares (PLS) and stepwise multiple regression (SMR). Results showed that all the biochemicals except carotenoid were accurately estimated from hyerspectral data with R2 values > 0.80.« less
Asymmetric patch size distribution leads to disruptive selection on dispersal.
Massol, François; Duputié, Anne; David, Patrice; Jarne, Philippe
2011-02-01
Numerous models have been designed to understand how dispersal ability evolves when organisms live in a fragmented landscape. Most of them predict a single dispersal rate at evolutionary equilibrium, and when diversification of dispersal rates has been predicted, it occurs as a response to perturbation or environmental fluctuation regimes. Yet abundant variation in dispersal ability is observed in natural populations and communities, even in relatively stable environments. We show that this diversification can operate in a simple island model without temporal variability: disruptive selection on dispersal occurs when the environment consists of many small and few large patches, a common feature in natural spatial systems. This heterogeneity in patch size results in a high variability in the number of related patch mates by individual, which, in turn, triggers disruptive selection through a high per capita variance of inclusive fitness. Our study provides a likely, parsimonious and testable explanation for the diversity of dispersal rates encountered in nature. It also suggests that biological conservation policies aiming at preserving ecological communities should strive to keep the distribution of patch size sufficiently asymmetric and variable. © 2010 The Author(s). Evolution© 2010 The Society for the Study of Evolution.
Improving permafrost distribution modelling using feature selection algorithms
NASA Astrophysics Data System (ADS)
Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail
2016-04-01
The availability of an increasing number of spatial data on the occurrence of mountain permafrost allows the employment of machine learning (ML) classification algorithms for modelling the distribution of the phenomenon. One of the major problems when dealing with high-dimensional dataset is the number of input features (variables) involved. Application of ML classification algorithms to this large number of variables leads to the risk of overfitting, with the consequence of a poor generalization/prediction. For this reason, applying feature selection (FS) techniques helps simplifying the amount of factors required and improves the knowledge on adopted features and their relation with the studied phenomenon. Moreover, taking away irrelevant or redundant variables from the dataset effectively improves the quality of the ML prediction. This research deals with a comparative analysis of permafrost distribution models supported by FS variable importance assessment. The input dataset (dimension = 20-25, 10 m spatial resolution) was constructed using landcover maps, climate data and DEM derived variables (altitude, aspect, slope, terrain curvature, solar radiation, etc.). It was completed with permafrost evidences (geophysical and thermal data and rock glacier inventories) that serve as training permafrost data. Used FS algorithms informed about variables that appeared less statistically important for permafrost presence/absence. Three different algorithms were compared: Information Gain (IG), Correlation-based Feature Selection (CFS) and Random Forest (RF). IG is a filter technique that evaluates the worth of a predictor by measuring the information gain with respect to the permafrost presence/absence. Conversely, CFS is a wrapper technique that evaluates the worth of a subset of predictors by considering the individual predictive ability of each variable along with the degree of redundancy between them. Finally, RF is a ML algorithm that performs FS as part of its overall operation. It operates by constructing a large collection of decorrelated classification trees, and then predicts the permafrost occurrence through a majority vote. With the so-called out-of-bag (OOB) error estimate, the classification of permafrost data can be validated as well as the contribution of each predictor can be assessed. The performances of compared permafrost distribution models (computed on independent testing sets) increased with the application of FS algorithms on the original dataset and irrelevant or redundant variables were removed. As a consequence, the process provided faster and more cost-effective predictors and a better understanding of the underlying structures residing in permafrost data. Our work demonstrates the usefulness of a feature selection step prior to applying a machine learning algorithm. In fact, permafrost predictors could be ranked not only based on their heuristic and subjective importance (expert knowledge), but also based on their statistical relevance in relation of the permafrost distribution.
Giovenzana, Valentina; Beghi, Roberto; Parisi, Simone; Brancadoro, Lucio; Guidetti, Riccardo
2018-03-01
Increasing attention is being paid to non-destructive methods for water status real time monitoring as a potential solution to replace the tedious conventional techniques which are time consuming and not easy to perform directly in the field. The objective of this study was to test the potential effectiveness of two portable optical devices (visible/near infrared (vis/NIR) and near infrared (NIR) spectrophotometers) for the rapid and non-destructive evaluation of the water status of grapevine leaves. Moreover, a variable selection methodology was proposed to determine a set of candidate variables for the prediction of water potential (Ψ, MPa) related to leaf water status in view of a simplified optical device. The statistics of the partial least square (PLS) models showed in validation R 2 between 0.67 and 0.77 for models arising from vis/NIR spectra, and R 2 ranged from 0.77 to 0.85 for the NIR region. The overall performance of the multiple linear regression (MLR) models from selected wavelengths was slightly worse than that of the PLS models. Regarding the NIR range, acceptable MLR models were obtained only using 14 effective variables (R 2 range 0.63-0.69). To address the market demand for portable optical devices and heading towards the trend of miniaturization and low cost of the devices, individual wavelengths could be useful for the design of a simplified and low-cost handheld system providing useful information for better irrigation scheduling. © 2017 Society of Chemical Industry. © 2017 Society of Chemical Industry.
Using the NANA toolkit at home to predict older adults' future depression.
Andrews, J A; Harrison, R F; Brown, L J E; MacLean, L M; Hwang, F; Smith, T; Williams, E A; Timon, C; Adlam, T; Khadra, H; Astell, A J
2017-04-15
Depression is currently underdiagnosed among older adults. As part of the Novel Assessment of Nutrition and Aging (NANA) validation study, 40 older adults self-reported their mood using a touchscreen computer over three, one-week periods. Here, we demonstrate the potential of these data to predict future depression status. We analysed data from the NANA validation study using a machine learning approach. We applied the least absolute shrinkage and selection operator with a logistic model to averages of six measures of mood, with depression status according to the Geriatric Depression Scale 10 weeks later as the outcome variable. We tested multiple values of the selection parameter in order to produce a model with low deviance. We used a cross-validation framework to avoid overspecialisation, and receiver operating characteristic (ROC) curve analysis to determine the quality of the fitted model. The model we report contained coefficients for two variables: sadness and tiredness, as well as a constant. The cross-validated area under the ROC curve for this model was 0.88 (CI: 0.69-0.97). While results are based on a small sample, the methodology for the selection of variables appears suitable for the problem at hand, suggesting promise for a wider study and ultimate deployment with older adults at increased risk of depression. We have identified self-reported scales of sadness and tiredness as sensitive measures which have the potential to predict future depression status in older adults, partially addressing the problem of underdiagnosis. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
Suchting, Robert; Gowin, Joshua L; Green, Charles E; Walss-Bass, Consuelo; Lane, Scott D
2018-01-01
Rationale : Given datasets with a large or diverse set of predictors of aggression, machine learning (ML) provides efficient tools for identifying the most salient variables and building a parsimonious statistical model. ML techniques permit efficient exploration of data, have not been widely used in aggression research, and may have utility for those seeking prediction of aggressive behavior. Objectives : The present study examined predictors of aggression and constructed an optimized model using ML techniques. Predictors were derived from a dataset that included demographic, psychometric and genetic predictors, specifically FK506 binding protein 5 (FKBP5) polymorphisms, which have been shown to alter response to threatening stimuli, but have not been tested as predictors of aggressive behavior in adults. Methods : The data analysis approach utilized component-wise gradient boosting and model reduction via backward elimination to: (a) select variables from an initial set of 20 to build a model of trait aggression; and then (b) reduce that model to maximize parsimony and generalizability. Results : From a dataset of N = 47 participants, component-wise gradient boosting selected 8 of 20 possible predictors to model Buss-Perry Aggression Questionnaire (BPAQ) total score, with R 2 = 0.66. This model was simplified using backward elimination, retaining six predictors: smoking status, psychopathy (interpersonal manipulation and callous affect), childhood trauma (physical abuse and neglect), and the FKBP5_13 gene (rs1360780). The six-factor model approximated the initial eight-factor model at 99.4% of R 2 . Conclusions : Using an inductive data science approach, the gradient boosting model identified predictors consistent with previous experimental work in aggression; specifically psychopathy and trauma exposure. Additionally, allelic variants in FKBP5 were identified for the first time, but the relatively small sample size limits generality of results and calls for replication. This approach provides utility for the prediction of aggression behavior, particularly in the context of large multivariate datasets.
Dempsey, Steven J; Gese, Eric M; Kluever, Bryan M; Lonsinger, Robert C; Waits, Lisette P
2015-01-01
Development and evaluation of noninvasive methods for monitoring species distribution and abundance is a growing area of ecological research. While noninvasive methods have the advantage of reduced risk of negative factors associated with capture, comparisons to methods using more traditional invasive sampling is lacking. Historically kit foxes (Vulpes macrotis) occupied the desert and semi-arid regions of southwestern North America. Once the most abundant carnivore in the Great Basin Desert of Utah, the species is now considered rare. In recent decades, attempts have been made to model the environmental variables influencing kit fox distribution. Using noninvasive scat deposition surveys for determination of kit fox presence, we modeled resource selection functions to predict kit fox distribution using three popular techniques (Maxent, fixed-effects, and mixed-effects generalized linear models) and compared these with similar models developed from invasive sampling (telemetry locations from radio-collared foxes). Resource selection functions were developed using a combination of landscape variables including elevation, slope, aspect, vegetation height, and soil type. All models were tested against subsequent scat collections as a method of model validation. We demonstrate the importance of comparing multiple model types for development of resource selection functions used to predict a species distribution, and evaluating the importance of environmental variables on species distribution. All models we examined showed a large effect of elevation on kit fox presence, followed by slope and vegetation height. However, the invasive sampling method (i.e., radio-telemetry) appeared to be better at determining resource selection, and therefore may be more robust in predicting kit fox distribution. In contrast, the distribution maps created from the noninvasive sampling (i.e., scat transects) were significantly different than the invasive method, thus scat transects may be appropriate when used in an occupancy framework to predict species distribution. We concluded that while scat deposition transects may be useful for monitoring kit fox abundance and possibly occupancy, they do not appear to be appropriate for determining resource selection. On our study area, scat transects were biased to roadways, while data collected using radio-telemetry was dictated by movements of the kit foxes themselves. We recommend that future studies applying noninvasive scat sampling should consider a more robust random sampling design across the landscape (e.g., random transects or more complete road coverage) that would then provide a more accurate and unbiased depiction of resource selection useful to predict kit fox distribution.
Individual treatment selection for patients with posttraumatic stress disorder.
Deisenhofer, Anne-Katharina; Delgadillo, Jaime; Rubel, Julian A; Böhnke, Jan R; Zimmermann, Dirk; Schwartz, Brian; Lutz, Wolfgang
2018-04-16
Trauma-focused cognitive behavioral therapy (Tf-CBT) and eye movement desensitization and reprocessing (EMDR) are two highly effective treatment options for posttraumatic stress disorder (PTSD). Yet, on an individual level, PTSD patients vary substantially in treatment response. The aim of the paper is to test the application of a treatment selection method based on a personalized advantage index (PAI). The study used clinical data for patients accessing treatment for PTSD in a primary care mental health service in the north of England. PTSD patients received either EMDR (N = 75) or Tf-CBT (N = 242). The Patient Health Questionnaire (PHQ-9) was used as an outcome measure for depressive symptoms associated with PTSD. Variables predicting differential treatment response were identified using an automated variable selection approach (genetic algorithm) and afterwards included in regression models, allowing the calculation of each patient's PAI. Age, employment status, gender, and functional impairment were identified as relevant variables for Tf-CBT. For EMDR, baseline depressive symptoms as well as prescribed antidepressant medication were selected as predictor variables. Fifty-six percent of the patients (n = 125) had a PAI equal or higher than one standard deviation. From those patients, 62 (50%) did not receive their model-predicted treatment and could have benefited from a treatment assignment based on the PAI. Using a PAI-based algorithm has the potential to improve clinical decision making and to enhance individual patient outcomes, although further replication is necessary before such an approach can be implemented in prospective studies. © 2018 Wiley Periodicals, Inc.
Real-time predictive seasonal influenza model in Catalonia, Spain
Basile, Luca; Oviedo de la Fuente, Manuel; Torner, Nuria; Martínez, Ana; Jané, Mireia
2018-01-01
Influenza surveillance is critical to monitoring the situation during epidemic seasons and predictive mathematic models may aid the early detection of epidemic patterns. The objective of this study was to design a real-time spatial predictive model of ILI (Influenza Like Illness) incidence rate in Catalonia using one- and two-week forecasts. The available data sources used to select explanatory variables to include in the model were the statutory reporting disease system and the sentinel surveillance system in Catalonia for influenza incidence rates, the official climate service in Catalonia for meteorological data, laboratory data and Google Flu Trend. Time series for every explanatory variable with data from the last 4 seasons (from 2010–2011 to 2013–2014) was created. A pilot test was conducted during the 2014–2015 season to select the explanatory variables to be included in the model and the type of model to be applied. During the 2015–2016 season a real-time model was applied weekly, obtaining the intensity level and predicted incidence rates with 95% confidence levels one and two weeks away for each health region. At the end of the season, the confidence interval success rate (CISR) and intensity level success rate (ILSR) were analysed. For the 2015–2016 season a CISR of 85.3% at one week and 87.1% at two weeks and an ILSR of 82.9% and 82% were observed, respectively. The model described is a useful tool although it is hard to evaluate due to uncertainty. The accuracy of prediction at one and two weeks was above 80% globally, but was lower during the peak epidemic period. In order to improve the predictive power, new explanatory variables should be included. PMID:29513710
Torija, Antonio J; Ruiz, Diego P
2015-02-01
The prediction of environmental noise in urban environments requires the solution of a complex and non-linear problem, since there are complex relationships among the multitude of variables involved in the characterization and modelling of environmental noise and environmental-noise magnitudes. Moreover, the inclusion of the great spatial heterogeneity characteristic of urban environments seems to be essential in order to achieve an accurate environmental-noise prediction in cities. This problem is addressed in this paper, where a procedure based on feature-selection techniques and machine-learning regression methods is proposed and applied to this environmental problem. Three machine-learning regression methods, which are considered very robust in solving non-linear problems, are used to estimate the energy-equivalent sound-pressure level descriptor (LAeq). These three methods are: (i) multilayer perceptron (MLP), (ii) sequential minimal optimisation (SMO), and (iii) Gaussian processes for regression (GPR). In addition, because of the high number of input variables involved in environmental-noise modelling and estimation in urban environments, which make LAeq prediction models quite complex and costly in terms of time and resources for application to real situations, three different techniques are used to approach feature selection or data reduction. The feature-selection techniques used are: (i) correlation-based feature-subset selection (CFS), (ii) wrapper for feature-subset selection (WFS), and the data reduction technique is principal-component analysis (PCA). The subsequent analysis leads to a proposal of different schemes, depending on the needs regarding data collection and accuracy. The use of WFS as the feature-selection technique with the implementation of SMO or GPR as regression algorithm provides the best LAeq estimation (R(2)=0.94 and mean absolute error (MAE)=1.14-1.16 dB(A)). Copyright © 2014 Elsevier B.V. All rights reserved.
Nikolić, Biljana; Martinović, Jelena; Matić, Milan; Stefanović, Đorđe
2018-05-29
Different variables determine the performance of cyclists, which brings up the question how these parameters may help in their classification by specialty. The aim of the study was to determine differences in cardiorespiratory parameters of male cyclists according to their specialty, flat rider (N=21), hill rider (N=35) and sprinter (N=20) and obtain the multivariate model for further cyclists classification by specialties, based on selected variables. Seventeen variables were measured at submaximal and maximum load on the cycle ergometer Cosmed E 400HK (Cosmed, Rome, Italy) (initial 100W with 25W increase, 90-100 rpm). Multivariate discriminant analysis was used to determine which variables group cyclists within their specialty, and to predict which variables can direct cyclists to a particular specialty. Among nine variables that statistically contribute to the discriminant power of the model, achieved power on the anaerobic threshold and the produced CO2 had the biggest impact. The obtained discriminatory model correctly classified 91.43% of flat riders, 85.71% of hill riders, while sprinters were classified completely correct (100%), i.e. 92.10% of examinees were correctly classified, which point out the strength of the discriminatory model. Respiratory indicators mostly contribute to the discriminant power of the model, which may significantly contribute to training practice and laboratory tests in future.
Reulen, Holger; Kneib, Thomas
2016-04-01
One important goal in multi-state modelling is to explore information about conditional transition-type-specific hazard rate functions by estimating influencing effects of explanatory variables. This may be performed using single transition-type-specific models if these covariate effects are assumed to be different across transition-types. To investigate whether this assumption holds or whether one of the effects is equal across several transition-types (cross-transition-type effect), a combined model has to be applied, for instance with the use of a stratified partial likelihood formulation. Here, prior knowledge about the underlying covariate effect mechanisms is often sparse, especially about ineffectivenesses of transition-type-specific or cross-transition-type effects. As a consequence, data-driven variable selection is an important task: a large number of estimable effects has to be taken into account if joint modelling of all transition-types is performed. A related but subsequent task is model choice: is an effect satisfactory estimated assuming linearity, or is the true underlying nature strongly deviating from linearity? This article introduces component-wise Functional Gradient Descent Boosting (short boosting) for multi-state models, an approach performing unsupervised variable selection and model choice simultaneously within a single estimation run. We demonstrate that features and advantages in the application of boosting introduced and illustrated in classical regression scenarios remain present in the transfer to multi-state models. As a consequence, boosting provides an effective means to answer questions about ineffectiveness and non-linearity of single transition-type-specific or cross-transition-type effects.
Strategies for minimizing sample size for use in airborne LiDAR-based forest inventory
Junttila, Virpi; Finley, Andrew O.; Bradford, John B.; Kauranne, Tuomo
2013-01-01
Recently airborne Light Detection And Ranging (LiDAR) has emerged as a highly accurate remote sensing modality to be used in operational scale forest inventories. Inventories conducted with the help of LiDAR are most often model-based, i.e. they use variables derived from LiDAR point clouds as the predictive variables that are to be calibrated using field plots. The measurement of the necessary field plots is a time-consuming and statistically sensitive process. Because of this, current practice often presumes hundreds of plots to be collected. But since these plots are only used to calibrate regression models, it should be possible to minimize the number of plots needed by carefully selecting the plots to be measured. In the current study, we compare several systematic and random methods for calibration plot selection, with the specific aim that they be used in LiDAR based regression models for forest parameters, especially above-ground biomass. The primary criteria compared are based on both spatial representativity as well as on their coverage of the variability of the forest features measured. In the former case, it is important also to take into account spatial auto-correlation between the plots. The results indicate that choosing the plots in a way that ensures ample coverage of both spatial and feature space variability improves the performance of the corresponding models, and that adequate coverage of the variability in the feature space is the most important condition that should be met by the set of plots collected.
Investigating Organizational Alienation Behavior in Terms of Some Variables
ERIC Educational Resources Information Center
Dagli, Abidin; Averbek, Emel
2017-01-01
The aim of this study is to detect the perceptions of public primary school teachers regarding organizational alienation behaviors in terms of some variables (gender, marital status and seniority). Survey model was used in this study. The research sample consists of randomly selected 346 teachers from 40 schools in the central district of Mardin,…
ERIC Educational Resources Information Center
Bazan-Ramirez, Aldo; Castellanos-Simons, Doris; Lopez-Valenzuela, Mercedes
2010-01-01
This paper aims at analysing the structural relationships among some latent and observed variables related to the assessment of written language performance in 139 fourth grade students of Elementary School selected from nine public schools of the northwest of Mexico. Questionnaires were also applied to the children's parents and teachers. The…
James H. Miller
1998-01-01
Available research is reviewed on the interactions of application variables, herbicides, and species. Objectives of this review are to gain insights into why variation occurs with herbicide performance, how current knowledge might be applied to enhance efficacy and consistency, and research pathways that should foster integration of application-efficacy models. A...
Mediation in dyadic data at the level of the dyads: a Structural Equation Modeling approach.
Ledermann, Thomas; Macho, Siegfried
2009-10-01
An extended version of the Common Fate Model (CFM) is presented to estimate and test mediation in dyadic data. The model can be used for distinguishable dyad members (e.g., heterosexual couples) or indistinguishable dyad members (e.g., homosexual couples) if (a) the variables measure characteristics of the dyadic relationship or shared external influences that affect both partners; if (b) the causal associations between the variables should be analyzed at the dyadic level; and if (c) the measured variables are reliable indicators of the latent variables. To assess mediation using Structural Equation Modeling, a general three-step procedure is suggested. The first is a selection of a good fitting model, the second a test of the direct effects, and the third a test of the mediating effect by means of bootstrapping. The application of the model along with the procedure for assessing mediation is illustrated using data from 184 couples on marital problems, communication, and marital quality. Differences with the Actor-Partner Interdependence Model and the analysis of longitudinal mediation by using the CFM are discussed.
Variable-intercept panel model for deformation zoning of a super-high arch dam.
Shi, Zhongwen; Gu, Chongshi; Qin, Dong
2016-01-01
This study determines dam deformation similarity indexes based on an analysis of deformation zoning features and panel data clustering theory, with comprehensive consideration to the actual deformation law of super-high arch dams and the spatial-temporal features of dam deformation. Measurement methods of these indexes are studied. Based on the established deformation similarity criteria, the principle used to determine the number of dam deformation zones is constructed through entropy weight method. This study proposes the deformation zoning method for super-high arch dams and the implementation steps, analyzes the effect of special influencing factors of different dam zones on the deformation, introduces dummy variables that represent the special effect of dam deformation, and establishes a variable-intercept panel model for deformation zoning of super-high arch dams. Based on different patterns of the special effect in the variable-intercept panel model, two panel analysis models were established to monitor fixed and random effects of dam deformation. Hausman test method of model selection and model effectiveness assessment method are discussed. Finally, the effectiveness of established models is verified through a case study.
Baldwin, Austin K.; Graczyk, David J.; Robertson, Dale M.; Saad, David A.; Magruder, Christopher
2012-01-01
The models to estimate chloride concentrations all used specific conductance as the explanatory variable, except for the model for the Little Menomonee River near Freistadt, which used both specific conductance and turbidity as explanatory variables. Adjusted R2 values for the chloride models ranged from 0.74 to 0.97. Models to estimate total suspended solids and total phosphorus used turbidity as the only explanatory variable. Adjusted R2 values ranged from 0.77 to 0.94 for the total suspended solids models and from 0.55 to 0.75 for the total phosphorus models. Models to estimate indicator bacteria used water temperature and turbidity as the explanatory variables, with adjusted R2 values from 0.54 to 0.69 for Escherichia coli bacteria models and from 0.54 to 0.74 for fecal coliform bacteria models. Dissolved oxygen was not used in any of the final models. These models may help managers measure the effects of land-use changes and improvement projects, establish total maximum daily loads, estimate important water-quality indicators such as bacteria concentrations, and enable informed decision making in the future.
Heinrichs, Julie; Aldridge, Cameron L.; O'Donnell, Michael; Schumaker, Nathan
2017-01-01
Prioritizing habitats for conservation is a challenging task, particularly for species with fluctuating populations and seasonally dynamic habitat needs. Although the use of resource selection models to identify and prioritize habitat for conservation is increasingly common, their ability to characterize important long-term habitats for dynamic populations are variable. To examine how habitats might be prioritized differently if resource selection was directly and dynamically linked with population fluctuations and movement limitations among seasonal habitats, we constructed a spatially explicit individual-based model for a dramatically fluctuating population requiring temporally varying resources. Using greater sage-grouse (Centrocercus urophasianus) in Wyoming as a case study, we used resource selection function maps to guide seasonal movement and habitat selection, but emergent population dynamics and simulated movement limitations modified long-term habitat occupancy. We compared priority habitats in RSF maps to long-term simulated habitat use. We examined the circumstances under which the explicit consideration of movement limitations, in combination with population fluctuations and trends, are likely to alter predictions of important habitats. In doing so, we assessed the future occupancy of protected areas under alternative population and habitat conditions. Habitat prioritizations based on resource selection models alone predicted high use in isolated parcels of habitat and in areas with low connectivity among seasonal habitats. In contrast, results based on more biologically-informed simulations emphasized central and connected areas near high-density populations, sometimes predicted to be low selection value. Dynamic models of habitat use can provide additional biological realism that can extend, and in some cases, contradict habitat use predictions generated from short-term or static resource selection analyses. The explicit inclusion of population dynamics and movement propensities via spatial simulation modeling frameworks may provide an informative means of predicting long-term habitat use, particularly for fluctuating populations with complex seasonal habitat needs. Importantly, our results indicate the possible need to consider habitat selection models as a starting point rather than the common end point for refining and prioritizing habitats for protection for cyclic and highly variable populations.
Guinn, Caroline H; Baxter, Suzanne D; Royer, Julie A; Hitchcock, David B
2013-05-01
A 2010 publication showed a positive relationship between children's body mass index (BMI) and energy intake at school-provided meals (as assessed by direct meal observations). To help explain that relationship, we investigated 7 outcome variables concerning aspects of school-provided meals: energy content of items selected, number of meal components selected, number of meal components eaten, amounts eaten of standardized school-meal portions, energy intake from flavored milk, energy intake received in trades, and energy content given in trades. Fourth-grade children (N = 465) from Columbia, SC, were observed eating school-provided breakfast and lunch on 1 to 4 days per child. Researchers measured children's weight and height. For daily values at school meals, a generalized linear model was fit with BMI (dependent variable) and the 7 outcome variables, sex, and age (independent variables). BMI was positively related to amounts eaten of standardized school-meal portions (p < .0001) and increased 8.45 kg/m(2) per serving, controlling for other variables in the model. BMI was positively related to energy intake from flavored milk (p = .0041) and increased 0.347 kg/m(2) for every 100 kcal consumed. BMI was negatively related to energy intake received in trades (p = .0003) and decreased 0.468 kg/m(2) for every 100 kcal received. BMI was not significantly related to 4 outcome variables. Knowing that relationships between BMI and actual consumption, not selection, at school-provided meals explained the (previously found) positive relationship between BMI and energy intake at school-provided meals is helpful for school-based obesity interventions. © 2013, American School Health Association.
Shirk, Andrew J; Landguth, Erin L; Cushman, Samuel A
2018-01-01
Anthropogenic migration barriers fragment many populations and limit the ability of species to respond to climate-induced biome shifts. Conservation actions designed to conserve habitat connectivity and mitigate barriers are needed to unite fragmented populations into larger, more viable metapopulations, and to allow species to track their climate envelope over time. Landscape genetic analysis provides an empirical means to infer landscape factors influencing gene flow and thereby inform such conservation actions. However, there are currently many methods available for model selection in landscape genetics, and considerable uncertainty as to which provide the greatest accuracy in identifying the true landscape model influencing gene flow among competing alternative hypotheses. In this study, we used population genetic simulations to evaluate the performance of seven regression-based model selection methods on a broad array of landscapes that varied by the number and type of variables contributing to resistance, the magnitude and cohesion of resistance, as well as the functional relationship between variables and resistance. We also assessed the effect of transformations designed to linearize the relationship between genetic and landscape distances. We found that linear mixed effects models had the highest accuracy in every way we evaluated model performance; however, other methods also performed well in many circumstances, particularly when landscape resistance was high and the correlation among competing hypotheses was limited. Our results provide guidance for which regression-based model selection methods provide the most accurate inferences in landscape genetic analysis and thereby best inform connectivity conservation actions. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.
A Bibliography of Selected Publications: Project Air Force, 5th Edition
1989-05-01
Dyna - R-3028-AF. A Dynamic Retention Model for Air Force Officers: METRIC’s DL and and Pipeilne Variability. M. J. Carrillo. Theory and Estimates. G...Theorem and Dyna - and Support. METRIC’s Demand and Pipeline Variability. R-3255-AF. Aircraft Airframe Cost Estimating Relationships: N-2283/1-AF...U). 1970-1985. N-2409-AF. Tanker Splitting Across the SlOP Bomber Force R-3389-AF. Dyna -METRIC Version 4: Modeling Worldwide (U). Logistics Support of
Prediction of municipal solid waste generation using nonlinear autoregressive network.
Younes, Mohammad K; Nopiah, Z M; Basri, N E Ahmad; Basri, H; Abushammala, Mohammed F M; Maulud, K N A
2015-12-01
Most of the developing countries have solid waste management problems. Solid waste strategic planning requires accurate prediction of the quality and quantity of the generated waste. In developing countries, such as Malaysia, the solid waste generation rate is increasing rapidly, due to population growth and new consumption trends that characterize society. This paper proposes an artificial neural network (ANN) approach using feedforward nonlinear autoregressive network with exogenous inputs (NARX) to predict annual solid waste generation in relation to demographic and economic variables like population number, gross domestic product, electricity demand per capita and employment and unemployment numbers. In addition, variable selection procedures are also developed to select a significant explanatory variable. The model evaluation was performed using coefficient of determination (R(2)) and mean square error (MSE). The optimum model that produced the lowest testing MSE (2.46) and the highest R(2) (0.97) had three inputs (gross domestic product, population and employment), eight neurons and one lag in the hidden layer, and used Fletcher-Powell's conjugate gradient as the training algorithm.
Cataclysmic variables and related objects
NASA Technical Reports Server (NTRS)
Hack, Margherita; Ladous, Constanze; Jordan, Stuart D. (Editor); Thomas, Richard N. (Editor); Goldberg, Leo; Pecker, Jean-Claude
1993-01-01
This volume begins with an introductory chapter on general properties of cataclysmic variables. Chapters 2 through 5 of Part 1 are devoted to observations and interpretation of dwarf novae and nova-like stars. Chapters 6 through 10, Part 2, discuss the general observational properties of classical and recurrent novae, the theoretical models, and the characteristics and models for some well observed classical novae and recurrent novae. Chapters 11 through 14 of Part 3 are devoted to an overview of the observations of symbiotic stars, to a description of the various models proposed for explaining the symbiotic phenomenon, and to a discussion of a few selected objects, respectively. Chapter 15 briefly examines the many unsolved problems posed by the observations of the different classes of cataclysmic variables and symbiotic stars.
Ahmadi, Mehdi; Shahlaei, Mohsen
2015-01-01
P2X7 antagonist activity for a set of 49 molecules of the P2X7 receptor antagonists, derivatives of purine, was modeled with the aid of chemometric and artificial intelligence techniques. The activity of these compounds was estimated by means of combination of principal component analysis (PCA), as a well-known data reduction method, genetic algorithm (GA), as a variable selection technique, and artificial neural network (ANN), as a non-linear modeling method. First, a linear regression, combined with PCA, (principal component regression) was operated to model the structure-activity relationships, and afterwards a combination of PCA and ANN algorithm was employed to accurately predict the biological activity of the P2X7 antagonist. PCA preserves as much of the information as possible contained in the original data set. Seven most important PC's to the studied activity were selected as the inputs of ANN box by an efficient variable selection method, GA. The best computational neural network model was a fully-connected, feed-forward model with 7-7-1 architecture. The developed ANN model was fully evaluated by different validation techniques, including internal and external validation, and chemical applicability domain. All validations showed that the constructed quantitative structure-activity relationship model suggested is robust and satisfactory.
Ahmadi, Mehdi; Shahlaei, Mohsen
2015-01-01
P2X7 antagonist activity for a set of 49 molecules of the P2X7 receptor antagonists, derivatives of purine, was modeled with the aid of chemometric and artificial intelligence techniques. The activity of these compounds was estimated by means of combination of principal component analysis (PCA), as a well-known data reduction method, genetic algorithm (GA), as a variable selection technique, and artificial neural network (ANN), as a non-linear modeling method. First, a linear regression, combined with PCA, (principal component regression) was operated to model the structure–activity relationships, and afterwards a combination of PCA and ANN algorithm was employed to accurately predict the biological activity of the P2X7 antagonist. PCA preserves as much of the information as possible contained in the original data set. Seven most important PC's to the studied activity were selected as the inputs of ANN box by an efficient variable selection method, GA. The best computational neural network model was a fully-connected, feed-forward model with 7−7−1 architecture. The developed ANN model was fully evaluated by different validation techniques, including internal and external validation, and chemical applicability domain. All validations showed that the constructed quantitative structure–activity relationship model suggested is robust and satisfactory. PMID:26600858
Modelling Ecuador's rainfall distribution according to geographical characteristics.
NASA Astrophysics Data System (ADS)
Tobar, Vladimiro; Wyseure, Guido
2017-04-01
It is known that rainfall is affected by terrain characteristics and some studies had focussed on its distribution over complex terrain. Ecuador's temporal and spatial rainfall distribution is affected by its location on the ITCZ, the marine currents in the Pacific, the Amazon rainforest, and the Andes mountain range. Although all these factors are important, we think that the latter one may hold a key for modelling spatial and temporal distribution of rainfall. The study considered 30 years of monthly data from 319 rainfall stations having at least 10 years of data available. The relatively low density of stations and their location in accessible sites near to main roads or rivers, leave large and important areas ungauged, making it not appropriate to rely on traditional interpolation techniques to estimate regional rainfall for water balance. The aim of this research was to come up with a useful model for seasonal rainfall distribution in Ecuador based on geographical characteristics to allow its spatial generalization. The target for modelling was the seasonal rainfall, characterized by nine percentiles for each one of the 12 months of the year that results in 108 response variables, later on reduced to four principal components comprising 94% of the total variability. Predictor variables for the model were: geographic coordinates, elevation, main wind effects from the Amazon and Coast, Valley and Hill indexes, and average and maximum elevation above the selected rainfall station to the east and to the west, for each one of 18 directions (50-135°, by 5°) adding up to 79 predictors. A multiple linear regression model by the Elastic-net algorithm with cross-validation was applied for each one of the PC as response to select the most important ones from the 79 predictor variables. The Elastic-net algorithm deals well with collinearity problems, while allowing variable selection in a blended approach between the Ridge and Lasso regression. The model fitting produced explained variances of 59%, 81%, 49% and 17% for PC1, PC2, PC3 and PC4, respectively, backing up the hypothesis of good correlation between geographical characteristics and seasonal rainfall patterns (comprised in the four principal components). With the obtained coefficients from the regression, the 108 rainfall percentiles for each station were back estimated giving very good results when compared with the original ones, with an overall 60% explained variance.
Model selection with multiple regression on distance matrices leads to incorrect inferences.
Franckowiak, Ryan P; Panasci, Michael; Jarvis, Karl J; Acuña-Rodriguez, Ian S; Landguth, Erin L; Fortin, Marie-Josée; Wagner, Helene H
2017-01-01
In landscape genetics, model selection procedures based on Information Theoretic and Bayesian principles have been used with multiple regression on distance matrices (MRM) to test the relationship between multiple vectors of pairwise genetic, geographic, and environmental distance. Using Monte Carlo simulations, we examined the ability of model selection criteria based on Akaike's information criterion (AIC), its small-sample correction (AICc), and the Bayesian information criterion (BIC) to reliably rank candidate models when applied with MRM while varying the sample size. The results showed a serious problem: all three criteria exhibit a systematic bias toward selecting unnecessarily complex models containing spurious random variables and erroneously suggest a high level of support for the incorrectly ranked best model. These problems effectively increased with increasing sample size. The failure of AIC, AICc, and BIC was likely driven by the inflated sample size and different sum-of-squares partitioned by MRM, and the resulting effect on delta values. Based on these findings, we strongly discourage the continued application of AIC, AICc, and BIC for model selection with MRM.
Generalized structural equations improve sexual-selection analyses
Santini, Giacomo; Marchetti, Giovanni Maria; Focardi, Stefano
2017-01-01
Sexual selection is an intense evolutionary force, which operates through competition for the access to breeding resources. There are many cases where male copulatory success is highly asymmetric, and few males are able to sire most females. Two main hypotheses were proposed to explain this asymmetry: “female choice” and “male dominance”. The literature reports contrasting results. This variability may reflect actual differences among studied populations, but it may also be generated by methodological differences and statistical shortcomings in data analysis. A review of the statistical methods used so far in lek studies, shows a prevalence of Linear Models (LM) and Generalized Linear Models (GLM) which may be affected by problems in inferring cause-effect relationships; multi-collinearity among explanatory variables and erroneous handling of non-normal and non-continuous distributions of the response variable. In lek breeding, selective pressure is maximal, because large numbers of males and females congregate in small arenas. We used a dataset on lekking fallow deer (Dama dama), to contrast the methods and procedures employed so far, and we propose a novel approach based on Generalized Structural Equations Models (GSEMs). GSEMs combine the power and flexibility of both SEM and GLM in a unified modeling framework. We showed that LMs fail to identify several important predictors of male copulatory success and yields very imprecise parameter estimates. Minor variations in data transformation yield wide changes in results and the method appears unreliable. GLMs improved the analysis, but GSEMs provided better results, because the use of latent variables decreases the impact of measurement errors. Using GSEMs, we were able to test contrasting hypotheses and calculate both direct and indirect effects, and we reached a high precision of the estimates, which implies a high predictive ability. In synthesis, we recommend the use of GSEMs in studies on lekking behaviour, and we provide guidelines to implement these models. PMID:28809923
Madden, M; Batey Pwj
1983-05-01
Some problems associated with demographic-economic forecasting include finding models appropriate for a declining economy with unemployment, using a multiregional approach in an interregional model, finding a way to show differential consumption while endogenizing unemployment, and avoiding unemployment inconsistencies. The solution to these problems involves the construction of an activity-commodity framework, locating it within a group of forecasting models, and indicating possible ratios towards dynamization of the framework. The authors demonstrate the range of impact multipliers that can be derived from the framework and show how these multipliers relate to Leontief input-output multipliers. It is shown that desired population distribution may be obtained by selecting instruments from the economic sphere to produce, through the constraints vector of an activity-commodity framework, targets selected from demographic activities. The next step in this process, empirical exploitation, was carried out by the authors in the United Kingdom, linking an input-output model with a wide selection of demographic and demographic-economic variables. The generally tenuous control which government has over any variables in systems of this type, especially in market economies, makes application in the policy field of the optimization approach a partly conjectural exercise, although the analytic capacity of the approach can provide clear indications of policy directions.
Selection of Representative Models for Decision Analysis Under Uncertainty
NASA Astrophysics Data System (ADS)
Meira, Luis A. A.; Coelho, Guilherme P.; Santos, Antonio Alberto S.; Schiozer, Denis J.
2016-03-01
The decision-making process in oil fields includes a step of risk analysis associated with the uncertainties present in the variables of the problem. Such uncertainties lead to hundreds, even thousands, of possible scenarios that are supposed to be analyzed so an effective production strategy can be selected. Given this high number of scenarios, a technique to reduce this set to a smaller, feasible subset of representative scenarios is imperative. The selected scenarios must be representative of the original set and also free of optimistic and pessimistic bias. This paper is devoted to propose an assisted methodology to identify representative models in oil fields. To do so, first a mathematical function was developed to model the representativeness of a subset of models with respect to the full set that characterizes the problem. Then, an optimization tool was implemented to identify the representative models of any problem, considering not only the cross-plots of the main output variables, but also the risk curves and the probability distribution of the attribute-levels of the problem. The proposed technique was applied to two benchmark cases and the results, evaluated by experts in the field, indicate that the obtained solutions are richer than those identified by previously adopted manual approaches. The program bytecode is available under request.
ERIC Educational Resources Information Center
van der Ven, Sanne H. G.; Boom, Jan; Kroesbergen, Evelyn H.; Leseman, Paul P. M.
2012-01-01
Variability in strategy selection is an important characteristic of learning new skills such as mathematical skills. Strategies gradually come and go during this development. In 1996, Siegler described this phenomenon as ''overlapping waves.'' In the current microgenetic study, we attempted to model these overlapping waves statistically. In…
NASA Astrophysics Data System (ADS)
Unruh, Y. C.; Krivova, N. A.; Solanki, S. K.; Harder, J. W.; Kopp, G.
2008-07-01
Aims: We test the reliability of the observed and calculated spectral irradiance variations between 200 and 1600 nm over a time span of three solar rotations in 2004. Methods: We compare our model calculations to spectral irradiance observations taken with SORCE/SIM, SoHO/VIRGO, and UARS/SUSIM. The calculations assume LTE and are based on the SATIRE (Spectral And Total Irradiance REconstruction) model. We analyse the variability as a function of wavelength and present time series in a number of selected wavelength regions covering the UV to the NIR. We also show the facular and spot contributions to the total calculated variability. Results: In most wavelength regions, the variability agrees well between all sets of observations and the model calculations. The model does particularly well between 400 and 1300 nm, but fails below 220 nm, as well as for some of the strong NUV lines. Our calculations clearly show the shift from faculae-dominated variability in the NUV to spot-dominated variability above approximately 400 nm. We also discuss some of the remaining problems, such as the low sensitivity of SUSIM and SORCE for wavelengths between approximately 310 and 350 nm, where currently the model calculations still provide the best estimates of solar variability.
Svenning, J.-C.; Engelbrecht, B.M.J.; Kinner, D.A.; Kursar, T.A.; Stallard, R.F.; Wright, S.J.
2006-01-01
We used regression models and information-theoretic model selection to assess the relative importance of environment, local dispersal and historical contingency as controls of the distributions of 26 common plant species in tropical forest on Barro Colorado Island (BCI), Panama. We censused eighty-eight 0.09-ha plots scattered across the landscape. Environmental control, local dispersal and historical contingency were represented by environmental variables (soil moisture, slope, soil type, distance to shore, old-forest presence), a spatial autoregressive parameter (??), and four spatial trend variables, respectively. We built regression models, representing all combinations of the three hypotheses, for each species. The probability that the best model included the environmental variables, spatial trend variables and ?? averaged 33%, 64% and 50% across the study species, respectively. The environmental variables, spatial trend variables, ??, and a simple intercept model received the strongest support for 4, 15, 5 and 2 species, respectively. Comparing the model results to information on species traits showed that species with strong spatial trends produced few and heavy diaspores, while species with strong soil moisture relationships were particularly drought-sensitive. In conclusion, history and local dispersal appeared to be the dominant controls of the distributions of common plant species on BCI. Copyright ?? 2006 Cambridge University Press.
NASA Astrophysics Data System (ADS)
Zarindast, Atousa; Seyed Hosseini, Seyed Mohamad; Pishvaee, Mir Saman
2017-06-01
Robust supplier selection problem, in a scenario-based approach has been proposed, when the demand and exchange rates are subject to uncertainties. First, a deterministic multi-objective mixed integer linear programming is developed; then, the robust counterpart of the proposed mixed integer linear programming is presented using the recent extension in robust optimization theory. We discuss decision variables, respectively, by a two-stage stochastic planning model, a robust stochastic optimization planning model which integrates worst case scenario in modeling approach and finally by equivalent deterministic planning model. The experimental study is carried out to compare the performances of the three models. Robust model resulted in remarkable cost saving and it illustrated that to cope with such uncertainties, we should consider them in advance in our planning. In our case study different supplier were selected due to this uncertainties and since supplier selection is a strategic decision, it is crucial to consider these uncertainties in planning approach.
Clustering and variable selection in the presence of mixed variable types and missing data.
Storlie, C B; Myers, S M; Katusic, S K; Weaver, A L; Voigt, R G; Croarkin, P E; Stoeckel, R E; Port, J D
2018-05-17
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines. Copyright © 2018 John Wiley & Sons, Ltd.
Natural selection reduced diversity on human y chromosomes.
Wilson Sayres, Melissa A; Lohmueller, Kirk E; Nielsen, Rasmus
2014-01-01
The human Y chromosome exhibits surprisingly low levels of genetic diversity. This could result from neutral processes if the effective population size of males is reduced relative to females due to a higher variance in the number of offspring from males than from females. Alternatively, selection acting on new mutations, and affecting linked neutral sites, could reduce variability on the Y chromosome. Here, using genome-wide analyses of X, Y, autosomal and mitochondrial DNA, in combination with extensive population genetic simulations, we show that low observed Y chromosome variability is not consistent with a purely neutral model. Instead, we show that models of purifying selection are consistent with observed Y diversity. Further, the number of sites estimated to be under purifying selection greatly exceeds the number of Y-linked coding sites, suggesting the importance of the highly repetitive ampliconic regions. While we show that purifying selection removing deleterious mutations can explain the low diversity on the Y chromosome, we cannot exclude the possibility that positive selection acting on beneficial mutations could have also reduced diversity in linked neutral regions, and may have contributed to lowering human Y chromosome diversity. Because the functional significance of the ampliconic regions is poorly understood, our findings should motivate future research in this area.
Natural Selection Reduced Diversity on Human Y Chromosomes
Wilson Sayres, Melissa A.; Lohmueller, Kirk E.; Nielsen, Rasmus
2014-01-01
The human Y chromosome exhibits surprisingly low levels of genetic diversity. This could result from neutral processes if the effective population size of males is reduced relative to females due to a higher variance in the number of offspring from males than from females. Alternatively, selection acting on new mutations, and affecting linked neutral sites, could reduce variability on the Y chromosome. Here, using genome-wide analyses of X, Y, autosomal and mitochondrial DNA, in combination with extensive population genetic simulations, we show that low observed Y chromosome variability is not consistent with a purely neutral model. Instead, we show that models of purifying selection are consistent with observed Y diversity. Further, the number of sites estimated to be under purifying selection greatly exceeds the number of Y-linked coding sites, suggesting the importance of the highly repetitive ampliconic regions. While we show that purifying selection removing deleterious mutations can explain the low diversity on the Y chromosome, we cannot exclude the possibility that positive selection acting on beneficial mutations could have also reduced diversity in linked neutral regions, and may have contributed to lowering human Y chromosome diversity. Because the functional significance of the ampliconic regions is poorly understood, our findings should motivate future research in this area. PMID:24415951
Transformation Abilities: A Reanalysis and Confirmation of SOI Theory.
ERIC Educational Resources Information Center
Khattab, Ali-Maher; And Others
1987-01-01
Confirmatory factor analysis was used to reanalyze correlational data from selected variables in Guilford's Aptitudes Research Project. Results indicated Guilford's model reproduced the original correlation matrix more closely than other models. Most of Guilford's tests indicated high loadings on their hypothesized factors. (GDC)
Eyler, Lauren; Hubbard, Alan; Juillard, Catherine
2016-10-01
Low and middle-income countries (LMICs) and the world's poor bear a disproportionate share of the global burden of injury. Data regarding disparities in injury are vital to inform injury prevention and trauma systems strengthening interventions targeted towards vulnerable populations, but are limited in LMICs. We aim to facilitate injury disparities research by generating a standardized methodology for assessing economic status in resource-limited country trauma registries where complex metrics such as income, expenditures, and wealth index are infeasible to assess. To address this need, we developed a cluster analysis-based algorithm for generating simple population-specific metrics of economic status using nationally representative Demographic and Health Surveys (DHS) household assets data. For a limited number of variables, g, our algorithm performs weighted k-medoids clustering of the population using all combinations of g asset variables and selects the combination of variables and number of clusters that maximize average silhouette width (ASW). In simulated datasets containing both randomly distributed variables and "true" population clusters defined by correlated categorical variables, the algorithm selected the correct variable combination and appropriate cluster numbers unless variable correlation was very weak. When used with 2011 Cameroonian DHS data, our algorithm identified twenty economic clusters with ASW 0.80, indicating well-defined population clusters. This economic model for assessing health disparities will be used in the new Cameroonian six-hospital centralized trauma registry. By describing our standardized methodology and algorithm for generating economic clustering models, we aim to facilitate measurement of health disparities in other trauma registries in resource-limited countries. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Shi, Weimin; Zhang, Xiaoya; Shen, Qi
2010-01-01
Quantitative structure-activity relationship (QSAR) study of chemokine receptor 5 (CCR5) binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas and toxicity of aromatic compounds have been performed. The gene expression programming (GEP) was used to select variables and produce nonlinear QSAR models simultaneously using the selected variables. In our GEP implementation, a simple and convenient method was proposed to infer the K-expression from the number of arguments of the function in a gene, without building the expression tree. The results were compared to those obtained by artificial neural network (ANN) and support vector machine (SVM). It has been demonstrated that the GEP is a useful tool for QSAR modeling. Copyright 2009 Elsevier Masson SAS. All rights reserved.
Concave 1-norm group selection
Jiang, Dingfeng; Huang, Jian
2015-01-01
Grouping structures arise naturally in many high-dimensional problems. Incorporation of such information can improve model fitting and variable selection. Existing group selection methods, such as the group Lasso, require correct membership. However, in practice it can be difficult to correctly specify group membership of all variables. Thus, it is important to develop group selection methods that are robust against group mis-specification. Also, it is desirable to select groups as well as individual variables in many applications. We propose a class of concave \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$1$\\end{document}-norm group penalties that is robust to grouping structure and can perform bi-level selection. A coordinate descent algorithm is developed to calculate solutions of the proposed group selection method. Theoretical convergence of the algorithm is proved under certain regularity conditions. Comparison with other methods suggests the proposed method is the most robust approach under membership mis-specification. Simulation studies and real data application indicate that the \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$1$\\end{document}-norm concave group selection approach achieves better control of false discovery rates. An R package grppenalty implementing the proposed method is available at CRAN. PMID:25417206
Applying causal mediation analysis to personality disorder research.
Walters, Glenn D
2018-01-01
This article is designed to address fundamental issues in the application of causal mediation analysis to research on personality disorders. Causal mediation analysis is used to identify mechanisms of effect by testing variables as putative links between the independent and dependent variables. As such, it would appear to have relevance to personality disorder research. It is argued that proper implementation of causal mediation analysis requires that investigators take several factors into account. These factors are discussed under 5 headings: variable selection, model specification, significance evaluation, effect size estimation, and sensitivity testing. First, care must be taken when selecting the independent, dependent, mediator, and control variables for a mediation analysis. Some variables make better mediators than others and all variables should be based on reasonably reliable indicators. Second, the mediation model needs to be properly specified. This requires that the data for the analysis be prospectively or historically ordered and possess proper causal direction. Third, it is imperative that the significance of the identified pathways be established, preferably with a nonparametric bootstrap resampling approach. Fourth, effect size estimates should be computed or competing pathways compared. Finally, investigators employing the mediation method are advised to perform a sensitivity analysis. Additional topics covered in this article include parallel and serial multiple mediation designs, moderation, and the relationship between mediation and moderation. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Grant, Edward M.; Young, Deborah Rohm; Wu, Tong Tong
2015-01-01
We examined associations among longitudinal, multilevel variables and girls’ physical activity to determine the important predictors for physical activity change at different adolescent ages. The Trial of Activity for Adolescent Girls 2 study (Maryland) contributed participants from 8th (2009) to 11th grade (2011) (n=561). Questionnaires were used to obtain demographic, and psychosocial information (individual- and social-level variables); height, weight, and triceps skinfold to assess body composition; interviews and surveys for school-level data; and self-report for neighborhood-level variables. Moderate to vigorous physical activity minutes were assessed from accelerometers. A doubly regularized linear mixed effects model was used for the longitudinal multilevel data to identify the most important covariates for physical activity. Three fixed effects at the individual level and one random effect at the school level were chosen from an initial total of 66 variables, consisting of 47 fixed effects and 19 random effects variables, in additional to the time effect. Self-management strategies, perceived barriers, and social support from friends were the three selected fixed effects, and whether intramural or interscholastic programs were offered in middle school was the selected random effect. Psychosocial factors and friend support, plus a school’s physical activity environment, affect adolescent girl’s moderate to vigorous physical activity longitudinally. PMID:25928064
Piovesan, Chaiana; Ardenghi, Thiago Machado; Mendes, Fausto Medeiros; Agostini, Bernardo Antonio; Michel-Crosato, Edgard
2017-03-30
The effect of contextual factors on dental care utilization was evaluated after adjustment for individual characteristics of Brazilian preschool children. This cross-sectional study assessed 639 preschool children aged 1 to 5 years from Santa Maria, a town in Rio Grande do Sul State, located in southern Brazil. Participants were randomly selected from children attending the National Children's Vaccination Day and 15 health centers were selected for this research. Visual examinations followed the ICDAS criteria. Parents answered a questionnaire about demographic and socioeconomic characteristics. Contextual influences on children's dental care utilization were obtained from two community-related variables: presence of dentists and presence of workers' associations in the neighborhood. Unadjusted and adjusted multilevel logistic regression models were used to describe the association between outcome and predictor variables. A prevalence of 21.6% was found for regular use of dental services. The unadjusted assessment of the associations of dental health care utilization with individual and contextual factors included children's ages, family income, parents' schooling, mothers' participation in their children's school activities, dental caries, and presence of workers' associations in the neighborhood as the main outcome covariates. Individual variables remained associated with the outcome after adding contextual variables in the model. In conclusion, individual and contextual variables were associated with dental health care utilization by preschool children.
Genomic selection for slaughter age in pigs using the Cox frailty model.
Santos, V S; Martins Filho, S; Resende, M D V; Azevedo, C F; Lopes, P S; Guimarães, S E F; Glória, L S; Silva, F F
2015-10-19
The aim of this study was to compare genomic selection methodologies using a linear mixed model and the Cox survival model. We used data from an F2 population of pigs, in which the response variable was the time in days from birth to the culling of the animal and the covariates were 238 markers [237 single nucleotide polymorphism (SNP) plus the halothane gene]. The data were corrected for fixed effects, and the accuracy of the method was determined based on the correlation of the ranks of predicted genomic breeding values (GBVs) in both models with the corrected phenotypic values. The analysis was repeated with a subset of SNP markers with largest absolute effects. The results were in agreement with the GBV prediction and the estimation of marker effects for both models for uncensored data and for normality. However, when considering censored data, the Cox model with a normal random effect (S1) was more appropriate. Since there was no agreement between the linear mixed model and the imputed data (L2) for the prediction of genomic values and the estimation of marker effects, the model S1 was considered superior as it took into account the latent variable and the censored data. Marker selection increased correlations between the ranks of predicted GBVs by the linear and Cox frailty models and the corrected phenotypic values, and 120 markers were required to increase the predictive ability for the characteristic analyzed.
Hemmateenejad, Bahram; Yazdani, Mahdieh
2009-02-16
Steroids are widely distributed in nature and are found in plants, animals, and fungi in abundance. A data set consists of a diverse set of steroids have been used to develop quantitative structure-electrochemistry relationship (QSER) models for their half-wave reduction potential. Modeling was established by means of multiple linear regression (MLR) and principle component regression (PCR) analyses. In MLR analysis, the QSPR models were constructed by first grouping descriptors and then stepwise selection of variables from each group (MLR1) and stepwise selection of predictor variables from the pool of all calculated descriptors (MLR2). Similar procedure was used in PCR analysis so that the principal components (or features) were extracted from different group of descriptors (PCR1) and from entire set of descriptors (PCR2). The resulted models were evaluated using cross-validation, chance correlation, application to prediction reduction potential of some test samples and accessing applicability domain. Both MLR approaches represented accurate results however the QSPR model found by MLR1 was statistically more significant. PCR1 approach produced a model as accurate as MLR approaches whereas less accurate results were obtained by PCR2 approach. In overall, the correlation coefficients of cross-validation and prediction of the QSPR models resulted from MLR1, MLR2 and PCR1 approaches were higher than 90%, which show the high ability of the models to predict reduction potential of the studied steroids.
Quantifying natural delta variability using a multiple-point geostatistics prior uncertainty model
NASA Astrophysics Data System (ADS)
Scheidt, Céline; Fernandes, Anjali M.; Paola, Chris; Caers, Jef
2016-10-01
We address the question of quantifying uncertainty associated with autogenic pattern variability in a channelized transport system by means of a modern geostatistical method. This question has considerable relevance for practical subsurface applications as well, particularly those related to uncertainty quantification relying on Bayesian approaches. Specifically, we show how the autogenic variability in a laboratory experiment can be represented and reproduced by a multiple-point geostatistical prior uncertainty model. The latter geostatistical method requires selection of a limited set of training images from which a possibly infinite set of geostatistical model realizations, mimicking the training image patterns, can be generated. To that end, we investigate two methods to determine how many training images and what training images should be provided to reproduce natural autogenic variability. The first method relies on distance-based clustering of overhead snapshots of the experiment; the second method relies on a rate of change quantification by means of a computer vision algorithm termed the demon algorithm. We show quantitatively that with either training image selection method, we can statistically reproduce the natural variability of the delta formed in the experiment. In addition, we study the nature of the patterns represented in the set of training images as a representation of the "eigenpatterns" of the natural system. The eigenpattern in the training image sets display patterns consistent with previous physical interpretations of the fundamental modes of this type of delta system: a highly channelized, incisional mode; a poorly channelized, depositional mode; and an intermediate mode between the two.
Prediction of Baseflow Index of Catchments using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Yadav, B.; Hatfield, K.
2017-12-01
We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.
Genome-Wide Association Analysis of Adaptation Using Environmentally Predicted Traits.
van Heerwaarden, Joost; van Zanten, Martijn; Kruijer, Willem
2015-10-01
Current methods for studying the genetic basis of adaptation evaluate genetic associations with ecologically relevant traits or single environmental variables, under the implicit assumption that natural selection imposes correlations between phenotypes, environments and genotypes. In practice, observed trait and environmental data are manifestations of unknown selective forces and are only indirectly associated with adaptive genetic variation. In theory, improved estimation of these forces could enable more powerful detection of loci under selection. Here we present an approach in which we approximate adaptive variation by modeling phenotypes as a function of the environment and using the predicted trait in multivariate and univariate genome-wide association analysis (GWAS). Based on computer simulations and published flowering time data from the model plant Arabidopsis thaliana, we find that environmentally predicted traits lead to higher recovery of functional loci in multivariate GWAS and are more strongly correlated to allele frequencies at adaptive loci than individual environmental variables. Our results provide an example of the use of environmental data to obtain independent and meaningful information on adaptive genetic variation.
The evolution of parental cooperation in birds.
Remeš, Vladimír; Freckleton, Robert P; Tökölyi, Jácint; Liker, András; Székely, Tamás
2015-11-03
Parental care is one of the most variable social behaviors and it is an excellent model system to understand cooperation between unrelated individuals. Three major hypotheses have been proposed to explain the extent of parental cooperation: sexual selection, social environment, and environmental harshness. Using the most comprehensive dataset on parental care that includes 659 bird species from 113 families covering both uniparental and biparental taxa, we show that the degree of parental cooperation is associated with both sexual selection and social environment. Consistent with recent theoretical models parental cooperation decreases with the intensity of sexual selection and with skewed adult sex ratios. These effects are additive and robust to the influence of life-history variables. However, parental cooperation is unrelated to environmental factors (measured at the scale of whole species ranges) as indicated by a lack of consistent relationship with ambient temperature, rainfall or their fluctuations within and between years. These results highlight the significance of social effects for parental cooperation and suggest that several parental strategies may coexist in a given set of ambient environment.
Using maximum entropy modeling for optimal selection of sampling sites for monitoring networks
Stohlgren, Thomas J.; Kumar, Sunil; Barnett, David T.; Evangelista, Paul H.
2011-01-01
Environmental monitoring programs must efficiently describe state shifts. We propose using maximum entropy modeling to select dissimilar sampling sites to capture environmental variability at low cost, and demonstrate a specific application: sample site selection for the Central Plains domain (453,490 km2) of the National Ecological Observatory Network (NEON). We relied on four environmental factors: mean annual temperature and precipitation, elevation, and vegetation type. A “sample site” was defined as a 20 km × 20 km area (equal to NEON’s airborne observation platform [AOP] footprint), within which each 1 km2 cell was evaluated for each environmental factor. After each model run, the most environmentally dissimilar site was selected from all potential sample sites. The iterative selection of eight sites captured approximately 80% of the environmental envelope of the domain, an improvement over stratified random sampling and simple random designs for sample site selection. This approach can be widely used for cost-efficient selection of survey and monitoring sites.
A practical approach to Sasang constitutional diagnosis using vocal features
2013-01-01
Background Sasang constitutional medicine (SCM) is a type of tailored medicine that divides human beings into four Sasang constitutional (SC) types. Diagnosis of SC types is crucial to proper treatment in SCM. Voice characteristics have been used as an essential clue for diagnosing SC types. In the past, many studies tried to extract quantitative vocal features to make diagnosis models; however, these studies were flawed by limited data collected from one or a few sites, long recording time, and low accuracy. We propose a practical diagnosis model having only a few variables, which decreases model complexity. This in turn, makes our model appropriate for clinical applications. Methods A total of 2,341 participants’ voice recordings were used in making a SC classification model and to test the generalization ability of the model. Although the voice data consisted of five vowels and two repeated sentences per participant, we used only the sentence part for our study. A total of 21 features were extracted, and an advanced feature selection method—the least absolute shrinkage and selection operator (LASSO)—was applied to reduce the number of variables for classifier learning. A SC classification model was developed using multinomial logistic regression via LASSO. Results We compared the proposed classification model to the previous study, which used both sentences and five vowels from the same patient’s group. The classification accuracies for the test set were 47.9% and 40.4% for male and female, respectively. Our result showed that the proposed method was superior to the previous study in that it required shorter voice recordings, is more applicable to practical use, and had better generalization performance. Conclusions We proposed a practical SC classification method and showed that our model having fewer variables outperformed the model having many variables in the generalization test. We attempted to reduce the number of variables in two ways: 1) the initial number of candidate features was decreased by considering shorter voice recording, and 2) LASSO was introduced for reducing model complexity. The proposed method is suitable for an actual clinical environment. Moreover, we expect it to yield more stable results because of the model’s simplicity. PMID:24200041
NASA Astrophysics Data System (ADS)
Yasa, I. B. A.; Parnata, I. K.; Susilawati, N. L. N. A. S.
2018-01-01
This study aims to apply analytical review model to analyze the influence of GCG, accounting conservatism, financial distress models and company size on good and poor financial performance of LPD in Bangli Regency. Ordinal regression analysis is used to perform analytical review, so that obtained the influence and relationship between variables to be considered further audit. Respondents in this study were LPDs in Bangli Regency, which amounted to 159 LPDs of that number 100 LPDs were determined as randomly selected samples. The test results found GCG and company size have a significant effect on both the good and poor financial performance, while the conservatism and financial distress model has no significant effect. The influence of the four variables on the overall financial performance of 58.8%, while the remaining 41.2% influenced by other variables. Size, FDM and accounting conservatism are variables, which are further recommended to be audited.
NASA Technical Reports Server (NTRS)
Abbas, Khaled A.; Fattah, Nabil Abdel; Reda, Hala R.
2003-01-01
This research is concerned with developing passenger demand models for international aviation from/to Egypt. In this context, aviation sector in Egypt is represented by the biggest and main airport namely Cairo airport as well as by the main Egyptian international air carrier namely Egyptair. The developed models utilize two variables to represent aviation demand, namely total number of international flights originating from and attracted to Cairo airport as well as total number of passengers using Egyptair international flights originating from and attracted to Cairo airport. Such demand variables were related, using different functional forms, to several explanatory variables including population, GDP and number of foreign tourists. Finally, two models were selected based on their logical acceptability, best fit and statistical significance. To demonstrate usefulness of developed models, these were used to forecast future demand patterns.
NASA Astrophysics Data System (ADS)
Dieppois, B.; Pohl, B.; Eden, J.; Crétat, J.; Rouault, M.; Keenlyside, N.; New, M. G.
2017-12-01
The water management community has hitherto neglected or underestimated many of the uncertainties in climate impact scenarios, in particular, uncertainties associated with decadal climate variability. Uncertainty in the state-of-the-art global climate models (GCMs) is time-scale-dependant, e.g. stronger at decadal than at interannual timescales, in response to the different parameterizations and to internal climate variability. In addition, non-stationarity in statistical downscaling is widely recognized as a key problem, in which time-scale dependency of predictors plays an important role. As with global climate modelling, therefore, the selection of downscaling methods must proceed with caution to avoid unintended consequences of over-correcting the noise in GCMs (e.g. interpreting internal climate variability as a model bias). GCM outputs from the Coupled Model Intercomparison Project 5 (CMIP5) have therefore first been selected based on their ability to reproduce southern African summer rainfall variability and their teleconnections with Pacific sea-surface temperature across the dominant timescales. In observations, southern African summer rainfall has recently been shown to exhibit significant periodicities at the interannual timescale (2-8 years), quasi-decadal (8-13 years) and inter-decadal (15-28 years) timescales, which can be interpret as the signature of ENSO, the IPO, and the PDO over the region. Most of CMIP5 GCMs underestimate southern African summer rainfall variability and their teleconnections with Pacific SSTs at these three timescales. In addition, according to a more in-depth analysis of historical and pi-control runs, this bias is might result from internal climate variability in some of the CMIP5 GCMs, suggesting potential for bias-corrected prediction based empirical statistical downscaling. A multi-timescale regression based downscaling procedure, which determines the predictors across the different timescales, has thus been used to simulate southern African summer rainfall. This multi-timescale procedure shows much better skills in simulating decadal timescales of variability compared to commonly used statistical downscaling approaches.
Johnson, Jason K.; Oyen, Diane Adele; Chertkov, Michael; ...
2016-12-01
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the bestmore » planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. Finally, we demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Jason K.; Oyen, Diane Adele; Chertkov, Michael
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the bestmore » planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. Finally, we demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.« less
NASA Astrophysics Data System (ADS)
Boutsia, K.; Leibundgut, B.; Trevese, D.; Vagnetti, F.
2009-04-01
Context: Supermassive black holes with masses of 10^5-109 M⊙ are believed to inhabit most, if not all, nuclear regions of galaxies, and both observational evidence and theoretical models suggest a scenario where galaxy and black hole evolution are tightly related. Luminous AGNs are usually selected by their non-stellar colours or their X-ray emission. Colour selection cannot be used to select low-luminosity AGNs, since their emission is dominated by the host galaxy. Objects with low X-ray to optical ratio escape even the deepest X-ray surveys performed so far. In a previous study we presented a sample of candidates selected through optical variability in the Chandra Deep Field South, where repeated optical observations were performed in the framework of the STRESS supernova survey. Aims: The analysis is devoted to breaking down the sample in AGNs, starburst galaxies, and low-ionisation narrow-emission line objects, to providing new information about the possible dependence of the emission mechanisms on nuclear luminosity and black-hole mass, and eventually studying the evolution in cosmic time of the different populations. Methods: We obtained new optical spectroscopy for a sample of variability selected candidates with the ESO NTT telescope. We analysed the new spectra, together with those existing in the literature and studied the distribution of the objects in U-B and B-V colours, optical and X-ray luminosity, and variability amplitude. Results: A large fraction (17/27) of the observed candidates are broad-line luminous AGNs, confirming the efficiency of variability in detecting quasars. We detect: i) extended objects which would have escaped the colour selection and ii) objects of very low X-ray to optical ratio, in a few cases without any X-ray detection at all. Several objects resulted to be narrow-emission line galaxies where variability indicates nuclear activity, while no emission lines were detected in others. Some of these galaxies have variability and X-ray to optical ratio close to active galactic nuclei, while others have much lower variability and X-ray to optical ratio. This result can be explained by the dilution of the nuclear light due to the host galaxy. Conclusions: Our results demonstrate the effectiveness of supernova search programmes to detect large samples of low-luminosity AGNs. A sizable fraction of the AGN in our variability sample had escaped X-ray detection (5/47) and/or colour selection (9/48). Spectroscopic follow-up to fainter flux limits is strongly encouraged. Based on observations collected at the European Southern Observatory, Chile, 080.B-0187(A).
Mueller, Martina; Wagner, Carol L; Annibale, David J; Knapp, Rebecca G; Hulsey, Thomas C; Almeida, Jonas S
2006-03-01
Approximately 30% of intubated preterm infants with respiratory distress syndrome (RDS) will fail attempted extubation, requiring reintubation and mechanical ventilation. Although ventilator technology and monitoring of premature infants have improved over time, optimal extubation remains challenging. Furthermore, extubation decisions for premature infants require complex informational processing, techniques implicitly learned through clinical practice. Computer-aided decision-support tools would benefit inexperienced clinicians, especially during peak neonatal intensive care unit (NICU) census. A five-step procedure was developed to identify predictive variables. Clinical expert (CE) thought processes comprised one model. Variables from that model were used to develop two mathematical models for the decision-support tool: an artificial neural network (ANN) and a multivariate logistic regression model (MLR). The ranking of the variables in the three models was compared using the Wilcoxon Signed Rank Test. The best performing model was used in a web-based decision-support tool with a user interface implemented in Hypertext Markup Language (HTML) and the mathematical model employing the ANN. CEs identified 51 potentially predictive variables for extubation decisions for an infant on mechanical ventilation. Comparisons of the three models showed a significant difference between the ANN and the CE (p = 0.0006). Of the original 51 potentially predictive variables, the 13 most predictive variables were used to develop an ANN as a web-based decision-tool. The ANN processes user-provided data and returns the prediction 0-1 score and a novelty index. The user then selects the most appropriate threshold for categorizing the prediction as a success or failure. Furthermore, the novelty index, indicating the similarity of the test case to the training case, allows the user to assess the confidence level of the prediction with regard to how much the new data differ from the data originally used for the development of the prediction tool. State-of-the-art, machine-learning methods can be employed for the development of sophisticated tools to aid clinicians' decisions. We identified numerous variables considered relevant for extubation decisions for mechanically ventilated premature infants with RDS. We then developed a web-based decision-support tool for clinicians which can be made widely available and potentially improve patient care world wide.
From Metaphors to Formalism: A Heuristic Approach to Holistic Assessments of Ecosystem Health.
Fock, Heino O; Kraus, Gerd
2016-01-01
Environmental policies employ metaphoric objectives such as ecosystem health, resilience and sustainable provision of ecosystem services, which influence corresponding sustainability assessments by means of normative settings such as assumptions on system description, indicator selection, aggregation of information and target setting. A heuristic approach is developed for sustainability assessments to avoid ambiguity and applications to the EU Marine Strategy Framework Directive (MSFD) and OSPAR assessments are presented. For MSFD, nineteen different assessment procedures have been proposed, but at present no agreed assessment procedure is available. The heuristic assessment framework is a functional-holistic approach comprising an ex-ante/ex-post assessment framework with specifically defined normative and systemic dimensions (EAEPNS). The outer normative dimension defines the ex-ante/ex-post framework, of which the latter branch delivers one measure of ecosystem health based on indicators and the former allows to account for the multi-dimensional nature of sustainability (social, economic, ecological) in terms of modeling approaches. For MSFD, the ex-ante/ex-post framework replaces the current distinction between assessments based on pressure and state descriptors. The ex-ante and the ex-post branch each comprise an inner normative and a systemic dimension. The inner normative dimension in the ex-post branch considers additive utility models and likelihood functions to standardize variables normalized with Bayesian modeling. Likelihood functions allow precautionary target setting. The ex-post systemic dimension considers a posteriori indicator selection by means of analysis of indicator space to avoid redundant indicator information as opposed to a priori indicator selection in deconstructive-structural approaches. Indicator information is expressed in terms of ecosystem variability by means of multivariate analysis procedures. The application to the OSPAR assessment for the southern North Sea showed, that with the selected 36 indicators 48% of ecosystem variability could be explained. Tools for the ex-ante branch are risk and ecosystem models with the capability to analyze trade-offs, generating model output for each of the pressure chains to allow for a phasing-out of human pressures. The Bayesian measure of ecosystem health is sensitive to trends in environmental features, but robust to ecosystem variability in line with state space models. The combination of the ex-ante and ex-post branch is essential to evaluate ecosystem resilience and to adopt adaptive management. Based on requirements of the heuristic approach, three possible developments of this concept can be envisioned, i.e. a governance driven approach built upon participatory processes, a science driven functional-holistic approach requiring extensive monitoring to analyze complete ecosystem variability, and an approach with emphasis on ex-ante modeling and ex-post assessment of well-studied subsystems.
From Metaphors to Formalism: A Heuristic Approach to Holistic Assessments of Ecosystem Health
Kraus, Gerd
2016-01-01
Environmental policies employ metaphoric objectives such as ecosystem health, resilience and sustainable provision of ecosystem services, which influence corresponding sustainability assessments by means of normative settings such as assumptions on system description, indicator selection, aggregation of information and target setting. A heuristic approach is developed for sustainability assessments to avoid ambiguity and applications to the EU Marine Strategy Framework Directive (MSFD) and OSPAR assessments are presented. For MSFD, nineteen different assessment procedures have been proposed, but at present no agreed assessment procedure is available. The heuristic assessment framework is a functional-holistic approach comprising an ex-ante/ex-post assessment framework with specifically defined normative and systemic dimensions (EAEPNS). The outer normative dimension defines the ex-ante/ex-post framework, of which the latter branch delivers one measure of ecosystem health based on indicators and the former allows to account for the multi-dimensional nature of sustainability (social, economic, ecological) in terms of modeling approaches. For MSFD, the ex-ante/ex-post framework replaces the current distinction between assessments based on pressure and state descriptors. The ex-ante and the ex-post branch each comprise an inner normative and a systemic dimension. The inner normative dimension in the ex-post branch considers additive utility models and likelihood functions to standardize variables normalized with Bayesian modeling. Likelihood functions allow precautionary target setting. The ex-post systemic dimension considers a posteriori indicator selection by means of analysis of indicator space to avoid redundant indicator information as opposed to a priori indicator selection in deconstructive-structural approaches. Indicator information is expressed in terms of ecosystem variability by means of multivariate analysis procedures. The application to the OSPAR assessment for the southern North Sea showed, that with the selected 36 indicators 48% of ecosystem variability could be explained. Tools for the ex-ante branch are risk and ecosystem models with the capability to analyze trade-offs, generating model output for each of the pressure chains to allow for a phasing-out of human pressures. The Bayesian measure of ecosystem health is sensitive to trends in environmental features, but robust to ecosystem variability in line with state space models. The combination of the ex-ante and ex-post branch is essential to evaluate ecosystem resilience and to adopt adaptive management. Based on requirements of the heuristic approach, three possible developments of this concept can be envisioned, i.e. a governance driven approach built upon participatory processes, a science driven functional-holistic approach requiring extensive monitoring to analyze complete ecosystem variability, and an approach with emphasis on ex-ante modeling and ex-post assessment of well-studied subsystems. PMID:27509185
Xue, Hongqi; Wu, Shuang; Wu, Yichao; Ramirez Idarraga, Juan C; Wu, Hulin
2018-05-02
Mechanism-driven low-dimensional ordinary differential equation (ODE) models are often used to model viral dynamics at cellular levels and epidemics of infectious diseases. However, low-dimensional mechanism-based ODE models are limited for modeling infectious diseases at molecular levels such as transcriptomic or proteomic levels, which is critical to understand pathogenesis of diseases. Although linear ODE models have been proposed for gene regulatory networks (GRNs), nonlinear regulations are common in GRNs. The reconstruction of large-scale nonlinear networks from time-course gene expression data remains an unresolved issue. Here, we use high-dimensional nonlinear additive ODEs to model GRNs and propose a 4-step procedure to efficiently perform variable selection for nonlinear ODEs. To tackle the challenge of high dimensionality, we couple the 2-stage smoothing-based estimation method for ODEs and a nonlinear independence screening method to perform variable selection for the nonlinear ODE models. We have shown that our method possesses the sure screening property and it can handle problems with non-polynomial dimensionality. Numerical performance of the proposed method is illustrated with simulated data and a real data example for identifying the dynamic GRN of Saccharomyces cerevisiae. Copyright © 2018 John Wiley & Sons, Ltd.
Arnold, Megan A; Newland, M Christopher
2018-06-16
Behavioral inflexibility is often assessed using reversal learning tasks, which require a relatively low degree of response variability. No studies have assessed sensitivity to reinforcement contingencies that specifically select highly variable response patterns in mice, let alone in models of neurodevelopmental disorders involving limited response variation. Operant variability and incremental repeated acquisition (IRA) were used to assess unique aspects of behavioral variability of two mouse strains: BALB/c, a model of some deficits in ASD, and C57Bl/6. On the operant variability task, BALB/c mice responded more repetitively during adolescence than C57Bl/6 mice when reinforcement did not require variability but responded more variably when reinforcement required variability. During IRA testing in adulthood, both strains acquired an unchanging, performance sequence equally well. Strain differences emerged, however, after novel learning sequences began alternating with the performance sequence: BALB/c mice substantially outperformed C57Bl/6 mice. Using litter-mate controls, it was found that adolescent experience with variability did not affect either learning or performance on the IRA task in adulthood. These findings constrain the use of BALB/c mice as a model of ASD, but once again reveal this strain is highly sensitive to reinforcement contingencies and they are fast and robust learners. Copyright © 2018. Published by Elsevier B.V.
Data driven model generation based on computational intelligence
NASA Astrophysics Data System (ADS)
Gemmar, Peter; Gronz, Oliver; Faust, Christophe; Casper, Markus
2010-05-01
The simulation of discharges at a local gauge or the modeling of large scale river catchments are effectively involved in estimation and decision tasks of hydrological research and practical applications like flood prediction or water resource management. However, modeling such processes using analytical or conceptual approaches is made difficult by both complexity of process relations and heterogeneity of processes. It was shown manifold that unknown or assumed process relations can principally be described by computational methods, and that system models can automatically be derived from observed behavior or measured process data. This study describes the development of hydrological process models using computational methods including Fuzzy logic and artificial neural networks (ANN) in a comprehensive and automated manner. Methods We consider a closed concept for data driven development of hydrological models based on measured (experimental) data. The concept is centered on a Fuzzy system using rules of Takagi-Sugeno-Kang type which formulate the input-output relation in a generic structure like Ri : IFq(t) = lowAND...THENq(t+Δt) = ai0 +ai1q(t)+ai2p(t-Δti1)+ai3p(t+Δti2)+.... The rule's premise part (IF) describes process states involving available process information, e.g. actual outlet q(t) is low where low is one of several Fuzzy sets defined over variable q(t). The rule's conclusion (THEN) estimates expected outlet q(t + Δt) by a linear function over selected system variables, e.g. actual outlet q(t), previous and/or forecasted precipitation p(t ?Δtik). In case of river catchment modeling we use head gauges, tributary and upriver gauges in the conclusion part as well. In addition, we consider temperature and temporal (season) information in the premise part. By creating a set of rules R = {Ri|(i = 1,...,N)} the space of process states can be covered as concise as necessary. Model adaptation is achieved by finding on optimal set A = (aij) of conclusion parameters with respect to a defined rating function and experimental data. To find A, we use for example a linear equation solver and RMSE-function. In practical process models, the number of Fuzzy sets and the according number of rules is fairly low. Nevertheless, creating the optimal model requires some experience. Therefore, we improved this development step by methods for automatic generation of Fuzzy sets, rules, and conclusions. Basically, the model achievement depends to a great extend on the selection of the conclusion variables. It is the aim that variables having most influence on the system reaction being considered and superfluous ones being neglected. At first, we use Kohonen maps, a specialized ANN, to identify relevant input variables from the large set of available system variables. A greedy algorithm selects a comprehensive set of dominant and uncorrelated variables. Next, the premise variables are analyzed with clustering methods (e.g. Fuzzy-C-means) and Fuzzy sets are then derived from cluster centers and outlines. The rule base is automatically constructed by permutation of the Fuzzy sets of the premise variables. Finally, the conclusion parameters are calculated and the total coverage of the input space is iteratively tested with experimental data, rarely firing rules are combined and coarse coverage of sensitive process states results in refined Fuzzy sets and rules. Results The described methods were implemented and integrated in a development system for process models. A series of models has already been built e.g. for rainfall-runoff modeling or for flood prediction (up to 72 hours) in river catchments. The models required significantly less development effort and showed advanced simulation results compared to conventional models. The models can be used operationally and simulation takes only some minutes on a standard PC e.g. for a gauge forecast (up to 72 hours) for the whole Mosel (Germany) river catchment.
A Robust Adaptive Autonomous Approach to Optimal Experimental Design
NASA Astrophysics Data System (ADS)
Gu, Hairong
Experimentation is the fundamental tool of scientific inquiries to understand the laws governing the nature and human behaviors. Many complex real-world experimental scenarios, particularly in quest of prediction accuracy, often encounter difficulties to conduct experiments using an existing experimental procedure for the following two reasons. First, the existing experimental procedures require a parametric model to serve as the proxy of the latent data structure or data-generating mechanism at the beginning of an experiment. However, for those experimental scenarios of concern, a sound model is often unavailable before an experiment. Second, those experimental scenarios usually contain a large number of design variables, which potentially leads to a lengthy and costly data collection cycle. Incompetently, the existing experimental procedures are unable to optimize large-scale experiments so as to minimize the experimental length and cost. Facing the two challenges in those experimental scenarios, the aim of the present study is to develop a new experimental procedure that allows an experiment to be conducted without the assumption of a parametric model while still achieving satisfactory prediction, and performs optimization of experimental designs to improve the efficiency of an experiment. The new experimental procedure developed in the present study is named robust adaptive autonomous system (RAAS). RAAS is a procedure for sequential experiments composed of multiple experimental trials, which performs function estimation, variable selection, reverse prediction and design optimization on each trial. Directly addressing the challenges in those experimental scenarios of concern, function estimation and variable selection are performed by data-driven modeling methods to generate a predictive model from data collected during the course of an experiment, thus exempting the requirement of a parametric model at the beginning of an experiment; design optimization is performed to select experimental designs on the fly of an experiment based on their usefulness so that fewest designs are needed to reach useful inferential conclusions. Technically, function estimation is realized by Bayesian P-splines, variable selection is realized by Bayesian spike-and-slab prior, reverse prediction is realized by grid-search and design optimization is realized by the concepts of active learning. The present study demonstrated that RAAS achieves statistical robustness by making accurate predictions without the assumption of a parametric model serving as the proxy of latent data structure while the existing procedures can draw poor statistical inferences if a misspecified model is assumed; RAAS also achieves inferential efficiency by taking fewer designs to acquire useful statistical inferences than non-optimal procedures. Thus, RAAS is expected to be a principled solution to real-world experimental scenarios pursuing robust prediction and efficient experimentation.
ERIC Educational Resources Information Center
Denton, Holly M.
A study tested a model of organizational variables that earlier research had identified as important in influencing what model(s) of public relations an organization selects. Models of public relations (as outlined by J. Grunig and Hunt in 1984) are defined as either press agentry, public information, two-way asymmetrical, or two-way symmetrical.…
Zhu, Hongyan; Chu, Bingquan; Fan, Yangyang; Tao, Xiaoya; Yin, Wenxin; He, Yong
2017-08-10
We investigated the feasibility and potentiality of determining firmness, soluble solids content (SSC), and pH in kiwifruits using hyperspectral imaging, combined with variable selection methods and calibration models. The images were acquired by a push-broom hyperspectral reflectance imaging system covering two spectral ranges. Weighted regression coefficients (BW), successive projections algorithm (SPA) and genetic algorithm-partial least square (GAPLS) were compared and evaluated for the selection of effective wavelengths. Moreover, multiple linear regression (MLR), partial least squares regression and least squares support vector machine (LS-SVM) were developed to predict quality attributes quantitatively using effective wavelengths. The established models, particularly SPA-MLR, SPA-LS-SVM and GAPLS-LS-SVM, performed well. The SPA-MLR models for firmness (R pre = 0.9812, RPD = 5.17) and SSC (R pre = 0.9523, RPD = 3.26) at 380-1023 nm showed excellent performance, whereas GAPLS-LS-SVM was the optimal model at 874-1734 nm for predicting pH (R pre = 0.9070, RPD = 2.60). Image processing algorithms were developed to transfer the predictive model in every pixel to generate prediction maps that visualize the spatial distribution of firmness and SSC. Hence, the results clearly demonstrated that hyperspectral imaging has the potential as a fast and non-invasive method to predict the quality attributes of kiwifruits.
NASA Astrophysics Data System (ADS)
Thanos, Konstantinos-Georgios; Thomopoulos, Stelios C. A.
2016-05-01
wayGoo is a fully functional application whose main functionalities include content geolocation, event scheduling, and indoor navigation. However, significant information about events do not reach users' attention, either because of the size of this information or because some information comes from real - time data sources. The purpose of this work is to facilitate event management operations by prioritizing the presented events, based on users' interests using both, static and real - time data. Through the wayGoo interface, users select conceptual topics that are interesting for them. These topics constitute a browsing behavior vector which is used for learning users' interests implicitly, without being intrusive. Then, the system estimates user preferences and return an events list sorted from the most preferred one to the least. User preferences are modeled via a Naïve Bayesian Network which consists of: a) the `decision' random variable corresponding to users' decision on attending an event, b) the `distance' random variable, modeled by a linear regression that estimates the probability that the distance between a user and each event destination is not discouraging, ` the seat availability' random variable, modeled by a linear regression, which estimates the probability that the seat availability is encouraging d) and the `relevance' random variable, modeled by a clustering - based collaborative filtering, which determines the relevance of each event users' interests. Finally, experimental results show that the proposed system contribute essentially to assisting users in browsing and selecting events to attend.
Ning, Jing; Chen, Yong; Piao, Jin
2017-07-01
Publication bias occurs when the published research results are systematically unrepresentative of the population of studies that have been conducted, and is a potential threat to meaningful meta-analysis. The Copas selection model provides a flexible framework for correcting estimates and offers considerable insight into the publication bias. However, maximizing the observed likelihood under the Copas selection model is challenging because the observed data contain very little information on the latent variable. In this article, we study a Copas-like selection model and propose an expectation-maximization (EM) algorithm for estimation based on the full likelihood. Empirical simulation studies show that the EM algorithm and its associated inferential procedure performs well and avoids the non-convergence problem when maximizing the observed likelihood. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
L3 Syntactic Transfer Selectivity and Typological Determinacy: The Typological Primacy Model
ERIC Educational Resources Information Center
Rothman, Jason
2011-01-01
The present article addresses the following question: what variables condition syntactic transfer? Evidence is provided in support of the position that third language (L3) transfer is selective, whereby, at least under certain conditions, it is driven by the typological proximity of the target L3 measured against the other previously acquired…
A study of commuter airline economics
NASA Technical Reports Server (NTRS)
Summerfield, J. R.
1976-01-01
Variables are defined and cost relationships developed that describe the direct and indirect operating costs of commuter airlines. The study focused on costs for new aircraft and new aircraft technology when applied to the commuter airline industry. With proper judgement and selection of input variables, the operating costs model was shown to be capable of providing economic insight into other commuter airline system evaluations.
Katherine J. Elliott; Barton D. Clinton
1993-01-01
Allometric equations were developed to predict aboveground dry weight of herbaceous and woody species on prescribe-burned sites in the Southern Appalachians. Best-fit least-square regression models were developed using diamet,er, height, or both, as the independent variables and dry weight as the dependent variable. Coefficients of determination for the selected total...
Bhongsatiern, Jiraganya; Stockmann, Chris; Yu, Tian; Constance, Jonathan E; Moorthy, Ganesh; Spigarelli, Michael G; Desai, Pankaj B; Sherwin, Catherine M T
2016-05-01
Growth and maturational changes have been identified as significant covariates in describing variability in clearance of renally excreted drugs such as vancomycin. Because of immaturity of clearance mechanisms, quantification of renal function in neonates is of importance. Several serum creatinine (SCr)-based renal function descriptors have been developed in adults and children, but none are selectively derived for neonates. This review summarizes development of the neonatal kidney and discusses assessment of the renal function regarding estimation of glomerular filtration rate using renal function descriptors. Furthermore, identification of the renal function descriptors that best describe the variability of vancomycin clearance was performed in a sample study of a septic neonatal cohort. Population pharmacokinetic models were developed applying a combination of age-weight, renal function descriptors, or SCr alone. In addition to age and weight, SCr or renal function descriptors significantly reduced variability of vancomycin clearance. The population pharmacokinetic models with Léger and modified Schwartz formulas were selected as the optimal final models, although the other renal function descriptors and SCr provided reasonably good fit to the data, suggesting further evaluation of the final models using external data sets and cross validation. The present study supports incorporation of renal function descriptors in the estimation of vancomycin clearance in neonates. © 2015, The American College of Clinical Pharmacology.
Potts, Richard; Faith, J Tyler
2015-10-01
Interaction of orbital insolation cycles defines a predictive model of alternating phases of high- and low-climate variability for tropical East Africa over the past 5 million years. This model, which is described in terms of climate variability stages, implies repeated increases in landscape/resource instability and intervening periods of stability in East Africa. It predicts eight prolonged (>192 kyr) eras of intensified habitat instability (high variability stages) in which hominin evolutionary innovations are likely to have occurred, potentially by variability selection. The prediction that repeated shifts toward high climate variability affected paleoenvironments and evolution is tested in three ways. In the first test, deep-sea records of northeast African terrigenous dust flux (Sites 721/722) and eastern Mediterranean sapropels (Site 967A) show increased and decreased variability in concert with predicted shifts in climate variability. These regional measurements of climate dynamics are complemented by stratigraphic observations in five basins with lengthy stratigraphic and paleoenvironmental records: the mid-Pleistocene Olorgesailie Basin, the Plio-Pleistocene Turkana and Olduvai Basins, and the Pliocene Tugen Hills sequence and Hadar Basin--all of which show that highly variable landscapes inhabited by hominin populations were indeed concentrated in predicted stages of prolonged high climate variability. Second, stringent null-model tests demonstrate a significant association of currently known first and last appearance datums (FADs and LADs) of the major hominin lineages, suites of technological behaviors, and dispersal events with the predicted intervals of prolonged high climate variability. Palynological study in the Nihewan Basin, China, provides a third test, which shows the occupation of highly diverse habitats in eastern Asia, consistent with the predicted increase in adaptability in dispersing Oldowan hominins. Integration of fossil, archeological, sedimentary, and paleolandscape evidence illustrates the potential influence of prolonged high variability on the origin and spread of critical adaptations and lineages in the evolution of Homo. The growing body of data concerning environmental dynamics supports the idea that the evolution of adaptability in response to climate and overall ecological instability represents a unifying theme in hominin evolutionary history. Published by Elsevier Ltd.
NASA Astrophysics Data System (ADS)
Hanan, Lu; Qiushi, Li; Shaobin, Li
2016-12-01
This paper presents an integrated optimization design method in which uniform design, response surface methodology and genetic algorithm are used in combination. In detail, uniform design is used to select the experimental sampling points in the experimental domain and the system performance is evaluated by means of computational fluid dynamics to construct a database. After that, response surface methodology is employed to generate a surrogate mathematical model relating the optimization objective and the design variables. Subsequently, genetic algorithm is adopted and applied to the surrogate model to acquire the optimal solution in the case of satisfying some constraints. The method has been applied to the optimization design of an axisymmetric diverging duct, dealing with three design variables including one qualitative variable and two quantitative variables. The method of modeling and optimization design performs well in improving the duct aerodynamic performance and can be also applied to wider fields of mechanical design and seen as a useful tool for engineering designers, by reducing the design time and computation consumption.
Kepler AutoRegressive Planet Search: Motivation & Methodology
NASA Astrophysics Data System (ADS)
Caceres, Gabriel; Feigelson, Eric; Jogesh Babu, G.; Bahamonde, Natalia; Bertin, Karine; Christen, Alejandra; Curé, Michel; Meza, Cristian
2015-08-01
The Kepler AutoRegressive Planet Search (KARPS) project uses statistical methodology associated with autoregressive (AR) processes to model Kepler lightcurves in order to improve exoplanet transit detection in systems with high stellar variability. We also introduce a planet-search algorithm to detect transits in time-series residuals after application of the AR models. One of the main obstacles in detecting faint planetary transits is the intrinsic stellar variability of the host star. The variability displayed by many stars may have autoregressive properties, wherein later flux values are correlated with previous ones in some manner. Auto-Regressive Moving-Average (ARMA) models, Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH), and related models are flexible, phenomenological methods used with great success to model stochastic temporal behaviors in many fields of study, particularly econometrics. Powerful statistical methods are implemented in the public statistical software environment R and its many packages. Modeling involves maximum likelihood fitting, model selection, and residual analysis. These techniques provide a useful framework to model stellar variability and are used in KARPS with the objective of reducing stellar noise to enhance opportunities to find as-yet-undiscovered planets. Our analysis procedure consisting of three steps: pre-processing of the data to remove discontinuities, gaps and outliers; ARMA-type model selection and fitting; and transit signal search of the residuals using a new Transit Comb Filter (TCF) that replaces traditional box-finding algorithms. We apply the procedures to simulated Kepler-like time series with known stellar and planetary signals to evaluate the effectiveness of the KARPS procedures. The ARMA-type modeling is effective at reducing stellar noise, but also reduces and transforms the transit signal into ingress/egress spikes. A periodogram based on the TCF is constructed to concentrate the signal of these periodic spikes. When a periodic transit is found, the model is displayed on a standard period-folded averaged light curve. We also illustrate the efficient coding in R.
Steensels, Machteld; Maltz, Ephraim; Bahr, Claudia; Berckmans, Daniel; Antler, Aharon; Halachmi, Ilan
2017-05-01
The objective of this study was to design and validate a mathematical model to detect post-calving ketosis. The validation was conducted in four commercial dairy farms in Israel, on a total of 706 multiparous Holstein dairy cows: 203 cows clinically diagnosed with ketosis and 503 healthy cows. A logistic binary regression model was developed, where the dependent variable is categorical (healthy/diseased) and a set of explanatory variables were measured with existing commercial sensors: rumination duration, activity and milk yield of each individual cow. In a first validation step (within-farm), the model was calibrated on the database of each farm separately. Two thirds of the sick cows and an equal number of healthy cows were randomly selected for model validation. The remaining one third of the cows, which did not participate in the model validation, were used for model calibration. In order to overcome the random selection effect, this procedure was repeated 100 times. In a second (between-farms) validation step, the model was calibrated on one farm and validated on another farm. Within-farm accuracy, ranging from 74 to 79%, was higher than between-farm accuracy, ranging from 49 to 72%, in all farms. The within-farm sensitivities ranged from 78 to 90%, and specificities ranged from 71 to 74%. The between-farms sensitivities ranged from 65 to 95%. The developed model can be improved in future research, by employing other variables that can be added; or by exploring other models to achieve greater sensitivity and specificity.
Five Guidelines for Selecting Hydrological Signatures
NASA Astrophysics Data System (ADS)
McMillan, H. K.; Westerberg, I.; Branger, F.
2017-12-01
Hydrological signatures are index values derived from observed or modeled series of hydrological data such as rainfall, flow or soil moisture. They are designed to extract relevant information about hydrological behavior, such as to identify dominant processes, and to determine the strength, speed and spatiotemporal variability of the rainfall-runoff response. Hydrological signatures play an important role in model evaluation. They allow us to test whether particular model structures or parameter sets accurately reproduce the runoff generation processes within the watershed of interest. Most modeling studies use a selection of different signatures to capture different aspects of the catchment response, for example evaluating overall flow distribution as well as high and low flow extremes and flow timing. Such studies often choose their own set of signatures, or may borrow subsets of signatures used in multiple other works. The link between signature values and hydrological processes is not always straightforward, leading to uncertainty and variability in hydrologists' signature choices. In this presentation, we aim to encourage a more rigorous approach to hydrological signature selection, which considers the ability of signatures to represent hydrological behavior and underlying processes for the catchment and application in question. To this end, we propose a set of guidelines for selecting hydrological signatures. We describe five criteria that any hydrological signature should conform to: Identifiability, Robustness, Consistency, Representativeness, and Discriminatory Power. We describe an example of the design process for a signature, assessing possible signature designs against the guidelines above. Due to their ubiquity, we chose a signature related to the Flow Duration Curve, selecting the FDC mid-section slope as a proposed signature to quantify catchment overall behavior and flashiness. We demonstrate how assessment against each guideline could be used to compare or choose between alternative signature definitions. We believe that reaching a consensus on selection criteria for hydrological signatures will assist modelers to choose between competing signatures, facilitate comparison between hydrological studies, and help hydrologists to fully evaluate their models.
Shrinkage Estimation of Varying Covariate Effects Based On Quantile Regression
Peng, Limin; Xu, Jinfeng; Kutner, Nancy
2013-01-01
Varying covariate effects often manifest meaningful heterogeneity in covariate-response associations. In this paper, we adopt a quantile regression model that assumes linearity at a continuous range of quantile levels as a tool to explore such data dynamics. The consideration of potential non-constancy of covariate effects necessitates a new perspective for variable selection, which, under the assumed quantile regression model, is to retain variables that have effects on all quantiles of interest as well as those that influence only part of quantiles considered. Current work on l1-penalized quantile regression either does not concern varying covariate effects or may not produce consistent variable selection in the presence of covariates with partial effects, a practical scenario of interest. In this work, we propose a shrinkage approach by adopting a novel uniform adaptive LASSO penalty. The new approach enjoys easy implementation without requiring smoothing. Moreover, it can consistently identify the true model (uniformly across quantiles) and achieve the oracle estimation efficiency. We further extend the proposed shrinkage method to the case where responses are subject to random right censoring. Numerical studies confirm the theoretical results and support the utility of our proposals. PMID:25332515
Proxies for soil organic carbon derived from remote sensing
NASA Astrophysics Data System (ADS)
Rasel, S. M. M.; Groen, T. A.; Hussin, Y. A.; Diti, I. J.
2017-07-01
The possibility of carbon storage in soils is of interest because compared to vegetation it contains more carbon. Estimation of soil carbon through remote sensing based techniques can be a cost effective approach, but is limited by available methods. This study aims to develop a model based on remotely sensed variables (elevation, forest type and above ground biomass) to estimate soil carbon stocks. Field observations on soil organic carbon, species composition, and above ground biomass were recorded in the subtropical forest of Chitwan, Nepal. These variables were also estimated using LiDAR data and a WorldView 2 image. Above ground biomass was estimated from the LiDAR image using a novel approach where the image was segmented to identify individual trees, and for these trees estimates of DBH and Height were made. Based on AIC (Akaike Information Criterion) a regression model with above ground biomass derived from LiDAR data, and forest type derived from WorldView 2 imagery was selected to estimate soil organic carbon (SOC) stocks. The selected model had a coefficient of determination (R2) of 0.69. This shows the scope of estimating SOC with remote sensing derived variables in sub-tropical forests.
NASA Astrophysics Data System (ADS)
Olson, R.; Evans, J. P.; Fan, Y.
2015-12-01
NARCliM (NSW/ACT Regional Climate Modelling Project) is a regional climate project for Australia and the surrounding region. It dynamically downscales 4 General Circulation Models (GCMs) using three Regional Climate Models (RCMs) to provide climate projections for the CORDEX-AustralAsia region at 50 km resolution, and for south-east Australia at 10 km resolution. The project differs from previous work in the level of sophistication of model selection. Specifically, the selection process for GCMs included (i) conducting literature review to evaluate model performance, (ii) analysing model independence, and (iii) selecting models that span future temperature and precipitation change space. RCMs for downscaling the GCMs were chosen based on their performance for several precipitation events over South-East Australia, and on model independence.Bayesian Model Averaging (BMA) provides a statistically consistent framework for weighing the models based on their likelihood given the available observations. These weights are used to provide probability distribution functions (pdfs) for model projections. We develop a BMA framework for constructing probabilistic climate projections for spatially-averaged variables from the NARCliM project. The first step in the procedure is smoothing model output in order to exclude the influence of internal climate variability. Our statistical model for model-observations residuals is a homoskedastic iid process. Comparing RCMs with Australian Water Availability Project (AWAP) observations is used to determine model weights through Monte Carlo integration. Posterior pdfs of statistical parameters of model-data residuals are obtained using Markov Chain Monte Carlo. The uncertainty in the properties of the model-data residuals is fully accounted for when constructing the projections. We present the preliminary results of the BMA analysis for yearly maximum temperature for New South Wales state planning regions for the period 2060-2079.
Variability Selected Low-Luminosity Active Galactic Nuclei in the 4 Ms Chandra Deep Field-South
NASA Technical Reports Server (NTRS)
Young, M.; Brandt, W. N.; Xue, Y. Q.; Paolillo, D. M.; Alexander, F. E.; Bauer, F. E.; Lehmer, B. D.; Luo, B.; Shemmer, O.; Schneider, D. P.;
2012-01-01
The 4 Ms Chandra Deep Field-South (CDF-S) and other deep X-ray surveys have been highly effective at selecting active galactic nuclei (AGN). However, cosmologically distant low-luminosity AGN (LLAGN) have remained a challenge to identify due to significant contribution from the host galaxy. We identify long-term X ray variability (approx. month years, observed frame) in 20 of 92 CDF-S galaxies spanning redshifts approx equals 00.8 - 1.02 that do not meet other AGN selection criteria. We show that the observed variability cannot be explained by X-ray binary populations or ultraluminous X-ray sources, so the variability is most likely caused by accretion onto a supermassive black hole. The variable galaxies are not heavily obscured in general, with a stacked effective power-law photon index of Gamma(sub Stack) approx equals 1.93 +/- 0.13, and arc therefore likely LLAGN. The LLAGN tend to lie it factor of approx equal 6-89 below the extrapolated linear variability-luminosity relation measured for luminous AGN. This may he explained by their lower accretion rates. Variability-independent black-hole mass and accretion-rate estimates for variable galaxies show that they sample a significantly different black hole mass-accretion-rate space, with masses a factor of 2.4 lower and accretion rates a factor of 22.5 lower than variable luminous AGNs at the same redshift. We find that an empirical model based on a universal broken power-law power spectral density function, where the break frequency depends on SMBH mass and accretion rate, roughly reproduces the shape, but not the normalization, of the variability-luminosity trends measured for variable galaxies and more luminous AGNs.
Bean, William T.; Stafford, Robert; Butterfield, H. Scott; Brashares, Justin S.
2014-01-01
Species distributions are known to be limited by biotic and abiotic factors at multiple temporal and spatial scales. Species distribution models, however, frequently assume a population at equilibrium in both time and space. Studies of habitat selection have repeatedly shown the difficulty of estimating resource selection if the scale or extent of analysis is incorrect. Here, we present a multi-step approach to estimate the realized and potential distribution of the endangered giant kangaroo rat. First, we estimate the potential distribution by modeling suitability at a range-wide scale using static bioclimatic variables. We then examine annual changes in extent at a population-level. We define “available” habitat based on the total suitable potential distribution at the range-wide scale. Then, within the available habitat, model changes in population extent driven by multiple measures of resource availability. By modeling distributions for a population with robust estimates of population extent through time, and ecologically relevant predictor variables, we improved the predictive ability of SDMs, as well as revealed an unanticipated relationship between population extent and precipitation at multiple scales. At a range-wide scale, the best model indicated the giant kangaroo rat was limited to areas that received little to no precipitation in the summer months. In contrast, the best model for shorter time scales showed a positive relation with resource abundance, driven by precipitation, in the current and previous year. These results suggest that the distribution of the giant kangaroo rat was limited to the wettest parts of the drier areas within the study region. This multi-step approach reinforces the differing relationship species may have with environmental variables at different scales, provides a novel method for defining “available” habitat in habitat selection studies, and suggests a way to create distribution models at spatial and temporal scales relevant to theoretical and applied ecologists. PMID:25237807
Marvuglia, Antonino; Kanevski, Mikhail; Benetto, Enrico
2015-10-01
Toxicity characterization of chemical emissions in Life Cycle Assessment (LCA) is a complex task which usually proceeds via multimedia (fate, exposure and effect) models attached to models of dose-response relationships to assess the effects on target. Different models and approaches do exist, but all require a vast amount of data on the properties of the chemical compounds being assessed, which are hard to collect or hardly publicly available (especially for thousands of less common or newly developed chemicals), therefore hampering in practice the assessment in LCA. An example is USEtox, a consensual model for the characterization of human toxicity and freshwater ecotoxicity. This paper places itself in a line of research aiming at providing a methodology to reduce the number of input parameters necessary to run multimedia fate models, focusing in particular to the application of the USEtox toxicity model. By focusing on USEtox, in this paper two main goals are pursued: 1) performing an extensive exploratory analysis (using dimensionality reduction techniques) of the input space constituted by the substance-specific properties at the aim of detecting particular patterns in the data manifold and estimating the dimension of the subspace in which the data manifold actually lies; and 2) exploring the application of a set of linear models, based on partial least squares (PLS) regression, as well as a nonlinear model (general regression neural network--GRNN) in the seek for an automatic selection strategy of the most informative variables according to the modelled output (USEtox factor). After extensive analysis, the intrinsic dimension of the input manifold has been identified between three and four. The variables selected as most informative may vary according to the output modelled and the model used, but for the toxicity factors modelled in this paper the input variables selected as most informative are coherent with prior expectations based on scientific knowledge of toxicity factors modelling. Thus the outcomes of the analysis are promising for the future application of the approach to other portions of the model, affected by important data gaps, e.g., to the calculation of human health effect factors. Copyright © 2015. Published by Elsevier Ltd.
Model-assisted survey regression estimation with the lasso
Kelly S. McConville; F. Jay Breidt; Thomas C. M. Lee; Gretchen G. Moisen
2017-01-01
In the U.S. Forest Serviceâs Forest Inventory and Analysis (FIA) program, as in other natural resource surveys, many auxiliary variables are available for use in model-assisted inference about finite population parameters. Some of this auxiliary information may be extraneous, and therefore model selection is appropriate to improve the efficiency of the survey...
Hippert, Henrique S; Taylor, James W
2010-04-01
Artificial neural networks have frequently been proposed for electricity load forecasting because of their capabilities for the nonlinear modelling of large multivariate data sets. Modelling with neural networks is not an easy task though; two of the main challenges are defining the appropriate level of model complexity, and choosing the input variables. This paper evaluates techniques for automatic neural network modelling within a Bayesian framework, as applied to six samples containing daily load and weather data for four different countries. We analyse input selection as carried out by the Bayesian 'automatic relevance determination', and the usefulness of the Bayesian 'evidence' for the selection of the best structure (in terms of number of neurones), as compared to methods based on cross-validation. Copyright 2009 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Morales, M. B.; Traba, J.; Carriles, E.; Delgado, M. P.; de la Morena, E. L. García
2008-11-01
We examined sexual differences in patterns of vegetation structure selection in the sexually dimorphic little bustard. Differences in vegetation structure between male, female and non-used locations during reproduction were examined and used to build a presence/absence model for each sex. Ten variables were measured in each location, extracting two PCA factors (PC1: a visibility-shelter gradient; PC2: a gradient in food availability) used as response variables in GLM explanatory models. Both factors significantly differed between female, male and control locations. Neither study site nor phenology was significant. Logistic regression was used to model male and female presence/absence. Female presence was positively associated to cover of ground by vegetation litter, as well as overall vegetation cover, and negatively to vegetation density over 30 cm above ground. Male presence was positively related to litter cover and short vegetation and negatively to vegetation density over 30 cm above ground. Models showed good global performance and robustness. Female microhabitat selection and distribution seems to be related to the balance between shelter and visibility for surveillance. Male microhabitat selection would be related mainly to the need of conspicuousness for courtship. Accessibility to food resources seems to be equally important for both sexes. Differences suggest ecological sexual segregation resulting from different ecological constraints. These are the first detailed results on vegetation structure selection in both male and female little bustards, and are useful in designing management measures addressing vegetation structure irrespective of landscape composition. Similar microhabitat approaches can be applied to manage the habitat of many declining farmland birds.
Genetic-evolution-based optimization methods for engineering design
NASA Technical Reports Server (NTRS)
Rao, S. S.; Pan, T. S.; Dhingra, A. K.; Venkayya, V. B.; Kumar, V.
1990-01-01
This paper presents the applicability of a biological model, based on genetic evolution, for engineering design optimization. Algorithms embodying the ideas of reproduction, crossover, and mutation are developed and applied to solve different types of structural optimization problems. Both continuous and discrete variable optimization problems are solved. A two-bay truss for maximum fundamental frequency is considered to demonstrate the continuous variable case. The selection of locations of actuators in an actively controlled structure, for minimum energy dissipation, is considered to illustrate the discrete variable case.
Labbe, D; Rytz, A; Godinot, N; Ferrage, A; Martin, N
2017-01-01
Increasing portion sizes over the last 30 years are considered to be one of the factors underlying overconsumption. Past research on the drivers of portion selection for foods showed that larger portions are selected for foods delivering low expected satiation. However, the respective contribution of expected satiation vs. two other potential drivers of portion size selection, i.e. perceived healthfulness and expected tastiness, has never been explored. In this study, we conjointly explored the role of expected satiation, perceived healthfulness and expected tastiness when selecting portions within a range of six commercial pizzas varying in their toppings and brands. For each product, 63 pizza consumers selected a portion size that would satisfy them for lunch and scored their expected satiation, perceived healthfulness and expected tastiness. As six participants selected an entire pizza as ideal portion independently of topping or brand, their data sets were not considered in the data analyses completed on responses from 57 participants. Hierarchical multiple regression analyses showed that portion size variance was predicted by perceived healthiness and expected tastiness variables. Two sub-groups of participants with different portion size patterns across pizzas were identified through post-hoc exploratory analysis. The explanatory power of the regression model was significantly improved by adding interaction terms between sub-group and expected satiation variables and between sub-group and perceived healthfulness variables to the model. Analysis at a sub-group level showed either positive or negative association between portion size and expected satiation depending on sub-groups. For one group, portion size selection was more health-driven and for the other, more hedonic-driven. These results showed that even when considering a well-liked product category, perceived healthfulness can be an important factor influencing portion size decision. Copyright © 2016 Nestec S.A. Published by Elsevier Ltd.. All rights reserved.
Estimating the spatial scales of landscape effects on abundance
Richard Chandler; Jeffrey Hepinstall-Cymerman
2016-01-01
Spatial variation in abundance is influenced by local- and landscape-level environmental variables, but modeling landscape effects is challenging because the spatial scales of the relationships are unknown. Current approaches involve buffering survey locations with polygons of various sizes and using model selection to identify the best scale. The buffering...
Numerical modeling of eastern connecticut's visual resources
Daniel L. Civco
1979-01-01
A numerical model capable of accurately predicting the preference for landscape photographs of selected points in eastern Connecticut is presented. A function of the social attitudes expressed toward thirty-two salient visual landscape features serves as the independent variable in predicting preferences. A technique for objectively assigning adjectives to landscape...
Lin, Wei; Feng, Rui; Li, Hongzhe
2014-01-01
In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative L1 regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of co-variates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online. PMID:26392642
Stone, Wesley W.; Gilliom, Robert J.; Crawford, Charles G.
2008-01-01
Regression models were developed for predicting annual maximum and selected annual maximum moving-average concentrations of atrazine in streams using the Watershed Regressions for Pesticides (WARP) methodology developed by the National Water-Quality Assessment Program (NAWQA) of the U.S. Geological Survey (USGS). The current effort builds on the original WARP models, which were based on the annual mean and selected percentiles of the annual frequency distribution of atrazine concentrations. Estimates of annual maximum and annual maximum moving-average concentrations for selected durations are needed to characterize the levels of atrazine and other pesticides for comparison to specific water-quality benchmarks for evaluation of potential concerns regarding human health or aquatic life. Separate regression models were derived for the annual maximum and annual maximum 21-day, 60-day, and 90-day moving-average concentrations. Development of the regression models used the same explanatory variables, transformations, model development data, model validation data, and regression methods as those used in the original development of WARP. The models accounted for 72 to 75 percent of the variability in the concentration statistics among the 112 sampling sites used for model development. Predicted concentration statistics from the four models were within a factor of 10 of the observed concentration statistics for most of the model development and validation sites. Overall, performance of the models for the development and validation sites supports the application of the WARP models for predicting annual maximum and selected annual maximum moving-average atrazine concentration in streams and provides a framework to interpret the predictions in terms of uncertainty. For streams with inadequate direct measurements of atrazine concentrations, the WARP model predictions for the annual maximum and the annual maximum moving-average atrazine concentrations can be used to characterize the probable levels of atrazine for comparison to specific water-quality benchmarks. Sites with a high probability of exceeding a benchmark for human health or aquatic life can be prioritized for monitoring.
Salas, Eric Ariel L; Valdez, Raul; Michel, Stefan
2017-11-01
We modeled summer and winter habitat suitability of Marco Polo argali in the Pamir Mountains in southeastern Tajikistan using these statistical algorithms: Generalized Linear Model, Random Forest, Boosted Regression Tree, Maxent, and Multivariate Adaptive Regression Splines. Using sheep occurrence data collected from 2009 to 2015 and a set of selected habitat predictors, we produced summer and winter habitat suitability maps and determined the important habitat suitability predictors for both seasons. Our results demonstrated that argali selected proximity to riparian areas and greenness as the two most relevant variables for summer, and the degree of slope (gentler slopes between 0° to 20°) and Landsat temperature band for winter. The terrain roughness was also among the most important variables in summer and winter models. Aspect was only significant for winter habitat, with argali preferring south-facing mountain slopes. We evaluated various measures of model performance such as the Area Under the Curve (AUC) and the True Skill Statistic (TSS). Comparing the five algorithms, the AUC scored highest for Boosted Regression Tree in summer (AUC = 0.94) and winter model runs (AUC = 0.94). In contrast, Random Forest underperformed in both model runs.
Row, Jeffrey R.; Knick, Steven T.; Oyler-McCance, Sara J.; Lougheed, Stephen C.; Fedy, Bradley C.
2017-01-01
Dispersal can impact population dynamics and geographic variation, and thus, genetic approaches that can establish which landscape factors influence population connectivity have ecological and evolutionary importance. Mixed models that account for the error structure of pairwise datasets are increasingly used to compare models relating genetic differentiation to pairwise measures of landscape resistance. A model selection framework based on information criteria metrics or explained variance may help disentangle the ecological and landscape factors influencing genetic structure, yet there are currently no consensus for the best protocols. Here, we develop landscape-directed simulations and test a series of replicates that emulate independent empirical datasets of two species with different life history characteristics (greater sage-grouse; eastern foxsnake). We determined that in our simulated scenarios, AIC and BIC were the best model selection indices and that marginal R2 values were biased toward more complex models. The model coefficients for landscape variables generally reflected the underlying dispersal model with confidence intervals that did not overlap with zero across the entire model set. When we controlled for geographic distance, variables not in the underlying dispersal models (i.e., nontrue) typically overlapped zero. Our study helps establish methods for using linear mixed models to identify the features underlying patterns of dispersal across a variety of landscapes.
Thompson, Cynthia L
2016-05-01
Intraspecific variability in social systems is gaining increased recognition in primatology. Many primate species display variability in pair-living social organizations through incorporating extra adults into the group. While numerous models exist to explain primate pair-living, our tools to assess how and why variation in this trait occurs are currently limited. Here I outline an approach which: (i) utilizes conceptual models to identify the selective forces driving pair-living; (ii) outlines novel possible causes for variability in social organization; and (iii) conducts a holistic species-level analysis of social behavior to determine the factors contributing to variation in pair-living. A case study on white-faced sakis (Pithecia pithecia) is used to exemplify this approach. This species lives in either male-female pairs or groups incorporating "extra" adult males and/or females. Various conceptual models of pair-living suggest that high same-sex aggression toward extra-group individuals is a key component of the white-faced saki social system. Variable pair-living in white-faced sakis likely represents alternative strategies to achieve competency in this competition, in which animals experience conflicting selection pressures between achieving successful group defense and maintaining sole reproductive access to mates. Additionally, independent decisions by individuals may generate social variation by preventing other animals from adopting a social organization that maximizes fitness. White-faced saki inter-individual relationships and demographic patterns also lend conciliatory support to this conclusion. By utilizing both model-level and species-level approaches, with a consideration for potential sources of variation, researchers can gain insight into the factors generating variation in pair-living social organizations. © 2014 The Authors. American Journal of Primatology published by Wiley Periodicals, Inc.
Bayesian effect estimation accounting for adjustment uncertainty.
Wang, Chi; Parmigiani, Giovanni; Dominici, Francesca
2012-09-01
Model-based estimation of the effect of an exposure on an outcome is generally sensitive to the choice of which confounding factors are included in the model. We propose a new approach, which we call Bayesian adjustment for confounding (BAC), to estimate the effect of an exposure of interest on the outcome, while accounting for the uncertainty in the choice of confounders. Our approach is based on specifying two models: (1) the outcome as a function of the exposure and the potential confounders (the outcome model); and (2) the exposure as a function of the potential confounders (the exposure model). We consider Bayesian variable selection on both models and link the two by introducing a dependence parameter, ω, denoting the prior odds of including a predictor in the outcome model, given that the same predictor is in the exposure model. In the absence of dependence (ω= 1), BAC reduces to traditional Bayesian model averaging (BMA). In simulation studies, we show that BAC, with ω > 1, estimates the exposure effect with smaller bias than traditional BMA, and improved coverage. We, then, compare BAC, a recent approach of Crainiceanu, Dominici, and Parmigiani (2008, Biometrika 95, 635-651), and traditional BMA in a time series data set of hospital admissions, air pollution levels, and weather variables in Nassau, NY for the period 1999-2005. Using each approach, we estimate the short-term effects of on emergency admissions for cardiovascular diseases, accounting for confounding. This application illustrates the potentially significant pitfalls of misusing variable selection methods in the context of adjustment uncertainty. © 2012, The International Biometric Society.
NASA Astrophysics Data System (ADS)
Beguet, Benoit; Guyon, Dominique; Boukir, Samia; Chehata, Nesrine
2014-10-01
The main goal of this study is to design a method to describe the structure of forest stands from Very High Resolution satellite imagery, relying on some typical variables such as crown diameter, tree height, trunk diameter, tree density and tree spacing. The emphasis is placed on the automatization of the process of identification of the most relevant image features for the forest structure retrieval task, exploiting both spectral and spatial information. Our approach is based on linear regressions between the forest structure variables to be estimated and various spectral and Haralick's texture features. The main drawback of this well-known texture representation is the underlying parameters which are extremely difficult to set due to the spatial complexity of the forest structure. To tackle this major issue, an automated feature selection process is proposed which is based on statistical modeling, exploring a wide range of parameter values. It provides texture measures of diverse spatial parameters hence implicitly inducing a multi-scale texture analysis. A new feature selection technique, we called Random PRiF, is proposed. It relies on random sampling in feature space, carefully addresses the multicollinearity issue in multiple-linear regression while ensuring accurate prediction of forest variables. Our automated forest variable estimation scheme was tested on Quickbird and Pléiades panchromatic and multispectral images, acquired at different periods on the maritime pine stands of two sites in South-Western France. It outperforms two well-established variable subset selection techniques. It has been successfully applied to identify the best texture features in modeling the five considered forest structure variables. The RMSE of all predicted forest variables is improved by combining multispectral and panchromatic texture features, with various parameterizations, highlighting the potential of a multi-resolution approach for retrieving forest structure variables from VHR satellite images. Thus an average prediction error of ˜ 1.1 m is expected on crown diameter, ˜ 0.9 m on tree spacing, ˜ 3 m on height and ˜ 0.06 m on diameter at breast height.
Dooley, Helen; Flajnik, Martin F; Porter, Andrew J
2003-09-01
The novel immunoglobulin isotype novel antigen receptor (IgNAR) is found in cartilaginous fish and is composed of a heavy-chain homodimer that does not associate with light chains. The variable regions of IgNAR function as independent domains similar to those found in the heavy-chain immunoglobulins of Camelids. Here, we describe the successful cloning and generation of a phage-displayed, single-domain library based upon the variable domain of IgNAR. Selection of such a library generated from nurse sharks (Ginglymostoma cirratum) immunized with the model antigen hen egg-white lysozyme (HEL) enabled the successful isolation of intact antigen-specific binders matured in vivo. The selected variable domains were shown to be functionally expressed in Escherichia coli, extremely stable, and bind to antigen specifically with an affinity in the nanomolar range. This approach can therefore be considered as an alternative route for the isolation of minimal antigen-binding fragments with favorable characteristics.
De la Fuente, Jesus; Zapata, Lucía; Martínez-Vicente, Jose M.; Sander, Paul; Cardelle-Elawar, María
2014-01-01
The present investigation examines how personal self-regulation (presage variable) and regulatory teaching (process variable of teaching) relate to learning approaches, strategies for coping with stress, and self-regulated learning (process variables of learning) and, finally, how they relate to performance and satisfaction with the learning process (product variables). The objective was to clarify the associative and predictive relations between these variables, as contextualized in two different models that use the presage-process-product paradigm (the Biggs and DEDEPRO models). A total of 1101 university students participated in the study. The design was cross-sectional and retrospective with attributional (or selection) variables, using correlations and structural analysis. The results provide consistent and significant empirical evidence for the relationships hypothesized, incorporating variables that are part of and influence the teaching–learning process in Higher Education. Findings confirm the importance of interactive relationships within the teaching–learning process, where personal self-regulation is assumed to take place in connection with regulatory teaching. Variables that are involved in the relationships validated here reinforce the idea that both personal factors and teaching and learning factors should be taken into consideration when dealing with a formal teaching–learning context at university. PMID:25964764
Xu, Rengyi; Mesaros, Clementina; Weng, Liwei; Snyder, Nathaniel W; Vachani, Anil; Blair, Ian A; Hwang, Wei-Ting
2017-07-01
We compared three statistical methods in selecting a panel of serum lipid biomarkers for mesothelioma and asbestos exposure. Serum samples from mesothelioma, asbestos-exposed subjects and controls (40 per group) were analyzed. Three variable selection methods were considered: top-ranked predictors from univariate model, stepwise and least absolute shrinkage and selection operator. Crossed-validated area under the receiver operating characteristic curve was used to compare the prediction performance. Lipids with high crossed-validated area under the curve were identified. Lipid with mass-to-charge ratio of 372.31 was selected by all three methods comparing mesothelioma versus control. Lipids with mass-to-charge ratio of 1464.80 and 329.21 were selected by two models for asbestos exposure versus control. Different methods selected a similar set of serum lipids. Combining candidate biomarkers can improve prediction.
Uncertain programming models for portfolio selection with uncertain returns
NASA Astrophysics Data System (ADS)
Zhang, Bo; Peng, Jin; Li, Shengguo
2015-10-01
In an indeterminacy economic environment, experts' knowledge about the returns of securities consists of much uncertainty instead of randomness. This paper discusses portfolio selection problem in uncertain environment in which security returns cannot be well reflected by historical data, but can be evaluated by the experts. In the paper, returns of securities are assumed to be given by uncertain variables. According to various decision criteria, the portfolio selection problem in uncertain environment is formulated as expected-variance-chance model and chance-expected-variance model by using the uncertainty programming. Within the framework of uncertainty theory, for the convenience of solving the models, some crisp equivalents are discussed under different conditions. In addition, a hybrid intelligent algorithm is designed in the paper to provide a general method for solving the new models in general cases. At last, two numerical examples are provided to show the performance and applications of the models and algorithm.
Huang, Tao; Li, Xiao-yu; Xu, Meng-ling; Jin, Rui; Ku, Jing; Xu, Sen-miao; Wu, Zhen-zhong
2015-01-01
The quality of potato is directly related to their edible value and industrial value. Hollow heart of potato, as a physiological disease occurred inside the tuber, is difficult to be detected. This paper put forward a non-destructive detection method by using semi-transmission hyperspectral imaging with support vector machine (SVM) to detect hollow heart of potato. Compared to reflection and transmission hyperspectral image, semi-transmission hyperspectral image can get clearer image which contains the internal quality information of agricultural products. In this study, 224 potato samples (149 normal samples and 75 hollow samples) were selected as the research object, and semi-transmission hyperspectral image acquisition system was constructed to acquire the hyperspectral images (390-1 040 nn) of the potato samples, and then the average spectrum of region of interest were extracted for spectral characteristics analysis. Normalize was used to preprocess the original spectrum, and prediction model were developed based on SVM using all wave bands, the accurate recognition rate of test set is only 87. 5%. In order to simplify the model competitive.adaptive reweighed sampling algorithm (CARS) and successive projection algorithm (SPA) were utilized to select important variables from the all 520 spectral variables and 8 variables were selected (454, 601, 639, 664, 748, 827, 874 and 936 nm). 94. 64% of the accurate recognition rate of test set was obtained by using the 8 variables to develop SVM model. Parameter optimization algorithms, including artificial fish swarm algorithm (AFSA), genetic algorithm (GA) and grid search algorithm, were used to optimize the SVM model parameters: penalty parameter c and kernel parameter g. After comparative analysis, AFSA, a new bionic optimization algorithm based on the foraging behavior of fish swarm, was proved to get the optimal model parameter (c=10. 659 1, g=0. 349 7), and the recognition accuracy of 10% were obtained for the AFSA-SVM model. The results indicate that combining the semi-transmission hyperspectral imaging technology with CARS-SPA and AFSA-SVM can accurately detect hollow heart of potato, and also provide technical support for rapid non-destructive detecting of hollow heart of potato.
Longobardi, F; Ventrella, A; Bianco, A; Catucci, L; Cafagna, I; Gallo, V; Mastrorilli, P; Agostiano, A
2013-12-01
In this study, non-targeted (1)H NMR fingerprinting was used in combination with multivariate statistical techniques for the classification of Italian sweet cherries based on their different geographical origins (Emilia Romagna and Puglia). As classification techniques, Soft Independent Modelling of Class Analogy (SIMCA), Partial Least Squares Discriminant Analysis (PLS-DA), and Linear Discriminant Analysis (LDA) were carried out and the results were compared. For LDA, before performing a refined selection of the number/combination of variables, two different strategies for a preliminary reduction of the variable number were tested. The best average recognition and CV prediction abilities (both 100.0%) were obtained for all the LDA models, although PLS-DA also showed remarkable performances (94.6%). All the statistical models were validated by observing the prediction abilities with respect to an external set of cherry samples. The best result (94.9%) was obtained with LDA by performing a best subset selection procedure on a set of 30 principal components previously selected by a stepwise decorrelation. The metabolites that mostly contributed to the classification performances of such LDA model, were found to be malate, glucose, fructose, glutamine and succinate. Copyright © 2013 Elsevier Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
1979-01-01
A nonlinear, maximum likelihood, parameter identification computer program (NLSCIDNT) is described which evaluates rotorcraft stability and control coefficients from flight test data. The optimal estimates of the parameters (stability and control coefficients) are determined (identified) by minimizing the negative log likelihood cost function. The minimization technique is the Levenberg-Marquardt method, which behaves like the steepest descent method when it is far from the minimum and behaves like the modified Newton-Raphson method when it is nearer the minimum. Twenty-one states and 40 measurement variables are modeled, and any subset may be selected. States which are not integrated may be fixed at an input value, or time history data may be substituted for the state in the equations of motion. Any aerodynamic coefficient may be expressed as a nonlinear polynomial function of selected 'expansion variables'.
Aspinall, Richard
2004-08-01
This paper develops an approach to modelling land use change that links model selection and multi-model inference with empirical models and GIS. Land use change is frequently studied, and understanding gained, through a process of modelling that is an empirical analysis of documented changes in land cover or land use patterns. The approach here is based on analysis and comparison of multiple models of land use patterns using model selection and multi-model inference. The approach is illustrated with a case study of rural housing as it has developed for part of Gallatin County, Montana, USA. A GIS contains the location of rural housing on a yearly basis from 1860 to 2000. The database also documents a variety of environmental and socio-economic conditions. A general model of settlement development describes the evolution of drivers of land use change and their impacts in the region. This model is used to develop a series of different models reflecting drivers of change at different periods in the history of the study area. These period specific models represent a series of multiple working hypotheses describing (a) the effects of spatial variables as a representation of social, economic and environmental drivers of land use change, and (b) temporal changes in the effects of the spatial variables as the drivers of change evolve over time. Logistic regression is used to calibrate and interpret these models and the models are then compared and evaluated with model selection techniques. Results show that different models are 'best' for the different periods. The different models for different periods demonstrate that models are not invariant over time which presents challenges for validation and testing of empirical models. The research demonstrates (i) model selection as a mechanism for rating among many plausible models that describe land cover or land use patterns, (ii) inference from a set of models rather than from a single model, (iii) that models can be developed based on hypothesised relationships based on consideration of underlying and proximate causes of change, and (iv) that models are not invariant over time.
Development of a Robust Identifier for NPPs Transients Combining ARIMA Model and EBP Algorithm
NASA Astrophysics Data System (ADS)
Moshkbar-Bakhshayesh, Khalil; Ghofrani, Mohammad B.
2014-08-01
This study introduces a novel identification method for recognition of nuclear power plants (NPPs) transients by combining the autoregressive integrated moving-average (ARIMA) model and the neural network with error backpropagation (EBP) learning algorithm. The proposed method consists of three steps. First, an EBP based identifier is adopted to distinguish the plant normal states from the faulty ones. In the second step, ARIMA models use integrated (I) process to convert non-stationary data of the selected variables into stationary ones. Subsequently, ARIMA processes, including autoregressive (AR), moving-average (MA), or autoregressive moving-average (ARMA) are used to forecast time series of the selected plant variables. In the third step, for identification the type of transients, the forecasted time series are fed to the modular identifier which has been developed using the latest advances of EBP learning algorithm. Bushehr nuclear power plant (BNPP) transients are probed to analyze the ability of the proposed identifier. Recognition of transient is based on similarity of its statistical properties to the reference one, rather than the values of input patterns. More robustness against noisy data and improvement balance between memorization and generalization are salient advantages of the proposed identifier. Reduction of false identification, sole dependency of identification on the sign of each output signal, selection of the plant variables for transients training independent of each other, and extendibility for identification of more transients without unfavorable effects are other merits of the proposed identifier.
Peikert, Tobias; Duan, Fenghai; Rajagopalan, Srinivasan; Karwoski, Ronald A; Clay, Ryan; Robb, Richard A; Qin, Ziling; Sicks, JoRean; Bartholmai, Brian J; Maldonado, Fabien
2018-01-01
Optimization of the clinical management of screen-detected lung nodules is needed to avoid unnecessary diagnostic interventions. Herein we demonstrate the potential value of a novel radiomics-based approach for the classification of screen-detected indeterminate nodules. Independent quantitative variables assessing various radiologic nodule features such as sphericity, flatness, elongation, spiculation, lobulation and curvature were developed from the NLST dataset using 726 indeterminate nodules (all ≥ 7 mm, benign, n = 318 and malignant, n = 408). Multivariate analysis was performed using least absolute shrinkage and selection operator (LASSO) method for variable selection and regularization in order to enhance the prediction accuracy and interpretability of the multivariate model. The bootstrapping method was then applied for the internal validation and the optimism-corrected AUC was reported for the final model. Eight of the originally considered 57 quantitative radiologic features were selected by LASSO multivariate modeling. These 8 features include variables capturing Location: vertical location (Offset carina centroid z), Size: volume estimate (Minimum enclosing brick), Shape: flatness, Density: texture analysis (Score Indicative of Lesion/Lung Aggression/Abnormality (SILA) texture), and surface characteristics: surface complexity (Maximum shape index and Average shape index), and estimates of surface curvature (Average positive mean curvature and Minimum mean curvature), all with P<0.01. The optimism-corrected AUC for these 8 features is 0.939. Our novel radiomic LDCT-based approach for indeterminate screen-detected nodule characterization appears extremely promising however independent external validation is needed.
Genetic diversity in the interference selection limit.
Good, Benjamin H; Walczak, Aleksandra M; Neher, Richard A; Desai, Michael M
2014-03-01
Pervasive natural selection can strongly influence observed patterns of genetic variation, but these effects remain poorly understood when multiple selected variants segregate in nearby regions of the genome. Classical population genetics fails to account for interference between linked mutations, which grows increasingly severe as the density of selected polymorphisms increases. Here, we describe a simple limit that emerges when interference is common, in which the fitness effects of individual mutations play a relatively minor role. Instead, similar to models of quantitative genetics, molecular evolution is determined by the variance in fitness within the population, defined over an effectively asexual segment of the genome (a "linkage block"). We exploit this insensitivity in a new "coarse-grained" coalescent framework, which approximates the effects of many weakly selected mutations with a smaller number of strongly selected mutations that create the same variance in fitness. This approximation generates accurate and efficient predictions for silent site variability when interference is common. However, these results suggest that there is reduced power to resolve individual selection pressures when interference is sufficiently widespread, since a broad range of parameters possess nearly identical patterns of silent site variability.
Mental health courts and their selection processes: modeling variation for consistency.
Wolff, Nancy; Fabrikant, Nicole; Belenko, Steven
2011-10-01
Admission into mental health courts is based on a complicated and often variable decision-making process that involves multiple parties representing different expertise and interests. To the extent that eligibility criteria of mental health courts are more suggestive than deterministic, selection bias can be expected. Very little research has focused on the selection processes underpinning problem-solving courts even though such processes may dominate the performance of these interventions. This article describes a qualitative study designed to deconstruct the selection and admission processes of mental health courts. In this article, we describe a multi-stage, complex process for screening and admitting clients into mental health courts. The selection filtering model that is described has three eligibility screening stages: initial, assessment, and evaluation. The results of this study suggest that clients selected by mental health courts are shaped by the formal and informal selection criteria, as well as by the local treatment system.
Hydrological excitation of polar motion by different variables of the GLDAS models
NASA Astrophysics Data System (ADS)
Wińska, Małgorzata; Nastula, Jolanta
Continental hydrological loading, by land water, snow, and ice, is an element that is strongly needed for a full understanding of the excitation of polar motion. In this study we compute different estimations of hydrological excitation functions of polar motion (Hydrological Angular Momentum - HAM) using various variables from the Global Land Data Assimilation System (GLDAS) models of land hydrosphere. The main aim of this study is to show the influence of different variables for example: total evapotranspiration, runoff, snowmelt, soil moisture to polar motion excitations in annual and short term scale. In our consideration we employ several realizations of the GLDAS model as: GLDAS Common Land Model (CLM), GLDAS Mosaic Model, GLDAS National Centers for Environmental Prediction/Oregon State University/Air Force/Hydrologic Research Lab Model (Noah), GLDAS Variable Infiltration Capacity (VIC) Model. Hydrological excitation functions of polar motion, both global and regional, are determined by using selected variables of these GLDAS realizations. First we compare a timing, spectra and phase diagrams of different regional and global HAMs with each other. Next, we estimate, the hydrological signal in geodetically observed polar motion excitation by subtracting the atmospheric -- AAM (pressure + wind) and oceanic -- OAM (bottom pressure + currents) contributions. Finally, the hydrological excitations are compared to these hydrological signal in observed polar motion excitation series. The results help us understand which variables of considered hydrological models are the most important for the polar motion excitation and how well we can close polar motion excitation budget in the seasonal and inter-annual spectral ranges.
Chakraborty, Debojyoti; Wang, Tongli; Andre, Konrad; Konnert, Monika; Lexer, Manfred J.; Matulla, Christoph; Schueler, Silvio
2015-01-01
Identifying populations within tree species potentially adapted to future climatic conditions is an important requirement for reforestation and assisted migration programmes. Such populations can be identified either by empirical response functions based on correlations of quantitative traits with climate variables or by climate envelope models that compare the climate of seed sources and potential growing areas. In the present study, we analyzed the intraspecific variation in climate growth response of Douglas-fir planted within the non-analogous climate conditions of Central and continental Europe. With data from 50 common garden trials, we developed Universal Response Functions (URF) for tree height and mean basal area and compared the growth performance of the selected best performing populations with that of populations identified through a climate envelope approach. Climate variables of the trial location were found to be stronger predictors of growth performance than climate variables of the population origin. Although the precipitation regime of the population sources varied strongly none of the precipitation related climate variables of population origin was found to be significant within the models. Overall, the URFs explained more than 88% of variation in growth performance. Populations identified by the URF models originate from western Cascades and coastal areas of Washington and Oregon and show significantly higher growth performance than populations identified by the climate envelope approach under both current and climate change scenarios. The URFs predict decreasing growth performance at low and middle elevations of the case study area, but increasing growth performance on high elevation sites. Our analysis suggests that population recommendations based on empirical approaches should be preferred and population selections by climate envelope models without considering climatic constrains of growth performance should be carefully appraised before transferring populations to planting locations with novel or dissimilar climate. PMID:26288363
Tufto, Jarle
2015-08-01
Adaptive responses to autocorrelated environmental fluctuations through evolution in mean reaction norm elevation and slope and an independent component of the phenotypic variance are analyzed using a quantitative genetic model. Analytic approximations expressing the mutual dependencies between all three response modes are derived and solved for the joint evolutionary outcome. Both genetic evolution in reaction norm elevation and plasticity are favored by slow temporal fluctuations, with plasticity, in the absence of microenvironmental variability, being the dominant evolutionary outcome for reasonable parameter values. For fast fluctuations, tracking of the optimal phenotype through genetic evolution and plasticity is limited. If residual fluctuations in the optimal phenotype are large and stabilizing selection is strong, selection then acts to increase the phenotypic variance (bet-hedging adaptive). Otherwise, canalizing selection occurs. If the phenotypic variance increases with plasticity through the effect of microenvironmental variability, this shifts the joint evolutionary balance away from plasticity in favor of genetic evolution. If microenvironmental deviations experienced by each individual at the time of development and selection are correlated, however, more plasticity evolves. The adaptive significance of evolutionary fluctuations in plasticity and the phenotypic variance, transient evolution, and the validity of the analytic approximations are investigated using simulations. © 2015 The Author(s). Evolution © 2015 The Society for the Study of Evolution.
Constrained Stochastic Extended Redundancy Analysis.
DeSarbo, Wayne S; Hwang, Heungsun; Stadler Blank, Ashley; Kappe, Eelco
2015-06-01
We devise a new statistical methodology called constrained stochastic extended redundancy analysis (CSERA) to examine the comparative impact of various conceptual factors, or drivers, as well as the specific predictor variables that contribute to each driver on designated dependent variable(s). The technical details of the proposed methodology, the maximum likelihood estimation algorithm, and model selection heuristics are discussed. A sports marketing consumer psychology application is provided in a Major League Baseball (MLB) context where the effects of six conceptual drivers of game attendance and their defining predictor variables are estimated. Results compare favorably to those obtained using traditional extended redundancy analysis (ERA).
NASA Astrophysics Data System (ADS)
Rosero-Vlasova, O.; Borini Alves, D.; Vlassova, L.; Perez-Cabello, F.; Montorio Lloveria, R.
2017-10-01
Deforestation in Amazon basin due, among other factors, to frequent wildfires demands continuous post-fire monitoring of soil and vegetation. Thus, the study posed two objectives: (1) evaluate the capacity of Visible - Near InfraRed - ShortWave InfraRed (VIS-NIR-SWIR) spectroscopy to estimate soil organic matter (SOM) in fire-affected soils, and (2) assess the feasibility of SOM mapping from satellite images. For this purpose, 30 soil samples (surface layer) were collected in 2016 in areas of grass and riparian vegetation of Campos Amazonicos National Park, Brazil, repeatedly affected by wildfires. Standard laboratory procedures were applied to determine SOM. Reflectance spectra of soils were obtained in controlled laboratory conditions using Fieldspec4 spectroradiometer (spectral range 350nm- 2500nm). Measured spectra were resampled to simulate reflectances for Landsat-8, Sentinel-2 and EnMap spectral bands, used as predictors in SOM models developed using Partial Least Squares regression and step-down variable selection algorithm (PLSR-SD). The best fit was achieved with models based on reflectances simulated for EnMap bands (R2=0.93; R2cv=0.82 and NMSE=0.07; NMSEcv=0.19). The model uses only 8 out of 244 predictors (bands) chosen by the step-down variable selection algorithm. The least reliable estimates (R2=0.55 and R2cv=0.40 and NMSE=0.43; NMSEcv=0.60) resulted from Landsat model, while Sentinel-2 model showed R2=0.68 and R2cv=0.63; NMSE=0.31 and NMSEcv=0.38. The results confirm high potential of VIS-NIR-SWIR spectroscopy for SOM estimation. Application of step-down produces sparser and better-fit models. Finally, SOM can be estimated with an acceptable accuracy (NMSE 0.35) from EnMap and Sentinel-2 data enabling mapping and analysis of impacts of repeated wildfires on soils in the study area.
Zhang, Xuan; Li, Wei; Yin, Bin; Chen, Weizhong; Kelly, Declan P; Wang, Xiaoxin; Zheng, Kaiyi; Du, Yiping
2013-10-01
Coffee is the most heavily consumed beverage in the world after water, for which quality is a key consideration in commercial trade. Therefore, caffeine content which has a significant effect on the final quality of the coffee products requires to be determined fast and reliably by new analytical techniques. The main purpose of this work was to establish a powerful and practical analytical method based on near infrared spectroscopy (NIRS) and chemometrics for quantitative determination of caffeine content in roasted Arabica coffees. Ground coffee samples within a wide range of roasted levels were analyzed by NIR, meanwhile, in which the caffeine contents were quantitative determined by the most commonly used HPLC-UV method as the reference values. Then calibration models based on chemometric analyses of the NIR spectral data and reference concentrations of coffee samples were developed. Partial least squares (PLS) regression was used to construct the models. Furthermore, diverse spectra pretreatment and variable selection techniques were applied in order to obtain robust and reliable reduced-spectrum regression models. Comparing the respective quality of the different models constructed, the application of second derivative pretreatment and stability competitive adaptive reweighted sampling (SCARS) variable selection provided a notably improved regression model, with root mean square error of cross validation (RMSECV) of 0.375 mg/g and correlation coefficient (R) of 0.918 at PLS factor of 7. An independent test set was used to assess the model, with the root mean square error of prediction (RMSEP) of 0.378 mg/g, mean relative error of 1.976% and mean relative standard deviation (RSD) of 1.707%. Thus, the results provided by the high-quality calibration model revealed the feasibility of NIR spectroscopy for at-line application to predict the caffeine content of unknown roasted coffee samples, thanks to the short analysis time of a few seconds and non-destructive advantages of NIRS. Copyright © 2013 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Zhang, Xuan; Li, Wei; Yin, Bin; Chen, Weizhong; Kelly, Declan P.; Wang, Xiaoxin; Zheng, Kaiyi; Du, Yiping
2013-10-01
Coffee is the most heavily consumed beverage in the world after water, for which quality is a key consideration in commercial trade. Therefore, caffeine content which has a significant effect on the final quality of the coffee products requires to be determined fast and reliably by new analytical techniques. The main purpose of this work was to establish a powerful and practical analytical method based on near infrared spectroscopy (NIRS) and chemometrics for quantitative determination of caffeine content in roasted Arabica coffees. Ground coffee samples within a wide range of roasted levels were analyzed by NIR, meanwhile, in which the caffeine contents were quantitative determined by the most commonly used HPLC-UV method as the reference values. Then calibration models based on chemometric analyses of the NIR spectral data and reference concentrations of coffee samples were developed. Partial least squares (PLS) regression was used to construct the models. Furthermore, diverse spectra pretreatment and variable selection techniques were applied in order to obtain robust and reliable reduced-spectrum regression models. Comparing the respective quality of the different models constructed, the application of second derivative pretreatment and stability competitive adaptive reweighted sampling (SCARS) variable selection provided a notably improved regression model, with root mean square error of cross validation (RMSECV) of 0.375 mg/g and correlation coefficient (R) of 0.918 at PLS factor of 7. An independent test set was used to assess the model, with the root mean square error of prediction (RMSEP) of 0.378 mg/g, mean relative error of 1.976% and mean relative standard deviation (RSD) of 1.707%. Thus, the results provided by the high-quality calibration model revealed the feasibility of NIR spectroscopy for at-line application to predict the caffeine content of unknown roasted coffee samples, thanks to the short analysis time of a few seconds and non-destructive advantages of NIRS.
Bellamy, Chloe; Altringham, John
2015-01-01
Conservation increasingly operates at the landscape scale. For this to be effective, we need landscape scale information on species distributions and the environmental factors that underpin them. Species records are becoming increasingly available via data centres and online portals, but they are often patchy and biased. We demonstrate how such data can yield useful habitat suitability models, using bat roost records as an example. We analysed the effects of environmental variables at eight spatial scales (500 m - 6 km) on roost selection by eight bat species (Pipistrellus pipistrellus, P. pygmaeus, Nyctalus noctula, Myotis mystacinus, M. brandtii, M. nattereri, M. daubentonii, and Plecotus auritus) using the presence-only modelling software MaxEnt. Modelling was carried out on a selection of 418 data centre roost records from the Lake District National Park, UK. Target group pseudoabsences were selected to reduce the impact of sampling bias. Multi-scale models, combining variables measured at their best performing spatial scales, were used to predict roosting habitat suitability, yielding models with useful predictive abilities. Small areas of deciduous woodland consistently increased roosting habitat suitability, but other habitat associations varied between species and scales. Pipistrellus were positively related to built environments at small scales, and depended on large-scale woodland availability. The other, more specialist, species were highly sensitive to human-altered landscapes, avoiding even small rural towns. The strength of many relationships at large scales suggests that bats are sensitive to habitat modifications far from the roost itself. The fine resolution, large extent maps will aid targeted decision-making by conservationists and planners. We have made available an ArcGIS toolbox that automates the production of multi-scale variables, to facilitate the application of our methods to other taxa and locations. Habitat suitability modelling has the potential to become a standard tool for supporting landscape-scale decision-making as relevant data and open source, user-friendly, and peer-reviewed software become widely available.
Meguid, Robert A; Bronsert, Michael R; Juarez-Colunga, Elizabeth; Hammermeister, Karl E; Henderson, William G
2016-07-01
To develop parsimonious prediction models for postoperative mortality, overall morbidity, and 6 complication clusters applicable to a broad range of surgical operations in adult patients. Quantitative risk assessment tools are not routinely used for preoperative patient assessment, shared decision making, informed consent, and preoperative patient optimization, likely due in part to the burden of data collection and the complexity of incorporation into routine surgical practice. Multivariable forward selection stepwise logistic regression analyses were used to develop predictive models for 30-day mortality, overall morbidity, and 6 postoperative complication clusters, using 40 preoperative variables from 2,275,240 surgical cases in the American College of Surgeons National Surgical Quality Improvement Program data set, 2005 to 2012. For the mortality and overall morbidity outcomes, prediction models were compared with and without preoperative laboratory variables, and generic models (based on all of the data from 9 surgical specialties) were compared with specialty-specific models. In each model, the cumulative c-index was used to examine the contribution of each added predictor variable. C-indexes, Hosmer-Lemeshow analyses, and Brier scores were used to compare discrimination and calibration between models. For the mortality and overall morbidity outcomes, the prediction models without the preoperative laboratory variables performed as well as the models with the laboratory variables, and the generic models performed as well as the specialty-specific models. The c-indexes were 0.938 for mortality, 0.810 for overall morbidity, and for the 6 complication clusters ranged from 0.757 for infectious to 0.897 for pulmonary complications. Across the 8 prediction models, the first 7 to 11 variables entered accounted for at least 99% of the c-index of the full model (using up to 28 nonlaboratory predictor variables). Our results suggest that it will be possible to develop parsimonious models to predict 8 important postoperative outcomes for a broad surgical population, without the need for surgeon specialty-specific models or inclusion of laboratory variables.