Boosted regression tree, table, and figure data
Spreadsheets are included here to support the manuscript Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. This dataset is associated with the following publication:Golden , H., C. Lane , A. Prues, and E. D'Amico. Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. JAWRA. American Water Resources Association, Middleburg, VA, USA, 52(5): 1251-1274, (2016).
Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition
Boosted regression tree (BRT) models were developed to quantify the nonlinear relationships between landscape variables and nutrient concentrations in a mesoscale mixed land cover watershed during base-flow conditions. Factors that affect instream biological components, based on ...
Finding structure in data using multivariate tree boosting
Miller, Patrick J.; Lubke, Gitta H.; McArtor, Daniel B.; Bergeman, C. S.
2016-01-01
Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles such as random forests (Strobl, Malley, & Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called gradient boosted regression trees (Friedman, 2001). Our extension, multivariate tree boosting, is a method for nonparametric regression that is useful for identifying important predictors, detecting predictors with nonlinear effects and interactions without specification of such effects, and for identifying predictors that cause two or more outcome variables to covary. We provide the R package ‘mvtboost’ to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package ‘gbm’ (Ridgeway et al., 2015) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff & Keyes, 1995). Simulations verify that our approach identifies predictors with nonlinear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions. PMID:27918183
2015-01-01
Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project. PMID:26339227
Shin, Yoonseok
2015-01-01
Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project.
E. Freeman; G. Moisen; J. Coulston; B. Wilson
2014-01-01
Random forests (RF) and stochastic gradient boosting (SGB), both involving an ensemble of classification and regression trees, are compared for modeling tree canopy cover for the 2011 National Land Cover Database (NLCD). The objectives of this study were twofold. First, sensitivity of RF and SGB to choices in tuning parameters was explored. Second, performance of the...
Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees.
Chung, Yi-Shih
2013-12-01
Factor complexity is a characteristic of traffic crashes. This paper proposes a novel method, namely boosted regression trees (BRT), to investigate the complex and nonlinear relationships in high-variance traffic crash data. The Taiwanese 2004-2005 single-vehicle motorcycle crash data are used to demonstrate the utility of BRT. Traditional logistic regression and classification and regression tree (CART) models are also used to compare their estimation results and external validities. Both the in-sample cross-validation and out-of-sample validation results show that an increase in tree complexity provides improved, although declining, classification performance, indicating a limited factor complexity of single-vehicle motorcycle crashes. The effects of crucial variables including geographical, time, and sociodemographic factors explain some fatal crashes. Relatively unique fatal crashes are better approximated by interactive terms, especially combinations of behavioral factors. BRT models generally provide improved transferability than conventional logistic regression and CART models. This study also discusses the implications of the results for devising safety policies. Copyright © 2012 Elsevier Ltd. All rights reserved.
Westreich, Daniel; Lessler, Justin; Funk, Michele Jonsson
2010-08-01
Propensity scores for the analysis of observational data are typically estimated using logistic regression. Our objective in this review was to assess machine learning alternatives to logistic regression, which may accomplish the same goals but with fewer assumptions or greater accuracy. We identified alternative methods for propensity score estimation and/or classification from the public health, biostatistics, discrete mathematics, and computer science literature, and evaluated these algorithms for applicability to the problem of propensity score estimation, potential advantages over logistic regression, and ease of use. We identified four techniques as alternatives to logistic regression: neural networks, support vector machines, decision trees (classification and regression trees [CART]), and meta-classifiers (in particular, boosting). Although the assumptions of logistic regression are well understood, those assumptions are frequently ignored. All four alternatives have advantages and disadvantages compared with logistic regression. Boosting (meta-classifiers) and, to a lesser extent, decision trees (particularly CART), appear to be most promising for use in the context of propensity score analysis, but extensive simulation studies are needed to establish their utility in practice. Copyright (c) 2010 Elsevier Inc. All rights reserved.
Pashaei, Elnaz; Ozen, Mustafa; Aydin, Nizamettin
2015-08-01
Improving accuracy of supervised classification algorithms in biomedical applications is one of active area of research. In this study, we improve the performance of Particle Swarm Optimization (PSO) combined with C4.5 decision tree (PSO+C4.5) classifier by applying Boosted C5.0 decision tree as the fitness function. To evaluate the effectiveness of our proposed method, it is implemented on 1 microarray dataset and 5 different medical data sets obtained from UCI machine learning databases. Moreover, the results of PSO + Boosted C5.0 implementation are compared to eight well-known benchmark classification methods (PSO+C4.5, support vector machine under the kernel of Radial Basis Function, Classification And Regression Tree (CART), C4.5 decision tree, C5.0 decision tree, Boosted C5.0 decision tree, Naive Bayes and Weighted K-Nearest neighbor). Repeated five-fold cross-validation method was used to justify the performance of classifiers. Experimental results show that our proposed method not only improve the performance of PSO+C4.5 but also obtains higher classification accuracy compared to the other classification methods.
Deciphering factors controlling groundwater arsenic spatial variability in Bangladesh
NASA Astrophysics Data System (ADS)
Tan, Z.; Yang, Q.; Zheng, C.; Zheng, Y.
2017-12-01
Elevated concentrations of geogenic arsenic in groundwater have been found in many countries to exceed 10 μg/L, the WHO's guideline value for drinking water. A common yet unexplained characteristic of groundwater arsenic spatial distribution is the extensive variability at various spatial scales. This study investigates factors influencing the spatial variability of groundwater arsenic in Bangladesh to improve the accuracy of models predicting arsenic exceedance rate spatially. A novel boosted regression tree method is used to establish a weak-learning ensemble model, which is compared to a linear model using a conventional stepwise logistic regression method. The boosted regression tree models offer the advantage of parametric interaction when big datasets are analyzed in comparison to the logistic regression. The point data set (n=3,538) of groundwater hydrochemistry with 19 parameters was obtained by the British Geological Survey in 2001. The spatial data sets of geological parameters (n=13) were from the Consortium for Spatial Information, Technical University of Denmark, University of East Anglia and the FAO, while the soil parameters (n=42) were from the Harmonized World Soil Database. The aforementioned parameters were regressed to categorical groundwater arsenic concentrations below or above three thresholds: 5 μg/L, 10 μg/L and 50 μg/L to identify respective controlling factors. Boosted regression tree method outperformed logistic regression methods in all three threshold levels in terms of accuracy, specificity and sensitivity, resulting in an improvement of spatial distribution map of probability of groundwater arsenic exceeding all three thresholds when compared to disjunctive-kriging interpolated spatial arsenic map using the same groundwater arsenic dataset. Boosted regression tree models also show that the most important controlling factors of groundwater arsenic distribution include groundwater iron content and well depth for all three thresholds. The probability of a well with iron content higher than 5mg/L to contain greater than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be more than 91%, 85% and 51%, respectively, while the probability of a well from depth more than 160m to contain more than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be less than 38%, 25% and 14%, respectively.
Westreich, Daniel; Lessler, Justin; Funk, Michele Jonsson
2010-01-01
Summary Objective Propensity scores for the analysis of observational data are typically estimated using logistic regression. Our objective in this Review was to assess machine learning alternatives to logistic regression which may accomplish the same goals but with fewer assumptions or greater accuracy. Study Design and Setting We identified alternative methods for propensity score estimation and/or classification from the public health, biostatistics, discrete mathematics, and computer science literature, and evaluated these algorithms for applicability to the problem of propensity score estimation, potential advantages over logistic regression, and ease of use. Results We identified four techniques as alternatives to logistic regression: neural networks, support vector machines, decision trees (CART), and meta-classifiers (in particular, boosting). Conclusion While the assumptions of logistic regression are well understood, those assumptions are frequently ignored. All four alternatives have advantages and disadvantages compared with logistic regression. Boosting (meta-classifiers) and to a lesser extent decision trees (particularly CART) appear to be most promising for use in the context of propensity score analysis, but extensive simulation studies are needed to establish their utility in practice. PMID:20630332
Gretchen G. Moisen; Elizabeth A. Freeman; Jock A. Blackard; Tracey S. Frescino; Niklaus E. Zimmermann; Thomas C. Edwards
2006-01-01
Many efforts are underway to produce broad-scale forest attribute maps by modelling forest class and structure variables collected in forest inventories as functions of satellite-based and biophysical information. Typically, variants of classification and regression trees implemented in Rulequest's© See5 and Cubist (for binary and continuous responses,...
Austin, Peter C; Lee, Douglas S; Steyerberg, Ewout W; Tu, Jack V
2012-01-01
In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease. PMID:22777999
Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali
2016-01-01
Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.
NASA Astrophysics Data System (ADS)
Rooper, Christopher N.; Zimmermann, Mark; Prescott, Megan M.
2017-08-01
Deep-sea coral and sponge ecosystems are widespread throughout most of Alaska's marine waters, and are associated with many different species of fishes and invertebrates. These ecosystems are vulnerable to the effects of commercial fishing activities and climate change. We compared four commonly used species distribution models (general linear models, generalized additive models, boosted regression trees and random forest models) and an ensemble model to predict the presence or absence and abundance of six groups of benthic invertebrate taxa in the Gulf of Alaska. All four model types performed adequately on training data for predicting presence and absence, with regression forest models having the best overall performance measured by the area under the receiver-operating-curve (AUC). The models also performed well on the test data for presence and absence with average AUCs ranging from 0.66 to 0.82. For the test data, ensemble models performed the best. For abundance data, there was an obvious demarcation in performance between the two regression-based methods (general linear models and generalized additive models), and the tree-based models. The boosted regression tree and random forest models out-performed the other models by a wide margin on both the training and testing data. However, there was a significant drop-off in performance for all models of invertebrate abundance ( 50%) when moving from the training data to the testing data. Ensemble model performance was between the tree-based and regression-based methods. The maps of predictions from the models for both presence and abundance agreed very well across model types, with an increase in variability in predictions for the abundance data. We conclude that where data conforms well to the modeled distribution (such as the presence-absence data and binomial distribution in this study), the four types of models will provide similar results, although the regression-type models may be more consistent with biological theory. For data with highly zero-inflated distributions and non-normal distributions such as the abundance data from this study, the tree-based methods performed better. Ensemble models that averaged predictions across the four model types, performed better than the GLM or GAM models but slightly poorer than the tree-based methods, suggesting ensemble models might be more robust to overfitting than tree methods, while mitigating some of the disadvantages in predictive performance of regression methods.
Automatic energy expenditure measurement for health science.
Catal, Cagatay; Akbulut, Akhan
2018-04-01
It is crucial to predict the human energy expenditure in any sports activity and health science application accurately to investigate the impact of the activity. However, measurement of the real energy expenditure is not a trivial task and involves complex steps. The objective of this work is to improve the performance of existing estimation models of energy expenditure by using machine learning algorithms and several data from different sensors and provide this estimation service in a cloud-based platform. In this study, we used input data such as breathe rate, and hearth rate from three sensors. Inputs are received from a web form and sent to the web service which applies a regression model on Azure cloud platform. During the experiments, we assessed several machine learning models based on regression methods. Our experimental results showed that our novel model which applies Boosted Decision Tree Regression in conjunction with the median aggregation technique provides the best result among other five regression algorithms. This cloud-based energy expenditure system which uses a web service showed that cloud computing technology is a great opportunity to develop estimation systems and the new model which applies Boosted Decision Tree Regression with the median aggregation provides remarkable results. Copyright © 2018 Elsevier B.V. All rights reserved.
Shafizadeh-Moghadam, Hossein; Valavi, Roozbeh; Shahabi, Himan; Chapi, Kamran; Shirzadi, Ataollah
2018-07-01
In this research, eight individual machine learning and statistical models are implemented and compared, and based on their results, seven ensemble models for flood susceptibility assessment are introduced. The individual models included artificial neural networks, classification and regression trees, flexible discriminant analysis, generalized linear model, generalized additive model, boosted regression trees, multivariate adaptive regression splines, and maximum entropy, and the ensemble models were Ensemble Model committee averaging (EMca), Ensemble Model confidence interval Inferior (EMciInf), Ensemble Model confidence interval Superior (EMciSup), Ensemble Model to estimate the coefficient of variation (EMcv), Ensemble Model to estimate the mean (EMmean), Ensemble Model to estimate the median (EMmedian), and Ensemble Model based on weighted mean (EMwmean). The data set covered 201 flood events in the Haraz watershed (Mazandaran province in Iran) and 10,000 randomly selected non-occurrence points. Among the individual models, the Area Under the Receiver Operating Characteristic (AUROC), which showed the highest value, belonged to boosted regression trees (0.975) and the lowest value was recorded for generalized linear model (0.642). On the other hand, the proposed EMmedian resulted in the highest accuracy (0.976) among all models. In spite of the outstanding performance of some models, nevertheless, variability among the prediction of individual models was considerable. Therefore, to reduce uncertainty, creating more generalizable, more stable, and less sensitive models, ensemble forecasting approaches and in particular the EMmedian is recommended for flood susceptibility assessment. Copyright © 2018 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Stas, Michiel; Dong, Qinghan; Heremans, Stien; Zhang, Beier; Van Orshoven, Jos
2016-08-01
This paper compares two machine learning techniques to predict regional winter wheat yields. The models, based on Boosted Regression Trees (BRT) and Support Vector Machines (SVM), are constructed of Normalized Difference Vegetation Indices (NDVI) derived from low resolution SPOT VEGETATION satellite imagery. Three types of NDVI-related predictors were used: Single NDVI, Incremental NDVI and Targeted NDVI. BRT and SVM were first used to select features with high relevance for predicting the yield. Although the exact selections differed between the prefectures, certain periods with high influence scores for multiple prefectures could be identified. The same period of high influence stretching from March to June was detected by both machine learning methods. After feature selection, BRT and SVM models were applied to the subset of selected features for actual yield forecasting. Whereas both machine learning methods returned very low prediction errors, BRT seems to slightly but consistently outperform SVM.
Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T
2016-05-01
Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.
Austin, Peter C; Lee, Douglas S
2011-01-01
Purpose: Classification trees are increasingly being used to classifying patients according to the presence or absence of a disease or health outcome. A limitation of classification trees is their limited predictive accuracy. In the data-mining and machine learning literature, boosting has been developed to improve classification. Boosting with classification trees iteratively grows classification trees in a sequence of reweighted datasets. In a given iteration, subjects that were misclassified in the previous iteration are weighted more highly than subjects that were correctly classified. Classifications from each of the classification trees in the sequence are combined through a weighted majority vote to produce a final classification. The authors' objective was to examine whether boosting improved the accuracy of classification trees for predicting outcomes in cardiovascular patients. Methods: We examined the utility of boosting classification trees for classifying 30-day mortality outcomes in patients hospitalized with either acute myocardial infarction or congestive heart failure. Results: Improvements in the misclassification rate using boosted classification trees were at best minor compared to when conventional classification trees were used. Minor to modest improvements to sensitivity were observed, with only a negligible reduction in specificity. For predicting cardiovascular mortality, boosted classification trees had high specificity, but low sensitivity. Conclusions: Gains in predictive accuracy for predicting cardiovascular outcomes were less impressive than gains in performance observed in the data mining literature. PMID:22254181
Salas, Eric Ariel L; Valdez, Raul; Michel, Stefan
2017-11-01
We modeled summer and winter habitat suitability of Marco Polo argali in the Pamir Mountains in southeastern Tajikistan using these statistical algorithms: Generalized Linear Model, Random Forest, Boosted Regression Tree, Maxent, and Multivariate Adaptive Regression Splines. Using sheep occurrence data collected from 2009 to 2015 and a set of selected habitat predictors, we produced summer and winter habitat suitability maps and determined the important habitat suitability predictors for both seasons. Our results demonstrated that argali selected proximity to riparian areas and greenness as the two most relevant variables for summer, and the degree of slope (gentler slopes between 0° to 20°) and Landsat temperature band for winter. The terrain roughness was also among the most important variables in summer and winter models. Aspect was only significant for winter habitat, with argali preferring south-facing mountain slopes. We evaluated various measures of model performance such as the Area Under the Curve (AUC) and the True Skill Statistic (TSS). Comparing the five algorithms, the AUC scored highest for Boosted Regression Tree in summer (AUC = 0.94) and winter model runs (AUC = 0.94). In contrast, Random Forest underperformed in both model runs.
Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha
2016-01-01
Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.
Sankari, E Siva; Manimegalai, D
2017-12-21
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Tang, Jie; Liu, Rong; Zhang, Yue-Li; Liu, Mou-Ze; Hu, Yong-Fang; Shao, Ming-Jie; Zhu, Li-Jun; Xin, Hua-Wen; Feng, Gui-Wen; Shang, Wen-Jun; Meng, Xiang-Guang; Zhang, Li-Rong; Ming, Ying-Zi; Zhang, Wei
2017-02-01
Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.
Finch, Holmes W; Davis, Andrew; Dean, Raymond S
2015-03-01
The accurate and early identification of individuals with pervasive conditions such as attention deficit hyperactivity disorder (ADHD) is crucial to ensuring that they receive appropriate and timely assistance and treatment. Heretofore, identification of such individuals has proven somewhat difficult, typically involving clinical decision making based on descriptions and observations of behavior, in conjunction with the administration of cognitive assessments. The present study reports on the use of a sensory motor battery in conjunction with a recursive partitioning computer algorithm, boosted trees, to develop a prediction heuristic for identifying individuals with ADHD. Results of the study demonstrate that this method is able to do so with accuracy rates of over 95 %, much higher than the popular logistic regression model against which it was compared. Implications of these results for practice are provided.
Hitt, Nathaniel P.; Floyd, Michael; Compton, Michael; McDonald, Kenneth
2016-01-01
Chrosomus cumberlandensis (Blackside Dace [BSD]) and Etheostoma spilotum (Kentucky Arrow Darter [KAD]) are fish species of conservation concern due to their fragmented distributions, their low population sizes, and threats from anthropogenic stressors in the southeastern United States. We evaluated the relationship between fish abundance and stream conductivity, an index of environmental quality and potential physiological stressor. We modeled occurrence and abundance of KAD in the upper Kentucky River basin (208 samples) and BSD in the upper Cumberland River basin (294 samples) for sites sampled between 2003 and 2013. Segmented regression indicated a conductivity change-point for BSD abundance at 343 μS/cm (95% CI: 123–563 μS/cm) and for KAD abundance at 261 μS/cm (95% CI: 151–370 μS/cm). In both cases, abundances were negligible above estimated conductivity change-points. Post-hoc randomizations accounted for variance in estimated change points due to unequal sample sizes across the conductivity gradients. Boosted regression-tree analysis indicated stronger effects of conductivity than other natural and anthropogenic factors known to influence stream fishes. Boosted regression trees further indicated threshold responses of BSD and KAD occurrence to conductivity gradients in support of segmented regression results. We suggest that the observed conductivity relationship may indicate energetic limitations for insectivorous fishes due to changes in benthic macroinvertebrate community composition.
Prediction of Baseflow Index of Catchments using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Yadav, B.; Hatfield, K.
2017-12-01
We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.
2015-06-30
7. Building Statistical Metamodels using Simulation Experimental Designs ............................................... 34 7.1. Statistical Design...system design drivers across several different domain models, our methodology uses statistical metamodeling to approximate the simulations’ behavior. A...output. We build metamodels using a number of statistical methods that include stepwise regression, boosted trees, neural nets, and bootstrap forest
2015-06-01
7. Building Statistical Metamodels using Simulation Experimental Designs ............................................... 34 7.1. Statistical Design...system design drivers across several different domain models, our methodology uses statistical metamodeling to approximate the simulations’ behavior. A...output. We build metamodels using a number of statistical methods that include stepwise regression, boosted trees, neural nets, and bootstrap forest
Austin, Peter C.; Tu, Jack V.; Ho, Jennifer E.; Levy, Daniel; Lee, Douglas S.
2014-01-01
Objective Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines. Study design and Setting We compared the performance of these classification methods with those of conventional classification trees to classify patients with heart failure according to the following sub-types: heart failure with preserved ejection fraction (HFPEF) vs. heart failure with reduced ejection fraction (HFREF). We also compared the ability of these methods to predict the probability of the presence of HFPEF with that of conventional logistic regression. Results We found that modern, flexible tree-based methods from the data mining literature offer substantial improvement in prediction and classification of heart failure sub-type compared to conventional classification and regression trees. However, conventional logistic regression had superior performance for predicting the probability of the presence of HFPEF compared to the methods proposed in the data mining literature. Conclusion The use of tree-based methods offers superior performance over conventional classification and regression trees for predicting and classifying heart failure subtypes in a population-based sample of patients from Ontario. However, these methods do not offer substantial improvements over logistic regression for predicting the presence of HFPEF. PMID:23384592
Application of XGBoost algorithm in hourly PM2.5 concentration prediction
NASA Astrophysics Data System (ADS)
Pan, Bingyue
2018-02-01
In view of prediction techniques of hourly PM2.5 concentration in China, this paper applied the XGBoost(Extreme Gradient Boosting) algorithm to predict hourly PM2.5 concentration. The monitoring data of air quality in Tianjin city was analyzed by using XGBoost algorithm. The prediction performance of the XGBoost method is evaluated by comparing observed and predicted PM2.5 concentration using three measures of forecast accuracy. The XGBoost method is also compared with the random forest algorithm, multiple linear regression, decision tree regression and support vector machines for regression models using computational results. The results demonstrate that the XGBoost algorithm outperforms other data mining methods.
NASA Astrophysics Data System (ADS)
Zabret, Katarina; Rakovec, Jože; Šraj, Mojca
2018-03-01
Rainfall partitioning is an important part of the ecohydrological cycle, influenced by numerous variables. Rainfall partitioning for pine (Pinus nigra Arnold) and birch (Betula pendula Roth.) trees was measured from January 2014 to June 2017 in an urban area of Ljubljana, Slovenia. 180 events from more than three years of observations were analyzed, focusing on 13 meteorological variables, including the number of raindrops, their diameter, and velocity. Regression tree and boosted regression tree analyses were performed to evaluate the influence of the variables on rainfall interception loss, throughfall, and stemflow in different phenoseasons. The amount of rainfall was recognized as the most influential variable, followed by rainfall intensity and the number of raindrops. Higher rainfall amount, intensity, and the number of drops decreased percentage of rainfall interception loss. Rainfall amount and intensity were the most influential on interception loss by birch and pine trees during the leafed and leafless periods, respectively. Lower wind speed was found to increase throughfall, whereas wind direction had no significant influence. Consideration of drop size spectrum properties proved to be important, since the number of drops, drop diameter, and median volume diameter were often recognized as important influential variables.
A Predictive Model of Daily Seismic Activity Induced by Mining, Developed with Data Mining Methods
NASA Astrophysics Data System (ADS)
Jakubowski, Jacek
2014-12-01
The article presents the development and evaluation of a predictive classification model of daily seismic energy emissions induced by longwall mining in sector XVI of the Piast coal mine in Poland. The model uses data on tremor energy, basic characteristics of the longwall face and mined output in this sector over the period from July 1987 to March 2011. The predicted binary variable is the occurrence of a daily sum of tremor seismic energies in a longwall that is greater than or equal to the threshold value of 105 J. Three data mining analytical methods were applied: logistic regression,neural networks, and stochastic gradient boosted trees. The boosted trees model was chosen as the best for the purposes of the prediction. The validation sample results showed its good predictive capability, taking the complex nature of the phenomenon into account. This may indicate the applied model's suitability for a sequential, short-term prediction of mining induced seismic activity.
Boosted Regression Tree Models to Explain Watershed ...
Boosted regression tree (BRT) models were developed to quantify the nonlinear relationships between landscape variables and nutrient concentrations in a mesoscale mixed land cover watershed during base-flow conditions. Factors that affect instream biological components, based on the Index of Biotic Integrity (IBI), were also analyzed. Seasonal BRT models at two spatial scales (watershed and riparian buffered area [RBA]) for nitrite-nitrate (NO2-NO3), total Kjeldahl nitrogen, and total phosphorus (TP) and annual models for the IBI score were developed. Two primary factors — location within the watershed (i.e., geographic position, stream order, and distance to a downstream confluence) and percentage of urban land cover (both scales) — emerged as important predictor variables. Latitude and longitude interacted with other factors to explain the variability in summer NO2-NO3 concentrations and IBI scores. BRT results also suggested that location might be associated with indicators of sources (e.g., land cover), runoff potential (e.g., soil and topographic factors), and processes not easily represented by spatial data indicators. Runoff indicators (e.g., Hydrological Soil Group D and Topographic Wetness Indices) explained a substantial portion of the variability in nutrient concentrations as did point sources for TP in the summer months. The results from our BRT approach can help prioritize areas for nutrient management in mixed-use and heavily impacted watershed
Unravelling the limits to tree height: a major role for water and nutrient trade-offs.
Cramer, Michael D
2012-05-01
Competition for light has driven forest trees to grow exceedingly tall, but the lack of a single universal limit to tree height indicates multiple interacting environmental limitations. Because soil nutrient availability is determined by both nutrient concentrations and soil water, water and nutrient availabilities may interact in determining realised nutrient availability and consequently tree height. In SW Australia, which is characterised by nutrient impoverished soils that support some of the world's tallest forests, total [P] and water availability were independently correlated with tree height (r = 0.42 and 0.39, respectively). However, interactions between water availability and each of total [P], pH and [Mg] contributed to a multiple linear regression model of tree height (r = 0.72). A boosted regression tree model showed that maximum tree height was correlated with water availability (24%), followed by soil properties including total P (11%), Mg (10%) and total N (9%), amongst others, and that there was an interaction between water availability and total [P] in determining maximum tree height. These interactions indicated a trade-off between water and P availability in determining maximum tree height in SW Australia. This is enabled by a species assemblage capable of growing tall and surviving (some) disturbances. The mechanism for this trade-off is suggested to be through water enabling mass-flow and diffusive mobility of P, particularly of relatively mobile organic P, although water interactions with microbial activity could also play a role.
Gbm.auto: A software tool to simplify spatial modelling and Marine Protected Area planning
Officer, Rick; Clarke, Maurice; Reid, David G.; Brophy, Deirdre
2017-01-01
Boosted Regression Trees. Excellent for data-poor spatial management but hard to use Marine resource managers and scientists often advocate spatial approaches to manage data-poor species. Existing spatial prediction and management techniques are either insufficiently robust, struggle with sparse input data, or make suboptimal use of multiple explanatory variables. Boosted Regression Trees feature excellent performance and are well suited to modelling the distribution of data-limited species, but are extremely complicated and time-consuming to learn and use, hindering access for a wide potential user base and therefore limiting uptake and usage. BRTs automated and simplified for accessible general use with rich feature set We have built a software suite in R which integrates pre-existing functions with new tailor-made functions to automate the processing and predictive mapping of species abundance data: by automating and greatly simplifying Boosted Regression Tree spatial modelling, the gbm.auto R package suite makes this powerful statistical modelling technique more accessible to potential users in the ecological and modelling communities. The package and its documentation allow the user to generate maps of predicted abundance, visualise the representativeness of those abundance maps and to plot the relative influence of explanatory variables and their relationship to the response variables. Databases of the processed model objects and a report explaining all the steps taken within the model are also generated. The package includes a previously unavailable Decision Support Tool which combines estimated escapement biomass (the percentage of an exploited population which must be retained each year to conserve it) with the predicted abundance maps to generate maps showing the location and size of habitat that should be protected to conserve the target stocks (candidate MPAs), based on stakeholder priorities, such as the minimisation of fishing effort displacement. Gbm.auto for management in various settings By bridging the gap between advanced statistical methods for species distribution modelling and conservation science, management and policy, these tools can allow improved spatial abundance predictions, and therefore better management, decision-making, and conservation. Although this package was built to support spatial management of a data-limited marine elasmobranch fishery, it should be equally applicable to spatial abundance modelling, area protection, and stakeholder engagement in various scenarios. PMID:29216310
NASA Astrophysics Data System (ADS)
Lombardo, L.; Cama, M.; Maerker, M.; Parisi, L.; Rotigliano, E.
2014-12-01
This study aims at comparing the performances of Binary Logistic Regression (BLR) and Boosted Regression Trees (BRT) methods in assessing landslide susceptibility for multiple-occurrence regional landslide events within the Mediterranean region. A test area was selected in the north-eastern sector of Sicily (southern Italy), corresponding to the catchments of the Briga and the Giampilieri streams both stretching for few kilometres from the Peloritan ridge (eastern Sicily, Italy) to the Ionian sea. This area was struck on the 1st October 2009 by an extreme climatic event resulting in thousands of rapid shallow landslides, mainly of debris flows and debris avalanches types involving the weathered layer of a low to high grade metamorphic bedrock. Exploiting the same set of predictors and the 2009 landslide archive, BLR- and BRT-based susceptibility models were obtained for the two catchments separately, adopting a random partition (RP) technique for validation; besides, the models trained in one of the two catchments (Briga) were tested in predicting the landslide distribution in the other (Giampilieri), adopting a spatial partition (SP) based validation procedure. All the validation procedures were based on multi-folds tests so to evaluate and compare the reliability of the fitting, the prediction skill, the coherence in the predictor selection and the precision of the susceptibility estimates. All the obtained models for the two methods produced very high predictive performances, with a general congruence between BLR and BRT in the predictor importance. In particular, the research highlighted that BRT-models reached a higher prediction performance with respect to BLR-models, for RP based modelling, whilst for the SP-based models the difference in predictive skills between the two methods dropped drastically, converging to an analogous excellent performance. However, when looking at the precision of the probability estimates, BLR demonstrated to produce more robust models in terms of selected predictors and coefficients, as well as of dispersion of the estimated probabilities around the mean value for each mapped pixel. The difference in the behaviour could be interpreted as the result of overfitting effects, which heavily affect decision tree classification more than logistic regression techniques.
Schmidt, Johannes; Glaser, Bruno
2016-01-01
Tropical forests are significant carbon sinks and their soils’ carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms—including the model tuning and predictor selection—were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models’ predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction. PMID:27128736
Ließ, Mareike; Schmidt, Johannes; Glaser, Bruno
2016-01-01
Tropical forests are significant carbon sinks and their soils' carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms-including the model tuning and predictor selection-were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models' predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.
Wendling, T; Jung, K; Callahan, A; Schuler, A; Shah, N H; Gallego, B
2018-06-03
There is growing interest in using routinely collected data from health care databases to study the safety and effectiveness of therapies in "real-world" conditions, as it can provide complementary evidence to that of randomized controlled trials. Causal inference from health care databases is challenging because the data are typically noisy, high dimensional, and most importantly, observational. It requires methods that can estimate heterogeneous treatment effects while controlling for confounding in high dimensions. Bayesian additive regression trees, causal forests, causal boosting, and causal multivariate adaptive regression splines are off-the-shelf methods that have shown good performance for estimation of heterogeneous treatment effects in observational studies of continuous outcomes. However, it is not clear how these methods would perform in health care database studies where outcomes are often binary and rare and data structures are complex. In this study, we evaluate these methods in simulation studies that recapitulate key characteristics of comparative effectiveness studies. We focus on the conditional average effect of a binary treatment on a binary outcome using the conditional risk difference as an estimand. To emulate health care database studies, we propose a simulation design where real covariate and treatment assignment data are used and only outcomes are simulated based on nonparametric models of the real outcomes. We apply this design to 4 published observational studies that used records from 2 major health care databases in the United States. Our results suggest that Bayesian additive regression trees and causal boosting consistently provide low bias in conditional risk difference estimates in the context of health care database studies. Copyright © 2018 John Wiley & Sons, Ltd.
Tighe, Patrick J.; Harle, Christopher A.; Hurley, Robert W.; Aytug, Haldun; Boezaart, Andre P.; Fillingim, Roger B.
2015-01-01
Background Given their ability to process highly dimensional datasets with hundreds of variables, machine learning algorithms may offer one solution to the vexing challenge of predicting postoperative pain. Methods Here, we report on the application of machine learning algorithms to predict postoperative pain outcomes in a retrospective cohort of 8071 surgical patients using 796 clinical variables. Five algorithms were compared in terms of their ability to forecast moderate to severe postoperative pain: Least Absolute Shrinkage and Selection Operator (LASSO), gradient-boosted decision tree, support vector machine, neural network, and k-nearest neighbor, with logistic regression included for baseline comparison. Results In forecasting moderate to severe postoperative pain for postoperative day (POD) 1, the LASSO algorithm, using all 796 variables, had the highest accuracy with an area under the receiver-operating curve (ROC) of 0.704. Next, the gradient-boosted decision tree had an ROC of 0.665 and the k-nearest neighbor algorithm had an ROC of 0.643. For POD 3, the LASSO algorithm, using all variables, again had the highest accuracy, with an ROC of 0.727. Logistic regression had a lower ROC of 0.5 for predicting pain outcomes on POD 1 and 3. Conclusions Machine learning algorithms, when combined with complex and heterogeneous data from electronic medical record systems, can forecast acute postoperative pain outcomes with accuracies similar to methods that rely only on variables specifically collected for pain outcome prediction. PMID:26031220
NASA Astrophysics Data System (ADS)
Lawrence, R.; Landenburger, L.; Jewett, J.
2007-12-01
Whitebark pine seeds have long been identified as the most significant vegetative food source for grizzly bears in the Greater Yellowstone Ecosystem (GYE) and, hence, a crucial element of suitable grizzly bear habitat. The overall health and status of whitebark pine in the GYE is currently threatened by mountain pine beetle infestations and the spread of whitepine blister rust. Whitebark pine distribution (presence/absence) was mapped for the GYE using Landsat 7 Enhanced Thematic Mapper (ETM+) imagery and topographic data as part of a long-term inter-agency monitoring program. Logistic regression was compared with classification tree analysis (CTA) with and without boosting. Overall comparative classification accuracies for the central portion of the GYE covering three ETM+ images along a single path ranged from 91.6% using logistic regression to 95.8% with See5's CTA algorithm with the maximum 99 boosts. The analysis is being extended to the entire northern Rocky Mountain Ecosystem and extended over decadal time scales. The analysis is being extended to the entire northern Rocky Mountain Ecosystem and extended over decadal time scales.
NASA Astrophysics Data System (ADS)
Koestel, John; Bechtold, Michel; Jorda, Helena; Jarvis, Nicholas
2015-04-01
The saturated and near-saturated hydraulic conductivity of soil is of key importance for modelling water and solute fluxes in the vadose zone. Hydraulic conductivity measurements are cumbersome at the Darcy scale and practically impossible at larger scales where water and solute transport models are mostly applied. Hydraulic conductivity must therefore be estimated from proxy variables. Such pedotransfer functions are known to work decently well for e.g. water retention curves but rather poorly for near-saturated and saturated hydraulic conductivities. Recently, Weynants et al. (2009, Revisiting Vereecken pedotransfer functions: Introducing a closed-form hydraulic model. Vadose Zone Journal, 8, 86-95) reported a coefficients of determination of 0.25 (validation with an independent data set) for the saturated hydraulic conductivity from lab-measurements of Belgian soil samples. In our study, we trained boosted regression trees on a global meta-database containing tension-disk infiltrometer data (see Jarvis et al. 2013. Influence of soil, land use and climatic factors on the hydraulic conductivity of soil. Hydrology & Earth System Sciences, 17, 5185-5195) to predict the saturated hydraulic conductivity (Ks) and the conductivity at a tension of 10 cm (K10). We found coefficients of determination of 0.39 and 0.62 under a simple 10-fold cross-validation for Ks and K10. When carrying out the validation folded over the data-sources, i.e. the source publications, we found that the corresponding coefficients of determination reduced to 0.15 and 0.36, respectively. We conclude that the stricter source-wise cross-validation should be applied in future pedotransfer studies to prevent overly optimistic validation results. The boosted regression trees also allowed for an investigation of relevant predictors for estimating the near-saturated hydraulic conductivity. We found that land use and bulk density were most important to predict Ks. We also observed that Ks is large in fine and coarse textured soils and smaller in medium textured soils. Completely different predictors were important for appraising K10, where the soil macropore system is air-filled and therefore inactive. Here, the average annual temperature and precipitation where most important. The reasons for this are unclear and require further research. The clay content and the organic matter content were also important predictors of K10. We suggest that a larger and more complete database may help to improve the prediction of K10, whereas it may be more fruitful to estimate Ks statistics of sampling sites instead of individual values since the Ks is highly variable over very short distances.
Chen, Xiao Yu; Ma, Li Zhuang; Chu, Na; Zhou, Min; Hu, Yiyang
2013-01-01
Chronic hepatitis B (CHB) is a serious public health problem, and Traditional Chinese Medicine (TCM) plays an important role in the control and treatment for CHB. In the treatment of TCM, zheng discrimination is the most important step. In this paper, an approach based on CFS-GA (Correlation based Feature Selection and Genetic Algorithm) and C5.0 boost decision tree is used for zheng classification and progression in the TCM treatment of CHB. The CFS-GA performs better than the typical method of CFS. By CFS-GA, the acquired attribute subset is classified by C5.0 boost decision tree for TCM zheng classification of CHB, and C5.0 decision tree outperforms two typical decision trees of NBTree and REPTree on CFS-GA, CFS, and nonselection in comparison. Based on the critical indicators from C5.0 decision tree, important lab indicators in zheng progression are obtained by the method of stepwise discriminant analysis for expressing TCM zhengs in CHB, and alterations of the important indicators are also analyzed in zheng progression. In conclusion, all the three decision trees perform better on CFS-GA than on CFS and nonselection, and C5.0 decision tree outperforms the two typical decision trees both on attribute selection and nonselection.
Extensions and applications of ensemble-of-trees methods in machine learning
NASA Astrophysics Data System (ADS)
Bleich, Justin
Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of violence during probation hearings in court systems.
Rieger, Isaak; Kowarik, Ingo; Cherubini, Paolo; Cierjacks, Arne
2017-01-01
Aboveground carbon (C) sequestration in trees is important in global C dynamics, but reliable techniques for its modeling in highly productive and heterogeneous ecosystems are limited. We applied an extended dendrochronological approach to disentangle the functioning of drivers from the atmosphere (temperature, precipitation), the lithosphere (sedimentation rate), the hydrosphere (groundwater table, river water level fluctuation), the biosphere (tree characteristics), and the anthroposphere (dike construction). Carbon sequestration in aboveground biomass of riparian Quercus robur L. and Fraxinus excelsior L. was modeled (1) over time using boosted regression tree analysis (BRT) on cross-datable trees characterized by equal annual growth ring patterns and (2) across space using a subsequent classification and regression tree analysis (CART) on cross-datable and not cross-datable trees. While C sequestration of cross-datable Q. robur responded to precipitation and temperature, cross-datable F. excelsior also responded to a low Danube river water level. However, CART revealed that C sequestration over time is governed by tree height and parameters that vary over space (magnitude of fluctuation in the groundwater table, vertical distance to mean river water level, and longitudinal distance to upstream end of the study area). Thus, a uniform response to climatic drivers of aboveground C sequestration in Q. robur was only detectable in trees of an intermediate height class and in taller trees (>21.8m) on sites where the groundwater table fluctuated little (≤0.9m). The detection of climatic drivers and the river water level in F. excelsior depended on sites at lower altitudes above the mean river water level (≤2.7m) and along a less dynamic downstream section of the study area. Our approach indicates unexploited opportunities of understanding the interplay of different environmental drivers in aboveground C sequestration. Results may support species-specific and locally adapted forest management plans to increase carbon dioxide sequestration from the atmosphere in trees. Copyright © 2016 Elsevier B.V. All rights reserved.
Policy Implications and Suggestions on Administrative Measures of Urban Flood
NASA Astrophysics Data System (ADS)
Lee, S. V.; Lee, M. J.; Lee, C.; Yoon, J. H.; Chae, S. H.
2017-12-01
The frequency and intensity of floods are increasing worldwide as recent climate change progresses gradually. Flood management should be policy-oriented in urban municipalities due to the characteristics of urban areas with a lot of damage. Therefore, the purpose of this study is to prepare a flood susceptibility map by using data mining model and make a policy suggestion on administrative measures of urban flood. Therefore, we constructed a spatial database by collecting relevant factors including the topography, geology, soil and land use data of the representative city, Seoul, the capital city of Korea. Flood susceptibility map was constructed by applying the data mining models of random forest and boosted tree model to input data and existing flooded area data in 2010. The susceptibility map has been validated using the 2011 flood area data which was not used for training. The predictor importance value of each factor to the results was calculated in this process. The distance from the water, DEM and geology showed a high predictor importance value which means to be a high priority for flood preparation policy. As a result of receiver operating characteristic (ROC), random forest model showed 78.78% and 79.18% accuracy of regression and classification and boosted tree model showed 77.55% and 77.26% accuracy of regression and classification, respectively. The results show that the flood susceptibility maps can be applied to flood prevention and management, and it also can help determine the priority areas for flood mitigation policy by providing useful information to policy makers.
Cheong, Yoon Ling; Leitão, Pedro J; Lakes, Tobia
2014-07-01
The transmission of dengue disease is influenced by complex interactions among vector, host and virus. Land use such as water bodies or certain agricultural practices have been identified as likely risk factors for dengue because of the provision of suitable habitats for the vector. Many studies have focused on the land use factors of dengue vector abundance in small areas but have not yet studied the relationship between land use factors and dengue cases for large regions. This study aims to clarify if land use factors other than human settlements, e.g. different types of agricultural land use, water bodies and forest are associated with reported dengue cases from 2008 to 2010 in the state of Selangor, Malaysia. From the correlative relationship, we aim to generate a prediction risk map. We used Boosted Regression Trees (BRT) to account for nonlinearities and interactions between the factors with high predictive accuracies. Our model with a cross-validated performance score (Area Under the Receiver Operator Characteristic Curve, ROC AUC) of 0.81 showed that the most important land use factors are human settlements (model importance of 39.2%), followed by water bodies (16.1%), mixed horticulture (8.7%), open land (7.5%) and neglected grassland (6.7%). A risk map after 100 model runs with a cross-validated ROC AUC mean of 0.81 (±0.001 s.d.) is presented. Our findings may be an important asset for improving surveillance and control interventions for dengue. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.
Cerasoli, Francesco; Iannella, Mattia; D'Alessandro, Paola; Biondi, Maurizio
2017-01-01
Boosted Regression Trees (BRT) is one of the modelling techniques most recently applied to biodiversity conservation and it can be implemented with presence-only data through the generation of artificial absences (pseudo-absences). In this paper, three pseudo-absences generation techniques are compared, namely the generation of pseudo-absences within target-group background (TGB), testing both the weighted (WTGB) and unweighted (UTGB) scheme, and the generation at random (RDM), evaluating their performance and applicability in distribution modelling and species conservation. The choice of the target group fell on amphibians, because of their rapid decline worldwide and the frequent lack of guidelines for conservation strategies and regional-scale planning, which instead could be provided through an appropriate implementation of SDMs. Bufo bufo, Salamandrina perspicillata and Triturus carnifex were considered as target species, in order to perform our analysis with species having different ecological and distributional characteristics. The study area is the "Gran Sasso-Monti della Laga" National Park, which hosts 15 Natura 2000 sites and represents one of the most important biodiversity hotspots in Europe. Our results show that the model calibration ameliorates when using the target-group based pseudo-absences compared to the random ones, especially when applying the WTGB. Contrarily, model discrimination did not significantly vary in a consistent way among the three approaches with respect to the tree target species. Both WTGB and RDM clearly isolate the highly contributing variables, supplying many relevant indications for species conservation actions. Moreover, the assessment of pairwise variable interactions and their three-dimensional visualization further increase the amount of useful information for protected areas' managers. Finally, we suggest the use of RDM as an admissible alternative when it is not possible to individuate a suitable set of species as a representative target-group from which the pseudo-absences can be generated.
Mi, Chunrong; Huettmann, Falk; Guo, Yumin; Han, Xuesong; Wen, Lijia
2017-01-01
Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane ( Grus monacha , n = 33), White-naped Crane ( Grus vipio , n = 40), and Black-necked Crane ( Grus nigricollis , n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation.
Mi, Chunrong; Huettmann, Falk; Han, Xuesong; Wen, Lijia
2017-01-01
Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n = 33), White-naped Crane (Grus vipio, n = 40), and Black-necked Crane (Grus nigricollis, n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation. PMID:28097060
Hayashi, Yoshihiro; Oishi, Takuya; Shirotori, Kaede; Marumo, Yuki; Kosugi, Atsushi; Kumada, Shungo; Hirai, Daijiro; Takayama, Kozo; Onuki, Yoshinori
2018-07-01
The aim of this study was to explore the potential of boosted tree (BT) to develop a correlation model between active pharmaceutical ingredient (API) characteristics and a tensile strength (TS) of tablets as critical quality attributes. First, we evaluated 81 kinds of API characteristics, such as particle size distribution, bulk density, tapped density, Hausner ratio, moisture content, elastic recovery, molecular weight, and partition coefficient. Next, we prepared tablets containing 50% API, 49% microcrystalline cellulose, and 1% magnesium stearate using direct compression at 6, 8, and 10 kN, and measured TS. Then, we applied BT to our dataset to develop a correlation model. Finally, the constructed BT model was validated using k-fold cross-validation. Results showed that the BT model achieved high-performance statistics, whereas multiple regression analysis resulted in poor estimations. Sensitivity analysis of the BT model revealed that diameter of powder particles at the 10th percentile of the cumulative percentage size distribution was the most crucial factor for TS. In addition, the influences of moisture content, partition coefficients, and modal diameter were appreciably meaningful factors. This study demonstrates that BT model could provide comprehensive understanding of the latent structure underlying APIs and TS of tablets.
Using CART to segment road images
NASA Astrophysics Data System (ADS)
Davies, Bob; Lienhart, Rainer
2006-01-01
The 2005 DARPA Grand Challenge is a 132 mile race through the desert with autonomous robotic vehicles. Lasers mounted on the car roof provide a map of the road up to 20 meters ahead of the car but the car needs to see further in order to go fast enough to win the race. Computer vision can extend that map of the road ahead but desert road is notoriously similar to the surrounding desert. The CART algorithm (Classification and Regression Trees) provided a machine learning boost to find road while at the same time measuring when that road could not be distinguished from surrounding desert.
Boosted Multivariate Trees for Longitudinal Data
Pande, Amol; Li, Liang; Rajeswaran, Jeevanantham; Ehrlinger, John; Kogalur, Udaya B.; Blackstone, Eugene H.; Ishwaran, Hemant
2017-01-01
Machine learning methods provide a powerful approach for analyzing longitudinal data in which repeated measurements are observed for a subject over time. We boost multivariate trees to fit a novel flexible semi-nonparametric marginal model for longitudinal data. In this model, features are assumed to be nonparametric, while feature-time interactions are modeled semi-nonparametrically utilizing P-splines with estimated smoothing parameter. In order to avoid overfitting, we describe a relatively simple in sample cross-validation method which can be used to estimate the optimal boosting iteration and which has the surprising added benefit of stabilizing certain parameter estimates. Our new multivariate tree boosting method is shown to be highly flexible, robust to covariance misspecification and unbalanced designs, and resistant to overfitting in high dimensions. Feature selection can be used to identify important features and feature-time interactions. An application to longitudinal data of forced 1-second lung expiratory volume (FEV1) for lung transplant patients identifies an important feature-time interaction and illustrates the ease with which our method can find complex relationships in longitudinal data. PMID:29249866
Van Boeckel, Thomas P; Thanapongtharm, Weerapong; Robinson, Timothy; Biradar, Chandrashekhar M; Xiao, Xiangming; Gilbert, Marius
2012-01-01
Since 1996 when Highly Pathogenic Avian Influenza type H5N1 first emerged in southern China, numerous studies sought risk factors and produced risk maps based on environmental and anthropogenic predictors. However little attention has been paid to the link between the level of intensification of poultry production and the risk of outbreak. This study revised H5N1 risk mapping in Central and Western Thailand during the second wave of the 2004 epidemic. Production structure was quantified using a disaggregation methodology based on the number of poultry per holding. Population densities of extensively- and intensively-raised ducks and chickens were derived both at the sub-district and at the village levels. LandSat images were used to derive another previously neglected potential predictor of HPAI H5N1 risk: the proportion of water in the landscape resulting from floods. We used Monte Carlo simulation of Boosted Regression Trees models of predictor variables to characterize the risk of HPAI H5N1. Maps of mean risk and uncertainty were derived both at the sub-district and the village levels. The overall accuracy of Boosted Regression Trees models was comparable to that of logistic regression approaches. The proportion of area flooded made the highest contribution to predicting the risk of outbreak, followed by the densities of intensively-raised ducks, extensively-raised ducks and human population. Our results showed that as little as 15% of flooded land in villages is sufficient to reach the maximum level of risk associated with this variable. The spatial pattern of predicted risk is similar to previous work: areas at risk are mainly located along the flood plain of the Chao Phraya river and to the south-east of Bangkok. Using high-resolution village-level poultry census data, rather than sub-district data, the spatial accuracy of predictions was enhanced to highlight local variations in risk. Such maps provide useful information to guide intervention.
Van Boeckel, Thomas P.; Thanapongtharm, Weerapong; Robinson, Timothy; Biradar, Chandrashekhar M.; Xiao, Xiangming; Gilbert, Marius
2012-01-01
Since 1996 when Highly Pathogenic Avian Influenza type H5N1 first emerged in southern China, numerous studies sought risk factors and produced risk maps based on environmental and anthropogenic predictors. However little attention has been paid to the link between the level of intensification of poultry production and the risk of outbreak. This study revised H5N1 risk mapping in Central and Western Thailand during the second wave of the 2004 epidemic. Production structure was quantified using a disaggregation methodology based on the number of poultry per holding. Population densities of extensively- and intensively-raised ducks and chickens were derived both at the sub-district and at the village levels. LandSat images were used to derive another previously neglected potential predictor of HPAI H5N1 risk: the proportion of water in the landscape resulting from floods. We used Monte Carlo simulation of Boosted Regression Trees models of predictor variables to characterize the risk of HPAI H5N1. Maps of mean risk and uncertainty were derived both at the sub-district and the village levels. The overall accuracy of Boosted Regression Trees models was comparable to that of logistic regression approaches. The proportion of area flooded made the highest contribution to predicting the risk of outbreak, followed by the densities of intensively-raised ducks, extensively-raised ducks and human population. Our results showed that as little as 15% of flooded land in villages is sufficient to reach the maximum level of risk associated with this variable. The spatial pattern of predicted risk is similar to previous work: areas at risk are mainly located along the flood plain of the Chao Phraya river and to the south-east of Bangkok. Using high-resolution village-level poultry census data, rather than sub-district data, the spatial accuracy of predictions was enhanced to highlight local variations in risk. Such maps provide useful information to guide intervention. PMID:23185352
Using decision trees to understand structure in missing data
Tierney, Nicholas J; Harden, Fiona A; Harden, Maurice J; Mengersen, Kerrie L
2015-01-01
Objectives Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. Setting Data taken from employees at 3 different industrial sites in Australia. Participants 7915 observations were included. Materials and methods The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions Researchers are encouraged to use CART and BRT models to explore and understand missing data. PMID:26124509
NASA Astrophysics Data System (ADS)
Tesoriero, A. J.; Terziotti, S.
2014-12-01
Nitrate trends in streams often do not match expectations based on recent nitrogen source loadings to the land surface. Groundwater discharge with long travel times has been suggested as the likely cause for these observations. The fate of nitrate in groundwater depends to a large extent on the occurrence of denitrification along flow paths. Because denitrification in groundwater is inhibited when dissolved oxygen (DO) concentrations are high, defining the oxic-suboxic interface has been critical in determining pathways for nitrate transport in groundwater and to streams at the local scale. Predicting redox conditions on a regional scale is complicated by the spatial variability of reaction rates. In this study, logistic regression and boosted classification tree analysis were used to predict the probability of oxic water in groundwater in the Chesapeake Bay watershed. The probability of oxic water (DO > 2 mg/L) was predicted by relating DO concentrations in over 3,000 groundwater samples to indicators of residence time and/or electron donor availability. Variables that describe position in the flow system (e.g., depth to top of the open interval), soil drainage and surficial geology were the most important predictors of oxic water. Logistic regression and boosted classification tree analysis correctly predicted the presence or absence of oxic conditions in over 75 % of the samples in both training and validation data sets. Predictions of the percentages of oxic wells in deciles of risk were very accurate (r2>0.9) in both the training and validation data sets. Depth to the bottom of the oxic layer was predicted and is being used to estimate the effect that groundwater denitrification has on stream nitrate concentrations and the time lag between the application of nitrogen at the land surface and its effect on streams.
Boosting bonsai trees for handwritten/printed text discrimination
NASA Astrophysics Data System (ADS)
Ricquebourg, Yann; Raymond, Christian; Poirriez, Baptiste; Lemaitre, Aurélie; Coüasnon, Bertrand
2013-12-01
Boosting over decision-stumps proved its efficiency in Natural Language Processing essentially with symbolic features, and its good properties (fast, few and not critical parameters, not sensitive to over-fitting) could be of great interest in the numeric world of pixel images. In this article we investigated the use of boosting over small decision trees, in image classification processing, for the discrimination of handwritten/printed text. Then, we conducted experiments to compare it to usual SVM-based classification revealing convincing results with very close performance, but with faster predictions and behaving far less as a black-box. Those promising results tend to make use of this classifier in more complex recognition tasks like multiclass problems.
NASA Astrophysics Data System (ADS)
Schaeben, Helmut; Semmler, Georg
2016-09-01
The objective of prospectivity modeling is prediction of the conditional probability of the presence T = 1 or absence T = 0 of a target T given favorable or prohibitive predictors B, or construction of a two classes 0,1 classification of T. A special case of logistic regression called weights-of-evidence (WofE) is geologists' favorite method of prospectivity modeling due to its apparent simplicity. However, the numerical simplicity is deceiving as it is implied by the severe mathematical modeling assumption of joint conditional independence of all predictors given the target. General weights of evidence are explicitly introduced which are as simple to estimate as conventional weights, i.e., by counting, but do not require conditional independence. Complementary to the regression view is the classification view on prospectivity modeling. Boosting is the construction of a strong classifier from a set of weak classifiers. From the regression point of view it is closely related to logistic regression. Boost weights-of-evidence (BoostWofE) was introduced into prospectivity modeling to counterbalance violations of the assumption of conditional independence even though relaxation of modeling assumptions with respect to weak classifiers was not the (initial) purpose of boosting. In the original publication of BoostWofE a fabricated dataset was used to "validate" this approach. Using the same fabricated dataset it is shown that BoostWofE cannot generally compensate lacking conditional independence whatever the consecutively processing order of predictors. Thus the alleged features of BoostWofE are disproved by way of counterexamples, while theoretical findings are confirmed that logistic regression including interaction terms can exactly compensate violations of joint conditional independence if the predictors are indicators.
Extracting Baseline Electricity Usage Using Gradient Tree Boosting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Taehoon; Lee, Dongeun; Choi, Jaesik
To understand how specific interventions affect a process observed over time, we need to control for the other factors that influence outcomes. Such a model that captures all factors other than the one of interest is generally known as a baseline. In our study of how different pricing schemes affect residential electricity consumption, the baseline would need to capture the impact of outdoor temperature along with many other factors. In this work, we examine a number of different data mining techniques and demonstrate Gradient Tree Boosting (GTB) to be an effective method to build the baseline. We train GTB onmore » data prior to the introduction of new pricing schemes, and apply the known temperature following the introduction of new pricing schemes to predict electricity usage with the expected temperature correction. Our experiments and analyses show that the baseline models generated by GTB capture the core characteristics over the two years with the new pricing schemes. In contrast to the majority of regression based techniques which fail to capture the lag between the peak of daily temperature and the peak of electricity usage, the GTB generated baselines are able to correctly capture the delay between the temperature peak and the electricity peak. Furthermore, subtracting this temperature-adjusted baseline from the observed electricity usage, we find that the resulting values are more amenable to interpretation, which demonstrates that the temperature-adjusted baseline is indeed effective.« less
Pre-operative prediction of surgical morbidity in children: comparison of five statistical models.
Cooper, Jennifer N; Wei, Lai; Fernandez, Soledad A; Minneci, Peter C; Deans, Katherine J
2015-02-01
The accurate prediction of surgical risk is important to patients and physicians. Logistic regression (LR) models are typically used to estimate these risks. However, in the fields of data mining and machine-learning, many alternative classification and prediction algorithms have been developed. This study aimed to compare the performance of LR to several data mining algorithms for predicting 30-day surgical morbidity in children. We used the 2012 National Surgical Quality Improvement Program-Pediatric dataset to compare the performance of (1) a LR model that assumed linearity and additivity (simple LR model) (2) a LR model incorporating restricted cubic splines and interactions (flexible LR model) (3) a support vector machine, (4) a random forest and (5) boosted classification trees for predicting surgical morbidity. The ensemble-based methods showed significantly higher accuracy, sensitivity, specificity, PPV, and NPV than the simple LR model. However, none of the models performed better than the flexible LR model in terms of the aforementioned measures or in model calibration or discrimination. Support vector machines, random forests, and boosted classification trees do not show better performance than LR for predicting pediatric surgical morbidity. After further validation, the flexible LR model derived in this study could be used to assist with clinical decision-making based on patient-specific surgical risks. Copyright © 2014 Elsevier Ltd. All rights reserved.
An Update on Statistical Boosting in Biomedicine.
Mayr, Andreas; Hofner, Benjamin; Waldmann, Elisabeth; Hepp, Tobias; Meyer, Sebastian; Gefeller, Olaf
2017-01-01
Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.
Ensemble habitat mapping of invasive plant species
Stohlgren, T.J.; Ma, P.; Kumar, S.; Rocca, M.; Morisette, J.T.; Jarnevich, C.S.; Benson, N.
2010-01-01
Ensemble species distribution models combine the strengths of several species environmental matching models, while minimizing the weakness of any one model. Ensemble models may be particularly useful in risk analysis of recently arrived, harmful invasive species because species may not yet have spread to all suitable habitats, leaving species-environment relationships difficult to determine. We tested five individual models (logistic regression, boosted regression trees, random forest, multivariate adaptive regression splines (MARS), and maximum entropy model or Maxent) and ensemble modeling for selected nonnative plant species in Yellowstone and Grand Teton National Parks, Wyoming; Sequoia and Kings Canyon National Parks, California, and areas of interior Alaska. The models are based on field data provided by the park staffs, combined with topographic, climatic, and vegetation predictors derived from satellite data. For the four invasive plant species tested, ensemble models were the only models that ranked in the top three models for both field validation and test data. Ensemble models may be more robust than individual species-environment matching models for risk analysis. ?? 2010 Society for Risk Analysis.
Segmentation of optic disc and optic cup in retinal fundus images using shape regression.
Sedai, Suman; Roy, Pallab K; Mahapatra, Dwarikanath; Garnavi, Rahil
2016-08-01
Glaucoma is one of the leading cause of blindness. The manual examination of optic cup and disc is a standard procedure used for detecting glaucoma. This paper presents a fully automatic regression based method which accurately segments optic cup and disc in retinal colour fundus image. First, we roughly segment optic disc using circular hough transform. The approximated optic disc is then used to compute the initial optic disc and cup shapes. We propose a robust and efficient cascaded shape regression method which iteratively learns the final shape of the optic cup and disc from a given initial shape. Gradient boosted regression trees are employed to learn each regressor in the cascade. A novel data augmentation approach is proposed to improve the regressors performance by generating synthetic training data. The proposed optic cup and disc segmentation method is applied on an image set of 50 patients and demonstrate high segmentation accuracy for optic cup and disc with dice metric of 0.95 and 0.85 respectively. Comparative study shows that our proposed method outperforms state of the art optic cup and disc segmentation methods.
Creating silvopastures: some considerations when planting trees in pastures
John Fike; Adam Downing; John Munsell; Gregory E. Frey; Kelly Mercier; Gabriel Pent; Chris Teutsch; J.B. Daniel; Jason Fisher; Miller Adams; Todd Groh
2017-01-01
 Silvopastures â integrated tree-forage-livestock production systems â have the potential to boost farm resource use and income. These systems take advantage of the beneficial interactions...
Lei, Tailong; Sun, Huiyong; Kang, Yu; Zhu, Feng; Liu, Hui; Zhou, Wenfang; Wang, Zhe; Li, Dan; Li, Youyong; Hou, Tingjun
2017-11-06
Xenobiotic chemicals and their metabolites are mainly excreted out of our bodies by the urinary tract through the urine. Chemical-induced urinary tract toxicity is one of the main reasons that cause failure during drug development, and it is a common adverse event for medications, natural supplements, and environmental chemicals. Despite its importance, there are only a few in silico models for assessing urinary tract toxicity for a large number of compounds with diverse chemical structures. Here, we developed a series of qualitative and quantitative structure-activity relationship (QSAR) models for predicting urinary tract toxicity. In our study, the recursive feature elimination method incorporated with random forests (RFE-RF) was used for dimension reduction, and then eight machine learning approaches were used for QSAR modeling, i.e., relevance vector machine (RVM), support vector machine (SVM), regularized random forest (RRF), C5.0 trees, eXtreme gradient boosting (XGBoost), AdaBoost.M1, SVM boosting (SVMBoost), and RVM boosting (RVMBoost). For building classification models, the synthetic minority oversampling technique was used to handle the imbalance data set problem. Among all the machine learning approaches, SVMBoost based on the RBF kernel achieves both the best quantitative (q ext 2 = 0.845) and qualitative predictions for the test set (MCC of 0.787, AUC of 0.893, sensitivity of 89.6%, specificity of 94.1%, and global accuracy of 90.8%). The application domains were then analyzed, and all of the tested chemicals fall within the application domain coverage. We also examined the structure features of the chemicals with large prediction errors. In brief, both the regression and classification models developed by the SVMBoost approach have reliable prediction capability for assessing chemical-induced urinary tract toxicity.
Liu, Rong; Li, Xi; Zhang, Wei; Zhou, Hong-Hao
2015-01-01
Objective Multiple linear regression (MLR) and machine learning techniques in pharmacogenetic algorithm-based warfarin dosing have been reported. However, performances of these algorithms in racially diverse group have never been objectively evaluated and compared. In this literature-based study, we compared the performances of eight machine learning techniques with those of MLR in a large, racially-diverse cohort. Methods MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied in warfarin dose algorithms in a cohort from the International Warfarin Pharmacogenetics Consortium database. Covariates obtained by stepwise regression from 80% of randomly selected patients were used to develop algorithms. To compare the performances of these algorithms, the mean percentage of patients whose predicted dose fell within 20% of the actual dose (mean percentage within 20%) and the mean absolute error (MAE) were calculated in the remaining 20% of patients. The performances of these techniques in different races, as well as the dose ranges of therapeutic warfarin were compared. Robust results were obtained after 100 rounds of resampling. Results BART, MARS and SVR were statistically indistinguishable and significantly out performed all the other approaches in the whole cohort (MAE: 8.84–8.96 mg/week, mean percentage within 20%: 45.88%–46.35%). In the White population, MARS and BART showed higher mean percentage within 20% and lower mean MAE than those of MLR (all p values < 0.05). In the Asian population, SVR, BART, MARS and LAR performed the same as MLR. MLR and LAR optimally performed among the Black population. When patients were grouped in terms of warfarin dose range, all machine learning techniques except ANN and LAR showed significantly higher mean percentage within 20%, and lower MAE (all p values < 0.05) than MLR in the low- and high- dose ranges. Conclusion Overall, machine learning-based techniques, BART, MARS and SVR performed superior than MLR in warfarin pharmacogenetic dosing. Differences of algorithms’ performances exist among the races. Moreover, machine learning-based algorithms tended to perform better in the low- and high- dose ranges than MLR. PMID:26305568
NASA Astrophysics Data System (ADS)
Sayegh, Arwa; Tate, James E.; Ropkins, Karl
2016-02-01
Oxides of Nitrogen (NOx) is a major component of photochemical smog and its constituents are considered principal traffic-related pollutants affecting human health. This study investigates the influence of background concentrations of NOx, traffic density, and prevailing meteorological conditions on roadside concentrations of NOx at UK urban, open motorway, and motorway tunnel sites using the statistical approach Boosted Regression Trees (BRT). BRT models have been fitted using hourly concentration, traffic, and meteorological data for each site. The models predict, rank, and visualise the relationship between model variables and roadside NOx concentrations. A strong relationship between roadside NOx and monitored local background concentrations is demonstrated. Relationships between roadside NOx and other model variables have been shown to be strongly influenced by the quality and resolution of background concentrations of NOx, i.e. if it were based on monitored data or modelled prediction. The paper proposes a direct method of using site-specific fundamental diagrams for splitting traffic data into four traffic states: free-flow, busy-flow, congested, and severely congested. Using BRT models, the density of traffic (vehicles per kilometre) was observed to have a proportional influence on the concentrations of roadside NOx, with different fitted regression line slopes for the different traffic states. When other influences are conditioned out, the relationship between roadside concentrations and ambient air temperature suggests NOx concentrations reach a minimum at around 22 °C with high concentrations at low ambient air temperatures which could be associated to restricted atmospheric dispersion and/or to changes in road traffic exhaust emission characteristics at low ambient air temperatures. This paper uses BRT models to study how different critical factors, and their relative importance, influence the variation of roadside NOx concentrations. The paper highlights the importance of either setting up local background continuous monitors or improving the quality and resolution of modelled UK background maps and the need to further investigate the influence of ambient air temperature on NOx emissions and roadside NOx concentrations.
Using Boosting Decision Trees in Gravitational Wave Searches triggered by Gamma-ray Bursts
NASA Astrophysics Data System (ADS)
Zuraw, Sarah; LIGO Collaboration
2015-04-01
The search for gravitational wave bursts requires the ability to distinguish weak signals from background detector noise. Gravitational wave bursts are characterized by their transient nature, making them particularly difficult to detect as they are similar to non-Gaussian noise fluctuations in the detector. The Boosted Decision Tree method is a powerful machine learning algorithm which uses Multivariate Analysis techniques to explore high-dimensional data sets in order to distinguish between gravitational wave signal and background detector noise. It does so by training with known noise events and simulated gravitational wave events. The method is tested using waveform models and compared with the performance of the standard gravitational wave burst search pipeline for Gamma-ray Bursts. It is shown that the method is able to effectively distinguish between signal and background events under a variety of conditions and over multiple Gamma-ray Burst events. This example demonstrates the usefulness and robustness of the Boosted Decision Tree and Multivariate Analysis techniques as a detection method for gravitational wave bursts. LIGO, UMass, PREP, NEGAP.
A Ranking Approach to Genomic Selection.
Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori
2015-01-01
Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.
Lara, Mark J; Genet, Hélène; McGuire, Anthony D; Euskirchen, Eugénie S; Zhang, Yujin; Brown, Dana R N; Jorgenson, Mark T; Romanovsky, Vladimir; Breen, Amy; Bolton, William R
2016-02-01
Lowland boreal forest ecosystems in Alaska are dominated by wetlands comprised of a complex mosaic of fens, collapse-scar bogs, low shrub/scrub, and forests growing on elevated ice-rich permafrost soils. Thermokarst has affected the lowlands of the Tanana Flats in central Alaska for centuries, as thawing permafrost collapses forests that transition to wetlands. Located within the discontinuous permafrost zone, this region has significantly warmed over the past half-century, and much of these carbon-rich permafrost soils are now within ~0.5 °C of thawing. Increased permafrost thaw in lowland boreal forests in response to warming may have consequences for the climate system. This study evaluates the trajectories and potential drivers of 60 years of forest change in a landscape subjected to permafrost thaw in unburned dominant forest types (paper birch and black spruce) associated with location on elevated permafrost plateau and across multiple time periods (1949, 1978, 1986, 1998, and 2009) using historical and contemporary aerial and satellite images for change detection. We developed (i) a deterministic statistical model to evaluate the potential climatic controls on forest change using gradient boosting and regression tree analysis, and (ii) a 30 × 30 m land cover map of the Tanana Flats to estimate the potential landscape-level losses of forest area due to thermokarst from 1949 to 2009. Over the 60-year period, we observed a nonlinear loss of birch forests and a relatively continuous gain of spruce forest associated with thermokarst and forest succession, while gradient boosting/regression tree models identify precipitation and forest fragmentation as the primary factors controlling birch and spruce forest change, respectively. Between 1950 and 2009, landscape-level analysis estimates a transition of ~15 km² or ~7% of birch forests to wetlands, where the greatest change followed warm periods. This work highlights that the vulnerability and resilience of lowland ice-rich permafrost ecosystems to climate changes depend on forest type. © 2015 John Wiley & Sons Ltd.
You, Ming P.; Rensing, Kelly; Renton, Michael; Barbetti, Martin J.
2017-01-01
Subterranean clover (Trifolium subterraneum) is a critical pasture legume in Mediterranean regions of southern Australia and elsewhere, including Mediterranean-type climatic regions in Africa, Asia, Australia, Europe, North America, and South America. Pythium damping-off and root disease caused by Pythium irregulare is a significant threat to subterranean clover in Australia and a study was conducted to define how environmental factors (viz. temperature, soil type, moisture and nutrition) as well as variety, influence the extent of damping-off and root disease as well as subterranean clover productivity under challenge by this pathogen. Relationships were statistically modeled using linear and generalized linear models and boosted regression trees. Modeling found complex relationships between explanatory variables and the extent of Pythium damping-off and root rot. Linear modeling identified high-level (4 or 5-way) significant interactions for each dependent variable (dry shoot and root weight, emergence, tap and lateral root disease index). Furthermore, all explanatory variables (temperature, soil, moisture, nutrition, variety) were found significant as part of some interaction within these models. A significant five-way interaction between all explanatory variables was found for both dry shoot and root dry weights, and a four way interaction between temperature, soil, moisture, and nutrition was found for both tap and lateral root disease index. A second approach to modeling using boosted regression trees provided support for and helped clarify the complex nature of the relationships found in linear models. All explanatory variables showed at least 5% relative influence on each of the five dependent variables. All models indicated differences due to soil type, with the sand-based soil having either higher weights, greater emergence, or lower disease indices; while lowest weights and less emergence, as well as higher disease indices, were found for loam soil and low temperature. There was more severe tap and lateral root rot disease in higher moisture situations. PMID:29184544
Prediction of fishing effort distributions using boosted regression trees.
Soykan, Candan U; Eguchi, Tomoharu; Kohin, Suzanne; Dewar, Heidi
2014-01-01
Concerns about bycatch of protected species have become a dominant factor shaping fisheries management. However, efforts to mitigate bycatch are often hindered by a lack of data on the distributions of fishing effort and protected species. One approach to overcoming this problem has been to overlay the distribution of past fishing effort with known locations of protected species, often obtained through satellite telemetry and occurrence data, to identify potential bycatch hotspots. This approach, however, generates static bycatch risk maps, calling into question their ability to forecast into the future, particularly when dealing with spatiotemporally dynamic fisheries and highly migratory bycatch species. In this study, we use boosted regression trees to model the spatiotemporal distribution of fishing effort for two distinct fisheries in the North Pacific Ocean, the albacore (Thunnus alalunga) troll fishery and the California drift gillnet fishery that targets swordfish (Xiphias gladius). Our results suggest that it is possible to accurately predict fishing effort using < 10 readily available predictor variables (cross-validated correlations between model predictions and observed data -0.6). Although the two fisheries are quite different in their gears and fishing areas, their respective models had high predictive ability, even when input data sets were restricted to a fraction of the full time series. The implications for conservation and management are encouraging: Across a range of target species, fishing methods, and spatial scales, even a relatively short time series of fisheries data may suffice to accurately predict the location of fishing effort into the future. In combination with species distribution modeling of bycatch species, this approach holds promise as a mitigation tool when observer data are limited. Even in data-rich regions, modeling fishing effort and bycatch may provide more accurate estimates of bycatch risk than partial observer coverage for fisheries and bycatch species that are heavily influenced by dynamic oceanographic conditions.
Lara, M.; Genet, Helene; McGuire, A. David; Euskirchen, Eugénie S.; Zhang, Yujin; Brown, Dana R. N.; Jorgenson, M.T.; Romanovsky, V.; Breen, Amy L.; Bolton, W.R.
2016-01-01
Lowland boreal forest ecosystems in Alaska are dominated by wetlands comprised of a complex mosaic of fens, collapse-scar bogs, low shrub/scrub, and forests growing on elevated ice-rich permafrost soils. Thermokarst has affected the lowlands of the Tanana Flats in central Alaska for centuries, as thawing permafrost collapses forests that transition to wetlands. Located within the discontinuous permafrost zone, this region has significantly warmed over the past half-century, and much of these carbon-rich permafrost soils are now within ~0.5 °C of thawing. Increased permafrost thaw in lowland boreal forests in response to warming may have consequences for the climate system. This study evaluates the trajectories and potential drivers of 60 years of forest change in a landscape subjected to permafrost thaw in unburned dominant forest types (paper birch and black spruce) associated with location on elevated permafrost plateau and across multiple time periods (1949, 1978, 1986, 1998, and 2009) using historical and contemporary aerial and satellite images for change detection. We developed (i) a deterministic statistical model to evaluate the potential climatic controls on forest change using gradient boosting and regression tree analysis, and (ii) a 30 × 30 m land cover map of the Tanana Flats to estimate the potential landscape-level losses of forest area due to thermokarst from 1949 to 2009. Over the 60-year period, we observed a nonlinear loss of birch forests and a relatively continuous gain of spruce forest associated with thermokarst and forest succession, while gradient boosting/regression tree models identify precipitation and forest fragmentation as the primary factors controlling birch and spruce forest change, respectively. Between 1950 and 2009, landscape-level analysis estimates a transition of ~15 km² or ~7% of birch forests to wetlands, where the greatest change followed warm periods. This work highlights that the vulnerability and resilience of lowland ice-rich permafrost ecosystems to climate changes depend on forest type.
Learning-based Wind Estimation using Distant Soundings for Unguided Aerial Delivery
NASA Astrophysics Data System (ADS)
Plyler, M.; Cahoy, K.; Angermueller, K.; Chen, D.; Markuzon, N.
2016-12-01
Delivering unguided, parachuted payloads from aircraft requires accurate knowledge of the wind field inside an operational zone. Usually, a dropsonde released from the aircraft over the drop zone gives a more accurate wind estimate than a forecast. Mission objectives occasionally demand releasing the dropsonde away from the drop zone, but still require accuracy and precision. Barnes interpolation and many other assimilation methods do poorly when the forecast error is inconsistent in a forecast grid. A machine learning approach can better leverage non-linear relations between different weather patterns and thus provide a better wind estimate at the target drop zone when using data collected up to 100 km away. This study uses the 13 km resolution Rapid Refresh (RAP) dataset available through NOAA and subsamples to an area around Yuma, AZ and up to approximately 10km AMSL. RAP forecast grids are updated with simulated dropsondes taken from analysis (historical weather maps). We train models using different data mining and machine learning techniques, most notably boosted regression trees, that can accurately assimilate the distant dropsonde. The model takes a forecast grid and simulated remote dropsonde data as input and produces an estimate of the wind stick over the drop zone. Using ballistic winds as a defining metric, we show our data driven approach does better than Barnes interpolation under some conditions, most notably when the forecast error is different between the two locations, on test data previously unseen by the model. We study and evaluate the model's performance depending on the size, the time lag, the drop altitude, and the geographic location of the training set, and identify parameters most contributing to the accuracy of the wind estimation. This study demonstrates a new approach for assimilating remotely released dropsondes, based on boosted regression trees, and shows improvement in wind estimation over currently used methods.
Altartouri, Anas; Nurminen, Leena; Jolma, Ari
2014-01-01
Phragmites australis, a native helophyte in coastal areas of the Baltic Sea, has significantly spread on the Finnish coast in the last decades raising ecological questions and social interest and concern due to the important role it plays in the ecosystem dynamics of shallow coastal areas. Despite its important implications on the planning and management of the area, predictive modeling of Phragmites distribution is not well studied. We examined the prevalence and progression of Phragmites in four sites along the Southern Finnish coast in multiple time frames in relation to a number of predictors. We also analyzed patterns of neighborhood effect on the expansion and disappearance of Phragmites in a cellular data model. We developed boosted regression trees models to predict Phragmites occurrences and produce maps of habitat suitability. Various Phragmites spread figures were observed in different areas and time periods, with a minimum annual expansion rate of 1% and a maximum of 8%. The water depth, shore openness, and proximity to river mouths were found influential in Phragmites distribution. The neighborhood configuration partially explained the dynamics of Phragmites colonies. The boosted regression trees method was successfully used to interpolate and extrapolate Phragmites distributions in the study sites highlighting its potential for assessing habitat suitability for Phragmites along the Finnish coast. Our findings are useful for a number of applications. With variables easily available, delineation of areas susceptible for Phragmites colonization allows early management plans to be made. Given the influence of reed beds on the littoral species and ecosystem, these results can be useful for the ecological studies of coastal areas. We provide estimates of habitat suitability and quantification of Phragmites expansion in a form suitable for dynamic modeling, which would be useful for predicting future Phragmites distribution under different scenarios of land cover change and Phragmites spatial configuration. PMID:24772277
Kabaria, Caroline W; Gilbert, Marius; Noor, Abdisalan M; Snow, Robert W; Linard, Catherine
2017-01-26
Although malaria has been traditionally regarded as less of a problem in urban areas compared to neighbouring rural areas, the risk of malaria infection continues to exist in densely populated, urban areas of Africa. Despite the recognition that urbanization influences the epidemiology of malaria, there is little consensus on urbanization relevant for malaria parasite mapping. Previous studies examining the relationship between urbanization and malaria transmission have used products defining urbanization at global/continental scales developed in the early 2000s, that overestimate actual urban extents while the population estimates are over 15 years old and estimated at administrative unit level. This study sought to discriminate an urbanization definition that is most relevant for malaria parasite mapping using individual level malaria infection data obtained from nationally representative household-based surveys. Boosted regression tree (BRT) modelling was used to determine the effect of urbanization on malaria transmission and if this effect varied with urbanization definition. In addition, the most recent high resolution population distribution data was used to determine whether population density had significant effect on malaria parasite prevalence and if so, could population density replace urban classifications in modelling malaria transmission patterns. The risk of malaria infection was shown to decline from rural areas through peri-urban settlements to urban central areas. Population density was found to be an important predictor of malaria risk. The final boosted regression trees (BRT) model with urbanization and population density gave the best model fit (Tukey test p value <0.05) compared to the models with urbanization only. Given the challenges in uniformly classifying urban areas across different countries, population density provides a reliable metric to adjust for the patterns of malaria risk in densely populated urban areas. Future malaria risk models can, therefore, be improved by including both population density and urbanization which have both been shown to have significant impact on malaria risk in this study.
Comparison of stream invertebrate response models for bioassessment metric
Waite, Ian R.; Kennen, Jonathan G.; May, Jason T.; Brown, Larry R.; Cuffney, Thomas F.; Jones, Kimberly A.; Orlando, James L.
2012-01-01
We aggregated invertebrate data from various sources to assemble data for modeling in two ecoregions in Oregon and one in California. Our goal was to compare the performance of models developed using multiple linear regression (MLR) techniques with models developed using three relatively new techniques: classification and regression trees (CART), random forest (RF), and boosted regression trees (BRT). We used tolerance of taxa based on richness (RICHTOL) and ratio of observed to expected taxa (O/E) as response variables and land use/land cover as explanatory variables. Responses were generally linear; therefore, there was little improvement to the MLR models when compared to models using CART and RF. In general, the four modeling techniques (MLR, CART, RF, and BRT) consistently selected the same primary explanatory variables for each region. However, results from the BRT models showed significant improvement over the MLR models for each region; increases in R2 from 0.09 to 0.20. The O/E metric that was derived from models specifically calibrated for Oregon consistently had lower R2 values than RICHTOL for the two regions tested. Modeled O/E R2 values were between 0.06 and 0.10 lower for each of the four modeling methods applied in the Willamette Valley and were between 0.19 and 0.36 points lower for the Blue Mountains. As a result, BRT models may indeed represent a good alternative to MLR for modeling species distribution relative to environmental variables.
Jin, Mingwu; Deng, Weishu
2018-05-15
There is a spectrum of the progression from healthy control (HC) to mild cognitive impairment (MCI) without conversion to Alzheimer's disease (AD), to MCI with conversion to AD (cMCI), and to AD. This study aims to predict the different disease stages using brain structural information provided by magnetic resonance imaging (MRI) data. The neighborhood component analysis (NCA) is applied to select most powerful features for prediction. The ensemble decision tree classifier is built to predict which group the subject belongs to. The best features and model parameters are determined by cross validation of the training data. Our results show that 16 out of a total of 429 features were selected by NCA using 240 training subjects, including MMSE score and structural measures in memory-related regions. The boosting tree model with NCA features can achieve prediction accuracy of 56.25% on 160 test subjects. Principal component analysis (PCA) and sequential feature selection (SFS) are used for feature selection, while support vector machine (SVM) is used for classification. The boosting tree model with NCA features outperforms all other combinations of feature selection and classification methods. The results suggest that NCA be a better feature selection strategy than PCA and SFS for the data used in this study. Ensemble tree classifier with boosting is more powerful than SVM to predict the subject group. However, more advanced feature selection and classification methods or additional measures besides structural MRI may be needed to improve the prediction performance. Copyright © 2018 Elsevier B.V. All rights reserved.
Using "The Giving Tree" To Teach Literary Criticism.
ERIC Educational Resources Information Center
Remler, Nancy Lawson
2000-01-01
Argues that introducing students to literary criticism while introducing them to literature boosts their confidence and abilities to analyze literature, and increases their interest in discussing it. Describes how the author, in her college-level introductory literature course, used Shel Silverstein's "The Giving Tree" (a children's…
Elizabeth A. Freeman; Gretchen G. Moisen; John W. Coulston; Barry T. (Ty) Wilson
2015-01-01
As part of the development of the 2011 National Land Cover Database (NLCD) tree canopy cover layer, a pilot project was launched to test the use of high-resolution photography coupled with extensive ancillary data to map the distribution of tree canopy cover over four study regions in the conterminous US. Two stochastic modeling techniques, random forests (RF...
Plieninger, Tobias; Levers, Christian; Mantel, Martin; Costa, Augusta; Schaich, Harald; Kuemmerle, Tobias
2015-01-01
Scattered trees support high levels of farmland biodiversity and ecosystem services in agricultural landscapes, but they are threatened by agricultural intensification, urbanization, and land abandonment. This study aimed to map and quantify the decline of orchard meadows (scattered fruit trees of high nature conservation value) for a region in Southwestern Germany for the 1968 2009 period and to identify the driving forces of this decline. We derived orchard meadow loss from 1968 and 2009 aerial images and used a boosted regression trees modelling framework to assess the relative importance of 18 environmental, demographic, and socio-economic variables to test five alternative hypothesis explaining orchard meadow loss. We found that orchard meadow loss occurred in flatter areas, in areas where smaller plot sizes and fragmented orchard meadows prevailed, and in areas near settlements and infrastructure. The analysis did not confirm that orchard meadow loss was higher in areas where agricultural intensification was stronger and in areas of lower implementation levels of conservation policies. Our results demonstrated that the influential drivers of orchard meadow loss were those that reduce economic profitability and increase opportunity costs for orchards, providing incentives for converting orchard meadows to other, more profitable land uses. These insights could be taken up by local- and regional-level conservation policies to identify the sites of persistent orchard meadows in agricultural landscapes that would be prioritized in conservation efforts. PMID:25932914
Robust boosting via convex optimization
NASA Astrophysics Data System (ADS)
Rätsch, Gunnar
2001-12-01
In this work we consider statistical learning problems. A learning machine aims to extract information from a set of training examples such that it is able to predict the associated label on unseen examples. We consider the case where the resulting classification or regression rule is a combination of simple rules - also called base hypotheses. The so-called boosting algorithms iteratively find a weighted linear combination of base hypotheses that predict well on unseen data. We address the following issues: o The statistical learning theory framework for analyzing boosting methods. We study learning theoretic guarantees on the prediction performance on unseen examples. Recently, large margin classification techniques emerged as a practical result of the theory of generalization, in particular Boosting and Support Vector Machines. A large margin implies a good generalization performance. Hence, we analyze how large the margins in boosting are and find an improved algorithm that is able to generate the maximum margin solution. o How can boosting methods be related to mathematical optimization techniques? To analyze the properties of the resulting classification or regression rule, it is of high importance to understand whether and under which conditions boosting converges. We show that boosting can be used to solve large scale constrained optimization problems, whose solutions are well characterizable. To show this, we relate boosting methods to methods known from mathematical optimization, and derive convergence guarantees for a quite general family of boosting algorithms. o How to make Boosting noise robust? One of the problems of current boosting techniques is that they are sensitive to noise in the training sample. In order to make boosting robust, we transfer the soft margin idea from support vector learning to boosting. We develop theoretically motivated regularized algorithms that exhibit a high noise robustness. o How to adapt boosting to regression problems? Boosting methods are originally designed for classification problems. To extend the boosting idea to regression problems, we use the previous convergence results and relations to semi-infinite programming to design boosting-like algorithms for regression problems. We show that these leveraging algorithms have desirable theoretical and practical properties. o Can boosting techniques be useful in practice? The presented theoretical results are guided by simulation results either to illustrate properties of the proposed algorithms or to show that they work well in practice. We report on successful applications in a non-intrusive power monitoring system, chaotic time series analysis and a drug discovery process. --- Anmerkung: Der Autor ist Träger des von der Mathematisch-Naturwissenschaftlichen Fakultät der Universität Potsdam vergebenen Michelson-Preises für die beste Promotion des Jahres 2001/2002. In dieser Arbeit werden statistische Lernprobleme betrachtet. Lernmaschinen extrahieren Informationen aus einer gegebenen Menge von Trainingsmustern, so daß sie in der Lage sind, Eigenschaften von bisher ungesehenen Mustern - z.B. eine Klassenzugehörigkeit - vorherzusagen. Wir betrachten den Fall, bei dem die resultierende Klassifikations- oder Regressionsregel aus einfachen Regeln - den Basishypothesen - zusammengesetzt ist. Die sogenannten Boosting Algorithmen erzeugen iterativ eine gewichtete Summe von Basishypothesen, die gut auf ungesehenen Mustern vorhersagen. Die Arbeit behandelt folgende Sachverhalte: o Die zur Analyse von Boosting-Methoden geeignete Statistische Lerntheorie. Wir studieren lerntheoretische Garantien zur Abschätzung der Vorhersagequalität auf ungesehenen Mustern. Kürzlich haben sich sogenannte Klassifikationstechniken mit großem Margin als ein praktisches Ergebnis dieser Theorie herausgestellt - insbesondere Boosting und Support-Vektor-Maschinen. Ein großer Margin impliziert eine hohe Vorhersagequalität der Entscheidungsregel. Deshalb wird analysiert, wie groß der Margin bei Boosting ist und ein verbesserter Algorithmus vorgeschlagen, der effizient Regeln mit maximalem Margin erzeugt. o Was ist der Zusammenhang von Boosting und Techniken der konvexen Optimierung? Um die Eigenschaften der entstehenden Klassifikations- oder Regressionsregeln zu analysieren, ist es sehr wichtig zu verstehen, ob und unter welchen Bedingungen iterative Algorithmen wie Boosting konvergieren. Wir zeigen, daß solche Algorithmen benutzt werden koennen, um sehr große Optimierungsprobleme mit Nebenbedingungen zu lösen, deren Lösung sich gut charakterisieren laesst. Dazu werden Verbindungen zum Wissenschaftsgebiet der konvexen Optimierung aufgezeigt und ausgenutzt, um Konvergenzgarantien für eine große Familie von Boosting-ähnlichen Algorithmen zu geben. o Kann man Boosting robust gegenüber Meßfehlern und Ausreissern in den Daten machen? Ein Problem bisheriger Boosting-Methoden ist die relativ hohe Sensitivität gegenüber Messungenauigkeiten und Meßfehlern in der Trainingsdatenmenge. Um dieses Problem zu beheben, wird die sogenannte 'Soft-Margin' Idee, die beim Support-Vector Lernen schon benutzt wird, auf Boosting übertragen. Das führt zu theoretisch gut motivierten, regularisierten Algorithmen, die ein hohes Maß an Robustheit aufweisen. o Wie kann man die Anwendbarkeit von Boosting auf Regressionsprobleme erweitern? Boosting-Methoden wurden ursprünglich für Klassifikationsprobleme entwickelt. Um die Anwendbarkeit auf Regressionsprobleme zu erweitern, werden die vorherigen Konvergenzresultate benutzt und neue Boosting-ähnliche Algorithmen zur Regression entwickelt. Wir zeigen, daß diese Algorithmen gute theoretische und praktische Eigenschaften haben. o Ist Boosting praktisch anwendbar? Die dargestellten theoretischen Ergebnisse werden begleitet von Simulationsergebnissen, entweder, um bestimmte Eigenschaften von Algorithmen zu illustrieren, oder um zu zeigen, daß sie in der Praxis tatsächlich gut funktionieren und direkt einsetzbar sind. Die praktische Relevanz der entwickelten Methoden wird in der Analyse chaotischer Zeitreihen und durch industrielle Anwendungen wie ein Stromverbrauch-Überwachungssystem und bei der Entwicklung neuer Medikamente illustriert.
NASA Astrophysics Data System (ADS)
Tahernezhad-Javazm, Farajollah; Azimirad, Vahid; Shoaran, Maryam
2018-04-01
Objective. Considering the importance and the near-future development of noninvasive brain-machine interface (BMI) systems, this paper presents a comprehensive theoretical-experimental survey on the classification and evolutionary methods for BMI-based systems in which EEG signals are used. Approach. The paper is divided into two main parts. In the first part, a wide range of different types of the base and combinatorial classifiers including boosting and bagging classifiers and evolutionary algorithms are reviewed and investigated. In the second part, these classifiers and evolutionary algorithms are assessed and compared based on two types of relatively widely used BMI systems, sensory motor rhythm-BMI and event-related potentials-BMI. Moreover, in the second part, some of the improved evolutionary algorithms as well as bi-objective algorithms are experimentally assessed and compared. Main results. In this study two databases are used, and cross-validation accuracy (CVA) and stability to data volume (SDV) are considered as the evaluation criteria for the classifiers. According to the experimental results on both databases, regarding the base classifiers, linear discriminant analysis and support vector machines with respect to CVA evaluation metric, and naive Bayes with respect to SDV demonstrated the best performances. Among the combinatorial classifiers, four classifiers, Bagg-DT (bagging decision tree), LogitBoost, and GentleBoost with respect to CVA, and Bagging-LR (bagging logistic regression) and AdaBoost (adaptive boosting) with respect to SDV had the best performances. Finally, regarding the evolutionary algorithms, single-objective invasive weed optimization (IWO) and bi-objective nondominated sorting IWO algorithms demonstrated the best performances. Significance. We present a general survey on the base and the combinatorial classification methods for EEG signals (sensory motor rhythm and event-related potentials) as well as their optimization methods through the evolutionary algorithms. In addition, experimental and statistical significance tests are carried out to study the applicability and effectiveness of the reviewed methods.
Boosting structured additive quantile regression for longitudinal childhood obesity data.
Fenske, Nora; Fahrmeir, Ludwig; Hothorn, Torsten; Rzehak, Peter; Höhle, Michael
2013-07-25
Childhood obesity and the investigation of its risk factors has become an important public health issue. Our work is based on and motivated by a German longitudinal study including 2,226 children with up to ten measurements on their body mass index (BMI) and risk factors from birth to the age of 10 years. We introduce boosting of structured additive quantile regression as a novel distribution-free approach for longitudinal quantile regression. The quantile-specific predictors of our model include conventional linear population effects, smooth nonlinear functional effects, varying-coefficient terms, and individual-specific effects, such as intercepts and slopes. Estimation is based on boosting, a computer intensive inference method for highly complex models. We propose a component-wise functional gradient descent boosting algorithm that allows for penalized estimation of the large variety of different effects, particularly leading to individual-specific effects shrunken toward zero. This concept allows us to flexibly estimate the nonlinear age curves of upper quantiles of the BMI distribution, both on population and on individual-specific level, adjusted for further risk factors and to detect age-varying effects of categorical risk factors. Our model approach can be regarded as the quantile regression analog of Gaussian additive mixed models (or structured additive mean regression models), and we compare both model classes with respect to our obesity data.
Detection of chewing from piezoelectric film sensor signals using ensemble classifiers.
Farooq, Muhammad; Sazonov, Edward
2016-08-01
Selection and use of pattern recognition algorithms is application dependent. In this work, we explored the use of several ensembles of weak classifiers to classify signals captured from a wearable sensor system to detect food intake based on chewing. Three sensor signals (Piezoelectric sensor, accelerometer, and hand to mouth gesture) were collected from 12 subjects in free-living conditions for 24 hrs. Sensor signals were divided into 10 seconds epochs and for each epoch combination of time and frequency domain features were computed. In this work, we present a comparison of three different ensemble techniques: boosting (AdaBoost), bootstrap aggregation (bagging) and stacking, each trained with 3 different weak classifiers (Decision Trees, Linear Discriminant Analysis (LDA) and Logistic Regression). Type of feature normalization used can also impact the classification results. For each ensemble method, three feature normalization techniques: (no-normalization, z-score normalization, and minmax normalization) were tested. A 12 fold cross-validation scheme was used to evaluate the performance of each model where the performance was evaluated in terms of precision, recall, and accuracy. Best results achieved here show an improvement of about 4% over our previous algorithms.
Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny
2016-01-01
Depression is commonly comorbid with many other somatic diseases and symptoms. Identification of individuals in clusters with comorbid symptoms may reveal new pathophysiological mechanisms and treatment targets. The aim of this research was to combine machine-learning (ML) algorithms with traditional regression techniques by utilising self-reported medical symptoms to identify and describe clusters of individuals with increased rates of depression from a large cross-sectional community based population epidemiological study. A multi-staged methodology utilising ML and traditional statistical techniques was performed using the community based population National Health and Nutrition Examination Study (2009-2010) (N = 3,922). A Self-organised Mapping (SOM) ML algorithm, combined with hierarchical clustering, was performed to create participant clusters based on 68 medical symptoms. Binary logistic regression, controlling for sociodemographic confounders, was used to then identify the key clusters of participants with higher levels of depression (PHQ-9≥10, n = 377). Finally, a Multiple Additive Regression Tree boosted ML algorithm was run to identify the important medical symptoms for each key cluster within 17 broad categories: heart, liver, thyroid, respiratory, diabetes, arthritis, fractures and osteoporosis, skeletal pain, blood pressure, blood transfusion, cholesterol, vision, hearing, psoriasis, weight, bowels and urinary. Five clusters of participants, based on medical symptoms, were identified to have significantly increased rates of depression compared to the cluster with the lowest rate: odds ratios ranged from 2.24 (95% CI 1.56, 3.24) to 6.33 (95% CI 1.67, 24.02). The ML boosted regression algorithm identified three key medical condition categories as being significantly more common in these clusters: bowel, pain and urinary symptoms. Bowel-related symptoms was found to dominate the relative importance of symptoms within the five key clusters. This methodology shows promise for the identification of conditions in general populations and supports the current focus on the potential importance of bowel symptoms and the gut in mental health research.
Identifying pollution sources and predicting urban air quality using ensemble learning methods
NASA Astrophysics Data System (ADS)
Singh, Kunwar P.; Gupta, Shikha; Rai, Premanjali
2013-12-01
In this study, principal components analysis (PCA) was performed to identify air pollution sources and tree based ensemble learning models were constructed to predict the urban air quality of Lucknow (India) using the air quality and meteorological databases pertaining to a period of five years. PCA identified vehicular emissions and fuel combustion as major air pollution sources. The air quality indices revealed the air quality unhealthy during the summer and winter. Ensemble models were constructed to discriminate between the seasonal air qualities, factors responsible for discrimination, and to predict the air quality indices. Accordingly, single decision tree (SDT), decision tree forest (DTF), and decision treeboost (DTB) were constructed and their generalization and predictive performance was evaluated in terms of several statistical parameters and compared with conventional machine learning benchmark, support vector machines (SVM). The DT and SVM models discriminated the seasonal air quality rendering misclassification rate (MR) of 8.32% (SDT); 4.12% (DTF); 5.62% (DTB), and 6.18% (SVM), respectively in complete data. The AQI and CAQI regression models yielded a correlation between measured and predicted values and root mean squared error of 0.901, 6.67 and 0.825, 9.45 (SDT); 0.951, 4.85 and 0.922, 6.56 (DTF); 0.959, 4.38 and 0.929, 6.30 (DTB); 0.890, 7.00 and 0.836, 9.16 (SVR) in complete data. The DTF and DTB models outperformed the SVM both in classification and regression which could be attributed to the incorporation of the bagging and boosting algorithms in these models. The proposed ensemble models successfully predicted the urban ambient air quality and can be used as effective tools for its management.
Cost-sensitive AdaBoost algorithm for ordinal regression based on extreme learning machine.
Riccardi, Annalisa; Fernández-Navarro, Francisco; Carloni, Sante
2014-10-01
In this paper, the well known stagewise additive modeling using a multiclass exponential (SAMME) boosting algorithm is extended to address problems where there exists a natural order in the targets using a cost-sensitive approach. The proposed ensemble model uses an extreme learning machine (ELM) model as a base classifier (with the Gaussian kernel and the additional regularization parameter). The closed form of the derived weighted least squares problem is provided, and it is employed to estimate analytically the parameters connecting the hidden layer to the output layer at each iteration of the boosting algorithm. Compared to the state-of-the-art boosting algorithms, in particular those using ELM as base classifier, the suggested technique does not require the generation of a new training dataset at each iteration. The adoption of the weighted least squares formulation of the problem has been presented as an unbiased and alternative approach to the already existing ELM boosting techniques. Moreover, the addition of a cost model for weighting the patterns, according to the order of the targets, enables the classifier to tackle ordinal regression problems further. The proposed method has been validated by an experimental study by comparing it with already existing ensemble methods and ELM techniques for ordinal regression, showing competitive results.
van Wilgen, Nicola J; Richardson, David M
2012-04-01
We developed a method to predict the potential of non-native reptiles and amphibians (herpetofauna) to establish populations. This method may inform efforts to prevent the introduction of invasive non-native species. We used boosted regression trees to determine whether nine variables influence establishment success of introduced herpetofauna in California and Florida. We used an independent data set to assess model performance. Propagule pressure was the variable most strongly associated with establishment success. Species with short juvenile periods and species with phylogenetically more distant relatives in regional biotas were more likely to establish than species that start breeding later and those that have close relatives. Average climate match (the similarity of climate between native and non-native range) and life form were also important. Frogs and lizards were the taxonomic groups most likely to establish, whereas a much lower proportion of snakes and turtles established. We used results from our best model to compile a spreadsheet-based model for easy use and interpretation. Probability scores obtained from the spreadsheet model were strongly correlated with establishment success as were probabilities predicted for independent data by the boosted regression tree model. However, the error rate for predictions made with independent data was much higher than with cross validation using training data. This difference in predictive power does not preclude use of the model to assess the probability of establishment of herpetofauna because (1) the independent data had no information for two variables (meaning the full predictive capacity of the model could not be realized) and (2) the model structure is consistent with the recent literature on the primary determinants of establishment success for herpetofauna. It may still be difficult to predict the establishment probability of poorly studied taxa, but it is clear that non-native species (especially lizards and frogs) that mature early and come from environments similar to that of the introduction region have the highest probability of establishment. ©2012 Society for Conservation Biology.
Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data
Xiong, Lie; Kuan, Pei-Fen; Tian, Jianan; Keles, Sunduz; Wang, Sijian
2015-01-01
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies. PMID:26609213
NASA Astrophysics Data System (ADS)
Dozier, J.; Bair, N.; Calfa, A. A.; Skalka, C.; Tolle, K.; Bongard, J.
2015-12-01
The task is to estimate spatiotemporally distributed estimates of snow water equivalent (SWE) in snow-dominated mountain environments, including those that lack on-the-ground measurements such as the Hindu Kush range in Afghanistan. During the snow season, we can use two measurements: (1) passive microwave estimates of SWE, which generally underestimate in the mountains; (2) fractional snow-covered area from MODIS. Once the snow has melted, we can reconstruct the accumulated SWE back to the last significant snowfall by calculating the energy used in melt. The reconstructed SWE values provide a training set for predictions from the passive microwave SWE and snow-covered area. We examine several machine learning methods—regression-boosted decision trees, bagged trees, neural networks, and genetic programming—to estimate SWE. All methods work reasonably well, with R2 values greater than 0.8. Predictors built with multiple years of data reduce the bias that usually appears if we predict one year from just one other year's training set. Genetic programming tends to produce results that additionally provide physical insight. Adding precipitation estimates from the Global Precipitation Measurements mission is in progress.
Selkowitz, D.J.
2010-01-01
Shrub cover appears to be increasing across many areas of the Arctic tundra biome, and increasing shrub cover in the Arctic has the potential to significantly impact global carbon budgets and the global climate system. For most of the Arctic, however, there is no existing baseline inventory of shrub canopy cover, as existing maps of Arctic vegetation provide little information about the density of shrub cover at a moderate spatial resolution across the region. Remotely-sensed fractional shrub canopy maps can provide this necessary baseline inventory of shrub cover. In this study, we compare the accuracy of fractional shrub canopy (> 0.5 m tall) maps derived from multi-spectral, multi-angular, and multi-temporal datasets from Landsat imagery at 30 m spatial resolution, Moderate Resolution Imaging SpectroRadiometer (MODIS) imagery at 250 m and 500 m spatial resolution, and MultiAngle Imaging Spectroradiometer (MISR) imagery at 275 m spatial resolution for a 1067 km2 study area in Arctic Alaska. The study area is centered at 69 ??N, ranges in elevation from 130 to 770 m, is composed primarily of rolling topography with gentle slopes less than 10??, and is free of glaciers and perennial snow cover. Shrubs > 0.5 m in height cover 2.9% of the study area and are primarily confined to patches associated with specific landscape features. Reference fractional shrub canopy is determined from in situ shrub canopy measurements and a high spatial resolution IKONOS image swath. Regression tree models are constructed to estimate fractional canopy cover at 250 m using different combinations of input data from Landsat, MODIS, and MISR. Results indicate that multi-spectral data provide substantially more accurate estimates of fractional shrub canopy cover than multi-angular or multi-temporal data. Higher spatial resolution datasets also provide more accurate estimates of fractional shrub canopy cover (aggregated to moderate spatial resolutions) than lower spatial resolution datasets, an expected result for a study area where most shrub cover is concentrated in narrow patches associated with rivers, drainages, and slopes. Including the middle infrared bands available from Landsat and MODIS in the regression tree models (in addition to the four standard visible and near-infrared spectral bands) typically results in a slight boost in accuracy. Including the multi-angular red band data available from MISR in the regression tree models, however, typically boosts accuracy more substantially, resulting in moderate resolution fractional shrub canopy estimates approaching the accuracy of estimates derived from the much higher spatial resolution Landsat sensor. Given the poor availability of snow and cloud-free Landsat scenes in many areas of the Arctic and the promising results demonstrated here by the MISR sensor, MISR may be the best choice for large area fractional shrub canopy mapping in the Alaskan Arctic for the period 2000-2009.
Toward Predicting Social Support Needs in Online Health Social Networks.
Choi, Min-Je; Kim, Sung-Hee; Lee, Sukwon; Kwon, Bum Chul; Yi, Ji Soo; Choo, Jaegul; Huh, Jina
2017-08-02
While online health social networks (OHSNs) serve as an effective platform for patients to fulfill their various social support needs, predicting the needs of users and providing tailored information remains a challenge. The objective of this study was to discriminate important features for identifying users' social support needs based on knowledge gathered from survey data. This study also provides guidelines for a technical framework, which can be used to predict users' social support needs based on raw data collected from OHSNs. We initially conducted a Web-based survey with 184 OHSN users. From this survey data, we extracted 34 features based on 5 categories: (1) demographics, (2) reading behavior, (3) posting behavior, (4) perceived roles in OHSNs, and (5) values sought in OHSNs. Features from the first 4 categories were used as variables for binary classification. For the prediction outcomes, we used features from the last category: the needs for emotional support, experience-based information, unconventional information, and medical facts. We compared 5 binary classifier algorithms: gradient boosting tree, random forest, decision tree, support vector machines, and logistic regression. We then calculated the scores of the area under the receiver operating characteristic (ROC) curve (AUC) to understand the comparative effectiveness of the used features. The best performance was AUC scores of 0.89 for predicting users seeking emotional support, 0.86 for experience-based information, 0.80 for unconventional information, and 0.83 for medical facts. With the gradient boosting tree as our best performing model, we analyzed the strength of individual features in predicting one's social support need. Among other discoveries, we found that users seeking emotional support tend to post more in OHSNs compared with others. We developed an initial framework for automatically predicting social support needs in OHSNs using survey data. Future work should involve nonsurvey data to evaluate the feasibility of the framework. Our study contributes to providing personalized social support in OHSNs. ©Min-Je Choi, Sung-Hee Kim, Sukwon Lee, Bum Chul Kwon, Ji Soo Yi, Jaegul Choo, Jina Huh. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 02.08.2017.
Moisen, Gretchen G.; Freeman, E.A.; Blackard, J.A.; Frescino, T.S.; Zimmermann, N.E.; Edwards, T.C.
2006-01-01
Many efforts are underway to produce broad-scale forest attribute maps by modelling forest class and structure variables collected in forest inventories as functions of satellite-based and biophysical information. Typically, variants of classification and regression trees implemented in Rulequest's?? See5 and Cubist (for binary and continuous responses, respectively) are the tools of choice in many of these applications. These tools are widely used in large remote sensing applications, but are not easily interpretable, do not have ties with survey estimation methods, and use proprietary unpublished algorithms. Consequently, three alternative modelling techniques were compared for mapping presence and basal area of 13 species located in the mountain ranges of Utah, USA. The modelling techniques compared included the widely used See5/Cubist, generalized additive models (GAMs), and stochastic gradient boosting (SGB). Model performance was evaluated using independent test data sets. Evaluation criteria for mapping species presence included specificity, sensitivity, Kappa, and area under the curve (AUC). Evaluation criteria for the continuous basal area variables included correlation and relative mean squared error. For predicting species presence (setting thresholds to maximize Kappa), SGB had higher values for the majority of the species for specificity and Kappa, while GAMs had higher values for the majority of the species for sensitivity. In evaluating resultant AUC values, GAM and/or SGB models had significantly better results than the See5 models where significant differences could be detected between models. For nine out of 13 species, basal area prediction results for all modelling techniques were poor (correlations less than 0.5 and relative mean squared errors greater than 0.8), but SGB provided the most stable predictions in these instances. SGB and Cubist performed equally well for modelling basal area for three species with moderate prediction success, while all three modelling tools produced comparably good predictions (correlation of 0.68 and relative mean squared error of 0.56) for one species. ?? 2006 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Basye, Austin T.
A matrix element method analysis of the Standard Model Higgs boson, produced in association with two top quarks decaying to the lepton-plus-jets channel is presented. Based on 20.3 fb--1 of s=8 TeV data, produced at the Large Hadron Collider and collected by the ATLAS detector, this analysis utilizes multiple advanced techniques to search for ttH signatures with a 125 GeV Higgs boson decaying to two b -quarks. After categorizing selected events based on their jet and b-tag multiplicities, signal rich regions are analyzed using the matrix element method. Resulting variables are then propagated to two parallel multivariate analyses utilizing Neural Networks and Boosted Decision Trees respectively. As no significant excess is found, an observed (expected) limit of 3.4 (2.2) times the Standard Model cross-section is determined at 95% confidence, using the CLs method, for the Neural Network analysis. For the Boosted Decision Tree analysis, an observed (expected) limit of 5.2 (2.7) times the Standard Model cross-section is determined at 95% confidence, using the CLs method. Corresponding unconstrained fits of the Higgs boson signal strength to the observed data result in the measured signal cross-section to Standard Model cross-section prediction of mu = 1.2 +/- 1.3(total) +/- 0.7(stat.) for the Neural Network analysis, and mu = 2.9 +/- 1.4(total) +/- 0.8(stat.) for the Boosted Decision Tree analysis.
Optical diagnosis of cervical cancer by higher order spectra and boosting
NASA Astrophysics Data System (ADS)
Pratiher, Sawon; Mukhopadhyay, Sabyasachi; Barman, Ritwik; Pratiher, Souvik; Pradhan, Asima; Ghosh, Nirmalya; Panigrahi, Prasanta K.
2017-03-01
In this contribution, we report the application of higher order statistical moments using decision tree and ensemble based learning methodology for the development of diagnostic algorithms for optical diagnosis of cancer. The classification results were compared to those obtained with an independent feature extractors like linear discriminant analysis (LDA). The performance and efficacy of these methodology using higher order statistics as a classifier using boosting has higher specificity and sensitivity while being much faster as compared to other time-frequency domain based methods.
NASA Astrophysics Data System (ADS)
Johnson, Nicholas E.; Bonczak, Bartosz; Kontokosta, Constantine E.
2018-07-01
The increased availability and improved quality of new sensing technologies have catalyzed a growing body of research to evaluate and leverage these tools in order to quantify and describe urban environments. Air quality, in particular, has received greater attention because of the well-established links to serious respiratory illnesses and the unprecedented levels of air pollution in developed and developing countries and cities around the world. Though numerous laboratory and field evaluation studies have begun to explore the use and potential of low-cost air quality monitoring devices, the performance and stability of these tools has not been adequately evaluated in complex urban environments, and further research is needed. In this study, we present the design of a low-cost air quality monitoring platform based on the Shinyei PPD42 aerosol monitor and examine the suitability of the sensor for deployment in a dense heterogeneous urban environment. We assess the sensor's performance during a field calibration campaign from February 7th to March 25th 2017 with a reference instrument in New York City, and present a novel calibration approach using a machine learning method that incorporates publicly available meteorological data in order to improve overall sensor performance. We find that while the PPD42 performs well in relation to the reference instrument using linear regression (R2 = 0.36-0.51), a gradient boosting regression tree model can significantly improve device calibration (R2 = 0.68-0.76). We discuss the sensor's performance and reliability when deployed in a dense, heterogeneous urban environment during a period of significant variation in weather conditions, and important considerations when using machine learning techniques to improve the performance of low-cost air quality monitors.
Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi
2017-11-02
Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.
NASA Technical Reports Server (NTRS)
Barrett, K.; Kasischke, E. S.; McGuire, A. D.; Turetsky, M. R.; Kane, E. S.
2010-01-01
Biomass burning in the Alaskan interior is already a major disturbance and source of carbon emissions, and is likely to increase in response to the warming and drying predicted for the future climate. In addition to quantifying changes to the spatial and temporal patterns of burned areas, observing variations in severity is the key to studying the impact of changes to the fire regime on carbon cycling, energy budgets, and post-fire succession. Remote sensing indices of fire severity have not consistently been well-correlated with in situ observations of important severity characteristics in Alaskan black spruce stands, including depth of burning of the surface organic layer. The incorporation of ancillary data such as in situ observations and GIS layers with spectral data from Landsat TM/ETM+ greatly improved efforts to map the reduction of the organic layer in burned black spruce stands. Using a regression tree approach, the R2 of the organic layer depth reduction models was 0.60 and 0.55 (pb0.01) for relative and absolute depth reduction, respectively. All of the independent variables used by the regression tree to estimate burn depth can be obtained independently of field observations. Implementation of a gradient boosting algorithm improved the R2 to 0.80 and 0.79 (pb0.01) for absolute and relative organic layer depth reduction, respectively. Independent variables used in the regression tree model of burn depth included topographic position, remote sensing indices related to soil and vegetation characteristics, timing of the fire event, and meteorological data. Post-fire organic layer depth characteristics are determined for a large (N200,000 ha) fire to identify areas that are potentially vulnerable to a shift in post-fire succession. This application showed that 12% of this fire event experienced fire severe enough to support a change in post-fire succession. We conclude that non-parametric models and ancillary data are useful in the modeling of the surface organic layer fire depth. Because quantitative differences in post-fire surface characteristics do not directly influence spectral properties, these modeling techniques provide better information than the use of remote sensing data alone.
Barrett, Kirsten M.; Kasischke, E.S.; McGuire, A.D.; Turetsky, M.R.; Kane, E.S.
2010-01-01
Biomass burning in the Alaskan interior is already a major disturbance and source of carbon emissions, and is likely to increase in response to the warming and drying predicted for the future climate. In addition to quantifying changes to the spatial and temporal patterns of burned areas, observing variations in severity is the key to studying the impact of changes to the fire regime on carbon cycling, energy budgets, and post-fire succession. Remote sensing indices of fire severity have not consistently been well-correlated with in situ observations of important severity characteristics in Alaskan black spruce stands, including depth of burning of the surface organic layer. The incorporation of ancillary data such as in situ observations and GIS layers with spectral data from Landsat TM/ETM+ greatly improved efforts to map the reduction of the organic layer in burned black spruce stands. Using a regression tree approach, the R2 of the organic layer depth reduction models was 0.60 and 0.55 (pb0.01) for relative and absolute depth reduction, respectively. All of the independent variables used by the regression tree to estimate burn depth can be obtained independently of field observations. Implementation of a gradient boosting algorithm improved the R2 to 0.80 and 0.79 (pb0.01) for absolute and relative organic layer depth reduction, respectively. Independent variables used in the regression tree model of burn depth included topographic position, remote sensing indices related to soil and vegetation characteristics, timing of the fire event, and meteorological data. Post-fire organic layer depth characteristics are determined for a large (N200,000 ha) fire to identify areas that are potentially vulnerable to a shift in post-fire succession. This application showed that 12% of this fire event experienced fire severe enough to support a change in post-fire succession. We conclude that non-parametric models and ancillary data are useful in the modeling of the surface organic layer fire depth. Because quantitative differences in post-fire surface characteristics do not directly influence spectral properties, these modeling techniques provide better information than the use of remote sensing data alone.
Brett A. Huggett; Paul G. Schaberg; Gary J. Hawley; Christopher Eager
2007-01-01
We surveyed and wounded forest-grown sugar maple (Acer saccharum Marsh.) trees in a long-term, replicated Ca manipulation study at the Hubbard Brook Experimental Forest in New Hampshire, USA. Plots received applications of Ca (to boost Ca availability above depleted ambient levels) or A1 (to compete with Ca uptake and further reduce Ca availability...
NASA Astrophysics Data System (ADS)
Yang, Wei; Zhang, Su; Li, Wenying; Chen, Yaqing; Lu, Hongtao; Chen, Wufan; Chen, Yazhu
2010-04-01
Various computerized features extracted from breast ultrasound images are useful in assessing the malignancy of breast tumors. However, the underlying relationship between the computerized features and tumor malignancy may not be linear in nature. We use the decision tree ensemble trained by the cost-sensitive boosting algorithm to approximate the target function for malignancy assessment and to reflect this relationship qualitatively. Partial dependence plots are employed to explore and visualize the effect of features on the output of the decision tree ensemble. In the experiments, 31 image features are extracted to quantify the sonographic characteristics of breast tumors. Patient age is used as an external feature because of its high clinical importance. The area under the receiver-operating characteristic curve of the tree ensembles can reach 0.95 with sensitivity of 0.95 (61/64) at the associated specificity 0.74 (77/104). The partial dependence plots of the four most important features are demonstrated to show the influence of the features on malignancy, and they are in accord with the empirical observations. The results can provide visual and qualitative references on the computerized image features for physicians, and can be useful for enhancing the interpretability of computer-aided diagnosis systems for breast ultrasound.
ERIC Educational Resources Information Center
Ritchie, Meabh
2010-01-01
This article is an informal description of a forest school outdoor programme designed to boost emotional literacy, inclusion and attainment of secondary school pupils with a range of learning and behavioural difficulties. (Contains 2 online resources.)
Learning Instance-Specific Predictive Models
Visweswaran, Shyam; Cooper, Gregory F.
2013-01-01
This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including nave Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms. PMID:25045325
NASA Astrophysics Data System (ADS)
Ghosh, S. M.; Behera, M. D.
2017-12-01
Forest aboveground biomass (AGB) is an important factor for preparation of global policy making decisions to tackle the impact of climate change. Several previous studies has concluded that remote sensing methods are more suitable for estimating forest biomass on regional scale. Among all available remote sensing data and methods, Synthetic Aperture Radar (SAR) data in combination with decision tree based machine learning algorithms has shown better promise in estimating higher biomass values. There aren't many studies done for biomass estimation of dense Indian tropical forests with high biomass density. In this study aboveground biomass was estimated for two major tree species, Sal (Shorea robusta) and Teak (Tectona grandis), of Katerniaghat Wildlife Sanctuary, a tropical forest situated in northern India. Biomass was estimated by combining C-band SAR data from Sentinel-1A satellite, vegetation indices produced using Sentinel-2A data and ground inventory plots. Along with SAR backscatter value, SAR texture images were also used as input as earlier studies had found that image texture has a correlation with vegetation biomass. Decision tree based nonlinear machine learning algorithms were used in place of parametric regression models for establishing relationship between fields measured values and remotely sensed parameters. Using random forest model with a combination of vegetation indices with SAR backscatter as predictor variables shows best result for Sal forest, with a coefficient of determination value of 0.71 and a RMSE value of 105.027 t/ha. In teak forest also best result can be found in the same combination but for stochastic gradient boosted model with a coefficient of determination value of 0.6 and a RMSE value of 79.45 t/ha. These results are mostly better than the results of other studies done for similar kind of forests. This study shows that Sentinel series satellite data has exceptional capabilities in estimating dense forest AGB and machine learning algorithms are better means to do so than parametric regression models.
Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.
Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi
2014-01-01
In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.
Cosmic string detection with tree-based machine learning
NASA Astrophysics Data System (ADS)
Vafaei Sadr, A.; Farhang, M.; Movahed, S. M. S.; Bassett, B.; Kunz, M.
2018-07-01
We explore the use of random forest and gradient boosting, two powerful tree-based machine learning algorithms, for the detection of cosmic strings in maps of the cosmic microwave background (CMB), through their unique Gott-Kaiser-Stebbins effect on the temperature anisotropies. The information in the maps is compressed into feature vectors before being passed to the learning units. The feature vectors contain various statistical measures of the processed CMB maps that boost cosmic string detectability. Our proposed classifiers, after training, give results similar to or better than claimed detectability levels from other methods for string tension, Gμ. They can make 3σ detection of strings with Gμ ≳ 2.1 × 10-10 for noise-free, 0.9'-resolution CMB observations. The minimum detectable tension increases to Gμ ≳ 3.0 × 10-8 for a more realistic, CMB S4-like (II) strategy, improving over previous results.
Cosmic String Detection with Tree-Based Machine Learning
NASA Astrophysics Data System (ADS)
Vafaei Sadr, A.; Farhang, M.; Movahed, S. M. S.; Bassett, B.; Kunz, M.
2018-05-01
We explore the use of random forest and gradient boosting, two powerful tree-based machine learning algorithms, for the detection of cosmic strings in maps of the cosmic microwave background (CMB), through their unique Gott-Kaiser-Stebbins effect on the temperature anisotropies. The information in the maps is compressed into feature vectors before being passed to the learning units. The feature vectors contain various statistical measures of the processed CMB maps that boost cosmic string detectability. Our proposed classifiers, after training, give results similar to or better than claimed detectability levels from other methods for string tension, Gμ. They can make 3σ detection of strings with Gμ ≳ 2.1 × 10-10 for noise-free, 0.9΄-resolution CMB observations. The minimum detectable tension increases to Gμ ≳ 3.0 × 10-8 for a more realistic, CMB S4-like (II) strategy, improving over previous results.
Fraccaro, Paolo; Nicolo, Massimo; Bonetto, Monica; Giacomini, Mauro; Weller, Peter; Traverso, Carlo Enrico; Prosperi, Mattia; OSullivan, Dympna
2015-01-27
To investigate machine learning methods, ranging from simpler interpretable techniques to complex (non-linear) "black-box" approaches, for automated diagnosis of Age-related Macular Degeneration (AMD). Data from healthy subjects and patients diagnosed with AMD or other retinal diseases were collected during routine visits via an Electronic Health Record (EHR) system. Patients' attributes included demographics and, for each eye, presence/absence of major AMD-related clinical signs (soft drusen, retinal pigment epitelium, defects/pigment mottling, depigmentation area, subretinal haemorrhage, subretinal fluid, macula thickness, macular scar, subretinal fibrosis). Interpretable techniques known as white box methods including logistic regression and decision trees as well as less interpreitable techniques known as black box methods, such as support vector machines (SVM), random forests and AdaBoost, were used to develop models (trained and validated on unseen data) to diagnose AMD. The gold standard was confirmed diagnosis of AMD by physicians. Sensitivity, specificity and area under the receiver operating characteristic (AUC) were used to assess performance. Study population included 487 patients (912 eyes). In terms of AUC, random forests, logistic regression and adaboost showed a mean performance of (0.92), followed by SVM and decision trees (0.90). All machine learning models identified soft drusen and age as the most discriminating variables in clinicians' decision pathways to diagnose AMD. Both black-box and white box methods performed well in identifying diagnoses of AMD and their decision pathways. Machine learning models developed through the proposed approach, relying on clinical signs identified by retinal specialists, could be embedded into EHR to provide physicians with real time (interpretable) support.
An efficient ensemble learning method for gene microarray classification.
Osareh, Alireza; Shadgar, Bita
2013-01-01
The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. However, it has been also revealed that the basic classification techniques have intrinsic drawbacks in achieving accurate gene classification and cancer diagnosis. On the other hand, classifier ensembles have received increasing attention in various applications. Here, we address the gene classification issue using RotBoost ensemble methodology. This method is a combination of Rotation Forest and AdaBoost techniques which in turn preserve both desirable features of an ensemble architecture, that is, accuracy and diversity. To select a concise subset of informative genes, 5 different feature selection algorithms are considered. To assess the efficiency of the RotBoost, other nonensemble/ensemble techniques including Decision Trees, Support Vector Machines, Rotation Forest, AdaBoost, and Bagging are also deployed. Experimental results have revealed that the combination of the fast correlation-based feature selection method with ICA-based RotBoost ensemble is highly effective for gene classification. In fact, the proposed method can create ensemble classifiers which outperform not only the classifiers produced by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods, that is, Bagging and AdaBoost.
NASA Astrophysics Data System (ADS)
Pinkerton, Matt H.; Smith, Adam N. H.; Raymond, Ben; Hosie, Graham W.; Sharp, Ben; Leathwick, John R.; Bradford-Grieve, Janet M.
2010-04-01
We applied a multivariate statistical modelling technique called boosted regression trees to derive relationships between environmental conditions and the distribution of the adult stage of the cyclopoid copepod Oithona similis in the Southern Ocean. Nearly 20 000 samples from the Southern Ocean Continuous Plankton Recorder survey (87% from East Antarctica) were used to model the probability of detection (presence) and relative abundance of adults of this zooplankton species in surface waters. We demonstrate that it is possible to obtain reasonable models for both the presence (area under the Receiver Operating Characteristic curve of 0.77) and relative abundance (28-35% variance explained) of adult O. similis between November and March in much of the Southern Ocean. No investigation was possible where the environmental characteristics were not well represented by the SO-CPR dataset, namely, the Argentine shelf, Weddell Sea, and the frontal region north of the Amundsen Sea, or under sea-ice. Our analyses support the hypothesis that adult O. similis abundance is related to environmental conditions in a broadly similar way throughout the Southern Ocean. Compared to a compilation of net-haul data from the literature, the abundance model explained 34% of the variance in surface concentrations of adult stages of this species, and 23-59% of the variance in depth-integrated abundance of copepodite and adult stages combined. The models show higher occurrence and elevated abundances in a broad circumpolar band between the Antarctic Polar Front and the southern boundary of the Antarctic Circumpolar Current (approximately 54-64°S). Evidence of diel vertical migration by adults of this species north of 65°S was found, with surface abundances 20% higher at night than during the day. There was no evidence of diel migration south of 65°S. Five potential "hotspots" of adult O. similis were identified: in the southern Scotia Sea, two areas off east Antarctica, in the frontal zone north of the Amundsen Sea, and a small area in the outer Bellingshausen Sea. We recommend that a database of all available net-haul data on Oithona similis in the Southern Ocean be created to facilitate further investigations on the circumpolar distribution of this species.
NASA Astrophysics Data System (ADS)
Shabani, Farzin; Kumar, Lalit; Solhjouy-fard, Samaneh
2017-08-01
The aim of this study was to have a comparative investigation and evaluation of the capabilities of correlative and mechanistic modeling processes, applied to the projection of future distributions of date palm in novel environments and to establish a method of minimizing uncertainty in the projections of differing techniques. The location of this study on a global scale is in Middle Eastern Countries. We compared the mechanistic model CLIMEX (CL) with the correlative models MaxEnt (MX), Boosted Regression Trees (BRT), and Random Forests (RF) to project current and future distributions of date palm ( Phoenix dactylifera L.). The Global Climate Model (GCM), the CSIRO-Mk3.0 (CS) using the A2 emissions scenario, was selected for making projections. Both indigenous and alien distribution data of the species were utilized in the modeling process. The common areas predicted by MX, BRT, RF, and CL from the CS GCM were extracted and compared to ascertain projection uncertainty levels of each individual technique. The common areas identified by all four modeling techniques were used to produce a map indicating suitable and unsuitable areas for date palm cultivation for Middle Eastern countries, for the present and the year 2100. The four different modeling approaches predict fairly different distributions. Projections from CL were more conservative than from MX. The BRT and RF were the most conservative methods in terms of projections for the current time. The combination of the final CL and MX projections for the present and 2100 provide higher certainty concerning those areas that will become highly suitable for future date palm cultivation. According to the four models, cold, hot, and wet stress, with differences on a regional basis, appears to be the major restrictions on future date palm distribution. The results demonstrate variances in the projections, resulting from different techniques. The assessment and interpretation of model projections requires reservations, especially in correlative models such as MX, BRT, and RF. Intersections between different techniques may decrease uncertainty in future distribution projections. However, readers should not miss the fact that the uncertainties are mostly because the future GHG emission scenarios are unknowable with sufficient precision. Suggestions towards methodology and processing for improving projections are included.
Buston, Peter M; Elith, Jane
2011-05-01
1. Central questions of behavioural and evolutionary ecology are what factors influence the reproductive success of dominant breeders and subordinate nonbreeders within animal societies? A complete understanding of any society requires that these questions be answered for all individuals. 2. The clown anemonefish, Amphiprion percula, forms simple societies that live in close association with sea anemones, Heteractis magnifica. Here, we use data from a well-studied population of A. percula to determine the major predictors of reproductive success of dominant pairs in this species. 3. We analyse the effect of multiple predictors on four components of reproductive success, using a relatively new technique from the field of statistical learning: boosted regression trees (BRTs). BRTs have the potential to model complex relationships in ways that give powerful insight. 4. We show that the reproductive success of dominant pairs is unrelated to the presence, number or phenotype of nonbreeders. This is consistent with the observation that nonbreeders do not help or hinder breeders in any way, confirming and extending the results of a previous study. 5. Primarily, reproductive success is negatively related to male growth and positively related to breeding experience. It is likely that these effects are interrelated because males that grow a lot have little breeding experience. These effects are indicative of a trade-off between male growth and parental investment. 6. Secondarily, reproductive success is positively related to female growth and size. In this population, female size is positively related to group size and anemone size, also. These positive correlations among traits likely are caused by variation in site quality and are suggestive of a silver-spoon effect. 7. Noteworthily, whereas reproductive success is positively related to female size, it is unrelated to male size. This observation provides support for the size advantage hypothesis for sex change: both individuals maximize their reproductive success when the larger individual adopts the female tactic. 8. This study provides the most complete picture to date of the factors that predict the reproductive success of dominant pairs of clown anemonefish and illustrates the utility of BRTs for analysis of complex behavioural and evolutionary ecology data. © 2011 The Authors. Journal of Animal Ecology © 2011 British Ecological Society.
Using Predictive Analytics to Predict Power Outages from Severe Weather
NASA Astrophysics Data System (ADS)
Wanik, D. W.; Anagnostou, E. N.; Hartman, B.; Frediani, M. E.; Astitha, M.
2015-12-01
The distribution of reliable power is essential to businesses, public services, and our daily lives. With the growing abundance of data being collected and created by industry (i.e. outage data), government agencies (i.e. land cover), and academia (i.e. weather forecasts), we can begin to tackle problems that previously seemed too complex to solve. In this session, we will present newly developed tools to aid decision-support challenges at electric distribution utilities that must mitigate, prepare for, respond to and recover from severe weather. We will show a performance evaluation of outage predictive models built for Eversource Energy (formerly Connecticut Light & Power) for storms of all types (i.e. blizzards, thunderstorms and hurricanes) and magnitudes (from 20 to >15,000 outages). High resolution weather simulations (simulated with the Weather and Research Forecast Model) were joined with utility outage data to calibrate four types of models: a decision tree (DT), random forest (RF), boosted gradient tree (BT) and an ensemble (ENS) decision tree regression that combined predictions from DT, RF and BT. The study shows that the ENS model forced with weather, infrastructure and land cover data was superior to the other models we evaluated, especially in terms of predicting the spatial distribution of outages. This research has the potential to be used for other critical infrastructure systems (such as telecommunications, drinking water and gas distribution networks), and can be readily expanded to the entire New England region to facilitate better planning and coordination among decision-makers when severe weather strikes.
Ventriculogram segmentation using boosted decision trees
NASA Astrophysics Data System (ADS)
McDonald, John A.; Sheehan, Florence H.
2004-05-01
Left ventricular status, reflected in ejection fraction or end systolic volume, is a powerful prognostic indicator in heart disease. Quantitative analysis of these and other parameters from ventriculograms (cine xrays of the left ventricle) is infrequently performed due to the labor required for manual segmentation. None of the many methods developed for automated segmentation has achieved clinical acceptance. We present a method for semi-automatic segmentation of ventriculograms based on a very accurate two-stage boosted decision-tree pixel classifier. The classifier determines which pixels are inside the ventricle at key ED (end-diastole) and ES (end-systole) frames. The test misclassification rate is about 1%. The classifier is semi-automatic, requiring a user to select 3 points in each frame: the endpoints of the aortic valve and the apex. The first classifier stage is 2 boosted decision-trees, trained using features such as gray-level statistics (e.g. median brightness) and image geometry (e.g. coordinates relative to user supplied 3 points). Second stage classifiers are trained using the same features as the first, plus the output of the first stage. Border pixels are determined from the segmented images using dilation and erosion. A curve is then fit to the border pixels, minimizing a penalty function that trades off fidelity to the border pixels with smoothness. ED and ES volumes, and ejection fraction are estimated from border curves using standard area-length formulas. On independent test data, the differences between automatic and manual volumes (and ejection fractions) are similar in size to the differences between two human observers.
Anantha M. Prasad; Louis R. Iverson; Andy Liaw; Andy Liaw
2006-01-01
We evaluated four statistical models - Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) - for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.
The process and utility of classification and regression tree methodology in nursing research
Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda
2014-01-01
Aim This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Background Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Design Discussion paper. Data sources English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984–2013. Discussion Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Implications for Nursing Research Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Conclusion Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. PMID:24237048
The process and utility of classification and regression tree methodology in nursing research.
Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda
2014-06-01
This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Discussion paper. English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984-2013. Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. © 2013 The Authors. Journal of Advanced Nursing Published by John Wiley & Sons Ltd.
Data mining: Potential applications in research on nutrition and health.
Batterham, Marijka; Neale, Elizabeth; Martin, Allison; Tapsell, Linda
2017-02-01
Data mining enables further insights from nutrition-related research, but caution is required. The aim of this analysis was to demonstrate and compare the utility of data mining methods in classifying a categorical outcome derived from a nutrition-related intervention. Baseline data (23 variables, 8 categorical) on participants (n = 295) in an intervention trial were used to classify participants in terms of meeting the criteria of achieving 10 000 steps per day. Results from classification and regression trees (CARTs), random forests, adaptive boosting, logistic regression, support vector machines and neural networks were compared using area under the curve (AUC) and error assessments. The CART produced the best model when considering the AUC (0.703), overall error (18%) and within class error (28%). Logistic regression also performed reasonably well compared to the other models (AUC 0.675, overall error 23%, within class error 36%). All the methods gave different rankings of variables' importance. CART found that body fat, quality of life using the SF-12 Physical Component Summary (PCS) and the cholesterol: HDL ratio were the most important predictors of meeting the 10 000 steps criteria, while logistic regression showed the SF-12PCS, glucose levels and level of education to be the most significant predictors (P ≤ 0.01). Differing outcomes suggest caution is required with a single data mining method, particularly in a dataset with nonlinear relationships and outliers and when exploring relationships that were not the primary outcomes of the research. © 2017 Dietitians Association of Australia.
1992-01-01
boost plenum which houses the camshaft . The compressed mixture is metered by a throttle to intake valves of the engine. The engine is constructed from...difficulties associated with a time-tagged fault tree . In particular, recent work indicates that the multi-layer perception architecture can give good fdi...Abstract: In the past decade, wastepaper recycling has gained a wider acceptance. Depletion of tree stocks, waste water treatment demands and
A new approach to enhance the performance of decision tree for classifying gene expression data.
Hassan, Md; Kotagiri, Ramamohanarao
2013-12-20
Gene expression data classification is a challenging task due to the large dimensionality and very small number of samples. Decision tree is one of the popular machine learning approaches to address such classification problems. However, the existing decision tree algorithms use a single gene feature at each node to split the data into its child nodes and hence might suffer from poor performance specially when classifying gene expression dataset. By using a new decision tree algorithm where, each node of the tree consists of more than one gene, we enhance the classification performance of traditional decision tree classifiers. Our method selects suitable genes that are combined using a linear function to form a derived composite feature. To determine the structure of the tree we use the area under the Receiver Operating Characteristics curve (AUC). Experimental analysis demonstrates higher classification accuracy using the new decision tree compared to the other existing decision trees in literature. We experimentally compare the effect of our scheme against other well known decision tree techniques. Experiments show that our algorithm can substantially boost the classification performance of the decision tree.
Liu, Kevin; Warnow, Tandy J; Holder, Mark T; Nelesen, Serita M; Yu, Jiaye; Stamatakis, Alexandros P; Linder, C Randal
2012-01-01
Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.
NASA Astrophysics Data System (ADS)
Xiao, Guoqiang; Jiang, Yang; Song, Gang; Jiang, Jianmin
2010-12-01
We propose a support-vector-machine (SVM) tree to hierarchically learn from domain knowledge represented by low-level features toward automatic classification of sports videos. The proposed SVM tree adopts a binary tree structure to exploit the nature of SVM's binary classification, where each internal node is a single SVM learning unit, and each external node represents the classified output type. Such a SVM tree presents a number of advantages, which include: 1. low computing cost; 2. integrated learning and classification while preserving individual SVM's learning strength; and 3. flexibility in both structure and learning modules, where different numbers of nodes and features can be added to address specific learning requirements, and various learning models can be added as individual nodes, such as neural networks, AdaBoost, hidden Markov models, dynamic Bayesian networks, etc. Experiments support that the proposed SVM tree achieves good performances in sports video classifications.
Assessing the influence of multiple stressors on stream diatom metrics in the upper Midwest, USA
Munn, Mark D.; Waite, Ian R.; Konrad, Christopher P.
2018-01-01
Water resource managers face increasing challenges in identifying what physical and chemical stressors are responsible for the alteration of biological conditions in streams. The objective of this study was to assess the comparative influence of multiple stressors on benthic diatoms at 98 sites that spanned a range of stressors in an agriculturally dominated region in the upper Midwest, USA. The primary stressors of interest included: nutrients, herbicides and fungicides, sediment, and streamflow; although the influence of physical habitat was incorporated in the assessment. Boosted Regression Tree was used to examine both the sensitivity of various diatom metrics and the relative importance of the primary stressors. Percent Sensitive Taxa, percent Highly Motile Taxa, and percent High Phosphorus Taxa had the strongest response to stressors. Habitat and total phosphorous were the most common discriminators of diatom metrics, with herbicides as secondary factors. A Classification and Regression Tree (CART) model was used to examine conditional relations among stressors and indicated that fine-grain streams had a lower percentage of Sensitive Taxa than coarse-grain streams, with Sensitive Taxa decreasing further with increased water temperature (>30 °C) and triazine concentrations (>1500 ng/L). In contrast, streams dominated by coarse-grain substrate contained a higher percentage of Sensitive Taxa, with relative abundance increasing with lower water temperatures (<29 °C) and shallower water depth (<0.3 m). Quantile regression indicated that maximum water temperature appears to be a major limiting factor in Midwest streams; whereas both total phosphorus and percent fines showed a slight subsidy-stress response. While using benthic algae for assessing stream quality can be challenging, field-based studies can elucidate stressor effects and interactions when the response variables are appropriate, sufficient stressor resolution is achieved, and the number and type of sites represent a gradient of stressor conditions and at least a quasi-factorial design.
Variable selection and model choice in geoadditive regression models.
Kneib, Thomas; Hothorn, Torsten; Tutz, Gerhard
2009-06-01
Model choice and variable selection are issues of major concern in practical regression analyses, arising in many biometric applications such as habitat suitability analyses, where the aim is to identify the influence of potentially many environmental conditions on certain species. We describe regression models for breeding bird communities that facilitate both model choice and variable selection, by a boosting algorithm that works within a class of geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, and varying coefficients. The major modeling components are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a smooth component with one degree of freedom to obtain a fair comparison between the model terms. A generic representation of the geoadditive model allows us to devise a general boosting algorithm that automatically performs model choice and variable selection.
Maloney, Kelly O.; Schmid, Matthias; Weller, Donald E.
2012-01-01
Issues with ecological data (e.g. non-normality of errors, nonlinear relationships and autocorrelation of variables) and modelling (e.g. overfitting, variable selection and prediction) complicate regression analyses in ecology. Flexible models, such as generalized additive models (GAMs), can address data issues, and machine learning techniques (e.g. gradient boosting) can help resolve modelling issues. Gradient boosted GAMs do both. Here, we illustrate the advantages of this technique using data on benthic macroinvertebrates and fish from 1573 small streams in Maryland, USA.
NASA Astrophysics Data System (ADS)
Chugh, Saryu; Arivu Selvan, K.; Nadesh, RK
2017-11-01
Numerous destructive things influence the working arrangement of human body as hypertension, smoking, obesity, inappropriate medication taking which causes many contrasting diseases as diabetes, thyroid, strokes and coronary diseases. The impermanence and horribleness of the environment situation is also the reason for the coronary disease. The structure of Apache start relies on the evolution which requires gathering of the data. To break down the significance of use programming focused on data structure the Apache stop ought to be utilized and it gives various central focuses as it is fast in light as it uses memory worked in preparing. Apache Spark continues running on dispersed environment and chops down the data in bunches giving a high profitability rate. Utilizing mining procedure as a part of the determination of coronary disease has been exhaustively examined indicating worthy levels of precision. Decision trees, Neural Network, Gradient Boosting Algorithm are the various apache spark proficiencies which help in collecting the information.
Automated Proton Track Identification in MicroBooNE Using Gradient Boosted Decision Trees
DOE Office of Scientific and Technical Information (OSTI.GOV)
Woodruff, Katherine
MicroBooNE is a liquid argon time projection chamber (LArTPC) neutrino experiment that is currently running in the Booster Neutrino Beam at Fermilab. LArTPC technology allows for high-resolution, three-dimensional representations of neutrino interactions. A wide variety of software tools for automated reconstruction and selection of particle tracks in LArTPCs are actively being developed. Short, isolated proton tracks, the signal for low- momentum-transfer neutral current (NC) elastic events, are easily hidden in a large cosmic background. Detecting these low-energy tracks will allow us to probe interesting regions of the proton's spin structure. An effective method for selecting NC elastic events is tomore » combine a highly efficient track reconstruction algorithm to find all candidate tracks with highly accurate particle identification using a machine learning algorithm. We present our work on particle track classification using gradient tree boosting software (XGBoost) and the performance on simulated neutrino data.« less
Models of Marine Fish Biodiversity: Assessing Predictors from Three Habitat Classification Schemes.
Yates, Katherine L; Mellin, Camille; Caley, M Julian; Radford, Ben T; Meeuwig, Jessica J
2016-01-01
Prioritising biodiversity conservation requires knowledge of where biodiversity occurs. Such knowledge, however, is often lacking. New technologies for collecting biological and physical data coupled with advances in modelling techniques could help address these gaps and facilitate improved management outcomes. Here we examined the utility of environmental data, obtained using different methods, for developing models of both uni- and multivariate biodiversity metrics. We tested which biodiversity metrics could be predicted best and evaluated the performance of predictor variables generated from three types of habitat data: acoustic multibeam sonar imagery, predicted habitat classification, and direct observer habitat classification. We used boosted regression trees (BRT) to model metrics of fish species richness, abundance and biomass, and multivariate regression trees (MRT) to model biomass and abundance of fish functional groups. We compared model performance using different sets of predictors and estimated the relative influence of individual predictors. Models of total species richness and total abundance performed best; those developed for endemic species performed worst. Abundance models performed substantially better than corresponding biomass models. In general, BRT and MRTs developed using predicted habitat classifications performed less well than those using multibeam data. The most influential individual predictor was the abiotic categorical variable from direct observer habitat classification and models that incorporated predictors from direct observer habitat classification consistently outperformed those that did not. Our results show that while remotely sensed data can offer considerable utility for predictive modelling, the addition of direct observer habitat classification data can substantially improve model performance. Thus it appears that there are aspects of marine habitats that are important for modelling metrics of fish biodiversity that are not fully captured by remotely sensed data. As such, the use of remotely sensed data to model biodiversity represents a compromise between model performance and data availability.
Models of Marine Fish Biodiversity: Assessing Predictors from Three Habitat Classification Schemes
Yates, Katherine L.; Mellin, Camille; Caley, M. Julian; Radford, Ben T.; Meeuwig, Jessica J.
2016-01-01
Prioritising biodiversity conservation requires knowledge of where biodiversity occurs. Such knowledge, however, is often lacking. New technologies for collecting biological and physical data coupled with advances in modelling techniques could help address these gaps and facilitate improved management outcomes. Here we examined the utility of environmental data, obtained using different methods, for developing models of both uni- and multivariate biodiversity metrics. We tested which biodiversity metrics could be predicted best and evaluated the performance of predictor variables generated from three types of habitat data: acoustic multibeam sonar imagery, predicted habitat classification, and direct observer habitat classification. We used boosted regression trees (BRT) to model metrics of fish species richness, abundance and biomass, and multivariate regression trees (MRT) to model biomass and abundance of fish functional groups. We compared model performance using different sets of predictors and estimated the relative influence of individual predictors. Models of total species richness and total abundance performed best; those developed for endemic species performed worst. Abundance models performed substantially better than corresponding biomass models. In general, BRT and MRTs developed using predicted habitat classifications performed less well than those using multibeam data. The most influential individual predictor was the abiotic categorical variable from direct observer habitat classification and models that incorporated predictors from direct observer habitat classification consistently outperformed those that did not. Our results show that while remotely sensed data can offer considerable utility for predictive modelling, the addition of direct observer habitat classification data can substantially improve model performance. Thus it appears that there are aspects of marine habitats that are important for modelling metrics of fish biodiversity that are not fully captured by remotely sensed data. As such, the use of remotely sensed data to model biodiversity represents a compromise between model performance and data availability. PMID:27333202
Travis Woolley; David C. Shaw; Lisa M. Ganio; Stephen Fitzgerald
2012-01-01
Logistic regression models used to predict tree mortality are critical to post-fire management, planning prescribed bums and understanding disturbance ecology. We review literature concerning post-fire mortality prediction using logistic regression models for coniferous tree species in the western USA. We include synthesis and review of: methods to develop, evaluate...
Kaufmann, Liane; Huber, Stefan; Mayer, Daniel; Moeller, Korbinian; Marksteiner, Josef
2018-04-01
Adverse effects of heavy drinking on cognition have frequently been reported. In the present study, we systematically examined for the first time whether clinical neuropsychological assessments may be sensitive to alcohol abuse in elderly patients with suspected minor neurocognitive disorder. A total of 144 elderly with and without alcohol abuse (each group n=72; mean age 66.7 years) were selected from a patient pool of n=738 by applying propensity score matching (a statistical method allowing to match participants in experimental and control group by balancing various covariates to reduce selection bias). Accordingly, study groups were almost perfectly matched regarding age, education, gender, and Mini Mental State Examination score. Neuropsychological performance was measured using the CERAD (Consortium to Establish a Registry for Alzheimer's Disease). Classification analyses (i.e., decision tree and boosted trees models) were conducted to examine whether CERAD variables or total score contributed to group classification. Decision tree models disclosed that groups could be reliably classified based on the CERAD variables "Word List Discriminability" (tapping verbal recognition memory, 64% classification accuracy) and "Trail Making Test A" (measuring visuo-motor speed, 59% classification accuracy). Boosted tree analyses further indicated the sensitivity of "Word List Recall" (measuring free verbal recall) for discriminating elderly with versus without a history of alcohol abuse. This indicates that specific CERAD variables seem to be sensitive to alcohol-related cognitive dysfunctions in elderly patients with suspected minor neurocognitive disorder. (JINS, 2018, 24, 360-371).
Comparative Study of Vibration Condition Indicators for Detecting Cracks in Spur Gears
NASA Technical Reports Server (NTRS)
Nanadic, Nenad; Ardis, Paul; Hood, Adrian; Thurston, Michael; Ghoshal, Anindya; Lewicki, David
2013-01-01
This paper reports the results of an empirical study on the tooth breakage failure mode in spur gears. Of four dominant gear failure modes (breakage, wear, pitting, and scoring), tooth breakage is the most precipitous and often leads to catastrophic failures. The cracks were initiated using a fatigue tester and a custom-designed single-tooth bending fixture to simulate over-load conditions, instead of traditional notching using wire electrical discharge machining (EDM). The cracks were then propagated on a dynamometer. The ground truth of damage level during crack propagation was monitored with crack-propagation sensors. Ten crack propagations have been performed to compare the existing condition indicators (CIs) with respect to their: ability to detect a crack, ability to assess the damage, and sensitivity to sensor placement. Of more than thirty computed CIs, this paper compares five commonly used: raw RMS, FM0, NA4, raw kurtosis, and NP4. The performance of combined CIs was also investigated, using linear, logistic, and boosted regression trees based feature fusion.
Fienen, Michael N.; Nolan, Bernard T.; Feinstein, Daniel T.
2016-01-01
For decision support, the insights and predictive power of numerical process models can be hampered by insufficient expertise and computational resources required to evaluate system response to new stresses. An alternative is to emulate the process model with a statistical “metamodel.” Built on a dataset of collocated numerical model input and output, a groundwater flow model was emulated using a Bayesian Network, an Artificial neural network, and a Gradient Boosted Regression Tree. The response of interest was surface water depletion expressed as the source of water-to-wells. The results have application for managing allocation of groundwater. Each technique was tuned using cross validation and further evaluated using a held-out dataset. A numerical MODFLOW-USG model of the Lake Michigan Basin, USA, was used for the evaluation. The performance and interpretability of each technique was compared pointing to advantages of each technique. The metamodel can extend to unmodeled areas.
Mozer, M C; Wolniewicz, R; Grimes, D B; Johnson, E; Kaushansky, H
2000-01-01
Competition in the wireless telecommunications industry is fierce. To maintain profitability, wireless carriers must control churn, which is the loss of subscribers who switch from one carrier to another.We explore techniques from statistical machine learning to predict churn and, based on these predictions, to determine what incentives should be offered to subscribers to improve retention and maximize profitability to the carrier. The techniques include logit regression, decision trees, neural networks, and boosting. Our experiments are based on a database of nearly 47,000 U.S. domestic subscribers and includes information about their usage, billing, credit, application, and complaint history. Our experiments show that under a wide variety of assumptions concerning the cost of intervention and the retention rate resulting from intervention, using predictive techniques to identify potential churners and offering incentives can yield significant savings to a carrier. We also show the importance of a data representation crafted by domain experts. Finally, we report on a real-world test of the techniques that validate our simulation experiments.
ANNz2: Photometric Redshift and Probability Distribution Function Estimation using Machine Learning
NASA Astrophysics Data System (ADS)
Sadeh, I.; Abdalla, F. B.; Lahav, O.
2016-10-01
We present ANNz2, a new implementation of the public software for photometric redshift (photo-z) estimation of Collister & Lahav, which now includes generation of full probability distribution functions (PDFs). ANNz2 utilizes multiple machine learning methods, such as artificial neural networks and boosted decision/regression trees. The objective of the algorithm is to optimize the performance of the photo-z estimation, to properly derive the associated uncertainties, and to produce both single-value solutions and PDFs. In addition, estimators are made available, which mitigate possible problems of non-representative or incomplete spectroscopic training samples. ANNz2 has already been used as part of the first weak lensing analysis of the Dark Energy Survey, and is included in the experiment's first public data release. Here we illustrate the functionality of the code using data from the tenth data release of the Sloan Digital Sky Survey and the Baryon Oscillation Spectroscopic Survey. The code is available for download at http://github.com/IftachSadeh/ANNZ.
Developing a dengue forecast model using machine learning: A case study in China.
Guo, Pi; Liu, Tao; Zhang, Qin; Wang, Li; Xiao, Jianpeng; Zhang, Qingying; Luo, Ganfeng; Li, Zhihao; He, Jianfeng; Zhang, Yonghui; Ma, Wenjun
2017-10-01
In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue. Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011-2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China. The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics.
Mitchell, Matthew G E; Johansen, Kasper; Maron, Martine; McAlpine, Clive A; Wu, Dan; Rhodes, Jonathan R
2018-05-01
Urban areas are sources of land use change and CO 2 emissions that contribute to global climate change. Despite this, assessments of urban vegetation carbon stocks often fail to identify important landscape-scale drivers of variation in urban carbon, especially the potential effects of landscape structure variables at different spatial scales. We combined field measurements with Light Detection And Ranging (LiDAR) data to build high-resolution models of woody plant aboveground carbon across the urban portion of Brisbane, Australia, and then identified landscape scale drivers of these carbon stocks. First, we used LiDAR data to quantify the extent and vertical structure of vegetation across the city at high resolution (5×5m). Next, we paired this data with aboveground carbon measurements at 219 sites to create boosted regression tree models and map aboveground carbon across the city. We then used these maps to determine how spatial variation in land cover/land use and landscape structure affects these carbon stocks. Foliage densities above 5m height, tree canopy height, and the presence of ground openings had the strongest relationships with aboveground carbon. Using these fine-scale relationships, we estimate that 2.2±0.4 TgC are stored aboveground in the urban portion of Brisbane, with mean densities of 32.6±5.8MgCha -1 calculated across the entire urban land area, and 110.9±19.7MgCha -1 calculated within treed areas. Predicted carbon densities within treed areas showed strong positive relationships with the proportion of surrounding tree cover and how clumped that tree cover was at both 1km 2 and 1ha resolutions. Our models predict that even dense urban areas with low tree cover can have high carbon densities at fine scales. We conclude that actions and policies aimed at increasing urban carbon should focus on those areas where urban tree cover is most fragmented. Copyright © 2017 Elsevier B.V. All rights reserved.
Duncan, Dustin T; Kawachi, Ichiro; Kum, Susan; Aldstadt, Jared; Piras, Gianfranco; Matthews, Stephen A; Arbia, Giuseppe; Castro, Marcia C; White, Kellee; Williams, David R
2014-04-01
The racial/ethnic and income composition of neighborhoods often influences local amenities, including the potential spatial distribution of trees, which are important for population health and community wellbeing, particularly in urban areas. This ecological study used spatial analytical methods to assess the relationship between neighborhood socio-demographic characteristics (i.e. minority racial/ethnic composition and poverty) and tree density at the census tact level in Boston, Massachusetts (US). We examined spatial autocorrelation with the Global Moran's I for all study variables and in the ordinary least squares (OLS) regression residuals as well as computed Spearman correlations non-adjusted and adjusted for spatial autocorrelation between socio-demographic characteristics and tree density. Next, we fit traditional regressions (i.e. OLS regression models) and spatial regressions (i.e. spatial simultaneous autoregressive models), as appropriate. We found significant positive spatial autocorrelation for all neighborhood socio-demographic characteristics (Global Moran's I range from 0.24 to 0.86, all P =0.001), for tree density (Global Moran's I =0.452, P =0.001), and in the OLS regression residuals (Global Moran's I range from 0.32 to 0.38, all P <0.001). Therefore, we fit the spatial simultaneous autoregressive models. There was a negative correlation between neighborhood percent non-Hispanic Black and tree density (r S =-0.19; conventional P -value=0.016; spatially adjusted P -value=0.299) as well as a negative correlation between predominantly non-Hispanic Black (over 60% Black) neighborhoods and tree density (r S =-0.18; conventional P -value=0.019; spatially adjusted P -value=0.180). While the conventional OLS regression model found a marginally significant inverse relationship between Black neighborhoods and tree density, we found no statistically significant relationship between neighborhood socio-demographic composition and tree density in the spatial regression models. Methodologically, our study suggests the need to take into account spatial autocorrelation as findings/conclusions can change when the spatial autocorrelation is ignored. Substantively, our findings suggest no need for policy intervention vis-à-vis trees in Boston, though we hasten to add that replication studies, and more nuanced data on tree quality, age and diversity are needed.
Jarnevich, Catherine S.; Talbert, Marian; Morisette, Jeffrey T.; Aldridge, Cameron L.; Brown, Cynthia; Kumar, Sunil; Manier, Daniel; Talbert, Colin; Holcombe, Tracy R.
2017-01-01
Evaluating the conditions where a species can persist is an important question in ecology both to understand tolerances of organisms and to predict distributions across landscapes. Presence data combined with background or pseudo-absence locations are commonly used with species distribution modeling to develop these relationships. However, there is not a standard method to generate background or pseudo-absence locations, and method choice affects model outcomes. We evaluated combinations of both model algorithms (simple and complex generalized linear models, multivariate adaptive regression splines, Maxent, boosted regression trees, and random forest) and background methods (random, minimum convex polygon, and continuous and binary kernel density estimator (KDE)) to assess the sensitivity of model outcomes to choices made. We evaluated six questions related to model results, including five beyond the common comparison of model accuracy assessment metrics (biological interpretability of response curves, cross-validation robustness, independent data accuracy and robustness, and prediction consistency). For our case study with cheatgrass in the western US, random forest was least sensitive to background choice and the binary KDE method was least sensitive to model algorithm choice. While this outcome may not hold for other locations or species, the methods we used can be implemented to help determine appropriate methodologies for particular research questions.
Predicting arsenic in drinking water wells of the Central Valley, California
Ayotte, Joseph; Nolan, Bernard T.; Gronberg, JoAnn M.
2016-01-01
Probabilities of arsenic in groundwater at depths used for domestic and public supply in the Central Valley of California are predicted using weak-learner ensemble models (boosted regression trees, BRT) and more traditional linear models (logistic regression, LR). Both methods captured major processes that affect arsenic concentrations, such as the chemical evolution of groundwater, redox differences, and the influence of aquifer geochemistry. Inferred flow-path length was the most important variable but near-surface-aquifer geochemical data also were significant. A unique feature of this study was that previously predicted nitrate concentrations in three dimensions were themselves predictive of arsenic and indicated an important redox effect at >10 μg/L, indicating low arsenic where nitrate was high. Additionally, a variable representing three-dimensional aquifer texture from the Central Valley Hydrologic Model was an important predictor, indicating high arsenic associated with fine-grained aquifer sediment. BRT outperformed LR at the 5 μg/L threshold in all five predictive performance measures and at 10 μg/L in four out of five measures. BRT yielded higher prediction sensitivity (39%) than LR (18%) at the 10 μg/L threshold–a useful outcome because a major objective of the modeling was to improve our ability to predict high arsenic areas.
NASA Astrophysics Data System (ADS)
Zafari, Jaber; Jouni, Fatemeh Javani; Ahmadvand, Ali; Abdolmaleki, Parviz; Soodi, Malihe; Zendehdel, Rezvan
2017-02-01
A model was set up to predict the differentiation patterns based on the data extracted from FTIR spectroscopy. For this reason, bone marrow stem cells (BMSCs) were differentiated to primordial germ cells (PGCs). Changes in cellular macromolecules in the time of 0, 24, 48, 72, and 96 h of differentiation, as different steps of the differentiation procedure were investigated by using FTIR spectroscopy. Also, the expression of pluripotency (Oct-4, Nanog and c-Myc) and specific genes (Mvh, Stella and Fragilis) were investigated by real-time PCR. However, the expression of genes in five steps of differentiation was predicted by FTIR spectroscopy. FTIR spectra showed changes in the template of band intensities at different differentiation steps. There are increasing changes in the stepwise differentiation procedure for the ratio area of CH2, which is symmetric to CH2 asymmetric stretching. An ensemble of expert methods, including regression tree (RT), boosting algorithm (BA), and generalized regression neural network (GRNN), was the best method to predict the gene expression by FTIR spectroscopy. In conclusion, the model was able to distinguish the pattern of different steps from cell differentiation by using some useful features extracted from FTIR spectra.
Wakie, Tewodros; Kumar, Sunil; Senay, Gabriel; Takele, Abera; Lencho, Alemu
2016-01-01
A number of studies have reported the presence of wheat septoria leaf blotch (Septoria tritici; SLB) disease in Ethiopia. However, the environmental factors associated with SLB disease, and areas under risk of SLB disease, have not been studied. Here, we tested the hypothesis that environmental variables can adequately explain observed SLB disease severity levels in West Shewa, Central Ethiopia. Specifically, we identified 50 environmental variables and assessed their relationships with SLB disease severity. Geographically referenced disease severity data were obtained from the field, and linear regression and Boosted Regression Trees (BRT) modeling approaches were used for developing spatial models. Moderate-resolution imaging spectroradiometer (MODIS) derived vegetation indices and land surface temperature (LST) variables highly influenced SLB model predictions. Soil and topographic variables did not sufficiently explain observed SLB disease severity variation in this study. Our results show that wheat growing areas in Central Ethiopia, including highly productive districts, are at risk of SLB disease. The study demonstrates the integration of field data with modeling approaches such as BRT for predicting the spatial patterns of severity of a pathogenic wheat disease in Central Ethiopia. Our results can aid Ethiopia's wheat disease monitoring efforts, while our methods can be replicated for testing related hypotheses elsewhere.
Evaluating the performance of low cost chemical sensors for air pollution research.
Lewis, Alastair C; Lee, James D; Edwards, Peter M; Shaw, Marvin D; Evans, Mat J; Moller, Sarah J; Smith, Katie R; Buckley, Jack W; Ellis, Matthew; Gillot, Stefan R; White, Andrew
2016-07-18
Low cost pollution sensors have been widely publicized, in principle offering increased information on the distribution of air pollution and a democratization of air quality measurements to amateur users. We report a laboratory study of commonly-used electrochemical sensors and quantify a number of cross-interferences with other atmospheric chemicals, some of which become significant at typical suburban air pollution concentrations. We highlight that artefact signals from co-sampled pollutants such as CO2 can be greater than the electrochemical sensor signal generated by the measurand. We subsequently tested in ambient air, over a period of three weeks, twenty identical commercial sensor packages alongside standard measurements and report on the degree of agreement between references and sensors. We then explore potential experimental approaches to improve sensor performance, enhancing outputs from qualitative to quantitative, focusing on low cost VOC photoionization sensors. Careful signal handling, for example, was seen to improve limits of detection by one order of magnitude. The quantity, magnitude and complexity of analytical interferences that must be characterised to convert a signal into a quantitative observation, with known uncertainties, make standard individual parameter regression inappropriate. We show that one potential solution to this problem is the application of supervised machine learning approaches such as boosted regression trees and Gaussian processes emulation.
Louis R Iverson; Anantha M. Prasad; Mark W. Schwartz; Mark W. Schwartz
2005-01-01
We predict current distribution and abundance for tree species present in eastern North America, and subsequently estimate potential suitable habitat for those species under a changed climate with 2 x CO2. We used a series of statistical models (i.e., Regression Tree Analysis (RTA), Multivariate Adaptive Regression Splines (MARS), Bagging Trees (...
Reef flattening effects on total richness and species responses in the Caribbean.
Newman, Steven P; Meesters, Erik H; Dryden, Charlie S; Williams, Stacey M; Sanchez, Cristina; Mumby, Peter J; Polunin, Nicholas V C
2015-11-01
There has been ongoing flattening of Caribbean coral reefs with the loss of habitat having severe implications for these systems. Complexity and its structural components are important to fish species richness and community composition, but little is known about its role for other taxa or species-specific responses. This study reveals the importance of reef habitat complexity and structural components to different taxa of macrofauna, total species richness, and individual coral and fish species in the Caribbean. Species presence and richness of different taxa were visually quantified in one hundred 25-m(2) plots in three marine reserves in the Caribbean. Sampling was evenly distributed across five levels of visually estimated reef complexity, with five structural components also recorded: the number of corals, number of large corals, slope angle, maximum sponge and maximum octocoral height. Taking advantage of natural heterogeneity in structural complexity within a particular coral reef habitat (Orbicella reefs) and discrete environmental envelope, thus minimizing other sources of variability, the relative importance of reef complexity and structural components was quantified for different taxa and individual fish and coral species on Caribbean coral reefs using boosted regression trees (BRTs). Boosted regression tree models performed very well when explaining variability in total (82·3%), coral (80·6%) and fish species richness (77·3%), for which the greatest declines in richness occurred below intermediate reef complexity levels. Complexity accounted for very little of the variability in octocorals, sponges, arthropods, annelids or anemones. BRTs revealed species-specific variability and importance for reef complexity and structural components. Coral and fish species occupancy generally declined at low complexity levels, with the exception of two coral species (Pseudodiploria strigosa and Porites divaricata) and four fish species (Halichoeres bivittatus, H. maculipinna, Malacoctenus triangulatus and Stegastes partitus) more common at lower reef complexity levels. A significant interaction between country and reef complexity revealed a non-additive decline in species richness in areas of low complexity and the reserve in Puerto Rico. Flattening of Caribbean coral reefs will result in substantial species losses, with few winners. Individual structural components have considerable value to different species, and their loss may have profound impacts on population responses of coral and fish due to identity effects of key species, which underpin population richness and resilience and may affect essential ecosystem processes and services. © 2015 The Authors. Journal of Animal Ecology © 2015 British Ecological Society.
Charles E. Rose; Thomas B. Lynch
2001-01-01
A method was developed for estimating parameters in an individual tree basal area growth model using a system of equations based on dbh rank classes. The estimation method developed is a compromise between an individual tree and a stand level basal area growth model that accounts for the correlation between trees within a plot by using seemingly unrelated regression (...
Susan L. King
2003-01-01
The performance of two classifiers, logistic regression and neural networks, are compared for modeling noncatastrophic individual tree mortality for 21 species of trees in West Virginia. The output of the classifier is usually a continuous number between 0 and 1. A threshold is selected between 0 and 1 and all of the trees below the threshold are classified as...
NASA Astrophysics Data System (ADS)
Novitski, Linda Nicole
Accurate and cost-effective assessment of water quality is necessary for proper management and restoration of inland water bodies susceptible to algal bloom conditions. Landsat and MODIS satellite images were used to create chlorophyll and Secchi depth predictive models for algal assessment of Great Lakes and other lakes of the United States. Boosted regression tree (BRT) models using satellite imagery are both easy to use and can have high predictive performance. BRT models inferred chlorophyll and Secchi depth more accurately than linear regression models for all study locations. Inferred chlorophyll of inner Saginaw Bay was subsequently used in ecological models to help understand the ecological drivers of algal blooms in this ecosystem. For small lakes (non-Great Lakes), the best national Landsat model for ln-transformed chlorophyll was the BRT model and had a cross-validation R 2 of 0.44 and a 0.76 ln-transformed mug/L RMSE. The best national Landsat model for Secchi depth was also a BRT model that had an adjusted R 2 of 0.52 and a 0.80 m RMSE. We assessed the applicability of the national chlorophyll model for ecological analysis by comparing the total phosphorus- chlorophyll relationship with chlorophyll determined from sampling or remote sensing, which showed the total phosphorus- chlorophyll relationship had an adjusted R2 = 0.58 and 1.02 ln-transformed microg/L RMSE with sampled chlorophyll versus an adjusted R2 = 0.56 and 1.04 ln-transformed mug/L RMSE with chlorophyll determined by the boosted regression tree remote sensing model. For Great Lakes models, the MODIS BRT model predicted chlorophyll most accurately of the three BRT models and compared well to other models in the literature. BRT models for Landsat ETM+ and TM more accurately predicted chlorophyll than the MSS model and all Landsat models had favorable results when compared to the literature. BRT chlorophyll predictive models are useful in helping to understand historical, long-term chlorophyll trends and to inform us of how climate change may alter ecosystems in the future. In inner Saginaw Bay, annual average and upper quartile Landsat-derived chlorophyll decreased from 7.44 to 6.62 and 8.38 to 7.38 mug/L between 1973-1982, and annual upper quartile of 8-day phosphorus loads increased from 5.29 to 6.79 kg between 1973-2012. Simple linear and multiple regression models and Wilcoxon rank test results for MODIS and Landsat-derived chlorophyll indicate that distance from the Saginaw River mouth influences chlorophyll concentration in Saginaw Bay; Landsat-derived surface water temperature and phosphorus loads to a lesser extent. Mixed-effect models for MODIS and Landsat-derived chlorophyll were related to chlorophyll better than simple linear or multiple regressions, with random effects of pixel and sample date contributing substantially to predictive power (NSE=0.35-70), though phosphorus loads, distance to Saginaw River mouth, and water were significant fixed effects in most models. Water quality changes in Saginaw Bay between 1972-2012 were influenced by phosphorus loading and distance to the Saginaw River's mouth. Landsat and MODIS imagery are complementary platforms because of the long history of Landsat operation and the finer spectral resolution and image frequency of MODIS. Remote sensing water quality assessment tools can be valuable for limnological study, ecological assessment, and water resource management.
Duncan, Dustin T.; Kawachi, Ichiro; Kum, Susan; Aldstadt, Jared; Piras, Gianfranco; Matthews, Stephen A.; Arbia, Giuseppe; Castro, Marcia C.; White, Kellee; Williams, David R.
2017-01-01
The racial/ethnic and income composition of neighborhoods often influences local amenities, including the potential spatial distribution of trees, which are important for population health and community wellbeing, particularly in urban areas. This ecological study used spatial analytical methods to assess the relationship between neighborhood socio-demographic characteristics (i.e. minority racial/ethnic composition and poverty) and tree density at the census tact level in Boston, Massachusetts (US). We examined spatial autocorrelation with the Global Moran’s I for all study variables and in the ordinary least squares (OLS) regression residuals as well as computed Spearman correlations non-adjusted and adjusted for spatial autocorrelation between socio-demographic characteristics and tree density. Next, we fit traditional regressions (i.e. OLS regression models) and spatial regressions (i.e. spatial simultaneous autoregressive models), as appropriate. We found significant positive spatial autocorrelation for all neighborhood socio-demographic characteristics (Global Moran’s I range from 0.24 to 0.86, all P=0.001), for tree density (Global Moran’s I=0.452, P=0.001), and in the OLS regression residuals (Global Moran’s I range from 0.32 to 0.38, all P<0.001). Therefore, we fit the spatial simultaneous autoregressive models. There was a negative correlation between neighborhood percent non-Hispanic Black and tree density (rS=−0.19; conventional P-value=0.016; spatially adjusted P-value=0.299) as well as a negative correlation between predominantly non-Hispanic Black (over 60% Black) neighborhoods and tree density (rS=−0.18; conventional P-value=0.019; spatially adjusted P-value=0.180). While the conventional OLS regression model found a marginally significant inverse relationship between Black neighborhoods and tree density, we found no statistically significant relationship between neighborhood socio-demographic composition and tree density in the spatial regression models. Methodologically, our study suggests the need to take into account spatial autocorrelation as findings/conclusions can change when the spatial autocorrelation is ignored. Substantively, our findings suggest no need for policy intervention vis-à-vis trees in Boston, though we hasten to add that replication studies, and more nuanced data on tree quality, age and diversity are needed. PMID:29354668
Huang, C.; Townshend, J.R.G.
2003-01-01
A stepwise regression tree (SRT) algorithm was developed for approximating complex nonlinear relationships. Based on the regression tree of Breiman et al . (BRT) and a stepwise linear regression (SLR) method, this algorithm represents an improvement over SLR in that it can approximate nonlinear relationships and over BRT in that it gives more realistic predictions. The applicability of this method to estimating subpixel forest was demonstrated using three test data sets, on all of which it gave more accurate predictions than SLR and BRT. SRT also generated more compact trees and performed better than or at least as well as BRT at all 10 equal forest proportion interval ranging from 0 to 100%. This method is appealing to estimating subpixel land cover over large areas.
NASA Astrophysics Data System (ADS)
Tian, Fang; Cao, Xianyong; Dallmeyer, Anne; Zhao, Yan; Ni, Jian; Herzschuh, Ulrike
2017-01-01
Temporal and spatial stability of the vegetation-climate relationship is a basic ecological assumption for pollen-based quantitative inferences of past climate change and for predicting future vegetation. We explore this assumption for the Holocene in eastern continental Asia (China, Mongolia). Boosted regression trees (BRT) between fossil pollen taxa percentages (Abies, Artemisia, Betula, Chenopodiaceae, Cyperaceae, Ephedra, Picea, Pinus, Poaceae and Quercus) and climate model outputs of mean annual precipitation (Pann) and mean temperature of the warmest month (Mtwa) for 9 and 6 ka (ka = thousand years before present) were set up and results compared to those obtained from relating modern pollen to modern climate. Overall, our results reveal only slight temporal differences in the pollen-climate relationships. Our analyses suggest that the importance of Pann compared with Mtwa for taxa distribution is higher today than it was at 6 ka and 9 ka. In particular, the relevance of Pann for Picea and Pinus increases and has become the main determinant. This change in the climate-tree pollen relationship parallels a widespread tree pollen decrease in north-central China and the eastern Tibetan Plateau. We assume that this is at least partly related to vegetation-climate disequilibrium originating from human impact. Increased atmospheric CO2 concentration may have permitted the expansion of moisture-loving herb taxa (Cyperaceae and Poaceae) during the late Holocene into arid/semi-arid areas. We furthermore find that the pollen-climate relationship between north-central China and the eastern Tibetan Plateau is generally similar, but that regional differences are larger than temporal differences. In summary, vegetation-climate relationships in China are generally stable in space and time, and pollen-based climate reconstructions can be applied to the Holocene. Regional differences imply the calibration-set should be restricted spatially.
ERIC Educational Resources Information Center
Koon, Sharon; Petscher, Yaacov
2015-01-01
The purpose of this report was to explicate the use of logistic regression and classification and regression tree (CART) analysis in the development of early warning systems. It was motivated by state education leaders' interest in maintaining high classification accuracy while simultaneously improving practitioner understanding of the rules by…
Eric H. Wharton; Tiberius Cunia
1987-01-01
Proceedings of a workshop co-sponsored by the USDA Forest Service, the State University of New York, and the Society of American Foresters. Presented were papers on the methodology of sample tree selection, tree biomass measurement, construction of biomass tables and estimation of their error, and combining the error of biomass tables with that of the sample plots or...
Chen, Guangchao; Li, Xuehua; Chen, Jingwen; Zhang, Ya-Nan; Peijnenburg, Willie J G M
2014-12-01
Biodegradation is the principal environmental dissipation process of chemicals. As such, it is a dominant factor determining the persistence and fate of organic chemicals in the environment, and is therefore of critical importance to chemical management and regulation. In the present study, the authors developed in silico methods assessing biodegradability based on a large heterogeneous set of 825 organic compounds, using the techniques of the C4.5 decision tree, the functional inner regression tree, and logistic regression. External validation was subsequently carried out by 2 independent test sets of 777 and 27 chemicals. As a result, the functional inner regression tree exhibited the best predictability with predictive accuracies of 81.5% and 81.0%, respectively, on the training set (825 chemicals) and test set I (777 chemicals). Performance of the developed models on the 2 test sets was subsequently compared with that of the Estimation Program Interface (EPI) Suite Biowin 5 and Biowin 6 models, which also showed a better predictability of the functional inner regression tree model. The model built in the present study exhibits a reasonable predictability compared with existing models while possessing a transparent algorithm. Interpretation of the mechanisms of biodegradation was also carried out based on the models developed. © 2014 SETAC.
Filgueiras, Paulo R; Terra, Luciana A; Castro, Eustáquio V R; Oliveira, Lize M S L; Dias, Júlio C M; Poppi, Ronei J
2015-09-01
This paper aims to estimate the temperature equivalent to 10% (T10%), 50% (T50%) and 90% (T90%) of distilled volume in crude oils using (1)H NMR and support vector regression (SVR). Confidence intervals for the predicted values were calculated using a boosting-type ensemble method in a procedure called ensemble support vector regression (eSVR). The estimated confidence intervals obtained by eSVR were compared with previously accepted calculations from partial least squares (PLS) models and a boosting-type ensemble applied in the PLS method (ePLS). By using the proposed boosting strategy, it was possible to identify outliers in the T10% property dataset. The eSVR procedure improved the accuracy of the distillation temperature predictions in relation to standard PLS, ePLS and SVR. For T10%, a root mean square error of prediction (RMSEP) of 11.6°C was obtained in comparison with 15.6°C for PLS, 15.1°C for ePLS and 28.4°C for SVR. The RMSEPs for T50% were 24.2°C, 23.4°C, 22.8°C and 14.4°C for PLS, ePLS, SVR and eSVR, respectively. For T90%, the values of RMSEP were 39.0°C, 39.9°C and 39.9°C for PLS, ePLS, SVR and eSVR, respectively. The confidence intervals calculated by the proposed boosting methodology presented acceptable values for the three properties analyzed; however, they were lower than those calculated by the standard methodology for PLS. Copyright © 2015 Elsevier B.V. All rights reserved.
Suchetana, Bihu; Rajagopalan, Balaji; Silverstein, JoAnn
2017-11-15
A regression tree-based diagnostic approach is developed to evaluate factors affecting US wastewater treatment plant compliance with ammonia discharge permit limits using Discharge Monthly Report (DMR) data from a sample of 106 municipal treatment plants for the period of 2004-2008. Predictor variables used to fit the regression tree are selected using random forests, and consist of the previous month's effluent ammonia, influent flow rates and plant capacity utilization. The tree models are first used to evaluate compliance with existing ammonia discharge standards at each facility and then applied assuming more stringent discharge limits, under consideration in many states. The model predicts that the ability to meet both current and future limits depends primarily on the previous month's treatment performance. With more stringent discharge limits predicted ammonia concentration relative to the discharge limit, increases. In-sample validation shows that the regression trees can provide a median classification accuracy of >70%. The regression tree model is validated using ammonia discharge data from an operating wastewater treatment plant and is able to accurately predict the observed ammonia discharge category approximately 80% of the time, indicating that the regression tree model can be applied to predict compliance for individual treatment plants providing practical guidance for utilities and regulators with an interest in controlling ammonia discharges. The proposed methodology is also used to demonstrate how to delineate reliable sources of demand and supply in a point source-to-point source nutrient credit trading scheme, as well as how planners and decision makers can set reasonable discharge limits in future. Copyright © 2017 Elsevier B.V. All rights reserved.
Dynamic travel time estimation using regression trees.
DOT National Transportation Integrated Search
2008-10-01
This report presents a methodology for travel time estimation by using regression trees. The dissemination of travel time information has become crucial for effective traffic management, especially under congested road conditions. In the absence of c...
Jose F. Negron
1998-01-01
Infested and uninfested areas within Douglas fir, Pseudotsuga menziesii Mirb.. Franco, stands affected by the Douglas-fir beetle, Dendroctonus pseudotsugae Hopk. were sampled in the Colorado Front Range, CO. Classification tree models were built to predict probabilities of infestation. Regression trees and linear regression analysis were used to model amount of tree...
Using nonlinear quantile regression to estimate the self-thinning boundary curve
Quang V. Cao; Thomas J. Dean
2015-01-01
The relationship between tree size (quadratic mean diameter) and tree density (number of trees per unit area) has been a topic of research and discussion for many decades. Starting with Reineke in 1933, the maximum size-density relationship, on a log-log scale, has been assumed to be linear. Several techniques, including linear quantile regression, have been employed...
Shrinkage Degree in $L_{2}$ -Rescale Boosting for Regression.
Xu, Lin; Lin, Shaobo; Wang, Yao; Xu, Zongben
2017-08-01
L 2 -rescale boosting ( L 2 -RBoosting) is a variant of L 2 -Boosting, which can essentially improve the generalization performance of L 2 -Boosting. The key feature of L 2 -RBoosting lies in introducing a shrinkage degree to rescale the ensemble estimate in each iteration. Thus, the shrinkage degree determines the performance of L 2 -RBoosting. The aim of this paper is to develop a concrete analysis concerning how to determine the shrinkage degree in L 2 -RBoosting. We propose two feasible ways to select the shrinkage degree. The first one is to parameterize the shrinkage degree and the other one is to develop a data-driven approach. After rigorously analyzing the importance of the shrinkage degree in L 2 -RBoosting, we compare the pros and cons of the proposed methods. We find that although these approaches can reach the same learning rates, the structure of the final estimator of the parameterized approach is better, which sometimes yields a better generalization capability when the number of sample is finite. With this, we recommend to parameterize the shrinkage degree of L 2 -RBoosting. We also present an adaptive parameter-selection strategy for shrinkage degree and verify its feasibility through both theoretical analysis and numerical verification. The obtained results enhance the understanding of L 2 -RBoosting and give guidance on how to use it for regression tasks.
NASA Astrophysics Data System (ADS)
Bourrel, Luc; Brodu, Nicolas; Frappart, Frédéric
2016-04-01
Remotely sensed images allow a frequent monitoring of land cover variations at regional and global scale. Recently launched Sentinel-1 satellite offers a global cover of land areas at an unprecedented spatial (20 m) and temporal (6 days at the Equator). We propose here to compare the performances of commonly used supervised classification techniques (i.e., k-nearest neighbors, linear and Gaussian support vector machines, naive Bayes, linear and quadratic discriminant analyzes, adaptative boosting, loggit regression, ridge regression with one-vs-one voting, random forest, extremely randomized trees) for land cover applications in the Guayas Basin, the largest river basin of the Pacific coast of Ecuator (area ~32,000 km²). The reason of this choice is the importance of this region in Ecuatorian economy as its watershed represents 13% of the total area of Ecuador where 40% of the Ecuadorian population lives. It also corresponds to the most productive region of Ecuador for agriculture and aquaculture. Fifty percents of the country shrimp farming production comes from this watershed, and represents with agriculture the largest source of revenue of the country. Similar comparisons are also performed using ENVISAT ASAR images acquired in global mode (1 km of spatial resolution). Accuracy of the results will be achieved using land cover map derived from multi-spectral images.
NASA Astrophysics Data System (ADS)
Drzewiecki, Wojciech
2017-12-01
We evaluated the performance of nine machine learning regression algorithms and their ensembles for sub-pixel estimation of impervious areas coverages from Landsat imagery. The accuracy of imperviousness mapping in individual time points was assessed based on RMSE, MAE and R2. These measures were also used for the assessment of imperviousness change intensity estimations. The applicability for detection of relevant changes in impervious areas coverages at sub-pixel level was evaluated using overall accuracy, F-measure and ROC Area Under Curve. The results proved that Cubist algorithm may be advised for Landsat-based mapping of imperviousness for single dates. Stochastic gradient boosting of regression trees (GBM) may be also considered for this purpose. However, Random Forest algorithm is endorsed for both imperviousness change detection and mapping of its intensity. In all applications the heterogeneous model ensembles performed at least as well as the best individual models or better. They may be recommended for improving the quality of sub-pixel imperviousness and imperviousness change mapping. The study revealed also limitations of the investigated methodology for detection of subtle changes of imperviousness inside the pixel. None of the tested approaches was able to reliably classify changed and non-changed pixels if the relevant change threshold was set as one or three percent. Also for fi ve percent change threshold most of algorithms did not ensure that the accuracy of change map is higher than the accuracy of random classifi er. For the threshold of relevant change set as ten percent all approaches performed satisfactory.
Meteorological Factors for Dengue Fever Control and Prevention in South China.
Gu, Haogao; Leung, Ross Ka-Kit; Jing, Qinlong; Zhang, Wangjian; Yang, Zhicong; Lu, Jiahai; Hao, Yuantao; Zhang, Dingmei
2016-08-31
Dengue fever (DF) is endemic in Guangzhou and has been circulating for decades, causing significant economic loss. DF prevention mainly relies on mosquito control and change in lifestyle. However, alert fatigue may partially limit the success of these countermeasures. This study investigated the delayed effect of meteorological factors, as well as the relationships between five climatic variables and the risk for DF by boosted regression trees (BRT) over the period of 2005-2011, to determine the best timing and strategy for adapting such preventive measures. The most important meteorological factor was daily average temperature. We used BRT to investigate the lagged relationship between dengue clinical burden and climatic variables, with the 58 and 62 day lag models attaining the largest area under the curve. The climatic factors presented similar patterns between these two lag models, which can be used as references for DF prevention in the early stage. Our results facilitate the development of the Mosquito Breeding Risk Index for early warning systems. The availability of meteorological data and modeling methods enables the extension of the application to other vector-borne diseases endemic in tropical and subtropical countries.
NASA Astrophysics Data System (ADS)
Painter Jones, Matilda; Green, Mattias; Gove, Jamison; Williams, Gareth
2017-04-01
The ocean is saturated with internal waves at tidal frequency. The energy associated with conversion from barotropic to baroclinic can enhance mixing and upwelling at sites of generation and dissipation, which in turn can drive primary production. Hotspots of internal wave generation are located at sudden changes in topography with the Hawaiian archipelago identified as an area of intense internal wave activity. The role of internal waves as a driver of benthic reef community is unexplored and could be key to coral reefs survival in the unknown future. Using a Pacific wide map of internal wave flux and barotropic-to-baroclinic conversion at an unprecedented 1/30th degree resolution, energy budgets were developed for four islands to evaluate dissipation and generation of internal waves. Spatiotemporal variations in benthic community structure were plotted around each island and related to changes in internal wave energetics using a boosted regression tree. Contrasting spatial patterns and species assemblages were seen around islands with distinct internal wave regimes. The relative importance and influence of internal waves on coral reef ecosystems is evaluated.
NASA Astrophysics Data System (ADS)
Hu, Yao; Quinn, Christopher J.; Cai, Ximing; Garfinkle, Noah W.
2017-11-01
For agent-based modeling, the major challenges in deriving agents' behavioral rules arise from agents' bounded rationality and data scarcity. This study proposes a "gray box" approach to address the challenge by incorporating expert domain knowledge (i.e., human intelligence) with machine learning techniques (i.e., machine intelligence). Specifically, we propose using directed information graph (DIG), boosted regression trees (BRT), and domain knowledge to infer causal factors and identify behavioral rules from data. A case study is conducted to investigate farmers' pumping behavior in the Midwest, U.S.A. Results show that four factors identified by the DIG algorithm- corn price, underlying groundwater level, monthly mean temperature and precipitation- have main causal influences on agents' decisions on monthly groundwater irrigation depth. The agent-based model is then developed based on the behavioral rules represented by three DIGs and modeled by BRTs, and coupled with a physically-based groundwater model to investigate the impacts of agents' pumping behavior on the underlying groundwater system in the context of coupled human and environmental systems.
Predicting the limits to tree height using statistical regressions of leaf traits.
Burgess, Stephen S O; Dawson, Todd E
2007-01-01
Leaf morphology and physiological functioning demonstrate considerable plasticity within tree crowns, with various leaf traits often exhibiting pronounced vertical gradients in very tall trees. It has been proposed that the trajectory of these gradients, as determined by regression methods, could be used in conjunction with theoretical biophysical limits to estimate the maximum height to which trees can grow. Here, we examined this approach using published and new experimental data from tall conifer and angiosperm species. We showed that height predictions were sensitive to tree-to-tree variation in the shape of the regression and to the biophysical endpoints selected. We examined the suitability of proposed end-points and their theoretical validity. We also noted that site and environment influenced height predictions considerably. Use of leaf mass per unit area or leaf water potential coupled with vulnerability of twigs to cavitation poses a number of difficulties for predicting tree height. Photosynthetic rate and carbon isotope discrimination show more promise, but in the second case, the complex relationship between light, water availability, photosynthetic capacity and internal conductance to CO(2) must first be characterized.
Lucini, Filipe R; S Fogliatto, Flavio; C da Silveira, Giovani J; L Neyeloff, Jeruza; Anzanello, Michel J; de S Kuchenbecker, Ricardo; D Schaan, Beatriz
2017-04-01
Emergency department (ED) overcrowding is a serious issue for hospitals. Early information on short-term inward bed demand from patients receiving care at the ED may reduce the overcrowding problem, and optimize the use of hospital resources. In this study, we use text mining methods to process data from early ED patient records using the SOAP framework, and predict future hospitalizations and discharges. We try different approaches for pre-processing of text records and to predict hospitalization. Sets-of-words are obtained via binary representation, term frequency, and term frequency-inverse document frequency. Unigrams, bigrams and trigrams are tested for feature formation. Feature selection is based on χ 2 and F-score metrics. In the prediction module, eight text mining methods are tested: Decision Tree, Random Forest, Extremely Randomized Tree, AdaBoost, Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine (Kernel linear) and Nu-Support Vector Machine (Kernel linear). Prediction performance is evaluated by F1-scores. Precision and Recall values are also informed for all text mining methods tested. Nu-Support Vector Machine was the text mining method with the best overall performance. Its average F1-score in predicting hospitalization was 77.70%, with a standard deviation (SD) of 0.66%. The method could be used to manage daily routines in EDs such as capacity planning and resource allocation. Text mining could provide valuable information and facilitate decision-making by inward bed management teams. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.
L2-Boosting algorithm applied to high-dimensional problems in genomic selection.
González-Recio, Oscar; Weigel, Kent A; Gianola, Daniel; Naya, Hugo; Rosa, Guilherme J M
2010-06-01
The L(2)-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0.65 (0.33), 0.53 (0.37), 0.66 (0.26) and 0.63 (0.27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0.08 and 1.08, respectively) and broiler (-0.011 and 0.006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0.10 and MSE=1.10) than BayesA (bias=1.26 and MSE=2.81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive alternative for genomic selection applications, providing high accuracy and low bias in genomic-assisted evaluations with a relatively short computational time.
L.R. Iverson; A.M. Prasad; A. Liaw
2004-01-01
More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in...
Equations for predicting biomass in 2- to 6-year-old Eucalyptus saligna in Hawaii
Craig D. Whitesell; Susan C. Miyasaka; Robert F. Strand; Thomas H. Schubert; Katharine E. McDuffie
1988-01-01
Eucalyptus saligna trees grown in short-rotation plantations on the island of Hawaii were measured, harvested, and weighed to provide data for developing regression equations using non-destructive stand measurements. Regression analysis of the data from 190 trees in the 2.0- to 3.5-year range and 96 trees in the 4- to 6-year range related stem-only...
Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett
2009-01-01
Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wong, Philip; Lambert, Christine, E-mail: christine.lambert@muhc.mcgill.ca; Agnihotram, Ramanakumar V.
Purpose: Local recurrence (LR) of ductal carcinoma in situ (DCIS) is reduced by whole-breast irradiation after breast-conserving surgery (BCS). However, the benefit of adding a radiotherapy boost to the surgical cavity for DCIS is unclear. We sought to determine the impact of the boost on LR in patients with DCIS treated at the McGill University Health Centre. Methods and Materials: A total of 220 consecutive cases of DCIS treated with BCS and radiotherapy between January 2000 and December 2006 were reviewed. Of the patients, 36% received a radiotherapy boost to the surgical cavity. Median follow-up was 46 months for themore » boost and no-boost groups. Kaplan-Meier survival analyses and Cox regression analyses were performed. Results: Compared with the no-boost group, patients in the boost group more frequently had positive and <0.1-cm margins (48% vs. 8%) (p < 0.0001) and more frequently were in higher-risk categories as defined by the Van Nuys Prognostic (VNP) index (p = 0.006). Despite being at higher risk for LR, none (0/79) of the patients who received a boost experienced LR, whereas 8 of 141 patients who did not receive a boost experienced an in-breast LR (log-rank p = 0.03). Univariate analysis of prognostic factors (age, tumor size, margin status, histological grade, necrosis, and VNP risk category) revealed only the presence of necrosis to significantly correlate with LR (log-rank p = 0.003). The whole-breast irradiation dose and fractionation schedule did not affect LR rate. Conclusions: Our results suggest that the use of a radiotherapy boost improves local control in DCIS and may outweigh the poor prognostic effect of necrosis.« less
Deploying a quantum annealing processor to detect tree cover in aerial imagery of California
Basu, Saikat; Ganguly, Sangram; Michaelis, Andrew; Mukhopadhyay, Supratik; Nemani, Ramakrishna R.
2017-01-01
Quantum annealing is an experimental and potentially breakthrough computational technology for handling hard optimization problems, including problems of computer vision. We present a case study in training a production-scale classifier of tree cover in remote sensing imagery, using early-generation quantum annealing hardware built by D-wave Systems, Inc. Beginning within a known boosting framework, we train decision stumps on texture features and vegetation indices extracted from four-band, one-meter-resolution aerial imagery from the state of California. We then impose a regulated quadratic training objective to select an optimal voting subset from among these stumps. The votes of the subset define the classifier. For optimization, the logical variables in the objective function map to quantum bits in the hardware device, while quadratic couplings encode as the strength of physical interactions between the quantum bits. Hardware design limits the number of couplings between these basic physical entities to five or six. To account for this limitation in mapping large problems to the hardware architecture, we propose a truncation and rescaling of the training objective through a trainable metaparameter. The boosting process on our basic 108- and 508-variable problems, thus constituted, returns classifiers that incorporate a diverse range of color- and texture-based metrics and discriminate tree cover with accuracies as high as 92% in validation and 90% on a test scene encompassing the open space preserves and dense suburban build of Mill Valley, CA. PMID:28241028
Improving LHC searches for dark photons using lepton-jet substructure
NASA Astrophysics Data System (ADS)
Barello, G.; Chang, Spencer; Newby, Christopher A.; Ostdiek, Bryan
2017-03-01
Collider signals of dark photons are an exciting probe for new gauge forces and are characterized by events with boosted lepton jets. Existing techniques are efficient in searching for muonic lepton jets but due to substantial backgrounds have difficulty constraining lepton jets containing only electrons. This is unfortunate since upcoming intensity frontier experiments are sensitive to dark photon masses which only allow electron decays. Analyzing a recently proposed model of kinetic mixing, with new scalar particles decaying into dark photons, we find that existing techniques for electron jets can be substantially improved. We show that using lepton-jet-substructure variables, in association with a boosted decision tree, improves background rejection, significantly increasing the LHC's reach for dark photons in this region of parameter space.
Lo, Benjamin W Y; Fukuda, Hitoshi; Angle, Mark; Teitelbaum, Jeanne; Macdonald, R Loch; Farrokhyar, Forough; Thabane, Lehana; Levine, Mitchell A H
2016-01-01
Classification and regression tree analysis involves the creation of a decision tree by recursive partitioning of a dataset into more homogeneous subgroups. Thus far, there is scarce literature on using this technique to create clinical prediction tools for aneurysmal subarachnoid hemorrhage (SAH). The classification and regression tree analysis technique was applied to the multicenter Tirilazad database (3551 patients) in order to create the decision-making algorithm. In order to elucidate prognostic subgroups in aneurysmal SAH, neurologic, systemic, and demographic factors were taken into account. The dependent variable used for analysis was the dichotomized Glasgow Outcome Score at 3 months. Classification and regression tree analysis revealed seven prognostic subgroups. Neurological grade, occurrence of post-admission stroke, occurrence of post-admission fever, and age represented the explanatory nodes of this decision tree. Split sample validation revealed classification accuracy of 79% for the training dataset and 77% for the testing dataset. In addition, the occurrence of fever at 1-week post-aneurysmal SAH is associated with increased odds of post-admission stroke (odds ratio: 1.83, 95% confidence interval: 1.56-2.45, P < 0.01). A clinically useful classification tree was generated, which serves as a prediction tool to guide bedside prognostication and clinical treatment decision making. This prognostic decision-making algorithm also shed light on the complex interactions between a number of risk factors in determining outcome after aneurysmal SAH.
Assessment of timber availability from forest restoration within the Blue Mountains of Oregon.
Robert Rainville; Rachel White; Jamie Barbour
2008-01-01
Changes in forest management have detrimentally affected the economic health of small communities in the Blue Mountain region of Oregon over the past few decades. A build-up of small trees threatens the ecological health of these forests and increases wildland fire hazard. Hoping to boost their economies and also restore these forests, local leaders are interested in...
NASA Astrophysics Data System (ADS)
Zhao, Fengjun; Liu, Junting; Qu, Xiaochao; Xu, Xianhui; Chen, Xueli; Yang, Xiang; Cao, Feng; Liang, Jimin; Tian, Jie
2014-12-01
To solve the multicollinearity issue and unequal contribution of vascular parameters for the quantification of angiogenesis, we developed a quantification evaluation method of vascular parameters for angiogenesis based on in vivo micro-CT imaging of hindlimb ischemic model mice. Taking vascular volume as the ground truth parameter, nine vascular parameters were first assembled into sparse principal components (PCs) to reduce the multicolinearity issue. Aggregated boosted trees (ABTs) were then employed to analyze the importance of vascular parameters for the quantification of angiogenesis via the loadings of sparse PCs. The results demonstrated that vascular volume was mainly characterized by vascular area, vascular junction, connectivity density, segment number and vascular length, which indicated they were the key vascular parameters for the quantification of angiogenesis. The proposed quantitative evaluation method was compared with both the ABTs directly using the nine vascular parameters and Pearson correlation, which were consistent. In contrast to the ABTs directly using the vascular parameters, the proposed method can select all the key vascular parameters simultaneously, because all the key vascular parameters were assembled into the sparse PCs with the highest relative importance.
Semanjski, Ivana; Gautama, Sidharta
2015-07-03
Mobility management represents one of the most important parts of the smart city concept. The way we travel, at what time of the day, for what purposes and with what transportation modes, have a pertinent impact on the overall quality of life in cities. To manage this process, detailed and comprehensive information on individuals' behaviour is needed as well as effective feedback/communication channels. In this article, we explore the applicability of crowdsourced data for this purpose. We apply a gradient boosting trees algorithm to model individuals' mobility decision making processes (particularly concerning what transportation mode they are likely to use). To accomplish this we rely on data collected from three sources: a dedicated smartphone application, a geographic information systems-based web interface and weather forecast data collected over a period of six months. The applicability of the developed model is seen as a potential platform for personalized mobility management in smart cities and a communication tool between the city (to steer the users towards more sustainable behaviour by additionally weighting preferred suggestions) and users (who can give feedback on the acceptability of the provided suggestions, by accepting or rejecting them, providing an additional input to the learning process).
Amini, Payam; Maroufizadeh, Saman; Samani, Reza Omani; Hamidi, Omid; Sepidarkish, Mahdi
2017-06-01
Preterm birth (PTB) is a leading cause of neonatal death and the second biggest cause of death in children under five years of age. The objective of this study was to determine the prevalence of PTB and its associated factors using logistic regression and decision tree classification methods. This cross-sectional study was conducted on 4,415 pregnant women in Tehran, Iran, from July 6-21, 2015. Data were collected by a researcher-developed questionnaire through interviews with mothers and review of their medical records. To evaluate the accuracy of the logistic regression and decision tree methods, several indices such as sensitivity, specificity, and the area under the curve were used. The PTB rate was 5.5% in this study. The logistic regression outperformed the decision tree for the classification of PTB based on risk factors. Logistic regression showed that multiple pregnancies, mothers with preeclampsia, and those who conceived with assisted reproductive technology had an increased risk for PTB ( p < 0.05). Identifying and training mothers at risk as well as improving prenatal care may reduce the PTB rate. We also recommend that statisticians utilize the logistic regression model for the classification of risk groups for PTB.
Developing a dengue forecast model using machine learning: A case study in China
Zhang, Qin; Wang, Li; Xiao, Jianpeng; Zhang, Qingying; Luo, Ganfeng; Li, Zhihao; He, Jianfeng; Zhang, Yonghui; Ma, Wenjun
2017-01-01
Background In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue. Methodology/Principal findings Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011–2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China. Conclusion and significance The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics. PMID:29036169
Differences in Risk Factors for Rotator Cuff Tears between Elderly Patients and Young Patients.
Watanabe, Akihisa; Ono, Qana; Nishigami, Tomohiko; Hirooka, Takahiko; Machida, Hirohisa
2018-02-01
It has been unclear whether the risk factors for rotator cuff tears are the same at all ages or differ between young and older populations. In this study, we examined the risk factors for rotator cuff tears using classification and regression tree analysis as methods of nonlinear regression analysis. There were 65 patients in the rotator cuff tears group and 45 patients in the intact rotator cuff group. Classification and regression tree analysis was performed to predict rotator cuff tears. The target factor was rotator cuff tears; explanatory variables were age, sex, trauma, and critical shoulder angle≥35°. In the results of classification and regression tree analysis, the tree was divided at age 64. For patients aged≥64, the tree was divided at trauma. For patients aged<64, the tree was divided at critical shoulder angle≥35°. The odds ratio for critical shoulder angle≥35° was significant for all ages (5.89), and for patients aged<64 (10.3) while trauma was only a significant factor for patients aged≥64 (5.13). Age, trauma, and critical shoulder angle≥35° were related to rotator cuff tears in this study. However, these risk factors showed different trends according to age group, not a linear relationship.
Bayesian Ensemble Trees (BET) for Clustering and Prediction in Heterogeneous Data
Duan, Leo L.; Clancy, John P.; Szczesniak, Rhonda D.
2016-01-01
We propose a novel “tree-averaging” model that utilizes the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian Ensemble Trees (BET) and model them as a Dirichlet process. We show that BET determines the optimal number of trees by adapting to the data heterogeneity. Compared with the other ensemble methods, BET requires much fewer trees and shows equivalent prediction accuracy using weighted averaging. Moreover, each tree in BET provides variable selection criterion and interpretation for each subset. We developed an efficient estimating procedure with improved estimation strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations and illustrate the approach with a real-world data example involving regression of lung function measurements obtained from patients with cystic fibrosis. Supplemental materials are available online. PMID:27524872
Mani, Ashutosh; Rao, Marepalli; James, Kelley; Bhattacharya, Amit
2015-01-01
The purpose of this study was to explore data-driven models, based on decision trees, to develop practical and easy to use predictive models for early identification of firefighters who are likely to cross the threshold of hyperthermia during live-fire training. Predictive models were created for three consecutive live-fire training scenarios. The final predicted outcome was a categorical variable: will a firefighter cross the upper threshold of hyperthermia - Yes/No. Two tiers of models were built, one with and one without taking into account the outcome (whether a firefighter crossed hyperthermia or not) from the previous training scenario. First tier of models included age, baseline heart rate and core body temperature, body mass index, and duration of training scenario as predictors. The second tier of models included the outcome of the previous scenario in the prediction space, in addition to all the predictors from the first tier of models. Classification and regression trees were used independently for prediction. The response variable for the regression tree was the quantitative variable: core body temperature at the end of each scenario. The predicted quantitative variable from regression trees was compared to the upper threshold of hyperthermia (38°C) to predict whether a firefighter would enter hyperthermia. The performance of classification and regression tree models was satisfactory for the second (success rate = 79%) and third (success rate = 89%) training scenarios but not for the first (success rate = 43%). Data-driven models based on decision trees can be a useful tool for predicting physiological response without modeling the underlying physiological systems. Early prediction of heat stress coupled with proactive interventions, such as pre-cooling, can help reduce heat stress in firefighters.
Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data.
Li, Jiuyong; Liu, Lin; Liu, Jixue; Green, Ryan
2017-12-01
It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper will test an ensemble method, Diversified Multiple Tree (DMT), on its capability for classifying instances in a new laboratory using the classifier built on the instances of another laboratory. DMT is tested on three real world biomedical data sets from different laboratories in comparison with four benchmark ensemble methods, AdaBoost, Bagging, Random Forests, and Random Trees. Experiments have also been conducted on studying the limitation of DMT and its possible variations. Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on classifying new instances of a different laboratory from the laboratory where instances are used to build the classifier. This paper demonstrates that an ensemble classifier, DMT, is more robust in classifying noisy data than other widely used ensemble methods. DMT works on the data set that supports multiple simple trees.
A self-trained classification technique for producing 30 m percent-water maps from Landsat data
Rover, Jennifer R.; Wylie, Bruce K.; Ji, Lei
2010-01-01
Small bodies of water can be mapped with moderate-resolution satellite data using methods where water is mapped as subpixel fractions using field measurements or high-resolution images as training datasets. A new method, developed from a regression-tree technique, uses a 30 m Landsat image for training the regression tree that, in turn, is applied to the same image to map subpixel water. The self-trained method was evaluated by comparing the percent-water map with three other maps generated from established percent-water mapping methods: (1) a regression-tree model trained with a 5 m SPOT 5 image, (2) a regression-tree model based on endmembers and (3) a linear unmixing classification technique. The results suggest that subpixel water fractions can be accurately estimated when high-resolution satellite data or intensively interpreted training datasets are not available, which increases our ability to map small water bodies or small changes in lake size at a regional scale.
Scalable Regression Tree Learning on Hadoop using OpenPlanet
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yin, Wei; Simmhan, Yogesh; Prasanna, Viktor
As scientific and engineering domains attempt to effectively analyze the deluge of data arriving from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. Map Reduce is well suited for addressing such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, using MapReduce has been proposed earlier. In this paper, we describe an open source implement of this algorithm, OpenPlanet, on the Hadoop framework usingmore » a hybrid approach. Further, we evaluate the performance of OpenPlanet using realworld datasets from the Smart Power Grid domain to perform energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.« less
Wang, Kung-Jeng; Makond, Bunjira; Wang, Kung-Min
2013-11-09
Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-survival patients) whereas the standard classifiers are not applicable for the imbalanced data sets. The methods to improve survivability prognosis of breast cancer need for study. Two well-known five-year prognosis models/classifiers [i.e., logistic regression (LR) and decision tree (DT)] are constructed by combining synthetic minority over-sampling technique (SMOTE), cost-sensitive classifier technique (CSC), under-sampling, bagging, and boosting. The feature selection method is used to select relevant variables, while the pruning technique is applied to obtain low information-burden models. These methods are applied on data obtained from the Surveillance, Epidemiology, and End Results database. The improvements of survivability prognosis of breast cancer are investigated based on the experimental results. Experimental results confirm that the DT and LR models combined with SMOTE, CSC, and under-sampling generate higher predictive performance consecutively than the original ones. Most of the time, DT and LR models combined with SMOTE and CSC use less informative burden/features when a feature selection method and a pruning technique are applied. LR is found to have better statistical power than DT in predicting five-year survivability. CSC is superior to SMOTE, under-sampling, bagging, and boosting to improve the prognostic performance of DT and LR.
2013-01-01
Background Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-survival patients) whereas the standard classifiers are not applicable for the imbalanced data sets. The methods to improve survivability prognosis of breast cancer need for study. Methods Two well-known five-year prognosis models/classifiers [i.e., logistic regression (LR) and decision tree (DT)] are constructed by combining synthetic minority over-sampling technique (SMOTE) ,cost-sensitive classifier technique (CSC), under-sampling, bagging, and boosting. The feature selection method is used to select relevant variables, while the pruning technique is applied to obtain low information-burden models. These methods are applied on data obtained from the Surveillance, Epidemiology, and End Results database. The improvements of survivability prognosis of breast cancer are investigated based on the experimental results. Results Experimental results confirm that the DT and LR models combined with SMOTE, CSC, and under-sampling generate higher predictive performance consecutively than the original ones. Most of the time, DT and LR models combined with SMOTE and CSC use less informative burden/features when a feature selection method and a pruning technique are applied. Conclusions LR is found to have better statistical power than DT in predicting five-year survivability. CSC is superior to SMOTE, under-sampling, bagging, and boosting to improve the prognostic performance of DT and LR. PMID:24207108
Riparian vegetation structure under desertification scenarios
NASA Astrophysics Data System (ADS)
Rosário Fernandes, M.; Segurado, Pedro; Jauch, Eduardo; Ferreira, M. Teresa
2015-04-01
Riparian areas are responsible for many ecological and ecosystems services, including the filtering function, that are considered crucial to the preservation of water quality and social benefits. The main goal of this study is to quantify and understand the riparian variability under desertification scenario(s) and identify the optimal riparian indicators for water scarcity and droughts (WS&D), henceforth improving river basin management. This study was performed in the Iberian Tâmega basin, using riparian woody patches, mapped by visual interpretation on Google Earth imagery, along 130 Sampling Units of 250 m long river stretches. Eight riparian structural indicators, related with lateral dimension, weighted area and shape complexity of riparian patches were calculated using Patch Analyst extension for ArcGis 10. A set of 29 hydrological, climatic, and hydrogeomorphological variables were computed, by a water modelling system (MOHID), using monthly meteorological data between 2008 and 2014. Land-use classes were also calculated, in a 250m-buffer surrounding each sampling unit, using a classification based system on Corine Land Cover. Boosted Regression Trees identified Mean-width (MW) as the optimal riparian indicator for water scarcity and drought, followed by the Weighted Class Area (WCA) (classification accuracy =0.79 and 0.69 respectively). Average Flow and Strahler number were consistently selected, by all boosted models, as the most important explanatory variables. However, a combined effect of hidrogeomorphology and land-use can explain the high variability found in the riparian width mainly in Tâmega tributaries. Riparian patches are larger towards Tâmega river mouth although with lower shape complexity, probably related with more continuous and almost monospecific stands. Climatic, hydrological and land use scenarios, singly and combined, were used to quantify the riparian variability responding to these changes, and to assess the loss of riparian functions such as nutrient incorporation and sediment flux alterations.
Method for estimating potential tree-grade distributions for northeastern forest species
Daniel A. Yaussy; Daniel A. Yaussy
1993-01-01
Generalized logistic regression was used to distribute trees into four potential tree grades for 20 northeastern species groups. The potential tree grade is defined as the tree grade based on the length and amount of clear cuttings and defects only, disregarding minimum grading diameter. The algorithms described use site index and tree diameter as the predictive...
Additivity of nonlinear biomass equations
Bernard R. Parresol
2001-01-01
Two procedures that guarantee the property of additivity among the components of tree biomass and total tree biomass utilizing nonlinear functions are developed. Procedure 1 is a simple combination approach, and procedure 2 is based on nonlinear joint-generalized regression (nonlinear seemingly unrelated regressions) with parameter restrictions. Statistical theory is...
Park, Ji Hyun; Kim, Hyeon-Young; Lee, Hanna; Yun, Eun Kyoung
2015-12-01
This study compares the performance of the logistic regression and decision tree analysis methods for assessing the risk factors for infection in cancer patients undergoing chemotherapy. The subjects were 732 cancer patients who were receiving chemotherapy at K university hospital in Seoul, Korea. The data were collected between March 2011 and February 2013 and were processed for descriptive analysis, logistic regression and decision tree analysis using the IBM SPSS Statistics 19 and Modeler 15.1 programs. The most common risk factors for infection in cancer patients receiving chemotherapy were identified as alkylating agents, vinca alkaloid and underlying diabetes mellitus. The logistic regression explained 66.7% of the variation in the data in terms of sensitivity and 88.9% in terms of specificity. The decision tree analysis accounted for 55.0% of the variation in the data in terms of sensitivity and 89.0% in terms of specificity. As for the overall classification accuracy, the logistic regression explained 88.0% and the decision tree analysis explained 87.2%. The logistic regression analysis showed a higher degree of sensitivity and classification accuracy. Therefore, logistic regression analysis is concluded to be the more effective and useful method for establishing an infection prediction model for patients undergoing chemotherapy. Copyright © 2015 Elsevier Ltd. All rights reserved.
Gu, Yingxin; Wylie, Bruce K.; Boyte, Stephen; Picotte, Joshua J.; Howard, Danny; Smith, Kelcy; Nelson, Kurtis
2016-01-01
Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.
Appelt, Ane L; Vogelius, Ivan R; Pløen, John; Rafaelsen, Søren R; Lindebjerg, Jan; Havelund, Birgitte M; Bentzen, Søren M; Jakobsen, Anders
2014-01-01
Purpose/Objective(s) Mature data on tumor control and survival are presented from a randomized trial of the addition of a brachytherapy boost to long-course neoadjuvant chemoradiation (CRT) for locally advanced rectal cancer. Methods and Materials Between March 2005 and November 2008, 248 patients withT3-4N0-2M0 rectal cancer were prospectively randomized to either long-course preoperative CRT (50.4Gy in 28 fractions, peroral UFT and L-leucovorin) alone or the same CRT schedule plus a brachytherapy boost (10Gy in 2 fractions). Primary trial endpoint was pathological complete response (pCR) at time of surgery; secondary endpoints included overall survival (OS), progression-free survival (PFS) and freedom from locoregional failure. Results Results for the primary endpoint have previously been reported. This analysis presents survival data for the 224 patients in the Danish part of the trial. 221 patients (111 control arm, 110 brachytherapy boost arm) had data available for analysis, with a median follow-up of 5.4 years. Despite a significant increase in tumor response at the time of surgery, no differences in 5-year OS (70.6% vs 63.6%, HR=1.24, p=0.34) and PFS (63.9% vs 52.0%, HR=1.22, p=0.32) were observed. Freedom from locoregional failure at 5 years were 93.9% and 85.7% (HR=2.60, 1.00–6.73, p=0.06) in the standard and in the brachytherapy arm, respectively. There was no difference in the prevalence of stoma. Explorative analysis based on stratification for tumor regression grade and resection margin status indicated the presence of response migration. Conclusions Despite increased pathological tumor regression at the time of surgery, we observed no benefit on late outcome. Improved tumor regression does not necessarily lead to a relevant clinical benefit when the neoadjuvant treatment is followed by high-quality surgery. PMID:25015203
Double-charming Higgs boson identification using machine-learning assisted jet shapes
NASA Astrophysics Data System (ADS)
Lenz, Alexander; Spannowsky, Michael; Tetlalmatzi-Xolocotzi, Gilberto
2018-01-01
We study the possibility of identifying a boosted resonance that decays into a charm pair against different sources of background using QCD event shapes, which are promoted to jet shapes. Using a set of jet shapes as input to a boosted decision tree, we find that observables utilizing the simultaneous presence of two charm quarks can access complementary information compared to approaches relying on two independent charm tags. Focusing on Higgs associated production with subsequent H →c c ¯ decay and on a C P -odd scalar A with mA≤10 GeV we obtain the limits B r (H →c c ¯ )≤6.48 % and B r (H →A (→c c ¯ )Z )≤0.01 % at 95% C.L.
USDA-ARS?s Scientific Manuscript database
Incomplete meteorological data has been a problem in environmental modeling studies. The objective of this work was to develop a technique to reconstruct missing daily precipitation data in the central part of Chesapeake Bay Watershed using regression trees (RT) and artificial neural networks (ANN)....
USDA-ARS?s Scientific Manuscript database
Missing meteorological data have to be estimated for agricultural and environmental modeling. The objective of this work was to develop a technique to reconstruct the missing daily precipitation data in the central part of the Chesapeake Bay Watershed using regression trees (RT) and artificial neura...
Hayes, Mark A.; Cryan, Paul M.; Wunder, Michael B.
2015-01-01
Understanding seasonal distribution and movement patterns of animals that migrate long distances is an essential part of monitoring and conserving their populations. Compared to migratory birds and other more conspicuous migrants, we know very little about the movement patterns of many migratory bats. Hoary bats (Lasiurus cinereus), a cryptic, wide-ranging, long-distance migrant, comprise a substantial proportion of the tens to hundreds of thousands of bat fatalities estimated to occur each year at wind turbines in North America. We created seasonally-dynamic species distribution models (SDMs) from 2,753 museum occurrence records collected over five decades in North America to better understand the seasonal geographic distributions of hoary bats. We used 5 SDM approaches: logistic regression, multivariate adaptive regression splines, boosted regression trees, random forest, and maximum entropy and consolidated outputs to generate ensemble maps. These maps represent the first formal hypotheses for sex- and season-specific hoary bat distributions. Our results suggest that North American hoary bats winter in regions with relatively long growing seasons where temperatures are moderated by proximity to oceans, and then move to the continental interior for the summer. SDMs suggested that hoary bats are most broadly distributed in autumn—the season when they are most susceptible to mortality from wind turbines; this season contains the greatest overlap between potentially suitable habitat and wind energy facilities. Comparing wind-turbine fatality data to model outputs could test many predictions, such as ‘risk from turbines is highest in habitats between hoary bat summering and wintering grounds’. Although future field studies are needed to validate the SDMs, this study generated well-justified and testable hypotheses of hoary bat migration patterns and seasonal distribution. PMID:26208098
NASA Astrophysics Data System (ADS)
Rogers, Jeffrey N.; Parrish, Christopher E.; Ward, Larry G.; Burdick, David M.
2018-03-01
Salt marsh vegetation tends to increase vertical uncertainty in light detection and ranging (lidar) derived elevation data, often causing the data to become ineffective for analysis of topographic features governing tidal inundation or vegetation zonation. Previous attempts at improving lidar data collected in salt marsh environments range from simply computing and subtracting the global elevation bias to more complex methods such as computing vegetation-specific, constant correction factors. The vegetation specific corrections can be used along with an existing habitat map to apply separate corrections to different areas within a study site. It is hypothesized here that correcting salt marsh lidar data by applying location-specific, point-by-point corrections, which are computed from lidar waveform-derived features, tidal-datum based elevation, distance from shoreline and other lidar digital elevation model based variables, using nonparametric regression will produce better results. The methods were developed and tested using full-waveform lidar and ground truth for three marshes in Cape Cod, Massachusetts, U.S.A. Five different model algorithms for nonparametric regression were evaluated, with TreeNet's stochastic gradient boosting algorithm consistently producing better regression and classification results. Additionally, models were constructed to predict the vegetative zone (high marsh and low marsh). The predictive modeling methods used in this study estimated ground elevation with a mean bias of 0.00 m and a standard deviation of 0.07 m (0.07 m root mean square error). These methods appear very promising for correction of salt marsh lidar data and, importantly, do not require an existing habitat map, biomass measurements, or image based remote sensing data such as multi/hyperspectral imagery.
Hayes, Mark A.; Cryan, Paul M.; Wunder, Michael B.
2015-01-01
Understanding seasonal distribution and movement patterns of animals that migrate long distances is an essential part of monitoring and conserving their populations. Compared to migratory birds and other more conspicuous migrants, we know very little about the movement patterns of many migratory bats. Hoary bats (Lasiurus cinereus), a cryptic, wide-ranging, long-distance migrant, comprise a substantial proportion of the tens to hundreds of thousands of bat fatalities estimated to occur each year at wind turbines in North America. We created seasonally-dynamic species distribution models (SDMs) from 2,753 museum occurrence records collected over five decades in North America to better understand the seasonal geographic distributions of hoary bats. We used 5 SDM approaches: logistic regression, multivariate adaptive regression splines, boosted regression trees, random forest, and maximum entropy and consolidated outputs to generate ensemble maps. These maps represent the first formal hypotheses for sex- and season-specific hoary bat distributions. Our results suggest that North American hoary bats winter in regions with relatively long growing seasons where temperatures are moderated by proximity to oceans, and then move to the continental interior for the summer. SDMs suggested that hoary bats are most broadly distributed in autumn—the season when they are most susceptible to mortality from wind turbines; this season contains the greatest overlap between potentially suitable habitat and wind energy facilities. Comparing wind-turbine fatality data to model outputs could test many predictions, such as ‘risk from turbines is highest in habitats between hoary bat summering and wintering grounds’. Although future field studies are needed to validate the SDMs, this study generated well-justified and testable hypotheses of hoary bat migration patterns and seasonal distribution.
Hayes, Mark A; Cryan, Paul M; Wunder, Michael B
2015-01-01
Understanding seasonal distribution and movement patterns of animals that migrate long distances is an essential part of monitoring and conserving their populations. Compared to migratory birds and other more conspicuous migrants, we know very little about the movement patterns of many migratory bats. Hoary bats (Lasiurus cinereus), a cryptic, wide-ranging, long-distance migrant, comprise a substantial proportion of the tens to hundreds of thousands of bat fatalities estimated to occur each year at wind turbines in North America. We created seasonally-dynamic species distribution models (SDMs) from 2,753 museum occurrence records collected over five decades in North America to better understand the seasonal geographic distributions of hoary bats. We used 5 SDM approaches: logistic regression, multivariate adaptive regression splines, boosted regression trees, random forest, and maximum entropy and consolidated outputs to generate ensemble maps. These maps represent the first formal hypotheses for sex- and season-specific hoary bat distributions. Our results suggest that North American hoary bats winter in regions with relatively long growing seasons where temperatures are moderated by proximity to oceans, and then move to the continental interior for the summer. SDMs suggested that hoary bats are most broadly distributed in autumn-the season when they are most susceptible to mortality from wind turbines; this season contains the greatest overlap between potentially suitable habitat and wind energy facilities. Comparing wind-turbine fatality data to model outputs could test many predictions, such as 'risk from turbines is highest in habitats between hoary bat summering and wintering grounds'. Although future field studies are needed to validate the SDMs, this study generated well-justified and testable hypotheses of hoary bat migration patterns and seasonal distribution.
USDA-ARS?s Scientific Manuscript database
Illegal use of nitrogen-rich melamine (C3H6N6) to boost perceived protein content of food products such as milk, infant formula, frozen yogurt, pet food, biscuits, and coffee drinks has caused serious food safety problems. Conventional methods to detect melamine in foods, such as Enzyme-linked immun...
Tree STEM and Canopy Biomass Estimates from Terrestrial Laser Scanning Data
NASA Astrophysics Data System (ADS)
Olofsson, K.; Holmgren, J.
2017-10-01
In this study an automatic method for estimating both the tree stem and the tree canopy biomass is presented. The point cloud tree extraction techniques operate on TLS data and models the biomass using the estimated stem and canopy volume as independent variables. The regression model fit error is of the order of less than 5 kg, which gives a relative model error of about 5 % for the stem estimate and 10-15 % for the spruce and pine canopy biomass estimates. The canopy biomass estimate was improved by separating the models by tree species which indicates that the method is allometry dependent and that the regression models need to be recomputed for different areas with different climate and different vegetation.
Chakraborty, Somsubhra; Weindorf, David C; Morgan, Cristine L S; Ge, Yufeng; Galbraith, John M; Li, Bin; Kahlon, Charanjit S
2010-01-01
In the United States, petroleum extraction, refinement, and transportation present countless opportunities for spillage mishaps. A method for rapid field appraisal and mapping of petroleum hydrocarbon-contaminated soils for environmental cleanup purposes would be useful. Visible near-infrared (VisNIR, 350-2500 nm) diffuse reflectance spectroscopy (DRS) is a rapid, nondestructive, proximal-sensing technique that has proven adept at quantifying soil properties in situ. The objective of this study was to determine the prediction accuracy of VisNIR DRS in quantifying petroleum hydrocarbons in contaminated soils. Forty-six soil samples (including both contaminated and reference samples) were collected from six different parishes in Louisiana. Each soil sample was scanned using VisNIR DRS at three combinations of moisture content and pretreatment: (i) field-moist intact aggregates, (ii) air-dried intact aggregates, (iii) and air-dried ground soil (sieved through a 2-mm sieve). The VisNIR spectra of soil samples were used to predict total petroleum hydrocarbon (TPH) content in the soil using partial least squares (PLS) regression and boosted regression tree (BRT) models. Each model was validated with 30% of the samples that were randomly selected and not used in the calibration model. The field-moist intact scan proved best for predicting TPH content with a validation r2 of 0.64 and relative percent difference (RPD) of 1.70. Because VisNIR DRS was promising for rapidly predicting soil petroleum hydrocarbon content, future research is warranted to evaluate the methodology for identifying petroleum contaminated soils.
Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William
2014-01-01
Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies.
Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William
2014-01-01
Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies. PMID:24992657
Min, Xiaobo; Wang, Yangyang; Chai, Liyuan; Yang, Zhihui; Liao, Qi
2017-09-01
To explore how heavy metal contamination in Chromite Ore Processing Residue (COPR) disposal sites determine the dissimilarities of indigenous microbial communities, 16S rRNA gene MiSeq sequencing and advanced statistical methods were applied. 13 soil samples were collected from three COPR disposal sites in Mouding of southwestern, Shangnan of northwestern and Yima of central China. The results of analyses of variance (ANOVA), similarities (ANOSIM), and non-metric multidimensional scaling (NMDS) showed that the structural diversity of the microbial communities in the samples with high total chromium (Cr) content (more than 300 mg kg -1 ; High group) were significantly lesser than in the Low group (less than 90 mg kg -1 ) regardless of their geographical distribution. But their diversity had virtually rehabilitated under the pressures of long-term metal contamination. Furthermore, the similarity percentage (SIMPER) analysis indicated that the major dissimilarity contributors Micrococcaceae, Delftia, and Streptophyta, possibly having Cr(VI)-resistant and/or Cr(VI)-reducing capability, were dominant in the High group, while Ramlibacter and Gemmatimonas with potential resistances to other heavy metals were prevalent in the Low group. In addition, the multivariate regression tree (MRT), aggregated boosted tree (ABT), and Mantel test revealed that total Cr content affiliated with Cr(VI) was the principal factor shaping the dissimilarities between the soil microbial communities in the COPR sites. Our findings provide a deep insight of the influence of these heavy metals on the microbial communities in the COPR disposal sites and will facilitate bioremediation on such site. Copyright © 2017 Elsevier Ltd. All rights reserved.
Rejecting Non-MIP-Like Tracks using Boosted Decision Trees with the T2K Pi-Zero Subdetector
NASA Astrophysics Data System (ADS)
Hogan, Matthew; Schwehr, Jacklyn; Cherdack, Daniel; Wilson, Robert; T2K Collaboration
2016-03-01
Tokai-to-Kamioka (T2K) is a long-baseline neutrino experiment with a narrow band energy spectrum peaked at 600 MeV. The Pi-Zero detector (PØD) is a plastic scintillator-based detector located in the off-axis near detector complex 280 meters from the beam origin. It is designed to constrain neutral-current induced π0 production background at the far detector using the water target which is interleaved between scintillator layers. A PØD-based measurement of charged-current (CC) single charged pion (1π+) production on water is being developed which will have expanded phase space coverage as compared to the previous analysis. The signal channel for this analysis, which for T2K is dominated by Δ production, is defined as events that produce a single muon, single charged pion, and any number of nucleons in the final state. The analysis will employ machine learning algorithms to enhance CC1π+ selection by studying topological observables that characterize signal well. Important observables for this analysis are those that discriminate a minimum ionizing particle (MIP) like a muon or pion from a proton at the T2K energies. This work describes the development of a discriminator using Boosted Decision Trees to reject non-MIP-like PØD tracks.
Generalized and synthetic regression estimators for randomized branch sampling
David L. R. Affleck; Timothy G. Gregoire
2015-01-01
In felled-tree studies, ratio and regression estimators are commonly used to convert more readily measured branch characteristics to dry crown mass estimates. In some cases, data from multiple trees are pooled to form these estimates. This research evaluates the utility of both tactics in the estimation of crown biomass following randomized branch sampling (...
Cloud-Free Satellite Image Mosaics with Regression Trees and Histogram Matching.
E.H. Helmer; B. Ruefenacht
2005-01-01
Cloud-free optical satellite imagery simplifies remote sensing, but land-cover phenology limits existing solutions to persistent cloudiness to compositing temporally resolute, spatially coarser imagery. Here, a new strategy for developing cloud-free imagery at finer resolution permits simple automatic change detection. The strategy uses regression trees to predict...
Regression estimators for late-instar gypsy moth larvae at low pupulation densities
W.E. Wallnr; A.S. Devito; Stanley J. Zarnoch
1989-01-01
Two regression estimators were developed for determining densities of late-instar gypsy moth, Lymantria dispar (Lepidoptera: Lymantriidae), larvae from burlap band and pyrethrin spray counts on oak trees in Vermont, Massachusetts, Connecticut, and New York. Studies were conducted by marking larvae on individual burlap banded trees within 15...
What Satisfies Students?: Mining Student-Opinion Data with Regression and Decision Tree Analysis
ERIC Educational Resources Information Center
Thomas, Emily H.; Galambos, Nora
2004-01-01
To investigate how students' characteristics and experiences affect satisfaction, this study uses regression and decision tree analysis with the CHAID algorithm to analyze student-opinion data. A data mining approach identifies the specific aspects of students' university experience that most influence three measures of general satisfaction. The…
Abiotic and biotic determinants of coarse woody productivity in temperate mixed forests.
Yuan, Zuoqiang; Ali, Arshad; Wang, Shaopeng; Gazol, Antonio; Freckleton, Robert; Wang, Xugao; Lin, Fei; Ye, Ji; Zhou, Li; Hao, Zhanqing; Loreau, Michel
2018-07-15
Forests play an important role in regulating the global carbon cycle. Yet, how abiotic (i.e. soil nutrients) and biotic (i.e. tree diversity, stand structure and initial biomass) factors simultaneously contribute to aboveground biomass (coarse woody) productivity, and how the relative importance of these factors changes over succession remain poorly studied. Coarse woody productivity (CWP) was estimated as the annual aboveground biomass gain of stems using 10-year census data in old growth and secondary forests (25-ha and 4.8-ha, respectively) in northeast China. Boosted regression tree (BRT) model was used to evaluate the relative contribution of multiple metrics of tree diversity (taxonomic, functional and phylogenetic diversity and trait composition as well as stand structure attributes), stand initial biomass and soil nutrients on productivity in the studied forests. Our results showed that community-weighted mean of leaf phosphorus content, initial stand biomass and soil nutrients were the three most important individual predictors for CWP in secondary forest. Instead, initial stand biomass, rather than diversity and functional trait composition (vegetation quality) was the most parsimonious predictor of CWP in old growth forest. By comparing the results from secondary and old growth forest, the summed relative contribution of trait composition and soil nutrients on productivity decreased as those of diversity indices and initial biomass increased, suggesting the stronger effect of diversity and vegetation quantity over time. Vegetation quantity, rather than diversity and soil nutrients, is the main driver of forest productivity in temperate mixed forest. Our results imply that diversity effect for productivity in natural forests may not be so important as often suggested, at least not during the later stage of forest succession. This finding suggests that as a change of the importance of different divers of productivity, the environmentally driven filtering decreases and competitively driven niche differentiation increases with forest succession. Copyright © 2018 Elsevier B.V. All rights reserved.
Crase, Beth; Vesk, Peter A; Liedloff, Adam; Wintle, Brendan A
2015-08-01
Dominant species influence the composition and abundance of other species present in ecosystems. However, forecasts of distributional change under future climates have predominantly focused on changes in species distribution and ignored possible changes in spatial and temporal patterns of dominance. We develop forecasts of spatial changes for the distribution of species dominance, defined in terms of basal area, and for species occurrence, in response to sea level rise for three tree taxa within an extensive mangrove ecosystem in northern Australia. Three new metrics are provided, indicating the area expected to be suitable under future conditions (Eoccupied ), the instability of suitable area (Einstability ) and the overlap between the current and future spatial distribution (Eoverlap ). The current dominance and occurrence were modelled in relation to a set of environmental variables using boosted regression tree (BRT) models, under two scenarios of seedling establishment: unrestricted and highly restricted. While forecasts of spatial change were qualitatively similar for species occurrence and dominance, the models of species dominance exhibited higher metrics of model fit and predictive performance, and the spatial pattern of future dominance was less similar to the current pattern than was the case for the distributions of species occurrence. This highlights the possibility of greater changes in the spatial patterning of mangrove tree species dominance under future sea level rise. Under the restricted seedling establishment scenario, the area occupied by or dominated by a species declined between 42.1% and 93.8%, while for unrestricted seedling establishment, the area suitable for dominance or occurrence of each species varied from a decline of 68.4% to an expansion of 99.5%. As changes in the spatial patterning of dominance are likely to cause a cascade of effects throughout the ecosystem, forecasting spatial changes in dominance provides new and complementary information in addition to that provided by forecasts of species occurrence. © 2015 John Wiley & Sons Ltd.
Freitas, Alex A; Limbu, Kriti; Ghafourian, Taravat
2015-01-01
Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values. Comparison of the models that used only molecular descriptors, in particular, the Bagging decision tree (mean fold error of 2.33), with those employing predicted Kt:p values in addition to the molecular descriptors, such as the Bagging decision tree using adipose Kt:p (mean fold error of 2.29), indicated that the use of predicted Kt:p values as descriptors may be beneficial for accurate prediction of Vss using decision trees if prior feature selection is applied. Decision tree based models presented in this work have an accuracy that is reasonable and similar to the accuracy of reported Vss inter-species extrapolations in the literature. The estimation of Vss for new compounds in drug discovery will benefit from methods that are able to integrate large and varied sources of data and flexible non-linear data mining methods such as decision trees, which can produce interpretable models. Graphical AbstractDecision trees for the prediction of tissue partition coefficient and volume of distribution of drugs.
NASA Astrophysics Data System (ADS)
Zack, J. W.
2015-12-01
Predictions from Numerical Weather Prediction (NWP) models are the foundation for wind power forecasts for day-ahead and longer forecast horizons. The NWP models directly produce three-dimensional wind forecasts on their respective computational grids. These can be interpolated to the location and time of interest. However, these direct predictions typically contain significant systematic errors ("biases"). This is due to a variety of factors including the limited space-time resolution of the NWP models and shortcomings in the model's representation of physical processes. It has become common practice to attempt to improve the raw NWP forecasts by statistically adjusting them through a procedure that is widely known as Model Output Statistics (MOS). The challenge is to identify complex patterns of systematic errors and then use this knowledge to adjust the NWP predictions. The MOS-based improvements are the basis for much of the value added by commercial wind power forecast providers. There are an enormous number of statistical approaches that can be used to generate the MOS adjustments to the raw NWP forecasts. In order to obtain insight into the potential value of some of the newer and more sophisticated statistical techniques often referred to as "machine learning methods" a MOS-method comparison experiment has been performed for wind power generation facilities in 6 wind resource areas of California. The underlying NWP models that provided the raw forecasts were the two primary operational models of the US National Weather Service: the GFS and NAM models. The focus was on 1- and 2-day ahead forecasts of the hourly wind-based generation. The statistical methods evaluated included: (1) screening multiple linear regression, which served as a baseline method, (2) artificial neural networks, (3) a decision-tree approach called random forests, (4) gradient boosted regression based upon an decision-tree algorithm, (5) support vector regression and (6) analog ensemble, which is a case-matching scheme. The presentation will provide (1) an overview of each method and the experimental design, (2) performance comparisons based on standard metrics such as bias, MAE and RMSE, (3) a summary of the performance characteristics of each approach and (4) a preview of further experiments to be conducted.
Ratliff, John K; Balise, Ray; Veeravagu, Anand; Cole, Tyler S; Cheng, Ivan; Olshen, Richard A; Tian, Lu
2016-05-18
Postoperative metrics are increasingly important in determining standards of quality for physicians and hospitals. Although complications following spinal surgery have been described, procedural and patient variables have yet to be incorporated into a predictive model of adverse-event occurrence. We sought to develop a predictive model of complication occurrence after spine surgery. We used longitudinal prospective data from a national claims database and developed a predictive model incorporating complication type and frequency of occurrence following spine surgery procedures. We structured our model to assess the impact of features such as preoperative diagnosis, patient comorbidities, location in the spine, anterior versus posterior approach, whether fusion had been performed, whether instrumentation had been used, number of levels, and use of bone morphogenetic protein (BMP). We assessed a variety of adverse events. Prediction models were built using logistic regression with additive main effects and logistic regression with main effects as well as all 2 and 3-factor interactions. Least absolute shrinkage and selection operator (LASSO) regularization was used to select features. Competing approaches included boosted additive trees and the classification and regression trees (CART) algorithm. The final prediction performance was evaluated by estimating the area under a receiver operating characteristic curve (AUC) as predictions were applied to independent validation data and compared with the Charlson comorbidity score. The model was developed from 279,135 records of patients with a minimum duration of follow-up of 30 days. Preliminary assessment showed an adverse-event rate of 13.95%, well within norms reported in the literature. We used the first 80% of the records for training (to predict adverse events) and the remaining 20% of the records for validation. There was remarkable similarity among methods, with an AUC of 0.70 for predicting the occurrence of adverse events. The AUC using the Charlson comorbidity score was 0.61. The described model was more accurate than Charlson scoring (p < 0.01). We present a modeling effort based on administrative claims data that predicts the occurrence of complications after spine surgery. We believe that the development of a predictive modeling tool illustrating the risk of complication occurrence after spine surgery will aid in patient counseling and improve the accuracy of risk modeling strategies. Copyright © 2016 by The Journal of Bone and Joint Surgery, Incorporated.
Regression analysis using dependent Polya trees.
Schörgendorfer, Angela; Branscum, Adam J
2013-11-30
Many commonly used models for linear regression analysis force overly simplistic shape and scale constraints on the residual structure of data. We propose a semiparametric Bayesian model for regression analysis that produces data-driven inference by using a new type of dependent Polya tree prior to model arbitrary residual distributions that are allowed to evolve across increasing levels of an ordinal covariate (e.g., time, in repeated measurement studies). By modeling residual distributions at consecutive covariate levels or time points using separate, but dependent Polya tree priors, distributional information is pooled while allowing for broad pliability to accommodate many types of changing residual distributions. We can use the proposed dependent residual structure in a wide range of regression settings, including fixed-effects and mixed-effects linear and nonlinear models for cross-sectional, prospective, and repeated measurement data. A simulation study illustrates the flexibility of our novel semiparametric regression model to accurately capture evolving residual distributions. In an application to immune development data on immunoglobulin G antibodies in children, our new model outperforms several contemporary semiparametric regression models based on a predictive model selection criterion. Copyright © 2013 John Wiley & Sons, Ltd.
DIF Trees: Using Classification Trees to Detect Differential Item Functioning
ERIC Educational Resources Information Center
Vaughn, Brandon K.; Wang, Qiu
2010-01-01
A nonparametric tree classification procedure is used to detect differential item functioning for items that are dichotomously scored. Classification trees are shown to be an alternative procedure to detect differential item functioning other than the use of traditional Mantel-Haenszel and logistic regression analysis. A nonparametric…
Spatial prediction of soil texture in region Centre (France) from summary data
NASA Astrophysics Data System (ADS)
Dobarco, Mercedes Roman; Saby, Nicolas; Paroissien, Jean-Baptiste; Orton, Tom G.
2015-04-01
Soil texture is a key controlling factor of important soil functions like water and nutrient holding capacity, retention of pollutants, drainage, soil biodiversity, and C cycling. High resolution soil texture maps enhance our understanding of the spatial distribution of soil properties and provide valuable information for decision making and crop management, environmental protection, and hydrological planning. We predicted the soil texture of agricultural topsoils in the Region Centre (France) combining regression and area-to-point kriging. Soil texture data was collected from the French soil-test database (BDAT), which is populated with soil analysis performed by farmers' demand. To protect the anonymity of the farms the data was treated by commune. In a first step, summary statistics of environmental covariates by commune were used to develop prediction models with Cubist, boosted regression trees, and random forests. In a second step the residuals of each individual observation were summarized by commune and kriged following the method developed by Orton et al. (2012). This approach allowed to include non-linear relationships among covariates and soil texture while accounting for the uncertainty on areal means in the area-to-point kriging step. Independent validation of the models was done using data from the systematic soil monitoring network of French soils. Future work will compare the performance of these models with a non-stationary variance geostatistical model using the most important covariates and summary statistics of texture data. The results will inform on whether the later and statistically more-challenging approach improves significantly texture predictions or whether the more simple area-to-point regression kriging can offer satisfactory results. The application of area-to-point regression kriging at national level using BDAT data has the potential to improve soil texture predictions for agricultural topsoils, especially when combined with existing maps (i.e., model ensemble).
Neighborhood Influences on Vehicle-Pedestrian Crash Severity.
Toran Pour, Alireza; Moridpour, Sara; Tay, Richard; Rajabifard, Abbas
2017-12-01
Socioeconomic factors are known to be contributing factors for vehicle-pedestrian crashes. Although several studies have examined the socioeconomic factors related to the location of the crashes, limited studies have considered the socioeconomic factors of the neighborhood where the road users live in vehicle-pedestrian crash modelling. This research aims to identify the socioeconomic factors related to both the neighborhoods where the road users live and where crashes occur that have an influence on vehicle-pedestrian crash severity. Data on vehicle-pedestrian crashes that occurred at mid-blocks in Melbourne, Australia, was analyzed. Neighborhood factors associated with road users' residents and location of crash were investigated using boosted regression tree (BRT). Furthermore, partial dependence plots were applied to illustrate the interactions between these factors. We found that socioeconomic factors accounted for 60% of the 20 top contributing factors to vehicle-pedestrian crashes. This research reveals that socioeconomic factors of the neighborhoods where the road users live and where the crashes occur are important in determining the severity of the crashes, with the former having a greater influence. Hence, road safety countermeasures, especially those focussing on the road users, should be targeted at these high-risk neighborhoods.
NASA Astrophysics Data System (ADS)
Santoro, R.; Ingraffea, A. R.
2015-12-01
Previous modeling (ingraffea et al. PNAS, 2014) indicated roughly two-times higher cumulative risk for wellbore impairment in unconventional wells, relative to conventional wells, and large spatial variation in risk for oil and gas wells drilled in the state of Pennsylvania. Impairment risk for wells in the northeast portion of the state were found to be 8.5-times greater than that of wells drilled in the rest of the state. Here, we set out to explain this apparent regional variability through Boosted Regression Tree (BRT) analysis of geographic, developmental, and general well attributes. We find that regional variability is largely driven by the nature of the development, i.e. whether conventional or unconventional development is dominant. Oil and natural gas market prices and total well depths present as major influences in wellbore impairment, with moderate influences from well densities and geologic factors. The figure depicts influence paths for predictors of impairments for the state (top left), SW region (top right), unconventional/NE region (bottom left) and conventional/NW region (bottom right) models. Influences are scaled to reflect percent contributions in explaining variability in the model.
Identifying multiple coral reef regimes and their drivers across the Hawaiian archipelago
Jouffray, Jean-Baptiste; Nyström, Magnus; Norström, Albert V.; Williams, Ivor D.; Wedding, Lisa M.; Kittinger, John N.; Williams, Gareth J.
2015-01-01
Loss of coral reef resilience can lead to dramatic changes in benthic structure, often called regime shifts, which significantly alter ecosystem processes and functioning. In the face of global change and increasing direct human impacts, there is an urgent need to anticipate and prevent undesirable regime shifts and, conversely, to reverse shifts in already degraded reef systems. Such challenges require a better understanding of the human and natural drivers that support or undermine different reef regimes. The Hawaiian archipelago extends across a wide gradient of natural and anthropogenic conditions and provides us a unique opportunity to investigate the relationships between multiple reef regimes, their dynamics and potential drivers. We applied a combination of exploratory ordination methods and inferential statistics to one of the most comprehensive coral reef datasets available in order to detect, visualize and define potential multiple ecosystem regimes. This study demonstrates the existence of three distinct reef regimes dominated by hard corals, turf algae or macroalgae. Results from boosted regression trees show nonlinear patterns among predictors that help to explain the occurrence of these regimes, and highlight herbivore biomass as the key driver in addition to effluent, latitude and depth.
Ayanu, Yohannes; Conrad, Christopher; Jentsch, Anke; Koellner, Thomas
2015-01-01
The worldwide demand for food has been increasing due to the rapidly growing global population, and agricultural lands have increased in extent to produce more food crops. The pattern of cropland varies among different regions depending on the traditional knowledge of farmers and availability of uncultivated land. Satellite images can be used to map cropland in open areas but have limitations for detecting undergrowth inside forests. Classification results are often biased and need to be supplemented with field observations. Undercover cropland inside forests in the Bale Mountains of Ethiopia was assessed using field observed percentage cover of land use/land cover classes, and topographic and location parameters. The most influential factors were identified using Boosted Regression Trees and used to map undercover cropland area. Elevation, slope, easterly aspect, distance to settlements, and distance to national park were found to be the most influential factors determining undercover cropland area. When there is very high demand for growing food crops, constrained under restricted rights for clearing forest, cultivation could take place within forests as an undercover. Further research on the impact of undercover cropland on ecosystem services and challenges in sustainable management is thus essential. PMID:26098107
The CMS Level-1 Calorimeter Trigger for LHC Run II
NASA Astrophysics Data System (ADS)
Sinthuprasith, Tutanon
2017-01-01
The phase-1 upgrades of the CMS Level-1 calorimeter trigger have been completed. The Level-1 trigger has been fully commissioned and it will be used by CMS to collect data starting from the 2016 data run. The new trigger has been designed to improve the performance at high luminosity and large number of simultaneous inelastic collisions per crossing (pile-up). For this purpose it uses a novel design, the Time Multiplexed Design, which enables the data from an event to be processed by a single trigger processor at full granularity over several bunch crossings. The TMT design is a modular design based on the uTCA standard. The architecture is flexible and the number of trigger processors can be expanded according to the physics needs of CMS. Intelligent, more complex, and innovative algorithms are now the core of the first decision layer of CMS: the upgraded trigger system implements pattern recognition and MVA (Boosted Decision Tree) regression techniques in the trigger processors for pT assignment, pile up subtraction, and isolation requirements for electrons, and taus. The performance of the TMT design and the latency measurements and the algorithm performance which has been measured using data is also presented here.
Jiao, Shengwu; Guo, Yumin; Huettmann, Falk; Lei, Guangchun
2014-07-01
Avian nest-site selection is an important research and management subject. The hooded crane (Grus monacha) is a vulnerable (VU) species according to the IUCN Red List. Here, we present the first long-term Chinese legacy nest data for this species (1993-2010) with publicly available metadata. Further, we provide the first study that reports findings on multivariate nest habitat preference using such long-term field data for this species. Our work was carried out in Northeastern China, where we found and measured 24 nests and 81 randomly selected control plots and their environmental parameters in a vast landscape. We used machine learning (stochastic boosted regression trees) to quantify nest selection. Our analysis further included varclust (R Hmisc) and (TreenNet) to address statistical correlations and two-way interactions. We found that from an initial list of 14 measured field variables, water area (+), water depth (+) and shrub coverage (-) were the main explanatory variables that contributed to hooded crane nest-site selection. Agricultural sites played a smaller role in the selection of these nests. Our results are important for the conservation management of cranes all over East Asia and constitute a defensible and quantitative basis for predictive models.
Waite, Ian R.
2014-01-01
As part of the USGS study of nutrient enrichment of streams in agricultural regions throughout the United States, about 30 sites within each of eight study areas were selected to capture a gradient of nutrient conditions. The objective was to develop watershed disturbance predictive models for macroinvertebrate and algal metrics at national and three regional landscape scales to obtain a better understanding of important explanatory variables. Explanatory variables in models were generated from landscape data, habitat, and chemistry. Instream nutrient concentration and variables assessing the amount of disturbance to the riparian zone (e.g., percent row crops or percent agriculture) were selected as most important explanatory variable in almost all boosted regression tree models regardless of landscape scale or assemblage. Frequently, TN and TP concentration and riparian agricultural land use variables showed a threshold type response at relatively low values to biotic metrics modeled. Some measure of habitat condition was also commonly selected in the final invertebrate models, though the variable(s) varied across regions. Results suggest national models tended to account for more general landscape/climate differences, while regional models incorporated both broad landscape scale and more specific local-scale variables.
Studies of the DIII-D disruption database using Machine Learning algorithms
NASA Astrophysics Data System (ADS)
Rea, Cristina; Granetz, Robert; Meneghini, Orso
2017-10-01
A Random Forests Machine Learning algorithm, trained on a large database of both disruptive and non-disruptive DIII-D discharges, predicts disruptive behavior in DIII-D with about 90% of accuracy. Several algorithms have been tested and Random Forests was found superior in performances for this particular task. Over 40 plasma parameters are included in the database, with data for each of the parameters taken from 500k time slices. We focused on a subset of non-dimensional plasma parameters, deemed to be good predictors based on physics considerations. Both binary (disruptive/non-disruptive) and multi-label (label based on the elapsed time before disruption) classification problems are investigated. The Random Forests algorithm provides insight on the available dataset by ranking the relative importance of the input features. It is found that q95 and Greenwald density fraction (n/nG) are the most relevant parameters for discriminating between DIII-D disruptive and non-disruptive discharges. A comparison with the Gradient Boosted Trees algorithm is shown and the first results coming from the application of regression algorithms are presented. Work supported by the US Department of Energy under DE-FC02-04ER54698, DE-SC0014264 and DE-FG02-95ER54309.
Ayanu, Yohannes; Conrad, Christopher; Jentsch, Anke; Koellner, Thomas
2015-01-01
The worldwide demand for food has been increasing due to the rapidly growing global population, and agricultural lands have increased in extent to produce more food crops. The pattern of cropland varies among different regions depending on the traditional knowledge of farmers and availability of uncultivated land. Satellite images can be used to map cropland in open areas but have limitations for detecting undergrowth inside forests. Classification results are often biased and need to be supplemented with field observations. Undercover cropland inside forests in the Bale Mountains of Ethiopia was assessed using field observed percentage cover of land use/land cover classes, and topographic and location parameters. The most influential factors were identified using Boosted Regression Trees and used to map undercover cropland area. Elevation, slope, easterly aspect, distance to settlements, and distance to national park were found to be the most influential factors determining undercover cropland area. When there is very high demand for growing food crops, constrained under restricted rights for clearing forest, cultivation could take place within forests as an undercover. Further research on the impact of undercover cropland on ecosystem services and challenges in sustainable management is thus essential.
Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction
Rahman, Raziur; Haider, Saad; Ghosh, Souparno; Pal, Ranadip
2015-01-01
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error. PMID:27081304
Buchner, Florian; Wasem, Jürgen; Schillo, Sonja
2017-01-01
Risk equalization formulas have been refined since their introduction about two decades ago. Because of the complexity and the abundance of possible interactions between the variables used, hardly any interactions are considered. A regression tree is used to systematically search for interactions, a methodologically new approach in risk equalization. Analyses are based on a data set of nearly 2.9 million individuals from a major German social health insurer. A two-step approach is applied: In the first step a regression tree is built on the basis of the learning data set. Terminal nodes characterized by more than one morbidity-group-split represent interaction effects of different morbidity groups. In the second step the 'traditional' weighted least squares regression equation is expanded by adding interaction terms for all interactions detected by the tree, and regression coefficients are recalculated. The resulting risk adjustment formula shows an improvement in the adjusted R 2 from 25.43% to 25.81% on the evaluation data set. Predictive ratios are calculated for subgroups affected by the interactions. The R 2 improvement detected is only marginal. According to the sample level performance measures used, not involving a considerable number of morbidity interactions forms no relevant loss in accuracy. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
Goo, Yeong-Jia James; Shen, Zone-De
2014-01-01
As the fraudulent financial statement of an enterprise is increasingly serious with each passing day, establishing a valid forecasting fraudulent financial statement model of an enterprise has become an important question for academic research and financial practice. After screening the important variables using the stepwise regression, the study also matches the logistic regression, support vector machine, and decision tree to construct the classification models to make a comparison. The study adopts financial and nonfinancial variables to assist in establishment of the forecasting fraudulent financial statement model. Research objects are the companies to which the fraudulent and nonfraudulent financial statement happened between years 1998 to 2012. The findings are that financial and nonfinancial information are effectively used to distinguish the fraudulent financial statement, and decision tree C5.0 has the best classification effect 85.71%. PMID:25302338
Chen, Suduan; Goo, Yeong-Jia James; Shen, Zone-De
2014-01-01
As the fraudulent financial statement of an enterprise is increasingly serious with each passing day, establishing a valid forecasting fraudulent financial statement model of an enterprise has become an important question for academic research and financial practice. After screening the important variables using the stepwise regression, the study also matches the logistic regression, support vector machine, and decision tree to construct the classification models to make a comparison. The study adopts financial and nonfinancial variables to assist in establishment of the forecasting fraudulent financial statement model. Research objects are the companies to which the fraudulent and nonfraudulent financial statement happened between years 1998 to 2012. The findings are that financial and nonfinancial information are effectively used to distinguish the fraudulent financial statement, and decision tree C5.0 has the best classification effect 85.71%.
Lisa M. Ganio; Robert A. Progar
2017-01-01
Wild and prescribed fire-induced injury to forest trees can produce immediate or delayed tree mortality but fire-injured trees can also survive. Land managers use logistic regression models that incorporate tree-injury variables to discriminate between fatally injured trees and those that will survive. We used data from 4024 ponderosa pine (Pinus ponderosa...
Using Evidence-Based Decision Trees Instead of Formulas to Identify At-Risk Readers. REL 2014-036
ERIC Educational Resources Information Center
Koon, Sharon; Petscher, Yaacov; Foorman, Barbara R.
2014-01-01
This study examines whether the classification and regression tree (CART) model improves the early identification of students at risk for reading comprehension difficulties compared with the more difficult to interpret logistic regression model. CART is a type of predictive modeling that relies on nonparametric techniques. It presents results in…
ERIC Educational Resources Information Center
Brabant, Marie-Eve; Hebert, Martine; Chagnon, Francois
2013-01-01
This study explored the clinical profiles of 77 female teenager survivors of sexual abuse and examined the association of abuse-related and personal variables with suicidal ideations. Analyses revealed that 64% of participants experienced suicidal ideations. Findings from classification and regression tree analysis indicated that depression,…
Forest type mapping of the Interior West
Bonnie Ruefenacht; Gretchen G. Moisen; Jock A. Blackard
2004-01-01
This paper develops techniques for the mapping of forest types in Arizona, New Mexico, and Wyoming. The methods involve regression-tree modeling using a variety of remote sensing and GIS layers along with Forest Inventory Analysis (FIA) point data. Regression-tree modeling is a fast and efficient technique of estimating variables for large data sets with high accuracy...
ERIC Educational Resources Information Center
Thomas, Emily H.; Galambos, Nora
To investigate how students' characteristics and experiences affect satisfaction, this study used regression and decision-tree analysis with the CHAID algorithm to analyze student opinion data from a sample of 1,783 college students. A data-mining approach identifies the specific aspects of students' university experience that most influence three…
ERIC Educational Resources Information Center
Cohen, Ira L.; Liu, Xudong; Hudson, Melissa; Gillis, Jennifer; Cavalari, Rachel N. S.; Romanczyk, Raymond G.; Karmel, Bernard Z.; Gardner, Judith M.
2016-01-01
In order to improve discrimination accuracy between Autism Spectrum Disorder (ASD) and similar neurodevelopmental disorders, a data mining procedure, Classification and Regression Trees (CART), was used on a large multi-site sample of PDD Behavior Inventory (PDDBI) forms on children with and without ASD. Discrimination accuracy exceeded 80%,…
Zhao, Yang; Zheng, Wei; Zhuo, Daisy Y; Lu, Yuefeng; Ma, Xiwen; Liu, Hengchang; Zeng, Zhen; Laird, Glen
2017-10-11
Personalized medicine, or tailored therapy, has been an active and important topic in recent medical research. Many methods have been proposed in the literature for predictive biomarker detection and subgroup identification. In this article, we propose a novel decision tree-based approach applicable in randomized clinical trials. We model the prognostic effects of the biomarkers using additive regression trees and the biomarker-by-treatment effect using a single regression tree. Bayesian approach is utilized to periodically revise the split variables and the split rules of the decision trees, which provides a better overall fitting. Gibbs sampler is implemented in the MCMC procedure, which updates the prognostic trees and the interaction tree separately. We use the posterior distribution of the interaction tree to construct the predictive scores of the biomarkers and to identify the subgroup where the treatment is superior to the control. Numerical simulations show that our proposed method performs well under various settings comparing to existing methods. We also demonstrate an application of our method in a real clinical trial.
Spam comments prediction using stacking with ensemble learning
NASA Astrophysics Data System (ADS)
Mehmood, Arif; On, Byung-Won; Lee, Ingyu; Ashraf, Imran; Choi, Gyu Sang
2018-01-01
Illusive comments of product or services are misleading for people in decision making. The current methodologies to predict deceptive comments are concerned for feature designing with single training model. Indigenous features have ability to show some linguistic phenomena but are hard to reveal the latent semantic meaning of the comments. We propose a prediction model on general features of documents using stacking with ensemble learning. Term Frequency/Inverse Document Frequency (TF/IDF) features are inputs to stacking of Random Forest and Gradient Boosted Trees and the outputs of the base learners are encapsulated with decision tree to make final training of the model. The results exhibits that our approach gives the accuracy of 92.19% which outperform the state-of-the-art method.
ERIC Educational Resources Information Center
Albaiz, Tahany
2016-01-01
Teaching English to ESL teachers is a challenging task for a number of reasons, the lack of connection between the target language and the native one being one of the most challenging factors (Ferlazzo & Sypnieski, 2013). Therefore, teachers are supposed to be innovators in creating the tools that could boost the learning process, as well as…
Composition Studies with the Telescope Array Surface Detector
NASA Astrophysics Data System (ADS)
Kuznetsov, Mikhail; Piskunov, Maxim; Rubtsov, Grigory; Troitsky, Sergey; Zhezher, Yana
The results on ultra-high-energy cosmic-ray chemical composition based on the data from the Telescope Array surface-detector are presented. The method is based on the multivariate boosted decision tree (BDT) analysis which uses surface-detector observables. The results on average atomic mass in the energy range 1018.0-1020.0 eV are presented. A comparison with the Telescope Array hybrid results and the Pierre Auger Observatory surface detector results is shown.
Harvesting southern pines with taproots is economic way to boost tonnage per acre 20 percent
P. Koch
1977-01-01
At the Southern Forest Experiment Station, we've been trying to extend the pulpwood resource by bringing more of each pine tree to the mill yard. The taproot of a 15- to 30-year-old southern pine weighs about 20% as much as the merchantable stem (Table I). Harvesting and pulping this wasted material would greatly increase pulpwood tonnage yield per acre. But is it...
Harvesting southern pines with taproots is economic way to boost tonnage per acre 20 percent.
P. Koch
1977-01-01
At the Southern Forest Experiment Station, we've been trying to exten the pulpwood resource by bringing more of each pine tree to the mill yard. The taproot of a 15- to 30-year-old southern pine weighs about 20% as much as the merchantable stem (Table 1). Harvesting and pulping this wasted material would greatly increase pulpwood tonnage yield per acre. But is it...
Harvesting southern pines with taproots is economic way to boost tonnage per acre 20 percent
Peter Koch
1977-01-01
At the Southern Forest Experiment Station, we've been trying to extend the pulpwood resource by bringing more of each pine tree to the mill yard. The taproot of a 15- to 3-year-old southern pine weighs about 20% as much as the merchantable stem (Table I). Harvesting and pulping this waster material would greatly increase pulpwood tonnage yield per acre. But is it...
NASA Astrophysics Data System (ADS)
Manikumari, N.; Murugappan, A.; Vinodhini, G.
2017-07-01
Time series forecasting has gained remarkable interest of researchers in the last few decades. Neural networks based time series forecasting have been employed in various application areas. Reference Evapotranspiration (ETO) is one of the most important components of the hydrologic cycle and its precise assessment is vital in water balance and crop yield estimation, water resources system design and management. This work aimed at achieving accurate time series forecast of ETO using a combination of neural network approaches. This work was carried out using data collected in the command area of VEERANAM Tank during the period 2004 - 2014 in India. In this work, the Neural Network (NN) models were combined by ensemble learning in order to improve the accuracy for forecasting Daily ETO (for the year 2015). Bagged Neural Network (Bagged-NN) and Boosted Neural Network (Boosted-NN) ensemble learning were employed. It has been proved that Bagged-NN and Boosted-NN ensemble models are better than individual NN models in terms of accuracy. Among the ensemble models, Boosted-NN reduces the forecasting errors compared to Bagged-NN and individual NNs. Regression co-efficient, Mean Absolute Deviation, Mean Absolute Percentage error and Root Mean Square Error also ascertain that Boosted-NN lead to improved ETO forecasting performance.
Henrard, S; Speybroeck, N; Hermans, C
2015-11-01
Haemophilia is a rare genetic haemorrhagic disease characterized by partial or complete deficiency of coagulation factor VIII, for haemophilia A, or IX, for haemophilia B. As in any other medical research domain, the field of haemophilia research is increasingly concerned with finding factors associated with binary or continuous outcomes through multivariable models. Traditional models include multiple logistic regressions, for binary outcomes, and multiple linear regressions for continuous outcomes. Yet these regression models are at times difficult to implement, especially for non-statisticians, and can be difficult to interpret. The present paper sought to didactically explain how, why, and when to use classification and regression tree (CART) analysis for haemophilia research. The CART method is non-parametric and non-linear, based on the repeated partitioning of a sample into subgroups based on a certain criterion. Breiman developed this method in 1984. Classification trees (CTs) are used to analyse categorical outcomes and regression trees (RTs) to analyse continuous ones. The CART methodology has become increasingly popular in the medical field, yet only a few examples of studies using this methodology specifically in haemophilia have to date been published. Two examples using CART analysis and previously published in this field are didactically explained in details. There is increasing interest in using CART analysis in the health domain, primarily due to its ease of implementation, use, and interpretation, thus facilitating medical decision-making. This method should be promoted for analysing continuous or categorical outcomes in haemophilia, when applicable. © 2015 John Wiley & Sons Ltd.
Gmur, Stephan; Vogt, Daniel; Zabowski, Darlene; Moskal, L. Monika
2012-01-01
The characterization of soil attributes using hyperspectral sensors has revealed patterns in soil spectra that are known to respond to mineral composition, organic matter, soil moisture and particle size distribution. Soil samples from different soil horizons of replicated soil series from sites located within Washington and Oregon were analyzed with the FieldSpec Spectroradiometer to measure their spectral signatures across the electromagnetic range of 400 to 1,000 nm. Similarity rankings of individual soil samples reveal differences between replicate series as well as samples within the same replicate series. Using classification and regression tree statistical methods, regression trees were fitted to each spectral response using concentrations of nitrogen, carbon, carbonate and organic matter as the response variables. Statistics resulting from fitted trees were: nitrogen R2 0.91 (p < 0.01) at 403, 470, 687, and 846 nm spectral band widths, carbonate R2 0.95 (p < 0.01) at 531 and 898 nm band widths, total carbon R2 0.93 (p < 0.01) at 400, 409, 441 and 907 nm band widths, and organic matter R2 0.98 (p < 0.01) at 300, 400, 441, 832 and 907 nm band widths. Use of the 400 to 1,000 nm electromagnetic range utilizing regression trees provided a powerful, rapid and inexpensive method for assessing nitrogen, carbon, carbonate and organic matter for upper soil horizons in a nondestructive method. PMID:23112620
NASA Astrophysics Data System (ADS)
Herguido, Estela; Pulido, Manuel; Francisco Lavado Contador, Joaquín; Schnabel, Susanne
2017-04-01
In Iberian dehesas and montados, the lack of tree recruitment compromises its long-term sustainability. However, in marginal areas of dehesas shrub encroachment facilitates tree recruitment while altering the distinctive physiognomic and cultural characteristics of the system. These are ongoing processes that should be considered when designing afforestation measures and policies. Based on spatial variables, we modeled the proneness of a piece of land to undergo tree recruitment and the results were related with the afforestation measures carried out under the UE First Afforestation Agricultural Land Program between 1992 and 2008. We analyzed the temporal tree population dynamics in 800 randomly selected plots of 100 m radius (2,510 ha in total) in dehesas and treeless pasturelands of Extremadura (hereafter rangelands). Tree changes were revealed by comparing aerial images taken in 1956 with orthophotographs and infrared ones from 2012. Spatial models that predict the areas prone either to lack tree recruitment or with recruitment were developed and based on three data mining algorithms: MARS (Multivariate Adaptive Regression Splines), Random Forest (RF) and Stochastic Gradient Boosting (Tree-Net, TN). Recruited-tree locations (1) vs. locations of places with no recruitment (0) (randomly selected from the study areas) were used as the binary dependent variable. A 5% of the data were used as test data set. As candidate explanatory variables we used 51 different topographic, climatic, bioclimatic, land cover-related and edaphic ones. The statistical models developed were extrapolated to the spatial context of the afforested areas in the region and also to the whole Extremenian rangelands, and the percentage of area modelled as prone to tree recruitment was calculated for each case. A total of 46,674.63 ha were afforested with holm oak (Quercus ilex) or cork oak (Quercus suber) in the studied rangelands under the UE First Afforestation Agricultural Land Program. In the sampled plots, 16,747 trees were detected as recruited, while 47,058 and 12,803 were present in both dates and lost during the studied period, respectively. Based on the Area Under the ROC Curve (AUC), all the data mining models considered showed a high fitness (MARS AUC= 0.86; TN AUC= 0.92; RF AUC= 0.95) and low misclassification rates. Correctly predicted test samples for absence and presence of tree recruitment accounted respectively to 78.3% and 76.8% when using MARS, 90.8% and 90.8% using TN and 88.9% and 89.1% using RF. The spatial patterns of the different models were similar. However, attending only the percentage of area prone to tree recruitment, outstanding differences were observed among models considering the total surface of rangelands (36.03% in MARS, 22.88% in TN and 6.72 % in RF). Despite these differences, when comparing the results with those of the afforested surfaces (31.73% in MARS, 20.70% in TN and 5.63 % in RF) the three algorithms pointed to similar conclusions, i.e. the afforestations performed in rangelands of Extremadura under UE First Afforestation Agricultural Land Program, barely discriminate between areas with or without natural regeneration. In conclusion, data mining technics are suitable to develop high-performance spatial models of vegetation dynamics. These models could be useful for policy and decision makers aimed at assessing the implementation of afforestation measures and the selection of more adequate locations.
Which factors affect the success or failure of eradication campaigns against alien species?
Pluess, Therese; Jarošík, Vojtěch; Pyšek, Petr; Cannon, Ray; Pergl, Jan; Breukers, Annemarie; Bacher, Sven
2012-01-01
Although issues related to the management of invasive alien species are receiving increasing attention, little is known about which factors affect the likelihood of success of management measures. We applied two data mining techniques, classification trees and boosted trees, to identify factors that relate to the success of management campaigns aimed at eradicating invasive alien invertebrates, plants and plant pathogens. We assembled a dataset of 173 different eradication campaigns against 94 species worldwide, about a half of which (50.9%) were successful. Eradications in man-made habitats, greenhouses in particular, were more likely to succeed than those in (semi-)natural habitats. In man-made habitats the probability of success was generally high in Australasia, while in Europe and the Americas it was higher for local infestations that are easier to deal with, and for international campaigns that are likely to profit from cross-border cooperation. In (semi-) natural habitats, eradication campaigns were more likely to succeed for plants introduced as an ornamental and escaped from cultivation prior to invasion. Averaging out all other factors in boosted trees, pathogens, bacteria and viruses were most, and fungi the least likely to be eradicated; for plants and invertebrates the probability was intermediate. Our analysis indicates that initiating the campaign before the extent of infestation reaches the critical threshold, starting to eradicate within the first four years since the problem has been noticed, paying special attention to species introduced by the cultivation pathway, and applying sanitary measures can substantially increase the probability of eradication success. Our investigations also revealed that information on socioeconomic factors, which are often considered to be crucial for eradication success, is rarely available, and thus their relative importance cannot be evaluated. Future campaigns should carefully document socioeconomic factors to enable tests of their importance.
Computer-aided diagnosis of lung nodule using gradient tree boosting and Bayesian optimization.
Nishio, Mizuho; Nishizawa, Mitsuo; Sugiyama, Osamu; Kojima, Ryosuke; Yakami, Masahiro; Kuroda, Tomohiro; Togashi, Kaori
2018-01-01
We aimed to evaluate a computer-aided diagnosis (CADx) system for lung nodule classification focussing on (i) usefulness of the conventional CADx system (hand-crafted imaging feature + machine learning algorithm), (ii) comparison between support vector machine (SVM) and gradient tree boosting (XGBoost) as machine learning algorithms, and (iii) effectiveness of parameter optimization using Bayesian optimization and random search. Data on 99 lung nodules (62 lung cancers and 37 benign lung nodules) were included from public databases of CT images. A variant of the local binary pattern was used for calculating a feature vector. SVM or XGBoost was trained using the feature vector and its corresponding label. Tree Parzen Estimator (TPE) was used as Bayesian optimization for parameters of SVM and XGBoost. Random search was done for comparison with TPE. Leave-one-out cross-validation was used for optimizing and evaluating the performance of our CADx system. Performance was evaluated using area under the curve (AUC) of receiver operating characteristic analysis. AUC was calculated 10 times, and its average was obtained. The best averaged AUC of SVM and XGBoost was 0.850 and 0.896, respectively; both were obtained using TPE. XGBoost was generally superior to SVM. Optimal parameters for achieving high AUC were obtained with fewer numbers of trials when using TPE, compared with random search. Bayesian optimization of SVM and XGBoost parameters was more efficient than random search. Based on observer study, AUC values of two board-certified radiologists were 0.898 and 0.822. The results show that diagnostic accuracy of our CADx system was comparable to that of radiologists with respect to classifying lung nodules.
Which Factors Affect the Success or Failure of Eradication Campaigns against Alien Species?
Pluess, Therese; Jarošík, Vojtěch; Pyšek, Petr; Cannon, Ray; Pergl, Jan; Breukers, Annemarie; Bacher, Sven
2012-01-01
Although issues related to the management of invasive alien species are receiving increasing attention, little is known about which factors affect the likelihood of success of management measures. We applied two data mining techniques, classification trees and boosted trees, to identify factors that relate to the success of management campaigns aimed at eradicating invasive alien invertebrates, plants and plant pathogens. We assembled a dataset of 173 different eradication campaigns against 94 species worldwide, about a half of which (50.9%) were successful. Eradications in man-made habitats, greenhouses in particular, were more likely to succeed than those in (semi-)natural habitats. In man-made habitats the probability of success was generally high in Australasia, while in Europe and the Americas it was higher for local infestations that are easier to deal with, and for international campaigns that are likely to profit from cross-border cooperation. In (semi-) natural habitats, eradication campaigns were more likely to succeed for plants introduced as an ornamental and escaped from cultivation prior to invasion. Averaging out all other factors in boosted trees, pathogens, bacteria and viruses were most, and fungi the least likely to be eradicated; for plants and invertebrates the probability was intermediate. Our analysis indicates that initiating the campaign before the extent of infestation reaches the critical threshold, starting to eradicate within the first four years since the problem has been noticed, paying special attention to species introduced by the cultivation pathway, and applying sanitary measures can substantially increase the probability of eradication success. Our investigations also revealed that information on socioeconomic factors, which are often considered to be crucial for eradication success, is rarely available, and thus their relative importance cannot be evaluated. Future campaigns should carefully document socioeconomic factors to enable tests of their importance. PMID:23110197
NASA Astrophysics Data System (ADS)
Peek, R.; Viers, J.; Yarnell, S. M.
2012-12-01
Climate change can affect sensitive species and ecosystems in many ways, yet sparse data and the inability to apply various climate models at functional spatial scales often prevents relevant research from being utilized in conservation management plans. Climate change has been linked to declines and disturbances in a multitude of species and habitats, and in California, one of the greatest climatic concerns is the predicted reduction in mountain snowpack and associated snowmelt. These decreases in natural storage of water as snow in mountain regions can affect the timing and variability of critical snowmelt runoff periods—important seasonal signals that species in montane ecosystems have evolved life history strategies around—leading to greater intra-annual variability and diminished summer and fall stream flows. Although many species distribution models exist, few provide ways to integrate continually updated and revised Global Climate Models (GCMs), hydrologic data unique to a watershed, and ecological responses that can be incorporated into conservation strategies. This study documents a novel and applicable method of combining boosted regression tree (BRT) modeling and species distributions with hydroclimatic data as a potential management tool for conservation. Boosted regression trees are suitable for ecological distribution modeling because they can reduce both bias and variance, as well as handle sharp discontinuities common in sparsely sampled species or large study areas. This approach was used to quantify the effects of hydroclimatic changes on the distribution of key riparian-associated amphibian species in montane meadow habitats in the Sierra Nevada at the sub-watershed level. Based on modeling using current species range maps in conjunction with three climate scenarios (near, mid, and far), extreme range contractions were observed for all sensitive species (southern long-toed salamander, mountain yellow-legged frog, Yosemite toad) by the year 2100. Among many environmental and hydroclimatic variables used in the model, snowpack and snowmelt (runoff) variables were consistently among the most informative in predicting species occupancy. Few sub-watersheds contained greater than 50% probability of species occupancy throughout the modeled time period; however several core areas were identified as more resilient to climate change for each species. There was overlap among species in areas that were predicted to remain hydroclimatically stable, particularly in sub-watersheds that contain high meadow density. Quantifying these areas of habitat stability, or "resiliency", may ultimately be the most useful outcome of BRT modeling, with the flexibility to utilize multiple GCMs at varying scales. Ultimately managers need to consider both short term and long term conservation goals by identifying and protecting suitable habitat areas most resilient to climate change to give multiple species the best chance to persist. This approach provides a unique tool for conservation management which can be easily applied to a variety of data and species, and provides useful knowledge at both near and long term time scales.
Risk Factors of Falls in Community-Dwelling Older Adults: Logistic Regression Tree Analysis
ERIC Educational Resources Information Center
Yamashita, Takashi; Noe, Douglas A.; Bailer, A. John
2012-01-01
Purpose of the Study: A novel logistic regression tree-based method was applied to identify fall risk factors and possible interaction effects of those risk factors. Design and Methods: A nationally representative sample of American older adults aged 65 years and older (N = 9,592) in the Health and Retirement Study 2004 and 2006 modules was used.…
Louys, Julien; Meloro, Carlo; Elton, Sarah; Ditchfield, Peter; Bishop, Laura C
2015-01-01
We test the performance of two models that use mammalian communities to reconstruct multivariate palaeoenvironments. While both models exploit the correlation between mammal communities (defined in terms of functional groups) and arboreal heterogeneity, the first uses a multiple multivariate regression of community structure and arboreal heterogeneity, while the second uses a linear regression of the principal components of each ecospace. The success of these methods means the palaeoenvironment of a particular locality can be reconstructed in terms of the proportions of heavy, moderate, light, and absent tree canopy cover. The linear regression is less biased, and more precisely and accurately reconstructs heavy tree canopy cover than the multiple multivariate model. However, the multiple multivariate model performs better than the linear regression for all other canopy cover categories. Both models consistently perform better than randomly generated reconstructions. We apply both models to the palaeocommunity of the Upper Laetolil Beds, Tanzania. Our reconstructions indicate that there was very little heavy tree cover at this site (likely less than 10%), with the palaeo-landscape instead comprising a mixture of light and absent tree cover. These reconstructions help resolve the previous conflicting palaeoecological reconstructions made for this site. Copyright © 2014 Elsevier Ltd. All rights reserved.
Modeling vertebrate diversity in Oregon using satellite imagery
NASA Astrophysics Data System (ADS)
Cablk, Mary Elizabeth
Vertebrate diversity was modeled for the state of Oregon using a parametric approach to regression tree analysis. This exploratory data analysis effectively modeled the non-linear relationships between vertebrate richness and phenology, terrain, and climate. Phenology was derived from time-series NOAA-AVHRR satellite imagery for the year 1992 using two methods: principal component analysis and derivation of EROS data center greenness metrics. These two measures of spatial and temporal vegetation condition incorporated the critical temporal element in this analysis. The first three principal components were shown to contain spatial and temporal information about the landscape and discriminated phenologically distinct regions in Oregon. Principal components 2 and 3, 6 greenness metrics, elevation, slope, aspect, annual precipitation, and annual seasonal temperature difference were investigated as correlates to amphibians, birds, all vertebrates, reptiles, and mammals. Variation explained for each regression tree by taxa were: amphibians (91%), birds (67%), all vertebrates (66%), reptiles (57%), and mammals (55%). Spatial statistics were used to quantify the pattern of each taxa and assess validity of resulting predictions from regression tree models. Regression tree analysis was relatively robust against spatial autocorrelation in the response data and graphical results indicated models were well fit to the data.
Häberle, Lothar; Hack, Carolin C; Heusinger, Katharina; Wagner, Florian; Jud, Sebastian M; Uder, Michael; Beckmann, Matthias W; Schulz-Wendtland, Rüdiger; Wittenberg, Thomas; Fasching, Peter A
2017-08-30
Tumors in radiologically dense breast were overlooked on mammograms more often than tumors in low-density breasts. A fast reproducible and automated method of assessing percentage mammographic density (PMD) would be desirable to support decisions whether ultrasonography should be provided for women in addition to mammography in diagnostic mammography units. PMD assessment has still not been included in clinical routine work, as there are issues of interobserver variability and the procedure is quite time consuming. This study investigated whether fully automatically generated texture features of mammograms can replace time-consuming semi-automatic PMD assessment to predict a patient's risk of having an invasive breast tumor that is visible on ultrasound but masked on mammography (mammography failure). This observational study included 1334 women with invasive breast cancer treated at a hospital-based diagnostic mammography unit. Ultrasound was available for the entire cohort as part of routine diagnosis. Computer-based threshold PMD assessments ("observed PMD") were carried out and 363 texture features were obtained from each mammogram. Several variable selection and regression techniques (univariate selection, lasso, boosting, random forest) were applied to predict PMD from the texture features. The predicted PMD values were each used as new predictor for masking in logistic regression models together with clinical predictors. These four logistic regression models with predicted PMD were compared among themselves and with a logistic regression model with observed PMD. The most accurate masking prediction was determined by cross-validation. About 120 of the 363 texture features were selected for predicting PMD. Density predictions with boosting were the best substitute for observed PMD to predict masking. Overall, the corresponding logistic regression model performed better (cross-validated AUC, 0.747) than one without mammographic density (0.734), but less well than the one with the observed PMD (0.753). However, in patients with an assigned mammography failure risk >10%, covering about half of all masked tumors, the boosting-based model performed at least as accurately as the original PMD model. Automatically generated texture features can replace semi-automatically determined PMD in a prediction model for mammography failure, such that more than 50% of masked tumors could be discovered.
Distribution of cavity trees in midwestern old-growth and second-growth forests
Zhaofei Fan; Stephen R. Shifley; Martin A. Spetich; Frank R. Thompson; David R. Larsen
2003-01-01
We used classification and regression tree analysis to determine the primary variables associated with the occurrence of cavity trees and the hierarchical structure among those variables. We applied that information to develop logistic models predicting cavity tree probability as a function of diameter, species group, and decay class. Inventories of cavity abundance in...
Distribution of cavity trees in midwesternold-growth and second-growth forests
Zhaofei Fan; Stephen R. Shifley; Martin A. Spetich; Frank R., III Thompson; David R. Larsen
2003-01-01
We used classification and regression tree analysis to determine the primary variables associated with the occurrence of cavity trees and the hierarchical structure among those variables. We applied that information to develop logistic models predicting cavity tree probability as a function of diameter, species group, and decay class. Inventories of cavity abundance in...
A hierarchical linear model for tree height prediction.
Vicente J. Monleon
2003-01-01
Measuring tree height is a time-consuming process. Often, tree diameter is measured and height is estimated from a published regression model. Trees used to develop these models are clustered into stands, but this structure is ignored and independence is assumed. In this study, hierarchical linear models that account explicitly for the clustered structure of the data...
Modeling individual tree survial
Quang V. Cao
2016-01-01
Information provided by growth and yield models is the basis for forest managers to make decisions on how to manage their forests. Among different types of growth models, whole-stand models offer predictions at stand level, whereas individual-tree models give detailed information at tree level. The well-known logistic regression is commonly used to predict tree...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Appelt, Ane L., E-mail: ane.lindegaard.appelt@rsyd.dk; Faculty of Health Sciences, University of Southern Denmark, Odense; Vogelius, Ivan R.
Purpose/Objective(s): Mature data on tumor control and survival are presented from a randomized trial of the addition of a brachytherapy boost to long-course neoadjuvant chemoradiation therapy (CRT) for locally advanced rectal cancer. Methods and Materials: Between March 2005 and November 2008, 248 patients with T3-4N0-2M0 rectal cancer were prospectively randomized to either long-course preoperative CRT (50.4 Gy in 28 fractions, per oral tegafur-uracil and L-leucovorin) alone or the same CRT schedule plus a brachytherapy boost (10 Gy in 2 fractions). The primary trial endpoint was pathologic complete response (pCR) at the time of surgery; secondary endpoints included overall survival (OS), progression-free survivalmore » (PFS), and freedom from locoregional failure. Results: Results for the primary endpoint have previously been reported. This analysis presents survival data for the 224 patients in the Danish part of the trial. In all, 221 patients (111 control arm, 110 brachytherapy boost arm) had data available for analysis, with a median follow-up time of 5.4 years. Despite a significant increase in tumor response at the time of surgery, no differences in 5-year OS (70.6% vs 63.6%, hazard ratio [HR] = 1.24, P=.34) and PFS (63.9% vs 52.0%, HR=1.22, P=.32) were observed. Freedom from locoregional failure at 5 years were 93.9% and 85.7% (HR=2.60, P=.06) in the standard and in the brachytherapy arms, respectively. There was no difference in the prevalence of stoma. Explorative analysis based on stratification for tumor regression grade and resection margin status indicated the presence of response migration. Conclusions: Despite increased pathologic tumor regression at the time of surgery, we observed no benefit on late outcome. Improved tumor regression does not necessarily lead to a relevant clinical benefit when the neoadjuvant treatment is followed by high-quality surgery.« less
[Hyperspectral Estimation of Apple Tree Canopy LAI Based on SVM and RF Regression].
Han, Zhao-ying; Zhu, Xi-cun; Fang, Xian-yi; Wang, Zhuo-yuan; Wang, Ling; Zhao, Geng-Xing; Jiang, Yuan-mao
2016-03-01
Leaf area index (LAI) is the dynamic index of crop population size. Hyperspectral technology can be used to estimate apple canopy LAI rapidly and nondestructively. It can be provide a reference for monitoring the tree growing and yield estimation. The Red Fuji apple trees of full bearing fruit are the researching objects. Ninety apple trees canopies spectral reflectance and LAI values were measured by the ASD Fieldspec3 spectrometer and LAI-2200 in thirty orchards in constant two years in Qixia research area of Shandong Province. The optimal vegetation indices were selected by the method of correlation analysis of the original spectral reflectance and vegetation indices. The models of predicting the LAI were built with the multivariate regression analysis method of support vector machine (SVM) and random forest (RF). The new vegetation indices, GNDVI527, ND-VI676, RVI682, FD-NVI656 and GRVI517 and the previous two main vegetation indices, NDVI670 and NDVI705, are in accordance with LAI. In the RF regression model, the calibration set decision coefficient C-R2 of 0.920 and validation set decision coefficient V-R2 of 0.889 are higher than the SVM regression model by 0.045 and 0.033 respectively. The root mean square error of calibration set C-RMSE of 0.249, the root mean square error validation set V-RMSE of 0.236 are lower than that of the SVM regression model by 0.054 and 0.058 respectively. Relative analysis of calibrating error C-RPD and relative analysis of validation set V-RPD reached 3.363 and 2.520, 0.598 and 0.262, respectively, which were higher than the SVM regression model. The measured and predicted the scatterplot trend line slope of the calibration set and validation set C-S and V-S are close to 1. The estimation result of RF regression model is better than that of the SVM. RF regression model can be used to estimate the LAI of red Fuji apple trees in full fruit period.
Log and tree sawing times for hardwood mills
Everette D. Rast
1974-01-01
Data on 6,850 logs and 1,181 trees were analyzed to predict sawing times. For both logs and trees, regression equations were derived that express (in minutes) sawing time per log or tree and per Mbf. For trees, merchantable height is expressed in number of logs as well as in feet. One of the major uses for the tables of average sawing times is as a bench mark against...
Logistic regression trees for initial selection of interesting loci in case-control studies
Nickolov, Radoslav Z; Milanov, Valentin B
2007-01-01
Modern genetic epidemiology faces the challenge of dealing with hundreds of thousands of genetic markers. The selection of a small initial subset of interesting markers for further investigation can greatly facilitate genetic studies. In this contribution we suggest the use of a logistic regression tree algorithm known as logistic tree with unbiased selection. Using the simulated data provided for Genetic Analysis Workshop 15, we show how this algorithm, with incorporation of multifactor dimensionality reduction method, can reduce an initial large pool of markers to a small set that includes the interesting markers with high probability. PMID:18466557
Optimization of Adaboost Algorithm for Sonar Target Detection in a Multi-Stage ATR System
NASA Technical Reports Server (NTRS)
Lin, Tsung Han (Hank)
2011-01-01
JPL has developed a multi-stage Automated Target Recognition (ATR) system to locate objects in images. First, input images are preprocessed and sent to a Grayscale Optical Correlator (GOC) filter to identify possible regions-of-interest (ROIs). Second, feature extraction operations are performed using Texton filters and Principal Component Analysis (PCA). Finally, the features are fed to a classifier, to identify ROIs that contain the targets. Previous work used the Feed-forward Back-propagation Neural Network for classification. In this project we investigate a version of Adaboost as a classifier for comparison. The version we used is known as GentleBoost. We used the boosted decision tree as the weak classifier. We have tested our ATR system against real-world sonar images using the Adaboost approach. Results indicate an improvement in performance over a single Neural Network design.
Andrew T. Hudak; Nicholas L. Crookston; Jeffrey S. Evans; Michael K. Falkowski; Alistair M. S. Smith; Paul E. Gessler; Penelope Morgan
2006-01-01
We compared the utility of discrete-return light detection and ranging (lidar) data and multispectral satellite imagery, and their integration, for modeling and mapping basal area and tree density across two diverse coniferous forest landscapes in north-central Idaho. We applied multiple linear regression models subset from a suite of 26 predictor variables derived...
ERIC Educational Resources Information Center
Kitsantas, Anastasia; Kitsantas, Panagiota; Kitsantas, Thomas
2012-01-01
The purpose of this exploratory study was to assess the relative importance of a number of variables in predicting students' interest in math and/or computer science. Classification and regression trees (CART) were employed in the analysis of survey data collected from 276 college students enrolled in two U.S. and Greek universities. The results…
Wills, Christopher; Harms, Kyle E; Wiegand, Thorsten; Punchi-Manage, Ruwan; Gilbert, Gregory S; Erickson, David; Kress, W John; Hubbell, Stephen P; Gunatilleke, C V Savitri; Gunatilleke, I A U Nimal
2016-01-01
Studies of forest dynamics plots (FDPs) have revealed a variety of negative density-dependent (NDD) demographic interactions, especially among conspecific trees. These interactions can affect growth rate, recruitment and mortality, and they play a central role in the maintenance of species diversity in these complex ecosystems. Here we use an equal area annulus (EAA) point-pattern method to comprehensively analyze data from two tropical FDPs, Barro Colorado Island in Panama and Sinharaja in Sri Lanka. We show that these NDD interactions also influence the continued evolutionary diversification of even distantly related tree species in these FDPs. We examine the details of a wide range of these interactions between individual trees and the trees that surround them. All these interactions, and their cumulative effects, are strongest among conspecific focal and surrounding tree species in both FDPs. They diminish in magnitude with increasing phylogenetic distance between heterospecific focal and surrounding trees, but do not disappear or change the pattern of their dependence on size, density, frequency or physical distance even among the most distantly related trees. The phylogenetic persistence of all these effects provides evidence that interactions between tree species that share an ecosystem may continue to promote adaptive divergence even after the species' gene pools have become separated. Adaptive divergence among taxa would operate in stark contrast to an alternative possibility that has previously been suggested, that distantly related species with dispersal-limited distributions and confronted with unpredictable neighbors will tend to converge on common strategies of resource use. In addition, we have also uncovered a positive density-dependent effect: growth rates of large trees are boosted in the presence of a smaller basal area of surrounding trees. We also show that many of the NDD interactions switch sign rapidly as focal trees grow in size, and that their cumulative effect can strongly influence the distributions and species composition of the trees that surround the focal trees during the focal trees' lifetimes.
Identification of extremely premature infants at high risk of rehospitalization.
Ambalavanan, Namasivayam; Carlo, Waldemar A; McDonald, Scott A; Yao, Qing; Das, Abhik; Higgins, Rosemary D
2011-11-01
Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization. Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002-2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables. A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%-42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay. The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge.
Identification of Extremely Premature Infants at High Risk of Rehospitalization
Carlo, Waldemar A.; McDonald, Scott A.; Yao, Qing; Das, Abhik; Higgins, Rosemary D.
2011-01-01
OBJECTIVE: Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization. METHODS: Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002–2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables. RESULTS: A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%–42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay. CONCLUSIONS: The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge. PMID:22007016
Kohrt, Holbrook E; Olshen, Richard A; Bermas, Honnie R; Goodson, William H; Wood, Douglas J; Henry, Solomon; Rouse, Robert V; Bailey, Lisa; Philben, Vicki J; Dirbas, Frederick M; Dunn, Jocelyn J; Johnson, Denise L; Wapnir, Irene L; Carlson, Robert W; Stockdale, Frank E; Hansen, Nora M; Jeffrey, Stefanie S
2008-03-04
Current practice is to perform a completion axillary lymph node dissection (ALND) for breast cancer patients with tumor-involved sentinel lymph nodes (SLNs), although fewer than half will have non-sentinel node (NSLN) metastasis. Our goal was to develop new models to quantify the risk of NSLN metastasis in SLN-positive patients and to compare predictive capabilities to another widely used model. We constructed three models to predict NSLN status: recursive partitioning with receiver operating characteristic curves (RP-ROC), boosted Classification and Regression Trees (CART), and multivariate logistic regression (MLR) informed by CART. Data were compiled from a multicenter Northern California and Oregon database of 784 patients who prospectively underwent SLN biopsy and completion ALND. We compared the predictive abilities of our best model and the Memorial Sloan-Kettering Breast Cancer Nomogram (Nomogram) in our dataset and an independent dataset from Northwestern University. 285 patients had positive SLNs, of which 213 had known angiolymphatic invasion status and 171 had complete pathologic data including hormone receptor status. 264 (93%) patients had limited SLN disease (micrometastasis, 70%, or isolated tumor cells, 23%). 101 (35%) of all SLN-positive patients had tumor-involved NSLNs. Three variables (tumor size, angiolymphatic invasion, and SLN metastasis size) predicted risk in all our models. RP-ROC and boosted CART stratified patients into four risk levels. MLR informed by CART was most accurate. Using two composite predictors calculated from three variables, MLR informed by CART was more accurate than the Nomogram computed using eight predictors. In our dataset, area under ROC curve (AUC) was 0.83/0.85 for MLR (n = 213/n = 171) and 0.77 for Nomogram (n = 171). When applied to an independent dataset (n = 77), AUC was 0.74 for our model and 0.62 for Nomogram. The composite predictors in our model were the product of angiolymphatic invasion and size of SLN metastasis, and the product of tumor size and square of SLN metastasis size. We present a new model developed from a community-based SLN database that uses only three rather than eight variables to achieve higher accuracy than the Nomogram for predicting NSLN status in two different datasets.
Jon C. Regelbrugge
1993-01-01
Abstract. We modeled tree mortality occurring two years following wildfire in Pinus ponderosa forests using data from 1275 trees in 25 stands burned during the 1987 Stanislaus Complex fires. We used logistic regression analysis to develop models relating the probability of wildfire-induced mortality with tree size and fire severity for Pinus ponderosa, Calocedrus...
Large-Scale Variations in Lumber Value Recovery of Yellow Birch and Sugar Maple in Quebec, Canada
Hassegawa, Mariana; Havreljuk, Filip; Ouimet, Rock; Auty, David; Pothier, David; Achim, Alexis
2015-01-01
Silvicultural restoration measures have been implemented in the northern hardwoods forests of southern Quebec, Canada, but their financial applicability is often hampered by the depleted state of the resource. To help identify sites most suited for the production of high quality timber, where the potential return on silvicultural investments should be the highest, this study assessed the impact of stand and site characteristics on timber quality in sugar maple (Acer saccharum Marsh.) and yellow birch (Betula alleghaniensis Britt.). For this purpose, lumber value recovery (LVR), an estimate of the summed value of boards contained in a unit volume of round wood, was used as an indicator of timber quality. Predictions of LVR were made for yellow birch and sugar maple trees contained in a network of more than 22000 temporary sample plots across the Province. Next, stand-level variables were selected and models to predict LVR were built using the boosted regression trees method. Finally, the occurrence of spatial clusters was verified by a hotspot analysis. Results showed that in both species LVR was positively correlated with the stand age and structural diversity index, and negatively correlated with the number of merchantable stems. Yellow birch had higher LVR in areas with shallower soils, whereas sugar maple had higher LVR in regions with deeper soils. The hotspot analysis indicated that clusters of high and low LVR exist across the province for both species. Although it remains uncertain to what extent the variability of LVR may result from variations in past management practices or in inherent site quality, we argue that efforts to produce high quality timber should be prioritized in sites where LVR is predicted to be the highest. PMID:26313689
Large-Scale Variations in Lumber Value Recovery of Yellow Birch and Sugar Maple in Quebec, Canada.
Hassegawa, Mariana; Havreljuk, Filip; Ouimet, Rock; Auty, David; Pothier, David; Achim, Alexis
2015-01-01
Silvicultural restoration measures have been implemented in the northern hardwoods forests of southern Quebec, Canada, but their financial applicability is often hampered by the depleted state of the resource. To help identify sites most suited for the production of high quality timber, where the potential return on silvicultural investments should be the highest, this study assessed the impact of stand and site characteristics on timber quality in sugar maple (Acer saccharum Marsh.) and yellow birch (Betula alleghaniensis Britt.). For this purpose, lumber value recovery (LVR), an estimate of the summed value of boards contained in a unit volume of round wood, was used as an indicator of timber quality. Predictions of LVR were made for yellow birch and sugar maple trees contained in a network of more than 22000 temporary sample plots across the Province. Next, stand-level variables were selected and models to predict LVR were built using the boosted regression trees method. Finally, the occurrence of spatial clusters was verified by a hotspot analysis. Results showed that in both species LVR was positively correlated with the stand age and structural diversity index, and negatively correlated with the number of merchantable stems. Yellow birch had higher LVR in areas with shallower soils, whereas sugar maple had higher LVR in regions with deeper soils. The hotspot analysis indicated that clusters of high and low LVR exist across the province for both species. Although it remains uncertain to what extent the variability of LVR may result from variations in past management practices or in inherent site quality, we argue that efforts to produce high quality timber should be prioritized in sites where LVR is predicted to be the highest.
Machine Learning Principles Can Improve Hip Fracture Prediction.
Kruse, Christian; Eiken, Pia; Vestergaard, Peter
2017-04-01
Apply machine learning principles to predict hip fractures and estimate predictor importance in Dual-energy X-ray absorptiometry (DXA)-scanned men and women. Dual-energy X-ray absorptiometry data from two Danish regions between 1996 and 2006 were combined with national Danish patient data to comprise 4722 women and 717 men with 5 years of follow-up time (original cohort n = 6606 men and women). Twenty-four statistical models were built on 75% of data points through k-5, 5-repeat cross-validation, and then validated on the remaining 25% of data points to calculate area under the curve (AUC) and calibrate probability estimates. The best models were retrained with restricted predictor subsets to estimate the best subsets. For women, bootstrap aggregated flexible discriminant analysis ("bagFDA") performed best with a test AUC of 0.92 [0.89; 0.94] and well-calibrated probabilities following Naïve Bayes adjustments. A "bagFDA" model limited to 11 predictors (among them bone mineral densities (BMD), biochemical glucose measurements, general practitioner and dentist use) achieved a test AUC of 0.91 [0.88; 0.93]. For men, eXtreme Gradient Boosting ("xgbTree") performed best with a test AUC of 0.89 [0.82; 0.95], but with poor calibration in higher probabilities. A ten predictor subset (BMD, biochemical cholesterol and liver function tests, penicillin use and osteoarthritis diagnoses) achieved a test AUC of 0.86 [0.78; 0.94] using an "xgbTree" model. Machine learning can improve hip fracture prediction beyond logistic regression using ensemble models. Compiling data from international cohorts of longer follow-up and performing similar machine learning procedures has the potential to further improve discrimination and calibration.
Predicting biological condition in southern California streams
Brown, Larry R.; May, Jason T.; Rehn, Andrew C.; Ode, Peter R.; Waite, Ian R.; Kennen, Jonathan G.
2012-01-01
As understanding of the complex relations among environmental stressors and biological responses improves, a logical next step is predictive modeling of biological condition at unsampled sites. We developed a boosted regression tree (BRT) model of biological condition, as measured by a benthic macroinvertebrate index of biotic integrity (BIBI), for streams in urbanized Southern Coastal California. We also developed a multiple linear regression (MLR) model as a benchmark for comparison with the BRT model. The BRT model explained 66% of the variance in B-IBI, identifying watershed population density and combined percentage agricultural and urban land cover in the riparian buffer as the most important predictors of B-IBI, but with watershed mean precipitation and watershed density of manmade channels also important. The MLR model explained 48% of the variance in B-IBI and included watershed population density and combined percentage agricultural and urban land cover in the riparian buffer. For a verification data set, the BRT model correctly classified 75% of impaired sites (B-IBI < 40) and 78% of unimpaired sites (B-IBI = 40). For the same verification data set, the MLR model correctly classified 69% of impaired sites and 87% of unimpaired sites. The BRT model should not be used to predict B-IBI for specific sites; however, the model can be useful for general applications such as identifying and prioritizing regions for monitoring, remediation or preservation, stratifying new bioassessments according to anticipated biological condition, or assessing the potential for change in stream biological condition based on anticipated changes in population density and development in stream buffers.
Modeling time-to-event (survival) data using classification tree analysis.
Linden, Ariel; Yarnold, Paul R
2017-12-01
Time to the occurrence of an event is often studied in health research. Survival analysis differs from other designs in that follow-up times for individuals who do not experience the event by the end of the study (called censored) are accounted for in the analysis. Cox regression is the standard method for analysing censored data, but the assumptions required of these models are easily violated. In this paper, we introduce classification tree analysis (CTA) as a flexible alternative for modelling censored data. Classification tree analysis is a "decision-tree"-like classification model that provides parsimonious, transparent (ie, easy to visually display and interpret) decision rules that maximize predictive accuracy, derives exact P values via permutation tests, and evaluates model cross-generalizability. Using empirical data, we identify all statistically valid, reproducible, longitudinally consistent, and cross-generalizable CTA survival models and then compare their predictive accuracy to estimates derived via Cox regression and an unadjusted naïve model. Model performance is assessed using integrated Brier scores and a comparison between estimated survival curves. The Cox regression model best predicts average incidence of the outcome over time, whereas CTA survival models best predict either relatively high, or low, incidence of the outcome over time. Classification tree analysis survival models offer many advantages over Cox regression, such as explicit maximization of predictive accuracy, parsimony, statistical robustness, and transparency. Therefore, researchers interested in accurate prognoses and clear decision rules should consider developing models using the CTA-survival framework. © 2017 John Wiley & Sons, Ltd.
Liu, Yang; Lü, Yi-he; Zheng, Hai-feng; Chen, Li-ding
2010-05-01
Based on the 10-day SPOT VEGETATION NDVI data and the daily meteorological data from 1998 to 2007 in Yan' an City, the main meteorological variables affecting the annual and interannual variations of NDVI were determined by using regression tree. It was found that the effects of test meteorological variables on the variability of NDVI differed with seasons and time lags. Temperature and precipitation were the most important meteorological variables affecting the annual variation of NDVI, and the average highest temperature was the most important meteorological variable affecting the inter-annual variation of NDVI. Regression tree was very powerful in determining the key meteorological variables affecting NDVI variation, but could not build quantitative relations between NDVI and meteorological variables, which limited its further and wider application.
Partitioning sources of variation in vertebrate species richness
Boone, R.B.; Krohn, W.B.
2000-01-01
Aim: To explore biogeographic patterns of terrestrial vertebrates in Maine, USA using techniques that would describe local and spatial correlations with the environment. Location: Maine, USA. Methods: We delineated the ranges within Maine (86,156 km2) of 275 species using literature and expert review. Ranges were combined into species richness maps, and compared to geomorphology, climate, and woody plant distributions. Methods were adapted that compared richness of all vertebrate classes to each environmental correlate, rather than assessing a single explanatory theory. We partitioned variation in species richness into components using tree and multiple linear regression. Methods were used that allowed for useful comparisons between tree and linear regression results. For both methods we partitioned variation into broad-scale (spatially autocorrelated) and fine-scale (spatially uncorrelated) explained and unexplained components. By partitioning variance, and using both tree and linear regression in analyses, we explored the degree of variation in species richness for each vertebrate group that Could be explained by the relative contribution of each environmental variable. Results: In tree regression, climate variation explained richness better (92% of mean deviance explained for all species) than woody plant variation (87%) and geomorphology (86%). Reptiles were highly correlated with environmental variation (93%), followed by mammals, amphibians, and birds (each with 84-82% deviance explained). In multiple linear regression, climate was most closely associated with total vertebrate richness (78%), followed by woody plants (67%) and geomorphology (56%). Again, reptiles were closely correlated with the environment (95%), followed by mammals (73%), amphibians (63%) and birds (57%). Main conclusions: Comparing variation explained using tree and multiple linear regression quantified the importance of nonlinear relationships and local interactions between species richness and environmental variation, identifying the importance of linear relationships between reptiles and the environment, and nonlinear relationships between birds and woody plants, for example. Conservation planners should capture climatic variation in broad-scale designs; temperatures may shift during climate change, but the underlying correlations between the environment and species richness will presumably remain.
Wills, Christopher; Harms, Kyle E.; Wiegand, Thorsten; Punchi-Manage, Ruwan; Gilbert, Gregory S.; Erickson, David; Kress, W. John; Hubbell, Stephen P.; Gunatilleke, C. V. Savitri; Gunatilleke, I. A. U. Nimal
2016-01-01
Studies of forest dynamics plots (FDPs) have revealed a variety of negative density-dependent (NDD) demographic interactions, especially among conspecific trees. These interactions can affect growth rate, recruitment and mortality, and they play a central role in the maintenance of species diversity in these complex ecosystems. Here we use an equal area annulus (EAA) point-pattern method to comprehensively analyze data from two tropical FDPs, Barro Colorado Island in Panama and Sinharaja in Sri Lanka. We show that these NDD interactions also influence the continued evolutionary diversification of even distantly related tree species in these FDPs. We examine the details of a wide range of these interactions between individual trees and the trees that surround them. All these interactions, and their cumulative effects, are strongest among conspecific focal and surrounding tree species in both FDPs. They diminish in magnitude with increasing phylogenetic distance between heterospecific focal and surrounding trees, but do not disappear or change the pattern of their dependence on size, density, frequency or physical distance even among the most distantly related trees. The phylogenetic persistence of all these effects provides evidence that interactions between tree species that share an ecosystem may continue to promote adaptive divergence even after the species’ gene pools have become separated. Adaptive divergence among taxa would operate in stark contrast to an alternative possibility that has previously been suggested, that distantly related species with dispersal-limited distributions and confronted with unpredictable neighbors will tend to converge on common strategies of resource use. In addition, we have also uncovered a positive density-dependent effect: growth rates of large trees are boosted in the presence of a smaller basal area of surrounding trees. We also show that many of the NDD interactions switch sign rapidly as focal trees grow in size, and that their cumulative effect can strongly influence the distributions and species composition of the trees that surround the focal trees during the focal trees’ lifetimes. PMID:27305092
Ensemble of trees approaches to risk adjustment for evaluating a hospital's performance.
Liu, Yang; Traskin, Mikhail; Lorch, Scott A; George, Edward I; Small, Dylan
2015-03-01
A commonly used method for evaluating a hospital's performance on an outcome is to compare the hospital's observed outcome rate to the hospital's expected outcome rate given its patient (case) mix and service. The process of calculating the hospital's expected outcome rate given its patient mix and service is called risk adjustment (Iezzoni 1997). Risk adjustment is critical for accurately evaluating and comparing hospitals' performances since we would not want to unfairly penalize a hospital just because it treats sicker patients. The key to risk adjustment is accurately estimating the probability of an Outcome given patient characteristics. For cases with binary outcomes, the method that is commonly used in risk adjustment is logistic regression. In this paper, we consider ensemble of trees methods as alternatives for risk adjustment, including random forests and Bayesian additive regression trees (BART). Both random forests and BART are modern machine learning methods that have been shown recently to have excellent performance for prediction of outcomes in many settings. We apply these methods to carry out risk adjustment for the performance of neonatal intensive care units (NICU). We show that these ensemble of trees methods outperform logistic regression in predicting mortality among babies treated in NICU, and provide a superior method of risk adjustment compared to logistic regression.
Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle.
Jiménez-Montero, J A; González-Recio, O; Alenda, R
2013-01-01
The aim of this study was to evaluate methods for genomic evaluation of the Spanish Holstein population as an initial step toward the implementation of routine genomic evaluations. This study provides a description of the population structure of progeny tested bulls in Spain at the genomic level and compares different genomic evaluation methods with regard to accuracy and bias. Two bayesian linear regression models, Bayes-A and Bayesian-LASSO (B-LASSO), as well as a machine learning algorithm, Random-Boosting (R-Boost), and BLUP using a realized genomic relationship matrix (G-BLUP), were compared. Five traits that are currently under selection in the Spanish Holstein population were used: milk yield, fat yield, protein yield, fat percentage, and udder depth. In total, genotypes from 1859 progeny tested bulls were used. The training sets were composed of bulls born before 2005; including 1601 bulls for production and 1574 bulls for type, whereas the testing sets contained 258 and 235 bulls born in 2005 or later for production and type, respectively. Deregressed proofs (DRP) from January 2009 Interbull (Uppsala, Sweden) evaluation were used as the dependent variables for bulls in the training sets, whereas DRP from the December 2011 DRPs Interbull evaluation were used to compare genomic predictions with progeny test results for bulls in the testing set. Genomic predictions were more accurate than traditional pedigree indices for predicting future progeny test results of young bulls. The gain in accuracy, due to inclusion of genomic data varied by trait and ranged from 0.04 to 0.42 Pearson correlation units. Results averaged across traits showed that B-LASSO had the highest accuracy with an advantage of 0.01, 0.03 and 0.03 points in Pearson correlation compared with R-Boost, Bayes-A, and G-BLUP, respectively. The B-LASSO predictions also showed the least bias (0.02, 0.03 and 0.10 SD units less than Bayes-A, R-Boost and G-BLUP, respectively) as measured by mean difference between genomic predictions and progeny test results. The R-Boosting algorithm provided genomic predictions with regression coefficients closer to unity, which is an alternative measure of bias, for 4 out of 5 traits and also resulted in mean squared errors estimates that were 2%, 10%, and 12% smaller than B-LASSO, Bayes-A, and G-BLUP, respectively. The observed prediction accuracy obtained with these methods was within the range of values expected for a population of similar size, suggesting that the prediction method and reference population described herein are appropriate for implementation of routine genome-assisted evaluations in Spanish dairy cattle. R-Boost is a competitive marker regression methodology in terms of predictive ability that can accommodate large data sets. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
McCaffrey, Daniel F.; Ridgeway, Greg; Morral, Andrew R.
2004-01-01
Causal effect modeling with naturalistic rather than experimental data is challenging. In observational studies participants in different treatment conditions may also differ on pretreatment characteristics that influence outcomes. Propensity score methods can theoretically eliminate these confounds for all observed covariates, but accurate…
Modeling Caribbean tree stem diameters from tree height and crown width measurements
Thomas Brandeis; KaDonna Randolph; Mike Strub
2009-01-01
Regression models to predict diameter at breast height (DBH) as a function of tree height and maximum crown radius were developed for Caribbean forests based on data collected by the U.S. Forest Service in the Commonwealth of Puerto Rico and Territory of the U.S. Virgin Islands. The model predicting DBH from tree height fit reasonably well (R2 = 0.7110), with...
Perceived Organizational Support for Enhancing Welfare at Work: A Regression Tree Model
Giorgi, Gabriele; Dubin, David; Perez, Javier Fiz
2016-01-01
When trying to examine outcomes such as welfare and well-being, research tends to focus on main effects and take into account limited numbers of variables at a time. There are a number of techniques that may help address this problem. For example, many statistical packages available in R provide easy-to-use methods of modeling complicated analysis such as classification and tree regression (i.e., recursive partitioning). The present research illustrates the value of recursive partitioning in the prediction of perceived organizational support in a sample of more than 6000 Italian bankers. Utilizing the tree function party package in R, we estimated a regression tree model predicting perceived organizational support from a multitude of job characteristics including job demand, lack of job control, lack of supervisor support, training, etc. The resulting model appears particularly helpful in pointing out several interactions in the prediction of perceived organizational support. In particular, training is the dominant factor. Another dimension that seems to influence organizational support is reporting (perceived communication about safety and stress concerns). Results are discussed from a theoretical and methodological point of view. PMID:28082924
Prediction of strontium bromide laser efficiency using cluster and decision tree analysis
NASA Astrophysics Data System (ADS)
Iliev, Iliycho; Gocheva-Ilieva, Snezhana; Kulin, Chavdar
2018-01-01
Subject of investigation is a new high-powered strontium bromide (SrBr2) vapor laser emitting in multiline region of wavelengths. The laser is an alternative to the atom strontium lasers and electron free lasers, especially at the line 6.45 μm which line is used in surgery for medical processing of biological tissues and bones with minimal damage. In this paper the experimental data from measurements of operational and output characteristics of the laser are statistically processed by means of cluster analysis and tree-based regression techniques. The aim is to extract the more important relationships and dependences from the available data which influence the increase of the overall laser efficiency. There are constructed and analyzed a set of cluster models. It is shown by using different cluster methods that the seven investigated operational characteristics (laser tube diameter, length, supplied electrical power, and others) and laser efficiency are combined in 2 clusters. By the built regression tree models using Classification and Regression Trees (CART) technique there are obtained dependences to predict the values of efficiency, and especially the maximum efficiency with over 95% accuracy.
Digression and Value Concatenation to Enable Privacy-Preserving Regression.
Li, Xiao-Bai; Sarkar, Sumit
2014-09-01
Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-analysis and data-mining technique, can be used to effectively reveal individuals' sensitive data. This problem, which we call a "regression attack," has not been addressed in the data privacy literature, and existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach introduces a novel measure, called digression , which assesses the sensitive value disclosure risk in the process of building a regression tree model. Specifically, we develop an algorithm that uses the measure for pruning the tree to limit disclosure of sensitive data. We also propose a dynamic value-concatenation method for anonymizing data, which better preserves data utility than a user-defined generalization scheme commonly used in existing approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted using real-world financial, economic and healthcare data. The results of the experiments demonstrate that the proposed approach is very effective in protecting data privacy while preserving data quality for research and analysis.
Brachytherapy Boost Utilization and Survival in Unfavorable-risk Prostate Cancer.
Johnson, Skyler B; Lester-Coll, Nataniel H; Kelly, Jacqueline R; Kann, Benjamin H; Yu, James B; Nath, Sameer K
2017-11-01
There are limited comparative survival data for prostate cancer (PCa) patients managed with a low-dose rate brachytherapy (LDR-B) boost and dose-escalated external-beam radiotherapy (DE-EBRT) alone. To compare overall survival (OS) for men with unfavorable PCa between LDR-B and DE-EBRT groups. Using the National Cancer Data Base, we identified men with unfavorable PCa treated between 2004 and 2012 with androgen suppression (AS) and either EBRT followed by LDR-B or DE-EBRT (75.6-86.4Gy). Treatment selection was evaluated using logistic regression and annual percentage proportions. OS was analyzed using the Kaplan-Meier method, log-rank test, Cox proportional hazards, and propensity score matching. We identified 25038 men between 2004 and 2012, during which LDR-B boost utilization decreased from 29% to 14%. LDR-B was associated with better OS on univariate (7-yr OS: 82% vs 73%; p<0.001) and multivariate analyses (hazard ratio [HR] 0.70, 95% confidence interval [CI] 0.64-0.77). Propensity score matching verified an OS benefit associated with LDR-B boost (HR 0.74, 95% CI 0.66-0.89). The OS benefit of LDR-B boost persisted when limited to men aged <60 yr with no comorbidities. On subset analysis, there was no interaction between treatment and age, risk group, or radiation dose. Limitations include the retrospective design, nonrandomized selection bias, and the absence of treatment toxicity, hormone duration, and cancer-specific outcomes. Between 2004 and 2012, LDR-B boost utilization declined and was associated with better OS compared to DE-EBRT alone. LDR-B boost is probably the ideal treatment option for men with unfavorable PCa, pending long-term results of randomized trials. We compared radiotherapy utilization and survival for prostate cancer (PCa) patients using a national database. We found that low-dose rate brachytherapy (LDR-B) boost, a method being used less frequently, was associated with better overall survival when compared to dose-escalated external-beam radiotherapy alone for men with unfavorable PCa. Randomized trials are needed to confirm that LDR-B boost is the ideal treatment. Copyright © 2017 European Association of Urology. Published by Elsevier B.V. All rights reserved.
Decision trees in epidemiological research.
Venkatasubramaniam, Ashwini; Wolfson, Julian; Mitchell, Nathan; Barnes, Timothy; JaKa, Meghan; French, Simone
2017-01-01
In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. However, identifying relevant subgroups can be challenging with standard statistical methods. We review the literature on decision trees, a family of techniques for partitioning the population, on the basis of covariates, into distinct subgroups who share similar values of an outcome variable. We compare two decision tree methods, the popular Classification and Regression tree (CART) technique and the newer Conditional Inference tree (CTree) technique, assessing their performance in a simulation study and using data from the Box Lunch Study, a randomized controlled trial of a portion size intervention. Both CART and CTree identify homogeneous population subgroups and offer improved prediction accuracy relative to regression-based approaches when subgroups are truly present in the data. An important distinction between CART and CTree is that the latter uses a formal statistical hypothesis testing framework in building decision trees, which simplifies the process of identifying and interpreting the final tree model. We also introduce a novel way to visualize the subgroups defined by decision trees. Our novel graphical visualization provides a more scientifically meaningful characterization of the subgroups identified by decision trees. Decision trees are a useful tool for identifying homogeneous subgroups defined by combinations of individual characteristics. While all decision tree techniques generate subgroups, we advocate the use of the newer CTree technique due to its simplicity and ease of interpretation.
Indicators of Terrorism Vulnerability in Africa
2015-03-26
the terror threat and vulnerabilities across Africa. Key words: Terrorism, Africa, Negative Binomial Regression, Classification Tree iv I would like...31 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Log -likelihood...70 viii Page 5.3 Classification Tree Description
NASA Astrophysics Data System (ADS)
Di, Nur Faraidah Muhammad; Satari, Siti Zanariah
2017-05-01
Outlier detection in linear data sets has been done vigorously but only a small amount of work has been done for outlier detection in circular data. In this study, we proposed multiple outliers detection in circular regression models based on the clustering algorithm. Clustering technique basically utilizes distance measure to define distance between various data points. Here, we introduce the similarity distance based on Euclidean distance for circular model and obtain a cluster tree using the single linkage clustering algorithm. Then, a stopping rule for the cluster tree based on the mean direction and circular standard deviation of the tree height is proposed. We classify the cluster group that exceeds the stopping rule as potential outlier. Our aim is to demonstrate the effectiveness of proposed algorithms with the similarity distances in detecting the outliers. It is found that the proposed methods are performed well and applicable for circular regression model.
Kadiyala, Akhil; Kaur, Devinder; Kumar, Ashok
2013-02-01
The present study developed a novel approach to modeling indoor air quality (IAQ) of a public transportation bus by the development of hybrid genetic-algorithm-based neural networks (also known as evolutionary neural networks) with input variables optimized from using the regression trees, referred as the GART approach. This study validated the applicability of the GART modeling approach in solving complex nonlinear systems by accurately predicting the monitored contaminants of carbon dioxide (CO2), carbon monoxide (CO), nitric oxide (NO), sulfur dioxide (SO2), 0.3-0.4 microm sized particle numbers, 0.4-0.5 microm sized particle numbers, particulate matter (PM) concentrations less than 1.0 microm (PM10), and PM concentrations less than 2.5 microm (PM2.5) inside a public transportation bus operating on 20% grade biodiesel in Toledo, OH. First, the important variables affecting each monitored in-bus contaminant were determined using regression trees. Second, the analysis of variance was used as a complimentary sensitivity analysis to the regression tree results to determine a subset of statistically significant variables affecting each monitored in-bus contaminant. Finally, the identified subsets of statistically significant variables were used as inputs to develop three artificial neural network (ANN) models. The models developed were regression tree-based back-propagation network (BPN-RT), regression tree-based radial basis function network (RBFN-RT), and GART models. Performance measures were used to validate the predictive capacity of the developed IAQ models. The results from this approach were compared with the results obtained from using a theoretical approach and a generalized practicable approach to modeling IAQ that included the consideration of additional independent variables when developing the aforementioned ANN models. The hybrid GART models were able to capture majority of the variance in the monitored in-bus contaminants. The genetic-algorithm-based neural network IAQ models outperformed the traditional ANN methods of the back-propagation and the radial basis function networks. The novelty of this research is the development of a novel approach to modeling vehicular indoor air quality by integration of the advanced methods of genetic algorithms, regression trees, and the analysis of variance for the monitored in-vehicle gaseous and particulate matter contaminants, and comparing the results obtained from using the developed approach with conventional artificial intelligence techniques of back propagation networks and radial basis function networks. This study validated the newly developed approach using holdout and threefold cross-validation methods. These results are of great interest to scientists, researchers, and the public in understanding the various aspects of modeling an indoor microenvironment. This methodology can easily be extended to other fields of study also.
NASA Astrophysics Data System (ADS)
Yosipof, Abraham; Guedes, Rita C.; García-Sosa, Alfonso T.
2018-05-01
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neuronal network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.
Yosipof, Abraham; Guedes, Rita C; García-Sosa, Alfonso T
2018-01-01
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.
Developing Models to Forcast Sales of Natural Christmas Trees
Lawrence D. Garrett; Thomas H. Pendleton
1977-01-01
A study of practices for marketing Christmas trees in Winston-Salem, North Carolina, and Denver, Colorado, revealed that such factors as retail lot competition, tree price, consumer traffic, and consumer income were very important in determining a particular retailer's sales. Analyses of 4 years of market data were used in developing regression models for...
Comprehensive database of diameter-based biomass regressions for North American tree species
Jennifer C. Jenkins; David C. Chojnacky; Linda S. Heath; Richard A. Birdsey
2004-01-01
A database consisting of 2,640 equations compiled from the literature for predicting the biomass of trees and tree components from diameter measurements of species found in North America. Bibliographic information, geographic locations, diameter limits, diameter and biomass units, equation forms, statistical errors, and coefficients are provided for each equation,...
Evaluation of digital soil mapping approaches with large sets of environmental covariates
NASA Astrophysics Data System (ADS)
Nussbaum, Madlene; Spiess, Kay; Baltensweiler, Andri; Grob, Urs; Keller, Armin; Greiner, Lucie; Schaepman, Michael E.; Papritz, Andreas
2018-01-01
The spatial assessment of soil functions requires maps of basic soil properties. Unfortunately, these are either missing for many regions or are not available at the desired spatial resolution or down to the required soil depth. The field-based generation of large soil datasets and conventional soil maps remains costly. Meanwhile, legacy soil data and comprehensive sets of spatial environmental data are available for many regions. Digital soil mapping (DSM) approaches relating soil data (responses) to environmental data (covariates) face the challenge of building statistical models from large sets of covariates originating, for example, from airborne imaging spectroscopy or multi-scale terrain analysis. We evaluated six approaches for DSM in three study regions in Switzerland (Berne, Greifensee, ZH forest) by mapping the effective soil depth available to plants (SD), pH, soil organic matter (SOM), effective cation exchange capacity (ECEC), clay, silt, gravel content and fine fraction bulk density for four soil depths (totalling 48 responses). Models were built from 300-500 environmental covariates by selecting linear models through (1) grouped lasso and (2) an ad hoc stepwise procedure for robust external-drift kriging (georob). For (3) geoadditive models we selected penalized smoothing spline terms by component-wise gradient boosting (geoGAM). We further used two tree-based methods: (4) boosted regression trees (BRTs) and (5) random forest (RF). Lastly, we computed (6) weighted model averages (MAs) from the predictions obtained from methods 1-5. Lasso, georob and geoGAM successfully selected strongly reduced sets of covariates (subsets of 3-6 % of all covariates). Differences in predictive performance, tested on independent validation data, were mostly small and did not reveal a single best method for 48 responses. Nevertheless, RF was often the best among methods 1-5 (28 of 48 responses), but was outcompeted by MA for 14 of these 28 responses. RF tended to over-fit the data. The performance of BRT was slightly worse than RF. GeoGAM performed poorly on some responses and was the best only for 7 of 48 responses. The prediction accuracy of lasso was intermediate. All models generally had small bias. Only the computationally very efficient lasso had slightly larger bias because it tended to under-fit the data. Summarizing, although differences were small, the frequencies of the best and worst performance clearly favoured RF if a single method is applied and MA if multiple prediction models can be developed.
Identifying taxonomic and functional surrogates for spring biodiversity conservation.
Jyväsjärvi, Jussi; Virtanen, Risto; Ilmonen, Jari; Paasivirta, Lauri; Muotka, Timo
2018-02-27
Surrogate approaches are widely used to estimate overall taxonomic diversity for conservation planning. Surrogate taxa are frequently selected based on rarity or charisma, whereas selection through statistical modeling has been applied rarely. We used boosted-regression-tree models (BRT) fitted to biological data from 165 springs to identify bryophyte and invertebrate surrogates for taxonomic and functional diversity of boreal springs. We focused on these 2 groups because they are well known and abundant in most boreal springs. The best indicators of taxonomic versus functional diversity differed. The bryophyte Bryum weigelii and the chironomid larva Paratrichocladius skirwithensis best indicated taxonomic diversity, whereas the isopod Asellus aquaticus and the chironomid Macropelopia spp. were the best surrogates of functional diversity. In a scoring algorithm for priority-site selection, taxonomic surrogates performed only slightly better than random selection for all spring-dwelling taxa, but they were very effective in representing spring specialists, providing a distinct improvement over random solutions. However, the surrogates for taxonomic diversity represented functional diversity poorly and vice versa. When combined with cross-taxon complementarity analyses, surrogate selection based on statistical modeling provides a promising approach for identifying groundwater-dependent ecosystems of special conservation value, a key requirement of the EU Water Framework Directive. © 2018 Society for Conservation Biology.
Spatial models reveal the microclimatic buffering capacity of old-growth forests
Frey, Sarah J. K.; Hadley, Adam S.; Johnson, Sherri L.; Schulze, Mark; Jones, Julia A.; Betts, Matthew G.
2016-01-01
Climate change is predicted to cause widespread declines in biodiversity, but these predictions are derived from coarse-resolution climate models applied at global scales. Such models lack the capacity to incorporate microclimate variability, which is critical to biodiversity microrefugia. In forested montane regions, microclimate is thought to be influenced by combined effects of elevation, microtopography, and vegetation, but their relative effects at fine spatial scales are poorly known. We used boosted regression trees to model the spatial distribution of fine-scale, under-canopy air temperatures in mountainous terrain. Spatial models predicted observed independent test data well (r = 0.87). As expected, elevation strongly predicted temperatures, but vegetation and microtopography also exerted critical effects. Old-growth vegetation characteristics, measured using LiDAR (light detection and ranging), appeared to have an insulating effect; maximum spring monthly temperatures decreased by 2.5°C across the observed gradient in old-growth structure. These cooling effects across a gradient in forest structure are of similar magnitude to 50-year forecasts of the Intergovernmental Panel on Climate Change and therefore have the potential to mitigate climate warming at local scales. Management strategies to conserve old-growth characteristics and to curb current rates of primary forest loss could maintain microrefugia, enhancing biodiversity persistence in mountainous systems under climate warming. PMID:27152339
Kotta, Jonne; Möller, Tiia; Orav-Kotta, Helen; Pärnoja, Merli
2014-12-01
Little is known about how organisms might respond to multiple climate stressors and this lack of knowledge limits our ability to manage coastal ecosystems under contemporary climate change. Ecological models provide managers and decision makers with greater certainty that the systems affected by their decisions are accurately represented. In this study Boosted Regression Trees modelling was used to relate the cover of submerged aquatic vegetation to the abiotic environment in the brackish Baltic Sea. The analyses showed that the majority of the studied submerged aquatic species are most sensitive to changes in water temperature, current velocity and winter ice scour. Surprisingly, water salinity, turbidity and eutrophication have little impact on the distributional pattern of the studied biota. Both small and large scale environmental variability contributes to the variability of submerged aquatic vegetation. When modelling species distribution under the projected influences of climate change, all of the studied submerged aquatic species appear to be very resilient to a broad range of environmental perturbation and biomass gains are expected when seawater temperature increases. This is mainly because vegetation develops faster in spring and has a longer growing season under the projected climate change scenario. Copyright © 2014 Elsevier Ltd. All rights reserved.
Liu, Zhihua
2016-11-18
Understanding the influence of climate variability and fire characteristics in shaping postfire vegetation recovery will help to predict future ecosystem trajectories in boreal forests. In this study, I asked: (1) which remotely-sensed vegetation index (VI) is a good proxy for vegetation recovery? and (2) what are the relative influences of climate and fire in controlling postfire vegetation recovery in a Siberian larch forest, a globally important but poorly understood ecosystem type? Analysis showed that the shortwave infrared (SWIR) VI is a good indicator of postfire vegetation recovery in boreal larch forests. A boosted regression tree analysis showed that postfire recovery was collectively controlled by processes that controlled seed availability, as well as by site conditions and climate variability. Fire severity and its spatial variability played a dominant role in determining vegetation recovery, indicating seed availability as the primary mechanism affecting postfire forest resilience. Environmental and immediate postfire climatic conditions appear to be less important, but interact strongly with fire severity to influence postfire recovery. If future warming and fire regimes manifest as expected in this region, seed limitation and climate-induced regeneration failure will become more prevalent and severe, which may cause forests to shift to alternative stable states.
NASA Astrophysics Data System (ADS)
Nijland, Wiebe; Nielsen, Scott E.; Coops, Nicholas C.; Wulder, Michael A.; Stenhouse, Gordon B.
2014-01-01
Food and habitat resources are critical components of wildlife management and conservation efforts. The grizzly bear (Ursus arctos) has diverse diets and habitat requirements particularly for understory plant species, which are impacted by human developments and forest management activities. We use light detection and ranging (LiDAR) data to predict the occurrence of 14 understory plant species relevant to bear forage and compare our predictions with more conventional climate- and land cover-based models. We use boosted regression trees to model each of the 14 understory species across 4435 km2 using occurrence (presence-absence) data from 1941 field plots. Three sets of models were fitted: climate only, climate and basic land and forest covers from Landsat 30-m imagery, and a climate- and LiDAR-derived model describing both the terrain and forest canopy. Resulting model accuracies varied widely among species. Overall, 8 of 14 species models were improved by including the LiDAR-derived variables. For climate-only models, mean annual precipitation and frost-free periods were the most important variables. With inclusion of LiDAR-derived attributes, depth-to-water table, terrain-intercepted annual radiation, and elevation were most often selected. This suggests that fine-scale terrain conditions affect the distribution of the studied species more than canopy conditions.
Chen-Ying Hung; Wei-Chen Chen; Po-Tsun Lai; Ching-Heng Lin; Chi-Chun Lee
2017-07-01
Electronic medical claims (EMCs) can be used to accurately predict the occurrence of a variety of diseases, which can contribute to precise medical interventions. While there is a growing interest in the application of machine learning (ML) techniques to address clinical problems, the use of deep-learning in healthcare have just gained attention recently. Deep learning, such as deep neural network (DNN), has achieved impressive results in the areas of speech recognition, computer vision, and natural language processing in recent years. However, deep learning is often difficult to comprehend due to the complexities in its framework. Furthermore, this method has not yet been demonstrated to achieve a better performance comparing to other conventional ML algorithms in disease prediction tasks using EMCs. In this study, we utilize a large population-based EMC database of around 800,000 patients to compare DNN with three other ML approaches for predicting 5-year stroke occurrence. The result shows that DNN and gradient boosting decision tree (GBDT) can result in similarly high prediction accuracies that are better compared to logistic regression (LR) and support vector machine (SVM) approaches. Meanwhile, DNN achieves optimal results by using lesser amounts of patient data when comparing to GBDT method.
Fishing, fast growth and climate variability increase the risk of collapse
Pinsky, Malin L.; Byler, David
2015-01-01
Species around the world have suffered collapses, and a key question is why some populations are more vulnerable than others. Traditional conservation biology and evidence from terrestrial species suggest that slow-growing populations are most at risk, but interactions between climate variability and harvest dynamics may alter or even reverse this pattern. Here, we test this hypothesis globally. We use boosted regression trees to analyse the influences of harvesting, species traits and climate variability on the risk of collapse (decline below a fixed threshold) across 154 marine fish populations around the world. The most important factor explaining collapses was the magnitude of overfishing, while the duration of overfishing best explained long-term depletion. However, fast growth was the next most important risk factor. Fast-growing populations and those in variable environments were especially sensitive to overfishing, and the risk of collapse was more than tripled for fast-growing when compared with slow-growing species that experienced overfishing. We found little evidence that, in the absence of overfishing, climate variability or fast growth rates alone drove population collapse over the last six decades. Expanding efforts to rapidly adjust harvest pressure to account for climate-driven lows in productivity could help to avoid future collapses, particularly among fast-growing species. PMID:26246548
Fishing, fast growth and climate variability increase the risk of collapse.
Pinsky, Malin L; Byler, David
2015-08-22
Species around the world have suffered collapses, and a key question is why some populations are more vulnerable than others. Traditional conservation biology and evidence from terrestrial species suggest that slow-growing populations are most at risk, but interactions between climate variability and harvest dynamics may alter or even reverse this pattern. Here, we test this hypothesis globally. We use boosted regression trees to analyse the influences of harvesting, species traits and climate variability on the risk of collapse (decline below a fixed threshold) across 154 marine fish populations around the world. The most important factor explaining collapses was the magnitude of overfishing, while the duration of overfishing best explained long-term depletion. However, fast growth was the next most important risk factor. Fast-growing populations and those in variable environments were especially sensitive to overfishing, and the risk of collapse was more than tripled for fast-growing when compared with slow-growing species that experienced overfishing. We found little evidence that, in the absence of overfishing, climate variability or fast growth rates alone drove population collapse over the last six decades. Expanding efforts to rapidly adjust harvest pressure to account for climate-driven lows in productivity could help to avoid future collapses, particularly among fast-growing species. © 2015 The Author(s).
Liu, Zhihua
2016-01-01
Understanding the influence of climate variability and fire characteristics in shaping postfire vegetation recovery will help to predict future ecosystem trajectories in boreal forests. In this study, I asked: (1) which remotely-sensed vegetation index (VI) is a good proxy for vegetation recovery? and (2) what are the relative influences of climate and fire in controlling postfire vegetation recovery in a Siberian larch forest, a globally important but poorly understood ecosystem type? Analysis showed that the shortwave infrared (SWIR) VI is a good indicator of postfire vegetation recovery in boreal larch forests. A boosted regression tree analysis showed that postfire recovery was collectively controlled by processes that controlled seed availability, as well as by site conditions and climate variability. Fire severity and its spatial variability played a dominant role in determining vegetation recovery, indicating seed availability as the primary mechanism affecting postfire forest resilience. Environmental and immediate postfire climatic conditions appear to be less important, but interact strongly with fire severity to influence postfire recovery. If future warming and fire regimes manifest as expected in this region, seed limitation and climate-induced regeneration failure will become more prevalent and severe, which may cause forests to shift to alternative stable states. PMID:27857204
Liu, Zhihua; Yang, Jian; He, Hong S.
2013-01-01
The relative importance of fuel, topography, and weather on fire spread varies at different spatial scales, but how the relative importance of these controls respond to changing spatial scales is poorly understood. We designed a “moving window” resampling technique that allowed us to quantify the relative importance of controls on fire spread at continuous spatial scales using boosted regression trees methods. This quantification allowed us to identify the threshold value for fire size at which the dominant control switches from fuel at small sizes to weather at large sizes. Topography had a fluctuating effect on fire spread across the spatial scales, explaining 20–30% of relative importance. With increasing fire size, the dominant control switched from bottom-up controls (fuel and topography) to top-down controls (weather). Our analysis suggested that there is a threshold for fire size, above which fires are driven primarily by weather and more likely lead to larger fire size. We suggest that this threshold, which may be ecosystem-specific, can be identified using our “moving window” resampling technique. Although the threshold derived from this analytical method may rely heavily on the sampling technique, our study introduced an easily implemented approach to identify scale thresholds in wildfire regimes. PMID:23383247
Automatic red eye correction and its quality metric
NASA Astrophysics Data System (ADS)
Safonov, Ilia V.; Rychagov, Michael N.; Kang, KiMin; Kim, Sang Ho
2008-01-01
The red eye artifacts are troublesome defect of amateur photos. Correction of red eyes during printing without user intervention and making photos more pleasant for an observer are important tasks. The novel efficient technique of automatic correction of red eyes aimed for photo printers is proposed. This algorithm is independent from face orientation and capable to detect paired red eyes as well as single red eyes. The approach is based on application of 3D tables with typicalness levels for red eyes and human skin tones and directional edge detection filters for processing of redness image. Machine learning is applied for feature selection. For classification of red eye regions a cascade of classifiers including Gentle AdaBoost committee from Classification and Regression Trees (CART) is applied. Retouching stage includes desaturation, darkening and blending with initial image. Several versions of approach implementation using trade-off between detection and correction quality, processing time, memory volume are possible. The numeric quality criterion of automatic red eye correction is proposed. This quality metric is constructed by applying Analytic Hierarchy Process (AHP) for consumer opinions about correction outcomes. Proposed numeric metric helped to choose algorithm parameters via optimization procedure. Experimental results demonstrate high accuracy and efficiency of the proposed algorithm in comparison with existing solutions.
Spatial models reveal the microclimatic buffering capacity of old-growth forests.
Frey, Sarah J K; Hadley, Adam S; Johnson, Sherri L; Schulze, Mark; Jones, Julia A; Betts, Matthew G
2016-04-01
Climate change is predicted to cause widespread declines in biodiversity, but these predictions are derived from coarse-resolution climate models applied at global scales. Such models lack the capacity to incorporate microclimate variability, which is critical to biodiversity microrefugia. In forested montane regions, microclimate is thought to be influenced by combined effects of elevation, microtopography, and vegetation, but their relative effects at fine spatial scales are poorly known. We used boosted regression trees to model the spatial distribution of fine-scale, under-canopy air temperatures in mountainous terrain. Spatial models predicted observed independent test data well (r = 0.87). As expected, elevation strongly predicted temperatures, but vegetation and microtopography also exerted critical effects. Old-growth vegetation characteristics, measured using LiDAR (light detection and ranging), appeared to have an insulating effect; maximum spring monthly temperatures decreased by 2.5°C across the observed gradient in old-growth structure. These cooling effects across a gradient in forest structure are of similar magnitude to 50-year forecasts of the Intergovernmental Panel on Climate Change and therefore have the potential to mitigate climate warming at local scales. Management strategies to conserve old-growth characteristics and to curb current rates of primary forest loss could maintain microrefugia, enhancing biodiversity persistence in mountainous systems under climate warming.
Seligman, D A; Pullinger, A G
2000-01-01
Confusion about the relationship of occlusion to temporomandibular disorders (TMD) persists. This study attempted to identify occlusal and attrition factors plus age that would characterize asymptomatic normal female subjects. A total of 124 female patients with intracapsular TMD were compared with 47 asymptomatic female controls for associations to 9 occlusal factors, 3 attrition severity measures, and age using classification tree, multiple stepwise logistic regression, and univariate analyses. Models were tested for accuracy (sensitivity and specificity) and total contribution to the variance. The classification tree model had 4 terminal nodes that used only anterior attrition and age. "Normals" were mainly characterized by low attrition levels, whereas patients had higher attrition and tended to be younger. The tree model was only moderately useful (sensitivity 63%, specificity 94%) in predicting normals. The logistic regression model incorporated unilateral posterior crossbite and mediotrusive attrition severity in addition to the 2 factors in the tree, but was slightly less accurate than the tree (sensitivity 51%, specificity 90%). When only occlusal factors were considered in the analysis, normals were additionally characterized by a lack of anterior open bite, smaller overjet, and smaller RCP-ICP slides. The log likelihood accounted for was similar for both the tree (pseudo R(2) = 29.38%; mean deviance = 0.95) and the multiple logistic regression (Cox Snell R(2) = 30.3%, mean deviance = 0.84) models. The occlusal and attrition factors studied were only moderately useful in differentiating normals from TMD patients.
Towards lidar-based mapping of tree age at the Arctic forest tundra ecotone.
NASA Astrophysics Data System (ADS)
Jensen, J.; Maguire, A.; Oelkers, R.; Andreu-Hayles, L.; Boelman, N.; D'Arrigo, R.; Griffin, K. L.; Jennewein, J. S.; Hiers, E.; Meddens, A. J.; Russell, M.; Vierling, L. A.; Eitel, J.
2017-12-01
Climate change may cause spatial shifts in the forest-tundra ecotone (FTE). To improve our ability to study these spatial shifts, information on tree demography along the FTE is needed. The objective of this study was to assess the suitability of lidar derived tree heights as a surrogate for tree age. We calculated individual tree age from 48 tree cores collected at basal height from white spruce (Picea glauca) within the FTE in northern Alaska. Tree height was obtained from terrestrial lidar scans (<1cm spatial resolution). The relationship between age and height was examined using a linear regression model forced through the origin. We found a very strong predictive relationship between tree height and age (R2 = 0.90, RMSE = 19.34 years) for trees that ranged between 14 to 230 years. Separate regression models were also developed for small (height < 3 m) and large trees (height >= 3 m), yielding strong predictive relationships between height and age (R2 = 0.86, RMSE 12.21 years, and R2 = 0.93, RMSE = 25.16 years, respectively). The slope coefficient for small and large tree models (16.83 and 12.98 years/m, respectively) indicate that small trees grow 1.3 times faster than large trees at these FTE study sites. Although a strong, predictive relationship between age and height is uncommon in light-limited forest environments, our findings suggest that the sparseness of trees within the FTE may explain the strong tree height-age relationships found herein. Further analysis of 36 additional tree cores recently collected within the FTE near Inuvik, Canada will be performed. Our preliminary analysis suggests that lidar derived tree height could be a reliable proxy for tree age at the FTE, thereby establishing a new technique for scaling tree structure and demographics across larger portions of this sensitive ecotone.
Rifai, Sami W; Urquiza Muñoz, José D; Negrón-Juárez, Robinson I; Ramírez Arévalo, Fredy R; Tello-Espinoza, Rodil; Vanderwel, Mark C; Lichstein, Jeremy W; Chambers, Jeffrey Q; Bohlman, Stephanie A
2016-10-01
Wind disturbance can create large forest blowdowns, which greatly reduces live biomass and adds uncertainty to the strength of the Amazon carbon sink. Observational studies from within the central Amazon have quantified blowdown size and estimated total mortality but have not determined which trees are most likely to die from a catastrophic wind disturbance. Also, the impact of spatial dependence upon tree mortality from wind disturbance has seldom been quantified, which is important because wind disturbance often kills clusters of trees due to large treefalls killing surrounding neighbors. We examine (1) the causes of differential mortality between adult trees from a 300-ha blowdown event in the Peruvian region of the northwestern Amazon, (2) how accounting for spatial dependence affects mortality predictions, and (3) how incorporating both differential mortality and spatial dependence affect the landscape level estimation of necromass produced from the blowdown. Standard regression and spatial regression models were used to estimate how stem diameter, wood density, elevation, and a satellite-derived disturbance metric influenced the probability of tree death from the blowdown event. The model parameters regarding tree characteristics, topography, and spatial autocorrelation of the field data were then used to determine the consequences of non-random mortality for landscape production of necromass through a simulation model. Tree mortality was highly non-random within the blowdown, where tree mortality rates were highest for trees that were large, had low wood density, and were located at high elevation. Of the differential mortality models, the non-spatial models overpredicted necromass, whereas the spatial model slightly underpredicted necromass. When parameterized from the same field data, the spatial regression model with differential mortality estimated only 7.5% more dead trees across the entire blowdown than the random mortality model, yet it estimated 51% greater necromass. We suggest that predictions of forest carbon loss from wind disturbance are sensitive to not only the underlying spatial dependence of observations, but also the biological differences between individuals that promote differential levels of mortality. © 2016 by the Ecological Society of America.
O’Connor, Christopher D.; Lynch, Ann M.
2016-01-01
A significant concern about Metabolic Scaling Theory (MST) in real forests relates to consistent differences between the values of power law scaling exponents of tree primary size measures used to estimate mass and those predicted by MST. Here we consider why observed scaling exponents for diameter and height relationships deviate from MST predictions across three semi-arid conifer forests in relation to: (1) tree condition and physical form, (2) the level of inter-tree competition (e.g. open vs closed stand structure), (3) increasing tree age, and (4) differences in site productivity. Scaling exponent values derived from non-linear least-squares regression for trees in excellent condition (n = 381) were above the MST prediction at the 95% confidence level, while the exponent for trees in good condition were no different than MST (n = 926). Trees that were in fair or poor condition, characterized as diseased, leaning, or sparsely crowned had exponent values below MST predictions (n = 2,058), as did recently dead standing trees (n = 375). Exponent value of the mean-tree model that disregarded tree condition (n = 3,740) was consistent with other studies that reject MST scaling. Ostensibly, as stand density and competition increase trees exhibited greater morphological plasticity whereby the majority had characteristically fair or poor growth forms. Fitting by least-squares regression biases the mean-tree model scaling exponent toward values that are below MST idealized predictions. For 368 trees from Arizona with known establishment dates, increasing age had no significant impact on expected scaling. We further suggest height to diameter ratios below MST relate to vertical truncation caused by limitation in plant water availability. Even with environmentally imposed height limitation, proportionality between height and diameter scaling exponents were consistent with the predictions of MST. PMID:27391084
Swetnam, Tyson L; O'Connor, Christopher D; Lynch, Ann M
2016-01-01
A significant concern about Metabolic Scaling Theory (MST) in real forests relates to consistent differences between the values of power law scaling exponents of tree primary size measures used to estimate mass and those predicted by MST. Here we consider why observed scaling exponents for diameter and height relationships deviate from MST predictions across three semi-arid conifer forests in relation to: (1) tree condition and physical form, (2) the level of inter-tree competition (e.g. open vs closed stand structure), (3) increasing tree age, and (4) differences in site productivity. Scaling exponent values derived from non-linear least-squares regression for trees in excellent condition (n = 381) were above the MST prediction at the 95% confidence level, while the exponent for trees in good condition were no different than MST (n = 926). Trees that were in fair or poor condition, characterized as diseased, leaning, or sparsely crowned had exponent values below MST predictions (n = 2,058), as did recently dead standing trees (n = 375). Exponent value of the mean-tree model that disregarded tree condition (n = 3,740) was consistent with other studies that reject MST scaling. Ostensibly, as stand density and competition increase trees exhibited greater morphological plasticity whereby the majority had characteristically fair or poor growth forms. Fitting by least-squares regression biases the mean-tree model scaling exponent toward values that are below MST idealized predictions. For 368 trees from Arizona with known establishment dates, increasing age had no significant impact on expected scaling. We further suggest height to diameter ratios below MST relate to vertical truncation caused by limitation in plant water availability. Even with environmentally imposed height limitation, proportionality between height and diameter scaling exponents were consistent with the predictions of MST.
Boosted structured additive regression for Escherichia coli fed-batch fermentation modeling.
Melcher, Michael; Scharl, Theresa; Luchner, Markus; Striedner, Gerald; Leisch, Friedrich
2017-02-01
The quality of biopharmaceuticals and patients' safety are of highest priority and there are tremendous efforts to replace empirical production process designs by knowledge-based approaches. Main challenge in this context is that real-time access to process variables related to product quality and quantity is severely limited. To date comprehensive on- and offline monitoring platforms are used to generate process data sets that allow for development of mechanistic and/or data driven models for real-time prediction of these important quantities. Ultimate goal is to implement model based feed-back control loops that facilitate online control of product quality. In this contribution, we explore structured additive regression (STAR) models in combination with boosting as a variable selection tool for modeling the cell dry mass, product concentration, and optical density on the basis of online available process variables and two-dimensional fluorescence spectroscopic data. STAR models are powerful extensions of linear models allowing for inclusion of smooth effects or interactions between predictors. Boosting constructs the final model in a stepwise manner and provides a variable importance measure via predictor selection frequencies. Our results show that the cell dry mass can be modeled with a relative error of about ±3%, the optical density with ±6%, the soluble protein with ±16%, and the insoluble product with an accuracy of ±12%. Biotechnol. Bioeng. 2017;114: 321-334. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Automated Detection of Driver Fatigue Based on AdaBoost Classifier with EEG Signals.
Hu, Jianfeng
2017-01-01
Purpose: Driving fatigue has become one of the important causes of road accidents, there are many researches to analyze driver fatigue. EEG is becoming increasingly useful in the measuring fatigue state. Manual interpretation of EEG signals is impossible, so an effective method for automatic detection of EEG signals is crucial needed. Method: In order to evaluate the complex, unstable, and non-linear characteristics of EEG signals, four feature sets were computed from EEG signals, in which fuzzy entropy (FE), sample entropy (SE), approximate Entropy (AE), spectral entropy (PE), and combined entropies (FE + SE + AE + PE) were included. All these feature sets were used as the input vectors of AdaBoost classifier, a boosting method which is fast and highly accurate. To assess our method, several experiments including parameter setting and classifier comparison were conducted on 28 subjects. For comparison, Decision Trees (DT), Support Vector Machine (SVM) and Naive Bayes (NB) classifiers are used. Results: The proposed method (combination of FE and AdaBoost) yields superior performance than other schemes. Using FE feature extractor, AdaBoost achieves improved area (AUC) under the receiver operating curve of 0.994, error rate (ERR) of 0.024, Precision of 0.969, Recall of 0.984, F1 score of 0.976, and Matthews correlation coefficient (MCC) of 0.952, compared to SVM (ERR at 0.035, Precision of 0.957, Recall of 0.974, F1 score of 0.966, and MCC of 0.930 with AUC of 0.990), DT (ERR at 0.142, Precision of 0.857, Recall of 0.859, F1 score of 0.966, and MCC of 0.716 with AUC of 0.916) and NB (ERR at 0.405, Precision of 0.646, Recall of 0.434, F1 score of 0.519, and MCC of 0.203 with AUC of 0.606). It shows that the FE feature set and combined feature set outperform other feature sets. AdaBoost seems to have better robustness against changes of ratio of test samples for all samples and number of subjects, which might therefore aid in the real-time detection of driver fatigue through the classification of EEG signals. Conclusion: By using combination of FE features and AdaBoost classifier to detect EEG-based driver fatigue, this paper ensured confidence in exploring the inherent physiological mechanisms and wearable application.
Automated Detection of Driver Fatigue Based on AdaBoost Classifier with EEG Signals
Hu, Jianfeng
2017-01-01
Purpose: Driving fatigue has become one of the important causes of road accidents, there are many researches to analyze driver fatigue. EEG is becoming increasingly useful in the measuring fatigue state. Manual interpretation of EEG signals is impossible, so an effective method for automatic detection of EEG signals is crucial needed. Method: In order to evaluate the complex, unstable, and non-linear characteristics of EEG signals, four feature sets were computed from EEG signals, in which fuzzy entropy (FE), sample entropy (SE), approximate Entropy (AE), spectral entropy (PE), and combined entropies (FE + SE + AE + PE) were included. All these feature sets were used as the input vectors of AdaBoost classifier, a boosting method which is fast and highly accurate. To assess our method, several experiments including parameter setting and classifier comparison were conducted on 28 subjects. For comparison, Decision Trees (DT), Support Vector Machine (SVM) and Naive Bayes (NB) classifiers are used. Results: The proposed method (combination of FE and AdaBoost) yields superior performance than other schemes. Using FE feature extractor, AdaBoost achieves improved area (AUC) under the receiver operating curve of 0.994, error rate (ERR) of 0.024, Precision of 0.969, Recall of 0.984, F1 score of 0.976, and Matthews correlation coefficient (MCC) of 0.952, compared to SVM (ERR at 0.035, Precision of 0.957, Recall of 0.974, F1 score of 0.966, and MCC of 0.930 with AUC of 0.990), DT (ERR at 0.142, Precision of 0.857, Recall of 0.859, F1 score of 0.966, and MCC of 0.716 with AUC of 0.916) and NB (ERR at 0.405, Precision of 0.646, Recall of 0.434, F1 score of 0.519, and MCC of 0.203 with AUC of 0.606). It shows that the FE feature set and combined feature set outperform other feature sets. AdaBoost seems to have better robustness against changes of ratio of test samples for all samples and number of subjects, which might therefore aid in the real-time detection of driver fatigue through the classification of EEG signals. Conclusion: By using combination of FE features and AdaBoost classifier to detect EEG-based driver fatigue, this paper ensured confidence in exploring the inherent physiological mechanisms and wearable application. PMID:28824409
Landenburger, L.; Lawrence, R.L.; Podruzny, S.; Schwartz, C.C.
2008-01-01
Moderate resolution satellite imagery traditionally has been thought to be inadequate for mapping vegetation at the species level. This has made comprehensive mapping of regional distributions of sensitive species, such as whitebark pine, either impractical or extremely time consuming. We sought to determine whether using a combination of moderate resolution satellite imagery (Landsat Enhanced Thematic Mapper Plus), extensive stand data collected by land management agencies for other purposes, and modern statistical classification techniques (boosted classification trees) could result in successful mapping of whitebark pine. Overall classification accuracies exceeded 90%, with similar individual class accuracies. Accuracies on a localized basis varied based on elevation. Accuracies also varied among administrative units, although we were not able to determine whether these differences related to inherent spatial variations or differences in the quality of available reference data.
ANSYS UIDL-Based CAE Development of Axial Support System for Optical Mirror
NASA Astrophysics Data System (ADS)
Yang, De-Hua; Shao, Liang
2008-09-01
The Whiffle-tree type axial support mechanism is widely adopted by most relatively large optical mirrors. Based on the secondary developing tools offered by the commonly used Finite Element Anylysis (FEA) software ANSYS, ANSYS Parametric Design Language (APDL) is used for creating the mirror FEA model driven by parameters, and ANSYS User Interface Design Language (UIDL) for generating custom menu of interactive manner, whereby, the relatively independent dedicated Computer Aided Engineering (CAE) module is embedded in ANSYS for calculation and optimization of axial Whiffle-tree support of optical mirrors. An example is also described to illustrate the intuitive and effective usage of the dedicated module by boosting work efficiency and releasing related engineering knowledge of user. The philosophy of secondary-developed special module with commonly used software also suggests itself for product development in other industries.
Multivariate regression model for partitioning tree volume of white oak into round-product classes
Daniel A. Yaussy; David L. Sonderman
1984-01-01
Describes the development of multivariate equations that predict the expected cubic volume of four round-product classes from independent variables composed of individual tree-quality characteristics. Although the model has limited application at this time, it does demonstrate the feasibility of partitioning total tree cubic volume into round-product classes based on...
Kathleen L. Kavanaugh; Matthew B. Dickinson; Anthony S. Bova
2010-01-01
Current operational methods for predicting tree mortality from fire injury are regression-based models that only indirectly consider underlying causes and, thus, have limited generality. A better understanding of the physiological consequences of tree heating and injury are needed to develop biophysical process models that can make predictions under changing or novel...
Height-age relationships for regeneration-size trees in the northern Rocky Mountains, USA
Dennis E. Ferguson; Clinton E. Carlson
2010-01-01
Regression equations were developed to predict heights of 10 conifer species inregenerating stands in central and northern Idaho, western Montana, and eastern Washington. Most sample trees were natural regeneration that became established after conventional harvest and site preparation methods. Heights are predicted as a function of tree age, residual overstory density...
Louis R. Iverson; Anantha M. Prasad; Anantha M. Prasad
2002-01-01
Global climate change could have profound effects on the Earth's biota, including large redistributions of tree species and forest types. We used DISTRIB, a deterministic regression tree analysis model, to examine environmental drivers related to current forest-species distributions and then model potential suitable habitat under five climate change scenarios...
A prediction model of short-term ionospheric foF2 based on AdaBoost
NASA Astrophysics Data System (ADS)
Zhao, Xiukuan; Ning, Baiqi; Liu, Libo; Song, Gangbing
2014-02-01
In this paper, the AdaBoost-BP algorithm is used to construct a new model to predict the critical frequency of the ionospheric F2-layer (foF2) one hour ahead. Different indices were used to characterize ionospheric diurnal and seasonal variations and their dependence on solar and geomagnetic activity. These indices, together with the current observed foF2 value, were input into the prediction model and the foF2 value at one hour ahead was output. We analyzed twenty-two years' foF2 data from nine ionosonde stations in the East-Asian sector in this work. The first eleven years' data were used as a training dataset and the second eleven years' data were used as a testing dataset. The results show that the performance of AdaBoost-BP is better than those of BP Neural Network (BPNN), Support Vector Regression (SVR) and the IRI model. For example, the AdaBoost-BP prediction absolute error of foF2 at Irkutsk station (a middle latitude station) is 0.32 MHz, which is better than 0.34 MHz from BPNN, 0.35 MHz from SVR and also significantly outperforms the IRI model whose absolute error is 0.64 MHz. Meanwhile, AdaBoost-BP prediction absolute error at Taipei station from the low latitude is 0.78 MHz, which is better than 0.81 MHz from BPNN, 0.81 MHz from SVR and 1.37 MHz from the IRI model. Finally, the variety characteristics of the AdaBoost-BP prediction error along with seasonal variation, solar activity and latitude variation were also discussed in the paper.
Reulen, Holger; Kneib, Thomas
2016-04-01
One important goal in multi-state modelling is to explore information about conditional transition-type-specific hazard rate functions by estimating influencing effects of explanatory variables. This may be performed using single transition-type-specific models if these covariate effects are assumed to be different across transition-types. To investigate whether this assumption holds or whether one of the effects is equal across several transition-types (cross-transition-type effect), a combined model has to be applied, for instance with the use of a stratified partial likelihood formulation. Here, prior knowledge about the underlying covariate effect mechanisms is often sparse, especially about ineffectivenesses of transition-type-specific or cross-transition-type effects. As a consequence, data-driven variable selection is an important task: a large number of estimable effects has to be taken into account if joint modelling of all transition-types is performed. A related but subsequent task is model choice: is an effect satisfactory estimated assuming linearity, or is the true underlying nature strongly deviating from linearity? This article introduces component-wise Functional Gradient Descent Boosting (short boosting) for multi-state models, an approach performing unsupervised variable selection and model choice simultaneously within a single estimation run. We demonstrate that features and advantages in the application of boosting introduced and illustrated in classical regression scenarios remain present in the transfer to multi-state models. As a consequence, boosting provides an effective means to answer questions about ineffectiveness and non-linearity of single transition-type-specific or cross-transition-type effects.
Fish habitat regression under water scarcity scenarios in the Douro River basin
NASA Astrophysics Data System (ADS)
Segurado, Pedro; Jauch, Eduardo; Neves, Ramiro; Ferreira, Teresa
2015-04-01
Climate change will predictably alter hydrological patterns and processes at the catchment scale, with impacts on habitat conditions for fish. The main goals of this study are to identify the stream reaches that will undergo more pronounced flow reduction under different climate change scenarios and to assess which fish species will be more affected by the consequent regression of suitable habitats. The interplay between changes in flow and temperature and the presence of transversal artificial obstacles (dams and weirs) is analysed. The results will contribute to river management and impact mitigation actions under climate change. This study was carried out in the Tâmega catchment of the Douro basin. A set of 29 Hydrological, climatic, and hydrogeomorphological variables were modelled using a water modelling system (MOHID), based on meteorological data recorded monthly between 2008 and 2014. The same variables were modelled considering future climate change scenarios. The resulting variables were used in empirical habitat models of a set of key species (brown trout Salmo trutta fario, barbell Barbus bocagei, and nase Pseudochondrostoma duriense) using boosted regression trees. The stream segments between tributaries were used as spatial sampling units. Models were developed for the whole Douro basin using 401 fish sampling sites, although the modelled probabilities of species occurrence for each stream segment were predicted only for the Tâmega catchment. These probabilities of occurrence were used to classify stream segments into suitable and unsuitable habitat for each fish species, considering the future climate change scenario. The stream reaches that were predicted to undergo longer flow interruptions were identified and crossed with the resulting predictive maps of habitat suitability to compute the total area of habitat loss per species. Among the target species, the brown trout was predicted to be the most sensitive to habitat regression due to the interplay of flow reduction, increase of temperature and transversal barriers. This species is therefore a good indicator of climate change impacts in rivers and therefore we recommend using this species as a target of monitoring programs to be implemented in the context of climate change adaptation strategies.
NASA Astrophysics Data System (ADS)
Künne, A.; Fink, M.; Kipka, H.; Krause, P.; Flügel, W.-A.
2012-06-01
In this paper, a method is presented to estimate excess nitrogen on large scales considering single field processes. The approach was implemented by using the physically based model J2000-S to simulate the nitrogen balance as well as the hydrological dynamics within meso-scale test catchments. The model input data, the parameterization, the results and a detailed system understanding were used to generate the regression tree models with GUIDE (Loh, 2002). For each landscape type in the federal state of Thuringia a regression tree was calibrated and validated using the model data and results of excess nitrogen from the test catchments. Hydrological parameters such as precipitation and evapotranspiration were also used to predict excess nitrogen by the regression tree model. Hence they had to be calculated and regionalized as well for the state of Thuringia. Here the model J2000g was used to simulate the water balance on the macro scale. With the regression trees the excess nitrogen was regionalized for each landscape type of Thuringia. The approach allows calculating the potential nitrogen input into the streams of the drainage area. The results show that the applied methodology was able to transfer the detailed model results of the meso-scale catchments to the entire state of Thuringia by low computing time without losing the detailed knowledge from the nitrogen transport modeling. This was validated with modeling results from Fink (2004) in a catchment lying in the regionalization area. The regionalized and modeled excess nitrogen correspond with 94%. The study was conducted within the framework of a project in collaboration with the Thuringian Environmental Ministry, whose overall aim was to assess the effect of agro-environmental measures regarding load reduction in the water bodies of Thuringia to fulfill the requirements of the European Water Framework Directive (Bäse et al., 2007; Fink, 2006; Fink et al., 2007).
Predicting Diameter at Breast Height from Stump Diameters for Northeastern Tree Species
Eric H. Wharton; Eric H. Wharton
1984-01-01
Presents equations to predict diameter at breast height from stump diameter measurements for 17 northeastern tree species. Simple linear regression was used to develop the equations. Application of the equations is discussed.
The US EPA is developing assessment tools to evaluate the effectiveness of green infrastructure (GI) applied in stormwater best management practices (BMPs) at the small watershed (HUC12 or finer) scale. Based on analysis of historical monitoring data using boosted regression tre...
Narratives Boost Entrepreneurial Attitudes: Making an Entrepreneurial Career Attractive?
ERIC Educational Resources Information Center
Fellnhofer, Katharina
2018-01-01
This article analyses the impact of narratives on entrepreneurial attitudes and intentions. To this end, a quasi-experiment was conducted to evaluate web-based entrepreneurial narratives. The paired-sample tests and regression analysis use a sample of 466 people from Austria, Finland, and Greece and indicate that individuals' perceptions of the…
Guo, Huey-Ming; Shyu, Yea-Ing Lotus; Chang, Her-Kun
2006-01-01
In this article, the authors provide an overview of a research method to predict quality of care in home health nursing data set. The results of this study can be visualized through classification an regression tree (CART) graphs. The analysis was more effective, and the results were more informative since the home health nursing dataset was analyzed with a combination of the logistic regression and CART, these two techniques complete each other. And the results more informative that more patients' characters were related to quality of care in home care. The results contributed to home health nurse predict patient outcome in case management. Improved prediction is needed for interventions to be appropriately targeted for improved patient outcome and quality of care.
Huang, Li-Shan; Myers, Gary J.; Davidson, Philip W.; Cox, Christopher; Xiao, Fenyuan; Thurston, Sally W.; Cernichiari, Elsa; Shamlaye, Conrad F.; Sloane-Reeves, Jean; Georger, Lesley; Clarkson, Thomas W.
2007-01-01
Studies of the association between prenatal methylmercury exposure from maternal fish consumption during pregnancy and neurodevelopmental test scores in the Seychelles Child Development Study have found no consistent pattern of associations through age nine years. The analyses for the most recent nine-year data examined the population effects of prenatal exposure, but did not address the possibility of non-homogeneous susceptibility. This paper presents a regression tree approach: covariate effects are treated nonlinearly and non-additively and non-homogeneous effects of prenatal methylmercury exposure are permitted among the covariate clusters identified by the regression tree. The approach allows us to address whether children in the lower or higher ends of the developmental spectrum differ in susceptibility to subtle exposure effects. Of twenty-one endpoints available at age nine years, we chose the Weschler Full Scale IQ and its associated covariates to construct the regression tree. The prenatal mercury effect in each of the nine resulting clusters was assessed linearly and non-homogeneously. In addition we reanalyzed five other nine-year endpoints that in the linear analysis has a two-tailed p-value <0.2 for the effect of prenatal exposure. In this analysis, motor proficiency and activity level improved significantly with increasing MeHg for 53% of the children who had an average home environment. Motor proficiency significantly decreased with increasing prenatal MeHg exposure in 7% of the children whose home environment was below average. The regression tree results support previous analyses of outcomes in this cohort. However, this analysis raises the intriguing possibility that an effect may be non-homogeneous among children with different backgrounds and IQ levels. PMID:17942158
Huang, Li-Shan; Myers, Gary J; Davidson, Philip W; Cox, Christopher; Xiao, Fenyuan; Thurston, Sally W; Cernichiari, Elsa; Shamlaye, Conrad F; Sloane-Reeves, Jean; Georger, Lesley; Clarkson, Thomas W
2007-11-01
Studies of the association between prenatal methylmercury exposure from maternal fish consumption during pregnancy and neurodevelopmental test scores in the Seychelles Child Development Study have found no consistent pattern of associations through age 9 years. The analyses for the most recent 9-year data examined the population effects of prenatal exposure, but did not address the possibility of non-homogeneous susceptibility. This paper presents a regression tree approach: covariate effects are treated non-linearly and non-additively and non-homogeneous effects of prenatal methylmercury exposure are permitted among the covariate clusters identified by the regression tree. The approach allows us to address whether children in the lower or higher ends of the developmental spectrum differ in susceptibility to subtle exposure effects. Of 21 endpoints available at age 9 years, we chose the Weschler Full Scale IQ and its associated covariates to construct the regression tree. The prenatal mercury effect in each of the nine resulting clusters was assessed linearly and non-homogeneously. In addition we reanalyzed five other 9-year endpoints that in the linear analysis had a two-tailed p-value <0.2 for the effect of prenatal exposure. In this analysis, motor proficiency and activity level improved significantly with increasing MeHg for 53% of the children who had an average home environment. Motor proficiency significantly decreased with increasing prenatal MeHg exposure in 7% of the children whose home environment was below average. The regression tree results support previous analyses of outcomes in this cohort. However, this analysis raises the intriguing possibility that an effect may be non-homogeneous among children with different backgrounds and IQ levels.
Kim, So-Ra; Kwak, Doo-Ahn; Lee, Woo-Kyun; oLee, Woo-Kyun; Son, Yowhan; Bae, Sang-Won; Kim, Choonsig; Yoo, Seongjin
2010-07-01
The objective of this study was to estimate the carbon storage capacity of Pinus densiflora stands using remotely sensed data by combining digital aerial photography with light detection and ranging (LiDAR) data. A digital canopy model (DCM), generated from the LiDAR data, was combined with aerial photography for segmenting crowns of individual trees. To eliminate errors in over and under-segmentation, the combined image was smoothed using a Gaussian filtering method. The processed image was then segmented into individual trees using a marker-controlled watershed segmentation method. After measuring the crown area from the segmented individual trees, the individual tree diameter at breast height (DBH) was estimated using a regression function developed from the relationship observed between the field-measured DBH and crown area. The above ground biomass of individual trees could be calculated by an image-derived DBH using a regression function developed by the Korea Forest Research Institute. The carbon storage, based on individual trees, was estimated by simple multiplication using the carbon conversion index (0.5), as suggested in guidelines from the Intergovernmental Panel on Climate Change. The mean carbon storage per individual tree was estimated and then compared with the field-measured value. This study suggested that the biomass and carbon storage in a large forest area can be effectively estimated using aerial photographs and LiDAR data.
Yip, T C-F; Ma, A J; Wong, V W-S; Tse, Y-K; Chan, H L-Y; Yuen, P-C; Wong, G L-H
2017-08-01
Non-alcoholic fatty liver disease (NAFLD) affects 20%-40% of the general population in developed countries and is an increasingly important cause of hepatocellular carcinoma. Electronic medical records facilitate large-scale epidemiological studies, existing NAFLD scores often require clinical and anthropometric parameters that may not be captured in those databases. To develop and validate a laboratory parameter-based machine learning model to detect NAFLD for the general population. We randomly divided 922 subjects from a population screening study into training and validation groups; NAFLD was diagnosed by proton-magnetic resonance spectroscopy. On the basis of machine learning from 23 routine clinical and laboratory parameters after elastic net regulation, we evaluated the logistic regression, ridge regression, AdaBoost and decision tree models. The areas under receiver-operating characteristic curve (AUROC) of models in validation group were compared. Six predictors including alanine aminotransferase, high-density lipoprotein cholesterol, triglyceride, haemoglobin A 1c , white blood cell count and the presence of hypertension were selected. The NAFLD ridge score achieved AUROC of 0.87 (95% CI 0.83-0.90) and 0.88 (0.84-0.91) in the training and validation groups respectively. Using dual cut-offs of 0.24 and 0.44, NAFLD ridge score achieved 92% (86%-96%) sensitivity and 90% (86%-93%) specificity with corresponding negative and positive predictive values of 96% (91%-98%) and 69% (59%-78%), and 87% of overall accuracy among 70% of classifiable subjects in the validation group; 30% of subjects remained indeterminate. NAFLD ridge score is a simple and robust reference comparable to existing NAFLD scores to exclude NAFLD patients in epidemiological studies. © 2017 John Wiley & Sons Ltd.
An Interactive Tool For Semi-automated Statistical Prediction Using Earth Observations and Models
NASA Astrophysics Data System (ADS)
Zaitchik, B. F.; Berhane, F.; Tadesse, T.
2015-12-01
We developed a semi-automated statistical prediction tool applicable to concurrent analysis or seasonal prediction of any time series variable in any geographic location. The tool was developed using Shiny, JavaScript, HTML and CSS. A user can extract a predictand by drawing a polygon over a region of interest on the provided user interface (global map). The user can select the Climatic Research Unit (CRU) precipitation or Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) as predictand. They can also upload their own predictand time series. Predictors can be extracted from sea surface temperature, sea level pressure, winds at different pressure levels, air temperature at various pressure levels, and geopotential height at different pressure levels. By default, reanalysis fields are applied as predictors, but the user can also upload their own predictors, including a wide range of compatible satellite-derived datasets. The package generates correlations of the variables selected with the predictand. The user also has the option to generate composites of the variables based on the predictand. Next, the user can extract predictors by drawing polygons over the regions that show strong correlations (composites). Then, the user can select some or all of the statistical prediction models provided. Provided models include Linear Regression models (GLM, SGLM), Tree-based models (bagging, random forest, boosting), Artificial Neural Network, and other non-linear models such as Generalized Additive Model (GAM) and Multivariate Adaptive Regression Splines (MARS). Finally, the user can download the analysis steps they used, such as the region they selected, the time period they specified, the predictand and predictors they chose and preprocessing options they used, and the model results in PDF or HTML format. Key words: Semi-automated prediction, Shiny, R, GLM, ANN, RF, GAM, MARS
Gieswein, Alexander; Hering, Daniel; Feld, Christian K
2017-09-01
Freshwater ecosystems are impacted by a range of stressors arising from diverse human-caused land and water uses. Identifying the relative importance of single stressors and understanding how multiple stressors interact and jointly affect biology is crucial for River Basin Management. This study addressed multiple human-induced stressors and their effects on the aquatic flora and fauna based on data from standard WFD monitoring schemes. For altogether 1095 sites within a mountainous catchment, we used 12 stressor variables covering three different stressor groups: riparian land use, physical habitat quality and nutrient enrichment. Twenty-one biological metrics calculated from taxa lists of three organism groups (fish, benthic invertebrates and aquatic macrophytes) served as response variables. Stressor and response variables were subjected to Boosted Regression Tree (BRT) analysis to identify stressor hierarchy and stressor interactions and subsequently to Generalised Linear Regression Modelling (GLM) to quantify the stressors standardised effect size. Our results show that riverine habitat degradation was the dominant stressor group for the river fauna, notably the bed physical habitat structure. Overall, the explained variation in benthic invertebrate metrics was higher than it was in fish and macrophyte metrics. In particular, general integrative (aggregate) metrics such as % Ephemeroptera, Plecoptera and Trichoptera (EPT) taxa performed better than ecological traits (e.g. % feeding types). Overall, additive stressor effects dominated, while significant and meaningful stressor interactions were generally rare and weak. We concluded that given the type of stressor and ecological response variables addressed in this study, river basin managers do not need to bother much about complex stressor interactions, but can focus on the prevailing stressors according to the hierarchy identified. Copyright © 2017 Elsevier B.V. All rights reserved.
West, Amanda M.; Evangelista, Paul H.; Jarnevich, Catherine S.; Young, Nicholas E.; Stohlgren, Thomas J.; Talbert, Colin; Talbert, Marian; Morisette, Jeffrey; Anderson, Ryan
2016-01-01
Early detection of invasive plant species is vital for the management of natural resources and protection of ecosystem processes. The use of satellite remote sensing for mapping the distribution of invasive plants is becoming more common, however conventional imaging software and classification methods have been shown to be unreliable. In this study, we test and evaluate the use of five species distribution model techniques fit with satellite remote sensing data to map invasive tamarisk (Tamarix spp.) along the Arkansas River in Southeastern Colorado. The models tested included boosted regression trees (BRT), Random Forest (RF), multivariate adaptive regression splines (MARS), generalized linear model (GLM), and Maxent. These analyses were conducted using a newly developed software package called the Software for Assisted Habitat Modeling (SAHM). All models were trained with 499 presence points, 10,000 pseudo-absence points, and predictor variables acquired from the Landsat 5 Thematic Mapper (TM) sensor over an eight-month period to distinguish tamarisk from native riparian vegetation using detection of phenological differences. From the Landsat scenes, we used individual bands and calculated Normalized Difference Vegetation Index (NDVI), Soil-Adjusted Vegetation Index (SAVI), and tasseled capped transformations. All five models identified current tamarisk distribution on the landscape successfully based on threshold independent and threshold dependent evaluation metrics with independent location data. To account for model specific differences, we produced an ensemble of all five models with map output highlighting areas of agreement and areas of uncertainty. Our results demonstrate the usefulness of species distribution models in analyzing remotely sensed data and the utility of ensemble mapping, and showcase the capability of SAHM in pre-processing and executing multiple complex models.
NASA Astrophysics Data System (ADS)
Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Abbaspour, Karim
2018-02-01
Considering the unstable condition of water resources in Iran and many other countries in arid and semi-arid regions, groundwater studies are very important. Therefore, the aim of this study is to model groundwater potential by qanat locations as indicators and ten advanced and soft computing models applied to the Beheshtabad Watershed, Iran. Qanat is a man-made underground construction which gathers groundwater from higher altitudes and transmits it to low land areas where it can be used for different purposes. For this purpose, at first, the location of the qanats was detected using extensive field surveys. These qanats were classified into two datasets including training (70%) and validation (30%). Then, 14 influence factors depicting the region's physical, morphological, lithological, and hydrological features were identified to model groundwater potential. Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), flexible discriminant analysis (FDA), penalized discriminant analysis (PDA), boosted regression tree (BRT), random forest (RF), artificial neural network (ANN), K-nearest neighbor (KNN), multivariate adaptive regression splines (MARS), and support vector machine (SVM) models were applied in R scripts to produce groundwater potential maps. For evaluation of the performance accuracies of the developed models, ROC curve and kappa index were implemented. According to the results, RF had the best performance, followed by SVM and BRT models. Our results showed that qanat locations could be used as a good indicator for groundwater potential. Furthermore, altitude, slope, plan curvature, and profile curvature were found to be the most important influence factors. On the other hand, lithology, land use, and slope aspect were the least significant factors. The methodology in the current study could be used by land use and terrestrial planners and water resource managers to reduce the costs of groundwater resource discovery.
West, Amanda M; Evangelista, Paul H; Jarnevich, Catherine S; Young, Nicholas E; Stohlgren, Thomas J; Talbert, Colin; Talbert, Marian; Morisette, Jeffrey; Anderson, Ryan
2016-10-11
Early detection of invasive plant species is vital for the management of natural resources and protection of ecosystem processes. The use of satellite remote sensing for mapping the distribution of invasive plants is becoming more common, however conventional imaging software and classification methods have been shown to be unreliable. In this study, we test and evaluate the use of five species distribution model techniques fit with satellite remote sensing data to map invasive tamarisk (Tamarix spp.) along the Arkansas River in Southeastern Colorado. The models tested included boosted regression trees (BRT), Random Forest (RF), multivariate adaptive regression splines (MARS), generalized linear model (GLM), and Maxent. These analyses were conducted using a newly developed software package called the Software for Assisted Habitat Modeling (SAHM). All models were trained with 499 presence points, 10,000 pseudo-absence points, and predictor variables acquired from the Landsat 5 Thematic Mapper (TM) sensor over an eight-month period to distinguish tamarisk from native riparian vegetation using detection of phenological differences. From the Landsat scenes, we used individual bands and calculated Normalized Difference Vegetation Index (NDVI), Soil-Adjusted Vegetation Index (SAVI), and tasseled capped transformations. All five models identified current tamarisk distribution on the landscape successfully based on threshold independent and threshold dependent evaluation metrics with independent location data. To account for model specific differences, we produced an ensemble of all five models with map output highlighting areas of agreement and areas of uncertainty. Our results demonstrate the usefulness of species distribution models in analyzing remotely sensed data and the utility of ensemble mapping, and showcase the capability of SAHM in pre-processing and executing multiple complex models.
Louis R. Iverson; Anantha Prasad; Mark W. Schwartz; Mark W. Schwartz
1999-01-01
We are using a deterministic regression tree analysis model (DISTRIB) and a stochastic migration model (SHIFT) to examine potential distributions of ~66 individual species of eastern US trees under a 2 x CO2 climate change scenario. This process is demonstrated for Virginia pine (Pinus virginiana).
Potential Changes in Tree Species Richness and Forest Community Types following Climate Change
Louis R. Iverson; Anantha M. Prasad
2001-01-01
Potential changes in tree species richness and forest community types were evaluated for the eastern United States according to five scenarios of future climate change resulting from a doubling of atmospheric carbon dioxide (CO2). DISTRIB, an empirical model that uses a regression tree analysis approach, was used to generate suitable habitat, or potential future...
Estimating tree crown widths for the primary Acadian species in Maine
Matthew B. Russell; Aaron R. Weiskittel
2012-01-01
In this analysis, data for seven conifer and eight hardwood species were gathered from across the state of Maine for estimating tree crown widths. Maximum and largest crown width equations were developed using tree diameter at breast height as the primary predicting variable. Quantile regression techniques were used to estimate the maximum crown width and a constrained...
Du, Hua Qiang; Sun, Xiao Yan; Han, Ning; Mao, Fang Jie
2017-10-01
By synergistically using the object-based image analysis (OBIA) and the classification and regression tree (CART) methods, the distribution information, the indexes (including diameter at breast, tree height, and crown closure), and the aboveground carbon storage (AGC) of moso bamboo forest in Shanchuan Town, Anji County, Zhejiang Province were investigated. The results showed that the moso bamboo forest could be accurately delineated by integrating the multi-scale ima ge segmentation in OBIA technique and CART, which connected the image objects at various scales, with a pretty good producer's accuracy of 89.1%. The investigation of indexes estimated by regression tree model that was constructed based on the features extracted from the image objects reached normal or better accuracy, in which the crown closure model archived the best estimating accuracy of 67.9%. The estimating accuracy of diameter at breast and tree height was relatively low, which was consistent with conclusion that estimating diameter at breast and tree height using optical remote sensing could not achieve satisfactory results. Estimation of AGC reached relatively high accuracy, and accuracy of the region of high value achieved above 80%.
Zhao, Cai-Yun; Xu, Jing; Liu, Xiao-Yan
2017-01-01
Abstract Globalization increases the opportunities for unintentionally introduced invasive alien species, especially for insects, and most of these species could damage ecosystems and cause economic loss in China. In this study, we analyzed drivers of the distribution of unintentionally introduced invasive alien insects. Based on the number of unintentionally introduced invasive alien insects and their presence/absence records in each province in mainland China, regression trees were built to elucidate the roles of environmental and anthropogenic factors on the number distribution and similarity of species composition of these insects. Classification and regression trees indicated climatic suitability (the mean temperature in January) and human economic activity (sum of total freight) are primary drivers for the number distribution pattern of unintentionally introduced invasive alien insects at provincial scale, while only environmental factors (the mean January temperature, the annual precipitation and the areas of provinces) significantly affect the similarity of them based on the multivariate regression trees. PMID:28973576
Bevilacqua, M; Ciarapica, F E; Giacchetta, G
2008-07-01
This work is an attempt to apply classification tree methods to data regarding accidents in a medium-sized refinery, so as to identify the important relationships between the variables, which can be considered as decision-making rules when adopting any measures for improvement. The results obtained using the CART (Classification And Regression Trees) method proved to be the most precise and, in general, they are encouraging concerning the use of tree diagrams as preliminary explorative techniques for the assessment of the ergonomic, management and operational parameters which influence high accident risk situations. The Occupational Injury analysis carried out in this paper was planned as a dynamic process and can be repeated systematically. The CART technique, which considers a very wide set of objective and predictive variables, shows new cause-effect correlations in occupational safety which had never been previously described, highlighting possible injury risk groups and supporting decision-making in these areas. The use of classification trees must not, however, be seen as an attempt to supplant other techniques, but as a complementary method which can be integrated into traditional types of analysis.
Acid rain, air pollution, and tree growth in southeastern New York
Puckett, L.J.
1982-01-01
Whether dendroecological analyses could be used to detect changes in the relationship of tree growth to climate that might have resulted from chronic exposure to components of the acid rain-air pollution complex was determined. Tree-ring indices of white pine (Pinus strobus L.), eastern hemlock (Tsuga canadensis (L.) Cart.), pitch pine (Pinus rigida Mill.), and chestnut oak (Quercus prinus L.) were regressed against orthogonally transformed values of temperature and precipitation in order to derive a response-function relationship. Results of the regression analyses for three time periods, 1901–1920, 1926–1945, and 1954–1973 suggest that the relationship of tree growth to climate has been altered. Statistical tests of the temperature and precipitation data suggest that this change was nonclimatic. Temporally, the shift in growth response appears to correspond with the suspected increase in acid rain and air pollution in the Shawangunk Mountain area of southeastern New York in the early 1950's. This change could be the result of physiological stress induced by components of the acid rain-air pollution complex, causing climatic conditions to be more limiting to tree growth.
Teng, Ju-Hsi; Lin, Kuan-Chia; Ho, Bin-Shenq
2007-10-01
A community-based aboriginal study was conducted and analysed to explore the application of classification tree and logistic regression. A total of 1066 aboriginal residents in Yilan County were screened during 2003-2004. The independent variables include demographic characteristics, physical examinations, geographic location, health behaviours, dietary habits and family hereditary diseases history. Risk factors of cardiovascular diseases were selected as the dependent variables in further analysis. The completion rate for heath interview is 88.9%. The classification tree results find that if body mass index is higher than 25.72 kg m(-2) and the age is above 51 years, the predicted probability for number of cardiovascular risk factors > or =3 is 73.6% and the population is 322. If body mass index is higher than 26.35 kg m(-2) and geographical latitude of the village is lower than 24 degrees 22.8', the predicted probability for number of cardiovascular risk factors > or =4 is 60.8% and the population is 74. As the logistic regression results indicate that body mass index, drinking habit and menopause are the top three significant independent variables. The classification tree model specifically shows the discrimination paths and interactions between the risk groups. The logistic regression model presents and analyses the statistical independent factors of cardiovascular risks. Applying both models to specific situations will provide a different angle for the design and management of future health intervention plans after community-based study.
Exploiting graph kernels for high performance biomedical relation extraction.
Panyam, Nagesh C; Verspoor, Karin; Cohn, Trevor; Ramamohanarao, Kotagiri
2018-01-30
Relation extraction from biomedical publications is an important task in the area of semantic mining of text. Kernel methods for supervised relation extraction are often preferred over manual feature engineering methods, when classifying highly ordered structures such as trees and graphs obtained from syntactic parsing of a sentence. Tree kernels such as the Subset Tree Kernel and Partial Tree Kernel have been shown to be effective for classifying constituency parse trees and basic dependency parse graphs of a sentence. Graph kernels such as the All Path Graph kernel (APG) and Approximate Subgraph Matching (ASM) kernel have been shown to be suitable for classifying general graphs with cycles, such as the enhanced dependency parse graph of a sentence. In this work, we present a high performance Chemical-Induced Disease (CID) relation extraction system. We present a comparative study of kernel methods for the CID task and also extend our study to the Protein-Protein Interaction (PPI) extraction task, an important biomedical relation extraction task. We discuss novel modifications to the ASM kernel to boost its performance and a method to apply graph kernels for extracting relations expressed in multiple sentences. Our system for CID relation extraction attains an F-score of 60%, without using external knowledge sources or task specific heuristic or rules. In comparison, the state of the art Chemical-Disease Relation Extraction system achieves an F-score of 56% using an ensemble of multiple machine learning methods, which is then boosted to 61% with a rule based system employing task specific post processing rules. For the CID task, graph kernels outperform tree kernels substantially, and the best performance is obtained with APG kernel that attains an F-score of 60%, followed by the ASM kernel at 57%. The performance difference between the ASM and APG kernels for CID sentence level relation extraction is not significant. In our evaluation of ASM for the PPI task, ASM performed better than APG kernel for the BioInfer dataset, in the Area Under Curve (AUC) measure (74% vs 69%). However, for all the other PPI datasets, namely AIMed, HPRD50, IEPA and LLL, ASM is substantially outperformed by the APG kernel in F-score and AUC measures. We demonstrate a high performance Chemical Induced Disease relation extraction, without employing external knowledge sources or task specific heuristics. Our work shows that graph kernels are effective in extracting relations that are expressed in multiple sentences. We also show that the graph kernels, namely the ASM and APG kernels, substantially outperform the tree kernels. Among the graph kernels, we showed the ASM kernel as effective for biomedical relation extraction, with comparable performance to the APG kernel for datasets such as the CID-sentence level relation extraction and BioInfer in PPI. Overall, the APG kernel is shown to be significantly more accurate than the ASM kernel, achieving better performance on most datasets.
NASA Astrophysics Data System (ADS)
Ruske, Simon; Topping, David O.; Foot, Virginia E.; Kaye, Paul H.; Stanley, Warren R.; Crawford, Ian; Morse, Andrew P.; Gallagher, Martin W.
2017-03-01
Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.
Jose Negron
1997-01-01
Classification trees and linear regression analysis were used to build models to predict probabilities of infestation and amount of tree mortality in terms of basal area resulting from roundheaded pine beetle, Dendroctonus adjunctus Blandford, activity in ponderosa pine, Pinus ponderosa Laws., in the Sacramento Mountains, New Mexico. Classification trees were built for...
An Extension of CART's Pruning Algorithm. Program Statistics Research Technical Report No. 91-11.
ERIC Educational Resources Information Center
Kim, Sung-Ho
Among the computer-based methods used for the construction of trees such as AID, THAID, CART, and FACT, the only one that uses an algorithm that first grows a tree and then prunes the tree is CART. The pruning component of CART is analogous in spirit to the backward elimination approach in regression analysis. This idea provides a tool in…
STX--Fortran-4 program for estimates of tree populations from 3P sample-tree-measurements
L. R. Grosenbaugh
1967-01-01
Describes how to use an improved and greatly expanded version of an earlier computer program (1964) that converts dendrometer measurements of 3P-sample trees to population values in terms of whatever units user desires. Many new options are available, including that of obtaining a product-yield and appraisal report based on regression coefficients supplied by user....
Portable Language-Independent Adaptive Translation from OCR. Phase 1
2009-04-01
including brute-force k-Nearest Neighbors ( kNN ), fast approximate kNN using hashed k-d trees, classification and regression trees, and locality...achieved by refinements in ground-truthing protocols. Recent algorithmic improvements to our approximate kNN classifier using hashed k-D trees allows...recent years discriminative training has been shown to outperform phonetic HMMs estimated using ML for speech recognition. Standard ML estimation
High-Quality School-Based Pre-K Can Boost Early Learning for Children with Special Needs
ERIC Educational Resources Information Center
Phillips, Deborah A.; Meloy, Mary E.
2012-01-01
This article assesses the effects of Tulsa, Oklahoma's school-based prekindergarten program on the school readiness of children with special needs using a regression discontinuity design. Participation in the pre-K program was associated with significant gains for children with special needs in early literacy scores, but not in math scores. These…
Improving Cluster Analysis with Automatic Variable Selection Based on Trees
2014-12-01
regression trees Daisy DISsimilAritY PAM partitioning around medoids PMA penalized multivariate analysis SPC sparse principal components UPGMA unweighted...unweighted pair-group average method ( UPGMA ). This method measures dissimilarities between all objects in two clusters and takes the average value
Predicting U.S. Army Reserve Unit Manning Using Market Demographics
2015-06-01
develops linear regression , classification tree, and logistic regression models to determine the ability of the location to support manning requirements... logistic regression model delivers predictive results that allow decision-makers to identify locations with a high probability of meeting unit...manning requirements. The recommendation of this thesis is that the USAR implement the logistic regression model. 14. SUBJECT TERMS U.S
Dipnall, Joanna F.
2016-01-01
Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571
Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny
2016-01-01
Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.
Simulation of land use change in the three gorges reservoir area based on CART-CA
NASA Astrophysics Data System (ADS)
Yuan, Min
2018-05-01
This study proposes a new method to simulate spatiotemporal complex multiple land uses by using classification and regression tree algorithm (CART) based CA model. In this model, we use classification and regression tree algorithm to calculate land class conversion probability, and combine neighborhood factor, random factor to extract cellular transformation rules. The overall Kappa coefficient is 0.8014 and the overall accuracy is 0.8821 in the land dynamic simulation results of the three gorges reservoir area from 2000 to 2010, and the simulation results are satisfactory.
CADDIS Volume 4. Data Analysis: Basic Analyses
Use of statistical tests to determine if an observation is outside the normal range of expected values. Details of CART, regression analysis, use of quantile regression analysis, CART in causal analysis, simplifying or pruning resulting trees.
Batterham, Philip J; Christensen, Helen; Mackinnon, Andrew J
2009-11-22
Relative to physical health conditions such as cardiovascular disease, little is known about risk factors that predict the prevalence of depression. The present study investigates the expected effects of a reduction of these risks over time, using the decision tree method favoured in assessing cardiovascular disease risk. The PATH through Life cohort was used for the study, comprising 2,105 20-24 year olds, 2,323 40-44 year olds and 2,177 60-64 year olds sampled from the community in the Canberra region, Australia. A decision tree methodology was used to predict the presence of major depressive disorder after four years of follow-up. The decision tree was compared with a logistic regression analysis using ROC curves. The decision tree was found to distinguish and delineate a wide range of risk profiles. Previous depressive symptoms were most highly predictive of depression after four years, however, modifiable risk factors such as substance use and employment status played significant roles in assessing the risk of depression. The decision tree was found to have better sensitivity and specificity than a logistic regression using identical predictors. The decision tree method was useful in assessing the risk of major depressive disorder over four years. Application of the model to the development of a predictive tool for tailored interventions is discussed.
A study of Solar-Enso correlation with southern Brazil tree ring index (1955- 1991)
NASA Astrophysics Data System (ADS)
Rigozo, N.; Nordemann, D.; Vieira, L.; Echer, E.
The effects of solar activity and El Niño-Southern Oscillation on tree growth in Southern Brazil were studied by correlation analysis. Trees for this study were native Araucaria (Araucaria Angustifolia)from four locations in Rio Grande do Sul State, in Southern Brazil: Canela (29o18`S, 50o51`W, 790 m asl), Nova Petropolis (29o2`S, 51o10`W, 579 m asl), Sao Francisco de Paula (29o25`S, 50o24`W, 930 m asl) and Sao Martinho da Serra (29o30`S, 53o53`W, 484 m asl). From these four sites, an average tree ring Index for this region was derived, for the period 1955-1991. Linear correlations were made on annual and 10 year running averages of this tree ring Index, of sunspot number Rz and SOI. For annual averages, the correlation coefficients were low, and the multiple regression between tree ring and SOI and Rz indicates that 20% of the variance in tree rings was explained by solar activity and ENSO variability. However, when the 10 year running averages correlations were made, the coefficient correlations were much higher. A clear anticorrelation is observed between SOI and Index (r=-0.81) whereas Rz and Index show a positive correlation (r=0.67). The multiple regression of 10 year running averages indicates that 76% of the variance in tree ring INdex was explained by solar activity and ENSO. These results indicate that the effects of solar activity and ENSO on tree rings are better seen on long timescales.
Effectiveness of repeated examination to diagnose enterobiasis in nursery school groups.
Remm, Mare; Remm, Kalle
2009-09-01
The aim of this study was to estimate the benefit from repeated examinations in the diagnosis of enterobiasis in nursery school groups, and to test the effectiveness of individual-based risk predictions using different methods. A total of 604 children were examined using double, and 96 using triple, anal swab examinations. The questionnaires for parents, structured observations, and interviews with supervisors were used to identify factors of possible infection risk. In order to model the risk of enterobiasis at individual level, a similarity-based machine learning and prediction software Constud was compared with data mining methods in the Statistica 8 Data Miner software package. Prevalence according to a single examination was 22.5%; the increase as a result of double examinations was 8.2%. Single swabs resulted in an estimated prevalence of 20.1% among children examined 3 times; double swabs increased this by 10.1%, and triple swabs by 7.3%. Random forest classification, boosting classification trees, and Constud correctly predicted about 2/3 of the results of the second examination. Constud estimated a mean prevalence of 31.5% in groups. Constud was able to yield the highest overall fit of individual-based predictions while boosting classification tree and random forest models were more effective in recognizing Enterobius positive persons. As a rule, the actual prevalence of enterobiasis is higher than indicated by a single examination. We suggest using either the values of the mean increase in prevalence after double examinations compared to single examinations or group estimations deduced from individual-level modelled risk predictions.
Effectiveness of Repeated Examination to Diagnose Enterobiasis in Nursery School Groups
Remm, Kalle
2009-01-01
The aim of this study was to estimate the benefit from repeated examinations in the diagnosis of enterobiasis in nursery school groups, and to test the effectiveness of individual-based risk predictions using different methods. A total of 604 children were examined using double, and 96 using triple, anal swab examinations. The questionnaires for parents, structured observations, and interviews with supervisors were used to identify factors of possible infection risk. In order to model the risk of enterobiasis at individual level, a similarity-based machine learning and prediction software Constud was compared with data mining methods in the Statistica 8 Data Miner software package. Prevalence according to a single examination was 22.5%; the increase as a result of double examinations was 8.2%. Single swabs resulted in an estimated prevalence of 20.1% among children examined 3 times; double swabs increased this by 10.1%, and triple swabs by 7.3%. Random forest classification, boosting classification trees, and Constud correctly predicted about 2/3 of the results of the second examination. Constud estimated a mean prevalence of 31.5% in groups. Constud was able to yield the highest overall fit of individual-based predictions while boosting classification tree and random forest models were more effective in recognizing Enterobius positive persons. As a rule, the actual prevalence of enterobiasis is higher than indicated by a single examination. We suggest using either the values of the mean increase in prevalence after double examinations compared to single examinations or group estimations deduced from individual-level modelled risk predictions. PMID:19724696
Carneiro, Gustavo; Georgescu, Bogdan; Good, Sara; Comaniciu, Dorin
2008-09-01
We propose a novel method for the automatic detection and measurement of fetal anatomical structures in ultrasound images. This problem offers a myriad of challenges, including: difficulty of modeling the appearance variations of the visual object of interest, robustness to speckle noise and signal dropout, and large search space of the detection procedure. Previous solutions typically rely on the explicit encoding of prior knowledge and formulation of the problem as a perceptual grouping task solved through clustering or variational approaches. These methods are constrained by the validity of the underlying assumptions and usually are not enough to capture the complex appearances of fetal anatomies. We propose a novel system for fast automatic detection and measurement of fetal anatomies that directly exploits a large database of expert annotated fetal anatomical structures in ultrasound images. Our method learns automatically to distinguish between the appearance of the object of interest and background by training a constrained probabilistic boosting tree classifier. This system is able to produce the automatic segmentation of several fetal anatomies using the same basic detection algorithm. We show results on fully automatic measurement of biparietal diameter (BPD), head circumference (HC), abdominal circumference (AC), femur length (FL), humerus length (HL), and crown rump length (CRL). Notice that our approach is the first in the literature to deal with the HL and CRL measurements. Extensive experiments (with clinical validation) show that our system is, on average, close to the accuracy of experts in terms of segmentation and obstetric measurements. Finally, this system runs under half second on a standard dual-core PC computer.
James W. Flewelling
2009-01-01
Remotely sensed data can be used to make digital maps showing individual tree crowns (ITC) for entire forests. Attributes of the ITCs may include area, shape, height, and color. The crown map is sampled in a way that provides an unbiased linkage between ITCs and identifiable trees measured on the ground. Methods of avoiding edge bias are given. In an example from a...
Austin Troy; J. Morgan Grove; Jarlath O' Neill-Dunne
2012-01-01
The extent to which urban tree cover influences crime is in debate in the literature. This research took advantage of geocoded crime point data and high resolution tree canopy data to address this question in Baltimore City and County, MD, an area that includes a significant urban-rural gradient. Using ordinary least squares and spatially adjusted regression and...
Spatial Assessment of Model Errors from Four Regression Techniques
Lianjun Zhang; Jeffrey H. Gove; Jeffrey H. Gove
2005-01-01
Fomst modelers have attempted to account for the spatial autocorrelations among trees in growth and yield models by applying alternative regression techniques such as linear mixed models (LMM), generalized additive models (GAM), and geographicalIy weighted regression (GWR). However, the model errors are commonly assessed using average errors across the entire study...
Steen, Paul J.; Passino-Reader, Dora R.; Wiley, Michael J.
2006-01-01
As a part of the Great Lakes Regional Aquatic Gap Analysis Project, we evaluated methodologies for modeling associations between fish species and habitat characteristics at a landscape scale. To do this, we created brook trout Salvelinus fontinalis presence and absence models based on four different techniques: multiple linear regression, logistic regression, neural networks, and classification trees. The models were tested in two ways: by application to an independent validation database and cross-validation using the training data, and by visual comparison of statewide distribution maps with historically recorded occurrences from the Michigan Fish Atlas. Although differences in the accuracy of our models were slight, the logistic regression model predicted with the least error, followed by multiple regression, then classification trees, then the neural networks. These models will provide natural resource managers a way to identify habitats requiring protection for the conservation of fish species.
Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula
2011-01-01
Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.
Model-informed risk assessment for Zika virus outbreaks in the Asia-Pacific regions.
Teng, Yue; Bi, Dehua; Xie, Guigang; Jin, Yuan; Huang, Yong; Lin, Baihan; An, Xiaoping; Tong, Yigang; Feng, Dan
2017-05-01
Recently, Zika virus (ZIKV) has been recognized as a significant threat to global public health. The disease was present in large parts of the Americas, the Caribbean, and also the western Pacific area with southern Asia during 2015 and 2016. However, little is known about the factors affecting the transmission of ZIKV. We used Gradient Boosted Regression Tree models to investigate the effects of various potential explanatory variables on the spread of ZIKV, and used current with historical information from a range of sources to assess the risks of future ZIKV outbreaks. Our results indicated that the probability of ZIKV outbreaks increases with vapor pressure, the occurrence of Dengue virus, and population density but decreases as health expenditure, GDP, and numbers of travelers. The predictive results revealed the potential risk countries of ZIKV infection in the Asia-Pacific regions between October 2016 and January 2017. We believe that the high-risk conditions would continue in South Asia and Australia over this period. By integrating information on eco-environmental, social-economical, and ZIKV-related niche factors, this study estimated the probability for locally acquired mosquito-borne ZIKV infections in the Asia-Pacific region and improves the ability to forecast, and possibly even prevent, future outbreaks of ZIKV. Copyright © 2017 The British Infection Association. Published by Elsevier Ltd. All rights reserved.
Tran, Chinh C; Yanagida, John F; Saksena, Sumeet; Fox, Jefferson
2016-02-06
This study addresses the tradeoff between Vietnam's national poultry vaccination program, which implemented an annual two-round HPAI H5N1 vaccination program for the entire geographical area of the Red River Delta during the period from 2005-2010, and an alternative vaccination program which would involve vaccination for every production cycle at the recommended poultry age in high risk areas within the Delta. The ex ante analysis framework was applied to identify the location of areas with high probability of HPAI H5N1 occurrence for the alternative vaccination program by using boosted regression trees (BRT) models, followed by weighted overlay operations. Cost-effectiveness of the vaccination programs was then estimated to measure the tradeoff between the past national poultry vaccination program and the alternative vaccination program. Ex ante analysis showed that the focus areas for the alternative vaccination program included 1137 communes, corresponding to 50.6% of total communes in the Delta, and located primarily in the coastal areas to the east and south of Hanoi. The cost-effectiveness analysis suggested that the alternative vaccination program would have been more successful in reducing the rate of disease occurrence and the total cost of vaccinations, as compared to the national poultry vaccination program.
NASA Astrophysics Data System (ADS)
Rice, Joshua S.; Emanuel, Ryan E.; Vose, James M.; Nelson, Stacy A. C.
2015-08-01
Changes in streamflow are an important area of ongoing research in the hydrologic sciences. To better understand spatial patterns in past changes in streamflow, we examined relationships between watershed-scale spatial characteristics and trends in streamflow. Trends in streamflow were identified by analyzing mean daily flow observations between 1940 and 2009 from 967 U.S. Geological Survey stream gages. Results indicated that streamflow across the continental U.S., as a whole, increased while becoming less extreme between 1940 and 2009. However, substantial departures from the continental U.S. (CONUS) scale pattern occurred at the regional scale, including increased annual maxima, decreased annual minima, overall drying trends, and changes in streamflow variability. A subset of watersheds belonging to a reference data set exhibited significantly smaller trend magnitudes than those observed in nonreference watersheds. Boosted regression tree models were applied to examine the influence of watershed characteristics on streamflow trend magnitudes at both the CONUS and regional scale. Geographic location was found to be of particular importance at the CONUS scale while local variability in hydroclimate and topography tended to have a strong influence on regional-scale patterns in streamflow trends. This methodology facilitates detailed, data-driven analyses of how the characteristics of individual watersheds interact with large-scale hydroclimate forces to influence how changes in streamflow manifest.
Tran, Chinh C.; Yanagida, John F.; Saksena, Sumeet; Fox, Jefferson
2016-01-01
This study addresses the tradeoff between Vietnam’s national poultry vaccination program, which implemented an annual two-round HPAI H5N1 vaccination program for the entire geographical area of the Red River Delta during the period from 2005–2010, and an alternative vaccination program which would involve vaccination for every production cycle at the recommended poultry age in high risk areas within the Delta. The ex ante analysis framework was applied to identify the location of areas with high probability of HPAI H5N1 occurrence for the alternative vaccination program by using boosted regression trees (BRT) models, followed by weighted overlay operations. Cost-effectiveness of the vaccination programs was then estimated to measure the tradeoff between the past national poultry vaccination program and the alternative vaccination program. Ex ante analysis showed that the focus areas for the alternative vaccination program included 1137 communes, corresponding to 50.6% of total communes in the Delta, and located primarily in the coastal areas to the east and south of Hanoi. The cost-effectiveness analysis suggested that the alternative vaccination program would have been more successful in reducing the rate of disease occurrence and the total cost of vaccinations, as compared to the national poultry vaccination program. PMID:29056716
The effect of topography on arctic-alpine aboveground biomass and NDVI patterns
NASA Astrophysics Data System (ADS)
Riihimäki, Henri; Heiskanen, Janne; Luoto, Miska
2017-04-01
Topography is a key factor affecting numerous environmental phenomena, including Arctic and alpine aboveground biomass (AGB) distribution. Digital Elevation Model (DEM) is a source of topographic information which can be linked to local growing conditions. Here, we investigated the effect of DEM derived variables, namely elevation, topographic position, radiation and wetness on AGB and Normalized Difference Vegetation Index (NDVI) in a Fennoscandian forest-alpine tundra ecotone. Boosted regression trees were used to derive non-parametric response curves and relative influences of the explanatory variables. Elevation and potential incoming solar radiation were the most important explanatory variables for both AGB and NDVI. In the NDVI models, the response curves were smooth compared with AGB models. This might be caused by large contribution of field and shrub layer to NDVI, especially at the treeline. Furthermore, radiation and elevation had a significant interaction, showing that the highest NDVI and biomass values are found from low-elevation, high-radiation sites, typically on the south-southwest facing valley slopes. Topographic wetness had minor influence on AGB and NDVI. Topographic position had generally weak effects on AGB and NDVI, although protected topographic position seemed to be more favorable below the treeline. The explanatory power of the topographic variables, particularly elevation and radiation demonstrates that DEM-derived land surface parameters can be used for exploring biomass distribution resulting from landform control on local growing conditions.
Main, Anson R; Michel, Nicole L; Headley, John V; Peru, Kerry M; Morrissey, Christy A
2015-07-21
Neonicotinoids are commonly used seed treatments on Canada's major prairie crops. Transported via surface and subsurface runoff into wetlands, their ultimate aquatic fate remains largely unknown. Biotic and abiotic wetland characteristics likely affect neonicotinoid presence and environmental persistence, but concentrations vary widely between wetlands that appear ecologically (e.g., plant composition) and physically (e.g., depth) similar for reasons that remain unclear. We conducted intensive surveys of 238 wetlands, and documented 59 wetland (e.g., dominant plant species) and landscape (e.g., surrounding crop) characteristics as part of a novel rapid wetland assessment system. We used boosted regression tree (BRT) analysis to predict both probability of neonicotinoid analytical detection and concentration. BRT models effectively predicted the deviance in neonicotinoid detection (62.4%) and concentration (74.7%) from 21 and 23 variables, respectively. Detection was best explained by shallow marsh plant species identity (34.8%) and surrounding crop (13.9%). Neonicotinoid concentration was best explained by shallow marsh plant species identity (14.9%) and wetland depth (14.2%). Our research revealed that plant composition is a key indicator and/or driver of neonicotinoid presence and concentration in Prairie wetlands. We recommend wetland buffers consisting of diverse native vegetation be retained or restored to minimize neonicotinoid transport and retention in wetlands, thereby limiting their potential effects on wetland-dependent organisms.
Effects of pond draining on biodiversity and water quality of farm ponds.
Usio, Nisikawa; Imada, Miho; Nakagawa, Megumi; Akasaka, Munemitsu; Takamura, Noriko
2013-12-01
Farm ponds have high conservation value because they contribute significantly to regional biodiversity and ecosystem services. In Japan pond draining is a traditional management method that is widely believed to improve water quality and eradicate invasive fish. In addition, fishing by means of pond draining has significant cultural value for local people, serving as a social event. However, there is a widespread belief that pond draining reduces freshwater biodiversity through the extirpation of aquatic animals, but scientific evaluation of the effectiveness of pond draining is lacking. We conducted a large-scale field study to evaluate the effects of pond draining on invasive animal control, water quality, and aquatic biodiversity relative to different pond-management practices, pond physicochemistry, and surrounding land use. The results of boosted regression-tree models and analyses of similarity showed that pond draining had little effect on invasive fish control, water quality, or aquatic biodiversity. Draining even facilitated the colonization of farm ponds by invasive red swamp crayfish (Procambarus clarkii), which in turn may have detrimental effects on the biodiversity and water quality of farm ponds. Our results highlight the need for reconsidering current pond management and developing management plans with respect to multifunctionality of such ponds. Efectos del Drenado de Estanques sobre la Biodiversidad y la Calidad del Agua en Estanques de Cultivo. © 2013 Society for Conservation Biology.
Machine learning derived risk prediction of anorexia nervosa.
Guo, Yiran; Wei, Zhi; Keating, Brendan J; Hakonarson, Hakon
2016-01-20
Anorexia nervosa (AN) is a complex psychiatric disease with a moderate to strong genetic contribution. In addition to conventional genome wide association (GWA) studies, researchers have been using machine learning methods in conjunction with genomic data to predict risk of diseases in which genetics play an important role. In this study, we collected whole genome genotyping data on 3940 AN cases and 9266 controls from the Genetic Consortium for Anorexia Nervosa (GCAN), the Wellcome Trust Case Control Consortium 3 (WTCCC3), Price Foundation Collaborative Group and the Children's Hospital of Philadelphia (CHOP), and applied machine learning methods for predicting AN disease risk. The prediction performance is measured by area under the receiver operating characteristic curve (AUC), indicating how well the model distinguishes cases from unaffected control subjects. Logistic regression model with the lasso penalty technique generated an AUC of 0.693, while Support Vector Machines and Gradient Boosted Trees reached AUC's of 0.691 and 0.623, respectively. Using different sample sizes, our results suggest that larger datasets are required to optimize the machine learning models and achieve higher AUC values. To our knowledge, this is the first attempt to assess AN risk based on genome wide genotype level data. Future integration of genomic, environmental and family-based information is likely to improve the AN risk evaluation process, eventually benefitting AN patients and families in the clinical setting.
Zhang, Ling Yu; Liu, Zhao Gang
2017-12-01
Based on the data collected from 108 permanent plots of the forest resources survey in Maoershan Experimental Forest Farm during 2004-2016, this study investigated the spatial distribution of recruitment trees in natural secondary forest by global Poisson regression and geographically weighted Poisson regression (GWPR) with four bandwidths of 2.5, 5, 10 and 15 km. The simulation effects of the 5 regressions and the factors influencing the recruitment trees in stands were analyzed, a description was given to the spatial autocorrelation of the regression residuals on global and local levels using Moran's I. The results showed that the spatial distribution of the number of natural secondary forest recruitment was significantly influenced by stands and topographic factors, especially average DBH. The GWPR model with small scale (2.5 km) had high accuracy of model fitting, a large range of model parameter estimates was generated, and the localized spatial distribution effect of the model parameters was obtained. The GWPR model at small scale (2.5 and 5 km) had produced a small range of model residuals, and the stability of the model was improved. The global spatial auto-correlation of the GWPR model residual at the small scale (2.5 km) was the lowe-st, and the local spatial auto-correlation was significantly reduced, in which an ideal spatial distribution pattern of small clusters with different observations was formed. The local model at small scale (2.5 km) was much better than the global model in the simulation effect on the spatial distribution of recruitment tree number.
Random Bits Forest: a Strong Classifier/Regressor for Big Data
NASA Astrophysics Data System (ADS)
Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li
2016-07-01
Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).
Shi, Huilan; Jia, Junya; Li, Dong; Wei, Li; Shang, Wenya; Zheng, Zhenfeng
2018-02-09
Precise renal histopathological diagnosis will guide therapy strategy in patients with lupus nephritis. Blood oxygen level dependent (BOLD) magnetic resonance imaging (MRI) has been applicable noninvasive technique in renal disease. This current study was performed to explore whether BOLD MRI could contribute to diagnose renal pathological pattern. Adult patients with lupus nephritis renal pathological diagnosis were recruited for this study. Renal biopsy tissues were assessed based on the lupus nephritis ISN/RPS 2003 classification. The Blood oxygen level dependent magnetic resonance imaging (BOLD-MRI) was used to obtain functional magnetic resonance parameter, R2* values. Several functions of R2* values were calculated and used to construct algorithmic models for renal pathological patterns. In addition, the algorithmic models were compared as to their diagnostic capability. Both Histopathology and BOLD MRI were used to examine a total of twelve patients. Renal pathological patterns included five classes III (including 3 as class III + V) and seven classes IV (including 4 as class IV + V). Three algorithmic models, including decision tree, line discriminant, and logistic regression, were constructed to distinguish the renal pathological pattern of class III and class IV. The sensitivity of the decision tree model was better than that of the line discriminant model (71.87% vs 59.48%, P < 0.001) and inferior to that of the Logistic regression model (71.87% vs 78.71%, P < 0.001). The specificity of decision tree model was equivalent to that of the line discriminant model (63.87% vs 63.73%, P = 0.939) and higher than that of the logistic regression model (63.87% vs 38.0%, P < 0.001). The Area under the ROC curve (AUROCC) of the decision tree model was greater than that of the line discriminant model (0.765 vs 0.629, P < 0.001) and logistic regression model (0.765 vs 0.662, P < 0.001). BOLD MRI is a useful non-invasive imaging technique for the evaluation of lupus nephritis. Decision tree models constructed using functions of R2* values may facilitate the prediction of renal pathological patterns.
Lin, Lei; Wang, Qian; Sadek, Adel W
2016-06-01
The duration of freeway traffic accidents duration is an important factor, which affects traffic congestion, environmental pollution, and secondary accidents. Among previous studies, the M5P algorithm has been shown to be an effective tool for predicting incident duration. M5P builds a tree-based model, like the traditional classification and regression tree (CART) method, but with multiple linear regression models as its leaves. The problem with M5P for accident duration prediction, however, is that whereas linear regression assumes that the conditional distribution of accident durations is normally distributed, the distribution for a "time-to-an-event" is almost certainly nonsymmetrical. A hazard-based duration model (HBDM) is a better choice for this kind of a "time-to-event" modeling scenario, and given this, HBDMs have been previously applied to analyze and predict traffic accidents duration. Previous research, however, has not yet applied HBDMs for accident duration prediction, in association with clustering or classification of the dataset to minimize data heterogeneity. The current paper proposes a novel approach for accident duration prediction, which improves on the original M5P tree algorithm through the construction of a M5P-HBDM model, in which the leaves of the M5P tree model are HBDMs instead of linear regression models. Such a model offers the advantage of minimizing data heterogeneity through dataset classification, and avoids the need for the incorrect assumption of normality for traffic accident durations. The proposed model was then tested on two freeway accident datasets. For each dataset, the first 500 records were used to train the following three models: (1) an M5P tree; (2) a HBDM; and (3) the proposed M5P-HBDM, and the remainder of data were used for testing. The results show that the proposed M5P-HBDM managed to identify more significant and meaningful variables than either M5P or HBDMs. Moreover, the M5P-HBDM had the lowest overall mean absolute percentage error (MAPE). Copyright © 2016 Elsevier Ltd. All rights reserved.
Observed Methods for Felling Hardwood Trees with Chain Saws
Jerry L. Koger
1983-01-01
The angles and lengths of the cutting surfaces made by chain saw operators on hardwood tree stumps are described by means, standard deviations, ranges, and regression equations. Recommended felling guidelines are compared with observed felling methods used by experienced timber cutters in the southern Appalachian Mountains.
Kakkar, Fatima; Boucoiran, Isabelle; Lamarre, Valerie; Ducruet, Thierry; Amre, Devendra; Soudeyns, Hugo; Lapointe, Normand; Boucher, Marc
2015-01-01
Background The risk of pre-term birth (PTB) associated with the use of protease inhibitors (PIs) during pregnancy remains a subject of debate. Recent data suggest that ritonavir boosting of PIs may play a specific role in the initiation of PTB, through an effect on the maternal–fetal adrenal axis. The primary objective of this study is to compare the risk of PTB among women treated with boosted PI versus non-boosted PIs during pregnancy. Methods Between 1988 and 2011, 705 HIV-positive women were enrolled into the Centre Maternel et Infantile sur le SIDA mother–infant cohort at Centre Hospitalier Universitaire Sainte-Justine in Montreal, Canada. Inclusion criteria for the study were: 1) attendance at a minimum of two antenatal obstetric visits and 2) singleton live birth, at 24 weeks gestational or older. The association between PTB (defined as delivery at <37 weeks gestational age), antiretroviral drug exposure and maternal risk factors was assessed retrospectively using logistic regression. Results A total of 525 mother–infant pairs were included in the analysis. Among them, PI-based combination anti-retroviral therapy was used in 37.4%, boosted PI based in 24.4%, non-nucleoside reverse transcriptase inhibitor (NNRTI) or nucleoside reverse transcriptase inhibitor based in 28.1%, and no treatment was given in 10.0% of cases. Overall, 13.5% of women experienced PTB. Among women treated with antiretroviral therapy, the risk of PTB was significantly higher among women who received boosted versus non-boosted PI (OR 2.01, 95% CI 1.02–3.97). This remained significant after adjusting for maternal age, delivery CD4 count, hepatitis C co-infection, history of previous PTB, and parity (aOR 2.17, 95% CI 1.05–4.51). There was no increased risk of PTB with the use of unboosted PIs as compared to NNRTI- or NRTI-based regimens. Conclusion While previous studies on the association between PTB and PI use have generally considered all PIs the same, our results would indicate a possible role of ritonavir boosting as a risk factor for PTB. Further work is needed to understand the pathophysiologic mechanisms involved, and to identify the safest ARV regimens to be used in pregnancy. PMID:26051165
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rupšys, P.
A system of stochastic differential equations (SDE) with mixed-effects parameters and multivariate normal copula density function were used to develop tree height model for Scots pine trees in Lithuania. A two-step maximum likelihood parameter estimation method is used and computational guidelines are given. After fitting the conditional probability density functions to outside bark diameter at breast height, and total tree height, a bivariate normal copula distribution model was constructed. Predictions from the mixed-effects parameters SDE tree height model calculated during this research were compared to the regression tree height equations. The results are implemented in the symbolic computational language MAPLE.
Wylie, Bruce K.; Howard, Daniel; Dahal, Devendra; Gilmanov, Tagir; Ji, Lei; Zhang, Li; Smith, Kelcy
2016-01-01
This paper presents the methodology and results of two ecological-based net ecosystem production (NEP) regression tree models capable of up scaling measurements made at various flux tower sites throughout the U.S. Great Plains. Separate grassland and cropland NEP regression tree models were trained using various remote sensing data and other biogeophysical data, along with 15 flux towers contributing to the grassland model and 15 flux towers for the cropland model. The models yielded weekly mean daily grassland and cropland NEP maps of the U.S. Great Plains at 250 m resolution for 2000–2008. The grassland and cropland NEP maps were spatially summarized and statistically compared. The results of this study indicate that grassland and cropland ecosystems generally performed as weak net carbon (C) sinks, absorbing more C from the atmosphere than they released from 2000 to 2008. Grasslands demonstrated higher carbon sink potential (139 g C·m−2·year−1) than non-irrigated croplands. A closer look into the weekly time series reveals the C fluctuation through time and space for each land cover type.
Zhao, Cai-Yun; Li, Jun-Sheng; Xu, Jing; Liu, Xiao-Yan
2017-05-01
Globalization increases the opportunities for unintentionally introduced invasive alien species, especially for insects, and most of these species could damage ecosystems and cause economic loss in China. In this study, we analyzed drivers of the distribution of unintentionally introduced invasive alien insects. Based on the number of unintentionally introduced invasive alien insects and their presence/absence records in each province in mainland China, regression trees were built to elucidate the roles of environmental and anthropogenic factors on the number distribution and similarity of species composition of these insects. Classification and regression trees indicated climatic suitability (the mean temperature in January) and human economic activity (sum of total freight) are primary drivers for the number distribution pattern of unintentionally introduced invasive alien insects at provincial scale, while only environmental factors (the mean January temperature, the annual precipitation and the areas of provinces) significantly affect the similarity of them based on the multivariate regression trees. © The Authors 2017. Published by Oxford University Press on behalf of Entomological Society of America.
Brito-Rocha, E; Schilling, A C; Dos Anjos, L; Piotto, D; Dalmolin, A C; Mielke, M S
2016-01-01
Individual leaf area (LA) is a key variable in studies of tree ecophysiology because it directly influences light interception, photosynthesis and evapotranspiration of adult trees and seedlings. We analyzed the leaf dimensions (length - L and width - W) of seedlings and adults of seven Neotropical rainforest tree species (Brosimum rubescens, Manilkara maxima, Pouteria caimito, Pouteria torta, Psidium cattleyanum, Symphonia globulifera and Tabebuia stenocalyx) with the objective to test the feasibility of single regression models to estimate LA of both adults and seedlings. In southern Bahia, Brazil, a first set of data was collected between March and October 2012. From the seven species analyzed, only two (P. cattleyanum and T. stenocalyx) had very similar relationships between LW and LA in both ontogenetic stages. For these two species, a second set of data was collected in August 2014, in order to validate the single models encompassing adult and seedlings. Our results show the possibility of development of models for predicting individual leaf area encompassing different ontogenetic stages for tropical tree species. The development of these models was more dependent on the species than the differences in leaf size between seedlings and adults.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andrew G. Peterson; J. Timothy Ball; Yiqi Luo
1998-09-25
Estimation of leaf photosynthetic rate (A) from leaf nitrogen content (N) is both conceptually and numerically important in models of plant, ecosystem and biosphere responses to global change. The relationship between A and N has been studied extensively at ambient CO{sub 2} but much less at elevated CO{sub 2}. This study was designed to (1) assess whether the A-N relationship was more similar for species within than between community and vegetation types, and (2) examine how growth at elevated CO{sub 2} affects the A-N relationship. Data were obtained for 39 C{sub 3} species grown at ambient CO{sub 2} and 10more » C{sub 3} species grown at ambient and elevated CO{sub 2}. A regression model was applied to each species as well as to species pooled within different community and vegetation types. Cluster analysis of the regression coefficients indicated that species measured at ambient CO{sub 2} did not separate into distinct groups matching community or vegetation type. Instead, most community and vegetation types shared the same general parameter space for regression coefficients. Growth at elevated CO{sub 2} increased photosynthetic nitrogen use efficiency for pines and deciduous trees. When species were pooled by vegetation type, the A-N relationship for deciduous trees expressed on a leaf-mass bask was not altered by elevated CO{sub 2}, while the intercept increased for pines. When regression coefficients were averaged to give mean responses for different vegetation types, elevated CO{sub 2} increased the intercept and the slope for deciduous trees but increased only the intercept for pines. There were no statistical differences between the pines and deciduous trees for the effect of CO{sub 2}. Generalizations about the effect of elevated CO{sub 2} on the A-N relationship, and differences between pines and deciduous trees will be enhanced as more data become available.« less
Influence of Elevation Data Resolution on Spatial Prediction of Colluvial Soils in a Luvisol Region
Penížek, Vít; Zádorová, Tereza; Kodešová, Radka; Vaněk, Aleš
2016-01-01
The development of a soil cover is a dynamic process. Soil cover can be altered within a few decades, which requires updating of the legacy soil maps. Soil erosion is one of the most important processes quickly altering soil cover on agriculture land. Colluvial soils develop in concave parts of the landscape as a consequence of sedimentation of eroded material. Colluvial soils are recognised as important soil units because they are a vast sink of soil organic carbon. Terrain derivatives became an important tool in digital soil mapping and are among the most popular auxiliary data used for quantitative spatial prediction. Prediction success rates are often directly dependent on raster resolution. In our study, we tested how raster resolution (1, 2, 3, 5, 10, 20 and 30 meters) influences spatial prediction of colluvial soils. Terrain derivatives (altitude, slope, plane curvature, topographic position index, LS factor and convergence index) were calculated for the given raster resolutions. Four models were applied (boosted tree, neural network, random forest and Classification/Regression Tree) to spatially predict the soil cover over a 77 ha large study plot. Models training and validation was based on 111 soil profiles surveyed on a regular sampling grid. Moreover, the predicted real extent and shape of the colluvial soil area was examined. In general, no clear trend in the accuracy prediction was found without the given raster resolution range. Higher maximum prediction accuracy for colluvial soil, compared to prediction accuracy of total soil cover of the study plot, can be explained by the choice of terrain derivatives that were best for Colluvial soils differentiation from other soil units. Regarding the character of the predicted Colluvial soils area, maps of 2 to 10 m resolution provided reasonable delineation of the colluvial soil as part of the cover over the study area. PMID:27846230
NASA Astrophysics Data System (ADS)
Rao, M.; Vuong, H.
2013-12-01
The overall objective of this study is to develop a method for estimating total aboveground biomass of redwood stands in Jackson Demonstration State Forest, Mendocino, California using airborne LiDAR data. LiDAR data owing to its vertical and horizontal accuracy are increasingly being used to characterize landscape features including ground surface elevation and canopy height. These LiDAR-derived metrics involving structural signatures at higher precision and accuracy can help better understand ecological processes at various spatial scales. Our study is focused on two major species of the forest: redwood (Sequoia semperirens [D.Don] Engl.) and Douglas-fir (Pseudotsuga mensiezii [Mirb.] Franco). Specifically, the objectives included linear regression models fitting tree diameter at breast height (dbh) to LiDAR derived height for each species. From 23 random points on the study area, field measurement (dbh and tree coordinate) were collected for more than 500 trees of Redwood and Douglas-fir over 0.2 ha- plots. The USFS-FUSION application software along with its LiDAR Data Viewer (LDV) were used to to extract Canopy Height Model (CHM) from which tree heights would be derived. Based on the LiDAR derived height and ground based dbh, a linear regression model was developed to predict dbh. The predicted dbh was used to estimate the biomass at the single tree level using Jenkin's formula (Jenkin et al 2003). The linear regression models were able to explain 65% of the variability associated with Redwood's dbh and 80% of that associated with Douglas-fir's dbh.
Dispersion patterns and sampling plans for Diaphorina citri (Hemiptera: Psyllidae) in citrus.
Sétamou, Mamoudou; Flores, Daniel; French, J Victor; Hall, David G
2008-08-01
The abundance and spatial dispersion of Diaphorina citri Kuwayama (Hemiptera: Psyllidae) were studied in 34 grapefruit (Citrus paradisi Macfad.) and six sweet orange [Citrus sinensis (L.) Osbeck] orchards from March to August 2006 when the pest is more abundant in southern Texas. Although flush shoot infestation levels did not vary with host plant species, densities of D. citri eggs, nymphs, and adults were significantly higher on sweet orange than on grapefruit. D. citri immatures also were found in significantly higher numbers in the southeastern quadrant of trees than other parts of the canopy. The spatial distribution of D. citri nymphs and adults was analyzed using Iowa's patchiness regression and Taylor's power law. Taylor's power law fitted the data better than Iowa's model. Based on both regression models, the field dispersion patterns of D. citri nymphs and adults were aggregated among flush shoots in individual trees as indicated by the regression slopes that were significantly >1. For the average density of each life stage obtained during our surveys, the minimum number of flush shoots per tree needed to estimate D. citri densities varied from eight for eggs to four flush shoots for adults. Projections indicated that a sampling plan consisting of 10 trees and eight flush shoots per tree would provide density estimates of the three developmental stages of D. citri acceptable enough for population studies and management decisions. A presence-absence sampling plan with a fixed precision level was developed and can be used to provide a quick estimation of D. citri populations in citrus orchards.
1985-12-01
consists of the node t and all descendants of t in T. (3) Definition 3. Pruning a branch Tt from a tree T con- sists of deleting from T all...The default is 1.0 so that actually, this keyword did not need to appear in the above file. (5) DELETE . This keyword does not appear in our example, but...when it is used associated with some variable names, it indicates that we want to delete these vari- ables from the regression. If this keyword is
NASA Astrophysics Data System (ADS)
Morimoto, M.; Juday, G. P.; Huettmann, F.
2016-12-01
Following forest disturbance, the stand initiation stage decisively influences future forest structure. Understanding post-harvest regeneration, especially under climate change, is essential to predicting future carbon stores in this extensive forest biome. We apply IPCC B1, A1B, and A2 climate scenarios to generate plausible future forest conditions under different management. We recorded presence of white spruce, birch, and aspen in 726 plots on 30 state forest white spruce harvest units. We built spatially explicit models and scenarios of species presence/absence using TreeNet (Stochastic Gradient Boosting). Post-harvest tree regeneration predictions in calibration data closely matched the validation set, indicating tree regeneration scenarios are reliable. Early stage post-harvest regeneration is similar to post-fire regeneration and matches the pattern of long-term natural vegetation distribution, confirming that site environmental factors are more important than management practices. Post-harvest natural regeneration of tree species increases under moderate warming scenarios, but fails under strong warming scenarios in landscape positions with high temperatures and low precipitation. Under all warming scenarios, the most successful regenerating species following white spruce harvest is white spruce. Birch experiences about 30% regeneration failure under A2 scenario by 2050. White spruce and aspen are projected to regenerate more successfully when site preparation is applied. Although white spruce has been the major managed species, birch may require more intensive management. Sites likely to experience regeneration failure of current tree species apparently will experience biome shift, although adaptive migration of existing or new species might be an option. Our scenario modeling tool allows resource managers to forecast tree regeneration on productive managed sites that have made a disproportionate contribution to carbon flux in a critical region.
Jovanovic, Milos; Radovanovic, Sandro; Vukicevic, Milan; Van Poucke, Sven; Delibasic, Boris
2016-09-01
Quantification and early identification of unplanned readmission risk have the potential to improve the quality of care during hospitalization and after discharge. However, high dimensionality, sparsity, and class imbalance of electronic health data and the complexity of risk quantification, challenge the development of accurate predictive models. Predictive models require a certain level of interpretability in order to be applicable in real settings and create actionable insights. This paper aims to develop accurate and interpretable predictive models for readmission in a general pediatric patient population, by integrating a data-driven model (sparse logistic regression) and domain knowledge based on the international classification of diseases 9th-revision clinical modification (ICD-9-CM) hierarchy of diseases. Additionally, we propose a way to quantify the interpretability of a model and inspect the stability of alternative solutions. The analysis was conducted on >66,000 pediatric hospital discharge records from California, State Inpatient Databases, Healthcare Cost and Utilization Project between 2009 and 2011. We incorporated domain knowledge based on the ICD-9-CM hierarchy in a data driven, Tree-Lasso regularized logistic regression model, providing the framework for model interpretation. This approach was compared with traditional Lasso logistic regression resulting in models that are easier to interpret by fewer high-level diagnoses, with comparable prediction accuracy. The results revealed that the use of a Tree-Lasso model was as competitive in terms of accuracy (measured by area under the receiver operating characteristic curve-AUC) as the traditional Lasso logistic regression, but integration with the ICD-9-CM hierarchy of diseases provided more interpretable models in terms of high-level diagnoses. Additionally, interpretations of models are in accordance with existing medical understanding of pediatric readmission. Best performing models have similar performances reaching AUC values 0.783 and 0.779 for traditional Lasso and Tree-Lasso, respectfully. However, information loss of Lasso models is 0.35 bits higher compared to Tree-Lasso model. We propose a method for building predictive models applicable for the detection of readmission risk based on Electronic Health records. Integration of domain knowledge (in the form of ICD-9-CM taxonomy) and a data-driven, sparse predictive algorithm (Tree-Lasso Logistic Regression) resulted in an increase of interpretability of the resulting model. The models are interpreted for the readmission prediction problem in general pediatric population in California, as well as several important subpopulations, and the interpretations of models comply with existing medical understanding of pediatric readmission. Finally, quantitative assessment of the interpretability of the models is given, that is beyond simple counts of selected low-level features. Copyright © 2016 Elsevier B.V. All rights reserved.
Peng, Hui; Zheng, Yi; Blumenstein, Michael; Tao, Dacheng; Li, Jinyan
2018-04-16
CRISPR/Cas9 system is a widely used genome editing tool. A prediction problem of great interests for this system is: how to select optimal single guide RNAs (sgRNAs) such that its cleavage efficiency is high meanwhile the off-target effect is low. This work proposed a two-step averaging method (TSAM) for the regression of cleavage efficiencies of a set of sgRNAs by averaging the predicted efficiency scores of a boosting algorithm and those by a support vector machine (SVM).We also proposed to use profiled Markov properties as novel features to capture the global characteristics of sgRNAs. These new features are combined with the outstanding features ranked by the boosting algorithm for the training of the SVM regressor. TSAM improved the mean Spearman correlation coefficiencies comparing with the state-of-the-art performance on benchmark datasets containing thousands of human, mouse and zebrafish sgRNAs. Our method can be also converted to make binary distinctions between efficient and inefficient sgRNAs with superior performance to the existing methods. The analysis reveals that highly efficient sgRNAs have lower melting temperature at the middle of the spacer, cut at 5'-end closer parts of the genome and contain more 'A' but less 'G' comparing with inefficient ones. Comprehensive further analysis also demonstrates that our tool can predict an sgRNA's cutting efficiency with consistently good performance no matter it is expressed from an U6 promoter in cells or from a T7 promoter in vitro. Online tool is available at http://www.aai-bioinfo.com/CRISPR/. Python and Matlab source codes are freely available at https://github.com/penn-hui/TSAM. Jinyan.Li@uts.edu.au. Supplementary data are available at Bioinformatics online.
Error analysis of leaf area estimates made from allometric regression models
NASA Technical Reports Server (NTRS)
Feiveson, A. H.; Chhikara, R. S.
1986-01-01
Biological net productivity, measured in terms of the change in biomass with time, affects global productivity and the quality of life through biochemical and hydrological cycles and by its effect on the overall energy balance. Estimating leaf area for large ecosystems is one of the more important means of monitoring this productivity. For a particular forest plot, the leaf area is often estimated by a two-stage process. In the first stage, known as dimension analysis, a small number of trees are felled so that their areas can be measured as accurately as possible. These leaf areas are then related to non-destructive, easily-measured features such as bole diameter and tree height, by using a regression model. In the second stage, the non-destructive features are measured for all or for a sample of trees in the plots and then used as input into the regression model to estimate the total leaf area. Because both stages of the estimation process are subject to error, it is difficult to evaluate the accuracy of the final plot leaf area estimates. This paper illustrates how a complete error analysis can be made, using an example from a study made on aspen trees in northern Minnesota. The study was a joint effort by NASA and the University of California at Santa Barbara known as COVER (Characterization of Vegetation with Remote Sensing).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, D.W.; Safai, C.; Goffinet, D.R.
Eleven patients with obstructive jaundice from unresectable cholangiocarcinoma, metastatic porta hepatis adenopathy, or direct compression from a pancreatic malignancy were treated at the Stanford University Medical Center from 1978-1983 with an external drainage procedure followed by high-dose external-beam radiotherapy and by an intracavitary boost to the site of obstruction with Iridium/sup 192/ (Ir/sup 192/). A median dose of 5000 cGy was delivered with 4-6 Mv photons to the tumor bed and regional lymphatics in 9 patients, 1 patient received 2100 cGy to the liver in accelerated fractions because of extensive intrahepatic disease, and 1 patient received 7000 equivalent cGy tomore » his pancreatic tumor bed and regional lymphatics with neon heavy particles. An Ir/sup 192/ wire source later delivered a 3100-10,647 cGy boost to the site of biliary obstruction in each patient, for a mean combined dose of 10,202 cGy to a point 5 mm from the line source. Few acute complications were noted, but 3/11 patients (27%) subsequently developed upper gastrointestinal bleeding from duodenitis or frank duodenal ulceration 4 weeks, 4 months, and 7.5 months following treatment. Eight patients died - 5 with local recurrence +/- distant metastasis, 2 with sepsis, and 1 with widespread systemic metastasis. Autopsies revealed no evidence of biliary tree obstruction in 3/3 patients. Evolution of radiation treatment technqiues for biliary obstruction in the literature is reviewed. High-dose external-beam therapy followed by high-dose Ir/sup 192/ intracavitary boost is well tolerated and provides significant palliation.« less
Inferring gene regression networks with model trees
2010-01-01
Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET. PMID:20950452
More Trees, More Poverty? The Socioeconomic Effects of Tree Plantations in Chile, 2001-2011
NASA Astrophysics Data System (ADS)
Andersson, Krister; Lawrence, Duncan; Zavaleta, Jennifer; Guariguata, Manuel R.
2016-01-01
Tree plantations play a controversial role in many nations' efforts to balance goals for economic development, ecological conservation, and social justice. This paper seeks to contribute to this debate by analyzing the socioeconomic impact of such plantations. We focus our study on Chile, a country that has experienced extraordinary growth of industrial tree plantations. Our analysis draws on a unique dataset with longitudinal observations collected in 180 municipal territories during 2001-2011. Employing panel data regression techniques, we find that growth in plantation area is associated with higher than average rates of poverty during this period.
More Trees, More Poverty? The Socioeconomic Effects of Tree Plantations in Chile, 2001-2011.
Andersson, Krister; Lawrence, Duncan; Zavaleta, Jennifer; Guariguata, Manuel R
2016-01-01
Tree plantations play a controversial role in many nations' efforts to balance goals for economic development, ecological conservation, and social justice. This paper seeks to contribute to this debate by analyzing the socioeconomic impact of such plantations. We focus our study on Chile, a country that has experienced extraordinary growth of industrial tree plantations. Our analysis draws on a unique dataset with longitudinal observations collected in 180 municipal territories during 2001-2011. Employing panel data regression techniques, we find that growth in plantation area is associated with higher than average rates of poverty during this period.
Fatigue design of a cellular phone folder using regression model-based multi-objective optimization
NASA Astrophysics Data System (ADS)
Kim, Young Gyun; Lee, Jongsoo
2016-08-01
In a folding cellular phone, the folding device is repeatedly opened and closed by the user, which eventually results in fatigue damage, particularly to the front of the folder. Hence, it is important to improve the safety and endurance of the folder while also reducing its weight. This article presents an optimal design for the folder front that maximizes its fatigue endurance while minimizing its thickness. Design data for analysis and optimization were obtained experimentally using a test jig. Multi-objective optimization was carried out using a nonlinear regression model. Three regression methods were employed: back-propagation neural networks, logistic regression and support vector machines. The AdaBoost ensemble technique was also used to improve the approximation. Two-objective Pareto-optimal solutions were identified using the non-dominated sorting genetic algorithm (NSGA-II). Finally, a numerically optimized solution was validated against experimental product data, in terms of both fatigue endurance and thickness index.
The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis.
Koziol, James A; Feng, Anne C; Jia, Zhenyu; Wang, Yipeng; Goodison, Seven; McClelland, Michael; Mercola, Dan
2009-01-01
Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors.
Waldron, A; Justicia, R; Smith, L E
2015-03-01
The twin United Nations' Millennium Development Goals of biodiversity preservation and poverty reduction both strongly depend on actions in the tropics. In particular, traditional agroforestry could be critical to both biological conservation and human livelihoods in human-altered rainforest areas. However, traditional agroforestry is rapidly disappearing, because the system itself is economically precarious, and because the forest trees that shade traditional crops are now perceived to be overly detrimental to agricultural yield. Here, we show a case where the commonly used agroforestry shade metric, canopy cover, would indeed suggest complete removal of shade trees to maximize yield, with strongly negative biodiversity and climate implications. However, a yield over 50% higher was achievable if approximately 100 shade trees per hectare were planted in a spatially organized fashion, a win-win for biodiversity and the smallholder. The higher yield option was detected by optimizing simultaneously for canopy cover, and a second shade metric, neighboring tree density, which was designed to better capture the yield value of ecological services flowing from forest trees. Nevertheless, even a 50% yield increase may prove insufficient to stop farmers converting away from traditional agroforestry. To further increase agroforestry rents, we apply our results to the design of a sustainable certification (eco-labelling) scheme for cocoa-based products in a biodiversity hotspot, and consider their implications for the use of the United Nations REDD (reducing emissions from deforestation and forest degradation) program in agroforestry systems. Combining yield boost, certification, and REDD has the potential to incentivize eco-friendly agroforestry and lift smallholders out of poverty, simultaneously.
Beating the Odds: Trees to Success in Different Countries
ERIC Educational Resources Information Center
Finch, W. Holmes; Marchant, Gregory J.
2017-01-01
A recursive partitioning model approach in the form of classification and regression trees (CART) was used with 2012 PISA data for five countries (Canada, Finland, Germany, Singapore-China, and the Unites States). The objective of the study was to determine demographic and educational variables that differentiated between low SES student that were…
Geospatial relationships of tree species damage caused by Hurricane Katrina in south Mississippi
Mark W. Garrigues; Zhaofei Fan; David L. Evans; Scott D. Roberts; William H. Cooke III
2012-01-01
Hurricane Katrina generated substantial impacts on the forests and biological resources of the affected area in Mississippi. This study seeks to use classification tree analysis (CTA) to determine which variables are significant in predicting hurricane damage (shear or windthrow) in the Southeast Mississippi Institute for Forest Inventory District. Logistic regressions...
Using Classification Trees to Predict Alumni Giving for Higher Education
ERIC Educational Resources Information Center
Weerts, David J.; Ronca, Justin M.
2009-01-01
As the relative level of public support for higher education declines, colleges and universities aim to maximize alumni-giving to keep their programs competitive. Anchored in a utility maximization framework, this study employs the classification and regression tree methodology to examine characteristics of alumni donors and non-donors at a…
Updated generalized biomass equations for North American tree species
David C. Chojnacky; Linda S. Heath; Jennifer C. Jenkins
2014-01-01
Historically, tree biomass at large scales has been estimated by applying dimensional analysis techniques and field measurements such as diameter at breast height (dbh) in allometric regression equations. Equations often have been developed using differing methods and applied only to certain species or isolated areas. We previously had compiled and combined (in meta-...
The microcomputer scientific software series 5: the BIOMASS user's guide.
George E. Host; Stephen C. Westin; William G. Cole; Kurt S. Pregitzer
1989-01-01
BIOMASS is an interactive microcomputer program that uses allometric regression equations to calculate aboveground biomass of common tree species of the Lake States. The equations are species-specific and most use both diameter and height as independent variables. The program accommodates fixed area and variable radius sample designs and produces both individual tree...
Northern Arkansas Spring Precipitation Reconstructed from Tree Rings, 1023-1992 A.D.
Malcolm K. Cleaveland
2001-01-01
Three baldcypress (Taxodium distichum (L.) Rich.) tree-ring chronologies in northeastern Arkansas and southeastern Missouri respond strongly to April-June (spring) rainfall in northern Arkansas. I used regression to reconstruct an average of spring rainfall in the three climatic divisions of northern Arkansas since 1023 A.D. The reconstruction was...
Kobayashi, Nobuaki; Hong, Choongman; Klinman, Dennis M.; Shirota, Hidekazu
2012-01-01
The primary goal of cancer immunotherapy is to elicit an immune response capable of eliminating the tumor. One approach towards accomplishing that goal utilizes general (rather than tumor-specific) immunomodulatory agents to boost the number and activity of pre-existing cytotoxic T lymphocytes. We find that the intra-tumoral injection of poly-G ODN has such an effect, boosting anti-tumor immunity and promoting tumor regression. The anti-tumor activity of polyguanosine (poly-G) oligonucleotides (ODN) was mediated through CD8 T cells in a TLR9 independent manner. Mechanistically, poly-G ODN directly induced the phosphorylation of Lck (an essential element of the T cell signaling pathway), thereby enhancing the production of IL-2 and CD8 T cell proliferation. These findings establish poly-G ODN as a novel type of cancer immunotherapy. PMID:23296706
Weather Impact on Airport Arrival Meter Fix Throughput
NASA Technical Reports Server (NTRS)
Wang, Yao
2017-01-01
Time-based flow management provides arrival aircraft schedules based on arrival airport conditions, airport capacity, required spacing, and weather conditions. In order to meet a scheduled time at which arrival aircraft can cross an airport arrival meter fix prior to entering the airport terminal airspace, air traffic controllers make regulations on air traffic. Severe weather may create an airport arrival bottleneck if one or more of airport arrival meter fixes are partially or completely blocked by the weather and the arrival demand has not been reduced accordingly. Under these conditions, aircraft are frequently being put in holding patterns until they can be rerouted. A model that predicts the weather impacted meter fix throughput may help air traffic controllers direct arrival flows into the airport more efficiently, minimizing arrival meter fix congestion. This paper presents an analysis of air traffic flows across arrival meter fixes at the Newark Liberty International Airport (EWR). Several scenarios of weather impacted EWR arrival fix flows are described. Furthermore, multiple linear regression and regression tree ensemble learning approaches for translating multiple sector Weather Impacted Traffic Indexes (WITI) to EWR arrival meter fix throughputs are examined. These weather translation models are developed and validated using the EWR arrival flight and weather data for the period of April-September in 2014. This study also compares the performance of the regression tree ensemble with traditional multiple linear regression models for estimating the weather impacted throughputs at each of the EWR arrival meter fixes. For all meter fixes investigated, the results from the regression tree ensemble weather translation models show a stronger correlation between model outputs and observed meter fix throughputs than that produced from multiple linear regression method.
Shi, K-Q; Zhou, Y-Y; Yan, H-D; Li, H; Wu, F-L; Xie, Y-Y; Braddock, M; Lin, X-Y; Zheng, M-H
2017-02-01
At present, there is no ideal model for predicting the short-term outcome of patients with acute-on-chronic hepatitis B liver failure (ACHBLF). This study aimed to establish and validate a prognostic model by using the classification and regression tree (CART) analysis. A total of 1047 patients from two separate medical centres with suspected ACHBLF were screened in the study, which were recognized as derivation cohort and validation cohort, respectively. CART analysis was applied to predict the 3-month mortality of patients with ACHBLF. The accuracy of the CART model was tested using the area under the receiver operating characteristic curve, which was compared with the model for end-stage liver disease (MELD) score and a new logistic regression model. CART analysis identified four variables as prognostic factors of ACHBLF: total bilirubin, age, serum sodium and INR, and three distinct risk groups: low risk (4.2%), intermediate risk (30.2%-53.2%) and high risk (81.4%-96.9%). The new logistic regression model was constructed with four independent factors, including age, total bilirubin, serum sodium and prothrombin activity by multivariate logistic regression analysis. The performances of the CART model (0.896), similar to the logistic regression model (0.914, P=.382), exceeded that of MELD score (0.667, P<.001). The results were confirmed in the validation cohort. We have developed and validated a novel CART model superior to MELD for predicting three-month mortality of patients with ACHBLF. Thus, the CART model could facilitate medical decision-making and provide clinicians with a validated practical bedside tool for ACHBLF risk stratification. © 2016 John Wiley & Sons Ltd.
NASA Astrophysics Data System (ADS)
Wilson, Barry T.; Knight, Joseph F.; McRoberts, Ronald E.
2018-03-01
Imagery from the Landsat Program has been used frequently as a source of auxiliary data for modeling land cover, as well as a variety of attributes associated with tree cover. With ready access to all scenes in the archive since 2008 due to the USGS Landsat Data Policy, new approaches to deriving such auxiliary data from dense Landsat time series are required. Several methods have previously been developed for use with finer temporal resolution imagery (e.g. AVHRR and MODIS), including image compositing and harmonic regression using Fourier series. The manuscript presents a study, using Minnesota, USA during the years 2009-2013 as the study area and timeframe. The study examined the relative predictive power of land cover models, in particular those related to tree cover, using predictor variables based solely on composite imagery versus those using estimated harmonic regression coefficients. The study used two common non-parametric modeling approaches (i.e. k-nearest neighbors and random forests) for fitting classification and regression models of multiple attributes measured on USFS Forest Inventory and Analysis plots using all available Landsat imagery for the study area and timeframe. The estimated Fourier coefficients developed by harmonic regression of tasseled cap transformation time series data were shown to be correlated with land cover, including tree cover. Regression models using estimated Fourier coefficients as predictor variables showed a two- to threefold increase in explained variance for a small set of continuous response variables, relative to comparable models using monthly image composites. Similarly, the overall accuracies of classification models using the estimated Fourier coefficients were approximately 10-20 percentage points higher than the models using the image composites, with corresponding individual class accuracies between six and 45 percentage points higher.
Gradient boosting machine for modeling the energy consumption of commercial buildings
Touzani, Samir; Granderson, Jessica; Fernandes, Samuel
2017-11-26
Accurate savings estimations are important to promote energy efficiency projects and demonstrate their cost-effectiveness. The increasing presence of advanced metering infrastructure (AMI) in commercial buildings has resulted in a rising availability of high frequency interval data. These data can be used for a variety of energy efficiency applications such as demand response, fault detection and diagnosis, and heating, ventilation, and air conditioning (HVAC) optimization. This large amount of data has also opened the door to the use of advanced statistical learning models, which hold promise for providing accurate building baseline energy consumption predictions, and thus accurate saving estimations. The gradientmore » boosting machine is a powerful machine learning algorithm that is gaining considerable traction in a wide range of data driven applications, such as ecology, computer vision, and biology. In the present work an energy consumption baseline modeling method based on a gradient boosting machine was proposed. To assess the performance of this method, a recently published testing procedure was used on a large dataset of 410 commercial buildings. The model training periods were varied and several prediction accuracy metrics were used to evaluate the model's performance. The results show that using the gradient boosting machine model improved the R-squared prediction accuracy and the CV(RMSE) in more than 80 percent of the cases, when compared to an industry best practice model that is based on piecewise linear regression, and to a random forest algorithm.« less
Gradient boosting machine for modeling the energy consumption of commercial buildings
DOE Office of Scientific and Technical Information (OSTI.GOV)
Touzani, Samir; Granderson, Jessica; Fernandes, Samuel
Accurate savings estimations are important to promote energy efficiency projects and demonstrate their cost-effectiveness. The increasing presence of advanced metering infrastructure (AMI) in commercial buildings has resulted in a rising availability of high frequency interval data. These data can be used for a variety of energy efficiency applications such as demand response, fault detection and diagnosis, and heating, ventilation, and air conditioning (HVAC) optimization. This large amount of data has also opened the door to the use of advanced statistical learning models, which hold promise for providing accurate building baseline energy consumption predictions, and thus accurate saving estimations. The gradientmore » boosting machine is a powerful machine learning algorithm that is gaining considerable traction in a wide range of data driven applications, such as ecology, computer vision, and biology. In the present work an energy consumption baseline modeling method based on a gradient boosting machine was proposed. To assess the performance of this method, a recently published testing procedure was used on a large dataset of 410 commercial buildings. The model training periods were varied and several prediction accuracy metrics were used to evaluate the model's performance. The results show that using the gradient boosting machine model improved the R-squared prediction accuracy and the CV(RMSE) in more than 80 percent of the cases, when compared to an industry best practice model that is based on piecewise linear regression, and to a random forest algorithm.« less
de Seny, Dominique; Fillet, Marianne; Meuwis, Marie-Alice; Geurts, Pierre; Lutteri, Laurence; Ribbens, Clio; Bours, Vincent; Wehenkel, Louis; Piette, Jacques; Malaise, Michel; Merville, Marie-Paule
2005-12-01
To identify serum protein biomarkers specific for rheumatoid arthritis (RA), using surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF-MS) technology. A total of 103 serum samples from patients and healthy controls were analyzed. Thirty-four of the patients had a diagnosis of RA, based on the American College of Rheumatology criteria. The inflammation control group comprised 20 patients with psoriatic arthritis (PsA), 9 with asthma, and 10 with Crohn's disease. The noninflammation control group comprised 14 patients with knee osteoarthritis and 16 healthy control subjects. Serum protein profiles were obtained by SELDI-TOF-MS and compared in order to identify new biomarkers specific for RA. Data were analyzed by a machine learning algorithm called decision tree boosting, according to different preprocessing steps. The most discriminative mass/charge (m/z) values serving as potential biomarkers for RA were identified on arrays for both patients with RA versus controls and patients with RA versus patients with PsA. From among several candidates, the following peaks were highlighted: m/z values of 2,924 (RA versus controls on H4 arrays), 10,832 and 11,632 (RA versus controls on CM10 arrays), 4,824 (RA versus PsA on H4 arrays), and 4,666 (RA versus PsA on CM10 arrays). Positive results of proteomic analysis were associated with positive results of the anti-cyclic citrullinated peptide test. Our observations suggested that the 10,832 peak could represent myeloid-related protein 8. SELDI-TOF-MS technology allows rapid analysis of many serum samples, and use of decision tree boosting analysis as the main statistical method allowed us to propose a pattern of protein peaks specific for RA.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Möller, A.; Ruhlmann-Kleider, V.; Leloup, C.
In the era of large astronomical surveys, photometric classification of supernovae (SNe) has become an important research field due to limited spectroscopic resources for candidate follow-up and classification. In this work, we present a method to photometrically classify type Ia supernovae based on machine learning with redshifts that are derived from the SN light-curves. This method is implemented on real data from the SNLS deferred pipeline, a purely photometric pipeline that identifies SNe Ia at high-redshifts (0.2 < z < 1.1). Our method consists of two stages: feature extraction (obtaining the SN redshift from photometry and estimating light-curve shape parameters)more » and machine learning classification. We study the performance of different algorithms such as Random Forest and Boosted Decision Trees. We evaluate the performance using SN simulations and real data from the first 3 years of the Supernova Legacy Survey (SNLS), which contains large spectroscopically and photometrically classified type Ia samples. Using the Area Under the Curve (AUC) metric, where perfect classification is given by 1, we find that our best-performing classifier (Extreme Gradient Boosting Decision Tree) has an AUC of 0.98.We show that it is possible to obtain a large photometrically selected type Ia SN sample with an estimated contamination of less than 5%. When applied to data from the first three years of SNLS, we obtain 529 events. We investigate the differences between classifying simulated SNe, and real SN survey data. In particular, we find that applying a thorough set of selection cuts to the SN sample is essential for good classification. This work demonstrates for the first time the feasibility of machine learning classification in a high- z SN survey with application to real SN data.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Singh, Kunwar P., E-mail: kpsingh_52@yahoo.com; Gupta, Shikha
Ensemble learning approach based decision treeboost (DTB) and decision tree forest (DTF) models are introduced in order to establish quantitative structure–toxicity relationship (QSTR) for the prediction of toxicity of 1450 diverse chemicals. Eight non-quantum mechanical molecular descriptors were derived. Structural diversity of the chemicals was evaluated using Tanimoto similarity index. Stochastic gradient boosting and bagging algorithms supplemented DTB and DTF models were constructed for classification and function optimization problems using the toxicity end-point in T. pyriformis. Special attention was drawn to prediction ability and robustness of the models, investigated both in external and 10-fold cross validation processes. In complete data,more » optimal DTB and DTF models rendered accuracies of 98.90%, 98.83% in two-category and 98.14%, 98.14% in four-category toxicity classifications. Both the models further yielded classification accuracies of 100% in external toxicity data of T. pyriformis. The constructed regression models (DTB and DTF) using five descriptors yielded correlation coefficients (R{sup 2}) of 0.945, 0.944 between the measured and predicted toxicities with mean squared errors (MSEs) of 0.059, and 0.064 in complete T. pyriformis data. The T. pyriformis regression models (DTB and DTF) applied to the external toxicity data sets yielded R{sup 2} and MSE values of 0.637, 0.655; 0.534, 0.507 (marine bacteria) and 0.741, 0.691; 0.155, 0.173 (algae). The results suggest for wide applicability of the inter-species models in predicting toxicity of new chemicals for regulatory purposes. These approaches provide useful strategy and robust tools in the screening of ecotoxicological risk or environmental hazard potential of chemicals. - Graphical abstract: Importance of input variables in DTB and DTF classification models for (a) two-category, and (b) four-category toxicity intervals in T. pyriformis data. Generalization and predictive abilities of the constructed (c) DTB and (d) DTF regression models to predict the T. pyriformis toxicity of diverse chemicals. - Highlights: • Ensemble learning (EL) based models constructed for toxicity prediction of chemicals • Predictive models used a few simple non-quantum mechanical molecular descriptors. • EL-based DTB/DTF models successfully discriminated toxic and non-toxic chemicals. • DTB/DTF regression models precisely predicted toxicity of chemicals in multi-species. • Proposed EL based models can be used as tool to predict toxicity of new chemicals.« less
Nolan, Bernard T.; Fienen, Michael N.; Lorenz, David L.
2015-01-01
We used a statistical learning framework to evaluate the ability of three machine-learning methods to predict nitrate concentration in shallow groundwater of the Central Valley, California: boosted regression trees (BRT), artificial neural networks (ANN), and Bayesian networks (BN). Machine learning methods can learn complex patterns in the data but because of overfitting may not generalize well to new data. The statistical learning framework involves cross-validation (CV) training and testing data and a separate hold-out data set for model evaluation, with the goal of optimizing predictive performance by controlling for model overfit. The order of prediction performance according to both CV testing R2 and that for the hold-out data set was BRT > BN > ANN. For each method we identified two models based on CV testing results: that with maximum testing R2 and a version with R2 within one standard error of the maximum (the 1SE model). The former yielded CV training R2 values of 0.94–1.0. Cross-validation testing R2 values indicate predictive performance, and these were 0.22–0.39 for the maximum R2 models and 0.19–0.36 for the 1SE models. Evaluation with hold-out data suggested that the 1SE BRT and ANN models predicted better for an independent data set compared with the maximum R2 versions, which is relevant to extrapolation by mapping. Scatterplots of predicted vs. observed hold-out data obtained for final models helped identify prediction bias, which was fairly pronounced for ANN and BN. Lastly, the models were compared with multiple linear regression (MLR) and a previous random forest regression (RFR) model. Whereas BRT results were comparable to RFR, MLR had low hold-out R2 (0.07) and explained less than half the variation in the training data. Spatial patterns of predictions by the final, 1SE BRT model agreed reasonably well with previously observed patterns of nitrate occurrence in groundwater of the Central Valley.
Nattee, Cholwich; Khamsemanan, Nirattaya; Lawtrakul, Luckhana; Toochinda, Pisanu; Hannongbua, Supa
2017-01-01
Malaria is still one of the most serious diseases in tropical regions. This is due in part to the high resistance against available drugs for the inhibition of parasites, Plasmodium, the cause of the disease. New potent compounds with high clinical utility are urgently needed. In this work, we created a novel model using a regression tree to study structure-activity relationships and predict the inhibition constant, K i of three different antimalarial analogues (Trimethoprim, Pyrimethamine, and Cycloguanil) based on their molecular descriptors. To the best of our knowledge, this work is the first attempt to study the structure-activity relationships of all three analogues combined. The most relevant descriptors and appropriate parameters of the regression tree are harvested using extremely randomized trees. These descriptors are water accessible surface area, Log of the aqueous solubility, total hydrophobic van der Waals surface area, and molecular refractivity. Out of all possible combinations of these selected parameters and descriptors, the tree with the strongest coefficient of determination is selected to be our prediction model. Predicted K i values from the proposed model show a strong coefficient of determination, R 2 =0.996, to experimental K i values. From the structure of the regression tree, compounds with high accessible surface area of all hydrophobic atoms (ASA_H) and low aqueous solubility of inhibitors (Log S) generally possess low K i values. Our prediction model can also be utilized as a screening test for new antimalarial drug compounds which may reduce the time and expenses for new drug development. New compounds with high predicted K i should be excluded from further drug development. It is also our inference that a threshold of ASA_H greater than 575.80 and Log S less than or equal to -4.36 is a sufficient condition for a new compound to possess a low K i . Copyright © 2016 Elsevier Inc. All rights reserved.
Koch, George W; Sillett, Stephen C; Jennings, Gregory M; Davis, Stephen D
2004-04-22
Trees grow tall where resources are abundant, stresses are minor, and competition for light places a premium on height growth. The height to which trees can grow and the biophysical determinants of maximum height are poorly understood. Some models predict heights of up to 120 m in the absence of mechanical damage, but there are historical accounts of taller trees. Current hypotheses of height limitation focus on increasing water transport constraints in taller trees and the resulting reductions in leaf photosynthesis. We studied redwoods (Sequoia sempervirens), including the tallest known tree on Earth (112.7 m), in wet temperate forests of northern California. Our regression analyses of height gradients in leaf functional characteristics estimate a maximum tree height of 122-130 m barring mechanical damage, similar to the tallest recorded trees of the past. As trees grow taller, increasing leaf water stress due to gravity and path length resistance may ultimately limit leaf expansion and photosynthesis for further height growth, even with ample soil moisture.
Comparison of Sub-Pixel Classification Approaches for Crop-Specific Mapping
This paper examined two non-linear models, Multilayer Perceptron (MLP) regression and Regression Tree (RT), for estimating sub-pixel crop proportions using time-series MODIS-NDVI data. The sub-pixel proportions were estimated for three major crop types including corn, soybean, a...
Forterre, Patrick
2017-01-01
The eocyte hypothesis, in which Eukarya emerged from within Archaea, has been boosted by the description of a new candidate archaeal phylum, “Lokiarchaeota”, from metagenomic data. Eukarya branch within Lokiarchaeota in a tree reconstructed from the concatenation of 36 universal proteins. However, individual phylogenies revealed that lokiarchaeal proteins sequences have different evolutionary histories. The individual markers phylogenies revealed at least two subsets of proteins, either supporting the Woese or the Eocyte tree of life. Strikingly, removal of a single protein, the elongation factor EF2, is sufficient to break the Eukaryotes-Lokiarchaea affiliation. Our analysis suggests that the three lokiarchaeal EF2 proteins have a chimeric organization that could be due to contamination and/or homologous recombination with patches of eukaryotic sequences. A robust phylogenetic analysis of RNA polymerases with a new dataset indicates that Lokiarchaeota and related phyla of the Asgard superphylum are sister group to Euryarchaeota, not to Eukarya, and supports the monophyly of Archaea with their rooting in the branch leading to Thaumarchaeota. PMID:28604769
Authentication of Ginkgo biloba herbal dietary supplements using DNA barcoding.
Little, Damon P
2014-09-01
Ginkgo biloba L. (known as ginkgo or maidenhair tree) is a phylogenetically isolated, charismatic, gymnosperm tree. Herbal dietary supplements, prepared from G. biloba leaves, are consumed to boost cognitive capacity via improved blood perfusion and mitochondrial function. A novel DNA mini-barcode assay was designed and validated for the authentication of G. biloba in herbal dietary supplements (n = 22; sensitivity = 1.00, 95% CI = 0.59-1.00; specificity = 1.00, 95% CI = 0.64-1.00). This assay was further used to estimate the frequency of mislabeled ginkgo herbal dietary supplements on the market in the United States of America: DNA amenable to PCR could not be extracted from three (7.5%) of the 40 supplements sampled, 31 of 37 (83.8%) assayable supplements contained identifiable G. biloba DNA, and six supplements (16.2%) contained fillers without any detectable G. biloba DNA. It is hoped that this assay will be used by supplement manufacturers to ensure that their supplements contain G. biloba.
"Mad or bad?": burden on caregivers of patients with personality disorders.
Bauer, Rita; Döring, Antje; Schmidt, Tanja; Spießl, Hermann
2012-12-01
The burden on caregivers of patients with personality disorders is often greatly underestimated or completely disregarded. Possibilities for caregiver support have rarely been assessed. Thirty interviews were conducted with caregivers of such patients to assess illness-related burden. Responses were analyzed with a mixed method of qualitative and quantitative analysis in a sequential design. Patient and caregiver data, including sociodemographic and disease-related variables, were evaluated with regression analysis and regression trees. Caregiver statements (n = 404) were summarized into 44 global statements. The most frequent global statements were worries about the burden on other family members (70.0%), poor cooperation with clinical centers and other institutions (60.0%), financial burden (56.7%), worry about the patient's future (53.3%), and dissatisfaction with the patient's treatment and rehabilitation (53.3%). Linear regression and regression tree analysis identified predictors for more burdened caregivers. Caregivers of patients with personality disorders experience a variety of burdens, some disorder specific. Yet these caregivers often receive little attention or support.
NASA Astrophysics Data System (ADS)
Niemeijer, Meindert; Dumitrescu, Alina V.; van Ginneken, Bram; Abrámoff, Michael D.
2011-03-01
Parameters extracted from the vasculature on the retina are correlated with various conditions such as diabetic retinopathy and cardiovascular diseases such as stroke. Segmentation of the vasculature on the retina has been a topic that has received much attention in the literature over the past decade. Analysis of the segmentation result, however, has only received limited attention with most works describing methods to accurately measure the width of the vessels. Analyzing the connectedness of the vascular network is an important step towards the characterization of the complete vascular tree. The retinal vascular tree, from an image interpretation point of view, originates at the optic disc and spreads out over the retina. The tree bifurcates and the vessels also cross each other. The points where this happens form the key to determining the connectedness of the complete tree. We present a supervised method to detect the bifurcations and crossing points of the vasculature of the retina. The method uses features extracted from the vasculature as well as the image in a location regression approach to find those locations of the segmented vascular tree where the bifurcation or crossing occurs (from here, POI, points of interest). We evaluate the method on the publicly available DRIVE database in which an ophthalmologist has marked the POI.
Ultrasonographic Diagnosis of Biliary Atresia Based on a Decision-Making Tree Model.
Lee, So Mi; Cheon, Jung-Eun; Choi, Young Hun; Kim, Woo Sun; Cho, Hyun-Hae; Cho, Hyun-Hye; Kim, In-One; You, Sun Kyoung
2015-01-01
To assess the diagnostic value of various ultrasound (US) findings and to make a decision-tree model for US diagnosis of biliary atresia (BA). From March 2008 to January 2014, the following US findings were retrospectively evaluated in 100 infants with cholestatic jaundice (BA, n = 46; non-BA, n = 54): length and morphology of the gallbladder, triangular cord thickness, hepatic artery and portal vein diameters, and visualization of the common bile duct. Logistic regression analyses were performed to determine the features that would be useful in predicting BA. Conditional inference tree analysis was used to generate a decision-making tree for classifying patients into the BA or non-BA groups. Multivariate logistic regression analysis showed that abnormal gallbladder morphology and greater triangular cord thickness were significant predictors of BA (p = 0.003 and 0.001; adjusted odds ratio: 345.6 and 65.6, respectively). In the decision-making tree using conditional inference tree analysis, gallbladder morphology and triangular cord thickness (optimal cutoff value of triangular cord thickness, 3.4 mm) were also selected as significant discriminators for differential diagnosis of BA, and gallbladder morphology was the first discriminator. The diagnostic performance of the decision-making tree was excellent, with sensitivity of 100% (46/46), specificity of 94.4% (51/54), and overall accuracy of 97% (97/100). Abnormal gallbladder morphology and greater triangular cord thickness (> 3.4 mm) were the most useful predictors of BA on US. We suggest that the gallbladder morphology should be evaluated first and that triangular cord thickness should be evaluated subsequently in cases with normal gallbladder morphology.
A cosmetic evaluation of breast cancer treatment: A randomized study of radiotherapy boost technique
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vass, Sylvie; Bairati, Isabelle
2005-08-01
Purpose: To compare cosmetic results of two different radiotherapy (RT) boost techniques used in the treatment of breast cancer after whole breast radiotherapy and to identify factors affecting cosmetic outcomes. Methods and Materials: Between 1996 and 1998, 142 patients with Stage I and II breast cancer were treated with breast conservative surgery and adjuvant RT. Patients were then randomly assigned to receive a boost dose of 15 Gy delivered to the tumor bed either by iridium 192, or a combination of photons and electrons. Cosmetic evaluations were done on a 6-month basis, with a final evaluation at 36 months aftermore » RT. The evaluations were done using a panel of global and specific subjective scores, a digitized scoring system using the breast retraction assessment (BRA) measurement, and a patient's self-assessment evaluation. As cosmetic results were graded according to severity, the comparison of boost techniques was done using the ordinal logistic regression model. Adjusted odds ratios (OR) and their 95% confidence intervals (CI) are presented. Results: At 36 months of follow-up, there was no significant difference between the two groups with respect to the global subjective cosmetic outcome (OR = 1.40; 95%CI = 0.69-2.85, p = 0.35). Good to excellent scores were observed in 65% of implant patients and 62% of photon/electron patients. At 24 months and beyond, telangiectasia was more severe in the implant group with an OR of 9.64 (95%CI = 4.05-22.92, p < 0.0001) at 36 months. The only variable associated with a worse global cosmetic outcome was the presence of concomitant chemotherapy (OR = 3.87; 95%CI = 1.74-8.62). The BRA value once adjusted for age, concomitant chemotherapy, and boost volume showed a positive association with the boost technique. The BRA value was significantly greater in the implant group (p 0.03). There was no difference in the patient's final self-assessment score between the two groups. Three variables were statistically associated with an adverse self-evaluation: an inferior quadrant tumor localization, postoperative hematoma, and concomitant chemotherapy. Conclusions: Although this trial showed that at 36 months of follow-up, there were no significant differences in the overall global cosmetic scores between the implant boost group and the photon/electron boost group, telangiectasia was more severe and the BRA value was greater in the implant group.« less
NASA Astrophysics Data System (ADS)
Bradshaw, Tyler; Fu, Rau; Bowen, Stephen; Zhu, Jun; Forrest, Lisa; Jeraj, Robert
2015-07-01
Dose painting relies on the ability of functional imaging to identify resistant tumor subvolumes to be targeted for additional boosting. This work assessed the ability of FDG, FLT, and Cu-ATSM PET imaging to predict the locations of residual FDG PET in canine tumors following radiotherapy. Nineteen canines with spontaneous sinonasal tumors underwent PET/CT imaging with radiotracers FDG, FLT, and Cu-ATSM prior to hypofractionated radiotherapy. Therapy consisted of 10 fractions of 4.2 Gy to the sinonasal cavity with or without an integrated boost of 0.8 Gy to the GTV. Patients had an additional FLT PET/CT scan after fraction 2, a Cu-ATSM PET/CT scan after fraction 3, and follow-up FDG PET/CT scans after radiotherapy. Following image registration, simple and multiple linear and logistic voxel regressions were performed to assess how well pre- and mid-treatment PET imaging predicted post-treatment FDG uptake. R2 and pseudo R2 were used to assess the goodness of fits. For simple linear regression models, regression coefficients for all pre- and mid-treatment PET images were significantly positive across the population (P < 0.05). However, there was large variability among patients in goodness of fits: R2 ranged from 0.00 to 0.85, with a median of 0.12. Results for logistic regression models were similar. Multiple linear regression models resulted in better fits (median R2 = 0.31), but there was still large variability between patients in R2. The R2 from regression models for different predictor variables were highly correlated across patients (R ≈ 0.8), indicating tumors that were poorly predicted with one tracer were also poorly predicted by other tracers. In conclusion, the high inter-patient variability in goodness of fits indicates that PET was able to predict locations of residual tumor in some patients, but not others. This suggests not all patients would be good candidates for dose painting based on a single biological target.
Bradshaw, Tyler; Fu, Rau; Bowen, Stephen; Zhu, Jun; Forrest, Lisa; Jeraj, Robert
2015-07-07
Dose painting relies on the ability of functional imaging to identify resistant tumor subvolumes to be targeted for additional boosting. This work assessed the ability of FDG, FLT, and Cu-ATSM PET imaging to predict the locations of residual FDG PET in canine tumors following radiotherapy. Nineteen canines with spontaneous sinonasal tumors underwent PET/CT imaging with radiotracers FDG, FLT, and Cu-ATSM prior to hypofractionated radiotherapy. Therapy consisted of 10 fractions of 4.2 Gy to the sinonasal cavity with or without an integrated boost of 0.8 Gy to the GTV. Patients had an additional FLT PET/CT scan after fraction 2, a Cu-ATSM PET/CT scan after fraction 3, and follow-up FDG PET/CT scans after radiotherapy. Following image registration, simple and multiple linear and logistic voxel regressions were performed to assess how well pre- and mid-treatment PET imaging predicted post-treatment FDG uptake. R(2) and pseudo R(2) were used to assess the goodness of fits. For simple linear regression models, regression coefficients for all pre- and mid-treatment PET images were significantly positive across the population (P < 0.05). However, there was large variability among patients in goodness of fits: R(2) ranged from 0.00 to 0.85, with a median of 0.12. Results for logistic regression models were similar. Multiple linear regression models resulted in better fits (median R(2) = 0.31), but there was still large variability between patients in R(2). The R(2) from regression models for different predictor variables were highly correlated across patients (R ≈ 0.8), indicating tumors that were poorly predicted with one tracer were also poorly predicted by other tracers. In conclusion, the high inter-patient variability in goodness of fits indicates that PET was able to predict locations of residual tumor in some patients, but not others. This suggests not all patients would be good candidates for dose painting based on a single biological target.
Anderson, S.C.; Kupfer, J.A.; Wilson, R.R.; Cooper, R.J.
2000-01-01
The purpose of this research was to develop a model that could be used to provide a spatial representation of uneven-aged silvicultural treatments on forest crown area. We began by developing species-specific linear regression equations relating tree DBH to crown area for eight bottomland tree species at White River National Wildlife Refuge, Arkansas, USA. The relationships were highly significant for all species, with coefficients of determination (r(2)) ranging from 0.37 for Ulmus crassifolia to nearly 0.80 for Quercus nuttalliii and Taxodium distichum. We next located and measured the diameters of more than 4000 stumps from a single tree-group selection timber harvest. Stump locations were recorded with respect to an established gl id point system and entered into a Geographic Information System (ARC/INFO). The area occupied by the crown of each logged individual was then estimated by using the stump dimensions (adjusted to DBHs) and the regression equations relating tree DBH to crown area. Our model projected that the selection cuts removed roughly 300 m(2) of basal area from the logged sites resulting in the loss of approximate to 55 000 m(2) of crown area. The model developed in this research represents a tool that can be used in conjunction with remote sensing applications to assist in forest inventory and management, as well as to estimate the impacts of selective timber harvest on wildlife.
Scollo, Annalisa; Gottardo, Flaviana; Contiero, Barbara; Edwards, Sandra A
2017-10-01
Tail biting in pigs has been an identified behavioural, welfare and economic problem for decades, and requires appropriate but sometimes difficult on-farm interventions. The aim of the paper is to introduce the Classification and Regression Tree (CRT) methodologies to develop a tool for prevention of acute tail biting lesions in pigs on-farm. A sample of 60 commercial farms rearing heavy pigs were involved; an on-farm visit and an interview with the farmer collected data on general management, herd health, disease prevention, climate control, feeding and production traits. Results suggest a value for the CRT analysis in managing the risk factors behind tail biting on a farm-specific level, showing 86.7% sensitivity for the Classification Tree and a correlation of 0.7 between observed and predicted prevalence of tail biting obtained with the Regression Tree. CRT analysis showed five main variables (stocking density, ammonia levels, number of pigs per stockman, type of floor and timeliness in feed supply) as critical predictors of acute tail biting lesions, which demonstrate different importance in different farms subgroups. The model might have reliable and practical applications for the support and implementation of tail biting prevention interventions, especially in case of subgroups of pigs with higher risk, helping farmers and veterinarians to assess the risk in their own farm and to manage their predisposing variables in order to reduce acute tail biting lesions. Copyright © 2017 Elsevier B.V. All rights reserved.
Annual Tree Growth Predictions From Periodic Measurements
Quang V. Cao
2004-01-01
Data from annual measurements of a loblolly pine (Pinus taeda L.) plantation were available for this study. Regression techniques were employed to model annual changes of individual trees in terms of diameters, heights, and survival probabilities. Subsets of the data that include measurements every 2, 3, 4, 5, and 6 years were used to fit the same...
Understory response following varying levels of overstory removal in mixed conifer stands
Fabian C.C. Uzoh; Leroy K. Dolph; John R. Anstead
1997-01-01
Diameter growth rates of understory trees were measured for periods both before and after overstory removal on six study areas in northern California. All the species responded with increased diameter growth after adjusting to their new environments. Linear regression equations that predict post treatment diameter growth increment of the residual trees are presented...
Delayed conifer tree mortality following fire in California
Sharon M. Hood; Sheri L. Smith; Daniel R. Cluck
2007-01-01
Fire injury was characterized and survival monitored for 5,246 trees from five wildfires in California that occurred between 1999 and 2002. Logistic regression models for predicting the probability of mortality were developed for incense-cedar, Jeffrey pine, ponderosa pine, red fir and white fir. Two-year post-fire preliminary models were developed for incense-cedar,...
Estimating leaf area and leaf biomass of open-grown deciduous urban trees
David J. Nowak
1996-01-01
Logarithmic regression equations were developed to predict leaf area and leaf biomass for open-grown deciduous urban trees based on stem diameter and crown parameters. Equations based on crown parameters produced more reliable estimates. The equations can be used to help quantify forest structure and functions, particularly in urbanizing and urban/suburban areas.
Kirk M. Stueve; Dawna L. Cerney; Regina M. Rochefort; Laurie L. Kurth
2009-01-01
We performed classification analysis of 1970 satellite imagery and 2003 aerial photography to delineate establishment. Local site conditions were calculated from a LIDAR-based DEM, ancillary climate data, and 1970 tree locations in a GIS. We used logistic regression on a spatially weighted landscape matrix to rank variables.
Biomass of Yellow-Poplar in Natural Stands in Western North Carolina
Alexander Clark; James G. Schroeder
1977-01-01
Aboveground biomass was determined for yellow-poplar(Liriodendron tulipifera L.) trees 6 to 28 inches d. b. h. growingin natural, uneven-aged mountaincovestandsin western North Carolina.Specific gravity, moisture content, and green weight per cubic foot are presented for the total tree and its components. Tables developed from regression equations show weight and...
The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis
Koziol, James A.; Feng, Anne C.; Jia, Zhenyu; Wang, Yipeng; Goodison, Seven; McClelland, Michael; Mercola, Dan
2009-01-01
Motivation: Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Results: Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors. Contact: dmercola@uci.edu PMID:18628288
Decision tree modeling using R.
Zhang, Zhongheng
2016-08-01
In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.
Jose F. Negron; Willis C. Schaupp; Kenneth E. Gibson; John Anhold; Dawn Hansen; Ralph Thier; Phil Mocettini
1999-01-01
Data collected from Douglas-fir stands infected by the Douglas-fir beetle in Wyoming, Montana, Idaho, and Utah, were used to develop models to estimate amount of mortality in terms of basal area killed. Models were built using stepwise linear regression and regression tree approaches. Linear regression models using initial Douglas-fir basal area were built for all...
Am I who I say I am? Unobtrusive self-representation and personality recognition on Facebook
2017-01-01
Across social media platforms users (sub)consciously represent themselves in a way which is appropriate for their intended audience. This has unknown impacts on studies with unobtrusive designs based on digital (social) platforms, and studies of contemporary social phenomena in online settings. A lack of appropriate methods to identify, control for, and mitigate the effects of self-representation, the propensity to express socially responding characteristics or self-censorship in digital settings, hinders the ability of researchers to confidently interpret and generalize their findings. This article proposes applying boosted regression modelling to fill this research gap. A case study of paid Amazon Mechanical Turk workers (n = 509) is presented where workers completed psychometric surveys and provided anonymized access to their Facebook timelines. Our research finds indicators of self-representation on Facebook, facilitating suggestions for its mitigation. We validate the use of LIWC for Facebook personality studies, as well as find discrepancies with extant literature about the use of LIWC-only approaches in unobtrusive designs. Using survey data and LIWC sentiment categories as predictors, the boosted regression model classified the Five Factor personality model with an average accuracy of 74.6%. The contribution of this work is an accurate prediction of psychometric information based on short, informal text. PMID:28926569
Am I who I say I am? Unobtrusive self-representation and personality recognition on Facebook.
Hall, Margeret; Caton, Simon
2017-01-01
Across social media platforms users (sub)consciously represent themselves in a way which is appropriate for their intended audience. This has unknown impacts on studies with unobtrusive designs based on digital (social) platforms, and studies of contemporary social phenomena in online settings. A lack of appropriate methods to identify, control for, and mitigate the effects of self-representation, the propensity to express socially responding characteristics or self-censorship in digital settings, hinders the ability of researchers to confidently interpret and generalize their findings. This article proposes applying boosted regression modelling to fill this research gap. A case study of paid Amazon Mechanical Turk workers (n = 509) is presented where workers completed psychometric surveys and provided anonymized access to their Facebook timelines. Our research finds indicators of self-representation on Facebook, facilitating suggestions for its mitigation. We validate the use of LIWC for Facebook personality studies, as well as find discrepancies with extant literature about the use of LIWC-only approaches in unobtrusive designs. Using survey data and LIWC sentiment categories as predictors, the boosted regression model classified the Five Factor personality model with an average accuracy of 74.6%. The contribution of this work is an accurate prediction of psychometric information based on short, informal text.
Biomass expansion factor and root-to-shoot ratio for Pinus in Brazil.
Sanquetta, Carlos R; Corte, Ana Pd; da Silva, Fernando
2011-09-24
The Biomass Expansion Factor (BEF) and the Root-to-Shoot Ratio (R) are variables used to quantify carbon stock in forests. They are often considered as constant or species/area specific values in most studies. This study aimed at showing tree size and age dependence upon BEF and R and proposed equations to improve forest biomass and carbon stock. Data from 70 sample Pinus spp. grown in southern Brazil trees in different diameter classes and ages were used to demonstrate the correlation between BEF and R, and forest inventory data, such as DBH, tree height and age. Total dry biomass, carbon stock and CO2 equivalent were simulated using the IPCC default values of BEF and R, corresponding average calculated from data used in this study, as well as the values estimated by regression equations. The mean values of BEF and R calculated in this study were 1.47 and 0.17, respectively. The relationship between BEF and R and the tree measurement variables were inversely related with negative exponential behavior. Simulations indicated that use of fixed values of BEF and R, either IPCC default or current average data, may lead to unreliable estimates of carbon stock inventories and CDM projects. It was concluded that accounting for the variations in BEF and R and using regression equations to relate them to DBH, tree height and age, is fundamental in obtaining reliable estimates of forest tree biomass, carbon sink and CO2 equivalent.
Fire frequency in the Interior Columbia River Basin: Building regional models from fire history data
McKenzie, D.; Peterson, D.L.; Agee, James K.
2000-01-01
Fire frequency affects vegetation composition and successional pathways; thus it is essential to understand fire regimes in order to manage natural resources at broad spatial scales. Fire history data are lacking for many regions for which fire management decisions are being made, so models are needed to estimate past fire frequency where local data are not yet available. We developed multiple regression models and tree-based (classification and regression tree, or CART) models to predict fire return intervals across the interior Columbia River basin at 1-km resolution, using georeferenced fire history, potential vegetation, cover type, and precipitation databases. The models combined semiqualitative methods and rigorous statistics. The fire history data are of uneven quality; some estimates are based on only one tree, and many are not cross-dated. Therefore, we weighted the models based on data quality and performed a sensitivity analysis of the effects on the models of estimation errors that are due to lack of cross-dating. The regression models predict fire return intervals from 1 to 375 yr for forested areas, whereas the tree-based models predict a range of 8 to 150 yr. Both types of models predict latitudinal and elevational gradients of increasing fire return intervals. Examination of regional-scale output suggests that, although the tree-based models explain more of the variation in the original data, the regression models are less likely to produce extrapolation errors. Thus, the models serve complementary purposes in elucidating the relationships among fire frequency, the predictor variables, and spatial scale. The models can provide local managers with quantitative information and provide data to initialize coarse-scale fire-effects models, although predictions for individual sites should be treated with caution because of the varying quality and uneven spatial coverage of the fire history database. The models also demonstrate the integration of qualitative and quantitative methods when requisite data for fully quantitative models are unavailable. They can be tested by comparing new, independent fire history reconstructions against their predictions and can be continually updated, as better fire history data become available.
National scale biomass estimators for United States tree species
Jennifer C. Jenkins; David C. Chojnacky; Linda S. Heath; Richard A. Birdsey
2003-01-01
Estimates of national-scale forest carbon (C) stocks and fluxes are typically based on allometric regression equations developed using dimensional analysis techniques. However, the literature is inconsistent and incomplete with respect to large-scale forest C estimation. We compiled all available diameter-based allometric regression equations for estimating total...
A prediction model of short-term ionospheric foF2 Based on AdaBoost
NASA Astrophysics Data System (ADS)
Zhao, Xiukuan; Liu, Libo; Ning, Baiqi
Accurate specifications of spatial and temporal variations of the ionosphere during geomagnetic quiet and disturbed conditions are critical for applications, such as HF communications, satellite positioning and navigation, power grids, pipelines, etc. Therefore, developing empirical models to forecast the ionospheric perturbations is of high priority in real applications. The critical frequency of the F2 layer, foF2, is an important ionospheric parameter, especially for radio wave propagation applications. In this paper, the AdaBoost-BP algorithm is used to construct a new model to predict the critical frequency of the ionospheric F2-layer one hour ahead. Different indices were used to characterize ionospheric diurnal and seasonal variations and their dependence on solar and geomagnetic activity. These indices, together with the current observed foF2 value, were input into the prediction model and the foF2 value at one hour ahead was output. We analyzed twenty-two years’ foF2 data from nine ionosonde stations in the East-Asian sector in this work. The first eleven years’ data were used as a training dataset and the second eleven years’ data were used as a testing dataset. The results show that the performance of AdaBoost-BP is better than those of BP Neural Network (BPNN), Support Vector Regression (SVR) and the IRI model. For example, the AdaBoost-BP prediction absolute error of foF2 at Irkutsk station (a middle latitude station) is 0.32 MHz, which is better than 0.34 MHz from BPNN, 0.35 MHz from SVR and also significantly outperforms the IRI model whose absolute error is 0.64 MHz. Meanwhile, AdaBoost-BP prediction absolute error at Taipei station from the low latitude is 0.78 MHz, which is better than 0.81 MHz from BPNN, 0.81 MHz from SVR and 1.37 MHz from the IRI model. Finally, the variety characteristics of the AdaBoost-BP prediction error along with seasonal variation, solar activity and latitude variation were also discussed in the paper.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bantema-Joppe, Enja J.; Schilstra, Cornelis; Bock, Geertruida H. de
Purpose: To evaluate toxicity and cosmetic outcome (CO) in breast cancer survivors treated with three-dimensional conformal radiotherapy with a hypofractionated, simultaneous integrated boost (3D-CRT-SIB) and to identify risk factors for toxicity, with special focus on the impact of age. Methods and Materials: Included were 940 consecutive disease-free patients treated for breast cancer (Stage 0-III) with 3D-CRT-SIB, after breast-conserving surgery, from 2005 to 2010. Physician-rated toxicity (Common Terminology Criteria for Adverse Events version 3.0) and CO were prospectively assessed during yearly follow-up, up to 5 years after radiotherapy. Multivariate logistic regression analyses using a bootstrapping method were performed. Results: At 3more » years, toxicity scores of 436 patients were available. Grade {>=}2 fibrosis in the boost area was observed in 8.5%, non-boost fibrosis in 49.4%, pain to the chest wall in 6.7%, and fair/poor CO in 39.7% of cases. Radiotherapy before chemotherapy was significantly associated with grade {>=}2 boost fibrosis at 3 years (odds ratio [OR] 2.8, 95% confidence interval [CI] 1.3-6.0). Non-boost fibrosis was associated with re-resection (OR 2.2, 95% CI 1.2-4.0) and larger tumors (OR 1.1, 95% CI 1.0-1.1). At 1 year, chest wall pain was significantly associated with high boost dosage (OR 2.1, 95% CI 1.2-3.7) and younger age (OR 0.4, 95% CI 0.2-0.7). A fair/poor CO was observed more often after re-resection (OR 4.5, 95% CI 2.4-8.5), after regional radiotherapy (OR 2.9, 95% CI 1.2-7.1), and in larger tumors (OR 1.1, 95% CI 1.0-1.1). Conclusions: Toxicity and CO are not impaired after 3D-CRT-SIB. Fibrosis was not significantly associated with radiotherapy parameters. Independent risk factors for fibrosis were chemotherapy after radiotherapy, re-resection, and larger tumor size. Re-resection was most predictive for worse CO. Age had an impact on chest wall pain occurrence.« less
Spatial modeling and classification of corneal shape.
Marsolo, Keith; Twa, Michael; Bullimore, Mark A; Parthasarathy, Srinivasan
2007-03-01
One of the most promising applications of data mining is in biomedical data used in patient diagnosis. Any method of data analysis intended to support the clinical decision-making process should meet several criteria: it should capture clinically relevant features, be computationally feasible, and provide easily interpretable results. In an initial study, we examined the feasibility of using Zernike polynomials to represent biomedical instrument data in conjunction with a decision tree classifier to distinguish between the diseased and non-diseased eyes. Here, we provide a comprehensive follow-up to that work, examining a second representation, pseudo-Zernike polynomials, to determine whether they provide any increase in classification accuracy. We compare the fidelity of both methods using residual root-mean-square (rms) error and evaluate accuracy using several classifiers: neural networks, C4.5 decision trees, Voting Feature Intervals, and Naïve Bayes. We also examine the effect of several meta-learning strategies: boosting, bagging, and Random Forests (RFs). We present results comparing accuracy as it relates to dataset and transformation resolution over a larger, more challenging, multi-class dataset. They show that classification accuracy is similar for both data transformations, but differs by classifier. We find that the Zernike polynomials provide better feature representation than the pseudo-Zernikes and that the decision trees yield the best balance of classification accuracy and interpretability.
Lung tumor diagnosis and subtype discovery by gene expression profiling.
Wang, Lu-yong; Tu, Zhuowen
2006-01-01
The optimal treatment of patients with complex diseases, such as cancers, depends on the accurate diagnosis by using a combination of clinical and histopathological data. In many scenarios, it becomes tremendously difficult because of the limitations in clinical presentation and histopathology. To accurate diagnose complex diseases, the molecular classification based on gene or protein expression profiles are indispensable for modern medicine. Moreover, many heterogeneous diseases consist of various potential subtypes in molecular basis and differ remarkably in their response to therapies. It is critical to accurate predict subgroup on disease gene expression profiles. More fundamental knowledge of the molecular basis and classification of disease could aid in the prediction of patient outcome, the informed selection of therapies, and identification of novel molecular targets for therapy. In this paper, we propose a new disease diagnostic method, probabilistic boosting tree (PB tree) method, on gene expression profiles of lung tumors. It enables accurate disease classification and subtype discovery in disease. It automatically constructs a tree in which each node combines a number of weak classifiers into a strong classifier. Also, subtype discovery is naturally embedded in the learning process. Our algorithm achieves excellent diagnostic performance, and meanwhile it is capable of detecting the disease subtype based on gene expression profile.
Du, Ning; Fan, Jintu; Chen, Shuo; Liu, Yang
2008-07-21
Although recent investigations [Ryan, M.G., Yoder, B.J., 1997. Hydraulic limits to tree height and tree growth. Bioscience 47, 235-242; Koch, G.W., Sillett, S.C.,Jennings, G.M.,Davis, S.D., 2004. The limits to tree height. Nature 428, 851-854; Niklas, K.J., Spatz, H., 2004. Growth and hydraulic (not mechanical) constraints govern the scaling of tree height and mass. Proc. Natl Acad. Sci. 101, 15661-15663; Ryan, M.G., Phillips, N., Bond, B.J., 2006. Hydraulic limitation hypothesis revisited. Plant Cell Environ. 29, 367-381; Niklas, K.J., 2007. Maximum plant height and the biophysical factors that limit it. Tree Physiol. 27, 433-440; Burgess, S.S.O., Dawson, T.E., 2007. Predicting the limits to tree height using statistical regressions of leaf traits. New Phytol. 174, 626-636] suggested that the hydraulic limitation hypothesis (HLH) is the most plausible theory to explain the biophysical limits to maximum tree height and the decline in tree growth rate with age, the analysis is largely qualitative or based on statistical regression. Here we present an integrated biophysical model based on the principle that trees develop physiological compensations (e.g. the declined leaf water potential and the tapering of conduits with heights [West, G.B., Brown, J.H., Enquist, B.J., 1999. A general model for the structure and allometry of plant vascular systems. Nature 400, 664-667]) to resist the increasing water stress with height, the classical HLH and the biochemical limitations on photosynthesis [von Caemmerer, S., 2000. Biochemical Models of Leaf Photosynthesis. CSIRO Publishing, Australia]. The model has been applied to the tallest trees in the world (viz. Coast redwood (Sequoia sempervirens)). Xylem water potential, leaf carbon isotope composition, leaf mass to area ratio at different heights derived from the model show good agreements with the experimental measurements of Koch et al. [2004. The limits to tree height. Nature 428, 851-854]. The model also well explains the universal trend of declining growth rate with age.
Estimating tree species diversity in the savannah using NDVI and woody canopy cover
NASA Astrophysics Data System (ADS)
Madonsela, Sabelo; Cho, Moses Azong; Ramoelo, Abel; Mutanga, Onisimo; Naidoo, Laven
2018-04-01
Remote sensing applications in biodiversity research often rely on the establishment of relationships between spectral information from the image and tree species diversity measured in the field. Most studies have used normalized difference vegetation index (NDVI) to estimate tree species diversity on the basis that it is sensitive to primary productivity which defines spatial variation in plant diversity. The NDVI signal is influenced by photosynthetically active vegetation which, in the savannah, includes woody canopy foliage and grasses. The question is whether the relationship between NDVI and tree species diversity in the savanna depends on the woody cover percentage. This study explored the relationship between woody canopy cover (WCC) and tree species diversity in the savannah woodland of southern Africa and also investigated whether there is a significant interaction between seasonal NDVI and WCC in the factorial model when estimating tree species diversity. To fulfil our aim, we followed stratified random sampling approach and surveyed tree species in 68 plots of 90 m × 90 m across the study area. Within each plot, all trees with diameter at breast height of >10 cm were sampled and Shannon index - a common measure of species diversity which considers both species richness and abundance - was used to quantify tree species diversity. We then extracted WCC in each plot from existing fractional woody cover product produced from Synthetic Aperture Radar (SAR) data. Factorial regression model was used to determine the interaction effect between NDVI and WCC when estimating tree species diversity. Results from regression analysis showed that (i) WCC has a highly significant relationship with tree species diversity (r2 = 0.21; p < 0.01), (ii) the interaction between the NDVI and WCC is not significant, however, the factorial model significantly reduced the error of prediction (RMSE = 0.47, p < 0.05) compared to NDVI (RMSE = 0.49) or WCC (RMSE = 0.49) model during the senescence period. The result justifies our assertion that combining NDVI with WCC will be optimal for biodiversity estimation during the senescence period.
Kotta, Jonne; Oganjan, Katarina; Lauringson, Velda; Pärnoja, Merli; Kaasik, Ants; Rohtla, Liisa; Kotta, Ilmar; Orav-Kotta, Helen
2015-01-01
Benthic suspension feeding mussels are an important functional guild in coastal and estuarine ecosystems. To date we lack information on how various environmental gradients and biotic interactions separately and interactively shape the distribution patterns of mussels in non-tidal environments. Opposing to tidal environments, mussels inhabit solely subtidal zone in non-tidal waterbodies and, thereby, driving factors for mussel populations are expected to differ from the tidal areas. In the present study, we used the boosted regression tree modelling (BRT), an ensemble method for statistical techniques and machine learning, in order to explain the distribution and biomass of the suspension feeding mussel Mytilus trossulus in the non-tidal Baltic Sea. BRT models suggested that (1) distribution patterns of M. trossulus are largely driven by separate effects of direct environmental gradients and partly by interactive effects of resource gradients with direct environmental gradients. (2) Within its suitable habitat range, however, resource gradients had an important role in shaping the biomass distribution of M. trossulus. (3) Contrary to tidal areas, mussels were not competitively superior over macrophytes with patterns indicating either facilitative interactions between mussels and macrophytes or co-variance due to common stressor. To conclude, direct environmental gradients seem to define the distribution pattern of M. trossulus, and within the favourable distribution range, resource gradients in interaction with direct environmental gradients are expected to set the biomass level of mussels.
Carlson Mazur, Martha L.; Kowalski, Kurt P.; Galbraith, David
2014-01-01
In the Laurentian Great Lakes, the invasive form of Phragmites australis (common reed) poses a threat to highly productive coastal wetlands and shorelines by forming impenetrable stands that outcompete native plants. Large, dominant stands can derail efforts to restore wetland ecosystems degraded by other stressors. To be proactive, landscape-level management of Phragmites requires information on the current spatial distribution of the species and a characterization of areas suitable for future colonization. Using a recent basin-scale map of this invasive plant’s distribution in the U.S. coastal zone of the Great Lakes, environmental data (e.g., soils, nutrients, disturbance, climate, topography), and climate predictions, we performed analyses of current and predicted suitable coastal habitat using boosted regression trees, a type of species distribution modeling. We also investigated differential influences of environmental variables in the upper lakes (Lakes Superior, Michigan, and Huron) and lower lakes (Lakes St. Clair, Erie, and Ontario). Basin-wide results showed that the coastal areas most vulnerable to Phragmites expansion were in close proximity to developed lands and had minimal topographic relief, poorly drained soils, and dense road networks. Elevated nutrients and proximity to agriculture also influenced the distribution of Phragmites. Climate predictions indicated an increase in suitable habitat in coastal Lakes Huron and Michigan in particular. The results of this study, combined with a publicly available online decision support tool, will enable resource managers and restoration practitioners to target and prioritize Phragmites control efforts in the Great Lakes coastal zone.
NASA Astrophysics Data System (ADS)
Nieto, Karen; Xu, Yi; Teo, Steven L. H.; McClatchie, Sam; Holmes, John
2017-01-01
We used satellite sea surface temperature (SST) data to characterize coastal fronts and then tested the effects of the fronts and other environmental variables on the distribution of the albacore tuna (Thunnus alalunga) catches in the coastal areas (from the coast to 200 nm offshore) of the Northeast Pacific Ocean. A boosted regression tree (BRT) model was used to explain the spatial and temporal patterns in albacore tuna catch per unit effort (CPUE) (1988-2011), using frontal features (distance to the front and temperature gradient), and other environmental variables like SST, surface chlorophyll concentration (chlorophyll), and geostrophic currents as explanatory variables. Based on over two decades of high-resolution data, the modeled results confirmed previous findings that albacore CPUE distribution is strongly influenced by SST and chlorophyll at fishing locations, and the distance of fronts from the coast (DFRONT-COAST), albeit with substantial seasonal and interannual variation. Albacore CPUEs were higher near warm, low chlorophyll oceanic waters, and near SST fronts. We performed sequential leave-one-year-out cross-validations for all years and found that the relationships in the BRT models were robust for the entire study period. Spatial distributions of model-predicted albacore CPUE were similar to observations, but the model was unable to predict very high CPUEs in some areas. These results help to explain previously observed variability in albacore CPUE and will likely help improve international fisheries management in the face of environmental changes.
Risk Distribution of Human Infections with Avian Influenza H7N9 and H5N1 virus in China
Li, Xin-Lou; Yang, Yang; Sun, Ye; Chen, Wan-Jun; Sun, Ruo-Xi; Liu, Kun; Ma, Mai-Juan; Liang, Song; Yao, Hong-Wu; Gray, Gregory C.; Fang, Li-Qun; Cao, Wu-Chun
2015-01-01
It has been documented that the epidemiological characteristics of human infections with H7N9 differ significantly between H5N1. However, potential factors that may explain the different spatial distributions remain unexplored. We use boosted regression tree (BRT) models to explore the association of agro-ecological, environmental and meteorological variables with the occurrence of human cases of H7N9 and H5N1, and map the probabilities of occurrence of human cases. Live poultry markets, density of human, coverage of built-up land, relative humidity and precipitation were significant predictors for both. In addition, density of poultry, coverage of shrub and temperature played important roles for human H7N9 infection, whereas human H5N1 infection was associated with coverage of forest and water body. Based on the risks and distribution of ecological characteristics which may facilitate the circulation of the two viruses, we found Yangtze River Delta and Pearl River Delta, along with a few spots on the southeast coastline, to be the high risk areas for H7N9 and H5N1. Additional, H5N1 risk spots were identified in eastern Sichuan and southern Yunnan Provinces. Surveillance of the two viruses needs to be enhanced in these high risk areas to reduce the risk of future epidemics of avian influenza in China. PMID:26691585
Multistressor predictive models of invertebrate condition in the Corn Belt, USA
Waite, Ian R.; Van Metre, Peter C.
2017-01-01
Understanding the complex relations between multiple environmental stressors and ecological conditions in streams can help guide resource-management decisions. During 14 weeks in spring/summer 2013, personnel from the US Geological Survey and the US Environmental Protection Agency sampled 98 wadeable streams across the Midwest Corn Belt region of the USA for water and sediment quality, physical and habitat characteristics, and ecological communities. We used these data to develop independent predictive disturbance models for 3 macroinvertebrate metrics and a multimetric index. We developed the models based on boosted regression trees (BRT) for 3 stressor categories, land use/land cover (geographic information system [GIS]), all in-stream stressors combined (nutrients, habitat, and contaminants), and for GIS plus in-stream stressors. The GIS plus in-stream stressor models had the best overall performance with an average cross-validation R2 across all models of 0.41. The models were generally consistent in the explanatory variables selected within each stressor group across the 4 invertebrate metrics modeled. Variables related to riparian condition, substrate size or embeddedness, velocity and channel shape, nutrients (primarily NH3), and contaminants (pyrethroid degradates) were important descriptors of the invertebrate metrics. Models based on all measured in-stream stressors performed comparably to models based on GIS landscape variables, suggesting that the in-stream stressor characterization reasonably represents the dominant factors affecting invertebrate communities and that GIS variables are acting as surrogates for in-stream stressors that directly affect in-stream biota.
Modelling spatial patterns of urban growth in Africa
Linard, Catherine; Tatem, Andrew J.; Gilbert, Marius
2013-01-01
The population of Africa is predicted to double over the next 40 years, driving exceptionally high urban expansion rates that will induce significant socio-economic, environmental and health changes. In order to prepare for these changes, it is important to better understand urban growth dynamics in Africa and better predict the spatial pattern of rural-urban conversions. Previous work on urban expansion has been carried out at the city level or at the global level with a relatively coarse 5–10 km resolution. The main objective of the present paper was to develop a modelling approach at an intermediate scale in order to identify factors that influence spatial patterns of urban expansion in Africa. Boosted Regression Tree models were developed to predict the spatial pattern of rural-urban conversions in every large African city. Urban change data between circa 1990 and circa 2000 available for 20 large cities across Africa were used as training data. Results showed that the urban land in a 1 km neighbourhood and the accessibility to the city centre were the most influential variables. Results obtained were generally more accurate than results obtained using a distance-based urban expansion model and showed that the spatial pattern of small, compact and fast growing cities were easier to simulate than cities with lower population densities and a lower growth rate. The simulation method developed here will allow the production of spatially detailed urban expansion forecasts for 2020 and 2025 for Africa, data that are increasingly required by global change modellers. PMID:25152552
Harris, Ted D.; Graham, Jennifer L.
2017-01-01
Cyanobacterial blooms degrade water quality in drinking water supply reservoirs by producing toxic and taste-and-odor causing secondary metabolites, which ultimately cause public health concerns and lead to increased treatment costs for water utilities. There have been numerous attempts to create models that predict cyanobacteria and their secondary metabolites, most using linear models; however, linear models are limited by assumptions about the data and have had limited success as predictive tools. Thus, lake and reservoir managers need improved modeling techniques that can accurately predict large bloom events that have the highest impact on recreational activities and drinking-water treatment processes. In this study, we compared 12 unique linear and nonlinear regression modeling techniques to predict cyanobacterial abundance and the cyanobacterial secondary metabolites microcystin and geosmin using 14 years of physiochemical water quality data collected from Cheney Reservoir, Kansas. Support vector machine (SVM), random forest (RF), boosted tree (BT), and Cubist modeling techniques were the most predictive of the compared modeling approaches. SVM, RF, and BT modeling techniques were able to successfully predict cyanobacterial abundance, microcystin, and geosmin concentrations <60,000 cells/mL, 2.5 µg/L, and 20 ng/L, respectively. Only Cubist modeling predicted maxima concentrations of cyanobacteria and geosmin; no modeling technique was able to predict maxima microcystin concentrations. Because maxima concentrations are a primary concern for lake and reservoir managers, Cubist modeling may help predict the largest and most noxious concentrations of cyanobacteria and their secondary metabolites.
NASA Astrophysics Data System (ADS)
Zeng, Zhenzhong; Chen, Anping; Piao, Shilong; Rabin, Sam; Shen, Zehao
2014-07-01
The distributions of tropical ecosystems are rapidly being altered by climate change and anthropogenic activities. One possible trend—the loss of tropical forests and replacement by savannas—could result in significant shifts in ecosystem services and biodiversity loss. However, the influence and the relative importance of environmental factors in regulating the distribution of tropical forest and savanna biomes are still poorly understood, which makes it difficult to predict future tropical forest and savanna distributions in the context of climate change. Here we use boosted regression trees to quantitatively evaluate the importance of environmental predictors—mainly climatic, edaphic, and fire factors—for the tropical forest-savanna distribution at a mesoscale across the tropics (between 15°N and 35°S). Our results demonstrate that climate alone can explain most of the distribution of tropical forest and savanna at the scale considered; dry season average precipitation is the single most important determinant across tropical Asia-Australia, Africa, and South America. Given the strong tendency of increased seasonality and decreased dry season precipitation predicted by global climate models, we estimate that about 28% of what is now tropical forest would likely be lost to savanna by the late 21st century under the future scenario considered. This study highlights the importance of climate seasonality and interannual variability in predicting the distribution of tropical forest and savanna, supporting the climate as the primary driver in the savanna biogeography.
Climatic and Landscape Influences on Fire Regimes from 1984 to 2010 in the Western United States
Liu, Zhihua; Wimberly, Michael C.
2015-01-01
An improved understanding of the relative influences of climatic and landscape controls on multiple fire regime components is needed to enhance our understanding of modern fire regimes and how they will respond to future environmental change. To address this need, we analyzed the spatio-temporal patterns of fire occurrence, size, and severity of large fires (> 405 ha) in the western United States from 1984–2010. We assessed the associations of these fire regime components with environmental variables, including short-term climate anomalies, vegetation type, topography, and human influences, using boosted regression tree analysis. Results showed that large fire occurrence, size, and severity each exhibited distinctive spatial and spatio-temporal patterns, which were controlled by different sets of climate and landscape factors. Antecedent climate anomalies had the strongest influences on fire occurrence, resulting in the highest spatial synchrony. In contrast, climatic variability had weaker influences on fire size and severity and vegetation types were the most important environmental determinants of these fire regime components. Topography had moderately strong effects on both fire occurrence and severity, and human influence variables were most strongly associated with fire size. These results suggest a potential for the emergence of novel fire regimes due to the responses of fire regime components to multiple drivers at different spatial and temporal scales. Next-generation approaches for projecting future fire regimes should incorporate indirect climate effects on vegetation type changes as well as other landscape effects on multiple components of fire regimes. PMID:26465959
Kotta, Jonne; Oganjan, Katarina; Lauringson, Velda; Pärnoja, Merli; Kaasik, Ants; Rohtla, Liisa; Kotta, Ilmar; Orav-Kotta, Helen
2015-01-01
Benthic suspension feeding mussels are an important functional guild in coastal and estuarine ecosystems. To date we lack information on how various environmental gradients and biotic interactions separately and interactively shape the distribution patterns of mussels in non-tidal environments. Opposing to tidal environments, mussels inhabit solely subtidal zone in non-tidal waterbodies and, thereby, driving factors for mussel populations are expected to differ from the tidal areas. In the present study, we used the boosted regression tree modelling (BRT), an ensemble method for statistical techniques and machine learning, in order to explain the distribution and biomass of the suspension feeding mussel Mytilus trossulus in the non-tidal Baltic Sea. BRT models suggested that (1) distribution patterns of M. trossulus are largely driven by separate effects of direct environmental gradients and partly by interactive effects of resource gradients with direct environmental gradients. (2) Within its suitable habitat range, however, resource gradients had an important role in shaping the biomass distribution of M. trossulus. (3) Contrary to tidal areas, mussels were not competitively superior over macrophytes with patterns indicating either facilitative interactions between mussels and macrophytes or co-variance due to common stressor. To conclude, direct environmental gradients seem to define the distribution pattern of M. trossulus, and within the favourable distribution range, resource gradients in interaction with direct environmental gradients are expected to set the biomass level of mussels. PMID:26317668
Xu, Chi; Holmgren, Milena; Van Nes, Egbert H; Hirota, Marina; Chapin, F Stuart; Scheffer, Marten
2015-01-01
Publicly available remote sensing products have boosted science in many ways. The openness of these data sources suggests high reproducibility. However, as we show here, results may be specific to versions of the data products that can become unavailable as new versions are posted. We focus on remotely-sensed tree cover. Recent studies have used this public resource to detect multi-modality in tree cover in the tropical and boreal biomes. Such patterns suggest alternative stable states separated by critical tipping points. This has important implications for the potential response of these ecosystems to global climate change. For the boreal region, four distinct ecosystem states (i.e., treeless, sparse and dense woodland, and boreal forest) were previously identified by using the Collection 3 data of MODIS Vegetation Continuous Fields (VCF). Since then, the MODIS VCF product has been updated to Collection 5; and a Landsat VCF product of global tree cover at a fine spatial resolution of 30 meters has been developed. Here we compare these different remote-sensing products of tree cover to show that identification of alternative stable states in the boreal biome partly depends on the data source used. The updated MODIS data and the newer Landsat data consistently demonstrate three distinct modes around similar tree-cover values. Our analysis suggests that the boreal region has three modes: one sparsely vegetated state (treeless), one distinct 'savanna-like' state and one forest state, which could be alternative stable states. Our analysis illustrates that qualitative outcomes of studies may change fundamentally as new versions of remote sensing products are used. Scientific reproducibility thus requires that old versions remain publicly available.
Hayes, Mark A.; Ozenberger, Katharine; Cryan, Paul M.; Wunder, Michael B.
2015-01-01
Bat specimens held in natural history museum collections can provide insights into the distribution of species. However, there are several important sources of spatial error associated with natural history specimens that may influence the analysis and mapping of bat species distributions. We analyzed the importance of geographic referencing and error correction in species distribution modeling (SDM) using occurrence records of hoary bats (Lasiurus cinereus). This species is known to migrate long distances and is a species of increasing concern due to fatalities documented at wind energy facilities in North America. We used 3,215 museum occurrence records collected from 1950–2000 for hoary bats in North America. We compared SDM performance using five approaches: generalized linear models, multivariate adaptive regression splines, boosted regression trees, random forest, and maximum entropy models. We evaluated results using three SDM performance metrics (AUC, sensitivity, and specificity) and two data sets: one comprised of the original occurrence data, and a second data set consisting of these same records after the locations were adjusted to correct for identifiable spatial errors. The increase in precision improved the mean estimated spatial error associated with hoary bat records from 5.11 km to 1.58 km, and this reduction in error resulted in a slight increase in all three SDM performance metrics. These results provide insights into the importance of geographic referencing and the value of correcting spatial errors in modeling the distribution of a wide-ranging bat species. We conclude that the considerable time and effort invested in carefully increasing the precision of the occurrence locations in this data set was not worth the marginal gains in improved SDM performance, and it seems likely that gains would be similar for other bat species that range across large areas of the continent, migrate, and are habitat generalists.
Iannella, Mattia; Cerasoli, Francesco; D'Alessandro, Paola; Console, Giulia; Biondi, Maurizio
2018-01-01
The pond turtle Emys trinacris is an endangered endemic species of Sicily showing a fragmented distribution throughout the main island. In this study, we applied "Ensemble Niche Modelling", combining more classical statistical techniques as Generalized Linear Models and Multivariate Adaptive Regression Splines with machine-learning approaches as Boosted Regression Trees and Maxent, to model the potential distribution of the species under current and future climatic conditions. Moreover, a "gap analysis" performed on both the species' presence sites and the predictions from the Ensemble Models is proposed to integrate outputs from these models, in order to assess the conservation status of this threatened species in the context of biodiversity management. For this aim, four "Representative Concentration Pathways", corresponding to different greenhouse gases emissions trajectories were considered to project the obtained models to both 2050 and 2070. Areas lost, gained or remaining stable for the target species in the projected models were calculated. E. trinacris ' potential distribution resulted to be significantly dependent upon precipitation-linked variables, mainly precipitation of wettest and coldest quarter. Future negative effects for the conservation of this species, because of more unstable precipitation patterns and extreme meteorological events, emerged from our analyses. Further, the sites currently inhabited by E. trinacris are, for more than a half, out of the Protected Areas network, highlighting an inadequate management of the species by the authorities responsible for its protection. Our results, therefore, suggest that in the next future the Sicilian pond turtle will need the utmost attention by the scientific community to avoid the imminent risk of extinction. Finally, the gap analysis performed in GIS environment resulted to be a very informative post-modeling technique, potentially applicable to the management of species at risk and to Protected Areas' planning in many contexts.
Using Historical Atlas Data to Develop High-Resolution Distribution Models of Freshwater Fishes
Huang, Jian; Frimpong, Emmanuel A.
2015-01-01
Understanding the spatial pattern of species distributions is fundamental in biogeography, and conservation and resource management applications. Most species distribution models (SDMs) require or prefer species presence and absence data for adequate estimation of model parameters. However, observations with unreliable or unreported species absences dominate and limit the implementation of SDMs. Presence-only models generally yield less accurate predictions of species distribution, and make it difficult to incorporate spatial autocorrelation. The availability of large amounts of historical presence records for freshwater fishes of the United States provides an opportunity for deriving reliable absences from data reported as presence-only, when sampling was predominantly community-based. In this study, we used boosted regression trees (BRT), logistic regression, and MaxEnt models to assess the performance of a historical metacommunity database with inferred absences, for modeling fish distributions, investigating the effect of model choice and data properties thereby. With models of the distribution of 76 native, non-game fish species of varied traits and rarity attributes in four river basins across the United States, we show that model accuracy depends on data quality (e.g., sample size, location precision), species’ rarity, statistical modeling technique, and consideration of spatial autocorrelation. The cross-validation area under the receiver-operating-characteristic curve (AUC) tended to be high in the spatial presence-absence models at the highest level of resolution for species with large geographic ranges and small local populations. Prevalence affected training but not validation AUC. The key habitat predictors identified and the fish-habitat relationships evaluated through partial dependence plots corroborated most previous studies. The community-based SDM framework broadens our capability to model species distributions by innovatively removing the constraint of lack of species absence data, thus providing a robust prediction of distribution for stream fishes in other regions where historical data exist, and for other taxa (e.g., benthic macroinvertebrates, birds) usually observed by community-based sampling designs. PMID:26075902
Environmental Controls on Above-Ground Biomass in the Taita Hills, Kenya
NASA Astrophysics Data System (ADS)
Adhikari, H.; Heiskanen, J.; Siljander, M.; Maeda, E. E.; Heikinheimo, V.; Pellikka, P.
2016-12-01
Tropical forests are globally significant ecosystems which maintain high biodiversity and provide valuable ecosystem services, including carbon sink, climate change mitigation and adaptation. This ecosystem has been severely degraded for decades. However, the magnitude and spatial patterns of the above ground biomass (AGB) in the tropical forest-agriculture landscapes is highly variable, even under the same climatic condition and land use. This work aims 1) to generate wall-to-wall map of AGB density for the Taita Hills in Kenya based on field measurements and airborne laser scanning (ALS) and 2) to examine environmental controls on AGB using geospatial data sets on topography, soils, climate and land use, and statistical modelling. The study area (67000 ha) is located in the northernmost part of the Eastern Arc Mountains of Kenya and Tanzania, and the highest hilltops reach over 2200 m in elevation. Most of the forest area has been cleared for croplands and agroforestry, and hills are surrounded by the semi-arid scrublands and dry savannah at an elevation of 600-900 m a.s.l. As a result, the current land cover is a mosaic of various types of land cover and land use. The field measurements were carried out in total of 216 plots in 2013-2015 for AGB computations and ALS flights were conducted in 2014-2015. AGB map at 30 m x 30 m resolution was implemented using multiple linear regression based on ALS variables derived from the point cloud, namely canopy cover and 25 percentile height of ALS returns (R2 = 0.88). Boosted regression trees (BRT) was used for examining the relationship between AGB and explanatory variables, which were derived from ALS-based high resolution DEM (2 m resolution), soil database, downscaled climate data and land cover/use maps based on satellite image analysis. The results of these analyses will be presented in the conference.
Fabian C.C. Uzoh; Martin W. Ritchie
1996-01-01
The equations presented predict crown area for 13 species of trees and shrubs which may be found growing in competition with commercial conifers during early stages of stand development. The equations express crown area as a function of basal area and height. Parameters were estimated for each species individually using weighted nonlinear least square regression.
Biomass equations for major tree species of the Northeast
Louise M. Tritton; James W. Hornbeck
1982-01-01
Regression equations are used in both forestry and ecosystem studies to estimate tree biomass from field measurements of dbh (diameter at breast height) or a combination of dbh and height. Literature on biomass is reviewed, and 178 sets of publish equation for 25 species common to the Northeastern Unites States are listed. On the basis of these equations, estimates of...
Stand basal-area and tree-diameter growth in red spruce-fir forests in Maine, 1960-80
S.J. Zarnoch; D.A. Gansner; D.S. Powell; T.A. Birch; T.A. Birch
1990-01-01
Stand basal-area change and individual surviving red spruce d.b.h. growth from 1960 to 1980 were analyzed for red spruce-fir stands in Maine. Regression modeling was used to relate these measures of growth to stand and tree conditions and to compare growth throughout the period. Results indicate a decline in growth.
Examination of the Arborsonic Decay Detector for Detecting Bacterial Wetwood in Red Oaks
Zicai Xu; Theodor D. Leininger; James G. Williams; Frank H. Tainter
2000-01-01
The Arborsonic Decay Detector (ADD; Fujikura Europe Limited, Wiltshire, England) was used to measure the time it took an ultrasound wave to cross 280 diameters in red oak trees with varying degrees of bacterial wetwood or heartwood decay. Linear regressions derived from the ADD readings of trees in Mississippi and South Carolina with wetwood and heartwood decay...
NASA Astrophysics Data System (ADS)
Kisi, Ozgur; Parmar, Kulwinder Singh
2016-03-01
This study investigates the accuracy of least square support vector machine (LSSVM), multivariate adaptive regression splines (MARS) and M5 model tree (M5Tree) in modeling river water pollution. Various combinations of water quality parameters, Free Ammonia (AMM), Total Kjeldahl Nitrogen (TKN), Water Temperature (WT), Total Coliform (TC), Fecal Coliform (FC) and Potential of Hydrogen (pH) monitored at Nizamuddin, Delhi Yamuna River in India were used as inputs to the applied models. Results indicated that the LSSVM and MARS models had almost same accuracy and they performed better than the M5Tree model in modeling monthly chemical oxygen demand (COD). The average root mean square error (RMSE) of the LSSVM and M5Tree models was decreased by 1.47% and 19.1% using MARS model, respectively. Adding TC input to the models did not increase their accuracy in modeling COD while adding FC and pH inputs to the models generally decreased the accuracy. The overall results indicated that the MARS and LSSVM models could be successfully used in estimating monthly river water pollution level by using AMM, TKN and WT parameters as inputs.
Avelino, Jacques; Cabut, Sandrine; Barboza, Bernardo; Barquero, Miguel; Alfaro, Ronny; Esquivel, César; Durand, Jean-François; Cilas, Christian
2007-12-01
ABSTRACT We monitored the development of American leaf spot of coffee, a disease caused by the gemmiferous fungus Mycena citricolor, in 57 plots in Costa Rica for 1 or 2 years in order to gain a clearer understanding of conditions conducive to the disease and improve its control. During the investigation, characteristics of the coffee trees, crop management, and the environment were recorded. For the analyses, we used partial least-squares regression via the spline functions (PLSS), which is a nonlinear extension to partial least-squares regression (PLS). The fungus developed well in areas located between approximately 1,100 and 1,550 m above sea level. Slopes were conducive to its development, but eastern-facing slopes were less affected than the others, probably because they were more exposed to sunlight, especially in the rainy season. The distance between planting rows, the shade percentage, coffee tree height, the type of shade, and the pruning system explained disease intensity due to their effects on coffee tree shading and, possibly, on the humidity conditions in the plot. Forest trees and fruit trees intercropped with coffee provided particularly propitious conditions. Apparently, fertilization was unfavorable for the disease, probably due to dilution phenomena associated with faster coffee tree growth. Finally, series of wet spells interspersed with dry spells, which were frequent in the middle of the rainy season, were critical for the disease, probably because they affected the production and release of gemmae and their viability. These results could be used to draw up a map of epidemic risks taking topographical factors into account. To reduce those risks and improve chemical control, our results suggested that farmers should space planting rows further apart, maintain light shading in the plantation, and prune their coffee trees.
NASA Astrophysics Data System (ADS)
Kukunda, Collins B.; Duque-Lazo, Joaquín; González-Ferreiro, Eduardo; Thaden, Hauke; Kleinn, Christoph
2018-03-01
Distinguishing tree species is relevant in many contexts of remote sensing assisted forest inventory. Accurate tree species maps support management and conservation planning, pest and disease control and biomass estimation. This study evaluated the performance of applying ensemble techniques with the goal of automatically distinguishing Pinus sylvestris L. and Pinus uncinata Mill. Ex Mirb within a 1.3 km2 mountainous area in Barcelonnette (France). Three modelling schemes were examined, based on: (1) high-density LiDAR data (160 returns m-2), (2) Worldview-2 multispectral imagery, and (3) Worldview-2 and LiDAR in combination. Variables related to the crown structure and height of individual trees were extracted from the normalized LiDAR point cloud at individual-tree level, after performing individual tree crown (ITC) delineation. Vegetation indices and the Haralick texture indices were derived from Worldview-2 images and served as independent spectral variables. Selection of the best predictor subset was done after a comparison of three variable selection procedures: (1) Random Forests with cross validation (AUCRFcv), (2) Akaike Information Criterion (AIC) and (3) Bayesian Information Criterion (BIC). To classify the species, 9 regression techniques were combined using ensemble models. Predictions were evaluated using cross validation and an independent dataset. Integration of datasets and models improved individual tree species classification (True Skills Statistic, TSS; from 0.67 to 0.81) over individual techniques and maintained strong predictive power (Relative Operating Characteristic, ROC = 0.91). Assemblage of regression models and integration of the datasets provided more reliable species distribution maps and associated tree-scale mapping uncertainties. Our study highlights the potential of model and data assemblage at improving species classifications needed in present-day forest planning and management.
Chilling and heat requirements for flowering in temperate fruit trees
NASA Astrophysics Data System (ADS)
Guo, Liang; Dai, Junhu; Ranjitkar, Sailesh; Yu, Haiying; Xu, Jianchu; Luedeling, Eike
2014-08-01
Climate change has affected the rates of chilling and heat accumulation, which are vital for flowering and production, in temperate fruit trees, but few studies have been conducted in the cold-winter climates of East Asia. To evaluate tree responses to variation in chill and heat accumulation rates, partial least squares regression was used to correlate first flowering dates of chestnut ( Castanea mollissima Blume) and jujube ( Zizyphus jujube Mill.) in Beijing, China, with daily chill and heat accumulation between 1963 and 2008. The Dynamic Model and the Growing Degree Hour Model were used to convert daily records of minimum and maximum temperature into horticulturally meaningful metrics. Regression analyses identified the chilling and forcing periods for chestnut and jujube. The forcing periods started when half the chilling requirements were fulfilled. Over the past 50 years, heat accumulation during tree dormancy increased significantly, while chill accumulation remained relatively stable for both species. Heat accumulation was the main driver of bloom timing, with effects of variation in chill accumulation negligible in Beijing's cold-winter climate. It does not seem likely that reductions in chill will have a major effect on the studied species in Beijing in the near future. Such problems are much more likely for trees grown in locations that are substantially warmer than their native habitats, such as temperate species in the subtropics and tropics.
Chilling and heat requirements for flowering in temperate fruit trees.
Guo, Liang; Dai, Junhu; Ranjitkar, Sailesh; Yu, Haiying; Xu, Jianchu; Luedeling, Eike
2014-08-01
Climate change has affected the rates of chilling and heat accumulation, which are vital for flowering and production, in temperate fruit trees, but few studies have been conducted in the cold-winter climates of East Asia. To evaluate tree responses to variation in chill and heat accumulation rates, partial least squares regression was used to correlate first flowering dates of chestnut (Castanea mollissima Blume) and jujube (Zizyphus jujube Mill.) in Beijing, China, with daily chill and heat accumulation between 1963 and 2008. The Dynamic Model and the Growing Degree Hour Model were used to convert daily records of minimum and maximum temperature into horticulturally meaningful metrics. Regression analyses identified the chilling and forcing periods for chestnut and jujube. The forcing periods started when half the chilling requirements were fulfilled. Over the past 50 years, heat accumulation during tree dormancy increased significantly, while chill accumulation remained relatively stable for both species. Heat accumulation was the main driver of bloom timing, with effects of variation in chill accumulation negligible in Beijing’s cold-winter climate. It does not seem likely that reductions in chill will have a major effect on the studied species in Beijing in the near future. Such problems are much more likely for trees grown in locations that are substantially warmer than their native habitats, such as temperate species in the subtropics and tropics.
Thin Cloud Detection Method by Linear Combination Model of Cloud Image
NASA Astrophysics Data System (ADS)
Liu, L.; Li, J.; Wang, Y.; Xiao, Y.; Zhang, W.; Zhang, S.
2018-04-01
The existing cloud detection methods in photogrammetry often extract the image features from remote sensing images directly, and then use them to classify images into cloud or other things. But when the cloud is thin and small, these methods will be inaccurate. In this paper, a linear combination model of cloud images is proposed, by using this model, the underlying surface information of remote sensing images can be removed. So the cloud detection result can become more accurate. Firstly, the automatic cloud detection program in this paper uses the linear combination model to split the cloud information and surface information in the transparent cloud images, then uses different image features to recognize the cloud parts. In consideration of the computational efficiency, AdaBoost Classifier was introduced to combine the different features to establish a cloud classifier. AdaBoost Classifier can select the most effective features from many normal features, so the calculation time is largely reduced. Finally, we selected a cloud detection method based on tree structure and a multiple feature detection method using SVM classifier to compare with the proposed method, the experimental data shows that the proposed cloud detection program in this paper has high accuracy and fast calculation speed.
Testing naturalness at 100 TeV
NASA Astrophysics Data System (ADS)
Chen, Chuan-Ren; Hajer, Jan; Liu, Tao; Low, Ian; Zhang, Hao
2017-09-01
Solutions to the electroweak hierarchy problem typically introduce a new symmetry to stabilize the quadratic ultraviolet sensitivity in the self-energy of the Higgs boson. The new symmetry is either broken softly or collectively, as for example in supersymmetric and little Higgs theories. At low energies such theories contain naturalness partners of the Standard Model fields which are responsible for canceling the quadratic divergence in the squared Higgs mass. Post the discovery of any partner-like particles, we propose to test the aforementioned cancellation by measuring relevant Higgs couplings. Using the fermionic top partners in little Higgs theories as an illustration, we construct a simplified model for naturalness and initiate a study on testing naturalness. After electroweak symmetry breaking, naturalness in the top sector requires a T = - λ t 2 at leading order, where λ t and a T are the Higgs couplings to a pair of top quarks and top partners, respectively. Using a multivariate method of Boosted Decision Tree to tag boosted particles in the Standard Model, we show that, with a luminosity of 30 ab-1 at a 100 TeV pp-collider, naturalness could be tested with a precision of 10% for a top partner mass up to 2.5 TeV.
NASA Astrophysics Data System (ADS)
Bouffon, T.; Rice, R.; Bales, R.
2006-12-01
The spatial distributions of snow water equivalent (SWE) and snow depth within a 1, 4, and 16 km2 grid element around two automated snow pillows in a forested and open- forested region of the Upper Merced River Basin (2,800 km2) of Yosemite National Park were characterized using field observations and analyzed using binary regression trees. Snow surveys occurred at the forested site during the accumulation and ablation seasons, while at the open-forest site a survey was performed only during the accumulation season. An average of 130 snow depth and 7 snow density measurements were made on each survey, within the 4 km2 grid. Snow depth was distributed using binary regression trees and geostatistical methods using the physiographic parameters (e.g. elevation, slope, vegetation, aspect). Results in the forest region indicate that the snow pillow overestimated average SWE within the 1, 4, and 16 km2 areas by 34 percent during ablation, but during accumulation the snow pillow provides a good estimate of the modeled mean SWE grid value, however it is suspected that the snow pillow was underestimating SWE. However, at the open forest site, during accumulation, the snow pillow was 28 percent greater than the mean modeled grid element. In addition, the binary regression trees indicate that the independent variables of vegetation, slope, and aspect are the most influential parameters of snow depth distribution. The binary regression tree and multivariate linear regression models explain about 60 percent of the initial variance for snow depth and 80 percent for density, respectively. This short-term study provides motivation and direction for the installation of a distributed snow measurement network to fill the information gap in basin-wide SWE and snow depth measurements. Guided by these results, a distributed snow measurement network was installed in the Fall 2006 at Gin Flat in the Upper Merced River Basin with the specific objective of measuring accumulation and ablation across topographic variables with the aim of providing guidance for future larger scale observation network designs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Teguh, David N.; Levendag, Peter C.; Noever, Inge
2008-11-15
Purpose: To assess the relationship for oropharyngeal (OP) cancer and nasopharyngeal (NP) cancer between the dose received by the swallowing structures and the dysphagia related quality of life (QoL). Methods and Materials: Between 2000 and 2005, 85 OP and 47 NP cancer patients were treated by radiation therapy. After 46 Gy, OP cancer is boosted by intensity-modulated radiation therapy (IMRT), brachytherapy (BT), or frameless stereotactic radiation/cyberknife (CBK). After 46 Gy, the NP cancer was boosted with parallel-opposed fields or IMRT to a total dose of 70 Gy; subsequently, a second boost was given by either BT (11 Gy) or stereotacticmore » radiation (SRT)/CBK (11.2 Gy). Sixty OP and 21 NP cancer patients responded to functional and QoL questionnaires (i.e., the Performance Status Scales, European Organization for Research and Treatment of Cancer H and N35, and M.D. Anderson Dysphagia Inventory). The swallowing muscles were delineated and the mean dose calculated using the original three-dimensional computed tomography-based treatment plans. Univariate analyses were performed using logistic regression analysis. Results: Most dysphagia problems were observed in the base of tongue tumors. For OP cancer, boosting with IMRT resulted in more dysphagia as opposed to BT or SRT/CBK. For NPC patients, in contrast to the first booster dose (46-70 Gy), no additional increase of dysphagia by the second boost was observed. Conclusions: The lowest mean doses of radiation to the swallowing muscles were achieved when using BT as opposed to SRT/CBK or IMRT. For the 81 patients alive with no evidence of disease for at least 1 year, a dose-effect relationship was observed between the dose in the superior constrictor muscle and the 'normalcy of diet' (Performance Status Scales) or 'swallowing scale' (H and N35) scores (p < 0.01)« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hepel, Jaroslaw T., E-mail: jhepel@lifespan.org; Department of Radiation Oncology, Tufts Medical Center, Tufts University, Boston, Massachusetts; Leonard, Kara Lynne
Purpose: Stereotactic body radiation therapy (SBRT) boost to primary and nodal disease after chemoradiation has potential to improve outcomes for advanced non-small cell lung cancer (NSCLC). A dose escalation study was initiated to evaluate the maximum tolerated dose (MTD). Methods and Materials: Eligible patients received chemoradiation to a dose of 50.4 Gy in 28 fractions and had primary and nodal volumes appropriate for SBRT boost (<120 cc and <60 cc, respectively). SBRT was delivered in 2 fractions after chemoradiation. Dose was escalated from 16 to 28 Gy in 2 Gy/fraction increments, resulting in 4 dose cohorts. MTD was defined when ≥2 of 6 patients permore » cohort experienced any treatment-related grade 3 to 5 toxicity within 4 weeks of treatment or the maximum dose was reached. Late toxicity, disease control, and survival were also evaluated. Results: Twelve patients (3 per dose level) underwent treatment. All treatment plans met predetermined dose-volume constraints. The mean age was 64 years. Most patients had stage III disease (92%) and were medically inoperable (92%). The maximum dose level was reached with no grade 3 to 5 acute toxicities. At a median follow-up time of 16 months, 1-year local-regional control (LRC) was 78%. LRC was 50% at <24 Gy and 100% at ≥24 Gy (P=.02). Overall survival at 1 year was 67%. Late toxicity (grade 3-5) was seen in only 1 patient who experienced fatal bronchopulmonary hemorrhage (grade 5). There were no predetermined dose constraints for the proximal bronchial-vascular tree (PBV) in this study. This patient's 4-cc PBV dose was substantially higher than that received by other patients in all 4 cohorts and was associated with the toxicity observed: 20.3 Gy (P<.05) and 73.5 Gy (P=.07) for SBRT boost and total treatment, respectively. Conclusions: SBRT boost to both primary and nodal disease after chemoradiation is feasible and well tolerated. Local control rates are encouraging, especially at doses ≥24 Gy in 2 fractions. Toxicity at the PBV is a concern but potentially can be avoided with strict dose-volume constraints.« less
Yamashita, Takashi; Kart, Cary S; Noe, Douglas A
2012-12-01
Type 2 diabetes is known to contribute to health disparities in the U.S. and failure to adhere to recommended self-care behaviors is a contributing factor. Intervention programs face difficulties as a result of patient diversity and limited resources. With data from the 2005 Behavioral Risk Factor Surveillance System, this study employs a logistic regression tree algorithm to identify characteristics of sub-populations with type 2 diabetes according to their reported frequency of adherence to four recommended diabetes self-care behaviors including blood glucose monitoring, foot examination, eye examination and HbA1c testing. Using Andersen's health behavior model, need factors appear to dominate the definition of which sub-groups were at greatest risk for low as well as high adherence. Findings demonstrate the utility of easily interpreted tree diagrams to design specific culturally appropriate intervention programs targeting sub-populations of diabetes patients who need to improve their self-care behaviors. Limitations and contributions of the study are discussed.
NASA Astrophysics Data System (ADS)
Pham, Binh Thai; Prakash, Indra; Tien Bui, Dieu
2018-02-01
A hybrid machine learning approach of Random Subspace (RSS) and Classification And Regression Trees (CART) is proposed to develop a model named RSSCART for spatial prediction of landslides. This model is a combination of the RSS method which is known as an efficient ensemble technique and the CART which is a state of the art classifier. The Luc Yen district of Yen Bai province, a prominent landslide prone area of Viet Nam, was selected for the model development. Performance of the RSSCART model was evaluated through the Receiver Operating Characteristic (ROC) curve, statistical analysis methods, and the Chi Square test. Results were compared with other benchmark landslide models namely Support Vector Machines (SVM), single CART, Naïve Bayes Trees (NBT), and Logistic Regression (LR). In the development of model, ten important landslide affecting factors related with geomorphology, geology and geo-environment were considered namely slope angles, elevation, slope aspect, curvature, lithology, distance to faults, distance to rivers, distance to roads, and rainfall. Performance of the RSSCART model (AUC = 0.841) is the best compared with other popular landslide models namely SVM (0.835), single CART (0.822), NBT (0.821), and LR (0.723). These results indicate that performance of the RSSCART is a promising method for spatial landslide prediction.
Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research
ERIC Educational Resources Information Center
He, Lingjun; Levine, Richard A.; Fan, Juanjuan; Beemer, Joshua; Stronach, Jeanne
2018-01-01
In institutional research, modern data mining approaches are seldom considered to address predictive analytics problems. The goal of this paper is to highlight the advantages of tree-based machine learning algorithms over classic (logistic) regression methods for data-informed decision making in higher education problems, and stress the success of…
Tree allometry and improved estimation of carbon stocks and balance in tropical forests.
Chave, J; Andalo, C; Brown, S; Cairns, M A; Chambers, J Q; Eamus, D; Fölster, H; Fromard, F; Higuchi, N; Kira, T; Lescure, J-P; Nelson, B W; Ogawa, H; Puig, H; Riéra, B; Yamakura, T
2005-08-01
Tropical forests hold large stores of carbon, yet uncertainty remains regarding their quantitative contribution to the global carbon cycle. One approach to quantifying carbon biomass stores consists in inferring changes from long-term forest inventory plots. Regression models are used to convert inventory data into an estimate of aboveground biomass (AGB). We provide a critical reassessment of the quality and the robustness of these models across tropical forest types, using a large dataset of 2,410 trees >or= 5 cm diameter, directly harvested in 27 study sites across the tropics. Proportional relationships between aboveground biomass and the product of wood density, trunk cross-sectional area, and total height are constructed. We also develop a regression model involving wood density and stem diameter only. Our models were tested for secondary and old-growth forests, for dry, moist and wet forests, for lowland and montane forests, and for mangrove forests. The most important predictors of AGB of a tree were, in decreasing order of importance, its trunk diameter, wood specific gravity, total height, and forest type (dry, moist, or wet). Overestimates prevailed, giving a bias of 0.5-6.5% when errors were averaged across all stands. Our regression models can be used reliably to predict aboveground tree biomass across a broad range of tropical forests. Because they are based on an unprecedented dataset, these models should improve the quality of tropical biomass estimates, and bring consensus about the contribution of the tropical forest biome and tropical deforestation to the global carbon cycle.
Vegetation Continuous Fields--Transitioning from MODIS to VIIRS
NASA Astrophysics Data System (ADS)
DiMiceli, C.; Townshend, J. R.; Sohlberg, R. A.; Kim, D. H.; Kelly, M.
2015-12-01
Measurements of fractional vegetation cover are critical for accurate and consistent monitoring of global deforestation rates. They also provide important parameters for land surface, climate and carbon models and vital background data for research into fire, hydrological and ecosystem processes. MODIS Vegetation Continuous Fields (VCF) products provide four complementary layers of fractional cover: tree cover, non-tree vegetation, bare ground, and surface water. MODIS VCF products are currently produced globally and annually at 250m resolution for 2000 to the present. Additionally, annual VCF products at 1/20° resolution derived from AVHRR and MODIS Long-Term Data Records are in development to provide Earth System Data Records of fractional vegetation cover for 1982 to the present. In order to provide continuity of these valuable products, we are extending the VCF algorithms to create Suomi NPP/VIIRS VCF products. This presentation will highlight the first VIIRS fractional cover product: global percent tree cover at 1 km resolution. To create this product, phenological and physiological metrics were derived from each complete year of VIIRS 8-day surface reflectance products. A supervised regression tree method was applied to the metrics, using training derived from Landsat data supplemented by high-resolution data from Ikonos, RapidEye and QuickBird. The regression tree model was then applied globally to produce fractional tree cover. In our presentation we will detail our methods for creating the VIIRS VCF product. We will compare the new VIIRS VCF product to our current MODIS VCF products and demonstrate continuity between instruments. Finally, we will outline future VIIRS VCF development plans.
Tanaka, Tomohiro; Voigt, Michael D
2018-03-01
Non-melanoma skin cancer (NMSC) is the most common de novo malignancy in liver transplant (LT) recipients; it behaves more aggressively and it increases mortality. We used decision tree analysis to develop a tool to stratify and quantify risk of NMSC in LT recipients. We performed Cox regression analysis to identify which predictive variables to enter into the decision tree analysis. Data were from the Organ Procurement Transplant Network (OPTN) STAR files of September 2016 (n = 102984). NMSC developed in 4556 of the 105984 recipients, a mean of 5.6 years after transplant. The 5/10/20-year rates of NMSC were 2.9/6.3/13.5%, respectively. Cox regression identified male gender, Caucasian race, age, body mass index (BMI) at LT, and sirolimus use as key predictive or protective factors for NMSC. These factors were entered into a decision tree analysis. The final tree stratified non-Caucasians as low risk (0.8%), and Caucasian males > 47 years, BMI < 40 who did not receive sirolimus, as high risk (7.3% cumulative incidence of NMSC). The predictions in the derivation set were almost identical to those in the validation set (r 2 = 0.971, p < 0.0001). Cumulative incidence of NMSC in low, moderate and high risk groups at 5/10/20 year was 0.5/1.2/3.3, 2.1/4.8/11.7 and 5.6/11.6/23.1% (p < 0.0001). The decision tree model accurately stratifies the risk of developing NMSC in the long-term after LT.
NASA Astrophysics Data System (ADS)
Park, Seonyoung; Im, Jungho; Park, Sumin; Rhee, Jinyoung
2017-04-01
Soil moisture is one of the most important keys for understanding regional and global climate systems. Soil moisture is directly related to agricultural processes as well as hydrological processes because soil moisture highly influences vegetation growth and determines water supply in the agroecosystem. Accurate monitoring of the spatiotemporal pattern of soil moisture is important. Soil moisture has been generally provided through in situ measurements at stations. Although field survey from in situ measurements provides accurate soil moisture with high temporal resolution, it requires high cost and does not provide the spatial distribution of soil moisture over large areas. Microwave satellite (e.g., advanced Microwave Scanning Radiometer on the Earth Observing System (AMSR2), the Advanced Scatterometer (ASCAT), and Soil Moisture Active Passive (SMAP)) -based approaches and numerical models such as Global Land Data Assimilation System (GLDAS) and Modern- Era Retrospective Analysis for Research and Applications (MERRA) provide spatial-temporalspatiotemporally continuous soil moisture products at global scale. However, since those global soil moisture products have coarse spatial resolution ( 25-40 km), their applications for agriculture and water resources at local and regional scales are very limited. Thus, soil moisture downscaling is needed to overcome the limitation of the spatial resolution of soil moisture products. In this study, GLDAS soil moisture data were downscaled up to 1 km spatial resolution through the integration of AMSR2 and ASCAT soil moisture data, Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM), and Moderate Resolution Imaging Spectroradiometer (MODIS) data—Land Surface Temperature, Normalized Difference Vegetation Index, and Land cover—using modified regression trees over East Asia from 2013 to 2015. Modified regression trees were implemented using Cubist, a commercial software tool based on machine learning. An optimization based on pruning of rules derived from the modified regression trees was conducted. Root Mean Square Error (RMSE) and Correlation coefficients (r) were used to optimize the rules, and finally 59 rules from modified regression trees were produced. The results show high validation r (0.79) and low validation RMSE (0.0556m3/m3). The 1 km downscaled soil moisture was evaluated using ground soil moisture data at 14 stations, and both soil moisture data showed similar temporal patterns (average r=0.51 and average RMSE=0.041). The spatial distribution of the 1 km downscaled soil moisture well corresponded with GLDAS soil moisture that caught both extremely dry and wet regions. Correlation between GLDAS and the 1 km downscaled soil moisture during growing season was positive (mean r=0.35) in most regions.
Diameter-growth model across shortleaf pine range using regression tree analysis
Daniel Yaussy; Louis Iverson; Anantha Prasad
1999-01-01
Diameter growth of a tree in most gap-phase models is limited by light, nutrients, moisture, and temperature. Growing-season temperature is represented by growing degree days (gdd), which is the sum of the average daily temperatures above a baseline temperature. Gap-phase models determine the north-south range of a species by the gdd limits at the north and south...
Eric J. Gustafson
2014-01-01
Regression models developed in the upper Midwest (United States) to predict drought-induced tree mortality from measures of drought (Palmer Drought Severity Index) were tested in the northeastern United States and found inadequate. The most likely cause of this result is that long drought events were rare in the Northeast during the period when inventory data were...
W. Henry McNab; David L. Loftis; Callie J. Schweitzer; Raymond Sheffield
2004-01-01
We used tree indicator species occurring on 438 plots in the Plateau counties of Tennessee to test the uniqueness of four conterminous ecoregions. Multinomial logistic regression indicated that the presence of 14 tree species allowed classification of sample plots according to ecoregion with an average overall accuracy of 75 percent (range 45 to 94 percent). Additional...
Dating tree mortality using log decay in the White Mountains of New Hampshire
Andrew J. Fast; Mark J. Ducey; Jeffrey H. Gove; William B. Leak
2008-01-01
Coarse woody material (CWM) is an important component of forest ecosystems. To meet specific CWM management objectives, it is important to understand rates of decay. We present results from a silvicultural trial at the Bartlett Experimental Forest, in which time of death is known for a large sample of trees. Either a simple table or regression equations that use...
Development of post-fire crown damage mortality thresholds in ponderosa pine
James F. Fowler; Carolyn Hull Sieg; Joel McMillin; Kurt K. Allen; Jose F. Negron; Linda L. Wadleigh; John A. Anhold; Ken E. Gibson
2010-01-01
Previous research has shown that crown scorch volume and crown consumption volume are the major predictors of post-fire mortality in ponderosa pine. In this study, we use piecewise logistic regression models of crown scorch data from 6633 trees in five wildfires from the Intermountain West to locate a mortality threshold at 88% scorch by volume for trees with no crown...