Anantha M. Prasad; Louis R. Iverson; Andy Liaw; Andy Liaw
2006-01-01
We evaluated four statistical models - Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) - for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.
L.R. Iverson; A.M. Prasad; A. Liaw
2004-01-01
More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in...
Generalized and synthetic regression estimators for randomized branch sampling
David L. R. Affleck; Timothy G. Gregoire
2015-01-01
In felled-tree studies, ratio and regression estimators are commonly used to convert more readily measured branch characteristics to dry crown mass estimates. In some cases, data from multiple trees are pooled to form these estimates. This research evaluates the utility of both tactics in the estimation of crown biomass following randomized branch sampling (...
Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction
Rahman, Raziur; Haider, Saad; Ghosh, Souparno; Pal, Ranadip
2015-01-01
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error. PMID:27081304
E. Freeman; G. Moisen; J. Coulston; B. Wilson
2014-01-01
Random forests (RF) and stochastic gradient boosting (SGB), both involving an ensemble of classification and regression trees, are compared for modeling tree canopy cover for the 2011 National Land Cover Database (NLCD). The objectives of this study were twofold. First, sensitivity of RF and SGB to choices in tuning parameters was explored. Second, performance of the...
NASA Astrophysics Data System (ADS)
Tang, Jie; Liu, Rong; Zhang, Yue-Li; Liu, Mou-Ze; Hu, Yong-Fang; Shao, Ming-Jie; Zhu, Li-Jun; Xin, Hua-Wen; Feng, Gui-Wen; Shang, Wen-Jun; Meng, Xiang-Guang; Zhang, Li-Rong; Ming, Ying-Zi; Zhang, Wei
2017-02-01
Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.
Gu, Yingxin; Wylie, Bruce K.; Boyte, Stephen; Picotte, Joshua J.; Howard, Danny; Smith, Kelcy; Nelson, Kurtis
2016-01-01
Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.
Austin, Peter C; Lee, Douglas S; Steyerberg, Ewout W; Tu, Jack V
2012-01-01
In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease. PMID:22777999
Decision tree modeling using R.
Zhang, Zhongheng
2016-08-01
In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.
Su, Xiaogang; Peña, Annette T; Liu, Lei; Levine, Richard A
2018-04-29
Assessing heterogeneous treatment effects is a growing interest in advancing precision medicine. Individualized treatment effects (ITEs) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees. To this end, we propose a smooth sigmoid surrogate method, as an alternative to greedy search, to speed up tree construction. The RFIT outperforms the "separate regression" approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT are obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial. Copyright © 2018 John Wiley & Sons, Ltd.
ERIC Educational Resources Information Center
Strobl, Carolin; Malley, James; Tutz, Gerhard
2009-01-01
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and…
Ensemble of trees approaches to risk adjustment for evaluating a hospital's performance.
Liu, Yang; Traskin, Mikhail; Lorch, Scott A; George, Edward I; Small, Dylan
2015-03-01
A commonly used method for evaluating a hospital's performance on an outcome is to compare the hospital's observed outcome rate to the hospital's expected outcome rate given its patient (case) mix and service. The process of calculating the hospital's expected outcome rate given its patient mix and service is called risk adjustment (Iezzoni 1997). Risk adjustment is critical for accurately evaluating and comparing hospitals' performances since we would not want to unfairly penalize a hospital just because it treats sicker patients. The key to risk adjustment is accurately estimating the probability of an Outcome given patient characteristics. For cases with binary outcomes, the method that is commonly used in risk adjustment is logistic regression. In this paper, we consider ensemble of trees methods as alternatives for risk adjustment, including random forests and Bayesian additive regression trees (BART). Both random forests and BART are modern machine learning methods that have been shown recently to have excellent performance for prediction of outcomes in many settings. We apply these methods to carry out risk adjustment for the performance of neonatal intensive care units (NICU). We show that these ensemble of trees methods outperform logistic regression in predicting mortality among babies treated in NICU, and provide a superior method of risk adjustment compared to logistic regression.
Sankari, E Siva; Manimegalai, D
2017-12-21
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.
Rifai, Sami W; Urquiza Muñoz, José D; Negrón-Juárez, Robinson I; Ramírez Arévalo, Fredy R; Tello-Espinoza, Rodil; Vanderwel, Mark C; Lichstein, Jeremy W; Chambers, Jeffrey Q; Bohlman, Stephanie A
2016-10-01
Wind disturbance can create large forest blowdowns, which greatly reduces live biomass and adds uncertainty to the strength of the Amazon carbon sink. Observational studies from within the central Amazon have quantified blowdown size and estimated total mortality but have not determined which trees are most likely to die from a catastrophic wind disturbance. Also, the impact of spatial dependence upon tree mortality from wind disturbance has seldom been quantified, which is important because wind disturbance often kills clusters of trees due to large treefalls killing surrounding neighbors. We examine (1) the causes of differential mortality between adult trees from a 300-ha blowdown event in the Peruvian region of the northwestern Amazon, (2) how accounting for spatial dependence affects mortality predictions, and (3) how incorporating both differential mortality and spatial dependence affect the landscape level estimation of necromass produced from the blowdown. Standard regression and spatial regression models were used to estimate how stem diameter, wood density, elevation, and a satellite-derived disturbance metric influenced the probability of tree death from the blowdown event. The model parameters regarding tree characteristics, topography, and spatial autocorrelation of the field data were then used to determine the consequences of non-random mortality for landscape production of necromass through a simulation model. Tree mortality was highly non-random within the blowdown, where tree mortality rates were highest for trees that were large, had low wood density, and were located at high elevation. Of the differential mortality models, the non-spatial models overpredicted necromass, whereas the spatial model slightly underpredicted necromass. When parameterized from the same field data, the spatial regression model with differential mortality estimated only 7.5% more dead trees across the entire blowdown than the random mortality model, yet it estimated 51% greater necromass. We suggest that predictions of forest carbon loss from wind disturbance are sensitive to not only the underlying spatial dependence of observations, but also the biological differences between individuals that promote differential levels of mortality. © 2016 by the Ecological Society of America.
Prediction of Baseflow Index of Catchments using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Yadav, B.; Hatfield, K.
2017-12-01
We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.
An application of quantile random forests for predictive mapping of forest attributes
E.A. Freeman; G.G. Moisen
2015-01-01
Increasingly, random forest models are used in predictive mapping of forest attributes. Traditional random forests output the mean prediction from the random trees. Quantile regression forests (QRF) is an extension of random forests developed by Nicolai Meinshausen that provides non-parametric estimates of the median predicted value as well as prediction quantiles. It...
Assessing the predictive capability of randomized tree-based ensembles in streamflow modelling
NASA Astrophysics Data System (ADS)
Galelli, S.; Castelletti, A.
2013-02-01
Combining randomization methods with ensemble prediction is emerging as an effective option to balance accuracy and computational efficiency in data-driven modeling. In this paper we investigate the prediction capability of extremely randomized trees (Extra-Trees), in terms of accuracy, explanation ability and computational efficiency, in a streamflow modeling exercise. Extra-Trees are a totally randomized tree-based ensemble method that (i) alleviates the poor generalization property and tendency to overfitting of traditional standalone decision trees (e.g. CART); (ii) is computationally very efficient; and, (iii) allows to infer the relative importance of the input variables, which might help in the ex-post physical interpretation of the model. The Extra-Trees potential is analyzed on two real-world case studies (Marina catchment (Singapore) and Canning River (Western Australia)) representing two different morphoclimatic contexts comparatively with other tree-based methods (CART and M5) and parametric data-driven approaches (ANNs and multiple linear regression). Results show that Extra-Trees perform comparatively well to the best of the benchmarks (i.e. M5) in both the watersheds, while outperforming the other approaches in terms of computational requirement when adopted on large datasets. In addition, the ranking of the input variable provided can be given a physically meaningful interpretation.
Assessing the predictive capability of randomized tree-based ensembles in streamflow modelling
NASA Astrophysics Data System (ADS)
Galelli, S.; Castelletti, A.
2013-07-01
Combining randomization methods with ensemble prediction is emerging as an effective option to balance accuracy and computational efficiency in data-driven modelling. In this paper, we investigate the prediction capability of extremely randomized trees (Extra-Trees), in terms of accuracy, explanation ability and computational efficiency, in a streamflow modelling exercise. Extra-Trees are a totally randomized tree-based ensemble method that (i) alleviates the poor generalisation property and tendency to overfitting of traditional standalone decision trees (e.g. CART); (ii) is computationally efficient; and, (iii) allows to infer the relative importance of the input variables, which might help in the ex-post physical interpretation of the model. The Extra-Trees potential is analysed on two real-world case studies - Marina catchment (Singapore) and Canning River (Western Australia) - representing two different morphoclimatic contexts. The evaluation is performed against other tree-based methods (CART and M5) and parametric data-driven approaches (ANNs and multiple linear regression). Results show that Extra-Trees perform comparatively well to the best of the benchmarks (i.e. M5) in both the watersheds, while outperforming the other approaches in terms of computational requirement when adopted on large datasets. In addition, the ranking of the input variable provided can be given a physically meaningful interpretation.
Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula
2011-01-01
Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.
Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William
2014-01-01
Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies.
Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William
2014-01-01
Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies. PMID:24992657
NASA Astrophysics Data System (ADS)
Rooper, Christopher N.; Zimmermann, Mark; Prescott, Megan M.
2017-08-01
Deep-sea coral and sponge ecosystems are widespread throughout most of Alaska's marine waters, and are associated with many different species of fishes and invertebrates. These ecosystems are vulnerable to the effects of commercial fishing activities and climate change. We compared four commonly used species distribution models (general linear models, generalized additive models, boosted regression trees and random forest models) and an ensemble model to predict the presence or absence and abundance of six groups of benthic invertebrate taxa in the Gulf of Alaska. All four model types performed adequately on training data for predicting presence and absence, with regression forest models having the best overall performance measured by the area under the receiver-operating-curve (AUC). The models also performed well on the test data for presence and absence with average AUCs ranging from 0.66 to 0.82. For the test data, ensemble models performed the best. For abundance data, there was an obvious demarcation in performance between the two regression-based methods (general linear models and generalized additive models), and the tree-based models. The boosted regression tree and random forest models out-performed the other models by a wide margin on both the training and testing data. However, there was a significant drop-off in performance for all models of invertebrate abundance ( 50%) when moving from the training data to the testing data. Ensemble model performance was between the tree-based and regression-based methods. The maps of predictions from the models for both presence and abundance agreed very well across model types, with an increase in variability in predictions for the abundance data. We conclude that where data conforms well to the modeled distribution (such as the presence-absence data and binomial distribution in this study), the four types of models will provide similar results, although the regression-type models may be more consistent with biological theory. For data with highly zero-inflated distributions and non-normal distributions such as the abundance data from this study, the tree-based methods performed better. Ensemble models that averaged predictions across the four model types, performed better than the GLM or GAM models but slightly poorer than the tree-based methods, suggesting ensemble models might be more robust to overfitting than tree methods, while mitigating some of the disadvantages in predictive performance of regression methods.
Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali
2016-01-01
Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.
Simulation of land use change in the three gorges reservoir area based on CART-CA
NASA Astrophysics Data System (ADS)
Yuan, Min
2018-05-01
This study proposes a new method to simulate spatiotemporal complex multiple land uses by using classification and regression tree algorithm (CART) based CA model. In this model, we use classification and regression tree algorithm to calculate land class conversion probability, and combine neighborhood factor, random factor to extract cellular transformation rules. The overall Kappa coefficient is 0.8014 and the overall accuracy is 0.8821 in the land dynamic simulation results of the three gorges reservoir area from 2000 to 2010, and the simulation results are satisfactory.
Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J
2015-12-01
In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. (c) 2015 APA, all rights reserved).
Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J.
2016-01-01
In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. PMID:26389526
Improving ensemble decision tree performance using Adaboost and Bagging
NASA Astrophysics Data System (ADS)
Hasan, Md. Rajib; Siraj, Fadzilah; Sainin, Mohd Shamrie
2015-12-01
Ensemble classifier systems are considered as one of the most promising in medical data classification and the performance of deceision tree classifier can be increased by the ensemble method as it is proven to be better than single classifiers. However, in a ensemble settings the performance depends on the selection of suitable base classifier. This research employed two prominent esemble s namely Adaboost and Bagging with base classifiers such as Random Forest, Random Tree, j48, j48grafts and Logistic Model Regression (LMT) that have been selected independently. The empirical study shows that the performance varries when different base classifiers are selected and even some places overfitting issue also been noted. The evidence shows that ensemble decision tree classfiers using Adaboost and Bagging improves the performance of selected medical data sets.
Suchetana, Bihu; Rajagopalan, Balaji; Silverstein, JoAnn
2017-11-15
A regression tree-based diagnostic approach is developed to evaluate factors affecting US wastewater treatment plant compliance with ammonia discharge permit limits using Discharge Monthly Report (DMR) data from a sample of 106 municipal treatment plants for the period of 2004-2008. Predictor variables used to fit the regression tree are selected using random forests, and consist of the previous month's effluent ammonia, influent flow rates and plant capacity utilization. The tree models are first used to evaluate compliance with existing ammonia discharge standards at each facility and then applied assuming more stringent discharge limits, under consideration in many states. The model predicts that the ability to meet both current and future limits depends primarily on the previous month's treatment performance. With more stringent discharge limits predicted ammonia concentration relative to the discharge limit, increases. In-sample validation shows that the regression trees can provide a median classification accuracy of >70%. The regression tree model is validated using ammonia discharge data from an operating wastewater treatment plant and is able to accurately predict the observed ammonia discharge category approximately 80% of the time, indicating that the regression tree model can be applied to predict compliance for individual treatment plants providing practical guidance for utilities and regulators with an interest in controlling ammonia discharges. The proposed methodology is also used to demonstrate how to delineate reliable sources of demand and supply in a point source-to-point source nutrient credit trading scheme, as well as how planners and decision makers can set reasonable discharge limits in future. Copyright © 2017 Elsevier B.V. All rights reserved.
Zhao, Yang; Zheng, Wei; Zhuo, Daisy Y; Lu, Yuefeng; Ma, Xiwen; Liu, Hengchang; Zeng, Zhen; Laird, Glen
2017-10-11
Personalized medicine, or tailored therapy, has been an active and important topic in recent medical research. Many methods have been proposed in the literature for predictive biomarker detection and subgroup identification. In this article, we propose a novel decision tree-based approach applicable in randomized clinical trials. We model the prognostic effects of the biomarkers using additive regression trees and the biomarker-by-treatment effect using a single regression tree. Bayesian approach is utilized to periodically revise the split variables and the split rules of the decision trees, which provides a better overall fitting. Gibbs sampler is implemented in the MCMC procedure, which updates the prognostic trees and the interaction tree separately. We use the posterior distribution of the interaction tree to construct the predictive scores of the biomarkers and to identify the subgroup where the treatment is superior to the control. Numerical simulations show that our proposed method performs well under various settings comparing to existing methods. We also demonstrate an application of our method in a real clinical trial.
Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research
ERIC Educational Resources Information Center
He, Lingjun; Levine, Richard A.; Fan, Juanjuan; Beemer, Joshua; Stronach, Jeanne
2018-01-01
In institutional research, modern data mining approaches are seldom considered to address predictive analytics problems. The goal of this paper is to highlight the advantages of tree-based machine learning algorithms over classic (logistic) regression methods for data-informed decision making in higher education problems, and stress the success of…
Hu, Chen; Steingrimsson, Jon Arni
2018-01-01
A crucial component of making individualized treatment decisions is to accurately predict each patient's disease risk. In clinical oncology, disease risks are often measured through time-to-event data, such as overall survival and progression/recurrence-free survival, and are often subject to censoring. Risk prediction models based on recursive partitioning methods are becoming increasingly popular largely due to their ability to handle nonlinear relationships, higher-order interactions, and/or high-dimensional covariates. The most popular recursive partitioning methods are versions of the Classification and Regression Tree (CART) algorithm, which builds a simple interpretable tree structured model. With the aim of increasing prediction accuracy, the random forest algorithm averages multiple CART trees, creating a flexible risk prediction model. Risk prediction models used in clinical oncology commonly use both traditional demographic and tumor pathological factors as well as high-dimensional genetic markers and treatment parameters from multimodality treatments. In this article, we describe the most commonly used extensions of the CART and random forest algorithms to right-censored outcomes. We focus on how they differ from the methods for noncensored outcomes, and how the different splitting rules and methods for cost-complexity pruning impact these algorithms. We demonstrate these algorithms by analyzing a randomized Phase III clinical trial of breast cancer. We also conduct Monte Carlo simulations to compare the prediction accuracy of survival forests with more commonly used regression models under various scenarios. These simulation studies aim to evaluate how sensitive the prediction accuracy is to the underlying model specifications, the choice of tuning parameters, and the degrees of missing covariates.
Louys, Julien; Meloro, Carlo; Elton, Sarah; Ditchfield, Peter; Bishop, Laura C
2015-01-01
We test the performance of two models that use mammalian communities to reconstruct multivariate palaeoenvironments. While both models exploit the correlation between mammal communities (defined in terms of functional groups) and arboreal heterogeneity, the first uses a multiple multivariate regression of community structure and arboreal heterogeneity, while the second uses a linear regression of the principal components of each ecospace. The success of these methods means the palaeoenvironment of a particular locality can be reconstructed in terms of the proportions of heavy, moderate, light, and absent tree canopy cover. The linear regression is less biased, and more precisely and accurately reconstructs heavy tree canopy cover than the multiple multivariate model. However, the multiple multivariate model performs better than the linear regression for all other canopy cover categories. Both models consistently perform better than randomly generated reconstructions. We apply both models to the palaeocommunity of the Upper Laetolil Beds, Tanzania. Our reconstructions indicate that there was very little heavy tree cover at this site (likely less than 10%), with the palaeo-landscape instead comprising a mixture of light and absent tree cover. These reconstructions help resolve the previous conflicting palaeoecological reconstructions made for this site. Copyright © 2014 Elsevier Ltd. All rights reserved.
Austin, Peter C.; Tu, Jack V.; Ho, Jennifer E.; Levy, Daniel; Lee, Douglas S.
2014-01-01
Objective Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines. Study design and Setting We compared the performance of these classification methods with those of conventional classification trees to classify patients with heart failure according to the following sub-types: heart failure with preserved ejection fraction (HFPEF) vs. heart failure with reduced ejection fraction (HFREF). We also compared the ability of these methods to predict the probability of the presence of HFPEF with that of conventional logistic regression. Results We found that modern, flexible tree-based methods from the data mining literature offer substantial improvement in prediction and classification of heart failure sub-type compared to conventional classification and regression trees. However, conventional logistic regression had superior performance for predicting the probability of the presence of HFPEF compared to the methods proposed in the data mining literature. Conclusion The use of tree-based methods offers superior performance over conventional classification and regression trees for predicting and classifying heart failure subtypes in a population-based sample of patients from Ontario. However, these methods do not offer substantial improvements over logistic regression for predicting the presence of HFPEF. PMID:23384592
NASA Astrophysics Data System (ADS)
Pham, Binh Thai; Prakash, Indra; Tien Bui, Dieu
2018-02-01
A hybrid machine learning approach of Random Subspace (RSS) and Classification And Regression Trees (CART) is proposed to develop a model named RSSCART for spatial prediction of landslides. This model is a combination of the RSS method which is known as an efficient ensemble technique and the CART which is a state of the art classifier. The Luc Yen district of Yen Bai province, a prominent landslide prone area of Viet Nam, was selected for the model development. Performance of the RSSCART model was evaluated through the Receiver Operating Characteristic (ROC) curve, statistical analysis methods, and the Chi Square test. Results were compared with other benchmark landslide models namely Support Vector Machines (SVM), single CART, Naïve Bayes Trees (NBT), and Logistic Regression (LR). In the development of model, ten important landslide affecting factors related with geomorphology, geology and geo-environment were considered namely slope angles, elevation, slope aspect, curvature, lithology, distance to faults, distance to rivers, distance to roads, and rainfall. Performance of the RSSCART model (AUC = 0.841) is the best compared with other popular landslide models namely SVM (0.835), single CART (0.822), NBT (0.821), and LR (0.723). These results indicate that performance of the RSSCART is a promising method for spatial landslide prediction.
Red-shouldered hawk nesting habitat preference in south Texas
Strobel, Bradley N.; Boal, Clint W.
2010-01-01
We examined nesting habitat preference by red-shouldered hawks Buteo lineatus using conditional logistic regression on characteristics measured at 27 occupied nest sites and 68 unused sites in 2005–2009 in south Texas. We measured vegetation characteristics of individual trees (nest trees and unused trees) and corresponding 0.04-ha plots. We evaluated the importance of tree and plot characteristics to nesting habitat selection by comparing a priori tree-specific and plot-specific models using Akaike's information criterion. Models with only plot variables carried 14% more weight than models with only center tree variables. The model-averaged odds ratios indicated red-shouldered hawks selected to nest in taller trees and in areas with higher average diameter at breast height than randomly available within the forest stand. Relative to randomly selected areas, each 1-m increase in nest tree height and 1-cm increase in the plot average diameter at breast height increased the probability of selection by 85% and 10%, respectively. Our results indicate that red-shouldered hawks select nesting habitat based on vegetation characteristics of individual trees as well as the 0.04-ha area surrounding the tree. Our results indicate forest management practices resulting in tall forest stands with large average diameter at breast height would benefit red-shouldered hawks in south Texas.
NASA Astrophysics Data System (ADS)
Dokuchaev, P. M.; Meshalkina, J. L.; Yaroslavtsev, A. M.
2018-01-01
Comparative analysis of soils geospatial modeling using multinomial logistic regression, decision trees, random forest, regression trees and support vector machines algorithms was conducted. The visual interpretation of the digital maps obtained and their comparison with the existing map, as well as the quantitative assessment of the individual soil groups detection overall accuracy and of the models kappa showed that multiple logistic regression, support vector method, and random forest models application with spatial prediction of the conditional soil groups distribution can be reliably used for mapping of the study area. It has shown the most accurate detection for sod-podzolics soils (Phaeozems Albic) lightly eroded and moderately eroded soils. In second place, according to the mean overall accuracy of the prediction, there are sod-podzolics soils - non-eroded and warp one, as well as sod-gley soils (Umbrisols Gleyic) and alluvial soils (Fluvisols Dystric, Umbric). Heavy eroded sod-podzolics and gray forest soils (Phaeozems Albic) were detected by methods of automatic classification worst of all.
Nattee, Cholwich; Khamsemanan, Nirattaya; Lawtrakul, Luckhana; Toochinda, Pisanu; Hannongbua, Supa
2017-01-01
Malaria is still one of the most serious diseases in tropical regions. This is due in part to the high resistance against available drugs for the inhibition of parasites, Plasmodium, the cause of the disease. New potent compounds with high clinical utility are urgently needed. In this work, we created a novel model using a regression tree to study structure-activity relationships and predict the inhibition constant, K i of three different antimalarial analogues (Trimethoprim, Pyrimethamine, and Cycloguanil) based on their molecular descriptors. To the best of our knowledge, this work is the first attempt to study the structure-activity relationships of all three analogues combined. The most relevant descriptors and appropriate parameters of the regression tree are harvested using extremely randomized trees. These descriptors are water accessible surface area, Log of the aqueous solubility, total hydrophobic van der Waals surface area, and molecular refractivity. Out of all possible combinations of these selected parameters and descriptors, the tree with the strongest coefficient of determination is selected to be our prediction model. Predicted K i values from the proposed model show a strong coefficient of determination, R 2 =0.996, to experimental K i values. From the structure of the regression tree, compounds with high accessible surface area of all hydrophobic atoms (ASA_H) and low aqueous solubility of inhibitors (Log S) generally possess low K i values. Our prediction model can also be utilized as a screening test for new antimalarial drug compounds which may reduce the time and expenses for new drug development. New compounds with high predicted K i should be excluded from further drug development. It is also our inference that a threshold of ASA_H greater than 575.80 and Log S less than or equal to -4.36 is a sufficient condition for a new compound to possess a low K i . Copyright © 2016 Elsevier Inc. All rights reserved.
Finding structure in data using multivariate tree boosting
Miller, Patrick J.; Lubke, Gitta H.; McArtor, Daniel B.; Bergeman, C. S.
2016-01-01
Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles such as random forests (Strobl, Malley, & Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called gradient boosted regression trees (Friedman, 2001). Our extension, multivariate tree boosting, is a method for nonparametric regression that is useful for identifying important predictors, detecting predictors with nonlinear effects and interactions without specification of such effects, and for identifying predictors that cause two or more outcome variables to covary. We provide the R package ‘mvtboost’ to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package ‘gbm’ (Ridgeway et al., 2015) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff & Keyes, 1995). Simulations verify that our approach identifies predictors with nonlinear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions. PMID:27918183
Ramírez, J; Górriz, J M; Segovia, F; Chaves, R; Salas-Gonzalez, D; López, M; Alvarez, I; Padilla, P
2010-03-19
This letter shows a computer aided diagnosis (CAD) technique for the early detection of the Alzheimer's disease (AD) by means of single photon emission computed tomography (SPECT) image classification. The proposed method is based on partial least squares (PLS) regression model and a random forest (RF) predictor. The challenge of the curse of dimensionality is addressed by reducing the large dimensionality of the input data by downscaling the SPECT images and extracting score features using PLS. A RF predictor then forms an ensemble of classification and regression tree (CART)-like classifiers being its output determined by a majority vote of the trees in the forest. A baseline principal component analysis (PCA) system is also developed for reference. The experimental results show that the combined PLS-RF system yields a generalization error that converges to a limit when increasing the number of trees in the forest. Thus, the generalization error is reduced when using PLS and depends on the strength of the individual trees in the forest and the correlation between them. Moreover, PLS feature extraction is found to be more effective for extracting discriminative information from the data than PCA yielding peak sensitivity, specificity and accuracy values of 100%, 92.7%, and 96.9%, respectively. Moreover, the proposed CAD system outperformed several other recently developed AD CAD systems. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T
2016-05-01
Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.
Decision trees in epidemiological research.
Venkatasubramaniam, Ashwini; Wolfson, Julian; Mitchell, Nathan; Barnes, Timothy; JaKa, Meghan; French, Simone
2017-01-01
In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. However, identifying relevant subgroups can be challenging with standard statistical methods. We review the literature on decision trees, a family of techniques for partitioning the population, on the basis of covariates, into distinct subgroups who share similar values of an outcome variable. We compare two decision tree methods, the popular Classification and Regression tree (CART) technique and the newer Conditional Inference tree (CTree) technique, assessing their performance in a simulation study and using data from the Box Lunch Study, a randomized controlled trial of a portion size intervention. Both CART and CTree identify homogeneous population subgroups and offer improved prediction accuracy relative to regression-based approaches when subgroups are truly present in the data. An important distinction between CART and CTree is that the latter uses a formal statistical hypothesis testing framework in building decision trees, which simplifies the process of identifying and interpreting the final tree model. We also introduce a novel way to visualize the subgroups defined by decision trees. Our novel graphical visualization provides a more scientifically meaningful characterization of the subgroups identified by decision trees. Decision trees are a useful tool for identifying homogeneous subgroups defined by combinations of individual characteristics. While all decision tree techniques generate subgroups, we advocate the use of the newer CTree technique due to its simplicity and ease of interpretation.
Kropat, Georg; Bochud, Francois; Jaboyedoff, Michel; Laedermann, Jean-Pascal; Murith, Christophe; Palacios Gruson, Martha; Baechler, Sébastien
2015-09-01
According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information. Copyright © 2015 Elsevier Ltd. All rights reserved.
Salas, Eric Ariel L; Valdez, Raul; Michel, Stefan
2017-11-01
We modeled summer and winter habitat suitability of Marco Polo argali in the Pamir Mountains in southeastern Tajikistan using these statistical algorithms: Generalized Linear Model, Random Forest, Boosted Regression Tree, Maxent, and Multivariate Adaptive Regression Splines. Using sheep occurrence data collected from 2009 to 2015 and a set of selected habitat predictors, we produced summer and winter habitat suitability maps and determined the important habitat suitability predictors for both seasons. Our results demonstrated that argali selected proximity to riparian areas and greenness as the two most relevant variables for summer, and the degree of slope (gentler slopes between 0° to 20°) and Landsat temperature band for winter. The terrain roughness was also among the most important variables in summer and winter models. Aspect was only significant for winter habitat, with argali preferring south-facing mountain slopes. We evaluated various measures of model performance such as the Area Under the Curve (AUC) and the True Skill Statistic (TSS). Comparing the five algorithms, the AUC scored highest for Boosted Regression Tree in summer (AUC = 0.94) and winter model runs (AUC = 0.94). In contrast, Random Forest underperformed in both model runs.
Nunes, Matheus Henrique
2016-01-01
Tree stem form in native tropical forests is very irregular, posing a challenge to establishing taper equations that can accurately predict the diameter at any height along the stem and subsequently merchantable volume. Artificial intelligence approaches can be useful techniques in minimizing estimation errors within complex variations of vegetation. We evaluated the performance of Random Forest® regression tree and Artificial Neural Network procedures in modelling stem taper. Diameters and volume outside bark were compared to a traditional taper-based equation across a tropical Brazilian savanna, a seasonal semi-deciduous forest and a rainforest. Neural network models were found to be more accurate than the traditional taper equation. Random forest showed trends in the residuals from the diameter prediction and provided the least precise and accurate estimations for all forest types. This study provides insights into the superiority of a neural network, which provided advantages regarding the handling of local effects. PMID:27187074
Nunes, Matheus Henrique; Görgens, Eric Bastos
2016-01-01
Tree stem form in native tropical forests is very irregular, posing a challenge to establishing taper equations that can accurately predict the diameter at any height along the stem and subsequently merchantable volume. Artificial intelligence approaches can be useful techniques in minimizing estimation errors within complex variations of vegetation. We evaluated the performance of Random Forest® regression tree and Artificial Neural Network procedures in modelling stem taper. Diameters and volume outside bark were compared to a traditional taper-based equation across a tropical Brazilian savanna, a seasonal semi-deciduous forest and a rainforest. Neural network models were found to be more accurate than the traditional taper equation. Random forest showed trends in the residuals from the diameter prediction and provided the least precise and accurate estimations for all forest types. This study provides insights into the superiority of a neural network, which provided advantages regarding the handling of local effects.
Alghamdi, Manal; Al-Mallah, Mouaz; Keteyian, Steven; Brawner, Clinton; Ehrman, Jonathan; Sakr, Sherif
2017-01-01
Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.
2013-01-01
Objectives. Global perceptions of stress (GPS) have major implications for mental and physical health, and stress in midlife may influence adaptation in later life. Thus, it is important to determine the unique and interactive effects of diverse influences of role stress (at work or in personal relationships), loneliness, life events, time pressure, caregiving, finances, discrimination, and neighborhood circumstances on these GPS. Method. Exploratory regression trees and random forests were used to examine complex interactions among myriad events and chronic stressors in middle-aged participants’ (N = 410; mean age = 52.12) GPS. Results. Different role and domain stressors were influential at high and low levels of loneliness. Varied combinations of these stressors resulting in similar levels of perceived stress are also outlined as examples of equifinality. Loneliness emerged as an important predictor across trees. Discussion. Exploring multiple stressors simultaneously provides insights into the diversity of stressor combinations across individuals—even those with similar levels of global perceived stress—and answers theoretical mandates to better understand the influence of stress by sampling from many domain and role stressors. Further, the unique influences of each predictor relative to the others inform theory and applied work. Finally, examples of equifinality and multifinality call for targeted interventions. PMID:23341437
[Hyperspectral Estimation of Apple Tree Canopy LAI Based on SVM and RF Regression].
Han, Zhao-ying; Zhu, Xi-cun; Fang, Xian-yi; Wang, Zhuo-yuan; Wang, Ling; Zhao, Geng-Xing; Jiang, Yuan-mao
2016-03-01
Leaf area index (LAI) is the dynamic index of crop population size. Hyperspectral technology can be used to estimate apple canopy LAI rapidly and nondestructively. It can be provide a reference for monitoring the tree growing and yield estimation. The Red Fuji apple trees of full bearing fruit are the researching objects. Ninety apple trees canopies spectral reflectance and LAI values were measured by the ASD Fieldspec3 spectrometer and LAI-2200 in thirty orchards in constant two years in Qixia research area of Shandong Province. The optimal vegetation indices were selected by the method of correlation analysis of the original spectral reflectance and vegetation indices. The models of predicting the LAI were built with the multivariate regression analysis method of support vector machine (SVM) and random forest (RF). The new vegetation indices, GNDVI527, ND-VI676, RVI682, FD-NVI656 and GRVI517 and the previous two main vegetation indices, NDVI670 and NDVI705, are in accordance with LAI. In the RF regression model, the calibration set decision coefficient C-R2 of 0.920 and validation set decision coefficient V-R2 of 0.889 are higher than the SVM regression model by 0.045 and 0.033 respectively. The root mean square error of calibration set C-RMSE of 0.249, the root mean square error validation set V-RMSE of 0.236 are lower than that of the SVM regression model by 0.054 and 0.058 respectively. Relative analysis of calibrating error C-RPD and relative analysis of validation set V-RPD reached 3.363 and 2.520, 0.598 and 0.262, respectively, which were higher than the SVM regression model. The measured and predicted the scatterplot trend line slope of the calibration set and validation set C-S and V-S are close to 1. The estimation result of RF regression model is better than that of the SVM. RF regression model can be used to estimate the LAI of red Fuji apple trees in full fruit period.
NASA Astrophysics Data System (ADS)
Muller, Sybrand Jacobus; van Niekerk, Adriaan
2016-07-01
Soil salinity often leads to reduced crop yield and quality and can render soils barren. Irrigated areas are particularly at risk due to intensive cultivation and secondary salinization caused by waterlogging. Regular monitoring of salt accumulation in irrigation schemes is needed to keep its negative effects under control. The dynamic spatial and temporal characteristics of remote sensing can provide a cost-effective solution for monitoring salt accumulation at irrigation scheme level. This study evaluated a range of pan-fused SPOT-5 derived features (spectral bands, vegetation indices, image textures and image transformations) for classifying salt-affected areas in two distinctly different irrigation schemes in South Africa, namely Vaalharts and Breede River. The relationship between the input features and electro conductivity measurements were investigated using regression modelling (stepwise linear regression, partial least squares regression, curve fit regression modelling) and supervised classification (maximum likelihood, nearest neighbour, decision tree analysis, support vector machine and random forests). Classification and regression trees and random forest were used to select the most important features for differentiating salt-affected and unaffected areas. The results showed that the regression analyses produced weak models (<0.4 R squared). Better results were achieved using the supervised classifiers, but the algorithms tend to over-estimate salt-affected areas. A key finding was that none of the feature sets or classification algorithms stood out as being superior for monitoring salt accumulation at irrigation scheme level. This was attributed to the large variations in the spectral responses of different crops types at different growing stages, coupled with their individual tolerances to saline conditions.
NASA Astrophysics Data System (ADS)
Deng, Chengbin; Wu, Changshan
2013-12-01
Urban impervious surface information is essential for urban and environmental applications at the regional/national scales. As a popular image processing technique, spectral mixture analysis (SMA) has rarely been applied to coarse-resolution imagery due to the difficulty of deriving endmember spectra using traditional endmember selection methods, particularly within heterogeneous urban environments. To address this problem, we derived endmember signatures through a least squares solution (LSS) technique with known abundances of sample pixels, and integrated these endmember signatures into SMA for mapping large-scale impervious surface fraction. In addition, with the same sample set, we carried out objective comparative analyses among SMA (i.e. fully constrained and unconstrained SMA) and machine learning (i.e. Cubist regression tree and Random Forests) techniques. Analysis of results suggests three major conclusions. First, with the extrapolated endmember spectra from stratified random training samples, the SMA approaches performed relatively well, as indicated by small MAE values. Second, Random Forests yields more reliable results than Cubist regression tree, and its accuracy is improved with increased sample sizes. Finally, comparative analyses suggest a tentative guide for selecting an optimal approach for large-scale fractional imperviousness estimation: unconstrained SMA might be a favorable option with a small number of samples, while Random Forests might be preferred if a large number of samples are available.
Liu, Rong; Li, Xi; Zhang, Wei; Zhou, Hong-Hao
2015-01-01
Objective Multiple linear regression (MLR) and machine learning techniques in pharmacogenetic algorithm-based warfarin dosing have been reported. However, performances of these algorithms in racially diverse group have never been objectively evaluated and compared. In this literature-based study, we compared the performances of eight machine learning techniques with those of MLR in a large, racially-diverse cohort. Methods MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied in warfarin dose algorithms in a cohort from the International Warfarin Pharmacogenetics Consortium database. Covariates obtained by stepwise regression from 80% of randomly selected patients were used to develop algorithms. To compare the performances of these algorithms, the mean percentage of patients whose predicted dose fell within 20% of the actual dose (mean percentage within 20%) and the mean absolute error (MAE) were calculated in the remaining 20% of patients. The performances of these techniques in different races, as well as the dose ranges of therapeutic warfarin were compared. Robust results were obtained after 100 rounds of resampling. Results BART, MARS and SVR were statistically indistinguishable and significantly out performed all the other approaches in the whole cohort (MAE: 8.84–8.96 mg/week, mean percentage within 20%: 45.88%–46.35%). In the White population, MARS and BART showed higher mean percentage within 20% and lower mean MAE than those of MLR (all p values < 0.05). In the Asian population, SVR, BART, MARS and LAR performed the same as MLR. MLR and LAR optimally performed among the Black population. When patients were grouped in terms of warfarin dose range, all machine learning techniques except ANN and LAR showed significantly higher mean percentage within 20%, and lower MAE (all p values < 0.05) than MLR in the low- and high- dose ranges. Conclusion Overall, machine learning-based techniques, BART, MARS and SVR performed superior than MLR in warfarin pharmacogenetic dosing. Differences of algorithms’ performances exist among the races. Moreover, machine learning-based algorithms tended to perform better in the low- and high- dose ranges than MLR. PMID:26305568
NASA Astrophysics Data System (ADS)
Mangla, Rohit; Kumar, Shashi; Nandy, Subrata
2016-05-01
SAR and LiDAR remote sensing have already shown the potential of active sensors for forest parameter retrieval. SAR sensor in its fully polarimetric mode has an advantage to retrieve scattering property of different component of forest structure and LiDAR has the capability to measure structural information with very high accuracy. This study was focused on retrieval of forest aboveground biomass (AGB) using Terrestrial Laser Scanner (TLS) based point clouds and scattering property of forest vegetation obtained from decomposition modelling of RISAT-1 fully polarimetric SAR data. TLS data was acquired for 14 plots of Timli forest range, Uttarakhand, India. The forest area is dominated by Sal trees and random sampling with plot size of 0.1 ha (31.62m*31.62m) was adopted for TLS and field data collection. RISAT-1 data was processed to retrieve SAR data based variables and TLS point clouds based 3D imaging was done to retrieve LiDAR based variables. Surface scattering, double-bounce scattering, volume scattering, helix and wire scattering were the SAR based variables retrieved from polarimetric decomposition. Tree heights and stem diameters were used as LiDAR based variables retrieved from single tree vertical height and least square circle fit methods respectively. All the variables obtained for forest plots were used as an input in a machine learning based Random Forest Regression Model, which was developed in this study for forest AGB estimation. Modelled output for forest AGB showed reliable accuracy (RMSE = 27.68 t/ha) and a good coefficient of determination (0.63) was obtained through the linear regression between modelled AGB and field-estimated AGB. The sensitivity analysis showed that the model was more sensitive for the major contributed variables (stem diameter and volume scattering) and these variables were measured from two different remote sensing techniques. This study strongly recommends the integration of SAR and LiDAR data for forest AGB estimation.
Static terrestrial laser scanning of juvenile understory trees for field phenotyping
NASA Astrophysics Data System (ADS)
Wang, Huanhuan; Lin, Yi
2014-11-01
This study was to attempt the cutting-edge 3D remote sensing technique of static terrestrial laser scanning (TLS) for parametric 3D reconstruction of juvenile understory trees. The data for test was collected with a Leica HDS6100 TLS system in a single-scan way. The geometrical structures of juvenile understory trees are extracted by model fitting. Cones are used to model trunks and branches. Principal component analysis (PCA) is adopted to calculate their major axes. Coordinate transformation and orthogonal projection are used to estimate the parameters of the cones. Then, AutoCAD is utilized to simulate the morphological characteristics of the understory trees, and to add secondary branches and leaves in a random way. Comparison of the reference values and the estimated values gives the regression equation and shows that the proposed algorithm of extracting parameters is credible. The results have basically verified the applicability of TLS for field phenotyping of juvenile understory trees.
Arevalillo, Jorge M; Sztein, Marcelo B; Kotloff, Karen L; Levine, Myron M; Simon, Jakub K
2017-10-01
Immunologic correlates of protection are important in vaccine development because they give insight into mechanisms of protection, assist in the identification of promising vaccine candidates, and serve as endpoints in bridging clinical vaccine studies. Our goal is the development of a methodology to identify immunologic correlates of protection using the Shigella challenge as a model. The proposed methodology utilizes the Random Forests (RF) machine learning algorithm as well as Classification and Regression Trees (CART) to detect immune markers that predict protection, identify interactions between variables, and define optimal cutoffs. Logistic regression modeling is applied to estimate the probability of protection and the confidence interval (CI) for such a probability is computed by bootstrapping the logistic regression models. The results demonstrate that the combination of Classification and Regression Trees and Random Forests complements the standard logistic regression and uncovers subtle immune interactions. Specific levels of immunoglobulin IgG antibody in blood on the day of challenge predicted protection in 75% (95% CI 67-86). Of those subjects that did not have blood IgG at or above a defined threshold, 100% were protected if they had IgA antibody secreting cells above a defined threshold. Comparison with the results obtained by applying only logistic regression modeling with standard Akaike Information Criterion for model selection shows the usefulness of the proposed method. Given the complexity of the immune system, the use of machine learning methods may enhance traditional statistical approaches. When applied together, they offer a novel way to quantify important immune correlates of protection that may help the development of vaccines. Copyright © 2017 Elsevier Inc. All rights reserved.
Marcus V. Warwell; Gerald E. Rehfeldt; Nicholas L. Crookston
2010-01-01
The Random Forests multiple regression tree was used to develop an empirically based bioclimatic model of the presence-absence of species occupying small geographic distributions in western North America. The species assessed were subalpine larch (Larix lyallii), smooth Arizona cypress (Cupressus arizonica ssp. glabra...
NASA Astrophysics Data System (ADS)
Rao, M.; Vuong, H.
2013-12-01
The overall objective of this study is to develop a method for estimating total aboveground biomass of redwood stands in Jackson Demonstration State Forest, Mendocino, California using airborne LiDAR data. LiDAR data owing to its vertical and horizontal accuracy are increasingly being used to characterize landscape features including ground surface elevation and canopy height. These LiDAR-derived metrics involving structural signatures at higher precision and accuracy can help better understand ecological processes at various spatial scales. Our study is focused on two major species of the forest: redwood (Sequoia semperirens [D.Don] Engl.) and Douglas-fir (Pseudotsuga mensiezii [Mirb.] Franco). Specifically, the objectives included linear regression models fitting tree diameter at breast height (dbh) to LiDAR derived height for each species. From 23 random points on the study area, field measurement (dbh and tree coordinate) were collected for more than 500 trees of Redwood and Douglas-fir over 0.2 ha- plots. The USFS-FUSION application software along with its LiDAR Data Viewer (LDV) were used to to extract Canopy Height Model (CHM) from which tree heights would be derived. Based on the LiDAR derived height and ground based dbh, a linear regression model was developed to predict dbh. The predicted dbh was used to estimate the biomass at the single tree level using Jenkin's formula (Jenkin et al 2003). The linear regression models were able to explain 65% of the variability associated with Redwood's dbh and 80% of that associated with Douglas-fir's dbh.
Application of XGBoost algorithm in hourly PM2.5 concentration prediction
NASA Astrophysics Data System (ADS)
Pan, Bingyue
2018-02-01
In view of prediction techniques of hourly PM2.5 concentration in China, this paper applied the XGBoost(Extreme Gradient Boosting) algorithm to predict hourly PM2.5 concentration. The monitoring data of air quality in Tianjin city was analyzed by using XGBoost algorithm. The prediction performance of the XGBoost method is evaluated by comparing observed and predicted PM2.5 concentration using three measures of forecast accuracy. The XGBoost method is also compared with the random forest algorithm, multiple linear regression, decision tree regression and support vector machines for regression models using computational results. The results demonstrate that the XGBoost algorithm outperforms other data mining methods.
Extensions and applications of ensemble-of-trees methods in machine learning
NASA Astrophysics Data System (ADS)
Bleich, Justin
Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of violence during probation hearings in court systems.
Large unbalanced credit scoring using Lasso-logistic regression ensemble.
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.
Empirical analyses of plant-climate relationships for the western United States
Gerald E. Rehfeldt; Nicholas L. Crookston; Marcus V. Warwell; Jeffrey S. Evans
2006-01-01
The Random Forests multiple-regression tree was used to model climate profiles of 25 biotic communities of the western United States and nine of their constituent species. Analyses of the communities were based on a gridded sample of ca. 140,000 points, while those for the species used presence-absence data from ca. 120,000 locations. Independent variables included 35...
Stemflow estimation in a redwood forest using model-based stratified random sampling
Jack Lewis
2003-01-01
Model-based stratified sampling is illustrated by a case study of stemflow volume in a redwood forest. The approach is actually a model-assisted sampling design in which auxiliary information (tree diameter) is utilized in the design of stratum boundaries to optimize the efficiency of a regression or ratio estimator. The auxiliary information is utilized in both the...
Mi, Chunrong; Huettmann, Falk; Guo, Yumin; Han, Xuesong; Wen, Lijia
2017-01-01
Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane ( Grus monacha , n = 33), White-naped Crane ( Grus vipio , n = 40), and Black-necked Crane ( Grus nigricollis , n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation.
Mi, Chunrong; Huettmann, Falk; Han, Xuesong; Wen, Lijia
2017-01-01
Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n = 33), White-naped Crane (Grus vipio, n = 40), and Black-necked Crane (Grus nigricollis, n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation. PMID:28097060
NASA Astrophysics Data System (ADS)
Wilson, Barry T.; Knight, Joseph F.; McRoberts, Ronald E.
2018-03-01
Imagery from the Landsat Program has been used frequently as a source of auxiliary data for modeling land cover, as well as a variety of attributes associated with tree cover. With ready access to all scenes in the archive since 2008 due to the USGS Landsat Data Policy, new approaches to deriving such auxiliary data from dense Landsat time series are required. Several methods have previously been developed for use with finer temporal resolution imagery (e.g. AVHRR and MODIS), including image compositing and harmonic regression using Fourier series. The manuscript presents a study, using Minnesota, USA during the years 2009-2013 as the study area and timeframe. The study examined the relative predictive power of land cover models, in particular those related to tree cover, using predictor variables based solely on composite imagery versus those using estimated harmonic regression coefficients. The study used two common non-parametric modeling approaches (i.e. k-nearest neighbors and random forests) for fitting classification and regression models of multiple attributes measured on USFS Forest Inventory and Analysis plots using all available Landsat imagery for the study area and timeframe. The estimated Fourier coefficients developed by harmonic regression of tasseled cap transformation time series data were shown to be correlated with land cover, including tree cover. Regression models using estimated Fourier coefficients as predictor variables showed a two- to threefold increase in explained variance for a small set of continuous response variables, relative to comparable models using monthly image composites. Similarly, the overall accuracies of classification models using the estimated Fourier coefficients were approximately 10-20 percentage points higher than the models using the image composites, with corresponding individual class accuracies between six and 45 percentage points higher.
Shafizadeh-Moghadam, Hossein; Valavi, Roozbeh; Shahabi, Himan; Chapi, Kamran; Shirzadi, Ataollah
2018-07-01
In this research, eight individual machine learning and statistical models are implemented and compared, and based on their results, seven ensemble models for flood susceptibility assessment are introduced. The individual models included artificial neural networks, classification and regression trees, flexible discriminant analysis, generalized linear model, generalized additive model, boosted regression trees, multivariate adaptive regression splines, and maximum entropy, and the ensemble models were Ensemble Model committee averaging (EMca), Ensemble Model confidence interval Inferior (EMciInf), Ensemble Model confidence interval Superior (EMciSup), Ensemble Model to estimate the coefficient of variation (EMcv), Ensemble Model to estimate the mean (EMmean), Ensemble Model to estimate the median (EMmedian), and Ensemble Model based on weighted mean (EMwmean). The data set covered 201 flood events in the Haraz watershed (Mazandaran province in Iran) and 10,000 randomly selected non-occurrence points. Among the individual models, the Area Under the Receiver Operating Characteristic (AUROC), which showed the highest value, belonged to boosted regression trees (0.975) and the lowest value was recorded for generalized linear model (0.642). On the other hand, the proposed EMmedian resulted in the highest accuracy (0.976) among all models. In spite of the outstanding performance of some models, nevertheless, variability among the prediction of individual models was considerable. Therefore, to reduce uncertainty, creating more generalizable, more stable, and less sensitive models, ensemble forecasting approaches and in particular the EMmedian is recommended for flood susceptibility assessment. Copyright © 2018 Elsevier Ltd. All rights reserved.
Zhu, K; Lou, Z; Zhou, J; Ballester, N; Kong, N; Parikh, P
2015-01-01
This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners. Explore the use of conditional logistic regression to increase the prediction accuracy. We analyzed an HCUP statewide inpatient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models. The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 - 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures. It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.
Marcus V. Warwell; Gerald E. Rehfeldt; Nicholas L. Crookston
2006-01-01
The Random Forests multiple regression tree was used to develop an empirically-based bioclimate model for the distribution of Pinus albicaulis (whitebark pine) in western North America, latitudes 31° to 51° N and longitudes 102° to 125° W. Independent variables included 35 simple expressions of temperature and precipitation and their interactions....
Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data. PMID:25706988
What contributes to perceived stress in later life? A recursive partitioning approach.
Scott, Stacey B; Jackson, Brenda R; Bergeman, C S
2011-12-01
One possible explanation for the individual differences in outcomes of stress is the diversity of inputs that produce perceptions of being stressed. The current study examines how combinations of contextual features (e.g., social isolation, neighborhood quality, health problems, age discrimination, financial concerns, and recent life events) of later life contribute to overall feelings of stress. Recursive partitioning techniques (regression trees and random forests) were used to examine unique interrelations between predictors of perceived stress in a sample of 282 community-dwelling adults. Trees provided possible examples of equifinality (i.e., subsets of people with similar levels of perceived stress but different predictors) as well as identification both of contextual combinations that separated participants with very high and very low perceived stress. Random forest analyses aggregated across many trees based on permuted versions of the data and predictors; loneliness, financial strain, neighborhood strain, ageism, and to some extent life events emerged as important predictors. Interviews with a subsample of participants provided both thick description of the complex relationships identified in the trees, as well as additional risks not appearing in the survey results. Together, the analyses highlight what may be missed when stress is used as a simple unidimensional construct and can guide differential intervention efforts.
What contributes to perceived stress in later life? A recursive partitioning approach
Scott, Stacey B.; Jackson, Brenda R.; Bergeman, C. S.
2011-01-01
One possible explanation for the individual differences in outcomes of stress is the diversity of inputs that produce perceptions of being stressed. The current study examines how combinations of contextual features (e.g., social isolation, neighborhood quality, health problems, age discrimination, financial concerns, and recent life events) of later life contribute to overall feelings of stress. Recursive partitioning techniques (regression trees and random forests) were used to examine unique interrelations between predictors of perceived stress in a sample of 282 community-dwelling adults. Trees provided possible examples of equifinality (i.e., subsets of people with similar levels of perceived stress but different predictors) as well as for the identification both of contextual combinations that separated participants with very high and very low perceived stress. Random forest analyses aggregated across many trees based on permuted versions of the data and predictors; loneliness, financial strain, neighborhood strain, ageism, and to some extent life events emerged as important predictors. Interviews with a subsample of participants provided both thick description of the complex relationships identified in the trees, as well as additional risks not appearing in the survey results. Together, the analyses highlight what may be missed when stress is used as a simple unidimensional construct and can guide differential intervention efforts. PMID:21604885
Fokkema, M; Smits, N; Zeileis, A; Hothorn, T; Kelderman, H
2017-10-25
Identification of subgroups of patients for whom treatment A is more effective than treatment B, and vice versa, is of key importance to the development of personalized medicine. Tree-based algorithms are helpful tools for the detection of such interactions, but none of the available algorithms allow for taking into account clustered or nested dataset structures, which are particularly common in psychological research. Therefore, we propose the generalized linear mixed-effects model tree (GLMM tree) algorithm, which allows for the detection of treatment-subgroup interactions, while accounting for the clustered structure of a dataset. The algorithm uses model-based recursive partitioning to detect treatment-subgroup interactions, and a GLMM to estimate the random-effects parameters. In a simulation study, GLMM trees show higher accuracy in recovering treatment-subgroup interactions, higher predictive accuracy, and lower type II error rates than linear-model-based recursive partitioning and mixed-effects regression trees. Also, GLMM trees show somewhat higher predictive accuracy than linear mixed-effects models with pre-specified interaction effects, on average. We illustrate the application of GLMM trees on an individual patient-level data meta-analysis on treatments for depression. We conclude that GLMM trees are a promising exploratory tool for the detection of treatment-subgroup interactions in clustered datasets.
NASA Astrophysics Data System (ADS)
Kukunda, Collins B.; Duque-Lazo, Joaquín; González-Ferreiro, Eduardo; Thaden, Hauke; Kleinn, Christoph
2018-03-01
Distinguishing tree species is relevant in many contexts of remote sensing assisted forest inventory. Accurate tree species maps support management and conservation planning, pest and disease control and biomass estimation. This study evaluated the performance of applying ensemble techniques with the goal of automatically distinguishing Pinus sylvestris L. and Pinus uncinata Mill. Ex Mirb within a 1.3 km2 mountainous area in Barcelonnette (France). Three modelling schemes were examined, based on: (1) high-density LiDAR data (160 returns m-2), (2) Worldview-2 multispectral imagery, and (3) Worldview-2 and LiDAR in combination. Variables related to the crown structure and height of individual trees were extracted from the normalized LiDAR point cloud at individual-tree level, after performing individual tree crown (ITC) delineation. Vegetation indices and the Haralick texture indices were derived from Worldview-2 images and served as independent spectral variables. Selection of the best predictor subset was done after a comparison of three variable selection procedures: (1) Random Forests with cross validation (AUCRFcv), (2) Akaike Information Criterion (AIC) and (3) Bayesian Information Criterion (BIC). To classify the species, 9 regression techniques were combined using ensemble models. Predictions were evaluated using cross validation and an independent dataset. Integration of datasets and models improved individual tree species classification (True Skills Statistic, TSS; from 0.67 to 0.81) over individual techniques and maintained strong predictive power (Relative Operating Characteristic, ROC = 0.91). Assemblage of regression models and integration of the datasets provided more reliable species distribution maps and associated tree-scale mapping uncertainties. Our study highlights the potential of model and data assemblage at improving species classifications needed in present-day forest planning and management.
Schmid, Matthias; Küchenhoff, Helmut; Hoerauf, Achim; Tutz, Gerhard
2016-02-28
Survival trees are a popular alternative to parametric survival modeling when there are interactions between the predictor variables or when the aim is to stratify patients into prognostic subgroups. A limitation of classical survival tree methodology is that most algorithms for tree construction are designed for continuous outcome variables. Hence, classical methods might not be appropriate if failure time data are measured on a discrete time scale (as is often the case in longitudinal studies where data are collected, e.g., quarterly or yearly). To address this issue, we develop a method for discrete survival tree construction. The proposed technique is based on the result that the likelihood of a discrete survival model is equivalent to the likelihood of a regression model for binary outcome data. Hence, we modify tree construction methods for binary outcomes such that they result in optimized partitions for the estimation of discrete hazard functions. By applying the proposed method to data from a randomized trial in patients with filarial lymphedema, we demonstrate how discrete survival trees can be used to identify clinically relevant patient groups with similar survival behavior. Copyright © 2015 John Wiley & Sons, Ltd.
Singh, Gyanendra; Sachdeva, S N; Pal, Mahesh
2016-11-01
This work examines the application of M5 model tree and conventionally used fixed/random effect negative binomial (FENB/RENB) regression models for accident prediction on non-urban sections of highway in Haryana (India). Road accident data for a period of 2-6 years on different sections of 8 National and State Highways in Haryana was collected from police records. Data related to road geometry, traffic and road environment related variables was collected through field studies. Total two hundred and twenty two data points were gathered by dividing highways into sections with certain uniform geometric characteristics. For prediction of accident frequencies using fifteen input parameters, two modeling approaches: FENB/RENB regression and M5 model tree were used. Results suggest that both models perform comparably well in terms of correlation coefficient and root mean square error values. M5 model tree provides simple linear equations that are easy to interpret and provide better insight, indicating that this approach can effectively be used as an alternative to RENB approach if the sole purpose is to predict motor vehicle crashes. Sensitivity analysis using M5 model tree also suggests that its results reflect the physical conditions. Both models clearly indicate that to improve safety on Indian highways minor accesses to the highways need to be properly designed and controlled, the service roads to be made functional and dispersion of speeds is to be brought down. Copyright © 2016 Elsevier Ltd. All rights reserved.
Ozge, C; Toros, F; Bayramkaya, E; Camdeviren, H; Sasmaz, T
2006-08-01
The purpose of this study is to evaluate the most important sociodemographic factors on smoking status of high school students using a broad randomised epidemiological survey. Using in-class, self administered questionnaire about their sociodemographic variables and smoking behaviour, a representative sample of total 3304 students of preparatory, 9th, 10th, and 11th grades, from 22 randomly selected schools of Mersin, were evaluated and discriminative factors have been determined using appropriate statistics. In addition to binary logistic regression analysis, the study evaluated combined effects of these factors using classification and regression tree methodology, as a new statistical method. The data showed that 38% of the students reported lifetime smoking and 16.9% of them reported current smoking with a male predominancy and increasing prevalence by age. Second hand smoking was reported at a 74.3% frequency with father predominance (56.6%). The significantly important factors that affect current smoking in these age groups were increased by household size, late birth rank, certain school types, low academic performance, increased second hand smoking, and stress (especially reported as separation from a close friend or because of violence at home). Classification and regression tree methodology showed the importance of some neglected sociodemographic factors with a good classification capacity. It was concluded that, as closely related with sociocultural factors, smoking was a common problem in this young population, generating important academic and social burden in youth life and with increasing data about this behaviour and using new statistical methods, effective coping strategies could be composed.
The process and utility of classification and regression tree methodology in nursing research
Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda
2014-01-01
Aim This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Background Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Design Discussion paper. Data sources English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984–2013. Discussion Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Implications for Nursing Research Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Conclusion Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. PMID:24237048
The process and utility of classification and regression tree methodology in nursing research.
Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda
2014-06-01
This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Discussion paper. English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984-2013. Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. © 2013 The Authors. Journal of Advanced Nursing Published by John Wiley & Sons Ltd.
Estimating tree species diversity in the savannah using NDVI and woody canopy cover
NASA Astrophysics Data System (ADS)
Madonsela, Sabelo; Cho, Moses Azong; Ramoelo, Abel; Mutanga, Onisimo; Naidoo, Laven
2018-04-01
Remote sensing applications in biodiversity research often rely on the establishment of relationships between spectral information from the image and tree species diversity measured in the field. Most studies have used normalized difference vegetation index (NDVI) to estimate tree species diversity on the basis that it is sensitive to primary productivity which defines spatial variation in plant diversity. The NDVI signal is influenced by photosynthetically active vegetation which, in the savannah, includes woody canopy foliage and grasses. The question is whether the relationship between NDVI and tree species diversity in the savanna depends on the woody cover percentage. This study explored the relationship between woody canopy cover (WCC) and tree species diversity in the savannah woodland of southern Africa and also investigated whether there is a significant interaction between seasonal NDVI and WCC in the factorial model when estimating tree species diversity. To fulfil our aim, we followed stratified random sampling approach and surveyed tree species in 68 plots of 90 m × 90 m across the study area. Within each plot, all trees with diameter at breast height of >10 cm were sampled and Shannon index - a common measure of species diversity which considers both species richness and abundance - was used to quantify tree species diversity. We then extracted WCC in each plot from existing fractional woody cover product produced from Synthetic Aperture Radar (SAR) data. Factorial regression model was used to determine the interaction effect between NDVI and WCC when estimating tree species diversity. Results from regression analysis showed that (i) WCC has a highly significant relationship with tree species diversity (r2 = 0.21; p < 0.01), (ii) the interaction between the NDVI and WCC is not significant, however, the factorial model significantly reduced the error of prediction (RMSE = 0.47, p < 0.05) compared to NDVI (RMSE = 0.49) or WCC (RMSE = 0.49) model during the senescence period. The result justifies our assertion that combining NDVI with WCC will be optimal for biodiversity estimation during the senescence period.
Doyle, Suzanne R.; Donovan, Dennis M.
2014-01-01
Aims The purpose of this study was to explore the selection of predictor variables in the evaluation of drug treatment completion using an ensemble approach with classification trees. The basic methodology is reviewed and the subagging procedure of random subsampling is applied. Methods Among 234 individuals with stimulant use disorders randomized to a 12-Step facilitative intervention shown to increase stimulant use abstinence, 67.52% were classified as treatment completers. A total of 122 baseline variables were used to identify factors associated with completion. Findings The number of types of self-help activity involvement prior to treatment was the predominant predictor. Other effective predictors included better coping self-efficacy for substance use in high-risk situations, more days of prior meeting attendance, greater acceptance of the Disease model, higher confidence for not resuming use following discharge, lower ASI Drug and Alcohol composite scores, negative urine screens for cocaine or marijuana, and fewer employment problems. Conclusions The application of an ensemble subsampling regression tree method utilizes the fact that classification trees are unstable but, on average, produce an improved prediction of the completion of drug abuse treatment. The results support the notion there are early indicators of treatment completion that may allow for modification of approaches more tailored to fitting the needs of individuals and potentially provide more successful treatment engagement and improved outcomes. PMID:25134038
Mocellin, Simone; Thompson, John F; Pasquali, Sandro; Montesco, Maria C; Pilati, Pierluigi; Nitti, Donato; Saw, Robyn P; Scolyer, Richard A; Stretch, Jonathan R; Rossi, Carlo R
2009-12-01
To improve selection for sentinel node (SN) biopsy (SNB) in patients with cutaneous melanoma using statistical models predicting SN status. About 80% of patients currently undergoing SNB are node negative. In the absence of conclusive evidence of a SNBassociated survival benefit, these patients may be over-treated. Here, we tested the efficiency of 4 different models in predicting SN status. The clinicopathologic data (age, gender, tumor thickness, Clark level, regression, ulceration, histologic subtype, and mitotic index) of 1132 melanoma patients who had undergone SNB at institutions in Italy and Australia were analyzed. Logistic regression, classification tree, random forest, and support vector machine models were fitted to the data. The predictive models were built with the aim of maximizing the negative predictive value (NPV) and reducing the rate of SNB procedures though minimizing the error rate. After cross-validation logistic regression, classification tree, random forest, and support vector machine predictive models obtained clinically relevant NPV (93.6%, 94.0%, 97.1%, and 93.0%, respectively), SNB reduction (27.5%, 29.8%, 18.2%, and 30.1%, respectively), and error rates (1.8%, 1.8%, 0.5%, and 2.1%, respectively). Using commonly available clinicopathologic variables, predictive models can preoperatively identify a proportion of patients ( approximately 25%) who might be spared SNB, with an acceptable (1%-2%) error. If validated in large prospective series, these models might be implemented in the clinical setting for improved patient selection, which ultimately would lead to better quality of life for patients and optimization of resource allocation for the health care system.
Validating automatic semantic annotation of anatomy in DICOM CT images
NASA Astrophysics Data System (ADS)
Pathak, Sayan D.; Criminisi, Antonio; Shotton, Jamie; White, Steve; Robertson, Duncan; Sparks, Bobbi; Munasinghe, Indeera; Siddiqui, Khan
2011-03-01
In the current health-care environment, the time available for physicians to browse patients' scans is shrinking due to the rapid increase in the sheer number of images. This is further aggravated by mounting pressure to become more productive in the face of decreasing reimbursement. Hence, there is an urgent need to deliver technology which enables faster and effortless navigation through sub-volume image visualizations. Annotating image regions with semantic labels such as those derived from the RADLEX ontology can vastly enhance image navigation and sub-volume visualization. This paper uses random regression forests for efficient, automatic detection and localization of anatomical structures within DICOM 3D CT scans. A regression forest is a collection of decision trees which are trained to achieve direct mapping from voxels to organ location and size in a single pass. This paper focuses on comparing automated labeling with expert-annotated ground-truth results on a database of 50 highly variable CT scans. Initial investigations show that regression forest derived localization errors are smaller and more robust than those achieved by state-of-the-art global registration approaches. The simplicity of the algorithm's context-rich visual features yield typical runtimes of less than 10 seconds for a 5123 voxel DICOM CT series on a single-threaded, single-core machine running multiple trees; each tree taking less than a second. Furthermore, qualitative evaluation demonstrates that using the detected organs' locations as index into the image volume improves the efficiency of the navigational workflow in all the CT studies.
Özge, C; Toros, F; Bayramkaya, E; Çamdeviren, H; Şaşmaz, T
2006-01-01
Background The purpose of this study is to evaluate the most important sociodemographic factors on smoking status of high school students using a broad randomised epidemiological survey. Methods Using in‐class, self administered questionnaire about their sociodemographic variables and smoking behaviour, a representative sample of total 3304 students of preparatory, 9th, 10th, and 11th grades, from 22 randomly selected schools of Mersin, were evaluated and discriminative factors have been determined using appropriate statistics. In addition to binary logistic regression analysis, the study evaluated combined effects of these factors using classification and regression tree methodology, as a new statistical method. Results The data showed that 38% of the students reported lifetime smoking and 16.9% of them reported current smoking with a male predominancy and increasing prevalence by age. Second hand smoking was reported at a 74.3% frequency with father predominance (56.6%). The significantly important factors that affect current smoking in these age groups were increased by household size, late birth rank, certain school types, low academic performance, increased second hand smoking, and stress (especially reported as separation from a close friend or because of violence at home). Classification and regression tree methodology showed the importance of some neglected sociodemographic factors with a good classification capacity. Conclusions It was concluded that, as closely related with sociocultural factors, smoking was a common problem in this young population, generating important academic and social burden in youth life and with increasing data about this behaviour and using new statistical methods, effective coping strategies could be composed. PMID:16891446
Higuchi, P; Silva, A C; Louzada, J N C; Machado, E L M
2010-05-01
The objectives of this study were to evaluate the influence of propagules source and the implication of tree size class on the spatial pattern of Xylopia brasiliensis Spreng. individuals growing under the canopy of an experimental plantation of eucalyptus. To this end, all individuals of Xylopia brasiliensis with diameter at soil height (dsh) > 1 cm were mapped in the understory of a 3.16 ha Eucalyptus spp. and Corymbia spp. plantation, located in the municipality of Lavras, SE Brazil. The largest nearby mature tree of X. brasiliensis was considered as the propagules source. Linear regressions were used to assess the influence of the distance of propagules source on the population parameters (density, basal area and height). The spatial pattern of trees was assessed through the Ripley K function. The overall pattern showed that the propagules source distance had strong influence over spatial distribution of trees, mainly the small ones, indicating that the closer the distance from the propagules source, the higher the tree density and the lower the mean tree height. The population showed different spatial distribution patterns according to the spatial scale and diameter class considered. While small trees tended to be aggregated up to around 80 m, the largest individuals were randomly distributed in the area. A plausible explanation for observed patterns might be limited seed rain and intra-population competition.
Boosted regression tree, table, and figure data
Spreadsheets are included here to support the manuscript Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. This dataset is associated with the following publication:Golden , H., C. Lane , A. Prues, and E. D'Amico. Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. JAWRA. American Water Resources Association, Middleburg, VA, USA, 52(5): 1251-1274, (2016).
Travis Woolley; David C. Shaw; Lisa M. Ganio; Stephen Fitzgerald
2012-01-01
Logistic regression models used to predict tree mortality are critical to post-fire management, planning prescribed bums and understanding disturbance ecology. We review literature concerning post-fire mortality prediction using logistic regression models for coniferous tree species in the western USA. We include synthesis and review of: methods to develop, evaluate...
Hitt, Nathaniel P.; Floyd, Michael; Compton, Michael; McDonald, Kenneth
2016-01-01
Chrosomus cumberlandensis (Blackside Dace [BSD]) and Etheostoma spilotum (Kentucky Arrow Darter [KAD]) are fish species of conservation concern due to their fragmented distributions, their low population sizes, and threats from anthropogenic stressors in the southeastern United States. We evaluated the relationship between fish abundance and stream conductivity, an index of environmental quality and potential physiological stressor. We modeled occurrence and abundance of KAD in the upper Kentucky River basin (208 samples) and BSD in the upper Cumberland River basin (294 samples) for sites sampled between 2003 and 2013. Segmented regression indicated a conductivity change-point for BSD abundance at 343 μS/cm (95% CI: 123–563 μS/cm) and for KAD abundance at 261 μS/cm (95% CI: 151–370 μS/cm). In both cases, abundances were negligible above estimated conductivity change-points. Post-hoc randomizations accounted for variance in estimated change points due to unequal sample sizes across the conductivity gradients. Boosted regression-tree analysis indicated stronger effects of conductivity than other natural and anthropogenic factors known to influence stream fishes. Boosted regression trees further indicated threshold responses of BSD and KAD occurrence to conductivity gradients in support of segmented regression results. We suggest that the observed conductivity relationship may indicate energetic limitations for insectivorous fishes due to changes in benthic macroinvertebrate community composition.
Liu, Feng; Tan, Chang; Lei, Pi-Feng
2014-11-01
Taking Wugang forest farm in Xuefeng Mountain as the research object, using the airborne light detection and ranging (LiDAR) data under leaf-on condition and field data of concomitant plots, this paper assessed the ability of using LiDAR technology to estimate aboveground biomass of the mid-subtropical forest. A semi-automated individual tree LiDAR cloud point segmentation was obtained by using condition random fields and optimization methods. Spatial structure, waveform characteristics and topography were calculated as LiDAR metrics from the segmented objects. Then statistical models between aboveground biomass from field data and these LiDAR metrics were built. The individual tree recognition rates were 93%, 86% and 60% for coniferous, broadleaf and mixed forests, respectively. The adjusted coefficients of determination (R(2)adj) and the root mean squared errors (RMSE) for the three types of forest were 0.83, 0.81 and 0.74, and 28.22, 29.79 and 32.31 t · hm(-2), respectively. The estimation capability of model based on canopy geometric volume, tree percentile height, slope and waveform characteristics was much better than that of traditional regression model based on tree height. Therefore, LiDAR metrics from individual tree could facilitate better performance in biomass estimation.
Duncan, Dustin T; Kawachi, Ichiro; Kum, Susan; Aldstadt, Jared; Piras, Gianfranco; Matthews, Stephen A; Arbia, Giuseppe; Castro, Marcia C; White, Kellee; Williams, David R
2014-04-01
The racial/ethnic and income composition of neighborhoods often influences local amenities, including the potential spatial distribution of trees, which are important for population health and community wellbeing, particularly in urban areas. This ecological study used spatial analytical methods to assess the relationship between neighborhood socio-demographic characteristics (i.e. minority racial/ethnic composition and poverty) and tree density at the census tact level in Boston, Massachusetts (US). We examined spatial autocorrelation with the Global Moran's I for all study variables and in the ordinary least squares (OLS) regression residuals as well as computed Spearman correlations non-adjusted and adjusted for spatial autocorrelation between socio-demographic characteristics and tree density. Next, we fit traditional regressions (i.e. OLS regression models) and spatial regressions (i.e. spatial simultaneous autoregressive models), as appropriate. We found significant positive spatial autocorrelation for all neighborhood socio-demographic characteristics (Global Moran's I range from 0.24 to 0.86, all P =0.001), for tree density (Global Moran's I =0.452, P =0.001), and in the OLS regression residuals (Global Moran's I range from 0.32 to 0.38, all P <0.001). Therefore, we fit the spatial simultaneous autoregressive models. There was a negative correlation between neighborhood percent non-Hispanic Black and tree density (r S =-0.19; conventional P -value=0.016; spatially adjusted P -value=0.299) as well as a negative correlation between predominantly non-Hispanic Black (over 60% Black) neighborhoods and tree density (r S =-0.18; conventional P -value=0.019; spatially adjusted P -value=0.180). While the conventional OLS regression model found a marginally significant inverse relationship between Black neighborhoods and tree density, we found no statistically significant relationship between neighborhood socio-demographic composition and tree density in the spatial regression models. Methodologically, our study suggests the need to take into account spatial autocorrelation as findings/conclusions can change when the spatial autocorrelation is ignored. Substantively, our findings suggest no need for policy intervention vis-à-vis trees in Boston, though we hasten to add that replication studies, and more nuanced data on tree quality, age and diversity are needed.
Louis R Iverson; Anantha M. Prasad; Mark W. Schwartz; Mark W. Schwartz
2005-01-01
We predict current distribution and abundance for tree species present in eastern North America, and subsequently estimate potential suitable habitat for those species under a changed climate with 2 x CO2. We used a series of statistical models (i.e., Regression Tree Analysis (RTA), Multivariate Adaptive Regression Splines (MARS), Bagging Trees (...
The influence of tree morphology on stemflow generation in a tropical lowland rainforest
NASA Astrophysics Data System (ADS)
Uber, Magdalena; Levia, Delphis F.; Zimmermann, Beate; Zimmermann, Alexander
2014-05-01
Even though stemflow usually accounts for only a small proportion of rainfall, it is an important point source of water and ion input to forest floors and may, for instance, influence soil moisture patterns and groundwater recharge. Previous studies showed that the generation of stemflow depends on a multitude of meteorological and biological factors. Interestingly, despite the tremendous progress in stemflow research during the last decades it is still largely unknown which combination of tree characteristics determines stemflow volumes in species-rich tropical forests. This knowledge gap motivated us to analyse the influence of tree characteristics on stemflow volumes in a 1 hectare plot located in a Panamanian lowland rainforest. Our study comprised stemflow measurements in six randomly selected 10 m by 10 m subplots. In each subplot we measured stemflow of all trees with a diameter at breast height (DBH) > 5 cm on an event-basis for a period of six weeks. Additionally, we identified all tree species and determined a set of tree characteristics including DBH, crown diameter, bark roughness, bark furrowing, epiphyte coverage, tree architecture, stem inclination, and crown position. During the sampling period, we collected 985 L of stemflow (0.98 % of total rainfall). Based on regression analyses and comparisons among plant functional groups we show that palms were most efficient in yielding stemflow due to their large inclined fronds. Trees with large emergent crowns also produced relatively large amounts of stemflow. Due to their abundance, understory trees contribute much to stemflow yield not on individual but on the plot scale. Even though parameters such as crown diameter, branch inclination and position of the crown influence stemflow generation to some extent, these parameters explain less than 30 % of the variation in stemflow volumes. In contrast to published results from temperate forests, we did not detect a negative correlation between bark roughness and stemflow volume. This is because other parameters such as crown diameter obscured this relationship. Due to multicollinearity and poor correlations between single tree characteristics with stemflow volume, an assessment of stemflow volumes based on forest characteristics remains cumbersome in highly diverse ecosystems. Instead of relying on regression relationships, we therefore advocate a total sampling of trees in several plots to determine stand-scale stemflow yield in tropical forests.
Charles E. Rose; Thomas B. Lynch
2001-01-01
A method was developed for estimating parameters in an individual tree basal area growth model using a system of equations based on dbh rank classes. The estimation method developed is a compromise between an individual tree and a stand level basal area growth model that accounts for the correlation between trees within a plot by using seemingly unrelated regression (...
Susan L. King
2003-01-01
The performance of two classifiers, logistic regression and neural networks, are compared for modeling noncatastrophic individual tree mortality for 21 species of trees in West Virginia. The output of the classifier is usually a continuous number between 0 and 1. A threshold is selected between 0 and 1 and all of the trees below the threshold are classified as...
Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes
2013-01-01
Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. Availability The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. PMID:24564704
Wang, Yue; Goh, Wilson; Wong, Limsoon; Montana, Giovanni
2013-01-01
Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.
López-Sampson, Arlene; Cernusak, Lucas A; Page, Tony
2017-05-01
Physiological traits are frequently used as indicators of tree productivity. Aquilaria species growing in a research planting were studied to investigate relationships between leaf-productivity traits and tree growth. Twenty-eight trees were selected to measure isotopic composition of carbon (δ13C) and nitrogen (δ15N) and monitor six leaf attributes. Trees were sampled randomly within each of four diametric classes (at 150 mm above ground level) ensuring the variability in growth of the whole population was represented. A model averaging technique based on the Akaike's information criterion was computed to identify whether leaf traits could assist in diameter prediction. Regression analysis was performed to test for relationships between carbon isotope values and diameter and leaf traits. Approximately one new leaf per week was produced by a shoot. The rate of leaf expansion was estimated as 1.45 mm day-1. The range of δ13C values in leaves of Aquilaria species was from -25.5‰ to -31‰, with an average of -28.4 ‰ (±1.5‰ SD). A moderate negative correlation (R2 = 0.357) between diameter and δ13C in leaf dry matter indicated that individuals with high intercellular CO2 concentrations (low δ13C) and associated low water-use efficiency sustained rapid growth. Analysis of the 95% confidence of best-ranked regression models indicated that the predictors that could best explain growth in Aquilaria species were δ13C, δ15N, petiole length, number of new leaves produced per week and specific leaf area. The model constructed with these variables explained 55% (R2 = 0.55) of the variability in stem diameter. This demonstrates that leaf traits can assist in the early selection of high-productivity trees in Aquilaria species. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Kim, Dong Wook; Kim, Hwiyoung; Nam, Woong; Kim, Hyung Jun; Cha, In-Ho
2018-04-23
The aim of this study was to build and validate five types of machine learning models that can predict the occurrence of BRONJ associated with dental extraction in patients taking bisphosphonates for the management of osteoporosis. A retrospective review of the medical records was conducted to obtain cases and controls for the study. Total 125 patients consisting of 41 cases and 84 controls were selected for the study. Five machine learning prediction algorithms including multivariable logistic regression model, decision tree, support vector machine, artificial neural network, and random forest were implemented. The outputs of these models were compared with each other and also with conventional methods, such as serum CTX level. Area under the receiver operating characteristic (ROC) curve (AUC) was used to compare the results. The performance of machine learning models was significantly superior to conventional statistical methods and single predictors. The random forest model yielded the best performance (AUC = 0.973), followed by artificial neural network (AUC = 0.915), support vector machine (AUC = 0.882), logistic regression (AUC = 0.844), decision tree (AUC = 0.821), drug holiday alone (AUC = 0.810), and CTX level alone (AUC = 0.630). Machine learning methods showed superior performance in predicting BRONJ associated with dental extraction compared to conventional statistical methods using drug holiday and serum CTX level. Machine learning can thus be applied in a wide range of clinical studies. Copyright © 2017. Published by Elsevier Inc.
NASA Astrophysics Data System (ADS)
Beguet, Benoit; Guyon, Dominique; Boukir, Samia; Chehata, Nesrine
2014-10-01
The main goal of this study is to design a method to describe the structure of forest stands from Very High Resolution satellite imagery, relying on some typical variables such as crown diameter, tree height, trunk diameter, tree density and tree spacing. The emphasis is placed on the automatization of the process of identification of the most relevant image features for the forest structure retrieval task, exploiting both spectral and spatial information. Our approach is based on linear regressions between the forest structure variables to be estimated and various spectral and Haralick's texture features. The main drawback of this well-known texture representation is the underlying parameters which are extremely difficult to set due to the spatial complexity of the forest structure. To tackle this major issue, an automated feature selection process is proposed which is based on statistical modeling, exploring a wide range of parameter values. It provides texture measures of diverse spatial parameters hence implicitly inducing a multi-scale texture analysis. A new feature selection technique, we called Random PRiF, is proposed. It relies on random sampling in feature space, carefully addresses the multicollinearity issue in multiple-linear regression while ensuring accurate prediction of forest variables. Our automated forest variable estimation scheme was tested on Quickbird and Pléiades panchromatic and multispectral images, acquired at different periods on the maritime pine stands of two sites in South-Western France. It outperforms two well-established variable subset selection techniques. It has been successfully applied to identify the best texture features in modeling the five considered forest structure variables. The RMSE of all predicted forest variables is improved by combining multispectral and panchromatic texture features, with various parameterizations, highlighting the potential of a multi-resolution approach for retrieving forest structure variables from VHR satellite images. Thus an average prediction error of ˜ 1.1 m is expected on crown diameter, ˜ 0.9 m on tree spacing, ˜ 3 m on height and ˜ 0.06 m on diameter at breast height.
Duncan, Dustin T.; Kawachi, Ichiro; Kum, Susan; Aldstadt, Jared; Piras, Gianfranco; Matthews, Stephen A.; Arbia, Giuseppe; Castro, Marcia C.; White, Kellee; Williams, David R.
2017-01-01
The racial/ethnic and income composition of neighborhoods often influences local amenities, including the potential spatial distribution of trees, which are important for population health and community wellbeing, particularly in urban areas. This ecological study used spatial analytical methods to assess the relationship between neighborhood socio-demographic characteristics (i.e. minority racial/ethnic composition and poverty) and tree density at the census tact level in Boston, Massachusetts (US). We examined spatial autocorrelation with the Global Moran’s I for all study variables and in the ordinary least squares (OLS) regression residuals as well as computed Spearman correlations non-adjusted and adjusted for spatial autocorrelation between socio-demographic characteristics and tree density. Next, we fit traditional regressions (i.e. OLS regression models) and spatial regressions (i.e. spatial simultaneous autoregressive models), as appropriate. We found significant positive spatial autocorrelation for all neighborhood socio-demographic characteristics (Global Moran’s I range from 0.24 to 0.86, all P=0.001), for tree density (Global Moran’s I=0.452, P=0.001), and in the OLS regression residuals (Global Moran’s I range from 0.32 to 0.38, all P<0.001). Therefore, we fit the spatial simultaneous autoregressive models. There was a negative correlation between neighborhood percent non-Hispanic Black and tree density (rS=−0.19; conventional P-value=0.016; spatially adjusted P-value=0.299) as well as a negative correlation between predominantly non-Hispanic Black (over 60% Black) neighborhoods and tree density (rS=−0.18; conventional P-value=0.019; spatially adjusted P-value=0.180). While the conventional OLS regression model found a marginally significant inverse relationship between Black neighborhoods and tree density, we found no statistically significant relationship between neighborhood socio-demographic composition and tree density in the spatial regression models. Methodologically, our study suggests the need to take into account spatial autocorrelation as findings/conclusions can change when the spatial autocorrelation is ignored. Substantively, our findings suggest no need for policy intervention vis-à-vis trees in Boston, though we hasten to add that replication studies, and more nuanced data on tree quality, age and diversity are needed. PMID:29354668
Huang, C.; Townshend, J.R.G.
2003-01-01
A stepwise regression tree (SRT) algorithm was developed for approximating complex nonlinear relationships. Based on the regression tree of Breiman et al . (BRT) and a stepwise linear regression (SLR) method, this algorithm represents an improvement over SLR in that it can approximate nonlinear relationships and over BRT in that it gives more realistic predictions. The applicability of this method to estimating subpixel forest was demonstrated using three test data sets, on all of which it gave more accurate predictions than SLR and BRT. SRT also generated more compact trees and performed better than or at least as well as BRT at all 10 equal forest proportion interval ranging from 0 to 100%. This method is appealing to estimating subpixel land cover over large areas.
ERIC Educational Resources Information Center
Koon, Sharon; Petscher, Yaacov
2015-01-01
The purpose of this report was to explicate the use of logistic regression and classification and regression tree (CART) analysis in the development of early warning systems. It was motivated by state education leaders' interest in maintaining high classification accuracy while simultaneously improving practitioner understanding of the rules by…
Westreich, Daniel; Lessler, Justin; Funk, Michele Jonsson
2010-08-01
Propensity scores for the analysis of observational data are typically estimated using logistic regression. Our objective in this review was to assess machine learning alternatives to logistic regression, which may accomplish the same goals but with fewer assumptions or greater accuracy. We identified alternative methods for propensity score estimation and/or classification from the public health, biostatistics, discrete mathematics, and computer science literature, and evaluated these algorithms for applicability to the problem of propensity score estimation, potential advantages over logistic regression, and ease of use. We identified four techniques as alternatives to logistic regression: neural networks, support vector machines, decision trees (classification and regression trees [CART]), and meta-classifiers (in particular, boosting). Although the assumptions of logistic regression are well understood, those assumptions are frequently ignored. All four alternatives have advantages and disadvantages compared with logistic regression. Boosting (meta-classifiers) and, to a lesser extent, decision trees (particularly CART), appear to be most promising for use in the context of propensity score analysis, but extensive simulation studies are needed to establish their utility in practice. Copyright (c) 2010 Elsevier Inc. All rights reserved.
Eric H. Wharton; Tiberius Cunia
1987-01-01
Proceedings of a workshop co-sponsored by the USDA Forest Service, the State University of New York, and the Society of American Foresters. Presented were papers on the methodology of sample tree selection, tree biomass measurement, construction of biomass tables and estimation of their error, and combining the error of biomass tables with that of the sample plots or...
Chen, Guangchao; Li, Xuehua; Chen, Jingwen; Zhang, Ya-Nan; Peijnenburg, Willie J G M
2014-12-01
Biodegradation is the principal environmental dissipation process of chemicals. As such, it is a dominant factor determining the persistence and fate of organic chemicals in the environment, and is therefore of critical importance to chemical management and regulation. In the present study, the authors developed in silico methods assessing biodegradability based on a large heterogeneous set of 825 organic compounds, using the techniques of the C4.5 decision tree, the functional inner regression tree, and logistic regression. External validation was subsequently carried out by 2 independent test sets of 777 and 27 chemicals. As a result, the functional inner regression tree exhibited the best predictability with predictive accuracies of 81.5% and 81.0%, respectively, on the training set (825 chemicals) and test set I (777 chemicals). Performance of the developed models on the 2 test sets was subsequently compared with that of the Estimation Program Interface (EPI) Suite Biowin 5 and Biowin 6 models, which also showed a better predictability of the functional inner regression tree model. The model built in the present study exhibits a reasonable predictability compared with existing models while possessing a transparent algorithm. Interpretation of the mechanisms of biodegradation was also carried out based on the models developed. © 2014 SETAC.
Cerasoli, Francesco; Iannella, Mattia; D'Alessandro, Paola; Biondi, Maurizio
2017-01-01
Boosted Regression Trees (BRT) is one of the modelling techniques most recently applied to biodiversity conservation and it can be implemented with presence-only data through the generation of artificial absences (pseudo-absences). In this paper, three pseudo-absences generation techniques are compared, namely the generation of pseudo-absences within target-group background (TGB), testing both the weighted (WTGB) and unweighted (UTGB) scheme, and the generation at random (RDM), evaluating their performance and applicability in distribution modelling and species conservation. The choice of the target group fell on amphibians, because of their rapid decline worldwide and the frequent lack of guidelines for conservation strategies and regional-scale planning, which instead could be provided through an appropriate implementation of SDMs. Bufo bufo, Salamandrina perspicillata and Triturus carnifex were considered as target species, in order to perform our analysis with species having different ecological and distributional characteristics. The study area is the "Gran Sasso-Monti della Laga" National Park, which hosts 15 Natura 2000 sites and represents one of the most important biodiversity hotspots in Europe. Our results show that the model calibration ameliorates when using the target-group based pseudo-absences compared to the random ones, especially when applying the WTGB. Contrarily, model discrimination did not significantly vary in a consistent way among the three approaches with respect to the tree target species. Both WTGB and RDM clearly isolate the highly contributing variables, supplying many relevant indications for species conservation actions. Moreover, the assessment of pairwise variable interactions and their three-dimensional visualization further increase the amount of useful information for protected areas' managers. Finally, we suggest the use of RDM as an admissible alternative when it is not possible to individuate a suitable set of species as a representative target-group from which the pseudo-absences can be generated.
Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees.
Chung, Yi-Shih
2013-12-01
Factor complexity is a characteristic of traffic crashes. This paper proposes a novel method, namely boosted regression trees (BRT), to investigate the complex and nonlinear relationships in high-variance traffic crash data. The Taiwanese 2004-2005 single-vehicle motorcycle crash data are used to demonstrate the utility of BRT. Traditional logistic regression and classification and regression tree (CART) models are also used to compare their estimation results and external validities. Both the in-sample cross-validation and out-of-sample validation results show that an increase in tree complexity provides improved, although declining, classification performance, indicating a limited factor complexity of single-vehicle motorcycle crashes. The effects of crucial variables including geographical, time, and sociodemographic factors explain some fatal crashes. Relatively unique fatal crashes are better approximated by interactive terms, especially combinations of behavioral factors. BRT models generally provide improved transferability than conventional logistic regression and CART models. This study also discusses the implications of the results for devising safety policies. Copyright © 2012 Elsevier Ltd. All rights reserved.
Dynamic travel time estimation using regression trees.
DOT National Transportation Integrated Search
2008-10-01
This report presents a methodology for travel time estimation by using regression trees. The dissemination of travel time information has become crucial for effective traffic management, especially under congested road conditions. In the absence of c...
Jose F. Negron
1998-01-01
Infested and uninfested areas within Douglas fir, Pseudotsuga menziesii Mirb.. Franco, stands affected by the Douglas-fir beetle, Dendroctonus pseudotsugae Hopk. were sampled in the Colorado Front Range, CO. Classification tree models were built to predict probabilities of infestation. Regression trees and linear regression analysis were used to model amount of tree...
Using nonlinear quantile regression to estimate the self-thinning boundary curve
Quang V. Cao; Thomas J. Dean
2015-01-01
The relationship between tree size (quadratic mean diameter) and tree density (number of trees per unit area) has been a topic of research and discussion for many decades. Starting with Reineke in 1933, the maximum size-density relationship, on a log-log scale, has been assumed to be linear. Several techniques, including linear quantile regression, have been employed...
Ejlskov, Linda; Wulff, Jesper; Bøggild, Henrik; Kuh, Diana; Stafford, Mai
2017-09-08
Improving the design and targeting of interventions is important for alleviating loneliness among older adults. This requires identifying which correlates are the most important predictors of loneliness. This study demonstrates the use of recursive partitioning in exploring the characteristics and assessing the relative importance of correlates of loneliness in older adults. Using exploratory regression trees and random forests, we examined combinations and the relative importance of 42 correlates in relation to loneliness at age 68 among 2453 participants from the birth cohort study the MRC National Survey of Health and Development. Positive mental well-being, personal mastery, identifying the spouse as the closest confidant, being extrovert and informal social contact were the most important correlates of lower loneliness levels. Participation in organised groups and demographic correlates were poor identifiers of loneliness. The regression tree suggested that loneliness was not raised among those with poor mental wellbeing if they identified their partner as closest confidante and had frequent social contact. Recursive partitioning can identify which combinations of experiences and circumstances characterise high-risk groups. Poor mental wellbeing and sparse social contact emerged as especially important and classical demographic factors as insufficient in identifying high loneliness levels among older adults.
A Ranking Approach to Genomic Selection.
Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori
2015-01-01
Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.
Can tree species diversity be assessed with Landsat data in a temperate forest?
Arekhi, Maliheh; Yılmaz, Osman Yalçın; Yılmaz, Hatice; Akyüz, Yaşar Feyza
2017-10-28
The diversity of forest trees as an indicator of ecosystem health can be assessed using the spectral characteristics of plant communities through remote sensing data. The objectives of this study were to investigate alpha and beta tree diversity using Landsat data for six dates in the Gönen dam watershed of Turkey. We used richness and the Shannon and Simpson diversity indices to calculate tree alpha diversity. We also represented the relationship between beta diversity and remotely sensed data using species composition similarity and spectral distance similarity of sampling plots via quantile regression. A total of 99 sampling units, each 20 m × 20 m, were selected using geographically stratified random sampling method. Within each plot, the tree species were identified, and all of the trees with a diameter at breast height (dbh) larger than 7 cm were measured. Presence/absence and abundance data (tree species number and tree species basal area) of tree species were used to determine the relationship between richness and the Shannon and Simpson diversity indices, which were computed with ground field data, and spectral variables derived (2 × 2 pixels and 3 × 3 pixels) from Landsat 8 OLI data. The Shannon-Weiner index had the highest correlation. For all six dates, NDVI (normalized difference vegetation index) was the spectral variable most strongly correlated with the Shannon index and the tree diversity variables. The Ratio of green to red (VI) was the spectral variable least correlated with the tree diversity variables and the Shannon basal area. In both beta diversity curves, the slope of the OLS regression was low, while in the upper quantile, it was approximately twice the lower quantiles. The Jaccard index is closed to one with little difference in both two beta diversity approaches. This result is due to increasing the similarity between the sampling plots when they are located close to each other. The intercept differences between two investigated beta diversity were strongly related to the development stage of a number of sampling plots in the tree species basal area method. To obtain beta diversity, the tree basal area method indicates better result than the tree species number method at representing similarity of regions which are located close together. In conclusion, NDVI is helpful for estimating the alpha diversity of trees over large areas when the vegetation is at the maximum growing season. Beta diversity could be obtained with the spectral heterogeneity of Landsat data. Future tree diversity studies using remote sensing data should select data sets when vegetation is at the maximum growing season. Also, forest tree diversity investigations can be identified by using higher-resolution remote sensing data such as ESA Sentinel 2 data which is freely available since June 2015.
Wheeler, David C.; Burstyn, Igor; Vermeulen, Roel; Yu, Kai; Shortreed, Susan M.; Pronk, Anjoeka; Stewart, Patricia A.; Colt, Joanne S.; Baris, Dalsu; Karagas, Margaret R.; Schwenn, Molly; Johnson, Alison; Silverman, Debra T.; Friesen, Melissa C.
2014-01-01
Objectives Evaluating occupational exposures in population-based case-control studies often requires exposure assessors to review each study participants' reported occupational information job-by-job to derive exposure estimates. Although such assessments likely have underlying decision rules, they usually lack transparency, are time-consuming and have uncertain reliability and validity. We aimed to identify the underlying rules to enable documentation, review, and future use of these expert-based exposure decisions. Methods Classification and regression trees (CART, predictions from a single tree) and random forests (predictions from many trees) were used to identify the underlying rules from the questionnaire responses and an expert's exposure assignments for occupational diesel exhaust exposure for several metrics: binary exposure probability and ordinal exposure probability, intensity, and frequency. Data were split into training (n=10,488 jobs), testing (n=2,247), and validation (n=2,248) data sets. Results The CART and random forest models' predictions agreed with 92–94% of the expert's binary probability assignments. For ordinal probability, intensity, and frequency metrics, the two models extracted decision rules more successfully for unexposed and highly exposed jobs (86–90% and 57–85%, respectively) than for low or medium exposed jobs (7–71%). Conclusions CART and random forest models extracted decision rules and accurately predicted an expert's exposure decisions for the majority of jobs and identified questionnaire response patterns that would require further expert review if the rules were applied to other jobs in the same or different study. This approach makes the exposure assessment process in case-control studies more transparent and creates a mechanism to efficiently replicate exposure decisions in future studies. PMID:23155187
Predicting the limits to tree height using statistical regressions of leaf traits.
Burgess, Stephen S O; Dawson, Todd E
2007-01-01
Leaf morphology and physiological functioning demonstrate considerable plasticity within tree crowns, with various leaf traits often exhibiting pronounced vertical gradients in very tall trees. It has been proposed that the trajectory of these gradients, as determined by regression methods, could be used in conjunction with theoretical biophysical limits to estimate the maximum height to which trees can grow. Here, we examined this approach using published and new experimental data from tall conifer and angiosperm species. We showed that height predictions were sensitive to tree-to-tree variation in the shape of the regression and to the biophysical endpoints selected. We examined the suitability of proposed end-points and their theoretical validity. We also noted that site and environment influenced height predictions considerably. Use of leaf mass per unit area or leaf water potential coupled with vulnerability of twigs to cavitation poses a number of difficulties for predicting tree height. Photosynthetic rate and carbon isotope discrimination show more promise, but in the second case, the complex relationship between light, water availability, photosynthetic capacity and internal conductance to CO(2) must first be characterized.
Policy Implications and Suggestions on Administrative Measures of Urban Flood
NASA Astrophysics Data System (ADS)
Lee, S. V.; Lee, M. J.; Lee, C.; Yoon, J. H.; Chae, S. H.
2017-12-01
The frequency and intensity of floods are increasing worldwide as recent climate change progresses gradually. Flood management should be policy-oriented in urban municipalities due to the characteristics of urban areas with a lot of damage. Therefore, the purpose of this study is to prepare a flood susceptibility map by using data mining model and make a policy suggestion on administrative measures of urban flood. Therefore, we constructed a spatial database by collecting relevant factors including the topography, geology, soil and land use data of the representative city, Seoul, the capital city of Korea. Flood susceptibility map was constructed by applying the data mining models of random forest and boosted tree model to input data and existing flooded area data in 2010. The susceptibility map has been validated using the 2011 flood area data which was not used for training. The predictor importance value of each factor to the results was calculated in this process. The distance from the water, DEM and geology showed a high predictor importance value which means to be a high priority for flood preparation policy. As a result of receiver operating characteristic (ROC), random forest model showed 78.78% and 79.18% accuracy of regression and classification and boosted tree model showed 77.55% and 77.26% accuracy of regression and classification, respectively. The results show that the flood susceptibility maps can be applied to flood prevention and management, and it also can help determine the priority areas for flood mitigation policy by providing useful information to policy makers.
Fraccaro, Paolo; Nicolo, Massimo; Bonetto, Monica; Giacomini, Mauro; Weller, Peter; Traverso, Carlo Enrico; Prosperi, Mattia; OSullivan, Dympna
2015-01-27
To investigate machine learning methods, ranging from simpler interpretable techniques to complex (non-linear) "black-box" approaches, for automated diagnosis of Age-related Macular Degeneration (AMD). Data from healthy subjects and patients diagnosed with AMD or other retinal diseases were collected during routine visits via an Electronic Health Record (EHR) system. Patients' attributes included demographics and, for each eye, presence/absence of major AMD-related clinical signs (soft drusen, retinal pigment epitelium, defects/pigment mottling, depigmentation area, subretinal haemorrhage, subretinal fluid, macula thickness, macular scar, subretinal fibrosis). Interpretable techniques known as white box methods including logistic regression and decision trees as well as less interpreitable techniques known as black box methods, such as support vector machines (SVM), random forests and AdaBoost, were used to develop models (trained and validated on unseen data) to diagnose AMD. The gold standard was confirmed diagnosis of AMD by physicians. Sensitivity, specificity and area under the receiver operating characteristic (AUC) were used to assess performance. Study population included 487 patients (912 eyes). In terms of AUC, random forests, logistic regression and adaboost showed a mean performance of (0.92), followed by SVM and decision trees (0.90). All machine learning models identified soft drusen and age as the most discriminating variables in clinicians' decision pathways to diagnose AMD. Both black-box and white box methods performed well in identifying diagnoses of AMD and their decision pathways. Machine learning models developed through the proposed approach, relying on clinical signs identified by retinal specialists, could be embedded into EHR to provide physicians with real time (interpretable) support.
Equations for predicting biomass in 2- to 6-year-old Eucalyptus saligna in Hawaii
Craig D. Whitesell; Susan C. Miyasaka; Robert F. Strand; Thomas H. Schubert; Katharine E. McDuffie
1988-01-01
Eucalyptus saligna trees grown in short-rotation plantations on the island of Hawaii were measured, harvested, and weighed to provide data for developing regression equations using non-destructive stand measurements. Regression analysis of the data from 190 trees in the 2.0- to 3.5-year range and 96 trees in the 4- to 6-year range related stem-only...
Bianca N.I. Eskelson; Hailemariam Temesgen; Tara M. Barrett
2009-01-01
Cavity tree and snag abundance data are highly variable and contain many zero observations. We predict cavity tree and snag abundance from variables that are readily available from forest cover maps or remotely sensed data using negative binomial (NB), zero-inflated NB, and zero-altered NB (ZANB) regression models as well as nearest neighbor (NN) imputation methods....
Lo, Benjamin W Y; Fukuda, Hitoshi; Angle, Mark; Teitelbaum, Jeanne; Macdonald, R Loch; Farrokhyar, Forough; Thabane, Lehana; Levine, Mitchell A H
2016-01-01
Classification and regression tree analysis involves the creation of a decision tree by recursive partitioning of a dataset into more homogeneous subgroups. Thus far, there is scarce literature on using this technique to create clinical prediction tools for aneurysmal subarachnoid hemorrhage (SAH). The classification and regression tree analysis technique was applied to the multicenter Tirilazad database (3551 patients) in order to create the decision-making algorithm. In order to elucidate prognostic subgroups in aneurysmal SAH, neurologic, systemic, and demographic factors were taken into account. The dependent variable used for analysis was the dichotomized Glasgow Outcome Score at 3 months. Classification and regression tree analysis revealed seven prognostic subgroups. Neurological grade, occurrence of post-admission stroke, occurrence of post-admission fever, and age represented the explanatory nodes of this decision tree. Split sample validation revealed classification accuracy of 79% for the training dataset and 77% for the testing dataset. In addition, the occurrence of fever at 1-week post-aneurysmal SAH is associated with increased odds of post-admission stroke (odds ratio: 1.83, 95% confidence interval: 1.56-2.45, P < 0.01). A clinically useful classification tree was generated, which serves as a prediction tool to guide bedside prognostication and clinical treatment decision making. This prognostic decision-making algorithm also shed light on the complex interactions between a number of risk factors in determining outcome after aneurysmal SAH.
Using histograms to introduce randomization in the generation of ensembles of decision trees
Kamath, Chandrika; Cantu-Paz, Erick; Littau, David
2005-02-22
A system for decision tree ensembles that includes a module to read the data, a module to create a histogram, a module to evaluate a potential split according to some criterion using the histogram, a module to select a split point randomly in an interval around the best split, a module to split the data, and a module to combine multiple decision trees in ensembles. The decision tree method includes the steps of reading the data; creating a histogram; evaluating a potential split according to some criterion using the histogram, selecting a split point randomly in an interval around the best split, splitting the data, and combining multiple decision trees in ensembles.
Eskelson, Bianca N.I.; Hagar, Joan; Temesgen, Hailemariam
2012-01-01
Snags (standing dead trees) are an essential structural component of forests. Because wildlife use of snags depends on size and decay stage, snag density estimation without any information about snag quality attributes is of little value for wildlife management decision makers. Little work has been done to develop models that allow multivariate estimation of snag density by snag quality class. Using climate, topography, Landsat TM data, stand age and forest type collected for 2356 forested Forest Inventory and Analysis plots in western Washington and western Oregon, we evaluated two multivariate techniques for their abilities to estimate density of snags by three decay classes. The density of live trees and snags in three decay classes (D1: recently dead, little decay; D2: decay, without top, some branches and bark missing; D3: extensive decay, missing bark and most branches) with diameter at breast height (DBH) ≥ 12.7 cm was estimated using a nonparametric random forest nearest neighbor imputation technique (RF) and a parametric two-stage model (QPORD), for which the number of trees per hectare was estimated with a Quasipoisson model in the first stage and the probability of belonging to a tree status class (live, D1, D2, D3) was estimated with an ordinal regression model in the second stage. The presence of large snags with DBH ≥ 50 cm was predicted using a logistic regression and RF imputation. Because of the more homogenous conditions on private forest lands, snag density by decay class was predicted with higher accuracies on private forest lands than on public lands, while presence of large snags was more accurately predicted on public lands, owing to the higher prevalence of large snags on public lands. RF outperformed the QPORD model in terms of percent accurate predictions, while QPORD provided smaller root mean square errors in predicting snag density by decay class. The logistic regression model achieved more accurate presence/absence classification of large snags than the RF imputation approach. Adjusting the decision threshold to account for unequal size for presence and absence classes is more straightforward for the logistic regression than for the RF imputation approach. Overall, model accuracies were poor in this study, which can be attributed to the poor predictive quality of the explanatory variables and the large range of forest types and geographic conditions observed in the data.
Teale, Stephen A; Letkowski, Steven; Matusick, George; Stehman, Stephen V; Castello, John D
2009-08-01
Beech scale, Cryptococcus fagisuga Lindinger, is a non-native invasive insect associated with beech bark disease. A quantitative method of measuring viable scale density at the levels of the individual tree and localized bark patches was developed. Bark patches (10 cm(2)) were removed at 0, 1, and 2 m above the ground and at the four cardinal directions from 13 trees in northern New York and 12 trees in northern Michigan. Digital photographs of each patch were made, and the wax mass area was measured from two random 1-cm(2) subsamples on each bark patch using image analysis software. Viable scale insects were counted after removing the wax under a dissecting microscope. Separate regression analyses at the whole tree level for the New York and Michigan sites each showed a strong positive relationship of wax mass area with the number of underlying viable scale insects. The relationships for the New York and Michigan data were not significantly different from each other, and when pooling data from the two sites, there was still a significant positive relationship between wax mass area and the number of scale insects. The relationships between viable scale insects and wax mass area were different at the 0-, 1-, and 2-m sampling heights but do not seem to affect the relationship. This method does not disrupt the insect or its interactions with the host tree.
Amini, Payam; Maroufizadeh, Saman; Samani, Reza Omani; Hamidi, Omid; Sepidarkish, Mahdi
2017-06-01
Preterm birth (PTB) is a leading cause of neonatal death and the second biggest cause of death in children under five years of age. The objective of this study was to determine the prevalence of PTB and its associated factors using logistic regression and decision tree classification methods. This cross-sectional study was conducted on 4,415 pregnant women in Tehran, Iran, from July 6-21, 2015. Data were collected by a researcher-developed questionnaire through interviews with mothers and review of their medical records. To evaluate the accuracy of the logistic regression and decision tree methods, several indices such as sensitivity, specificity, and the area under the curve were used. The PTB rate was 5.5% in this study. The logistic regression outperformed the decision tree for the classification of PTB based on risk factors. Logistic regression showed that multiple pregnancies, mothers with preeclampsia, and those who conceived with assisted reproductive technology had an increased risk for PTB ( p < 0.05). Identifying and training mothers at risk as well as improving prenatal care may reduce the PTB rate. We also recommend that statisticians utilize the logistic regression model for the classification of risk groups for PTB.
Jarnevich, Catherine S.; Talbert, Marian; Morisette, Jeffrey T.; Aldridge, Cameron L.; Brown, Cynthia; Kumar, Sunil; Manier, Daniel; Talbert, Colin; Holcombe, Tracy R.
2017-01-01
Evaluating the conditions where a species can persist is an important question in ecology both to understand tolerances of organisms and to predict distributions across landscapes. Presence data combined with background or pseudo-absence locations are commonly used with species distribution modeling to develop these relationships. However, there is not a standard method to generate background or pseudo-absence locations, and method choice affects model outcomes. We evaluated combinations of both model algorithms (simple and complex generalized linear models, multivariate adaptive regression splines, Maxent, boosted regression trees, and random forest) and background methods (random, minimum convex polygon, and continuous and binary kernel density estimator (KDE)) to assess the sensitivity of model outcomes to choices made. We evaluated six questions related to model results, including five beyond the common comparison of model accuracy assessment metrics (biological interpretability of response curves, cross-validation robustness, independent data accuracy and robustness, and prediction consistency). For our case study with cheatgrass in the western US, random forest was least sensitive to background choice and the binary KDE method was least sensitive to model algorithm choice. While this outcome may not hold for other locations or species, the methods we used can be implemented to help determine appropriate methodologies for particular research questions.
Estimation of retinal vessel caliber using model fitting and random forests
NASA Astrophysics Data System (ADS)
Araújo, Teresa; Mendonça, Ana Maria; Campilho, Aurélio
2017-03-01
Retinal vessel caliber changes are associated with several major diseases, such as diabetes and hypertension. These caliber changes can be evaluated using eye fundus images. However, the clinical assessment is tiresome and prone to errors, motivating the development of automatic methods. An automatic method based on vessel crosssection intensity profile model fitting for the estimation of vessel caliber in retinal images is herein proposed. First, vessels are segmented from the image, vessel centerlines are detected and individual segments are extracted and smoothed. Intensity profiles are extracted perpendicularly to the vessel, and the profile lengths are determined. Then, model fitting is applied to the smoothed profiles. A novel parametric model (DoG-L7) is used, consisting on a Difference-of-Gaussians multiplied by a line which is able to describe profile asymmetry. Finally, the parameters of the best-fit model are used for determining the vessel width through regression using ensembles of bagged regression trees with random sampling of the predictors (random forests). The method is evaluated on the REVIEW public dataset. A precision close to the observers is achieved, outperforming other state-of-the-art methods. The method is robust and reliable for width estimation in images with pathologies and artifacts, with performance independent of the range of diameters.
Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition
Boosted regression tree (BRT) models were developed to quantify the nonlinear relationships between landscape variables and nutrient concentrations in a mesoscale mixed land cover watershed during base-flow conditions. Factors that affect instream biological components, based on ...
Differences in Risk Factors for Rotator Cuff Tears between Elderly Patients and Young Patients.
Watanabe, Akihisa; Ono, Qana; Nishigami, Tomohiko; Hirooka, Takahiko; Machida, Hirohisa
2018-02-01
It has been unclear whether the risk factors for rotator cuff tears are the same at all ages or differ between young and older populations. In this study, we examined the risk factors for rotator cuff tears using classification and regression tree analysis as methods of nonlinear regression analysis. There were 65 patients in the rotator cuff tears group and 45 patients in the intact rotator cuff group. Classification and regression tree analysis was performed to predict rotator cuff tears. The target factor was rotator cuff tears; explanatory variables were age, sex, trauma, and critical shoulder angle≥35°. In the results of classification and regression tree analysis, the tree was divided at age 64. For patients aged≥64, the tree was divided at trauma. For patients aged<64, the tree was divided at critical shoulder angle≥35°. The odds ratio for critical shoulder angle≥35° was significant for all ages (5.89), and for patients aged<64 (10.3) while trauma was only a significant factor for patients aged≥64 (5.13). Age, trauma, and critical shoulder angle≥35° were related to rotator cuff tears in this study. However, these risk factors showed different trends according to age group, not a linear relationship.
Hao, Xu; Yujun, Sun; Xinjie, Wang; Jin, Wang; Yao, Fu
2015-01-01
A multiple linear model was developed for individual tree crown width of Cunninghamia lanceolata (Lamb.) Hook in Fujian province, southeast China. Data were obtained from 55 sample plots of pure China-fir plantation stands. An Ordinary Linear Least Squares (OLS) regression was used to establish the crown width model. To adjust for correlations between observations from the same sample plots, we developed one level linear mixed-effects (LME) models based on the multiple linear model, which take into account the random effects of plots. The best random effects combinations for the LME models were determined by the Akaike's information criterion, the Bayesian information criterion and the -2logarithm likelihood. Heteroscedasticity was reduced by three residual variance functions: the power function, the exponential function and the constant plus power function. The spatial correlation was modeled by three correlation structures: the first-order autoregressive structure [AR(1)], a combination of first-order autoregressive and moving average structures [ARMA(1,1)], and the compound symmetry structure (CS). Then, the LME model was compared to the multiple linear model using the absolute mean residual (AMR), the root mean square error (RMSE), and the adjusted coefficient of determination (adj-R2). For individual tree crown width models, the one level LME model showed the best performance. An independent dataset was used to test the performance of the models and to demonstrate the advantage of calibrating LME models.
Bayesian Ensemble Trees (BET) for Clustering and Prediction in Heterogeneous Data
Duan, Leo L.; Clancy, John P.; Szczesniak, Rhonda D.
2016-01-01
We propose a novel “tree-averaging” model that utilizes the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian Ensemble Trees (BET) and model them as a Dirichlet process. We show that BET determines the optimal number of trees by adapting to the data heterogeneity. Compared with the other ensemble methods, BET requires much fewer trees and shows equivalent prediction accuracy using weighted averaging. Moreover, each tree in BET provides variable selection criterion and interpretation for each subset. We developed an efficient estimating procedure with improved estimation strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations and illustrate the approach with a real-world data example involving regression of lung function measurements obtained from patients with cystic fibrosis. Supplemental materials are available online. PMID:27524872
Mani, Ashutosh; Rao, Marepalli; James, Kelley; Bhattacharya, Amit
2015-01-01
The purpose of this study was to explore data-driven models, based on decision trees, to develop practical and easy to use predictive models for early identification of firefighters who are likely to cross the threshold of hyperthermia during live-fire training. Predictive models were created for three consecutive live-fire training scenarios. The final predicted outcome was a categorical variable: will a firefighter cross the upper threshold of hyperthermia - Yes/No. Two tiers of models were built, one with and one without taking into account the outcome (whether a firefighter crossed hyperthermia or not) from the previous training scenario. First tier of models included age, baseline heart rate and core body temperature, body mass index, and duration of training scenario as predictors. The second tier of models included the outcome of the previous scenario in the prediction space, in addition to all the predictors from the first tier of models. Classification and regression trees were used independently for prediction. The response variable for the regression tree was the quantitative variable: core body temperature at the end of each scenario. The predicted quantitative variable from regression trees was compared to the upper threshold of hyperthermia (38°C) to predict whether a firefighter would enter hyperthermia. The performance of classification and regression tree models was satisfactory for the second (success rate = 79%) and third (success rate = 89%) training scenarios but not for the first (success rate = 43%). Data-driven models based on decision trees can be a useful tool for predicting physiological response without modeling the underlying physiological systems. Early prediction of heat stress coupled with proactive interventions, such as pre-cooling, can help reduce heat stress in firefighters.
Cade, Brian S.; Noon, Barry R.; Scherer, Rick D.; Keane, John J.
2017-01-01
Counts of avian fledglings, nestlings, or clutch size that are bounded below by zero and above by some small integer form a discrete random variable distribution that is not approximated well by conventional parametric count distributions such as the Poisson or negative binomial. We developed a logistic quantile regression model to provide estimates of the empirical conditional distribution of a bounded discrete random variable. The logistic quantile regression model requires that counts are randomly jittered to a continuous random variable, logit transformed to bound them between specified lower and upper values, then estimated in conventional linear quantile regression, repeating the 3 steps and averaging estimates. Back-transformation to the original discrete scale relies on the fact that quantiles are equivariant to monotonic transformations. We demonstrate this statistical procedure by modeling 20 years of California Spotted Owl fledgling production (0−3 per territory) on the Lassen National Forest, California, USA, as related to climate, demographic, and landscape habitat characteristics at territories. Spotted Owl fledgling counts increased nonlinearly with decreasing precipitation in the early nesting period, in the winter prior to nesting, and in the prior growing season; with increasing minimum temperatures in the early nesting period; with adult compared to subadult parents; when there was no fledgling production in the prior year; and when percentage of the landscape surrounding nesting sites (202 ha) with trees ≥25 m height increased. Changes in production were primarily driven by changes in the proportion of territories with 2 or 3 fledglings. Average variances of the discrete cumulative distributions of the estimated fledgling counts indicated that temporal changes in climate and parent age class explained 18% of the annual variance in owl fledgling production, which was 34% of the total variance. Prior fledgling production explained as much of the variance in the fledgling counts as climate, parent age class, and landscape habitat predictors. Our logistic quantile regression model can be used for any discrete response variables with fixed upper and lower bounds.
Ensemble habitat mapping of invasive plant species
Stohlgren, T.J.; Ma, P.; Kumar, S.; Rocca, M.; Morisette, J.T.; Jarnevich, C.S.; Benson, N.
2010-01-01
Ensemble species distribution models combine the strengths of several species environmental matching models, while minimizing the weakness of any one model. Ensemble models may be particularly useful in risk analysis of recently arrived, harmful invasive species because species may not yet have spread to all suitable habitats, leaving species-environment relationships difficult to determine. We tested five individual models (logistic regression, boosted regression trees, random forest, multivariate adaptive regression splines (MARS), and maximum entropy model or Maxent) and ensemble modeling for selected nonnative plant species in Yellowstone and Grand Teton National Parks, Wyoming; Sequoia and Kings Canyon National Parks, California, and areas of interior Alaska. The models are based on field data provided by the park staffs, combined with topographic, climatic, and vegetation predictors derived from satellite data. For the four invasive plant species tested, ensemble models were the only models that ranked in the top three models for both field validation and test data. Ensemble models may be more robust than individual species-environment matching models for risk analysis. ?? 2010 Society for Risk Analysis.
A self-trained classification technique for producing 30 m percent-water maps from Landsat data
Rover, Jennifer R.; Wylie, Bruce K.; Ji, Lei
2010-01-01
Small bodies of water can be mapped with moderate-resolution satellite data using methods where water is mapped as subpixel fractions using field measurements or high-resolution images as training datasets. A new method, developed from a regression-tree technique, uses a 30 m Landsat image for training the regression tree that, in turn, is applied to the same image to map subpixel water. The self-trained method was evaluated by comparing the percent-water map with three other maps generated from established percent-water mapping methods: (1) a regression-tree model trained with a 5 m SPOT 5 image, (2) a regression-tree model based on endmembers and (3) a linear unmixing classification technique. The results suggest that subpixel water fractions can be accurately estimated when high-resolution satellite data or intensively interpreted training datasets are not available, which increases our ability to map small water bodies or small changes in lake size at a regional scale.
Scalable Regression Tree Learning on Hadoop using OpenPlanet
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yin, Wei; Simmhan, Yogesh; Prasanna, Viktor
As scientific and engineering domains attempt to effectively analyze the deluge of data arriving from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. Map Reduce is well suited for addressing such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, using MapReduce has been proposed earlier. In this paper, we describe an open source implement of this algorithm, OpenPlanet, on the Hadoop framework usingmore » a hybrid approach. Further, we evaluate the performance of OpenPlanet using realworld datasets from the Smart Power Grid domain to perform energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.« less
NASA Astrophysics Data System (ADS)
Herguido, Estela; Pulido, Manuel; Francisco Lavado Contador, Joaquín; Schnabel, Susanne
2017-04-01
In Iberian dehesas and montados, the lack of tree recruitment compromises its long-term sustainability. However, in marginal areas of dehesas shrub encroachment facilitates tree recruitment while altering the distinctive physiognomic and cultural characteristics of the system. These are ongoing processes that should be considered when designing afforestation measures and policies. Based on spatial variables, we modeled the proneness of a piece of land to undergo tree recruitment and the results were related with the afforestation measures carried out under the UE First Afforestation Agricultural Land Program between 1992 and 2008. We analyzed the temporal tree population dynamics in 800 randomly selected plots of 100 m radius (2,510 ha in total) in dehesas and treeless pasturelands of Extremadura (hereafter rangelands). Tree changes were revealed by comparing aerial images taken in 1956 with orthophotographs and infrared ones from 2012. Spatial models that predict the areas prone either to lack tree recruitment or with recruitment were developed and based on three data mining algorithms: MARS (Multivariate Adaptive Regression Splines), Random Forest (RF) and Stochastic Gradient Boosting (Tree-Net, TN). Recruited-tree locations (1) vs. locations of places with no recruitment (0) (randomly selected from the study areas) were used as the binary dependent variable. A 5% of the data were used as test data set. As candidate explanatory variables we used 51 different topographic, climatic, bioclimatic, land cover-related and edaphic ones. The statistical models developed were extrapolated to the spatial context of the afforested areas in the region and also to the whole Extremenian rangelands, and the percentage of area modelled as prone to tree recruitment was calculated for each case. A total of 46,674.63 ha were afforested with holm oak (Quercus ilex) or cork oak (Quercus suber) in the studied rangelands under the UE First Afforestation Agricultural Land Program. In the sampled plots, 16,747 trees were detected as recruited, while 47,058 and 12,803 were present in both dates and lost during the studied period, respectively. Based on the Area Under the ROC Curve (AUC), all the data mining models considered showed a high fitness (MARS AUC= 0.86; TN AUC= 0.92; RF AUC= 0.95) and low misclassification rates. Correctly predicted test samples for absence and presence of tree recruitment accounted respectively to 78.3% and 76.8% when using MARS, 90.8% and 90.8% using TN and 88.9% and 89.1% using RF. The spatial patterns of the different models were similar. However, attending only the percentage of area prone to tree recruitment, outstanding differences were observed among models considering the total surface of rangelands (36.03% in MARS, 22.88% in TN and 6.72 % in RF). Despite these differences, when comparing the results with those of the afforested surfaces (31.73% in MARS, 20.70% in TN and 5.63 % in RF) the three algorithms pointed to similar conclusions, i.e. the afforestations performed in rangelands of Extremadura under UE First Afforestation Agricultural Land Program, barely discriminate between areas with or without natural regeneration. In conclusion, data mining technics are suitable to develop high-performance spatial models of vegetation dynamics. These models could be useful for policy and decision makers aimed at assessing the implementation of afforestation measures and the selection of more adequate locations.
Comparison of stream invertebrate response models for bioassessment metric
Waite, Ian R.; Kennen, Jonathan G.; May, Jason T.; Brown, Larry R.; Cuffney, Thomas F.; Jones, Kimberly A.; Orlando, James L.
2012-01-01
We aggregated invertebrate data from various sources to assemble data for modeling in two ecoregions in Oregon and one in California. Our goal was to compare the performance of models developed using multiple linear regression (MLR) techniques with models developed using three relatively new techniques: classification and regression trees (CART), random forest (RF), and boosted regression trees (BRT). We used tolerance of taxa based on richness (RICHTOL) and ratio of observed to expected taxa (O/E) as response variables and land use/land cover as explanatory variables. Responses were generally linear; therefore, there was little improvement to the MLR models when compared to models using CART and RF. In general, the four modeling techniques (MLR, CART, RF, and BRT) consistently selected the same primary explanatory variables for each region. However, results from the BRT models showed significant improvement over the MLR models for each region; increases in R2 from 0.09 to 0.20. The O/E metric that was derived from models specifically calibrated for Oregon consistently had lower R2 values than RICHTOL for the two regions tested. Modeled O/E R2 values were between 0.06 and 0.10 lower for each of the four modeling methods applied in the Willamette Valley and were between 0.19 and 0.36 points lower for the Blue Mountains. As a result, BRT models may indeed represent a good alternative to MLR for modeling species distribution relative to environmental variables.
Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S.
2016-01-01
Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0–20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The good performance of the RF model was attributable to its ability to handle the non-linear and hierarchical relationships between soil Cd and environmental variables. These results confirm that the RF approach is promising for the prediction and spatial distribution mapping of soil Cd at the regional scale. PMID:26964095
Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S
2016-01-01
Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0-20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The good performance of the RF model was attributable to its ability to handle the non-linear and hierarchical relationships between soil Cd and environmental variables. These results confirm that the RF approach is promising for the prediction and spatial distribution mapping of soil Cd at the regional scale.
Method for estimating potential tree-grade distributions for northeastern forest species
Daniel A. Yaussy; Daniel A. Yaussy
1993-01-01
Generalized logistic regression was used to distribute trees into four potential tree grades for 20 northeastern species groups. The potential tree grade is defined as the tree grade based on the length and amount of clear cuttings and defects only, disregarding minimum grading diameter. The algorithms described use site index and tree diameter as the predictive...
Additivity of nonlinear biomass equations
Bernard R. Parresol
2001-01-01
Two procedures that guarantee the property of additivity among the components of tree biomass and total tree biomass utilizing nonlinear functions are developed. Procedure 1 is a simple combination approach, and procedure 2 is based on nonlinear joint-generalized regression (nonlinear seemingly unrelated regressions) with parameter restrictions. Statistical theory is...
Park, Ji Hyun; Kim, Hyeon-Young; Lee, Hanna; Yun, Eun Kyoung
2015-12-01
This study compares the performance of the logistic regression and decision tree analysis methods for assessing the risk factors for infection in cancer patients undergoing chemotherapy. The subjects were 732 cancer patients who were receiving chemotherapy at K university hospital in Seoul, Korea. The data were collected between March 2011 and February 2013 and were processed for descriptive analysis, logistic regression and decision tree analysis using the IBM SPSS Statistics 19 and Modeler 15.1 programs. The most common risk factors for infection in cancer patients receiving chemotherapy were identified as alkylating agents, vinca alkaloid and underlying diabetes mellitus. The logistic regression explained 66.7% of the variation in the data in terms of sensitivity and 88.9% in terms of specificity. The decision tree analysis accounted for 55.0% of the variation in the data in terms of sensitivity and 89.0% in terms of specificity. As for the overall classification accuracy, the logistic regression explained 88.0% and the decision tree analysis explained 87.2%. The logistic regression analysis showed a higher degree of sensitivity and classification accuracy. Therefore, logistic regression analysis is concluded to be the more effective and useful method for establishing an infection prediction model for patients undergoing chemotherapy. Copyright © 2015 Elsevier Ltd. All rights reserved.
2011-01-01
Background Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Results Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5. Conclusions When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing. PMID:21849043
USDA-ARS?s Scientific Manuscript database
Incomplete meteorological data has been a problem in environmental modeling studies. The objective of this work was to develop a technique to reconstruct missing daily precipitation data in the central part of Chesapeake Bay Watershed using regression trees (RT) and artificial neural networks (ANN)....
USDA-ARS?s Scientific Manuscript database
Missing meteorological data have to be estimated for agricultural and environmental modeling. The objective of this work was to develop a technique to reconstruct the missing daily precipitation data in the central part of the Chesapeake Bay Watershed using regression trees (RT) and artificial neura...
Mazenq, Julie; Dubus, Jean-Christophe; Gaudart, Jean; Charpin, Denis; Viudes, Gilles; Noel, Guilhem
2017-11-01
Particulate matter, nitrogen dioxide (NO 2 ) and ozone are recognized as the three pollutants that most significantly affect human health. Asthma is a multifactorial disease. However, the place of residence has rarely been investigated. We compared the impact of air pollution, measured near patients' homes, on emergency department (ED) visits for asthma or trauma (controls) within the Provence-Alpes-Côte-d'Azur region. Variables were selected using classification and regression trees on asthmatic and control population, 3-99 years, visiting ED from January 1 to December 31, 2013. Then in a nested case control study, randomization was based on the day of ED visit and on defined age groups. Pollution, meteorological, pollens and viral data measured that day were linked to the patient's ZIP code. A total of 794,884 visits were reported including 6250 for asthma and 278,192 for trauma. Factors associated with an excess risk of emergency visit for asthma included short-term exposure to NO 2 , female gender, high viral load and a combination of low temperature and high humidity. Short-term exposures to high NO 2 concentrations, as assessed close to the homes of the patients, were significantly associated with asthma-related ED visits in children and adults. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ostrand, William D.; Gotthardt, Tracey A.; Howlin, Shay; Robards, Martin D.
2005-01-01
We modeled habitat selection by Pacific sand lance (Ammodytes hexapterus) by examining their distribution in relation to water depth, distance to shore, bottom slope, bottom type, distance from sand bottom, and shoreline type. Through both logistic regression and classification tree models, we compared the characteristics of 29 known sand lance locations to 58 randomly selected sites. The best models indicated a strong selection of shallow water by sand lance, with weaker association between sand lance distribution and beach shorelines, sand bottoms, distance to shore, bottom slope, and distance to the nearest sand bottom. We applied an information-theoretic approach to the interpretation of the logistic regression analysis and determined importance values of 0.99, 0.54, 0.52, 0.44, 0.39, and 0.25 for depth, beach shorelines, sand bottom, distance to shore, gradual bottom slope, and distance to the nearest sand bottom, respectively. The classification tree model indicated that sand lance selected shallow-water habitats and remained near sand bottoms when located in habitats with depths between 40 and 60 m. All sand lance locations were at depths <60 m and 93% occurred at depths <40 m. Probable reasons for the modeled relationships between the distribution of sand lance and the independent variables are discussed.
Goloboff, Pablo A
2014-10-01
Three different types of data sets, for which the uniquely most parsimonious tree can be known exactly but is hard to find with heuristic tree search methods, are studied. Tree searches are complicated more by the shape of the tree landscape (i.e. the distribution of homoplasy on different trees) than by the sheer abundance of homoplasy or character conflict. Data sets of Type 1 are those constructed by Radel et al. (2013). Data sets of Type 2 present a very rugged landscape, with narrow peaks and valleys, but relatively low amounts of homoplasy. For such a tree landscape, subjecting the trees to TBR and saving suboptimal trees produces much better results when the sequence of clipping for the tree branches is randomized instead of fixed. An unexpected finding for data sets of Types 1 and 2 is that starting a search from a random tree instead of a random addition sequence Wagner tree may increase the probability that the search finds the most parsimonious tree; a small artificial example where these probabilities can be calculated exactly is presented. Data sets of Type 3, the most difficult data sets studied here, comprise only congruent characters, and a single island with only one most parsimonious tree. Even if there is a single island, missing entries create a very flat landscape which is difficult to traverse with tree search algorithms because the number of equally parsimonious trees that need to be saved and swapped to effectively move around the plateaus is too large. Minor modifications of the parameters of tree drifting, ratchet, and sectorial searches allow travelling around these plateaus much more efficiently than saving and swapping large numbers of equally parsimonious trees with TBR. For these data sets, two new related criteria for selecting taxon addition sequences in Wagner trees (the "selected" and "informative" addition sequences) produce much better results than the standard random or closest addition sequences. These new methods for Wagner trees and for moving around plateaus can be useful when analyzing phylogenomic data sets formed by concatenation of genes with uneven taxon representation ("sparse" supermatrices), which are likely to present a tree landscape with extensive plateaus. Copyright © 2014 Elsevier Inc. All rights reserved.
Mohammadi, Seyed-Farzad; Sabbaghi, Mostafa; Z-Mehrjardi, Hadi; Hashemi, Hassan; Alizadeh, Somayeh; Majdi, Mercede; Taee, Farough
2012-03-01
To apply artificial intelligence models to predict the occurrence of posterior capsule opacification (PCO) after phacoemulsification. Farabi Eye Hospital, Tehran, Iran. Clinical-based cross-sectional study. The posterior capsule status of eyes operated on for age-related cataract and the need for laser capsulotomy were determined. After a literature review, data polishing, and expert consultation, 10 input variables were selected. The QUEST algorithm was used to develop a decision tree. Three back-propagation artificial neural networks were constructed with 4, 20, and 40 neurons in 2 hidden layers and trained with the same transfer functions (log-sigmoid and linear transfer) and training protocol with randomly selected eyes. They were then tested on the remaining eyes and the networks compared for their performance. Performance indices were used to compare resultant models with the results of logistic regression analysis. The models were trained using 282 randomly selected eyes and then tested using 70 eyes. Laser capsulotomy for clinically significant PCO was indicated or had been performed 2 years postoperatively in 40 eyes. A sample decision tree was produced with accuracy of 50% (likelihood ratio 0.8). The best artificial neural network, which showed 87% accuracy and a positive likelihood ratio of 8, was achieved with 40 neurons. The area under the receiver-operating-characteristic curve was 0.71. In comparison, logistic regression reached accuracy of 80%; however, the likelihood ratio was not measurable because the sensitivity was zero. A prototype artificial neural network was developed that predicted posterior capsule status (requiring capsulotomy) with reasonable accuracy. No author has a financial or proprietary interest in any material or method mentioned. Copyright © 2012 ASCRS and ESCRS. Published by Elsevier Inc. All rights reserved.
Point-Sampling and Line-Sampling Probability Theory, Geometric Implications, Synthesis
L.R. Grosenbaugh
1958-01-01
Foresters concerned with measuring tree populations on definite areas have long employed two well-known methods of representative sampling. In list or enumerative sampling the entire tree population is tallied with a known proportion being randomly selected and measured for volume or other variables. In area sampling all trees on randomly located plots or strips...
Fast image interpolation via random forests.
Huang, Jun-Jie; Siu, Wan-Chi; Liu, Tian-Rui
2015-10-01
This paper proposes a two-stage framework for fast image interpolation via random forests (FIRF). The proposed FIRF method gives high accuracy, as well as requires low computation. The underlying idea of this proposed work is to apply random forests to classify the natural image patch space into numerous subspaces and learn a linear regression model for each subspace to map the low-resolution image patch to high-resolution image patch. The FIRF framework consists of two stages. Stage 1 of the framework removes most of the ringing and aliasing artifacts in the initial bicubic interpolated image, while Stage 2 further refines the Stage 1 interpolated image. By varying the number of decision trees in the random forests and the number of stages applied, the proposed FIRF method can realize computationally scalable image interpolation. Extensive experimental results show that the proposed FIRF(3, 2) method achieves more than 0.3 dB improvement in peak signal-to-noise ratio over the state-of-the-art nonlocal autoregressive modeling (NARM) method. Moreover, the proposed FIRF(1, 1) obtains similar or better results as NARM while only takes its 0.3% computational time.
Tree STEM and Canopy Biomass Estimates from Terrestrial Laser Scanning Data
NASA Astrophysics Data System (ADS)
Olofsson, K.; Holmgren, J.
2017-10-01
In this study an automatic method for estimating both the tree stem and the tree canopy biomass is presented. The point cloud tree extraction techniques operate on TLS data and models the biomass using the estimated stem and canopy volume as independent variables. The regression model fit error is of the order of less than 5 kg, which gives a relative model error of about 5 % for the stem estimate and 10-15 % for the spruce and pine canopy biomass estimates. The canopy biomass estimate was improved by separating the models by tree species which indicates that the method is allometry dependent and that the regression models need to be recomputed for different areas with different climate and different vegetation.
Amirabadizadeh, Alireza; Nezami, Hossein; Vaughn, Michael G; Nakhaee, Samaneh; Mehrpour, Omid
2018-05-12
Substance abuse exacts considerable social and health care burdens throughout the world. The aim of this study was to create a prediction model to better identify risk factors for drug use. A prospective cross-sectional study was conducted in South Khorasan Province, Iran. Of the total of 678 eligible subjects, 70% (n: 474) were randomly selected to provide a training set for constructing decision tree and multiple logistic regression (MLR) models. The remaining 30% (n: 204) were employed in a holdout sample to test the performance of the decision tree and MLR models. Predictive performance of different models was analyzed by the receiver operating characteristic (ROC) curve using the testing set. Independent variables were selected from demographic characteristics and history of drug use. For the decision tree model, the sensitivity and specificity for identifying people at risk for drug abuse were 66% and 75%, respectively, while the MLR model was somewhat less effective at 60% and 73%. Key independent variables in the analyses included first substance experience, age at first drug use, age, place of residence, history of cigarette use, and occupational and marital status. While study findings are exploratory and lack generalizability they do suggest that the decision tree model holds promise as an effective classification approach for identifying risk factors for drug use. Convergent with prior research in Western contexts is that age of drug use initiation was a critical factor predicting a substance use disorder.
NASA Astrophysics Data System (ADS)
Lombardo, L.; Cama, M.; Maerker, M.; Parisi, L.; Rotigliano, E.
2014-12-01
This study aims at comparing the performances of Binary Logistic Regression (BLR) and Boosted Regression Trees (BRT) methods in assessing landslide susceptibility for multiple-occurrence regional landslide events within the Mediterranean region. A test area was selected in the north-eastern sector of Sicily (southern Italy), corresponding to the catchments of the Briga and the Giampilieri streams both stretching for few kilometres from the Peloritan ridge (eastern Sicily, Italy) to the Ionian sea. This area was struck on the 1st October 2009 by an extreme climatic event resulting in thousands of rapid shallow landslides, mainly of debris flows and debris avalanches types involving the weathered layer of a low to high grade metamorphic bedrock. Exploiting the same set of predictors and the 2009 landslide archive, BLR- and BRT-based susceptibility models were obtained for the two catchments separately, adopting a random partition (RP) technique for validation; besides, the models trained in one of the two catchments (Briga) were tested in predicting the landslide distribution in the other (Giampilieri), adopting a spatial partition (SP) based validation procedure. All the validation procedures were based on multi-folds tests so to evaluate and compare the reliability of the fitting, the prediction skill, the coherence in the predictor selection and the precision of the susceptibility estimates. All the obtained models for the two methods produced very high predictive performances, with a general congruence between BLR and BRT in the predictor importance. In particular, the research highlighted that BRT-models reached a higher prediction performance with respect to BLR-models, for RP based modelling, whilst for the SP-based models the difference in predictive skills between the two methods dropped drastically, converging to an analogous excellent performance. However, when looking at the precision of the probability estimates, BLR demonstrated to produce more robust models in terms of selected predictors and coefficients, as well as of dispersion of the estimated probabilities around the mean value for each mapped pixel. The difference in the behaviour could be interpreted as the result of overfitting effects, which heavily affect decision tree classification more than logistic regression techniques.
Creating ensembles of decision trees through sampling
Kamath, Chandrika; Cantu-Paz, Erick
2005-08-30
A system for decision tree ensembles that includes a module to read the data, a module to sort the data, a module to evaluate a potential split of the data according to some criterion using a random sample of the data, a module to split the data, and a module to combine multiple decision trees in ensembles. The decision tree method is based on statistical sampling techniques and includes the steps of reading the data; sorting the data; evaluating a potential split according to some criterion using a random sample of the data, splitting the data, and combining multiple decision trees in ensembles.
Cloud-Free Satellite Image Mosaics with Regression Trees and Histogram Matching.
E.H. Helmer; B. Ruefenacht
2005-01-01
Cloud-free optical satellite imagery simplifies remote sensing, but land-cover phenology limits existing solutions to persistent cloudiness to compositing temporally resolute, spatially coarser imagery. Here, a new strategy for developing cloud-free imagery at finer resolution permits simple automatic change detection. The strategy uses regression trees to predict...
Regression estimators for late-instar gypsy moth larvae at low pupulation densities
W.E. Wallnr; A.S. Devito; Stanley J. Zarnoch
1989-01-01
Two regression estimators were developed for determining densities of late-instar gypsy moth, Lymantria dispar (Lepidoptera: Lymantriidae), larvae from burlap band and pyrethrin spray counts on oak trees in Vermont, Massachusetts, Connecticut, and New York. Studies were conducted by marking larvae on individual burlap banded trees within 15...
What Satisfies Students?: Mining Student-Opinion Data with Regression and Decision Tree Analysis
ERIC Educational Resources Information Center
Thomas, Emily H.; Galambos, Nora
2004-01-01
To investigate how students' characteristics and experiences affect satisfaction, this study uses regression and decision tree analysis with the CHAID algorithm to analyze student-opinion data. A data mining approach identifies the specific aspects of students' university experience that most influence three measures of general satisfaction. The…
2018-01-01
Background Many studies have tried to develop predictors for return-to-work (RTW). However, since complex factors have been demonstrated to predict RTW, it is difficult to use them practically. This study investigated whether factors used in previous studies could predict whether an individual had returned to his/her original work by four years after termination of the worker's recovery period. Methods An initial logistic regression analysis of 1,567 participants of the fourth Panel Study of Worker's Compensation Insurance yielded odds ratios. The participants were divided into two subsets, a training dataset and a test dataset. Using the training dataset, logistic regression, decision tree, random forest, and support vector machine models were established, and important variables of each model were identified. The predictive abilities of the different models were compared. Results The analysis showed that only earned income and company-related factors significantly affected return-to-original-work (RTOW). The random forest model showed the best accuracy among the tested machine learning models; however, the difference was not prominent. Conclusion It is possible to predict a worker's probability of RTOW using machine learning techniques with moderate accuracy. PMID:29736160
Ishwaran, Hemant; Lu, Min
2018-06-04
Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings. Copyright © 2018 John Wiley & Sons, Ltd.
Spectrum of walk matrix for Koch network and its application
NASA Astrophysics Data System (ADS)
Xie, Pinchen; Lin, Yuan; Zhang, Zhongzhi
2015-06-01
Various structural and dynamical properties of a network are encoded in the eigenvalues of walk matrix describing random walks on the network. In this paper, we study the spectra of walk matrix of the Koch network, which displays the prominent scale-free and small-world features. Utilizing the particular architecture of the network, we obtain all the eigenvalues and their corresponding multiplicities. Based on the link between the eigenvalues of walk matrix and random target access time defined as the expected time for a walker going from an arbitrary node to another one selected randomly according to the steady-state distribution, we then derive an explicit solution to the random target access time for random walks on the Koch network. Finally, we corroborate our computation for the eigenvalues by enumerating spanning trees in the Koch network, using the connection governing eigenvalues and spanning trees, where a spanning tree of a network is a subgraph of the network, that is, a tree containing all the nodes.
Quantifying post-fire fallen trees using multi-temporal lidar
NASA Astrophysics Data System (ADS)
Bohlin, Inka; Olsson, Håkan; Bohlin, Jonas; Granström, Anders
2017-12-01
Massive tree-felling due to root damage is a common fire effect on burnt areas in Scandinavia, but has so far not been analyzed in detail. Here we explore if pre- and post-fire lidar data can be used to estimate the proportion of fallen trees. The study was carried out within a large (14,000 ha) area in central Sweden burnt in August 2014, where we had access to airborne lidar data from both 2011 and 2015. Three data-sets of predictor variables were tested: POST (post-fire lidar metrics), DIF (difference between post- and pre-fire lidar metrics) and combination of those two (POST_DIF). Fractional logistic regression was used to predict the proportion of fallen trees. Training data consisted of 61 plots, where the number of fallen and standing trees was calculated both in the field and with interpretation of drone images. The accuracy of the best model was tested based on 100 randomly selected validation plots with a size of 25 × 25 m. Our results showed that multi-temporal lidar together with field-collected training data can be used for quantifying post-fire tree felling over large areas. Several height-, density- and intensity metrics correlated with the proportion of fallen trees. The best model combined metrics from both datasets (POST_DIF), resulting in a RMSE of 0.11. Results were slightly poorer in the validation plots with RMSE of 0.18 using pixel size of 12.5 m and RMSE of 0.15 using pixel size of 6.25 m. Our model performed least well for stands that had been exposed to high-intensity crown fire. This was likely due to the low amount of echoes from the standing black tree skeletons. Wall-to-wall maps produced with this model can be used for landscape level analysis of fire effects and to explore the relationship between fallen trees and forest structure, soil type, fire intensity or topography.
Can Predictive Modeling Identify Head and Neck Oncology Patients at Risk for Readmission?
Manning, Amy M; Casper, Keith A; Peter, Kay St; Wilson, Keith M; Mark, Jonathan R; Collar, Ryan M
2018-05-01
Objective Unplanned readmission within 30 days is a contributor to health care costs in the United States. The use of predictive modeling during hospitalization to identify patients at risk for readmission offers a novel approach to quality improvement and cost reduction. Study Design Two-phase study including retrospective analysis of prospectively collected data followed by prospective longitudinal study. Setting Tertiary academic medical center. Subjects and Methods Prospectively collected data for patients undergoing surgical treatment for head and neck cancer from January 2013 to January 2015 were used to build predictive models for readmission within 30 days of discharge using logistic regression, classification and regression tree (CART) analysis, and random forests. One model (logistic regression) was then placed prospectively into the discharge workflow from March 2016 to May 2016 to determine the model's ability to predict which patients would be readmitted within 30 days. Results In total, 174 admissions had descriptive data. Thirty-two were excluded due to incomplete data. Logistic regression, CART, and random forest predictive models were constructed using the remaining 142 admissions. When applied to 106 consecutive prospective head and neck oncology patients at the time of discharge, the logistic regression model predicted readmissions with a specificity of 94%, a sensitivity of 47%, a negative predictive value of 90%, and a positive predictive value of 62% (odds ratio, 14.9; 95% confidence interval, 4.02-55.45). Conclusion Prospectively collected head and neck cancer databases can be used to develop predictive models that can accurately predict which patients will be readmitted. This offers valuable support for quality improvement initiatives and readmission-related cost reduction in head and neck cancer care.
Association of tRNA methyltransferase NSUN2/IGF-II molecular signature with ovarian cancer survival.
Yang, Jia-Cheng; Risch, Eric; Zhang, Meiqin; Huang, Chan; Huang, Huatian; Lu, Lingeng
2017-09-01
To investigate the association between NSUN2/IGF-II signature and ovarian cancer survival. Using a publicly accessible dataset of RNA sequencing and clinical follow-up data, we performed Classification and Regression Tree and survival analyses. Patients with NSUN2 high IGF-II low had significantly superior overall and disease progression-free survival, followed by NSUN2 low IGF-II low , NSUN2 high IGF-II high and NSUN2 low IGF-II high (p < 0.0001 for overall, p = 0.0024 for progression-free survival, respectively). The associations of NSUN2/IGF-II signature with the risks of death and relapse remained significant in multivariate Cox regression models. Random-effects meta-analyses show the upregulated NSUN2 and IGF-II expression in ovarian cancer versus normal tissues. The NSUN2/IGF-II signature associates with heterogeneous outcome and may have clinical implications in managing ovarian cancer.
Rumor Processes in Random Environment on and on Galton-Watson Trees
NASA Astrophysics Data System (ADS)
Bertacchi, Daniela; Zucca, Fabio
2013-11-01
The aim of this paper is to study rumor processes in random environment. In a rumor process a signal starts from the stations of a fixed vertex (the root) and travels on a graph from vertex to vertex. We consider two rumor processes. In the firework process each station, when reached by the signal, transmits it up to a random distance. In the reverse firework process, on the other hand, stations do not send any signal but they “listen” for it up to a random distance. The first random environment that we consider is the deterministic 1-dimensional tree with a random number of stations on each vertex; in this case the root is the origin of . We give conditions for the survival/extinction on almost every realization of the sequence of stations. Later on, we study the processes on Galton-Watson trees with random number of stations on each vertex. We show that if the probability of survival is positive, then there is survival on almost every realization of the infinite tree such that there is at least one station at the root. We characterize the survival of the process in some cases and we give sufficient conditions for survival/extinction.
Freitas, Alex A; Limbu, Kriti; Ghafourian, Taravat
2015-01-01
Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values. Comparison of the models that used only molecular descriptors, in particular, the Bagging decision tree (mean fold error of 2.33), with those employing predicted Kt:p values in addition to the molecular descriptors, such as the Bagging decision tree using adipose Kt:p (mean fold error of 2.29), indicated that the use of predicted Kt:p values as descriptors may be beneficial for accurate prediction of Vss using decision trees if prior feature selection is applied. Decision tree based models presented in this work have an accuracy that is reasonable and similar to the accuracy of reported Vss inter-species extrapolations in the literature. The estimation of Vss for new compounds in drug discovery will benefit from methods that are able to integrate large and varied sources of data and flexible non-linear data mining methods such as decision trees, which can produce interpretable models. Graphical AbstractDecision trees for the prediction of tissue partition coefficient and volume of distribution of drugs.
Random trinomial tree models and vanilla options
NASA Astrophysics Data System (ADS)
Ganikhodjaev, Nasir; Bayram, Kamola
2013-09-01
In this paper we introduce and study random trinomial model. The usual trinomial model is prescribed by triple of numbers (u, d, m). We call the triple (u, d, m) an environment of the trinomial model. A triple (Un, Dn, Mn), where {Un}, {Dn} and {Mn} are the sequences of independent, identically distributed random variables with 0 < Dn < 1 < Un and Mn = 1 for all n, is called a random environment and trinomial tree model with random environment is called random trinomial model. The random trinomial model is considered to produce more accurate results than the random binomial model or usual trinomial model.
Regression analysis using dependent Polya trees.
Schörgendorfer, Angela; Branscum, Adam J
2013-11-30
Many commonly used models for linear regression analysis force overly simplistic shape and scale constraints on the residual structure of data. We propose a semiparametric Bayesian model for regression analysis that produces data-driven inference by using a new type of dependent Polya tree prior to model arbitrary residual distributions that are allowed to evolve across increasing levels of an ordinal covariate (e.g., time, in repeated measurement studies). By modeling residual distributions at consecutive covariate levels or time points using separate, but dependent Polya tree priors, distributional information is pooled while allowing for broad pliability to accommodate many types of changing residual distributions. We can use the proposed dependent residual structure in a wide range of regression settings, including fixed-effects and mixed-effects linear and nonlinear models for cross-sectional, prospective, and repeated measurement data. A simulation study illustrates the flexibility of our novel semiparametric regression model to accurately capture evolving residual distributions. In an application to immune development data on immunoglobulin G antibodies in children, our new model outperforms several contemporary semiparametric regression models based on a predictive model selection criterion. Copyright © 2013 John Wiley & Sons, Ltd.
Computer simulations of melts of randomly branching polymers
NASA Astrophysics Data System (ADS)
Rosa, Angelo; Everaers, Ralf
2016-10-01
Randomly branching polymers with annealed connectivity are model systems for ring polymers and chromosomes. In this context, the branched structure represents transient folding induced by topological constraints. Here we present computer simulations of melts of annealed randomly branching polymers of 3 ≤ N ≤ 1800 segments in d = 2 and d = 3 dimensions. In all cases, we perform a detailed analysis of the observed tree connectivities and spatial conformations. Our results are in excellent agreement with an asymptotic scaling of the average tree size of R ˜ N1/d, suggesting that the trees behave as compact, territorial fractals. The observed swelling relative to the size of ideal trees, R ˜ N1/4, demonstrates that excluded volume interactions are only partially screened in melts of annealed trees. Overall, our results are in good qualitative agreement with the predictions of Flory theory. In particular, we find that the trees swell by the combination of modified branching and path stretching. However, the former effect is subdominant and difficult to detect in d = 3 dimensions.
DIF Trees: Using Classification Trees to Detect Differential Item Functioning
ERIC Educational Resources Information Center
Vaughn, Brandon K.; Wang, Qiu
2010-01-01
A nonparametric tree classification procedure is used to detect differential item functioning for items that are dichotomously scored. Classification trees are shown to be an alternative procedure to detect differential item functioning other than the use of traditional Mantel-Haenszel and logistic regression analysis. A nonparametric…
Arano, Ichiro; Sugimoto, Tomoyuki; Hamasaki, Toshimitsu; Ohno, Yuko
2010-04-23
Survival analysis methods such as the Kaplan-Meier method, log-rank test, and Cox proportional hazards regression (Cox regression) are commonly used to analyze data from randomized withdrawal studies in patients with major depressive disorder. However, unfortunately, such common methods may be inappropriate when a long-term censored relapse-free time appears in data as the methods assume that if complete follow-up were possible for all individuals, each would eventually experience the event of interest. In this paper, to analyse data including such a long-term censored relapse-free time, we discuss a semi-parametric cure regression (Cox cure regression), which combines a logistic formulation for the probability of occurrence of an event with a Cox proportional hazards specification for the time of occurrence of the event. In specifying the treatment's effect on disease-free survival, we consider the fraction of long-term survivors and the risks associated with a relapse of the disease. In addition, we develop a tree-based method for the time to event data to identify groups of patients with differing prognoses (cure survival CART). Although analysis methods typically adapt the log-rank statistic for recursive partitioning procedures, the method applied here used a likelihood ratio (LR) test statistic from a fitting of cure survival regression assuming exponential and Weibull distributions for the latency time of relapse. The method is illustrated using data from a sertraline randomized withdrawal study in patients with major depressive disorder. We concluded that Cox cure regression reveals facts on who may be cured, and how the treatment and other factors effect on the cured incidence and on the relapse time of uncured patients, and that cure survival CART output provides easily understandable and interpretable information, useful both in identifying groups of patients with differing prognoses and in utilizing Cox cure regression models leading to meaningful interpretations.
NASA Astrophysics Data System (ADS)
Mayr, G. J.; Kneringer, P.; Dietz, S. J.; Zeileis, A.
2016-12-01
Low visibility or low cloud ceiling reduce the capacity of airports by requiring special low visibility procedures (LVP) for incoming/departing aircraft. Probabilistic forecasts when such procedures will become necessary help to mitigate delays and economic losses.We compare the performance of probabilistic nowcasts with two statistical methods: ordered logistic regression, and trees and random forests. These models harness historic and current meteorological measurements in the vicinity of the airport and LVP states, and incorporate diurnal and seasonal climatological information via generalized additive models (GAM). The methods are applied at Vienna International Airport (Austria). The performance is benchmarked against climatology, persistence and human forecasters.
Buchner, Florian; Wasem, Jürgen; Schillo, Sonja
2017-01-01
Risk equalization formulas have been refined since their introduction about two decades ago. Because of the complexity and the abundance of possible interactions between the variables used, hardly any interactions are considered. A regression tree is used to systematically search for interactions, a methodologically new approach in risk equalization. Analyses are based on a data set of nearly 2.9 million individuals from a major German social health insurer. A two-step approach is applied: In the first step a regression tree is built on the basis of the learning data set. Terminal nodes characterized by more than one morbidity-group-split represent interaction effects of different morbidity groups. In the second step the 'traditional' weighted least squares regression equation is expanded by adding interaction terms for all interactions detected by the tree, and regression coefficients are recalculated. The resulting risk adjustment formula shows an improvement in the adjusted R 2 from 25.43% to 25.81% on the evaluation data set. Predictive ratios are calculated for subgroups affected by the interactions. The R 2 improvement detected is only marginal. According to the sample level performance measures used, not involving a considerable number of morbidity interactions forms no relevant loss in accuracy. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
Elizabeth A. Freeman; Gretchen G. Moisen; John W. Coulston; Barry T. (Ty) Wilson
2015-01-01
As part of the development of the 2011 National Land Cover Database (NLCD) tree canopy cover layer, a pilot project was launched to test the use of high-resolution photography coupled with extensive ancillary data to map the distribution of tree canopy cover over four study regions in the conterminous US. Two stochastic modeling techniques, random forests (RF...
A soil map of a large watershed in China: applying digital soil mapping in a data sparse region
NASA Astrophysics Data System (ADS)
Barthold, F.; Blank, B.; Wiesmeier, M.; Breuer, L.; Frede, H.-G.
2009-04-01
Prediction of soil classes in data sparse regions is a major research challenge. With the advent of machine learning the possibilities to spatially predict soil classes have increased tremendously and given birth to new possibilities in soil mapping. Digital soil mapping is a research field that has been established during the last decades and has been accepted widely. We now need to develop tools to reduce the uncertainty in soil predictions. This is especially challenging in data sparse regions. One approach to do this is to implement soil taxonomic distance as a classification error criterion in classification and regression trees (CART) as suggested by Minasny et al. (Geoderma 142 (2007) 285-293). This approach assumes that the classification error should be larger between soils that are more dissimilar, i.e. differ in a larger number of soil properties, and smaller between more similar soils. Our study area is the Xilin River Basin, which is located in central Inner Mongolia in China. It is characterized by semi arid climate conditions and is representative for the natural occurring steppe ecosystem. The study area comprises 3600 km2. We applied a random, stratified sampling design after McKenzie and Ryan (Geoderma 89 (1999) 67-94) with landuse and topography as stratifying variables. We defined 10 sampling classes, from each class 14 replicates were randomly drawn and sampled. The dataset was split into 100 soil profiles for training and 40 soil profiles for validation. We then applied classification and regression trees (CART) to quantify the relationships between soil classes and environmental covariates. The classification tree explained 75.5% of the variance with land use and geology as most important predictor variables. Among the 8 soil classes that we predicted, the Kastanozems cover most of the area. They are predominantly found in steppe areas. However, even some of the soils at sand dune sites, which were thought to show only little soil formation, can be classified as Kastanozems. Besides the Kastanozems, Regosols are most common at the sand dune sites as well as at sites that are defined as bare soil which are characterized by little or no vegetation. Gleysols are mostly found at sites in the vicinity of the Xilin river, which are connected to the groundwater. They can also be found in small valleys or depressions where sub-surface waters from neighboring areas collect. The richest soils are found in mountain meadow areas. Pedogenetic conditions here are most favorable and lead to the formation of Chernozems with deep humic Ah horizons. Other soil types that occur in the study area are Arenosols, Calcisols, Cambisol and Phaeozems. In addition, soil taxonomic distance is implemented into the decision tree procedure as a measure of classification error. The results of incorporating taxonomic distance as a loss function in the decision tree will be compared with the standard application of the decision tree.
Stratification of the severity of critically ill patients with classification trees
2009-01-01
Background Development of three classification trees (CT) based on the CART (Classification and Regression Trees), CHAID (Chi-Square Automatic Interaction Detection) and C4.5 methodologies for the calculation of probability of hospital mortality; the comparison of the results with the APACHE II, SAPS II and MPM II-24 scores, and with a model based on multiple logistic regression (LR). Methods Retrospective study of 2864 patients. Random partition (70:30) into a Development Set (DS) n = 1808 and Validation Set (VS) n = 808. Their properties of discrimination are compared with the ROC curve (AUC CI 95%), Percent of correct classification (PCC CI 95%); and the calibration with the Calibration Curve and the Standardized Mortality Ratio (SMR CI 95%). Results CTs are produced with a different selection of variables and decision rules: CART (5 variables and 8 decision rules), CHAID (7 variables and 15 rules) and C4.5 (6 variables and 10 rules). The common variables were: inotropic therapy, Glasgow, age, (A-a)O2 gradient and antecedent of chronic illness. In VS: all the models achieved acceptable discrimination with AUC above 0.7. CT: CART (0.75(0.71-0.81)), CHAID (0.76(0.72-0.79)) and C4.5 (0.76(0.73-0.80)). PCC: CART (72(69-75)), CHAID (72(69-75)) and C4.5 (76(73-79)). Calibration (SMR) better in the CT: CART (1.04(0.95-1.31)), CHAID (1.06(0.97-1.15) and C4.5 (1.08(0.98-1.16)). Conclusion With different methodologies of CTs, trees are generated with different selection of variables and decision rules. The CTs are easy to interpret, and they stratify the risk of hospital mortality. The CTs should be taken into account for the classification of the prognosis of critically ill patients. PMID:20003229
Goo, Yeong-Jia James; Shen, Zone-De
2014-01-01
As the fraudulent financial statement of an enterprise is increasingly serious with each passing day, establishing a valid forecasting fraudulent financial statement model of an enterprise has become an important question for academic research and financial practice. After screening the important variables using the stepwise regression, the study also matches the logistic regression, support vector machine, and decision tree to construct the classification models to make a comparison. The study adopts financial and nonfinancial variables to assist in establishment of the forecasting fraudulent financial statement model. Research objects are the companies to which the fraudulent and nonfraudulent financial statement happened between years 1998 to 2012. The findings are that financial and nonfinancial information are effectively used to distinguish the fraudulent financial statement, and decision tree C5.0 has the best classification effect 85.71%. PMID:25302338
Chen, Suduan; Goo, Yeong-Jia James; Shen, Zone-De
2014-01-01
As the fraudulent financial statement of an enterprise is increasingly serious with each passing day, establishing a valid forecasting fraudulent financial statement model of an enterprise has become an important question for academic research and financial practice. After screening the important variables using the stepwise regression, the study also matches the logistic regression, support vector machine, and decision tree to construct the classification models to make a comparison. The study adopts financial and nonfinancial variables to assist in establishment of the forecasting fraudulent financial statement model. Research objects are the companies to which the fraudulent and nonfraudulent financial statement happened between years 1998 to 2012. The findings are that financial and nonfinancial information are effectively used to distinguish the fraudulent financial statement, and decision tree C5.0 has the best classification effect 85.71%.
Lisa M. Ganio; Robert A. Progar
2017-01-01
Wild and prescribed fire-induced injury to forest trees can produce immediate or delayed tree mortality but fire-injured trees can also survive. Land managers use logistic regression models that incorporate tree-injury variables to discriminate between fatally injured trees and those that will survive. We used data from 4024 ponderosa pine (Pinus ponderosa...
Cavity tree selection by red-cockaded woodpeckers in relation to tree age
D. Craig Rudolph; Richard N. Conner
1991-01-01
We aged over 1350 Red-cockaded Woodpecker (Picoides borealis) cavity trees and a comparable number of randomly selected trees. Resulting data strongly support the hypothesis that Red-cockaded Woodpeckers preferentially select older trees. Ages of recently initiated cavity trees in the Texas study areas generally were similar to those of cavity trees...
Using Evidence-Based Decision Trees Instead of Formulas to Identify At-Risk Readers. REL 2014-036
ERIC Educational Resources Information Center
Koon, Sharon; Petscher, Yaacov; Foorman, Barbara R.
2014-01-01
This study examines whether the classification and regression tree (CART) model improves the early identification of students at risk for reading comprehension difficulties compared with the more difficult to interpret logistic regression model. CART is a type of predictive modeling that relies on nonparametric techniques. It presents results in…
ERIC Educational Resources Information Center
Brabant, Marie-Eve; Hebert, Martine; Chagnon, Francois
2013-01-01
This study explored the clinical profiles of 77 female teenager survivors of sexual abuse and examined the association of abuse-related and personal variables with suicidal ideations. Analyses revealed that 64% of participants experienced suicidal ideations. Findings from classification and regression tree analysis indicated that depression,…
Forest type mapping of the Interior West
Bonnie Ruefenacht; Gretchen G. Moisen; Jock A. Blackard
2004-01-01
This paper develops techniques for the mapping of forest types in Arizona, New Mexico, and Wyoming. The methods involve regression-tree modeling using a variety of remote sensing and GIS layers along with Forest Inventory Analysis (FIA) point data. Regression-tree modeling is a fast and efficient technique of estimating variables for large data sets with high accuracy...
ERIC Educational Resources Information Center
Thomas, Emily H.; Galambos, Nora
To investigate how students' characteristics and experiences affect satisfaction, this study used regression and decision-tree analysis with the CHAID algorithm to analyze student opinion data from a sample of 1,783 college students. A data-mining approach identifies the specific aspects of students' university experience that most influence three…
ERIC Educational Resources Information Center
Cohen, Ira L.; Liu, Xudong; Hudson, Melissa; Gillis, Jennifer; Cavalari, Rachel N. S.; Romanczyk, Raymond G.; Karmel, Bernard Z.; Gardner, Judith M.
2016-01-01
In order to improve discrimination accuracy between Autism Spectrum Disorder (ASD) and similar neurodevelopmental disorders, a data mining procedure, Classification and Regression Trees (CART), was used on a large multi-site sample of PDD Behavior Inventory (PDDBI) forms on children with and without ASD. Discrimination accuracy exceeded 80%,…
Henrard, S; Speybroeck, N; Hermans, C
2015-11-01
Haemophilia is a rare genetic haemorrhagic disease characterized by partial or complete deficiency of coagulation factor VIII, for haemophilia A, or IX, for haemophilia B. As in any other medical research domain, the field of haemophilia research is increasingly concerned with finding factors associated with binary or continuous outcomes through multivariable models. Traditional models include multiple logistic regressions, for binary outcomes, and multiple linear regressions for continuous outcomes. Yet these regression models are at times difficult to implement, especially for non-statisticians, and can be difficult to interpret. The present paper sought to didactically explain how, why, and when to use classification and regression tree (CART) analysis for haemophilia research. The CART method is non-parametric and non-linear, based on the repeated partitioning of a sample into subgroups based on a certain criterion. Breiman developed this method in 1984. Classification trees (CTs) are used to analyse categorical outcomes and regression trees (RTs) to analyse continuous ones. The CART methodology has become increasingly popular in the medical field, yet only a few examples of studies using this methodology specifically in haemophilia have to date been published. Two examples using CART analysis and previously published in this field are didactically explained in details. There is increasing interest in using CART analysis in the health domain, primarily due to its ease of implementation, use, and interpretation, thus facilitating medical decision-making. This method should be promoted for analysing continuous or categorical outcomes in haemophilia, when applicable. © 2015 John Wiley & Sons Ltd.
Gmur, Stephan; Vogt, Daniel; Zabowski, Darlene; Moskal, L. Monika
2012-01-01
The characterization of soil attributes using hyperspectral sensors has revealed patterns in soil spectra that are known to respond to mineral composition, organic matter, soil moisture and particle size distribution. Soil samples from different soil horizons of replicated soil series from sites located within Washington and Oregon were analyzed with the FieldSpec Spectroradiometer to measure their spectral signatures across the electromagnetic range of 400 to 1,000 nm. Similarity rankings of individual soil samples reveal differences between replicate series as well as samples within the same replicate series. Using classification and regression tree statistical methods, regression trees were fitted to each spectral response using concentrations of nitrogen, carbon, carbonate and organic matter as the response variables. Statistics resulting from fitted trees were: nitrogen R2 0.91 (p < 0.01) at 403, 470, 687, and 846 nm spectral band widths, carbonate R2 0.95 (p < 0.01) at 531 and 898 nm band widths, total carbon R2 0.93 (p < 0.01) at 400, 409, 441 and 907 nm band widths, and organic matter R2 0.98 (p < 0.01) at 300, 400, 441, 832 and 907 nm band widths. Use of the 400 to 1,000 nm electromagnetic range utilizing regression trees provided a powerful, rapid and inexpensive method for assessing nitrogen, carbon, carbonate and organic matter for upper soil horizons in a nondestructive method. PMID:23112620
Modelling Variable Fire Severity in Boreal Forests: Effects of Fire Intensity and Stand Structure
Miquelajauregui, Yosune; Cumming, Steven G.; Gauthier, Sylvie
2016-01-01
It is becoming clear that fires in boreal forests are not uniformly stand-replacing. On the contrary, marked variation in fire severity, measured as tree mortality, has been found both within and among individual fires. It is important to understand the conditions under which this variation can arise. We integrated forest sample plot data, tree allometries and historical forest fire records within a diameter class-structured model of 1.0 ha patches of mono-specific black spruce and jack pine stands in northern Québec, Canada. The model accounts for crown fire initiation and vertical spread into the canopy. It uses empirical relations between fire intensity, scorch height, the percent of crown scorched and tree mortality to simulate fire severity, specifically the percent reduction in patch basal area due to fire-caused mortality. A random forest and a regression tree analysis of a large random sample of simulated fires were used to test for an effect of fireline intensity, stand structure, species composition and pyrogeographic regions on resultant severity. Severity increased with intensity and was lower for jack pine stands. The proportion of simulated fires that burned at high severity (e.g. >75% reduction in patch basal area) was 0.80 for black spruce and 0.11 for jack pine. We identified thresholds in intensity below which there was a marked sensitivity of simulated fire severity to stand structure, and to interactions between intensity and structure. We found no evidence for a residual effect of pyrogeographic region on simulated severity, after the effects of stand structure and species composition were accounted for. The model presented here was able to produce variation in fire severity under a range of fire intensity conditions. This suggests that variation in stand structure is one of the factors causing the observed variation in boreal fire severity. PMID:26919456
Modelling Variable Fire Severity in Boreal Forests: Effects of Fire Intensity and Stand Structure.
Miquelajauregui, Yosune; Cumming, Steven G; Gauthier, Sylvie
2016-01-01
It is becoming clear that fires in boreal forests are not uniformly stand-replacing. On the contrary, marked variation in fire severity, measured as tree mortality, has been found both within and among individual fires. It is important to understand the conditions under which this variation can arise. We integrated forest sample plot data, tree allometries and historical forest fire records within a diameter class-structured model of 1.0 ha patches of mono-specific black spruce and jack pine stands in northern Québec, Canada. The model accounts for crown fire initiation and vertical spread into the canopy. It uses empirical relations between fire intensity, scorch height, the percent of crown scorched and tree mortality to simulate fire severity, specifically the percent reduction in patch basal area due to fire-caused mortality. A random forest and a regression tree analysis of a large random sample of simulated fires were used to test for an effect of fireline intensity, stand structure, species composition and pyrogeographic regions on resultant severity. Severity increased with intensity and was lower for jack pine stands. The proportion of simulated fires that burned at high severity (e.g. >75% reduction in patch basal area) was 0.80 for black spruce and 0.11 for jack pine. We identified thresholds in intensity below which there was a marked sensitivity of simulated fire severity to stand structure, and to interactions between intensity and structure. We found no evidence for a residual effect of pyrogeographic region on simulated severity, after the effects of stand structure and species composition were accounted for. The model presented here was able to produce variation in fire severity under a range of fire intensity conditions. This suggests that variation in stand structure is one of the factors causing the observed variation in boreal fire severity.
Stacked Denoising Autoencoders Applied to Star/Galaxy Classification
NASA Astrophysics Data System (ADS)
Qin, Hao-ran; Lin, Ji-ming; Wang, Jun-yi
2017-04-01
In recent years, the deep learning algorithm, with the characteristics of strong adaptability, high accuracy, and structural complexity, has become more and more popular, but it has not yet been used in astronomy. In order to solve the problem that the star/galaxy classification accuracy is high for the bright source set, but low for the faint source set of the Sloan Digital Sky Survey (SDSS) data, we introduced the new deep learning algorithm, namely the SDA (stacked denoising autoencoder) neural network and the dropout fine-tuning technique, which can greatly improve the robustness and antinoise performance. We randomly selected respectively the bright source sets and faint source sets from the SDSS DR12 and DR7 data with spectroscopic measurements, and made preprocessing on them. Then, we randomly selected respectively the training sets and testing sets without replacement from the bright source sets and faint source sets. At last, using these training sets we made the training to obtain the SDA models of the bright sources and faint sources in the SDSS DR7 and DR12, respectively. We compared the test result of the SDA model on the DR12 testing set with the test results of the Library for Support Vector Machines (LibSVM), J48 decision tree, Logistic Model Tree (LMT), Support Vector Machine (SVM), Logistic Regression, and Decision Stump algorithm, and compared the test result of the SDA model on the DR7 testing set with the test results of six kinds of decision trees. The experiments show that the SDA has a better classification accuracy than other machine learning algorithms for the faint source sets of DR7 and DR12. Especially, when the completeness function is used as the evaluation index, compared with the decision tree algorithms, the correctness rate of SDA has improved about 15% for the faint source set of SDSS-DR7.
Risk Factors of Falls in Community-Dwelling Older Adults: Logistic Regression Tree Analysis
ERIC Educational Resources Information Center
Yamashita, Takashi; Noe, Douglas A.; Bailer, A. John
2012-01-01
Purpose of the Study: A novel logistic regression tree-based method was applied to identify fall risk factors and possible interaction effects of those risk factors. Design and Methods: A nationally representative sample of American older adults aged 65 years and older (N = 9,592) in the Health and Retirement Study 2004 and 2006 modules was used.…
Modeling vertebrate diversity in Oregon using satellite imagery
NASA Astrophysics Data System (ADS)
Cablk, Mary Elizabeth
Vertebrate diversity was modeled for the state of Oregon using a parametric approach to regression tree analysis. This exploratory data analysis effectively modeled the non-linear relationships between vertebrate richness and phenology, terrain, and climate. Phenology was derived from time-series NOAA-AVHRR satellite imagery for the year 1992 using two methods: principal component analysis and derivation of EROS data center greenness metrics. These two measures of spatial and temporal vegetation condition incorporated the critical temporal element in this analysis. The first three principal components were shown to contain spatial and temporal information about the landscape and discriminated phenologically distinct regions in Oregon. Principal components 2 and 3, 6 greenness metrics, elevation, slope, aspect, annual precipitation, and annual seasonal temperature difference were investigated as correlates to amphibians, birds, all vertebrates, reptiles, and mammals. Variation explained for each regression tree by taxa were: amphibians (91%), birds (67%), all vertebrates (66%), reptiles (57%), and mammals (55%). Spatial statistics were used to quantify the pattern of each taxa and assess validity of resulting predictions from regression tree models. Regression tree analysis was relatively robust against spatial autocorrelation in the response data and graphical results indicated models were well fit to the data.
The contribution of competition to tree mortality in old-growth coniferous forests
Das, A.; Battles, J.; Stephenson, N.L.; van Mantgem, P.J.
2011-01-01
Competition is a well-documented contributor to tree mortality in temperate forests, with numerous studies documenting a relationship between tree death and the competitive environment. Models frequently rely on competition as the only non-random mechanism affecting tree mortality. However, for mature forests, competition may cease to be the primary driver of mortality.We use a large, long-term dataset to study the importance of competition in determining tree mortality in old-growth forests on the western slope of the Sierra Nevada of California, U.S.A. We make use of the comparative spatial configuration of dead and live trees, changes in tree spatial pattern through time, and field assessments of contributors to an individual tree's death to quantify competitive effects.Competition was apparently a significant contributor to tree mortality in these forests. Trees that died tended to be in more competitive environments than trees that survived, and suppression frequently appeared as a factor contributing to mortality. On the other hand, based on spatial pattern analyses, only three of 14 plots demonstrated compelling evidence that competition was dominating mortality. Most of the rest of the plots fell within the expectation for random mortality, and three fit neither the random nor the competition model. These results suggest that while competition is often playing a significant role in tree mortality processes in these forests it only infrequently governs those processes. In addition, the field assessments indicated a substantial presence of biotic mortality agents in trees that died.While competition is almost certainly important, demographics in these forests cannot accurately be characterized without a better grasp of other mortality processes. In particular, we likely need a better understanding of biotic agents and their interactions with one another and with competition. ?? 2011.
Wiström, Björn; Busse Nielsen, Anders
2017-07-01
After two major storms, the Swedish Transport Administration was granted permission in 2008 to expand the railroad corridor from 10 to 20 m from the rail banks, and to clear the forest edges in the expanded area. In order to evaluate the possibilities for managers to promote and control the species composition of the woody regrowth so that a forest edge with a graded profile develops over time, this study mapped the woody regrowth and environmental variables at 78 random sites along the 610-km railroad between Stockholm and Malmö four growing seasons after the clearing was implemented. Through different clustering approaches, dominant tree species to be controlled and future building block species for management were identified. Using multivariate regression trees, the most decisive environmental variables were identified and used to develop a regrowth typology and to calculate species indicator values. Five regrowth types and ten indicator species were identified along the environmental gradients of soil moisture, soil fertility, and altitude. Six tree species dominated the regrowth across the regrowth types, but clustering showed that if these were controlled by selective thinning, lower tree and shrub species were generally present so they could form the "building blocks" for development of a graded edge. We concluded that selective thinning targeted at controlling a few dominant tree species, here named Functional Species Control, is a simple and easily implemented management concept to promote a wide range of suitable species, because it does not require field staff with specialist taxonomic knowledge.
Overstory structure and soil nutrients effect on plant diversity in unmanaged moist tropical forest
NASA Astrophysics Data System (ADS)
Gautam, Mukesh Kumar; Manhas, Rajesh Kumar; Tripathi, Ashutosh Kumar
2016-08-01
Forests with intensive management past are kept unmanaged to restore diversity and ecosystem functioning. Before perpetuating abandonment after protracted restitution, understanding its effect on forest vegetation is desirable. We studied plant diversity and its relation with environmental variables and stand structure in northern Indian unmanaged tropical moist deciduous forest. We hypothesized that post-abandonment species richness would have increased, and the structure of contemporary forest would be heterogeneous. Vegetation structure, composition, and diversity were recorded, in forty 0.1 ha plots selected randomly in four forest ranges. Three soil samples per 0.1 ha were assessed for physicochemistry, fine sand, and clay mineralogy. Contemporary forest had less species richness than pre-abandonment reference period. Fourteen species were recorded as either seedling or sapling, suggesting reappearance or immigration. For most species, regeneration was either absent or impaired. Ordination and multiple regression results showed that exchangeable base cations and phosphorous affected maximum tree diversity and structure variables. Significant correlations between soil moisture and temperature, and shrub layer was observed, besides tree layer correspondence with shrub richness, suggesting that dense overstory resulting from abandonment through its effect on soil conditions, is responsible for dense shrub layer. Herb layer diversity was negatively associated with tree layer and shrub overgrowth (i.e. Mallotus spp.). Protracted abandonment may not reinforce species richness and heterogeneity; perhaps result in high tree and shrub density in moist deciduous forests, which can impede immigrating or reappearing plant species establishment. This can be overcome by density/basal area reduction strategies, albeit for both tree and shrub layer.
Distribution of cavity trees in midwestern old-growth and second-growth forests
Zhaofei Fan; Stephen R. Shifley; Martin A. Spetich; Frank R. Thompson; David R. Larsen
2003-01-01
We used classification and regression tree analysis to determine the primary variables associated with the occurrence of cavity trees and the hierarchical structure among those variables. We applied that information to develop logistic models predicting cavity tree probability as a function of diameter, species group, and decay class. Inventories of cavity abundance in...
Distribution of cavity trees in midwesternold-growth and second-growth forests
Zhaofei Fan; Stephen R. Shifley; Martin A. Spetich; Frank R., III Thompson; David R. Larsen
2003-01-01
We used classification and regression tree analysis to determine the primary variables associated with the occurrence of cavity trees and the hierarchical structure among those variables. We applied that information to develop logistic models predicting cavity tree probability as a function of diameter, species group, and decay class. Inventories of cavity abundance in...
A hierarchical linear model for tree height prediction.
Vicente J. Monleon
2003-01-01
Measuring tree height is a time-consuming process. Often, tree diameter is measured and height is estimated from a published regression model. Trees used to develop these models are clustered into stands, but this structure is ignored and independence is assumed. In this study, hierarchical linear models that account explicitly for the clustered structure of the data...
Modeling individual tree survial
Quang V. Cao
2016-01-01
Information provided by growth and yield models is the basis for forest managers to make decisions on how to manage their forests. Among different types of growth models, whole-stand models offer predictions at stand level, whereas individual-tree models give detailed information at tree level. The well-known logistic regression is commonly used to predict tree...
Extending the Distributed Lag Model framework to handle chemical mixtures.
Bello, Ghalib A; Arora, Manish; Austin, Christine; Horton, Megan K; Wright, Robert O; Gennings, Chris
2017-07-01
Distributed Lag Models (DLMs) are used in environmental health studies to analyze the time-delayed effect of an exposure on an outcome of interest. Given the increasing need for analytical tools for evaluation of the effects of exposure to multi-pollutant mixtures, this study attempts to extend the classical DLM framework to accommodate and evaluate multiple longitudinally observed exposures. We introduce 2 techniques for quantifying the time-varying mixture effect of multiple exposures on an outcome of interest. Lagged WQS, the first technique, is based on Weighted Quantile Sum (WQS) regression, a penalized regression method that estimates mixture effects using a weighted index. We also introduce Tree-based DLMs, a nonparametric alternative for assessment of lagged mixture effects. This technique is based on the Random Forest (RF) algorithm, a nonparametric, tree-based estimation technique that has shown excellent performance in a wide variety of domains. In a simulation study, we tested the feasibility of these techniques and evaluated their performance in comparison to standard methodology. Both methods exhibited relatively robust performance, accurately capturing pre-defined non-linear functional relationships in different simulation settings. Further, we applied these techniques to data on perinatal exposure to environmental metal toxicants, with the goal of evaluating the effects of exposure on neurodevelopment. Our methods identified critical neurodevelopmental windows showing significant sensitivity to metal mixtures. Copyright © 2017 Elsevier Inc. All rights reserved.
Anderson localization for radial tree-like random quantum graphs
NASA Astrophysics Data System (ADS)
Hislop, Peter D.; Post, Olaf
We prove that certain random models associated with radial, tree-like, rooted quantum graphs exhibit Anderson localization at all energies. The two main examples are the random length model (RLM) and the random Kirchhoff model (RKM). In the RLM, the lengths of each generation of edges form a family of independent, identically distributed random variables (iid). For the RKM, the iid random variables are associated with each generation of vertices and moderate the current flow through the vertex. We consider extensions to various families of decorated graphs and prove stability of localization with respect to decoration. In particular, we prove Anderson localization for the random necklace model.
NASA Astrophysics Data System (ADS)
Freeman, Mary Pyott
ABSTRACT An Analysis of Tree Mortality Using High Resolution Remotely-Sensed Data for Mixed-Conifer Forests in San Diego County by Mary Pyott Freeman The montane mixed-conifer forests of San Diego County are currently experiencing extensive tree mortality, which is defined as dieback where whole stands are affected. This mortality is likely the result of the complex interaction of many variables, such as altered fire regimes, climatic conditions such as drought, as well as forest pathogens and past management strategies. Conifer tree mortality and its spatial pattern and change over time were examined in three components. In component 1, two remote sensing approaches were compared for their effectiveness in delineating dead trees, a spatial contextual approach and an OBIA (object based image analysis) approach, utilizing various dates and spatial resolutions of airborne image data. For each approach transforms and masking techniques were explored, which were found to improve classifications, and an object-based assessment approach was tested. In component 2, dead tree maps produced by the most effective techniques derived from component 1 were utilized for point pattern and vector analyses to further understand spatio-temporal changes in tree mortality for the years 1997, 2000, 2002, and 2005 for three study areas: Palomar, Volcan and Laguna mountains. Plot-based fieldwork was conducted to further assess mortality patterns. Results indicate that conifer mortality was significantly clustered, increased substantially between 2002 and 2005, and was non-random with respect to tree species and diameter class sizes. In component 3, multiple environmental variables were used in Generalized Linear Model (GLM-logistic regression) and decision tree classifier model development, revealing the importance of climate and topographic factors such as precipitation and elevation, in being able to predict areas of high risk for tree mortality. The results from this study highlight the importance of multi-scale spatial as well as temporal analyses, in order to understand mixed-conifer forest structure, dynamics, and processes of decline, which can lead to more sustainable management of forests with continued natural and anthropogenic disturbance.
Log and tree sawing times for hardwood mills
Everette D. Rast
1974-01-01
Data on 6,850 logs and 1,181 trees were analyzed to predict sawing times. For both logs and trees, regression equations were derived that express (in minutes) sawing time per log or tree and per Mbf. For trees, merchantable height is expressed in number of logs as well as in feet. One of the major uses for the tables of average sawing times is as a bench mark against...
NASA Astrophysics Data System (ADS)
Sirait, Kamson; Tulus; Budhiarti Nababan, Erna
2017-12-01
Clustering methods that have high accuracy and time efficiency are necessary for the filtering process. One method that has been known and applied in clustering is K-Means Clustering. In its application, the determination of the begining value of the cluster center greatly affects the results of the K-Means algorithm. This research discusses the results of K-Means Clustering with starting centroid determination with a random and KD-Tree method. The initial determination of random centroid on the data set of 1000 student academic data to classify the potentially dropout has a sse value of 952972 for the quality variable and 232.48 for the GPA, whereas the initial centroid determination by KD-Tree has a sse value of 504302 for the quality variable and 214,37 for the GPA variable. The smaller sse values indicate that the result of K-Means Clustering with initial KD-Tree centroid selection have better accuracy than K-Means Clustering method with random initial centorid selection.
Logistic regression trees for initial selection of interesting loci in case-control studies
Nickolov, Radoslav Z; Milanov, Valentin B
2007-01-01
Modern genetic epidemiology faces the challenge of dealing with hundreds of thousands of genetic markers. The selection of a small initial subset of interesting markers for further investigation can greatly facilitate genetic studies. In this contribution we suggest the use of a logistic regression tree algorithm known as logistic tree with unbiased selection. Using the simulated data provided for Genetic Analysis Workshop 15, we show how this algorithm, with incorporation of multifactor dimensionality reduction method, can reduce an initial large pool of markers to a small set that includes the interesting markers with high probability. PMID:18466557
Andrew T. Hudak; Nicholas L. Crookston; Jeffrey S. Evans; Michael K. Falkowski; Alistair M. S. Smith; Paul E. Gessler; Penelope Morgan
2006-01-01
We compared the utility of discrete-return light detection and ranging (lidar) data and multispectral satellite imagery, and their integration, for modeling and mapping basal area and tree density across two diverse coniferous forest landscapes in north-central Idaho. We applied multiple linear regression models subset from a suite of 26 predictor variables derived...
ERIC Educational Resources Information Center
Kitsantas, Anastasia; Kitsantas, Panagiota; Kitsantas, Thomas
2012-01-01
The purpose of this exploratory study was to assess the relative importance of a number of variables in predicting students' interest in math and/or computer science. Classification and regression trees (CART) were employed in the analysis of survey data collected from 276 college students enrolled in two U.S. and Greek universities. The results…
Identification of extremely premature infants at high risk of rehospitalization.
Ambalavanan, Namasivayam; Carlo, Waldemar A; McDonald, Scott A; Yao, Qing; Das, Abhik; Higgins, Rosemary D
2011-11-01
Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization. Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002-2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables. A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%-42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay. The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge.
Identification of Extremely Premature Infants at High Risk of Rehospitalization
Carlo, Waldemar A.; McDonald, Scott A.; Yao, Qing; Das, Abhik; Higgins, Rosemary D.
2011-01-01
OBJECTIVE: Extremely low birth weight infants often require rehospitalization during infancy. Our objective was to identify at the time of discharge which extremely low birth weight infants are at higher risk for rehospitalization. METHODS: Data from extremely low birth weight infants in Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network centers from 2002–2005 were analyzed. The primary outcome was rehospitalization by the 18- to 22-month follow-up, and secondary outcome was rehospitalization for respiratory causes in the first year. Using variables and odds ratios identified by stepwise logistic regression, scoring systems were developed with scores proportional to odds ratios. Classification and regression-tree analysis was performed by recursive partitioning and automatic selection of optimal cutoff points of variables. RESULTS: A total of 3787 infants were evaluated (mean ± SD birth weight: 787 ± 136 g; gestational age: 26 ± 2 weeks; 48% male, 42% black). Forty-five percent of the infants were rehospitalized by 18 to 22 months; 14.7% were rehospitalized for respiratory causes in the first year. Both regression models (area under the curve: 0.63) and classification and regression-tree models (mean misclassification rate: 40%–42%) were moderately accurate. Predictors for the primary outcome by regression were shunt surgery for hydrocephalus, hospital stay of >120 days for pulmonary reasons, necrotizing enterocolitis stage II or higher or spontaneous gastrointestinal perforation, higher fraction of inspired oxygen at 36 weeks, and male gender. By classification and regression-tree analysis, infants with hospital stays of >120 days for pulmonary reasons had a 66% rehospitalization rate compared with 42% without such a stay. CONCLUSIONS: The scoring systems and classification and regression-tree analysis models identified infants at higher risk of rehospitalization and might assist planning for care after discharge. PMID:22007016
Jon C. Regelbrugge
1993-01-01
Abstract. We modeled tree mortality occurring two years following wildfire in Pinus ponderosa forests using data from 1275 trees in 25 stands burned during the 1987 Stanislaus Complex fires. We used logistic regression analysis to develop models relating the probability of wildfire-induced mortality with tree size and fire severity for Pinus ponderosa, Calocedrus...
Pre-operative prediction of surgical morbidity in children: comparison of five statistical models.
Cooper, Jennifer N; Wei, Lai; Fernandez, Soledad A; Minneci, Peter C; Deans, Katherine J
2015-02-01
The accurate prediction of surgical risk is important to patients and physicians. Logistic regression (LR) models are typically used to estimate these risks. However, in the fields of data mining and machine-learning, many alternative classification and prediction algorithms have been developed. This study aimed to compare the performance of LR to several data mining algorithms for predicting 30-day surgical morbidity in children. We used the 2012 National Surgical Quality Improvement Program-Pediatric dataset to compare the performance of (1) a LR model that assumed linearity and additivity (simple LR model) (2) a LR model incorporating restricted cubic splines and interactions (flexible LR model) (3) a support vector machine, (4) a random forest and (5) boosted classification trees for predicting surgical morbidity. The ensemble-based methods showed significantly higher accuracy, sensitivity, specificity, PPV, and NPV than the simple LR model. However, none of the models performed better than the flexible LR model in terms of the aforementioned measures or in model calibration or discrimination. Support vector machines, random forests, and boosted classification trees do not show better performance than LR for predicting pediatric surgical morbidity. After further validation, the flexible LR model derived in this study could be used to assist with clinical decision-making based on patient-specific surgical risks. Copyright © 2014 Elsevier Ltd. All rights reserved.
Modeling time-to-event (survival) data using classification tree analysis.
Linden, Ariel; Yarnold, Paul R
2017-12-01
Time to the occurrence of an event is often studied in health research. Survival analysis differs from other designs in that follow-up times for individuals who do not experience the event by the end of the study (called censored) are accounted for in the analysis. Cox regression is the standard method for analysing censored data, but the assumptions required of these models are easily violated. In this paper, we introduce classification tree analysis (CTA) as a flexible alternative for modelling censored data. Classification tree analysis is a "decision-tree"-like classification model that provides parsimonious, transparent (ie, easy to visually display and interpret) decision rules that maximize predictive accuracy, derives exact P values via permutation tests, and evaluates model cross-generalizability. Using empirical data, we identify all statistically valid, reproducible, longitudinally consistent, and cross-generalizable CTA survival models and then compare their predictive accuracy to estimates derived via Cox regression and an unadjusted naïve model. Model performance is assessed using integrated Brier scores and a comparison between estimated survival curves. The Cox regression model best predicts average incidence of the outcome over time, whereas CTA survival models best predict either relatively high, or low, incidence of the outcome over time. Classification tree analysis survival models offer many advantages over Cox regression, such as explicit maximization of predictive accuracy, parsimony, statistical robustness, and transparency. Therefore, researchers interested in accurate prognoses and clear decision rules should consider developing models using the CTA-survival framework. © 2017 John Wiley & Sons, Ltd.
Liu, Yang; Lü, Yi-he; Zheng, Hai-feng; Chen, Li-ding
2010-05-01
Based on the 10-day SPOT VEGETATION NDVI data and the daily meteorological data from 1998 to 2007 in Yan' an City, the main meteorological variables affecting the annual and interannual variations of NDVI were determined by using regression tree. It was found that the effects of test meteorological variables on the variability of NDVI differed with seasons and time lags. Temperature and precipitation were the most important meteorological variables affecting the annual variation of NDVI, and the average highest temperature was the most important meteorological variable affecting the inter-annual variation of NDVI. Regression tree was very powerful in determining the key meteorological variables affecting NDVI variation, but could not build quantitative relations between NDVI and meteorological variables, which limited its further and wider application.
Modeling groundwater nitrate concentrations in private wells in Iowa
Wheeler, David C.; Nolan, Bernard T.; Flory, Abigail R.; DellaValle, Curt T.; Ward, Mary H.
2015-01-01
Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square = 0.77) and was acceptable in the testing set (r-square = 0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort.
Modeling groundwater nitrate concentrations in private wells in Iowa.
Wheeler, David C; Nolan, Bernard T; Flory, Abigail R; DellaValle, Curt T; Ward, Mary H
2015-12-01
Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square=0.77) and was acceptable in the testing set (r-square=0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort. Copyright © 2015 Elsevier B.V. All rights reserved.
Harry T. Valentine
2002-01-01
Randomized branch sampling (RBS) is a special application of multistage probability sampling (see Sampling, environmental), which was developed originally by Jessen [3] to estimate fruit counts on individual orchard trees. In general, the method can be used to obtain estimates of many different attributes of trees or other branched plants. The usual objective of RBS is...
Partitioning sources of variation in vertebrate species richness
Boone, R.B.; Krohn, W.B.
2000-01-01
Aim: To explore biogeographic patterns of terrestrial vertebrates in Maine, USA using techniques that would describe local and spatial correlations with the environment. Location: Maine, USA. Methods: We delineated the ranges within Maine (86,156 km2) of 275 species using literature and expert review. Ranges were combined into species richness maps, and compared to geomorphology, climate, and woody plant distributions. Methods were adapted that compared richness of all vertebrate classes to each environmental correlate, rather than assessing a single explanatory theory. We partitioned variation in species richness into components using tree and multiple linear regression. Methods were used that allowed for useful comparisons between tree and linear regression results. For both methods we partitioned variation into broad-scale (spatially autocorrelated) and fine-scale (spatially uncorrelated) explained and unexplained components. By partitioning variance, and using both tree and linear regression in analyses, we explored the degree of variation in species richness for each vertebrate group that Could be explained by the relative contribution of each environmental variable. Results: In tree regression, climate variation explained richness better (92% of mean deviance explained for all species) than woody plant variation (87%) and geomorphology (86%). Reptiles were highly correlated with environmental variation (93%), followed by mammals, amphibians, and birds (each with 84-82% deviance explained). In multiple linear regression, climate was most closely associated with total vertebrate richness (78%), followed by woody plants (67%) and geomorphology (56%). Again, reptiles were closely correlated with the environment (95%), followed by mammals (73%), amphibians (63%) and birds (57%). Main conclusions: Comparing variation explained using tree and multiple linear regression quantified the importance of nonlinear relationships and local interactions between species richness and environmental variation, identifying the importance of linear relationships between reptiles and the environment, and nonlinear relationships between birds and woody plants, for example. Conservation planners should capture climatic variation in broad-scale designs; temperatures may shift during climate change, but the underlying correlations between the environment and species richness will presumably remain.
Tree mortality rates and tree population projections in Baltimore, Maryland, USA
David J. Nowak; Miki Kuroda; Daniel E. Crane
2004-01-01
Based on re-measurements (1999 and 2001) of randomly-distributed permanent plots within the city boundaries of Baltimore, Maryland, trees are estimated to have an annual mortality rate of 6.6% with an overall annual net change in the number of live trees of -4.2%. Tree mortality rates were significantly different based on tree size, condition, species, and Land use....
Hostettler, Isabel Charlotte; Muroi, Carl; Richter, Johannes Konstantin; Schmid, Josef; Neidert, Marian Christoph; Seule, Martin; Boss, Oliver; Pangalu, Athina; Germans, Menno Robbert; Keller, Emanuela
2018-01-19
OBJECTIVE The aim of this study was to create prediction models for outcome parameters by decision tree analysis based on clinical and laboratory data in patients with aneurysmal subarachnoid hemorrhage (aSAH). METHODS The database consisted of clinical and laboratory parameters of 548 patients with aSAH who were admitted to the Neurocritical Care Unit, University Hospital Zurich. To examine the model performance, the cohort was randomly divided into a derivation cohort (60% [n = 329]; training data set) and a validation cohort (40% [n = 219]; test data set). The classification and regression tree prediction algorithm was applied to predict death, functional outcome, and ventriculoperitoneal (VP) shunt dependency. Chi-square automatic interaction detection was applied to predict delayed cerebral infarction on days 1, 3, and 7. RESULTS The overall mortality was 18.4%. The accuracy of the decision tree models was good for survival on day 1 and favorable functional outcome at all time points, with a difference between the training and test data sets of < 5%. Prediction accuracy for survival on day 1 was 75.2%. The most important differentiating factor was the interleukin-6 (IL-6) level on day 1. Favorable functional outcome, defined as Glasgow Outcome Scale scores of 4 and 5, was observed in 68.6% of patients. Favorable functional outcome at all time points had a prediction accuracy of 71.1% in the training data set, with procalcitonin on day 1 being the most important differentiating factor at all time points. A total of 148 patients (27%) developed VP shunt dependency. The most important differentiating factor was hyperglycemia on admission. CONCLUSIONS The multiple variable analysis capability of decision trees enables exploration of dependent variables in the context of multiple changing influences over the course of an illness. The decision tree currently generated increases awareness of the early systemic stress response, which is seemingly pertinent for prognostication.
Villamor, Grace B.; Nyarko, Benjamin Kofi; Wala, Kperkouma; Akpagana, Koffi
2018-01-01
Vitellaria paradoxa (Gaertn C. F.), or shea tree, remains one of the most valuable trees for farmers in the Atacora district of northern Benin, where rural communities depend on shea products for both food and income. To optimize productivity and management of shea agroforestry systems, or "parklands," accurate and up-to-date data are needed. For this purpose, we monitored120 fruiting shea trees for two years under three land-use scenarios and different soil groups in Atacora, coupled with a farm household survey to elicit information on decision making and management practices. To examine the local pattern of shea tree productivity and relationships between morphological factors and yields, we used a randomized branch sampling method and applied a regression analysis to build a shea yield model based on dendrometric, soil and land-use variables. We also compared potential shea yields based on farm household socio-economic characteristics and management practices derived from the survey data. Soil and land-use variables were the most important determinants of shea fruit yield. In terms of land use, shea trees growing on farmland plots exhibited the highest yields (i.e., fruit quantity and mass) while trees growing on Lixisols performed better than those of the other soil group. Contrary to our expectations, dendrometric parameters had weak relationships with fruit yield regardless of land-use and soil group. There is an inter-annual variability in fruit yield in both soil groups and land-use type. In addition to observed inter-annual yield variability, there was a high degree of variability in production among individual shea trees. Furthermore, household socioeconomic characteristics such as road accessibility, landholding size, and gross annual income influence shea fruit yield. The use of fallow areas is an important land management practice in the study area that influences both conservation and shea yield. PMID:29346406
Aleza, Koutchoukalo; Villamor, Grace B; Nyarko, Benjamin Kofi; Wala, Kperkouma; Akpagana, Koffi
2018-01-01
Vitellaria paradoxa (Gaertn C. F.), or shea tree, remains one of the most valuable trees for farmers in the Atacora district of northern Benin, where rural communities depend on shea products for both food and income. To optimize productivity and management of shea agroforestry systems, or "parklands," accurate and up-to-date data are needed. For this purpose, we monitored120 fruiting shea trees for two years under three land-use scenarios and different soil groups in Atacora, coupled with a farm household survey to elicit information on decision making and management practices. To examine the local pattern of shea tree productivity and relationships between morphological factors and yields, we used a randomized branch sampling method and applied a regression analysis to build a shea yield model based on dendrometric, soil and land-use variables. We also compared potential shea yields based on farm household socio-economic characteristics and management practices derived from the survey data. Soil and land-use variables were the most important determinants of shea fruit yield. In terms of land use, shea trees growing on farmland plots exhibited the highest yields (i.e., fruit quantity and mass) while trees growing on Lixisols performed better than those of the other soil group. Contrary to our expectations, dendrometric parameters had weak relationships with fruit yield regardless of land-use and soil group. There is an inter-annual variability in fruit yield in both soil groups and land-use type. In addition to observed inter-annual yield variability, there was a high degree of variability in production among individual shea trees. Furthermore, household socioeconomic characteristics such as road accessibility, landholding size, and gross annual income influence shea fruit yield. The use of fallow areas is an important land management practice in the study area that influences both conservation and shea yield.
SUNPLIN: Simulation with Uncertainty for Phylogenetic Investigations
2013-01-01
Background Phylogenetic comparative analyses usually rely on a single consensus phylogenetic tree in order to study evolutionary processes. However, most phylogenetic trees are incomplete with regard to species sampling, which may critically compromise analyses. Some approaches have been proposed to integrate non-molecular phylogenetic information into incomplete molecular phylogenies. An expanded tree approach consists of adding missing species to random locations within their clade. The information contained in the topology of the resulting expanded trees can be captured by the pairwise phylogenetic distance between species and stored in a matrix for further statistical analysis. Thus, the random expansion and processing of multiple phylogenetic trees can be used to estimate the phylogenetic uncertainty through a simulation procedure. Because of the computational burden required, unless this procedure is efficiently implemented, the analyses are of limited applicability. Results In this paper, we present efficient algorithms and implementations for randomly expanding and processing phylogenetic trees so that simulations involved in comparative phylogenetic analysis with uncertainty can be conducted in a reasonable time. We propose algorithms for both randomly expanding trees and calculating distance matrices. We made available the source code, which was written in the C++ language. The code may be used as a standalone program or as a shared object in the R system. The software can also be used as a web service through the link: http://purl.oclc.org/NET/sunplin/. Conclusion We compare our implementations to similar solutions and show that significant performance gains can be obtained. Our results open up the possibility of accounting for phylogenetic uncertainty in evolutionary and ecological analyses of large datasets. PMID:24229408
SUNPLIN: simulation with uncertainty for phylogenetic investigations.
Martins, Wellington S; Carmo, Welton C; Longo, Humberto J; Rosa, Thierson C; Rangel, Thiago F
2013-11-15
Phylogenetic comparative analyses usually rely on a single consensus phylogenetic tree in order to study evolutionary processes. However, most phylogenetic trees are incomplete with regard to species sampling, which may critically compromise analyses. Some approaches have been proposed to integrate non-molecular phylogenetic information into incomplete molecular phylogenies. An expanded tree approach consists of adding missing species to random locations within their clade. The information contained in the topology of the resulting expanded trees can be captured by the pairwise phylogenetic distance between species and stored in a matrix for further statistical analysis. Thus, the random expansion and processing of multiple phylogenetic trees can be used to estimate the phylogenetic uncertainty through a simulation procedure. Because of the computational burden required, unless this procedure is efficiently implemented, the analyses are of limited applicability. In this paper, we present efficient algorithms and implementations for randomly expanding and processing phylogenetic trees so that simulations involved in comparative phylogenetic analysis with uncertainty can be conducted in a reasonable time. We propose algorithms for both randomly expanding trees and calculating distance matrices. We made available the source code, which was written in the C++ language. The code may be used as a standalone program or as a shared object in the R system. The software can also be used as a web service through the link: http://purl.oclc.org/NET/sunplin/. We compare our implementations to similar solutions and show that significant performance gains can be obtained. Our results open up the possibility of accounting for phylogenetic uncertainty in evolutionary and ecological analyses of large datasets.
Schmidt, Johannes; Glaser, Bruno
2016-01-01
Tropical forests are significant carbon sinks and their soils’ carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms—including the model tuning and predictor selection—were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models’ predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction. PMID:27128736
Ließ, Mareike; Schmidt, Johannes; Glaser, Bruno
2016-01-01
Tropical forests are significant carbon sinks and their soils' carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms-including the model tuning and predictor selection-were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models' predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.
Hutton, Eileen K; Simioni, Julia C; Thabane, Lehana
2017-08-01
Among women with a fetus with a non-cephalic presentation, external cephalic version (ECV) has been shown to reduce the rate of breech presentation at birth and cesarean birth. Compared with ECV at term, beginning ECV prior to 37 weeks' gestation decreases the number of infants in a non-cephalic presentation at birth. The purpose of this secondary analysis was to investigate factors associated with a successful ECV procedure and to present this in a clinically useful format. Data were collected as part of the Early ECV Pilot and Early ECV2 Trials, which randomized 1776 women with a fetus in breech presentation to either early ECV (34-36 weeks' gestation) or delayed ECV (at or after 37 weeks). The outcome of interest was successful ECV, defined as the fetus being in a cephalic presentation immediately following the procedure, as well as at the time of birth. The importance of several factors in predicting successful ECV was investigated using two statistical methods: logistic regression and classification and regression tree (CART) analyses. Among nulliparas, non-engagement of the presenting part and an easily palpable fetal head were independently associated with success. Among multiparas, non-engagement of the presenting part, gestation less than 37 weeks and an easily palpable fetal head were found to be independent predictors of success. These findings were consistent with results of the CART analyses. Regardless of parity, descent of the presenting part was the most discriminating factor in predicting successful ECV and cephalic presentation at birth. © 2017 Nordic Federation of Societies of Obstetrics and Gynecology.
Wendling, T; Jung, K; Callahan, A; Schuler, A; Shah, N H; Gallego, B
2018-06-03
There is growing interest in using routinely collected data from health care databases to study the safety and effectiveness of therapies in "real-world" conditions, as it can provide complementary evidence to that of randomized controlled trials. Causal inference from health care databases is challenging because the data are typically noisy, high dimensional, and most importantly, observational. It requires methods that can estimate heterogeneous treatment effects while controlling for confounding in high dimensions. Bayesian additive regression trees, causal forests, causal boosting, and causal multivariate adaptive regression splines are off-the-shelf methods that have shown good performance for estimation of heterogeneous treatment effects in observational studies of continuous outcomes. However, it is not clear how these methods would perform in health care database studies where outcomes are often binary and rare and data structures are complex. In this study, we evaluate these methods in simulation studies that recapitulate key characteristics of comparative effectiveness studies. We focus on the conditional average effect of a binary treatment on a binary outcome using the conditional risk difference as an estimand. To emulate health care database studies, we propose a simulation design where real covariate and treatment assignment data are used and only outcomes are simulated based on nonparametric models of the real outcomes. We apply this design to 4 published observational studies that used records from 2 major health care databases in the United States. Our results suggest that Bayesian additive regression trees and causal boosting consistently provide low bias in conditional risk difference estimates in the context of health care database studies. Copyright © 2018 John Wiley & Sons, Ltd.
van der Ploeg, Tjeerd; Nieboer, Daan; Steyerberg, Ewout W
2016-10-01
Prediction of medical outcomes may potentially benefit from using modern statistical modeling techniques. We aimed to externally validate modeling strategies for prediction of 6-month mortality of patients suffering from traumatic brain injury (TBI) with predictor sets of increasing complexity. We analyzed individual patient data from 15 different studies including 11,026 TBI patients. We consecutively considered a core set of predictors (age, motor score, and pupillary reactivity), an extended set with computed tomography scan characteristics, and a further extension with two laboratory measurements (glucose and hemoglobin). With each of these sets, we predicted 6-month mortality using default settings with five statistical modeling techniques: logistic regression (LR), classification and regression trees, random forests (RFs), support vector machines (SVM) and neural nets. For external validation, a model developed on one of the 15 data sets was applied to each of the 14 remaining sets. This process was repeated 15 times for a total of 630 validations. The area under the receiver operating characteristic curve (AUC) was used to assess the discriminative ability of the models. For the most complex predictor set, the LR models performed best (median validated AUC value, 0.757), followed by RF and support vector machine models (median validated AUC value, 0.735 and 0.732, respectively). With each predictor set, the classification and regression trees models showed poor performance (median validated AUC value, <0.7). The variability in performance across the studies was smallest for the RF- and LR-based models (inter quartile range for validated AUC values from 0.07 to 0.10). In the area of predicting mortality from TBI, nonlinear and nonadditive effects are not pronounced enough to make modern prediction methods beneficial. Copyright © 2016 Elsevier Inc. All rights reserved.
Modeling Caribbean tree stem diameters from tree height and crown width measurements
Thomas Brandeis; KaDonna Randolph; Mike Strub
2009-01-01
Regression models to predict diameter at breast height (DBH) as a function of tree height and maximum crown radius were developed for Caribbean forests based on data collected by the U.S. Forest Service in the Commonwealth of Puerto Rico and Territory of the U.S. Virgin Islands. The model predicting DBH from tree height fit reasonably well (R2 = 0.7110), with...
Perceived Organizational Support for Enhancing Welfare at Work: A Regression Tree Model
Giorgi, Gabriele; Dubin, David; Perez, Javier Fiz
2016-01-01
When trying to examine outcomes such as welfare and well-being, research tends to focus on main effects and take into account limited numbers of variables at a time. There are a number of techniques that may help address this problem. For example, many statistical packages available in R provide easy-to-use methods of modeling complicated analysis such as classification and tree regression (i.e., recursive partitioning). The present research illustrates the value of recursive partitioning in the prediction of perceived organizational support in a sample of more than 6000 Italian bankers. Utilizing the tree function party package in R, we estimated a regression tree model predicting perceived organizational support from a multitude of job characteristics including job demand, lack of job control, lack of supervisor support, training, etc. The resulting model appears particularly helpful in pointing out several interactions in the prediction of perceived organizational support. In particular, training is the dominant factor. Another dimension that seems to influence organizational support is reporting (perceived communication about safety and stress concerns). Results are discussed from a theoretical and methodological point of view. PMID:28082924
Prediction of strontium bromide laser efficiency using cluster and decision tree analysis
NASA Astrophysics Data System (ADS)
Iliev, Iliycho; Gocheva-Ilieva, Snezhana; Kulin, Chavdar
2018-01-01
Subject of investigation is a new high-powered strontium bromide (SrBr2) vapor laser emitting in multiline region of wavelengths. The laser is an alternative to the atom strontium lasers and electron free lasers, especially at the line 6.45 μm which line is used in surgery for medical processing of biological tissues and bones with minimal damage. In this paper the experimental data from measurements of operational and output characteristics of the laser are statistically processed by means of cluster analysis and tree-based regression techniques. The aim is to extract the more important relationships and dependences from the available data which influence the increase of the overall laser efficiency. There are constructed and analyzed a set of cluster models. It is shown by using different cluster methods that the seven investigated operational characteristics (laser tube diameter, length, supplied electrical power, and others) and laser efficiency are combined in 2 clusters. By the built regression tree models using Classification and Regression Trees (CART) technique there are obtained dependences to predict the values of efficiency, and especially the maximum efficiency with over 95% accuracy.
Digression and Value Concatenation to Enable Privacy-Preserving Regression.
Li, Xiao-Bai; Sarkar, Sumit
2014-09-01
Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-analysis and data-mining technique, can be used to effectively reveal individuals' sensitive data. This problem, which we call a "regression attack," has not been addressed in the data privacy literature, and existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach introduces a novel measure, called digression , which assesses the sensitive value disclosure risk in the process of building a regression tree model. Specifically, we develop an algorithm that uses the measure for pruning the tree to limit disclosure of sensitive data. We also propose a dynamic value-concatenation method for anonymizing data, which better preserves data utility than a user-defined generalization scheme commonly used in existing approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted using real-world financial, economic and healthcare data. The results of the experiments demonstrate that the proposed approach is very effective in protecting data privacy while preserving data quality for research and analysis.
C.D. Nelson; Thomas L. Kubisiak; M. Stine; W.L. Nance
1994-01-01
Eight megagametophyte DNA samples from a single longleaf pine (Pinus palustris Mill.) tree were used to screen 576 oligonucleotide primers for random amplified polymorphic DNA (RAPD) fragments. Primers amplifying repeatable polymorphic fragments were further characterized within a sample of 72 megagametophytes from the same tree. Fragments...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Menzel, M.A.; Carter, T.C.; Ford, W.M.
Radiotraction of six eastern red bats, six seminole bats and twenty-four evening bats to 55, 61, and 65 day roosts during 1996 to 1997 in the Upper Coastal Plain of South Carolina. For each species, testing was done for differences between used roost trees and randomly located trees. Also tested for differences between habitat characteristics surrounding roost trees and randomly located trees. Eastern Red and Seminole bats generally roosted in canopies of hardwood and pine while clinging to foilage and small branches. Evening bats roosted in cavities or under exfoliating bark in pines and dead snags. Forest management strategies namedmore » within the study should be beneficial for providing roosts in the Upper Coastal Plain of South Carolina.« less
Indicators of Terrorism Vulnerability in Africa
2015-03-26
the terror threat and vulnerabilities across Africa. Key words: Terrorism, Africa, Negative Binomial Regression, Classification Tree iv I would like...31 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Log -likelihood...70 viii Page 5.3 Classification Tree Description
NASA Astrophysics Data System (ADS)
Di, Nur Faraidah Muhammad; Satari, Siti Zanariah
2017-05-01
Outlier detection in linear data sets has been done vigorously but only a small amount of work has been done for outlier detection in circular data. In this study, we proposed multiple outliers detection in circular regression models based on the clustering algorithm. Clustering technique basically utilizes distance measure to define distance between various data points. Here, we introduce the similarity distance based on Euclidean distance for circular model and obtain a cluster tree using the single linkage clustering algorithm. Then, a stopping rule for the cluster tree based on the mean direction and circular standard deviation of the tree height is proposed. We classify the cluster group that exceeds the stopping rule as potential outlier. Our aim is to demonstrate the effectiveness of proposed algorithms with the similarity distances in detecting the outliers. It is found that the proposed methods are performed well and applicable for circular regression model.
Kadiyala, Akhil; Kaur, Devinder; Kumar, Ashok
2013-02-01
The present study developed a novel approach to modeling indoor air quality (IAQ) of a public transportation bus by the development of hybrid genetic-algorithm-based neural networks (also known as evolutionary neural networks) with input variables optimized from using the regression trees, referred as the GART approach. This study validated the applicability of the GART modeling approach in solving complex nonlinear systems by accurately predicting the monitored contaminants of carbon dioxide (CO2), carbon monoxide (CO), nitric oxide (NO), sulfur dioxide (SO2), 0.3-0.4 microm sized particle numbers, 0.4-0.5 microm sized particle numbers, particulate matter (PM) concentrations less than 1.0 microm (PM10), and PM concentrations less than 2.5 microm (PM2.5) inside a public transportation bus operating on 20% grade biodiesel in Toledo, OH. First, the important variables affecting each monitored in-bus contaminant were determined using regression trees. Second, the analysis of variance was used as a complimentary sensitivity analysis to the regression tree results to determine a subset of statistically significant variables affecting each monitored in-bus contaminant. Finally, the identified subsets of statistically significant variables were used as inputs to develop three artificial neural network (ANN) models. The models developed were regression tree-based back-propagation network (BPN-RT), regression tree-based radial basis function network (RBFN-RT), and GART models. Performance measures were used to validate the predictive capacity of the developed IAQ models. The results from this approach were compared with the results obtained from using a theoretical approach and a generalized practicable approach to modeling IAQ that included the consideration of additional independent variables when developing the aforementioned ANN models. The hybrid GART models were able to capture majority of the variance in the monitored in-bus contaminants. The genetic-algorithm-based neural network IAQ models outperformed the traditional ANN methods of the back-propagation and the radial basis function networks. The novelty of this research is the development of a novel approach to modeling vehicular indoor air quality by integration of the advanced methods of genetic algorithms, regression trees, and the analysis of variance for the monitored in-vehicle gaseous and particulate matter contaminants, and comparing the results obtained from using the developed approach with conventional artificial intelligence techniques of back propagation networks and radial basis function networks. This study validated the newly developed approach using holdout and threefold cross-validation methods. These results are of great interest to scientists, researchers, and the public in understanding the various aspects of modeling an indoor microenvironment. This methodology can easily be extended to other fields of study also.
Statistical Methods in Ai: Rare Event Learning Using Associative Rules and Higher-Order Statistics
NASA Astrophysics Data System (ADS)
Iyer, V.; Shetty, S.; Iyengar, S. S.
2015-07-01
Rare event learning has not been actively researched since lately due to the unavailability of algorithms which deal with big samples. The research addresses spatio-temporal streams from multi-resolution sensors to find actionable items from a perspective of real-time algorithms. This computing framework is independent of the number of input samples, application domain, labelled or label-less streams. A sampling overlap algorithm such as Brooks-Iyengar is used for dealing with noisy sensor streams. We extend the existing noise pre-processing algorithms using Data-Cleaning trees. Pre-processing using ensemble of trees using bagging and multi-target regression showed robustness to random noise and missing data. As spatio-temporal streams are highly statistically correlated, we prove that a temporal window based sampling from sensor data streams converges after n samples using Hoeffding bounds. Which can be used for fast prediction of new samples in real-time. The Data-cleaning tree model uses a nonparametric node splitting technique, which can be learned in an iterative way which scales linearly in memory consumption for any size input stream. The improved task based ensemble extraction is compared with non-linear computation models using various SVM kernels for speed and accuracy. We show using empirical datasets the explicit rule learning computation is linear in time and is only dependent on the number of leafs present in the tree ensemble. The use of unpruned trees (t) in our proposed ensemble always yields minimum number (m) of leafs keeping pre-processing computation to n × t log m compared to N2 for Gram Matrix. We also show that the task based feature induction yields higher Qualify of Data (QoD) in the feature space compared to kernel methods using Gram Matrix.
Developing Models to Forcast Sales of Natural Christmas Trees
Lawrence D. Garrett; Thomas H. Pendleton
1977-01-01
A study of practices for marketing Christmas trees in Winston-Salem, North Carolina, and Denver, Colorado, revealed that such factors as retail lot competition, tree price, consumer traffic, and consumer income were very important in determining a particular retailer's sales. Analyses of 4 years of market data were used in developing regression models for...
Comprehensive database of diameter-based biomass regressions for North American tree species
Jennifer C. Jenkins; David C. Chojnacky; Linda S. Heath; Richard A. Birdsey
2004-01-01
A database consisting of 2,640 equations compiled from the literature for predicting the biomass of trees and tree components from diameter measurements of species found in North America. Bibliographic information, geographic locations, diameter limits, diameter and biomass units, equation forms, statistical errors, and coefficients are provided for each equation,...
Seligman, D A; Pullinger, A G
2000-01-01
Confusion about the relationship of occlusion to temporomandibular disorders (TMD) persists. This study attempted to identify occlusal and attrition factors plus age that would characterize asymptomatic normal female subjects. A total of 124 female patients with intracapsular TMD were compared with 47 asymptomatic female controls for associations to 9 occlusal factors, 3 attrition severity measures, and age using classification tree, multiple stepwise logistic regression, and univariate analyses. Models were tested for accuracy (sensitivity and specificity) and total contribution to the variance. The classification tree model had 4 terminal nodes that used only anterior attrition and age. "Normals" were mainly characterized by low attrition levels, whereas patients had higher attrition and tended to be younger. The tree model was only moderately useful (sensitivity 63%, specificity 94%) in predicting normals. The logistic regression model incorporated unilateral posterior crossbite and mediotrusive attrition severity in addition to the 2 factors in the tree, but was slightly less accurate than the tree (sensitivity 51%, specificity 90%). When only occlusal factors were considered in the analysis, normals were additionally characterized by a lack of anterior open bite, smaller overjet, and smaller RCP-ICP slides. The log likelihood accounted for was similar for both the tree (pseudo R(2) = 29.38%; mean deviance = 0.95) and the multiple logistic regression (Cox Snell R(2) = 30.3%, mean deviance = 0.84) models. The occlusal and attrition factors studied were only moderately useful in differentiating normals from TMD patients.
Towards lidar-based mapping of tree age at the Arctic forest tundra ecotone.
NASA Astrophysics Data System (ADS)
Jensen, J.; Maguire, A.; Oelkers, R.; Andreu-Hayles, L.; Boelman, N.; D'Arrigo, R.; Griffin, K. L.; Jennewein, J. S.; Hiers, E.; Meddens, A. J.; Russell, M.; Vierling, L. A.; Eitel, J.
2017-12-01
Climate change may cause spatial shifts in the forest-tundra ecotone (FTE). To improve our ability to study these spatial shifts, information on tree demography along the FTE is needed. The objective of this study was to assess the suitability of lidar derived tree heights as a surrogate for tree age. We calculated individual tree age from 48 tree cores collected at basal height from white spruce (Picea glauca) within the FTE in northern Alaska. Tree height was obtained from terrestrial lidar scans (<1cm spatial resolution). The relationship between age and height was examined using a linear regression model forced through the origin. We found a very strong predictive relationship between tree height and age (R2 = 0.90, RMSE = 19.34 years) for trees that ranged between 14 to 230 years. Separate regression models were also developed for small (height < 3 m) and large trees (height >= 3 m), yielding strong predictive relationships between height and age (R2 = 0.86, RMSE 12.21 years, and R2 = 0.93, RMSE = 25.16 years, respectively). The slope coefficient for small and large tree models (16.83 and 12.98 years/m, respectively) indicate that small trees grow 1.3 times faster than large trees at these FTE study sites. Although a strong, predictive relationship between age and height is uncommon in light-limited forest environments, our findings suggest that the sparseness of trees within the FTE may explain the strong tree height-age relationships found herein. Further analysis of 36 additional tree cores recently collected within the FTE near Inuvik, Canada will be performed. Our preliminary analysis suggests that lidar derived tree height could be a reliable proxy for tree age at the FTE, thereby establishing a new technique for scaling tree structure and demographics across larger portions of this sensitive ecotone.
Housworth, E A; Martins, E P
2001-01-01
Statistical randomization tests in evolutionary biology often require a set of random, computer-generated trees. For example, earlier studies have shown how large numbers of computer-generated trees can be used to conduct phylogenetic comparative analyses even when the phylogeny is uncertain or unknown. These methods were limited, however, in that (in the absence of molecular sequence or other data) they allowed users to assume that no phylogenetic information was available or that all possible trees were known. Intermediate situations where only a taxonomy or other limited phylogenetic information (e.g., polytomies) are available are technically more difficult. The current study describes a procedure for generating random samples of phylogenies while incorporating limited phylogenetic information (e.g., four taxa belong together in a subclade). The procedure can be used to conduct comparative analyses when the phylogeny is only partially resolved or can be used in other randomization tests in which large numbers of possible phylogenies are needed.
Fitting and Calibrating a Multilevel Mixed-Effects Stem Taper Model for Maritime Pine in NW Spain
Arias-Rodil, Manuel; Castedo-Dorado, Fernando; Cámara-Obregón, Asunción; Diéguez-Aranda, Ulises
2015-01-01
Stem taper data are usually hierarchical (several measurements per tree, and several trees per plot), making application of a multilevel mixed-effects modelling approach essential. However, correlation between trees in the same plot/stand has often been ignored in previous studies. Fitting and calibration of a variable-exponent stem taper function were conducted using data from 420 trees felled in even-aged maritime pine (Pinus pinaster Ait.) stands in NW Spain. In the fitting step, the tree level explained much more variability than the plot level, and therefore calibration at plot level was omitted. Several stem heights were evaluated for measurement of the additional diameter needed for calibration at tree level. Calibration with an additional diameter measured at between 40 and 60% of total tree height showed the greatest improvement in volume and diameter predictions. If additional diameter measurement is not available, the fixed-effects model fitted by the ordinary least squares technique should be used. Finally, we also evaluated how the expansion of parameters with random effects affects the stem taper prediction, as we consider this a key question when applying the mixed-effects modelling approach to taper equations. The results showed that correlation between random effects should be taken into account when assessing the influence of random effects in stem taper prediction. PMID:26630156
O’Connor, Christopher D.; Lynch, Ann M.
2016-01-01
A significant concern about Metabolic Scaling Theory (MST) in real forests relates to consistent differences between the values of power law scaling exponents of tree primary size measures used to estimate mass and those predicted by MST. Here we consider why observed scaling exponents for diameter and height relationships deviate from MST predictions across three semi-arid conifer forests in relation to: (1) tree condition and physical form, (2) the level of inter-tree competition (e.g. open vs closed stand structure), (3) increasing tree age, and (4) differences in site productivity. Scaling exponent values derived from non-linear least-squares regression for trees in excellent condition (n = 381) were above the MST prediction at the 95% confidence level, while the exponent for trees in good condition were no different than MST (n = 926). Trees that were in fair or poor condition, characterized as diseased, leaning, or sparsely crowned had exponent values below MST predictions (n = 2,058), as did recently dead standing trees (n = 375). Exponent value of the mean-tree model that disregarded tree condition (n = 3,740) was consistent with other studies that reject MST scaling. Ostensibly, as stand density and competition increase trees exhibited greater morphological plasticity whereby the majority had characteristically fair or poor growth forms. Fitting by least-squares regression biases the mean-tree model scaling exponent toward values that are below MST idealized predictions. For 368 trees from Arizona with known establishment dates, increasing age had no significant impact on expected scaling. We further suggest height to diameter ratios below MST relate to vertical truncation caused by limitation in plant water availability. Even with environmentally imposed height limitation, proportionality between height and diameter scaling exponents were consistent with the predictions of MST. PMID:27391084
Swetnam, Tyson L; O'Connor, Christopher D; Lynch, Ann M
2016-01-01
A significant concern about Metabolic Scaling Theory (MST) in real forests relates to consistent differences between the values of power law scaling exponents of tree primary size measures used to estimate mass and those predicted by MST. Here we consider why observed scaling exponents for diameter and height relationships deviate from MST predictions across three semi-arid conifer forests in relation to: (1) tree condition and physical form, (2) the level of inter-tree competition (e.g. open vs closed stand structure), (3) increasing tree age, and (4) differences in site productivity. Scaling exponent values derived from non-linear least-squares regression for trees in excellent condition (n = 381) were above the MST prediction at the 95% confidence level, while the exponent for trees in good condition were no different than MST (n = 926). Trees that were in fair or poor condition, characterized as diseased, leaning, or sparsely crowned had exponent values below MST predictions (n = 2,058), as did recently dead standing trees (n = 375). Exponent value of the mean-tree model that disregarded tree condition (n = 3,740) was consistent with other studies that reject MST scaling. Ostensibly, as stand density and competition increase trees exhibited greater morphological plasticity whereby the majority had characteristically fair or poor growth forms. Fitting by least-squares regression biases the mean-tree model scaling exponent toward values that are below MST idealized predictions. For 368 trees from Arizona with known establishment dates, increasing age had no significant impact on expected scaling. We further suggest height to diameter ratios below MST relate to vertical truncation caused by limitation in plant water availability. Even with environmentally imposed height limitation, proportionality between height and diameter scaling exponents were consistent with the predictions of MST.
Predictive equations for dimensions and leaf area of coastal Southern California street trees
P.J. Peper; E.G. McPherson; S.M. Mori
2001-01-01
Tree height, crown height, crown width, diameter at breast height (dbh), and leaf area were measured for 16 species of commonly planted street trees in the coastal southern California city of Santa Monica, USA. The randomly sampled trees were planted from 1 to 44 years ago. Using number of years after planting or dbh as explanatory variables, mean values of dbh, tree...
Esmaily, Habibollah; Tayefi, Maryam; Doosti, Hassan; Ghayour-Mobarhan, Majid; Nezami, Hossein; Amirabadizadeh, Alireza
2018-04-24
We aimed to identify the associated risk factors of type 2 diabetes mellitus (T2DM) using data mining approach, decision tree and random forest techniques using the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) Study program. A cross-sectional study. The MASHAD study started in 2010 and will continue until 2020. Two data mining tools, namely decision trees, and random forests, are used for predicting T2DM when some other characteristics are observed on 9528 subjects recruited from MASHAD database. This paper makes a comparison between these two models in terms of accuracy, sensitivity, specificity and the area under ROC curve. The prevalence rate of T2DM was 14% among these subjects. The decision tree model has 64.9% accuracy, 64.5% sensitivity, 66.8% specificity, and area under the ROC curve measuring 68.6%, while the random forest model has 71.1% accuracy, 71.3% sensitivity, 69.9% specificity, and area under the ROC curve measuring 77.3% respectively. The random forest model, when used with demographic, clinical, and anthropometric and biochemical measurements, can provide a simple tool to identify associated risk factors for type 2 diabetes. Such identification can substantially use for managing the health policy to reduce the number of subjects with T2DM .
Disassortativity of random critical branching trees
NASA Astrophysics Data System (ADS)
Kim, J. S.; Kahng, B.; Kim, D.
2009-06-01
Random critical branching trees (CBTs) are generated by the multiplicative branching process, where the branching number is determined stochastically, independent of the degree of their ancestor. Here we show analytically that despite this stochastic independence, there exists the degree-degree correlation (DDC) in the CBT and it is disassortative. Moreover, the skeletons of fractal networks, the maximum spanning trees formed by the edge betweenness centrality, behave similarly to the CBT in the DDC. This analytic solution and observation support the argument that the fractal scaling in complex networks originates from the disassortativity in the DDC.
NASA Astrophysics Data System (ADS)
Ghosh, S. M.; Behera, M. D.
2017-12-01
Forest aboveground biomass (AGB) is an important factor for preparation of global policy making decisions to tackle the impact of climate change. Several previous studies has concluded that remote sensing methods are more suitable for estimating forest biomass on regional scale. Among all available remote sensing data and methods, Synthetic Aperture Radar (SAR) data in combination with decision tree based machine learning algorithms has shown better promise in estimating higher biomass values. There aren't many studies done for biomass estimation of dense Indian tropical forests with high biomass density. In this study aboveground biomass was estimated for two major tree species, Sal (Shorea robusta) and Teak (Tectona grandis), of Katerniaghat Wildlife Sanctuary, a tropical forest situated in northern India. Biomass was estimated by combining C-band SAR data from Sentinel-1A satellite, vegetation indices produced using Sentinel-2A data and ground inventory plots. Along with SAR backscatter value, SAR texture images were also used as input as earlier studies had found that image texture has a correlation with vegetation biomass. Decision tree based nonlinear machine learning algorithms were used in place of parametric regression models for establishing relationship between fields measured values and remotely sensed parameters. Using random forest model with a combination of vegetation indices with SAR backscatter as predictor variables shows best result for Sal forest, with a coefficient of determination value of 0.71 and a RMSE value of 105.027 t/ha. In teak forest also best result can be found in the same combination but for stochastic gradient boosted model with a coefficient of determination value of 0.6 and a RMSE value of 79.45 t/ha. These results are mostly better than the results of other studies done for similar kind of forests. This study shows that Sentinel series satellite data has exceptional capabilities in estimating dense forest AGB and machine learning algorithms are better means to do so than parametric regression models.
Sabel, Michael S; Rice, John D; Griffith, Kent A; Lowe, Lori; Wong, Sandra L; Chang, Alfred E; Johnson, Timothy M; Taylor, Jeremy M G
2012-01-01
To identify melanoma patients at sufficiently low risk of nodal metastases who could avoid sentinel lymph node biopsy (SLNB), several statistical models have been proposed based upon patient/tumor characteristics, including logistic regression, classification trees, random forests, and support vector machines. We sought to validate recently published models meant to predict sentinel node status. We queried our comprehensive, prospectively collected melanoma database for consecutive melanoma patients undergoing SLNB. Prediction values were estimated based upon four published models, calculating the same reported metrics: negative predictive value (NPV), rate of negative predictions (RNP), and false-negative rate (FNR). Logistic regression performed comparably with our data when considering NPV (89.4 versus 93.6%); however, the model's specificity was not high enough to significantly reduce the rate of biopsies (SLN reduction rate of 2.9%). When applied to our data, the classification tree produced NPV and reduction in biopsy rates that were lower (87.7 versus 94.1 and 29.8 versus 14.3, respectively). Two published models could not be applied to our data due to model complexity and the use of proprietary software. Published models meant to reduce the SLNB rate among patients with melanoma either underperformed when applied to our larger dataset, or could not be validated. Differences in selection criteria and histopathologic interpretation likely resulted in underperformance. Statistical predictive models must be developed in a clinically applicable manner to allow for both validation and ultimately clinical utility.
Sabel, Michael S.; Rice, John D.; Griffith, Kent A.; Lowe, Lori; Wong, Sandra L.; Chang, Alfred E.; Johnson, Timothy M.; Taylor, Jeremy M.G.
2013-01-01
Introduction To identify melanoma patients at sufficiently low risk of nodal metastases who could avoid SLN biopsy (SLNB). Several statistical models have been proposed based upon patient/tumor characteristics, including logistic regression, classification trees, random forests and support vector machines. We sought to validate recently published models meant to predict sentinel node status. Methods We queried our comprehensive, prospectively-collected melanoma database for consecutive melanoma patients undergoing SLNB. Prediction values were estimated based upon 4 published models, calculating the same reported metrics: negative predictive value (NPV), rate of negative predictions (RNP), and false negative rate (FNR). Results Logistic regression performed comparably with our data when considering NPV (89.4% vs. 93.6%); however the model’s specificity was not high enough to significantly reduce the rate of biopsies (SLN reduction rate of 2.9%). When applied to our data, the classification tree produced NPV and reduction in biopsies rates that were lower 87.7% vs. 94.1% and 29.8% vs. 14.3%, respectively. Two published models could not be applied to our data due to model complexity and the use of proprietary software. Conclusions Published models meant to reduce the SLNB rate among patients with melanoma either underperformed when applied to our larger dataset, or could not be validated. Differences in selection criteria and histopathologic interpretation likely resulted in underperformance. Development of statistical predictive models must be created in a clinically applicable manner to allow for both validation and ultimately clinical utility. PMID:21822550
NASA Astrophysics Data System (ADS)
Drzewiecki, Wojciech
2017-12-01
We evaluated the performance of nine machine learning regression algorithms and their ensembles for sub-pixel estimation of impervious areas coverages from Landsat imagery. The accuracy of imperviousness mapping in individual time points was assessed based on RMSE, MAE and R2. These measures were also used for the assessment of imperviousness change intensity estimations. The applicability for detection of relevant changes in impervious areas coverages at sub-pixel level was evaluated using overall accuracy, F-measure and ROC Area Under Curve. The results proved that Cubist algorithm may be advised for Landsat-based mapping of imperviousness for single dates. Stochastic gradient boosting of regression trees (GBM) may be also considered for this purpose. However, Random Forest algorithm is endorsed for both imperviousness change detection and mapping of its intensity. In all applications the heterogeneous model ensembles performed at least as well as the best individual models or better. They may be recommended for improving the quality of sub-pixel imperviousness and imperviousness change mapping. The study revealed also limitations of the investigated methodology for detection of subtle changes of imperviousness inside the pixel. None of the tested approaches was able to reliably classify changed and non-changed pixels if the relevant change threshold was set as one or three percent. Also for fi ve percent change threshold most of algorithms did not ensure that the accuracy of change map is higher than the accuracy of random classifi er. For the threshold of relevant change set as ten percent all approaches performed satisfactory.
Red Cedar Invasion Along the Missouri River, South Dakota: Cause and Consequence
NASA Astrophysics Data System (ADS)
Greene, S.; Knox, J. C.
2012-12-01
This research evaluates drivers of and ecosystem response to red cedar (Juniperus virginiana) invasion of riparian surfaces downstream of Gavin's Point Dam on the Missouri River. Gavin's Point Dam changed the downstream geomorphology and hydrology of the river and its floodplain by reducing scouring floods and flood-deposited sediment. The native cottonwood species (Populus deltoides) favors cleared surfaces with little to no competitors to establish. Now that there are infrequent erosive floods along the riparian surfaces to remove competitor seeds and seedlings, other vegetation is able to establish. Red cedar is invading the understory of established cottonwood stands and post-dam riparian surfaces. To assess reasons and spatial patterns for the recent invasion of red cedar, a stratified random sampling of soil, tree density and frequency by species, and tree age of 14 forest stands was undertaken along 59 river kilometers of riparian habitat. Soil particle size was determined using laser diffraction and tree ages were estimated from ring counts of tree cores. As an indicator of ecosystem response to invasion, we measured organic matter content in soil collected beneath red cedar and cottonwood trees at three different depths. Of 565 red cedars, only two trees were established before the dam was built. We applied a multiple regression model of red cedar density as a function of cottonwood density and percent sand (63-1000 microns in diameter) in StatPlus© statistical software. Cottonwood density and percent sand are strongly correlated with invasion of red cedar along various riparian surfaces (n = 59, R2 = 0.42, p-values < 0.05). No significant differences exist between organic matter content of soil beneath red cedar and cottonwood trees (p-value > 0.05 for all depths). These findings suggest that the dam's minimization of downstream high-stage flows opened up new habitat for red cedar to establish. Fluvial geomorphic surfaces reflect soil type and cottonwood density and, in turn, predict susceptibility of a surface to red cedar invasion. Nonetheless, soils underlying red cedar and cottonwood trees are functionally similar with regard to soil organic matter content.
Unravelling the limits to tree height: a major role for water and nutrient trade-offs.
Cramer, Michael D
2012-05-01
Competition for light has driven forest trees to grow exceedingly tall, but the lack of a single universal limit to tree height indicates multiple interacting environmental limitations. Because soil nutrient availability is determined by both nutrient concentrations and soil water, water and nutrient availabilities may interact in determining realised nutrient availability and consequently tree height. In SW Australia, which is characterised by nutrient impoverished soils that support some of the world's tallest forests, total [P] and water availability were independently correlated with tree height (r = 0.42 and 0.39, respectively). However, interactions between water availability and each of total [P], pH and [Mg] contributed to a multiple linear regression model of tree height (r = 0.72). A boosted regression tree model showed that maximum tree height was correlated with water availability (24%), followed by soil properties including total P (11%), Mg (10%) and total N (9%), amongst others, and that there was an interaction between water availability and total [P] in determining maximum tree height. These interactions indicated a trade-off between water and P availability in determining maximum tree height in SW Australia. This is enabled by a species assemblage capable of growing tall and surviving (some) disturbances. The mechanism for this trade-off is suggested to be through water enabling mass-flow and diffusive mobility of P, particularly of relatively mobile organic P, although water interactions with microbial activity could also play a role.
Neither fixed nor random: weighted least squares meta-regression.
Stanley, T D; Doucouliagos, Hristos
2017-03-01
Our study revisits and challenges two core conventional meta-regression estimators: the prevalent use of 'mixed-effects' or random-effects meta-regression analysis and the correction of standard errors that defines fixed-effects meta-regression analysis (FE-MRA). We show how and explain why an unrestricted weighted least squares MRA (WLS-MRA) estimator is superior to conventional random-effects (or mixed-effects) meta-regression when there is publication (or small-sample) bias that is as good as FE-MRA in all cases and better than fixed effects in most practical applications. Simulations and statistical theory show that WLS-MRA provides satisfactory estimates of meta-regression coefficients that are practically equivalent to mixed effects or random effects when there is no publication bias. When there is publication selection bias, WLS-MRA always has smaller bias than mixed effects or random effects. In practical applications, an unrestricted WLS meta-regression is likely to give practically equivalent or superior estimates to fixed-effects, random-effects, and mixed-effects meta-regression approaches. However, random-effects meta-regression remains viable and perhaps somewhat preferable if selection for statistical significance (publication bias) can be ruled out and when random, additive normal heterogeneity is known to directly affect the 'true' regression coefficient. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Multivariate regression model for partitioning tree volume of white oak into round-product classes
Daniel A. Yaussy; David L. Sonderman
1984-01-01
Describes the development of multivariate equations that predict the expected cubic volume of four round-product classes from independent variables composed of individual tree-quality characteristics. Although the model has limited application at this time, it does demonstrate the feasibility of partitioning total tree cubic volume into round-product classes based on...
Kathleen L. Kavanaugh; Matthew B. Dickinson; Anthony S. Bova
2010-01-01
Current operational methods for predicting tree mortality from fire injury are regression-based models that only indirectly consider underlying causes and, thus, have limited generality. A better understanding of the physiological consequences of tree heating and injury are needed to develop biophysical process models that can make predictions under changing or novel...
Height-age relationships for regeneration-size trees in the northern Rocky Mountains, USA
Dennis E. Ferguson; Clinton E. Carlson
2010-01-01
Regression equations were developed to predict heights of 10 conifer species inregenerating stands in central and northern Idaho, western Montana, and eastern Washington. Most sample trees were natural regeneration that became established after conventional harvest and site preparation methods. Heights are predicted as a function of tree age, residual overstory density...
Louis R. Iverson; Anantha M. Prasad; Anantha M. Prasad
2002-01-01
Global climate change could have profound effects on the Earth's biota, including large redistributions of tree species and forest types. We used DISTRIB, a deterministic regression tree analysis model, to examine environmental drivers related to current forest-species distributions and then model potential suitable habitat under five climate change scenarios...
Viguiliouk, Effie; Kendall, Cyril W C; Blanco Mejia, Sonia; Cozma, Adrian I; Ha, Vanessa; Mirrahimi, Arash; Jayalath, Viranda H; Augustin, Livia S A; Chiavaroli, Laura; Leiter, Lawrence A; de Souza, Russell J; Jenkins, David J A; Sievenpiper, John L
2014-01-01
Tree nut consumption has been associated with reduced diabetes risk, however, results from randomized trials on glycemic control have been inconsistent. To provide better evidence for diabetes guidelines development, we conducted a systematic review and meta-analysis of randomized controlled trials to assess the effects of tree nuts on markers of glycemic control in individuals with diabetes. MEDLINE, EMBASE, CINAHL, and Cochrane databases through 6 April 2014. Randomized controlled trials ≥3 weeks conducted in individuals with diabetes that compare the effect of diets emphasizing tree nuts to isocaloric diets without tree nuts on HbA1c, fasting glucose, fasting insulin, and HOMA-IR. Two independent reviewer's extracted relevant data and assessed study quality and risk of bias. Data were pooled by the generic inverse variance method and expressed as mean differences (MD) with 95% CI's. Heterogeneity was assessed (Cochran Q-statistic) and quantified (I2). Twelve trials (n = 450) were included. Diets emphasizing tree nuts at a median dose of 56 g/d significantly lowered HbA1c (MD = -0.07% [95% CI:-0.10, -0.03%]; P = 0.0003) and fasting glucose (MD = -0.15 mmol/L [95% CI: -0.27, -0.02 mmol/L]; P = 0.03) compared with control diets. No significant treatment effects were observed for fasting insulin and HOMA-IR, however the direction of effect favoured tree nuts. Majority of trials were of short duration and poor quality. Pooled analyses show that tree nuts improve glycemic control in individuals with type 2 diabetes, supporting their inclusion in a healthy diet. Owing to the uncertainties in our analyses there is a need for longer, higher quality trials with a focus on using nuts to displace high-glycemic index carbohydrates. ClinicalTrials.gov NCT01630980.
Viguiliouk, Effie; Kendall, Cyril W. C.; Blanco Mejia, Sonia; Cozma, Adrian I.; Ha, Vanessa; Mirrahimi, Arash; Jayalath, Viranda H.; Augustin, Livia S. A.; Chiavaroli, Laura; Leiter, Lawrence A.; de Souza, Russell J.; Jenkins, David J. A.; Sievenpiper, John L.
2014-01-01
Background Tree nut consumption has been associated with reduced diabetes risk, however, results from randomized trials on glycemic control have been inconsistent. Objective To provide better evidence for diabetes guidelines development, we conducted a systematic review and meta-analysis of randomized controlled trials to assess the effects of tree nuts on markers of glycemic control in individuals with diabetes. Data Sources MEDLINE, EMBASE, CINAHL, and Cochrane databases through 6 April 2014. Study Selection Randomized controlled trials ≥3 weeks conducted in individuals with diabetes that compare the effect of diets emphasizing tree nuts to isocaloric diets without tree nuts on HbA1c, fasting glucose, fasting insulin, and HOMA-IR. Data Extraction and Synthesis Two independent reviewer’s extracted relevant data and assessed study quality and risk of bias. Data were pooled by the generic inverse variance method and expressed as mean differences (MD) with 95% CI’s. Heterogeneity was assessed (Cochran Q-statistic) and quantified (I2). Results Twelve trials (n = 450) were included. Diets emphasizing tree nuts at a median dose of 56 g/d significantly lowered HbA1c (MD = −0.07% [95% CI:−0.10, −0.03%]; P = 0.0003) and fasting glucose (MD = −0.15 mmol/L [95% CI: −0.27, −0.02 mmol/L]; P = 0.03) compared with control diets. No significant treatment effects were observed for fasting insulin and HOMA-IR, however the direction of effect favoured tree nuts. Limitations Majority of trials were of short duration and poor quality. Conclusions Pooled analyses show that tree nuts improve glycemic control in individuals with type 2 diabetes, supporting their inclusion in a healthy diet. Owing to the uncertainties in our analyses there is a need for longer, higher quality trials with a focus on using nuts to displace high-glycemic index carbohydrates. Trial Registration ClinicalTrials.gov NCT01630980 PMID:25076495
Modelling the ecological consequences of whole tree harvest for bioenergy production
NASA Astrophysics Data System (ADS)
Skår, Silje; Lange, Holger; Sogn, Trine
2013-04-01
There is an increasing demand for energy from biomass as a substitute to fossil fuels worldwide, and the Norwegian government plans to double the production of bioenergy to 9% of the national energy production or to 28 TWh per year by 2020. A large part of this increase may come from forests, which have a great potential with respect to biomass supply as forest growth increasingly has exceeded harvest in the last decades. One feasible option is the utilization of forest residues (needles, twigs and branches) in addition to stems, known as Whole Tree Harvest (WTH). As opposed to WTH, the residues are traditionally left in the forest with Conventional Timber Harvesting (CH). However, the residues contain a large share of the treés nutrients, indicating that WTH may possibly alter the supply of nutrients and organic matter to the soil and the forest ecosystem. This may potentially lead to reduced tree growth. Other implications can be nutrient imbalance, loss of carbon from the soil and changes in species composition and diversity. This study aims to identify key factors and appropriate strategies for ecologically sustainable WTH in Norway spruce (Picea abies) and Scots pine (Pinus sylvestris) forest stands in Norway. We focus on identifying key factors driving soil organic matter, nutrients, biomass, biodiversity etc. Simulations of the effect on the carbon and nitrogen budget with the two harvesting methods will also be conducted. Data from field trials and long-term manipulation experiments are used to obtain a first overview of key variables. The relationships between the variables are hitherto unknown, but it is by no means obvious that they could be assumed as linear; thus, an ordinary multiple linear regression approach is expected to be insufficient. Here we apply two advanced and highly flexible modelling frameworks which hardly have been used in the context of tree growth, nutrient balances and biomass removal so far: Generalized Additive Models (GAMs) and Random Forests. Results obtained for GAMs so far show that there are differences between WTH and CH in two directions: both the significance of drivers and the shape of the response functions differ. GAMs turn out to be a flexible and powerful alternative to multivariate linear regression. The restriction to linear relationships seems to be unjustified in the present case. We use Random Forests as a highly efficient classifier which gives reliable estimates for the importance of each driver variable in determining the diameter growth for the two different harvesting treatments. Based on the final results of these two modelling approaches, the study contributes to find appropriate strategies and suitable regions (in Norway) where WTH may be sustainable performed.
NASA Astrophysics Data System (ADS)
Künne, A.; Fink, M.; Kipka, H.; Krause, P.; Flügel, W.-A.
2012-06-01
In this paper, a method is presented to estimate excess nitrogen on large scales considering single field processes. The approach was implemented by using the physically based model J2000-S to simulate the nitrogen balance as well as the hydrological dynamics within meso-scale test catchments. The model input data, the parameterization, the results and a detailed system understanding were used to generate the regression tree models with GUIDE (Loh, 2002). For each landscape type in the federal state of Thuringia a regression tree was calibrated and validated using the model data and results of excess nitrogen from the test catchments. Hydrological parameters such as precipitation and evapotranspiration were also used to predict excess nitrogen by the regression tree model. Hence they had to be calculated and regionalized as well for the state of Thuringia. Here the model J2000g was used to simulate the water balance on the macro scale. With the regression trees the excess nitrogen was regionalized for each landscape type of Thuringia. The approach allows calculating the potential nitrogen input into the streams of the drainage area. The results show that the applied methodology was able to transfer the detailed model results of the meso-scale catchments to the entire state of Thuringia by low computing time without losing the detailed knowledge from the nitrogen transport modeling. This was validated with modeling results from Fink (2004) in a catchment lying in the regionalization area. The regionalized and modeled excess nitrogen correspond with 94%. The study was conducted within the framework of a project in collaboration with the Thuringian Environmental Ministry, whose overall aim was to assess the effect of agro-environmental measures regarding load reduction in the water bodies of Thuringia to fulfill the requirements of the European Water Framework Directive (Bäse et al., 2007; Fink, 2006; Fink et al., 2007).
Predicting Diameter at Breast Height from Stump Diameters for Northeastern Tree Species
Eric H. Wharton; Eric H. Wharton
1984-01-01
Presents equations to predict diameter at breast height from stump diameter measurements for 17 northeastern tree species. Simple linear regression was used to develop the equations. Application of the equations is discussed.
Summer roosting by adult male seminole bats in the Ouachita Mountains, Arkansas
Roger W. Perry; Ronald E. Thill
2007-01-01
We used radiotelemetry to locate 51 diurnal roosts for 17 male Seminole bats (Lasiurus seminolus) during late spring and early summer, 2000â2005. We quantified characteristics of roost trees and sites surrounding roosts and compared those measurements with random trees and random locations. All but two roosts were located in the foliage of large...
Urban tree cover change in Detroit and Atlanta, USA, 1951-2010
Krista Merry; Jacek Siry; Pete Bettinger; J.M. Bowker
2014-01-01
We assessed tree cover using random points and polygons distributed within the administrative boundaries of Detroit, MI and Atlanta, GA. Two approaches were tested, a point-based approach using 1000 randomly located sample points, and polygon-based approach using 250 circular areas, 200 m in radius (12.56 ha). In the case of Atlanta, both approaches arrived at similar...
Microhabitat selection by three common bird species of montane farmlands in Northern Greece.
Tsiakiris, Rigas; Stara, Kalliopi; Pantis, John; Sgardelis, Stefanos
2009-11-01
Common farmland birds are declining throughout Europe; however, marginal farmlands that escaped intensification or land abandonment remain a haven for farmland species in some Mediterranean mountains. The purpose of this study is to identify the most important anthropogenic microhabitat characteristics for Red-Backed Shrike (Lanius collurio), Corn Bunting (Miliaria calandra) and Common Whitethroat (Sylvia communis) in three such areas within the newly established Northern Pindos National Park. We compare land use structural and physiognomic characteristics of the habitat within 133 plots containing birds paired with randomly selected "non-bird" plots. Using logistic regression and classification-tree models we identify the specific habitat requirements for each of the three birds. The three species show a preference for agricultural mosaics dominated by rangelands with scattered shrub or short trees mixed with arable land. Areas with dikes and dirt roads are preferred by all three species, while the presence of fences and periodically burned bushes and hedges are of particular importance for Red-Backed Shrike. Across the gradient of vegetation density and height, M. calandra is mostly found in grasslands with few dwarf shrubs and short trees, S. communis in places with more dense and tall vegetation of shrub, trees and hedges, and L. collurio, being a typical bird of ecotones, occurs in both habitats and in intermediate situations. In all cases those requirements are associated with habitat features maintained either directly or indirectly by the traditional agricultural activities in the area and particularly by the long established extensive controlled grazing that prevent shrub expansion.
Microhabitat Selection by Three Common Bird Species of Montane Farmlands in Northern Greece
NASA Astrophysics Data System (ADS)
Tsiakiris, Rigas; Stara, Kalliopi; Pantis, John; Sgardelis, Stefanos
2009-11-01
Common farmland birds are declining throughout Europe; however, marginal farmlands that escaped intensification or land abandonment remain a haven for farmland species in some Mediterranean mountains. The purpose of this study is to identify the most important anthropogenic microhabitat characteristics for Red-Backed Shrike ( Lanius collurio), Corn Bunting ( Miliaria calandra) and Common Whitethroat ( Sylvia communis) in three such areas within the newly established Northern Pindos National Park. We compare land use structural and physiognomic characteristics of the habitat within 133 plots containing birds paired with randomly selected “non-bird” plots. Using logistic regression and classification-tree models we identify the specific habitat requirements for each of the three birds. The three species show a preference for agricultural mosaics dominated by rangelands with scattered shrub or short trees mixed with arable land. Areas with dikes and dirt roads are preferred by all three species, while the presence of fences and periodically burned bushes and hedges are of particular importance for Red-Backed Shrike. Across the gradient of vegetation density and height, M. calandra is mostly found in grasslands with few dwarf shrubs and short trees, S. communis in places with more dense and tall vegetation of shrub, trees and hedges, and L. collurio, being a typical bird of ecotones, occurs in both habitats and in intermediate situations. In all cases those requirements are associated with habitat features maintained either directly or indirectly by the traditional agricultural activities in the area and particularly by the long established extensive controlled grazing that prevent shrub expansion.
Guo, Huey-Ming; Shyu, Yea-Ing Lotus; Chang, Her-Kun
2006-01-01
In this article, the authors provide an overview of a research method to predict quality of care in home health nursing data set. The results of this study can be visualized through classification an regression tree (CART) graphs. The analysis was more effective, and the results were more informative since the home health nursing dataset was analyzed with a combination of the logistic regression and CART, these two techniques complete each other. And the results more informative that more patients' characters were related to quality of care in home care. The results contributed to home health nurse predict patient outcome in case management. Improved prediction is needed for interventions to be appropriately targeted for improved patient outcome and quality of care.
Lucini, Filipe R; S Fogliatto, Flavio; C da Silveira, Giovani J; L Neyeloff, Jeruza; Anzanello, Michel J; de S Kuchenbecker, Ricardo; D Schaan, Beatriz
2017-04-01
Emergency department (ED) overcrowding is a serious issue for hospitals. Early information on short-term inward bed demand from patients receiving care at the ED may reduce the overcrowding problem, and optimize the use of hospital resources. In this study, we use text mining methods to process data from early ED patient records using the SOAP framework, and predict future hospitalizations and discharges. We try different approaches for pre-processing of text records and to predict hospitalization. Sets-of-words are obtained via binary representation, term frequency, and term frequency-inverse document frequency. Unigrams, bigrams and trigrams are tested for feature formation. Feature selection is based on χ 2 and F-score metrics. In the prediction module, eight text mining methods are tested: Decision Tree, Random Forest, Extremely Randomized Tree, AdaBoost, Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine (Kernel linear) and Nu-Support Vector Machine (Kernel linear). Prediction performance is evaluated by F1-scores. Precision and Recall values are also informed for all text mining methods tested. Nu-Support Vector Machine was the text mining method with the best overall performance. Its average F1-score in predicting hospitalization was 77.70%, with a standard deviation (SD) of 0.66%. The method could be used to manage daily routines in EDs such as capacity planning and resource allocation. Text mining could provide valuable information and facilitate decision-making by inward bed management teams. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.
On Tree-Based Phylogenetic Networks.
Zhang, Louxin
2016-07-01
A large class of phylogenetic networks can be obtained from trees by the addition of horizontal edges between the tree edges. These networks are called tree-based networks. We present a simple necessary and sufficient condition for tree-based networks and prove that a universal tree-based network exists for any number of taxa that contains as its base every phylogenetic tree on the same set of taxa. This answers two problems posted by Francis and Steel recently. A byproduct is a computer program for generating random binary phylogenetic networks under the uniform distribution model.
Markov-random-field-based super-resolution mapping for identification of urban trees in VHR images
NASA Astrophysics Data System (ADS)
Ardila, Juan P.; Tolpekin, Valentyn A.; Bijker, Wietske; Stein, Alfred
2011-11-01
Identification of tree crowns from remote sensing requires detailed spectral information and submeter spatial resolution imagery. Traditional pixel-based classification techniques do not fully exploit the spatial and spectral characteristics of remote sensing datasets. We propose a contextual and probabilistic method for detection of tree crowns in urban areas using a Markov random field based super resolution mapping (SRM) approach in very high resolution images. Our method defines an objective energy function in terms of the conditional probabilities of panchromatic and multispectral images and it locally optimizes the labeling of tree crown pixels. Energy and model parameter values are estimated from multiple implementations of SRM in tuning areas and the method is applied in QuickBird images to produce a 0.6 m tree crown map in a city of The Netherlands. The SRM output shows an identification rate of 66% and commission and omission errors in small trees and shrub areas. The method outperforms tree crown identification results obtained with maximum likelihood, support vector machines and SRM at nominal resolution (2.4 m) approaches.
Huang, Li-Shan; Myers, Gary J.; Davidson, Philip W.; Cox, Christopher; Xiao, Fenyuan; Thurston, Sally W.; Cernichiari, Elsa; Shamlaye, Conrad F.; Sloane-Reeves, Jean; Georger, Lesley; Clarkson, Thomas W.
2007-01-01
Studies of the association between prenatal methylmercury exposure from maternal fish consumption during pregnancy and neurodevelopmental test scores in the Seychelles Child Development Study have found no consistent pattern of associations through age nine years. The analyses for the most recent nine-year data examined the population effects of prenatal exposure, but did not address the possibility of non-homogeneous susceptibility. This paper presents a regression tree approach: covariate effects are treated nonlinearly and non-additively and non-homogeneous effects of prenatal methylmercury exposure are permitted among the covariate clusters identified by the regression tree. The approach allows us to address whether children in the lower or higher ends of the developmental spectrum differ in susceptibility to subtle exposure effects. Of twenty-one endpoints available at age nine years, we chose the Weschler Full Scale IQ and its associated covariates to construct the regression tree. The prenatal mercury effect in each of the nine resulting clusters was assessed linearly and non-homogeneously. In addition we reanalyzed five other nine-year endpoints that in the linear analysis has a two-tailed p-value <0.2 for the effect of prenatal exposure. In this analysis, motor proficiency and activity level improved significantly with increasing MeHg for 53% of the children who had an average home environment. Motor proficiency significantly decreased with increasing prenatal MeHg exposure in 7% of the children whose home environment was below average. The regression tree results support previous analyses of outcomes in this cohort. However, this analysis raises the intriguing possibility that an effect may be non-homogeneous among children with different backgrounds and IQ levels. PMID:17942158
Huang, Li-Shan; Myers, Gary J; Davidson, Philip W; Cox, Christopher; Xiao, Fenyuan; Thurston, Sally W; Cernichiari, Elsa; Shamlaye, Conrad F; Sloane-Reeves, Jean; Georger, Lesley; Clarkson, Thomas W
2007-11-01
Studies of the association between prenatal methylmercury exposure from maternal fish consumption during pregnancy and neurodevelopmental test scores in the Seychelles Child Development Study have found no consistent pattern of associations through age 9 years. The analyses for the most recent 9-year data examined the population effects of prenatal exposure, but did not address the possibility of non-homogeneous susceptibility. This paper presents a regression tree approach: covariate effects are treated non-linearly and non-additively and non-homogeneous effects of prenatal methylmercury exposure are permitted among the covariate clusters identified by the regression tree. The approach allows us to address whether children in the lower or higher ends of the developmental spectrum differ in susceptibility to subtle exposure effects. Of 21 endpoints available at age 9 years, we chose the Weschler Full Scale IQ and its associated covariates to construct the regression tree. The prenatal mercury effect in each of the nine resulting clusters was assessed linearly and non-homogeneously. In addition we reanalyzed five other 9-year endpoints that in the linear analysis had a two-tailed p-value <0.2 for the effect of prenatal exposure. In this analysis, motor proficiency and activity level improved significantly with increasing MeHg for 53% of the children who had an average home environment. Motor proficiency significantly decreased with increasing prenatal MeHg exposure in 7% of the children whose home environment was below average. The regression tree results support previous analyses of outcomes in this cohort. However, this analysis raises the intriguing possibility that an effect may be non-homogeneous among children with different backgrounds and IQ levels.
Kim, So-Ra; Kwak, Doo-Ahn; Lee, Woo-Kyun; oLee, Woo-Kyun; Son, Yowhan; Bae, Sang-Won; Kim, Choonsig; Yoo, Seongjin
2010-07-01
The objective of this study was to estimate the carbon storage capacity of Pinus densiflora stands using remotely sensed data by combining digital aerial photography with light detection and ranging (LiDAR) data. A digital canopy model (DCM), generated from the LiDAR data, was combined with aerial photography for segmenting crowns of individual trees. To eliminate errors in over and under-segmentation, the combined image was smoothed using a Gaussian filtering method. The processed image was then segmented into individual trees using a marker-controlled watershed segmentation method. After measuring the crown area from the segmented individual trees, the individual tree diameter at breast height (DBH) was estimated using a regression function developed from the relationship observed between the field-measured DBH and crown area. The above ground biomass of individual trees could be calculated by an image-derived DBH using a regression function developed by the Korea Forest Research Institute. The carbon storage, based on individual trees, was estimated by simple multiplication using the carbon conversion index (0.5), as suggested in guidelines from the Intergovernmental Panel on Climate Change. The mean carbon storage per individual tree was estimated and then compared with the field-measured value. This study suggested that the biomass and carbon storage in a large forest area can be effectively estimated using aerial photographs and LiDAR data.
Studies of the DIII-D disruption database using Machine Learning algorithms
NASA Astrophysics Data System (ADS)
Rea, Cristina; Granetz, Robert; Meneghini, Orso
2017-10-01
A Random Forests Machine Learning algorithm, trained on a large database of both disruptive and non-disruptive DIII-D discharges, predicts disruptive behavior in DIII-D with about 90% of accuracy. Several algorithms have been tested and Random Forests was found superior in performances for this particular task. Over 40 plasma parameters are included in the database, with data for each of the parameters taken from 500k time slices. We focused on a subset of non-dimensional plasma parameters, deemed to be good predictors based on physics considerations. Both binary (disruptive/non-disruptive) and multi-label (label based on the elapsed time before disruption) classification problems are investigated. The Random Forests algorithm provides insight on the available dataset by ranking the relative importance of the input features. It is found that q95 and Greenwald density fraction (n/nG) are the most relevant parameters for discriminating between DIII-D disruptive and non-disruptive discharges. A comparison with the Gradient Boosted Trees algorithm is shown and the first results coming from the application of regression algorithms are presented. Work supported by the US Department of Energy under DE-FC02-04ER54698, DE-SC0014264 and DE-FG02-95ER54309.
Developing Data-driven models for quantifying Cochlodinium polykrikoides in Coastal Waters
NASA Astrophysics Data System (ADS)
Kwon, Yongsung; Jang, Eunna; Im, Jungho; Baek, Seungho; Park, Yongeun; Cho, Kyunghwa
2017-04-01
Harmful algal blooms have been worldwide problems because it leads to serious dangers to human health and aquatic ecosystems. Especially, fish killing red tide blooms by one of dinoflagellate, Cochlodinium polykrikoides (C. polykrikoides), have caused critical damage to mariculture in the Korean coastal waters. In this work, multiple linear regression (MLR), regression tree (RT), and random forest (RF) models were constructed and applied to estimate C. polykrikoides blooms in coastal waters. Five different types of input dataset were carried out to test the performance of three models. To train and validate the three models, observed number of C. polykrikoides cells from National institute of fisheries science (NIFS) and remote sensing reflectance data from Geostationary Ocean Color Imager (GOCI) images for 3 years from 2013 to 2015 were used. The RT model showed the best prediction performance when using 4 bands and 3 band ratios data were used as input data simultaneously. Results obtained from iterative model development with randomly chosen input data indicated that the recognition of patterns in training data caused a variation in prediction performance. This work provided useful tools for reliably estimate the number of C. polykrikoides cells by using reasonable input reflectance dataset in coastal waters. It is expected that the RT model is easily accessed and manipulated by administrators and decision-makers working with coastal waters.
NASA Astrophysics Data System (ADS)
Bourrel, Luc; Brodu, Nicolas; Frappart, Frédéric
2016-04-01
Remotely sensed images allow a frequent monitoring of land cover variations at regional and global scale. Recently launched Sentinel-1 satellite offers a global cover of land areas at an unprecedented spatial (20 m) and temporal (6 days at the Equator). We propose here to compare the performances of commonly used supervised classification techniques (i.e., k-nearest neighbors, linear and Gaussian support vector machines, naive Bayes, linear and quadratic discriminant analyzes, adaptative boosting, loggit regression, ridge regression with one-vs-one voting, random forest, extremely randomized trees) for land cover applications in the Guayas Basin, the largest river basin of the Pacific coast of Ecuator (area ~32,000 km²). The reason of this choice is the importance of this region in Ecuatorian economy as its watershed represents 13% of the total area of Ecuador where 40% of the Ecuadorian population lives. It also corresponds to the most productive region of Ecuador for agriculture and aquaculture. Fifty percents of the country shrimp farming production comes from this watershed, and represents with agriculture the largest source of revenue of the country. Similar comparisons are also performed using ENVISAT ASAR images acquired in global mode (1 km of spatial resolution). Accuracy of the results will be achieved using land cover map derived from multi-spectral images.
Karin L. Riley; Isaac C. Grenfell; Mark A. Finney
2015-01-01
Mapping the number, size, and species of trees in forests across the western United States has utility for a number of research endeavors, ranging from estimation of terrestrial carbon resources to tree mortality following wildfires. For landscape fire and forest simulations that use the Forest Vegetation Simulator (FVS), a tree-level dataset, or âtree listâ, is a...
Louis R. Iverson; Anantha Prasad; Mark W. Schwartz; Mark W. Schwartz
1999-01-01
We are using a deterministic regression tree analysis model (DISTRIB) and a stochastic migration model (SHIFT) to examine potential distributions of ~66 individual species of eastern US trees under a 2 x CO2 climate change scenario. This process is demonstrated for Virginia pine (Pinus virginiana).
Potential Changes in Tree Species Richness and Forest Community Types following Climate Change
Louis R. Iverson; Anantha M. Prasad
2001-01-01
Potential changes in tree species richness and forest community types were evaluated for the eastern United States according to five scenarios of future climate change resulting from a doubling of atmospheric carbon dioxide (CO2). DISTRIB, an empirical model that uses a regression tree analysis approach, was used to generate suitable habitat, or potential future...
Estimating tree crown widths for the primary Acadian species in Maine
Matthew B. Russell; Aaron R. Weiskittel
2012-01-01
In this analysis, data for seven conifer and eight hardwood species were gathered from across the state of Maine for estimating tree crown widths. Maximum and largest crown width equations were developed using tree diameter at breast height as the primary predicting variable. Quantile regression techniques were used to estimate the maximum crown width and a constrained...
Du, Hua Qiang; Sun, Xiao Yan; Han, Ning; Mao, Fang Jie
2017-10-01
By synergistically using the object-based image analysis (OBIA) and the classification and regression tree (CART) methods, the distribution information, the indexes (including diameter at breast, tree height, and crown closure), and the aboveground carbon storage (AGC) of moso bamboo forest in Shanchuan Town, Anji County, Zhejiang Province were investigated. The results showed that the moso bamboo forest could be accurately delineated by integrating the multi-scale ima ge segmentation in OBIA technique and CART, which connected the image objects at various scales, with a pretty good producer's accuracy of 89.1%. The investigation of indexes estimated by regression tree model that was constructed based on the features extracted from the image objects reached normal or better accuracy, in which the crown closure model archived the best estimating accuracy of 67.9%. The estimating accuracy of diameter at breast and tree height was relatively low, which was consistent with conclusion that estimating diameter at breast and tree height using optical remote sensing could not achieve satisfactory results. Estimation of AGC reached relatively high accuracy, and accuracy of the region of high value achieved above 80%.
Zhao, Cai-Yun; Xu, Jing; Liu, Xiao-Yan
2017-01-01
Abstract Globalization increases the opportunities for unintentionally introduced invasive alien species, especially for insects, and most of these species could damage ecosystems and cause economic loss in China. In this study, we analyzed drivers of the distribution of unintentionally introduced invasive alien insects. Based on the number of unintentionally introduced invasive alien insects and their presence/absence records in each province in mainland China, regression trees were built to elucidate the roles of environmental and anthropogenic factors on the number distribution and similarity of species composition of these insects. Classification and regression trees indicated climatic suitability (the mean temperature in January) and human economic activity (sum of total freight) are primary drivers for the number distribution pattern of unintentionally introduced invasive alien insects at provincial scale, while only environmental factors (the mean January temperature, the annual precipitation and the areas of provinces) significantly affect the similarity of them based on the multivariate regression trees. PMID:28973576
Silvis, Alexander; Ford, W. Mark; Britzke, Eric R.
2015-01-01
Bat day-roost selection often is described through comparisons of day-roosts with randomly selected, and assumed unused, trees. Relatively few studies, however, look at patterns of multi-year selection or compare day-roosts used across years. We explored day-roost selection using 2 years of roost selection data for female northern long-eared bats (Myotis septentrionalis) on the Fort Knox Military Reservation, Kentucky, USA. We compared characteristics of randomly selected non-roost trees and day-roosts using a multinomial logistic model and day-roost species selection using chi-squared tests. We found that factors differentiating day-roosts from non-roosts and day-roosts between years varied. Day-roosts differed from non-roosts in the first year of data in all measured factors, but only in size and decay stage in the second year. Between years, day-roosts differed in size and canopy position, but not decay stage. Day-roost species selection was non-random and did not differ between years. Although bats used multiple trees, our results suggest that there were additional unused trees that were suitable as roosts at any time. Day-roost selection pattern descriptions will be inadequate if based only on a single year of data, and inferences of roost selection based only on comparisons of roost to non-roosts should be limited.
Silvis, Alexander; Ford, W. Mark; Britzke, Eric R.
2015-01-01
Bat day-roost selection often is described through comparisons of day-roosts with randomly selected, and assumed unused, trees. Relatively few studies, however, look at patterns of multi-year selection or compare day-roosts used across years. We explored day-roost selection using 2 years of roost selection data for female northern long-eared bats (Myotis septentrionalis) on the Fort Knox Military Reservation, Kentucky, USA. We compared characteristics of randomly selected non-roost trees and day-roosts using a multinomial logistic model and day-roost species selection using chi-squared tests. We found that factors differentiating day-roosts from non-roosts and day-roosts between years varied. Day-roosts differed from non-roosts in the first year of data in all measured factors, but only in size and decay stage in the second year. Between years, day-roosts differed in size and canopy position, but not decay stage. Day-roost species selection was non-random and did not differ between years. Although bats used multiple trees, our results suggest that there were additional unused trees that were suitable as roosts at any time. Day-roost selection pattern descriptions will be inadequate if based only on a single year of data, and inferences of roost selection based only on comparisons of roost to non-roosts should be limited.
The coalescent of a sample from a binary branching process.
Lambert, Amaury
2018-04-25
At time 0, start a time-continuous binary branching process, where particles give birth to a single particle independently (at a possibly time-dependent rate) and die independently (at a possibly time-dependent and age-dependent rate). A particular case is the classical birth-death process. Stop this process at time T>0. It is known that the tree spanned by the N tips alive at time T of the tree thus obtained (called a reduced tree or coalescent tree) is a coalescent point process (CPP), which basically means that the depths of interior nodes are independent and identically distributed (iid). Now select each of the N tips independently with probability y (Bernoulli sample). It is known that the tree generated by the selected tips, which we will call the Bernoulli sampled CPP, is again a CPP. Now instead, select exactly k tips uniformly at random among the N tips (a k-sample). We show that the tree generated by the selected tips is a mixture of Bernoulli sampled CPPs with the same parent CPP, over some explicit distribution of the sampling probability y. An immediate consequence is that the genealogy of a k-sample can be obtained by the realization of k random variables, first the random sampling probability Y and then the k-1 node depths which are iid conditional on Y=y. Copyright © 2018. Published by Elsevier Inc.
Bevilacqua, M; Ciarapica, F E; Giacchetta, G
2008-07-01
This work is an attempt to apply classification tree methods to data regarding accidents in a medium-sized refinery, so as to identify the important relationships between the variables, which can be considered as decision-making rules when adopting any measures for improvement. The results obtained using the CART (Classification And Regression Trees) method proved to be the most precise and, in general, they are encouraging concerning the use of tree diagrams as preliminary explorative techniques for the assessment of the ergonomic, management and operational parameters which influence high accident risk situations. The Occupational Injury analysis carried out in this paper was planned as a dynamic process and can be repeated systematically. The CART technique, which considers a very wide set of objective and predictive variables, shows new cause-effect correlations in occupational safety which had never been previously described, highlighting possible injury risk groups and supporting decision-making in these areas. The use of classification trees must not, however, be seen as an attempt to supplant other techniques, but as a complementary method which can be integrated into traditional types of analysis.
Acid rain, air pollution, and tree growth in southeastern New York
Puckett, L.J.
1982-01-01
Whether dendroecological analyses could be used to detect changes in the relationship of tree growth to climate that might have resulted from chronic exposure to components of the acid rain-air pollution complex was determined. Tree-ring indices of white pine (Pinus strobus L.), eastern hemlock (Tsuga canadensis (L.) Cart.), pitch pine (Pinus rigida Mill.), and chestnut oak (Quercus prinus L.) were regressed against orthogonally transformed values of temperature and precipitation in order to derive a response-function relationship. Results of the regression analyses for three time periods, 1901–1920, 1926–1945, and 1954–1973 suggest that the relationship of tree growth to climate has been altered. Statistical tests of the temperature and precipitation data suggest that this change was nonclimatic. Temporally, the shift in growth response appears to correspond with the suspected increase in acid rain and air pollution in the Shawangunk Mountain area of southeastern New York in the early 1950's. This change could be the result of physiological stress induced by components of the acid rain-air pollution complex, causing climatic conditions to be more limiting to tree growth.
Teng, Ju-Hsi; Lin, Kuan-Chia; Ho, Bin-Shenq
2007-10-01
A community-based aboriginal study was conducted and analysed to explore the application of classification tree and logistic regression. A total of 1066 aboriginal residents in Yilan County were screened during 2003-2004. The independent variables include demographic characteristics, physical examinations, geographic location, health behaviours, dietary habits and family hereditary diseases history. Risk factors of cardiovascular diseases were selected as the dependent variables in further analysis. The completion rate for heath interview is 88.9%. The classification tree results find that if body mass index is higher than 25.72 kg m(-2) and the age is above 51 years, the predicted probability for number of cardiovascular risk factors > or =3 is 73.6% and the population is 322. If body mass index is higher than 26.35 kg m(-2) and geographical latitude of the village is lower than 24 degrees 22.8', the predicted probability for number of cardiovascular risk factors > or =4 is 60.8% and the population is 74. As the logistic regression results indicate that body mass index, drinking habit and menopause are the top three significant independent variables. The classification tree model specifically shows the discrimination paths and interactions between the risk groups. The logistic regression model presents and analyses the statistical independent factors of cardiovascular risks. Applying both models to specific situations will provide a different angle for the design and management of future health intervention plans after community-based study.
Westreich, Daniel; Lessler, Justin; Funk, Michele Jonsson
2010-01-01
Summary Objective Propensity scores for the analysis of observational data are typically estimated using logistic regression. Our objective in this Review was to assess machine learning alternatives to logistic regression which may accomplish the same goals but with fewer assumptions or greater accuracy. Study Design and Setting We identified alternative methods for propensity score estimation and/or classification from the public health, biostatistics, discrete mathematics, and computer science literature, and evaluated these algorithms for applicability to the problem of propensity score estimation, potential advantages over logistic regression, and ease of use. Results We identified four techniques as alternatives to logistic regression: neural networks, support vector machines, decision trees (CART), and meta-classifiers (in particular, boosting). Conclusion While the assumptions of logistic regression are well understood, those assumptions are frequently ignored. All four alternatives have advantages and disadvantages compared with logistic regression. Boosting (meta-classifiers) and to a lesser extent decision trees (particularly CART) appear to be most promising for use in the context of propensity score analysis, but extensive simulation studies are needed to establish their utility in practice. PMID:20630332
Tree species exhibit complex patterns of distribution in bottomland hardwood forests
Luben D Dimov; Jim L Chambers; Brian R. Lockhart
2013-01-01
& Context Understanding tree interactions requires an insight into their spatial distribution. & Aims We looked for presence and extent of tree intraspecific spatial point pattern (random, aggregated, or overdispersed) and interspecific spatial point pattern (independent, aggregated, or segregated). & Methods We established twelve 0.64-ha plots in natural...
Use of DNA markers in forest tree improvement research
D.B. Neale; M.E. Devey; K.D. Jermstad; M.R. Ahuja; M.C. Alosi; K.A. Marshall
1992-01-01
DNA markers are rapidly being developed for forest trees. The most important markers are restriction fragment length polymorphisms (RFLPs), polymerase chain reaction- (PCR) based markers such as random amplified polymorphic DNA (RAPD), and fingerprinting markers. DNA markers can supplement isozyme markers for monitoring tree improvement activities such as; estimating...
Parallel Mitogenome Sequencing Alleviates Random Rooting Effect in Phylogeography.
Hirase, Shotaro; Takeshima, Hirohiko; Nishida, Mutsumi; Iwasaki, Wataru
2016-04-28
Reliably rooted phylogenetic trees play irreplaceable roles in clarifying diversification in the patterns of species and populations. However, such trees are often unavailable in phylogeographic studies, particularly when the focus is on rapidly expanded populations that exhibit star-like trees. A fundamental bottleneck is known as the random rooting effect, where a distant outgroup tends to root an unrooted tree "randomly." We investigated whether parallel mitochondrial genome (mitogenome) sequencing alleviates this effect in phylogeography using a case study on the Sea of Japan lineage of the intertidal goby Chaenogobius annularis Eighty-three C. annularis individuals were collected and their mitogenomes were determined by high-throughput and low-cost parallel sequencing. Phylogenetic analysis of these mitogenome sequences was conducted to root the Sea of Japan lineage, which has a star-like phylogeny and had not been reliably rooted. The topologies of the bootstrap trees were investigated to determine whether the use of mitogenomes alleviated the random rooting effect. The mitogenome data successfully rooted the Sea of Japan lineage by alleviating the effect, which hindered phylogenetic analysis that used specific gene sequences. The reliable rooting of the lineage led to the discovery of a novel, northern lineage that expanded during an interglacial period with high bootstrap support. Furthermore, the finding of this lineage suggested the existence of additional glacial refugia and provided a new recent calibration point that revised the divergence time estimation between the Sea of Japan and Pacific Ocean lineages. This study illustrates the effectiveness of parallel mitogenome sequencing for solving the random rooting problem in phylogeographic studies. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Gretchen G. Moisen; Elizabeth A. Freeman; Jock A. Blackard; Tracey S. Frescino; Niklaus E. Zimmermann; Thomas C. Edwards
2006-01-01
Many efforts are underway to produce broad-scale forest attribute maps by modelling forest class and structure variables collected in forest inventories as functions of satellite-based and biophysical information. Typically, variants of classification and regression trees implemented in Rulequest's© See5 and Cubist (for binary and continuous responses,...
Jose Negron
1997-01-01
Classification trees and linear regression analysis were used to build models to predict probabilities of infestation and amount of tree mortality in terms of basal area resulting from roundheaded pine beetle, Dendroctonus adjunctus Blandford, activity in ponderosa pine, Pinus ponderosa Laws., in the Sacramento Mountains, New Mexico. Classification trees were built for...
An Extension of CART's Pruning Algorithm. Program Statistics Research Technical Report No. 91-11.
ERIC Educational Resources Information Center
Kim, Sung-Ho
Among the computer-based methods used for the construction of trees such as AID, THAID, CART, and FACT, the only one that uses an algorithm that first grows a tree and then prunes the tree is CART. The pruning component of CART is analogous in spirit to the backward elimination approach in regression analysis. This idea provides a tool in…
STX--Fortran-4 program for estimates of tree populations from 3P sample-tree-measurements
L. R. Grosenbaugh
1967-01-01
Describes how to use an improved and greatly expanded version of an earlier computer program (1964) that converts dendrometer measurements of 3P-sample trees to population values in terms of whatever units user desires. Many new options are available, including that of obtaining a product-yield and appraisal report based on regression coefficients supplied by user....
Portable Language-Independent Adaptive Translation from OCR. Phase 1
2009-04-01
including brute-force k-Nearest Neighbors ( kNN ), fast approximate kNN using hashed k-d trees, classification and regression trees, and locality...achieved by refinements in ground-truthing protocols. Recent algorithmic improvements to our approximate kNN classifier using hashed k-D trees allows...recent years discriminative training has been shown to outperform phonetic HMMs estimated using ML for speech recognition. Standard ML estimation
Improving Cluster Analysis with Automatic Variable Selection Based on Trees
2014-12-01
regression trees Daisy DISsimilAritY PAM partitioning around medoids PMA penalized multivariate analysis SPC sparse principal components UPGMA unweighted...unweighted pair-group average method ( UPGMA ). This method measures dissimilarities between all objects in two clusters and takes the average value
Predicting U.S. Army Reserve Unit Manning Using Market Demographics
2015-06-01
develops linear regression , classification tree, and logistic regression models to determine the ability of the location to support manning requirements... logistic regression model delivers predictive results that allow decision-makers to identify locations with a high probability of meeting unit...manning requirements. The recommendation of this thesis is that the USAR implement the logistic regression model. 14. SUBJECT TERMS U.S
CADDIS Volume 4. Data Analysis: Basic Analyses
Use of statistical tests to determine if an observation is outside the normal range of expected values. Details of CART, regression analysis, use of quantile regression analysis, CART in causal analysis, simplifying or pruning resulting trees.
Comparison of texture synthesis methods for content generation in ultrasound simulation for training
NASA Astrophysics Data System (ADS)
Mattausch, Oliver; Ren, Elizabeth; Bajka, Michael; Vanhoey, Kenneth; Goksel, Orcun
2017-03-01
Navigation and interpretation of ultrasound (US) images require substantial expertise, the training of which can be aided by virtual-reality simulators. However, a major challenge in creating plausible simulated US images is the generation of realistic ultrasound speckle. Since typical ultrasound speckle exhibits many properties of Markov Random Fields, it is conceivable to use texture synthesis for generating plausible US appearance. In this work, we investigate popular classes of texture synthesis methods for generating realistic US content. In a user study, we evaluate their performance for reproducing homogeneous tissue regions in B-mode US images from small image samples of similar tissue and report the best-performing synthesis methods. We further show that regression trees can be used on speckle texture features to learn a predictor for US realism.
Batterham, Philip J; Christensen, Helen; Mackinnon, Andrew J
2009-11-22
Relative to physical health conditions such as cardiovascular disease, little is known about risk factors that predict the prevalence of depression. The present study investigates the expected effects of a reduction of these risks over time, using the decision tree method favoured in assessing cardiovascular disease risk. The PATH through Life cohort was used for the study, comprising 2,105 20-24 year olds, 2,323 40-44 year olds and 2,177 60-64 year olds sampled from the community in the Canberra region, Australia. A decision tree methodology was used to predict the presence of major depressive disorder after four years of follow-up. The decision tree was compared with a logistic regression analysis using ROC curves. The decision tree was found to distinguish and delineate a wide range of risk profiles. Previous depressive symptoms were most highly predictive of depression after four years, however, modifiable risk factors such as substance use and employment status played significant roles in assessing the risk of depression. The decision tree was found to have better sensitivity and specificity than a logistic regression using identical predictors. The decision tree method was useful in assessing the risk of major depressive disorder over four years. Application of the model to the development of a predictive tool for tailored interventions is discussed.
A study of Solar-Enso correlation with southern Brazil tree ring index (1955- 1991)
NASA Astrophysics Data System (ADS)
Rigozo, N.; Nordemann, D.; Vieira, L.; Echer, E.
The effects of solar activity and El Niño-Southern Oscillation on tree growth in Southern Brazil were studied by correlation analysis. Trees for this study were native Araucaria (Araucaria Angustifolia)from four locations in Rio Grande do Sul State, in Southern Brazil: Canela (29o18`S, 50o51`W, 790 m asl), Nova Petropolis (29o2`S, 51o10`W, 579 m asl), Sao Francisco de Paula (29o25`S, 50o24`W, 930 m asl) and Sao Martinho da Serra (29o30`S, 53o53`W, 484 m asl). From these four sites, an average tree ring Index for this region was derived, for the period 1955-1991. Linear correlations were made on annual and 10 year running averages of this tree ring Index, of sunspot number Rz and SOI. For annual averages, the correlation coefficients were low, and the multiple regression between tree ring and SOI and Rz indicates that 20% of the variance in tree rings was explained by solar activity and ENSO variability. However, when the 10 year running averages correlations were made, the coefficient correlations were much higher. A clear anticorrelation is observed between SOI and Index (r=-0.81) whereas Rz and Index show a positive correlation (r=0.67). The multiple regression of 10 year running averages indicates that 76% of the variance in tree ring INdex was explained by solar activity and ENSO. These results indicate that the effects of solar activity and ENSO on tree rings are better seen on long timescales.
Berchialla, Paola; Scarinzi, Cecilia; Snidero, Silvia; Gregori, Dario
2016-08-01
Risk Assessment is the systematic study of decisions subject to uncertain consequences. An increasing interest has been focused on modeling techniques like Bayesian Networks since their capability of (1) combining in the probabilistic framework different type of evidence including both expert judgments and objective data; (2) overturning previous beliefs in the light of the new information being received and (3) making predictions even with incomplete data. In this work, we proposed a comparison among Bayesian Networks and other classical Quantitative Risk Assessment techniques such as Neural Networks, Classification Trees, Random Forests and Logistic Regression models. Hybrid approaches, combining both Classification Trees and Bayesian Networks, were also considered. Among Bayesian Networks, a clear distinction between purely data-driven approach and combination of expert knowledge with objective data is made. The aim of this paper consists in evaluating among this models which best can be applied, in the framework of Quantitative Risk Assessment, to assess the safety of children who are exposed to the risk of inhalation/insertion/aspiration of consumer products. The issue of preventing injuries in children is of paramount importance, in particular where product design is involved: quantifying the risk associated to product characteristics can be of great usefulness in addressing the product safety design regulation. Data of the European Registry of Foreign Bodies Injuries formed the starting evidence for risk assessment. Results showed that Bayesian Networks appeared to have both the ease of interpretability and accuracy in making prediction, even if simpler models like logistic regression still performed well. © The Author(s) 2013.
Boligon, A A; Baldi, F; Mercadante, M E Z; Lobo, R B; Pereira, R J; Albuquerque, L G
2011-06-28
We quantified the potential increase in accuracy of expected breeding value for weights of Nelore cattle, from birth to mature age, using multi-trait and random regression models on Legendre polynomials and B-spline functions. A total of 87,712 weight records from 8144 females were used, recorded every three months from birth to mature age from the Nelore Brazil Program. For random regression analyses, all female weight records from birth to eight years of age (data set I) were considered. From this general data set, a subset was created (data set II), which included only nine weight records: at birth, weaning, 365 and 550 days of age, and 2, 3, 4, 5, and 6 years of age. Data set II was analyzed using random regression and multi-trait models. The model of analysis included the contemporary group as fixed effects and age of dam as a linear and quadratic covariable. In the random regression analyses, average growth trends were modeled using a cubic regression on orthogonal polynomials of age. Residual variances were modeled by a step function with five classes. Legendre polynomials of fourth and sixth order were utilized to model the direct genetic and animal permanent environmental effects, respectively, while third-order Legendre polynomials were considered for maternal genetic and maternal permanent environmental effects. Quadratic polynomials were applied to model all random effects in random regression models on B-spline functions. Direct genetic and animal permanent environmental effects were modeled using three segments or five coefficients, and genetic maternal and maternal permanent environmental effects were modeled with one segment or three coefficients in the random regression models on B-spline functions. For both data sets (I and II), animals ranked differently according to expected breeding value obtained by random regression or multi-trait models. With random regression models, the highest gains in accuracy were obtained at ages with a low number of weight records. The results indicate that random regression models provide more accurate expected breeding values than the traditionally finite multi-trait models. Thus, higher genetic responses are expected for beef cattle growth traits by replacing a multi-trait model with random regression models for genetic evaluation. B-spline functions could be applied as an alternative to Legendre polynomials to model covariance functions for weights from birth to mature age.
Pedological memory in forest soil development
Jonathan D. Phillips; Daniel A. Marion
2004-01-01
Individual trees may have significant impacts on soil morphology. If these impacts are non-random such that some microsites are repeatedly preferentially affected by trees, complex local spatial variability of soils would result. A model of self-reinforcing pedologic influences of trees (SRPIT) is proposed to explain patterns of soil variability in the Ouachita...
Influence of tree spatial pattern and sample plot type and size on inventory
John-Pascall Berrill; Kevin L. O' Hara
2012-01-01
Sampling with different plot types and sizes was simulated using tree location maps and data collected in three even-aged coast redwood (Sequoia sempervirens) stands selected to represent uniform, random, and clumped spatial patterns of tree locations. Fixed-radius circular plots, belt transects, and variable-radius plots were installed by...
Blomberg, S
2000-11-01
Currently available programs for the comparative analysis of phylogenetic data do not perform optimally when the phylogeny is not completely specified (i.e. the phylogeny contains polytomies). Recent literature suggests that a better way to analyse the data would be to create random trees from the known phylogeny that are fully-resolved but consistent with the known tree. A computer program is presented, Fels-Rand, that performs such analyses. A randomisation procedure is used to generate trees that are fully resolved but whose structure is consistent with the original tree. Statistics are then calculated on a large number of these randomly-generated trees. Fels-Rand uses the object-oriented features of Xlisp-Stat to manipulate internal tree representations. Xlisp-Stat's dynamic graphing features are used to provide heuristic tools to aid in analysis, particularly outlier analysis. The usefulness of Xlisp-Stat as a system for phylogenetic computation is discussed. Available from the author or at http://www.uq.edu.au/~ansblomb/Fels-Rand.sit.hqx. Xlisp-Stat is available from http://stat.umn.edu/~luke/xls/xlsinfo/xlsinfo.html. s.blomberg@abdn.ac.uk
Local random configuration-tree theory for string repetition and facilitated dynamics of glass
NASA Astrophysics Data System (ADS)
Lam, Chi-Hang
2018-02-01
We derive a microscopic theory of glassy dynamics based on the transport of voids by micro-string motions, each of which involves particles arranged in a line hopping simultaneously displacing one another. Disorder is modeled by a random energy landscape quenched in the configuration space of distinguishable particles, but transient in the physical space as expected for glassy fluids. We study the evolution of local regions with m coupled voids. At a low temperature, energetically accessible local particle configurations can be organized into a random tree with nodes and edges denoting configurations and micro-string propagations respectively. Such trees defined in the configuration space naturally describe systems defined in two- or three-dimensional physical space. A micro-string propagation initiated by a void can facilitate similar motions by other voids via perturbing the random energy landscape, realizing path interactions between voids or equivalently string interactions. We obtain explicit expressions of the particle diffusion coefficient and a particle return probability. Under our approximation, as temperature decreases, random trees of energetically accessible configurations exhibit a sequence of percolation transitions in the configuration space, with local regions containing fewer coupled voids entering the non-percolating immobile phase first. Dynamics is dominated by coupled voids of an optimal group size, which increases as temperature decreases. Comparison with a distinguishable-particle lattice model (DPLM) of glass shows very good quantitative agreements using only two adjustable parameters related to typical energy fluctuations and the interaction range of the micro-strings.
James W. Flewelling
2009-01-01
Remotely sensed data can be used to make digital maps showing individual tree crowns (ITC) for entire forests. Attributes of the ITCs may include area, shape, height, and color. The crown map is sampled in a way that provides an unbiased linkage between ITCs and identifiable trees measured on the ground. Methods of avoiding edge bias are given. In an example from a...
Austin Troy; J. Morgan Grove; Jarlath O' Neill-Dunne
2012-01-01
The extent to which urban tree cover influences crime is in debate in the literature. This research took advantage of geocoded crime point data and high resolution tree canopy data to address this question in Baltimore City and County, MD, an area that includes a significant urban-rural gradient. Using ordinary least squares and spatially adjusted regression and...
Khan, Salah Uddin; Gurley, Emily S.; Hossain, M. Jahangir; Nahar, Nazmun; Sharker, M. A. Yushuf; Luby, Stephen P.
2012-01-01
Background Drinking raw date palm sap is a risk factor for human Nipah virus (NiV) infection. Fruit bats, the natural reservoir of NiV, commonly contaminate raw sap with saliva by licking date palm’s sap producing surface. We evaluated four types of physical barriers that may prevent bats from contacting sap. Methods During 2009, we used a crossover design and randomly selected 20 date palm sap producing trees and observed each tree for 2 nights: one night with a bamboo skirt intervention applied and one night without the intervention. During 2010, we selected 120 trees and randomly assigned four types of interventions to 15 trees each: bamboo, dhoincha (local plant), jute stick and polythene skirts covering the shaved part, sap stream, tap and collection pot. We enrolled the remaining 60 trees as controls. We used motion sensor activated infrared cameras to examine bat contact with sap. Results During 2009 bats contacted date palm sap in 85% of observation nights when no intervention was used compared with 35% of nights when the intervention was used [p<0.001]. Bats were able to contact the sap when the skirt did not entirely cover the sap producing surface. Therefore, in 2010 we requested the sap harvesters to use larger skirts. During 2010 bats contacted date palm sap [2% vs. 83%, p<0.001] less frequently in trees protected with skirts compared to control trees. No bats contacted sap in trees with bamboo (p<0.001 compared to control), dhoincha skirt (p<0.001) or polythene covering (p<0.001), but bats did contact sap during one night (7%) with the jute stick skirt (p<0.001). Conclusion Bamboo, dhoincha, jute stick and polythene skirts covering the sap producing areas of a tree effectively prevented bat-sap contact. Community interventions should promote applying these skirts to prevent occasional Nipah spillovers to human. PMID:22905160
Spatial Assessment of Model Errors from Four Regression Techniques
Lianjun Zhang; Jeffrey H. Gove; Jeffrey H. Gove
2005-01-01
Fomst modelers have attempted to account for the spatial autocorrelations among trees in growth and yield models by applying alternative regression techniques such as linear mixed models (LMM), generalized additive models (GAM), and geographicalIy weighted regression (GWR). However, the model errors are commonly assessed using average errors across the entire study...
Steen, Paul J.; Passino-Reader, Dora R.; Wiley, Michael J.
2006-01-01
As a part of the Great Lakes Regional Aquatic Gap Analysis Project, we evaluated methodologies for modeling associations between fish species and habitat characteristics at a landscape scale. To do this, we created brook trout Salvelinus fontinalis presence and absence models based on four different techniques: multiple linear regression, logistic regression, neural networks, and classification trees. The models were tested in two ways: by application to an independent validation database and cross-validation using the training data, and by visual comparison of statewide distribution maps with historically recorded occurrences from the Michigan Fish Atlas. Although differences in the accuracy of our models were slight, the logistic regression model predicted with the least error, followed by multiple regression, then classification trees, then the neural networks. These models will provide natural resource managers a way to identify habitats requiring protection for the conservation of fish species.
Demographic predictors of peanut, tree nut, fish, shellfish, and sesame allergy in Canada.
Ben-Shoshan, M; Harrington, D W; Soller, L; Fragapane, J; Joseph, L; Pierre, Y St; Godefroy, S B; Elliott, S J; Clarke, A E
2012-01-01
Background. Studies suggest that the rising prevalence of food allergy during recent decades may have stabilized. Although genetics undoubtedly contribute to the emergence of food allergy, it is likely that other factors play a crucial role in mediating such short-term changes. Objective. To identify potential demographic predictors of food allergies. Methods. We performed a cross-Canada, random telephone survey. Criteria for food allergy were self-report of convincing symptoms and/or physician diagnosis of allergy. Multivariate logistic regressions were used to assess potential determinants. Results. Of 10,596 households surveyed in 2008/2009, 3666 responded, representing 9667 individuals. Peanut, tree nut, and sesame allergy were more common in children (odds ratio (OR) 2.24 (95% CI, 1.40, 3.59), 1.73 (95% CI, 1.11, 2.68), and 5.63 (95% CI, 1.39, 22.87), resp.) while fish and shellfish allergy were less common in children (OR 0.17 (95% CI, 0.04, 0.72) and 0.29 (95% CI, 0.14, 0.61)). Tree nut and shellfish allergy were less common in males (OR 0.55 (95% CI, 0.36, 0.83) and 0.63 (95% CI, 0.43, 0.91)). Shellfish allergy was more common in urban settings (OR 1.55 (95% CI, 1.04, 2.31)). There was a trend for most food allergies to be more prevalent in the more educated (tree nut OR 1.90 (95% CI, 1.18, 3.04)) and less prevalent in immigrants (shellfish OR 0.49 (95% CI, 0.26, 0.95)), but wide CIs preclude definitive conclusions for most foods. Conclusions. Our results reveal that in addition to age and sex, place of residence, socioeconomic status, and birth place may influence the development of food allergy.
Davrieux, Fabrice; Allal, François; Piombo, Georges; Kelly, Bokary; Okulo, John B; Thiam, Massamba; Diallo, Ousmane B; Bouvet, Jean-Marc
2010-07-14
The Shea tree (Vitellaria paradoxa) is a major tree species in African agroforestry systems. Butter extracted from its nuts offers an opportunity for sustainable development in Sudanian countries and an attractive potential for the food and cosmetics industries. The purpose of this study was to develop near-infrared spectroscopy (NIRS) calibrations to characterize Shea nut fat profiles. Powders prepared from nuts collected from 624 trees in five African countries (Senegal, Mali, Burkina Faso, Ghana and Uganda) were analyzed for moisture content, fat content using solvent extraction, and fatty acid profiles using gas chromatography. Results confirmed the differences between East and West African Shea nut fat composition: eastern nuts had significantly higher fat and oleic acid contents. Near infrared reflectance spectra were recorded for each sample. Ten percent of the samples were randomly selected for validation and the remaining samples used for calibration. For each constituent, calibration equations were developed using modified partial least squares (MPLS) regression. The equation performances were evaluated using the ratio performance to deviation (RPD(p)) and R(p)(2) parameters, obtained by comparison of the validation set NIR predictions and corresponding laboratory values. Moisture (RPD(p) = 4.45; R(p)(2) = 0.95) and fat (RPD(p) = 5.6; R(p)(2) = 0.97) calibrations enabled accurate determination of these traits. NIR models for stearic (RPD(p) = 6.26; R(p)(2) = 0.98) and oleic (RPD(p) = 7.91; R(p)(2) = 0.99) acids were highly efficient and enabled sharp characterization of these two major Shea butter fatty acids. This study demonstrated the ability of near-infrared spectroscopy for high-throughput phenotyping of Shea nuts.
Evolving optimised decision rules for intrusion detection using particle swarm paradigm
NASA Astrophysics Data System (ADS)
Sivatha Sindhu, Siva S.; Geetha, S.; Kannan, A.
2012-12-01
The aim of this article is to construct a practical intrusion detection system (IDS) that properly analyses the statistics of network traffic pattern and classify them as normal or anomalous class. The objective of this article is to prove that the choice of effective network traffic features and a proficient machine-learning paradigm enhances the detection accuracy of IDS. In this article, a rule-based approach with a family of six decision tree classifiers, namely Decision Stump, C4.5, Naive Baye's Tree, Random Forest, Random Tree and Representative Tree model to perform the detection of anomalous network pattern is introduced. In particular, the proposed swarm optimisation-based approach selects instances that compose training set and optimised decision tree operate over this trained set producing classification rules with improved coverage, classification capability and generalisation ability. Experiment with the Knowledge Discovery and Data mining (KDD) data set which have information on traffic pattern, during normal and intrusive behaviour shows that the proposed algorithm produces optimised decision rules and outperforms other machine-learning algorithm.
Downer, Mary Kathryn; Gea, Alfredo; Stampfer, Meir; Sánchez-Tainta, Ana; Corella, Dolores; Salas-Salvadó, Jordi; Ros, Emilio; Estruch, Ramón; Fitó, Montserrat; Gómez-Gracia, Enrique; Arós, Fernando; Fiol, Miquel; De-la-Corte, Francisco Jose Garcia; Serra-Majem, Lluís; Pinto, Xavier; Basora, Josep; Sorlí, José V; Vinyoles, Ernest; Zazpe, Itziar; Martínez-González, Miguel-Ángel
2016-06-14
Dietary intervention success requires strong participant adherence, but very few studies have examined factors related to both short-term and long-term adherence. A better understanding of predictors of adherence is necessary to improve the design and execution of dietary intervention trials. This study was designed to identify participant characteristics at baseline and study features that predict short-term and long-term adherence with interventions promoting the Mediterranean-type diet (MedDiet) in the PREvención con DIeta MEDiterránea (PREDIMED) randomized trial. Analyses included men and women living in Spain aged 55-80 at high risk for cardiovascular disease. Participants were randomized to the MedDiet supplemented with either complementary extra-virgin olive oil (EVOO) or tree nuts. The control group and participants with insufficient information on adherence were excluded. PREDIMED began in 2003 and ended in 2010. Investigators assessed covariates at baseline and dietary information was updated yearly throughout follow-up. Adherence was measured with a validated 14-point Mediterranean-type diet adherence score. Logistic regression was used to examine associations between baseline characteristics and adherence at one and four years of follow-up. Participants were randomized to the MedDiet supplemented with EVOO (n = 2,543; 1,962 after exclusions) or tree nuts (n = 2,454; 2,236 after exclusions). A higher number of cardiovascular risk factors, larger waist circumference, lower physical activity levels, lower total energy intake, poorer baseline adherence to the 14-point adherence score, and allocation to MedDiet + EVOO each independently predicted poorer adherence. Participants from PREDIMED recruiting centers with a higher total workload (measured as total number of persons-years of follow-up) achieved better adherence. No adverse events or side effects were reported. To maximize dietary adherence in dietary interventions, additional efforts to promote adherence should be used for participants with lower baseline adherence to the intended diet and poorer health status. The design of multicenter nutrition trials should prioritize few large centers with more participants in each, rather than many small centers. This study was registered at controlled-trials.com (http://www.controlled-trials. com/ISRCTN35739639). International Standard Randomized Controlled Trial Number (ISRCTN): 35739639. Registration date: 5 October 2005. parallel randomized trial.
Zhang, Ling Yu; Liu, Zhao Gang
2017-12-01
Based on the data collected from 108 permanent plots of the forest resources survey in Maoershan Experimental Forest Farm during 2004-2016, this study investigated the spatial distribution of recruitment trees in natural secondary forest by global Poisson regression and geographically weighted Poisson regression (GWPR) with four bandwidths of 2.5, 5, 10 and 15 km. The simulation effects of the 5 regressions and the factors influencing the recruitment trees in stands were analyzed, a description was given to the spatial autocorrelation of the regression residuals on global and local levels using Moran's I. The results showed that the spatial distribution of the number of natural secondary forest recruitment was significantly influenced by stands and topographic factors, especially average DBH. The GWPR model with small scale (2.5 km) had high accuracy of model fitting, a large range of model parameter estimates was generated, and the localized spatial distribution effect of the model parameters was obtained. The GWPR model at small scale (2.5 and 5 km) had produced a small range of model residuals, and the stability of the model was improved. The global spatial auto-correlation of the GWPR model residual at the small scale (2.5 km) was the lowe-st, and the local spatial auto-correlation was significantly reduced, in which an ideal spatial distribution pattern of small clusters with different observations was formed. The local model at small scale (2.5 km) was much better than the global model in the simulation effect on the spatial distribution of recruitment tree number.
Random forests as cumulative effects models: A case study of lakes and rivers in Muskoka, Canada.
Jones, F Chris; Plewes, Rachel; Murison, Lorna; MacDougall, Mark J; Sinclair, Sarah; Davies, Christie; Bailey, John L; Richardson, Murray; Gunn, John
2017-10-01
Cumulative effects assessment (CEA) - a type of environmental appraisal - lacks effective methods for modeling cumulative effects, evaluating indicators of ecosystem condition, and exploring the likely outcomes of development scenarios. Random forests are an extension of classification and regression trees, which model response variables by recursive partitioning. Random forests were used to model a series of candidate ecological indicators that described lakes and rivers from a case study watershed (The Muskoka River Watershed, Canada). Suitability of the candidate indicators for use in cumulative effects assessment and watershed monitoring was assessed according to how well they could be predicted from natural habitat features and how sensitive they were to human land-use. The best models explained 75% of the variation in a multivariate descriptor of lake benthic-macroinvertebrate community structure, and 76% of the variation in the conductivity of river water. Similar results were obtained by cross-validation. Several candidate indicators detected a simulated doubling of urban land-use in their catchments, and a few were able to detect a simulated doubling of agricultural land-use. The paper demonstrates that random forests can be used to describe the combined and singular effects of multiple stressors and natural environmental factors, and furthermore, that random forests can be used to evaluate the performance of monitoring indicators. The numerical methods presented are applicable to any ecosystem and indicator type, and therefore represent a step forward for CEA. Crown Copyright © 2017. Published by Elsevier Ltd. All rights reserved.
Shi, Huilan; Jia, Junya; Li, Dong; Wei, Li; Shang, Wenya; Zheng, Zhenfeng
2018-02-09
Precise renal histopathological diagnosis will guide therapy strategy in patients with lupus nephritis. Blood oxygen level dependent (BOLD) magnetic resonance imaging (MRI) has been applicable noninvasive technique in renal disease. This current study was performed to explore whether BOLD MRI could contribute to diagnose renal pathological pattern. Adult patients with lupus nephritis renal pathological diagnosis were recruited for this study. Renal biopsy tissues were assessed based on the lupus nephritis ISN/RPS 2003 classification. The Blood oxygen level dependent magnetic resonance imaging (BOLD-MRI) was used to obtain functional magnetic resonance parameter, R2* values. Several functions of R2* values were calculated and used to construct algorithmic models for renal pathological patterns. In addition, the algorithmic models were compared as to their diagnostic capability. Both Histopathology and BOLD MRI were used to examine a total of twelve patients. Renal pathological patterns included five classes III (including 3 as class III + V) and seven classes IV (including 4 as class IV + V). Three algorithmic models, including decision tree, line discriminant, and logistic regression, were constructed to distinguish the renal pathological pattern of class III and class IV. The sensitivity of the decision tree model was better than that of the line discriminant model (71.87% vs 59.48%, P < 0.001) and inferior to that of the Logistic regression model (71.87% vs 78.71%, P < 0.001). The specificity of decision tree model was equivalent to that of the line discriminant model (63.87% vs 63.73%, P = 0.939) and higher than that of the logistic regression model (63.87% vs 38.0%, P < 0.001). The Area under the ROC curve (AUROCC) of the decision tree model was greater than that of the line discriminant model (0.765 vs 0.629, P < 0.001) and logistic regression model (0.765 vs 0.662, P < 0.001). BOLD MRI is a useful non-invasive imaging technique for the evaluation of lupus nephritis. Decision tree models constructed using functions of R2* values may facilitate the prediction of renal pathological patterns.
Lin, Lei; Wang, Qian; Sadek, Adel W
2016-06-01
The duration of freeway traffic accidents duration is an important factor, which affects traffic congestion, environmental pollution, and secondary accidents. Among previous studies, the M5P algorithm has been shown to be an effective tool for predicting incident duration. M5P builds a tree-based model, like the traditional classification and regression tree (CART) method, but with multiple linear regression models as its leaves. The problem with M5P for accident duration prediction, however, is that whereas linear regression assumes that the conditional distribution of accident durations is normally distributed, the distribution for a "time-to-an-event" is almost certainly nonsymmetrical. A hazard-based duration model (HBDM) is a better choice for this kind of a "time-to-event" modeling scenario, and given this, HBDMs have been previously applied to analyze and predict traffic accidents duration. Previous research, however, has not yet applied HBDMs for accident duration prediction, in association with clustering or classification of the dataset to minimize data heterogeneity. The current paper proposes a novel approach for accident duration prediction, which improves on the original M5P tree algorithm through the construction of a M5P-HBDM model, in which the leaves of the M5P tree model are HBDMs instead of linear regression models. Such a model offers the advantage of minimizing data heterogeneity through dataset classification, and avoids the need for the incorrect assumption of normality for traffic accident durations. The proposed model was then tested on two freeway accident datasets. For each dataset, the first 500 records were used to train the following three models: (1) an M5P tree; (2) a HBDM; and (3) the proposed M5P-HBDM, and the remainder of data were used for testing. The results show that the proposed M5P-HBDM managed to identify more significant and meaningful variables than either M5P or HBDMs. Moreover, the M5P-HBDM had the lowest overall mean absolute percentage error (MAPE). Copyright © 2016 Elsevier Ltd. All rights reserved.
Observed Methods for Felling Hardwood Trees with Chain Saws
Jerry L. Koger
1983-01-01
The angles and lengths of the cutting surfaces made by chain saw operators on hardwood tree stumps are described by means, standard deviations, ranges, and regression equations. Recommended felling guidelines are compared with observed felling methods used by experienced timber cutters in the southern Appalachian Mountains.
Utilizing random forests imputation of forest plot data for landscape-level wildfire analyses
Karin L. Riley; Isaac C. Grenfell; Mark A. Finney; Nicholas L. Crookston
2014-01-01
Maps of the number, size, and species of trees in forests across the United States are desirable for a number of applications. For landscape-level fire and forest simulations that use the Forest Vegetation Simulator (FVS), a spatial tree-level dataset, or âtree listâ, is a necessity. FVS is widely used at the stand level for simulating fire effects on tree mortality,...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rupšys, P.
A system of stochastic differential equations (SDE) with mixed-effects parameters and multivariate normal copula density function were used to develop tree height model for Scots pine trees in Lithuania. A two-step maximum likelihood parameter estimation method is used and computational guidelines are given. After fitting the conditional probability density functions to outside bark diameter at breast height, and total tree height, a bivariate normal copula distribution model was constructed. Predictions from the mixed-effects parameters SDE tree height model calculated during this research were compared to the regression tree height equations. The results are implemented in the symbolic computational language MAPLE.
Roosting habitat use and selection by northern spotted owls during natal dispersal
Sovern, Stan G.; Forsman, Eric D.; Dugger, Catherine M.; Taylor, Margaret
2015-01-01
We studied habitat selection by northern spotted owls (Strix occidentalis caurina) during natal dispersal in Washington State, USA, at both the roost site and landscape scales. We used logistic regression to obtain parameters for an exponential resource selection function based on vegetation attributes in roost and random plots in 76 forest stands that were used for roosting. We used a similar analysis to evaluate selection of landscape habitat attributes based on 301 radio-telemetry relocations and random points within our study area. We found no evidence of within-stand selection for any of the variables examined, but 78% of roosts were in stands with at least some large (>50 cm dbh) trees. At the landscape scale, owls selected for stands with high canopy cover (>70%). Dispersing owls selected vegetation types that were more similar to habitat selected by adult owls than habitat that would result from following guidelines previously proposed to maintain dispersal habitat. Our analysis indicates that juvenile owls select stands for roosting that have greater canopy cover than is recommended in current agency guidelines.
Wylie, Bruce K.; Howard, Daniel; Dahal, Devendra; Gilmanov, Tagir; Ji, Lei; Zhang, Li; Smith, Kelcy
2016-01-01
This paper presents the methodology and results of two ecological-based net ecosystem production (NEP) regression tree models capable of up scaling measurements made at various flux tower sites throughout the U.S. Great Plains. Separate grassland and cropland NEP regression tree models were trained using various remote sensing data and other biogeophysical data, along with 15 flux towers contributing to the grassland model and 15 flux towers for the cropland model. The models yielded weekly mean daily grassland and cropland NEP maps of the U.S. Great Plains at 250 m resolution for 2000–2008. The grassland and cropland NEP maps were spatially summarized and statistically compared. The results of this study indicate that grassland and cropland ecosystems generally performed as weak net carbon (C) sinks, absorbing more C from the atmosphere than they released from 2000 to 2008. Grasslands demonstrated higher carbon sink potential (139 g C·m−2·year−1) than non-irrigated croplands. A closer look into the weekly time series reveals the C fluctuation through time and space for each land cover type.
Zhao, Cai-Yun; Li, Jun-Sheng; Xu, Jing; Liu, Xiao-Yan
2017-05-01
Globalization increases the opportunities for unintentionally introduced invasive alien species, especially for insects, and most of these species could damage ecosystems and cause economic loss in China. In this study, we analyzed drivers of the distribution of unintentionally introduced invasive alien insects. Based on the number of unintentionally introduced invasive alien insects and their presence/absence records in each province in mainland China, regression trees were built to elucidate the roles of environmental and anthropogenic factors on the number distribution and similarity of species composition of these insects. Classification and regression trees indicated climatic suitability (the mean temperature in January) and human economic activity (sum of total freight) are primary drivers for the number distribution pattern of unintentionally introduced invasive alien insects at provincial scale, while only environmental factors (the mean January temperature, the annual precipitation and the areas of provinces) significantly affect the similarity of them based on the multivariate regression trees. © The Authors 2017. Published by Oxford University Press on behalf of Entomological Society of America.
Taper models for commercial tree species in the northeastern United States
James A. Westfall; Charles T. Scott
2010-01-01
A new taper model was developed based on the switching taper model of Valentine and Gregoire; the most substantial changes were reformulation to incorporate estimated join points and modification of a switching function. Random-effects parameters were included that account for within-tree correlations and allow for customized calibration to each individual tree. The...
Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.
Maniruzzaman, Md; Rahman, Md Jahanur; Al-MehediHasan, Md; Suri, Harman S; Abedin, Md Menhazul; El-Baz, Ayman; Suri, Jasjit S
2018-04-10
Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.
Biodiversity mapping in a tropical West African forest with airborne hyperspectral data.
Vaglio Laurin, Gaia; Cheung-Wai Chan, Jonathan; Chen, Qi; Lindsell, Jeremy A; Coomes, David A; Guerriero, Leila; Del Frate, Fabio; Miglietta, Franco; Valentini, Riccardo
2014-01-01
Tropical forests are major repositories of biodiversity, but are fast disappearing as land is converted to agriculture. Decision-makers need to know which of the remaining forests to prioritize for conservation, but the only spatial information on forest biodiversity has, until recently, come from a sparse network of ground-based plots. Here we explore whether airborne hyperspectral imagery can be used to predict the alpha diversity of upper canopy trees in a West African forest. The abundance of tree species were collected from 64 plots (each 1250 m(2) in size) within a Sierra Leonean national park, and Shannon-Wiener biodiversity indices were calculated. An airborne spectrometer measured reflectances of 186 bands in the visible and near-infrared spectral range at 1 m(2) resolution. The standard deviations of these reflectance values and their first-order derivatives were calculated for each plot from the c. 1250 pixels of hyperspectral information within them. Shannon-Wiener indices were then predicted from these plot-based reflectance statistics using a machine-learning algorithm (Random Forest). The regression model fitted the data well (pseudo-R(2) = 84.9%), and we show that standard deviations of green-band reflectances and infra-red region derivatives had the strongest explanatory powers. Our work shows that airborne hyperspectral sensing can be very effective at mapping canopy tree diversity, because its high spatial resolution allows within-plot heterogeneity in reflectance to be characterized, making it an effective tool for monitoring forest biodiversity over large geographic scales.
Using Predictive Analytics to Predict Power Outages from Severe Weather
NASA Astrophysics Data System (ADS)
Wanik, D. W.; Anagnostou, E. N.; Hartman, B.; Frediani, M. E.; Astitha, M.
2015-12-01
The distribution of reliable power is essential to businesses, public services, and our daily lives. With the growing abundance of data being collected and created by industry (i.e. outage data), government agencies (i.e. land cover), and academia (i.e. weather forecasts), we can begin to tackle problems that previously seemed too complex to solve. In this session, we will present newly developed tools to aid decision-support challenges at electric distribution utilities that must mitigate, prepare for, respond to and recover from severe weather. We will show a performance evaluation of outage predictive models built for Eversource Energy (formerly Connecticut Light & Power) for storms of all types (i.e. blizzards, thunderstorms and hurricanes) and magnitudes (from 20 to >15,000 outages). High resolution weather simulations (simulated with the Weather and Research Forecast Model) were joined with utility outage data to calibrate four types of models: a decision tree (DT), random forest (RF), boosted gradient tree (BT) and an ensemble (ENS) decision tree regression that combined predictions from DT, RF and BT. The study shows that the ENS model forced with weather, infrastructure and land cover data was superior to the other models we evaluated, especially in terms of predicting the spatial distribution of outages. This research has the potential to be used for other critical infrastructure systems (such as telecommunications, drinking water and gas distribution networks), and can be readily expanded to the entire New England region to facilitate better planning and coordination among decision-makers when severe weather strikes.
Biodiversity Mapping in a Tropical West African Forest with Airborne Hyperspectral Data
Vaglio Laurin, Gaia; Chan, Jonathan Cheung-Wai; Chen, Qi; Lindsell, Jeremy A.; Coomes, David A.; Guerriero, Leila; Frate, Fabio Del; Miglietta, Franco; Valentini, Riccardo
2014-01-01
Tropical forests are major repositories of biodiversity, but are fast disappearing as land is converted to agriculture. Decision-makers need to know which of the remaining forests to prioritize for conservation, but the only spatial information on forest biodiversity has, until recently, come from a sparse network of ground-based plots. Here we explore whether airborne hyperspectral imagery can be used to predict the alpha diversity of upper canopy trees in a West African forest. The abundance of tree species were collected from 64 plots (each 1250 m2 in size) within a Sierra Leonean national park, and Shannon-Wiener biodiversity indices were calculated. An airborne spectrometer measured reflectances of 186 bands in the visible and near-infrared spectral range at 1 m2 resolution. The standard deviations of these reflectance values and their first-order derivatives were calculated for each plot from the c. 1250 pixels of hyperspectral information within them. Shannon-Wiener indices were then predicted from these plot-based reflectance statistics using a machine-learning algorithm (Random Forest). The regression model fitted the data well (pseudo-R2 = 84.9%), and we show that standard deviations of green-band reflectances and infra-red region derivatives had the strongest explanatory powers. Our work shows that airborne hyperspectral sensing can be very effective at mapping canopy tree diversity, because its high spatial resolution allows within-plot heterogeneity in reflectance to be characterized, making it an effective tool for monitoring forest biodiversity over large geographic scales. PMID:24937407
Tisné, Sébastien; Pomiès, Virginie; Riou, Virginie; Syahputra, Indra; Cochard, Benoît; Denis, Marie
2017-01-01
Multi-parental populations are promising tools for identifying quantitative disease resistance loci. Stem rot caused by Ganoderma boninense is a major threat to palm oil production, with yield losses of up to 80% prompting premature replantation of palms. There is evidence of genetic resistance sources, but the genetic architecture of Ganoderma resistance has not yet been investigated. This study aimed to identify Ganoderma resistance loci using an oil palm multi-parental population derived from nine major founders of ongoing breeding programs. A total of 1200 palm trees of the multi-parental population was planted in plots naturally infected by Ganoderma, and their health status was assessed biannually over 25 yr. The data were treated as survival data, and modeled using the Cox regression model, including a spatial effect to take the spatial component in the spread of Ganoderma into account. Based on the genotypes of 757 palm trees out of the 1200 planted, and on pedigree information, resistance loci were identified using a random effect with identity-by-descent kinship matrices as covariance matrices in the Cox model. Four Ganoderma resistance loci were identified, two controlling the occurrence of the first Ganoderma symptoms, and two the death of palm trees, while favorable haplotypes were identified among a major gene pool for ongoing breeding programs. This study implemented an efficient and flexible QTL mapping approach, and generated unique valuable information for the selection of oil palm varieties resistant to Ganoderma disease. PMID:28592650
NASA Astrophysics Data System (ADS)
Zabret, Katarina; Rakovec, Jože; Šraj, Mojca
2018-03-01
Rainfall partitioning is an important part of the ecohydrological cycle, influenced by numerous variables. Rainfall partitioning for pine (Pinus nigra Arnold) and birch (Betula pendula Roth.) trees was measured from January 2014 to June 2017 in an urban area of Ljubljana, Slovenia. 180 events from more than three years of observations were analyzed, focusing on 13 meteorological variables, including the number of raindrops, their diameter, and velocity. Regression tree and boosted regression tree analyses were performed to evaluate the influence of the variables on rainfall interception loss, throughfall, and stemflow in different phenoseasons. The amount of rainfall was recognized as the most influential variable, followed by rainfall intensity and the number of raindrops. Higher rainfall amount, intensity, and the number of drops decreased percentage of rainfall interception loss. Rainfall amount and intensity were the most influential on interception loss by birch and pine trees during the leafed and leafless periods, respectively. Lower wind speed was found to increase throughfall, whereas wind direction had no significant influence. Consideration of drop size spectrum properties proved to be important, since the number of drops, drop diameter, and median volume diameter were often recognized as important influential variables.
Brito-Rocha, E; Schilling, A C; Dos Anjos, L; Piotto, D; Dalmolin, A C; Mielke, M S
2016-01-01
Individual leaf area (LA) is a key variable in studies of tree ecophysiology because it directly influences light interception, photosynthesis and evapotranspiration of adult trees and seedlings. We analyzed the leaf dimensions (length - L and width - W) of seedlings and adults of seven Neotropical rainforest tree species (Brosimum rubescens, Manilkara maxima, Pouteria caimito, Pouteria torta, Psidium cattleyanum, Symphonia globulifera and Tabebuia stenocalyx) with the objective to test the feasibility of single regression models to estimate LA of both adults and seedlings. In southern Bahia, Brazil, a first set of data was collected between March and October 2012. From the seven species analyzed, only two (P. cattleyanum and T. stenocalyx) had very similar relationships between LW and LA in both ontogenetic stages. For these two species, a second set of data was collected in August 2014, in order to validate the single models encompassing adult and seedlings. Our results show the possibility of development of models for predicting individual leaf area encompassing different ontogenetic stages for tropical tree species. The development of these models was more dependent on the species than the differences in leaf size between seedlings and adults.
NASA Astrophysics Data System (ADS)
Seibert, Mathias; Merz, Bruno; Apel, Heiko
2017-03-01
The Limpopo Basin in southern Africa is prone to droughts which affect the livelihood of millions of people in South Africa, Botswana, Zimbabwe and Mozambique. Seasonal drought early warning is thus vital for the whole region. In this study, the predictability of hydrological droughts during the main runoff period from December to May is assessed using statistical approaches. Three methods (multiple linear models, artificial neural networks, random forest regression trees) are compared in terms of their ability to forecast streamflow with up to 12 months of lead time. The following four main findings result from the study. 1. There are stations in the basin at which standardised streamflow is predictable with lead times up to 12 months. The results show high inter-station differences of forecast skill but reach a coefficient of determination as high as 0.73 (cross validated). 2. A large range of potential predictors is considered in this study, comprising well-established climate indices, customised teleconnection indices derived from sea surface temperatures and antecedent streamflow as a proxy of catchment conditions. El Niño and customised indices, representing sea surface temperature in the Atlantic and Indian oceans, prove to be important teleconnection predictors for the region. Antecedent streamflow is a strong predictor in small catchments (with median 42 % explained variance), whereas teleconnections exert a stronger influence in large catchments. 3. Multiple linear models show the best forecast skill in this study and the greatest robustness compared to artificial neural networks and random forest regression trees, despite their capabilities to represent nonlinear relationships. 4. Employed in early warning, the models can be used to forecast a specific drought level. Even if the coefficient of determination is low, the forecast models have a skill better than a climatological forecast, which is shown by analysis of receiver operating characteristics (ROCs). Seasonal statistical forecasts in the Limpopo show promising results, and thus it is recommended to employ them as complementary to existing forecasts in order to strengthen preparedness for droughts.
Deciphering factors controlling groundwater arsenic spatial variability in Bangladesh
NASA Astrophysics Data System (ADS)
Tan, Z.; Yang, Q.; Zheng, C.; Zheng, Y.
2017-12-01
Elevated concentrations of geogenic arsenic in groundwater have been found in many countries to exceed 10 μg/L, the WHO's guideline value for drinking water. A common yet unexplained characteristic of groundwater arsenic spatial distribution is the extensive variability at various spatial scales. This study investigates factors influencing the spatial variability of groundwater arsenic in Bangladesh to improve the accuracy of models predicting arsenic exceedance rate spatially. A novel boosted regression tree method is used to establish a weak-learning ensemble model, which is compared to a linear model using a conventional stepwise logistic regression method. The boosted regression tree models offer the advantage of parametric interaction when big datasets are analyzed in comparison to the logistic regression. The point data set (n=3,538) of groundwater hydrochemistry with 19 parameters was obtained by the British Geological Survey in 2001. The spatial data sets of geological parameters (n=13) were from the Consortium for Spatial Information, Technical University of Denmark, University of East Anglia and the FAO, while the soil parameters (n=42) were from the Harmonized World Soil Database. The aforementioned parameters were regressed to categorical groundwater arsenic concentrations below or above three thresholds: 5 μg/L, 10 μg/L and 50 μg/L to identify respective controlling factors. Boosted regression tree method outperformed logistic regression methods in all three threshold levels in terms of accuracy, specificity and sensitivity, resulting in an improvement of spatial distribution map of probability of groundwater arsenic exceeding all three thresholds when compared to disjunctive-kriging interpolated spatial arsenic map using the same groundwater arsenic dataset. Boosted regression tree models also show that the most important controlling factors of groundwater arsenic distribution include groundwater iron content and well depth for all three thresholds. The probability of a well with iron content higher than 5mg/L to contain greater than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be more than 91%, 85% and 51%, respectively, while the probability of a well from depth more than 160m to contain more than 5 μg/L, 10 μg/L and 50 μg/L As is estimated to be less than 38%, 25% and 14%, respectively.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andrew G. Peterson; J. Timothy Ball; Yiqi Luo
1998-09-25
Estimation of leaf photosynthetic rate (A) from leaf nitrogen content (N) is both conceptually and numerically important in models of plant, ecosystem and biosphere responses to global change. The relationship between A and N has been studied extensively at ambient CO{sub 2} but much less at elevated CO{sub 2}. This study was designed to (1) assess whether the A-N relationship was more similar for species within than between community and vegetation types, and (2) examine how growth at elevated CO{sub 2} affects the A-N relationship. Data were obtained for 39 C{sub 3} species grown at ambient CO{sub 2} and 10more » C{sub 3} species grown at ambient and elevated CO{sub 2}. A regression model was applied to each species as well as to species pooled within different community and vegetation types. Cluster analysis of the regression coefficients indicated that species measured at ambient CO{sub 2} did not separate into distinct groups matching community or vegetation type. Instead, most community and vegetation types shared the same general parameter space for regression coefficients. Growth at elevated CO{sub 2} increased photosynthetic nitrogen use efficiency for pines and deciduous trees. When species were pooled by vegetation type, the A-N relationship for deciduous trees expressed on a leaf-mass bask was not altered by elevated CO{sub 2}, while the intercept increased for pines. When regression coefficients were averaged to give mean responses for different vegetation types, elevated CO{sub 2} increased the intercept and the slope for deciduous trees but increased only the intercept for pines. There were no statistical differences between the pines and deciduous trees for the effect of CO{sub 2}. Generalizations about the effect of elevated CO{sub 2} on the A-N relationship, and differences between pines and deciduous trees will be enhanced as more data become available.« less
Dispersion patterns and sampling plans for Diaphorina citri (Hemiptera: Psyllidae) in citrus.
Sétamou, Mamoudou; Flores, Daniel; French, J Victor; Hall, David G
2008-08-01
The abundance and spatial dispersion of Diaphorina citri Kuwayama (Hemiptera: Psyllidae) were studied in 34 grapefruit (Citrus paradisi Macfad.) and six sweet orange [Citrus sinensis (L.) Osbeck] orchards from March to August 2006 when the pest is more abundant in southern Texas. Although flush shoot infestation levels did not vary with host plant species, densities of D. citri eggs, nymphs, and adults were significantly higher on sweet orange than on grapefruit. D. citri immatures also were found in significantly higher numbers in the southeastern quadrant of trees than other parts of the canopy. The spatial distribution of D. citri nymphs and adults was analyzed using Iowa's patchiness regression and Taylor's power law. Taylor's power law fitted the data better than Iowa's model. Based on both regression models, the field dispersion patterns of D. citri nymphs and adults were aggregated among flush shoots in individual trees as indicated by the regression slopes that were significantly >1. For the average density of each life stage obtained during our surveys, the minimum number of flush shoots per tree needed to estimate D. citri densities varied from eight for eggs to four flush shoots for adults. Projections indicated that a sampling plan consisting of 10 trees and eight flush shoots per tree would provide density estimates of the three developmental stages of D. citri acceptable enough for population studies and management decisions. A presence-absence sampling plan with a fixed precision level was developed and can be used to provide a quick estimation of D. citri populations in citrus orchards.
2012-03-01
with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random
1985-12-01
consists of the node t and all descendants of t in T. (3) Definition 3. Pruning a branch Tt from a tree T con- sists of deleting from T all...The default is 1.0 so that actually, this keyword did not need to appear in the above file. (5) DELETE . This keyword does not appear in our example, but...when it is used associated with some variable names, it indicates that we want to delete these vari- ables from the regression. If this keyword is
Jovanovic, Milos; Radovanovic, Sandro; Vukicevic, Milan; Van Poucke, Sven; Delibasic, Boris
2016-09-01
Quantification and early identification of unplanned readmission risk have the potential to improve the quality of care during hospitalization and after discharge. However, high dimensionality, sparsity, and class imbalance of electronic health data and the complexity of risk quantification, challenge the development of accurate predictive models. Predictive models require a certain level of interpretability in order to be applicable in real settings and create actionable insights. This paper aims to develop accurate and interpretable predictive models for readmission in a general pediatric patient population, by integrating a data-driven model (sparse logistic regression) and domain knowledge based on the international classification of diseases 9th-revision clinical modification (ICD-9-CM) hierarchy of diseases. Additionally, we propose a way to quantify the interpretability of a model and inspect the stability of alternative solutions. The analysis was conducted on >66,000 pediatric hospital discharge records from California, State Inpatient Databases, Healthcare Cost and Utilization Project between 2009 and 2011. We incorporated domain knowledge based on the ICD-9-CM hierarchy in a data driven, Tree-Lasso regularized logistic regression model, providing the framework for model interpretation. This approach was compared with traditional Lasso logistic regression resulting in models that are easier to interpret by fewer high-level diagnoses, with comparable prediction accuracy. The results revealed that the use of a Tree-Lasso model was as competitive in terms of accuracy (measured by area under the receiver operating characteristic curve-AUC) as the traditional Lasso logistic regression, but integration with the ICD-9-CM hierarchy of diseases provided more interpretable models in terms of high-level diagnoses. Additionally, interpretations of models are in accordance with existing medical understanding of pediatric readmission. Best performing models have similar performances reaching AUC values 0.783 and 0.779 for traditional Lasso and Tree-Lasso, respectfully. However, information loss of Lasso models is 0.35 bits higher compared to Tree-Lasso model. We propose a method for building predictive models applicable for the detection of readmission risk based on Electronic Health records. Integration of domain knowledge (in the form of ICD-9-CM taxonomy) and a data-driven, sparse predictive algorithm (Tree-Lasso Logistic Regression) resulted in an increase of interpretability of the resulting model. The models are interpreted for the readmission prediction problem in general pediatric population in California, as well as several important subpopulations, and the interpretations of models comply with existing medical understanding of pediatric readmission. Finally, quantitative assessment of the interpretability of the models is given, that is beyond simple counts of selected low-level features. Copyright © 2016 Elsevier B.V. All rights reserved.
Sedgwick, James A.; Knopf, Fritz L.
1990-01-01
We examined habitat relationships and nest site characteristics for 6 species of cavity-nesting birds--American kestrel (Falco sparverius), northern flicker (Colaptes auratus), red-headed woodpecker (Melanerpes erythrocephalus), black-capped chickadee (Parus atricapillus), house wren (Troglodytes aedon), and European starling (Sturnus vulgaris)--in a mature plains cottonwood (Populus sargentii) bottomland along the South Platte River in northeastern Colorado in 1985 and 1986. We examined characteristics of cavities, nest trees, and the habitat surrounding nest trees. Density of large trees (>69 cm dbh), total length of dead limbs ≥10 cm diameter (TDLL), and cavity density were the most important habitat variables; dead limb length (DLL), dbh, and species were the most important tree variables; and cavity height, cavity entrance diameter, and substrate condition at the cavity (live vs. dead) were the most important cavity variables in segregating cavity nesters along habitat, tree, and cavity dimensions, respectively. Random sites differed most from cavity-nesting bird sites on the basis of dbh, DLL, limb tree density (trees with ≥1 m dead limbs ≥10 cm diameter), and cavity density. Habitats of red-headed woodpeckers and American kestrels were the most unique, differing most from random sites. Based on current trends in cottonwood demography, densities of cavity-nesting birds will probably decline gradually along the South Platte River, paralleling a decline in DLL, limb tree density, snag density, and the concurrent lack of cottonwood regeneration.
Heritability estimations for inner muscular fat in Hereford cattle using random regressions
USDA-ARS?s Scientific Manuscript database
Random regressions make possible to make genetic predictions and parameters estimation across a gradient of environments, allowing a more accurate and beneficial use of animals as breeders in specific environments. The objective of this study was to use random regression models to estimate heritabil...
Cherry, Kevin M; Peplinski, Brandon; Kim, Lauren; Wang, Shijun; Lu, Le; Zhang, Weidong; Liu, Jianfei; Wei, Zhuoshi; Summers, Ronald M
2015-01-01
Given the potential importance of marginal artery localization in automated registration in computed tomography colonography (CTC), we have devised a semi-automated method of marginal vessel detection employing sequential Monte Carlo tracking (also known as particle filtering tracking) by multiple cue fusion based on intensity, vesselness, organ detection, and minimum spanning tree information for poorly enhanced vessel segments. We then employed a random forest algorithm for intelligent cue fusion and decision making which achieved high sensitivity and robustness. After applying a vessel pruning procedure to the tracking results, we achieved statistically significantly improved precision compared to a baseline Hessian detection method (2.7% versus 75.2%, p<0.001). This method also showed statistically significantly improved recall rate compared to a 2-cue baseline method using fewer vessel cues (30.7% versus 67.7%, p<0.001). These results demonstrate that marginal artery localization on CTC is feasible by combining a discriminative classifier (i.e., random forest) with a sequential Monte Carlo tracking mechanism. In so doing, we present the effective application of an anatomical probability map to vessel pruning as well as a supplementary spatial coordinate system for colonic segmentation and registration when this task has been confounded by colon lumen collapse. Published by Elsevier B.V.
Error analysis of leaf area estimates made from allometric regression models
NASA Technical Reports Server (NTRS)
Feiveson, A. H.; Chhikara, R. S.
1986-01-01
Biological net productivity, measured in terms of the change in biomass with time, affects global productivity and the quality of life through biochemical and hydrological cycles and by its effect on the overall energy balance. Estimating leaf area for large ecosystems is one of the more important means of monitoring this productivity. For a particular forest plot, the leaf area is often estimated by a two-stage process. In the first stage, known as dimension analysis, a small number of trees are felled so that their areas can be measured as accurately as possible. These leaf areas are then related to non-destructive, easily-measured features such as bole diameter and tree height, by using a regression model. In the second stage, the non-destructive features are measured for all or for a sample of trees in the plots and then used as input into the regression model to estimate the total leaf area. Because both stages of the estimation process are subject to error, it is difficult to evaluate the accuracy of the final plot leaf area estimates. This paper illustrates how a complete error analysis can be made, using an example from a study made on aspen trees in northern Minnesota. The study was a joint effort by NASA and the University of California at Santa Barbara known as COVER (Characterization of Vegetation with Remote Sensing).
NASA Astrophysics Data System (ADS)
Webb, Mathew A.; Hall, Andrew; Kidd, Darren; Minansy, Budiman
2016-05-01
Assessment of local spatial climatic variability is important in the planning of planting locations for horticultural crops. This study investigated three regression-based calibration methods (i.e. traditional versus two optimized methods) to relate short-term 12-month data series from 170 temperature loggers and 4 weather station sites with data series from nearby long-term Australian Bureau of Meteorology climate stations. The techniques trialled to interpolate climatic temperature variables, such as frost risk, growing degree days (GDDs) and chill hours, were regression kriging (RK), regression trees (RTs) and random forests (RFs). All three calibration methods produced accurate results, with the RK-based calibration method delivering the most accurate validation measures: coefficients of determination ( R 2) of 0.92, 0.97 and 0.95 and root-mean-square errors of 1.30, 0.80 and 1.31 °C, for daily minimum, daily maximum and hourly temperatures, respectively. Compared with the traditional method of calibration using direct linear regression between short-term and long-term stations, the RK-based calibration method improved R 2 and reduced root-mean-square error (RMSE) by at least 5 % and 0.47 °C for daily minimum temperature, 1 % and 0.23 °C for daily maximum temperature and 3 % and 0.33 °C for hourly temperature. Spatial modelling indicated insignificant differences between the interpolation methods, with the RK technique tending to be the slightly better method due to the high degree of spatial autocorrelation between logger sites.
Data mining: Potential applications in research on nutrition and health.
Batterham, Marijka; Neale, Elizabeth; Martin, Allison; Tapsell, Linda
2017-02-01
Data mining enables further insights from nutrition-related research, but caution is required. The aim of this analysis was to demonstrate and compare the utility of data mining methods in classifying a categorical outcome derived from a nutrition-related intervention. Baseline data (23 variables, 8 categorical) on participants (n = 295) in an intervention trial were used to classify participants in terms of meeting the criteria of achieving 10 000 steps per day. Results from classification and regression trees (CARTs), random forests, adaptive boosting, logistic regression, support vector machines and neural networks were compared using area under the curve (AUC) and error assessments. The CART produced the best model when considering the AUC (0.703), overall error (18%) and within class error (28%). Logistic regression also performed reasonably well compared to the other models (AUC 0.675, overall error 23%, within class error 36%). All the methods gave different rankings of variables' importance. CART found that body fat, quality of life using the SF-12 Physical Component Summary (PCS) and the cholesterol: HDL ratio were the most important predictors of meeting the 10 000 steps criteria, while logistic regression showed the SF-12PCS, glucose levels and level of education to be the most significant predictors (P ≤ 0.01). Differing outcomes suggest caution is required with a single data mining method, particularly in a dataset with nonlinear relationships and outliers and when exploring relationships that were not the primary outcomes of the research. © 2017 Dietitians Association of Australia.
Rieger, Isaak; Kowarik, Ingo; Cherubini, Paolo; Cierjacks, Arne
2017-01-01
Aboveground carbon (C) sequestration in trees is important in global C dynamics, but reliable techniques for its modeling in highly productive and heterogeneous ecosystems are limited. We applied an extended dendrochronological approach to disentangle the functioning of drivers from the atmosphere (temperature, precipitation), the lithosphere (sedimentation rate), the hydrosphere (groundwater table, river water level fluctuation), the biosphere (tree characteristics), and the anthroposphere (dike construction). Carbon sequestration in aboveground biomass of riparian Quercus robur L. and Fraxinus excelsior L. was modeled (1) over time using boosted regression tree analysis (BRT) on cross-datable trees characterized by equal annual growth ring patterns and (2) across space using a subsequent classification and regression tree analysis (CART) on cross-datable and not cross-datable trees. While C sequestration of cross-datable Q. robur responded to precipitation and temperature, cross-datable F. excelsior also responded to a low Danube river water level. However, CART revealed that C sequestration over time is governed by tree height and parameters that vary over space (magnitude of fluctuation in the groundwater table, vertical distance to mean river water level, and longitudinal distance to upstream end of the study area). Thus, a uniform response to climatic drivers of aboveground C sequestration in Q. robur was only detectable in trees of an intermediate height class and in taller trees (>21.8m) on sites where the groundwater table fluctuated little (≤0.9m). The detection of climatic drivers and the river water level in F. excelsior depended on sites at lower altitudes above the mean river water level (≤2.7m) and along a less dynamic downstream section of the study area. Our approach indicates unexploited opportunities of understanding the interplay of different environmental drivers in aboveground C sequestration. Results may support species-specific and locally adapted forest management plans to increase carbon dioxide sequestration from the atmosphere in trees. Copyright © 2016 Elsevier B.V. All rights reserved.
Contrasting natural regeneration and tree planting in fourteen North American cities
David J. Nowak
2012-01-01
Field data from randomly located plots in 12 cities in the United States and Canada were used to estimate the proportion of the existing tree population that was planted or occurred via natural regeneration. In addition, two cities (Baltimore and Syracuse) were recently re-sampled to estimate the proportion of newly established trees that were planted. Results for the...
Karin Riley; Isaac C. Grenfell; Mark A. Finney
2016-01-01
Maps of the number, size, and species of trees in forests across the western United States are desirable for many applications such as estimating terrestrial carbon resources, predicting tree mortality following wildfires, and for forest inventory. However, detailed mapping of trees for large areas is not feasible with current technologies, but statistical...
Urban forest structure, ecosystem services and change in Syracuse, NY
David J. Nowak; Robert E. Hoehn; Allison R. Bodine; Eric J. Greenfield; Jarlath O' Neil-Dunne
2013-01-01
The tree population within the City of Syracuse was assessed using a random sampling of plots in 1999, 2001 and 2009 to determine how the population and the ecosystem services these trees provide have changed over time. Ecosystem services and values for carbon sequestration, air pollution removal and changes in building energy use were derived using the i-Tree Eco...
Deborah G. McCullough; Therese M. Poland; Andrea C. Anulewicz; Phillip Lewis; David. Cappaert
2011-01-01
Effective methods are needed to protect ash trees (Fraxinus spp.) from emerald ash borer, Agrilus planipennis Fairmaire (Coleoptera: Buprestidae), an invasive buprestid that has killed millions of North American ash (Fraxinus spp.) trees. We randomly assigned 175 ash trees (11.5-48.1 cm in diameter) in 25 blocks...
Beyond Flory theory: Distribution functions for interacting lattice trees
NASA Astrophysics Data System (ADS)
Rosa, Angelo; Everaers, Ralf
2017-01-01
While Flory theories [J. Isaacson and T. C. Lubensky, J. Physique Lett. 41, 469 (1980), 10.1051/jphyslet:019800041019046900; M. Daoud and J. F. Joanny, J. Physique 42, 1359 (1981), 10.1051/jphys:0198100420100135900; A. M. Gutin et al., Macromolecules 26, 1293 (1993), 10.1021/ma00058a016] provide an extremely useful framework for understanding the behavior of interacting, randomly branching polymers, the approach is inherently limited. Here we use a combination of scaling arguments and computer simulations to go beyond a Gaussian description. We analyze distribution functions for a wide variety of quantities characterizing the tree connectivities and conformations for the four different statistical ensembles, which we have studied numerically in [A. Rosa and R. Everaers, J. Phys. A: Math. Theor. 49, 345001 (2016), 10.1088/1751-8113/49/34/345001 and J. Chem. Phys. 145, 164906 (2016), 10.1063/1.4965827]: (a) ideal randomly branching polymers, (b) 2 d and 3 d melts of interacting randomly branching polymers, (c) 3 d self-avoiding trees with annealed connectivity, and (d) 3 d self-avoiding trees with quenched ideal connectivity. In particular, we investigate the distributions (i) pN(n ) of the weight, n , of branches cut from trees of mass N by severing randomly chosen bonds; (ii) pN(l ) of the contour distances, l , between monomers; (iii) pN(r ⃗) of spatial distances, r ⃗, between monomers, and (iv) pN(r ⃗|l ) of the end-to-end distance of paths of length l . Data for different tree sizes superimpose, when expressed as functions of suitably rescaled observables x ⃗=r ⃗/√{
Kück, Patrick; Meusemann, Karen; Dambach, Johannes; Thormann, Birthe; von Reumont, Björn M; Wägele, Johann W; Misof, Bernhard
2010-03-31
Methods of alignment masking, which refers to the technique of excluding alignment blocks prior to tree reconstructions, have been successful in improving the signal-to-noise ratio in sequence alignments. However, the lack of formally well defined methods to identify randomness in sequence alignments has prevented a routine application of alignment masking. In this study, we compared the effects on tree reconstructions of the most commonly used profiling method (GBLOCKS) which uses a predefined set of rules in combination with alignment masking, with a new profiling approach (ALISCORE) based on Monte Carlo resampling within a sliding window, using different data sets and alignment methods. While the GBLOCKS approach excludes variable sections above a certain threshold which choice is left arbitrary, the ALISCORE algorithm is free of a priori rating of parameter space and therefore more objective. ALISCORE was successfully extended to amino acids using a proportional model and empirical substitution matrices to score randomness in multiple sequence alignments. A complex bootstrap resampling leads to an even distribution of scores of randomly similar sequences to assess randomness of the observed sequence similarity. Testing performance on real data, both masking methods, GBLOCKS and ALISCORE, helped to improve tree resolution. The sliding window approach was less sensitive to different alignments of identical data sets and performed equally well on all data sets. Concurrently, ALISCORE is capable of dealing with different substitution patterns and heterogeneous base composition. ALISCORE and the most relaxed GBLOCKS gap parameter setting performed best on all data sets. Correspondingly, Neighbor-Net analyses showed the most decrease in conflict. Alignment masking improves signal-to-noise ratio in multiple sequence alignments prior to phylogenetic reconstruction. Given the robust performance of alignment profiling, alignment masking should routinely be used to improve tree reconstructions. Parametric methods of alignment profiling can be easily extended to more complex likelihood based models of sequence evolution which opens the possibility of further improvements.
Inferring gene regression networks with model trees
2010-01-01
Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET. PMID:20950452
Semiautomated landscape feature extraction and modeling
NASA Astrophysics Data System (ADS)
Wasilewski, Anthony A.; Faust, Nickolas L.; Ribarsky, William
2001-08-01
We have developed a semi-automated procedure for generating correctly located 3D tree objects form overhead imagery. Cross-platform software partitions arbitrarily large, geocorrected and geolocated imagery into management sub- images. The user manually selected tree areas from one or more of these sub-images. Tree group blobs are then narrowed to lines using a special thinning algorithm which retains the topology of the blobs, and also stores the thickness of the parent blob. Maxima along these thinned tree grous are found, and used as individual tree locations within the tree group. Magnitudes of the local maxima are used to scale the radii of the tree objects. Grossly overlapping trees are culled based on a comparison of tree-tree distance to combined radii. Tree color is randomly selected based on the distribution of sample tree pixels, and height is estimated form tree radius. The final tree objects are then inserted into a terrain database which can be navigated by VGIS, a high-resolution global terrain visualization system developed at Georgia Tech.
More Trees, More Poverty? The Socioeconomic Effects of Tree Plantations in Chile, 2001-2011
NASA Astrophysics Data System (ADS)
Andersson, Krister; Lawrence, Duncan; Zavaleta, Jennifer; Guariguata, Manuel R.
2016-01-01
Tree plantations play a controversial role in many nations' efforts to balance goals for economic development, ecological conservation, and social justice. This paper seeks to contribute to this debate by analyzing the socioeconomic impact of such plantations. We focus our study on Chile, a country that has experienced extraordinary growth of industrial tree plantations. Our analysis draws on a unique dataset with longitudinal observations collected in 180 municipal territories during 2001-2011. Employing panel data regression techniques, we find that growth in plantation area is associated with higher than average rates of poverty during this period.
More Trees, More Poverty? The Socioeconomic Effects of Tree Plantations in Chile, 2001-2011.
Andersson, Krister; Lawrence, Duncan; Zavaleta, Jennifer; Guariguata, Manuel R
2016-01-01
Tree plantations play a controversial role in many nations' efforts to balance goals for economic development, ecological conservation, and social justice. This paper seeks to contribute to this debate by analyzing the socioeconomic impact of such plantations. We focus our study on Chile, a country that has experienced extraordinary growth of industrial tree plantations. Our analysis draws on a unique dataset with longitudinal observations collected in 180 municipal territories during 2001-2011. Employing panel data regression techniques, we find that growth in plantation area is associated with higher than average rates of poverty during this period.
The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis.
Koziol, James A; Feng, Anne C; Jia, Zhenyu; Wang, Yipeng; Goodison, Seven; McClelland, Michael; Mercola, Dan
2009-01-01
Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors.
Psychological correlates of loneliness in the older adult.
Walton, C G; Shultz, C M; Beck, C M; Walls, R C
1991-06-01
Loneliness is the emotional response to the discrepancy between desired and available relationships. As people grow old, the likelihood of experiencing age-related losses increases. Such losses may impede the maintenance or acquisition of desired relationships, resulting in a higher incidence of loneliness. This pilot study examines how loneliness relates to age-related losses, hopelessness, self-transcendence, and spiritual well-being in a convenience sample of 107 adults aged 65 years or older. The collective utility of the independent variables in predicting loneliness was investigated by means of a regression decision tree with an automatic random subset crossvalidation procedure. This procedure explained 46% of the variance. Higher scores for age-related losses and hopelessness were associated with higher loneliness scores. Higher scores for self-transcendence and existential spiritual well-being were associated with lower loneliness scores.
Finding paths in tree graphs with a quantum walk
NASA Astrophysics Data System (ADS)
Koch, Daniel; Hillery, Mark
2018-01-01
We analyze the potential for different types of searches using the formalism of scattering random walks on quantum computers. Given a particular type of graph consisting of nodes and connections, a "tree maze," we would like to find a selected final node as quickly as possible, faster than any classical search algorithm. We show that this can be done using a quantum random walk, both through numerical calculations as well as by using the eigenvectors and eigenvalues of the quantum system.
Beating the Odds: Trees to Success in Different Countries
ERIC Educational Resources Information Center
Finch, W. Holmes; Marchant, Gregory J.
2017-01-01
A recursive partitioning model approach in the form of classification and regression trees (CART) was used with 2012 PISA data for five countries (Canada, Finland, Germany, Singapore-China, and the Unites States). The objective of the study was to determine demographic and educational variables that differentiated between low SES student that were…
Geospatial relationships of tree species damage caused by Hurricane Katrina in south Mississippi
Mark W. Garrigues; Zhaofei Fan; David L. Evans; Scott D. Roberts; William H. Cooke III
2012-01-01
Hurricane Katrina generated substantial impacts on the forests and biological resources of the affected area in Mississippi. This study seeks to use classification tree analysis (CTA) to determine which variables are significant in predicting hurricane damage (shear or windthrow) in the Southeast Mississippi Institute for Forest Inventory District. Logistic regressions...
Using Classification Trees to Predict Alumni Giving for Higher Education
ERIC Educational Resources Information Center
Weerts, David J.; Ronca, Justin M.
2009-01-01
As the relative level of public support for higher education declines, colleges and universities aim to maximize alumni-giving to keep their programs competitive. Anchored in a utility maximization framework, this study employs the classification and regression tree methodology to examine characteristics of alumni donors and non-donors at a…
Updated generalized biomass equations for North American tree species
David C. Chojnacky; Linda S. Heath; Jennifer C. Jenkins
2014-01-01
Historically, tree biomass at large scales has been estimated by applying dimensional analysis techniques and field measurements such as diameter at breast height (dbh) in allometric regression equations. Equations often have been developed using differing methods and applied only to certain species or isolated areas. We previously had compiled and combined (in meta-...
The microcomputer scientific software series 5: the BIOMASS user's guide.
George E. Host; Stephen C. Westin; William G. Cole; Kurt S. Pregitzer
1989-01-01
BIOMASS is an interactive microcomputer program that uses allometric regression equations to calculate aboveground biomass of common tree species of the Lake States. The equations are species-specific and most use both diameter and height as independent variables. The program accommodates fixed area and variable radius sample designs and produces both individual tree...
Northern Arkansas Spring Precipitation Reconstructed from Tree Rings, 1023-1992 A.D.
Malcolm K. Cleaveland
2001-01-01
Three baldcypress (Taxodium distichum (L.) Rich.) tree-ring chronologies in northeastern Arkansas and southeastern Missouri respond strongly to April-June (spring) rainfall in northern Arkansas. I used regression to reconstruct an average of spring rainfall in the three climatic divisions of northern Arkansas since 1023 A.D. The reconstruction was...
Mohana, G S; Shaanker, R U; Ganeshaiah, K N; Dayanandan, S
2001-07-01
Dalbergia sissoo, a wind-dispersed tropical tree, exhibits high intrafruit seed abortion. Of the four to five ovules in the flower, generally one and occasionally two or three develop to maturity. It has been proposed that the seed abortion is a consequence of intense sibling competition for maternal resources and that this competition occurs as an inverse function of the genetic relatedness among the developing seeds. Accordingly, developing seeds compete intensely when they are genetically less related but tend to develop together when genetically more related. We tested this hypothesis by comparing the genetic similarity among the pairs of seeds developing within a pod with that among (a) random pairs from the pool of all seeds, (b) random pairs from single-seeded pods, and (c) random pairs from two-seeded pods, using both randomly amplified polymorphic DNA (RAPD) and isozymes in five trees. We found that the pairs of seeds developing within a pod are genetically more similar than any random pairs of seeds in a tree. Thus the formation of two-seeded pods appear to be associated with increased genetic relatedness among the developing seeds. We discuss the results in the context of possible fitness advantages and then discuss the possible mechanisms that promote tolerance among related seeds.
Weather Impact on Airport Arrival Meter Fix Throughput
NASA Technical Reports Server (NTRS)
Wang, Yao
2017-01-01
Time-based flow management provides arrival aircraft schedules based on arrival airport conditions, airport capacity, required spacing, and weather conditions. In order to meet a scheduled time at which arrival aircraft can cross an airport arrival meter fix prior to entering the airport terminal airspace, air traffic controllers make regulations on air traffic. Severe weather may create an airport arrival bottleneck if one or more of airport arrival meter fixes are partially or completely blocked by the weather and the arrival demand has not been reduced accordingly. Under these conditions, aircraft are frequently being put in holding patterns until they can be rerouted. A model that predicts the weather impacted meter fix throughput may help air traffic controllers direct arrival flows into the airport more efficiently, minimizing arrival meter fix congestion. This paper presents an analysis of air traffic flows across arrival meter fixes at the Newark Liberty International Airport (EWR). Several scenarios of weather impacted EWR arrival fix flows are described. Furthermore, multiple linear regression and regression tree ensemble learning approaches for translating multiple sector Weather Impacted Traffic Indexes (WITI) to EWR arrival meter fix throughputs are examined. These weather translation models are developed and validated using the EWR arrival flight and weather data for the period of April-September in 2014. This study also compares the performance of the regression tree ensemble with traditional multiple linear regression models for estimating the weather impacted throughputs at each of the EWR arrival meter fixes. For all meter fixes investigated, the results from the regression tree ensemble weather translation models show a stronger correlation between model outputs and observed meter fix throughputs than that produced from multiple linear regression method.
Shi, K-Q; Zhou, Y-Y; Yan, H-D; Li, H; Wu, F-L; Xie, Y-Y; Braddock, M; Lin, X-Y; Zheng, M-H
2017-02-01
At present, there is no ideal model for predicting the short-term outcome of patients with acute-on-chronic hepatitis B liver failure (ACHBLF). This study aimed to establish and validate a prognostic model by using the classification and regression tree (CART) analysis. A total of 1047 patients from two separate medical centres with suspected ACHBLF were screened in the study, which were recognized as derivation cohort and validation cohort, respectively. CART analysis was applied to predict the 3-month mortality of patients with ACHBLF. The accuracy of the CART model was tested using the area under the receiver operating characteristic curve, which was compared with the model for end-stage liver disease (MELD) score and a new logistic regression model. CART analysis identified four variables as prognostic factors of ACHBLF: total bilirubin, age, serum sodium and INR, and three distinct risk groups: low risk (4.2%), intermediate risk (30.2%-53.2%) and high risk (81.4%-96.9%). The new logistic regression model was constructed with four independent factors, including age, total bilirubin, serum sodium and prothrombin activity by multivariate logistic regression analysis. The performances of the CART model (0.896), similar to the logistic regression model (0.914, P=.382), exceeded that of MELD score (0.667, P<.001). The results were confirmed in the validation cohort. We have developed and validated a novel CART model superior to MELD for predicting three-month mortality of patients with ACHBLF. Thus, the CART model could facilitate medical decision-making and provide clinicians with a validated practical bedside tool for ACHBLF risk stratification. © 2016 John Wiley & Sons Ltd.
Fine-scale spatial genetic dynamics over the life cycle of the tropical tree Prunus africana.
Berens, D G; Braun, C; González-Martínez, S C; Griebeler, E M; Nathan, R; Böhning-Gaese, K
2014-11-01
Studying fine-scale spatial genetic patterns across life stages is a powerful approach to identify ecological processes acting within tree populations. We investigated spatial genetic dynamics across five life stages in the insect-pollinated and vertebrate-dispersed tropical tree Prunus africana in Kakamega Forest, Kenya. Using six highly polymorphic microsatellite loci, we assessed genetic diversity and spatial genetic structure (SGS) from seed rain and seedlings, and different sapling stages to adult trees. We found significant SGS in all stages, potentially caused by limited seed dispersal and high recruitment rates in areas with high light availability. SGS decreased from seed and early seedling stages to older juvenile stages. Interestingly, SGS was stronger in adults than in late juveniles. The initial decrease in SGS was probably driven by both random and non-random thinning of offspring clusters during recruitment. Intergenerational variation in SGS could have been driven by variation in gene flow processes, overlapping generations in the adult stage or local selection. Our study shows that complex sequential processes during recruitment contribute to SGS of tree populations.
Anomalous scaling in an age-dependent branching model.
Keller-Schmidt, Stephanie; Tuğrul, Murat; Eguíluz, Víctor M; Hernández-García, Emilio; Klemm, Konstantin
2015-02-01
We introduce a one-parametric family of tree growth models, in which branching probabilities decrease with branch age τ as τ(-α). Depending on the exponent α, the scaling of tree depth with tree size n displays a transition between the logarithmic scaling of random trees and an algebraic growth. At the transition (α=1) tree depth grows as (logn)(2). This anomalous scaling is in good agreement with the trend observed in evolution of biological species, thus providing a theoretical support for age-dependent speciation and associating it to the occurrence of a critical point.
A Metric on Phylogenetic Tree Shapes
Plazzotta, G.
2018-01-01
Abstract The shapes of evolutionary trees are influenced by the nature of the evolutionary process but comparisons of trees from different processes are hindered by the challenge of completely describing tree shape. We present a full characterization of the shapes of rooted branching trees in a form that lends itself to natural tree comparisons. We use this characterization to define a metric, in the sense of a true distance function, on tree shapes. The metric distinguishes trees from random models known to produce different tree shapes. It separates trees derived from tropical versus USA influenza A sequences, which reflect the differing epidemiology of tropical and seasonal flu. We describe several metrics based on the same core characterization, and illustrate how to extend the metric to incorporate trees’ branch lengths or other features such as overall imbalance. Our approach allows us to construct addition and multiplication on trees, and to create a convex metric on tree shapes which formally allows computation of average tree shapes. PMID:28472435
Frequency and severity of decay in street tree maples in four upstate New York cities Arboric
Christopher Luley; David Nowak; Eric Greenfield
2009-01-01
A proportional random selection of street tree Norway, silver, and sugar maples, and other species among four diameter classes were surveyed in the U.S. New York cities of Albany, Buffalo, Rochester, and Syracuse for decay incidence and severity. Decay was determined by drilling sampled trees with a Resistograph and calculating the ratio of sound wood to radius....
Don C. Bragg; Jeffrey L. Kershner
2004-01-01
Riparian large woody debris (LWD) recruitment simulations have traditionally applied a random angle of tree fall from two well-forested stream banks. We used a riparian LWD recruitment model (CWD, version 1.4) to test the validity these assumptions. Both the number of contributing forest banks and predominant tree fall direction significantly influenced simulated...
Mining Health App Data to Find More and Less Successful Weight Loss Subgroups
2016-01-01
Background More than half of all smartphone app downloads involve weight, diet, and exercise. If successful, these lifestyle apps may have far-reaching effects for disease prevention and health cost-savings, but few researchers have analyzed data from these apps. Objective The purposes of this study were to analyze data from a commercial health app (Lose It!) in order to identify successful weight loss subgroups via exploratory analyses and to verify the stability of the results. Methods Cross-sectional, de-identified data from Lose It! were analyzed. This dataset (n=12,427,196) was randomly split into 24 subsamples, and this study used 3 subsamples (combined n=972,687). Classification and regression tree methods were used to explore groupings of weight loss with one subsample, with descriptive analyses to examine other group characteristics. Data mining validation methods were conducted with 2 additional subsamples. Results In subsample 1, 14.96% of users lost 5% or more of their starting body weight. Classification and regression tree analysis identified 3 distinct subgroups: “the occasional users” had the lowest proportion (4.87%) of individuals who successfully lost weight; “the basic users” had 37.61% weight loss success; and “the power users” achieved the highest percentage of weight loss success at 72.70%. Behavioral factors delineated the subgroups, though app-related behavioral characteristics further distinguished them. Results were replicated in further analyses with separate subsamples. Conclusions This study demonstrates that distinct subgroups can be identified in “messy” commercial app data and the identified subgroups can be replicated in independent samples. Behavioral factors and use of custom app features characterized the subgroups. Targeting and tailoring information to particular subgroups could enhance weight loss success. Future studies should replicate data mining analyses to increase methodology rigor. PMID:27301853
NASA Astrophysics Data System (ADS)
Yao, W.; Poleswki, P.; Krzystek, P.
2016-06-01
The recent success of deep convolutional neural networks (CNN) on a large number of applications can be attributed to large amounts of available training data and increasing computing power. In this paper, a semantic pixel labelling scheme for urban areas using multi-resolution CNN and hand-crafted spatial-spectral features of airborne remotely sensed data is presented. Both CNN and hand-crafted features are applied to image/DSM patches to produce per-pixel class probabilities with a L1-norm regularized logistical regression classifier. The evidence theory infers a degree of belief for pixel labelling from different sources to smooth regions by handling the conflicts present in the both classifiers while reducing the uncertainty. The aerial data used in this study were provided by ISPRS as benchmark datasets for 2D semantic labelling tasks in urban areas, which consists of two data sources from LiDAR and color infrared camera. The test sites are parts of a city in Germany which is assumed to consist of typical object classes including impervious surfaces, trees, buildings, low vegetation, vehicles and clutter. The evaluation is based on the computation of pixel-based confusion matrices by random sampling. The performance of the strategy with respect to scene characteristics and method combination strategies is analyzed and discussed. The competitive classification accuracy could be not only explained by the nature of input data sources: e.g. the above-ground height of nDSM highlight the vertical dimension of houses, trees even cars and the nearinfrared spectrum indicates vegetation, but also attributed to decision-level fusion of CNN's texture-based approach with multichannel spatial-spectral hand-crafted features based on the evidence combination theory.
Koch, George W; Sillett, Stephen C; Jennings, Gregory M; Davis, Stephen D
2004-04-22
Trees grow tall where resources are abundant, stresses are minor, and competition for light places a premium on height growth. The height to which trees can grow and the biophysical determinants of maximum height are poorly understood. Some models predict heights of up to 120 m in the absence of mechanical damage, but there are historical accounts of taller trees. Current hypotheses of height limitation focus on increasing water transport constraints in taller trees and the resulting reductions in leaf photosynthesis. We studied redwoods (Sequoia sempervirens), including the tallest known tree on Earth (112.7 m), in wet temperate forests of northern California. Our regression analyses of height gradients in leaf functional characteristics estimate a maximum tree height of 122-130 m barring mechanical damage, similar to the tallest recorded trees of the past. As trees grow taller, increasing leaf water stress due to gravity and path length resistance may ultimately limit leaf expansion and photosynthesis for further height growth, even with ample soil moisture.
Comparison of Sub-Pixel Classification Approaches for Crop-Specific Mapping
This paper examined two non-linear models, Multilayer Perceptron (MLP) regression and Regression Tree (RT), for estimating sub-pixel crop proportions using time-series MODIS-NDVI data. The sub-pixel proportions were estimated for three major crop types including corn, soybean, a...
"Mad or bad?": burden on caregivers of patients with personality disorders.
Bauer, Rita; Döring, Antje; Schmidt, Tanja; Spießl, Hermann
2012-12-01
The burden on caregivers of patients with personality disorders is often greatly underestimated or completely disregarded. Possibilities for caregiver support have rarely been assessed. Thirty interviews were conducted with caregivers of such patients to assess illness-related burden. Responses were analyzed with a mixed method of qualitative and quantitative analysis in a sequential design. Patient and caregiver data, including sociodemographic and disease-related variables, were evaluated with regression analysis and regression trees. Caregiver statements (n = 404) were summarized into 44 global statements. The most frequent global statements were worries about the burden on other family members (70.0%), poor cooperation with clinical centers and other institutions (60.0%), financial burden (56.7%), worry about the patient's future (53.3%), and dissatisfaction with the patient's treatment and rehabilitation (53.3%). Linear regression and regression tree analysis identified predictors for more burdened caregivers. Caregivers of patients with personality disorders experience a variety of burdens, some disorder specific. Yet these caregivers often receive little attention or support.
NASA Astrophysics Data System (ADS)
Niemeijer, Meindert; Dumitrescu, Alina V.; van Ginneken, Bram; Abrámoff, Michael D.
2011-03-01
Parameters extracted from the vasculature on the retina are correlated with various conditions such as diabetic retinopathy and cardiovascular diseases such as stroke. Segmentation of the vasculature on the retina has been a topic that has received much attention in the literature over the past decade. Analysis of the segmentation result, however, has only received limited attention with most works describing methods to accurately measure the width of the vessels. Analyzing the connectedness of the vascular network is an important step towards the characterization of the complete vascular tree. The retinal vascular tree, from an image interpretation point of view, originates at the optic disc and spreads out over the retina. The tree bifurcates and the vessels also cross each other. The points where this happens form the key to determining the connectedness of the complete tree. We present a supervised method to detect the bifurcations and crossing points of the vasculature of the retina. The method uses features extracted from the vasculature as well as the image in a location regression approach to find those locations of the segmented vascular tree where the bifurcation or crossing occurs (from here, POI, points of interest). We evaluate the method on the publicly available DRIVE database in which an ophthalmologist has marked the POI.
Ultrasonographic Diagnosis of Biliary Atresia Based on a Decision-Making Tree Model.
Lee, So Mi; Cheon, Jung-Eun; Choi, Young Hun; Kim, Woo Sun; Cho, Hyun-Hae; Cho, Hyun-Hye; Kim, In-One; You, Sun Kyoung
2015-01-01
To assess the diagnostic value of various ultrasound (US) findings and to make a decision-tree model for US diagnosis of biliary atresia (BA). From March 2008 to January 2014, the following US findings were retrospectively evaluated in 100 infants with cholestatic jaundice (BA, n = 46; non-BA, n = 54): length and morphology of the gallbladder, triangular cord thickness, hepatic artery and portal vein diameters, and visualization of the common bile duct. Logistic regression analyses were performed to determine the features that would be useful in predicting BA. Conditional inference tree analysis was used to generate a decision-making tree for classifying patients into the BA or non-BA groups. Multivariate logistic regression analysis showed that abnormal gallbladder morphology and greater triangular cord thickness were significant predictors of BA (p = 0.003 and 0.001; adjusted odds ratio: 345.6 and 65.6, respectively). In the decision-making tree using conditional inference tree analysis, gallbladder morphology and triangular cord thickness (optimal cutoff value of triangular cord thickness, 3.4 mm) were also selected as significant discriminators for differential diagnosis of BA, and gallbladder morphology was the first discriminator. The diagnostic performance of the decision-making tree was excellent, with sensitivity of 100% (46/46), specificity of 94.4% (51/54), and overall accuracy of 97% (97/100). Abnormal gallbladder morphology and greater triangular cord thickness (> 3.4 mm) were the most useful predictors of BA on US. We suggest that the gallbladder morphology should be evaluated first and that triangular cord thickness should be evaluated subsequently in cases with normal gallbladder morphology.
Anderson, S.C.; Kupfer, J.A.; Wilson, R.R.; Cooper, R.J.
2000-01-01
The purpose of this research was to develop a model that could be used to provide a spatial representation of uneven-aged silvicultural treatments on forest crown area. We began by developing species-specific linear regression equations relating tree DBH to crown area for eight bottomland tree species at White River National Wildlife Refuge, Arkansas, USA. The relationships were highly significant for all species, with coefficients of determination (r(2)) ranging from 0.37 for Ulmus crassifolia to nearly 0.80 for Quercus nuttalliii and Taxodium distichum. We next located and measured the diameters of more than 4000 stumps from a single tree-group selection timber harvest. Stump locations were recorded with respect to an established gl id point system and entered into a Geographic Information System (ARC/INFO). The area occupied by the crown of each logged individual was then estimated by using the stump dimensions (adjusted to DBHs) and the regression equations relating tree DBH to crown area. Our model projected that the selection cuts removed roughly 300 m(2) of basal area from the logged sites resulting in the loss of approximate to 55 000 m(2) of crown area. The model developed in this research represents a tool that can be used in conjunction with remote sensing applications to assist in forest inventory and management, as well as to estimate the impacts of selective timber harvest on wildlife.
Scollo, Annalisa; Gottardo, Flaviana; Contiero, Barbara; Edwards, Sandra A
2017-10-01
Tail biting in pigs has been an identified behavioural, welfare and economic problem for decades, and requires appropriate but sometimes difficult on-farm interventions. The aim of the paper is to introduce the Classification and Regression Tree (CRT) methodologies to develop a tool for prevention of acute tail biting lesions in pigs on-farm. A sample of 60 commercial farms rearing heavy pigs were involved; an on-farm visit and an interview with the farmer collected data on general management, herd health, disease prevention, climate control, feeding and production traits. Results suggest a value for the CRT analysis in managing the risk factors behind tail biting on a farm-specific level, showing 86.7% sensitivity for the Classification Tree and a correlation of 0.7 between observed and predicted prevalence of tail biting obtained with the Regression Tree. CRT analysis showed five main variables (stocking density, ammonia levels, number of pigs per stockman, type of floor and timeliness in feed supply) as critical predictors of acute tail biting lesions, which demonstrate different importance in different farms subgroups. The model might have reliable and practical applications for the support and implementation of tail biting prevention interventions, especially in case of subgroups of pigs with higher risk, helping farmers and veterinarians to assess the risk in their own farm and to manage their predisposing variables in order to reduce acute tail biting lesions. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Naidoo, L.; Cho, M. A.; Mathieu, R.; Asner, G.
2012-04-01
The accurate classification and mapping of individual trees at species level in the savanna ecosystem can provide numerous benefits for the managerial authorities. Such benefits include the mapping of economically useful tree species, which are a key source of food production and fuel wood for the local communities, and of problematic alien invasive and bush encroaching species, which can threaten the integrity of the environment and livelihoods of the local communities. Species level mapping is particularly challenging in African savannas which are complex, heterogeneous, and open environments with high intra-species spectral variability due to differences in geology, topography, rainfall, herbivory and human impacts within relatively short distances. Savanna vegetation are also highly irregular in canopy and crown shape, height and other structural dimensions with a combination of open grassland patches and dense woody thicket - a stark contrast to the more homogeneous forest vegetation. This study classified eight common savanna tree species in the Greater Kruger National Park region, South Africa, using a combination of hyperspectral and Light Detection and Ranging (LiDAR)-derived structural parameters, in the form of seven predictor datasets, in an automated Random Forest modelling approach. The most important predictors, which were found to play an important role in the different classification models and contributed to the success of the hybrid dataset model when combined, were species tree height; NDVI; the chlorophyll b wavelength (466 nm) and a selection of raw, continuum removed and Spectral Angle Mapper (SAM) bands. It was also concluded that the hybrid predictor dataset Random Forest model yielded the highest classification accuracy and prediction success for the eight savanna tree species with an overall classification accuracy of 87.68% and KHAT value of 0.843.
Annual Tree Growth Predictions From Periodic Measurements
Quang V. Cao
2004-01-01
Data from annual measurements of a loblolly pine (Pinus taeda L.) plantation were available for this study. Regression techniques were employed to model annual changes of individual trees in terms of diameters, heights, and survival probabilities. Subsets of the data that include measurements every 2, 3, 4, 5, and 6 years were used to fit the same...
Understory response following varying levels of overstory removal in mixed conifer stands
Fabian C.C. Uzoh; Leroy K. Dolph; John R. Anstead
1997-01-01
Diameter growth rates of understory trees were measured for periods both before and after overstory removal on six study areas in northern California. All the species responded with increased diameter growth after adjusting to their new environments. Linear regression equations that predict post treatment diameter growth increment of the residual trees are presented...
Delayed conifer tree mortality following fire in California
Sharon M. Hood; Sheri L. Smith; Daniel R. Cluck
2007-01-01
Fire injury was characterized and survival monitored for 5,246 trees from five wildfires in California that occurred between 1999 and 2002. Logistic regression models for predicting the probability of mortality were developed for incense-cedar, Jeffrey pine, ponderosa pine, red fir and white fir. Two-year post-fire preliminary models were developed for incense-cedar,...
Estimating leaf area and leaf biomass of open-grown deciduous urban trees
David J. Nowak
1996-01-01
Logarithmic regression equations were developed to predict leaf area and leaf biomass for open-grown deciduous urban trees based on stem diameter and crown parameters. Equations based on crown parameters produced more reliable estimates. The equations can be used to help quantify forest structure and functions, particularly in urbanizing and urban/suburban areas.
Kirk M. Stueve; Dawna L. Cerney; Regina M. Rochefort; Laurie L. Kurth
2009-01-01
We performed classification analysis of 1970 satellite imagery and 2003 aerial photography to delineate establishment. Local site conditions were calculated from a LIDAR-based DEM, ancillary climate data, and 1970 tree locations in a GIS. We used logistic regression on a spatially weighted landscape matrix to rank variables.
Biomass of Yellow-Poplar in Natural Stands in Western North Carolina
Alexander Clark; James G. Schroeder
1977-01-01
Aboveground biomass was determined for yellow-poplar(Liriodendron tulipifera L.) trees 6 to 28 inches d. b. h. growingin natural, uneven-aged mountaincovestandsin western North Carolina.Specific gravity, moisture content, and green weight per cubic foot are presented for the total tree and its components. Tables developed from regression equations show weight and...
The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis
Koziol, James A.; Feng, Anne C.; Jia, Zhenyu; Wang, Yipeng; Goodison, Seven; McClelland, Michael; Mercola, Dan
2009-01-01
Motivation: Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Results: Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors. Contact: dmercola@uci.edu PMID:18628288
Jose F. Negron; Willis C. Schaupp; Kenneth E. Gibson; John Anhold; Dawn Hansen; Ralph Thier; Phil Mocettini
1999-01-01
Data collected from Douglas-fir stands infected by the Douglas-fir beetle in Wyoming, Montana, Idaho, and Utah, were used to develop models to estimate amount of mortality in terms of basal area killed. Models were built using stepwise linear regression and regression tree approaches. Linear regression models using initial Douglas-fir basal area were built for all...
Carroll B. Williams; David L. Azuma; George T. Ferrell
1992-01-01
Approximately 3.200 trees in young mixed-conifer stands were examined for pest activity and human-caused or mechanical injuries, and approximately 25 percent of these trees were randomly selected for stem analyses. The examination of trees felled for stem analyses showed that 409 (47 percent) were free of pests and 466 (53 percent) had one or more pest categories....
Analysis of the pattern of potential woody cover in Texas savanna
NASA Astrophysics Data System (ADS)
Yang, Xuebin; Crews, Kelley A.; Yan, Bowei
2016-10-01
While woody plant encroachment has been observed worldwide in savannas and adversely affected the ecosystem structure and function, a thorough understanding of the nature of this phenomenon is urgently required for savanna management and restoration. Among others, potential woody cover (the maximum realizable woody cover that a given site can support), especially its variation over environment has huge implication on the encroachment management in particular, and on tree-grass interactions in general. This project was designed to explore the pattern of potential woody cover in Texas savanna, an ecosystem with a large rainfall gradient in west-east direction. Substantial random pixels were sampled across the study area from MODIS Vegetation Continuous Fields (VCF) tree cover layer (250 m). Since potential woody cover is suggested to be limited by water availability, a nonlinear 99th quantile regression was performed between the observed woody cover and mean annual precipitation (MAP) to model the pattern of potential woody cover. Research result suggests a segmented relationship between potential woody cover and MAP at MODIS scale. Potential biases as well as the practical and theoretical implications were discussed. Through this study, the hypothesis about the primary role of water availability in determining savanna woody cover was further confirmed in a relatively understudied US-located savanna.
NASA Astrophysics Data System (ADS)
Dung Nguyen, The; Kappas, Martin
2017-04-01
In the last several years, the interest in forest biomass and carbon stock estimation has increased due to its importance for forest management, modelling carbon cycle, and other ecosystem services. However, no estimates of biomass and carbon stocks of deferent forest cover types exist throughout in the Xuan Lien Nature Reserve, Thanh Hoa, Viet Nam. This study investigates the relationship between above ground carbon stock and different vegetation indices and to identify the most likely vegetation index that best correlate with forest carbon stock. The terrestrial inventory data come from 380 sample plots that were randomly sampled. Individual tree parameters such as DBH and tree height were collected to calculate the above ground volume, biomass and carbon for different forest types. The SPOT6 2013 satellite data was used in the study to obtain five vegetation indices NDVI, RDVI, MSR, RVI, and EVI. The relationships between the forest carbon stock and vegetation indices were investigated using a multiple linear regression analysis. R-square, RMSE values and cross-validation were used to measure the strength and validate the performance of the models. The methodology presented here demonstrates the possibility of estimating forest volume, biomass and carbon stock. It can also be further improved by addressing more spectral bands data and/or elevation.
A Metric on Phylogenetic Tree Shapes.
Colijn, C; Plazzotta, G
2018-01-01
The shapes of evolutionary trees are influenced by the nature of the evolutionary process but comparisons of trees from different processes are hindered by the challenge of completely describing tree shape. We present a full characterization of the shapes of rooted branching trees in a form that lends itself to natural tree comparisons. We use this characterization to define a metric, in the sense of a true distance function, on tree shapes. The metric distinguishes trees from random models known to produce different tree shapes. It separates trees derived from tropical versus USA influenza A sequences, which reflect the differing epidemiology of tropical and seasonal flu. We describe several metrics based on the same core characterization, and illustrate how to extend the metric to incorporate trees' branch lengths or other features such as overall imbalance. Our approach allows us to construct addition and multiplication on trees, and to create a convex metric on tree shapes which formally allows computation of average tree shapes. © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.
NIMEFI: gene regulatory network inference using multiple ensemble feature importance algorithms.
Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan
2014-01-01
One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available.
Multiple filters affect tree species assembly in mid-latitude forest communities.
Kubota, Y; Kusumoto, B; Shiono, T; Ulrich, W
2018-05-01
Species assembly patterns of local communities are shaped by the balance between multiple abiotic/biotic filters and dispersal that both select individuals from species pools at the regional scale. Knowledge regarding functional assembly can provide insight into the relative importance of the deterministic and stochastic processes that shape species assembly. We evaluated the hierarchical roles of the α niche and β niches by analyzing the influence of environmental filtering relative to functional traits on geographical patterns of tree species assembly in mid-latitude forests. Using forest plot datasets, we examined the α niche traits (leaf and wood traits) and β niche properties (cold/drought tolerance) of tree species, and tested non-randomness (clustering/over-dispersion) of trait assembly based on null models that assumed two types of species pools related to biogeographical regions. For most plots, species assembly patterns fell within the range of random expectation. However, particularly for cold/drought tolerance-related β niche properties, deviation from randomness was frequently found; non-random clustering was predominant in higher latitudes with harsh climates. Our findings demonstrate that both randomness and non-randomness in trait assembly emerged as a result of the α and β niches, although we suggest the potential role of dispersal processes and/or species equalization through trait similarities in generating the prevalence of randomness. Clustering of β niche traits along latitudinal climatic gradients provides clear evidence of species sorting by filtering particular traits. Our results reveal that multiple filters through functional niches and stochastic processes jointly shape geographical patterns of species assembly across mid-latitude forests.
Hybrid detection of lung nodules on CT scan images
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lu, Lin; Tan, Yongqiang; Schwartz, Lawrence H.
Purpose: The diversity of lung nodules poses difficulty for the current computer-aided diagnostic (CAD) schemes for lung nodule detection on computed tomography (CT) scan images, especially in large-scale CT screening studies. We proposed a novel CAD scheme based on a hybrid method to address the challenges of detection in diverse lung nodules. Methods: The hybrid method proposed in this paper integrates several existing and widely used algorithms in the field of nodule detection, including morphological operation, dot-enhancement based on Hessian matrix, fuzzy connectedness segmentation, local density maximum algorithm, geodesic distance map, and regression tree classification. All of the adopted algorithmsmore » were organized into tree structures with multi-nodes. Each node in the tree structure aimed to deal with one type of lung nodule. Results: The method has been evaluated on 294 CT scans from the Lung Image Database Consortium (LIDC) dataset. The CT scans were randomly divided into two independent subsets: a training set (196 scans) and a test set (98 scans). In total, the 294 CT scans contained 631 lung nodules, which were annotated by at least two radiologists participating in the LIDC project. The sensitivity and false positive per scan for the training set were 87% and 2.61%. The sensitivity and false positive per scan for the testing set were 85.2% and 3.13%. Conclusions: The proposed hybrid method yielded high performance on the evaluation dataset and exhibits advantages over existing CAD schemes. We believe that the present method would be useful for a wide variety of CT imaging protocols used in both routine diagnosis and screening studies.« less
Tisné, Sébastien; Pomiès, Virginie; Riou, Virginie; Syahputra, Indra; Cochard, Benoît; Denis, Marie
2017-06-07
Multi-parental populations are promising tools for identifying quantitative disease resistance loci. Stem rot caused by Ganoderma boninense is a major threat to palm oil production, with yield losses of up to 80% prompting premature replantation of palms. There is evidence of genetic resistance sources, but the genetic architecture of Ganoderma resistance has not yet been investigated. This study aimed to identify Ganoderma resistance loci using an oil palm multi-parental population derived from nine major founders of ongoing breeding programs. A total of 1200 palm trees of the multi-parental population was planted in plots naturally infected by Ganoderma , and their health status was assessed biannually over 25 yr. The data were treated as survival data, and modeled using the Cox regression model, including a spatial effect to take the spatial component in the spread of Ganoderma into account. Based on the genotypes of 757 palm trees out of the 1200 planted, and on pedigree information, resistance loci were identified using a random effect with identity-by-descent kinship matrices as covariance matrices in the Cox model. Four Ganoderma resistance loci were identified, two controlling the occurrence of the first Ganoderma symptoms, and two the death of palm trees, while favorable haplotypes were identified among a major gene pool for ongoing breeding programs. This study implemented an efficient and flexible QTL mapping approach, and generated unique valuable information for the selection of oil palm varieties resistant to Ganoderma disease. Copyright © 2017 Tisné et al.
Biomass expansion factor and root-to-shoot ratio for Pinus in Brazil.
Sanquetta, Carlos R; Corte, Ana Pd; da Silva, Fernando
2011-09-24
The Biomass Expansion Factor (BEF) and the Root-to-Shoot Ratio (R) are variables used to quantify carbon stock in forests. They are often considered as constant or species/area specific values in most studies. This study aimed at showing tree size and age dependence upon BEF and R and proposed equations to improve forest biomass and carbon stock. Data from 70 sample Pinus spp. grown in southern Brazil trees in different diameter classes and ages were used to demonstrate the correlation between BEF and R, and forest inventory data, such as DBH, tree height and age. Total dry biomass, carbon stock and CO2 equivalent were simulated using the IPCC default values of BEF and R, corresponding average calculated from data used in this study, as well as the values estimated by regression equations. The mean values of BEF and R calculated in this study were 1.47 and 0.17, respectively. The relationship between BEF and R and the tree measurement variables were inversely related with negative exponential behavior. Simulations indicated that use of fixed values of BEF and R, either IPCC default or current average data, may lead to unreliable estimates of carbon stock inventories and CDM projects. It was concluded that accounting for the variations in BEF and R and using regression equations to relate them to DBH, tree height and age, is fundamental in obtaining reliable estimates of forest tree biomass, carbon sink and CO2 equivalent.
Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data.
Li, Jiuyong; Liu, Lin; Liu, Jixue; Green, Ryan
2017-12-01
It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper will test an ensemble method, Diversified Multiple Tree (DMT), on its capability for classifying instances in a new laboratory using the classifier built on the instances of another laboratory. DMT is tested on three real world biomedical data sets from different laboratories in comparison with four benchmark ensemble methods, AdaBoost, Bagging, Random Forests, and Random Trees. Experiments have also been conducted on studying the limitation of DMT and its possible variations. Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on classifying new instances of a different laboratory from the laboratory where instances are used to build the classifier. This paper demonstrates that an ensemble classifier, DMT, is more robust in classifying noisy data than other widely used ensemble methods. DMT works on the data set that supports multiple simple trees.
2014-01-01
Background Meta-regression is becoming increasingly used to model study level covariate effects. However this type of statistical analysis presents many difficulties and challenges. Here two methods for calculating confidence intervals for the magnitude of the residual between-study variance in random effects meta-regression models are developed. A further suggestion for calculating credible intervals using informative prior distributions for the residual between-study variance is presented. Methods Two recently proposed and, under the assumptions of the random effects model, exact methods for constructing confidence intervals for the between-study variance in random effects meta-analyses are extended to the meta-regression setting. The use of Generalised Cochran heterogeneity statistics is extended to the meta-regression setting and a Newton-Raphson procedure is developed to implement the Q profile method for meta-analysis and meta-regression. WinBUGS is used to implement informative priors for the residual between-study variance in the context of Bayesian meta-regressions. Results Results are obtained for two contrasting examples, where the first example involves a binary covariate and the second involves a continuous covariate. Intervals for the residual between-study variance are wide for both examples. Conclusions Statistical methods, and R computer software, are available to compute exact confidence intervals for the residual between-study variance under the random effects model for meta-regression. These frequentist methods are almost as easily implemented as their established counterparts for meta-analysis. Bayesian meta-regressions are also easily performed by analysts who are comfortable using WinBUGS. Estimates of the residual between-study variance in random effects meta-regressions should be routinely reported and accompanied by some measure of their uncertainty. Confidence and/or credible intervals are well-suited to this purpose. PMID:25196829
Fire frequency in the Interior Columbia River Basin: Building regional models from fire history data
McKenzie, D.; Peterson, D.L.; Agee, James K.
2000-01-01
Fire frequency affects vegetation composition and successional pathways; thus it is essential to understand fire regimes in order to manage natural resources at broad spatial scales. Fire history data are lacking for many regions for which fire management decisions are being made, so models are needed to estimate past fire frequency where local data are not yet available. We developed multiple regression models and tree-based (classification and regression tree, or CART) models to predict fire return intervals across the interior Columbia River basin at 1-km resolution, using georeferenced fire history, potential vegetation, cover type, and precipitation databases. The models combined semiqualitative methods and rigorous statistics. The fire history data are of uneven quality; some estimates are based on only one tree, and many are not cross-dated. Therefore, we weighted the models based on data quality and performed a sensitivity analysis of the effects on the models of estimation errors that are due to lack of cross-dating. The regression models predict fire return intervals from 1 to 375 yr for forested areas, whereas the tree-based models predict a range of 8 to 150 yr. Both types of models predict latitudinal and elevational gradients of increasing fire return intervals. Examination of regional-scale output suggests that, although the tree-based models explain more of the variation in the original data, the regression models are less likely to produce extrapolation errors. Thus, the models serve complementary purposes in elucidating the relationships among fire frequency, the predictor variables, and spatial scale. The models can provide local managers with quantitative information and provide data to initialize coarse-scale fire-effects models, although predictions for individual sites should be treated with caution because of the varying quality and uneven spatial coverage of the fire history database. The models also demonstrate the integration of qualitative and quantitative methods when requisite data for fully quantitative models are unavailable. They can be tested by comparing new, independent fire history reconstructions against their predictions and can be continually updated, as better fire history data become available.
Chen, Xuexia; Liu, Shuguang; Zhu, Zhiliang; Vogelmann, James E.; Li, Zhengpeng; Ohlen, Donald O.
2011-01-01
The concentrations of CO2 and other greenhouse gases in the atmosphere have been increasing and greatly affecting global climate and socio-economic systems. Actively growing forests are generally considered to be a major carbon sink, but forest wildfires lead to large releases of biomass carbon into the atmosphere. Aboveground forest biomass carbon (AFBC), an important ecological indicator, and fire-induced carbon emissions at regional scales are highly relevant to forest sustainable management and climate change. It is challenging to accurately estimate the spatial distribution of AFBC across large areas because of the spatial heterogeneity of forest cover types and canopy structure. In this study, Forest Inventory and Analysis (FIA) data, Landsat, and Landscape Fire and Resource Management Planning Tools Project (LANDFIRE) data were integrated in a regression tree model for estimating AFBC at a 30-m resolution in the Utah High Plateaus. AFBC were calculated from 225 FIA field plots and used as the dependent variable in the model. Of these plots, 10% were held out for model evaluation with stratified random sampling, and the other 90% were used as training data to develop the regression tree model. Independent variable layers included Landsat imagery and the derived spectral indicators, digital elevation model (DEM) data and derivatives, biophysical gradient data, existing vegetation cover type and vegetation structure. The cross-validation correlation coefficient (r value) was 0.81 for the training model. Independent validation using withheld plot data was similar with r value of 0.82. This validated regression tree model was applied to map AFBC in the Utah High Plateaus and then combined with burn severity information to estimate loss of AFBC in the Longston fire of Zion National Park in 2001. The final dataset represented 24 forest cover types for a 4 million ha forested area. We estimated a total of 353 Tg AFBC with an average of 87 MgC/ha in the Utah High Plateaus. We also estimated that 8054 Mg AFBC were released from 2.24 km2 burned forest area in the Longston fire. These results demonstrate that an AFBC spatial map and estimated biomass carbon consumption can readily be generated using existing database. The methodology provides a consistent, practical, and inexpensive way for estimating AFBC at 30-m resolution over large areas throughout the United States.
National scale biomass estimators for United States tree species
Jennifer C. Jenkins; David C. Chojnacky; Linda S. Heath; Richard A. Birdsey
2003-01-01
Estimates of national-scale forest carbon (C) stocks and fluxes are typically based on allometric regression equations developed using dimensional analysis techniques. However, the literature is inconsistent and incomplete with respect to large-scale forest C estimation. We compiled all available diameter-based allometric regression equations for estimating total...
NASA Astrophysics Data System (ADS)
Dobrowski, S. Z.; Greenberg, J. A.; Schladow, G.
2006-12-01
There is evidence from the Sierra Nevada that sub-alpine and alpine environments are currently experiencing landscape-mediated changes in growth and recruitment due to recent climate change. Understanding the biophysical controls of forest structure, growth, and recruitment in these environments is critical for interpreting and predicting the direction and magnitude of biotic responses to climate shift. We examined the abiotic controls of forest biomass within a 305 km2 region of the Carson Range on the eastern shore of Lake Tahoe, CA USA using estimates of forest structure and biophysical drivers developed continuously over the landscape. The study area ranged from 1900 m to 3400 m a.s.l. and encompassed montane, sub-alpine, and alpine environments. From hyperspatial optical imagery (IKONOS), we derived per-tree positions and crown sizes using a template matching approach applied to a pre-classified image of sunlit and shadowed vegetation pixels. From this remote sensing derived stem map, we calculated plot-level estimates of stem density, tree cover and average crown size. Additionally, we developed high resolution (30 m) estimates of climate variables within the study area using meteorological station data, topographic data, and a combination of empirical and mechanistic modeling approaches. From these climate surfaces, digital elevation data, and soil survey data, we derived estimates of direct and indirect biophysical drivers including heat loading, reference evapotranspiration, water deficit, solar radiation, topographic convergence, soil depth, and soil water holding capacity. Using these data sets, we conducted a regression tree analysis with stem density, tree cover, and average tree size as response and biophysical drivers as predictors. Trees were fit using half of the dataset randomly sampled (168,000 samples) and pruned using cost-complexity pruning based on 10-fold cross- validation. Predictions from pruned trees were then assessed against the hold-out data. Preliminary results from this analysis suggest that: 1) the relative importance and dependencies of biophysical drivers on forest structure are contingent upon the position of these forests along gradients of a limiting resource, 2) stem density shows a stronger dependence on water availability than tree size and 3) that the predictive power of abiotic variables are limited with our best models accounting for only 36-40 percent of the variance in the response. These results suggest that the response of forest structure to climate change may be highly idiosyncratic and difficult to predict using abiotic drivers alone.
Du, Ning; Fan, Jintu; Chen, Shuo; Liu, Yang
2008-07-21
Although recent investigations [Ryan, M.G., Yoder, B.J., 1997. Hydraulic limits to tree height and tree growth. Bioscience 47, 235-242; Koch, G.W., Sillett, S.C.,Jennings, G.M.,Davis, S.D., 2004. The limits to tree height. Nature 428, 851-854; Niklas, K.J., Spatz, H., 2004. Growth and hydraulic (not mechanical) constraints govern the scaling of tree height and mass. Proc. Natl Acad. Sci. 101, 15661-15663; Ryan, M.G., Phillips, N., Bond, B.J., 2006. Hydraulic limitation hypothesis revisited. Plant Cell Environ. 29, 367-381; Niklas, K.J., 2007. Maximum plant height and the biophysical factors that limit it. Tree Physiol. 27, 433-440; Burgess, S.S.O., Dawson, T.E., 2007. Predicting the limits to tree height using statistical regressions of leaf traits. New Phytol. 174, 626-636] suggested that the hydraulic limitation hypothesis (HLH) is the most plausible theory to explain the biophysical limits to maximum tree height and the decline in tree growth rate with age, the analysis is largely qualitative or based on statistical regression. Here we present an integrated biophysical model based on the principle that trees develop physiological compensations (e.g. the declined leaf water potential and the tapering of conduits with heights [West, G.B., Brown, J.H., Enquist, B.J., 1999. A general model for the structure and allometry of plant vascular systems. Nature 400, 664-667]) to resist the increasing water stress with height, the classical HLH and the biochemical limitations on photosynthesis [von Caemmerer, S., 2000. Biochemical Models of Leaf Photosynthesis. CSIRO Publishing, Australia]. The model has been applied to the tallest trees in the world (viz. Coast redwood (Sequoia sempervirens)). Xylem water potential, leaf carbon isotope composition, leaf mass to area ratio at different heights derived from the model show good agreements with the experimental measurements of Koch et al. [2004. The limits to tree height. Nature 428, 851-854]. The model also well explains the universal trend of declining growth rate with age.
NASA Astrophysics Data System (ADS)
Assouline, Dan; Mohajeri, Nahid; Scartezzini, Jean-Louis
2017-04-01
Solar energy is clean, widely available, and arguably the most promising renewable energy resource. Taking full advantage of solar power, however, requires a deep understanding of its patterns and dependencies in space and time. The recent advances in Machine Learning brought powerful algorithms to estimate the spatio-temporal variations of solar irradiance (the power per unit area received from the Sun, W/m2), using local weather and terrain information. Such algorithms include Deep Learning (e.g. Artificial Neural Networks), or kernel methods (e.g. Support Vector Machines). However, most of these methods have some disadvantages, as they: (i) are complex to tune, (ii) are mainly used as a black box and offering no interpretation on the variables contributions, (iii) often do not provide uncertainty predictions (Assouline et al., 2016). To provide a reasonable solar mapping with good accuracy, these gaps would ideally need to be filled. We present here simple steps using one ensemble learning algorithm namely, Random Forests (Breiman, 2001) to (i) estimate monthly solar potential with good accuracy, (ii) provide information on the contribution of each feature in the estimation, and (iii) offer prediction intervals for each point estimate. We have selected Switzerland as an example. Using a Digital Elevation Model (DEM) along with monthly solar irradiance time series and weather data, we build monthly solar maps for Global Horizontal Irradiance (GHI), Diffuse Horizontal Irradiance (GHI), and Extraterrestrial Irradiance (EI). The weather data include monthly values for temperature, precipitation, sunshine duration, and cloud cover. In order to explain the impact of each feature on the solar irradiance of each point estimate, we extend the contribution method (Kuz'min et al., 2011) to a regression setting. Contribution maps for all features can then be computed for each solar map. This provides precious information on the spatial variation of the features impact all across Switzerland maps. Finally, as RFs are based on bootstrap samples of the training data, they can produce prediction intervals by looking at the trees estimates distribution, instead of taking the mean estimate. To do so, a simple idea is to grow all trees fully so that each leaf has exactly one value, that is, a training sample value. Then, for each point estimate, we compute percentiles of the trees estimates data to build a prediction interval. Two issues arise from this process: (i) growing the trees fully is not always possible, and (ii) there is a risk of over-fitting. We show how to solve them. These steps can be used for any type of environmental mapping so as to extract useful information on uncertainty and feature impact interpretation. References -Assouline, D., Mohajeri, N., & Scartezzini, J. L. (2017). Quantifying rooftop photovoltaic solar energy potential: A machine learning approach. Solar Energy, 141, 278-296. -Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. -Kuz'min, V. E., Polishchuk, P. G., Artemenko, A. G., & Andronati, S. A. (2011). Interpretation of QSAR models based on random forest methods. Molecular Informatics, 30(6-7), 593-603.
Ponderosa pine mortality resulting from a mountain pine beetle outbreak
William F. McCambridge; Frank G. Hawksworth; Carleton B. Edminster; John G. Laut
1982-01-01
From 1965 to 1978, mountain pine beetles killed 25% of the pines taller than 4.5 feet in a study area in north-central Colorado. Average basal area was reduced from 92 to 58 square feet per acre. Mortality increased with tree diameter up to about 9 inches d.b.h. Larger trees appeared to be killed at random. Mortality was directly related to number of trees per acre and...
Fabian C.C. Uzoh; Martin W. Ritchie
1996-01-01
The equations presented predict crown area for 13 species of trees and shrubs which may be found growing in competition with commercial conifers during early stages of stand development. The equations express crown area as a function of basal area and height. Parameters were estimated for each species individually using weighted nonlinear least square regression.
Biomass equations for major tree species of the Northeast
Louise M. Tritton; James W. Hornbeck
1982-01-01
Regression equations are used in both forestry and ecosystem studies to estimate tree biomass from field measurements of dbh (diameter at breast height) or a combination of dbh and height. Literature on biomass is reviewed, and 178 sets of publish equation for 25 species common to the Northeastern Unites States are listed. On the basis of these equations, estimates of...
Stand basal-area and tree-diameter growth in red spruce-fir forests in Maine, 1960-80
S.J. Zarnoch; D.A. Gansner; D.S. Powell; T.A. Birch; T.A. Birch
1990-01-01
Stand basal-area change and individual surviving red spruce d.b.h. growth from 1960 to 1980 were analyzed for red spruce-fir stands in Maine. Regression modeling was used to relate these measures of growth to stand and tree conditions and to compare growth throughout the period. Results indicate a decline in growth.
Examination of the Arborsonic Decay Detector for Detecting Bacterial Wetwood in Red Oaks
Zicai Xu; Theodor D. Leininger; James G. Williams; Frank H. Tainter
2000-01-01
The Arborsonic Decay Detector (ADD; Fujikura Europe Limited, Wiltshire, England) was used to measure the time it took an ultrasound wave to cross 280 diameters in red oak trees with varying degrees of bacterial wetwood or heartwood decay. Linear regressions derived from the ADD readings of trees in Mississippi and South Carolina with wetwood and heartwood decay...
A machine learning system to improve heart failure patient assistance.
Guidi, Gabriele; Pettenati, Maria Chiara; Melillo, Paolo; Iadanza, Ernesto
2014-11-01
In this paper, we present a clinical decision support system (CDSS) for the analysis of heart failure (HF) patients, providing various outputs such as an HF severity evaluation, HF-type prediction, as well as a management interface that compares the different patients' follow-ups. The whole system is composed of a part of intelligent core and of an HF special-purpose management tool also providing the function to act as interface for the artificial intelligence training and use. To implement the smart intelligent functions, we adopted a machine learning approach. In this paper, we compare the performance of a neural network (NN), a support vector machine, a system with fuzzy rules genetically produced, and a classification and regression tree and its direct evolution, which is the random forest, in analyzing our database. Best performances in both HF severity evaluation and HF-type prediction functions are obtained by using the random forest algorithm. The management tool allows the cardiologist to populate a "supervised database" suitable for machine learning during his or her regular outpatient consultations. The idea comes from the fact that in literature there are a few databases of this type, and they are not scalable to our case.
NASA Astrophysics Data System (ADS)
Kisi, Ozgur; Parmar, Kulwinder Singh
2016-03-01
This study investigates the accuracy of least square support vector machine (LSSVM), multivariate adaptive regression splines (MARS) and M5 model tree (M5Tree) in modeling river water pollution. Various combinations of water quality parameters, Free Ammonia (AMM), Total Kjeldahl Nitrogen (TKN), Water Temperature (WT), Total Coliform (TC), Fecal Coliform (FC) and Potential of Hydrogen (pH) monitored at Nizamuddin, Delhi Yamuna River in India were used as inputs to the applied models. Results indicated that the LSSVM and MARS models had almost same accuracy and they performed better than the M5Tree model in modeling monthly chemical oxygen demand (COD). The average root mean square error (RMSE) of the LSSVM and M5Tree models was decreased by 1.47% and 19.1% using MARS model, respectively. Adding TC input to the models did not increase their accuracy in modeling COD while adding FC and pH inputs to the models generally decreased the accuracy. The overall results indicated that the MARS and LSSVM models could be successfully used in estimating monthly river water pollution level by using AMM, TKN and WT parameters as inputs.
Avelino, Jacques; Cabut, Sandrine; Barboza, Bernardo; Barquero, Miguel; Alfaro, Ronny; Esquivel, César; Durand, Jean-François; Cilas, Christian
2007-12-01
ABSTRACT We monitored the development of American leaf spot of coffee, a disease caused by the gemmiferous fungus Mycena citricolor, in 57 plots in Costa Rica for 1 or 2 years in order to gain a clearer understanding of conditions conducive to the disease and improve its control. During the investigation, characteristics of the coffee trees, crop management, and the environment were recorded. For the analyses, we used partial least-squares regression via the spline functions (PLSS), which is a nonlinear extension to partial least-squares regression (PLS). The fungus developed well in areas located between approximately 1,100 and 1,550 m above sea level. Slopes were conducive to its development, but eastern-facing slopes were less affected than the others, probably because they were more exposed to sunlight, especially in the rainy season. The distance between planting rows, the shade percentage, coffee tree height, the type of shade, and the pruning system explained disease intensity due to their effects on coffee tree shading and, possibly, on the humidity conditions in the plot. Forest trees and fruit trees intercropped with coffee provided particularly propitious conditions. Apparently, fertilization was unfavorable for the disease, probably due to dilution phenomena associated with faster coffee tree growth. Finally, series of wet spells interspersed with dry spells, which were frequent in the middle of the rainy season, were critical for the disease, probably because they affected the production and release of gemmae and their viability. These results could be used to draw up a map of epidemic risks taking topographical factors into account. To reduce those risks and improve chemical control, our results suggested that farmers should space planting rows further apart, maintain light shading in the plantation, and prune their coffee trees.
Chilling and heat requirements for flowering in temperate fruit trees
NASA Astrophysics Data System (ADS)
Guo, Liang; Dai, Junhu; Ranjitkar, Sailesh; Yu, Haiying; Xu, Jianchu; Luedeling, Eike
2014-08-01
Climate change has affected the rates of chilling and heat accumulation, which are vital for flowering and production, in temperate fruit trees, but few studies have been conducted in the cold-winter climates of East Asia. To evaluate tree responses to variation in chill and heat accumulation rates, partial least squares regression was used to correlate first flowering dates of chestnut ( Castanea mollissima Blume) and jujube ( Zizyphus jujube Mill.) in Beijing, China, with daily chill and heat accumulation between 1963 and 2008. The Dynamic Model and the Growing Degree Hour Model were used to convert daily records of minimum and maximum temperature into horticulturally meaningful metrics. Regression analyses identified the chilling and forcing periods for chestnut and jujube. The forcing periods started when half the chilling requirements were fulfilled. Over the past 50 years, heat accumulation during tree dormancy increased significantly, while chill accumulation remained relatively stable for both species. Heat accumulation was the main driver of bloom timing, with effects of variation in chill accumulation negligible in Beijing's cold-winter climate. It does not seem likely that reductions in chill will have a major effect on the studied species in Beijing in the near future. Such problems are much more likely for trees grown in locations that are substantially warmer than their native habitats, such as temperate species in the subtropics and tropics.
Chilling and heat requirements for flowering in temperate fruit trees.
Guo, Liang; Dai, Junhu; Ranjitkar, Sailesh; Yu, Haiying; Xu, Jianchu; Luedeling, Eike
2014-08-01
Climate change has affected the rates of chilling and heat accumulation, which are vital for flowering and production, in temperate fruit trees, but few studies have been conducted in the cold-winter climates of East Asia. To evaluate tree responses to variation in chill and heat accumulation rates, partial least squares regression was used to correlate first flowering dates of chestnut (Castanea mollissima Blume) and jujube (Zizyphus jujube Mill.) in Beijing, China, with daily chill and heat accumulation between 1963 and 2008. The Dynamic Model and the Growing Degree Hour Model were used to convert daily records of minimum and maximum temperature into horticulturally meaningful metrics. Regression analyses identified the chilling and forcing periods for chestnut and jujube. The forcing periods started when half the chilling requirements were fulfilled. Over the past 50 years, heat accumulation during tree dormancy increased significantly, while chill accumulation remained relatively stable for both species. Heat accumulation was the main driver of bloom timing, with effects of variation in chill accumulation negligible in Beijing’s cold-winter climate. It does not seem likely that reductions in chill will have a major effect on the studied species in Beijing in the near future. Such problems are much more likely for trees grown in locations that are substantially warmer than their native habitats, such as temperate species in the subtropics and tropics.
Mehra, Lucky K; Cowger, Christina; Gross, Kevin; Ojiambo, Peter S
2016-01-01
Pre-planting factors have been associated with the late-season severity of Stagonospora nodorum blotch (SNB), caused by the fungal pathogen Parastagonospora nodorum, in winter wheat (Triticum aestivum). The relative importance of these factors in the risk of SNB has not been determined and this knowledge can facilitate disease management decisions prior to planting of the wheat crop. In this study, we examined the performance of multiple regression (MR) and three machine learning algorithms namely artificial neural networks, categorical and regression trees, and random forests (RF), in predicting the pre-planting risk of SNB in wheat. Pre-planting factors tested as potential predictor variables were cultivar resistance, latitude, longitude, previous crop, seeding rate, seed treatment, tillage type, and wheat residue. Disease severity assessed at the end of the growing season was used as the response variable. The models were developed using 431 disease cases (unique combinations of predictors) collected from 2012 to 2014 and these cases were randomly divided into training, validation, and test datasets. Models were evaluated based on the regression of observed against predicted severity values of SNB, sensitivity-specificity ROC analysis, and the Kappa statistic. A strong relationship was observed between late-season severity of SNB and specific pre-planting factors in which latitude, longitude, wheat residue, and cultivar resistance were the most important predictors. The MR model explained 33% of variability in the data, while machine learning models explained 47 to 79% of the total variability. Similarly, the MR model correctly classified 74% of the disease cases, while machine learning models correctly classified 81 to 83% of these cases. Results show that the RF algorithm, which explained 79% of the variability within the data, was the most accurate in predicting the risk of SNB, with an accuracy rate of 93%. The RF algorithm could allow early assessment of the risk of SNB, facilitating sound disease management decisions prior to planting of wheat.
Over- and undersupply in home care: a representative multicenter correlational study.
Lahmann, Nils A; Suhr, Ralf; Kuntz, Simone; Kottner, Jan
2015-04-01
Quality assurance and funding of care become a major challenge against the background of demographic changes in western societies. The primary aim of the study was to identify possible misclassification, respectively over and undersupply of care by comparing the Barthel Index of clients of home care service with the level of care (Stage 0, I, II, III) according to the statutory German long-term care insurance. In 2012, a multi-center point prevalence study of 878 randomly selected clients of 100 randomly selected home care services across Germany was conducted. According to a standardized study protocol, demographics, the Barthel Index and the nurses' professional judgment-whether a client requires more nursing care-were assessed. Associations of the Barthel items and professional judgment were analyzed using univariate (Chi-square) and multivariate (logistic regression and classification-regression-tree-models) statistics. In each level of care, the Barthel Index showed large variability e.g. in level II ranging from 0 to 100 points. Multivariate logistic regression regarding possible under- and oversupply revealed occasionally fecal incontinence (2.1; 95 % CI 1.2-3.7), urinary incontinence (2.0; 95 % CI 1.1-3.6), feeding (1.7; 95 % CI 1.0-2.9), immobility (0.2; 95 % CI 0.1-0.6) and to be female (1.8; 95 % CI 1.2-2.6) to be statistically significantly associated. The variability in Barthel Index in each level of care found in this study indicated a large general misclassification of home care clients according to their actual need of care. Professional caregivers identified occasional incontinence, help with eating and drinking and mobility (especially in female clients) as areas of possible under- and oversupply of care. The statutory German long-term care insurance classification should be modified according to the above finding to increase the quality of care in home care clients.
Efforts are increasingly being made to classify the world’s wetland resources, an important ecosystem and habitat that is diminishing in abundance. There are multiple remote sensing classification methods, including a suite of nonparametric classifiers such as decision-tree...
Measurement variability error for estimates of volume change
James A. Westfall; Paul L. Patterson
2007-01-01
Using quality assurance data, measurement variability distributions were developed for attributes that affect tree volume prediction. Random deviations from the measurement variability distributions were applied to 19381 remeasured sample trees in Maine. The additional error due to measurement variation and measurement bias was estimated via a simulation study for...
The fast algorithm of spark in compressive sensing
NASA Astrophysics Data System (ADS)
Xie, Meihua; Yan, Fengxia
2017-01-01
Compressed Sensing (CS) is an advanced theory on signal sampling and reconstruction. In CS theory, the reconstruction condition of signal is an important theory problem, and spark is a good index to study this problem. But the computation of spark is NP hard. In this paper, we study the problem of computing spark. For some special matrixes, for example, the Gaussian random matrix and 0-1 random matrix, we obtain some conclusions. Furthermore, for Gaussian random matrix with fewer rows than columns, we prove that its spark equals to the number of its rows plus one with probability 1. For general matrix, two methods are given to compute its spark. One is the method of directly searching and the other is the method of dual-tree searching. By simulating 24 Gaussian random matrixes and 18 0-1 random matrixes, we tested the computation time of these two methods. Numerical results showed that the dual-tree searching method had higher efficiency than directly searching, especially for those matrixes which has as much as rows and columns.
NASA Astrophysics Data System (ADS)
Bouffon, T.; Rice, R.; Bales, R.
2006-12-01
The spatial distributions of snow water equivalent (SWE) and snow depth within a 1, 4, and 16 km2 grid element around two automated snow pillows in a forested and open- forested region of the Upper Merced River Basin (2,800 km2) of Yosemite National Park were characterized using field observations and analyzed using binary regression trees. Snow surveys occurred at the forested site during the accumulation and ablation seasons, while at the open-forest site a survey was performed only during the accumulation season. An average of 130 snow depth and 7 snow density measurements were made on each survey, within the 4 km2 grid. Snow depth was distributed using binary regression trees and geostatistical methods using the physiographic parameters (e.g. elevation, slope, vegetation, aspect). Results in the forest region indicate that the snow pillow overestimated average SWE within the 1, 4, and 16 km2 areas by 34 percent during ablation, but during accumulation the snow pillow provides a good estimate of the modeled mean SWE grid value, however it is suspected that the snow pillow was underestimating SWE. However, at the open forest site, during accumulation, the snow pillow was 28 percent greater than the mean modeled grid element. In addition, the binary regression trees indicate that the independent variables of vegetation, slope, and aspect are the most influential parameters of snow depth distribution. The binary regression tree and multivariate linear regression models explain about 60 percent of the initial variance for snow depth and 80 percent for density, respectively. This short-term study provides motivation and direction for the installation of a distributed snow measurement network to fill the information gap in basin-wide SWE and snow depth measurements. Guided by these results, a distributed snow measurement network was installed in the Fall 2006 at Gin Flat in the Upper Merced River Basin with the specific objective of measuring accumulation and ablation across topographic variables with the aim of providing guidance for future larger scale observation network designs.
Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set.
Adler, Werner; Gefeller, Olaf; Gul, Asma; Horn, Folkert K; Khan, Zardad; Lausen, Berthold
2016-12-07
Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma.
On the error probability of general tree and trellis codes with applications to sequential decoding
NASA Technical Reports Server (NTRS)
Johannesson, R.
1973-01-01
An upper bound on the average error probability for maximum-likelihood decoding of the ensemble of random binary tree codes is derived and shown to be independent of the length of the tree. An upper bound on the average error probability for maximum-likelihood decoding of the ensemble of random L-branch binary trellis codes of rate R = 1/n is derived which separates the effects of the tail length T and the memory length M of the code. It is shown that the bound is independent of the length L of the information sequence. This implication is investigated by computer simulations of sequential decoding utilizing the stack algorithm. These simulations confirm the implication and further suggest an empirical formula for the true undetected decoding error probability with sequential decoding.
Unbiased feature selection in learning random forests for high-dimensional data.
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
Yamashita, Takashi; Kart, Cary S; Noe, Douglas A
2012-12-01
Type 2 diabetes is known to contribute to health disparities in the U.S. and failure to adhere to recommended self-care behaviors is a contributing factor. Intervention programs face difficulties as a result of patient diversity and limited resources. With data from the 2005 Behavioral Risk Factor Surveillance System, this study employs a logistic regression tree algorithm to identify characteristics of sub-populations with type 2 diabetes according to their reported frequency of adherence to four recommended diabetes self-care behaviors including blood glucose monitoring, foot examination, eye examination and HbA1c testing. Using Andersen's health behavior model, need factors appear to dominate the definition of which sub-groups were at greatest risk for low as well as high adherence. Findings demonstrate the utility of easily interpreted tree diagrams to design specific culturally appropriate intervention programs targeting sub-populations of diabetes patients who need to improve their self-care behaviors. Limitations and contributions of the study are discussed.
Tree allometry and improved estimation of carbon stocks and balance in tropical forests.
Chave, J; Andalo, C; Brown, S; Cairns, M A; Chambers, J Q; Eamus, D; Fölster, H; Fromard, F; Higuchi, N; Kira, T; Lescure, J-P; Nelson, B W; Ogawa, H; Puig, H; Riéra, B; Yamakura, T
2005-08-01
Tropical forests hold large stores of carbon, yet uncertainty remains regarding their quantitative contribution to the global carbon cycle. One approach to quantifying carbon biomass stores consists in inferring changes from long-term forest inventory plots. Regression models are used to convert inventory data into an estimate of aboveground biomass (AGB). We provide a critical reassessment of the quality and the robustness of these models across tropical forest types, using a large dataset of 2,410 trees >or= 5 cm diameter, directly harvested in 27 study sites across the tropics. Proportional relationships between aboveground biomass and the product of wood density, trunk cross-sectional area, and total height are constructed. We also develop a regression model involving wood density and stem diameter only. Our models were tested for secondary and old-growth forests, for dry, moist and wet forests, for lowland and montane forests, and for mangrove forests. The most important predictors of AGB of a tree were, in decreasing order of importance, its trunk diameter, wood specific gravity, total height, and forest type (dry, moist, or wet). Overestimates prevailed, giving a bias of 0.5-6.5% when errors were averaged across all stands. Our regression models can be used reliably to predict aboveground tree biomass across a broad range of tropical forests. Because they are based on an unprecedented dataset, these models should improve the quality of tropical biomass estimates, and bring consensus about the contribution of the tropical forest biome and tropical deforestation to the global carbon cycle.
NASA Astrophysics Data System (ADS)
Burkholder, Aaron
This project investigated the spectral separability of the invasive species Ailanthus altissima, commonly called tree of heaven, and four other native species. Leaves were collected from Ailanthus and four native tree species from May 13 through August 24, 2008, and spectral reflectance factor measurements were gathered for each tree using an ASD (Boulder, Colorado) FieldSpec Pro full-range spectroradiometer. The original data covered the range from 350-2500 nm, with one reflectance measurement collected per one nm wavelength. To reduce dimensionality, the measurements were resampled to the actual resolution of the spectrometer's sensors, and regions of atmospheric absorption were removed. Continuum removal was performed on the reflectance data, resulting in a second dataset. For both the reflectance and continuum removed datasets, least angle regression (LARS) and random forest classification were used to identify a single set of optimal wavelengths across all sampled dates, a set of optimal wavelengths for each date, and the dates for which Ailanthus is most separable from other species. It was found that classification accuracy varies both with dates and bands used. Contrary to expectations that early spring would provide the best separability, the lowest classification error was observed on July 22 for the reflectance data, and on May 13, July 11 and August 1 for the continuum removed data. This suggests that July and August are also potentially good months for species differentiation. Applying continuum removal in many cases reduced classification error, although not consistently. Band selection seems to be more important for reflectance data in that it results in greater improvement in classification accuracy, and LARS appears to be an effective band selection tool. The optimal spectral bands were selected from across the spectrum, often with bands from the blue (401-431 nm), NIR (1115 nm) and SWIR (1985-1995 nm), suggesting that hyperspectral sensors with broad wavelength sensitivity are important for mapping and identification of Ailanthus.
Teets, Aaron; Fraver, Shawn; Weiskittel, Aaron R; Hollinger, David Y
2018-03-11
A range of environmental factors regulate tree growth; however, climate is generally thought to most strongly influence year-to-year variability in growth. Numerous dendrochronological (tree-ring) studies have identified climate factors that influence year-to-year variability in growth for given tree species and location. However, traditional dendrochronology methods have limitations that prevent them from adequately assessing stand-level (as opposed to species-level) growth. We argue that stand-level growth analyses provide a more meaningful assessment of forest response to climate fluctuations, as well as the management options that may be employed to sustain forest productivity. Working in a mature, mixed-species stand at the Howland Research Forest of central Maine, USA, we used two alternatives to traditional dendrochronological analyses by (1) selecting trees for coring using a stratified (by size and species), random sampling method that ensures a representative sample of the stand, and (2) converting ring widths to biomass increments, which once summed, produced a representation of stand-level growth, while maintaining species identities or canopy position if needed. We then tested the relative influence of seasonal climate variables on year-to-year variability in the biomass increment using generalized least squares regression, while accounting for temporal autocorrelation. Our results indicate that stand-level growth responded most strongly to previous summer and current spring climate variables, resulting from a combination of individualistic climate responses occurring at the species- and canopy-position level. Our climate models were better fit to stand-level biomass increment than to species-level or canopy-position summaries. The relative growth responses (i.e., percent change) predicted from the most influential climate variables indicate stand-level growth varies less from to year-to-year than species-level or canopy-position growth responses. By assessing stand-level growth response to climate, we provide an alternative perspective on climate-growth relationships of forests, improving our understanding of forest growth dynamics under a fluctuating climate. © 2018 John Wiley & Sons Ltd.
Prediction models for clustered data: comparison of a random intercept and standard regression model
2013-01-01
Background When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Methods Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. Results The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. Conclusion The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters. PMID:23414436
Bouwmeester, Walter; Twisk, Jos W R; Kappen, Teus H; van Klei, Wilton A; Moons, Karel G M; Vergouwe, Yvonne
2013-02-15
When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters.
Vegetation Continuous Fields--Transitioning from MODIS to VIIRS
NASA Astrophysics Data System (ADS)
DiMiceli, C.; Townshend, J. R.; Sohlberg, R. A.; Kim, D. H.; Kelly, M.
2015-12-01
Measurements of fractional vegetation cover are critical for accurate and consistent monitoring of global deforestation rates. They also provide important parameters for land surface, climate and carbon models and vital background data for research into fire, hydrological and ecosystem processes. MODIS Vegetation Continuous Fields (VCF) products provide four complementary layers of fractional cover: tree cover, non-tree vegetation, bare ground, and surface water. MODIS VCF products are currently produced globally and annually at 250m resolution for 2000 to the present. Additionally, annual VCF products at 1/20° resolution derived from AVHRR and MODIS Long-Term Data Records are in development to provide Earth System Data Records of fractional vegetation cover for 1982 to the present. In order to provide continuity of these valuable products, we are extending the VCF algorithms to create Suomi NPP/VIIRS VCF products. This presentation will highlight the first VIIRS fractional cover product: global percent tree cover at 1 km resolution. To create this product, phenological and physiological metrics were derived from each complete year of VIIRS 8-day surface reflectance products. A supervised regression tree method was applied to the metrics, using training derived from Landsat data supplemented by high-resolution data from Ikonos, RapidEye and QuickBird. The regression tree model was then applied globally to produce fractional tree cover. In our presentation we will detail our methods for creating the VIIRS VCF product. We will compare the new VIIRS VCF product to our current MODIS VCF products and demonstrate continuity between instruments. Finally, we will outline future VIIRS VCF development plans.
Tanaka, Tomohiro; Voigt, Michael D
2018-03-01
Non-melanoma skin cancer (NMSC) is the most common de novo malignancy in liver transplant (LT) recipients; it behaves more aggressively and it increases mortality. We used decision tree analysis to develop a tool to stratify and quantify risk of NMSC in LT recipients. We performed Cox regression analysis to identify which predictive variables to enter into the decision tree analysis. Data were from the Organ Procurement Transplant Network (OPTN) STAR files of September 2016 (n = 102984). NMSC developed in 4556 of the 105984 recipients, a mean of 5.6 years after transplant. The 5/10/20-year rates of NMSC were 2.9/6.3/13.5%, respectively. Cox regression identified male gender, Caucasian race, age, body mass index (BMI) at LT, and sirolimus use as key predictive or protective factors for NMSC. These factors were entered into a decision tree analysis. The final tree stratified non-Caucasians as low risk (0.8%), and Caucasian males > 47 years, BMI < 40 who did not receive sirolimus, as high risk (7.3% cumulative incidence of NMSC). The predictions in the derivation set were almost identical to those in the validation set (r 2 = 0.971, p < 0.0001). Cumulative incidence of NMSC in low, moderate and high risk groups at 5/10/20 year was 0.5/1.2/3.3, 2.1/4.8/11.7 and 5.6/11.6/23.1% (p < 0.0001). The decision tree model accurately stratifies the risk of developing NMSC in the long-term after LT.
NASA Astrophysics Data System (ADS)
Shabani, Farzin; Kumar, Lalit; Solhjouy-fard, Samaneh
2017-08-01
The aim of this study was to have a comparative investigation and evaluation of the capabilities of correlative and mechanistic modeling processes, applied to the projection of future distributions of date palm in novel environments and to establish a method of minimizing uncertainty in the projections of differing techniques. The location of this study on a global scale is in Middle Eastern Countries. We compared the mechanistic model CLIMEX (CL) with the correlative models MaxEnt (MX), Boosted Regression Trees (BRT), and Random Forests (RF) to project current and future distributions of date palm ( Phoenix dactylifera L.). The Global Climate Model (GCM), the CSIRO-Mk3.0 (CS) using the A2 emissions scenario, was selected for making projections. Both indigenous and alien distribution data of the species were utilized in the modeling process. The common areas predicted by MX, BRT, RF, and CL from the CS GCM were extracted and compared to ascertain projection uncertainty levels of each individual technique. The common areas identified by all four modeling techniques were used to produce a map indicating suitable and unsuitable areas for date palm cultivation for Middle Eastern countries, for the present and the year 2100. The four different modeling approaches predict fairly different distributions. Projections from CL were more conservative than from MX. The BRT and RF were the most conservative methods in terms of projections for the current time. The combination of the final CL and MX projections for the present and 2100 provide higher certainty concerning those areas that will become highly suitable for future date palm cultivation. According to the four models, cold, hot, and wet stress, with differences on a regional basis, appears to be the major restrictions on future date palm distribution. The results demonstrate variances in the projections, resulting from different techniques. The assessment and interpretation of model projections requires reservations, especially in correlative models such as MX, BRT, and RF. Intersections between different techniques may decrease uncertainty in future distribution projections. However, readers should not miss the fact that the uncertainties are mostly because the future GHG emission scenarios are unknowable with sufficient precision. Suggestions towards methodology and processing for improving projections are included.
NASA Astrophysics Data System (ADS)
Park, Seonyoung; Im, Jungho; Park, Sumin; Rhee, Jinyoung
2017-04-01
Soil moisture is one of the most important keys for understanding regional and global climate systems. Soil moisture is directly related to agricultural processes as well as hydrological processes because soil moisture highly influences vegetation growth and determines water supply in the agroecosystem. Accurate monitoring of the spatiotemporal pattern of soil moisture is important. Soil moisture has been generally provided through in situ measurements at stations. Although field survey from in situ measurements provides accurate soil moisture with high temporal resolution, it requires high cost and does not provide the spatial distribution of soil moisture over large areas. Microwave satellite (e.g., advanced Microwave Scanning Radiometer on the Earth Observing System (AMSR2), the Advanced Scatterometer (ASCAT), and Soil Moisture Active Passive (SMAP)) -based approaches and numerical models such as Global Land Data Assimilation System (GLDAS) and Modern- Era Retrospective Analysis for Research and Applications (MERRA) provide spatial-temporalspatiotemporally continuous soil moisture products at global scale. However, since those global soil moisture products have coarse spatial resolution ( 25-40 km), their applications for agriculture and water resources at local and regional scales are very limited. Thus, soil moisture downscaling is needed to overcome the limitation of the spatial resolution of soil moisture products. In this study, GLDAS soil moisture data were downscaled up to 1 km spatial resolution through the integration of AMSR2 and ASCAT soil moisture data, Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM), and Moderate Resolution Imaging Spectroradiometer (MODIS) data—Land Surface Temperature, Normalized Difference Vegetation Index, and Land cover—using modified regression trees over East Asia from 2013 to 2015. Modified regression trees were implemented using Cubist, a commercial software tool based on machine learning. An optimization based on pruning of rules derived from the modified regression trees was conducted. Root Mean Square Error (RMSE) and Correlation coefficients (r) were used to optimize the rules, and finally 59 rules from modified regression trees were produced. The results show high validation r (0.79) and low validation RMSE (0.0556m3/m3). The 1 km downscaled soil moisture was evaluated using ground soil moisture data at 14 stations, and both soil moisture data showed similar temporal patterns (average r=0.51 and average RMSE=0.041). The spatial distribution of the 1 km downscaled soil moisture well corresponded with GLDAS soil moisture that caught both extremely dry and wet regions. Correlation between GLDAS and the 1 km downscaled soil moisture during growing season was positive (mean r=0.35) in most regions.
Diameter-growth model across shortleaf pine range using regression tree analysis
Daniel Yaussy; Louis Iverson; Anantha Prasad
1999-01-01
Diameter growth of a tree in most gap-phase models is limited by light, nutrients, moisture, and temperature. Growing-season temperature is represented by growing degree days (gdd), which is the sum of the average daily temperatures above a baseline temperature. Gap-phase models determine the north-south range of a species by the gdd limits at the north and south...
Eric J. Gustafson
2014-01-01
Regression models developed in the upper Midwest (United States) to predict drought-induced tree mortality from measures of drought (Palmer Drought Severity Index) were tested in the northeastern United States and found inadequate. The most likely cause of this result is that long drought events were rare in the Northeast during the period when inventory data were...
W. Henry McNab; David L. Loftis; Callie J. Schweitzer; Raymond Sheffield
2004-01-01
We used tree indicator species occurring on 438 plots in the Plateau counties of Tennessee to test the uniqueness of four conterminous ecoregions. Multinomial logistic regression indicated that the presence of 14 tree species allowed classification of sample plots according to ecoregion with an average overall accuracy of 75 percent (range 45 to 94 percent). Additional...
Dating tree mortality using log decay in the White Mountains of New Hampshire
Andrew J. Fast; Mark J. Ducey; Jeffrey H. Gove; William B. Leak
2008-01-01
Coarse woody material (CWM) is an important component of forest ecosystems. To meet specific CWM management objectives, it is important to understand rates of decay. We present results from a silvicultural trial at the Bartlett Experimental Forest, in which time of death is known for a large sample of trees. Either a simple table or regression equations that use...
Development of post-fire crown damage mortality thresholds in ponderosa pine
James F. Fowler; Carolyn Hull Sieg; Joel McMillin; Kurt K. Allen; Jose F. Negron; Linda L. Wadleigh; John A. Anhold; Ken E. Gibson
2010-01-01
Previous research has shown that crown scorch volume and crown consumption volume are the major predictors of post-fire mortality in ponderosa pine. In this study, we use piecewise logistic regression models of crown scorch data from 6633 trees in five wildfires from the Intermountain West to locate a mortality threshold at 88% scorch by volume for trees with no crown...
Equations relating compacted and uncompacted live crown ratio for common tree species in the South
KaDonna C. Randolph
2010-01-01
Species-specific equations to predict uncompacted crown ratio (UNCR) from compacted live crown ratio (CCR), tree length, and stem diameter were developed for 24 species and 12 genera in the southern United States. Using data from the US Forest Service Forest Inventory and Analysis program, nonlinear regression was used to model UNCR with a logistic function. Model...
Scott M. Bretthauer; George Z. Gertner; Gary L. Rolfe; Jeffery O. Dawson
2003-01-01
Tree species diversity increases and dominance decreases with proximity to forest border in two 60-year-old successional forest stands developed on abandoned agricultural land in Piatt County, Illinois. A regression equation allowed us to quantify an increase in diversity with closeness to forest border for one of the forest stands. Shingle oak is the most dominant...
Regeneration of Douglas-fir cutblocks on the Six Rivers National Forest in northwestern California
R. O. Strothmann
1979-01-01
A survey of 61 cutblocks planted since 1964 evaluated stocking of conifers (trees 1 foot tall or taller) on 2-milacre quadrats. Overall stocking percentage averaged 42.2 and ranged from 15 to 8 1. Overall number of trees per acre averaged 396. In the regression model, based on 36 cutblocks, better stocking was associated with high site class, northerly aspect,...
Smitley, D R; Rebek, E J; Royalty, R N; Davis, T W; Newhouse, K F
2010-02-01
We conducted field trials at five different locations over a period of 6 yr to investigate the efficacy of imidacloprid applied each spring as a basal soil drench for protection against emerald ash borer, Agrilus planipennis Fairmaire (Coleoptera: Buprestidae). Canopy thinning and emerald ash borer larval density were used to evaluate efficacy for 3-4 yr at each location while treatments continued. Test sites included small urban trees (5-15 cm diameter at breast height [dbh]), medium to large (15-65 cm dbh) trees at golf courses, and medium to large street trees. Annual basal drenches with imidacloprid gave complete protection of small ash trees for three years. At three sites where the size of trees ranged from 23 to 37 cm dbh, we successfully protected all ash trees beginning the test with <60% canopy thinning. Regression analysis of data from two sites reveals that tree size explains 46% of the variation in efficacy of imidacloprid drenches. The smallest trees (<30 cm dbh) remained in excellent condition for 3 yr, whereas most of the largest trees (>38 cm dbh) declined to a weakened state and undesirable appearance. The five-fold increase in trunk and branch surface area of ash trees as the tree dbh doubles may account for reduced efficacy on larger trees, and suggests a need to increase treatment rates for larger trees.
Jansa, Václav
2017-01-01
Height to crown base (HCB) of a tree is an important variable often included as a predictor in various forest models that serve as the fundamental tools for decision-making in forestry. We developed spatially explicit and spatially inexplicit mixed-effects HCB models using measurements from a total 19,404 trees of Norway spruce (Picea abies (L.) Karst.) and European beech (Fagus sylvatica L.) on the permanent sample plots that are located across the Czech Republic. Variables describing site quality, stand density or competition, and species mixing effects were included into the HCB model with use of dominant height (HDOM), basal area of trees larger in diameters than a subject tree (BAL- spatially inexplicit measure) or Hegyi’s competition index (HCI—spatially explicit measure), and basal area proportion of a species of interest (BAPOR), respectively. The parameters describing sample plot-level random effects were included into the HCB model by applying the mixed-effects modelling approach. Among several functional forms evaluated, the logistic function was found most suited to our data. The HCB model for Norway spruce was tested against the data originated from different inventory designs, but model for European beech was tested using partitioned dataset (a part of the main dataset). The variance heteroscedasticity in the residuals was substantially reduced through inclusion of a power variance function into the HCB model. The results showed that spatially explicit model described significantly a larger part of the HCB variations [R2adj = 0.86 (spruce), 0.85 (beech)] than its spatially inexplicit counterpart [R2adj = 0.84 (spruce), 0.83 (beech)]. The HCB increased with increasing competitive interactions described by tree-centered competition measure: BAL or HCI, and species mixing effects described by BAPOR. A test of the mixed-effects HCB model with the random effects estimated using at least four trees per sample plot in the validation data confirmed that the model was precise enough for the prediction of HCB for a range of site quality, tree size, stand density, and stand structure. We therefore recommend measuring of HCB on four randomly selected trees of a species of interest on each sample plot for localizing the mixed-effects model and predicting HCB of the remaining trees on the plot. Growth simulations can be made from the data that lack the values for either crown ratio or HCB using the HCB models. PMID:29049391
A Millennial-length Reconstruction of the Western Pacific Pattern with Associated Paleoclimate
NASA Astrophysics Data System (ADS)
Wright, W. E.; Guan, B. T.; Wei, K.
2010-12-01
The Western Pacific Pattern (WP) is a lesser known 500 hPa pressure pattern similar to the NAO or PNA. As defined, the poles of the WP index are centered on 60°N over the Kamchatka peninsula and the neighboring Pacific and on 32.5°N over the western north Pacific. However, the area of influence for the southern half of the dipole includes a wide swath from East Asia, across Taiwan, through the Philippine Sea, to the western north Pacific. Tree rings of Taiwanese Chamaecyparis obtusa var. formosana in this extended region show significant correlation with the WP, and with local temperature. The WP is also significantly correlated with atmospheric temperatures over Taiwan, especially at 850hPa and 700 hPa, pressure levels that bracket the tree site. Spectral analysis indicates that variations in the WP occur at relatively high frequency, with most power at less than 5 years. Simple linear regression against high frequency variants of the tree-ring chronology yielded the most significant correlation coefficients. Two reconstructions are presented. The first uses a tree-ring time series produced as the first intrinsic mode function (IMF) from an Ensemble Empirical Mode Decomposition (EEMD), based on the Hilbert-Huang Transform. The significance of the regression using the EEMD-derived time series was much more significant than time series produced using traditional high pass filtering. The second also uses the first IMF of a tree-ring time series, but the dataset was first sorted and partitioned at a specified quantile prior to EEMD decomposition, with the mean of the partitioned data forming the input to the EEMD. The partitioning was done to filter out the less climatically sensitive tree rings, a common problem with shade tolerant trees. Time series statistics indicate that the first reconstruction is reliable to 1241 of the Common Era. Reliability of the second reconstruction is dependent on the development of statistics related to the quantile partitioning, and the consequent reduction in sample depth. However, the correlation coefficients from regressions over the instrumental period greatly exceed those from any other method of chronology generation, and so the technique holds promise. Additional atmospheric parameters having significant correlations against the WPO and tree ring time series with similar spatial patterns are also presented. These include vertical wind shear (850hPa-700hPa) over the northern Philippines and the Philippine Sea, surface Omega and 850hPa v-winds over the East China Sea, Japan and Taiwan. Possible links to changes in the subtropical jet stream will also be discussed.
Alternative methods to evaluate trial level surrogacy.
Abrahantes, Josè Cortiñas; Shkedy, Ziv; Molenberghs, Geert
2008-01-01
The evaluation and validation of surrogate endpoints have been extensively studied in the last decade. Prentice [1] and Freedman, Graubard and Schatzkin [2] laid the foundations for the evaluation of surrogate endpoints in randomized clinical trials. Later, Buyse et al. [5] proposed a meta-analytic methodology, producing different methods for different settings, which was further studied by Alonso and Molenberghs [9], in their unifying approach based on information theory. In this article, we focus our attention on the trial-level surrogacy and propose alternative procedures to evaluate such surrogacy measure, which do not pre-specify the type of association. A promising correction based on cross-validation is investigated. As well as the construction of confidence intervals for this measure. In order to avoid making assumption about the type of relationship between the treatment effects and its distribution, a collection of alternative methods, based on regression trees, bagging, random forests, and support vector machines, combined with bootstrap-based confidence interval and, should one wish, in conjunction with a cross-validation based correction, will be proposed and applied. We apply the various strategies to data from three clinical studies: in opthalmology, in advanced colorectal cancer, and in schizophrenia. The results obtained for the three case studies are compared; they indicate that using random forest or bagging models produces larger estimated values for the surrogacy measure, which are in general stabler and the confidence interval narrower than linear regression and support vector regression. For the advanced colorectal cancer studies, we even found the trial-level surrogacy is considerably different from what has been reported. In general the alternative methods are more computationally demanding, and specially the calculation of the confidence intervals, require more computational time that the delta-method counterpart. First, more flexible modeling techniques can be used, allowing for other type of association. Second, when no cross-validation-based correction is applied, overly optimistic trial-level surrogacy estimates will be found, thus cross-validation is highly recommendable. Third, the use of the delta method to calculate confidence intervals is not recommendable since it makes assumptions valid only in very large samples. It may also produce range-violating limits. We therefore recommend alternatives: bootstrap methods in general. Also, the information-theoretic approach produces comparable results with the bagging and random forest approaches, when cross-validation correction is applied. It is also important to observe that, even for the case in which the linear model might be a good option too, bagging methods perform well too, and their confidence intervals were more narrow.
Logging utilization in Idaho: Current and past trends
Eric A. Simmons; Todd A. Morgan; Erik C. Berg; Stanley J. Zarnoch; Steven W. Hayes; Mike T. Thompson
2014-01-01
A study of commercial timber-harvesting activities in Idaho was conducted during 2008 and 2011 to characterize current tree utilization, logging operations, and changes from previous Idaho logging utilization studies. A two-stage simple random sampling design was used to select sites and felled trees for measurement within active logging sites. Thirty-three logging...
Invasion Percolation and Global Optimization
NASA Astrophysics Data System (ADS)
Barabási, Albert-László
1996-05-01
Invasion bond percolation (IBP) is mapped exactly into Prim's algorithm for finding the shortest spanning tree of a weighted random graph. Exploring this mapping, which is valid for arbitrary dimensions and lattices, we introduce a new IBP model that belongs to the same universality class as IBP and generates the minimal energy tree spanning the IBP cluster.
Adaptive economic and ecological forest management under risk
Joseph Buongiorno; Mo Zhou
2015-01-01
Background: Forest managers must deal with inherently stochastic ecological and economic processes. The future growth of trees is uncertain, and so is their value. The randomness of low-impact, high frequency or rare catastrophic shocks in forest growth has significant implications in shaping the mix of tree species and the forest landscape...
Natural cavities used by wood ducks in north-central Minnesota
Gilmer, D.S.; Ball, I.J.; Cowardin, L.M.; Mathisen, J.
1978-01-01
Radio telemetry was used to locate 31 wood duck (Aix sponsa) nest cavity sites in 16 forest stands. Stands were of 2 types: (1) mature (mean = 107 years) northern hardwoods (10 nest sites), and (2) mature (mean = 68 years) quaking aspen (Populus tremuloides) (21 nest sites). Aspen was the most important cavity-producing tree used by wood ducks and accounted for 57 percent of 28 cavities inspected. In stands used by wood ducks, the average density of suitable cavities was about 4 per hectare. Trees containing nests were closer to water areas (P < 0.05) and the nearest forest canopy openings (P < 0.01) than was a random sample of trees from the same stands. A significant (P < 0.005) relationship existed between the orientation of the cavity entrance and the nearest canopy opening. Potential wood duck cavities usually were clustered within a stand rather than randomly distributed. Selection of trees by woodpeckers for nest hole construction probably influenced the availability of cavities used by wood ducks. A plan for managing forests to benefit wood ducks and other wildlife dependent on old-growth timber is discussed.
Nesting habitat and productivity of Swainson's Hawks in southeastern Arizona
Nishida, Catherine; Boal, Clint W.; DeStefano, Stephen; Hobbs, Royden J.
2013-01-01
We studied Swainson's Hawks (Buteo swainsoni) in southeastern Arizona to assess the status of the local breeding population. Nest success (≥1 young fledged) was 44.4% in 1999 with an average of 1.43 ± 0.09 (SE) young produced per successful pair. Productivity was similar in 2000, with 58.2% nesting success and 1.83 ± 0.09 fledglings per successful pair. Mesquite (Prosopis velutina) and cottonwood (Populus fremontii) accounted for >50% of 167 nest trees. Nest trees were taller than surrounding trees and random trees, and overall there was more vegetative cover at nest sites than random sites. This apparent requirement for cover around nest sites could be important for management of the species in Arizona. However, any need for cover at nest sites must be balanced with the need for open areas for foraging. Density of nesting Swainson's Hawks was higher in agriculture than in grasslands and desert scrub. Breeding pairs had similar success in agricultural and nonagricultural areas, but the effect of rapid and widespread land-use change on breeding distribution and productivity continues to be a concern throughout the range of the species.
Foster, Jane R.; D'Amato, Anthony W.; Bradford, John B.
2014-01-01
Forest biomass growth is almost universally assumed to peak early in stand development, near canopy closure, after which it will plateau or decline. The chronosequence and plot remeasurement approaches used to establish the decline pattern suffer from limitations and coarse temporal detail. We combined annual tree ring measurements and mortality models to address two questions: first, how do assumptions about tree growth and mortality influence reconstructions of biomass growth? Second, under what circumstances does biomass production follow the model that peaks early, then declines? We integrated three stochastic mortality models with a census tree-ring data set from eight temperate forest types to reconstruct stand-level biomass increments (in Minnesota, USA). We compared growth patterns among mortality models, forest types and stands. Timing of peak biomass growth varied significantly among mortality models, peaking 20–30 years earlier when mortality was random with respect to tree growth and size, than when mortality favored slow-growing individuals. Random or u-shaped mortality (highest in small or large trees) produced peak growth 25–30 % higher than the surviving tree sample alone. Growth trends for even-aged, monospecific Pinus banksiana or Acer saccharum forests were similar to the early peak and decline expectation. However, we observed continually increasing biomass growth in older, low-productivity forests of Quercus rubra, Fraxinus nigra, and Thuja occidentalis. Tree-ring reconstructions estimated annual changes in live biomass growth and identified more diverse development patterns than previous methods. These detailed, long-term patterns of biomass development are crucial for detecting recent growth responses to global change and modeling future forest dynamics.
SLE as a Mating of Trees in Euclidean Geometry
NASA Astrophysics Data System (ADS)
Holden, Nina; Sun, Xin
2018-05-01
The mating of trees approach to Schramm-Loewner evolution (SLE) in the random geometry of Liouville quantum gravity (LQG) has been recently developed by Duplantier et al. (Liouville quantum gravity as a mating of trees, 2014. arXiv:1409.7055). In this paper we consider the mating of trees approach to SLE in Euclidean geometry. Let {η} be a whole-plane space-filling SLE with parameter {κ > 4} , parameterized by Lebesgue measure. The main observable in the mating of trees approach is the contour function, a two-dimensional continuous process describing the evolution of the Minkowski content of the left and right frontier of {η} . We prove regularity properties of the contour function and show that (as in the LQG case) it encodes all the information about the curve {η} . We also prove that the uniform spanning tree on {Z^2} converges to SLE8 in the natural topology associated with the mating of trees approach.
Simple chained guide trees give high-quality protein multiple sequence alignments
Boyce, Kieran; Sievers, Fabian; Higgins, Desmond G.
2014-01-01
Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random. PMID:25002495
Zong, Shengwei; Wu, Zhengfang; Xu, Jiawei; Li, Ming; Gao, Xiaofeng; He, Hongshi; Du, Haibo; Wang, Lei
2014-01-01
Tree line ecotone in the Changbai Mountains has undergone large changes in the past decades. Tree locations show variations on the four sides of the mountains, especially on the northern and western sides, which has not been fully explained. Previous studies attributed such variations to the variations in temperature. However, in this study, we hypothesized that topographic controls were responsible for causing the variations in the tree locations in tree line ecotone of the Changbai Mountains. To test the hypothesis, we used IKONOS images and WorldView-1 image to identify the tree locations and developed a logistic regression model using topographical variables to identify the dominant controls of the tree locations. The results showed that aspect, wetness, and slope were dominant controls for tree locations on western side of the mountains, whereas altitude, SPI, and aspect were the dominant factors on northern side. The upmost altitude a tree can currently reach was 2140 m asl on the northern side and 2060 m asl on western side. The model predicted results showed that habitats above the current tree line on the both sides were available for trees. Tree recruitments under the current tree line may take advantage of the available habitats at higher elevations based on the current tree location. Our research confirmed the controlling effects of topography on the tree locations in the tree line ecotone of Changbai Mountains and suggested that it was essential to assess the tree response to topography in the research of tree line ecotone. PMID:25170918
Zong, Shengwei; Wu, Zhengfang; Xu, Jiawei; Li, Ming; Gao, Xiaofeng; He, Hongshi; Du, Haibo; Wang, Lei
2014-01-01
Tree line ecotone in the Changbai Mountains has undergone large changes in the past decades. Tree locations show variations on the four sides of the mountains, especially on the northern and western sides, which has not been fully explained. Previous studies attributed such variations to the variations in temperature. However, in this study, we hypothesized that topographic controls were responsible for causing the variations in the tree locations in tree line ecotone of the Changbai Mountains. To test the hypothesis, we used IKONOS images and WorldView-1 image to identify the tree locations and developed a logistic regression model using topographical variables to identify the dominant controls of the tree locations. The results showed that aspect, wetness, and slope were dominant controls for tree locations on western side of the mountains, whereas altitude, SPI, and aspect were the dominant factors on northern side. The upmost altitude a tree can currently reach was 2140 m asl on the northern side and 2060 m asl on western side. The model predicted results showed that habitats above the current tree line on the both sides were available for trees. Tree recruitments under the current tree line may take advantage of the available habitats at higher elevations based on the current tree location. Our research confirmed the controlling effects of topography on the tree locations in the tree line ecotone of Changbai Mountains and suggested that it was essential to assess the tree response to topography in the research of tree line ecotone.
Lanier, Hayley C; Knowles, L Lacey
2015-02-01
Coalescent-based methods for species-tree estimation are becoming a dominant approach for reconstructing species histories from multi-locus data, with most of the studies examining these methodologies focused on recently diverged species. However, deeper phylogenies, such as the datasets that comprise many Tree of Life (ToL) studies, also exhibit gene-tree discordance. This discord may also arise from the stochastic sorting of gene lineages during the speciation process (i.e., reflecting the random coalescence of gene lineages in ancestral populations). It remains unknown whether guidelines regarding methodologies and numbers of loci established by simulation studies at shallow tree depths translate into accurate species relationships for deeper phylogenetic histories. We address this knowledge gap and specifically identify the challenges and limitations of species-tree methods that account for coalescent variance for deeper phylogenies. Using simulated data with characteristics informed by empirical studies, we evaluate both the accuracy of estimated species trees and the characteristics associated with recalcitrant nodes, with a specific focus on whether coalescent variance is generally responsible for the lack of resolution. By determining the proportion of coalescent genealogies that support a particular node, we demonstrate that (1) species-tree methods account for coalescent variance at deep nodes and (2) mutational variance - not gene-tree discord arising from the coalescent - posed the primary challenge for accurate reconstruction across the tree. For example, many nodes were accurately resolved despite predicted discord from the random coalescence of gene lineages and nodes with poor support were distributed across a range of depths (i.e., they were not restricted to a particular recent divergences). Given their broad taxonomic scope and large sampling of taxa, deep level phylogenies pose several potential methodological complications including difficulties with MCMC convergence and estimation of requisite population genetic parameters for coalescent-based approaches. Despite these difficulties, the findings generally support the utility of species-tree analyses for the estimation of species relationships throughout the ToL. We discuss strategies for successful application of species-tree approaches to deep phylogenies. Copyright © 2014 Elsevier Inc. All rights reserved.
Dyer, Betsey D.; Kahn, Michael J.; LeBlanc, Mark D.
2008-01-01
Classification and regression tree (CART) analysis was applied to genome-wide tetranucleotide frequencies (genomic signatures) of 195 archaea and bacteria. Although genomic signatures have typically been used to classify evolutionary divergence, in this study, convergent evolution was the focus. Temperature optima for most of the organisms examined could be distinguished by CART analyses of tetranucleotide frequencies. This suggests that pervasive (nonlinear) qualities of genomes may reflect certain environmental conditions (such as temperature) in which those genomes evolved. The predominant use of GAGA and AGGA as the discriminating tetramers in CART models suggests that purine-loading and codon biases of thermophiles may explain some of the results. PMID:19054742
Brabant, Marie-Eve; Hébert, Martine; Chagnon, François
2013-01-01
This study explored the clinical profiles of 77 female teenager survivors of sexual abuse and examined the association of abuse-related and personal variables with suicidal ideations. Analyses revealed that 64% of participants experienced suicidal ideations. Findings from classification and regression tree analysis indicated that depression, posttraumatic stress symptoms, and hopelessness discriminated profiles of suicidal and nonsuicidal survivors. The elevated prevalence of suicidal ideations among adolescent survivors of sexual abuse underscores the importance of investigating the presence of suicidal ideations in sexual abuse survivors. However, suicidal ideation is not the sole variable that needs to be investigated; depression, hopelessness and posttraumatic stress symptoms are also related to suicidal ideations in survivors and could therefore guide interventions.
Simulation of design-unbiased point-to-particle sampling compared to alternatives on plantation rows
Thomas B. Lynch; David Hamlin; Mark J. Ducey
2016-01-01
Total quantities of tree attributes can be estimated in plantations by sampling on plantation rows using several methods. At random sample points on a row, either fixed row lengths or variable row lengths with a fixed number of sample trees can be assessed. Ratio of means or mean of ratios estimators can be developed for the fixed number of trees option but are not...
Charles A. Gresham
2010-01-01
In May 2005, 12 plots approximately 21 by 27 m each were established on the exterior bank of a dredge spoil area in Georgetown, SC. The tree cover was primarily Chinese tallowtree, but there were also live oak trees in each plot. Also in May 2005, Chinese tallowtree s (d.b.h. > 2.4 cm) in three randomly selected plots received a âhack and squirtâ application of...
Topology of correlation-based minimal spanning trees in real and model markets
NASA Astrophysics Data System (ADS)
Bonanno, Giovanni; Caldarelli, Guido; Lillo, Fabrizio; Mantegna, Rosario N.
2003-10-01
We compare the topological properties of the minimal spanning tree obtained from a large group of stocks traded at the New York Stock Exchange during a 12-year trading period with the one obtained from surrogated data simulated by using simple market models. We find that the empirical tree has features of a complex network that cannot be reproduced, even as a first approximation, by a random market model and by the widespread one-factor model.
Sumithran, P; Purcell, K; Kuyruk, S; Proietto, J; Prendergast, L A
2018-02-01
Consistent, strong predictors of obesity treatment outcomes have not been identified. It has been suggested that broadening the range of predictor variables examined may be valuable. We explored methods to predict outcomes of a very-low-energy diet (VLED)-based programme in a clinically comparable setting, using a wide array of pre-intervention biological and psychosocial participant data. A total of 61 women and 39 men (mean ± standard deviation [SD] body mass index: 39.8 ± 7.3 kg/m 2 ) underwent an 8-week VLED and 12-month follow-up. At baseline, participants underwent a blood test and assessment of psychological, social and behavioural factors previously associated with treatment outcomes. Logistic regression, linear discriminant analysis, decision trees and random forests were used to model outcomes from baseline variables. Of the 100 participants, 88 completed the VLED and 42 attended the Week 60 visit. Overall prediction rates for weight loss of ≥10% at weeks 8 and 60, and attrition at Week 60, using combined data were between 77.8 and 87.6% for logistic regression, and lower for other methods. When logistic regression analyses included only baseline demographic and anthropometric variables, prediction rates were 76.2-86.1%. In this population, considering a wide range of biological and psychosocial data did not improve outcome prediction compared to simply-obtained baseline characteristics. © 2017 World Obesity Federation.
Evaluation and prediction of shrub cover in coastal Oregon forests (USA)
Becky K. Kerns; Janet L. Ohmann
2004-01-01
We used data from regional forest inventories and research programs, coupled with mapped climatic and topographic information, to explore relationships and develop multiple linear regression (MLR) and regression tree models for total and deciduous shrub cover in the Oregon coastal province. Results from both types of models indicate that forest structure variables were...
Weighted linear regression using D2H and D2 as the independent variables
Hans T. Schreuder; Michael S. Williams
1998-01-01
Several error structures for weighted regression equations used for predicting volume were examined for 2 large data sets of felled and standing loblolly pine trees (Pinus taeda L.). The generally accepted model with variance of error proportional to the value of the covariate squared ( D2H = diameter squared times height or D...
An Intelligent Decision Support System for Workforce Forecast
2011-01-01
ARIMA ) model to forecast the demand for construction skills in Hong Kong. This model was based...Decision Trees ARIMA Rule Based Forecasting Segmentation Forecasting Regression Analysis Simulation Modeling Input-Output Models LP and NLP Markovian...data • When results are needed as a set of easily interpretable rules 4.1.4 ARIMA Auto-regressive, integrated, moving-average ( ARIMA ) models
Phylogenetic mixtures and linear invariants for equal input models.
Casanellas, Marta; Steel, Mike
2017-04-01
The reconstruction of phylogenetic trees from molecular sequence data relies on modelling site substitutions by a Markov process, or a mixture of such processes. In general, allowing mixed processes can result in different tree topologies becoming indistinguishable from the data, even for infinitely long sequences. However, when the underlying Markov process supports linear phylogenetic invariants, then provided these are sufficiently informative, the identifiability of the tree topology can be restored. In this paper, we investigate a class of processes that support linear invariants once the stationary distribution is fixed, the 'equal input model'. This model generalizes the 'Felsenstein 1981' model (and thereby the Jukes-Cantor model) from four states to an arbitrary number of states (finite or infinite), and it can also be described by a 'random cluster' process. We describe the structure and dimension of the vector spaces of phylogenetic mixtures and of linear invariants for any fixed phylogenetic tree (and for all trees-the so called 'model invariants'), on any number n of leaves. We also provide a precise description of the space of mixtures and linear invariants for the special case of [Formula: see text] leaves. By combining techniques from discrete random processes and (multi-) linear algebra, our results build on a classic result that was first established by James Lake (Mol Biol Evol 4:167-191, 1987).
Mladenoff, D.J.; Dahir, S.E.; Nordheim, E.V.; Schulte, L.A.; Guntenspergen, G.R.
2002-01-01
Historical data have increasingly become appreciated for insight into the past conditions of ecosystems. Uses of such data include assessing the extent of ecosystem change; deriving ecological baselines for management, restoration, and modeling; and assessing the importance of past conditions on the composition and function of current systems. One historical data set of this type is the Public Land Survey (PLS) of the United States General Land Office, which contains data on multiple tree species, sizes, and distances recorded at each survey point, located at half-mile (0.8 km) intervals on a 1-mi (1.6 km) grid. This survey method was begun in the 1790s on US federal lands extending westward from Ohio. Thus, the data have the potential of providing a view of much of the US landscape from the mid-1800s, and they have been used extensively for this purpose. However, historical data sources, such as those describing the species composition of forests, can often be limited in the detail recorded and the reliability of the data, since the information was often not originally recorded for ecological purposes. Forest trees are sometimes recorded ambiguously, using generic or obscure common names. For the PLS data of northern Wisconsin, USA, we developed a method to classify ambiguously identified tree species using logistic regression analysis, using data on trees that were clearly identified to species and a set of independent predictor variables to build the models. The models were first created on partial data sets for each species and then tested for fit against the remaining data. Validations were conducted using repeated, random subsets of the data. Model prediction accuracy ranged from 81% to 96% in differentiating congeneric species among oak, pine, ash, maple, birch, and elm. Major predictor variables were tree size, associated species, landscape classes indicative of soil type, and spatial location within the study region. Results help to clarify ambiguities formerly present in maps of historic ecosystems for the region and can be applied to PLS datasets elsewhere, as well as other sources of ambiguous historical data. Mapping the newly classified data with ecological land units provides additional information on the distribution, abundance, and associations of tree species, as well as their relationships to environmental gradients before the industrial period, and clarifies the identities of species formerly mapped only to genus. We offer some caveats on the appropriate use of data derived in this way, as well as describing their potential.
NASA Astrophysics Data System (ADS)
Thomas, N.; Rueda, X.; Lambin, E.; Mendenhall, C. D.
2012-12-01
Large intact forested regions of the world are known to be critical to maintaining Earth's climate, ecosystem health, and human livelihoods. Remote sensing has been successfully implemented as a tool to monitor forest cover and landscape dynamics over broad regions. Much of this work has been done using coarse resolution sensors such as AVHRR and MODIS in combination with moderate resolution sensors, particularly Landsat. Finer scale analysis of heterogeneous and fragmented landscapes is commonly performed with medium resolution data and has had varying success depending on many factors including the level of fragmentation, variability of land cover types, patch size, and image availability. Fine scale tree cover in mixed agricultural areas can have a major impact on biodiversity and ecosystem sustainability but may often be inadequately captured with the global to regional (coarse resolution and moderate resolution) satellite sensors and processing techniques widely used to detect land use and land cover changes. This study investigates whether advanced remote sensing methods are able to assess and monitor percent tree canopy cover in spatially complex human-dominated agricultural landscapes that prove challenging for traditional mapping techniques. Our study areas are in high altitude, mixed agricultural coffee-growing regions in Costa Rica and the Colombian Andes. We applied Random Forests regression tree analysis to Landsat data along with additional spectral, environmental, and spatial variables to predict percent tree canopy cover at 30m resolution. Image object-based texture, shape, and neighborhood metrics were generated at the Landsat scale using eCognition and included in the variable suite. Training and validation data was generated using high resolution imagery from digital aerial photography at 1m to 2.5 m resolution. Our results are promising with Pearson's correlation coefficients between observed and predicted percent tree canopy cover of .86 (Costa Rica) and .83 (Colombia). The tree cover mapping developed here supports two distinct projects on sustaining biodiversity and natural and human capital: in Costa Rica the tree canopy cover map is utilized to predict bird community composition; and in Colombia the mapping is performed for two time periods and used to assess the impact of coffee eco-certification programs on the landscape. This research identifies ways to leverage readily available, high quality, and cost-free Landsat data or other medium resolution satellite data sources in combination with high resolution data, such as that frequently available through Google Earth, to monitor and support sustainability efforts in fragmented and heterogeneous landscapes.
Inferring Epidemic Contact Structure from Phylogenetic Trees
Leventhal, Gabriel E.; Kouyos, Roger; Stadler, Tanja; von Wyl, Viktor; Yerly, Sabine; Böni, Jürg; Cellerai, Cristina; Klimkait, Thomas; Günthard, Huldrych F.; Bonhoeffer, Sebastian
2012-01-01
Contact structure is believed to have a large impact on epidemic spreading and consequently using networks to model such contact structure continues to gain interest in epidemiology. However, detailed knowledge of the exact contact structure underlying real epidemics is limited. Here we address the question whether the structure of the contact network leaves a detectable genetic fingerprint in the pathogen population. To this end we compare phylogenies generated by disease outbreaks in simulated populations with different types of contact networks. We find that the shape of these phylogenies strongly depends on contact structure. In particular, measures of tree imbalance allow us to quantify to what extent the contact structure underlying an epidemic deviates from a null model contact network and illustrate this in the case of random mixing. Using a phylogeny from the Swiss HIV epidemic, we show that this epidemic has a significantly more unbalanced tree than would be expected from random mixing. PMID:22412361
Predictors of suicidal ideation in older people: a decision tree analysis.
Handley, Tonelle E; Hiles, Sarah A; Inder, Kerry J; Kay-Lambkin, Frances J; Kelly, Brian J; Lewin, Terry J; McEvoy, Mark; Peel, Roseanne; Attia, John R
2014-11-01
Suicide among older adults is a major public health issue worldwide. Although studies have identified psychological, physical, and social contributors to suicidal thoughts in older adults, few have explored the specific interactions between these factors. This article used a novel statistical approach to explore predictors of suicidal ideation in a community-based sample of older adults. Prospective cohort study. Participants aged 55-85 years were randomly selected from the Hunter Region, a large regional center in New South Wales, Australia. Baseline psychological, physical, and social factors, including psychological distress, physical functioning, and social support, were used to predict suicidal ideation at the 5-year follow-up. Classification and regression tree modeling was used to determine specific risk profiles for participants depending on their individual well-being in each of these key areas. Psychological distress was the strongest predictor, with 25% of people with high distress reporting suicidal ideation. Within high psychological distress, lower physical functioning significantly increased the likelihood of suicidal ideation, with high distress and low functioning being associated with ideation in 50% of cases. A substantial subgroup reported suicidal ideation in the absence of psychological distress; dissatisfaction with social support was the most important predictor among this group. The performance of the model was high (area under the curve: 0.81). Decision tree modeling enabled individualized "risk" profiles for suicidal ideation to be determined. Although psychological factors are important for predicting suicidal ideation, both physical and social factors significantly improved the predictive ability of the model. Assessing these factors may enhance identification of older people at risk of suicidal ideation. Copyright © 2014. Published by Elsevier Inc.
Estimating Mixed Broadleaves Forest Stand Volume Using Dsm Extracted from Digital Aerial Images
NASA Astrophysics Data System (ADS)
Sohrabi, H.
2012-07-01
In mixed old growth broadleaves of Hyrcanian forests, it is difficult to estimate stand volume at plot level by remotely sensed data while LiDar data is absent. In this paper, a new approach has been proposed and tested for estimating stand forest volume. The approach is based on this idea that forest volume can be estimated by variation of trees height at plots. In the other word, the more the height variation in plot, the more the stand volume would be expected. For testing this idea, 120 circular 0.1 ha sample plots with systematic random design has been collected in Tonekaon forest located in Hyrcanian zone. Digital surface model (DSM) measure the height values of the first surface on the ground including terrain features, trees, building etc, which provides a topographic model of the earth's surface. The DSMs have been extracted automatically from aerial UltraCamD images so that ground pixel size for extracted DSM varied from 1 to 10 m size by 1m span. DSMs were checked manually for probable errors. Corresponded to ground samples, standard deviation and range of DSM pixels have been calculated. For modeling, non-linear regression method was used. The results showed that standard deviation of plot pixels with 5 m resolution was the most appropriate data for modeling. Relative bias and RMSE of estimation was 5.8 and 49.8 percent, respectively. Comparing to other approaches for estimating stand volume based on passive remote sensing data in mixed broadleaves forests, these results are more encouraging. One big problem in this method occurs when trees canopy cover is totally closed. In this situation, the standard deviation of height is low while stand volume is high. In future studies, applying forest stratification could be studied.
Baldi, F; Alencar, M M; Albuquerque, L G
2010-12-01
The objective of this work was to estimate covariance functions using random regression models on B-splines functions of animal age, for weights from birth to adult age in Canchim cattle. Data comprised 49,011 records on 2435 females. The model of analysis included fixed effects of contemporary groups, age of dam as quadratic covariable and the population mean trend taken into account by a cubic regression on orthogonal polynomials of animal age. Residual variances were modelled through a step function with four classes. The direct and maternal additive genetic effects, and animal and maternal permanent environmental effects were included as random effects in the model. A total of seventeen analyses, considering linear, quadratic and cubic B-splines functions and up to seven knots, were carried out. B-spline functions of the same order were considered for all random effects. Random regression models on B-splines functions were compared to a random regression model on Legendre polynomials and with a multitrait model. Results from different models of analyses were compared using the REML form of the Akaike Information criterion and Schwarz' Bayesian Information criterion. In addition, the variance components and genetic parameters estimated for each random regression model were also used as criteria to choose the most adequate model to describe the covariance structure of the data. A model fitting quadratic B-splines, with four knots or three segments for direct additive genetic effect and animal permanent environmental effect and two knots for maternal additive genetic effect and maternal permanent environmental effect, was the most adequate to describe the covariance structure of the data. Random regression models using B-spline functions as base functions fitted the data better than Legendre polynomials, especially at mature ages, but higher number of parameters need to be estimated with B-splines functions. © 2010 Blackwell Verlag GmbH.
Price, B; Gomez, A; Mathys, L; Gardi, O; Schellenberger, A; Ginzler, C; Thürig, E
2017-03-01
Trees outside forest (TOF) can perform a variety of social, economic and ecological functions including carbon sequestration. However, detailed quantification of tree biomass is usually limited to forest areas. Taking advantage of structural information available from stereo aerial imagery and airborne laser scanning (ALS), this research models tree biomass using national forest inventory data and linear least-square regression and applies the model both inside and outside of forest to create a nationwide model for tree biomass (above ground and below ground). Validation of the tree biomass model against TOF data within settlement areas shows relatively low model performance (R 2 of 0.44) but still a considerable improvement on current biomass estimates used for greenhouse gas inventory and carbon accounting. We demonstrate an efficient and easily implementable approach to modelling tree biomass across a large heterogeneous nationwide area. The model offers significant opportunity for improved estimates on land use combination categories (CC) where tree biomass has either not been included or only roughly estimated until now. The ALS biomass model also offers the advantage of providing greater spatial resolution and greater within CC spatial variability compared to the current nationwide estimates.
Sah, Jay P.; Ross, Michael S.; Snyder, James R.; Ogurcak, Danielle E.
2010-01-01
In fire-dependent forests, managers are interested in predicting the consequences of prescribed burning on postfire tree mortality. We examined the effects of prescribed fire on tree mortality in Florida Keys pine forests, using a factorial design with understory type, season, and year of burn as factors. We also used logistic regression to model the effects of burn season, fire severity, and tree dimensions on individual tree mortality. Despite limited statistical power due to problems in carrying out the full suite of planned experimental burns, associations with tree and fire variables were observed. Post-fire pine tree mortality was negatively correlated with tree size and positively correlated with char height and percent crown scorch. Unlike post-fire mortality, tree mortality associated with storm surge from Hurricane Wilma was greater in the large size classes. Due to their influence on population structure and fuel dynamics, the size-selective mortality patterns following fire and storm surge have practical importance for using fire as a management tool in Florida Keys pinelands in the future, particularly when the threats to their continued existence from tropical storms and sea level rise are expected to increase.
A note on variance estimation in random effects meta-regression.
Sidik, Kurex; Jonkman, Jeffrey N
2005-01-01
For random effects meta-regression inference, variance estimation for the parameter estimates is discussed. Because estimated weights are used for meta-regression analysis in practice, the assumed or estimated covariance matrix used in meta-regression is not strictly correct, due to possible errors in estimating the weights. Therefore, this note investigates the use of a robust variance estimation approach for obtaining variances of the parameter estimates in random effects meta-regression inference. This method treats the assumed covariance matrix of the effect measure variables as a working covariance matrix. Using an example of meta-analysis data from clinical trials of a vaccine, the robust variance estimation approach is illustrated in comparison with two other methods of variance estimation. A simulation study is presented, comparing the three methods of variance estimation in terms of bias and coverage probability. We find that, despite the seeming suitability of the robust estimator for random effects meta-regression, the improved variance estimator of Knapp and Hartung (2003) yields the best performance among the three estimators, and thus may provide the best protection against errors in the estimated weights.
Node degree distribution in spanning trees
NASA Astrophysics Data System (ADS)
Pozrikidis, C.
2016-03-01
A method is presented for computing the number of spanning trees involving one link or a specified group of links, and excluding another link or a specified group of links, in a network described by a simple graph in terms of derivatives of the spanning-tree generating function defined with respect to the eigenvalues of the Kirchhoff (weighted Laplacian) matrix. The method is applied to deduce the node degree distribution in a complete or randomized set of spanning trees of an arbitrary network. An important feature of the proposed method is that the explicit construction of spanning trees is not required. It is shown that the node degree distribution in the spanning trees of the complete network is described by the binomial distribution. Numerical results are presented for the node degree distribution in square, triangular, and honeycomb lattices.
Presence of indicator plant species as a predictor of wetland vegetation integrity
Stapanian, Martin A.; Adams, Jean V.; Gara, Brian
2013-01-01
We fit regression and classification tree models to vegetation data collected from Ohio (USA) wetlands to determine (1) which species best predict Ohio vegetation index of biotic integrity (OVIBI) score and (2) which species best predict high-quality wetlands (OVIBI score >75). The simplest regression tree model predicted OVIBI score based on the occurrence of three plant species: skunk-cabbage (Symplocarpus foetidus), cinnamon fern (Osmunda cinnamomea), and swamp rose (Rosa palustris). The lowest OVIBI scores were best predicted by the absence of the selected plant species rather than by the presence of other species. The simplest classification tree model predicted high-quality wetlands based on the occurrence of two plant species: skunk-cabbage and marsh-fern (Thelypteris palustris). The overall misclassification rate from this tree was 13 %. Again, low-quality wetlands were better predicted than high-quality wetlands by the absence of selected species rather than the presence of other species using the classification tree model. Our results suggest that a species’ wetland status classification and coefficient of conservatism are of little use in predicting wetland quality. A simple, statistically derived species checklist such as the one created in this study could be used by field biologists to quickly and efficiently identify wetland sites likely to be regulated as high-quality, and requiring more intensive field assessments. Alternatively, it can be used for advanced determinations of low-quality wetlands. Agencies can save considerable money by screening wetlands for the presence/absence of such “indicator” species before issuing permits.
Automatic energy expenditure measurement for health science.
Catal, Cagatay; Akbulut, Akhan
2018-04-01
It is crucial to predict the human energy expenditure in any sports activity and health science application accurately to investigate the impact of the activity. However, measurement of the real energy expenditure is not a trivial task and involves complex steps. The objective of this work is to improve the performance of existing estimation models of energy expenditure by using machine learning algorithms and several data from different sensors and provide this estimation service in a cloud-based platform. In this study, we used input data such as breathe rate, and hearth rate from three sensors. Inputs are received from a web form and sent to the web service which applies a regression model on Azure cloud platform. During the experiments, we assessed several machine learning models based on regression methods. Our experimental results showed that our novel model which applies Boosted Decision Tree Regression in conjunction with the median aggregation technique provides the best result among other five regression algorithms. This cloud-based energy expenditure system which uses a web service showed that cloud computing technology is a great opportunity to develop estimation systems and the new model which applies Boosted Decision Tree Regression with the median aggregation provides remarkable results. Copyright © 2018 Elsevier B.V. All rights reserved.
Wu, Jian-qiang; Wang, Yi-xiang; Yang, Yi; Zhu, Ting-ting; Zhu, Xu-dan
2015-02-01
Crop trees were selected in a 26-year-old even-aged Cunninghamia lanceolata plantation in Lin' an, and compared in plots that were released and unreleased to examine growth and structure responses for 3 years after thinning. Crop tree release significantly increased the mean increments of diameter and volume of individual tree by 1.30 and 1.25 times relative to trees in control stands, respectively. The increments of diameter and volume of crop trees were significantly higher than those of general trees in thinning plots, crop trees and general trees in control plots, which suggested that the responses from different tree types to crop tree release treatment were different. Crop tree release increased the average distances of crop trees to the nearest neighboring trees, reducing competition among crop trees by about 68.2%. 3-year stand volume increment for thinning stands had no significant difference with that of control stands although the number of trees was only 81.5% of the control. Crop trees in thinned plots with diameters over than 14 cm reached 18.0% over 3 years, compared with 12.0% for trees without thinning, suggesting that crop tree release benefited the larger individual trees. The pattern of tree locations in thinning plots tended to be random, complying with the rule that tree distribution pattern changes with growth. Crop tree release in C. lanceolata plantation not only promoted the stand growth, but also optimized the stand structure, benefiting crop trees sustained rapid growth and larger diameter trees production.
Mukherjee, Arideep; Agrawal, Madhoolika
2018-05-15
Responses of urban vegetation to air pollution stress in relation to their tolerance and sensitivity have been extensively studied, however, studies related to air pollution responses based on different leaf functional traits and tree characteristics are limited. In this paper, we have tried to assess combined and individual effects of major air pollutants PM 10 (particulate matter ≤ 10 µm), TSP (total suspended particulate matter), SO 2 (sulphur dioxide), NO 2 (nitrogen dioxide) and O 3 (ozone) on thirteen tropical tree species in relation to fifteen leaf functional traits and different tree characteristics. Stepwise linear regression a general linear modelling approach was used to quantify the pollution response of trees against air pollutants. The study was performed for six successive seasons for two years in three distinct urban areas (traffic, industrial and residential) of Varanasi city in India. At all the study sites, concentrations of air pollutants, specifically PM (particulate matter) and NO 2 were above the specified standards. Distinct variations were recorded in all the fifteen leaf functional traits with pollution load. Caesalpinia sappan was identified as most tolerant species followed by Psidium guajava, Dalbergia sissoo and Albizia lebbeck. Stepwise regression analysis identified maximum response of Eucalyptus citriodora and P. guajava to air pollutants explaining overall 59% and 58% variability's in leaf functional traits, respectively. Among leaf functional traits, maximum effect of air pollutants was observed on non-enzymatic antioxidants followed by photosynthetic pigments and leaf water status. Among the pollutants, PM was identified as the major stress factor followed by O 3 explaining 47% and 33% variability's in leaf functional traits. Tolerance and pollution response were regulated by different tree characteristics such as height, canopy size, leaf from, texture and nature of tree. Outcomes of this study will help in urban forest development by selection of specific pollutant tolerant tree species and leaf traits, which is suitable as air pollution mitigation measure. Copyright © 2018 Elsevier Inc. All rights reserved.
Soil salinity study in Northern Great Plains sodium affected soil
NASA Astrophysics Data System (ADS)
Kharel, Tulsi P.
Climate and land-use changes when combined with the marine sediments that underlay portions of the Northern Great Plains have increased the salinization and sodification risks. The objectives of this dissertation were to compare three chemical amendments (calcium chloride, sulfuric acid and gypsum) remediation strategies on water permeability and sodium (Na) transport in undisturbed soil columns and to develop a remote sensing technique to characterize salinization in South Dakota soils. Forty-eight undisturbed soil columns (30 cm x 15 cm) collected from White Lake, Redfield, and Pierpont were used to assess the chemical remediation strategies. In this study the experimental design was a completely randomized design and each treatment was replicated four times. Following the application of chemical remediation strategies, 45.2 cm of water was leached through these columns. The leachate was separated into 120- ml increments and analyzed for Na and electrical conductivity (EC). Sulfuric acid increased Na leaching, whereas gypsum and CaCl2 increased water permeability. Our results further indicate that to maintain effective water permeability, ratio between soil EC and sodium absorption ratio (SAR) should be considered. In the second study, soil samples from 0-15 cm depth in 62 x 62 m grid spacing were taken from the South Dakota Pierpont (65 ha) and Redfield (17 ha) sites. Saturated paste EC was measured on each soil sample. At each sampling points reflectance and derived indices (Landsat 5, 7, 8 images), elevation, slope and aspect (LiDAR) were extracted. Regression models based on multiple linear regression, classification and regression tree, cubist, and random forest techniques were developed and their ability to predict soil EC were compared. Results showed that: 1) Random forest method was found to be the most effective method because of its ability to capture spatially correlated variation, 2) the short wave infrared (1.5 -2.29 mum) and near infrared (0.75-0.90 mum) were very sensitive to soil salinity; 3) EC prediction model using all 3 season (spring, summer and fall) images was better on state wide validation dataset compared to individual season model. Finally, in eastern South Dakota, the model predicted that from 2008 to 2012, EC increased in 569,165 ha or 13.4% of the land seeded to corn (Zea mays L.) or soybeans (Glycine max L).
Wilson, Jordan L; Samaranayake, V A; Limmer, Matthew A; Schumacher, John G; Burken, Joel G
2017-12-19
Contaminated sites pose ecological and human-health risks through exposure to contaminated soil and groundwater. Whereas we can readily locate, monitor, and track contaminants in groundwater, it is harder to perform these tasks in the vadose zone. In this study, tree-core samples were collected at a Superfund site to determine if the sample-collection location around a particular tree could reveal the subsurface location, or direction, of soil and soil-gas contaminant plumes. Contaminant-centroid vectors were calculated from tree-core data to reveal contaminant distributions in directional tree samples at a higher resolution, and vectors were correlated with soil-gas characterization collected using conventional methods. Results clearly demonstrated that directional tree coring around tree trunks can indicate gradients in soil and soil-gas contaminant plumes, and the strength of the correlations were directly proportionate to the magnitude of tree-core concentration gradients (spearman's coefficient of -0.61 and -0.55 in soil and tree-core gradients, respectively). Linear regression indicates agreement between the concentration-centroid vectors is significantly affected by in planta and soil concentration gradients and when concentration centroids in soil are closer to trees. Given the existing link between soil-gas and vapor intrusion, this study also indicates that directional tree coring might be applicable in vapor intrusion assessment.
Wilson, Jordan L.; Samaranayake, V.A.; Limmer, Matthew A.; Schumacher, John G.; Burken, Joel G.
2017-01-01
Contaminated sites pose ecological and human-health risks through exposure to contaminated soil and groundwater. Whereas we can readily locate, monitor, and track contaminants in groundwater, it is harder to perform these tasks in the vadose zone. In this study, tree-core samples were collected at a Superfund site to determine if the sample-collection location around a particular tree could reveal the subsurface location, or direction, of soil and soil-gas contaminant plumes. Contaminant-centroid vectors were calculated from tree-core data to reveal contaminant distributions in directional tree samples at a higher resolution, and vectors were correlated with soil-gas characterization collected using conventional methods. Results clearly demonstrated that directional tree coring around tree trunks can indicate gradients in soil and soil-gas contaminant plumes, and the strength of the correlations were directly proportionate to the magnitude of tree-core concentration gradients (spearman’s coefficient of -0.61 and -0.55 in soil and tree-core gradients, respectively). Linear regression indicates agreement between the concentration-centroid vectors is significantly affected by in-planta and soil concentration gradients and when concentration centroids in soil are closer to trees. Given the existing link between soil-gas and vapor intrusion, this study also indicates that directional tree coring might be applicable in vapor intrusion assessment.
NASA Astrophysics Data System (ADS)
Vaglio Laurin, Gaia; Puletti, Nicola; Chen, Qi; Corona, Piermaria; Papale, Dario; Valentini, Riccardo
2016-10-01
Estimates of forest aboveground biomass are fundamental for carbon monitoring and accounting; delivering information at very high spatial resolution is especially valuable for local management, conservation and selective logging purposes. In tropical areas, hosting large biomass and biodiversity resources which are often threatened by unsustainable anthropogenic pressures, frequent forest resources monitoring is needed. Lidar is a powerful tool to estimate aboveground biomass at fine resolution; however its application in tropical forests has been limited, with high variability in the accuracy of results. Lidar pulses scan the forest vertical profile, and can provide structure information which is also linked to biodiversity. In the last decade the remote sensing of biodiversity has received great attention, but few studies focused on the use of lidar for assessing tree species richness in tropical forests. This research aims at estimating aboveground biomass and tree species richness using discrete return airborne lidar in Ghana forests. We tested an advanced statistical technique, Multivariate Adaptive Regression Splines (MARS), which does not require assumptions on data distribution or on the relationships between variables, being suitable for studying ecological variables. We compared the MARS regression results with those obtained by multilinear regression and found that both algorithms were effective, but MARS provided higher accuracy either for biomass (R2 = 0.72) and species richness (R2 = 0.64). We also noted strong correlation between biodiversity and biomass field values. Even if the forest areas under analysis are limited in extent and represent peculiar ecosystems, the preliminary indications produced by our study suggest that instrument such as lidar, specifically useful for pinpointing forest structure, can also be exploited as a support for tree species richness assessment.
NASA Astrophysics Data System (ADS)
Rocha, Alby D.; Groen, Thomas A.; Skidmore, Andrew K.; Darvishzadeh, Roshanak; Willemen, Louise
2017-11-01
The growing number of narrow spectral bands in hyperspectral remote sensing improves the capacity to describe and predict biological processes in ecosystems. But it also poses a challenge to fit empirical models based on such high dimensional data, which often contain correlated and noisy predictors. As sample sizes, to train and validate empirical models, seem not to be increasing at the same rate, overfitting has become a serious concern. Overly complex models lead to overfitting by capturing more than the underlying relationship, and also through fitting random noise in the data. Many regression techniques claim to overcome these problems by using different strategies to constrain complexity, such as limiting the number of terms in the model, by creating latent variables or by shrinking parameter coefficients. This paper is proposing a new method, named Naïve Overfitting Index Selection (NOIS), which makes use of artificially generated spectra, to quantify the relative model overfitting and to select an optimal model complexity supported by the data. The robustness of this new method is assessed by comparing it to a traditional model selection based on cross-validation. The optimal model complexity is determined for seven different regression techniques, such as partial least squares regression, support vector machine, artificial neural network and tree-based regressions using five hyperspectral datasets. The NOIS method selects less complex models, which present accuracies similar to the cross-validation method. The NOIS method reduces the chance of overfitting, thereby avoiding models that present accurate predictions that are only valid for the data used, and too complex to make inferences about the underlying process.
Barrett A. Garrison; Christopher D. otahal; Matthew L. Triggs
2002-01-01
Age structure and growth of California black oak (Quercus kelloggii) was determined from tagged trees at four 26.1-acre study stands in Placer County, California. Stands were dominated by large diameter (>20 inch dbh) California black oak and ponderosa pine (Pinus ponderosa). Randomly selected trees were tagged in June-August...
Greedy algorithms in disordered systems
NASA Astrophysics Data System (ADS)
Duxbury, P. M.; Dobrin, R.
1999-08-01
We discuss search, minimal path and minimal spanning tree algorithms and their applications to disordered systems. Greedy algorithms solve these problems exactly, and are related to extremal dynamics in physics. Minimal cost path (Dijkstra) and minimal cost spanning tree (Prim) algorithms provide extremal dynamics for a polymer in a random medium (the KPZ universality class) and invasion percolation (without trapping) respectively.
Demographic trends in Claremont California’s street tree population
Natalie S. van Doorn; E. Gregory McPherson
2018-01-01
The aim of this study was to quantify street tree population dynamics in the city of Claremont, CA. A repeated measures survey (2000 and 2014) based on a stratified random sampling approach across size classes and for the most abundant 21 species was analyzed to calculate removal, growth, and replacement planting rates. Demographic rates were estimated using a...
An individual-tree dbh-total height model with random plot effects for shortleaf pine
Chakra B. Budhathoki; Thomas B. Lynch; James M. Guldin
2007-01-01
Individual tree measurements were available from over 200 permanent plots established during 1985-1987 and later remeasured in naturally regenerated stands of shortleaf pine (Pinus echinata Mill.) in western Arkansas and eastern Oklahoma. The objective of this study was to model shortleaf pine growth in natural stands for the region. As a major...
Estimating potential habitat for 134 eastern US tree species under six climate scenarios
Louis R. Iverson; Anantha M. Prasad; Stephen N. Matthews; Matthew Peters
2008-01-01
We modeled and mapped, using the predictive data mining tool Random Forests, 134 tree species from the eastern United States for potential response to several scenarios of climate change. Each species was modeled individually to show current and potential future habitats according to two emission scenarios (high emissions on current trajectory and reasonable...
Quantitative Trait Inheritance in a Forty-Year-Old Longleaf Pine Partial Diallel Test
Michael Stine; Jim Roberds; C. Dana Nelson; David P. Gwaze; Todd Shupe; Les Groom
2002-01-01
A longleaf pine (Pinus palustris Mill.) 13 parent partial diallel field experiment was established at two locations on the Harrison Experimental Forest in 1960. Parent trees were randomly selected from a natural population growing on the Harrison Experimental Forest, near Gulfport, Miss. Distance between trees chosen as parents ranged from 13 to 357...