Sample records for random forests regression

  1. An application of quantile random forests for predictive mapping of forest attributes

    Treesearch

    E.A. Freeman; G.G. Moisen

    2015-01-01

    Increasingly, random forest models are used in predictive mapping of forest attributes. Traditional random forests output the mean prediction from the random trees. Quantile regression forests (QRF) is an extension of random forests developed by Nicolai Meinshausen that provides non-parametric estimates of the median predicted value as well as prediction quantiles. It...

  2. Approximating prediction uncertainty for random forest regression models

    Treesearch

    John W. Coulston; Christine E. Blinn; Valerie A. Thomas; Randolph H. Wynne

    2016-01-01

    Machine learning approaches such as random forest have increased for the spatial modeling and mapping of continuous variables. Random forest is a non-parametric ensemble approach, and unlike traditional regression approaches there is no direct quantification of prediction error. Understanding prediction uncertainty is important when using model-based continuous maps as...

  3. Newer classification and regression tree techniques: Bagging and Random Forests for ecological prediction

    Treesearch

    Anantha M. Prasad; Louis R. Iverson; Andy Liaw; Andy Liaw

    2006-01-01

    We evaluated four statistical models - Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) - for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.

  4. Calibrating random forests for probability estimation.

    PubMed

    Dankowski, Theresa; Ziegler, Andreas

    2016-09-30

    Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so-called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models. These are, in turn, used for re-calibration. The two updating strategies were compared in a simulation study and are illustrated with data from the German Stroke Study Collaboration. In most simulation scenarios, both methods led to similar improvements. In the simulation scenario in which the stricter assumptions of Elkan's method were not met, the logistic regression-based re-calibration approach for random forests outperformed Elkan's method. It also performed better on the stroke data than Elkan's method. The strength of Elkan's method is its general applicability to any probability machine. However, if the strict assumptions underlying this approach are not met, the logistic regression-based approach is preferable for updating random forests for probability estimation. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  5. Subpixel urban land cover estimation: comparing cubist, random forests, and support vector regression

    Treesearch

    Jeffrey T. Walton

    2008-01-01

    Three machine learning subpixel estimation methods (Cubist, Random Forests, and support vector regression) were applied to estimate urban cover. Urban forest canopy cover and impervious surface cover were estimated from Landsat-7 ETM+ imagery using a higher resolution cover map resampled to 30 m as training and reference data. Three different band combinations (...

  6. Comparing spatial regression to random forests for large environmental data sets

    EPA Science Inventory

    Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates, whereas spatial regression, when using reduced rank methods, has a reputatio...

  7. Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks.

    PubMed

    Hsieh, Chung-Ho; Lu, Ruey-Hwa; Lee, Nai-Hsin; Chiu, Wen-Ta; Hsu, Min-Huei; Li, Yu-Chuan Jack

    2011-01-01

    Diagnosing acute appendicitis clinically is still difficult. We developed random forests, support vector machines, and artificial neural network models to diagnose acute appendicitis. Between January 2006 and December 2008, patients who had a consultation session with surgeons for suspected acute appendicitis were enrolled. Seventy-five percent of the data set was used to construct models including random forest, support vector machines, artificial neural networks, and logistic regression. Twenty-five percent of the data set was withheld to evaluate model performance. The area under the receiver operating characteristic curve (AUC) was used to evaluate performance, which was compared with that of the Alvarado score. Data from a total of 180 patients were collected, 135 used for training and 45 for testing. The mean age of patients was 39.4 years (range, 16-85). Final diagnosis revealed 115 patients with and 65 without appendicitis. The AUC of random forest, support vector machines, artificial neural networks, logistic regression, and Alvarado was 0.98, 0.96, 0.91, 0.87, and 0.77, respectively. The sensitivity, specificity, positive, and negative predictive values of random forest were 94%, 100%, 100%, and 87%, respectively. Random forest performed better than artificial neural networks, logistic regression, and Alvarado. We demonstrated that random forest can predict acute appendicitis with good accuracy and, deployed appropriately, can be an effective tool in clinical decision making. Copyright © 2011 Mosby, Inc. All rights reserved.

  8. Learning accurate and interpretable models based on regularized random forests regression

    PubMed Central

    2014-01-01

    Background Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. Methods In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. Results We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. Conclusion It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied. PMID:25350120

  9. An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests

    ERIC Educational Resources Information Center

    Strobl, Carolin; Malley, James; Tutz, Gerhard

    2009-01-01

    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and…

  10. New machine learning tools for predictive vegetation mapping after climate change: Bagging and Random Forest perform better than Regression Tree Analysis

    Treesearch

    L.R. Iverson; A.M. Prasad; A. Liaw

    2004-01-01

    More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in...

  11. Random forest models to predict aqueous solubility.

    PubMed

    Palmer, David S; O'Boyle, Noel M; Glen, Robert C; Mitchell, John B O

    2007-01-01

    Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

  12. 3D statistical shape models incorporating 3D random forest regression voting for robust CT liver segmentation

    NASA Astrophysics Data System (ADS)

    Norajitra, Tobias; Meinzer, Hans-Peter; Maier-Hein, Klaus H.

    2015-03-01

    During image segmentation, 3D Statistical Shape Models (SSM) usually conduct a limited search for target landmarks within one-dimensional search profiles perpendicular to the model surface. In addition, landmark appearance is modeled only locally based on linear profiles and weak learners, altogether leading to segmentation errors from landmark ambiguities and limited search coverage. We present a new method for 3D SSM segmentation based on 3D Random Forest Regression Voting. For each surface landmark, a Random Regression Forest is trained that learns a 3D spatial displacement function between the according reference landmark and a set of surrounding sample points, based on an infinite set of non-local randomized 3D Haar-like features. Landmark search is then conducted omni-directionally within 3D search spaces, where voxelwise forest predictions on landmark position contribute to a common voting map which reflects the overall position estimate. Segmentation experiments were conducted on a set of 45 CT volumes of the human liver, of which 40 images were randomly chosen for training and 5 for testing. Without parameter optimization, using a simple candidate selection and a single resolution approach, excellent results were achieved, while faster convergence and better concavity segmentation were observed, altogether underlining the potential of our approach in terms of increased robustness from distinct landmark detection and from better search coverage.

  13. Characterizing stand-level forest canopy cover and height using Landsat time series, samples of airborne LiDAR, and the Random Forest algorithm

    NASA Astrophysics Data System (ADS)

    Ahmed, Oumer S.; Franklin, Steven E.; Wulder, Michael A.; White, Joanne C.

    2015-03-01

    Many forest management activities, including the development of forest inventories, require spatially detailed forest canopy cover and height data. Among the various remote sensing technologies, LiDAR (Light Detection and Ranging) offers the most accurate and consistent means for obtaining reliable canopy structure measurements. A potential solution to reduce the cost of LiDAR data, is to integrate transects (samples) of LiDAR data with frequently acquired and spatially comprehensive optical remotely sensed data. Although multiple regression is commonly used for such modeling, often it does not fully capture the complex relationships between forest structure variables. This study investigates the potential of Random Forest (RF), a machine learning technique, to estimate LiDAR measured canopy structure using a time series of Landsat imagery. The study is implemented over a 2600 ha area of industrially managed coastal temperate forests on Vancouver Island, British Columbia, Canada. We implemented a trajectory-based approach to time series analysis that generates time since disturbance (TSD) and disturbance intensity information for each pixel and we used this information to stratify the forest land base into two strata: mature forests and young forests. Canopy cover and height for three forest classes (i.e. mature, young and mature and young (combined)) were modeled separately using multiple regression and Random Forest (RF) techniques. For all forest classes, the RF models provided improved estimates relative to the multiple regression models. The lowest validation error was obtained for the mature forest strata in a RF model (R2 = 0.88, RMSE = 2.39 m and bias = -0.16 for canopy height; R2 = 0.72, RMSE = 0.068% and bias = -0.0049 for canopy cover). This study demonstrates the value of using disturbance and successional history to inform estimates of canopy structure and obtain improved estimates of forest canopy cover and height using the RF algorithm.

  14. Visible and near infrared spectroscopy coupled to random forest to quantify some soil quality parameters

    NASA Astrophysics Data System (ADS)

    de Santana, Felipe Bachion; de Souza, André Marcelo; Poppi, Ronei Jesus

    2018-02-01

    This study evaluates the use of visible and near infrared spectroscopy (Vis-NIRS) combined with multivariate regression based on random forest to quantify some quality soil parameters. The parameters analyzed were soil cation exchange capacity (CEC), sum of exchange bases (SB), organic matter (OM), clay and sand present in the soils of several regions of Brazil. Current methods for evaluating these parameters are laborious, timely and require various wet analytical methods that are not adequate for use in precision agriculture, where faster and automatic responses are required. The random forest regression models were statistically better than PLS regression models for CEC, OM, clay and sand, demonstrating resistance to overfitting, attenuating the effect of outlier samples and indicating the most important variables for the model. The methodology demonstrates the potential of the Vis-NIR as an alternative for determination of CEC, SB, OM, sand and clay, making possible to develop a fast and automatic analytical procedure.

  15. Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets.

    PubMed

    Marchese Robinson, Richard L; Palczewska, Anna; Palczewski, Jan; Kidley, Nathan

    2017-08-28

    The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.

  16. Hierarchical Bayesian spatial models for predicting multiple forest variables using waveform LiDAR, hyperspectral imagery, and large inventory datasets

    USGS Publications Warehouse

    Finley, Andrew O.; Banerjee, Sudipto; Cook, Bruce D.; Bradford, John B.

    2013-01-01

    In this paper we detail a multivariate spatial regression model that couples LiDAR, hyperspectral and forest inventory data to predict forest outcome variables at a high spatial resolution. The proposed model is used to analyze forest inventory data collected on the US Forest Service Penobscot Experimental Forest (PEF), ME, USA. In addition to helping meet the regression model's assumptions, results from the PEF analysis suggest that the addition of multivariate spatial random effects improves model fit and predictive ability, compared with two commonly applied modeling approaches. This improvement results from explicitly modeling the covariation among forest outcome variables and spatial dependence among observations through the random effects. Direct application of such multivariate models to even moderately large datasets is often computationally infeasible because of cubic order matrix algorithms involved in estimation. We apply a spatial dimension reduction technique to help overcome this computational hurdle without sacrificing richness in modeling.

  17. Random forest regression modelling for forest aboveground biomass estimation using RISAT-1 PolSAR and terrestrial LiDAR data

    NASA Astrophysics Data System (ADS)

    Mangla, Rohit; Kumar, Shashi; Nandy, Subrata

    2016-05-01

    SAR and LiDAR remote sensing have already shown the potential of active sensors for forest parameter retrieval. SAR sensor in its fully polarimetric mode has an advantage to retrieve scattering property of different component of forest structure and LiDAR has the capability to measure structural information with very high accuracy. This study was focused on retrieval of forest aboveground biomass (AGB) using Terrestrial Laser Scanner (TLS) based point clouds and scattering property of forest vegetation obtained from decomposition modelling of RISAT-1 fully polarimetric SAR data. TLS data was acquired for 14 plots of Timli forest range, Uttarakhand, India. The forest area is dominated by Sal trees and random sampling with plot size of 0.1 ha (31.62m*31.62m) was adopted for TLS and field data collection. RISAT-1 data was processed to retrieve SAR data based variables and TLS point clouds based 3D imaging was done to retrieve LiDAR based variables. Surface scattering, double-bounce scattering, volume scattering, helix and wire scattering were the SAR based variables retrieved from polarimetric decomposition. Tree heights and stem diameters were used as LiDAR based variables retrieved from single tree vertical height and least square circle fit methods respectively. All the variables obtained for forest plots were used as an input in a machine learning based Random Forest Regression Model, which was developed in this study for forest AGB estimation. Modelled output for forest AGB showed reliable accuracy (RMSE = 27.68 t/ha) and a good coefficient of determination (0.63) was obtained through the linear regression between modelled AGB and field-estimated AGB. The sensitivity analysis showed that the model was more sensitive for the major contributed variables (stem diameter and volume scattering) and these variables were measured from two different remote sensing techniques. This study strongly recommends the integration of SAR and LiDAR data for forest AGB estimation.

  18. Random Bits Forest: a Strong Classifier/Regressor for Big Data

    NASA Astrophysics Data System (ADS)

    Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li

    2016-07-01

    Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).

  19. Modeling urban coastal flood severity from crowd-sourced flood reports using Poisson regression and Random Forest

    NASA Astrophysics Data System (ADS)

    Sadler, J. M.; Goodall, J. L.; Morsy, M. M.; Spencer, K.

    2018-04-01

    Sea level rise has already caused more frequent and severe coastal flooding and this trend will likely continue. Flood prediction is an essential part of a coastal city's capacity to adapt to and mitigate this growing problem. Complex coastal urban hydrological systems however, do not always lend themselves easily to physically-based flood prediction approaches. This paper presents a method for using a data-driven approach to estimate flood severity in an urban coastal setting using crowd-sourced data, a non-traditional but growing data source, along with environmental observation data. Two data-driven models, Poisson regression and Random Forest regression, are trained to predict the number of flood reports per storm event as a proxy for flood severity, given extensive environmental data (i.e., rainfall, tide, groundwater table level, and wind conditions) as input. The method is demonstrated using data from Norfolk, Virginia USA from September 2010 to October 2016. Quality-controlled, crowd-sourced street flooding reports ranging from 1 to 159 per storm event for 45 storm events are used to train and evaluate the models. Random Forest performed better than Poisson regression at predicting the number of flood reports and had a lower false negative rate. From the Random Forest model, total cumulative rainfall was by far the most dominant input variable in predicting flood severity, followed by low tide and lower low tide. These methods serve as a first step toward using data-driven methods for spatially and temporally detailed coastal urban flood prediction.

  20. LiDAR based prediction of forest biomass using hierarchical models with spatially varying coefficients

    USGS Publications Warehouse

    Babcock, Chad; Finley, Andrew O.; Bradford, John B.; Kolka, Randall K.; Birdsey, Richard A.; Ryan, Michael G.

    2015-01-01

    Many studies and production inventory systems have shown the utility of coupling covariates derived from Light Detection and Ranging (LiDAR) data with forest variables measured on georeferenced inventory plots through regression models. The objective of this study was to propose and assess the use of a Bayesian hierarchical modeling framework that accommodates both residual spatial dependence and non-stationarity of model covariates through the introduction of spatial random effects. We explored this objective using four forest inventory datasets that are part of the North American Carbon Program, each comprising point-referenced measures of above-ground forest biomass and discrete LiDAR. For each dataset, we considered at least five regression model specifications of varying complexity. Models were assessed based on goodness of fit criteria and predictive performance using a 10-fold cross-validation procedure. Results showed that the addition of spatial random effects to the regression model intercept improved fit and predictive performance in the presence of substantial residual spatial dependence. Additionally, in some cases, allowing either some or all regression slope parameters to vary spatially, via the addition of spatial random effects, further improved model fit and predictive performance. In other instances, models showed improved fit but decreased predictive performance—indicating over-fitting and underscoring the need for cross-validation to assess predictive ability. The proposed Bayesian modeling framework provided access to pixel-level posterior predictive distributions that were useful for uncertainty mapping, diagnosing spatial extrapolation issues, revealing missing model covariates, and discovering locally significant parameters.

  1. Comparing spatial regression to random forests for large ...

    EPA Pesticide Factsheets

    Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. Our primary goal is predicting MMI at over 1.1 million perennial stream reaches across the USA. For spatial regression modeling, we develop two new methods to accommodate large data: (1) a procedure that estimates optimal Box-Cox transformations to linearize covariate relationships; and (2) a computationally efficient covariate selection routine that takes into account spatial autocorrelation. We show that our new methods lead to cross-validated performance similar to random forests, but that there is an advantage for spatial regression when quantifying the uncertainty of the predictions. Simulations are used to clarify advantages for each method. This research investigates different approaches for modeling and mapping national stream condition. We use MMI data from the EPA's National Rivers and Streams Assessment and predictors from StreamCat (Hill et al., 2015). Previous studies have focused on modeling the MMI condition classes (i.e., good, fair, and po

  2. Predicting attention-deficit/hyperactivity disorder severity from psychosocial stress and stress-response genes: a random forest regression approach

    PubMed Central

    van der Meer, D; Hoekstra, P J; van Donkelaar, M; Bralten, J; Oosterlaan, J; Heslenfeld, D; Faraone, S V; Franke, B; Buitelaar, J K; Hartman, C A

    2017-01-01

    Identifying genetic variants contributing to attention-deficit/hyperactivity disorder (ADHD) is complicated by the involvement of numerous common genetic variants with small effects, interacting with each other as well as with environmental factors, such as stress exposure. Random forest regression is well suited to explore this complexity, as it allows for the analysis of many predictors simultaneously, taking into account any higher-order interactions among them. Using random forest regression, we predicted ADHD severity, measured by Conners’ Parent Rating Scales, from 686 adolescents and young adults (of which 281 were diagnosed with ADHD). The analysis included 17 374 single-nucleotide polymorphisms (SNPs) across 29 genes previously linked to hypothalamic–pituitary–adrenal (HPA) axis activity, together with information on exposure to 24 individual long-term difficulties or stressful life events. The model explained 12.5% of variance in ADHD severity. The most important SNP, which also showed the strongest interaction with stress exposure, was located in a region regulating the expression of telomerase reverse transcriptase (TERT). Other high-ranking SNPs were found in or near NPSR1, ESR1, GABRA6, PER3, NR3C2 and DRD4. Chronic stressors were more influential than single, severe, life events. Top hits were partly shared with conduct problems. We conclude that random forest regression may be used to investigate how multiple genetic and environmental factors jointly contribute to ADHD. It is able to implicate novel SNPs of interest, interacting with stress exposure, and may explain inconsistent findings in ADHD genetics. This exploratory approach may be best combined with more hypothesis-driven research; top predictors and their interactions with one another should be replicated in independent samples. PMID:28585928

  3. Simple to complex modeling of breathing volume using a motion sensor.

    PubMed

    John, Dinesh; Staudenmayer, John; Freedson, Patty

    2013-06-01

    To compare simple and complex modeling techniques to estimate categories of low, medium, and high ventilation (VE) from ActiGraph™ activity counts. Vertical axis ActiGraph™ GT1M activity counts, oxygen consumption and VE were measured during treadmill walking and running, sports, household chores and labor-intensive employment activities. Categories of low (<19.3 l/min), medium (19.3 to 35.4 l/min) and high (>35.4 l/min) VEs were derived from activity intensity classifications (light <2.9 METs, moderate 3.0 to 5.9 METs and vigorous >6.0 METs). We examined the accuracy of two simple techniques (multiple regression and activity count cut-point analyses) and one complex (random forest technique) modeling technique in predicting VE from activity counts. Prediction accuracy of the complex random forest technique was marginally better than the simple multiple regression method. Both techniques accurately predicted VE categories almost 80% of the time. The multiple regression and random forest techniques were more accurate (85 to 88%) in predicting medium VE. Both techniques predicted the high VE (70 to 73%) with greater accuracy than low VE (57 to 60%). Actigraph™ cut-points for light, medium and high VEs were <1381, 1381 to 3660 and >3660 cpm. There were minor differences in prediction accuracy between the multiple regression and the random forest technique. This study provides methods to objectively estimate VE categories using activity monitors that can easily be deployed in the field. Objective estimates of VE should provide a better understanding of the dose-response relationship between internal exposure to pollutants and disease. Copyright © 2013 Elsevier B.V. All rights reserved.

  4. Random forests and stochastic gradient boosting for predicting tree canopy cover: Comparing tuning processes and model performance

    Treesearch

    E. Freeman; G. Moisen; J. Coulston; B. Wilson

    2014-01-01

    Random forests (RF) and stochastic gradient boosting (SGB), both involving an ensemble of classification and regression trees, are compared for modeling tree canopy cover for the 2011 National Land Cover Database (NLCD). The objectives of this study were twofold. First, sensitivity of RF and SGB to choices in tuning parameters was explored. Second, performance of the...

  5. Exploring prediction uncertainty of spatial data in geostatistical and machine learning Approaches

    NASA Astrophysics Data System (ADS)

    Klump, J. F.; Fouedjio, F.

    2017-12-01

    Geostatistical methods such as kriging with external drift as well as machine learning techniques such as quantile regression forest have been intensively used for modelling spatial data. In addition to providing predictions for target variables, both approaches are able to deliver a quantification of the uncertainty associated with the prediction at a target location. Geostatistical approaches are, by essence, adequate for providing such prediction uncertainties and their behaviour is well understood. However, they often require significant data pre-processing and rely on assumptions that are rarely met in practice. Machine learning algorithms such as random forest regression, on the other hand, require less data pre-processing and are non-parametric. This makes the application of machine learning algorithms to geostatistical problems an attractive proposition. The objective of this study is to compare kriging with external drift and quantile regression forest with respect to their ability to deliver reliable prediction uncertainties of spatial data. In our comparison we use both simulated and real world datasets. Apart from classical performance indicators, comparisons make use of accuracy plots, probability interval width plots, and the visual examinations of the uncertainty maps provided by the two approaches. By comparing random forest regression to kriging we found that both methods produced comparable maps of estimated values for our variables of interest. However, the measure of uncertainty provided by random forest seems to be quite different to the measure of uncertainty provided by kriging. In particular, the lack of spatial context can give misleading results in areas without ground truth data. These preliminary results raise questions about assessing the risks associated with decisions based on the predictions from geostatistical and machine learning algorithms in a spatial context, e.g. mineral exploration.

  6. Estimating the impact of mineral aerosols on crop yields in food insecure regions using statistical crop models

    NASA Astrophysics Data System (ADS)

    Hoffman, A.; Forest, C. E.; Kemanian, A.

    2016-12-01

    A significant number of food-insecure nations exist in regions of the world where dust plays a large role in the climate system. While the impacts of common climate variables (e.g. temperature, precipitation, ozone, and carbon dioxide) on crop yields are relatively well understood, the impact of mineral aerosols on yields have not yet been thoroughly investigated. This research aims to develop the data and tools to progress our understanding of mineral aerosol impacts on crop yields. Suspended dust affects crop yields by altering the amount and type of radiation reaching the plant, modifying local temperature and precipitation. While dust events (i.e. dust storms) affect crop yields by depleting the soil of nutrients or by defoliation via particle abrasion. The impact of dust on yields is modeled statistically because we are uncertain which impacts will dominate the response on national and regional scales considered in this study. Multiple linear regression is used in a number of large-scale statistical crop modeling studies to estimate yield responses to various climate variables. In alignment with previous work, we develop linear crop models, but build upon this simple method of regression with machine-learning techniques (e.g. random forests) to identify important statistical predictors and isolate how dust affects yields on the scales of interest. To perform this analysis, we develop a crop-climate dataset for maize, soybean, groundnut, sorghum, rice, and wheat for the regions of West Africa, East Africa, South Africa, and the Sahel. Random forest regression models consistently model historic crop yields better than the linear models. In several instances, the random forest models accurately capture the temperature and precipitation threshold behavior in crops. Additionally, improving agricultural technology has caused a well-documented positive trend that dominates time series of global and regional yields. This trend is often removed before regression with traditional crop models, but likely at the cost of removing climate information. Our random forest models consistently discover the positive trend without removing any additional data. The application of random forests as a statistical crop model provides insight into understanding the impact of dust on yields in marginal food producing regions.

  7. Comparative analysis of used car price evaluation models

    NASA Astrophysics Data System (ADS)

    Chen, Chuancan; Hao, Lulu; Xu, Cong

    2017-05-01

    An accurate used car price evaluation is a catalyst for the healthy development of used car market. Data mining has been applied to predict used car price in several articles. However, little is studied on the comparison of using different algorithms in used car price estimation. This paper collects more than 100,000 used car dealing records throughout China to do empirical analysis on a thorough comparison of two algorithms: linear regression and random forest. These two algorithms are used to predict used car price in three different models: model for a certain car make, model for a certain car series and universal model. Results show that random forest has a stable but not ideal effect in price evaluation model for a certain car make, but it shows great advantage in the universal model compared with linear regression. This indicates that random forest is an optimal algorithm when handling complex models with a large number of variables and samples, yet it shows no obvious advantage when coping with simple models with less variables.

  8. Extrapolating intensified forest inventory data to the surrounding landscape using landsat

    Treesearch

    Evan B. Brooks; John W. Coulston; Valerie A. Thomas; Randolph H. Wynne

    2015-01-01

    In 2011, a collection of spatially intensified plots was established on three of the Experimental Forests and Ranges (EFRs) sites with the intent of facilitating FIA program objectives for regional extrapolation. Characteristic coefficients from harmonic regression (HR) analysis of associated Landsat stacks are used as inputs into a conditional random forests model to...

  9. Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

    PubMed Central

    Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-01-01

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914

  10. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    PubMed

    Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-03-15

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

  11. Methods for identifying SNP interactions: a review on variations of Logic Regression, Random Forest and Bayesian logistic regression.

    PubMed

    Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula

    2011-01-01

    Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.

  12. Mathematical models application for mapping soils spatial distribution on the example of the farm from the North of Udmurt Republic of Russia

    NASA Astrophysics Data System (ADS)

    Dokuchaev, P. M.; Meshalkina, J. L.; Yaroslavtsev, A. M.

    2018-01-01

    Comparative analysis of soils geospatial modeling using multinomial logistic regression, decision trees, random forest, regression trees and support vector machines algorithms was conducted. The visual interpretation of the digital maps obtained and their comparison with the existing map, as well as the quantitative assessment of the individual soil groups detection overall accuracy and of the models kappa showed that multiple logistic regression, support vector method, and random forest models application with spatial prediction of the conditional soil groups distribution can be reliably used for mapping of the study area. It has shown the most accurate detection for sod-podzolics soils (Phaeozems Albic) lightly eroded and moderately eroded soils. In second place, according to the mean overall accuracy of the prediction, there are sod-podzolics soils - non-eroded and warp one, as well as sod-gley soils (Umbrisols Gleyic) and alluvial soils (Fluvisols Dystric, Umbric). Heavy eroded sod-podzolics and gray forest soils (Phaeozems Albic) were detected by methods of automatic classification worst of all.

  13. Comparison of Logistic Regression and Random Forests techniques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy)

    NASA Astrophysics Data System (ADS)

    Trigila, Alessandro; Iadanza, Carla; Esposito, Carlo; Scarascia-Mugnozza, Gabriele

    2015-11-01

    The aim of this work is to define reliable susceptibility models for shallow landslides using Logistic Regression and Random Forests multivariate statistical techniques. The study area, located in North-East Sicily, was hit on October 1st 2009 by a severe rainstorm (225 mm of cumulative rainfall in 7 h) which caused flash floods and more than 1000 landslides. Several small villages, such as Giampilieri, were hit with 31 fatalities, 6 missing persons and damage to buildings and transportation infrastructures. Landslides, mainly types such as earth and debris translational slides evolving into debris flows, were triggered on steep slopes and involved colluvium and regolith materials which cover the underlying metamorphic bedrock. The work has been carried out with the following steps: i) realization of a detailed event landslide inventory map through field surveys coupled with observation of high resolution aerial colour orthophoto; ii) identification of landslide source areas; iii) data preparation of landslide controlling factors and descriptive statistics based on a bivariate method (Frequency Ratio) to get an initial overview on existing relationships between causative factors and shallow landslide source areas; iv) choice of criteria for the selection and sizing of the mapping unit; v) implementation of 5 multivariate statistical susceptibility models based on Logistic Regression and Random Forests techniques and focused on landslide source areas; vi) evaluation of the influence of sample size and type of sampling on results and performance of the models; vii) evaluation of the predictive capabilities of the models using ROC curve, AUC and contingency tables; viii) comparison of model results and obtained susceptibility maps; and ix) analysis of temporal variation of landslide susceptibility related to input parameter changes. Models based on Logistic Regression and Random Forests have demonstrated excellent predictive capabilities. Land use and wildfire variables were found to have a strong control on the occurrence of very rapid shallow landslides.

  14. Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction

    PubMed Central

    Rahman, Raziur; Haider, Saad; Ghosh, Souparno; Pal, Ranadip

    2015-01-01

    Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error. PMID:27081304

  15. Forecasting Daily Patient Outflow From a Ward Having No Real-Time Clinical Data

    PubMed Central

    Tran, Truyen; Luo, Wei; Phung, Dinh; Venkatesh, Svetha

    2016-01-01

    Background: Modeling patient flow is crucial in understanding resource demand and prioritization. We study patient outflow from an open ward in an Australian hospital, where currently bed allocation is carried out by a manager relying on past experiences and looking at demand. Automatic methods that provide a reasonable estimate of total next-day discharges can aid in efficient bed management. The challenges in building such methods lie in dealing with large amounts of discharge noise introduced by the nonlinear nature of hospital procedures, and the nonavailability of real-time clinical information in wards. Objective Our study investigates different models to forecast the total number of next-day discharges from an open ward having no real-time clinical data. Methods We compared 5 popular regression algorithms to model total next-day discharges: (1) autoregressive integrated moving average (ARIMA), (2) the autoregressive moving average with exogenous variables (ARMAX), (3) k-nearest neighbor regression, (4) random forest regression, and (5) support vector regression. Although the autoregressive integrated moving average model relied on past 3-month discharges, nearest neighbor forecasting used median of similar discharges in the past in estimating next-day discharge. In addition, the ARMAX model used the day of the week and number of patients currently in ward as exogenous variables. For the random forest and support vector regression models, we designed a predictor set of 20 patient features and 88 ward-level features. Results Our data consisted of 12,141 patient visits over 1826 days. Forecasting quality was measured using mean forecast error, mean absolute error, symmetric mean absolute percentage error, and root mean square error. When compared with a moving average prediction model, all 5 models demonstrated superior performance with the random forests achieving 22.7% improvement in mean absolute error, for all days in the year 2014. Conclusions In the absence of clinical information, our study recommends using patient-level and ward-level data in predicting next-day discharges. Random forest and support vector regression models are able to use all available features from such data, resulting in superior performance over traditional autoregressive methods. An intelligent estimate of available beds in wards plays a crucial role in relieving access block in emergency departments. PMID:27444059

  16. Analysis of Machine Learning Techniques for Heart Failure Readmissions.

    PubMed

    Mortazavi, Bobak J; Downing, Nicholas S; Bucholz, Emily M; Dharmarajan, Kumar; Manhapra, Ajay; Li, Shu-Xia; Negahban, Sahand N; Krumholz, Harlan M

    2016-11-01

    The current ability to predict readmissions in patients with heart failure is modest at best. It is unclear whether machine learning techniques that address higher dimensional, nonlinear relationships among variables would enhance prediction. We sought to compare the effectiveness of several machine learning algorithms for predicting readmissions. Using data from the Telemonitoring to Improve Heart Failure Outcomes trial, we compared the effectiveness of random forests, boosting, random forests combined hierarchically with support vector machines or logistic regression (LR), and Poisson regression against traditional LR to predict 30- and 180-day all-cause readmissions and readmissions because of heart failure. We randomly selected 50% of patients for a derivation set, and a validation set comprised the remaining patients, validated using 100 bootstrapped iterations. We compared C statistics for discrimination and distributions of observed outcomes in risk deciles for predictive range. In 30-day all-cause readmission prediction, the best performing machine learning model, random forests, provided a 17.8% improvement over LR (mean C statistics, 0.628 and 0.533, respectively). For readmissions because of heart failure, boosting improved the C statistic by 24.9% over LR (mean C statistic 0.678 and 0.543, respectively). For 30-day all-cause readmission, the observed readmission rates in the lowest and highest deciles of predicted risk with random forests (7.8% and 26.2%, respectively) showed a much wider separation than LR (14.2% and 16.4%, respectively). Machine learning methods improved the prediction of readmission after hospitalization for heart failure compared with LR and provided the greatest predictive range in observed readmission rates. © 2016 American Heart Association, Inc.

  17. The Past, Present and Future of the Meteorological Phenomena Identification Near the Ground (mPING) Project

    NASA Astrophysics Data System (ADS)

    Elmore, K. L.

    2016-12-01

    The Metorological Phenomemna Identification NeartheGround (mPING) project is an example of a crowd-sourced, citizen science effort to gather data of sufficeint quality and quantity needed by new post processing methods that use machine learning. Transportation and infrastructure are particularly sensitive to precipitation type in winter weather. We extract attributes from operational numerical forecast models and use them in a random forest to generate forecast winter precipitation types. We find that random forests applied to forecast soundings are effective at generating skillful forecasts of surface ptype with consideralbly more skill than the current algorithms, especuially for ice pellets and freezing rain. We also find that three very different forecast models yuield similar overall results, showing that random forests are able to extract essentially equivalent information from different forecast models. We also show that the random forest for each model, and each profile type is unique to the particular forecast model and that the random forests developed using a particular model suffer significant degradation when given attributes derived from a different model. This implies that no single algorithm can perform well across all forecast models. Clearly, random forests extract information unavailable to "physically based" methods because the physical information in the models does not appear as we expect. One intersting result is that results from the classic "warm nose" sounding profile are, by far, the most sensitive to the particular forecast model, but this profile is also the one for which random forests are most skillful. Finally, a method for calibrarting probabilties for each different ptype using multinomial logistic regression is shown.

  18. Using Classification and Regression Trees (CART) and random forests to analyze attrition: Results from two simulations.

    PubMed

    Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J

    2015-12-01

    In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. (c) 2015 APA, all rights reserved).

  19. Using Classification and Regression Trees (CART) and Random Forests to Analyze Attrition: Results From Two Simulations

    PubMed Central

    Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J.

    2016-01-01

    In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. PMID:26389526

  20. Modeling species’ realized climatic niche space and predicting their response to global warming for several western forest species with small geographic distributions

    Treesearch

    Marcus V. Warwell; Gerald E. Rehfeldt; Nicholas L. Crookston

    2010-01-01

    The Random Forests multiple regression tree was used to develop an empirically based bioclimatic model of the presence-absence of species occupying small geographic distributions in western North America. The species assessed were subalpine larch (Larix lyallii), smooth Arizona cypress (Cupressus arizonica ssp. glabra...

  1. Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research

    ERIC Educational Resources Information Center

    He, Lingjun; Levine, Richard A.; Fan, Juanjuan; Beemer, Joshua; Stronach, Jeanne

    2018-01-01

    In institutional research, modern data mining approaches are seldom considered to address predictive analytics problems. The goal of this paper is to highlight the advantages of tree-based machine learning algorithms over classic (logistic) regression methods for data-informed decision making in higher education problems, and stress the success of…

  2. Development of a hybrid proximal sensing method for rapid identification of petroleum contaminated soils.

    PubMed

    Chakraborty, Somsubhra; Weindorf, David C; Li, Bin; Ali Aldabaa, Abdalsamad Abdalsatar; Ghosh, Rakesh Kumar; Paul, Sathi; Nasim Ali, Md

    2015-05-01

    Using 108 petroleum contaminated soil samples, this pilot study proposed a new analytical approach of combining visible near-infrared diffuse reflectance spectroscopy (VisNIR DRS) and portable X-ray fluorescence spectrometry (PXRF) for rapid and improved quantification of soil petroleum contamination. Results indicated that an advanced fused model where VisNIR DRS spectra-based penalized spline regression (PSR) was used to predict total petroleum hydrocarbon followed by PXRF elemental data-based random forest regression was used to model the PSR residuals, it outperformed (R(2)=0.78, residual prediction deviation (RPD)=2.19) all other models tested, even producing better generalization than using VisNIR DRS alone (RPD's of 1.64, 1.86, and 1.96 for random forest, penalized spline regression, and partial least squares regression, respectively). Additionally, unsupervised principal component analysis using the PXRF+VisNIR DRS system qualitatively separated contaminated soils from control samples. Fusion of PXRF elemental data and VisNIR derivative spectra produced an optimized model for total petroleum hydrocarbon quantification in soils. Copyright © 2015 Elsevier B.V. All rights reserved.

  3. Unbiased split variable selection for random survival forests using maximally selected rank statistics.

    PubMed

    Wright, Marvin N; Dankowski, Theresa; Ziegler, Andreas

    2017-04-15

    The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  4. Statistical-learning strategies generate only modestly performing predictive models for urinary symptoms following external beam radiotherapy of the prostate: A comparison of conventional and machine-learning methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yahya, Noorazrul, E-mail: noorazrul.yahya@research.uwa.edu.au; Ebert, Martin A.; Bulsara, Max

    Purpose: Given the paucity of available data concerning radiotherapy-induced urinary toxicity, it is important to ensure derivation of the most robust models with superior predictive performance. This work explores multiple statistical-learning strategies for prediction of urinary symptoms following external beam radiotherapy of the prostate. Methods: The performance of logistic regression, elastic-net, support-vector machine, random forest, neural network, and multivariate adaptive regression splines (MARS) to predict urinary symptoms was analyzed using data from 754 participants accrued by TROG03.04-RADAR. Predictive features included dose-surface data, comorbidities, and medication-intake. Four symptoms were analyzed: dysuria, haematuria, incontinence, and frequency, each with three definitions (grade ≥more » 1, grade ≥ 2 and longitudinal) with event rate between 2.3% and 76.1%. Repeated cross-validations producing matched models were implemented. A synthetic minority oversampling technique was utilized in endpoints with rare events. Parameter optimization was performed on the training data. Area under the receiver operating characteristic curve (AUROC) was used to compare performance using sample size to detect differences of ≥0.05 at the 95% confidence level. Results: Logistic regression, elastic-net, random forest, MARS, and support-vector machine were the highest-performing statistical-learning strategies in 3, 3, 3, 2, and 1 endpoints, respectively. Logistic regression, MARS, elastic-net, random forest, neural network, and support-vector machine were the best, or were not significantly worse than the best, in 7, 7, 5, 5, 3, and 1 endpoints. The best-performing statistical model was for dysuria grade ≥ 1 with AUROC ± standard deviation of 0.649 ± 0.074 using MARS. For longitudinal frequency and dysuria grade ≥ 1, all strategies produced AUROC>0.6 while all haematuria endpoints and longitudinal incontinence models produced AUROC<0.6. Conclusions: Logistic regression and MARS were most likely to be the best-performing strategy for the prediction of urinary symptoms with elastic-net and random forest producing competitive results. The predictive power of the models was modest and endpoint-dependent. New features, including spatial dose maps, may be necessary to achieve better models.« less

  5. Artificial Intelligence Procedures for Tree Taper Estimation within a Complex Vegetation Mosaic in Brazil

    PubMed Central

    Nunes, Matheus Henrique

    2016-01-01

    Tree stem form in native tropical forests is very irregular, posing a challenge to establishing taper equations that can accurately predict the diameter at any height along the stem and subsequently merchantable volume. Artificial intelligence approaches can be useful techniques in minimizing estimation errors within complex variations of vegetation. We evaluated the performance of Random Forest® regression tree and Artificial Neural Network procedures in modelling stem taper. Diameters and volume outside bark were compared to a traditional taper-based equation across a tropical Brazilian savanna, a seasonal semi-deciduous forest and a rainforest. Neural network models were found to be more accurate than the traditional taper equation. Random forest showed trends in the residuals from the diameter prediction and provided the least precise and accurate estimations for all forest types. This study provides insights into the superiority of a neural network, which provided advantages regarding the handling of local effects. PMID:27187074

  6. Artificial Intelligence Procedures for Tree Taper Estimation within a Complex Vegetation Mosaic in Brazil.

    PubMed

    Nunes, Matheus Henrique; Görgens, Eric Bastos

    2016-01-01

    Tree stem form in native tropical forests is very irregular, posing a challenge to establishing taper equations that can accurately predict the diameter at any height along the stem and subsequently merchantable volume. Artificial intelligence approaches can be useful techniques in minimizing estimation errors within complex variations of vegetation. We evaluated the performance of Random Forest® regression tree and Artificial Neural Network procedures in modelling stem taper. Diameters and volume outside bark were compared to a traditional taper-based equation across a tropical Brazilian savanna, a seasonal semi-deciduous forest and a rainforest. Neural network models were found to be more accurate than the traditional taper equation. Random forest showed trends in the residuals from the diameter prediction and provided the least precise and accurate estimations for all forest types. This study provides insights into the superiority of a neural network, which provided advantages regarding the handling of local effects.

  7. Fast image interpolation via random forests.

    PubMed

    Huang, Jun-Jie; Siu, Wan-Chi; Liu, Tian-Rui

    2015-10-01

    This paper proposes a two-stage framework for fast image interpolation via random forests (FIRF). The proposed FIRF method gives high accuracy, as well as requires low computation. The underlying idea of this proposed work is to apply random forests to classify the natural image patch space into numerous subspaces and learn a linear regression model for each subspace to map the low-resolution image patch to high-resolution image patch. The FIRF framework consists of two stages. Stage 1 of the framework removes most of the ringing and aliasing artifacts in the initial bicubic interpolated image, while Stage 2 further refines the Stage 1 interpolated image. By varying the number of decision trees in the random forests and the number of stages applied, the proposed FIRF method can realize computationally scalable image interpolation. Extensive experimental results show that the proposed FIRF(3, 2) method achieves more than 0.3 dB improvement in peak signal-to-noise ratio over the state-of-the-art nonlocal autoregressive modeling (NARM) method. Moreover, the proposed FIRF(1, 1) obtains similar or better results as NARM while only takes its 0.3% computational time.

  8. Stemflow estimation in a redwood forest using model-based stratified random sampling

    Treesearch

    Jack Lewis

    2003-01-01

    Model-based stratified sampling is illustrated by a case study of stemflow volume in a redwood forest. The approach is actually a model-assisted sampling design in which auxiliary information (tree diameter) is utilized in the design of stratum boundaries to optimize the efficiency of a regression or ratio estimator. The auxiliary information is utilized in both the...

  9. Improved predictive mapping of indoor radon concentrations using ensemble regression trees based on automatic clustering of geological units.

    PubMed

    Kropat, Georg; Bochud, Francois; Jaboyedoff, Michel; Laedermann, Jean-Pascal; Murith, Christophe; Palacios Gruson, Martha; Baechler, Sébastien

    2015-09-01

    According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information. Copyright © 2015 Elsevier Ltd. All rights reserved.

  10. Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches.

    PubMed

    Stylianou, Neophytos; Akbarov, Artur; Kontopantelis, Evangelos; Buchan, Iain; Dunn, Ken W

    2015-08-01

    Predicting mortality from burn injury has traditionally employed logistic regression models. Alternative machine learning methods have been introduced in some areas of clinical prediction as the necessary software and computational facilities have become accessible. Here we compare logistic regression and machine learning predictions of mortality from burn. An established logistic mortality model was compared to machine learning methods (artificial neural network, support vector machine, random forests and naïve Bayes) using a population-based (England & Wales) case-cohort registry. Predictive evaluation used: area under the receiver operating characteristic curve; sensitivity; specificity; positive predictive value and Youden's index. All methods had comparable discriminatory abilities, similar sensitivities, specificities and positive predictive values. Although some machine learning methods performed marginally better than logistic regression the differences were seldom statistically significant and clinically insubstantial. Random forests were marginally better for high positive predictive value and reasonable sensitivity. Neural networks yielded slightly better prediction overall. Logistic regression gives an optimal mix of performance and interpretability. The established logistic regression model of burn mortality performs well against more complex alternatives. Clinical prediction with a small set of strong, stable, independent predictors is unlikely to gain much from machine learning outside specialist research contexts. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.

  11. Benchmarking dairy herd health status using routinely recorded herd summary data.

    PubMed

    Parker Gaddis, K L; Cole, J B; Clay, J S; Maltecca, C

    2016-02-01

    Genetic improvement of dairy cattle health through the use of producer-recorded data has been determined to be feasible. Low estimated heritabilities indicate that genetic progress will be slow. Variation observed in lowly heritable traits can largely be attributed to nongenetic factors, such as the environment. More rapid improvement of dairy cattle health may be attainable if herd health programs incorporate environmental and managerial aspects. More than 1,100 herd characteristics are regularly recorded on farm test-days. We combined these data with producer-recorded health event data, and parametric and nonparametric models were used to benchmark herd and cow health status. Health events were grouped into 3 categories for analyses: mastitis, reproductive, and metabolic. Both herd incidence and individual incidence were used as dependent variables. Models implemented included stepwise logistic regression, support vector machines, and random forests. At both the herd and individual levels, random forest models attained the highest accuracy for predicting health status in all health event categories when evaluated with 10-fold cross-validation. Accuracy (SD) ranged from 0.61 (0.04) to 0.63 (0.04) when using random forest models at the herd level. Accuracy of prediction (SD) at the individual cow level ranged from 0.87 (0.06) to 0.93 (0.001) with random forest models. Highly significant variables and key words from logistic regression and random forest models were also investigated. All models identified several of the same key factors for each health event category, including movement out of the herd, size of the herd, and weather-related variables. We concluded that benchmarking health status using routinely collected herd data is feasible. Nonparametric models were better suited to handle this complex data with numerous variables. These data mining techniques were able to perform prediction of health status and could add evidence to personal experience in herd management. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  12. Personalized Risk Prediction in Clinical Oncology Research: Applications and Practical Issues Using Survival Trees and Random Forests.

    PubMed

    Hu, Chen; Steingrimsson, Jon Arni

    2018-01-01

    A crucial component of making individualized treatment decisions is to accurately predict each patient's disease risk. In clinical oncology, disease risks are often measured through time-to-event data, such as overall survival and progression/recurrence-free survival, and are often subject to censoring. Risk prediction models based on recursive partitioning methods are becoming increasingly popular largely due to their ability to handle nonlinear relationships, higher-order interactions, and/or high-dimensional covariates. The most popular recursive partitioning methods are versions of the Classification and Regression Tree (CART) algorithm, which builds a simple interpretable tree structured model. With the aim of increasing prediction accuracy, the random forest algorithm averages multiple CART trees, creating a flexible risk prediction model. Risk prediction models used in clinical oncology commonly use both traditional demographic and tumor pathological factors as well as high-dimensional genetic markers and treatment parameters from multimodality treatments. In this article, we describe the most commonly used extensions of the CART and random forest algorithms to right-censored outcomes. We focus on how they differ from the methods for noncensored outcomes, and how the different splitting rules and methods for cost-complexity pruning impact these algorithms. We demonstrate these algorithms by analyzing a randomized Phase III clinical trial of breast cancer. We also conduct Monte Carlo simulations to compare the prediction accuracy of survival forests with more commonly used regression models under various scenarios. These simulation studies aim to evaluate how sensitive the prediction accuracy is to the underlying model specifications, the choice of tuning parameters, and the degrees of missing covariates.

  13. The use of single-date MODIS imagery for estimating large-scale urban impervious surface fraction with spectral mixture analysis and machine learning techniques

    NASA Astrophysics Data System (ADS)

    Deng, Chengbin; Wu, Changshan

    2013-12-01

    Urban impervious surface information is essential for urban and environmental applications at the regional/national scales. As a popular image processing technique, spectral mixture analysis (SMA) has rarely been applied to coarse-resolution imagery due to the difficulty of deriving endmember spectra using traditional endmember selection methods, particularly within heterogeneous urban environments. To address this problem, we derived endmember signatures through a least squares solution (LSS) technique with known abundances of sample pixels, and integrated these endmember signatures into SMA for mapping large-scale impervious surface fraction. In addition, with the same sample set, we carried out objective comparative analyses among SMA (i.e. fully constrained and unconstrained SMA) and machine learning (i.e. Cubist regression tree and Random Forests) techniques. Analysis of results suggests three major conclusions. First, with the extrapolated endmember spectra from stratified random training samples, the SMA approaches performed relatively well, as indicated by small MAE values. Second, Random Forests yields more reliable results than Cubist regression tree, and its accuracy is improved with increased sample sizes. Finally, comparative analyses suggest a tentative guide for selecting an optimal approach for large-scale fractional imperviousness estimation: unconstrained SMA might be a favorable option with a small number of samples, while Random Forests might be preferred if a large number of samples are available.

  14. Effects of road network on diversiform forest cover changes in the highest coverage region in China: An analysis of sampling strategies.

    PubMed

    Hu, Xisheng; Wu, Zhilong; Wu, Chengzhen; Ye, Limin; Lan, Chaofeng; Tang, Kun; Xu, Lu; Qiu, Rongzu

    2016-09-15

    Forest cover changes are of global concern due to their roles in global warming and biodiversity. However, many previous studies have ignored the fact that forest loss and forest gain are different processes that may respond to distinct factors by stressing forest loss more than gain or viewing forest cover change as a whole. It behooves us to carefully examine the patterns and drivers of the change by subdividing it into several categories. Our study includes areas of forest loss (4.8% of the study area), forest gain (1.3% of the study area) and forest loss and gain (2.0% of the study area) from 2000 to 2012 in Fujian Province, China. In the study area, approximately 65% and 90% of these changes occurred within 2000m of the nearest road and under road densities of 0.6km/km(2), respectively. We compared two sampling techniques (systematic sampling and random sampling) and four intensities for each technique to investigate the driving patterns underlying the changes using multinomial logistic regression. The results indicated the lack of pronounced differences in the regressions between the two sampling designs, although the sample size had a great impact on the regression outcome. The application of multi-model inference indicated that the low level road density had a negative significant association with forest loss and forest loss and gain, the expressway density had a positive significant impact on forest loss, and the road network was insignificantly related to forest gain. The model including socioeconomic and biophysical variables illuminated potentially different predictors of the different forest change categories. Moreover, the multiple comparisons tested by Fisher's least significant difference (LSD) were a good compensation for the multinomial logistic model to enrich the interpretation of the regression results. Copyright © 2016 Elsevier B.V. All rights reserved.

  15. Tehran Air Pollutants Prediction Based on Random Forest Feature Selection Method

    NASA Astrophysics Data System (ADS)

    Shamsoddini, A.; Aboodi, M. R.; Karami, J.

    2017-09-01

    Air pollution as one of the most serious forms of environmental pollutions poses huge threat to human life. Air pollution leads to environmental instability, and has harmful and undesirable effects on the environment. Modern prediction methods of the pollutant concentration are able to improve decision making and provide appropriate solutions. This study examines the performance of the Random Forest feature selection in combination with multiple-linear regression and Multilayer Perceptron Artificial Neural Networks methods, in order to achieve an efficient model to estimate carbon monoxide and nitrogen dioxide, sulfur dioxide and PM2.5 contents in the air. The results indicated that Artificial Neural Networks fed by the attributes selected by Random Forest feature selection method performed more accurate than other models for the modeling of all pollutants. The estimation accuracy of sulfur dioxide emissions was lower than the other air contaminants whereas the nitrogen dioxide was predicted more accurate than the other pollutants.

  16. Random forests of interaction trees for estimating individualized treatment effects in randomized trials.

    PubMed

    Su, Xiaogang; Peña, Annette T; Liu, Lei; Levine, Richard A

    2018-04-29

    Assessing heterogeneous treatment effects is a growing interest in advancing precision medicine. Individualized treatment effects (ITEs) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees. To this end, we propose a smooth sigmoid surrogate method, as an alternative to greedy search, to speed up tree construction. The RFIT outperforms the "separate regression" approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT are obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial. Copyright © 2018 John Wiley & Sons, Ltd.

  17. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets.

    PubMed

    Sankari, E Siva; Manimegalai, D

    2017-12-21

    Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.

  18. Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantation.

    PubMed

    Marino, S R; Lin, S; Maiers, M; Haagenson, M; Spellman, S; Klein, J P; Binkowski, T A; Lee, S J; van Besien, K

    2012-02-01

    The identification of important amino acid substitutions associated with low survival in hematopoietic cell transplantation (HCT) is hampered by the large number of observed substitutions compared with the small number of patients available for analysis. Random forest analysis is designed to address these limitations. We studied 2107 HCT recipients with good or intermediate risk hematological malignancies to identify HLA class I amino acid substitutions associated with reduced survival at day 100 post transplant. Random forest analysis and traditional univariate and multivariate analyses were used. Random forest analysis identified amino acid substitutions in 33 positions that were associated with reduced 100 day survival, including HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166 and 167; HLA-B 97, 109, 116 and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163 and 173. In all 13 had been previously reported by other investigators using classical biostatistical approaches. Using the same data set, traditional multivariate logistic regression identified only five amino acid substitutions associated with lower day 100 survival. Random forest analysis is a novel statistical methodology for analysis of HLA mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods.

  19. Comparison of machine-learning methods for above-ground biomass estimation based on Landsat imagery

    NASA Astrophysics Data System (ADS)

    Wu, Chaofan; Shen, Huanhuan; Shen, Aihua; Deng, Jinsong; Gan, Muye; Zhu, Jinxia; Xu, Hongwei; Wang, Ke

    2016-07-01

    Biomass is one significant biophysical parameter of a forest ecosystem, and accurate biomass estimation on the regional scale provides important information for carbon-cycle investigation and sustainable forest management. In this study, Landsat satellite imagery data combined with field-based measurements were integrated through comparisons of five regression approaches [stepwise linear regression, K-nearest neighbor, support vector regression, random forest (RF), and stochastic gradient boosting] with two different candidate variable strategies to implement the optimal spatial above-ground biomass (AGB) estimation. The results suggested that RF algorithm exhibited the best performance by 10-fold cross-validation with respect to R2 (0.63) and root-mean-square error (26.44 ton/ha). Consequently, the map of estimated AGB was generated with a mean value of 89.34 ton/ha in northwestern Zhejiang Province, China, with a similar pattern to the distribution mode of local forest species. This research indicates that machine-learning approaches associated with Landsat imagery provide an economical way for biomass estimation. Moreover, ensemble methods using all candidate variables, especially for Landsat images, provide an alternative for regional biomass simulation.

  20. Computer aided diagnosis system for the Alzheimer's disease based on partial least squares and random forest SPECT image classification.

    PubMed

    Ramírez, J; Górriz, J M; Segovia, F; Chaves, R; Salas-Gonzalez, D; López, M; Alvarez, I; Padilla, P

    2010-03-19

    This letter shows a computer aided diagnosis (CAD) technique for the early detection of the Alzheimer's disease (AD) by means of single photon emission computed tomography (SPECT) image classification. The proposed method is based on partial least squares (PLS) regression model and a random forest (RF) predictor. The challenge of the curse of dimensionality is addressed by reducing the large dimensionality of the input data by downscaling the SPECT images and extracting score features using PLS. A RF predictor then forms an ensemble of classification and regression tree (CART)-like classifiers being its output determined by a majority vote of the trees in the forest. A baseline principal component analysis (PCA) system is also developed for reference. The experimental results show that the combined PLS-RF system yields a generalization error that converges to a limit when increasing the number of trees in the forest. Thus, the generalization error is reduced when using PLS and depends on the strength of the individual trees in the forest and the correlation between them. Moreover, PLS feature extraction is found to be more effective for extracting discriminative information from the data than PCA yielding peak sensitivity, specificity and accuracy values of 100%, 92.7%, and 96.9%, respectively. Moreover, the proposed CAD system outperformed several other recently developed AD CAD systems. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  1. Hand pose estimation in depth image using CNN and random forest

    NASA Astrophysics Data System (ADS)

    Chen, Xi; Cao, Zhiguo; Xiao, Yang; Fang, Zhiwen

    2018-03-01

    Thanks to the availability of low cost depth cameras, like Microsoft Kinect, 3D hand pose estimation attracted special research attention in these years. Due to the large variations in hand`s viewpoint and the high dimension of hand motion, 3D hand pose estimation is still challenging. In this paper we propose a two-stage framework which joint with CNN and Random Forest to boost the performance of hand pose estimation. First, we use a standard Convolutional Neural Network (CNN) to regress the hand joints` locations. Second, using a Random Forest to refine the joints from the first stage. In the second stage, we propose a pyramid feature which merges the information flow of the CNN. Specifically, we get the rough joints` location from first stage, then rotate the convolutional feature maps (and image). After this, for each joint, we map its location to each feature map (and image) firstly, then crop features at each feature map (and image) around its location, put extracted features to Random Forest to refine at last. Experimentally, we evaluate our proposed method on ICVL dataset and get the mean error about 11mm, our method is also real-time on a desktop.

  2. Decision tree modeling using R.

    PubMed

    Zhang, Zhongheng

    2016-08-01

    In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.

  3. Comparing the efficiency of digital and conventional soil mapping to predict soil types in a semi-arid region in Iran

    NASA Astrophysics Data System (ADS)

    Zeraatpisheh, Mojtaba; Ayoubi, Shamsollah; Jafari, Azam; Finke, Peter

    2017-05-01

    The efficiency of different digital and conventional soil mapping approaches to produce categorical maps of soil types is determined by cost, sample size, accuracy and the selected taxonomic level. The efficiency of digital and conventional soil mapping approaches was examined in the semi-arid region of Borujen, central Iran. This research aimed to (i) compare two digital soil mapping approaches including Multinomial logistic regression and random forest, with the conventional soil mapping approach at four soil taxonomic levels (order, suborder, great group and subgroup levels), (ii) validate the predicted soil maps by the same validation data set to determine the best method for producing the soil maps, and (iii) select the best soil taxonomic level by different approaches at three sample sizes (100, 80, and 60 point observations), in two scenarios with and without a geomorphology map as a spatial covariate. In most predicted maps, using both digital soil mapping approaches, the best results were obtained using the combination of terrain attributes and the geomorphology map, although differences between the scenarios with and without the geomorphology map were not significant. Employing the geomorphology map increased map purity and the Kappa index, and led to a decrease in the 'noisiness' of soil maps. Multinomial logistic regression had better performance at higher taxonomic levels (order and suborder levels); however, random forest showed better performance at lower taxonomic levels (great group and subgroup levels). Multinomial logistic regression was less sensitive than random forest to a decrease in the number of training observations. The conventional soil mapping method produced a map with larger minimum polygon size because of traditional cartographic criteria used to make the geological map 1:100,000 (on which the conventional soil mapping map was largely based). Likewise, conventional soil mapping map had also a larger average polygon size that resulted in a lower level of detail. Multinomial logistic regression at the order level (map purity of 0.80), random forest at the suborder (map purity of 0.72) and great group level (map purity of 0.60), and conventional soil mapping at the subgroup level (map purity of 0.48) produced the most accurate maps in the study area. The multinomial logistic regression method was identified as the most effective approach based on a combined index of map purity, map information content, and map production cost. The combined index also showed that smaller sample size led to a preference for the order level, while a larger sample size led to a preference for the great group level.

  4. Random forests as cumulative effects models: A case study of lakes and rivers in Muskoka, Canada.

    PubMed

    Jones, F Chris; Plewes, Rachel; Murison, Lorna; MacDougall, Mark J; Sinclair, Sarah; Davies, Christie; Bailey, John L; Richardson, Murray; Gunn, John

    2017-10-01

    Cumulative effects assessment (CEA) - a type of environmental appraisal - lacks effective methods for modeling cumulative effects, evaluating indicators of ecosystem condition, and exploring the likely outcomes of development scenarios. Random forests are an extension of classification and regression trees, which model response variables by recursive partitioning. Random forests were used to model a series of candidate ecological indicators that described lakes and rivers from a case study watershed (The Muskoka River Watershed, Canada). Suitability of the candidate indicators for use in cumulative effects assessment and watershed monitoring was assessed according to how well they could be predicted from natural habitat features and how sensitive they were to human land-use. The best models explained 75% of the variation in a multivariate descriptor of lake benthic-macroinvertebrate community structure, and 76% of the variation in the conductivity of river water. Similar results were obtained by cross-validation. Several candidate indicators detected a simulated doubling of urban land-use in their catchments, and a few were able to detect a simulated doubling of agricultural land-use. The paper demonstrates that random forests can be used to describe the combined and singular effects of multiple stressors and natural environmental factors, and furthermore, that random forests can be used to evaluate the performance of monitoring indicators. The numerical methods presented are applicable to any ecosystem and indicator type, and therefore represent a step forward for CEA. Crown Copyright © 2017. Published by Elsevier Ltd. All rights reserved.

  5. Comparisons between physics-based, engineering, and statistical learning models for outdoor sound propagation.

    PubMed

    Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T

    2016-05-01

    Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.

  6. Ensemble of trees approaches to risk adjustment for evaluating a hospital's performance.

    PubMed

    Liu, Yang; Traskin, Mikhail; Lorch, Scott A; George, Edward I; Small, Dylan

    2015-03-01

    A commonly used method for evaluating a hospital's performance on an outcome is to compare the hospital's observed outcome rate to the hospital's expected outcome rate given its patient (case) mix and service. The process of calculating the hospital's expected outcome rate given its patient mix and service is called risk adjustment (Iezzoni 1997). Risk adjustment is critical for accurately evaluating and comparing hospitals' performances since we would not want to unfairly penalize a hospital just because it treats sicker patients. The key to risk adjustment is accurately estimating the probability of an Outcome given patient characteristics. For cases with binary outcomes, the method that is commonly used in risk adjustment is logistic regression. In this paper, we consider ensemble of trees methods as alternatives for risk adjustment, including random forests and Bayesian additive regression trees (BART). Both random forests and BART are modern machine learning methods that have been shown recently to have excellent performance for prediction of outcomes in many settings. We apply these methods to carry out risk adjustment for the performance of neonatal intensive care units (NICU). We show that these ensemble of trees methods outperform logistic regression in predicting mortality among babies treated in NICU, and provide a superior method of risk adjustment compared to logistic regression.

  7. Temporal changes in randomness of bird communities across Central Europe.

    PubMed

    Renner, Swen C; Gossner, Martin M; Kahl, Tiemo; Kalko, Elisabeth K V; Weisser, Wolfgang W; Fischer, Markus; Allan, Eric

    2014-01-01

    Many studies have examined whether communities are structured by random or deterministic processes, and both are likely to play a role, but relatively few studies have attempted to quantify the degree of randomness in species composition. We quantified, for the first time, the degree of randomness in forest bird communities based on an analysis of spatial autocorrelation in three regions of Germany. The compositional dissimilarity between pairs of forest patches was regressed against the distance between them. We then calculated the y-intercept of the curve, i.e. the 'nugget', which represents the compositional dissimilarity at zero spatial distance. We therefore assume, following similar work on plant communities, that this represents the degree of randomness in species composition. We then analysed how the degree of randomness in community composition varied over time and with forest management intensity, which we expected to reduce the importance of random processes by increasing the strength of environmental drivers. We found that a high portion of the bird community composition could be explained by chance (overall mean of 0.63), implying that most of the variation in local bird community composition is driven by stochastic processes. Forest management intensity did not consistently affect the mean degree of randomness in community composition, perhaps because the bird communities were relatively insensitive to management intensity. We found a high temporal variation in the degree of randomness, which may indicate temporal variation in assembly processes and in the importance of key environmental drivers. We conclude that the degree of randomness in community composition should be considered in bird community studies, and the high values we find may indicate that bird community composition is relatively hard to predict at the regional scale.

  8. Predictive Utility of Marketed Volumetric Software Tools in Subjects at Risk for Alzheimer's: Do Regions Outside the Hippocampus Matter?

    PubMed Central

    Tanpitukpongse, Teerath P.; Mazurowski, Maciej A.; Ikhena, John; Petrella, Jeffrey R.

    2016-01-01

    Background and Purpose To assess prognostic efficacy of individual versus combined regional volumetrics in two commercially-available brain volumetric software packages for predicting conversion of patients with mild cognitive impairment to Alzheimer's disease. Materials and Methods Data was obtained through the Alzheimer's Disease Neuroimaging Initiative. 192 subjects (mean age 74.8 years, 39% female) diagnosed with mild cognitive impairment at baseline were studied. All had T1WI MRI sequences at baseline and 3-year clinical follow-up. Analysis was performed with NeuroQuant® and Neuroreader™. Receiver operating characteristic curves assessing the prognostic efficacy of each software package were generated using a univariable approach employing individual regional brain volumes, as well as two multivariable approaches (multiple regression and random forest), combining multiple volumes. Results On univariable analysis of 11 NeuroQuant® and 11 Neuroreader™ regional volumes, hippocampal volume had the highest area under the curve for both software packages (0.69 NeuroQuant®, 0.68 Neuroreader™), and was not significantly different (p > 0.05) between packages. Multivariable analysis did not increase the area under the curve for either package (0.63 logistic regression, 0.60 random forest NeuroQuant®; 0.65 logistic regression, 0.62 random forest Neuroreader™). Conclusion Of the multiple regional volume measures available in FDA-cleared brain volumetric software packages, hippocampal volume remains the best single predictor of conversion of mild cognitive impairment to Alzheimer's disease at 3-year follow-up. Combining volumetrics did not add additional prognostic efficacy. Therefore, future prognostic studies in MCI, combining such tools with demographic and other biomarker measures, are justified in using hippocampal volume as the only volumetric biomarker. PMID:28057634

  9. Taxi-Out Time Prediction for Departures at Charlotte Airport Using Machine Learning Techniques

    NASA Technical Reports Server (NTRS)

    Lee, Hanbong; Malik, Waqar; Jung, Yoon C.

    2016-01-01

    Predicting the taxi-out times of departures accurately is important for improving airport efficiency and takeoff time predictability. In this paper, we attempt to apply machine learning techniques to actual traffic data at Charlotte Douglas International Airport for taxi-out time prediction. To find the key factors affecting aircraft taxi times, surface surveillance data is first analyzed. From this data analysis, several variables, including terminal concourse, spot, runway, departure fix and weight class, are selected for taxi time prediction. Then, various machine learning methods such as linear regression, support vector machines, k-nearest neighbors, random forest, and neural networks model are applied to actual flight data. Different traffic flow and weather conditions at Charlotte airport are also taken into account for more accurate prediction. The taxi-out time prediction results show that linear regression and random forest techniques can provide the most accurate prediction in terms of root-mean-square errors. We also discuss the operational complexity and uncertainties that make it difficult to predict the taxi times accurately.

  10. Summer and winter habitat suitability of Marco Polo argali in southeastern Tajikistan: A modeling approach.

    PubMed

    Salas, Eric Ariel L; Valdez, Raul; Michel, Stefan

    2017-11-01

    We modeled summer and winter habitat suitability of Marco Polo argali in the Pamir Mountains in southeastern Tajikistan using these statistical algorithms: Generalized Linear Model, Random Forest, Boosted Regression Tree, Maxent, and Multivariate Adaptive Regression Splines. Using sheep occurrence data collected from 2009 to 2015 and a set of selected habitat predictors, we produced summer and winter habitat suitability maps and determined the important habitat suitability predictors for both seasons. Our results demonstrated that argali selected proximity to riparian areas and greenness as the two most relevant variables for summer, and the degree of slope (gentler slopes between 0° to 20°) and Landsat temperature band for winter. The terrain roughness was also among the most important variables in summer and winter models. Aspect was only significant for winter habitat, with argali preferring south-facing mountain slopes. We evaluated various measures of model performance such as the Area Under the Curve (AUC) and the True Skill Statistic (TSS). Comparing the five algorithms, the AUC scored highest for Boosted Regression Tree in summer (AUC = 0.94) and winter model runs (AUC = 0.94). In contrast, Random Forest underperformed in both model runs.

  11. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions

    PubMed Central

    Hengl, Tomislav; Heuvelink, Gerard B. M.; Kempen, Bas; Leenaars, Johan G. B.; Walsh, Markus G.; Shepherd, Keith D.; Sila, Andrew; MacMillan, Robert A.; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E.

    2015-01-01

    80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008–2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management—organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15–75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological knowledge from data rich countries to countries with limited soil data. PMID:26110833

  12. Why choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence.

    PubMed

    Mi, Chunrong; Huettmann, Falk; Guo, Yumin; Han, Xuesong; Wen, Lijia

    2017-01-01

    Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane ( Grus monacha , n  = 33), White-naped Crane ( Grus vipio , n  = 40), and Black-necked Crane ( Grus nigricollis , n  = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation.

  13. Why choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence

    PubMed Central

    Mi, Chunrong; Huettmann, Falk; Han, Xuesong; Wen, Lijia

    2017-01-01

    Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n = 33), White-naped Crane (Grus vipio, n = 40), and Black-necked Crane (Grus nigricollis, n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation. PMID:28097060

  14. Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

    PubMed

    Maniruzzaman, Md; Rahman, Md Jahanur; Al-MehediHasan, Md; Suri, Harman S; Abedin, Md Menhazul; El-Baz, Ayman; Suri, Jasjit S

    2018-04-10

    Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.

  15. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions.

    PubMed

    Hengl, Tomislav; Heuvelink, Gerard B M; Kempen, Bas; Leenaars, Johan G B; Walsh, Markus G; Shepherd, Keith D; Sila, Andrew; MacMillan, Robert A; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E

    2015-01-01

    80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008-2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management--organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15-75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological knowledge from data rich countries to countries with limited soil data.

  16. Modeling Verdict Outcomes Using Social Network Measures: The Watergate and Caviar Network Cases.

    PubMed

    Masías, Víctor Hugo; Valle, Mauricio; Morselli, Carlo; Crespo, Fernando; Vargas, Augusto; Laengle, Sigifredo

    2016-01-01

    Modelling criminal trial verdict outcomes using social network measures is an emerging research area in quantitative criminology. Few studies have yet analyzed which of these measures are the most important for verdict modelling or which data classification techniques perform best for this application. To compare the performance of different techniques in classifying members of a criminal network, this article applies three different machine learning classifiers-Logistic Regression, Naïve Bayes and Random Forest-with a range of social network measures and the necessary databases to model the verdicts in two real-world cases: the U.S. Watergate Conspiracy of the 1970's and the now-defunct Canada-based international drug trafficking ring known as the Caviar Network. In both cases it was found that the Random Forest classifier did better than either Logistic Regression or Naïve Bayes, and its superior performance was statistically significant. This being so, Random Forest was used not only for classification but also to assess the importance of the measures. For the Watergate case, the most important one proved to be betweenness centrality while for the Caviar Network, it was the effective size of the network. These results are significant because they show that an approach combining machine learning with social network analysis not only can generate accurate classification models but also helps quantify the importance social network variables in modelling verdict outcomes. We conclude our analysis with a discussion and some suggestions for future work in verdict modelling using social network measures.

  17. Producing landslide susceptibility maps by utilizing machine learning methods. The case of Finikas catchment basin, North Peloponnese, Greece.

    NASA Astrophysics Data System (ADS)

    Tsangaratos, Paraskevas; Ilia, Ioanna; Loupasakis, Constantinos; Papadakis, Michalis; Karimalis, Antonios

    2017-04-01

    The main objective of the present study was to apply two machine learning methods for the production of a landslide susceptibility map in the Finikas catchment basin, located in North Peloponnese, Greece and to compare their results. Specifically, Logistic Regression and Random Forest were utilized, based on a database of 40 sites classified into two categories, non-landslide and landslide areas that were separated into a training dataset (70% of the total data) and a validation dataset (remaining 30%). The identification of the areas was established by analyzing airborne imagery, extensive field investigation and the examination of previous research studies. Six landslide related variables were analyzed, namely: lithology, elevation, slope, aspect, distance to rivers and distance to faults. Within the Finikas catchment basin most of the reported landslides were located along the road network and within the residential complexes, classified as rotational and translational slides, and rockfalls, mainly caused due to the physical conditions and the general geotechnical behavior of the geological formation that cover the area. Each landslide susceptibility map was reclassified by applying the Geometric Interval classification technique into five classes, namely: very low susceptibility, low susceptibility, moderate susceptibility, high susceptibility, and very high susceptibility. The comparison and validation of the outcomes of each model were achieved using statistical evaluation measures, the receiving operating characteristic and the area under the success and predictive rate curves. The computation process was carried out using RStudio an integrated development environment for R language and ArcGIS 10.1 for compiling the data and producing the landslide susceptibility maps. From the outcomes of the Logistic Regression analysis it was induced that the highest b coefficient is allocated to lithology and slope, which was 2.8423 and 1.5841, respectively. From the estimation of the mean decrease in Gini coefficient performed during the application of Random Forest and the mean decrease in accuracy the most important variable is slope followed by lithology, aspect, elevation, distance from river network, and distance from faults, while the most used variables during the training phase were the variable aspect (21.45%), slope (20.53%) and lithology (19.84%). The outcomes of the analysis are consistent with previous studies concerning the area of research, which have indicated the high influence of lithology and slope in the manifestation of landslides. High percentage of landslide occurrence has been observed in Plio-Pleistocene sediments, flysch formations, and Cretaceous limestone. Also the presences of landslides have been associated with the degree of weathering and fragmentation, the orientation of the discontinuities surfaces and the intense morphological relief. The most accurate model was Random Forest which identified correctly 92.00% of the instances during the training phase, followed by the Logistic Regression 89.00%. The same pattern of accuracy was calculated during the validation phase, in which the Random Forest achieved a classification accuracy of 93.00%, while the Logistic Regression model achieved an accuracy of 91.00%. In conclusion, the outcomes of the study could be a useful cartographic product to local authorities and government agencies during the implementation of successful decision-making and land use planning strategies. Keywords: Landslide Susceptibility, Logistic Regression, Random Forest, GIS, Greece.

  18. An evaluation of supervised classifiers for indirectly detecting salt-affected areas at irrigation scheme level

    NASA Astrophysics Data System (ADS)

    Muller, Sybrand Jacobus; van Niekerk, Adriaan

    2016-07-01

    Soil salinity often leads to reduced crop yield and quality and can render soils barren. Irrigated areas are particularly at risk due to intensive cultivation and secondary salinization caused by waterlogging. Regular monitoring of salt accumulation in irrigation schemes is needed to keep its negative effects under control. The dynamic spatial and temporal characteristics of remote sensing can provide a cost-effective solution for monitoring salt accumulation at irrigation scheme level. This study evaluated a range of pan-fused SPOT-5 derived features (spectral bands, vegetation indices, image textures and image transformations) for classifying salt-affected areas in two distinctly different irrigation schemes in South Africa, namely Vaalharts and Breede River. The relationship between the input features and electro conductivity measurements were investigated using regression modelling (stepwise linear regression, partial least squares regression, curve fit regression modelling) and supervised classification (maximum likelihood, nearest neighbour, decision tree analysis, support vector machine and random forests). Classification and regression trees and random forest were used to select the most important features for differentiating salt-affected and unaffected areas. The results showed that the regression analyses produced weak models (<0.4 R squared). Better results were achieved using the supervised classifiers, but the algorithms tend to over-estimate salt-affected areas. A key finding was that none of the feature sets or classification algorithms stood out as being superior for monitoring salt accumulation at irrigation scheme level. This was attributed to the large variations in the spectral responses of different crops types at different growing stages, coupled with their individual tolerances to saline conditions.

  19. Probability machines: consistent probability estimation using nonparametric learning machines.

    PubMed

    Malley, J D; Kruppa, J; Dasgupta, A; Malley, K G; Ziegler, A

    2012-01-01

    Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

  20. Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests

    PubMed Central

    2011-01-01

    Background Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Results Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5. Conclusions When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing. PMID:21849043

  1. Automated retrieval of forest structure variables based on multi-scale texture analysis of VHR satellite imagery

    NASA Astrophysics Data System (ADS)

    Beguet, Benoit; Guyon, Dominique; Boukir, Samia; Chehata, Nesrine

    2014-10-01

    The main goal of this study is to design a method to describe the structure of forest stands from Very High Resolution satellite imagery, relying on some typical variables such as crown diameter, tree height, trunk diameter, tree density and tree spacing. The emphasis is placed on the automatization of the process of identification of the most relevant image features for the forest structure retrieval task, exploiting both spectral and spatial information. Our approach is based on linear regressions between the forest structure variables to be estimated and various spectral and Haralick's texture features. The main drawback of this well-known texture representation is the underlying parameters which are extremely difficult to set due to the spatial complexity of the forest structure. To tackle this major issue, an automated feature selection process is proposed which is based on statistical modeling, exploring a wide range of parameter values. It provides texture measures of diverse spatial parameters hence implicitly inducing a multi-scale texture analysis. A new feature selection technique, we called Random PRiF, is proposed. It relies on random sampling in feature space, carefully addresses the multicollinearity issue in multiple-linear regression while ensuring accurate prediction of forest variables. Our automated forest variable estimation scheme was tested on Quickbird and Pléiades panchromatic and multispectral images, acquired at different periods on the maritime pine stands of two sites in South-Western France. It outperforms two well-established variable subset selection techniques. It has been successfully applied to identify the best texture features in modeling the five considered forest structure variables. The RMSE of all predicted forest variables is improved by combining multispectral and panchromatic texture features, with various parameterizations, highlighting the potential of a multi-resolution approach for retrieving forest structure variables from VHR satellite images. Thus an average prediction error of ˜ 1.1 m is expected on crown diameter, ˜ 0.9 m on tree spacing, ˜ 3 m on height and ˜ 0.06 m on diameter at breast height.

  2. A Global Study of GPP focusing on Light Use Efficiency in a Random Forest Regression Model

    NASA Astrophysics Data System (ADS)

    Fang, W.; Wei, S.; Yi, C.; Hendrey, G. R.

    2016-12-01

    Light use efficiency (LUE) is at the core of mechanistic modeling of global gross primary production (GPP). However, most LUE estimates in global models are satellite-based and coarsely measured with emphasis on environmental variables. Others are from eddy covariance towers with much greater spatial and temporal data quality and emphasis on mechanistic processes, but in a limited number of sites. In this paper, we conducted a comprehensive global study of tower-based LUE from 237 FLUXNET towers, and scaled up LUEs from in-situ tower level to global biome level. We integrated key environmental and biological variables into the tower-based LUE estimates, at 0.5o x 0.5o grid-cell resolution, using a random forest regression (RFR) approach. We then developed an RFR-LUE-GPP model using the grid-cell LUE data, and compared it to a tower-LUE-GPP model by the conventional way of treating LUE as a series of biome-specific constants. In order to calibrate the LUE models, we developed a data-driven RFR-GPP model using a random forest regression method. Our results showed that LUE varies largely with latitude. We estimated a global area-weighted average of LUE at 1.21 gC m-2 MJ-1 APAR, which led to an estimated global GPP of 102.9 Gt C /year from 2000 to 2005. The tower-LUE-GPP model tended to overestimate forest GPP in tropical and boreal regions. Large uncertainties exist in GPP estimates over sparsely vegetated areas covered by savannas and woody savannas around the middle to low latitudes (i.g. 20oS to 40oS and 5oN to 15oN) due to lack of available data. Model results were improved by incorporating Köppen climate types to represent climate /meteorological information in machine learning modeling. This shed new light on the recognized issues of climate dependence of spring onset of photosynthesis and the challenges in modeling the biome GPP of evergreen broad leaf forests (EBF) accurately. The divergent responses of GPP to temperature and precipitation at mid-high latitudes and at mid-low latitudes echoed the necessity of modeling GPP separately by latitudes. This work provided a global distribution of LUE estimate, and developed a comprehensive algorithm modeling global terrestrial carbon with high spatial and temporal resolutions.

  3. A bioavailable strontium isoscape for Western Europe: A machine learning approach

    PubMed Central

    von Holstein, Isabella C. C.; Laffoon, Jason E.; Willmes, Malte; Liu, Xiao-Ming; Davies, Gareth R.

    2018-01-01

    Strontium isotope ratios (87Sr/86Sr) are gaining considerable interest as a geolocation tool and are now widely applied in archaeology, ecology, and forensic research. However, their application for provenance requires the development of baseline models predicting surficial 87Sr/86Sr variations (“isoscapes”). A variety of empirically-based and process-based models have been proposed to build terrestrial 87Sr/86Sr isoscapes but, in their current forms, those models are not mature enough to be integrated with continuous-probability surface models used in geographic assignment. In this study, we aim to overcome those limitations and to predict 87Sr/86Sr variations across Western Europe by combining process-based models and a series of remote-sensing geospatial products into a regression framework. We find that random forest regression significantly outperforms other commonly used regression and interpolation methods, and efficiently predicts the multi-scale patterning of 87Sr/86Sr variations by accounting for geological, geomorphological and atmospheric controls. Random forest regression also provides an easily interpretable and flexible framework to integrate different types of environmental auxiliary variables required to model the multi-scale patterning of 87Sr/86Sr variability. The method is transferable to different scales and resolutions and can be applied to the large collection of geospatial data available at local and global levels. The isoscape generated in this study provides the most accurate 87Sr/86Sr predictions in bioavailable strontium for Western Europe (R2 = 0.58 and RMSE = 0.0023) to date, as well as a conservative estimate of spatial uncertainty by applying quantile regression forest. We anticipate that the method presented in this study combined with the growing numbers of bioavailable 87Sr/86Sr data and satellite geospatial products will extend the applicability of the 87Sr/86Sr geo-profiling tool in provenance applications. PMID:29847595

  4. Modeling nitrate at domestic and public-supply well depths in the Central Valley, California

    USGS Publications Warehouse

    Nolan, Bernard T.; Gronberg, JoAnn M.; Faunt, Claudia C.; Eberts, Sandra M.; Belitz, Ken

    2014-01-01

    Aquifer vulnerability models were developed to map groundwater nitrate concentration at domestic and public-supply well depths in the Central Valley, California. We compared three modeling methods for ability to predict nitrate concentration >4 mg/L: logistic regression (LR), random forest classification (RFC), and random forest regression (RFR). All three models indicated processes of nitrogen fertilizer input at the land surface, transmission through coarse-textured, well-drained soils, and transport in the aquifer to the well screen. The total percent correct predictions were similar among the three models (69–82%), but RFR had greater sensitivity (84% for shallow wells and 51% for deep wells). The results suggest that RFR can better identify areas with high nitrate concentration but that LR and RFC may better describe bulk conditions in the aquifer. A unique aspect of the modeling approach was inclusion of outputs from previous, physically based hydrologic and textural models as predictor variables, which were important to the models. Vertical water fluxes in the aquifer and percent coarse material above the well screen were ranked moderately high-to-high in the RFR models, and the average vertical water flux during the irrigation season was highly significant (p < 0.0001) in logistic regression.

  5. Statistically extracted fundamental watershed variables for estimating the loads of total nitrogen in small streams

    USGS Publications Warehouse

    Kronholm, Scott C.; Capel, Paul D.; Terziotti, Silvia

    2016-01-01

    Accurate estimation of total nitrogen loads is essential for evaluating conditions in the aquatic environment. Extrapolation of estimates beyond measured streams will greatly expand our understanding of total nitrogen loading to streams. Recursive partitioning and random forest regression were used to assess 85 geospatial, environmental, and watershed variables across 636 small (<585 km2) watersheds to determine which variables are fundamentally important to the estimation of annual loads of total nitrogen. Initial analysis led to the splitting of watersheds into three groups based on predominant land use (agricultural, developed, and undeveloped). Nitrogen application, agricultural and developed land area, and impervious or developed land in the 100-m stream buffer were commonly extracted variables by both recursive partitioning and random forest regression. A series of multiple linear regression equations utilizing the extracted variables were created and applied to the watersheds. As few as three variables explained as much as 76 % of the variability in total nitrogen loads for watersheds with predominantly agricultural land use. Catchment-scale national maps were generated to visualize the total nitrogen loads and yields across the USA. The estimates provided by these models can inform water managers and help identify areas where more in-depth monitoring may be beneficial.

  6. Application of Machine-Learning Models to Predict Tacrolimus Stable Dose in Renal Transplant Recipients

    NASA Astrophysics Data System (ADS)

    Tang, Jie; Liu, Rong; Zhang, Yue-Li; Liu, Mou-Ze; Hu, Yong-Fang; Shao, Ming-Jie; Zhu, Li-Jun; Xin, Hua-Wen; Feng, Gui-Wen; Shang, Wen-Jun; Meng, Xiang-Guang; Zhang, Li-Rong; Ming, Ying-Zi; Zhang, Wei

    2017-02-01

    Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.

  7. Minimizing effects of methodological decisions on interpretation and prediction in species distribution studies: An example with background selection

    USGS Publications Warehouse

    Jarnevich, Catherine S.; Talbert, Marian; Morisette, Jeffrey T.; Aldridge, Cameron L.; Brown, Cynthia; Kumar, Sunil; Manier, Daniel; Talbert, Colin; Holcombe, Tracy R.

    2017-01-01

    Evaluating the conditions where a species can persist is an important question in ecology both to understand tolerances of organisms and to predict distributions across landscapes. Presence data combined with background or pseudo-absence locations are commonly used with species distribution modeling to develop these relationships. However, there is not a standard method to generate background or pseudo-absence locations, and method choice affects model outcomes. We evaluated combinations of both model algorithms (simple and complex generalized linear models, multivariate adaptive regression splines, Maxent, boosted regression trees, and random forest) and background methods (random, minimum convex polygon, and continuous and binary kernel density estimator (KDE)) to assess the sensitivity of model outcomes to choices made. We evaluated six questions related to model results, including five beyond the common comparison of model accuracy assessment metrics (biological interpretability of response curves, cross-validation robustness, independent data accuracy and robustness, and prediction consistency). For our case study with cheatgrass in the western US, random forest was least sensitive to background choice and the binary KDE method was least sensitive to model algorithm choice. While this outcome may not hold for other locations or species, the methods we used can be implemented to help determine appropriate methodologies for particular research questions.

  8. Estimation of retinal vessel caliber using model fitting and random forests

    NASA Astrophysics Data System (ADS)

    Araújo, Teresa; Mendonça, Ana Maria; Campilho, Aurélio

    2017-03-01

    Retinal vessel caliber changes are associated with several major diseases, such as diabetes and hypertension. These caliber changes can be evaluated using eye fundus images. However, the clinical assessment is tiresome and prone to errors, motivating the development of automatic methods. An automatic method based on vessel crosssection intensity profile model fitting for the estimation of vessel caliber in retinal images is herein proposed. First, vessels are segmented from the image, vessel centerlines are detected and individual segments are extracted and smoothed. Intensity profiles are extracted perpendicularly to the vessel, and the profile lengths are determined. Then, model fitting is applied to the smoothed profiles. A novel parametric model (DoG-L7) is used, consisting on a Difference-of-Gaussians multiplied by a line which is able to describe profile asymmetry. Finally, the parameters of the best-fit model are used for determining the vessel width through regression using ensembles of bagged regression trees with random sampling of the predictors (random forests). The method is evaluated on the REVIEW public dataset. A precision close to the observers is achieved, outperforming other state-of-the-art methods. The method is robust and reliable for width estimation in images with pathologies and artifacts, with performance independent of the range of diameters.

  9. Wildfire Selectivity for Land Cover Type: Does Size Matter?

    PubMed Central

    Barros, Ana M. G.; Pereira, José M. C.

    2014-01-01

    Previous research has shown that fires burn certain land cover types disproportionally to their abundance. We used quantile regression to study land cover proneness to fire as a function of fire size, under the hypothesis that they are inversely related, for all land cover types. Using five years of fire perimeters, we estimated conditional quantile functions for lower (avoidance) and upper (preference) quantiles of fire selectivity for five land cover types - annual crops, evergreen oak woodlands, eucalypt forests, pine forests and shrublands. The slope of significant regression quantiles describes the rate of change in fire selectivity (avoidance or preference) as a function of fire size. We used Monte-Carlo methods to randomly permutate fires in order to obtain a distribution of fire selectivity due to chance. This distribution was used to test the null hypotheses that 1) mean fire selectivity does not differ from that obtained by randomly relocating observed fire perimeters; 2) that land cover proneness to fire does not vary with fire size. Our results show that land cover proneness to fire is higher for shrublands and pine forests than for annual crops and evergreen oak woodlands. As fire size increases, selectivity decreases for all land cover types tested. Moreover, the rate of change in selectivity with fire size is higher for preference than for avoidance. Comparison between observed and randomized data led us to reject both null hypotheses tested ( = 0.05) and to conclude it is very unlikely the observed values of fire selectivity and change in selectivity with fire size are due to chance. PMID:24454747

  10. Predictive Utility of Marketed Volumetric Software Tools in Subjects at Risk for Alzheimer Disease: Do Regions Outside the Hippocampus Matter?

    PubMed

    Tanpitukpongse, T P; Mazurowski, M A; Ikhena, J; Petrella, J R

    2017-03-01

    Alzheimer disease is a prevalent neurodegenerative disease. Computer assessment of brain atrophy patterns can help predict conversion to Alzheimer disease. Our aim was to assess the prognostic efficacy of individual-versus-combined regional volumetrics in 2 commercially available brain volumetric software packages for predicting conversion of patients with mild cognitive impairment to Alzheimer disease. Data were obtained through the Alzheimer's Disease Neuroimaging Initiative. One hundred ninety-two subjects (mean age, 74.8 years; 39% female) diagnosed with mild cognitive impairment at baseline were studied. All had T1-weighted MR imaging sequences at baseline and 3-year clinical follow-up. Analysis was performed with NeuroQuant and Neuroreader. Receiver operating characteristic curves assessing the prognostic efficacy of each software package were generated by using a univariable approach using individual regional brain volumes and 2 multivariable approaches (multiple regression and random forest), combining multiple volumes. On univariable analysis of 11 NeuroQuant and 11 Neuroreader regional volumes, hippocampal volume had the highest area under the curve for both software packages (0.69, NeuroQuant; 0.68, Neuroreader) and was not significantly different ( P > .05) between packages. Multivariable analysis did not increase the area under the curve for either package (0.63, logistic regression; 0.60, random forest NeuroQuant; 0.65, logistic regression; 0.62, random forest Neuroreader). Of the multiple regional volume measures available in FDA-cleared brain volumetric software packages, hippocampal volume remains the best single predictor of conversion of mild cognitive impairment to Alzheimer disease at 3-year follow-up. Combining volumetrics did not add additional prognostic efficacy. Therefore, future prognostic studies in mild cognitive impairment, combining such tools with demographic and other biomarker measures, are justified in using hippocampal volume as the only volumetric biomarker. © 2017 by American Journal of Neuroradiology.

  11. Estimating current and future streamflow characteristics at ungaged sites, central and eastern Montana, with application to evaluating effects of climate change on fish populations

    USGS Publications Warehouse

    Sando, Roy; Chase, Katherine J.

    2017-03-23

    A common statistical procedure for estimating streamflow statistics at ungaged locations is to develop a relational model between streamflow and drainage basin characteristics at gaged locations using least squares regression analysis; however, least squares regression methods are parametric and make constraining assumptions about the data distribution. The random forest regression method provides an alternative nonparametric method for estimating streamflow characteristics at ungaged sites and requires that the data meet fewer statistical conditions than least squares regression methods.Random forest regression analysis was used to develop predictive models for 89 streamflow characteristics using Precipitation-Runoff Modeling System simulated streamflow data and drainage basin characteristics at 179 sites in central and eastern Montana. The predictive models were developed from streamflow data simulated for current (baseline, water years 1982–99) conditions and three future periods (water years 2021–38, 2046–63, and 2071–88) under three different climate-change scenarios. These predictive models were then used to predict streamflow characteristics for baseline conditions and three future periods at 1,707 fish sampling sites in central and eastern Montana. The average root mean square error for all predictive models was about 50 percent. When streamflow predictions at 23 fish sampling sites were compared to nearby locations with simulated data, the mean relative percent difference was about 43 percent. When predictions were compared to streamflow data recorded at 21 U.S. Geological Survey streamflow-gaging stations outside of the calibration basins, the average mean absolute percent error was about 73 percent.

  12. ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches.

    PubMed

    Sharma, Ashok K; Srivastava, Gopal N; Roy, Ankita; Sharma, Vineet K

    2017-01-01

    The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84-0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better ( R 2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better ( R 2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules.

  13. ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches

    PubMed Central

    Sharma, Ashok K.; Srivastava, Gopal N.; Roy, Ankita; Sharma, Vineet K.

    2017-01-01

    The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84–0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better (R2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules. PMID:29249969

  14. Identification of immune correlates of protection in Shigella infection by application of machine learning.

    PubMed

    Arevalillo, Jorge M; Sztein, Marcelo B; Kotloff, Karen L; Levine, Myron M; Simon, Jakub K

    2017-10-01

    Immunologic correlates of protection are important in vaccine development because they give insight into mechanisms of protection, assist in the identification of promising vaccine candidates, and serve as endpoints in bridging clinical vaccine studies. Our goal is the development of a methodology to identify immunologic correlates of protection using the Shigella challenge as a model. The proposed methodology utilizes the Random Forests (RF) machine learning algorithm as well as Classification and Regression Trees (CART) to detect immune markers that predict protection, identify interactions between variables, and define optimal cutoffs. Logistic regression modeling is applied to estimate the probability of protection and the confidence interval (CI) for such a probability is computed by bootstrapping the logistic regression models. The results demonstrate that the combination of Classification and Regression Trees and Random Forests complements the standard logistic regression and uncovers subtle immune interactions. Specific levels of immunoglobulin IgG antibody in blood on the day of challenge predicted protection in 75% (95% CI 67-86). Of those subjects that did not have blood IgG at or above a defined threshold, 100% were protected if they had IgA antibody secreting cells above a defined threshold. Comparison with the results obtained by applying only logistic regression modeling with standard Akaike Information Criterion for model selection shows the usefulness of the proposed method. Given the complexity of the immune system, the use of machine learning methods may enhance traditional statistical approaches. When applied together, they offer a novel way to quantify important immune correlates of protection that may help the development of vaccines. Copyright © 2017 Elsevier Inc. All rights reserved.

  15. Effective search for stable segregation configurations at grain boundaries with data-mining techniques

    NASA Astrophysics Data System (ADS)

    Kiyohara, Shin; Mizoguchi, Teruyasu

    2018-03-01

    Grain boundary segregation of dopants plays a crucial role in materials properties. To investigate the dopant segregation behavior at the grain boundary, an enormous number of combinations have to be considered in the segregation of multiple dopants at the complex grain boundary structures. Here, two data mining techniques, the random-forests regression and the genetic algorithm, were applied to determine stable segregation sites at grain boundaries efficiently. Using the random-forests method, a predictive model was constructed from 2% of the segregation configurations and it has been shown that this model could determine the stable segregation configurations. Furthermore, the genetic algorithm also successfully determined the most stable segregation configuration with great efficiency. We demonstrate that these approaches are quite effective to investigate the dopant segregation behaviors at grain boundaries.

  16. Predicting Coastal Flood Severity using Random Forest Algorithm

    NASA Astrophysics Data System (ADS)

    Sadler, J. M.; Goodall, J. L.; Morsy, M. M.; Spencer, K.

    2017-12-01

    Coastal floods have become more common recently and are predicted to further increase in frequency and severity due to sea level rise. Predicting floods in coastal cities can be difficult due to the number of environmental and geographic factors which can influence flooding events. Built stormwater infrastructure and irregular urban landscapes add further complexity. This paper demonstrates the use of machine learning algorithms in predicting street flood occurrence in an urban coastal setting. The model is trained and evaluated using data from Norfolk, Virginia USA from September 2010 - October 2016. Rainfall, tide levels, water table levels, and wind conditions are used as input variables. Street flooding reports made by city workers after named and unnamed storm events, ranging from 1-159 reports per event, are the model output. Results show that Random Forest provides predictive power in estimating the number of flood occurrences given a set of environmental conditions with an out-of-bag root mean squared error of 4.3 flood reports and a mean absolute error of 0.82 flood reports. The Random Forest algorithm performed much better than Poisson regression. From the Random Forest model, total daily rainfall was by far the most important factor in flood occurrence prediction, followed by daily low tide and daily higher high tide. The model demonstrated here could be used to predict flood severity based on forecast rainfall and tide conditions and could be further enhanced using more complete street flooding data for model training.

  17. GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran.

    PubMed

    Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali

    2016-01-01

    Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.

  18. Enhancing Multimedia Imbalanced Concept Detection Using VIMP in Random Forests.

    PubMed

    Sadiq, Saad; Yan, Yilin; Shyu, Mei-Ling; Chen, Shu-Ching; Ishwaran, Hemant

    2016-07-01

    Recent developments in social media and cloud storage lead to an exponential growth in the amount of multimedia data, which increases the complexity of managing, storing, indexing, and retrieving information from such big data. Many current content-based concept detection approaches lag from successfully bridging the semantic gap. To solve this problem, a multi-stage random forest framework is proposed to generate predictor variables based on multivariate regressions using variable importance (VIMP). By fine tuning the forests and significantly reducing the predictor variables, the concept detection scores are evaluated when the concept of interest is rare and imbalanced, i.e., having little collaboration with other high level concepts. Using classical multivariate statistics, estimating the value of one coordinate using other coordinates standardizes the covariates and it depends upon the variance of the correlations instead of the mean. Thus, conditional dependence on the data being normally distributed is eliminated. Experimental results demonstrate that the proposed framework outperforms those approaches in the comparison in terms of the Mean Average Precision (MAP) values.

  19. A comparison of rule-based and machine learning approaches for classifying patient portal messages.

    PubMed

    Cronin, Robert M; Fabbri, Daniel; Denny, Joshua C; Rosenbloom, S Trent; Jackson, Gretchen Purcell

    2017-09-01

    Secure messaging through patient portals is an increasingly popular way that consumers interact with healthcare providers. The increasing burden of secure messaging can affect clinic staffing and workflows. Manual management of portal messages is costly and time consuming. Automated classification of portal messages could potentially expedite message triage and delivery of care. We developed automated patient portal message classifiers with rule-based and machine learning techniques using bag of words and natural language processing (NLP) approaches. To evaluate classifier performance, we used a gold standard of 3253 portal messages manually categorized using a taxonomy of communication types (i.e., main categories of informational, medical, logistical, social, and other communications, and subcategories including prescriptions, appointments, problems, tests, follow-up, contact information, and acknowledgement). We evaluated our classifiers' accuracies in identifying individual communication types within portal messages with area under the receiver-operator curve (AUC). Portal messages often contain more than one type of communication. To predict all communication types within single messages, we used the Jaccard Index. We extracted the variables of importance for the random forest classifiers. The best performing approaches to classification for the major communication types were: logistic regression for medical communications (AUC: 0.899); basic (rule-based) for informational communications (AUC: 0.842); and random forests for social communications and logistical communications (AUCs: 0.875 and 0.925, respectively). The best performing classification approach of classifiers for individual communication subtypes was random forests for Logistical-Contact Information (AUC: 0.963). The Jaccard Indices by approach were: basic classifier, Jaccard Index: 0.674; Naïve Bayes, Jaccard Index: 0.799; random forests, Jaccard Index: 0.859; and logistic regression, Jaccard Index: 0.861. For medical communications, the most predictive variables were NLP concepts (e.g., Temporal_Concept, which maps to 'morning', 'evening' and Idea_or_Concept which maps to 'appointment' and 'refill'). For logistical communications, the most predictive variables contained similar numbers of NLP variables and words (e.g., Telephone mapping to 'phone', 'insurance'). For social and informational communications, the most predictive variables were words (e.g., social: 'thanks', 'much', informational: 'question', 'mean'). This study applies automated classification methods to the content of patient portal messages and evaluates the application of NLP techniques on consumer communications in patient portal messages. We demonstrated that random forest and logistic regression approaches accurately classified the content of portal messages, although the best approach to classification varied by communication type. Words were the most predictive variables for classification of most communication types, although NLP variables were most predictive for medical communication types. As adoption of patient portals increases, automated techniques could assist in understanding and managing growing volumes of messages. Further work is needed to improve classification performance to potentially support message triage and answering. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. Computed tomography synthesis from magnetic resonance images in the pelvis using multiple random forests and auto-context features

    NASA Astrophysics Data System (ADS)

    Andreasen, Daniel; Edmund, Jens M.; Zografos, Vasileios; Menze, Bjoern H.; Van Leemput, Koen

    2016-03-01

    In radiotherapy treatment planning that is only based on magnetic resonance imaging (MRI), the electron density information usually obtained from computed tomography (CT) must be derived from the MRI by synthesizing a so-called pseudo CT (pCT). This is a non-trivial task since MRI intensities are neither uniquely nor quantitatively related to electron density. Typical approaches involve either a classification or regression model requiring specialized MRI sequences to solve intensity ambiguities, or an atlas-based model necessitating multiple registrations between atlases and subject scans. In this work, we explore a machine learning approach for creating a pCT of the pelvic region from conventional MRI sequences without using atlases. We use a random forest provided with information about local texture, edges and spatial features derived from the MRI. This helps to solve intensity ambiguities. Furthermore, we use the concept of auto-context by sequentially training a number of classification forests to create and improve context features, which are finally used to train a regression forest for pCT prediction. We evaluate the pCT quality in terms of the voxel-wise error and the radiologic accuracy as measured by water-equivalent path lengths. We compare the performance of our method against two baseline pCT strategies, which either set all MRI voxels in the subject equal to the CT value of water, or in addition transfer the bone volume from the real CT. We show an improved performance compared to both baseline pCTs suggesting that our method may be useful for MRI-only radiotherapy.

  1. Studies of the DIII-D disruption database using Machine Learning algorithms

    NASA Astrophysics Data System (ADS)

    Rea, Cristina; Granetz, Robert; Meneghini, Orso

    2017-10-01

    A Random Forests Machine Learning algorithm, trained on a large database of both disruptive and non-disruptive DIII-D discharges, predicts disruptive behavior in DIII-D with about 90% of accuracy. Several algorithms have been tested and Random Forests was found superior in performances for this particular task. Over 40 plasma parameters are included in the database, with data for each of the parameters taken from 500k time slices. We focused on a subset of non-dimensional plasma parameters, deemed to be good predictors based on physics considerations. Both binary (disruptive/non-disruptive) and multi-label (label based on the elapsed time before disruption) classification problems are investigated. The Random Forests algorithm provides insight on the available dataset by ranking the relative importance of the input features. It is found that q95 and Greenwald density fraction (n/nG) are the most relevant parameters for discriminating between DIII-D disruptive and non-disruptive discharges. A comparison with the Gradient Boosted Trees algorithm is shown and the first results coming from the application of regression algorithms are presented. Work supported by the US Department of Energy under DE-FC02-04ER54698, DE-SC0014264 and DE-FG02-95ER54309.

  2. Prediction of Baseflow Index of Catchments using Machine Learning Algorithms

    NASA Astrophysics Data System (ADS)

    Yadav, B.; Hatfield, K.

    2017-12-01

    We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.

  3. Application of XGBoost algorithm in hourly PM2.5 concentration prediction

    NASA Astrophysics Data System (ADS)

    Pan, Bingyue

    2018-02-01

    In view of prediction techniques of hourly PM2.5 concentration in China, this paper applied the XGBoost(Extreme Gradient Boosting) algorithm to predict hourly PM2.5 concentration. The monitoring data of air quality in Tianjin city was analyzed by using XGBoost algorithm. The prediction performance of the XGBoost method is evaluated by comparing observed and predicted PM2.5 concentration using three measures of forecast accuracy. The XGBoost method is also compared with the random forest algorithm, multiple linear regression, decision tree regression and support vector machines for regression models using computational results. The results demonstrate that the XGBoost algorithm outperforms other data mining methods.

  4. Large unbalanced credit scoring using Lasso-logistic regression ensemble.

    PubMed

    Wang, Hong; Xu, Qingsong; Zhou, Lifeng

    2015-01-01

    Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.

  5. Comparison of modeling methods to predict the spatial distribution of deep-sea coral and sponge in the Gulf of Alaska

    NASA Astrophysics Data System (ADS)

    Rooper, Christopher N.; Zimmermann, Mark; Prescott, Megan M.

    2017-08-01

    Deep-sea coral and sponge ecosystems are widespread throughout most of Alaska's marine waters, and are associated with many different species of fishes and invertebrates. These ecosystems are vulnerable to the effects of commercial fishing activities and climate change. We compared four commonly used species distribution models (general linear models, generalized additive models, boosted regression trees and random forest models) and an ensemble model to predict the presence or absence and abundance of six groups of benthic invertebrate taxa in the Gulf of Alaska. All four model types performed adequately on training data for predicting presence and absence, with regression forest models having the best overall performance measured by the area under the receiver-operating-curve (AUC). The models also performed well on the test data for presence and absence with average AUCs ranging from 0.66 to 0.82. For the test data, ensemble models performed the best. For abundance data, there was an obvious demarcation in performance between the two regression-based methods (general linear models and generalized additive models), and the tree-based models. The boosted regression tree and random forest models out-performed the other models by a wide margin on both the training and testing data. However, there was a significant drop-off in performance for all models of invertebrate abundance ( 50%) when moving from the training data to the testing data. Ensemble model performance was between the tree-based and regression-based methods. The maps of predictions from the models for both presence and abundance agreed very well across model types, with an increase in variability in predictions for the abundance data. We conclude that where data conforms well to the modeled distribution (such as the presence-absence data and binomial distribution in this study), the four types of models will provide similar results, although the regression-type models may be more consistent with biological theory. For data with highly zero-inflated distributions and non-normal distributions such as the abundance data from this study, the tree-based methods performed better. Ensemble models that averaged predictions across the four model types, performed better than the GLM or GAM models but slightly poorer than the tree-based methods, suggesting ensemble models might be more robust to overfitting than tree methods, while mitigating some of the disadvantages in predictive performance of regression methods.

  6. Comparison of partial least squares and random forests for evaluating relationship between phenolics and bioactivities of Neptunia oleracea.

    PubMed

    Lee, Soo Yee; Mediani, Ahmed; Maulidiani, Maulidiani; Khatib, Alfi; Ismail, Intan Safinar; Zawawi, Norhasnida; Abas, Faridah

    2018-01-01

    Neptunia oleracea is a plant consumed as a vegetable and which has been used as a folk remedy for several diseases. Herein, two regression models (partial least squares, PLS; and random forest, RF) in a metabolomics approach were compared and applied to the evaluation of the relationship between phenolics and bioactivities of N. oleracea. In addition, the effects of different extraction conditions on the phenolic constituents were assessed by pattern recognition analysis. Comparison of the PLS and RF showed that RF exhibited poorer generalization and hence poorer predictive performance. Both the regression coefficient of PLS and the variable importance of RF revealed that quercetin and kaempferol derivatives, caffeic acid and vitexin-2-O-rhamnoside were significant towards the tested bioactivities. Furthermore, principal component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA) results showed that sonication and absolute ethanol are the preferable extraction method and ethanol ratio, respectively, to produce N. oleracea extracts with high phenolic levels and therefore high DPPH scavenging and α-glucosidase inhibitory activities. Both PLS and RF are useful regression models in metabolomics studies. This work provides insight into the performance of different multivariate data analysis tools and the effects of different extraction conditions on the extraction of desired phenolics from plants. © 2017 Society of Chemical Industry. © 2017 Society of Chemical Industry.

  7. Reader reaction to "a robust method for estimating optimal treatment regimes" by Zhang et al. (2012).

    PubMed

    Taylor, Jeremy M G; Cheng, Wenting; Foster, Jared C

    2015-03-01

    A recent article (Zhang et al., 2012, Biometrics 168, 1010-1018) compares regression based and inverse probability based methods of estimating an optimal treatment regime and shows for a small number of covariates that inverse probability weighted methods are more robust to model misspecification than regression methods. We demonstrate that using models that fit the data better reduces the concern about non-robustness for the regression methods. We extend the simulation study of Zhang et al. (2012, Biometrics 168, 1010-1018), also considering the situation of a larger number of covariates, and show that incorporating random forests into both regression and inverse probability weighted based methods improves their properties. © 2014, The International Biometric Society.

  8. Can Predictive Modeling Identify Head and Neck Oncology Patients at Risk for Readmission?

    PubMed

    Manning, Amy M; Casper, Keith A; Peter, Kay St; Wilson, Keith M; Mark, Jonathan R; Collar, Ryan M

    2018-05-01

    Objective Unplanned readmission within 30 days is a contributor to health care costs in the United States. The use of predictive modeling during hospitalization to identify patients at risk for readmission offers a novel approach to quality improvement and cost reduction. Study Design Two-phase study including retrospective analysis of prospectively collected data followed by prospective longitudinal study. Setting Tertiary academic medical center. Subjects and Methods Prospectively collected data for patients undergoing surgical treatment for head and neck cancer from January 2013 to January 2015 were used to build predictive models for readmission within 30 days of discharge using logistic regression, classification and regression tree (CART) analysis, and random forests. One model (logistic regression) was then placed prospectively into the discharge workflow from March 2016 to May 2016 to determine the model's ability to predict which patients would be readmitted within 30 days. Results In total, 174 admissions had descriptive data. Thirty-two were excluded due to incomplete data. Logistic regression, CART, and random forest predictive models were constructed using the remaining 142 admissions. When applied to 106 consecutive prospective head and neck oncology patients at the time of discharge, the logistic regression model predicted readmissions with a specificity of 94%, a sensitivity of 47%, a negative predictive value of 90%, and a positive predictive value of 62% (odds ratio, 14.9; 95% confidence interval, 4.02-55.45). Conclusion Prospectively collected head and neck cancer databases can be used to develop predictive models that can accurately predict which patients will be readmitted. This offers valuable support for quality improvement initiatives and readmission-related cost reduction in head and neck cancer care.

  9. Empirical analyses of plant-climate relationships for the western United States

    Treesearch

    Gerald E. Rehfeldt; Nicholas L. Crookston; Marcus V. Warwell; Jeffrey S. Evans

    2006-01-01

    The Random Forests multiple-regression tree was used to model climate profiles of 25 biotic communities of the western United States and nine of their constituent species. Analyses of the communities were based on a gridded sample of ca. 140,000 points, while those for the species used presence-absence data from ca. 120,000 locations. Independent variables included 35...

  10. Prediction of Short-Distance Aerial Movement of Phakopsora pachyrhizi Urediniospores Using Machine Learning.

    PubMed

    Wen, L; Bowen, C R; Hartman, G L

    2017-10-01

    Dispersal of urediniospores by wind is the primary means of spread for Phakopsora pachyrhizi, the cause of soybean rust. Our research focused on the short-distance movement of urediniospores from within the soybean canopy and up to 61 m from field-grown rust-infected soybean plants. Environmental variables were used to develop and compare models including the least absolute shrinkage and selection operator regression, zero-inflated Poisson/regular Poisson regression, random forest, and neural network to describe deposition of urediniospores collected in passive and active traps. All four models identified distance of trap from source, humidity, temperature, wind direction, and wind speed as the five most important variables influencing short-distance movement of urediniospores. The random forest model provided the best predictions, explaining 76.1 and 86.8% of the total variation in the passive- and active-trap datasets, respectively. The prediction accuracy based on the correlation coefficient (r) between predicted values and the true values were 0.83 (P < 0.0001) and 0.94 (P < 0.0001) for the passive and active trap datasets, respectively. Overall, multiple machine learning techniques identified the most important variables to make the most accurate predictions of movement of P. pachyrhizi urediniospores short-distance.

  11. Prediction of Return-to-original-work after an Industrial Accident Using Machine Learning and Comparison of Techniques

    PubMed Central

    2018-01-01

    Background Many studies have tried to develop predictors for return-to-work (RTW). However, since complex factors have been demonstrated to predict RTW, it is difficult to use them practically. This study investigated whether factors used in previous studies could predict whether an individual had returned to his/her original work by four years after termination of the worker's recovery period. Methods An initial logistic regression analysis of 1,567 participants of the fourth Panel Study of Worker's Compensation Insurance yielded odds ratios. The participants were divided into two subsets, a training dataset and a test dataset. Using the training dataset, logistic regression, decision tree, random forest, and support vector machine models were established, and important variables of each model were identified. The predictive abilities of the different models were compared. Results The analysis showed that only earned income and company-related factors significantly affected return-to-original-work (RTOW). The random forest model showed the best accuracy among the tested machine learning models; however, the difference was not prominent. Conclusion It is possible to predict a worker's probability of RTOW using machine learning techniques with moderate accuracy. PMID:29736160

  12. Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival.

    PubMed

    Ishwaran, Hemant; Lu, Min

    2018-06-04

    Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings. Copyright © 2018 John Wiley & Sons, Ltd.

  13. Harmonic regression of Landsat time series for modeling attributes from national forest inventory data

    NASA Astrophysics Data System (ADS)

    Wilson, Barry T.; Knight, Joseph F.; McRoberts, Ronald E.

    2018-03-01

    Imagery from the Landsat Program has been used frequently as a source of auxiliary data for modeling land cover, as well as a variety of attributes associated with tree cover. With ready access to all scenes in the archive since 2008 due to the USGS Landsat Data Policy, new approaches to deriving such auxiliary data from dense Landsat time series are required. Several methods have previously been developed for use with finer temporal resolution imagery (e.g. AVHRR and MODIS), including image compositing and harmonic regression using Fourier series. The manuscript presents a study, using Minnesota, USA during the years 2009-2013 as the study area and timeframe. The study examined the relative predictive power of land cover models, in particular those related to tree cover, using predictor variables based solely on composite imagery versus those using estimated harmonic regression coefficients. The study used two common non-parametric modeling approaches (i.e. k-nearest neighbors and random forests) for fitting classification and regression models of multiple attributes measured on USFS Forest Inventory and Analysis plots using all available Landsat imagery for the study area and timeframe. The estimated Fourier coefficients developed by harmonic regression of tasseled cap transformation time series data were shown to be correlated with land cover, including tree cover. Regression models using estimated Fourier coefficients as predictor variables showed a two- to threefold increase in explained variance for a small set of continuous response variables, relative to comparable models using monthly image composites. Similarly, the overall accuracies of classification models using the estimated Fourier coefficients were approximately 10-20 percentage points higher than the models using the image composites, with corresponding individual class accuracies between six and 45 percentage points higher.

  14. Prediction of body mass index status from voice signals based on machine learning for automated medical applications.

    PubMed

    Lee, Bum Ju; Kim, Keun Ho; Ku, Boncho; Jang, Jun-Su; Kim, Jong Yeol

    2013-05-01

    The body mass index (BMI) provides essential medical information related to body weight for the treatment and prognosis prediction of diseases such as cardiovascular disease, diabetes, and stroke. We propose a method for the prediction of normal, overweight, and obese classes based only on the combination of voice features that are associated with BMI status, independently of weight and height measurements. A total of 1568 subjects were divided into 4 groups according to age and gender differences. We performed statistical analyses by analysis of variance (ANOVA) and Scheffe test to find significant features in each group. We predicted BMI status (normal, overweight, and obese) by a logistic regression algorithm and two ensemble classification algorithms (bagging and random forests) based on statistically significant features. In the Female-2030 group (females aged 20-40 years), classification experiments using an imbalanced (original) data set gave area under the receiver operating characteristic curve (AUC) values of 0.569-0.731 by logistic regression, whereas experiments using a balanced data set gave AUC values of 0.893-0.994 by random forests. AUC values in Female-4050 (females aged 41-60 years), Male-2030 (males aged 20-40 years), and Male-4050 (males aged 41-60 years) groups by logistic regression in imbalanced data were 0.585-0.654, 0.581-0.614, and 0.557-0.653, respectively. AUC values in Female-4050, Male-2030, and Male-4050 groups in balanced data were 0.629-0.893 by bagging, 0.707-0.916 by random forests, and 0.695-0.854 by bagging, respectively. In each group, we found discriminatory features showing statistical differences among normal, overweight, and obese classes. The results showed that the classification models built by logistic regression in imbalanced data were better than those built by the other two algorithms, and significant features differed according to age and gender groups. Our results could support the development of BMI diagnosis tools for real-time monitoring; such tools are considered helpful in improving automated BMI status diagnosis in remote healthcare or telemedicine and are expected to have applications in forensic and medical science. Copyright © 2013 Elsevier B.V. All rights reserved.

  15. A comparison of selected parametric and imputation methods for estimating snag density and snag quality attributes

    USGS Publications Warehouse

    Eskelson, Bianca N.I.; Hagar, Joan; Temesgen, Hailemariam

    2012-01-01

    Snags (standing dead trees) are an essential structural component of forests. Because wildlife use of snags depends on size and decay stage, snag density estimation without any information about snag quality attributes is of little value for wildlife management decision makers. Little work has been done to develop models that allow multivariate estimation of snag density by snag quality class. Using climate, topography, Landsat TM data, stand age and forest type collected for 2356 forested Forest Inventory and Analysis plots in western Washington and western Oregon, we evaluated two multivariate techniques for their abilities to estimate density of snags by three decay classes. The density of live trees and snags in three decay classes (D1: recently dead, little decay; D2: decay, without top, some branches and bark missing; D3: extensive decay, missing bark and most branches) with diameter at breast height (DBH) ≥ 12.7 cm was estimated using a nonparametric random forest nearest neighbor imputation technique (RF) and a parametric two-stage model (QPORD), for which the number of trees per hectare was estimated with a Quasipoisson model in the first stage and the probability of belonging to a tree status class (live, D1, D2, D3) was estimated with an ordinal regression model in the second stage. The presence of large snags with DBH ≥ 50 cm was predicted using a logistic regression and RF imputation. Because of the more homogenous conditions on private forest lands, snag density by decay class was predicted with higher accuracies on private forest lands than on public lands, while presence of large snags was more accurately predicted on public lands, owing to the higher prevalence of large snags on public lands. RF outperformed the QPORD model in terms of percent accurate predictions, while QPORD provided smaller root mean square errors in predicting snag density by decay class. The logistic regression model achieved more accurate presence/absence classification of large snags than the RF imputation approach. Adjusting the decision threshold to account for unequal size for presence and absence classes is more straightforward for the logistic regression than for the RF imputation approach. Overall, model accuracies were poor in this study, which can be attributed to the poor predictive quality of the explanatory variables and the large range of forest types and geographic conditions observed in the data.

  16. Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble

    PubMed Central

    Wang, Hong; Xu, Qingsong; Zhou, Lifeng

    2015-01-01

    Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data. PMID:25706988

  17. Sequential Monte Carlo tracking of the marginal artery by multiple cue fusion and random forest regression.

    PubMed

    Cherry, Kevin M; Peplinski, Brandon; Kim, Lauren; Wang, Shijun; Lu, Le; Zhang, Weidong; Liu, Jianfei; Wei, Zhuoshi; Summers, Ronald M

    2015-01-01

    Given the potential importance of marginal artery localization in automated registration in computed tomography colonography (CTC), we have devised a semi-automated method of marginal vessel detection employing sequential Monte Carlo tracking (also known as particle filtering tracking) by multiple cue fusion based on intensity, vesselness, organ detection, and minimum spanning tree information for poorly enhanced vessel segments. We then employed a random forest algorithm for intelligent cue fusion and decision making which achieved high sensitivity and robustness. After applying a vessel pruning procedure to the tracking results, we achieved statistically significantly improved precision compared to a baseline Hessian detection method (2.7% versus 75.2%, p<0.001). This method also showed statistically significantly improved recall rate compared to a 2-cue baseline method using fewer vessel cues (30.7% versus 67.7%, p<0.001). These results demonstrate that marginal artery localization on CTC is feasible by combining a discriminative classifier (i.e., random forest) with a sequential Monte Carlo tracking mechanism. In so doing, we present the effective application of an anatomical probability map to vessel pruning as well as a supplementary spatial coordinate system for colonic segmentation and registration when this task has been confounded by colon lumen collapse. Published by Elsevier B.V.

  18. Modeling contemporary climate profiles of whitebark pine (Pinus albicaulis) and predicting responses to global warming

    Treesearch

    Marcus V. Warwell; Gerald E. Rehfeldt; Nicholas L. Crookston

    2006-01-01

    The Random Forests multiple regression tree was used to develop an empirically-based bioclimate model for the distribution of Pinus albicaulis (whitebark pine) in western North America, latitudes 31° to 51° N and longitudes 102° to 125° W. Independent variables included 35 simple expressions of temperature and precipitation and their interactions....

  19. Use of Hundreds of Electrocardiograhpic Biomarkers for Prediction of Mortality in Post-Menopausal Women: The Women’s Health Initiative

    PubMed Central

    Gorodeski, Eiran Z.; Ishwaran, Hemant; Kogalur, Udaya B.; Blackstone, Eugene H.; Hsich, Eileen; Zhang, Zhu-ming; Vitolins, Mara Z.; Manson, JoAnn E.; Curb, J. David; Martin, Lisa W.; Prineas, Ronald J.; Lauer, Michael S.

    2013-01-01

    Background Simultaneous contribution of hundreds of electrocardiographic biomarkers to prediction of long-term mortality in post-menopausal women with clinically normal resting electrocardiograms (ECGs) is unknown. Methods and Results We analyzed ECGs and all-cause mortality in 33,144 women enrolled in Women’s Health Initiative trials, who were without baseline cardiovascular disease or cancer, and had normal ECGs by Minnesota and Novacode criteria. Four hundred and seventy seven ECG biomarkers, encompassing global and individual ECG findings, were measured using computer algorithms. During a median follow-up of 8.1 years (range for survivors 0.5–11.2 years), 1,229 women died. For analyses cohort was randomly split into derivation (n=22,096, deaths=819) and validation (n=11,048, deaths=410) subsets. ECG biomarkers, demographic, and clinical characteristics were simultaneously analyzed using both traditional Cox regression and Random Survival Forest (RSF), a novel algorithmic machine-learning approach. Regression modeling failed to converge. RSF variable selection yielded 20 variables that were independently predictive of long-term mortality, 14 of which were ECG biomarkers related to autonomic tone, atrial conduction, and ventricular depolarization and repolarization. Conclusions We identified 14 ECG biomarkers from amongst hundreds that were associated with long-term prognosis using a novel random forest variable selection methodology. These were related to autonomic tone, atrial conduction, ventricular depolarization, and ventricular repolarization. Quantitative ECG biomarkers have prognostic importance, and may be markers of subclinical disease in apparently healthy post-menopausal women. PMID:21862719

  20. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis.

    PubMed

    Dietrich, Stefan; Floegel, Anna; Troll, Martina; Kühn, Tilman; Rathmann, Wolfgang; Peters, Anette; Sookthai, Disorn; von Bergen, Martin; Kaaks, Rudolf; Adamski, Jerzy; Prehn, Cornelia; Boeing, Heiner; Schulze, Matthias B; Illig, Thomas; Pischon, Tobias; Knüppel, Sven; Wang-Sattler, Rui; Drogan, Dagmar

    2016-10-01

    The application of metabolomics in prospective cohort studies is statistically challenging. Given the importance of appropriate statistical methods for selection of disease-associated metabolites in highly correlated complex data, we combined random survival forest (RSF) with an automated backward elimination procedure that addresses such issues. Our RSF approach was illustrated with data from the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam study, with concentrations of 127 serum metabolites as exposure variables and time to development of type 2 diabetes mellitus (T2D) as outcome variable. Out of this data set, Cox regression with a stepwise selection method was recently published. Replication of methodical comparison (RSF and Cox regression) was conducted in two independent cohorts. Finally, the R-code for implementing the metabolite selection procedure into the RSF-syntax is provided. The application of the RSF approach in EPIC-Potsdam resulted in the identification of 16 incident T2D-associated metabolites which slightly improved prediction of T2D when used in addition to traditional T2D risk factors and also when used together with classical biomarkers. The identified metabolites partly agreed with previous findings using Cox regression, though RSF selected a higher number of highly correlated metabolites. The RSF method appeared to be a promising approach for identification of disease-associated variables in complex data with time to event as outcome. The demonstrated RSF approach provides comparable findings as the generally used Cox regression, but also addresses the problem of multicollinearity and is suitable for high-dimensional data. © The Author 2016; all rights reserved. Published by Oxford University Press on behalf of the International Epidemiological Association.

  1. Evaluating Disease Severity in Chronic Pain Patients with and without Fibromyalgia: A Comparison of the Symptom Impact Questionnaire and the Polysymptomatic Distress Scale.

    PubMed

    Friend, Ronald; Bennett, Robert M

    2015-12-01

    To compare the relative effectiveness of the Polysymptomatic Distress Scale (PSD) with the Symptom Impact Questionnaire (SIQR), the disease-neutral revision of the updated Fibromyalgia Impact Questionnaire (FIQR), in their ability to assess disease activity in patients with rheumatic disorders both with and without fibromyalgia (FM). The study included 321 patients from 8 clinical practices with some 16 different chronic pain disorders. Disease severity was assessed by the Medical Outcomes Study Short Form-36 (SF-36). Univariate analyses were used to assess the magnitude of PSD and SIQR correlations with SF-36 subscales. Hierarchical stepwise regression was used to evaluate the unique contribution of the PSD and SIQR to the SF-36. Random forest regression probed the relative importance of the SIQR and PSD components as predictors of SF-36. The correlations with the SF-36 subscales were significantly higher for the SIQR (0.48 to 0.78) than the PSD (0.29 to 0.56; p < 0.001). Stepwise regression revealed that the SIQR was contributing additional unique variance on SF-36 subscales, which was not the case for the PSD. Random forest regression showed SIQR Function, Symptoms, and Global Impact subscales were more important predictors of SF-36 than the PSD. The single SIQR pain item contributed 55% of SF-36 pain variance compared to 23% with the 19-point WPI (the Widespread Pain Index component of PSD). The SIQR, the disease-neutral revision of the updated FIQ, has several important advantages over the PSD in the evaluation of disease severity in chronic pain disorders.

  2. Predicting recreational water quality advisories: A comparison of statistical methods

    USGS Publications Warehouse

    Brooks, Wesley R.; Corsi, Steven R.; Fienen, Michael N.; Carvin, Rebecca B.

    2016-01-01

    Epidemiological studies indicate that fecal indicator bacteria (FIB) in beach water are associated with illnesses among people having contact with the water. In order to mitigate public health impacts, many beaches are posted with an advisory when the concentration of FIB exceeds a beach action value. The most commonly used method of measuring FIB concentration takes 18–24 h before returning a result. In order to avoid the 24 h lag, it has become common to ”nowcast” the FIB concentration using statistical regressions on environmental surrogate variables. Most commonly, nowcast models are estimated using ordinary least squares regression, but other regression methods from the statistical and machine learning literature are sometimes used. This study compares 14 regression methods across 7 Wisconsin beaches to identify which consistently produces the most accurate predictions. A random forest model is identified as the most accurate, followed by multiple regression fit using the adaptive LASSO.

  3. Combinations of Stressors in Midlife: Examining Role and Domain Stressors Using Regression Trees and Random Forests

    PubMed Central

    2013-01-01

    Objectives. Global perceptions of stress (GPS) have major implications for mental and physical health, and stress in midlife may influence adaptation in later life. Thus, it is important to determine the unique and interactive effects of diverse influences of role stress (at work or in personal relationships), loneliness, life events, time pressure, caregiving, finances, discrimination, and neighborhood circumstances on these GPS. Method. Exploratory regression trees and random forests were used to examine complex interactions among myriad events and chronic stressors in middle-aged participants’ (N = 410; mean age = 52.12) GPS. Results. Different role and domain stressors were influential at high and low levels of loneliness. Varied combinations of these stressors resulting in similar levels of perceived stress are also outlined as examples of equifinality. Loneliness emerged as an important predictor across trees. Discussion. Exploring multiple stressors simultaneously provides insights into the diversity of stressor combinations across individuals—even those with similar levels of global perceived stress—and answers theoretical mandates to better understand the influence of stress by sampling from many domain and role stressors. Further, the unique influences of each predictor relative to the others inform theory and applied work. Finally, examples of equifinality and multifinality call for targeted interventions. PMID:23341437

  4. Modeling groundwater nitrate concentrations in private wells in Iowa

    USGS Publications Warehouse

    Wheeler, David C.; Nolan, Bernard T.; Flory, Abigail R.; DellaValle, Curt T.; Ward, Mary H.

    2015-01-01

    Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square = 0.77) and was acceptable in the testing set (r-square = 0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort.

  5. Predicting Ascospore Release of Monilinia vaccinii-corymbosi of Blueberry with Machine Learning.

    PubMed

    Harteveld, Dalphy O C; Grant, Michael R; Pscheidt, Jay W; Peever, Tobin L

    2017-11-01

    Mummy berry, caused by Monilinia vaccinii-corymbosi, causes economic losses of highbush blueberry in the U.S. Pacific Northwest (PNW). Apothecia develop from mummified berries overwintering on soil surfaces and produce ascospores that infect tissue emerging from floral and vegetative buds. Disease control currently relies on fungicides applied on a calendar basis rather than inoculum availability. To establish a prediction model for ascospore release, apothecial development was tracked in three fields, one in western Oregon and two in northwestern Washington in 2015 and 2016. Air and soil temperature, precipitation, soil moisture, leaf wetness, relative humidity and solar radiation were monitored using in-field weather stations and Washington State University's AgWeatherNet stations. Four modeling approaches were compared: logistic regression, multivariate adaptive regression splines, artificial neural networks, and random forest. A supervised learning approach was used to train the models on two data sets: training (70%) and testing (30%). The importance of environmental factors was calculated for each model separately. Soil temperature, soil moisture, and solar radiation were identified as the most important factors influencing ascospore release. Random forest models, with 78% accuracy, showed the best performance compared with the other models. Results of this research helps PNW blueberry growers to optimize fungicide use and reduce production costs.

  6. Modeling groundwater nitrate concentrations in private wells in Iowa.

    PubMed

    Wheeler, David C; Nolan, Bernard T; Flory, Abigail R; DellaValle, Curt T; Ward, Mary H

    2015-12-01

    Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square=0.77) and was acceptable in the testing set (r-square=0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort. Copyright © 2015 Elsevier B.V. All rights reserved.

  7. Modeling and Prediction of Solvent Effect on Human Skin Permeability using Support Vector Regression and Random Forest.

    PubMed

    Baba, Hiromi; Takahara, Jun-ichi; Yamashita, Fumiyoshi; Hashida, Mitsuru

    2015-11-01

    The solvent effect on skin permeability is important for assessing the effectiveness and toxicological risk of new dermatological formulations in pharmaceuticals and cosmetics development. The solvent effect occurs by diverse mechanisms, which could be elucidated by efficient and reliable prediction models. However, such prediction models have been hampered by the small variety of permeants and mixture components archived in databases and by low predictive performance. Here, we propose a solution to both problems. We first compiled a novel large database of 412 samples from 261 structurally diverse permeants and 31 solvents reported in the literature. The data were carefully screened to ensure their collection under consistent experimental conditions. To construct a high-performance predictive model, we then applied support vector regression (SVR) and random forest (RF) with greedy stepwise descriptor selection to our database. The models were internally and externally validated. The SVR achieved higher performance statistics than RF. The (externally validated) determination coefficient, root mean square error, and mean absolute error of SVR were 0.899, 0.351, and 0.268, respectively. Moreover, because all descriptors are fully computational, our method can predict as-yet unsynthesized compounds. Our high-performance prediction model offers an attractive alternative to permeability experiments for pharmaceutical and cosmetic candidate screening and optimizing skin-permeable topical formulations.

  8. Random forest regression for magnetic resonance image synthesis.

    PubMed

    Jog, Amod; Carass, Aaron; Roy, Snehashis; Pham, Dzung L; Prince, Jerry L

    2017-01-01

    By choosing different pulse sequences and their parameters, magnetic resonance imaging (MRI) can generate a large variety of tissue contrasts. This very flexibility, however, can yield inconsistencies with MRI acquisitions across datasets or scanning sessions that can in turn cause inconsistent automated image analysis. Although image synthesis of MR images has been shown to be helpful in addressing this problem, an inability to synthesize both T 2 -weighted brain images that include the skull and FLuid Attenuated Inversion Recovery (FLAIR) images has been reported. The method described herein, called REPLICA, addresses these limitations. REPLICA is a supervised random forest image synthesis approach that learns a nonlinear regression to predict intensities of alternate tissue contrasts given specific input tissue contrasts. Experimental results include direct image comparisons between synthetic and real images, results from image analysis tasks on both synthetic and real images, and comparison against other state-of-the-art image synthesis methods. REPLICA is computationally fast, and is shown to be comparable to other methods on tasks they are able to perform. Additionally REPLICA has the capability to synthesize both T 2 -weighted images of the full head and FLAIR images, and perform intensity standardization between different imaging datasets. Copyright © 2016 Elsevier B.V. All rights reserved.

  9. The use of space and high altitude aerial photography to classify forest land and to detect forest disturbances

    NASA Technical Reports Server (NTRS)

    Aldrich, R. C.; Greentree, W. J.; Heller, R. C.; Norick, N. X.

    1970-01-01

    In October 1969, an investigation was begun near Atlanta, Georgia, to explore the possibilities of developing predictors for forest land and stand condition classifications using space photography. It has been found that forest area can be predicted with reasonable accuracy on space photographs using ocular techniques. Infrared color film is the best single multiband sensor for this purpose. Using the Apollo 9 infrared color photographs taken in March 1969 photointerpreters were able to predict forest area for small units consistently within 5 to 10 percent of ground truth. Approximately 5,000 density data points were recorded for 14 scan lines selected at random from five study blocks. The mean densities and standard deviations were computed for 13 separate land use classes. The results indicate that forest area cannot be separated from other land uses with a high degree of accuracy using optical film density alone. If, however, densities derived by introducing red, green, and blue cutoff filters in the optical system of the microdensitometer are combined with their differences and their ratios in regression analysis techniques, there is a good possibility of discriminating forest from all other classes.

  10. Developing reservoir monthly inflow forecasts using artificial intelligence and climate phenomenon information

    NASA Astrophysics Data System (ADS)

    Yang, Tiantian; Asanjan, Ata Akbari; Welles, Edwin; Gao, Xiaogang; Sorooshian, Soroosh; Liu, Xiaomang

    2017-04-01

    Reservoirs are fundamental human-built infrastructures that collect, store, and deliver fresh surface water in a timely manner for many purposes. Efficient reservoir operation requires policy makers and operators to understand how reservoir inflows are changing under different hydrological and climatic conditions to enable forecast-informed operations. Over the last decade, the uses of Artificial Intelligence and Data Mining [AI & DM] techniques in assisting reservoir streamflow subseasonal to seasonal forecasts have been increasing. In this study, Random Forest [RF), Artificial Neural Network (ANN), and Support Vector Regression (SVR) are employed and compared with respect to their capabilities for predicting 1 month-ahead reservoir inflows for two headwater reservoirs in USA and China. Both current and lagged hydrological information and 17 known climate phenomenon indices, i.e., PDO and ENSO, etc., are selected as predictors for simulating reservoir inflows. Results show (1) three methods are capable of providing monthly reservoir inflows with satisfactory statistics; (2) the results obtained by Random Forest have the best statistical performances compared with the other two methods; (3) another advantage of Random Forest algorithm is its capability of interpreting raw model inputs; (4) climate phenomenon indices are useful in assisting monthly or seasonal forecasts of reservoir inflow; and (5) different climate conditions are autocorrelated with up to several months, and the climatic information and their lags are cross correlated with local hydrological conditions in our case studies.

  11. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches

    NASA Astrophysics Data System (ADS)

    Brokamp, Cole; Jandarov, Roman; Rao, M. B.; LeMasters, Grace; Ryan, Patrick

    2017-02-01

    Exposure assessment for elemental components of particulate matter (PM) using land use modeling is a complex problem due to the high spatial and temporal variations in pollutant concentrations at the local scale. Land use regression (LUR) models may fail to capture complex interactions and non-linear relationships between pollutant concentrations and land use variables. The increasing availability of big spatial data and machine learning methods present an opportunity for improvement in PM exposure assessment models. In this manuscript, our objective was to develop a novel land use random forest (LURF) model and compare its accuracy and precision to a LUR model for elemental components of PM in the urban city of Cincinnati, Ohio. PM smaller than 2.5 μm (PM2.5) and eleven elemental components were measured at 24 sampling stations from the Cincinnati Childhood Allergy and Air Pollution Study (CCAAPS). Over 50 different predictors associated with transportation, physical features, community socioeconomic characteristics, greenspace, land cover, and emission point sources were used to construct LUR and LURF models. Cross validation was used to quantify and compare model performance. LURF and LUR models were created for aluminum (Al), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), lead (Pb), sulfur (S), silicon (Si), vanadium (V), zinc (Zn), and total PM2.5 in the CCAAPS study area. LURF utilized a more diverse and greater number of predictors than LUR and LURF models for Al, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all showed a decrease in fractional predictive error of at least 5% compared to their LUR models. LURF models for Al, Cu, Fe, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all had a cross validated fractional predictive error less than 30%. Furthermore, LUR models showed a differential exposure assessment bias and had a higher prediction error variance. Random forest and other machine learning methods may provide more accurate exposure assessment.

  12. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches.

    PubMed

    Brokamp, Cole; Jandarov, Roman; Rao, M B; LeMasters, Grace; Ryan, Patrick

    2017-02-01

    Exposure assessment for elemental components of particulate matter (PM) using land use modeling is a complex problem due to the high spatial and temporal variations in pollutant concentrations at the local scale. Land use regression (LUR) models may fail to capture complex interactions and non-linear relationships between pollutant concentrations and land use variables. The increasing availability of big spatial data and machine learning methods present an opportunity for improvement in PM exposure assessment models. In this manuscript, our objective was to develop a novel land use random forest (LURF) model and compare its accuracy and precision to a LUR model for elemental components of PM in the urban city of Cincinnati, Ohio. PM smaller than 2.5 μm (PM2.5) and eleven elemental components were measured at 24 sampling stations from the Cincinnati Childhood Allergy and Air Pollution Study (CCAAPS). Over 50 different predictors associated with transportation, physical features, community socioeconomic characteristics, greenspace, land cover, and emission point sources were used to construct LUR and LURF models. Cross validation was used to quantify and compare model performance. LURF and LUR models were created for aluminum (Al), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), lead (Pb), sulfur (S), silicon (Si), vanadium (V), zinc (Zn), and total PM2.5 in the CCAAPS study area. LURF utilized a more diverse and greater number of predictors than LUR and LURF models for Al, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all showed a decrease in fractional predictive error of at least 5% compared to their LUR models. LURF models for Al, Cu, Fe, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all had a cross validated fractional predictive error less than 30%. Furthermore, LUR models showed a differential exposure assessment bias and had a higher prediction error variance. Random forest and other machine learning methods may provide more accurate exposure assessment.

  13. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches

    PubMed Central

    Brokamp, Cole; Jandarov, Roman; Rao, M.B.; LeMasters, Grace; Ryan, Patrick

    2017-01-01

    Exposure assessment for elemental components of particulate matter (PM) using land use modeling is a complex problem due to the high spatial and temporal variations in pollutant concentrations at the local scale. Land use regression (LUR) models may fail to capture complex interactions and non-linear relationships between pollutant concentrations and land use variables. The increasing availability of big spatial data and machine learning methods present an opportunity for improvement in PM exposure assessment models. In this manuscript, our objective was to develop a novel land use random forest (LURF) model and compare its accuracy and precision to a LUR model for elemental components of PM in the urban city of Cincinnati, Ohio. PM smaller than 2.5 μm (PM2.5) and eleven elemental components were measured at 24 sampling stations from the Cincinnati Childhood Allergy and Air Pollution Study (CCAAPS). Over 50 different predictors associated with transportation, physical features, community socioeconomic characteristics, greenspace, land cover, and emission point sources were used to construct LUR and LURF models. Cross validation was used to quantify and compare model performance. LURF and LUR models were created for aluminum (Al), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), lead (Pb), sulfur (S), silicon (Si), vanadium (V), zinc (Zn), and total PM2.5 in the CCAAPS study area. LURF utilized a more diverse and greater number of predictors than LUR and LURF models for Al, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all showed a decrease in fractional predictive error of at least 5% compared to their LUR models. LURF models for Al, Cu, Fe, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all had a cross validated fractional predictive error less than 30%. Furthermore, LUR models showed a differential exposure assessment bias and had a higher prediction error variance. Random forest and other machine learning methods may provide more accurate exposure assessment. PMID:28959135

  14. A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use information.

    PubMed

    Chen, Gongbo; Li, Shanshan; Knibbs, Luke D; Hamm, N A S; Cao, Wei; Li, Tiantian; Guo, Jianping; Ren, Hongyan; Abramson, Michael J; Guo, Yuming

    2018-09-15

    Machine learning algorithms have very high predictive ability. However, no study has used machine learning to estimate historical concentrations of PM 2.5 (particulate matter with aerodynamic diameter ≤ 2.5 μm) at daily time scale in China at a national level. To estimate daily concentrations of PM 2.5 across China during 2005-2016. Daily ground-level PM 2.5 data were obtained from 1479 stations across China during 2014-2016. Data on aerosol optical depth (AOD), meteorological conditions and other predictors were downloaded. A random forests model (non-parametric machine learning algorithms) and two traditional regression models were developed to estimate ground-level PM 2.5 concentrations. The best-fit model was then utilized to estimate the daily concentrations of PM 2.5 across China with a resolution of 0.1° (≈10 km) during 2005-2016. The daily random forests model showed much higher predictive accuracy than the other two traditional regression models, explaining the majority of spatial variability in daily PM 2.5 [10-fold cross-validation (CV) R 2  = 83%, root mean squared prediction error (RMSE) = 28.1 μg/m 3 ]. At the monthly and annual time-scale, the explained variability of average PM 2.5 increased up to 86% (RMSE = 10.7 μg/m 3 and 6.9 μg/m 3 , respectively). Taking advantage of a novel application of modeling framework and the most recent ground-level PM 2.5 observations, the machine learning method showed higher predictive ability than previous studies. Random forests approach can be used to estimate historical exposure to PM 2.5 in China with high accuracy. Copyright © 2018 Elsevier B.V. All rights reserved.

  15. Data-Driven Lead-Acid Battery Prognostics Using Random Survival Forests

    DTIC Science & Technology

    2014-10-02

    Kogalur, Blackstone , & Lauer, 2008; Ishwaran & Kogalur, 2010). Random survival forest is a sur- vival analysis extension of Random Forests (Breiman, 2001...Statistics & probability letters, 80(13), 1056–1064. Ishwaran, H., Kogalur, U. B., Blackstone , E. H., & Lauer, M. S. (2008). Random survival forests. The

  16. A tale of two "forests": random forest machine learning AIDS tropical forest carbon mapping.

    PubMed

    Mascaro, Joseph; Asner, Gregory P; Knapp, David E; Kennedy-Bowdoin, Ty; Martin, Roberta E; Anderson, Christopher; Higgins, Mark; Chadwick, K Dana

    2014-01-01

    Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including--in the latter case--x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called "out-of-bag"), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha(-1) when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation.

  17. Correlation analysis between forest carbon stock and spectral vegetation indices in Xuan Lien Nature Reserve, Thanh Hoa, Viet Nam

    NASA Astrophysics Data System (ADS)

    Dung Nguyen, The; Kappas, Martin

    2017-04-01

    In the last several years, the interest in forest biomass and carbon stock estimation has increased due to its importance for forest management, modelling carbon cycle, and other ecosystem services. However, no estimates of biomass and carbon stocks of deferent forest cover types exist throughout in the Xuan Lien Nature Reserve, Thanh Hoa, Viet Nam. This study investigates the relationship between above ground carbon stock and different vegetation indices and to identify the most likely vegetation index that best correlate with forest carbon stock. The terrestrial inventory data come from 380 sample plots that were randomly sampled. Individual tree parameters such as DBH and tree height were collected to calculate the above ground volume, biomass and carbon for different forest types. The SPOT6 2013 satellite data was used in the study to obtain five vegetation indices NDVI, RDVI, MSR, RVI, and EVI. The relationships between the forest carbon stock and vegetation indices were investigated using a multiple linear regression analysis. R-square, RMSE values and cross-validation were used to measure the strength and validate the performance of the models. The methodology presented here demonstrates the possibility of estimating forest volume, biomass and carbon stock. It can also be further improved by addressing more spectral bands data and/or elevation.

  18. Estimating aboveground forest biomass carbon and fire consumption in the U.S. Utah High Plateaus using data from the Forest Inventory and Analysis program, Landsat, and LANDFIRE

    USGS Publications Warehouse

    Chen, Xuexia; Liu, Shuguang; Zhu, Zhiliang; Vogelmann, James E.; Li, Zhengpeng; Ohlen, Donald O.

    2011-01-01

    The concentrations of CO2 and other greenhouse gases in the atmosphere have been increasing and greatly affecting global climate and socio-economic systems. Actively growing forests are generally considered to be a major carbon sink, but forest wildfires lead to large releases of biomass carbon into the atmosphere. Aboveground forest biomass carbon (AFBC), an important ecological indicator, and fire-induced carbon emissions at regional scales are highly relevant to forest sustainable management and climate change. It is challenging to accurately estimate the spatial distribution of AFBC across large areas because of the spatial heterogeneity of forest cover types and canopy structure. In this study, Forest Inventory and Analysis (FIA) data, Landsat, and Landscape Fire and Resource Management Planning Tools Project (LANDFIRE) data were integrated in a regression tree model for estimating AFBC at a 30-m resolution in the Utah High Plateaus. AFBC were calculated from 225 FIA field plots and used as the dependent variable in the model. Of these plots, 10% were held out for model evaluation with stratified random sampling, and the other 90% were used as training data to develop the regression tree model. Independent variable layers included Landsat imagery and the derived spectral indicators, digital elevation model (DEM) data and derivatives, biophysical gradient data, existing vegetation cover type and vegetation structure. The cross-validation correlation coefficient (r value) was 0.81 for the training model. Independent validation using withheld plot data was similar with r value of 0.82. This validated regression tree model was applied to map AFBC in the Utah High Plateaus and then combined with burn severity information to estimate loss of AFBC in the Longston fire of Zion National Park in 2001. The final dataset represented 24 forest cover types for a 4 million ha forested area. We estimated a total of 353 Tg AFBC with an average of 87 MgC/ha in the Utah High Plateaus. We also estimated that 8054 Mg AFBC were released from 2.24 km2 burned forest area in the Longston fire. These results demonstrate that an AFBC spatial map and estimated biomass carbon consumption can readily be generated using existing database. The methodology provides a consistent, practical, and inexpensive way for estimating AFBC at 30-m resolution over large areas throughout the United States.

  19. Sentinel node status prediction by four statistical models: results from a large bi-institutional series (n = 1132).

    PubMed

    Mocellin, Simone; Thompson, John F; Pasquali, Sandro; Montesco, Maria C; Pilati, Pierluigi; Nitti, Donato; Saw, Robyn P; Scolyer, Richard A; Stretch, Jonathan R; Rossi, Carlo R

    2009-12-01

    To improve selection for sentinel node (SN) biopsy (SNB) in patients with cutaneous melanoma using statistical models predicting SN status. About 80% of patients currently undergoing SNB are node negative. In the absence of conclusive evidence of a SNBassociated survival benefit, these patients may be over-treated. Here, we tested the efficiency of 4 different models in predicting SN status. The clinicopathologic data (age, gender, tumor thickness, Clark level, regression, ulceration, histologic subtype, and mitotic index) of 1132 melanoma patients who had undergone SNB at institutions in Italy and Australia were analyzed. Logistic regression, classification tree, random forest, and support vector machine models were fitted to the data. The predictive models were built with the aim of maximizing the negative predictive value (NPV) and reducing the rate of SNB procedures though minimizing the error rate. After cross-validation logistic regression, classification tree, random forest, and support vector machine predictive models obtained clinically relevant NPV (93.6%, 94.0%, 97.1%, and 93.0%, respectively), SNB reduction (27.5%, 29.8%, 18.2%, and 30.1%, respectively), and error rates (1.8%, 1.8%, 0.5%, and 2.1%, respectively). Using commonly available clinicopathologic variables, predictive models can preoperatively identify a proportion of patients ( approximately 25%) who might be spared SNB, with an acceptable (1%-2%) error. If validated in large prospective series, these models might be implemented in the clinical setting for improved patient selection, which ultimately would lead to better quality of life for patients and optimization of resource allocation for the health care system.

  20. Alternative methods to evaluate trial level surrogacy.

    PubMed

    Abrahantes, Josè Cortiñas; Shkedy, Ziv; Molenberghs, Geert

    2008-01-01

    The evaluation and validation of surrogate endpoints have been extensively studied in the last decade. Prentice [1] and Freedman, Graubard and Schatzkin [2] laid the foundations for the evaluation of surrogate endpoints in randomized clinical trials. Later, Buyse et al. [5] proposed a meta-analytic methodology, producing different methods for different settings, which was further studied by Alonso and Molenberghs [9], in their unifying approach based on information theory. In this article, we focus our attention on the trial-level surrogacy and propose alternative procedures to evaluate such surrogacy measure, which do not pre-specify the type of association. A promising correction based on cross-validation is investigated. As well as the construction of confidence intervals for this measure. In order to avoid making assumption about the type of relationship between the treatment effects and its distribution, a collection of alternative methods, based on regression trees, bagging, random forests, and support vector machines, combined with bootstrap-based confidence interval and, should one wish, in conjunction with a cross-validation based correction, will be proposed and applied. We apply the various strategies to data from three clinical studies: in opthalmology, in advanced colorectal cancer, and in schizophrenia. The results obtained for the three case studies are compared; they indicate that using random forest or bagging models produces larger estimated values for the surrogacy measure, which are in general stabler and the confidence interval narrower than linear regression and support vector regression. For the advanced colorectal cancer studies, we even found the trial-level surrogacy is considerably different from what has been reported. In general the alternative methods are more computationally demanding, and specially the calculation of the confidence intervals, require more computational time that the delta-method counterpart. First, more flexible modeling techniques can be used, allowing for other type of association. Second, when no cross-validation-based correction is applied, overly optimistic trial-level surrogacy estimates will be found, thus cross-validation is highly recommendable. Third, the use of the delta method to calculate confidence intervals is not recommendable since it makes assumptions valid only in very large samples. It may also produce range-violating limits. We therefore recommend alternatives: bootstrap methods in general. Also, the information-theoretic approach produces comparable results with the bagging and random forest approaches, when cross-validation correction is applied. It is also important to observe that, even for the case in which the linear model might be a good option too, bagging methods perform well too, and their confidence intervals were more narrow.

  1. Distance error correction for time-of-flight cameras

    NASA Astrophysics Data System (ADS)

    Fuersattel, Peter; Schaller, Christian; Maier, Andreas; Riess, Christian

    2017-06-01

    The measurement accuracy of time-of-flight cameras is limited due to properties of the scene and systematic errors. These errors can accumulate to multiple centimeters which may limit the applicability of these range sensors. In the past, different approaches have been proposed for improving the accuracy of these cameras. In this work, we propose a new method that improves two important aspects of the range calibration. First, we propose a new checkerboard which is augmented by a gray-level gradient. With this addition it becomes possible to capture the calibration features for intrinsic and distance calibration at the same time. The gradient strip allows to acquire a large amount of distance measurements for different surface reflectivities, which results in more meaningful training data. Second, we present multiple new features which are used as input to a random forest regressor. By using random regression forests, we circumvent the problem of finding an accurate model for the measurement error. During application, a correction value for each individual pixel is estimated with the trained forest based on a specifically tailored feature vector. With our approach the measurement error can be reduced by more than 40% for the Mesa SR4000 and by more than 30% for the Microsoft Kinect V2. In our evaluation we also investigate the impact of the individual forest parameters and illustrate the importance of the individual features.

  2. A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping

    PubMed Central

    Mascaro, Joseph; Asner, Gregory P.; Knapp, David E.; Kennedy-Bowdoin, Ty; Martin, Roberta E.; Anderson, Christopher; Higgins, Mark; Chadwick, K. Dana

    2014-01-01

    Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including—in the latter case—x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called “out-of-bag”), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha−1 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation. PMID:24489686

  3. Modeling Verdict Outcomes Using Social Network Measures: The Watergate and Caviar Network Cases

    PubMed Central

    2016-01-01

    Modelling criminal trial verdict outcomes using social network measures is an emerging research area in quantitative criminology. Few studies have yet analyzed which of these measures are the most important for verdict modelling or which data classification techniques perform best for this application. To compare the performance of different techniques in classifying members of a criminal network, this article applies three different machine learning classifiers–Logistic Regression, Naïve Bayes and Random Forest–with a range of social network measures and the necessary databases to model the verdicts in two real–world cases: the U.S. Watergate Conspiracy of the 1970’s and the now–defunct Canada–based international drug trafficking ring known as the Caviar Network. In both cases it was found that the Random Forest classifier did better than either Logistic Regression or Naïve Bayes, and its superior performance was statistically significant. This being so, Random Forest was used not only for classification but also to assess the importance of the measures. For the Watergate case, the most important one proved to be betweenness centrality while for the Caviar Network, it was the effective size of the network. These results are significant because they show that an approach combining machine learning with social network analysis not only can generate accurate classification models but also helps quantify the importance social network variables in modelling verdict outcomes. We conclude our analysis with a discussion and some suggestions for future work in verdict modelling using social network measures. PMID:26824351

  4. Validating automatic semantic annotation of anatomy in DICOM CT images

    NASA Astrophysics Data System (ADS)

    Pathak, Sayan D.; Criminisi, Antonio; Shotton, Jamie; White, Steve; Robertson, Duncan; Sparks, Bobbi; Munasinghe, Indeera; Siddiqui, Khan

    2011-03-01

    In the current health-care environment, the time available for physicians to browse patients' scans is shrinking due to the rapid increase in the sheer number of images. This is further aggravated by mounting pressure to become more productive in the face of decreasing reimbursement. Hence, there is an urgent need to deliver technology which enables faster and effortless navigation through sub-volume image visualizations. Annotating image regions with semantic labels such as those derived from the RADLEX ontology can vastly enhance image navigation and sub-volume visualization. This paper uses random regression forests for efficient, automatic detection and localization of anatomical structures within DICOM 3D CT scans. A regression forest is a collection of decision trees which are trained to achieve direct mapping from voxels to organ location and size in a single pass. This paper focuses on comparing automated labeling with expert-annotated ground-truth results on a database of 50 highly variable CT scans. Initial investigations show that regression forest derived localization errors are smaller and more robust than those achieved by state-of-the-art global registration approaches. The simplicity of the algorithm's context-rich visual features yield typical runtimes of less than 10 seconds for a 5123 voxel DICOM CT series on a single-threaded, single-core machine running multiple trees; each tree taking less than a second. Furthermore, qualitative evaluation demonstrates that using the detected organs' locations as index into the image volume improves the efficiency of the navigational workflow in all the CT studies.

  5. Assessing soil carbon vulnerability in the Western USA by geospatial modeling of pyrogenic and particulate carbon stocks

    NASA Astrophysics Data System (ADS)

    Ahmed, Zia U.; Woodbury, Peter B.; Sanderman, Jonathan; Hawke, Bruce; Jauss, Verena; Solomon, Dawit; Lehmann, Johannes

    2017-02-01

    To predict how land management practices and climate change will affect soil carbon cycling, improved understanding of factors controlling soil organic carbon fractions at large spatial scales is needed. We analyzed total soil organic (SOC) as well as pyrogenic (PyC), particulate (POC), and other soil organic carbon (OOC) fractions in surface layers from 650 stratified-sampling locations throughout Colorado, Kansas, New Mexico, and Wyoming. PyC varied from 0.29 to 18.0 mg C g-1 soil with a mean of 4.05 mg C g-1 soil. The mean PyC was 34.6% of the SOC and ranged from 11.8 to 96.6%. Both POC and PyC were highest in forests and canyon bottoms. In the best random forest regression model, normalized vegetation index (NDVI), mean annual precipitation (MAP), mean annual temperature (MAT), and elevation were ranked as the top four important variables determining PyC and POC variability. Random forests regression kriging (RFK) with environmental covariables improved predictions over ordinary kriging by 20 and 7% for PyC and POC, respectively. Based on RFK, 8% of the study area was dominated (≥50% of SOC) by PyC and less than 1% was dominated by POC. Furthermore, based on spatial analysis of the ratio of POC to PyC, we estimated that about 16% of the study area is medium to highly vulnerable to SOC mineralization in surface soil. These are the first results to characterize PyC and POC stocks geospatially using stratified sampling scheme at the scale of 1,000,000 km2, and the methods are scalable to other regions.

  6. Improving Lidar-based Aboveground Biomass Estimation with Site Productivity for Central Hardwood Forests, USA

    NASA Astrophysics Data System (ADS)

    Shao, G.; Gallion, J.; Fei, S.

    2016-12-01

    Sound forest aboveground biomass estimation is required to monitor diverse forest ecosystems and their impacts on the changing climate. Lidar-based regression models provided promised biomass estimations in most forest ecosystems. However, considerable uncertainties of biomass estimations have been reported in the temperate hardwood and hardwood-dominated mixed forests. Varied site productivities in temperate hardwood forests largely diversified height and diameter growth rates, which significantly reduced the correlation between tree height and diameter at breast height (DBH) in mature and complex forests. It is, therefore, difficult to utilize height-based lidar metrics to predict DBH-based field-measured biomass through a simple regression model regardless the variation of site productivity. In this study, we established a multi-dimension nonlinear regression model incorporating lidar metrics and site productivity classes derived from soil features. In the regression model, lidar metrics provided horizontal and vertical structural information and productivity classes differentiated good and poor forest sites. The selection and combination of lidar metrics were discussed. Multiple regression models were employed and compared. Uncertainty analysis was applied to the best fit model. The effects of site productivity on the lidar-based biomass model were addressed.

  7. Does Sentinel multi sensor data offer synergy in Improving Accuracy of Aboveground Biomass Estimate of Dense Tropical Forest? - Utility of Decision Tree Based Machine Learning Algorithms

    NASA Astrophysics Data System (ADS)

    Ghosh, S. M.; Behera, M. D.

    2017-12-01

    Forest aboveground biomass (AGB) is an important factor for preparation of global policy making decisions to tackle the impact of climate change. Several previous studies has concluded that remote sensing methods are more suitable for estimating forest biomass on regional scale. Among all available remote sensing data and methods, Synthetic Aperture Radar (SAR) data in combination with decision tree based machine learning algorithms has shown better promise in estimating higher biomass values. There aren't many studies done for biomass estimation of dense Indian tropical forests with high biomass density. In this study aboveground biomass was estimated for two major tree species, Sal (Shorea robusta) and Teak (Tectona grandis), of Katerniaghat Wildlife Sanctuary, a tropical forest situated in northern India. Biomass was estimated by combining C-band SAR data from Sentinel-1A satellite, vegetation indices produced using Sentinel-2A data and ground inventory plots. Along with SAR backscatter value, SAR texture images were also used as input as earlier studies had found that image texture has a correlation with vegetation biomass. Decision tree based nonlinear machine learning algorithms were used in place of parametric regression models for establishing relationship between fields measured values and remotely sensed parameters. Using random forest model with a combination of vegetation indices with SAR backscatter as predictor variables shows best result for Sal forest, with a coefficient of determination value of 0.71 and a RMSE value of 105.027 t/ha. In teak forest also best result can be found in the same combination but for stochastic gradient boosted model with a coefficient of determination value of 0.6 and a RMSE value of 79.45 t/ha. These results are mostly better than the results of other studies done for similar kind of forests. This study shows that Sentinel series satellite data has exceptional capabilities in estimating dense forest AGB and machine learning algorithms are better means to do so than parametric regression models.

  8. Modelling Biophysical Parameters of Maize Using Landsat 8 Time Series

    NASA Astrophysics Data System (ADS)

    Dahms, Thorsten; Seissiger, Sylvia; Conrad, Christopher; Borg, Erik

    2016-06-01

    Open and free access to multi-frequent high-resolution data (e.g. Sentinel - 2) will fortify agricultural applications based on satellite data. The temporal and spatial resolution of these remote sensing datasets directly affects the applicability of remote sensing methods, for instance a robust retrieving of biophysical parameters over the entire growing season with very high geometric resolution. In this study we use machine learning methods to predict biophysical parameters, namely the fraction of absorbed photosynthetic radiation (FPAR), the leaf area index (LAI) and the chlorophyll content, from high resolution remote sensing. 30 Landsat 8 OLI scenes were available in our study region in Mecklenburg-Western Pomerania, Germany. In-situ data were weekly to bi-weekly collected on 18 maize plots throughout the summer season 2015. The study aims at an optimized prediction of biophysical parameters and the identification of the best explaining spectral bands and vegetation indices. For this purpose, we used the entire in-situ dataset from 24.03.2015 to 15.10.2015. Random forest and conditional inference forests were used because of their explicit strong exploratory and predictive character. Variable importance measures allowed for analysing the relation between the biophysical parameters with respect to the spectral response, and the performance of the two approaches over the plant stock evolvement. Classical random forest regression outreached the performance of conditional inference forests, in particular when modelling the biophysical parameters over the entire growing period. For example, modelling biophysical parameters of maize for the entire vegetation period using random forests yielded: FPAR: R² = 0.85; RMSE = 0.11; LAI: R² = 0.64; RMSE = 0.9 and chlorophyll content (SPAD): R² = 0.80; RMSE=4.9. Our results demonstrate the great potential in using machine-learning methods for the interpretation of long-term multi-frequent remote sensing datasets to model biophysical parameters.

  9. Factors influencing consumption of nutrient rich forest foods in rural Cameroon.

    PubMed

    Fungo, Robert; Muyonga, John H; Kabahenda, Margaret; Okia, Clement A; Snook, Laura

    2016-02-01

    Studies show that a number of forest foods consumed in Cameroon are highly nutritious and rich in health boosting bioactive compounds. This study assessed the knowledge and perceptions towards the nutritional and health promoting properties of forest foods among forest dependent communities. The relationship between knowledge, perceptions and socio-demographic attributes on consumption of forest foods was also determined. A total of 279 females in charge of decision making with respect to food preparation were randomly selected from 12 villages in southern and eastern Cameroon and interviewed using researcher administered questionnaires. Multivariate logistic regression analysis was used to identify the factors affecting consumption of forest foods. Baillonella toxisperma (98%) and Irvingia gabonesis (81%) were the most known nutrient rich forest foods by the respondents. About 31% of the respondents were aware of the nutritional value and health benefits of forest foods. About 10%-61% of the respondents expressed positive attitudes to questions related with health benefits of specific forest foods. Consumption of forest foods was found to be higher among polygamous families and also positively related to length of stay in the forest area and age of respondent with consumption of forest foods. Education had an inverse relationship with use of forest foods. Knowledge and positive attitude towards the nutritional value of forest foods were also found to positively influence consumption of forest foods. Since knowledge was found to influence attitude and consumption, there is need to invest in awareness campaigns to strengthen the current knowledge levels among the study population. This should positively influence the attitudes and perceptions towards increased consumption of forest foods. Copyright © 2015 Elsevier Ltd. All rights reserved.

  10. Red-shouldered hawk nesting habitat preference in south Texas

    USGS Publications Warehouse

    Strobel, Bradley N.; Boal, Clint W.

    2010-01-01

    We examined nesting habitat preference by red-shouldered hawks Buteo lineatus using conditional logistic regression on characteristics measured at 27 occupied nest sites and 68 unused sites in 2005–2009 in south Texas. We measured vegetation characteristics of individual trees (nest trees and unused trees) and corresponding 0.04-ha plots. We evaluated the importance of tree and plot characteristics to nesting habitat selection by comparing a priori tree-specific and plot-specific models using Akaike's information criterion. Models with only plot variables carried 14% more weight than models with only center tree variables. The model-averaged odds ratios indicated red-shouldered hawks selected to nest in taller trees and in areas with higher average diameter at breast height than randomly available within the forest stand. Relative to randomly selected areas, each 1-m increase in nest tree height and 1-cm increase in the plot average diameter at breast height increased the probability of selection by 85% and 10%, respectively. Our results indicate that red-shouldered hawks select nesting habitat based on vegetation characteristics of individual trees as well as the 0.04-ha area surrounding the tree. Our results indicate forest management practices resulting in tall forest stands with large average diameter at breast height would benefit red-shouldered hawks in south Texas.

  11. Advances in SCA and RF-DNA Fingerprinting Through Enhanced Linear Regression Attacks and Application of Random Forest Classifiers

    DTIC Science & Technology

    2014-09-18

    Converter AES Advance Encryption Standard ANN Artificial Neural Network APS Application Support AUC Area Under the Curve CPA Correlation Power Analysis ...Importance WGN White Gaussian Noise WPAN Wireless Personal Area Networks XEnv Cross-Environment XRx Cross-Receiver xxi ADVANCES IN SCA AND RF-DNA...based tool called KillerBee was released in 2009 that increases the exposure of ZigBee and other IEEE 802.15.4-based Wireless Personal Area Networks

  12. Sample entropy analysis for the estimating depth of anaesthesia through human EEG signal at different levels of unconsciousness during surgeries.

    PubMed

    Liu, Quan; Ma, Li; Fan, Shou-Zen; Abbod, Maysam F; Shieh, Jiann-Shing

    2018-01-01

    Estimating the depth of anaesthesia (DoA) in operations has always been a challenging issue due to the underlying complexity of the brain mechanisms. Electroencephalogram (EEG) signals are undoubtedly the most widely used signals for measuring DoA. In this paper, a novel EEG-based index is proposed to evaluate DoA for 24 patients receiving general anaesthesia with different levels of unconsciousness. Sample Entropy (SampEn) algorithm was utilised in order to acquire the chaotic features of the signals. After calculating the SampEn from the EEG signals, Random Forest was utilised for developing learning regression models with Bispectral index (BIS) as the target. Correlation coefficient, mean absolute error, and area under the curve (AUC) were used to verify the perioperative performance of the proposed method. Validation comparisons with typical nonstationary signal analysis methods (i.e., recurrence analysis and permutation entropy) and regression methods (i.e., neural network and support vector machine) were conducted. To further verify the accuracy and validity of the proposed methodology, the data is divided into four unconsciousness-level groups on the basis of BIS levels. Subsequently, analysis of variance (ANOVA) was applied to the corresponding index (i.e., regression output). Results indicate that the correlation coefficient improved to 0.72 ± 0.09 after filtering and to 0.90 ± 0.05 after regression from the initial values of 0.51 ± 0.17. Similarly, the final mean absolute error dramatically declined to 5.22 ± 2.12. In addition, the ultimate AUC increased to 0.98 ± 0.02, and the ANOVA analysis indicates that each of the four groups of different anaesthetic levels demonstrated significant difference from the nearest levels. Furthermore, the Random Forest output was extensively linear in relation to BIS, thus with better DoA prediction accuracy. In conclusion, the proposed method provides a concrete basis for monitoring patients' anaesthetic level during surgeries.

  13. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set.

    PubMed

    Lenselink, Eelke B; Ten Dijke, Niels; Bongers, Brandon; Papadatos, George; van Vlijmen, Herman W T; Kowalczyk, Wojtek; IJzerman, Adriaan P; van Westen, Gerard J P

    2017-08-14

    The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method ('DNN_PCM') performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized 'DNN_PCM'). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols. Graphical Abstract .

  14. Estimating future burned areas under changing climate in the EU-Mediterranean countries.

    PubMed

    Amatulli, Giuseppe; Camia, Andrea; San-Miguel-Ayanz, Jesús

    2013-04-15

    The impacts of climate change on forest fires have received increased attention in recent years at both continental and local scales. It is widely recognized that weather plays a key role in extreme fire situations. It is therefore of great interest to analyze projected changes in fire danger under climate change scenarios and to assess the consequent impacts of forest fires. In this study we estimated burned areas in the European Mediterranean (EU-Med) countries under past and future climate conditions. Historical (1985-2004) monthly burned areas in EU-Med countries were modeled by using the Canadian Fire Weather Index (CFWI). Monthly averages of the CFWI sub-indices were used as explanatory variables to estimate the monthly burned areas in each of the five most affected countries in Europe using three different modeling approaches (Multiple Linear Regression - MLR, Random Forest - RF, Multivariate Adaptive Regression Splines - MARS). MARS outperformed the other methods. Regression equations and significant coefficients of determination were obtained, although there were noticeable differences from country to country. Climatic conditions at the end of the 21st Century were simulated using results from the runs of the regional climate model HIRHAM in the European project PRUDENCE, considering two IPCC SRES scenarios (A2-B2). The MARS models were applied to both scenarios resulting in projected burned areas in each country and in the EU-Med region. Results showed that significant increases, 66% and 140% of the total burned area, can be expected in the EU-Med region under the A2 and B2 scenarios, respectively. Copyright © 2013 Elsevier B.V. All rights reserved.

  15. Using LiDAR to Estimate Total Aboveground Biomass of Redwood Stands in the Jackson Demonstration State Forest, Mendocino, California

    NASA Astrophysics Data System (ADS)

    Rao, M.; Vuong, H.

    2013-12-01

    The overall objective of this study is to develop a method for estimating total aboveground biomass of redwood stands in Jackson Demonstration State Forest, Mendocino, California using airborne LiDAR data. LiDAR data owing to its vertical and horizontal accuracy are increasingly being used to characterize landscape features including ground surface elevation and canopy height. These LiDAR-derived metrics involving structural signatures at higher precision and accuracy can help better understand ecological processes at various spatial scales. Our study is focused on two major species of the forest: redwood (Sequoia semperirens [D.Don] Engl.) and Douglas-fir (Pseudotsuga mensiezii [Mirb.] Franco). Specifically, the objectives included linear regression models fitting tree diameter at breast height (dbh) to LiDAR derived height for each species. From 23 random points on the study area, field measurement (dbh and tree coordinate) were collected for more than 500 trees of Redwood and Douglas-fir over 0.2 ha- plots. The USFS-FUSION application software along with its LiDAR Data Viewer (LDV) were used to to extract Canopy Height Model (CHM) from which tree heights would be derived. Based on the LiDAR derived height and ground based dbh, a linear regression model was developed to predict dbh. The predicted dbh was used to estimate the biomass at the single tree level using Jenkin's formula (Jenkin et al 2003). The linear regression models were able to explain 65% of the variability associated with Redwood's dbh and 80% of that associated with Douglas-fir's dbh.

  16. Genetic evidence for landscape effects on dispersal in the army ant Eciton burchellii.

    PubMed

    Soare, Thomas W; Kumar, Anjali; Naish, Kerry A; O'Donnell, Sean

    2014-01-01

    Inhibited dispersal, leading to reduced gene flow, threatens populations with inbreeding depression and local extinction. Fragmentation may be especially detrimental to social insects because inhibited gene flow has important consequences for cooperation and competition within and among colonies. Army ants have winged males and permanently wingless queens; these traits imply male-biased dispersal. However, army ant colonies are obligately nomadic and have the potential to traverse landscapes. Eciton burchellii, the most regularly nomadic army ant, is a forest interior species: colony raiding activities are limited in the absence of forest cover. To examine whether nomadism and landscape (forest clearing and elevation) affect population genetic structure in a montane E. burchellii population, we reconstructed queen and male genotypes from 25 colonies at seven polymorphic microsatellite loci. Pairwise genetic distances among individuals were compared to pairwise geographical and resistance distances using regressions with permutations, partial Mantel tests and random forests analyses. Although there was no significant spatial genetic structure in queens or males in montane forest, dispersal may be male-biased. We found significant isolation by landscape resistance for queens based on land cover (forest clearing), but not on elevation. Summed colony emigrations over the lifetime of the queen may contribute to gene flow in this species and forest clearing impedes these movements and subsequent gene dispersal. Further forest cover removal may increasingly inhibit Eciton burchellii colony dispersal. We recommend maintaining habitat connectivity in tropical forests to promote population persistence for this keystone species. © 2013 John Wiley & Sons Ltd.

  17. Modelling the ecological consequences of whole tree harvest for bioenergy production

    NASA Astrophysics Data System (ADS)

    Skår, Silje; Lange, Holger; Sogn, Trine

    2013-04-01

    There is an increasing demand for energy from biomass as a substitute to fossil fuels worldwide, and the Norwegian government plans to double the production of bioenergy to 9% of the national energy production or to 28 TWh per year by 2020. A large part of this increase may come from forests, which have a great potential with respect to biomass supply as forest growth increasingly has exceeded harvest in the last decades. One feasible option is the utilization of forest residues (needles, twigs and branches) in addition to stems, known as Whole Tree Harvest (WTH). As opposed to WTH, the residues are traditionally left in the forest with Conventional Timber Harvesting (CH). However, the residues contain a large share of the treés nutrients, indicating that WTH may possibly alter the supply of nutrients and organic matter to the soil and the forest ecosystem. This may potentially lead to reduced tree growth. Other implications can be nutrient imbalance, loss of carbon from the soil and changes in species composition and diversity. This study aims to identify key factors and appropriate strategies for ecologically sustainable WTH in Norway spruce (Picea abies) and Scots pine (Pinus sylvestris) forest stands in Norway. We focus on identifying key factors driving soil organic matter, nutrients, biomass, biodiversity etc. Simulations of the effect on the carbon and nitrogen budget with the two harvesting methods will also be conducted. Data from field trials and long-term manipulation experiments are used to obtain a first overview of key variables. The relationships between the variables are hitherto unknown, but it is by no means obvious that they could be assumed as linear; thus, an ordinary multiple linear regression approach is expected to be insufficient. Here we apply two advanced and highly flexible modelling frameworks which hardly have been used in the context of tree growth, nutrient balances and biomass removal so far: Generalized Additive Models (GAMs) and Random Forests. Results obtained for GAMs so far show that there are differences between WTH and CH in two directions: both the significance of drivers and the shape of the response functions differ. GAMs turn out to be a flexible and powerful alternative to multivariate linear regression. The restriction to linear relationships seems to be unjustified in the present case. We use Random Forests as a highly efficient classifier which gives reliable estimates for the importance of each driver variable in determining the diameter growth for the two different harvesting treatments. Based on the final results of these two modelling approaches, the study contributes to find appropriate strategies and suitable regions (in Norway) where WTH may be sustainable performed.

  18. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases.

    PubMed

    Heidema, A Geert; Boer, Jolanda M A; Nagelkerke, Nico; Mariman, Edwin C M; van der A, Daphne L; Feskens, Edith J M

    2006-04-21

    Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies. In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted. Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases.

  19. Landslide susceptibility assessment in Lianhua County (China): A comparison between a random forest data mining technique and bivariate and multivariate statistical models

    NASA Astrophysics Data System (ADS)

    Hong, Haoyuan; Pourghasemi, Hamid Reza; Pourtaghi, Zohre Sadat

    2016-04-01

    Landslides are an important natural hazard that causes a great amount of damage around the world every year, especially during the rainy season. The Lianhua area is located in the middle of China's southern mountainous area, west of Jiangxi Province, and is known to be an area prone to landslides. The aim of this study was to evaluate and compare landslide susceptibility maps produced using the random forest (RF) data mining technique with those produced by bivariate (evidential belief function and frequency ratio) and multivariate (logistic regression) statistical models for Lianhua County, China. First, a landslide inventory map was prepared using aerial photograph interpretation, satellite images, and extensive field surveys. In total, 163 landslide events were recognized in the study area, with 114 landslides (70%) used for training and 49 landslides (30%) used for validation. Next, the landslide conditioning factors-including the slope angle, altitude, slope aspect, topographic wetness index (TWI), slope-length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, distance to roads, annual precipitation, land use, normalized difference vegetation index (NDVI), and lithology-were derived from the spatial database. Finally, the landslide susceptibility maps of Lianhua County were generated in ArcGIS 10.1 based on the random forest (RF), evidential belief function (EBF), frequency ratio (FR), and logistic regression (LR) approaches and were validated using a receiver operating characteristic (ROC) curve. The ROC plot assessment results showed that for landslide susceptibility maps produced using the EBF, FR, LR, and RF models, the area under the curve (AUC) values were 0.8122, 0.8134, 0.7751, and 0.7172, respectively. Therefore, we can conclude that all four models have an AUC of more than 0.70 and can be used in landslide susceptibility mapping in the study area; meanwhile, the EBF and FR models had the best performance for Lianhua County, China. Thus, the resultant susceptibility maps will be useful for land use planning and hazard mitigation aims.

  20. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

    PubMed Central

    2013-01-01

    Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. Availability The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. PMID:24564704

  1. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.

    PubMed

    Wang, Yue; Goh, Wilson; Wong, Limsoon; Montana, Giovanni

    2013-01-01

    Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.

  2. Comparison of regression coefficient and GIS-based methodologies for regional estimates of forest soil carbon stocks.

    PubMed

    Campbell, J Elliott; Moen, Jeremie C; Ney, Richard A; Schnoor, Jerald L

    2008-03-01

    Estimates of forest soil organic carbon (SOC) have applications in carbon science, soil quality studies, carbon sequestration technologies, and carbon trading. Forest SOC has been modeled using a regression coefficient methodology that applies mean SOC densities (mass/area) to broad forest regions. A higher resolution model is based on an approach that employs a geographic information system (GIS) with soil databases and satellite-derived landcover images. Despite this advancement, the regression approach remains the basis of current state and federal level greenhouse gas inventories. Both approaches are analyzed in detail for Wisconsin forest soils from 1983 to 2001, applying rigorous error-fixing algorithms to soil databases. Resulting SOC stock estimates are 20% larger when determined using the GIS method rather than the regression approach. Average annual rates of increase in SOC stocks are 3.6 and 1.0 million metric tons of carbon per year for the GIS and regression approaches respectively.

  3. Data-Science Analysis of the Macro-scale Features Governing the Corrosion to Crack Transition in AA7050-T7451

    NASA Astrophysics Data System (ADS)

    Co, Noelle Easter C.; Brown, Donald E.; Burns, James T.

    2018-05-01

    This study applies data science approaches (random forest and logistic regression) to determine the extent to which macro-scale corrosion damage features govern the crack formation behavior in AA7050-T7451. Each corrosion morphology has a set of corresponding predictor variables (pit depth, volume, area, diameter, pit density, total fissure length, surface roughness metrics, etc.) describing the shape of the corrosion damage. The values of the predictor variables are obtained from white light interferometry, x-ray tomography, and scanning electron microscope imaging of the corrosion damage. A permutation test is employed to assess the significance of the logistic and random forest model predictions. Results indicate minimal relationship between the macro-scale corrosion feature predictor variables and fatigue crack initiation. These findings suggest that the macro-scale corrosion features and their interactions do not solely govern the crack formation behavior. While these results do not imply that the macro-features have no impact, they do suggest that additional parameters must be considered to rigorously inform the crack formation location.

  4. Accurate Segmentation of CT Male Pelvic Organs via Regression-based Deformable Models and Multi-task Random Forests

    PubMed Central

    Gao, Yaozong; Shao, Yeqin; Lian, Jun; Wang, Andrew Z.; Chen, Ronald C.

    2016-01-01

    Segmenting male pelvic organs from CT images is a prerequisite for prostate cancer radiotherapy. The efficacy of radiation treatment highly depends on segmentation accuracy. However, accurate segmentation of male pelvic organs is challenging due to low tissue contrast of CT images, as well as large variations of shape and appearance of the pelvic organs. Among existing segmentation methods, deformable models are the most popular, as shape prior can be easily incorporated to regularize the segmentation. Nonetheless, the sensitivity to initialization often limits their performance, especially for segmenting organs with large shape variations. In this paper, we propose a novel approach to guide deformable models, thus making them robust against arbitrary initializations. Specifically, we learn a displacement regressor, which predicts 3D displacement from any image voxel to the target organ boundary based on the local patch appearance. This regressor provides a nonlocal external force for each vertex of deformable model, thus overcoming the initialization problem suffered by the traditional deformable models. To learn a reliable displacement regressor, two strategies are particularly proposed. 1) A multi-task random forest is proposed to learn the displacement regressor jointly with the organ classifier; 2) an auto-context model is used to iteratively enforce structural information during voxel-wise prediction. Extensive experiments on 313 planning CT scans of 313 patients show that our method achieves better results than alternative classification or regression based methods, and also several other existing methods in CT pelvic organ segmentation. PMID:26800531

  5. Comparative Performance Analysis of Support Vector Machine, Random Forest, Logistic Regression and k-Nearest Neighbours in Rainbow Trout (Oncorhynchus Mykiss) Classification Using Image-Based Features

    PubMed Central

    Císař, Petr; Labbé, Laurent; Souček, Pavel; Pelissier, Pablo; Kerneis, Thierry

    2018-01-01

    The main aim of this study was to develop a new objective method for evaluating the impacts of different diets on the live fish skin using image-based features. In total, one-hundred and sixty rainbow trout (Oncorhynchus mykiss) were fed either a fish-meal based diet (80 fish) or a 100% plant-based diet (80 fish) and photographed using consumer-grade digital camera. Twenty-three colour features and four texture features were extracted. Four different classification methods were used to evaluate fish diets including Random forest (RF), Support vector machine (SVM), Logistic regression (LR) and k-Nearest neighbours (k-NN). The SVM with radial based kernel provided the best classifier with correct classification rate (CCR) of 82% and Kappa coefficient of 0.65. Although the both LR and RF methods were less accurate than SVM, they achieved good classification with CCR 75% and 70% respectively. The k-NN was the least accurate (40%) classification model. Overall, it can be concluded that consumer-grade digital cameras could be employed as the fast, accurate and non-invasive sensor for classifying rainbow trout based on their diets. Furthermore, these was a close association between image-based features and fish diet received during cultivation. These procedures can be used as non-invasive, accurate and precise approaches for monitoring fish status during the cultivation by evaluating diet’s effects on fish skin. PMID:29596375

  6. Comparative Performance Analysis of Support Vector Machine, Random Forest, Logistic Regression and k-Nearest Neighbours in Rainbow Trout (Oncorhynchus Mykiss) Classification Using Image-Based Features.

    PubMed

    Saberioon, Mohammadmehdi; Císař, Petr; Labbé, Laurent; Souček, Pavel; Pelissier, Pablo; Kerneis, Thierry

    2018-03-29

    The main aim of this study was to develop a new objective method for evaluating the impacts of different diets on the live fish skin using image-based features. In total, one-hundred and sixty rainbow trout ( Oncorhynchus mykiss ) were fed either a fish-meal based diet (80 fish) or a 100% plant-based diet (80 fish) and photographed using consumer-grade digital camera. Twenty-three colour features and four texture features were extracted. Four different classification methods were used to evaluate fish diets including Random forest (RF), Support vector machine (SVM), Logistic regression (LR) and k -Nearest neighbours ( k -NN). The SVM with radial based kernel provided the best classifier with correct classification rate (CCR) of 82% and Kappa coefficient of 0.65. Although the both LR and RF methods were less accurate than SVM, they achieved good classification with CCR 75% and 70% respectively. The k -NN was the least accurate (40%) classification model. Overall, it can be concluded that consumer-grade digital cameras could be employed as the fast, accurate and non-invasive sensor for classifying rainbow trout based on their diets. Furthermore, these was a close association between image-based features and fish diet received during cultivation. These procedures can be used as non-invasive, accurate and precise approaches for monitoring fish status during the cultivation by evaluating diet's effects on fish skin.

  7. Machine learning to predict the occurrence of bisphosphonate-related osteonecrosis of the jaw associated with dental extraction: A preliminary report.

    PubMed

    Kim, Dong Wook; Kim, Hwiyoung; Nam, Woong; Kim, Hyung Jun; Cha, In-Ho

    2018-04-23

    The aim of this study was to build and validate five types of machine learning models that can predict the occurrence of BRONJ associated with dental extraction in patients taking bisphosphonates for the management of osteoporosis. A retrospective review of the medical records was conducted to obtain cases and controls for the study. Total 125 patients consisting of 41 cases and 84 controls were selected for the study. Five machine learning prediction algorithms including multivariable logistic regression model, decision tree, support vector machine, artificial neural network, and random forest were implemented. The outputs of these models were compared with each other and also with conventional methods, such as serum CTX level. Area under the receiver operating characteristic (ROC) curve (AUC) was used to compare the results. The performance of machine learning models was significantly superior to conventional statistical methods and single predictors. The random forest model yielded the best performance (AUC = 0.973), followed by artificial neural network (AUC = 0.915), support vector machine (AUC = 0.882), logistic regression (AUC = 0.844), decision tree (AUC = 0.821), drug holiday alone (AUC = 0.810), and CTX level alone (AUC = 0.630). Machine learning methods showed superior performance in predicting BRONJ associated with dental extraction compared to conventional statistical methods using drug holiday and serum CTX level. Machine learning can thus be applied in a wide range of clinical studies. Copyright © 2017. Published by Elsevier Inc.

  8. Methods for estimating population density in data-limited areas: evaluating regression and tree-based models in Peru.

    PubMed

    Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William

    2014-01-01

    Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies.

  9. Methods for Estimating Population Density in Data-Limited Areas: Evaluating Regression and Tree-Based Models in Peru

    PubMed Central

    Anderson, Weston; Guikema, Seth; Zaitchik, Ben; Pan, William

    2014-01-01

    Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies. PMID:24992657

  10. Applying a weighted random forests method to extract karst sinkholes from LiDAR data

    NASA Astrophysics Data System (ADS)

    Zhu, Junfeng; Pierskalla, William P.

    2016-02-01

    Detailed mapping of sinkholes provides critical information for mitigating sinkhole hazards and understanding groundwater and surface water interactions in karst terrains. LiDAR (Light Detection and Ranging) measures the earth's surface in high-resolution and high-density and has shown great potentials to drastically improve locating and delineating sinkholes. However, processing LiDAR data to extract sinkholes requires separating sinkholes from other depressions, which can be laborious because of the sheer number of the depressions commonly generated from LiDAR data. In this study, we applied the random forests, a machine learning method, to automatically separate sinkholes from other depressions in a karst region in central Kentucky. The sinkhole-extraction random forest was grown on a training dataset built from an area where LiDAR-derived depressions were manually classified through a visual inspection and field verification process. Based on the geometry of depressions, as well as natural and human factors related to sinkholes, 11 parameters were selected as predictive variables to form the dataset. Because the training dataset was imbalanced with the majority of depressions being non-sinkholes, a weighted random forests method was used to improve the accuracy of predicting sinkholes. The weighted random forest achieved an average accuracy of 89.95% for the training dataset, demonstrating that the random forest can be an effective sinkhole classifier. Testing of the random forest in another area, however, resulted in moderate success with an average accuracy rate of 73.96%. This study suggests that an automatic sinkhole extraction procedure like the random forest classifier can significantly reduce time and labor costs and makes its more tractable to map sinkholes using LiDAR data for large areas. However, the random forests method cannot totally replace manual procedures, such as visual inspection and field verification.

  11. Comparison of Models for the Prediction of Medical Costs of Spinal Fusion in Taiwan Diagnosis-Related Groups by Machine Learning Algorithms.

    PubMed

    Kuo, Ching-Yen; Yu, Liang-Chin; Chen, Hou-Chaung; Chan, Chien-Lung

    2018-01-01

    The aims of this study were to compare the performance of machine learning methods for the prediction of the medical costs associated with spinal fusion in terms of profit or loss in Taiwan Diagnosis-Related Groups (Tw-DRGs) and to apply these methods to explore the important factors associated with the medical costs of spinal fusion. A data set was obtained from a regional hospital in Taoyuan city in Taiwan, which contained data from 2010 to 2013 on patients of Tw-DRG49702 (posterior and other spinal fusion without complications or comorbidities). Naïve-Bayesian, support vector machines, logistic regression, C4.5 decision tree, and random forest methods were employed for prediction using WEKA 3.8.1. Five hundred thirty-two cases were categorized as belonging to the Tw-DRG49702 group. The mean medical cost was US $4,549.7, and the mean age of the patients was 62.4 years. The mean length of stay was 9.3 days. The length of stay was an important variable in terms of determining medical costs for patients undergoing spinal fusion. The random forest method had the best predictive performance in comparison to the other methods, achieving an accuracy of 84.30%, a sensitivity of 71.4%, a specificity of 92.2%, and an AUC of 0.904. Our study demonstrated that the random forest model can be employed to predict the medical costs of Tw-DRG49702, and could inform hospital strategy in terms of increasing the financial management efficiency of this operation.

  12. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.

    PubMed

    Ma, Li; Fan, Suohai

    2017-03-14

    The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.

  13. Predicting 30-day Hospital Readmission with Publicly Available Administrative Database. A Conditional Logistic Regression Modeling Approach.

    PubMed

    Zhu, K; Lou, Z; Zhou, J; Ballester, N; Kong, N; Parikh, P

    2015-01-01

    This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners. Explore the use of conditional logistic regression to increase the prediction accuracy. We analyzed an HCUP statewide inpatient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models. The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 - 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures. It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.

  14. Strategies for minimizing sample size for use in airborne LiDAR-based forest inventory

    USGS Publications Warehouse

    Junttila, Virpi; Finley, Andrew O.; Bradford, John B.; Kauranne, Tuomo

    2013-01-01

    Recently airborne Light Detection And Ranging (LiDAR) has emerged as a highly accurate remote sensing modality to be used in operational scale forest inventories. Inventories conducted with the help of LiDAR are most often model-based, i.e. they use variables derived from LiDAR point clouds as the predictive variables that are to be calibrated using field plots. The measurement of the necessary field plots is a time-consuming and statistically sensitive process. Because of this, current practice often presumes hundreds of plots to be collected. But since these plots are only used to calibrate regression models, it should be possible to minimize the number of plots needed by carefully selecting the plots to be measured. In the current study, we compare several systematic and random methods for calibration plot selection, with the specific aim that they be used in LiDAR based regression models for forest parameters, especially above-ground biomass. The primary criteria compared are based on both spatial representativity as well as on their coverage of the variability of the forest features measured. In the former case, it is important also to take into account spatial auto-correlation between the plots. The results indicate that choosing the plots in a way that ensures ample coverage of both spatial and feature space variability improves the performance of the corresponding models, and that adequate coverage of the variability in the feature space is the most important condition that should be met by the set of plots collected.

  15. A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring

    NASA Astrophysics Data System (ADS)

    Zimmerman, Naomi; Presto, Albert A.; Kumar, Sriniwasa P. N.; Gu, Jason; Hauryliuk, Aliaksei; Robinson, Ellis S.; Robinson, Allen L.; Subramanian, R.

    2018-01-01

    Low-cost sensing strategies hold the promise of denser air quality monitoring networks, which could significantly improve our understanding of personal air pollution exposure. Additionally, low-cost air quality sensors could be deployed to areas where limited monitoring exists. However, low-cost sensors are frequently sensitive to environmental conditions and pollutant cross-sensitivities, which have historically been poorly addressed by laboratory calibrations, limiting their utility for monitoring. In this study, we investigated different calibration models for the Real-time Affordable Multi-Pollutant (RAMP) sensor package, which measures CO, NO2, O3, and CO2. We explored three methods: (1) laboratory univariate linear regression, (2) empirical multiple linear regression, and (3) machine-learning-based calibration models using random forests (RF). Calibration models were developed for 16-19 RAMP monitors (varied by pollutant) using training and testing windows spanning August 2016 through February 2017 in Pittsburgh, PA, US. The random forest models matched (CO) or significantly outperformed (NO2, CO2, O3) the other calibration models, and their accuracy and precision were robust over time for testing windows of up to 16 weeks. Following calibration, average mean absolute error on the testing data set from the random forest models was 38 ppb for CO (14 % relative error), 10 ppm for CO2 (2 % relative error), 3.5 ppb for NO2 (29 % relative error), and 3.4 ppb for O3 (15 % relative error), and Pearson r versus the reference monitors exceeded 0.8 for most units. Model performance is explored in detail, including a quantification of model variable importance, accuracy across different concentration ranges, and performance in a range of monitoring contexts including the National Ambient Air Quality Standards (NAAQS) and the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. A key strength of the RF approach is that it accounts for pollutant cross-sensitivities. This highlights the importance of developing multipollutant sensor packages (as opposed to single-pollutant monitors); we determined this is especially critical for NO2 and CO2. The evaluation reveals that only the RF-calibrated sensors meet the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. We also demonstrate that the RF-model-calibrated sensors could detect differences in NO2 concentrations between a near-road site and a suburban site less than 1.5 km away. From this study, we conclude that combining RF models with carefully controlled state-of-the-art multipollutant sensor packages as in the RAMP monitors appears to be a very promising approach to address the poor performance that has plagued low-cost air quality sensors.

  16. Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble-based methods?

    PubMed Central

    Austin, Peter C; Lee, Douglas S; Steyerberg, Ewout W; Tu, Jack V

    2012-01-01

    In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble-based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30-day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in-sample and out-of-sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short-term mortality in population-based samples of subjects with cardiovascular disease. PMID:22777999

  17. Deorientation of PolSAR coherency matrix for volume scattering retrieval

    NASA Astrophysics Data System (ADS)

    Kumar, Shashi; Garg, R. D.; Kushwaha, S. P. S.

    2016-05-01

    Polarimetric SAR data has proven its potential to extract scattering information for different features appearing in single resolution cell. Several decomposition modelling approaches have been developed to retrieve scattering information from PolSAR data. During scattering power decomposition based on physical scattering models it becomes very difficult to distinguish volume scattering as a result from randomly oriented vegetation from scattering nature of oblique structures which are responsible for double-bounce and volume scattering , because both are decomposed in same scattering mechanism. The polarization orientation angle (POA) of an electromagnetic wave is one of the most important character which gets changed due to scattering from geometrical structure of topographic slopes, oriented urban area and randomly oriented features like vegetation cover. The shift in POA affects the polarimetric radar signatures. So, for accurate estimation of scattering nature of feature compensation in polarization orientation shift becomes an essential procedure. The prime objective of this work was to investigate the effect of shift in POA in scattering information retrieval and to explore the effect of deorientation on regression between field-estimated aboveground biomass (AGB) and volume scattering. For this study Dudhwa National Park, U.P., India was selected as study area and fully polarimetric ALOS PALSAR data was used to retrieve scattering information from the forest area of Dudhwa National Park. Field data for DBH and tree height was collect for AGB estimation using stratified random sampling. AGB was estimated for 170 plots for different locations of the forest area. Yamaguchi four component decomposition modelling approach was utilized to retrieve surface, double-bounce, helix and volume scattering information. Shift in polarization orientation angle was estimated and deorientation of coherency matrix for compensation of POA shift was performed. Effect of deorientation on RGB color composite for the forest area can be easily seen. Overestimation of volume scattering and under estimation of double bounce scattering was recorded for PolSAR decomposition without deorientation and increase in double bounce scattering and decrease in volume scattering was noticed after deorientation. This study was mainly focused on volume scattering retrieval and its relation with field estimated AGB. Change in volume scattering after POA compensation of PolSAR data was recorded and a comparison was performed on volume scattering values for all the 170 forest plots for which field data were collected. Decrease in volume scattering after deorientation was noted for all the plots. Regression between PolSAR decomposition based volume scattering and AGB was performed. Before deorientation, coefficient determination (R2) between volume scattering and AGB was 0.225. After deorientation an improvement in coefficient of determination was found and the obtained value was 0.613. This study recommends deorientation of PolSAR data for decomposition modelling to retrieve reliable volume scattering information from forest area.

  18. Landscape-scale consequences of differential tree mortality from catastrophic wind disturbance in the Amazon.

    PubMed

    Rifai, Sami W; Urquiza Muñoz, José D; Negrón-Juárez, Robinson I; Ramírez Arévalo, Fredy R; Tello-Espinoza, Rodil; Vanderwel, Mark C; Lichstein, Jeremy W; Chambers, Jeffrey Q; Bohlman, Stephanie A

    2016-10-01

    Wind disturbance can create large forest blowdowns, which greatly reduces live biomass and adds uncertainty to the strength of the Amazon carbon sink. Observational studies from within the central Amazon have quantified blowdown size and estimated total mortality but have not determined which trees are most likely to die from a catastrophic wind disturbance. Also, the impact of spatial dependence upon tree mortality from wind disturbance has seldom been quantified, which is important because wind disturbance often kills clusters of trees due to large treefalls killing surrounding neighbors. We examine (1) the causes of differential mortality between adult trees from a 300-ha blowdown event in the Peruvian region of the northwestern Amazon, (2) how accounting for spatial dependence affects mortality predictions, and (3) how incorporating both differential mortality and spatial dependence affect the landscape level estimation of necromass produced from the blowdown. Standard regression and spatial regression models were used to estimate how stem diameter, wood density, elevation, and a satellite-derived disturbance metric influenced the probability of tree death from the blowdown event. The model parameters regarding tree characteristics, topography, and spatial autocorrelation of the field data were then used to determine the consequences of non-random mortality for landscape production of necromass through a simulation model. Tree mortality was highly non-random within the blowdown, where tree mortality rates were highest for trees that were large, had low wood density, and were located at high elevation. Of the differential mortality models, the non-spatial models overpredicted necromass, whereas the spatial model slightly underpredicted necromass. When parameterized from the same field data, the spatial regression model with differential mortality estimated only 7.5% more dead trees across the entire blowdown than the random mortality model, yet it estimated 51% greater necromass. We suggest that predictions of forest carbon loss from wind disturbance are sensitive to not only the underlying spatial dependence of observations, but also the biological differences between individuals that promote differential levels of mortality. © 2016 by the Ecological Society of America.

  19. Developing a Learning Algorithm-Generated Empirical Relaxer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mitchell, Wayne; Kallman, Josh; Toreja, Allen

    2016-03-30

    One of the main difficulties when running Arbitrary Lagrangian-Eulerian (ALE) simulations is determining how much to relax the mesh during the Eulerian step. This determination is currently made by the user on a simulation-by-simulation basis. We present a Learning Algorithm-Generated Empirical Relaxer (LAGER) which uses a regressive random forest algorithm to automate this decision process. We also demonstrate that LAGER successfully relaxes a variety of test problems, maintains simulation accuracy, and has the potential to significantly decrease both the person-hours and computational hours needed to run a successful ALE simulation.

  20. Improving the Spatial Prediction of Soil Organic Carbon Stocks in a Complex Tropical Mountain Landscape by Methodological Specifications in Machine Learning Approaches

    PubMed Central

    Schmidt, Johannes; Glaser, Bruno

    2016-01-01

    Tropical forests are significant carbon sinks and their soils’ carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms—including the model tuning and predictor selection—were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models’ predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction. PMID:27128736

  1. Improving the Spatial Prediction of Soil Organic Carbon Stocks in a Complex Tropical Mountain Landscape by Methodological Specifications in Machine Learning Approaches.

    PubMed

    Ließ, Mareike; Schmidt, Johannes; Glaser, Bruno

    2016-01-01

    Tropical forests are significant carbon sinks and their soils' carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms-including the model tuning and predictor selection-were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models' predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.

  2. Introducing two Random Forest based methods for cloud detection in remote sensing images

    NASA Astrophysics Data System (ADS)

    Ghasemian, Nafiseh; Akhoondzadeh, Mehdi

    2018-07-01

    Cloud detection is a necessary phase in satellite images processing to retrieve the atmospheric and lithospheric parameters. Currently, some cloud detection methods based on Random Forest (RF) model have been proposed but they do not consider both spectral and textural characteristics of the image. Furthermore, they have not been tested in the presence of snow/ice. In this paper, we introduce two RF based algorithms, Feature Level Fusion Random Forest (FLFRF) and Decision Level Fusion Random Forest (DLFRF) to incorporate visible, infrared (IR) and thermal spectral and textural features (FLFRF) including Gray Level Co-occurrence Matrix (GLCM) and Robust Extended Local Binary Pattern (RELBP_CI) or visible, IR and thermal classifiers (DLFRF) for highly accurate cloud detection on remote sensing images. FLFRF first fuses visible, IR and thermal features. Thereafter, it uses the RF model to classify pixels to cloud, snow/ice and background or thick cloud, thin cloud and background. DLFRF considers visible, IR and thermal features (both spectral and textural) separately and inserts each set of features to RF model. Then, it holds vote matrix of each run of the model. Finally, it fuses the classifiers using the majority vote method. To demonstrate the effectiveness of the proposed algorithms, 10 Terra MODIS and 15 Landsat 8 OLI/TIRS images with different spatial resolutions are used in this paper. Quantitative analyses are based on manually selected ground truth data. Results show that after adding RELBP_CI to input feature set cloud detection accuracy improves. Also, the average cloud kappa values of FLFRF and DLFRF on MODIS images (1 and 0.99) are higher than other machine learning methods, Linear Discriminate Analysis (LDA), Classification And Regression Tree (CART), K Nearest Neighbor (KNN) and Support Vector Machine (SVM) (0.96). The average snow/ice kappa values of FLFRF and DLFRF on MODIS images (1 and 0.85) are higher than other traditional methods. The quantitative values on Landsat 8 images show similar trend. Consequently, while SVM and K-nearest neighbor show overestimation in predicting cloud and snow/ice pixels, our Random Forest (RF) based models can achieve higher cloud, snow/ice kappa values on MODIS and thin cloud, thick cloud and snow/ice kappa values on Landsat 8 images. Our algorithms predict both thin and thick cloud on Landsat 8 images while the existing cloud detection algorithm, Fmask cannot discriminate them. Compared to the state-of-the-art methods, our algorithms have acquired higher average cloud and snow/ice kappa values for different spatial resolutions.

  3. [Estimating individual tree aboveground biomass of the mid-subtropical forest using airborne LiDAR technology].

    PubMed

    Liu, Feng; Tan, Chang; Lei, Pi-Feng

    2014-11-01

    Taking Wugang forest farm in Xuefeng Mountain as the research object, using the airborne light detection and ranging (LiDAR) data under leaf-on condition and field data of concomitant plots, this paper assessed the ability of using LiDAR technology to estimate aboveground biomass of the mid-subtropical forest. A semi-automated individual tree LiDAR cloud point segmentation was obtained by using condition random fields and optimization methods. Spatial structure, waveform characteristics and topography were calculated as LiDAR metrics from the segmented objects. Then statistical models between aboveground biomass from field data and these LiDAR metrics were built. The individual tree recognition rates were 93%, 86% and 60% for coniferous, broadleaf and mixed forests, respectively. The adjusted coefficients of determination (R(2)adj) and the root mean squared errors (RMSE) for the three types of forest were 0.83, 0.81 and 0.74, and 28.22, 29.79 and 32.31 t · hm(-2), respectively. The estimation capability of model based on canopy geometric volume, tree percentile height, slope and waveform characteristics was much better than that of traditional regression model based on tree height. Therefore, LiDAR metrics from individual tree could facilitate better performance in biomass estimation.

  4. Quantifying Biomass from Point Clouds by Connecting Representations of Ecosystem Structure

    NASA Astrophysics Data System (ADS)

    Hendryx, S. M.; Barron-Gafford, G.

    2017-12-01

    Quantifying terrestrial ecosystem biomass is an essential part of monitoring carbon stocks and fluxes within the global carbon cycle and optimizing natural resource management. Point cloud data such as from lidar and structure from motion can be effective for quantifying biomass over large areas, but significant challenges remain in developing effective models that allow for such predictions. Inference models that estimate biomass from point clouds are established in many environments, yet, are often scale-dependent, needing to be fitted and applied at the same spatial scale and grid size at which they were developed. Furthermore, training such models typically requires large in situ datasets that are often prohibitively costly or time-consuming to obtain. We present here a scale- and sensor-invariant framework for efficiently estimating biomass from point clouds. Central to this framework, we present a new algorithm, assignPointsToExistingClusters, that has been developed for finding matches between in situ data and clusters in remotely-sensed point clouds. The algorithm can be used for assessing canopy segmentation accuracy and for training and validating machine learning models for predicting biophysical variables. We demonstrate the algorithm's efficacy by using it to train a random forest model of above ground biomass in a shrubland environment in Southern Arizona. We show that by learning a nonlinear function to estimate biomass from segmented canopy features we can reduce error, especially in the presence of inaccurate clusterings, when compared to a traditional, deterministic technique to estimate biomass from remotely measured canopies. Our random forest on cluster features model extends established methods of training random forest regressions to predict biomass of subplots but requires significantly less training data and is scale invariant. The random forest on cluster features model reduced mean absolute error, when evaluated on all test data in leave one out cross validation, by 40.6% from deterministic mesquite allometry and 35.9% from the inferred ecosystem-state allometric function. Our framework should allow for the inference of biomass more efficiently than common subplot methods and more accurately than individual tree segmentation methods in densely vegetated environments.

  5. Cascaded face alignment via intimacy definition feature

    NASA Astrophysics Data System (ADS)

    Li, Hailiang; Lam, Kin-Man; Chiu, Man-Yau; Wu, Kangheng; Lei, Zhibin

    2017-09-01

    Recent years have witnessed the emerging popularity of regression-based face aligners, which directly learn mappings between facial appearance and shape-increment manifolds. We propose a random-forest based, cascaded regression model for face alignment by using a locally lightweight feature, namely intimacy definition feature. This feature is more discriminative than the pose-indexed feature, more efficient than the histogram of oriented gradients feature and the scale-invariant feature transform feature, and more compact than the local binary feature (LBF). Experimental validation of our algorithm shows that our approach achieves state-of-the-art performance when testing on some challenging datasets. Compared with the LBF-based algorithm, our method achieves about twice the speed, 20% improvement in terms of alignment accuracy and saves an order of magnitude on memory requirement.

  6. Biodiversity mapping in a tropical West African forest with airborne hyperspectral data.

    PubMed

    Vaglio Laurin, Gaia; Cheung-Wai Chan, Jonathan; Chen, Qi; Lindsell, Jeremy A; Coomes, David A; Guerriero, Leila; Del Frate, Fabio; Miglietta, Franco; Valentini, Riccardo

    2014-01-01

    Tropical forests are major repositories of biodiversity, but are fast disappearing as land is converted to agriculture. Decision-makers need to know which of the remaining forests to prioritize for conservation, but the only spatial information on forest biodiversity has, until recently, come from a sparse network of ground-based plots. Here we explore whether airborne hyperspectral imagery can be used to predict the alpha diversity of upper canopy trees in a West African forest. The abundance of tree species were collected from 64 plots (each 1250 m(2) in size) within a Sierra Leonean national park, and Shannon-Wiener biodiversity indices were calculated. An airborne spectrometer measured reflectances of 186 bands in the visible and near-infrared spectral range at 1 m(2) resolution. The standard deviations of these reflectance values and their first-order derivatives were calculated for each plot from the c. 1250 pixels of hyperspectral information within them. Shannon-Wiener indices were then predicted from these plot-based reflectance statistics using a machine-learning algorithm (Random Forest). The regression model fitted the data well (pseudo-R(2) = 84.9%), and we show that standard deviations of green-band reflectances and infra-red region derivatives had the strongest explanatory powers. Our work shows that airborne hyperspectral sensing can be very effective at mapping canopy tree diversity, because its high spatial resolution allows within-plot heterogeneity in reflectance to be characterized, making it an effective tool for monitoring forest biodiversity over large geographic scales.

  7. Biodiversity Mapping in a Tropical West African Forest with Airborne Hyperspectral Data

    PubMed Central

    Vaglio Laurin, Gaia; Chan, Jonathan Cheung-Wai; Chen, Qi; Lindsell, Jeremy A.; Coomes, David A.; Guerriero, Leila; Frate, Fabio Del; Miglietta, Franco; Valentini, Riccardo

    2014-01-01

    Tropical forests are major repositories of biodiversity, but are fast disappearing as land is converted to agriculture. Decision-makers need to know which of the remaining forests to prioritize for conservation, but the only spatial information on forest biodiversity has, until recently, come from a sparse network of ground-based plots. Here we explore whether airborne hyperspectral imagery can be used to predict the alpha diversity of upper canopy trees in a West African forest. The abundance of tree species were collected from 64 plots (each 1250 m2 in size) within a Sierra Leonean national park, and Shannon-Wiener biodiversity indices were calculated. An airborne spectrometer measured reflectances of 186 bands in the visible and near-infrared spectral range at 1 m2 resolution. The standard deviations of these reflectance values and their first-order derivatives were calculated for each plot from the c. 1250 pixels of hyperspectral information within them. Shannon-Wiener indices were then predicted from these plot-based reflectance statistics using a machine-learning algorithm (Random Forest). The regression model fitted the data well (pseudo-R2 = 84.9%), and we show that standard deviations of green-band reflectances and infra-red region derivatives had the strongest explanatory powers. Our work shows that airborne hyperspectral sensing can be very effective at mapping canopy tree diversity, because its high spatial resolution allows within-plot heterogeneity in reflectance to be characterized, making it an effective tool for monitoring forest biodiversity over large geographic scales. PMID:24937407

  8. Non-random species loss in a forest herbaceous layer following nitrogen addition

    Treesearch

    Christopher A. ​Walter; Mary Beth Adams; Frank S. Gilliam; William T. Peterjohn

    2017-01-01

    Nitrogen (N) additions have decreased species richness (S) in hardwood forest herbaceous layers, yet the functional mechanisms for these decreases have not been explicitly evaluated.We tested two hypothesized mechanisms, random species loss (RSL) and non-random species loss (NRSL), in the hardwood forest herbaceous layer of a long-term, plot-scale...

  9. Evaluation and prediction of shrub cover in coastal Oregon forests (USA)

    Treesearch

    Becky K. Kerns; Janet L. Ohmann

    2004-01-01

    We used data from regional forest inventories and research programs, coupled with mapped climatic and topographic information, to explore relationships and develop multiple linear regression (MLR) and regression tree models for total and deciduous shrub cover in the Oregon coastal province. Results from both types of models indicate that forest structure variables were...

  10. Forest type mapping of the Interior West

    Treesearch

    Bonnie Ruefenacht; Gretchen G. Moisen; Jock A. Blackard

    2004-01-01

    This paper develops techniques for the mapping of forest types in Arizona, New Mexico, and Wyoming. The methods involve regression-tree modeling using a variety of remote sensing and GIS layers along with Forest Inventory Analysis (FIA) point data. Regression-tree modeling is a fast and efficient technique of estimating variables for large data sets with high accuracy...

  11. Environmental adversities and psychotic symptoms: The impact of timing of trauma, abuse, and neglect.

    PubMed

    Schalinski, Inga; Breinlinger, Susanne; Hirt, Vanessa; Teicher, Martin H; Odenwald, Michael; Rockstroh, Brigitte

    2017-11-13

    Trauma and adverse childhood experiences (ACE) occur more often in mental illness, including psychosis, than in the general population. Individuals with psychosis (cases) report a higher number and severity (dose) of adversities than healthy controls. While a dose-dependent increase of adversities has been related to more severe psychopathology, the role of type and timing is still insufficiently understood on the exacerbation of positive and negative psychotic symptoms. Moreover, dissociative symptoms were examined as potential mediator between adversities and severity of psychotic symptoms. Exposure to adversities were assessed by interviews in n=180 cases and n=70 controls. In cases, symptom severities were obtained for psychotic symptoms and dissociation. Conditioned random forest regression determined the importance of type and timing of ACE for positive and negative symptom severity, and mediator analyses evaluated the role of dissociative symptoms in the relationship between adversities and psychotic symptoms. Cases experienced substantially more abuse and neglect than controls. Adversities were related in a dose-dependent manner to psychotic disorder. An array of adversities was associated with more severe positive symptoms, while the conditioned random forest regression depicted neglect at age 10 as the most important predictor. Dissociative symptoms mediated the small relation of trauma load in childhood and positive symptoms. The role of trauma and ACE on psychotic symptoms can be specified by neglect during frontocortical development in the exacerbation of positive symptoms. The mediating role of dissociation is restricted to the relation of childhood trauma and positive symptoms. Copyright © 2017 Elsevier B.V. All rights reserved.

  12. Policy Implications and Suggestions on Administrative Measures of Urban Flood

    NASA Astrophysics Data System (ADS)

    Lee, S. V.; Lee, M. J.; Lee, C.; Yoon, J. H.; Chae, S. H.

    2017-12-01

    The frequency and intensity of floods are increasing worldwide as recent climate change progresses gradually. Flood management should be policy-oriented in urban municipalities due to the characteristics of urban areas with a lot of damage. Therefore, the purpose of this study is to prepare a flood susceptibility map by using data mining model and make a policy suggestion on administrative measures of urban flood. Therefore, we constructed a spatial database by collecting relevant factors including the topography, geology, soil and land use data of the representative city, Seoul, the capital city of Korea. Flood susceptibility map was constructed by applying the data mining models of random forest and boosted tree model to input data and existing flooded area data in 2010. The susceptibility map has been validated using the 2011 flood area data which was not used for training. The predictor importance value of each factor to the results was calculated in this process. The distance from the water, DEM and geology showed a high predictor importance value which means to be a high priority for flood preparation policy. As a result of receiver operating characteristic (ROC), random forest model showed 78.78% and 79.18% accuracy of regression and classification and boosted tree model showed 77.55% and 77.26% accuracy of regression and classification, respectively. The results show that the flood susceptibility maps can be applied to flood prevention and management, and it also can help determine the priority areas for flood mitigation policy by providing useful information to policy makers.

  13. Problematic internet use (PIU): Associations with the impulsive-compulsive spectrum. An application of machine learning in psychiatry.

    PubMed

    Ioannidis, Konstantinos; Chamberlain, Samuel R; Treder, Matthias S; Kiraly, Franz; Leppink, Eric W; Redden, Sarah A; Stein, Dan J; Lochner, Christine; Grant, Jon E

    2016-12-01

    Problematic internet use is common, functionally impairing, and in need of further study. Its relationship with obsessive-compulsive and impulsive disorders is unclear. Our objective was to evaluate whether problematic internet use can be predicted from recognised forms of impulsive and compulsive traits and symptomatology. We recruited volunteers aged 18 and older using media advertisements at two sites (Chicago USA, and Stellenbosch, South Africa) to complete an extensive online survey. State-of-the-art out-of-sample evaluation of machine learning predictive models was used, which included Logistic Regression, Random Forests and Naïve Bayes. Problematic internet use was identified using the Internet Addiction Test (IAT). 2006 complete cases were analysed, of whom 181 (9.0%) had moderate/severe problematic internet use. Using Logistic Regression and Naïve Bayes we produced a classification prediction with a receiver operating characteristic area under the curve (ROC-AUC) of 0.83 (SD 0.03) whereas using a Random Forests algorithm the prediction ROC-AUC was 0.84 (SD 0.03) [all three models superior to baseline models p < 0.0001]. The models showed robust transfer between the study sites in all validation sets [p < 0.0001]. Prediction of problematic internet use was possible using specific measures of impulsivity and compulsivity in a population of volunteers. Moreover, this study offers proof-of-concept in support of using machine learning in psychiatry to demonstrate replicability of results across geographically and culturally distinct settings. Copyright © 2016 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  14. Application of random forests methods to diabetic retinopathy classification analyses.

    PubMed

    Casanova, Ramon; Saldana, Santiago; Chew, Emily Y; Danis, Ronald P; Greven, Craig M; Ambrosius, Walter T

    2014-01-01

    Diabetic retinopathy (DR) is one of the leading causes of blindness in the United States and world-wide. DR is a silent disease that may go unnoticed until it is too late for effective treatment. Therefore, early detection could improve the chances of therapeutic interventions that would alleviate its effects. Graded fundus photography and systemic data from 3443 ACCORD-Eye Study participants were used to estimate Random Forest (RF) and logistic regression classifiers. We studied the impact of sample size on classifier performance and the possibility of using RF generated class conditional probabilities as metrics describing DR risk. RF measures of variable importance are used to detect factors that affect classification performance. Both types of data were informative when discriminating participants with or without DR. RF based models produced much higher classification accuracy than those based on logistic regression. Combining both types of data did not increase accuracy but did increase statistical discrimination of healthy participants who subsequently did or did not have DR events during four years of follow-up. RF variable importance criteria revealed that microaneurysms counts in both eyes seemed to play the most important role in discrimination among the graded fundus variables, while the number of medicines and diabetes duration were the most relevant among the systemic variables. We have introduced RF methods to DR classification analyses based on fundus photography data. In addition, we propose an approach to DR risk assessment based on metrics derived from graded fundus photography and systemic data. Our results suggest that RF methods could be a valuable tool to diagnose DR diagnosis and evaluate its progression.

  15. Combining macula clinical signs and patient characteristics for age-related macular degeneration diagnosis: a machine learning approach.

    PubMed

    Fraccaro, Paolo; Nicolo, Massimo; Bonetto, Monica; Giacomini, Mauro; Weller, Peter; Traverso, Carlo Enrico; Prosperi, Mattia; OSullivan, Dympna

    2015-01-27

    To investigate machine learning methods, ranging from simpler interpretable techniques to complex (non-linear) "black-box" approaches, for automated diagnosis of Age-related Macular Degeneration (AMD). Data from healthy subjects and patients diagnosed with AMD or other retinal diseases were collected during routine visits via an Electronic Health Record (EHR) system. Patients' attributes included demographics and, for each eye, presence/absence of major AMD-related clinical signs (soft drusen, retinal pigment epitelium, defects/pigment mottling, depigmentation area, subretinal haemorrhage, subretinal fluid, macula thickness, macular scar, subretinal fibrosis). Interpretable techniques known as white box methods including logistic regression and decision trees as well as less interpreitable techniques known as black box methods, such as support vector machines (SVM), random forests and AdaBoost, were used to develop models (trained and validated on unseen data) to diagnose AMD. The gold standard was confirmed diagnosis of AMD by physicians. Sensitivity, specificity and area under the receiver operating characteristic (AUC) were used to assess performance. Study population included 487 patients (912 eyes). In terms of AUC, random forests, logistic regression and adaboost showed a mean performance of (0.92), followed by SVM and decision trees (0.90). All machine learning models identified soft drusen and age as the most discriminating variables in clinicians' decision pathways to diagnose AMD. Both black-box and white box methods performed well in identifying diagnoses of AMD and their decision pathways. Machine learning models developed through the proposed approach, relying on clinical signs identified by retinal specialists, could be embedded into EHR to provide physicians with real time (interpretable) support.

  16. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project.

    PubMed

    Alghamdi, Manal; Al-Mallah, Mouaz; Keteyian, Steven; Brawner, Clinton; Ehrman, Jonathan; Sakr, Sherif

    2017-01-01

    Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.

  17. Overstory structure and soil nutrients effect on plant diversity in unmanaged moist tropical forest

    NASA Astrophysics Data System (ADS)

    Gautam, Mukesh Kumar; Manhas, Rajesh Kumar; Tripathi, Ashutosh Kumar

    2016-08-01

    Forests with intensive management past are kept unmanaged to restore diversity and ecosystem functioning. Before perpetuating abandonment after protracted restitution, understanding its effect on forest vegetation is desirable. We studied plant diversity and its relation with environmental variables and stand structure in northern Indian unmanaged tropical moist deciduous forest. We hypothesized that post-abandonment species richness would have increased, and the structure of contemporary forest would be heterogeneous. Vegetation structure, composition, and diversity were recorded, in forty 0.1 ha plots selected randomly in four forest ranges. Three soil samples per 0.1 ha were assessed for physicochemistry, fine sand, and clay mineralogy. Contemporary forest had less species richness than pre-abandonment reference period. Fourteen species were recorded as either seedling or sapling, suggesting reappearance or immigration. For most species, regeneration was either absent or impaired. Ordination and multiple regression results showed that exchangeable base cations and phosphorous affected maximum tree diversity and structure variables. Significant correlations between soil moisture and temperature, and shrub layer was observed, besides tree layer correspondence with shrub richness, suggesting that dense overstory resulting from abandonment through its effect on soil conditions, is responsible for dense shrub layer. Herb layer diversity was negatively associated with tree layer and shrub overgrowth (i.e. Mallotus spp.). Protracted abandonment may not reinforce species richness and heterogeneity; perhaps result in high tree and shrub density in moist deciduous forests, which can impede immigrating or reappearing plant species establishment. This can be overcome by density/basal area reduction strategies, albeit for both tree and shrub layer.

  18. Improving ensemble decision tree performance using Adaboost and Bagging

    NASA Astrophysics Data System (ADS)

    Hasan, Md. Rajib; Siraj, Fadzilah; Sainin, Mohd Shamrie

    2015-12-01

    Ensemble classifier systems are considered as one of the most promising in medical data classification and the performance of deceision tree classifier can be increased by the ensemble method as it is proven to be better than single classifiers. However, in a ensemble settings the performance depends on the selection of suitable base classifier. This research employed two prominent esemble s namely Adaboost and Bagging with base classifiers such as Random Forest, Random Tree, j48, j48grafts and Logistic Model Regression (LMT) that have been selected independently. The empirical study shows that the performance varries when different base classifiers are selected and even some places overfitting issue also been noted. The evidence shows that ensemble decision tree classfiers using Adaboost and Bagging improves the performance of selected medical data sets.

  19. Combined use of two supervised learning algorithms to model sea turtle behaviours from tri-axial acceleration data.

    PubMed

    Jeantet, L; Dell'Amico, F; Forin-Wiart, M-A; Coutant, M; Bonola, M; Etienne, D; Gresser, J; Regis, S; Lecerf, N; Lefebvre, F; de Thoisy, B; Le Maho, Y; Brucker, M; Châtelain, N; Laesser, R; Crenner, F; Handrich, Y; Wilson, R; Chevallier, D

    2018-05-23

    Accelerometers are becoming ever more important sensors in animal-attached technology, providing data that allow determination of body posture and movement and thereby helping to elucidate behaviour in animals that are difficult to observe. We sought to validate the identification of sea turtle behaviours from accelerometer signals by deploying tags on the carapace of a juvenile loggerhead ( Caretta caretta ), an adult hawksbill ( Eretmochelys imbricata ) and an adult green turtle ( Chelonia mydas ) at Aquarium La Rochelle, France. We recorded tri-axial acceleration at 50 Hz for each species for a full day while two fixed cameras recorded their behaviours. We identified behaviours from the acceleration data using two different supervised learning algorithms, Random Forest and Classification And Regression Tree (CART), treating the data from the adult animals as separate from the juvenile data. We achieved a global accuracy of 81.30% for the adult hawksbill and green turtle CART model and 71.63% for the juvenile loggerhead, identifying 10 and 12 different behaviours, respectively. Equivalent figures were 86.96% for the adult hawksbill and green turtle Random Forest model and 79.49% for the juvenile loggerhead, for the same behaviours. The use of Random Forest combined with CART algorithms allowed us to understand the decision rules implicated in behaviour discrimination, and thus remove or group together some 'confused' or under--represented behaviours in order to get the most accurate models. This study is the first to validate accelerometer data to identify turtle behaviours and the approach can now be tested on other captive sea turtle species. © 2018. Published by The Company of Biologists Ltd.

  20. Inside the black box: starting to uncover the underlying decision rules used in one-by-one expert assessment of occupational exposure in case-control studies

    PubMed Central

    Wheeler, David C.; Burstyn, Igor; Vermeulen, Roel; Yu, Kai; Shortreed, Susan M.; Pronk, Anjoeka; Stewart, Patricia A.; Colt, Joanne S.; Baris, Dalsu; Karagas, Margaret R.; Schwenn, Molly; Johnson, Alison; Silverman, Debra T.; Friesen, Melissa C.

    2014-01-01

    Objectives Evaluating occupational exposures in population-based case-control studies often requires exposure assessors to review each study participants' reported occupational information job-by-job to derive exposure estimates. Although such assessments likely have underlying decision rules, they usually lack transparency, are time-consuming and have uncertain reliability and validity. We aimed to identify the underlying rules to enable documentation, review, and future use of these expert-based exposure decisions. Methods Classification and regression trees (CART, predictions from a single tree) and random forests (predictions from many trees) were used to identify the underlying rules from the questionnaire responses and an expert's exposure assignments for occupational diesel exhaust exposure for several metrics: binary exposure probability and ordinal exposure probability, intensity, and frequency. Data were split into training (n=10,488 jobs), testing (n=2,247), and validation (n=2,248) data sets. Results The CART and random forest models' predictions agreed with 92–94% of the expert's binary probability assignments. For ordinal probability, intensity, and frequency metrics, the two models extracted decision rules more successfully for unexposed and highly exposed jobs (86–90% and 57–85%, respectively) than for low or medium exposed jobs (7–71%). Conclusions CART and random forest models extracted decision rules and accurately predicted an expert's exposure decisions for the majority of jobs and identified questionnaire response patterns that would require further expert review if the rules were applied to other jobs in the same or different study. This approach makes the exposure assessment process in case-control studies more transparent and creates a mechanism to efficiently replicate exposure decisions in future studies. PMID:23155187

  1. Genome analysis of Legionella pneumophila strains using a mixed-genome microarray.

    PubMed

    Euser, Sjoerd M; Nagelkerke, Nico J; Schuren, Frank; Jansen, Ruud; Den Boer, Jeroen W

    2012-01-01

    Legionella, the causative agent for Legionnaires' disease, is ubiquitous in both natural and man-made aquatic environments. The distribution of Legionella genotypes within clinical strains is significantly different from that found in environmental strains. Developing novel genotypic methods that offer the ability to distinguish clinical from environmental strains could help to focus on more relevant (virulent) Legionella species in control efforts. Mixed-genome microarray data can be used to perform a comparative-genome analysis of strain collections, and advanced statistical approaches, such as the Random Forest algorithm are available to process these data. Microarray analysis was performed on a collection of 222 Legionella pneumophila strains, which included patient-derived strains from notified cases in The Netherlands in the period 2002-2006 and the environmental strains that were collected during the source investigation for those patients within the Dutch National Legionella Outbreak Detection Programme. The Random Forest algorithm combined with a logistic regression model was used to select predictive markers and to construct a predictive model that could discriminate between strains from different origin: clinical or environmental. Four genetic markers were selected that correctly predicted 96% of the clinical strains and 66% of the environmental strains collected within the Dutch National Legionella Outbreak Detection Programme. The Random Forest algorithm is well suited for the development of prediction models that use mixed-genome microarray data to discriminate between Legionella strains from different origin. The identification of these predictive genetic markers could offer the possibility to identify virulence factors within the Legionella genome, which in the future may be implemented in the daily practice of controlling Legionella in the public health environment.

  2. An assessment of the effectiveness of a random forest classifier for land-cover classification

    NASA Astrophysics Data System (ADS)

    Rodriguez-Galiano, V. F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J. P.

    2012-01-01

    Land cover monitoring using remotely sensed data requires robust classification methods which allow for the accurate mapping of complex land cover and land use categories. Random forest (RF) is a powerful machine learning classifier that is relatively unknown in land remote sensing and has not been evaluated thoroughly by the remote sensing community compared to more conventional pattern recognition techniques. Key advantages of RF include: their non-parametric nature; high classification accuracy; and capability to determine variable importance. However, the split rules for classification are unknown, therefore RF can be considered to be black box type classifier. RF provides an algorithm for estimating missing values; and flexibility to perform several types of data analysis, including regression, classification, survival analysis, and unsupervised learning. In this paper, the performance of the RF classifier for land cover classification of a complex area is explored. Evaluation was based on several criteria: mapping accuracy, sensitivity to data set size and noise. Landsat-5 Thematic Mapper data captured in European spring and summer were used with auxiliary variables derived from a digital terrain model to classify 14 different land categories in the south of Spain. Results show that the RF algorithm yields accurate land cover classifications, with 92% overall accuracy and a Kappa index of 0.92. RF is robust to training data reduction and noise because significant differences in kappa values were only observed for data reduction and noise addition values greater than 50 and 20%, respectively. Additionally, variables that RF identified as most important for classifying land cover coincided with expectations. A McNemar test indicates an overall better performance of the random forest model over a single decision tree at the 0.00001 significance level.

  3. Implications of sampling design and sample size for national carbon accounting systems.

    PubMed

    Köhl, Michael; Lister, Andrew; Scott, Charles T; Baldauf, Thomas; Plugge, Daniel

    2011-11-08

    Countries willing to adopt a REDD regime need to establish a national Measurement, Reporting and Verification (MRV) system that provides information on forest carbon stocks and carbon stock changes. Due to the extensive areas covered by forests the information is generally obtained by sample based surveys. Most operational sampling approaches utilize a combination of earth-observation data and in-situ field assessments as data sources. We compared the cost-efficiency of four different sampling design alternatives (simple random sampling, regression estimators, stratified sampling, 2-phase sampling with regression estimators) that have been proposed in the scope of REDD. Three of the design alternatives provide for a combination of in-situ and earth-observation data. Under different settings of remote sensing coverage, cost per field plot, cost of remote sensing imagery, correlation between attributes quantified in remote sensing and field data, as well as population variability and the percent standard error over total survey cost was calculated. The cost-efficiency of forest carbon stock assessments is driven by the sampling design chosen. Our results indicate that the cost of remote sensing imagery is decisive for the cost-efficiency of a sampling design. The variability of the sample population impairs cost-efficiency, but does not reverse the pattern of cost-efficiency of the individual design alternatives. Our results clearly indicate that it is important to consider cost-efficiency in the development of forest carbon stock assessments and the selection of remote sensing techniques. The development of MRV-systems for REDD need to be based on a sound optimization process that compares different data sources and sampling designs with respect to their cost-efficiency. This helps to reduce the uncertainties related with the quantification of carbon stocks and to increase the financial benefits from adopting a REDD regime.

  4. Sample entropy analysis for the estimating depth of anaesthesia through human EEG signal at different levels of unconsciousness during surgeries

    PubMed Central

    Fan, Shou-Zen; Abbod, Maysam F.

    2018-01-01

    Estimating the depth of anaesthesia (DoA) in operations has always been a challenging issue due to the underlying complexity of the brain mechanisms. Electroencephalogram (EEG) signals are undoubtedly the most widely used signals for measuring DoA. In this paper, a novel EEG-based index is proposed to evaluate DoA for 24 patients receiving general anaesthesia with different levels of unconsciousness. Sample Entropy (SampEn) algorithm was utilised in order to acquire the chaotic features of the signals. After calculating the SampEn from the EEG signals, Random Forest was utilised for developing learning regression models with Bispectral index (BIS) as the target. Correlation coefficient, mean absolute error, and area under the curve (AUC) were used to verify the perioperative performance of the proposed method. Validation comparisons with typical nonstationary signal analysis methods (i.e., recurrence analysis and permutation entropy) and regression methods (i.e., neural network and support vector machine) were conducted. To further verify the accuracy and validity of the proposed methodology, the data is divided into four unconsciousness-level groups on the basis of BIS levels. Subsequently, analysis of variance (ANOVA) was applied to the corresponding index (i.e., regression output). Results indicate that the correlation coefficient improved to 0.72 ± 0.09 after filtering and to 0.90 ± 0.05 after regression from the initial values of 0.51 ± 0.17. Similarly, the final mean absolute error dramatically declined to 5.22 ± 2.12. In addition, the ultimate AUC increased to 0.98 ± 0.02, and the ANOVA analysis indicates that each of the four groups of different anaesthetic levels demonstrated significant difference from the nearest levels. Furthermore, the Random Forest output was extensively linear in relation to BIS, thus with better DoA prediction accuracy. In conclusion, the proposed method provides a concrete basis for monitoring patients’ anaesthetic level during surgeries. PMID:29844970

  5. Predicting temperate forest stand types using only structural profiles from discrete return airborne lidar

    NASA Astrophysics Data System (ADS)

    Fedrigo, Melissa; Newnham, Glenn J.; Coops, Nicholas C.; Culvenor, Darius S.; Bolton, Douglas K.; Nitschke, Craig R.

    2018-02-01

    Light detection and ranging (lidar) data have been increasingly used for forest classification due to its ability to penetrate the forest canopy and provide detail about the structure of the lower strata. In this study we demonstrate forest classification approaches using airborne lidar data as inputs to random forest and linear unmixing classification algorithms. Our results demonstrated that both random forest and linear unmixing models identified a distribution of rainforest and eucalypt stands that was comparable to existing ecological vegetation class (EVC) maps based primarily on manual interpretation of high resolution aerial imagery. Rainforest stands were also identified in the region that have not previously been identified in the EVC maps. The transition between stand types was better characterised by the random forest modelling approach. In contrast, the linear unmixing model placed greater emphasis on field plots selected as endmembers which may not have captured the variability in stand structure within a single stand type. The random forest model had the highest overall accuracy (84%) and Cohen's kappa coefficient (0.62). However, the classification accuracy was only marginally better than linear unmixing. The random forest model was applied to a region in the Central Highlands of south-eastern Australia to produce maps of stand type probability, including areas of transition (the 'ecotone') between rainforest and eucalypt forest. The resulting map provided a detailed delineation of forest classes, which specifically recognised the coalescing of stand types at the landscape scale. This represents a key step towards mapping the structural and spatial complexity of these ecosystems, which is important for both their management and conservation.

  6. Logistic quantile regression provides improved estimates for bounded avian counts: A case study of California Spotted Owl fledgling production

    USGS Publications Warehouse

    Cade, Brian S.; Noon, Barry R.; Scherer, Rick D.; Keane, John J.

    2017-01-01

    Counts of avian fledglings, nestlings, or clutch size that are bounded below by zero and above by some small integer form a discrete random variable distribution that is not approximated well by conventional parametric count distributions such as the Poisson or negative binomial. We developed a logistic quantile regression model to provide estimates of the empirical conditional distribution of a bounded discrete random variable. The logistic quantile regression model requires that counts are randomly jittered to a continuous random variable, logit transformed to bound them between specified lower and upper values, then estimated in conventional linear quantile regression, repeating the 3 steps and averaging estimates. Back-transformation to the original discrete scale relies on the fact that quantiles are equivariant to monotonic transformations. We demonstrate this statistical procedure by modeling 20 years of California Spotted Owl fledgling production (0−3 per territory) on the Lassen National Forest, California, USA, as related to climate, demographic, and landscape habitat characteristics at territories. Spotted Owl fledgling counts increased nonlinearly with decreasing precipitation in the early nesting period, in the winter prior to nesting, and in the prior growing season; with increasing minimum temperatures in the early nesting period; with adult compared to subadult parents; when there was no fledgling production in the prior year; and when percentage of the landscape surrounding nesting sites (202 ha) with trees ≥25 m height increased. Changes in production were primarily driven by changes in the proportion of territories with 2 or 3 fledglings. Average variances of the discrete cumulative distributions of the estimated fledgling counts indicated that temporal changes in climate and parent age class explained 18% of the annual variance in owl fledgling production, which was 34% of the total variance. Prior fledgling production explained as much of the variance in the fledgling counts as climate, parent age class, and landscape habitat predictors. Our logistic quantile regression model can be used for any discrete response variables with fixed upper and lower bounds.

  7. Predicting surface fuel models and fuel metrics using lidar and CIR imagery in a dense mixed conifer forest

    Treesearch

    Marek K. Jakubowksi; Qinghua Guo; Brandon Collins; Scott Stephens; Maggi Kelly

    2013-01-01

    We compared the ability of several classification and regression algorithms to predict forest stand structure metrics and standard surface fuel models. Our study area spans a dense, topographically complex Sierra Nevada mixed-conifer forest. We used clustering, regression trees, and support vector machine algorithms to analyze high density (average 9 pulses/m

  8. Random forests for classification in ecology

    USGS Publications Warehouse

    Cutler, D.R.; Edwards, T.C.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J.

    2007-01-01

    Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature. ?? 2007 by the Ecological Society of America.

  9. Remote sensing-based measurement of Living Environment Deprivation: Improving classical approaches with machine learning

    PubMed Central

    2017-01-01

    This paper provides evidence on the usefulness of very high spatial resolution (VHR) imagery in gathering socioeconomic information in urban settlements. We use land cover, spectral, structure and texture features extracted from a Google Earth image of Liverpool (UK) to evaluate their potential to predict Living Environment Deprivation at a small statistical area level. We also contribute to the methodological literature on the estimation of socioeconomic indices with remote-sensing data by introducing elements from modern machine learning. In addition to classical approaches such as Ordinary Least Squares (OLS) regression and a spatial lag model, we explore the potential of the Gradient Boost Regressor and Random Forests to improve predictive performance and accuracy. In addition to novel predicting methods, we also introduce tools for model interpretation and evaluation such as feature importance and partial dependence plots, or cross-validation. Our results show that Random Forest proved to be the best model with an R2 of around 0.54, followed by Gradient Boost Regressor with 0.5. Both the spatial lag model and the OLS fall behind with significantly lower performances of 0.43 and 0.3, respectively. PMID:28464010

  10. Remote sensing-based measurement of Living Environment Deprivation: Improving classical approaches with machine learning.

    PubMed

    Arribas-Bel, Daniel; Patino, Jorge E; Duque, Juan C

    2017-01-01

    This paper provides evidence on the usefulness of very high spatial resolution (VHR) imagery in gathering socioeconomic information in urban settlements. We use land cover, spectral, structure and texture features extracted from a Google Earth image of Liverpool (UK) to evaluate their potential to predict Living Environment Deprivation at a small statistical area level. We also contribute to the methodological literature on the estimation of socioeconomic indices with remote-sensing data by introducing elements from modern machine learning. In addition to classical approaches such as Ordinary Least Squares (OLS) regression and a spatial lag model, we explore the potential of the Gradient Boost Regressor and Random Forests to improve predictive performance and accuracy. In addition to novel predicting methods, we also introduce tools for model interpretation and evaluation such as feature importance and partial dependence plots, or cross-validation. Our results show that Random Forest proved to be the best model with an R2 of around 0.54, followed by Gradient Boost Regressor with 0.5. Both the spatial lag model and the OLS fall behind with significantly lower performances of 0.43 and 0.3, respectively.

  11. Estimation of sleep status in sleep apnea patients using a novel head actigraphy technique.

    PubMed

    Hummel, Richard; Bradley, T Douglas; Fernie, Geoff R; Chang, S J Isaac; Alshaer, Hisham

    2015-01-01

    Polysomnography is a comprehensive modality for diagnosing sleep apnea (SA), but it is expensive and not widely available. Several technologies have been developed for portable diagnosis of SA in the home, most of which lack the ability to detect sleep status. Wrist actigraphy (accelerometry) has been adopted to cover this limitation. However, head actigraphy has not been systematically evaluated for this purpose. Therefore, the aim of this study was to evaluate the ability of head actigraphy to detect sleep/wake status. We obtained full overnight 3-axis head accelerometry data from 75 sleep apnea patient recordings. These were split into training and validation groups (2:1). Data were preprocessed and 5 features were extracted. Different feature combinations were fed into 3 different classifiers, namely support vector machine, logistic regression, and random forests, each of which was trained and validated on a different subgroup. The random forest algorithm yielded the highest performance, with an area under the receiver operating characteristic (ROC) curve of 0.81 for detection of sleep status. This shows that this technique has a very good performance in detecting sleep status in SA patients despite the specificities in this population, such as respiration related movements.

  12. A machine learning system to improve heart failure patient assistance.

    PubMed

    Guidi, Gabriele; Pettenati, Maria Chiara; Melillo, Paolo; Iadanza, Ernesto

    2014-11-01

    In this paper, we present a clinical decision support system (CDSS) for the analysis of heart failure (HF) patients, providing various outputs such as an HF severity evaluation, HF-type prediction, as well as a management interface that compares the different patients' follow-ups. The whole system is composed of a part of intelligent core and of an HF special-purpose management tool also providing the function to act as interface for the artificial intelligence training and use. To implement the smart intelligent functions, we adopted a machine learning approach. In this paper, we compare the performance of a neural network (NN), a support vector machine, a system with fuzzy rules genetically produced, and a classification and regression tree and its direct evolution, which is the random forest, in analyzing our database. Best performances in both HF severity evaluation and HF-type prediction functions are obtained by using the random forest algorithm. The management tool allows the cardiologist to populate a "supervised database" suitable for machine learning during his or her regular outpatient consultations. The idea comes from the fact that in literature there are a few databases of this type, and they are not scalable to our case.

  13. Accelerated Changes in Cortical Thickness Measurements with Age in Military Service Members with Traumatic Brain Injury.

    PubMed

    Savjani, Ricky R; Taylor, Brian A; Acion, Laura; Wilde, Elisabeth A; Jorge, Ricardo E

    2017-11-15

    Finding objective and quantifiable imaging markers of mild traumatic brain injury (TBI) has proven challenging, especially in the military population. Changes in cortical thickness after injury have been reported in animals and in humans, but it is unclear how these alterations manifest in the chronic phase, and it is difficult to characterize accurately with imaging. We used cortical thickness measures derived from Advanced Normalization Tools (ANTs) to predict a continuous demographic variable: age. We trained four different regression models (linear regression, support vector regression, Gaussian process regression, and random forests) to predict age from healthy control brains from publicly available datasets (n = 762). We then used these models to predict brain age in military Service Members with TBI (n = 92) and military Service Members without TBI (n = 34). Our results show that all four models overpredicted age in Service Members with TBI, and the predicted age difference was significantly greater compared with military controls. These data extend previous civilian findings and show that cortical thickness measures may reveal an association of accelerated changes over time with military TBI.

  14. The experimental design of the Missouri Ozark Forest Ecosystem Project

    Treesearch

    Steven L. Sheriff; Shuoqiong He

    1997-01-01

    The Missouri Ozark Forest Ecosystem Project (MOFEP) is an experiment that examines the effects of three forest management practices on the forest community. MOFEP is designed as a randomized complete block design using nine sites divided into three blocks. Treatments of uneven-aged, even-aged, and no-harvest management were randomly assigned to sites within each block...

  15. Impacts of Landscape Context on Patterns of Wind Downfall Damage in a Fragmented Amazonian Landscape

    NASA Astrophysics Data System (ADS)

    Schwartz, N.; Uriarte, M.; DeFries, R. S.; Gutierrez-Velez, V. H.; Fernandes, K.; Pinedo-Vasquez, M.

    2015-12-01

    Wind is a major disturbance in the Amazon and has both short-term impacts and lasting legacies in tropical forests. Observed patterns of damage across landscapes result from differences in wind exposure and stand characteristics, such as tree stature, species traits, successional age, and fragmentation. Wind disturbance has important consequences for biomass dynamics in Amazonian forests, and understanding the spatial distribution and size of impacts is necessary to quantify the effects on carbon dynamics. In November 2013, a mesoscale convective system was observed over the study area in Ucayali, Peru, a highly human modified and fragmented forest landscape. We mapped downfall damage associated with the storm in order to ask: how does the severity of damage vary within forest patches, and across forest patches of different sizes and successional ages? We applied spectral mixture analysis to Landsat images from 2013 and 2014 to calculate the change in non-photosynthetic vegetation fraction after the storm, and combined it with C-band SAR data from the Sentinel-1 satellite to predict downfall damage measured in 30 field plots using random forest regression. We then applied this model to map damage in forests across the study area. Using a land cover classification developed in a previous study, we mapped secondary and mature forest, and compared the severity of damage in the two. We found that damage was on average higher in secondary forests, but patterns varied spatially. This study demonstrates the utility of using multiple sources of satellite data for mapping wind disturbance, and adds to our understanding of the sources of variation in wind-related damage. Ultimately, an improved ability to map wind impacts and a better understanding of their spatial patterns can contribute to better quantification of carbon dynamics in Amazonian landscapes.

  16. Forest Type and Above Ground Biomass Estimation Based on Sentinel-2A and WorldView-2 Data Evaluation of Predictor nd Data Suitability

    NASA Astrophysics Data System (ADS)

    Fritz, Andreas; Enßle, Fabian; Zhang, Xiaoli; Koch, Barbara

    2016-08-01

    The present study analyses the two earth observation sensors regarding their capability of modelling forest above ground biomass and forest density. Our research is carried out at two different demonstration sites. The first is located in south-western Germany (region Karlsruhe) and the second is located in southern China in Jiangle County (Province Fujian). A set of spectral and spatial predictors are computed from both, Sentinel-2A and WorldView-2 data. Window sizes in the range of 3*3 pixels to 21*21 pixels are computed in order to cover the full range of the canopy sizes of mature forest stands. Textural predictors of first and second order (grey-level-co-occurrence matrix) are calculated and are further used within a feature selection procedure. Additionally common spectral predictors from WorldView-2 and Sentinel-2A data such as all relevant spectral bands and NDVI are integrated in the analyses. To examine the most important predictors, a predictor selection algorithm is applied to the data, whereas the entire predictor set of more than 1000 predictors is used to find most important ones. Out of the original set only the most important predictors are then further analysed. Predictor selection is done with the Boruta package in R (Kursa and Rudnicki (2010)), whereas regression is computed with random forest. Prior the classification and regression a tuning of parameters is done by a repetitive model selection (100 runs), based on the .632 bootstrapping. Both are implemented in the caret R pack- age (Kuhn et al. (2016)). To account for the variability in the data set 100 independent runs are performed. Within each run 80 percent of the data is used for training and the 20 percent are used for an independent validation. With the subset of original predictors mapping of above ground biomass is performed.

  17. Random Forest as an Imputation Method for Education and Psychology Research: Its Impact on Item Fit and Difficulty of the Rasch Model

    ERIC Educational Resources Information Center

    Golino, Hudson F.; Gomes, Cristiano M. A.

    2016-01-01

    This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…

  18. Habitat interaction between two species of chipmunk in the Basin and Range Province of Nevada

    USGS Publications Warehouse

    Lowrey, Christopher; Longshore, Kathleen M.

    2013-01-01

    Interspecies interactions can affect how species are distributed, put constraints on habitat expansion, and reduce the fundamental niche of the affected species. Using logistic regression, we analyzed and compared 174 Tamias palmeri and 94 Tamias panamintinus within an isolated mountain range of the Basin and Range Province of southern Nevada. Tamias panamintinus was more likely to use pinyon/ponderosa/fir mixed forests than pinyon alone, compared to random sites. In the presence of T palmeri, however, interaction analyses indicated T. panamintinus was less likely to occupy the mixed forests and more likely near large rocks on southern aspects. This specie s-by-habitat interaction data suggest that T. palmeri excludes T panamintinus from areas of potentially suitable habitat. Climate change may adversely affect species of restricted distribution. Habitat isolation and species interactions in this region may thus increase survival risks as climate temperatures rise.

  19. Ensemble habitat mapping of invasive plant species

    USGS Publications Warehouse

    Stohlgren, T.J.; Ma, P.; Kumar, S.; Rocca, M.; Morisette, J.T.; Jarnevich, C.S.; Benson, N.

    2010-01-01

    Ensemble species distribution models combine the strengths of several species environmental matching models, while minimizing the weakness of any one model. Ensemble models may be particularly useful in risk analysis of recently arrived, harmful invasive species because species may not yet have spread to all suitable habitats, leaving species-environment relationships difficult to determine. We tested five individual models (logistic regression, boosted regression trees, random forest, multivariate adaptive regression splines (MARS), and maximum entropy model or Maxent) and ensemble modeling for selected nonnative plant species in Yellowstone and Grand Teton National Parks, Wyoming; Sequoia and Kings Canyon National Parks, California, and areas of interior Alaska. The models are based on field data provided by the park staffs, combined with topographic, climatic, and vegetation predictors derived from satellite data. For the four invasive plant species tested, ensemble models were the only models that ranked in the top three models for both field validation and test data. Ensemble models may be more robust than individual species-environment matching models for risk analysis. ?? 2010 Society for Risk Analysis.

  20. Using Random Forest Models to Predict Organizational Violence

    NASA Technical Reports Server (NTRS)

    Levine, Burton; Bobashev, Georgly

    2012-01-01

    We present a methodology to access the proclivity of an organization to commit violence against nongovernment personnel. We fitted a Random Forest model using the Minority at Risk Organizational Behavior (MAROS) dataset. The MAROS data is longitudinal; so, individual observations are not independent. We propose a modification to the standard Random Forest methodology to account for the violation of the independence assumption. We present the results of the model fit, an example of predicting violence for an organization; and finally, we present a summary of the forest in a "meta-tree,"

  1. Forest dynamics to precipitation and temperature in the Gulf of Mexico coastal region.

    PubMed

    Li, Tianyu; Meng, Qingmin

    2017-05-01

    The forest is one of the most significant components of the Gulf of Mexico (GOM) coast. It provides livelihood to inhabitant and is known to be sensitive to climatic fluctuations. This study focuses on examining the impacts of temperature and precipitation variations on coastal forest. Two different regression methods, ordinary least squares (OLS) and geographically weighted regression (GWR), were employed to reveal the relationship between meteorological variables and forest dynamics. OLS regression analysis shows that changes in precipitation and temperature, over a span of 12 months, are responsible for 56% of NDVI variation. The forest, which is not particularly affected by the average monthly precipitation in most months, is observed to be affected by cumulative seasonal and annual precipitation explicitly. Temperature and precipitation almost equally impact on NDVI changes; about 50% of the NDVI variations is explained in OLS modeling, and about 74% of the NDVI variations is explained in GWR modeling. GWR analysis indicated that both precipitation and temperature characterize the spatial heterogeneity patterns of forest dynamics.

  2. Forest dynamics to precipitation and temperature in the Gulf of Mexico coastal region

    NASA Astrophysics Data System (ADS)

    Li, Tianyu; Meng, Qingmin

    2017-05-01

    The forest is one of the most significant components of the Gulf of Mexico (GOM) coast. It provides livelihood to inhabitant and is known to be sensitive to climatic fluctuations. This study focuses on examining the impacts of temperature and precipitation variations on coastal forest. Two different regression methods, ordinary least squares (OLS) and geographically weighted regression (GWR), were employed to reveal the relationship between meteorological variables and forest dynamics. OLS regression analysis shows that changes in precipitation and temperature, over a span of 12 months, are responsible for 56% of NDVI variation. The forest, which is not particularly affected by the average monthly precipitation in most months, is observed to be affected by cumulative seasonal and annual precipitation explicitly. Temperature and precipitation almost equally impact on NDVI changes; about 50% of the NDVI variations is explained in OLS modeling, and about 74% of the NDVI variations is explained in GWR modeling. GWR analysis indicated that both precipitation and temperature characterize the spatial heterogeneity patterns of forest dynamics.

  3. A comparative study: classification vs. user-based collaborative filtering for clinical prediction.

    PubMed

    Hao, Fang; Blair, Rachael Hageman

    2016-12-08

    Recommender systems have shown tremendous value for the prediction of personalized item recommendations for individuals in a variety of settings (e.g., marketing, e-commerce, etc.). User-based collaborative filtering is a popular recommender system, which leverages an individuals' prior satisfaction with items, as well as the satisfaction of individuals that are "similar". Recently, there have been applications of collaborative filtering based recommender systems for clinical risk prediction. In these applications, individuals represent patients, and items represent clinical data, which includes an outcome. Application of recommender systems to a problem of this type requires the recasting a supervised learning problem as unsupervised. The rationale is that patients with similar clinical features carry a similar disease risk. As the "Big Data" era progresses, it is likely that approaches of this type will be reached for as biomedical data continues to grow in both size and complexity (e.g., electronic health records). In the present study, we set out to understand and assess the performance of recommender systems in a controlled yet realistic setting. User-based collaborative filtering recommender systems are compared to logistic regression and random forests with different types of imputation and varying amounts of missingness on four different publicly available medical data sets: National Health and Nutrition Examination Survey (NHANES, 2011-2012 on Obesity), Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT), chronic kidney disease, and dermatology data. We also examined performance using simulated data with observations that are Missing At Random (MAR) or Missing Completely At Random (MCAR) under various degrees of missingness and levels of class imbalance in the response variable. Our results demonstrate that user-based collaborative filtering is consistently inferior to logistic regression and random forests with different imputations on real and simulated data. The results warrant caution for the collaborative filtering for the purpose of clinical risk prediction when traditional classification is feasible and practical. CF may not be desirable in datasets where classification is an acceptable alternative. We describe some natural applications related to "Big Data" where CF would be preferred and conclude with some insights as to why caution may be warranted in this context.

  4. Modeling landslide susceptibility in data-scarce environments using optimized data mining and statistical methods

    NASA Astrophysics Data System (ADS)

    Lee, Jung-Hyun; Sameen, Maher Ibrahim; Pradhan, Biswajeet; Park, Hyuck-Jin

    2018-02-01

    This study evaluated the generalizability of five models to select a suitable approach for landslide susceptibility modeling in data-scarce environments. In total, 418 landslide inventories and 18 landslide conditioning factors were analyzed. Multicollinearity and factor optimization were investigated before data modeling, and two experiments were then conducted. In each experiment, five susceptibility maps were produced based on support vector machine (SVM), random forest (RF), weight-of-evidence (WoE), ridge regression (Rid_R), and robust regression (RR) models. The highest accuracy (AUC = 0.85) was achieved with the SVM model when either the full or limited landslide inventories were used. Furthermore, the RF and WoE models were severely affected when less landslide samples were used for training. The other models were affected slightly when the training samples were limited.

  5. A Comparative Assessment of the Influences of Human Impacts on Soil Cd Concentrations Based on Stepwise Linear Regression, Classification and Regression Tree, and Random Forest Models

    PubMed Central

    Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S.

    2016-01-01

    Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0–20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The good performance of the RF model was attributable to its ability to handle the non-linear and hierarchical relationships between soil Cd and environmental variables. These results confirm that the RF approach is promising for the prediction and spatial distribution mapping of soil Cd at the regional scale. PMID:26964095

  6. A Comparative Assessment of the Influences of Human Impacts on Soil Cd Concentrations Based on Stepwise Linear Regression, Classification and Regression Tree, and Random Forest Models.

    PubMed

    Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S

    2016-01-01

    Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0-20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The good performance of the RF model was attributable to its ability to handle the non-linear and hierarchical relationships between soil Cd and environmental variables. These results confirm that the RF approach is promising for the prediction and spatial distribution mapping of soil Cd at the regional scale.

  7. Modelling Variable Fire Severity in Boreal Forests: Effects of Fire Intensity and Stand Structure

    PubMed Central

    Miquelajauregui, Yosune; Cumming, Steven G.; Gauthier, Sylvie

    2016-01-01

    It is becoming clear that fires in boreal forests are not uniformly stand-replacing. On the contrary, marked variation in fire severity, measured as tree mortality, has been found both within and among individual fires. It is important to understand the conditions under which this variation can arise. We integrated forest sample plot data, tree allometries and historical forest fire records within a diameter class-structured model of 1.0 ha patches of mono-specific black spruce and jack pine stands in northern Québec, Canada. The model accounts for crown fire initiation and vertical spread into the canopy. It uses empirical relations between fire intensity, scorch height, the percent of crown scorched and tree mortality to simulate fire severity, specifically the percent reduction in patch basal area due to fire-caused mortality. A random forest and a regression tree analysis of a large random sample of simulated fires were used to test for an effect of fireline intensity, stand structure, species composition and pyrogeographic regions on resultant severity. Severity increased with intensity and was lower for jack pine stands. The proportion of simulated fires that burned at high severity (e.g. >75% reduction in patch basal area) was 0.80 for black spruce and 0.11 for jack pine. We identified thresholds in intensity below which there was a marked sensitivity of simulated fire severity to stand structure, and to interactions between intensity and structure. We found no evidence for a residual effect of pyrogeographic region on simulated severity, after the effects of stand structure and species composition were accounted for. The model presented here was able to produce variation in fire severity under a range of fire intensity conditions. This suggests that variation in stand structure is one of the factors causing the observed variation in boreal fire severity. PMID:26919456

  8. A Multiscale Approach Indicates a Severe Reduction in Atlantic Forest Wetlands and Highlights that São Paulo Marsh Antwren Is on the Brink of Extinction

    PubMed Central

    Del-Rio, Glaucia; Rêgo, Marco Antonio; Silveira, Luís Fábio

    2015-01-01

    Over the last 200 years the wetlands of the Upper Tietê and Upper Paraíba do Sul basins, in the southeastern Atlantic Forest, Brazil, have been almost-completely transformed by urbanization, agriculture and mining. Endemic to these river basins, the São Paulo Marsh Antwren (Formicivora paludicola) survived these impacts, but remained unknown to science until its discovery in 2005. Its population status was cause for immediate concern. In order to understand the factors imperiling the species, and provide guidelines for its conservation, we investigated both the species’ distribution and the distribution of areas of suitable habitat using a multiscale approach encompassing species distribution modeling, fieldwork surveys and occupancy models. Of six species distribution models methods used (Generalized Linear Models, Generalized Additive Models, Multivariate Adaptive Regression Splines, Classification Tree Analysis, Artificial Neural Networks and Random Forest), Random Forest showed the best fit and was utilized to guide field validation. After surveying 59 sites, our results indicated that Formicivora paludicola occurred in only 13 sites, having narrow habitat specificity, and restricted habitat availability. Additionally, historic maps, distribution models and satellite imagery showed that human occupation has resulted in a loss of more than 346 km2 of suitable habitat for this species since the early twentieth century, so that it now only occupies a severely fragmented area (area of occupancy) of 1.42 km2, and it should be considered Critically Endangered according to IUCN criteria. Furthermore, averaged occupancy models showed that marshes with lower cattail (Typha dominguensis) densities have higher probabilities of being occupied. Thus, these areas should be prioritized in future conservation efforts to protect the species, and to restore a portion of Atlantic Forest wetlands, in times of unprecedented regional water supply problems. PMID:25798608

  9. Modelling Variable Fire Severity in Boreal Forests: Effects of Fire Intensity and Stand Structure.

    PubMed

    Miquelajauregui, Yosune; Cumming, Steven G; Gauthier, Sylvie

    2016-01-01

    It is becoming clear that fires in boreal forests are not uniformly stand-replacing. On the contrary, marked variation in fire severity, measured as tree mortality, has been found both within and among individual fires. It is important to understand the conditions under which this variation can arise. We integrated forest sample plot data, tree allometries and historical forest fire records within a diameter class-structured model of 1.0 ha patches of mono-specific black spruce and jack pine stands in northern Québec, Canada. The model accounts for crown fire initiation and vertical spread into the canopy. It uses empirical relations between fire intensity, scorch height, the percent of crown scorched and tree mortality to simulate fire severity, specifically the percent reduction in patch basal area due to fire-caused mortality. A random forest and a regression tree analysis of a large random sample of simulated fires were used to test for an effect of fireline intensity, stand structure, species composition and pyrogeographic regions on resultant severity. Severity increased with intensity and was lower for jack pine stands. The proportion of simulated fires that burned at high severity (e.g. >75% reduction in patch basal area) was 0.80 for black spruce and 0.11 for jack pine. We identified thresholds in intensity below which there was a marked sensitivity of simulated fire severity to stand structure, and to interactions between intensity and structure. We found no evidence for a residual effect of pyrogeographic region on simulated severity, after the effects of stand structure and species composition were accounted for. The model presented here was able to produce variation in fire severity under a range of fire intensity conditions. This suggests that variation in stand structure is one of the factors causing the observed variation in boreal fire severity.

  10. A multiscale approach indicates a severe reduction in Atlantic Forest wetlands and highlights that São Paulo Marsh Antwren is on the brink of extinction.

    PubMed

    Del-Rio, Glaucia; Rêgo, Marco Antonio; Silveira, Luís Fábio

    2015-01-01

    Over the last 200 years the wetlands of the Upper Tietê and Upper Paraíba do Sul basins, in the southeastern Atlantic Forest, Brazil, have been almost-completely transformed by urbanization, agriculture and mining. Endemic to these river basins, the São Paulo Marsh Antwren (Formicivora paludicola) survived these impacts, but remained unknown to science until its discovery in 2005. Its population status was cause for immediate concern. In order to understand the factors imperiling the species, and provide guidelines for its conservation, we investigated both the species' distribution and the distribution of areas of suitable habitat using a multiscale approach encompassing species distribution modeling, fieldwork surveys and occupancy models. Of six species distribution models methods used (Generalized Linear Models, Generalized Additive Models, Multivariate Adaptive Regression Splines, Classification Tree Analysis, Artificial Neural Networks and Random Forest), Random Forest showed the best fit and was utilized to guide field validation. After surveying 59 sites, our results indicated that Formicivora paludicola occurred in only 13 sites, having narrow habitat specificity, and restricted habitat availability. Additionally, historic maps, distribution models and satellite imagery showed that human occupation has resulted in a loss of more than 346 km2 of suitable habitat for this species since the early twentieth century, so that it now only occupies a severely fragmented area (area of occupancy) of 1.42 km2, and it should be considered Critically Endangered according to IUCN criteria. Furthermore, averaged occupancy models showed that marshes with lower cattail (Typha dominguensis) densities have higher probabilities of being occupied. Thus, these areas should be prioritized in future conservation efforts to protect the species, and to restore a portion of Atlantic Forest wetlands, in times of unprecedented regional water supply problems.

  11. Locally-constrained Boundary Regression for Segmentation of Prostate and Rectum in the Planning CT Images

    PubMed Central

    Shao, Yeqin; Gao, Yaozong; Wang, Qian; Yang, Xin; Shen, Dinggang

    2015-01-01

    Automatic and accurate segmentation of the prostate and rectum in planning CT images is a challenging task due to low image contrast, unpredictable organ (relative) position, and uncertain existence of bowel gas across different patients. Recently, regression forest was adopted for organ deformable segmentation on 2D medical images by training one landmark detector for each point on the shape model. However, it seems impractical for regression forest to guide 3D deformable segmentation as a landmark detector, due to large number of vertices in the 3D shape model as well as the difficulty in building accurate 3D vertex correspondence for each landmark detector. In this paper, we propose a novel boundary detection method by exploiting the power of regression forest for prostate and rectum segmentation. The contributions of this paper are as follows: 1) we introduce regression forest as a local boundary regressor to vote the entire boundary of a target organ, which avoids training a large number of landmark detectors and building an accurate 3D vertex correspondence for each landmark detector; 2) an auto-context model is integrated with regression forest to improve the accuracy of the boundary regression; 3) we further combine a deformable segmentation method with the proposed local boundary regressor for the final organ segmentation by integrating organ shape priors. Our method is evaluated on a planning CT image dataset with 70 images from 70 different patients. The experimental results show that our proposed boundary regression method outperforms the conventional boundary classification method in guiding the deformable model for prostate and rectum segmentations. Compared with other state-of-the-art methods, our method also shows a competitive performance. PMID:26439938

  12. A Random Forest-based ensemble method for activity recognition.

    PubMed

    Feng, Zengtao; Mo, Lingfei; Li, Meng

    2015-01-01

    This paper presents a multi-sensor ensemble approach to human physical activity (PA) recognition, using random forest. We designed an ensemble learning algorithm, which integrates several independent Random Forest classifiers based on different sensor feature sets to build a more stable, more accurate and faster classifier for human activity recognition. To evaluate the algorithm, PA data collected from the PAMAP (Physical Activity Monitoring for Aging People), which is a standard, publicly available database, was utilized to train and test. The experimental results show that the algorithm is able to correctly recognize 19 PA types with an accuracy of 93.44%, while the training is faster than others. The ensemble classifier system based on the RF (Random Forest) algorithm can achieve high recognition accuracy and fast calculation.

  13. [Detecting the moisture content of forest surface soil based on the microwave remote sensing technology.

    PubMed

    Li, Ming Ze; Gao, Yuan Ke; Di, Xue Ying; Fan, Wen Yi

    2016-03-01

    The moisture content of forest surface soil is an important parameter in forest ecosystems. It is practically significant for forest ecosystem related research to use microwave remote sensing technology for rapid and accurate estimation of the moisture content of forest surface soil. With the aid of TDR-300 soil moisture content measuring instrument, the moisture contents of forest surface soils of 120 sample plots at Tahe Forestry Bureau of Daxing'anling region in Heilongjiang Province were measured. Taking the moisture content of forest surface soil as the dependent variable and the polarization decomposition parameters of C band Quad-pol SAR data as independent variables, two types of quantitative estimation models (multilinear regression model and BP-neural network model) for predicting moisture content of forest surface soils were developed. The spatial distribution of moisture content of forest surface soil on the regional scale was then derived with model inversion. Results showed that the model precision was 86.0% and 89.4% with RMSE of 3.0% and 2.7% for the multilinear regression model and the BP-neural network model, respectively. It indicated that the BP-neural network model had a better performance than the multilinear regression model in quantitative estimation of the moisture content of forest surface soil. The spatial distribution of forest surface soil moisture content in the study area was then obtained by using the BP neural network model simulation with the Quad-pol SAR data.

  14. Forest/non-forest mapping using inventory data and satellite imagery

    Treesearch

    Ronald E. McRoberts

    2002-01-01

    For two study areas in Minnesota, USA, one heavily forested and one sparsely forested, maps of predicted proportion forest area were created using Landsat Thematic Mapper imagery, forest inventory plot data, and two prediction techniques, logistic regression and a k-Nearest Neighbours technique. The maps were used to increase the precision of forest area estimates by...

  15. The effects of forest fragmentation on forest stand attributes

    Treesearch

    Ronald E. McRoberts; Greg C. Liknes

    2002-01-01

    For two study areas in Minnesota, USA, one heavily forested and one sparsely forested, maps of predicted proportion forest area were created using Landsat Thematic Mapper imagery, forest inventory plot data, and a logistic regression model. The maps were used to estimate quantitative indices of forest fragmentation. Correlations between the values of the indices and...

  16. A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data.

    PubMed

    Nasejje, Justine B; Mwambi, Henry; Dheda, Keertan; Lesosky, Maia

    2017-07-28

    Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate. In this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points). The study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points. Although survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.

  17. Soil salinity study in Northern Great Plains sodium affected soil

    NASA Astrophysics Data System (ADS)

    Kharel, Tulsi P.

    Climate and land-use changes when combined with the marine sediments that underlay portions of the Northern Great Plains have increased the salinization and sodification risks. The objectives of this dissertation were to compare three chemical amendments (calcium chloride, sulfuric acid and gypsum) remediation strategies on water permeability and sodium (Na) transport in undisturbed soil columns and to develop a remote sensing technique to characterize salinization in South Dakota soils. Forty-eight undisturbed soil columns (30 cm x 15 cm) collected from White Lake, Redfield, and Pierpont were used to assess the chemical remediation strategies. In this study the experimental design was a completely randomized design and each treatment was replicated four times. Following the application of chemical remediation strategies, 45.2 cm of water was leached through these columns. The leachate was separated into 120- ml increments and analyzed for Na and electrical conductivity (EC). Sulfuric acid increased Na leaching, whereas gypsum and CaCl2 increased water permeability. Our results further indicate that to maintain effective water permeability, ratio between soil EC and sodium absorption ratio (SAR) should be considered. In the second study, soil samples from 0-15 cm depth in 62 x 62 m grid spacing were taken from the South Dakota Pierpont (65 ha) and Redfield (17 ha) sites. Saturated paste EC was measured on each soil sample. At each sampling points reflectance and derived indices (Landsat 5, 7, 8 images), elevation, slope and aspect (LiDAR) were extracted. Regression models based on multiple linear regression, classification and regression tree, cubist, and random forest techniques were developed and their ability to predict soil EC were compared. Results showed that: 1) Random forest method was found to be the most effective method because of its ability to capture spatially correlated variation, 2) the short wave infrared (1.5 -2.29 mum) and near infrared (0.75-0.90 mum) were very sensitive to soil salinity; 3) EC prediction model using all 3 season (spring, summer and fall) images was better on state wide validation dataset compared to individual season model. Finally, in eastern South Dakota, the model predicted that from 2008 to 2012, EC increased in 569,165 ha or 13.4% of the land seeded to corn (Zea mays L.) or soybeans (Glycine max L).

  18. SNP selection and classification of genome-wide SNP data using stratified sampling random forests.

    PubMed

    Wu, Qingyao; Ye, Yunming; Liu, Yang; Ng, Michael K

    2012-09-01

    For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

  19. Groundwater potential mapping using C5.0, random forest, and multivariate adaptive regression spline models in GIS.

    PubMed

    Golkarian, Ali; Naghibi, Seyed Amir; Kalantar, Bahareh; Pradhan, Biswajeet

    2018-02-17

    Ever increasing demand for water resources for different purposes makes it essential to have better understanding and knowledge about water resources. As known, groundwater resources are one of the main water resources especially in countries with arid climatic condition. Thus, this study seeks to provide groundwater potential maps (GPMs) employing new algorithms. Accordingly, this study aims to validate the performance of C5.0, random forest (RF), and multivariate adaptive regression splines (MARS) algorithms for generating GPMs in the eastern part of Mashhad Plain, Iran. For this purpose, a dataset was produced consisting of spring locations as indicator and groundwater-conditioning factors (GCFs) as input. In this research, 13 GCFs were selected including altitude, slope aspect, slope angle, plan curvature, profile curvature, topographic wetness index (TWI), slope length, distance from rivers and faults, rivers and faults density, land use, and lithology. The mentioned dataset was divided into two classes of training and validation with 70 and 30% of the springs, respectively. Then, C5.0, RF, and MARS algorithms were employed using R statistical software, and the final values were transformed into GPMs. Finally, two evaluation criteria including Kappa and area under receiver operating characteristics curve (AUC-ROC) were calculated. According to the findings of this research, MARS had the best performance with AUC-ROC of 84.2%, followed by RF and C5.0 algorithms with AUC-ROC values of 79.7 and 77.3%, respectively. The results indicated that AUC-ROC values for the employed models are more than 70% which shows their acceptable performance. As a conclusion, the produced methodology could be used in other geographical areas. GPMs could be used by water resource managers and related organizations to accelerate and facilitate water resource exploitation.

  20. Can machine-learning improve cardiovascular risk prediction using routine clinical data?

    PubMed Central

    Kai, Joe; Garibaldi, Jonathan M.; Qureshi, Nadeem

    2017-01-01

    Background Current approaches to predict cardiovascular risk fail to identify many people who would benefit from preventive treatment, while others receive unnecessary intervention. Machine-learning offers opportunity to improve accuracy by exploiting complex interactions between risk factors. We assessed whether machine-learning can improve cardiovascular risk prediction. Methods Prospective cohort study using routine clinical data of 378,256 patients from UK family practices, free from cardiovascular disease at outset. Four machine-learning algorithms (random forest, logistic regression, gradient boosting machines, neural networks) were compared to an established algorithm (American College of Cardiology guidelines) to predict first cardiovascular event over 10-years. Predictive accuracy was assessed by area under the ‘receiver operating curve’ (AUC); and sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) to predict 7.5% cardiovascular risk (threshold for initiating statins). Findings 24,970 incident cardiovascular events (6.6%) occurred. Compared to the established risk prediction algorithm (AUC 0.728, 95% CI 0.723–0.735), machine-learning algorithms improved prediction: random forest +1.7% (AUC 0.745, 95% CI 0.739–0.750), logistic regression +3.2% (AUC 0.760, 95% CI 0.755–0.766), gradient boosting +3.3% (AUC 0.761, 95% CI 0.755–0.766), neural networks +3.6% (AUC 0.764, 95% CI 0.759–0.769). The highest achieving (neural networks) algorithm predicted 4,998/7,404 cases (sensitivity 67.5%, PPV 18.4%) and 53,458/75,585 non-cases (specificity 70.7%, NPV 95.7%), correctly predicting 355 (+7.6%) more patients who developed cardiovascular disease compared to the established algorithm. Conclusions Machine-learning significantly improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment of others. PMID:28376093

  1. Can machine-learning improve cardiovascular risk prediction using routine clinical data?

    PubMed

    Weng, Stephen F; Reps, Jenna; Kai, Joe; Garibaldi, Jonathan M; Qureshi, Nadeem

    2017-01-01

    Current approaches to predict cardiovascular risk fail to identify many people who would benefit from preventive treatment, while others receive unnecessary intervention. Machine-learning offers opportunity to improve accuracy by exploiting complex interactions between risk factors. We assessed whether machine-learning can improve cardiovascular risk prediction. Prospective cohort study using routine clinical data of 378,256 patients from UK family practices, free from cardiovascular disease at outset. Four machine-learning algorithms (random forest, logistic regression, gradient boosting machines, neural networks) were compared to an established algorithm (American College of Cardiology guidelines) to predict first cardiovascular event over 10-years. Predictive accuracy was assessed by area under the 'receiver operating curve' (AUC); and sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) to predict 7.5% cardiovascular risk (threshold for initiating statins). 24,970 incident cardiovascular events (6.6%) occurred. Compared to the established risk prediction algorithm (AUC 0.728, 95% CI 0.723-0.735), machine-learning algorithms improved prediction: random forest +1.7% (AUC 0.745, 95% CI 0.739-0.750), logistic regression +3.2% (AUC 0.760, 95% CI 0.755-0.766), gradient boosting +3.3% (AUC 0.761, 95% CI 0.755-0.766), neural networks +3.6% (AUC 0.764, 95% CI 0.759-0.769). The highest achieving (neural networks) algorithm predicted 4,998/7,404 cases (sensitivity 67.5%, PPV 18.4%) and 53,458/75,585 non-cases (specificity 70.7%, NPV 95.7%), correctly predicting 355 (+7.6%) more patients who developed cardiovascular disease compared to the established algorithm. Machine-learning significantly improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment of others.

  2. Application of Random Forests Methods to Diabetic Retinopathy Classification Analyses

    PubMed Central

    Casanova, Ramon; Saldana, Santiago; Chew, Emily Y.; Danis, Ronald P.; Greven, Craig M.; Ambrosius, Walter T.

    2014-01-01

    Background Diabetic retinopathy (DR) is one of the leading causes of blindness in the United States and world-wide. DR is a silent disease that may go unnoticed until it is too late for effective treatment. Therefore, early detection could improve the chances of therapeutic interventions that would alleviate its effects. Methodology Graded fundus photography and systemic data from 3443 ACCORD-Eye Study participants were used to estimate Random Forest (RF) and logistic regression classifiers. We studied the impact of sample size on classifier performance and the possibility of using RF generated class conditional probabilities as metrics describing DR risk. RF measures of variable importance are used to detect factors that affect classification performance. Principal Findings Both types of data were informative when discriminating participants with or without DR. RF based models produced much higher classification accuracy than those based on logistic regression. Combining both types of data did not increase accuracy but did increase statistical discrimination of healthy participants who subsequently did or did not have DR events during four years of follow-up. RF variable importance criteria revealed that microaneurysms counts in both eyes seemed to play the most important role in discrimination among the graded fundus variables, while the number of medicines and diabetes duration were the most relevant among the systemic variables. Conclusions and Significance We have introduced RF methods to DR classification analyses based on fundus photography data. In addition, we propose an approach to DR risk assessment based on metrics derived from graded fundus photography and systemic data. Our results suggest that RF methods could be a valuable tool to diagnose DR diagnosis and evaluate its progression. PMID:24940623

  3. Classification of suicide attempters in schizophrenia using sociocultural and clinical features: A machine learning approach.

    PubMed

    Hettige, Nuwan C; Nguyen, Thai Binh; Yuan, Chen; Rajakulendran, Thanara; Baddour, Jermeen; Bhagwat, Nikhil; Bani-Fatemi, Ali; Voineskos, Aristotle N; Mallar Chakravarty, M; De Luca, Vincenzo

    2017-07-01

    Suicide is a major concern for those afflicted by schizophrenia. Identifying patients at the highest risk for future suicide attempts remains a complex problem for psychiatric interventions. Machine learning models allow for the integration of many risk factors in order to build an algorithm that predicts which patients are likely to attempt suicide. Currently it is unclear how to integrate previously identified risk factors into a clinically relevant predictive tool to estimate the probability of a patient with schizophrenia for attempting suicide. We conducted a cross-sectional assessment on a sample of 345 participants diagnosed with schizophrenia spectrum disorders. Suicide attempters and non-attempters were clearly identified using the Columbia Suicide Severity Rating Scale (C-SSRS) and the Beck Suicide Ideation Scale (BSS). We developed four classification algorithms using a regularized regression, random forest, elastic net and support vector machine models with sociocultural and clinical variables as features to train the models. All classification models performed similarly in identifying suicide attempters and non-attempters. Our regularized logistic regression model demonstrated an accuracy of 67% and an area under the curve (AUC) of 0.71, while the random forest model demonstrated 66% accuracy and an AUC of 0.67. Support vector classifier (SVC) model demonstrated an accuracy of 67% and an AUC of 0.70, and the elastic net model demonstrated and accuracy of 65% and an AUC of 0.71. Machine learning algorithms offer a relatively successful method for incorporating many clinical features to predict individuals at risk for future suicide attempts. Increased performance of these models using clinically relevant variables offers the potential to facilitate early treatment and intervention to prevent future suicide attempts. Copyright © 2017 Elsevier Inc. All rights reserved.

  4. Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.

    PubMed

    Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi

    2017-11-02

    Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.

  5. Data-driven mapping of the potential mountain permafrost distribution.

    PubMed

    Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail

    2017-07-15

    Existing mountain permafrost distribution models generally offer a good overview of the potential extent of this phenomenon at a regional scale. They are however not always able to reproduce the high spatial discontinuity of permafrost at the micro-scale (scale of a specific landform; ten to several hundreds of meters). To overcome this lack, we tested an alternative modelling approach using three classification algorithms belonging to statistics and machine learning: Logistic regression, Support Vector Machines and Random forests. These supervised learning techniques infer a classification function from labelled training data (pixels of permafrost absence and presence) with the aim of predicting the permafrost occurrence where it is unknown. The research was carried out in a 588km 2 area of the Western Swiss Alps. Permafrost evidences were mapped from ortho-image interpretation (rock glacier inventorying) and field data (mainly geoelectrical and thermal data). The relationship between selected permafrost evidences and permafrost controlling factors was computed with the mentioned techniques. Classification performances, assessed with AUROC, range between 0.81 for Logistic regression, 0.85 with Support Vector Machines and 0.88 with Random forests. The adopted machine learning algorithms have demonstrated to be efficient for permafrost distribution modelling thanks to consistent results compared to the field reality. The high resolution of the input dataset (10m) allows elaborating maps at the micro-scale with a modelled permafrost spatial distribution less optimistic than classic spatial models. Moreover, the probability output of adopted algorithms offers a more precise overview of the potential distribution of mountain permafrost than proposing simple indexes of the permafrost favorability. These encouraging results also open the way to new possibilities of permafrost data analysis and mapping. Copyright © 2017 Elsevier B.V. All rights reserved.

  6. Spectroscopic diagnosis of laryngeal carcinoma using near-infrared Raman spectroscopy and random recursive partitioning ensemble techniques.

    PubMed

    Teh, Seng Khoon; Zheng, Wei; Lau, David P; Huang, Zhiwei

    2009-06-01

    In this work, we evaluated the diagnostic ability of near-infrared (NIR) Raman spectroscopy associated with the ensemble recursive partitioning algorithm based on random forests for identifying cancer from normal tissue in the larynx. A rapid-acquisition NIR Raman system was utilized for tissue Raman measurements at 785 nm excitation, and 50 human laryngeal tissue specimens (20 normal; 30 malignant tumors) were used for NIR Raman studies. The random forests method was introduced to develop effective diagnostic algorithms for classification of Raman spectra of different laryngeal tissues. High-quality Raman spectra in the range of 800-1800 cm(-1) can be acquired from laryngeal tissue within 5 seconds. Raman spectra differed significantly between normal and malignant laryngeal tissues. Classification results obtained from the random forests algorithm on tissue Raman spectra yielded a diagnostic sensitivity of 88.0% and specificity of 91.4% for laryngeal malignancy identification. The random forests technique also provided variables importance that facilitates correlation of significant Raman spectral features with cancer transformation. This study shows that NIR Raman spectroscopy in conjunction with random forests algorithm has a great potential for the rapid diagnosis and detection of malignant tumors in the larynx.

  7. Estimation of Rice Crop Yields Using Random Forests in Taiwan

    NASA Astrophysics Data System (ADS)

    Chen, C. F.; Lin, H. S.; Nguyen, S. T.; Chen, C. R.

    2017-12-01

    Rice is globally one of the most important food crops, directly feeding more people than any other crops. Rice is not only the most important commodity, but also plays a critical role in the economy of Taiwan because it provides employment and income for large rural populations. The rice harvested area and production are thus monitored yearly due to the government's initiatives. Agronomic planners need such information for more precise assessment of food production to tackle issues of national food security and policymaking. This study aimed to develop a machine-learning approach using physical parameters to estimate rice crop yields in Taiwan. We processed the data for 2014 cropping seasons, following three main steps: (1) data pre-processing to construct input layers, including soil types and weather parameters (e.g., maxima and minima air temperature, precipitation, and solar radiation) obtained from meteorological stations across the country; (2) crop yield estimation using the random forests owing to its merits as it can process thousands of variables, estimate missing data, maintain the accuracy level when a large proportion of the data is missing, overcome most of over-fitting problems, and run fast and efficiently when handling large datasets; and (3) error verification. To execute the model, we separated the datasets into two groups of pixels: group-1 (70% of pixels) for training the model and group-2 (30% of pixels) for testing the model. Once the model is trained to produce small and stable out-of-bag error (i.e., the mean squared error between predicted and actual values), it can be used for estimating rice yields of cropping seasons. The results obtained from the random forests-based regression were compared with the actual yield statistics indicated the values of root mean square error (RMSE) and mean absolute error (MAE) achieved for the first rice crop were respectively 6.2% and 2.7%, while those for the second rice crop were 5.3% and 2.9%, respectively. Although there are several uncertainties attributed to the data quality of input layers, our study demonstrates the promising application of random forests for estimating rice crop yields at the national level in Taiwan. This approach could be transferable to other regions of the world for improving large-scale estimation of rice crop yields.

  8. Determining storm sampling requirements for improving precision of annual load estimates of nutrients from a small forested watershed.

    PubMed

    Ide, Jun'ichiro; Chiwa, Masaaki; Higashi, Naoko; Maruno, Ryoko; Mori, Yasushi; Otsuki, Kyoichi

    2012-08-01

    This study sought to determine the lowest number of storm events required for adequate estimation of annual nutrient loads from a forested watershed using the regression equation between cumulative load (∑L) and cumulative stream discharge (∑Q). Hydrological surveys were conducted for 4 years, and stream water was sampled sequentially at 15-60-min intervals during 24 h in 20 events, as well as weekly in a small forested watershed. The bootstrap sampling technique was used to determine the regression (∑L-∑Q) equations of dissolved nitrogen (DN) and phosphorus (DP), particulate nitrogen (PN) and phosphorus (PP), dissolved inorganic nitrogen (DIN), and suspended solid (SS) for each dataset of ∑L and ∑Q. For dissolved nutrients (DN, DP, DIN), the coefficient of variance (CV) in 100 replicates of 4-year average annual load estimates was below 20% with datasets composed of five storm events. For particulate nutrients (PN, PP, SS), the CV exceeded 20%, even with datasets composed of more than ten storm events. The differences in the number of storm events required for precise load estimates between dissolved and particulate nutrients were attributed to the goodness of fit of the ∑L-∑Q equations. Bootstrap simulation based on flow-stratified sampling resulted in fewer storm events than the simulation based on random sampling and showed that only three storm events were required to give a CV below 20% for dissolved nutrients. These results indicate that a sampling design considering discharge levels reduces the frequency of laborious chemical analyses of water samples required throughout the year.

  9. Mapping the spatial pattern of temperate forest above ground biomass by integrating airborne lidar with Radarsat-2 imagery via geostatistical models

    NASA Astrophysics Data System (ADS)

    Li, Wang; Niu, Zheng; Gao, Shuai; Wang, Cheng

    2014-11-01

    Light Detection and Ranging (LiDAR) and Synthetic Aperture Radar (SAR) are two competitive active remote sensing techniques in forest above ground biomass estimation, which is important for forest management and global climate change study. This study aims to further explore their capabilities in temperate forest above ground biomass (AGB) estimation by emphasizing the spatial auto-correlation of variables obtained from these two remote sensing tools, which is a usually overlooked aspect in remote sensing applications to vegetation studies. Remote sensing variables including airborne LiDAR metrics, backscattering coefficient for different SAR polarizations and their ratio variables for Radarsat-2 imagery were calculated. First, simple linear regression models (SLR) was established between the field-estimated above ground biomass and the remote sensing variables. Pearson's correlation coefficient (R2) was used to find which LiDAR metric showed the most significant correlation with the regression residuals and could be selected as co-variable in regression co-kriging (RCoKrig). Second, regression co-kriging was conducted by choosing the regression residuals as dependent variable and the LiDAR metric (Hmean) with highest R2 as co-variable. Third, above ground biomass over the study area was estimated using SLR model and RCoKrig model, respectively. The results for these two models were validated using the same ground points. Results showed that both of these two methods achieved satisfactory prediction accuracy, while regression co-kriging showed the lower estimation error. It is proved that regression co-kriging model is feasible and effective in mapping the spatial pattern of AGB in the temperate forest using Radarsat-2 data calibrated by airborne LiDAR metrics.

  10. Automated Classification of Consumer Health Information Needs in Patient Portal Messages.

    PubMed

    Cronin, Robert M; Fabbri, Daniel; Denny, Joshua C; Jackson, Gretchen Purcell

    2015-01-01

    Patients have diverse health information needs, and secure messaging through patient portals is an emerging means by which such needs are expressed and met. As patient portal adoption increases, growing volumes of secure messages may burden healthcare providers. Automated classification could expedite portal message triage and answering. We created four automated classifiers based on word content and natural language processing techniques to identify health information needs in 1000 patient-generated portal messages. Logistic regression and random forest classifiers detected single information needs well, with area under the curves of 0.804-0.914. A logistic regression classifier accurately found the set of needs within a message, with a Jaccard index of 0.859 (95% Confidence Interval: (0.847, 0.871)). Automated classification of consumer health information needs expressed in patient portal messages is feasible and may allow direct linking to relevant resources or creation of institutional resources for commonly expressed needs.

  11. Automated Classification of Consumer Health Information Needs in Patient Portal Messages

    PubMed Central

    Cronin, Robert M.; Fabbri, Daniel; Denny, Joshua C.; Jackson, Gretchen Purcell

    2015-01-01

    Patients have diverse health information needs, and secure messaging through patient portals is an emerging means by which such needs are expressed and met. As patient portal adoption increases, growing volumes of secure messages may burden healthcare providers. Automated classification could expedite portal message triage and answering. We created four automated classifiers based on word content and natural language processing techniques to identify health information needs in 1000 patient-generated portal messages. Logistic regression and random forest classifiers detected single information needs well, with area under the curves of 0.804–0.914. A logistic regression classifier accurately found the set of needs within a message, with a Jaccard index of 0.859 (95% Confidence Interval: (0.847, 0.871)). Automated classification of consumer health information needs expressed in patient portal messages is feasible and may allow direct linking to relevant resources or creation of institutional resources for commonly expressed needs. PMID:26958285

  12. Predicting the dissolution kinetics of silicate glasses using machine learning

    NASA Astrophysics Data System (ADS)

    Anoop Krishnan, N. M.; Mangalathu, Sujith; Smedskjaer, Morten M.; Tandia, Adama; Burton, Henry; Bauchy, Mathieu

    2018-05-01

    Predicting the dissolution rates of silicate glasses in aqueous conditions is a complex task as the underlying mechanism(s) remain poorly understood and the dissolution kinetics can depend on a large number of intrinsic and extrinsic factors. Here, we assess the potential of data-driven models based on machine learning to predict the dissolution rates of various aluminosilicate glasses exposed to a wide range of solution pH values, from acidic to caustic conditions. Four classes of machine learning methods are investigated, namely, linear regression, support vector machine regression, random forest, and artificial neural network. We observe that, although linear methods all fail to describe the dissolution kinetics, the artificial neural network approach offers excellent predictions, thanks to its inherent ability to handle non-linear data. Overall, we suggest that a more extensive use of machine learning approaches could significantly accelerate the design of novel glasses with tailored properties.

  13. Automatic segmentation and classification of mycobacterium tuberculosis with conventional light microscopy

    NASA Astrophysics Data System (ADS)

    Xu, Chao; Zhou, Dongxiang; Zhai, Yongping; Liu, Yunhui

    2015-12-01

    This paper realizes the automatic segmentation and classification of Mycobacterium tuberculosis with conventional light microscopy. First, the candidate bacillus objects are segmented by the marker-based watershed transform. The markers are obtained by an adaptive threshold segmentation based on the adaptive scale Gaussian filter. The scale of the Gaussian filter is determined according to the color model of the bacillus objects. Then the candidate objects are extracted integrally after region merging and contaminations elimination. Second, the shape features of the bacillus objects are characterized by the Hu moments, compactness, eccentricity, and roughness, which are used to classify the single, touching and non-bacillus objects. We evaluated the logistic regression, random forest, and intersection kernel support vector machines classifiers in classifying the bacillus objects respectively. Experimental results demonstrate that the proposed method yields to high robustness and accuracy. The logistic regression classifier performs best with an accuracy of 91.68%.

  14. National scale biomass estimators for United States tree species

    Treesearch

    Jennifer C. Jenkins; David C. Chojnacky; Linda S. Heath; Richard A. Birdsey

    2003-01-01

    Estimates of national-scale forest carbon (C) stocks and fluxes are typically based on allometric regression equations developed using dimensional analysis techniques. However, the literature is inconsistent and incomplete with respect to large-scale forest C estimation. We compiled all available diameter-based allometric regression equations for estimating total...

  15. Adapting GNU random forest program for Unix and Windows

    NASA Astrophysics Data System (ADS)

    Jirina, Marcel; Krayem, M. Said; Jirina, Marcel, Jr.

    2013-10-01

    The Random Forest is a well-known method and also a program for data clustering and classification. Unfortunately, the original Random Forest program is rather difficult to use. Here we describe a new version of this program originally written in Fortran 77. The modified program in Fortran 95 needs to be compiled only once and information for different tasks is passed with help of arguments. The program was tested with 24 data sets from UCI MLR and results are available on the net.

  16. What contributes to perceived stress in later life? A recursive partitioning approach.

    PubMed

    Scott, Stacey B; Jackson, Brenda R; Bergeman, C S

    2011-12-01

    One possible explanation for the individual differences in outcomes of stress is the diversity of inputs that produce perceptions of being stressed. The current study examines how combinations of contextual features (e.g., social isolation, neighborhood quality, health problems, age discrimination, financial concerns, and recent life events) of later life contribute to overall feelings of stress. Recursive partitioning techniques (regression trees and random forests) were used to examine unique interrelations between predictors of perceived stress in a sample of 282 community-dwelling adults. Trees provided possible examples of equifinality (i.e., subsets of people with similar levels of perceived stress but different predictors) as well as identification both of contextual combinations that separated participants with very high and very low perceived stress. Random forest analyses aggregated across many trees based on permuted versions of the data and predictors; loneliness, financial strain, neighborhood strain, ageism, and to some extent life events emerged as important predictors. Interviews with a subsample of participants provided both thick description of the complex relationships identified in the trees, as well as additional risks not appearing in the survey results. Together, the analyses highlight what may be missed when stress is used as a simple unidimensional construct and can guide differential intervention efforts.

  17. What contributes to perceived stress in later life? A recursive partitioning approach

    PubMed Central

    Scott, Stacey B.; Jackson, Brenda R.; Bergeman, C. S.

    2011-01-01

    One possible explanation for the individual differences in outcomes of stress is the diversity of inputs that produce perceptions of being stressed. The current study examines how combinations of contextual features (e.g., social isolation, neighborhood quality, health problems, age discrimination, financial concerns, and recent life events) of later life contribute to overall feelings of stress. Recursive partitioning techniques (regression trees and random forests) were used to examine unique interrelations between predictors of perceived stress in a sample of 282 community-dwelling adults. Trees provided possible examples of equifinality (i.e., subsets of people with similar levels of perceived stress but different predictors) as well as for the identification both of contextual combinations that separated participants with very high and very low perceived stress. Random forest analyses aggregated across many trees based on permuted versions of the data and predictors; loneliness, financial strain, neighborhood strain, ageism, and to some extent life events emerged as important predictors. Interviews with a subsample of participants provided both thick description of the complex relationships identified in the trees, as well as additional risks not appearing in the survey results. Together, the analyses highlight what may be missed when stress is used as a simple unidimensional construct and can guide differential intervention efforts. PMID:21604885

  18. Random Forests for Global and Regional Crop Yield Predictions.

    PubMed

    Jeong, Jig Han; Resop, Jonathan P; Mueller, Nathaniel D; Fleisher, David H; Yun, Kyungdahm; Butler, Ethan E; Timlin, Dennis J; Shim, Kyo-Moon; Gerber, James S; Reddy, Vangimalla R; Kim, Soo-Hyung

    2016-01-01

    Accurate predictions of crop yield are critical for developing effective agricultural and food policies at the regional and global scales. We evaluated a machine-learning method, Random Forests (RF), for its ability to predict crop yield responses to climate and biophysical variables at global and regional scales in wheat, maize, and potato in comparison with multiple linear regressions (MLR) serving as a benchmark. We used crop yield data from various sources and regions for model training and testing: 1) gridded global wheat grain yield, 2) maize grain yield from US counties over thirty years, and 3) potato tuber and maize silage yield from the northeastern seaboard region. RF was found highly capable of predicting crop yields and outperformed MLR benchmarks in all performance statistics that were compared. For example, the root mean square errors (RMSE) ranged between 6 and 14% of the average observed yield with RF models in all test cases whereas these values ranged from 14% to 49% for MLR models. Our results show that RF is an effective and versatile machine-learning method for crop yield predictions at regional and global scales for its high accuracy and precision, ease of use, and utility in data analysis. RF may result in a loss of accuracy when predicting the extreme ends or responses beyond the boundaries of the training data.

  19. Predicting human liver microsomal stability with machine learning techniques.

    PubMed

    Sakiyama, Yojiro; Yuki, Hitomi; Moriya, Takashi; Hattori, Kazunari; Suzuki, Misaki; Shimada, Kaoru; Honma, Teruki

    2008-02-01

    To ensure a continuing pipeline in pharmaceutical research, lead candidates must possess appropriate metabolic stability in the drug discovery process. In vitro ADMET (absorption, distribution, metabolism, elimination, and toxicity) screening provides us with useful information regarding the metabolic stability of compounds. However, before the synthesis stage, an efficient process is required in order to deal with the vast quantity of data from large compound libraries and high-throughput screening. Here we have derived a relationship between the chemical structure and its metabolic stability for a data set of in-house compounds by means of various in silico machine learning such as random forest, support vector machine (SVM), logistic regression, and recursive partitioning. For model building, 1952 proprietary compounds comprising two classes (stable/unstable) were used with 193 descriptors calculated by Molecular Operating Environment. The results using test compounds have demonstrated that all classifiers yielded satisfactory results (accuracy > 0.8, sensitivity > 0.9, specificity > 0.6, and precision > 0.8). Above all, classification by random forest as well as SVM yielded kappa values of approximately 0.7 in an independent validation set, slightly higher than other classification tools. These results suggest that nonlinear/ensemble-based classification methods might prove useful in the area of in silico ADME modeling.

  20. Application of Machine Learning Approaches for Classifying Sitting Posture Based on Force and Acceleration Sensors.

    PubMed

    Zemp, Roland; Tanadini, Matteo; Plüss, Stefan; Schnüriger, Karin; Singh, Navrag B; Taylor, William R; Lorenzetti, Silvio

    2016-01-01

    Occupational musculoskeletal disorders, particularly chronic low back pain (LBP), are ubiquitous due to prolonged static sitting or nonergonomic sitting positions. Therefore, the aim of this study was to develop an instrumented chair with force and acceleration sensors to determine the accuracy of automatically identifying the user's sitting position by applying five different machine learning methods (Support Vector Machines, Multinomial Regression, Boosting, Neural Networks, and Random Forest). Forty-one subjects were requested to sit four times in seven different prescribed sitting positions (total 1148 samples). Sixteen force sensor values and the backrest angle were used as the explanatory variables (features) for the classification. The different classification methods were compared by means of a Leave-One-Out cross-validation approach. The best performance was achieved using the Random Forest classification algorithm, producing a mean classification accuracy of 90.9% for subjects with which the algorithm was not familiar. The classification accuracy varied between 81% and 98% for the seven different sitting positions. The present study showed the possibility of accurately classifying different sitting positions by means of the introduced instrumented office chair combined with machine learning analyses. The use of such novel approaches for the accurate assessment of chair usage could offer insights into the relationships between sitting position, sitting behaviour, and the occurrence of musculoskeletal disorders.

  1. Mortality risk score prediction in an elderly population using machine learning.

    PubMed

    Rose, Sherri

    2013-03-01

    Standard practice for prediction often relies on parametric regression methods. Interesting new methods from the machine learning literature have been introduced in epidemiologic studies, such as random forest and neural networks. However, a priori, an investigator will not know which algorithm to select and may wish to try several. Here I apply the super learner, an ensembling machine learning approach that combines multiple algorithms into a single algorithm and returns a prediction function with the best cross-validated mean squared error. Super learning is a generalization of stacking methods. I used super learning in the Study of Physical Performance and Age-Related Changes in Sonomans (SPPARCS) to predict death among 2,066 residents of Sonoma, California, aged 54 years or more during the period 1993-1999. The super learner for predicting death (risk score) improved upon all single algorithms in the collection of algorithms, although its performance was similar to that of several algorithms. Super learner outperformed the worst algorithm (neural networks) by 44% with respect to estimated cross-validated mean squared error and had an R2 value of 0.201. The improvement of super learner over random forest with respect to R2 was approximately 2-fold. Alternatives for risk score prediction include the super learner, which can provide improved performance.

  2. Land surface temperature downscaling using random forest regression: primary result and sensitivity analysis

    NASA Astrophysics Data System (ADS)

    Pan, Xin; Cao, Chen; Yang, Yingbao; Li, Xiaolong; Shan, Liangliang; Zhu, Xi

    2018-04-01

    The land surface temperature (LST) derived from thermal infrared satellite images is a meaningful variable in many remote sensing applications. However, at present, the spatial resolution of the satellite thermal infrared remote sensing sensor is coarser, which cannot meet the needs. In this study, LST image was downscaled by a random forest model between LST and multiple predictors in an arid region with an oasis-desert ecotone. The proposed downscaling approach was evaluated using LST derived from the MODIS LST product of Zhangye City in Heihe Basin. The primary result of LST downscaling has been shown that the distribution of downscaled LST matched with that of the ecosystem of oasis and desert. By the way of sensitivity analysis, the most sensitive factors to LST downscaling were modified normalized difference water index (MNDWI)/normalized multi-band drought index (NMDI), soil adjusted vegetation index (SAVI)/ shortwave infrared reflectance (SWIR)/normalized difference vegetation index (NDVI), normalized difference building index (NDBI)/SAVI and SWIR/NDBI/MNDWI/NDWI for the region of water, vegetation, building and desert, with LST variation (at most) of 0.20/-0.22 K, 0.92/0.62/0.46 K, 0.28/-0.29 K and 3.87/-1.53/-0.64/-0.25 K in the situation of +/-0.02 predictor perturbances, respectively.

  3. Tree allometry and improved estimation of carbon stocks and balance in tropical forests.

    PubMed

    Chave, J; Andalo, C; Brown, S; Cairns, M A; Chambers, J Q; Eamus, D; Fölster, H; Fromard, F; Higuchi, N; Kira, T; Lescure, J-P; Nelson, B W; Ogawa, H; Puig, H; Riéra, B; Yamakura, T

    2005-08-01

    Tropical forests hold large stores of carbon, yet uncertainty remains regarding their quantitative contribution to the global carbon cycle. One approach to quantifying carbon biomass stores consists in inferring changes from long-term forest inventory plots. Regression models are used to convert inventory data into an estimate of aboveground biomass (AGB). We provide a critical reassessment of the quality and the robustness of these models across tropical forest types, using a large dataset of 2,410 trees >or= 5 cm diameter, directly harvested in 27 study sites across the tropics. Proportional relationships between aboveground biomass and the product of wood density, trunk cross-sectional area, and total height are constructed. We also develop a regression model involving wood density and stem diameter only. Our models were tested for secondary and old-growth forests, for dry, moist and wet forests, for lowland and montane forests, and for mangrove forests. The most important predictors of AGB of a tree were, in decreasing order of importance, its trunk diameter, wood specific gravity, total height, and forest type (dry, moist, or wet). Overestimates prevailed, giving a bias of 0.5-6.5% when errors were averaged across all stands. Our regression models can be used reliably to predict aboveground tree biomass across a broad range of tropical forests. Because they are based on an unprecedented dataset, these models should improve the quality of tropical biomass estimates, and bring consensus about the contribution of the tropical forest biome and tropical deforestation to the global carbon cycle.

  4. Forest Aboveground Biomass Mapping and Canopy Cover Estimation from Simulated ICESat-2 Data

    NASA Astrophysics Data System (ADS)

    Narine, L.; Popescu, S. C.; Neuenschwander, A. L.

    2017-12-01

    The assessment of forest aboveground biomass (AGB) can contribute to reducing uncertainties associated with the amount and distribution of terrestrial carbon. With a planned launch date of July 2018, the Ice, Cloud and Land Elevation Satellite-2 (ICESat-2) will provide data which will offer the possibility of mapping AGB at global scales. In this study, we develop approaches for utilizing vegetation data that will be delivered in ICESat-2's land-vegetation along track product (ATL08). The specific objectives are to: (1) simulate ICESat-2 photon-counting lidar (PCL) data using airborne lidar data, (2) utilize simulated PCL data to estimate forest canopy cover and AGB and, (3) upscale AGB predictions to create a wall-to-wall AGB map at 30-m spatial resolution. Using existing airborne lidar data for Sam Houston National Forest (SHNF) located in southeastern Texas and known ICESat-2 beam locations, PCL data are simulated from discrete return lidar points. We use multiple linear regression models to relate simulated PCL metrics for 100 m segments along the ICESat-2 ground tracks to AGB from a biomass map developed using airborne lidar data and canopy cover calculated from the same. Random Forest is then used to create an AGB map from predicted estimates and explanatory data consisting of spectral metrics derived from Landsat TM imagery and land cover data from the National Land Cover Database (NLCD). Findings from this study will demonstrate how data that will be acquired by ICESat-2 can be used to estimate forest structure and characterize the spatial distribution of AGB.

  5. Mapping Deforestation area in North Korea Using Phenology-based Multi-Index and Random Forest

    NASA Astrophysics Data System (ADS)

    Jin, Y.; Sung, S.; Lee, D. K.; Jeong, S.

    2016-12-01

    Forest ecosystem provides ecological benefits to both humans and wildlife. Growing global demand for food and fiber is accelerating the pressure on the forest ecosystem in whole world from agriculture and logging. In recently, North Korea lost almost 40 % of its forests to crop fields for food production and cut-down of forest for fuel woods between 1990 and 2015. It led to the increased damage caused by natural disasters and is known to be one of the most forest degraded areas in the world. The characteristic of forest landscape in North Korea is complex and heterogeneous, the major landscape types in the forest are hillside farm, unstocked forest, natural forest and plateau vegetation. Remote sensing can be used for the forest degradation mapping of a dynamic landscape at a broad scale of detail and spatial distribution. Confusion mostly occurred between hillside farmland and unstocked forest, but also between unstocked forest and forest. Most previous forest degradation that used focused on the classification of broad types such as deforests area and sand from the perspective of land cover classification. The objective of this study is using random forest for mapping degraded forest in North Korea by phenological based vegetation index derived from MODIS products, which has various environmental factors such as vegetation, soil and water at a regional scale for improving accuracy. The model created by random forest resulted in an overall accuracy was 91.44%. Class user's accuracy of hillside farmland and unstocked forest were 97.2% and 84%%, which indicate the degraded forest. Unstocked forest had relative low user accuracy due to misclassified hillside farmland and forest samples. Producer's accuracy of hillside farmland and unstocked forest were 85.2% and 93.3%, repectly. In this case hillside farmland had lower produce accuracy mainly due to confusion with field, unstocked forest and forest. Such a classification of degraded forest could supply essential information to decide the priority of forest management and restoration in degraded forest area.

  6. Estimating Mixed Broadleaves Forest Stand Volume Using Dsm Extracted from Digital Aerial Images

    NASA Astrophysics Data System (ADS)

    Sohrabi, H.

    2012-07-01

    In mixed old growth broadleaves of Hyrcanian forests, it is difficult to estimate stand volume at plot level by remotely sensed data while LiDar data is absent. In this paper, a new approach has been proposed and tested for estimating stand forest volume. The approach is based on this idea that forest volume can be estimated by variation of trees height at plots. In the other word, the more the height variation in plot, the more the stand volume would be expected. For testing this idea, 120 circular 0.1 ha sample plots with systematic random design has been collected in Tonekaon forest located in Hyrcanian zone. Digital surface model (DSM) measure the height values of the first surface on the ground including terrain features, trees, building etc, which provides a topographic model of the earth's surface. The DSMs have been extracted automatically from aerial UltraCamD images so that ground pixel size for extracted DSM varied from 1 to 10 m size by 1m span. DSMs were checked manually for probable errors. Corresponded to ground samples, standard deviation and range of DSM pixels have been calculated. For modeling, non-linear regression method was used. The results showed that standard deviation of plot pixels with 5 m resolution was the most appropriate data for modeling. Relative bias and RMSE of estimation was 5.8 and 49.8 percent, respectively. Comparing to other approaches for estimating stand volume based on passive remote sensing data in mixed broadleaves forests, these results are more encouraging. One big problem in this method occurs when trees canopy cover is totally closed. In this situation, the standard deviation of height is low while stand volume is high. In future studies, applying forest stratification could be studied.

  7. Leveraging Past and Current Measurements to Probabilistically Nowcast Low Visibility Procedures at an Airport

    NASA Astrophysics Data System (ADS)

    Mayr, G. J.; Kneringer, P.; Dietz, S. J.; Zeileis, A.

    2016-12-01

    Low visibility or low cloud ceiling reduce the capacity of airports by requiring special low visibility procedures (LVP) for incoming/departing aircraft. Probabilistic forecasts when such procedures will become necessary help to mitigate delays and economic losses.We compare the performance of probabilistic nowcasts with two statistical methods: ordered logistic regression, and trees and random forests. These models harness historic and current meteorological measurements in the vicinity of the airport and LVP states, and incorporate diurnal and seasonal climatological information via generalized additive models (GAM). The methods are applied at Vienna International Airport (Austria). The performance is benchmarked against climatology, persistence and human forecasters.

  8. Determinants of the process and outcomes of household participation in collaborative forest management in Ghana: a quantitative test of a community resilience model.

    PubMed

    Akamani, Kofi; Hall, Troy Elizabeth

    2015-01-01

    This study tested a proposed community resilience model by investigating the role of institutions, capital assets, community and socio-demographic variables as determinants of households' participation in Ghana's collaborative forest management (CFM) program and outcomes of the program. Quantitative survey data were gathered from 209 randomly selected households from two forest-dependent communities. Regression analysis shows that households' participation in the CFM program was predicted by community location, past connections with institutions, and past bonding social capital. Community location and past capitals were the strongest predictors of the outcomes of the CFM program as judged by current levels of capitals. Participation in the CFM program also had a positive effect on human capital but had minimal impact on the other capitals influencing household well-being and resilience, suggesting that the impact of co-management on household resilience may be modest. In all, the findings highlight the need for co-management policies to pay attention to the historical context of community interaction processes influencing access to capital assets and local institutions to successfully promote equitable resilience. Copyright © 2014 Elsevier Ltd. All rights reserved.

  9. Assessing the Effects of Forest Fragmentation Using Satellite Imagery and Forest Inventory Data

    Treesearch

    Ronald E. McRoberts; Greg C. Liknes

    2005-01-01

    For a study area in the North Central region of the USA, maps of predicted proportion forest area were created using Landsat Thematic Mapper imagery, forest inventory plot data, and a logistic regression model. The maps were used to estimate quantitative indices of forest fragmentation. Correlations between the values of the indices and forest attributes observed on...

  10. Random Forest-Based Recognition of Isolated Sign Language Subwords Using Data from Accelerometers and Surface Electromyographic Sensors.

    PubMed

    Su, Ruiliang; Chen, Xiang; Cao, Shuai; Zhang, Xu

    2016-01-14

    Sign language recognition (SLR) has been widely used for communication amongst the hearing-impaired and non-verbal community. This paper proposes an accurate and robust SLR framework using an improved decision tree as the base classifier of random forests. This framework was used to recognize Chinese sign language subwords using recordings from a pair of portable devices worn on both arms consisting of accelerometers (ACC) and surface electromyography (sEMG) sensors. The experimental results demonstrated the validity of the proposed random forest-based method for recognition of Chinese sign language (CSL) subwords. With the proposed method, 98.25% average accuracy was obtained for the classification of a list of 121 frequently used CSL subwords. Moreover, the random forests method demonstrated a superior performance in resisting the impact of bad training samples. When the proportion of bad samples in the training set reached 50%, the recognition error rate of the random forest-based method was only 10.67%, while that of a single decision tree adopted in our previous work was almost 27.5%. Our study offers a practical way of realizing a robust and wearable EMG-ACC-based SLR systems.

  11. Pseudo CT estimation from MRI using patch-based random forest

    NASA Astrophysics Data System (ADS)

    Yang, Xiaofeng; Lei, Yang; Shu, Hui-Kuo; Rossi, Peter; Mao, Hui; Shim, Hyunsuk; Curran, Walter J.; Liu, Tian

    2017-02-01

    Recently, MR simulators gain popularity because of unnecessary radiation exposure of CT simulators being used in radiation therapy planning. We propose a method for pseudo CT estimation from MR images based on a patch-based random forest. Patient-specific anatomical features are extracted from the aligned training images and adopted as signatures for each voxel. The most robust and informative features are identified using feature selection to train the random forest. The well-trained random forest is used to predict the pseudo CT of a new patient. This prediction technique was tested with human brain images and the prediction accuracy was assessed using the original CT images. Peak signal-to-noise ratio (PSNR) and feature similarity (FSIM) indexes were used to quantify the differences between the pseudo and original CT images. The experimental results showed the proposed method could accurately generate pseudo CT images from MR images. In summary, we have developed a new pseudo CT prediction method based on patch-based random forest, demonstrated its clinical feasibility, and validated its prediction accuracy. This pseudo CT prediction technique could be a useful tool for MRI-based radiation treatment planning and attenuation correction in a PET/MRI scanner.

  12. Discrimination and characterization of strawberry juice based on electronic nose and tongue: comparison of different juice processing approaches by LDA, PLSR, RF, and SVM.

    PubMed

    Qiu, Shanshan; Wang, Jun; Gao, Liping

    2014-07-09

    An electronic nose (E-nose) and an electronic tongue (E-tongue) have been used to characterize five types of strawberry juices based on processing approaches (i.e., microwave pasteurization, steam blanching, high temperature short time pasteurization, frozen-thawed, and freshly squeezed). Juice quality parameters (vitamin C, pH, total soluble solid, total acid, and sugar/acid ratio) were detected by traditional measuring methods. Multivariate statistical methods (linear discriminant analysis (LDA) and partial least squares regression (PLSR)) and neural networks (Random Forest (RF) and Support Vector Machines) were employed to qualitative classification and quantitative regression. E-tongue system reached higher accuracy rates than E-nose did, and the simultaneous utilization did have an advantage in LDA classification and PLSR regression. According to cross-validation, RF has shown outstanding and indisputable performances in the qualitative and quantitative analysis. This work indicates that the simultaneous utilization of E-nose and E-tongue can discriminate processed fruit juices and predict quality parameters successfully for the beverage industry.

  13. Modeling Mediterranean forest structure using airborne laser scanning data

    NASA Astrophysics Data System (ADS)

    Bottalico, Francesca; Chirici, Gherardo; Giannini, Raffaello; Mele, Salvatore; Mura, Matteo; Puxeddu, Michele; McRoberts, Ronald E.; Valbuena, Ruben; Travaglini, Davide

    2017-05-01

    The conservation of biological diversity is recognized as a fundamental component of sustainable development, and forests contribute greatly to its preservation. Structural complexity increases the potential biological diversity of a forest by creating multiple niches that can host a wide variety of species. To facilitate greater understanding of the contributions of forest structure to forest biological diversity, we modeled relationships between 14 forest structure variables and airborne laser scanning (ALS) data for two Italian study areas representing two common Mediterranean forests, conifer plantations and coppice oaks subjected to irregular intervals of unplanned and non-standard silvicultural interventions. The objectives were twofold: (i) to compare model prediction accuracies when using two types of ALS metrics, echo-based metrics and canopy height model (CHM)-based metrics, and (ii) to construct inferences in the form of confidence intervals for large area structural complexity parameters. Our results showed that the effects of the two study areas on accuracies were greater than the effects of the two types of ALS metrics. In particular, accuracies were less for the more complex study area in terms of species composition and forest structure. However, accuracies achieved using the echo-based metrics were only slightly greater than when using the CHM-based metrics, thus demonstrating that both options yield reliable and comparable results. Accuracies were greatest for dominant height (Hd) (R2 = 0.91; RMSE% = 8.2%) and mean height weighted by basal area (R2 = 0.83; RMSE% = 10.5%) when using the echo-based metrics, 99th percentile of the echo height distribution and interquantile distance. For the forested area, the generalized regression (GREG) estimate of mean Hd was similar to the simple random sampling (SRS) estimate, 15.5 m for GREG and 16.2 m SRS. Further, the GREG estimator with standard error of 0.10 m was considerable more precise than the SRS estimator with standard error of 0.69 m.

  14. Novel approaches to assess the quality of fertility data stored in dairy herd management software.

    PubMed

    Hermans, K; Waegeman, W; Opsomer, G; Van Ranst, B; De Koster, J; Van Eetvelde, M; Hostens, M

    2017-05-01

    Scientific journals and popular press magazines are littered with articles in which the authors use data from dairy herd management software. Almost none of such papers include data cleaning and data quality assessment in their study design despite this being a very critical step during data mining. This paper presents 2 novel data cleaning methods that permit identification of animals with good and bad data quality. The first method is a deterministic or rule-based data cleaning method. Reproduction and mutation or life-changing events such as birth and death were converted to a symbolic (alphabetical letter) representation and split into triplets (3-letter code). The triplets were manually labeled as physiologically correct, suspicious, or impossible. The deterministic data cleaning method was applied to assess the quality of data stored in dairy herd management from 26 farms enrolled in the herd health management program from the Faculty of Veterinary Medicine Ghent University, Belgium. In total, 150,443 triplets were created, 65.4% were labeled as correct, 17.4% as suspicious, and 17.2% as impossible. The second method, a probabilistic method, uses a machine learning algorithm (random forests) to predict the correctness of fertility and mutation events in an early stage of data cleaning. The prediction accuracy of the random forests algorithm was compared with a classical linear statistical method (penalized logistic regression), outperforming the latter substantially, with a superior receiver operating characteristic curve and a higher accuracy (89 vs. 72%). From those results, we conclude that the triplet method can be used to assess the quality of reproduction data stored in dairy herd management software and that a machine learning technique such as random forests is capable of predicting the correctness of fertility data. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  15. Detecting targets hidden in random forests

    NASA Astrophysics Data System (ADS)

    Kouritzin, Michael A.; Luo, Dandan; Newton, Fraser; Wu, Biao

    2009-05-01

    Military tanks, cargo or troop carriers, missile carriers or rocket launchers often hide themselves from detection in the forests. This plagues the detection problem of locating these hidden targets. An electro-optic camera mounted on a surveillance aircraft or unmanned aerial vehicle is used to capture the images of the forests with possible hidden targets, e.g., rocket launchers. We consider random forests of longitudinal and latitudinal correlations. Specifically, foliage coverage is encoded with a binary representation (i.e., foliage or no foliage), and is correlated in adjacent regions. We address the detection problem of camouflaged targets hidden in random forests by building memory into the observations. In particular, we propose an efficient algorithm to generate random forests, ground, and camouflage of hidden targets with two dimensional correlations. The observations are a sequence of snapshots consisting of foliage-obscured ground or target. Theoretically, detection is possible because there are subtle differences in the correlations of the ground and camouflage of the rocket launcher. However, these differences are well beyond human perception. To detect the presence of hidden targets automatically, we develop a Markov representation for these sequences and modify the classical filtering equations to allow the Markov chain observation. Particle filters are used to estimate the position of the targets in combination with a novel random weighting technique. Furthermore, we give positive proof-of-concept simulations.

  16. Regression tree modeling of forest NPP using site conditions and climate variables across eastern USA

    NASA Astrophysics Data System (ADS)

    Kwon, Y.

    2013-12-01

    As evidence of global warming continue to increase, being able to predict forest response to climate changes, such as expected rise of temperature and precipitation, will be vital for maintaining the sustainability and productivity of forests. To map forest species redistribution by climate change scenario has been successful, however, most species redistribution maps lack mechanistic understanding to explain why trees grow under the novel conditions of chaining climate. Distributional map is only capable of predicting under the equilibrium assumption that the communities would exist following a prolonged period under the new climate. In this context, forest NPP as a surrogate for growth rate, the most important facet that determines stand dynamics, can lead to valid prediction on the transition stage to new vegetation-climate equilibrium as it represents changes in structure of forest reflecting site conditions and climate factors. The objective of this study is to develop forest growth map using regression tree analysis by extracting large-scale non-linear structures from both field-based FIA and remotely sensed MODIS data set. The major issue addressed in this approach is non-linear spatial patterns of forest attributes. Forest inventory data showed complex spatial patterns that reflect environmental states and processes that originate at different spatial scales. At broad scales, non-linear spatial trends in forest attributes and mixture of continuous and discrete types of environmental variables make traditional statistical (multivariate regression) and geostatistical (kriging) models inefficient. It calls into question some traditional underlying assumptions of spatial trends that uncritically accepted in forest data. To solve the controversy surrounding the suitability of forest data, regression tree analysis are performed using Software See5 and Cubist. Four publicly available data sets were obtained: First, field-based Forest Inventory and Analysis (USDA, Forest Service) data set for the 31 eastern most United States. Second, 8-day composite of MODIS Land Cover, FPAR, LAI and GPP/NPP data were obtained from Jan 2001 to Dec 2004 (total 182 composite) and each product were filtered by pixel-level quality assurance data to select best quality pixels. Third, 30-year averaged climate data were collected from National Oceanic and Atmospheric Administration (NOAA) and five climatic variables were obtained: Monthly temperature, precipitation, annual heating and cooling days, and annual frost-free days. Forth, topographic data were obtained from digital elevation model (1km by 1km). This research will provide a better understanding of large-scale forest responses to environmental factors that will be beneficial for the development of important forest management applications.

  17. Screening large-scale association study data: exploiting interactions using random forests.

    PubMed

    Lunetta, Kathryn L; Hayward, L Brooke; Segal, Jonathan; Van Eerdewegh, Paul

    2004-12-10

    Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for further study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.

  18. Effects of Model Choice and Forest Structure on Inventory-Based Estimations of Puerto Rican Forest Biomass.

    Treesearch

    THOMAS J. BRANDEIS; MARIA DEL ROCIO SUAREZ ROZO

    2005-01-01

    Total aboveground live tree biomass in Puerto Rican lower montane wet, subtropical wet, subtropical moist and subtropical dry forests was estimated using data from two forest inventories and published regression equations. Multiple potentially-applicable published biomass models existed for some forested life zones, and their estimates tended to diverge with increasing...

  19. Effects of model choice and forest structure on inventory-based estimations of Puerto Rican forest biomass

    Treesearch

    Thomas J. Brandeis; Maria Del Rocio; Suarez Rozo

    2005-01-01

    Total aboveground live tree biomass in Puerto Rican lower montane wet, subtropical wet, subtropical moist and subtropical dry forests was estimated using data from two forest inventories and published regression equations. Multiple potentially-applicable published biomass models existed for some forested life zones, and their estimates tended to diverge with increasing...

  20. Evaluation of open source data mining software packages

    Treesearch

    Bonnie Ruefenacht; Greg Liknes; Andrew J. Lister; Haans Fisk; Dan Wendt

    2009-01-01

    Since 2001, the USDA Forest Service (USFS) has used classification and regression-tree technology to map USFS Forest Inventory and Analysis (FIA) biomass, forest type, forest type groups, and National Forest vegetation. This prior work used Cubist/See5 software for the analyses. The objective of this project, sponsored by the Remote Sensing Steering Committee (RSSC),...

  1. Estimation of Surface Seawater Fugacity of Carbon Dioxide Using Satellite Data and Machine Learning

    NASA Astrophysics Data System (ADS)

    Jang, E.; Im, J.; Park, G.; Park, Y.

    2016-12-01

    The ocean controls the climate of Earth by absorbing and releasing CO2 through the carbon cycle. The amount of CO2 in the ocean has increased since the industrial revolution. High CO2 concentration in the ocean has a negative influence to marine organisms and reduces the ability of absorbing CO2 in the ocean. This study estimated surface seawater fugacity of CO2 (fCO2) in the East Sea of Korea using Geostationary Ocean Color Imager (GOCI) and Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data, and Hybrid Coordinate Ocean Model (HYCOM) reanalysis data. GOCI is the world first geostationary ocean color observation satellite sensor, and it provides 8 images with 8 bands hourly per day from 9 am to 4 pm at 500m resolution. Two machine learning approaches (i.e., random forest and support vector regression) were used to model fCO2 in this study. While most of the existing studies used multiple linear regression to estimate the pressure of CO2 in the ocean, machine learning may handle more complex relationship between surface seawater fCO2 and ocean parameters in a dynamic spatiotemporal environment. Five ocean related parameters, colored dissolved organic matter (CDOM), chlorophyll-a (chla), sea surface temperature (SST), sea surface salinity (SSS), and mixed layer depth (MLD), were used as input variables. This study examined two schemes, one with GOCI-derived products and the other with MODIS-derived ones. Results show that random forest performed better than support vector regression regardless of satellite data used. The accuracy of GOCI-based estimation was higher than MODIS-based one, possibly thanks to the better spatiotemporal resolution of GOCI data. MLD was identified the most contributing parameter in estimating surface seawater fCO2 among the five ocean related parameters, which might be related with an active deep convection in the East Sea. The surface seawater fCO2 in summer was higher in general with some spatial variation than the other seasons because of higher SST.

  2. Retrieval of aerosol optical depth from surface solar radiation measurements using machine learning algorithms, non-linear regression and a radiative transfer-based look-up table

    NASA Astrophysics Data System (ADS)

    Huttunen, Jani; Kokkola, Harri; Mielonen, Tero; Esa Juhani Mononen, Mika; Lipponen, Antti; Reunanen, Juha; Vilhelm Lindfors, Anders; Mikkonen, Santtu; Erkki Juhani Lehtinen, Kari; Kouremeti, Natalia; Bais, Alkiviadis; Niska, Harri; Arola, Antti

    2016-07-01

    In order to have a good estimate of the current forcing by anthropogenic aerosols, knowledge on past aerosol levels is needed. Aerosol optical depth (AOD) is a good measure for aerosol loading. However, dedicated measurements of AOD are only available from the 1990s onward. One option to lengthen the AOD time series beyond the 1990s is to retrieve AOD from surface solar radiation (SSR) measurements taken with pyranometers. In this work, we have evaluated several inversion methods designed for this task. We compared a look-up table method based on radiative transfer modelling, a non-linear regression method and four machine learning methods (Gaussian process, neural network, random forest and support vector machine) with AOD observations carried out with a sun photometer at an Aerosol Robotic Network (AERONET) site in Thessaloniki, Greece. Our results show that most of the machine learning methods produce AOD estimates comparable to the look-up table and non-linear regression methods. All of the applied methods produced AOD values that corresponded well to the AERONET observations with the lowest correlation coefficient value being 0.87 for the random forest method. While many of the methods tended to slightly overestimate low AODs and underestimate high AODs, neural network and support vector machine showed overall better correspondence for the whole AOD range. The differences in producing both ends of the AOD range seem to be caused by differences in the aerosol composition. High AODs were in most cases those with high water vapour content which might affect the aerosol single scattering albedo (SSA) through uptake of water into aerosols. Our study indicates that machine learning methods benefit from the fact that they do not constrain the aerosol SSA in the retrieval, whereas the LUT method assumes a constant value for it. This would also mean that machine learning methods could have potential in reproducing AOD from SSR even though SSA would have changed during the observation period.

  3. Application of lifting wavelet and random forest in compound fault diagnosis of gearbox

    NASA Astrophysics Data System (ADS)

    Chen, Tang; Cui, Yulian; Feng, Fuzhou; Wu, Chunzhi

    2018-03-01

    Aiming at the weakness of compound fault characteristic signals of a gearbox of an armored vehicle and difficult to identify fault types, a fault diagnosis method based on lifting wavelet and random forest is proposed. First of all, this method uses the lifting wavelet transform to decompose the original vibration signal in multi-layers, reconstructs the multi-layer low-frequency and high-frequency components obtained by the decomposition to get multiple component signals. Then the time-domain feature parameters are obtained for each component signal to form multiple feature vectors, which is input into the random forest pattern recognition classifier to determine the compound fault type. Finally, a variety of compound fault data of the gearbox fault analog test platform are verified, the results show that the recognition accuracy of the fault diagnosis method combined with the lifting wavelet and the random forest is up to 99.99%.

  4. D Semantic Labeling of ALS Data Based on Domain Adaption by Transferring and Fusing Random Forest Models

    NASA Astrophysics Data System (ADS)

    Wu, J.; Yao, W.; Zhang, J.; Li, Y.

    2018-04-01

    Labeling 3D point cloud data with traditional supervised learning methods requires considerable labelled samples, the collection of which is cost and time expensive. This work focuses on adopting domain adaption concept to transfer existing trained random forest classifiers (based on source domain) to new data scenes (target domain), which aims at reducing the dependence of accurate 3D semantic labeling in point clouds on training samples from the new data scene. Firstly, two random forest classifiers were firstly trained with existing samples previously collected for other data. They were different from each other by using two different decision tree construction algorithms: C4.5 with information gain ratio and CART with Gini index. Secondly, four random forest classifiers adapted to the target domain are derived through transferring each tree in the source random forest models with two types of operations: structure expansion and reduction-SER and structure transfer-STRUT. Finally, points in target domain are labelled by fusing the four newly derived random forest classifiers using weights of evidence based fusion model. To validate our method, experimental analysis was conducted using 3 datasets: one is used as the source domain data (Vaihingen data for 3D Semantic Labelling); another two are used as the target domain data from two cities in China (Jinmen city and Dunhuang city). Overall accuracies of 85.5 % and 83.3 % for 3D labelling were achieved for Jinmen city and Dunhuang city data respectively, with only 1/3 newly labelled samples compared to the cases without domain adaption.

  5. Predicting stem total and assortment volumes in an industrial Pinus taeda L. forest plantation using airborne laser scanning data and random forest

    Treesearch

    Carlos Alberto Silva; Carine Klauberg; Andrew Thomas Hudak; Lee Alexander Vierling; Wan Shafrina Wan Mohd Jaafar; Midhun Mohan; Mariano Garcia; Antonio Ferraz; Adrian Cardil; Sassan Saatchi

    2017-01-01

    Improvements in the management of pine plantations result in multiple industrial and environmental benefits. Remote sensing techniques can dramatically increase the efficiency of plantation management by reducing or replacing time-consuming field sampling. We tested the utility and accuracy of combining field and airborne lidar data with Random Forest, a supervised...

  6. Complex mountain terrain and disturbance history drive variation in forest aboveground live carbon density in the western Oregon Cascades, USA

    PubMed Central

    Zald, Harold S.J.; Spies, Thomas A.; Seidl, Rupert; Pabst, Robert J.; Olsen, Keith A.; Steel, E. Ashley

    2016-01-01

    Forest carbon (C) density varies tremendously across space due to the inherent heterogeneity of forest ecosystems. Variation of forest C density is especially pronounced in mountainous terrain, where environmental gradients are compressed and vary at multiple spatial scales. Additionally, the influence of environmental gradients may vary with forest age and developmental stage, an important consideration as forest landscapes often have a diversity of stand ages from past management and other disturbance agents. Quantifying forest C density and its underlying environmental determinants in mountain terrain has remained challenging because many available data sources lack the spatial grain and ecological resolution needed at both stand and landscape scales. The objective of this study was to determine if environmental factors influencing aboveground live carbon (ALC) density differed between young versus old forests. We integrated aerial light detection and ranging (lidar) data with 702 field plots to map forest ALC density at a grain of 25 m across the H.J. Andrews Experimental Forest, a 6369 ha watershed in the Cascade Mountains of Oregon, USA. We used linear regressions, random forest ensemble learning (RF) and sequential autoregressive modeling (SAR) to reveal how mapped forest ALC density was related to climate, topography, soils, and past disturbance history (timber harvesting and wildfires). ALC increased with stand age in young managed forests, with much greater variation of ALC in relation to years since wildfire in old unmanaged forests. Timber harvesting was the most important driver of ALC across the entire watershed, despite occurring on only 23% of the landscape. More variation in forest ALC density was explained in models of young managed forests than in models of old unmanaged forests. Besides stand age, ALC density in young managed forests was driven by factors influencing site productivity, whereas variation in ALC density in old unmanaged forests was also affected by finer scale topographic conditions associated with sheltered sites. Past wildfires only had a small influence on current ALC density, which may be a result of long times since fire and/or prevalence of non-stand replacing fire. Our results indicate that forest ALC density depends on a suite of multi-scale environmental drivers mediated by complex mountain topography, and that these relationships are dependent on stand age. The high and context-dependent spatial variability of forest ALC density has implications for quantifying forest carbon stores, establishing upper bounds of potential carbon sequestration, and scaling field data to landscape and regional scales. PMID:27041818

  7. Uncertainty in Random Forests: What does it mean in a spatial context?

    NASA Astrophysics Data System (ADS)

    Klump, Jens; Fouedjio, Francky

    2017-04-01

    Geochemical surveys are an important part of exploration for mineral resources and in environmental studies. The samples and chemical analyses are often laborious and difficult to obtain and therefore come at a high cost. As a consequence, these surveys are characterised by datasets with large numbers of variables but relatively few data points when compared to conventional big data problems. With more remote sensing platforms and sensor networks being deployed, large volumes of auxiliary data of the surveyed areas are becoming available. The use of these auxiliary data has the potential to improve the prediction of chemical element concentrations over the whole study area. Kriging is a well established geostatistical method for the prediction of spatial data but requires significant pre-processing and makes some basic assumptions about the underlying distribution of the data. Some machine learning algorithms, on the other hand, may require less data pre-processing and are non-parametric. In this study we used a dataset provided by Kirkwood et al. [1] to explore the potential use of Random Forest in geochemical mapping. We chose Random Forest because it is a well understood machine learning method and has the advantage that it provides us with a measure of uncertainty. By comparing Random Forest to Kriging we found that both methods produced comparable maps of estimated values for our variables of interest. Kriging outperformed Random Forest for variables of interest with relatively strong spatial correlation. The measure of uncertainty provided by Random Forest seems to be quite different to the measure of uncertainty provided by Kriging. In particular, the lack of spatial context can give misleading results in areas without ground truth data. In conclusion, our preliminary results show that the model driven approach in geostatistics gives us more reliable estimates for our target variables than Random Forest for variables with relatively strong spatial correlation. However, in cases of weak spatial correlation Random Forest, as a nonparametric method, may give the better results once we have a better understanding of the meaning of its uncertainty measures in a spatial context. References [1] Kirkwood, C., M. Cave, D. Beamish, S. Grebby, and A. Ferreira (2016), A machine learning approach to geochemical mapping, Journal of Geochemical Exploration, 163, 28-40, doi:10.1016/j.gexplo.2016.05.003.

  8. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data.

    PubMed

    Wei, Runmin; Wang, Jingye; Su, Mingming; Jia, Erik; Chen, Shaoqiu; Chen, Tianlu; Ni, Yan

    2018-01-12

    Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).

  9. Ensemble classification of individual Pinus crowns from multispectral satellite imagery and airborne LiDAR

    NASA Astrophysics Data System (ADS)

    Kukunda, Collins B.; Duque-Lazo, Joaquín; González-Ferreiro, Eduardo; Thaden, Hauke; Kleinn, Christoph

    2018-03-01

    Distinguishing tree species is relevant in many contexts of remote sensing assisted forest inventory. Accurate tree species maps support management and conservation planning, pest and disease control and biomass estimation. This study evaluated the performance of applying ensemble techniques with the goal of automatically distinguishing Pinus sylvestris L. and Pinus uncinata Mill. Ex Mirb within a 1.3 km2 mountainous area in Barcelonnette (France). Three modelling schemes were examined, based on: (1) high-density LiDAR data (160 returns m-2), (2) Worldview-2 multispectral imagery, and (3) Worldview-2 and LiDAR in combination. Variables related to the crown structure and height of individual trees were extracted from the normalized LiDAR point cloud at individual-tree level, after performing individual tree crown (ITC) delineation. Vegetation indices and the Haralick texture indices were derived from Worldview-2 images and served as independent spectral variables. Selection of the best predictor subset was done after a comparison of three variable selection procedures: (1) Random Forests with cross validation (AUCRFcv), (2) Akaike Information Criterion (AIC) and (3) Bayesian Information Criterion (BIC). To classify the species, 9 regression techniques were combined using ensemble models. Predictions were evaluated using cross validation and an independent dataset. Integration of datasets and models improved individual tree species classification (True Skills Statistic, TSS; from 0.67 to 0.81) over individual techniques and maintained strong predictive power (Relative Operating Characteristic, ROC = 0.91). Assemblage of regression models and integration of the datasets provided more reliable species distribution maps and associated tree-scale mapping uncertainties. Our study highlights the potential of model and data assemblage at improving species classifications needed in present-day forest planning and management.

  10. A Comparison of Various Estimators for Updating Forest Area Coverage Using AVHRR and Forest Inventory Data

    Treesearch

    Francis A. Roesch; Paul C. van Deusen; Zhiliang Zhu

    1995-01-01

    Various methods of adjusting low-cost and possibly biased estimates of percent forest coverage from AVHRR data with a subsample of higher-cost estimates from the USDA Forest Service's Forest Inventory and Analysis plots were investigated. Two ratio and two regression estimators were evaluated. Previous work (Zhu and Teuber, 1991) finding that the estimates from...

  11. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

    PubMed Central

    Theis, Fabian J.

    2017-01-01

    Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia. PMID:29312464

  12. A Comparison between Decision Tree and Random Forest in Determining the Risk Factors Associated with Type 2 Diabetes.

    PubMed

    Esmaily, Habibollah; Tayefi, Maryam; Doosti, Hassan; Ghayour-Mobarhan, Majid; Nezami, Hossein; Amirabadizadeh, Alireza

    2018-04-24

    We aimed to identify the associated risk factors of type 2 diabetes mellitus (T2DM) using data mining approach, decision tree and random forest techniques using the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) Study program. A cross-sectional study. The MASHAD study started in 2010 and will continue until 2020. Two data mining tools, namely decision trees, and random forests, are used for predicting T2DM when some other characteristics are observed on 9528 subjects recruited from MASHAD database. This paper makes a comparison between these two models in terms of accuracy, sensitivity, specificity and the area under ROC curve. The prevalence rate of T2DM was 14% among these subjects. The decision tree model has 64.9% accuracy, 64.5% sensitivity, 66.8% specificity, and area under the ROC curve measuring 68.6%, while the random forest model has 71.1% accuracy, 71.3% sensitivity, 69.9% specificity, and area under the ROC curve measuring 77.3% respectively. The random forest model, when used with demographic, clinical, and anthropometric and biochemical measurements, can provide a simple tool to identify associated risk factors for type 2 diabetes. Such identification can substantially use for managing the health policy to reduce the number of subjects with T2DM .

  13. Applications of random forest feature selection for fine-scale genetic population assignment.

    PubMed

    Sylvester, Emma V A; Bentzen, Paul; Bradbury, Ian R; Clément, Marie; Pearce, Jon; Horne, John; Beiko, Robert G

    2018-02-01

    Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with F ST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon ( Salmo salar ) and a published SNP data set for Alaskan Chinook salmon ( Oncorhynchus tshawytscha ). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than F ST -selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using F ST -selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.

  14. Do little interactions get lost in dark random forests?

    PubMed

    Wright, Marvin N; Ziegler, Andreas; König, Inke R

    2016-03-31

    Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.

  15. Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption.

    PubMed

    Nasejje, Justine B; Mwambi, Henry

    2017-09-07

    Uganda just like any other Sub-Saharan African country, has a high under-five child mortality rate. To inform policy on intervention strategies, sound statistical methods are required to critically identify factors strongly associated with under-five child mortality rates. The Cox proportional hazards model has been a common choice in analysing data to understand factors strongly associated with high child mortality rates taking age as the time-to-event variable. However, due to its restrictive proportional hazards (PH) assumption, some covariates of interest which do not satisfy the assumption are often excluded in the analysis to avoid mis-specifying the model. Otherwise using covariates that clearly violate the assumption would mean invalid results. Survival trees and random survival forests are increasingly becoming popular in analysing survival data particularly in the case of large survey data and could be attractive alternatives to models with the restrictive PH assumption. In this article, we adopt random survival forests which have never been used in understanding factors affecting under-five child mortality rates in Uganda using Demographic and Health Survey data. Thus the first part of the analysis is based on the use of the classical Cox PH model and the second part of the analysis is based on the use of random survival forests in the presence of covariates that do not necessarily satisfy the PH assumption. Random survival forests and the Cox proportional hazards model agree that the sex of the household head, sex of the child, number of births in the past 1 year are strongly associated to under-five child mortality in Uganda given all the three covariates satisfy the PH assumption. Random survival forests further demonstrated that covariates that were originally excluded from the earlier analysis due to violation of the PH assumption were important in explaining under-five child mortality rates. These covariates include the number of children under the age of five in a household, number of births in the past 5 years, wealth index, total number of children ever born and the child's birth order. The results further indicated that the predictive performance for random survival forests built using covariates including those that violate the PH assumption was higher than that for random survival forests built using only covariates that satisfy the PH assumption. Random survival forests are appealing methods in analysing public health data to understand factors strongly associated with under-five child mortality rates especially in the presence of covariates that violate the proportional hazards assumption.

  16. Mapping and imputing potential productivity of Pacific Northwest forests using climate variables

    Treesearch

    Gregory Latta; Hailemariam Temesgen; Tara Barrett

    2009-01-01

    Regional estimation of potential forest productivity is important to diverse applications, including biofuels supply, carbon sequestration, and projections of forest growth. Using PRISM (Parameter-elevation Regressions on Independent Slopes Model) climate and productivity data measured on a grid of 3356 Forest Inventory and Analysis plots in Oregon and Washington, we...

  17. Estimating Forest Canopy Heights and Aboveground Biomass with Simulated ICESat-2 Data

    NASA Astrophysics Data System (ADS)

    Malambo, L.; Narine, L.; Popescu, S. C.; Neuenschwander, A. L.; Sheridan, R.

    2016-12-01

    The Ice, Cloud and Land Elevation Satellite (ICESat) 2 is scheduled for launch in 2017 and one of its overall science objectives will be to measure vegetation heights, which can be used to estimate and monitor aboveground biomass (AGB) over large spatial scales. This study serves to develop a methodology for utilizing vegetation data collected by ICESat-2 that will be on a five-year mission from 2017, for mapping forest canopy heights and estimating aboveground forest biomass (AGB). The specific objectives are to, (1) simulate ICESat-2 photon-counting lidar (PCL) data, (2) utilize simulated PCL data to estimate forest canopy heights and propose a methodology for upscaling PCL height measurements to obtain spatially contiguous coverage and, (3) estimate and map AGB using simulated PCL data. The laser pulse from ICESat-2 will be divided into three pairs of beams spaced approximately 3 km apart, with footprints measuring approximately 14 m in diameter and with 70 cm along-track intervals. Using existing airborne lidar data (ALS) for Sam Houston National Forest (SHNF) and known ICESat-2 beam locations, footprints are generated along beam locations and PCL data are then simulated from discrete return lidar points within each footprint. By applying data processing algorithms, photons are classified into top of canopy points and ground surface elevation points to yield tree canopy height values within each ICESat-2 footprint. AGB is then estimated using simple linear regression that utilizes AGB from a biomass map generated with ALS data for SHNF and simulated PCL height metrics for 100 m segments along ICESat-2 tracks. Two approaches also investigated for upscaling AGB estimates to provide wall-to-wall coverage of AGB are (1) co-kriging and (2) Random Forest. Height and AGB maps, which are the outcomes of this study, will demonstrate how data acquired by ICESat-2 can be used to measure forest parameters and in extension, estimate forest carbon for climate change initiatives.

  18. Neither fixed nor random: weighted least squares meta-regression.

    PubMed

    Stanley, T D; Doucouliagos, Hristos

    2017-03-01

    Our study revisits and challenges two core conventional meta-regression estimators: the prevalent use of 'mixed-effects' or random-effects meta-regression analysis and the correction of standard errors that defines fixed-effects meta-regression analysis (FE-MRA). We show how and explain why an unrestricted weighted least squares MRA (WLS-MRA) estimator is superior to conventional random-effects (or mixed-effects) meta-regression when there is publication (or small-sample) bias that is as good as FE-MRA in all cases and better than fixed effects in most practical applications. Simulations and statistical theory show that WLS-MRA provides satisfactory estimates of meta-regression coefficients that are practically equivalent to mixed effects or random effects when there is no publication bias. When there is publication selection bias, WLS-MRA always has smaller bias than mixed effects or random effects. In practical applications, an unrestricted WLS meta-regression is likely to give practically equivalent or superior estimates to fixed-effects, random-effects, and mixed-effects meta-regression approaches. However, random-effects meta-regression remains viable and perhaps somewhat preferable if selection for statistical significance (publication bias) can be ruled out and when random, additive normal heterogeneity is known to directly affect the 'true' regression coefficient. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  19. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach.

    PubMed

    Taylor, R Andrew; Pare, Joseph R; Venkatesh, Arjun K; Mowafi, Hani; Melnick, Edward R; Fleischman, William; Hall, M Kennedy

    2016-03-01

    Predictive analytics in emergency care has mostly been limited to the use of clinical decision rules (CDRs) in the form of simple heuristics and scoring systems. In the development of CDRs, limitations in analytic methods and concerns with usability have generally constrained models to a preselected small set of variables judged to be clinically relevant and to rules that are easily calculated. Furthermore, CDRs frequently suffer from questions of generalizability, take years to develop, and lack the ability to be updated as new information becomes available. Newer analytic and machine learning techniques capable of harnessing the large number of variables that are already available through electronic health records (EHRs) may better predict patient outcomes and facilitate automation and deployment within clinical decision support systems. In this proof-of-concept study, a local, big data-driven, machine learning approach is compared to existing CDRs and traditional analytic methods using the prediction of sepsis in-hospital mortality as the use case. This was a retrospective study of adult ED visits admitted to the hospital meeting criteria for sepsis from October 2013 to October 2014. Sepsis was defined as meeting criteria for systemic inflammatory response syndrome with an infectious admitting diagnosis in the ED. ED visits were randomly partitioned into an 80%/20% split for training and validation. A random forest model (machine learning approach) was constructed using over 500 clinical variables from data available within the EHRs of four hospitals to predict in-hospital mortality. The machine learning prediction model was then compared to a classification and regression tree (CART) model, logistic regression model, and previously developed prediction tools on the validation data set using area under the receiver operating characteristic curve (AUC) and chi-square statistics. There were 5,278 visits among 4,676 unique patients who met criteria for sepsis. Of the 4,222 patients in the training group, 210 (5.0%) died during hospitalization, and of the 1,056 patients in the validation group, 50 (4.7%) died during hospitalization. The AUCs with 95% confidence intervals (CIs) for the different models were as follows: random forest model, 0.86 (95% CI = 0.82 to 0.90); CART model, 0.69 (95% CI = 0.62 to 0.77); logistic regression model, 0.76 (95% CI = 0.69 to 0.82); CURB-65, 0.73 (95% CI = 0.67 to 0.80); MEDS, 0.71 (95% CI = 0.63 to 0.77); and mREMS, 0.72 (95% CI = 0.65 to 0.79). The random forest model AUC was statistically different from all other models (p ≤ 0.003 for all comparisons). In this proof-of-concept study, a local big data-driven, machine learning approach outperformed existing CDRs as well as traditional analytic techniques for predicting in-hospital mortality of ED patients with sepsis. Future research should prospectively evaluate the effectiveness of this approach and whether it translates into improved clinical outcomes for high-risk sepsis patients. The methods developed serve as an example of a new model for predictive analytics in emergency care that can be automated, applied to other clinical outcomes of interest, and deployed in EHRs to enable locally relevant clinical predictions. © 2015 by the Society for Academic Emergency Medicine.

  20. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data–Driven, Machine Learning Approach

    PubMed Central

    Taylor, R. Andrew; Pare, Joseph R.; Venkatesh, Arjun K.; Mowafi, Hani; Melnick, Edward R.; Fleischman, William; Hall, M. Kennedy

    2018-01-01

    Objectives Predictive analytics in emergency care has mostly been limited to the use of clinical decision rules (CDRs) in the form of simple heuristics and scoring systems. In the development of CDRs, limitations in analytic methods and concerns with usability have generally constrained models to a preselected small set of variables judged to be clinically relevant and to rules that are easily calculated. Furthermore, CDRs frequently suffer from questions of generalizability, take years to develop, and lack the ability to be updated as new information becomes available. Newer analytic and machine learning techniques capable of harnessing the large number of variables that are already available through electronic health records (EHRs) may better predict patient outcomes and facilitate automation and deployment within clinical decision support systems. In this proof-of-concept study, a local, big data–driven, machine learning approach is compared to existing CDRs and traditional analytic methods using the prediction of sepsis in-hospital mortality as the use case. Methods This was a retrospective study of adult ED visits admitted to the hospital meeting criteria for sepsis from October 2013 to October 2014. Sepsis was defined as meeting criteria for systemic inflammatory response syndrome with an infectious admitting diagnosis in the ED. ED visits were randomly partitioned into an 80%/20% split for training and validation. A random forest model (machine learning approach) was constructed using over 500 clinical variables from data available within the EHRs of four hospitals to predict in-hospital mortality. The machine learning prediction model was then compared to a classification and regression tree (CART) model, logistic regression model, and previously developed prediction tools on the validation data set using area under the receiver operating characteristic curve (AUC) and chi-square statistics. Results There were 5,278 visits among 4,676 unique patients who met criteria for sepsis. Of the 4,222 patients in the training group, 210 (5.0%) died during hospitalization, and of the 1,056 patients in the validation group, 50 (4.7%) died during hospitalization. The AUCs with 95% confidence intervals (CIs) for the different models were as follows: random forest model, 0.86 (95% CI = 0.82 to 0.90); CART model, 0.69 (95% CI = 0.62 to 0.77); logistic regression model, 0.76 (95% CI = 0.69 to 0.82); CURB-65, 0.73 (95% CI = 0.67 to 0.80); MEDS, 0.71 (95% CI = 0.63 to 0.77); and mREMS, 0.72 (95% CI = 0.65 to 0.79). The random forest model AUC was statistically different from all other models (p ≤ 0.003 for all comparisons). Conclusions In this proof-of-concept study, a local big data–driven, machine learning approach outperformed existing CDRs as well as traditional analytic techniques for predicting in-hospital mortality of ED patients with sepsis. Future research should prospectively evaluate the effectiveness of this approach and whether it translates into improved clinical outcomes for high-risk sepsis patients. The methods developed serve as an example of a new model for predictive analytics in emergency care that can be automated, applied to other clinical outcomes of interest, and deployed in EHRs to enable locally relevant clinical predictions. PMID:26679719

  1. Factors associated with succession of abandoned agricultural lands along the Lower Missouri River, U.S.A

    USGS Publications Warehouse

    Thogmartin, W.E.; Gallagher, M.; Young, N.; Rohweder, J.J.; Knutson, M.G.

    2009-01-01

    The 1993 flood of the Missouri River led to the abandonment of agriculture on considerable land in the floodplain. This abandonment led to a restoration opportunity for the U.S. Federal Government, purchasing those lands being sold by farmers. Restoration of this floodplain is complicated, however, by an imperfect understanding of its past environmental and vegetative conditions. We examined environmental conditions associated with the current placement of young forests and wet prairies as a guide to the potential successional trajectory for abandoned agricultural land subject to flooding. We used Bayesian mixed-effects logistic regression to examine the effects of flood frequency, soil drainage, distance from the main channel, and elevation on whether a site was in wet prairie or in forest. Study site was included as a random effect, controlling for site-specific differences not measured in our study. We found, after controlling for the effect of site, that early-successional forest sites were closer to the river and at a lower elevation but occurred on drier soils than wet prairie. In a regulated river such as the lower Missouri River, wet prairie sites are relatively isolated from the main channel compared to early-successional forest, despite occurring on relatively moister soils. The modeled results from this study may be used to predict the potential successional fate of the acquired agricultural lands, and along with information on wildlife assemblages associated with wet prairie and forest can be used to predict potential benefit of these acquisitions to wildlife conservation. ?? 2009 Society for Ecological Restoration International.

  2. Patterns of tree species diversity and composition in old-field successional forests in central Illinois

    Treesearch

    Scott M. Bretthauer; George Z. Gertner; Gary L. Rolfe; Jeffery O. Dawson

    2003-01-01

    Tree species diversity increases and dominance decreases with proximity to forest border in two 60-year-old successional forest stands developed on abandoned agricultural land in Piatt County, Illinois. A regression equation allowed us to quantify an increase in diversity with closeness to forest border for one of the forest stands. Shingle oak is the most dominant...

  3. Mapping SOC (Soil Organic Carbon) using LiDAR-derived vegetation indices in a random forest regression model

    NASA Astrophysics Data System (ADS)

    Will, R. M.; Glenn, N. F.; Benner, S. G.; Pierce, J. L.; Spaete, L.; Li, A.

    2015-12-01

    Quantifying SOC (Soil Organic Carbon) storage in complex terrain is challenging due to high spatial variability. Generally, the challenge is met by transforming point data to the entire landscape using surrogate, spatially-distributed, variables like elevation or precipitation. In many ecosystems, remotely sensed information on above-ground vegetation (e.g. NDVI) is a good predictor of below-ground carbon stocks. In this project, we are attempting to improve this predictive method by incorporating LiDAR-derived vegetation indices. LiDAR provides a mechanism for improved characterization of aboveground vegetation by providing structural parameters such as vegetation height and biomass. In this study, a random forest model is used to predict SOC using a suite of LiDAR-derived vegetation indices as predictor variables. The Reynolds Creek Experimental Watershed (RCEW) is an ideal location for a study of this type since it encompasses a strong elevation/precipitation gradient that supports lower biomass sagebrush ecosystems at low elevations and forests with more biomass at higher elevations. Sagebrush ecosystems composed of Wyoming, Low and Mountain Sagebrush have SOC values ranging from .4 to 1% (top 30 cm), while higher biomass ecosystems composed of aspen, juniper and fir have SOC values approaching 4% (top 30 cm). Large differences in SOC have been observed between canopy and interspace locations and high resolution vegetation information is likely to explain plot scale variability in SOC. Mapping of the SOC reservoir will help identify underlying controls on SOC distribution and provide insight into which processes are most important in determining SOC in semi-arid mountainous regions. In addition, airborne LiDAR has the potential to characterize vegetation communities at a high resolution and could be a tool for improving estimates of SOC at larger scales.

  4. Pre-operative prediction of surgical morbidity in children: comparison of five statistical models.

    PubMed

    Cooper, Jennifer N; Wei, Lai; Fernandez, Soledad A; Minneci, Peter C; Deans, Katherine J

    2015-02-01

    The accurate prediction of surgical risk is important to patients and physicians. Logistic regression (LR) models are typically used to estimate these risks. However, in the fields of data mining and machine-learning, many alternative classification and prediction algorithms have been developed. This study aimed to compare the performance of LR to several data mining algorithms for predicting 30-day surgical morbidity in children. We used the 2012 National Surgical Quality Improvement Program-Pediatric dataset to compare the performance of (1) a LR model that assumed linearity and additivity (simple LR model) (2) a LR model incorporating restricted cubic splines and interactions (flexible LR model) (3) a support vector machine, (4) a random forest and (5) boosted classification trees for predicting surgical morbidity. The ensemble-based methods showed significantly higher accuracy, sensitivity, specificity, PPV, and NPV than the simple LR model. However, none of the models performed better than the flexible LR model in terms of the aforementioned measures or in model calibration or discrimination. Support vector machines, random forests, and boosted classification trees do not show better performance than LR for predicting pediatric surgical morbidity. After further validation, the flexible LR model derived in this study could be used to assist with clinical decision-making based on patient-specific surgical risks. Copyright © 2014 Elsevier Ltd. All rights reserved.

  5. Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest

    NASA Astrophysics Data System (ADS)

    Chen, Hui; Lin, Zan; Wu, Hegang; Wang, Li; Wu, Tong; Tan, Chao

    2015-01-01

    Near-infrared (NIR) spectroscopy has such advantages as being noninvasive, fast, relatively inexpensive, and no risk of ionizing radiation. Differences in the NIR signals can reflect many physiological changes, which are in turn associated with such factors as vascularization, cellularity, oxygen consumption, or remodeling. NIR spectral differences between colorectal cancer and healthy tissues were investigated. A Fourier transform NIR spectroscopy instrument equipped with a fiber-optic probe was used to mimic in situ clinical measurements. A total of 186 spectra were collected and then underwent the preprocessing of standard normalize variate (SNV) for removing unwanted background variances. All the specimen and spots used for spectral collection were confirmed staining and examination by an experienced pathologist so as to ensure the representative of the pathology. Principal component analysis (PCA) was used to uncover the possible clustering. Several methods including random forest (RF), partial least squares-discriminant analysis (PLSDA), K-nearest neighbor and classification and regression tree (CART) were used to extract spectral features and to construct the diagnostic models. By comparison, it reveals that, even if no obvious difference of misclassified ratio (MCR) was observed between these models, RF is preferable since it is quicker, more convenient and insensitive to over-fitting. The results indicate that NIR spectroscopy coupled with RF model can serve as a potential tool for discriminating the colorectal cancer tissues from normal ones.

  6. Predictors of occurrence of the aquatic macrophyte Podostemum ceratophyllum in a southern Appalachian River

    USGS Publications Warehouse

    Argentina, Jane E.; Freeman, Mary C.; Freeman, Byron J.

    2010-01-01

    The aquatic macrophyte Podostemum ceratophyllum (Hornleaf Riverweed) commonly provides habitat for invertebrates and fishes in flowing-water portions of Piedmont and Appalachian streams in the eastern US. We quantified variation in percent cover by P. ceratophyllum in a 39-km reach of the Conasauga River, TN and GA, to test the hypothesis that cover decreased with increasing non-forest land use. We estimated percent P. ceratophyllum cover in quadrats (0.09 m2) placed at random coordinates within 20 randomly selected shoals. We then used hierarchical logistic regression, in an information-theoretic framework, to evaluate relative support for models incorporating alternative combinations of microhabitat and shoal-level variables to predict the occurrence of high (≥50%)P. ceratophyllum cover. As expected, bed sediment size and measures of light availability (location in the center of the channel, canopy cover) were included in best-supported models and had similar estimated-effect sizes across models. Podostemum ceratophyllum cover declined with increasing watershed size (included in 8 of 13 models in the confidence set of models); however, this decrease in cover was not well predicted by variation in land use. Focused monitoring of temporal and spatial trends in status of P. ceratophyllum are important due to its biotic importance in fast-flowing waters and its potential sensitivity to landscape-level changes, such as declines in forested land cover and homogenization of benthic habitats.

  7. Conformal Regression for Quantitative Structure-Activity Relationship Modeling-Quantifying Prediction Uncertainty.

    PubMed

    Svensson, Fredrik; Aniceto, Natalia; Norinder, Ulf; Cortes-Ciriano, Isidro; Spjuth, Ola; Carlsson, Lars; Bender, Andreas

    2018-05-29

    Making predictions with an associated confidence is highly desirable as it facilitates decision making and resource prioritization. Conformal regression is a machine learning framework that allows the user to define the required confidence and delivers predictions that are guaranteed to be correct to the selected extent. In this study, we apply conformal regression to model molecular properties and bioactivity values and investigate different ways to scale the resultant prediction intervals to create as efficient (i.e., narrow) regressors as possible. Different algorithms to estimate the prediction uncertainty were used to normalize the prediction ranges, and the different approaches were evaluated on 29 publicly available data sets. Our results show that the most efficient conformal regressors are obtained when using the natural exponential of the ensemble standard deviation from the underlying random forest to scale the prediction intervals, but other approaches were almost as efficient. This approach afforded an average prediction range of 1.65 pIC50 units at the 80% confidence level when applied to bioactivity modeling. The choice of nonconformity function has a pronounced impact on the average prediction range with a difference of close to one log unit in bioactivity between the tightest and widest prediction range. Overall, conformal regression is a robust approach to generate bioactivity predictions with associated confidence.

  8. Advancing individual tree biomass prediction: assessment and alternatives to the component ratio method

    Treesearch

    Aaron Weiskittel; Jereme Frank; David Walker; Phil Radtke; David Macfarlane; James Westfall

    2015-01-01

    Prediction of forest biomass and carbon is becoming important issues in the United States. However, estimating forest biomass and carbon is difficult and relies on empirically-derived regression equations. Based on recent findings from a national gap analysis and comprehensive assessment of the USDA Forest Service Forest Inventory and Analysis (USFS-FIA) component...

  9. Comparing Forest/Nonforest Classifications of Landsat TM Imagery for Stratifying FIA Estimates of Forest Land Area

    Treesearch

    Mark D. Nelson; Ronald E. McRoberts; Greg C. Liknes; Geoffrey R. Holden

    2005-01-01

    Landsat Thematic Mapper (TM) satellite imagery and Forest Inventory and Analysis (FIA) plot data were used to construct forest/nonforest maps of Mapping Zone 41, National Land Cover Dataset 2000 (NLCD 2000). Stratification approaches resulting from Maximum Likelihood, Fuzzy Convolution, Logistic Regression, and k-Nearest Neighbors classification/prediction methods were...

  10. Regression and Geostatistical Techniques: Considerations and Observations from Experiences in NE-FIA

    Treesearch

    Rachel Riemann; Andrew Lister

    2005-01-01

    Maps of forest variables improve our understanding of the forest resource by allowing us to view and analyze it spatially. The USDA Forest Service's Northeastern Forest Inventory and Analysis unit (NE-FIA) has used geostatistical techniques, particularly stochastic simulation, to produce maps and spatial data sets of FIA variables. That work underscores the...

  11. Preliminary results of spatial modeling of selected forest health variables in Georgia

    Treesearch

    Brock Stewart; Chris J. Cieszewski

    2009-01-01

    Variables relating to forest health monitoring, such as mortality, are difficult to predict and model. We present here the results of fitting various spatial regression models to these variables. We interpolate plot-level values compiled from the Forest Inventory and Analysis National Information Management System (FIA-NIMS) data that are related to forest health....

  12. Comparison of regression and geostatistical methods for mapping Leaf Area Index (LAI) with Landsat ETM+ data over a boreal forest.

    Treesearch

    Mercedes Berterretche; Andrew T. Hudak; Warren B. Cohen; Thomas K. Maiersperger; Stith T. Gower; Jennifer Dungan

    2005-01-01

    This study compared aspatial and spatial methods of using remote sensing and field data to predict maximum growing season leaf area index (LAI) maps in a boreal forest in Manitoba, Canada. The methods tested were orthogonal regression analysis (reduced major axis, RMA) and two geostatistical techniques: kriging with an external drift (KED) and sequential Gaussian...

  13. Regression modeling and mapping of coniferous forest basal area and tree density from discrete-return lidar and multispectral data

    Treesearch

    Andrew T. Hudak; Nicholas L. Crookston; Jeffrey S. Evans; Michael K. Falkowski; Alistair M. S. Smith; Paul E. Gessler; Penelope Morgan

    2006-01-01

    We compared the utility of discrete-return light detection and ranging (lidar) data and multispectral satellite imagery, and their integration, for modeling and mapping basal area and tree density across two diverse coniferous forest landscapes in north-central Idaho. We applied multiple linear regression models subset from a suite of 26 predictor variables derived...

  14. Above ground biomass and tree species richness estimation with airborne lidar in tropical Ghana forests

    NASA Astrophysics Data System (ADS)

    Vaglio Laurin, Gaia; Puletti, Nicola; Chen, Qi; Corona, Piermaria; Papale, Dario; Valentini, Riccardo

    2016-10-01

    Estimates of forest aboveground biomass are fundamental for carbon monitoring and accounting; delivering information at very high spatial resolution is especially valuable for local management, conservation and selective logging purposes. In tropical areas, hosting large biomass and biodiversity resources which are often threatened by unsustainable anthropogenic pressures, frequent forest resources monitoring is needed. Lidar is a powerful tool to estimate aboveground biomass at fine resolution; however its application in tropical forests has been limited, with high variability in the accuracy of results. Lidar pulses scan the forest vertical profile, and can provide structure information which is also linked to biodiversity. In the last decade the remote sensing of biodiversity has received great attention, but few studies focused on the use of lidar for assessing tree species richness in tropical forests. This research aims at estimating aboveground biomass and tree species richness using discrete return airborne lidar in Ghana forests. We tested an advanced statistical technique, Multivariate Adaptive Regression Splines (MARS), which does not require assumptions on data distribution or on the relationships between variables, being suitable for studying ecological variables. We compared the MARS regression results with those obtained by multilinear regression and found that both algorithms were effective, but MARS provided higher accuracy either for biomass (R2 = 0.72) and species richness (R2 = 0.64). We also noted strong correlation between biodiversity and biomass field values. Even if the forest areas under analysis are limited in extent and represent peculiar ecosystems, the preliminary indications produced by our study suggest that instrument such as lidar, specifically useful for pinpointing forest structure, can also be exploited as a support for tree species richness assessment.

  15. Sensitivity of ALOS/PALSAR imagery to forest degradation by fire in northern Amazon

    NASA Astrophysics Data System (ADS)

    Martins, Flora da Silva Ramos Vieira; dos Santos, João Roberto; Galvão, Lênio Soares; Xaud, Haron Abrahim Magalhães

    2016-07-01

    We evaluated the sensitivity of the full polarimetric Phased Array type L-band Synthetic Aperture Radar (PALSAR), onboard the Advanced Land Observing Satellite (ALOS), to forest degradation caused by fires in northern Amazon, Brazil. We searched for changes in PALSAR signal and tri-dimensional polarimetric responses for different classes of fire disturbance defined by fire frequency and severity. Since the aboveground biomass (AGB) is affected by fire, multiple regression models to estimate AGB were obtained for the whole set of coherent and incoherent attributes (general model) and for each set separately (specific models). The results showed that the polarimetric L-band PALSAR attributes were sensitive to variations in canopy structure and AGB caused by forest fire. However, except for the unburned and thrice burned classes, no single PALSAR attribute was able to discriminate between the intermediate classes of forest degradation by fire. Both the coherent and incoherent polarimetric attributes were important to explain AGB variations in tropical forests affected by fire. The HV backscattering coefficient, anisotropy, double-bounce component, orientation angle, volume index and HH-VV phase difference were PALSAR attributes selected from multiple regression analysis to estimate AGB. The general regression model, combining phase and power radar metrics, presented better results than specific models using coherent or incoherent attributes. The polarimetric responses indicated the dominance of VV-oriented backscattering in primary forest and lightly burned forests. The HH-oriented backscattering predominated in heavily and frequently burned forests. The results suggested a greater contribution of horizontally arranged constituents such as fallen trunks or branches in areas severely affected by fire.

  16. Multiple metrics of diversity have different effects on temperate forest functioning over succession.

    PubMed

    Yuan, Zuoqiang; Wang, Shaopeng; Gazol, Antonio; Mellard, Jarad; Lin, Fei; Ye, Ji; Hao, Zhanqing; Wang, Xugao; Loreau, Michel

    2016-12-01

    Biodiversity can be measured by taxonomic, phylogenetic, and functional diversity. How ecosystem functioning depends on these measures of diversity can vary from site to site and depends on successional stage. Here, we measured taxonomic, phylogenetic, and functional diversity, and examined their relationship with biomass in two successional stages of the broad-leaved Korean pine forest in northeastern China. Functional diversity was calculated from six plant traits, and aboveground biomass (AGB) and coarse woody productivity (CWP) were estimated using data from three forest censuses (10 years) in two large fully mapped forest plots (25 and 5 ha). 11 of the 12 regressions between biomass variables (AGB and CWP) and indices of diversity showed significant positive relationships, especially those with phylogenetic diversity. The mean tree diversity-biomass regressions increased from 0.11 in secondary forest to 0.31 in old-growth forest, implying a stronger biodiversity effect in more mature forest. Multi-model selection results showed that models including species richness, phylogenetic diversity, and single functional traits explained more variation in forest biomass than other candidate models. The models with a single functional trait, i.e., leaf area in secondary forest and wood density in mature forest, provided better explanations for forest biomass than models that combined all six functional traits. This finding may reflect different strategies in growth and resource acquisition in secondary and old-growth forests.

  17. Diversity and composition of tropical secondary forests recovering from large-scale clearing : results from the 1990 inventory in Puerto Rico

    Treesearch

    J. Danilo Chinea; Eileen H. Helmer

    2003-01-01

    The extensive recovery from agricultural clearing of Puerto Rican forests over the past half-century provides a good opportunity to study tropical forest recovery on a landscape scale. Using ordination and regression techniques, we analyzed forest inventory data from across Puerto Rico’s moist and wet secondary forests to evaluate their species composition and whether...

  18. Diversity and composition of tropical secondary forests recovering from large-scale clearing: results from the 1990 inventory in Puerto Rico.

    Treesearch

    J. Danilo Chinea; Eileen H. Helmer

    2003-01-01

    The extensive recovery from agricultural clearing of Puerto Rican forests over the past half-century provides a good opportunity to study tropical forest recovery on a landscape scale. Using ordination and regression techniques, we analyzed forest inventory data from across Puerto Rico’s moist and wet secondary forests to evaluate their species composition and whether...

  19. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

    EPA Science Inventory

    Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, e...

  20. The prediction of food additives in the fruit juice based on electronic nose with chemometrics.

    PubMed

    Qiu, Shanshan; Wang, Jun

    2017-09-01

    Food additives are added to products to enhance their taste, and preserve flavor or appearance. While their use should be restricted to achieve a technological benefit, the contents of food additives should be also strictly controlled. In this study, E-nose was applied as an alternative to traditional monitoring technologies for determining two food additives, namely benzoic acid and chitosan. For quantitative monitoring, support vector machine (SVM), random forest (RF), extreme learning machine (ELM) and partial least squares regression (PLSR) were applied to establish regression models between E-nose signals and the amount of food additives in fruit juices. The monitoring models based on ELM and RF reached higher correlation coefficients (R 2 s) and lower root mean square errors (RMSEs) than models based on PLSR and SVM. This work indicates that E-nose combined with RF or ELM can be a cost-effective, easy-to-build and rapid detection system for food additive monitoring. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Recovering Galaxy Properties Using Gaussian Process SED Fitting

    NASA Astrophysics Data System (ADS)

    Iyer, Kartheik; Awan, Humna

    2018-01-01

    Information about physical quantities like the stellar mass, star formation rates, and ages for distant galaxies is contained in their spectral energy distributions (SEDs), obtained through photometric surveys like SDSS, CANDELS, LSST etc. However, noise in the photometric observations often is a problem, and using naive machine learning methods to estimate physical quantities can result in overfitting the noise, or converging on solutions that lie outside the physical regime of parameter space.We use Gaussian Process regression trained on a sample of SEDs corresponding to galaxies from a Semi-Analytic model (Somerville+15a) to estimate their stellar masses, and compare its performance to a variety of different methods, including simple linear regression, Random Forests, and k-Nearest Neighbours. We find that the Gaussian Process method is robust to noise and predicts not only stellar masses but also their uncertainties. The method is also robust in the cases where the distribution of the training data is not identical to the target data, which can be extremely useful when generalized to more subtle galaxy properties.

  2. Measuring carbon in forests: current status and future challenges.

    PubMed

    Brown, Sandra

    2002-01-01

    To accurately and precisely measure the carbon in forests is gaining global attention as countries seek to comply with agreements under the UN Framework Convention on Climate Change. Established methods for measuring carbon in forests exist, and are best based on permanent sample plots laid out in a statistically sound design. Measurements on trees in these plots can be readily converted to aboveground biomass using either biomass expansion factors or allometric regression equations. A compilation of existing root biomass data for upland forests of the world generated a significant regression equation that can be used to predict root biomass based on aboveground biomass only. Methods for measuring coarse dead wood have been tested in many forest types, but the methods could be improved if a non-destructive tool for measuring the density of dead wood was developed. Future measurements of carbon storage in forests may rely more on remote sensing data, and new remote data collection technologies are in development.

  3. Prostate cancer prediction using the random forest algorithm that takes into account transrectal ultrasound findings, age, and serum levels of prostate-specific antigen.

    PubMed

    Xiao, Li-Hong; Chen, Pei-Ran; Gou, Zhong-Ping; Li, Yong-Zhong; Li, Mei; Xiang, Liang-Cheng; Feng, Ping

    2017-01-01

    The aim of this study is to evaluate the ability of the random forest algorithm that combines data on transrectal ultrasound findings, age, and serum levels of prostate-specific antigen to predict prostate carcinoma. Clinico-demographic data were analyzed for 941 patients with prostate diseases treated at our hospital, including age, serum prostate-specific antigen levels, transrectal ultrasound findings, and pathology diagnosis based on ultrasound-guided needle biopsy of the prostate. These data were compared between patients with and without prostate cancer using the Chi-square test, and then entered into the random forest model to predict diagnosis. Patients with and without prostate cancer differed significantly in age and serum prostate-specific antigen levels (P < 0.001), as well as in all transrectal ultrasound characteristics (P < 0.05) except uneven echo (P = 0.609). The random forest model based on age, prostate-specific antigen and ultrasound predicted prostate cancer with an accuracy of 83.10%, sensitivity of 65.64%, and specificity of 93.83%. Positive predictive value was 86.72%, and negative predictive value was 81.64%. By integrating age, prostate-specific antigen levels and transrectal ultrasound findings, the random forest algorithm shows better diagnostic performance for prostate cancer than either diagnostic indicator on its own. This algorithm may help improve diagnosis of the disease by identifying patients at high risk for biopsy.

  4. The Efficiency of Random Forest Method for Shoreline Extraction from LANDSAT-8 and GOKTURK-2 Imageries

    NASA Astrophysics Data System (ADS)

    Bayram, B.; Erdem, F.; Akpinar, B.; Ince, A. K.; Bozkurt, S.; Catal Reis, H.; Seker, D. Z.

    2017-11-01

    Coastal monitoring plays a vital role in environmental planning and hazard management related issues. Since shorelines are fundamental data for environment management, disaster management, coastal erosion studies, modelling of sediment transport and coastal morphodynamics, various techniques have been developed to extract shorelines. Random Forest is one of these techniques which is used in this study for shoreline extraction.. This algorithm is a machine learning method based on decision trees. Decision trees analyse classes of training data creates rules for classification. In this study, Terkos region has been chosen for the proposed method within the scope of "TUBITAK Project (Project No: 115Y718) titled "Integration of Unmanned Aerial Vehicles for Sustainable Coastal Zone Monitoring Model - Three-Dimensional Automatic Coastline Extraction and Analysis: Istanbul-Terkos Example". Random Forest algorithm has been implemented to extract the shoreline of the Black Sea where near the lake from LANDSAT-8 and GOKTURK-2 satellite imageries taken in 2015. The MATLAB environment was used for classification. To obtain land and water-body classes, the Random Forest method has been applied to NIR bands of LANDSAT-8 (5th band) and GOKTURK-2 (4th band) imageries. Each image has been digitized manually and shorelines obtained for accuracy assessment. According to accuracy assessment results, Random Forest method is efficient for both medium and high resolution images for shoreline extraction studies.

  5. Determining coniferous forest cover and forest fragmentation with NOAA-9 advanced very high resolution radiometer data

    NASA Technical Reports Server (NTRS)

    Ripple, William J.

    1995-01-01

    NOAA-9 satellite data from the Advanced Very High Resolution Radiometer (AVHRR) were used in conjunction with Landsat Multispectral Scanner (MSS) data to determine the proportion of closed canopy conifer forest cover in the Cascade Range of Oregon. A closed canopy conifer map, as determined from the MSS, was registered with AVHRR pixels. Regression was used to relate closed canopy conifer forest cover to AVHRR spectral data. A two-variable (band) regression model accounted for more variance in conifer cover than the Normalized Difference Vegetation Index (NDVI). The spectral signatures of various conifer successional stages were also examined. A map of Oregon was produced showing the proportion of closed canopy conifer cover for each AVHRR pixel. The AVHRR was responsive to both the percentage of closed canopy conifer cover and the successional stage in these temperate coniferous forests in this experiment.

  6. Missouri Ozark Forest Ecosystem Project: the experiment

    Treesearch

    Steven L. Sheriff

    2002-01-01

    Missouri Ozark Forest Ecosystem Project (MOFEP) is a unique experiment to learn about the impacts of management practices on a forest system. Three forest management practices (uneven-aged management, even-aged management, and no-harvest management) as practiced by the Missouri Department of Conservation were randomly assigned to nine forest management sites using a...

  7. Online updating of context-aware landmark detectors for prostate localization in daily treatment CT images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dai, Xiubin; Gao, Yaozong; Shen, Dinggang, E-mail: dgshen@med.unc.edu

    2015-05-15

    Purpose: In image guided radiation therapy, it is crucial to fast and accurately localize the prostate in the daily treatment images. To this end, the authors propose an online update scheme for landmark-guided prostate segmentation, which can fully exploit valuable patient-specific information contained in the previous treatment images and can achieve improved performance in landmark detection and prostate segmentation. Methods: To localize the prostate in the daily treatment images, the authors first automatically detect six anatomical landmarks on the prostate boundary by adopting a context-aware landmark detection method. Specifically, in this method, a two-layer regression forest is trained as amore » detector for each target landmark. Once all the newly detected landmarks from new treatment images are reviewed or adjusted (if necessary) by clinicians, they are further included into the training pool as new patient-specific information to update all the two-layer regression forests for the next treatment day. As more and more treatment images of the current patient are acquired, the two-layer regression forests can be continually updated by incorporating the patient-specific information into the training procedure. After all target landmarks are detected, a multiatlas random sample consensus (multiatlas RANSAC) method is used to segment the entire prostate by fusing multiple previously segmented prostates of the current patient after they are aligned to the current treatment image. Subsequently, the segmented prostate of the current treatment image is again reviewed (or even adjusted if needed) by clinicians before including it as a new shape example into the prostate shape dataset for helping localize the entire prostate in the next treatment image. Results: The experimental results on 330 images of 24 patients show the effectiveness of the authors’ proposed online update scheme in improving the accuracies of both landmark detection and prostate segmentation. Besides, compared to the other state-of-the-art prostate segmentation methods, the authors’ method achieves the best performance. Conclusions: By appropriate use of valuable patient-specific information contained in the previous treatment images, the authors’ proposed online update scheme can obtain satisfactory results for both landmark detection and prostate segmentation.« less

  8. Validating genetic markers of response to recombinant human growth hormone in children with growth hormone deficiency and Turner syndrome: the PREDICT validation study

    PubMed Central

    Stevens, Adam; Murray, Philip; Wojcik, Jerome; Raelson, John; Koledova, Ekaterina; Chatelain, Pierre

    2016-01-01

    Objective Single-nucleotide polymorphisms (SNPs) associated with the response to recombinant human growth hormone (r-hGH) have previously been identified in growth hormone deficiency (GHD) and Turner syndrome (TS) children in the PREDICT long-term follow-up (LTFU) study (Nbib699855). Here, we describe the PREDICT validation (VAL) study (Nbib1419249), which aimed to confirm these genetic associations. Design and methods Children with GHD (n = 293) or TS (n = 132) were recruited retrospectively from 29 sites in nine countries. All children had completed 1 year of r-hGH therapy. 48 SNPs previously identified as associated with first year growth response to r-hGH were genotyped. Regression analysis was used to assess the association between genotype and growth response using clinical/auxological variables as covariates. Further analysis was undertaken using random forest classification. Results The children were younger, and the growth response was higher in VAL study. Direct genotype analysis did not replicate what was found in the LTFU study. However, using exploratory regression models with covariates, a consistent relationship with growth response in both VAL and LTFU was shown for four genes – SOS1 and INPPL1 in GHD and ESR1 and PTPN1 in TS. The random forest analysis demonstrated that only clinical covariates were important in the prediction of growth response in mild GHD (>4 to <10 μg/L on GH stimulation test), however, in severe GHD (≤4 μg/L) several SNPs contributed (in IGF2, GRB10, FOS, IGFBP3 and GHRHR). Conclusions The PREDICT validation study supports, in an independent cohort, the association of four of 48 genetic markers with growth response to r-hGH treatment in both pre-pubertal GHD and TS children after controlling for clinical/auxological covariates. However, the contribution of these SNPs in a prediction model of first-year response is not sufficient for routine clinical use. PMID:27651465

  9. Flash Flood Type Identification within Catchments in Beijing Mountainous Area

    NASA Astrophysics Data System (ADS)

    Nan, W.

    2017-12-01

    Flash flood is a common type of disaster in mountainous area, Flash flood with the feature of large flow rate, strong flushing force, destructive power, has periodically caused loss to life and destruction to infrastructure in mountainous area. Beijing as China's political, economic and cultural center, the disaster prevention and control work in Beijing mountainous area has always been concerned widely. According to the transport mechanism, sediment concentration and density, the flash flood type identification within catchment can provide basis for making the hazards prevention and mitigation policy. Taking Beijing as the study area, this paper extracted parameters related to catchment morphological and topography features respectively. By using Bayes discriminant, Logistic regression and Random forest, the catchments in Beijing mountainous area were divided into water floods process, fluvial sediment transport process and debris flows process. The results found that Logistic regression analysis showed the highest accuracy, with the overall accuracy of 88.2%. Bayes discriminant and Random forest had poor prediction effects. This study confirmed the ability of morphological and topography features to identify flash flood process. The circularity ratio, elongation ratio and roughness index can be used to explain the flash flood types effectively, and the Melton ratio and elevation relief ratio also did a good job during the identification, whereas the drainage density seemed not to be an issue at this level of detail. Based on the analysis of spatial patterns of flash flood types, fluvial sediment transport process and debris flow process were the dominant hazards, while the pure water flood process was much less. The catchments dominated by fluvial sediment transport process were mainly distributed in the Yan Mountain region, where the fault belts were relatively dense. The debris flow process prone to occur in the Taihang Mountain region thanks to the abundant coal gangues. The pure water flood process catchments were mainly distributed in the transitional mountain front.

  10. Validating genetic markers of response to recombinant human growth hormone in children with growth hormone deficiency and Turner syndrome: the PREDICT validation study.

    PubMed

    Stevens, Adam; Murray, Philip; Wojcik, Jerome; Raelson, John; Koledova, Ekaterina; Chatelain, Pierre; Clayton, Peter

    2016-12-01

    Single-nucleotide polymorphisms (SNPs) associated with the response to recombinant human growth hormone (r-hGH) have previously been identified in growth hormone deficiency (GHD) and Turner syndrome (TS) children in the PREDICT long-term follow-up (LTFU) study (Nbib699855). Here, we describe the PREDICT validation (VAL) study (Nbib1419249), which aimed to confirm these genetic associations. Children with GHD (n = 293) or TS (n = 132) were recruited retrospectively from 29 sites in nine countries. All children had completed 1 year of r-hGH therapy. 48 SNPs previously identified as associated with first year growth response to r-hGH were genotyped. Regression analysis was used to assess the association between genotype and growth response using clinical/auxological variables as covariates. Further analysis was undertaken using random forest classification. The children were younger, and the growth response was higher in VAL study. Direct genotype analysis did not replicate what was found in the LTFU study. However, using exploratory regression models with covariates, a consistent relationship with growth response in both VAL and LTFU was shown for four genes - SOS1 and INPPL1 in GHD and ESR1 and PTPN1 in TS. The random forest analysis demonstrated that only clinical covariates were important in the prediction of growth response in mild GHD (>4 to <10 μg/L on GH stimulation test), however, in severe GHD (≤4 μg/L) several SNPs contributed (in IGF2, GRB10, FOS, IGFBP3 and GHRHR). The PREDICT validation study supports, in an independent cohort, the association of four of 48 genetic markers with growth response to r-hGH treatment in both pre-pubertal GHD and TS children after controlling for clinical/auxological covariates. However, the contribution of these SNPs in a prediction model of first-year response is not sufficient for routine clinical use. © 2016 European Society of Endocrinology.

  11. Using Random Forest to Improve the Downscaling of Global Livestock Census Data

    PubMed Central

    Nicolas, Gaëlle; Robinson, Timothy P.; Wint, G. R. William; Conchedda, Giulia; Cinardi, Giuseppina; Gilbert, Marius

    2016-01-01

    Large scale, high-resolution global data on farm animal distributions are essential for spatially explicit assessments of the epidemiological, environmental and socio-economic impacts of the livestock sector. This has been the major motivation behind the development of the Gridded Livestock of the World (GLW) database, which has been extensively used since its first publication in 2007. The database relies on a downscaling methodology whereby census counts of animals in sub-national administrative units are redistributed at the level of grid cells as a function of a series of spatial covariates. The recent upgrade of GLW1 to GLW2 involved automating the processing, improvement of input data, and downscaling at a spatial resolution of 1 km per cell (5 km per cell in the earlier version). The underlying statistical methodology, however, remained unchanged. In this paper, we evaluate new methods to downscale census data with a higher accuracy and increased processing efficiency. Two main factors were evaluated, based on sample census datasets of cattle in Africa and chickens in Asia. First, we implemented and evaluated Random Forest models (RF) instead of stratified regressions. Second, we investigated whether models that predicted the number of animals per rural person (per capita) could provide better downscaled estimates than the previous approach that predicted absolute densities (animals per km2). RF models consistently provided better predictions than the stratified regressions for both continents and species. The benefit of per capita over absolute density models varied according to the species and continent. In addition, different technical options were evaluated to reduce the processing time while maintaining their predictive power. Future GLW runs (GLW 3.0) will apply the new RF methodology with optimized modelling options. The potential benefit of per capita models will need to be further investigated with a better distinction between rural and agricultural populations. PMID:26977807

  12. Comparison of Nine Statistical Model Based Warfarin Pharmacogenetic Dosing Algorithms Using the Racially Diverse International Warfarin Pharmacogenetic Consortium Cohort Database

    PubMed Central

    Liu, Rong; Li, Xi; Zhang, Wei; Zhou, Hong-Hao

    2015-01-01

    Objective Multiple linear regression (MLR) and machine learning techniques in pharmacogenetic algorithm-based warfarin dosing have been reported. However, performances of these algorithms in racially diverse group have never been objectively evaluated and compared. In this literature-based study, we compared the performances of eight machine learning techniques with those of MLR in a large, racially-diverse cohort. Methods MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied in warfarin dose algorithms in a cohort from the International Warfarin Pharmacogenetics Consortium database. Covariates obtained by stepwise regression from 80% of randomly selected patients were used to develop algorithms. To compare the performances of these algorithms, the mean percentage of patients whose predicted dose fell within 20% of the actual dose (mean percentage within 20%) and the mean absolute error (MAE) were calculated in the remaining 20% of patients. The performances of these techniques in different races, as well as the dose ranges of therapeutic warfarin were compared. Robust results were obtained after 100 rounds of resampling. Results BART, MARS and SVR were statistically indistinguishable and significantly out performed all the other approaches in the whole cohort (MAE: 8.84–8.96 mg/week, mean percentage within 20%: 45.88%–46.35%). In the White population, MARS and BART showed higher mean percentage within 20% and lower mean MAE than those of MLR (all p values < 0.05). In the Asian population, SVR, BART, MARS and LAR performed the same as MLR. MLR and LAR optimally performed among the Black population. When patients were grouped in terms of warfarin dose range, all machine learning techniques except ANN and LAR showed significantly higher mean percentage within 20%, and lower MAE (all p values < 0.05) than MLR in the low- and high- dose ranges. Conclusion Overall, machine learning-based techniques, BART, MARS and SVR performed superior than MLR in warfarin pharmacogenetic dosing. Differences of algorithms’ performances exist among the races. Moreover, machine learning-based algorithms tended to perform better in the low- and high- dose ranges than MLR. PMID:26305568

  13. Variable selection with random forest: Balancing stability, performance, and interpretation in ecological and environmental modeling

    EPA Science Inventory

    Random forest (RF) is popular in ecological and environmental modeling, in part, because of its insensitivity to correlated predictors and resistance to overfitting. Although variable selection has been proposed to improve both performance and interpretation of RF models, it is u...

  14. Random Forests for Evaluating Pedagogy and Informing Personalized Learning

    ERIC Educational Resources Information Center

    Spoon, Kelly; Beemer, Joshua; Whitmer, John C.; Fan, Juanjuan; Frazee, James P.; Stronach, Jeanne; Bohonak, Andrew J.; Levine, Richard A.

    2016-01-01

    Random forests are presented as an analytics foundation for educational data mining tasks. The focus is on course- and program-level analytics including evaluating pedagogical approaches and interventions and identifying and characterizing at-risk students. As part of this development, the concept of individualized treatment effects (ITE) is…

  15. Employing canopy hyperspectral narrowband data and random forest algorithm to differentiate palmer amaranth from colored cotton

    USDA-ARS?s Scientific Manuscript database

    Palmer amaranth (Amaranthus palmeri S. Wats.) invasion negatively impacts cotton (Gossypium hirsutum L.) production systems throughout the United States. The objective of this study was to evaluate canopy hyperspectral narrowband data as input into the random forest machine learning algorithm to dis...

  16. Statistical properties of mean stand biomass estimators in a LIDAR-based double sampling forest survey design.

    Treesearch

    H.E. Anderson; J. Breidenbach

    2007-01-01

    Airborne laser scanning (LIDAR) can be a valuable tool in double-sampling forest survey designs. LIDAR-derived forest structure metrics are often highly correlated with important forest inventory variables, such as mean stand biomass, and LIDAR-based synthetic regression estimators have the potential to be highly efficient compared to single-stage estimators, which...

  17. A Proposal for Phase 4 of the Forest Inventory and Analysis Program

    Treesearch

    Ronald E. McRoberts

    2005-01-01

    Maps of forest cover were constructed using observations from forest inventory plots, Landsat Thematic Mapper satellite imagery, and a logistic regression model. Estimates of mean proportion forest area and the variance of the mean were calculated for circular study areas with radii ranging from 1 km to 15 km. The spatial correlation among pixel predictions was...

  18. Thorough statistical comparison of machine learning regression models and their ensembles for sub-pixel imperviousness and imperviousness change mapping

    NASA Astrophysics Data System (ADS)

    Drzewiecki, Wojciech

    2017-12-01

    We evaluated the performance of nine machine learning regression algorithms and their ensembles for sub-pixel estimation of impervious areas coverages from Landsat imagery. The accuracy of imperviousness mapping in individual time points was assessed based on RMSE, MAE and R2. These measures were also used for the assessment of imperviousness change intensity estimations. The applicability for detection of relevant changes in impervious areas coverages at sub-pixel level was evaluated using overall accuracy, F-measure and ROC Area Under Curve. The results proved that Cubist algorithm may be advised for Landsat-based mapping of imperviousness for single dates. Stochastic gradient boosting of regression trees (GBM) may be also considered for this purpose. However, Random Forest algorithm is endorsed for both imperviousness change detection and mapping of its intensity. In all applications the heterogeneous model ensembles performed at least as well as the best individual models or better. They may be recommended for improving the quality of sub-pixel imperviousness and imperviousness change mapping. The study revealed also limitations of the investigated methodology for detection of subtle changes of imperviousness inside the pixel. None of the tested approaches was able to reliably classify changed and non-changed pixels if the relevant change threshold was set as one or three percent. Also for fi ve percent change threshold most of algorithms did not ensure that the accuracy of change map is higher than the accuracy of random classifi er. For the threshold of relevant change set as ten percent all approaches performed satisfactory.

  19. Factors influencing the sustained participation of farmers in participatory forestry: a case study in central Sal forests in Bangladesh.

    PubMed

    Salam, M A; Noguchi, T; Koike, M

    2005-01-01

    Wide acceptance of sustainable development as a concept and as the goal of forest management has shifted forest management policies from a traditional to a people-oriented approach. Consequently, with its multiple new objectives, forest management has become more complex and an information gap exits between what is known and what is utilized, which hinders the sustained participation of farmers. This gap arose mainly due to an interrupted flow of information. With participatory forestry, the information flow requires a broad approach that goes beyond the forest ecosystem and includes the different stakeholders. Thus in participatory forest management strategies, policymakers, planners and project designers need to incorporate all relevant information within the context of the dynamic interaction between stakeholders and the forest environment. They should understand the impact of factors such as management policies, economics and conflicts on the sustained participation of farmers. This study aimed to use primary cross-sectional data to identify the factors that might influence the sustained participation of farmers in participatory forestry. Using stratified random sampling, 581 participants were selected to take part in this study, and data were collected through a structured questionnaire by interviewing the selected participants. To identify the dominant factors necessary for the sustained participation of farmers, logistic regression analyses were performed. The following results were observed: (a) sustained participation is positively and significantly correlated with (i) satisfaction of the participants with the tree species planted on their plots; (ii) confidence of the participants that their aspired benefits will be received; (iii) provision of training on different aspects of participatory forestry; (iv) contribution of participants' money to Tree Farming Funds. (b) The sustained participation of farmers is negatively and significantly correlated with the disruption of local peoples' interests through implementation of participatory forestry programs, and long delays in the harvesting of trees after completion of the contractual agreement period.

  20. Old-growth and mature forests near spotted owl nests in western Oregon

    NASA Technical Reports Server (NTRS)

    Ripple, William J.; Johnson, David H.; Hershey, K. T.; Meslow, E. Charles

    1995-01-01

    We investigated how the amount of old-growth and mature forest influences the selection of nest sites by northern spotted owls (Strix occidentalis caurina) in the Central Cascade Mountains of Oregon. We used 7 different plot sizes to compare the proportion of mature and old-growth forest between 30 nest sites and 30 random sites. The proportion of old-growth and mature forest was significantly greater at nests sites than at random sites for all plot sizes (P less than or equal to 0.01). Thus, management of the spotted owl might require setting the percentage of old-growth and mature forest retained from harvesting at least 1 standard deviation above the mean for the 30 nest sites we examined.

  1. Tissue segmentation of computed tomography images using a Random Forest algorithm: a feasibility study

    NASA Astrophysics Data System (ADS)

    Polan, Daniel F.; Brady, Samuel L.; Kaufman, Robert A.

    2016-09-01

    There is a need for robust, fully automated whole body organ segmentation for diagnostic CT. This study investigates and optimizes a Random Forest algorithm for automated organ segmentation; explores the limitations of a Random Forest algorithm applied to the CT environment; and demonstrates segmentation accuracy in a feasibility study of pediatric and adult patients. To the best of our knowledge, this is the first study to investigate a trainable Weka segmentation (TWS) implementation using Random Forest machine-learning as a means to develop a fully automated tissue segmentation tool developed specifically for pediatric and adult examinations in a diagnostic CT environment. Current innovation in computed tomography (CT) is focused on radiomics, patient-specific radiation dose calculation, and image quality improvement using iterative reconstruction, all of which require specific knowledge of tissue and organ systems within a CT image. The purpose of this study was to develop a fully automated Random Forest classifier algorithm for segmentation of neck-chest-abdomen-pelvis CT examinations based on pediatric and adult CT protocols. Seven materials were classified: background, lung/internal air or gas, fat, muscle, solid organ parenchyma, blood/contrast enhanced fluid, and bone tissue using Matlab and the TWS plugin of FIJI. The following classifier feature filters of TWS were investigated: minimum, maximum, mean, and variance evaluated over a voxel radius of 2 n , (n from 0 to 4), along with noise reduction and edge preserving filters: Gaussian, bilateral, Kuwahara, and anisotropic diffusion. The Random Forest algorithm used 200 trees with 2 features randomly selected per node. The optimized auto-segmentation algorithm resulted in 16 image features including features derived from maximum, mean, variance Gaussian and Kuwahara filters. Dice similarity coefficient (DSC) calculations between manually segmented and Random Forest algorithm segmented images from 21 patient image sections, were analyzed. The automated algorithm produced segmentation of seven material classes with a median DSC of 0.86  ±  0.03 for pediatric patient protocols, and 0.85  ±  0.04 for adult patient protocols. Additionally, 100 randomly selected patient examinations were segmented and analyzed, and a mean sensitivity of 0.91 (range: 0.82-0.98), specificity of 0.89 (range: 0.70-0.98), and accuracy of 0.90 (range: 0.76-0.98) were demonstrated. In this study, we demonstrate that this fully automated segmentation tool was able to produce fast and accurate segmentation of the neck and trunk of the body over a wide range of patient habitus and scan parameters.

  2. Security authentication with a three-dimensional optical phase code using random forest classifier: an overview

    NASA Astrophysics Data System (ADS)

    Markman, Adam; Carnicer, Artur; Javidi, Bahram

    2017-05-01

    We overview our recent work [1] on utilizing three-dimensional (3D) optical phase codes for object authentication using the random forest classifier. A simple 3D optical phase code (OPC) is generated by combining multiple diffusers and glass slides. This tag is then placed on a quick-response (QR) code, which is a barcode capable of storing information and can be scanned under non-uniform illumination conditions, rotation, and slight degradation. A coherent light source illuminates the OPC and the transmitted light is captured by a CCD to record the unique signature. Feature extraction on the signature is performed and inputted into a pre-trained random-forest classifier for authentication.

  3. Variation of Annual ET Determined from Water Budgets Across Rural Southeastern Basins Differing in Forest Types

    NASA Astrophysics Data System (ADS)

    Younger, S. E.; Jackson, C. R.

    2017-12-01

    In the Southeastern United States, evapotranspiration (ET) typically accounts for 60-70% of precipitation. Watershed and plot scale experiments show that evergreen forests have higher ET rates than hardwood forests and pastures. However, some plot experiments indicate that certain hardwood species have higher ET than paired evergreens. The complexity of factors influencing ET in mixed land cover watersheds makes identifying the relative influences difficult. Previous watershed scale studies have relied on regression to understand the influences or low flow analysis to indicate growing season differences among watersheds. Existing studies in the southeast investigating ET rates for watersheds with multiple forest cover types have failed to identify a significant forest type effect, but these studies acknowledge small sample sizes. Trends of decreasing streamflow have been recognized in the region and are generally attributed to five key factors, 1.) influences from multiple droughts, 2.) changes in distribution of precipitation, 3.) reforestation of agricultural land, 4.) increasing consumptive uses, or 5.) a combination of these and other factors. This study attempts to address the influence of forest type on long term average annual streamflow and on stream low flows. Long term annual ET rates were calculated as ET = P-Q for 46 USGS gaged basins with daily data for the 1982 - 2014 water years, >40% forest cover, and no large reservoirs. Land cover data was regressed against ET to describe the relationship between each of the forest types in the National Land Cover Database. Regression analysis indicates evergreen land cover has a positive relationship with ET while deciduous and total forest have a negative relationship with ET. Low flow analysis indicates low flows tend to be lower in watersheds with more evergreen cover, and that low flows increase with increasing deciduous cover, although these relationships are noisy. This work suggests considering forest cover type improves understanding of watershed scale ET at annual and seasonal levels which is consistent with historic paired watershed experiments and some plot scale data.

  4. CW-SSIM kernel based random forest for image classification

    NASA Astrophysics Data System (ADS)

    Fan, Guangzhe; Wang, Zhou; Wang, Jiheng

    2010-07-01

    Complex wavelet structural similarity (CW-SSIM) index has been proposed as a powerful image similarity metric that is robust to translation, scaling and rotation of images, but how to employ it in image classification applications has not been deeply investigated. In this paper, we incorporate CW-SSIM as a kernel function into a random forest learning algorithm. This leads to a novel image classification approach that does not require a feature extraction or dimension reduction stage at the front end. We use hand-written digit recognition as an example to demonstrate our algorithm. We compare the performance of the proposed approach with random forest learning based on other kernels, including the widely adopted Gaussian and the inner product kernels. Empirical evidences show that the proposed method is superior in its classification power. We also compared our proposed approach with the direct random forest method without kernel and the popular kernel-learning method support vector machine. Our test results based on both simulated and realworld data suggest that the proposed approach works superior to traditional methods without the feature selection procedure.

  5. Providing the Fire Risk Map in Forest Area Using a Geographically Weighted Regression Model with Gaussin Kernel and Modis Images, a Case Study: Golestan Province

    NASA Astrophysics Data System (ADS)

    Shah-Heydari pour, A.; Pahlavani, P.; Bigdeli, B.

    2017-09-01

    According to the industrialization of cities and the apparent increase in pollutants and greenhouse gases, the importance of forests as the natural lungs of the earth is felt more than ever to clean these pollutants. Annually, a large part of the forests is destroyed due to the lack of timely action during the fire. Knowledge about areas with a high-risk of fire and equipping these areas by constructing access routes and allocating the fire-fighting equipment can help to eliminate the destruction of the forest. In this research, the fire risk of region was forecasted and the risk map of that was provided using MODIS images by applying geographically weighted regression model with Gaussian kernel and ordinary least squares over the effective parameters in forest fire including distance from residential areas, distance from the river, distance from the road, height, slope, aspect, soil type, land use, average temperature, wind speed, and rainfall. After the evaluation, it was found that the geographically weighted regression model with Gaussian kernel forecasted 93.4% of the all fire points properly, however the ordinary least squares method could forecast properly only 66% of the fire points.

  6. A model-based approach to estimating forest area

    Treesearch

    Ronald E. McRoberts

    2006-01-01

    A logistic regression model based on forest inventory plot data and transformations of Landsat Thematic Mapper satellite imagery was used to predict the probability of forest for 15 study areas in Indiana, USA, and 15 in Minnesota, USA. Within each study area, model-based estimates of forest area were obtained for circular areas with radii of 5 km, 10 km, and 15 km and...

  7. Effects of lidar pulse density and sample size on a model-assisted approach to estimate forest inventory variables

    Treesearch

    Jacob Strunk; Hailemariam Temesgen; Hans-Erik Andersen; James P. Flewelling; Lisa Madsen

    2012-01-01

    Using lidar in an area-based model-assisted approach to forest inventory has the potential to increase estimation precision for some forest inventory variables. This study documents the bias and precision of a model-assisted (regression estimation) approach to forest inventory with lidar-derived auxiliary variables relative to lidar pulse density and the number of...

  8. Sediment rating curve & Co. - a contest of prediction methods

    NASA Astrophysics Data System (ADS)

    Francke, T.; Zimmermann, A.

    2012-04-01

    In spite of the recent technological progress in sediment monitoring, often the calculation of sediment yield (SSY) still relies on intermittent measurements because of the use of historic records, instrument-failure in continuous recording or financial constraints. Therefore, available measurements are usually inter- and even extrapolated using the sediment rating curve approach, which uses continuously available discharge data to predict sediment concentrations. Extending this idea by further aspects like the inclusion of other predictors (e.g. rainfall, discharge-characteristics, etc.), or the consideration of prediction uncertainty led to a variety of new methods. Now, with approaches such as Fuzzy Logic, Artificial Neural Networks, Tree-based regression, GLMs, etc., the user is left to decide which method to apply. Trying multiple approaches is usually not an option, as considerable effort and expertise may be needed for their application. To establish a helpful guideline in selecting the most appropriate method for SSY-computation, we initiated a study to compare and rank available methods. Depending on problem attributes like hydrological and sediment regime, number of samples, sampling scheme, and availability of ancillary predictors, the performance of different methods is compared. Our expertise allowed us to "register" Random Forests, Quantile Regression Forests and GLMs for the contest. To include many different methods and ensure their sophisticated use we invite scientists that are willing to benchmark their favourite method(s) with us. The more diverse the participating methods are, the more exciting the contest will be.

  9. Estimating tree biomass regressions and their error, proceedings of the workshop on tree biomass regression functions and their contribution to the error

    Treesearch

    Eric H. Wharton; Tiberius Cunia

    1987-01-01

    Proceedings of a workshop co-sponsored by the USDA Forest Service, the State University of New York, and the Society of American Foresters. Presented were papers on the methodology of sample tree selection, tree biomass measurement, construction of biomass tables and estimation of their error, and combining the error of biomass tables with that of the sample plots or...

  10. Improved high-dimensional prediction with Random Forests by the use of co-data.

    PubMed

    Te Beest, Dennis E; Mes, Steven W; Wilting, Saskia M; Brakenhoff, Ruud H; van de Wiel, Mark A

    2017-12-28

    Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

  11. Source localization in an ocean waveguide using supervised machine learning.

    PubMed

    Niu, Haiqiang; Reeves, Emma; Gerstoft, Peter

    2017-09-01

    Source localization in ocean acoustics is posed as a machine learning problem in which data-driven methods learn source ranges directly from observed acoustic data. The pressure received by a vertical linear array is preprocessed by constructing a normalized sample covariance matrix and used as the input for three machine learning methods: feed-forward neural networks (FNN), support vector machines (SVM), and random forests (RF). The range estimation problem is solved both as a classification problem and as a regression problem by these three machine learning algorithms. The results of range estimation for the Noise09 experiment are compared for FNN, SVM, RF, and conventional matched-field processing and demonstrate the potential of machine learning for underwater source localization.

  12. Predictive modeling of cardiovascular complications in incident hemodialysis patients.

    PubMed

    Ion Titapiccolo, J; Ferrario, M; Barbieri, C; Marcelli, D; Mari, F; Gatti, E; Cerutti, S; Smyth, P; Signorini, M G

    2012-01-01

    The administration of hemodialysis (HD) treatment leads to the continuous collection of a vast quantity of medical data. Many variables related to the patient health status, to the treatment, and to dialyzer settings can be recorded and stored at each treatment session. In this study a dataset of 42 variables and 1526 patients extracted from the Fresenius Medical Care database EuCliD was used to develop and apply a random forest predictive model for the prediction of cardiovascular events in the first year of HD treatment. A ridge-lasso logistic regression algorithm was then applied to the subset of variables mostly involved in the prediction model to get insights in the mechanisms underlying the incidence of cardiovascular complications in this high risk population of patients.

  13. Prediction of Nursing Workload in Hospital.

    PubMed

    Fiebig, Madlen; Hunstein, Dirk; Bartholomeyczik, Sabine

    2018-01-01

    A dissertation project at the Witten/Herdecke University [1] is investigating which (nursing sensitive) patient characteristics are suitable for predicting a higher or lower degree of nursing workload. For this research project four predictive modelling methods were selected. In a first step, SUPPORT VECTOR MACHINE, RANDOM FOREST, and GRADIENT BOOSTING were used to identify potential predictors from the nursing sensitive patient characteristics. The results were compared via FEATURE IMPORTANCE. To predict nursing workload the predictors identified in step 1 were modelled using MULTINOMIAL LOGISTIC REGRESSION. First results from the data mining process will be presented. A prognostic determination of nursing workload can be used not only as a basis for human resource planning in hospital, but also to respond to health policy issues.

  14. Detecting understory plant invasion in urban forests using LiDAR

    NASA Astrophysics Data System (ADS)

    Singh, Kunwar K.; Davis, Amy J.; Meentemeyer, Ross K.

    2015-06-01

    Light detection and ranging (LiDAR) data are increasingly used to measure structural characteristics of urban forests but are rarely used to detect the growing problem of exotic understory plant invaders. We explored the merits of using LiDAR-derived metrics alone and through integration with spectral data to detect the spatial distribution of the exotic understory plant Ligustrum sinense, a rapidly spreading invader in the urbanizing region of Charlotte, North Carolina, USA. We analyzed regional-scale L. sinense occurrence data collected over the course of three years with LiDAR-derived metrics of forest structure that were categorized into the following groups: overstory, understory, topography, and overall vegetation characteristics, and IKONOS spectral features - optical. Using random forest (RF) and logistic regression (LR) classifiers, we assessed the relative contributions of LiDAR and IKONOS derived variables to the detection of L. sinense. We compared the top performing models developed for a smaller, nested experimental extent using RF and LR classifiers, and used the best overall model to produce a predictive map of the spatial distribution of L. sinense across our country-wide study extent. RF classification of LiDAR-derived topography metrics produced the highest mapping accuracy estimates, outperforming IKONOS data by 17.5% and the integration of LiDAR and IKONOS data by 5.3%. The top performing model from the RF classifier produced the highest kappa of 64.8%, improving on the parsimonious LR model kappa by 31.1% with a moderate gain of 6.2% over the county extent model. Our results demonstrate the superiority of LiDAR-derived metrics over spectral data and fusion of LiDAR and spectral data for accurately mapping the spatial distribution of the forest understory invader L. sinense.

  15. The influence of tree morphology on stemflow generation in a tropical lowland rainforest

    NASA Astrophysics Data System (ADS)

    Uber, Magdalena; Levia, Delphis F.; Zimmermann, Beate; Zimmermann, Alexander

    2014-05-01

    Even though stemflow usually accounts for only a small proportion of rainfall, it is an important point source of water and ion input to forest floors and may, for instance, influence soil moisture patterns and groundwater recharge. Previous studies showed that the generation of stemflow depends on a multitude of meteorological and biological factors. Interestingly, despite the tremendous progress in stemflow research during the last decades it is still largely unknown which combination of tree characteristics determines stemflow volumes in species-rich tropical forests. This knowledge gap motivated us to analyse the influence of tree characteristics on stemflow volumes in a 1 hectare plot located in a Panamanian lowland rainforest. Our study comprised stemflow measurements in six randomly selected 10 m by 10 m subplots. In each subplot we measured stemflow of all trees with a diameter at breast height (DBH) > 5 cm on an event-basis for a period of six weeks. Additionally, we identified all tree species and determined a set of tree characteristics including DBH, crown diameter, bark roughness, bark furrowing, epiphyte coverage, tree architecture, stem inclination, and crown position. During the sampling period, we collected 985 L of stemflow (0.98 % of total rainfall). Based on regression analyses and comparisons among plant functional groups we show that palms were most efficient in yielding stemflow due to their large inclined fronds. Trees with large emergent crowns also produced relatively large amounts of stemflow. Due to their abundance, understory trees contribute much to stemflow yield not on individual but on the plot scale. Even though parameters such as crown diameter, branch inclination and position of the crown influence stemflow generation to some extent, these parameters explain less than 30 % of the variation in stemflow volumes. In contrast to published results from temperate forests, we did not detect a negative correlation between bark roughness and stemflow volume. This is because other parameters such as crown diameter obscured this relationship. Due to multicollinearity and poor correlations between single tree characteristics with stemflow volume, an assessment of stemflow volumes based on forest characteristics remains cumbersome in highly diverse ecosystems. Instead of relying on regression relationships, we therefore advocate a total sampling of trees in several plots to determine stand-scale stemflow yield in tropical forests.

  16. Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

    PubMed

    Le, Trang T; Simmons, W Kyle; Misaki, Masaya; Bodurka, Jerzy; White, Bill C; Savitz, Jonathan; McKinney, Brett A

    2017-09-15

    Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting. We introduce private Evaporative Cooling, a stochastic privacy-preserving machine learning algorithm that uses Relief-F for feature selection and random forest for privacy preserving classification that also prevents overfitting. We relate the privacy-preserving threshold mechanism to a thermodynamic Maxwell-Boltzmann distribution, where the temperature represents the privacy threshold. We use the thermal statistical physics concept of Evaporative Cooling of atomic gases to perform backward stepwise privacy-preserving feature selection. On simulated data with main effects and statistical interactions, we compare accuracies on holdout and validation sets for three privacy-preserving methods: the reusable holdout, reusable holdout with random forest, and private Evaporative Cooling, which uses Relief-F feature selection and random forest classification. In simulations where interactions exist between attributes, private Evaporative Cooling provides higher classification accuracy without overfitting based on an independent validation set. In simulations without interactions, thresholdout with random forest and private Evaporative Cooling give comparable accuracies. We also apply these privacy methods to human brain resting-state fMRI data from a study of major depressive disorder. Code available at http://insilico.utulsa.edu/software/privateEC . brett-mckinney@utulsa.edu. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  17. Breeding value accuracy estimates for growth traits using random regression and multi-trait models in Nelore cattle.

    PubMed

    Boligon, A A; Baldi, F; Mercadante, M E Z; Lobo, R B; Pereira, R J; Albuquerque, L G

    2011-06-28

    We quantified the potential increase in accuracy of expected breeding value for weights of Nelore cattle, from birth to mature age, using multi-trait and random regression models on Legendre polynomials and B-spline functions. A total of 87,712 weight records from 8144 females were used, recorded every three months from birth to mature age from the Nelore Brazil Program. For random regression analyses, all female weight records from birth to eight years of age (data set I) were considered. From this general data set, a subset was created (data set II), which included only nine weight records: at birth, weaning, 365 and 550 days of age, and 2, 3, 4, 5, and 6 years of age. Data set II was analyzed using random regression and multi-trait models. The model of analysis included the contemporary group as fixed effects and age of dam as a linear and quadratic covariable. In the random regression analyses, average growth trends were modeled using a cubic regression on orthogonal polynomials of age. Residual variances were modeled by a step function with five classes. Legendre polynomials of fourth and sixth order were utilized to model the direct genetic and animal permanent environmental effects, respectively, while third-order Legendre polynomials were considered for maternal genetic and maternal permanent environmental effects. Quadratic polynomials were applied to model all random effects in random regression models on B-spline functions. Direct genetic and animal permanent environmental effects were modeled using three segments or five coefficients, and genetic maternal and maternal permanent environmental effects were modeled with one segment or three coefficients in the random regression models on B-spline functions. For both data sets (I and II), animals ranked differently according to expected breeding value obtained by random regression or multi-trait models. With random regression models, the highest gains in accuracy were obtained at ages with a low number of weight records. The results indicate that random regression models provide more accurate expected breeding values than the traditionally finite multi-trait models. Thus, higher genetic responses are expected for beef cattle growth traits by replacing a multi-trait model with random regression models for genetic evaluation. B-spline functions could be applied as an alternative to Legendre polynomials to model covariance functions for weights from birth to mature age.

  18. [Prediction and spatial distribution of recruitment trees of natural secondary forest based on geographically weighted Poisson model].

    PubMed

    Zhang, Ling Yu; Liu, Zhao Gang

    2017-12-01

    Based on the data collected from 108 permanent plots of the forest resources survey in Maoershan Experimental Forest Farm during 2004-2016, this study investigated the spatial distribution of recruitment trees in natural secondary forest by global Poisson regression and geographically weighted Poisson regression (GWPR) with four bandwidths of 2.5, 5, 10 and 15 km. The simulation effects of the 5 regressions and the factors influencing the recruitment trees in stands were analyzed, a description was given to the spatial autocorrelation of the regression residuals on global and local levels using Moran's I. The results showed that the spatial distribution of the number of natural secondary forest recruitment was significantly influenced by stands and topographic factors, especially average DBH. The GWPR model with small scale (2.5 km) had high accuracy of model fitting, a large range of model parameter estimates was generated, and the localized spatial distribution effect of the model parameters was obtained. The GWPR model at small scale (2.5 and 5 km) had produced a small range of model residuals, and the stability of the model was improved. The global spatial auto-correlation of the GWPR model residual at the small scale (2.5 km) was the lowe-st, and the local spatial auto-correlation was significantly reduced, in which an ideal spatial distribution pattern of small clusters with different observations was formed. The local model at small scale (2.5 km) was much better than the global model in the simulation effect on the spatial distribution of recruitment tree number.

  19. What does it take to get family forest owners to enroll in a forest stewardship-type program?

    Treesearch

    Michael A. Kilgore; Stephanie A. Snyder; Joseph Schertz; Steven J. Taff

    2008-01-01

    We estimated the probability of enrollment and factors influencing participation in a forest stewardship-type program, Minnesota's Sustainable Forest Incentives Act, using data from a mail survey of over 1000 randomly-selected Minnesota family forest owners. Of the 15 variables tested, only five were significant predictors of a landowner's interest in...

  20. Mapping forest vegetation for the western United States using modified random forests imputation of FIA forest plots

    Treesearch

    Karin Riley; Isaac C. Grenfell; Mark A. Finney

    2016-01-01

    Maps of the number, size, and species of trees in forests across the western United States are desirable for many applications such as estimating terrestrial carbon resources, predicting tree mortality following wildfires, and for forest inventory. However, detailed mapping of trees for large areas is not feasible with current technologies, but statistical...

  1. Global Forest Canopy Height Maps Validation and Calibration for The Potential of Forest Biomass Estimation in The Southern United States

    NASA Astrophysics Data System (ADS)

    Ku, N. W.; Popescu, S. C.

    2015-12-01

    In the past few years, three global forest canopy height maps have been released. Lefsky (2010) first utilized the Geoscience Laser Altimeter System (GLAS) on the Ice, Cloud and land Elevation Satellite (ICESat) and Moderate Resolution Imaging Spectroradiometer (MODIS) data to generate a global forest canopy height map in 2010. Simard et al. (2011) integrated GLAS data and other ancillary variables, such as MODIS, Shuttle Radar Topography Mission (STRM), and climatic data, to generate another global forest canopy height map in 2011. Los et al. (2012) also used GLAS data to create a vegetation height map in 2012.Several studies attempted to compare these global height maps to other sources of data., Bolton et al. (2013) concluded that Simard's forest canopy height map has strong agreement with airborne lidar derived heights. Los map is a coarse spatial resolution vegetation height map with a 0.5 decimal degrees horizontal resolution, around 50 km in the US, which is not feasible for the purpose of our research. Thus, Simard's global forest canopy height map is the primary map for this research study. The main objectives of this research were to validate and calibrate Simard's map with airborne lidar data and other ancillary variables in the southern United States. The airborne lidar data was collected between 2010 and 2012 from: (1) NASA LiDAR, Hyperspectral & Thermal Image (G-LiHT) program; (2) National Ecological Observatory Network's (NEON) prototype data sharing program; (3) NSF Open Topography Facility; and (4) the Department of Ecosystem Science and Management at Texas A&M University. The airborne lidar study areas also cover a wide variety of vegetation types across the southern US. The airborne lidar data is post-processed to generate lidar-derived metrics and assigned to four different classes of point cloud data. The four classes of point cloud data are the data with ground points, above 1 m, above 3 m, and above 5 m. The root mean square error (RMSE) and coefficient of determination (R2) are used for examining the discrepancies of the canopy heights between the airborne lidar-derived metrics and global forest canopy height map, and the regression and random forest approaches are used to calibrate the global forest canopy height map. In summary, the research shows a calibrated forest canopy height map of the southern US.

  2. Wildfire suppression cost forecasts from the US Forest Service

    Treesearch

    Karen L. Abt; Jeffrey P. Prestemon; Krista M. Gebert

    2009-01-01

    The US Forest Service and other land-management agencies seek better tools for nticipating future expenditures for wildfire suppression. We developed regression models for forecasting US Forest Service suppression spending at 1-, 2-, and 3-year lead times. We compared these models to another readily available forecast model, the 10-year moving average model,...

  3. Developing Data-driven models for quantifying Cochlodinium polykrikoides in Coastal Waters

    NASA Astrophysics Data System (ADS)

    Kwon, Yongsung; Jang, Eunna; Im, Jungho; Baek, Seungho; Park, Yongeun; Cho, Kyunghwa

    2017-04-01

    Harmful algal blooms have been worldwide problems because it leads to serious dangers to human health and aquatic ecosystems. Especially, fish killing red tide blooms by one of dinoflagellate, Cochlodinium polykrikoides (C. polykrikoides), have caused critical damage to mariculture in the Korean coastal waters. In this work, multiple linear regression (MLR), regression tree (RT), and random forest (RF) models were constructed and applied to estimate C. polykrikoides blooms in coastal waters. Five different types of input dataset were carried out to test the performance of three models. To train and validate the three models, observed number of C. polykrikoides cells from National institute of fisheries science (NIFS) and remote sensing reflectance data from Geostationary Ocean Color Imager (GOCI) images for 3 years from 2013 to 2015 were used. The RT model showed the best prediction performance when using 4 bands and 3 band ratios data were used as input data simultaneously. Results obtained from iterative model development with randomly chosen input data indicated that the recognition of patterns in training data caused a variation in prediction performance. This work provided useful tools for reliably estimate the number of C. polykrikoides cells by using reasonable input reflectance dataset in coastal waters. It is expected that the RT model is easily accessed and manipulated by administrators and decision-makers working with coastal waters.

  4. Land cover in the Guayas Basin using SAR images from low resolution ASAR Global mode to high resolution Sentinel-1 images

    NASA Astrophysics Data System (ADS)

    Bourrel, Luc; Brodu, Nicolas; Frappart, Frédéric

    2016-04-01

    Remotely sensed images allow a frequent monitoring of land cover variations at regional and global scale. Recently launched Sentinel-1 satellite offers a global cover of land areas at an unprecedented spatial (20 m) and temporal (6 days at the Equator). We propose here to compare the performances of commonly used supervised classification techniques (i.e., k-nearest neighbors, linear and Gaussian support vector machines, naive Bayes, linear and quadratic discriminant analyzes, adaptative boosting, loggit regression, ridge regression with one-vs-one voting, random forest, extremely randomized trees) for land cover applications in the Guayas Basin, the largest river basin of the Pacific coast of Ecuator (area ~32,000 km²). The reason of this choice is the importance of this region in Ecuatorian economy as its watershed represents 13% of the total area of Ecuador where 40% of the Ecuadorian population lives. It also corresponds to the most productive region of Ecuador for agriculture and aquaculture. Fifty percents of the country shrimp farming production comes from this watershed, and represents with agriculture the largest source of revenue of the country. Similar comparisons are also performed using ENVISAT ASAR images acquired in global mode (1 km of spatial resolution). Accuracy of the results will be achieved using land cover map derived from multi-spectral images.

  5. Detecting spatio-temporal changes in agricultural land use in Heilongjiang province, China using MODIS time-series data and a random forest regression model

    NASA Astrophysics Data System (ADS)

    Hu, Q.; Friedl, M. A.; Wu, W.

    2017-12-01

    Accurate and timely information regarding the spatial distribution of crop types and their changes is essential for acreage surveys, yield estimation, water management, and agricultural production decision-making. In recent years, increasing population, dietary shifts and climate change have driven drastic changes in China's agricultural land use. However, no maps are currently available that document the spatial and temporal patterns of these agricultural land use changes. Because of its short revisit period, rich spectral bands and global coverage, MODIS time series data has been shown to have great potential for detecting the seasonal dynamics of different crop types. However, its inherently coarse spatial resolution limits the accuracy with which crops can be identified from MODIS in regions with small fields or complex agricultural landscapes. To evaluate this more carefully and specifically understand the strengths and weaknesses of MODIS data for crop-type mapping, we used MODIS time-series imagery to map the sub-pixel fractional crop area for four major crop types (rice, corn, soybean and wheat) at 500-m spatial resolution for Heilongjiang province, one of the most important grain-production regions in China where recent agricultural land use change has been rapid and pronounced. To do this, a random forest regression (RF-g) model was constructed to estimate the percentage of each sub-pixel crop type in 2006, 2011 and 2016. Crop type maps generated through expert visual interpretation of high spatial resolution images (i.e., Landsat and SPOT data) were used to calibrate the regression model. Five different time series of vegetation indices (155 features) derived from different spectral channels of MODIS land surface reflectance (MOD09A1) data were used as candidate features for the RF-g model. An out-of-bag strategy and backward elimination approach was applied to select the optimal spectra-temporal feature subset for each crop type. The resulting crop maps were assessed in two ways: (1) wall-to-wall pixel comparison with corresponding high spatial resolution reference maps; and (2) county-level comparison with census data. Based on these derived maps, changes in crop type, total area, and spatial patterns of change in Heilongjiang province during 2006-2016 were analyzed.

  6. Quantifying scaling effects on satellite-derived forest area estimates for the conterminous USA

    Treesearch

    Daolan Zheng; L.S. Heath; M.J. Ducey; J.E. Smith

    2009-01-01

    We quantified the scaling effects on forest area estimates for the conterminous USA using regression analysis and the National Land Cover Dataset 30m satellite-derived maps in 2001 and 1992. The original data were aggregated to: (1) broad cover types (forest vs. non-forest); and (2) coarser resolutions (1km and 10 km). Standard errors of the model estimates were 2.3%...

  7. The Random Forests Statistical Technique: An Examination of Its Value for the Study of Reading

    ERIC Educational Resources Information Center

    Matsuki, Kazunaga; Kuperman, Victor; Van Dyke, Julie A.

    2016-01-01

    Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this article, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such data sets. The method is…

  8. Random location of fuel treatments in wildland community interfaces: a percolation approach

    Treesearch

    Michael Bevers; Philip N. Omi; John G. Hof

    2004-01-01

    We explore the use of spatially correlated random treatments to reduce fuels in landscape patterns that appear somewhat natural while forming fully connected fuelbreaks between wildland forests and developed protection zones. From treatment zone maps partitioned into grids of hexagonal forest cells representing potential treatment sites, we selected cells to be treated...

  9. Road Network State Estimation Using Random Forest Ensemble Learning

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hou, Yi; Edara, Praveen; Chang, Yohan

    Network-scale travel time prediction not only enables traffic management centers (TMC) to proactively implement traffic management strategies, but also allows travelers make informed decisions about route choices between various origins and destinations. In this paper, a random forest estimator was proposed to predict travel time in a network. The estimator was trained using two years of historical travel time data for a case study network in St. Louis, Missouri. Both temporal and spatial effects were considered in the modeling process. The random forest models predicted travel times accurately during both congested and uncongested traffic conditions. The computational times for themore » models were low, thus useful for real-time traffic management and traveler information applications.« less

  10. [Simulation of three-dimensional green biomass of urban forests in Shenyang City and the factors affecting the biomass].

    PubMed

    Liu, Chang-Fu; He, Xing-Yuan; Chen, Wei; Zhao, Gui-Ling; Xue, Wen-Duo

    2008-06-01

    Based on the fractal theory of forest growth, stepwise regression was employed to pursue a convenient and efficient method of measuring the three-dimensional green biomass (TGB) of urban forests in small area. A total of thirteen simulation equations of TGB of urban forests in Shenyang City were derived, with the factors affecting the TGB analyzed. The results showed that the coefficients of determination (R2) of the 13 simulation equations ranged from 0.612 to 0.842. No evident pattern was shown in residual analysis, and the precisions were all higher than 87% (alpha = 0.05) and 83% (alpha = 0.01). The most convenient simulation equation was ln Y = 7.468 + 0.926 lnx1, where Y was the simulated TGB and x1 was basal area at breast height per hectare (SDB). The correlations between the standard regression coefficients of the simulation equations and 16 tree characteristics suggested that SDB was the main factor affecting the TGB of urban forests in Shenyang.

  11. Spatial properties of snow cover in the Upper Merced River Basin: implications for a distributed snow measurement network

    NASA Astrophysics Data System (ADS)

    Bouffon, T.; Rice, R.; Bales, R.

    2006-12-01

    The spatial distributions of snow water equivalent (SWE) and snow depth within a 1, 4, and 16 km2 grid element around two automated snow pillows in a forested and open- forested region of the Upper Merced River Basin (2,800 km2) of Yosemite National Park were characterized using field observations and analyzed using binary regression trees. Snow surveys occurred at the forested site during the accumulation and ablation seasons, while at the open-forest site a survey was performed only during the accumulation season. An average of 130 snow depth and 7 snow density measurements were made on each survey, within the 4 km2 grid. Snow depth was distributed using binary regression trees and geostatistical methods using the physiographic parameters (e.g. elevation, slope, vegetation, aspect). Results in the forest region indicate that the snow pillow overestimated average SWE within the 1, 4, and 16 km2 areas by 34 percent during ablation, but during accumulation the snow pillow provides a good estimate of the modeled mean SWE grid value, however it is suspected that the snow pillow was underestimating SWE. However, at the open forest site, during accumulation, the snow pillow was 28 percent greater than the mean modeled grid element. In addition, the binary regression trees indicate that the independent variables of vegetation, slope, and aspect are the most influential parameters of snow depth distribution. The binary regression tree and multivariate linear regression models explain about 60 percent of the initial variance for snow depth and 80 percent for density, respectively. This short-term study provides motivation and direction for the installation of a distributed snow measurement network to fill the information gap in basin-wide SWE and snow depth measurements. Guided by these results, a distributed snow measurement network was installed in the Fall 2006 at Gin Flat in the Upper Merced River Basin with the specific objective of measuring accumulation and ablation across topographic variables with the aim of providing guidance for future larger scale observation network designs.

  12. Adaptive economic and ecological forest management under risk

    Treesearch

    Joseph Buongiorno; Mo Zhou

    2015-01-01

    Background: Forest managers must deal with inherently stochastic ecological and economic processes. The future growth of trees is uncertain, and so is their value. The randomness of low-impact, high frequency or rare catastrophic shocks in forest growth has significant implications in shaping the mix of tree species and the forest landscape...

  13. Prediction of drug synergy in cancer using ensemble-based machine learning techniques

    NASA Astrophysics Data System (ADS)

    Singh, Harpreet; Rana, Prashant Singh; Singh, Urvinder

    2018-04-01

    Drug synergy prediction plays a significant role in the medical field for inhibiting specific cancer agents. It can be developed as a pre-processing tool for therapeutic successes. Examination of different drug-drug interaction can be done by drug synergy score. It needs efficient regression-based machine learning approaches to minimize the prediction errors. Numerous machine learning techniques such as neural networks, support vector machines, random forests, LASSO, Elastic Nets, etc., have been used in the past to realize requirement as mentioned above. However, these techniques individually do not provide significant accuracy in drug synergy score. Therefore, the primary objective of this paper is to design a neuro-fuzzy-based ensembling approach. To achieve this, nine well-known machine learning techniques have been implemented by considering the drug synergy data. Based on the accuracy of each model, four techniques with high accuracy are selected to develop ensemble-based machine learning model. These models are Random forest, Fuzzy Rules Using Genetic Cooperative-Competitive Learning method (GFS.GCCL), Adaptive-Network-Based Fuzzy Inference System (ANFIS) and Dynamic Evolving Neural-Fuzzy Inference System method (DENFIS). Ensembling is achieved by evaluating the biased weighted aggregation (i.e. adding more weights to the model with a higher prediction score) of predicted data by selected models. The proposed and existing machine learning techniques have been evaluated on drug synergy score data. The comparative analysis reveals that the proposed method outperforms others in terms of accuracy, root mean square error and coefficient of correlation.

  14. Assessment of Antarctic moss health from multi-sensor UAS imagery with Random Forest Modelling

    NASA Astrophysics Data System (ADS)

    Turner, Darren; Lucieer, Arko; Malenovský, Zbyněk; King, Diana; Robinson, Sharon A.

    2018-06-01

    Moss beds are one of very few terrestrial vegetation types that can be found on the Antarctic continent and as such mapping their extent and monitoring their health is important to environmental managers. Across Antarctica, moss beds are experiencing changes in health as their environment changes. As Antarctic moss beds are spatially fragmented with relatively small extent they require very high resolution remotely sensed imagery to monitor their distribution and dynamics. This study demonstrates that multi-sensor imagery collected by an Unmanned Aircraft System (UAS) provides a novel data source for assessment of moss health. In this study, we train a Random Forest Regression Model (RFM) with long-term field quadrats at a study site in the Windmill Islands, East Antarctica and apply it to UAS RGB and 6-band multispectral imagery, derived vegetation indices, 3D topographic data, and thermal imagery to predict moss health. Our results suggest that moss health, expressed as a percentage between 0 and 100% healthy, can be estimated with a root mean squared error (RMSE) between 7 and 12%. The RFM also quantifies the importance of input variables for moss health estimation showing the multispectral sensor data was important for accurate health prediction, such information being essential for planning future field investigations. The RFM was applied to the entire moss bed, providing an extrapolation of the health assessment across a larger spatial area. With further validation the resulting maps could be used for change detection of moss health across multiple sites and seasons.

  15. A prediction scheme of tropical cyclone frequency based on lasso and random forest

    NASA Astrophysics Data System (ADS)

    Tan, Jinkai; Liu, Hexiang; Li, Mengya; Wang, Jun

    2017-07-01

    This study aims to propose a novel prediction scheme of tropical cyclone frequency (TCF) over the Western North Pacific (WNP). We concerned the large-scale meteorological factors inclusive of the sea surface temperature, sea level pressure, the Niño-3.4 index, the wind shear, the vorticity, the subtropical high, and the sea ice cover, since the chronic change of these factors in the context of climate change would cause a gradual variation of the annual TCF. Specifically, we focus on the correlation between the year-to-year increment of these factors and TCF. The least absolute shrinkage and selection operator (Lasso) method was used for variable selection and dimension reduction from 11 initial predictors. Then, a prediction model based on random forest (RF) was established by using the training samples (1978-2011) for calibration and the testing samples (2012-2016) for validation. The RF model presents a major variation and trend of TCF in the period of calibration, and also fitted well with the observed TCF in the period of validation though there were some deviations. The leave-one-out cross validation of the model exhibited most of the predicted TCF are in consistence with the observed TCF with a high correlation coefficient. A comparison between results of the RF model and the multiple linear regression (MLR) model suggested the RF is more practical and capable of giving reliable results of TCF prediction over the WNP.

  16. Seeing the forest for the trees: utilizing modified random forests imputation of forest plot data for landscape-level analyses

    Treesearch

    Karin L. Riley; Isaac C. Grenfell; Mark A. Finney

    2015-01-01

    Mapping the number, size, and species of trees in forests across the western United States has utility for a number of research endeavors, ranging from estimation of terrestrial carbon resources to tree mortality following wildfires. For landscape fire and forest simulations that use the Forest Vegetation Simulator (FVS), a tree-level dataset, or “tree list”, is a...

  17. High-Resolution Regional Biomass Map of Siberia from Glas, Palsar L-Band Radar and Landsat Vcf Data

    NASA Astrophysics Data System (ADS)

    Sun, G.; Ranson, K.; Montesano, P.; Zhang, Z.; Kharuk, V.

    2015-12-01

    The Arctic-Boreal zone is known be warming at an accelerated rate relative to other biomes. The taiga or boreal forest covers over 16 x106 km2 of Arctic North America, Scandinavia, and Eurasia. A large part of the northern Boreal forests are in Russia's Siberia, as area with recent accelerated climate warming. During the last two decades we have been working on characterization of boreal forests in north-central Siberia using field and satellite measurements. We have published results of circumpolar biomass using field plots, airborne (PALS, ACTM) and spaceborne (GLAS) lidar data with ASTER DEM, LANDSAT and MODIS land cover classification, MODIS burned area and WWF's ecoregion map. Researchers from ESA and Russia have also been working on biomass (or growing stock) mapping in Siberia. For example, they developed a pan-boreal growing stock volume map at 1-kilometer scale using hyper-temporal ENVISAT ASAR ScanSAR backscatter data. Using the annual PALSAR mosaics from 2007 to 2010 growing stock volume maps were retrieved based on a supervised random forest regression approach. This method is being used in the ESA/Russia ZAPAS project for Central Siberia Biomass mapping. Spatially specific biomass maps of this region at higher resolution are desired for carbon cycle and climate change studies. In this study, our work focused on improving resolution ( 50 m) of a biomass map based on PALSAR L-band data and Landsat Vegetation Canopy Fraction products. GLAS data were carefully processed and screened using land cover classification, local slope, and acquisition dates. The biomass at remaining footprints was estimated using a model developed from field measurements at GLAS footprints. The GLAS biomass samples were then aggregated into 1 Mg/ha bins of biomass and mean VCF and PALSAR backscatter and textures were calculated for each of these biomass bins. The resulted biomass/signature data was used to train a random forest model for biomass mapping of entire region from 50oN to 75oN, and 80oE to 145oE. The spatial patterns of the new biomass map is much better than the previous maps due to spatially specific mapping in high resolution. The uncertainties of field/GLAS and GLAS/imagery models were investigated using bootstrap procedure, and the final biomass map was compared with previous maps.

  18. Forest Cover Estimation in Ireland Using Radar Remote Sensing: A Comparative Analysis of Forest Cover Assessment Methodologies.

    PubMed

    Devaney, John; Barrett, Brian; Barrett, Frank; Redmond, John; O Halloran, John

    2015-01-01

    Quantification of spatial and temporal changes in forest cover is an essential component of forest monitoring programs. Due to its cloud free capability, Synthetic Aperture Radar (SAR) is an ideal source of information on forest dynamics in countries with near-constant cloud-cover. However, few studies have investigated the use of SAR for forest cover estimation in landscapes with highly sparse and fragmented forest cover. In this study, the potential use of L-band SAR for forest cover estimation in two regions (Longford and Sligo) in Ireland is investigated and compared to forest cover estimates derived from three national (Forestry2010, Prime2, National Forest Inventory), one pan-European (Forest Map 2006) and one global forest cover (Global Forest Change) product. Two machine-learning approaches (Random Forests and Extremely Randomised Trees) are evaluated. Both Random Forests and Extremely Randomised Trees classification accuracies were high (98.1-98.5%), with differences between the two classifiers being minimal (<0.5%). Increasing levels of post classification filtering led to a decrease in estimated forest area and an increase in overall accuracy of SAR-derived forest cover maps. All forest cover products were evaluated using an independent validation dataset. For the Longford region, the highest overall accuracy was recorded with the Forestry2010 dataset (97.42%) whereas in Sligo, highest overall accuracy was obtained for the Prime2 dataset (97.43%), although accuracies of SAR-derived forest maps were comparable. Our findings indicate that spaceborne radar could aid inventories in regions with low levels of forest cover in fragmented landscapes. The reduced accuracies observed for the global and pan-continental forest cover maps in comparison to national and SAR-derived forest maps indicate that caution should be exercised when applying these datasets for national reporting.

  19. Forest Cover Estimation in Ireland Using Radar Remote Sensing: A Comparative Analysis of Forest Cover Assessment Methodologies

    PubMed Central

    Devaney, John; Barrett, Brian; Barrett, Frank; Redmond, John; O`Halloran, John

    2015-01-01

    Quantification of spatial and temporal changes in forest cover is an essential component of forest monitoring programs. Due to its cloud free capability, Synthetic Aperture Radar (SAR) is an ideal source of information on forest dynamics in countries with near-constant cloud-cover. However, few studies have investigated the use of SAR for forest cover estimation in landscapes with highly sparse and fragmented forest cover. In this study, the potential use of L-band SAR for forest cover estimation in two regions (Longford and Sligo) in Ireland is investigated and compared to forest cover estimates derived from three national (Forestry2010, Prime2, National Forest Inventory), one pan-European (Forest Map 2006) and one global forest cover (Global Forest Change) product. Two machine-learning approaches (Random Forests and Extremely Randomised Trees) are evaluated. Both Random Forests and Extremely Randomised Trees classification accuracies were high (98.1–98.5%), with differences between the two classifiers being minimal (<0.5%). Increasing levels of post classification filtering led to a decrease in estimated forest area and an increase in overall accuracy of SAR-derived forest cover maps. All forest cover products were evaluated using an independent validation dataset. For the Longford region, the highest overall accuracy was recorded with the Forestry2010 dataset (97.42%) whereas in Sligo, highest overall accuracy was obtained for the Prime2 dataset (97.43%), although accuracies of SAR-derived forest maps were comparable. Our findings indicate that spaceborne radar could aid inventories in regions with low levels of forest cover in fragmented landscapes. The reduced accuracies observed for the global and pan-continental forest cover maps in comparison to national and SAR-derived forest maps indicate that caution should be exercised when applying these datasets for national reporting. PMID:26262681

  20. A stepwise regression tree for nonlinear approximation: applications to estimating subpixel land cover

    USGS Publications Warehouse

    Huang, C.; Townshend, J.R.G.

    2003-01-01

    A stepwise regression tree (SRT) algorithm was developed for approximating complex nonlinear relationships. Based on the regression tree of Breiman et al . (BRT) and a stepwise linear regression (SLR) method, this algorithm represents an improvement over SLR in that it can approximate nonlinear relationships and over BRT in that it gives more realistic predictions. The applicability of this method to estimating subpixel forest was demonstrated using three test data sets, on all of which it gave more accurate predictions than SLR and BRT. SRT also generated more compact trees and performed better than or at least as well as BRT at all 10 equal forest proportion interval ranging from 0 to 100%. This method is appealing to estimating subpixel land cover over large areas.

  1. Optimized endogenous post-stratification in forest inventories

    Treesearch

    Paul L. Patterson

    2012-01-01

    An example of endogenous post-stratification is the use of remote sensing data with a sample of ground data to build a logistic regression model to predict the probability that a plot is forested and using the predicted probabilities to form categories for post-stratification. An optimized endogenous post-stratified estimator of the proportion of forest has been...

  2. Development of a Methodology for Predicting Forest Area for Large-Area Resource Monitoring

    Treesearch

    William H. Cooke

    2001-01-01

    The U.S. Department of Agriculture, Forest Service, Southcm Research Station, appointed a remote-sensing team to develop an image-processing methodology for mapping forest lands over large geographic areds. The team has presented a repeatable methodology, which is based on regression modeling of Advanced Very High Resolution Radiometer (AVHRR) and Landsat Thematic...

  3. Assessing West Virginia NIPF owner preferred forest management assistance topics and delivery methods

    Treesearch

    Daniel J. Magill; Rory F. Fraser; David W. McGill

    2003-01-01

    Four hundred and fourteen non-industrial private forest (NIPF) owners in West Virginia responded to a mail survey questionnaire assessing their forest management assistance topics and delivery methods of interest. Logistic regression was used to analyze 39 independent variables in relation to the dependent variables of wanting a specific topic of forestry assistance or...

  4. An automated ranking platform for machine learning regression models for meat spoilage prediction using multi-spectral imaging and metabolic profiling.

    PubMed

    Estelles-Lopez, Lucia; Ropodi, Athina; Pavlidis, Dimitris; Fotopoulou, Jenny; Gkousari, Christina; Peyrodie, Audrey; Panagou, Efstathios; Nychas, George-John; Mohareb, Fady

    2017-09-01

    Over the past decade, analytical approaches based on vibrational spectroscopy, hyperspectral/multispectral imagining and biomimetic sensors started gaining popularity as rapid and efficient methods for assessing food quality, safety and authentication; as a sensible alternative to the expensive and time-consuming conventional microbiological techniques. Due to the multi-dimensional nature of the data generated from such analyses, the output needs to be coupled with a suitable statistical approach or machine-learning algorithms before the results can be interpreted. Choosing the optimum pattern recognition or machine learning approach for a given analytical platform is often challenging and involves a comparative analysis between various algorithms in order to achieve the best possible prediction accuracy. In this work, "MeatReg", a web-based application is presented, able to automate the procedure of identifying the best machine learning method for comparing data from several analytical techniques, to predict the counts of microorganisms responsible of meat spoilage regardless of the packaging system applied. In particularly up to 7 regression methods were applied and these are ordinary least squares regression, stepwise linear regression, partial least square regression, principal component regression, support vector regression, random forest and k-nearest neighbours. MeatReg" was tested with minced beef samples stored under aerobic and modified atmosphere packaging and analysed with electronic nose, HPLC, FT-IR, GC-MS and Multispectral imaging instrument. Population of total viable count, lactic acid bacteria, pseudomonads, Enterobacteriaceae and B. thermosphacta, were predicted. As a result, recommendations of which analytical platforms are suitable to predict each type of bacteria and which machine learning methods to use in each case were obtained. The developed system is accessible via the link: www.sorfml.com. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. Multiple filters affect tree species assembly in mid-latitude forest communities.

    PubMed

    Kubota, Y; Kusumoto, B; Shiono, T; Ulrich, W

    2018-05-01

    Species assembly patterns of local communities are shaped by the balance between multiple abiotic/biotic filters and dispersal that both select individuals from species pools at the regional scale. Knowledge regarding functional assembly can provide insight into the relative importance of the deterministic and stochastic processes that shape species assembly. We evaluated the hierarchical roles of the α niche and β niches by analyzing the influence of environmental filtering relative to functional traits on geographical patterns of tree species assembly in mid-latitude forests. Using forest plot datasets, we examined the α niche traits (leaf and wood traits) and β niche properties (cold/drought tolerance) of tree species, and tested non-randomness (clustering/over-dispersion) of trait assembly based on null models that assumed two types of species pools related to biogeographical regions. For most plots, species assembly patterns fell within the range of random expectation. However, particularly for cold/drought tolerance-related β niche properties, deviation from randomness was frequently found; non-random clustering was predominant in higher latitudes with harsh climates. Our findings demonstrate that both randomness and non-randomness in trait assembly emerged as a result of the α and β niches, although we suggest the potential role of dispersal processes and/or species equalization through trait similarities in generating the prevalence of randomness. Clustering of β niche traits along latitudinal climatic gradients provides clear evidence of species sorting by filtering particular traits. Our results reveal that multiple filters through functional niches and stochastic processes jointly shape geographical patterns of species assembly across mid-latitude forests.

  6. Mapping Migratory Bird Prevalence Using Remote Sensing Data Fusion

    PubMed Central

    Swatantran, Anu; Dubayah, Ralph; Goetz, Scott; Hofton, Michelle; Betts, Matthew G.; Sun, Mindy; Simard, Marc; Holmes, Richard

    2012-01-01

    Background Improved maps of species distributions are important for effective management of wildlife under increasing anthropogenic pressures. Recent advances in lidar and radar remote sensing have shown considerable potential for mapping forest structure and habitat characteristics across landscapes. However, their relative efficacies and integrated use in habitat mapping remain largely unexplored. We evaluated the use of lidar, radar and multispectral remote sensing data in predicting multi-year bird detections or prevalence for 8 migratory songbird species in the unfragmented temperate deciduous forests of New Hampshire, USA. Methodology and Principal Findings A set of 104 predictor variables describing vegetation vertical structure and variability from lidar, phenology from multispectral data and backscatter properties from radar data were derived. We tested the accuracies of these variables in predicting prevalence using Random Forests regression models. All data sets showed more than 30% predictive power with radar models having the lowest and multi-sensor synergy (“fusion”) models having highest accuracies. Fusion explained between 54% and 75% variance in prevalence for all the birds considered. Stem density from discrete return lidar and phenology from multispectral data were among the best predictors. Further analysis revealed different relationships between the remote sensing metrics and bird prevalence. Spatial maps of prevalence were consistent with known habitat preferences for the bird species. Conclusion and Significance Our results highlight the potential of integrating multiple remote sensing data sets using machine-learning methods to improve habitat mapping. Multi-dimensional habitat structure maps such as those generated from this study can significantly advance forest management and ecological research by facilitating fine-scale studies at both stand and landscape level. PMID:22235254

  7. Mapping migratory bird prevalence using remote sensing data fusion.

    PubMed

    Swatantran, Anu; Dubayah, Ralph; Goetz, Scott; Hofton, Michelle; Betts, Matthew G; Sun, Mindy; Simard, Marc; Holmes, Richard

    2012-01-01

    Improved maps of species distributions are important for effective management of wildlife under increasing anthropogenic pressures. Recent advances in lidar and radar remote sensing have shown considerable potential for mapping forest structure and habitat characteristics across landscapes. However, their relative efficacies and integrated use in habitat mapping remain largely unexplored. We evaluated the use of lidar, radar and multispectral remote sensing data in predicting multi-year bird detections or prevalence for 8 migratory songbird species in the unfragmented temperate deciduous forests of New Hampshire, USA. A set of 104 predictor variables describing vegetation vertical structure and variability from lidar, phenology from multispectral data and backscatter properties from radar data were derived. We tested the accuracies of these variables in predicting prevalence using Random Forests regression models. All data sets showed more than 30% predictive power with radar models having the lowest and multi-sensor synergy ("fusion") models having highest accuracies. Fusion explained between 54% and 75% variance in prevalence for all the birds considered. Stem density from discrete return lidar and phenology from multispectral data were among the best predictors. Further analysis revealed different relationships between the remote sensing metrics and bird prevalence. Spatial maps of prevalence were consistent with known habitat preferences for the bird species. Our results highlight the potential of integrating multiple remote sensing data sets using machine-learning methods to improve habitat mapping. Multi-dimensional habitat structure maps such as those generated from this study can significantly advance forest management and ecological research by facilitating fine-scale studies at both stand and landscape level.

  8. Biogeographic distribution patterns and their correlates in the diverse frog fauna of the Atlantic Forest hotspot.

    PubMed

    Vasconcelos, Tiago S; Prado, Vitor H M; da Silva, Fernando R; Haddad, Célio F B

    2014-01-01

    Anurans are a highly diverse group in the Atlantic Forest hotspot (AF), yet distribution patterns and species richness gradients are not randomly distributed throughout the biome. Thus, we explore how anuran species are distributed in this complex and biodiverse hotspot, and hypothesize that this group can be distinguished by different cohesive regions. We used range maps of 497 species to obtain a presence/absence data grid, resolved to 50×50 km grain size, which was submitted to k-means clustering with v-fold cross-validation to determine the biogeographic regions. We also explored the extent to which current environmental variables, topography, and floristic structure of the AF are expected to identify the cluster patterns recognized by the k-means clustering. The biogeographic patterns found for amphibians are broadly congruent with ecoregions identified in the AF, but their edges, and sometimes the whole extent of some clusters, present much less resolved pattern compared to previous classification. We also identified that climate, topography, and vegetation structure of the AF explained a high percentage of variance of the cluster patterns identified, but the magnitude of the regression coefficients shifted regarding their importance in explaining the variance for each cluster. Specifically, we propose that the anuran fauna of the AF can be split into four biogeographic regions: a) less diverse and widely-ranged species that predominantly occur in the inland semideciduous forests; b) northern small-ranged species that presumably evolved within the Pleistocene forest refugia; c) highly diverse and small-ranged species from the southeastern Brazilian mountain chain and its adjacent semideciduous forest; and d) southern species from the Araucaria forest. Finally, the high congruence among the cluster patterns and previous eco-regions identified for the AF suggests that preserving the underlying habitat structure helps to preserve the historical and ecological signals that underlie the geographic distribution of AF anurans.

  9. Measured and modeled evidence for a two-fold increase in water use efficiency at an old-growth forest site in the Pacific Northwest

    NASA Astrophysics Data System (ADS)

    Jiang, Y.; Rastogi, B.; Kim, J. B.; Voelker, S.; Meinzer, F. C.; Still, C. J.

    2017-12-01

    Water use efficiency (WUE), the ratio of carbon uptake to transpiration, has been widely recognized as an important measure of carbon and water cycling in plants, and is used to track forest ecosystem responses to climate change and rising atmospheric CO2concentrations. In this study we used eddy covariance measurement data and Ecosystem Demography model (ED2) simulations to explore the patterns and physiological and biophysical controls of WUE at Wind River Experimental Forest, an old-growth coniferous forest in the Pacific Northwest. We characterized how observed and simulated WUE vary between wet and dry years, and explored the drivers of the differences in WUE between the wet and dry years. Through this explorative process, we evaluated the utility of various ways that WUE have been computed in literature. Measurement-based and simulated WUE at the old-growth forest increased over twofold from 1998 to 2015. The primary driver of this trend is a decreasing trend in evapotranspiration (ET). There were significant inter-annual variations. For example, during drought years, higher air temperature drove increases in early season ET, thereby depleting soil water and decreasing GPP. Lower GPP in turn resulted in lower WUE. This mechanism might drive changes in future carbon and water budgets under warming climate. Our evaluation of multiple WUE metrics demonstrates that each metric has a distinct sensitivity to climate anomalies, but also indicates a robust increasing trend of WUE. Statistical (multiple linear regression) and machine learning (Random Forest) analyses of flux measurements indicated that atmospheric CO2 concentration, air temperature and radiation were the most important predictors of WUE at monthly, daily and half-hourly time scale, respectively. In contrast, WUE mechanism was stable across all time scales in ED2 simulations: vapor pressure deficit was consistently the most important predictor of WUE at the monthly, daily and half-hourly time scales.

  10. Quantifying climate-growth relationships at the stand level in a mature mixed-species conifer forest.

    PubMed

    Teets, Aaron; Fraver, Shawn; Weiskittel, Aaron R; Hollinger, David Y

    2018-03-11

    A range of environmental factors regulate tree growth; however, climate is generally thought to most strongly influence year-to-year variability in growth. Numerous dendrochronological (tree-ring) studies have identified climate factors that influence year-to-year variability in growth for given tree species and location. However, traditional dendrochronology methods have limitations that prevent them from adequately assessing stand-level (as opposed to species-level) growth. We argue that stand-level growth analyses provide a more meaningful assessment of forest response to climate fluctuations, as well as the management options that may be employed to sustain forest productivity. Working in a mature, mixed-species stand at the Howland Research Forest of central Maine, USA, we used two alternatives to traditional dendrochronological analyses by (1) selecting trees for coring using a stratified (by size and species), random sampling method that ensures a representative sample of the stand, and (2) converting ring widths to biomass increments, which once summed, produced a representation of stand-level growth, while maintaining species identities or canopy position if needed. We then tested the relative influence of seasonal climate variables on year-to-year variability in the biomass increment using generalized least squares regression, while accounting for temporal autocorrelation. Our results indicate that stand-level growth responded most strongly to previous summer and current spring climate variables, resulting from a combination of individualistic climate responses occurring at the species- and canopy-position level. Our climate models were better fit to stand-level biomass increment than to species-level or canopy-position summaries. The relative growth responses (i.e., percent change) predicted from the most influential climate variables indicate stand-level growth varies less from to year-to-year than species-level or canopy-position growth responses. By assessing stand-level growth response to climate, we provide an alternative perspective on climate-growth relationships of forests, improving our understanding of forest growth dynamics under a fluctuating climate. © 2018 John Wiley & Sons Ltd.

  11. Recent drought conditions in the Conterminous United States

    Treesearch

    Frank H. Koch; William D. Smith; John W. Coulston

    2013-01-01

    Droughts are common in virtually all U.S. forests, but their frequency and intensity vary widely both between and within forest ecosystems (Hanson and Weltzin 2000). Forests in the Western United States generally exhibit a pattern of annual seasonal droughts. Forests in the Eastern United States tend to exhibit one of two prevailing patterns: random occasional droughts...

  12. Relations between fish abundances, summer temperatures, and forest harvest in a northern Minnesota stream system from 1997 to 2007

    USGS Publications Warehouse

    Merten, Eric C.; Hemstad, Nathaniel A.; Eggert, L.S.; Johnson, L.B.; Kolka, R.K.; Newman, Raymond M.; Vondracek, Bruce C.

    2015-01-01

    Short-term effects of forest harvest on fish habitat have been well documented, including sediment inputs, leaf litter reductions, and stream warming. However, few studies have considered changes in local climate when examining postlogging changes in fish communities. To address this need, we examined fish abundances between 1997 and 2007 in a basin in a northern hardwood forest. Streams in the basin were subjected to experimental riparian forest harvest in fall 1997. We noted a significant decrease for fish index of biotic integrity and abundance of Salvelinus fontinalis and Phoxinus eos over the study period. However, for P. eos and Culaea inconstans, the temporal patterns in abundances were related more to summer air temperatures than to fine sediment or spring precipitation when examined using multiple regressions. Univariate regressions suggested that summer air temperatures influenced temporal patterns in fish communities more than fine sediment or spring precipitation.

  13. Epidemiology of forest malaria in Central Vietnam: the hidden parasite reservoir.

    PubMed

    Thanh, Pham Vinh; Van Hong, Nguyen; Van Van, Nguyen; Van Malderen, Carine; Obsomer, Valérie; Rosanas-Urgell, Anna; Grietens, Koen Peeters; Xa, Nguyen Xuan; Bancone, Germana; Chowwiwat, Nongnud; Duong, Tran Thanh; D'Alessandro, Umberto; Speybroeck, Niko; Erhart, Annette

    2015-02-19

    After successfully reducing the malaria burden to pre-elimination levels over the past two decades, the national malaria programme in Vietnam has recently switched from control to elimination. However, in forested areas of Central Vietnam malaria elimination is likely to be jeopardized by the high occurrence of asymptomatic and submicroscopic infections as shown by previous reports. This paper presents the results of a malaria survey carried out in a remote forested area of Central Vietnam where we evaluated malaria prevalence and risk factors for infection. After a full census (four study villages = 1,810 inhabitants), the study population was screened for malaria infections by standard microscopy and, if needed, treated according to national guidelines. An additional blood sample on filter paper was also taken in a random sample of the population for later polymerase chain reaction (PCR) and more accurate estimation of the actual burden of malaria infections. The risk factor analysis for malaria infections was done using survey multivariate logistic regression as well as the classification and regression tree method (CART). A total of 1,450 individuals were screened. Malaria prevalence by microscopy was 7.8% (ranging from 3.9 to 10.9% across villages) mostly Plasmodium falciparum (81.4%) or Plasmodium vivax (17.7%) mono-infections; a large majority (69.9%) was asymptomatic. By PCR, the prevalence was estimated at 22.6% (ranging from 16.4 to 42.5%) with a higher proportion of P. vivax mono-infections (43.2%). The proportion of sub-patent infections increased with increasing age and with decreasing prevalence across villages. The main risk factors were young age, village, house structure, and absence of bed net. This study confirmed that in Central Vietnam a substantial part of the human malaria reservoir is hidden. Additional studies are urgently needed to assess the contribution of this hidden reservoir to the maintenance of malaria transmission. Such evidence will be crucial for guiding elimination strategies.

  14. Heritability estimations for inner muscular fat in Hereford cattle using random regressions

    USDA-ARS?s Scientific Manuscript database

    Random regressions make possible to make genetic predictions and parameters estimation across a gradient of environments, allowing a more accurate and beneficial use of animals as breeders in specific environments. The objective of this study was to use random regression models to estimate heritabil...

  15. Stratifying to reduce bias caused by high nonresponse rates: A case study from New Mexico’s forest inventory

    Treesearch

    Sara A. Goeking; Paul L. Patterson

    2013-01-01

    The USDA Forest Service’s Forest Inventory and Analysis (FIA) Program applies specific sampling and analysis procedures to estimate a variety of forest attributes. FIA’s Interior West region uses post-stratification, where strata consist of forest/nonforest polygons based on MODIS imagery, and assumes that nonresponse plots are distributed at random across each stratum...

  16. Thermokarst rates intensify due to climate change and forest fragmentation in an Alaskan boreal forest lowland.

    PubMed

    Lara, Mark J; Genet, Hélène; McGuire, Anthony D; Euskirchen, Eugénie S; Zhang, Yujin; Brown, Dana R N; Jorgenson, Mark T; Romanovsky, Vladimir; Breen, Amy; Bolton, William R

    2016-02-01

    Lowland boreal forest ecosystems in Alaska are dominated by wetlands comprised of a complex mosaic of fens, collapse-scar bogs, low shrub/scrub, and forests growing on elevated ice-rich permafrost soils. Thermokarst has affected the lowlands of the Tanana Flats in central Alaska for centuries, as thawing permafrost collapses forests that transition to wetlands. Located within the discontinuous permafrost zone, this region has significantly warmed over the past half-century, and much of these carbon-rich permafrost soils are now within ~0.5 °C of thawing. Increased permafrost thaw in lowland boreal forests in response to warming may have consequences for the climate system. This study evaluates the trajectories and potential drivers of 60 years of forest change in a landscape subjected to permafrost thaw in unburned dominant forest types (paper birch and black spruce) associated with location on elevated permafrost plateau and across multiple time periods (1949, 1978, 1986, 1998, and 2009) using historical and contemporary aerial and satellite images for change detection. We developed (i) a deterministic statistical model to evaluate the potential climatic controls on forest change using gradient boosting and regression tree analysis, and (ii) a 30 × 30 m land cover map of the Tanana Flats to estimate the potential landscape-level losses of forest area due to thermokarst from 1949 to 2009. Over the 60-year period, we observed a nonlinear loss of birch forests and a relatively continuous gain of spruce forest associated with thermokarst and forest succession, while gradient boosting/regression tree models identify precipitation and forest fragmentation as the primary factors controlling birch and spruce forest change, respectively. Between 1950 and 2009, landscape-level analysis estimates a transition of ~15 km² or ~7% of birch forests to wetlands, where the greatest change followed warm periods. This work highlights that the vulnerability and resilience of lowland ice-rich permafrost ecosystems to climate changes depend on forest type. © 2015 John Wiley & Sons Ltd.

  17. Thermokarst rates intensify due to climate change and forest fragmentation in an Alaskan boreal forest lowland

    USGS Publications Warehouse

    Lara, M.; Genet, Helene; McGuire, A. David; Euskirchen, Eugénie S.; Zhang, Yujin; Brown, Dana R. N.; Jorgenson, M.T.; Romanovsky, V.; Breen, Amy L.; Bolton, W.R.

    2016-01-01

    Lowland boreal forest ecosystems in Alaska are dominated by wetlands comprised of a complex mosaic of fens, collapse-scar bogs, low shrub/scrub, and forests growing on elevated ice-rich permafrost soils. Thermokarst has affected the lowlands of the Tanana Flats in central Alaska for centuries, as thawing permafrost collapses forests that transition to wetlands. Located within the discontinuous permafrost zone, this region has significantly warmed over the past half-century, and much of these carbon-rich permafrost soils are now within ~0.5 °C of thawing. Increased permafrost thaw in lowland boreal forests in response to warming may have consequences for the climate system. This study evaluates the trajectories and potential drivers of 60 years of forest change in a landscape subjected to permafrost thaw in unburned dominant forest types (paper birch and black spruce) associated with location on elevated permafrost plateau and across multiple time periods (1949, 1978, 1986, 1998, and 2009) using historical and contemporary aerial and satellite images for change detection. We developed (i) a deterministic statistical model to evaluate the potential climatic controls on forest change using gradient boosting and regression tree analysis, and (ii) a 30 × 30 m land cover map of the Tanana Flats to estimate the potential landscape-level losses of forest area due to thermokarst from 1949 to 2009. Over the 60-year period, we observed a nonlinear loss of birch forests and a relatively continuous gain of spruce forest associated with thermokarst and forest succession, while gradient boosting/regression tree models identify precipitation and forest fragmentation as the primary factors controlling birch and spruce forest change, respectively. Between 1950 and 2009, landscape-level analysis estimates a transition of ~15 km² or ~7% of birch forests to wetlands, where the greatest change followed warm periods. This work highlights that the vulnerability and resilience of lowland ice-rich permafrost ecosystems to climate changes depend on forest type.

  18. RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest.

    PubMed

    Ismail, Hamid D; Jones, Ahoi; Kim, Jung H; Newman, Robert H; Kc, Dukka B

    2016-01-01

    Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.

  19. Kepler AutoRegressive Planet Search (KARPS)

    NASA Astrophysics Data System (ADS)

    Caceres, Gabriel

    2018-01-01

    One of the main obstacles in detecting faint planetary transits is the intrinsic stellar variability of the host star. The Kepler AutoRegressive Planet Search (KARPS) project implements statistical methodology associated with autoregressive processes (in particular, ARIMA and ARFIMA) to model stellar lightcurves in order to improve exoplanet transit detection. We also develop a novel Transit Comb Filter (TCF) applied to the AR residuals which provides a periodogram analogous to the standard Box-fitting Least Squares (BLS) periodogram. We train a random forest classifier on known Kepler Objects of Interest (KOIs) using select features from different stages of this analysis, and then use ROC curves to define and calibrate the criteria to recover the KOI planet candidates with high fidelity. These statistical methods are detailed in a contributed poster (Feigelson et al., this meeting).These procedures are applied to the full DR25 dataset of NASA’s Kepler mission. Using the classification criteria, a vast majority of known KOIs are recovered and dozens of new KARPS Candidate Planets (KCPs) discovered, including ultra-short period exoplanets. The KCPs will be briefly presented and discussed.

  20. Inner and outer coronary vessel wall segmentation from CCTA using an active contour model with machine learning-based 3D voxel context-aware image force

    NASA Astrophysics Data System (ADS)

    Sivalingam, Udhayaraj; Wels, Michael; Rempfler, Markus; Grosskopf, Stefan; Suehling, Michael; Menze, Bjoern H.

    2016-03-01

    In this paper, we present a fully automated approach to coronary vessel segmentation, which involves calcification or soft plaque delineation in addition to accurate lumen delineation, from 3D Cardiac Computed Tomography Angiography data. Adequately virtualizing the coronary lumen plays a crucial role for simulating blood ow by means of fluid dynamics while additionally identifying the outer vessel wall in the case of arteriosclerosis is a prerequisite for further plaque compartment analysis. Our method is a hybrid approach complementing Active Contour Model-based segmentation with an external image force that relies on a Random Forest Regression model generated off-line. The regression model provides a strong estimate of the distance to the true vessel surface for every surface candidate point taking into account 3D wavelet-encoded contextual image features, which are aligned with the current surface hypothesis. The associated external image force is integrated in the objective function of the active contour model, such that the overall segmentation approach benefits from the advantages associated with snakes and from the ones associated with machine learning-based regression alike. This yields an integrated approach achieving competitive results on a publicly available benchmark data collection (Rotterdam segmentation challenge).

  1. Mangroves Enhance Reef Fish Abundance at the Caribbean Regional Scale.

    PubMed

    Serafy, Joseph E; Shideler, Geoffrey S; Araújo, Rafael J; Nagelkerken, Ivan

    2015-01-01

    Several studies conducted at the scale of islands, or small sections of continental coastlines, have suggested that mangrove habitats serve to enhance fish abundances on coral reefs, mainly by providing nursery grounds for several ontogenetically-migrating species. However, evidence of such enhancement at a regional scale has not been reported, and recently, some researchers have questioned the mangrove-reef subsidy effect. In the present study, using two different regression approaches, we pursued two questions related to mangrove-reef connectivity at the Caribbean regional scale: (1) Are reef fish abundances limited by mangrove forest area?; and (2) Are mean reef fish abundances proportional to mangrove forest area after taking human population density and latitude into account? Specifically, we tested for Caribbean-wide mangrove forest area effects on the abundances of 12 reef fishes that have been previously characterized as "mangrove-dependent". Analyzed were data from an ongoing, long-term (20-year) citizen-scientist fish monitoring program; coastal human population censuses; and several wetland forest information sources. Quantile regression results supported the notion that mangrove forest area limits the abundance of eight of the 12 fishes examined. Linear mixed-effects regression results, which considered potential human (fishing and habitat degradation) and latitudinal influences, suggested that average reef fish densities of at least six of the 12 focal fishes were directly proportional to mangrove forest area. Recent work questioning the mangrove-reef fish subsidy effect likely reflects a failure to: (1) focus analyses on species that use mangroves as nurseries, (2) consider more than the mean fish abundance response to mangrove forest extent; and/or (3) quantitatively account for potentially confounding human impacts, such as fishing pressure and habitat degradation. Our study is the first to demonstrate at a large regional scale (i.e., the Wider Caribbean) that greater mangrove forest size generally functions to increase the densities on neighboring reefs of those fishes that use these shallow, vegetated habitats as nurseries.

  2. Mangroves Enhance Reef Fish Abundance at the Caribbean Regional Scale

    PubMed Central

    Serafy, Joseph E.; Shideler, Geoffrey S.; Araújo, Rafael J.; Nagelkerken, Ivan

    2015-01-01

    Several studies conducted at the scale of islands, or small sections of continental coastlines, have suggested that mangrove habitats serve to enhance fish abundances on coral reefs, mainly by providing nursery grounds for several ontogenetically-migrating species. However, evidence of such enhancement at a regional scale has not been reported, and recently, some researchers have questioned the mangrove-reef subsidy effect. In the present study, using two different regression approaches, we pursued two questions related to mangrove-reef connectivity at the Caribbean regional scale: (1) Are reef fish abundances limited by mangrove forest area?; and (2) Are mean reef fish abundances proportional to mangrove forest area after taking human population density and latitude into account? Specifically, we tested for Caribbean-wide mangrove forest area effects on the abundances of 12 reef fishes that have been previously characterized as “mangrove-dependent”. Analyzed were data from an ongoing, long-term (20-year) citizen-scientist fish monitoring program; coastal human population censuses; and several wetland forest information sources. Quantile regression results supported the notion that mangrove forest area limits the abundance of eight of the 12 fishes examined. Linear mixed-effects regression results, which considered potential human (fishing and habitat degradation) and latitudinal influences, suggested that average reef fish densities of at least six of the 12 focal fishes were directly proportional to mangrove forest area. Recent work questioning the mangrove-reef fish subsidy effect likely reflects a failure to: (1) focus analyses on species that use mangroves as nurseries, (2) consider more than the mean fish abundance response to mangrove forest extent; and/or (3) quantitatively account for potentially confounding human impacts, such as fishing pressure and habitat degradation. Our study is the first to demonstrate at a large regional scale (i.e., the Wider Caribbean) that greater mangrove forest size generally functions to increase the densities on neighboring reefs of those fishes that use these shallow, vegetated habitats as nurseries. PMID:26536478

  3. Relationships between common forest metrics and realized impacts of Hurricane Katrina on forest resources in Mississippi

    Treesearch

    Sonja N. Oswalt; Christopher M. Oswalt

    2008-01-01

    This paper compares and contrasts hurricane-related damage recorded across the Mississippi landscape in the 2 years following Katrina with initial damage assessments based on modeled parameters by the USDA Forest Service. Logistic and multiple regressions are used to evaluate the influence of stand characteristics on tree damage probability. Specifically, this paper...

  4. Spatial, spectral and temporal patterns of tropical forest cover change as observed with multiple scales of optical satellite data.

    Treesearch

    D.J. Hayes; W.B. Cohen

    2006-01-01

    This article describes the development of a methodology for scaling observations of changes in tropical forest cover to large areas at high temporal frequency from coarse-resolution satellite imagery. The approach for estimating proportional forest cover change as a continuous variable is based on a regression model that relates multispectral, multitemporal Moderate...

  5. The cost of acquiring public hunting access on family forests lands

    Treesearch

    Michael A. Kilgore; Stephanie A. Snyder; Joesph M. Schertz; Steven J. Taff

    2008-01-01

    To address the issue of declining access to private forest land in the United States for hunting, over 1,000 Minnesota family forest owners were surveyed to estimate the cost of acquiring non-exclusive public hunting access rights. The results indicate landowner interest in selling access rights is extremely modest. Using binary logistic regression, the mean annual...

  6. ShapeSelectForest: a new r package for modeling landsat time series

    Treesearch

    Mary Meyer; Xiyue Liao; Gretchen Moisen; Elizabeth Freeman

    2015-01-01

    We present a new R package called ShapeSelectForest recently posted to the Comprehensive R Archival Network. The package was developed to fit nonparametric shape-restricted regression splines to time series of Landsat imagery for the purpose of modeling, mapping, and monitoring annual forest disturbance dynamics over nearly three decades. For each pixel and spectral...

  7. Estimation of crown closure from AVIRIS data using regression analysis

    NASA Technical Reports Server (NTRS)

    Staenz, K.; Williams, D. J.; Truchon, M.; Fritz, R.

    1993-01-01

    Crown closure is one of the input parameters used for forest growth and yield modelling. Preliminary work by Staenz et al. indicates that imaging spectrometer data acquired with sensors such as the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) have some potential for estimating crown closure on a stand level. The objectives of this paper are: (1) to establish a relationship between AVIRIS data and the crown closure derived from aerial photography of a forested test site within the Interior Douglas Fir biogeoclimatic zone in British Columbia, Canada; (2) to investigate the impact of atmospheric effects and the forest background on the correlation between AVIRIS data and crown closure estimates; and (3) to improve this relationship using multiple regression analysis.

  8. Electromagnetic wave extinction within a forested canopy

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.

    1989-01-01

    A forested canopy is modeled by a collection of randomly oriented finite-length cylinders shaded by randomly oriented and distributed disk- or needle-shaped leaves. For a plane wave exciting the forested canopy, the extinction coefficient is formulated in terms of the extinction cross sections (ECSs) in the local frame of each forest component and the Eulerian angles of orientation (used to describe the orientation of each component). The ECSs in the local frame for the finite-length cylinders used to model the branches are obtained by using the forward-scattering theorem. ECSs in the local frame for the disk- and needle-shaped leaves are obtained by the summation of the absorption and scattering cross-sections. The behavior of the extinction coefficients with the incidence angle is investigated numerically for both deciduous and coniferous forest. The dependencies of the extinction coefficients on the orientation of the leaves are illustrated numerically.

  9. Relationship of field and LiDAR estimates of forest canopy cover with snow accumulation and melt

    Treesearch

    Mariana Dobre; William J. Elliot; Joan Q. Wu; Timothy E. Link; Brandon Glaza; Theresa B. Jain; Andrew T. Hudak

    2012-01-01

    At the Priest River Experimental Forest in northern Idaho, USA, snow water equivalent (SWE) was recorded over a period of six years on random, equally-spaced plots in ~4.5 ha small watersheds (n=10). Two watersheds were selected as controls and eight as treatments, with two watersheds randomly assigned per treatment as follows: harvest (2007) followed by mastication (...

  10. Random forests and stochastic gradient boosting for predicting tree canopy cover: Comparing tuning processes and model performance

    Treesearch

    Elizabeth A. Freeman; Gretchen G. Moisen; John W. Coulston; Barry T. (Ty) Wilson

    2015-01-01

    As part of the development of the 2011 National Land Cover Database (NLCD) tree canopy cover layer, a pilot project was launched to test the use of high-resolution photography coupled with extensive ancillary data to map the distribution of tree canopy cover over four study regions in the conterminous US. Two stochastic modeling techniques, random forests (RF...

  11. Chapter4 - Drought patterns in the conterminous United States and Hawaii.

    Treesearch

    Frank H. Koch; William D. Smith; John W. Coulston

    2014-01-01

    Droughts are common in virtually all U.S. forests, but their frequency and intensity vary widely both between and within forest ecosystems (Hanson and Weltzin 2000). Forests in the Western United States generally exhibit a pattern of annual seasonal droughts. Forests in the Eastern United States tend to exhibit one of two prevailing patterns: random occasional droughts...

  12. A Prospectus on Restoring Late Successional Forest Structure to Eastside Pine Ecosystems Through Large-Scale, Interdisciplinary Research

    Treesearch

    Steve Zack; William F. Laudenslayer; Luke George; Carl Skinner; William Oliver

    1999-01-01

    At two different locations in northeast California, an interdisciplinary team of scientists is initiating long-term studies to quantify the effects of forest manipulations intended to accelerate andlor enhance late-successional structure of eastside pine forest ecosystems. One study, at Blacks Mountain Experimental Forest, uses a split-plot, factorial, randomized block...

  13. Probabilistic risk models for multiple disturbances: an example of forest insects and wildfires

    Treesearch

    Haiganoush K. Preisler; Alan A. Ager; Jane L. Hayes

    2010-01-01

    Building probabilistic risk models for highly random forest disturbances like wildfire and forest insect outbreaks is a challenging. Modeling the interactions among natural disturbances is even more difficult. In the case of wildfire and forest insects, we looked at the probability of a large fire given an insect outbreak and also the incidence of insect outbreaks...

  14. Utilizing random forests imputation of forest plot data for landscape-level wildfire analyses

    Treesearch

    Karin L. Riley; Isaac C. Grenfell; Mark A. Finney; Nicholas L. Crookston

    2014-01-01

    Maps of the number, size, and species of trees in forests across the United States are desirable for a number of applications. For landscape-level fire and forest simulations that use the Forest Vegetation Simulator (FVS), a spatial tree-level dataset, or “tree list”, is a necessity. FVS is widely used at the stand level for simulating fire effects on tree mortality,...

  15. Modeling Forest Understory Fires in an Eastern Amazonian Landscape

    NASA Technical Reports Server (NTRS)

    Alencar, A. A. C.; Solorzano, L. A.; Nepstad, D. C.

    2004-01-01

    Forest understory fires are an increasingly important cause of forest impoverishment in Ammonia, but little is known of the landscape characteristics and climatic phenomena that determine their occurrence. We developed empirical functions relating the occurrence of understory fires to landscape features near Paragominas, a 35- yr-old ranching and logging center in eastern Ammonia. An historical sequence of maps of forest understory fire was created based on field interviews With local farmers and Landsat TM images. Several landscape features that might explain spatial variations in the occurrence of understory fires were also mapped and co-registered for each of the sample dates, including: forest fragment size and shape, forest impoverishment through logging and understory fires, source of ignition (settlements and charcoal pits), roads, forest edges, and others. The spatial relationship between forest understory fire and each landscape characteristic was tested by regression analyses. Fire probability models were then developed for various combinations of landscape characteristics. The analyses were conducted separately for years of the El Nino Southern Oscillation (ENSO), which are associated with severe drought in eastern Amazonia, and non-ENS0 years. Most (91 %) of the forest area that burned during the 10-yr sequence caught fire during ENSO years, when severe drought may have increased both forest flammability and the escape of agricultural management fires. Forest understory fires were associated with forest edges, as reported in previous studies from Ammonia. But the strongest predictor of forest fire was the percentage of the forest fragment that had been previously logged or burned. Forest fragment size, distance to charcoal pits, distance to agricultural settlement, proximity to forest edge, and distance to roads were also correlated with forest understory fire. Logistic regression models using information on fragment degradation and distance to ignition sources accurately predicted the location of lss than 80% of the forest fires observed during the ENSO event of 1997- 1998. In this Amazon landscape, forest understory fire is a complex function of several variables that influence both the flammability and ignition exposure of the forest.

  16. Random Forest Application for NEXRAD Radar Data Quality Control

    NASA Astrophysics Data System (ADS)

    Keem, M.; Seo, B. C.; Krajewski, W. F.

    2017-12-01

    Identification and elimination of non-meteorological radar echoes (e.g., returns from ground, wind turbines, and biological targets) are the basic data quality control steps before radar data use in quantitative applications (e.g., precipitation estimation). Although WSR-88Ds' recent upgrade to dual-polarization has enhanced this quality control and echo classification, there are still challenges to detect some non-meteorological echoes that show precipitation-like characteristics (e.g., wind turbine or anomalous propagation clutter embedded in rain). With this in mind, a new quality control method using Random Forest is proposed in this study. This classification algorithm is known to produce reliable results with less uncertainty. The method introduces randomness into sampling and feature selections and integrates consequent multiple decision trees. The multidimensional structure of the trees can characterize the statistical interactions of involved multiple features in complex situations. The authors explore the performance of Random Forest method for NEXRAD radar data quality control. Training datasets are selected using several clear cases of precipitation and non-precipitation (but with some non-meteorological echoes). The model is structured using available candidate features (from the NEXRAD data) such as horizontal reflectivity, differential reflectivity, differential phase shift, copolar correlation coefficient, and their horizontal textures (e.g., local standard deviation). The influence of each feature on classification results are quantified by variable importance measures that are automatically estimated by the Random Forest algorithm. Therefore, the number and types of features in the final forest can be examined based on the classification accuracy. The authors demonstrate the capability of the proposed approach using several cases ranging from distinct to complex rain/no-rain events and compare the performance with the existing algorithms (e.g., MRMS). They also discuss operational feasibility based on the observed strength and weakness of the method.

  17. Estimating Forest Species Composition Using a Multi-Sensor Approach

    NASA Astrophysics Data System (ADS)

    Wolter, P. T.

    2009-12-01

    The magnitude, duration, and frequency of forest disturbance caused by the spruce budworm and forest tent caterpillar has increased over the last century due to a shift in forest species composition linked to historical fire suppression, forest management, and pesticide application that has fostered the increase in dominance of host tree species. Modeling approaches are currently being used to understand and forecast potential management effects in changing insect disturbance trends. However, detailed forest composition data needed for these efforts is often lacking. Here, we used partial least squares (PLS) regression to integrate satellite sensor data from Landsat, Radarsat-1, and PALSAR, as well as pixel-wise forest structure information derived from SPOT-5 sensor data (Wolter et al. 2009), to estimate species-level forest composition of 12 species required for modeling efforts. C-band Radarsat-1 data and L-band PALSAR data were frequently among the strongest predictors of forest composition. Pixel-level forest structure data were more important for estimating conifer rather than hardwood forest composition. The coefficients of determination for species relative basal area (RBA) ranged from 0.57 (white cedar) to 0.94 (maple) with RMSE of 8.88 to 6.44 % RBA, respectively. Receiver operating characteristic (ROC) curves were used to determine the effective lower limits of usefulness of species RBA estimates which ranged from 5.94 % (jack pine) to 39.41 % (black ash). These estimates were then used to produce a dominant forest species map for the study region with an overall accuracy of 78 %. Most notably, this approach facilitated discrimination of aspen from birch as well as spruce and fir from other conifer species which is crucial for the study of forest tent caterpillar and spruce budworm dynamics, respectively, in the Upper Midwest. Thus, use of PLS regression as a data fusion strategy has proven to be an effective tool for regional characterization of forest composition within spatially heterogeneous forests using large-format satellite sensor data.

  18. Modeling long-term suspended-sediment export from an undisturbed forest catchment

    NASA Astrophysics Data System (ADS)

    Zimmermann, Alexander; Francke, Till; Elsenbeer, Helmut

    2013-04-01

    Most estimates of suspended sediment yields from humid, undisturbed, and geologically stable forest environments fall within a range of 5 - 30 t km-2 a-1. These low natural erosion rates in small headwater catchments (≤ 1 km2) support the common impression that a well-developed forest cover prevents surface erosion. Interestingly, those estimates originate exclusively from areas with prevailing vertical hydrological flow paths. Forest environments dominated by (near-) surface flow paths (overland flow, pipe flow, and return flow) and a fast response to rainfall, however, are not an exceptional phenomenon, yet only very few sediment yields have been estimated for these areas. Not surprisingly, even fewer long-term (≥ 10 years) records exist. In this contribution we present our latest research which aims at quantifying long-term suspended-sediment export from an undisturbed rainforest catchment prone to frequent overland flow. A key aspect of our approach is the application of machine-learning techniques (Random Forest, Quantile Regression Forest) which allows not only the handling of non-Gaussian data, non-linear relations between predictors and response, and correlations between predictors, but also the assessment of prediction uncertainty. For the current study we provided the machine-learning algorithms exclusively with information from a high-resolution rainfall time series to reconstruct discharge and suspended sediment dynamics for a 21-year period. The significance of our results is threefold. First, our estimates clearly show that forest cover does not necessarily prevent erosion if wet antecedent conditions and large rainfalls coincide. During these situations, overland flow is widespread and sediment fluxes increase in a non-linear fashion due to the mobilization of new sediment sources. Second, our estimates indicate that annual suspended sediment yields of the undisturbed forest catchment show large fluctuations. Depending on the frequency of large events, annual suspended-sediment yield varies between 74 - 416 t km-2 a-1. Third, the estimated sediment yields exceed former benchmark values by an order of magnitude and provide evidence that the erosion footprint of undisturbed, forested catchments can be undistinguishable from that of sustainably managed, but hydrologically less responsive areas. Because of the susceptibility to soil loss we argue that any land use should be avoided in natural erosion hotspots.

  19. Aboveground Biomass and Dynamics of Forest Attributes using LiDAR Data and Vegetation Model

    NASA Astrophysics Data System (ADS)

    V V L, P. A.

    2015-12-01

    In recent years, biomass estimation for tropical forests has received much attention because of the fact that regional biomass is considered to be a critical input to climate change. Biomass almost determines the potential carbon emission that could be released to the atmosphere due to deforestation or conservation to non-forest land use. Thus, accurate biomass estimation is necessary for better understating of deforestation impacts on global warming and environmental degradation. In this context, forest stand height inclusion in biomass estimation plays a major role in reducing the uncertainty in the estimation of biomass. The improvement in the accuracy in biomass shall also help in meeting the MRV objectives of REDD+. Along with the precise estimate of biomass, it is also important to emphasize the role of vegetation models that will most likely become an important tool for assessing the effects of climate change on potential vegetation dynamics and terrestrial carbon storage and for managing terrestrial ecosystem sustainability. Remote sensing is an efficient way to estimate forest parameters in large area, especially at regional scale where field data is limited. LIDAR (Light Detection And Ranging) provides accurate information on the vertical structure of forests. We estimated average tree canopy heights and AGB from GLAS waveform parameters by using a multi-regression linear model in forested area of Madhya Pradesh (area-3,08,245 km2), India. The derived heights from ICESat-GLAS were correlated with field measured tree canopy heights for 60 plots. Results have shown a significant correlation of R2= 74% for top canopy heights and R2= 57% for stand biomass. The total biomass estimation 320.17 Mt and canopy heights are generated by using random forest algorithm. These canopy heights and biomass maps were used in vegetation models to predict the changes biophysical/physiological characteristics of forest according to the changing climate. In our study we have used Dynamic Global Vegetation Model to understand the possible vegetation dynamics in the event of climate change. The vegetation represents a biogeographic regime. Simulations were carried out for 70 years time period. The model produced leaf area index and biomass for each plant functional type and biome for each grid in that region.

  20. Fault Detection of Aircraft System with Random Forest Algorithm and Similarity Measure

    PubMed Central

    Park, Wookje; Jung, Sikhang

    2014-01-01

    Research on fault detection algorithm was developed with the similarity measure and random forest algorithm. The organized algorithm was applied to unmanned aircraft vehicle (UAV) that was readied by us. Similarity measure was designed by the help of distance information, and its usefulness was also verified by proof. Fault decision was carried out by calculation of weighted similarity measure. Twelve available coefficients among healthy and faulty status data group were used to determine the decision. Similarity measure weighting was done and obtained through random forest algorithm (RFA); RF provides data priority. In order to get a fast response of decision, a limited number of coefficients was also considered. Relation of detection rate and amount of feature data were analyzed and illustrated. By repeated trial of similarity calculation, useful data amount was obtained. PMID:25057508

  1. Real time forest fire warning and forest fire risk zoning: a Vietnamese case study

    NASA Astrophysics Data System (ADS)

    Chu, T.; Pham, D.; Phung, T.; Ha, A.; Paschke, M.

    2016-12-01

    Forest fire occurs seriously in Vietnam and has been considered as one of the major causes of forest lost and degradation. Several studies of forest fire risk warning were conducted using Modified Nesterov Index (MNI) but remaining shortcomings and inaccurate predictions that needs to be urgently improved. In our study, several important topographic and social factors such as aspect, slope, elevation, distance to residential areas and road system were considered as "permanent" factors while meteorological data were updated hourly using near-real-time (NRT) remotely sensed data (i.e. MODIS Terra/Aqua and TRMM) for the prediction and warning of fire. Due to the limited number of weather stations in Vietnam, data from all active stations (i.e. 178) were used with the satellite data to calibrate and upscale meteorological variables. These data with finer resolution were then used to generate MNI. The only significant "permanent" factors were selected as input variables based on the correlation coefficients that computed from multi-variable regression among true fire-burning (collected from 1/2007) and its spatial characteristics. These coefficients also used to suggest appropriate weight for computing forest fire risk (FR) model. Forest fire risk model was calculated from the MNI and the selected factors using fuzzy regression models (FRMs) and GIS based multi-criteria analysis. By this approach, the FR was slightly modified from MNI by the integrated use of various factors in our fire warning and prediction model. Multifactor-based maps of forest fire risk zone were generated from classifying FR into three potential danger levels. Fire risk maps were displayed using webgis technology that is easy for managing data and extracting reports. Reported fire-burnings thereafter have been used as true values for validating the forest fire risk. Fire probability has strong relationship with potential danger levels (varied from 5.3% to 53.8%) indicating that the higher potential risk, the more chance of fire happen. By adding spatial factors to continuous daily updated remote sensing based meteo-data, results are valuable for both mapping forest fire risk zones in short and long-term and real time fire warning in Vietnam. Key words: Near-real-time, forest fire warning, fuzzy regression model, remote sensing.

  2. A primer on stand and forest inventory designs

    Treesearch

    H. Gyde Lund; Charles E. Thomas

    1989-01-01

    Covers designs for the inventory of stands and forests in detail and with worked-out examples. For stands, random sampling, line transects, ricochet plot, systematic sampling, single plot, cluster, subjective sampling and complete enumeration are discussed. For forests inventory, the main categories are subjective sampling, inventories without prior stand mapping,...

  3. A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning

    NASA Astrophysics Data System (ADS)

    Houborg, Rasmus; McCabe, Matthew F.

    2018-01-01

    With an increasing volume and dimensionality of Earth observation data, enhanced integration of machine-learning methodologies is needed to effectively analyze and utilize these information rich datasets. In machine-learning, a training dataset is required to establish explicit associations between a suite of explanatory 'predictor' variables and the target property. The specifics of this learning process can significantly influence model validity and portability, with a higher generalization level expected with an increasing number of observable conditions being reflected in the training dataset. Here we propose a hybrid training approach for leaf area index (LAI) estimation, which harnesses synergistic attributes of scattered in-situ measurements and systematically distributed physically based model inversion results to enhance the information content and spatial representativeness of the training data. To do this, a complimentary training dataset of independent LAI was derived from a regularized model inversion of RapidEye surface reflectances and subsequently used to guide the development of LAI regression models via Cubist and random forests (RF) decision tree methods. The application of the hybrid training approach to a broad set of Landsat 8 vegetation index (VI) predictor variables resulted in significantly improved LAI prediction accuracies and spatial consistencies, relative to results relying on in-situ measurements alone for model training. In comparing the prediction capacity and portability of the two machine-learning algorithms, a pair of relatively simple multi-variate regression models established by Cubist performed best, with an overall relative mean absolute deviation (rMAD) of ∼11%, determined based on a stringent scene-specific cross-validation approach. In comparison, the portability of RF regression models was less effective (i.e., an overall rMAD of ∼15%), which was attributed partly to model saturation at high LAI in association with inherent extrapolation and transferability limitations. Explanatory VIs formed from bands in the near-infrared (NIR) and shortwave infrared domains (e.g., NDWI) were associated with the highest predictive ability, whereas Cubist models relying entirely on VIs based on NIR and red band combinations (e.g., NDVI) were associated with comparatively high uncertainties (i.e., rMAD ∼ 21%). The most transferable and best performing models were based on combinations of several predictor variables, which included both NDWI- and NDVI-like variables. In this process, prior screening of input VIs based on an assessment of variable relevance served as an effective mechanism for optimizing prediction accuracies from both Cubist and RF. While this study demonstrated benefit in combining data mining operations with physically based constraints via a hybrid training approach, the concept of transferability and portability warrants further investigations in order to realize the full potential of emerging machine-learning techniques for regression purposes.

  4. Roosting habitat use and selection by northern spotted owls during natal dispersal

    USGS Publications Warehouse

    Sovern, Stan G.; Forsman, Eric D.; Dugger, Catherine M.; Taylor, Margaret

    2015-01-01

    We studied habitat selection by northern spotted owls (Strix occidentalis caurina) during natal dispersal in Washington State, USA, at both the roost site and landscape scales. We used logistic regression to obtain parameters for an exponential resource selection function based on vegetation attributes in roost and random plots in 76 forest stands that were used for roosting. We used a similar analysis to evaluate selection of landscape habitat attributes based on 301 radio-telemetry relocations and random points within our study area. We found no evidence of within-stand selection for any of the variables examined, but 78% of roosts were in stands with at least some large (>50 cm dbh) trees. At the landscape scale, owls selected for stands with high canopy cover (>70%). Dispersing owls selected vegetation types that were more similar to habitat selected by adult owls than habitat that would result from following guidelines previously proposed to maintain dispersal habitat. Our analysis indicates that juvenile owls select stands for roosting that have greater canopy cover than is recommended in current agency guidelines.

  5. Mapping leaf nitrogen and carbon concentrations of intact and fragmented indigenous forest ecosystems using empirical modeling techniques and WorldView-2 data

    NASA Astrophysics Data System (ADS)

    Omer, Galal; Mutanga, Onisimo; Abdel-Rahman, Elfatih M.; Peerbhay, Kabir; Adam, Elhadi

    2017-09-01

    Forest nitrogen (N) and carbon (C) are among the most important biochemical components of tree organic matter, and the estimation of their concentrations can help to monitor the nutrient uptake processes and health of forest trees. Traditionally, these tree biochemical components are estimated using costly, labour intensive, time-consuming and subjective analytical protocols. The use of very high spatial resolution multispectral data and advanced machine learning regression algorithms such as support vector machines (SVM) and artificial neural networks (ANN) provide an opportunity to accurately estimate foliar N and C concentrations over intact and fragmented forest ecosystems. In the present study, the utility of spectral vegetation indices calculated from WorldView-2 (WV-2) imagery for mapping leaf N and C concentrations of fragmented and intact indigenous forest ecosystems was explored. We collected leaf samples from six tree species in the fragmented as well as intact Dukuduku indigenous forest ecosystems. Leaf samples (n = 85 for each of the fragmented and intact forests) were subjected to chemical analysis for estimating the concentrations of N and C. We used 70% of samples for training our models and 30% for validating the accuracy of our predictive empirical models. The study showed that the N concentration was significantly higher (p = 0.03) in the intact forests than in the fragmented forest. There was no significant difference (p = 0.55) in the C concentration between the intact and fragmented forest strata. The results further showed that the foliar N and C concentrations could be more accurately estimated using the fragmented stratum data compared with the intact stratum data. Further, SVM achieved relatively more accurate N (maximum R2 Val = 0.78 and minimum RMSEVal = 1.07% of the mean) and C (maximum R2 Val = 0.67 and minimum RMSEVal = 1.64% of the mean) estimates compared with ANN (maximum R2Val = 0.70 for N and 0.51 for C and minimum RMSEVal = 5.40% of the mean for N and 2.21% of the mean for C). Overall, SVM regressions achieved more accurate models for estimating forest foliar N and C concentrations in the fragmented and intact indigenous forests compared to the ANN regression method. It is concluded that the successful application of the WV-2 data integrated with SVM can provide an accurate framework for mapping the concentrations of biochemical elements in two indigenous forest ecosystems.

  6. Predicting tree species presence and basal area in Utah: A comparison of stochastic gradient boosting, generalized additive models, and tree-based methods

    Treesearch

    Gretchen G. Moisen; Elizabeth A. Freeman; Jock A. Blackard; Tracey S. Frescino; Niklaus E. Zimmermann; Thomas C. Edwards

    2006-01-01

    Many efforts are underway to produce broad-scale forest attribute maps by modelling forest class and structure variables collected in forest inventories as functions of satellite-based and biophysical information. Typically, variants of classification and regression trees implemented in Rulequest's© See5 and Cubist (for binary and continuous responses,...

  7. Predicting nest success from habitat features in aspen forests of the central Rocky Mountains

    Treesearch

    Heather M. Struempf; Deborah M. Finch; Gregory Hayward; Stanley Anderson

    2001-01-01

    We collected nesting data on bird use of aspen stands in the Routt and Medicine Bow National Forests between 1987 and 1989. We found active nest sites of 28 species of small nongame birds on nine study plots in undisturbed aspen forests. We compared logistic regression models predicting nest success (at least one nestling) from nest-site or stand-level habitat...

  8. Allocating Fire Mitigation Funds on the Basis of the Predicted Probabilities of Forest Wildfire

    Treesearch

    Ronald E. McRoberts; Greg C. Liknes; Mark D. Nelson; Krista M. Gebert; R. James Barbour; Susan L. Odell; Steven C. Yaddof

    2005-01-01

    A logistic regression model was used with map-based information to predict the probability of forest fire for forested areas of the United States. Model parameters were estimated using a digital layer depicting the locations of wildfires and satellite imagery depicting thermal hotspots. The area of the United States in the upper 50th percentile with respect to...

  9. Use of Forest Inventory and Analysis information in wildlife habitat modeling: a process for linking multiple scales

    Treesearch

    Thomas C. Edwards; Gretchen G. Moisen; Tracey S. Frescino; Joshua L. Lawler

    2002-01-01

    We describe our collective efforts to develop and apply methods for using FIA data to model forest resources and wildlife habitat. Our work demonstrates how flexible regression techniques, such as generalized additive models, can be linked with spatially explicit environmental information for the mapping of forest type and structure. We illustrate how these maps of...

  10. Assessment of wastewater treatment facility compliance with decreasing ammonia discharge limits using a regression tree model.

    PubMed

    Suchetana, Bihu; Rajagopalan, Balaji; Silverstein, JoAnn

    2017-11-15

    A regression tree-based diagnostic approach is developed to evaluate factors affecting US wastewater treatment plant compliance with ammonia discharge permit limits using Discharge Monthly Report (DMR) data from a sample of 106 municipal treatment plants for the period of 2004-2008. Predictor variables used to fit the regression tree are selected using random forests, and consist of the previous month's effluent ammonia, influent flow rates and plant capacity utilization. The tree models are first used to evaluate compliance with existing ammonia discharge standards at each facility and then applied assuming more stringent discharge limits, under consideration in many states. The model predicts that the ability to meet both current and future limits depends primarily on the previous month's treatment performance. With more stringent discharge limits predicted ammonia concentration relative to the discharge limit, increases. In-sample validation shows that the regression trees can provide a median classification accuracy of >70%. The regression tree model is validated using ammonia discharge data from an operating wastewater treatment plant and is able to accurately predict the observed ammonia discharge category approximately 80% of the time, indicating that the regression tree model can be applied to predict compliance for individual treatment plants providing practical guidance for utilities and regulators with an interest in controlling ammonia discharges. The proposed methodology is also used to demonstrate how to delineate reliable sources of demand and supply in a point source-to-point source nutrient credit trading scheme, as well as how planners and decision makers can set reasonable discharge limits in future. Copyright © 2017 Elsevier B.V. All rights reserved.

  11. Forest community classification of the Porcupine River drainage, interior Alaska, and its application to forest management.

    Treesearch

    John Yarie

    1983-01-01

    The forest vegetation of 3,600,000 hectares in northeast interior Alaska was classified. A total of 365 plots located in a stratified random design were run through the ordination programs SIMORD and TWINSPAN. A total of 40 forest communities were described vegetatively and, to a limited extent, environmentally. The area covered by each community was similar, ranging...

  12. Experimental Design Considerations for Establishing an Off-Road, Habitat-Specific Bird Monitoring Program Using Point Counts

    Treesearch

    JoAnn M. Hanowski; Gerald J. Niemi

    1995-01-01

    We established bird monitoring programs in two regions of Minnesota: the Chippewa National Forest and the Superior National Forest. The experimental design defined forest cover types as strata in which samples of forest stands were randomly selected. Subsamples (3 point counts) were placed in each stand to maximize field effort and to assess within-stand and between-...

  13. Predicting live and dead tree basal area of bark beetle affected forests from discrete-return lidar

    Treesearch

    Benjamin C. Bright; Andrew T. Hudak; Robert McGaughey; Hans-Erik Andersen; Jose Negron

    2013-01-01

    Bark beetle outbreaks have killed large numbers of trees across North America in recent years. Lidar remote sensing can be used to effectively estimate forest biomass, but prediction of both live and dead standing biomass in beetle-affected forests using lidar alone has not been demonstrated. We developed Random Forest (RF) models predicting total, live, dead, and...

  14. Valuing the Recreational Benefits from the Creation of Nature Reserves in Irish Forests

    Treesearch

    Riccardo Scarpa; Susan M. Chilton; W. George Hutchinson; Joseph Buongiorno

    2000-01-01

    Data from a large-scale contingent valuation study are used to investigate the effects of forest attribum on willingness to pay for forest recreation in Ireland. In particular, the presence of a nature reserve in the forest is found to significantly increase the visitors' willingness to pay. A random utility model is used to estimate the welfare change associated...

  15. Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada

    Treesearch

    Elizabeth A. Freeman; Gretchen G. Moisen; Tracy S. Frescino

    2012-01-01

    Random Forests is frequently used to model species distributions over large geographic areas. Complications arise when data used to train the models have been collected in stratified designs that involve different sampling intensity per stratum. The modeling process is further complicated if some of the target species are relatively rare on the landscape leading to an...

  16. Unbiased feature selection in learning random forests for high-dimensional data.

    PubMed

    Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi

    2015-01-01

    Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

  17. A random forest learning assisted "divide and conquer" approach for peptide conformation search.

    PubMed

    Chen, Xin; Yang, Bing; Lin, Zijing

    2018-06-11

    Computational determination of peptide conformations is challenging as it is a problem of finding minima in a high-dimensional space. The "divide and conquer" approach is promising for reliably reducing the search space size. A random forest learning model is proposed here to expand the scope of applicability of the "divide and conquer" approach. A random forest classification algorithm is used to characterize the distributions of the backbone φ-ψ units ("words"). A random forest supervised learning model is developed to analyze the combinations of the φ-ψ units ("grammar"). It is found that amino acid residues may be grouped as equivalent "words", while the φ-ψ combinations in low-energy peptide conformations follow a distinct "grammar". The finding of equivalent words empowers the "divide and conquer" method with the flexibility of fragment substitution. The learnt grammar is used to improve the efficiency of the "divide and conquer" method by removing unfavorable φ-ψ combinations without the need of dedicated human effort. The machine learning assisted search method is illustrated by efficiently searching the conformations of GGG/AAA/GGGG/AAAA/GGGGG through assembling the structures of GFG/GFGG. Moreover, the computational cost of the new method is shown to increase rather slowly with the peptide length.

  18. Estimating and mapping forest biomass using regression models and Spot-6 images (case study: Hyrcanian forests of north of Iran).

    PubMed

    Motlagh, Mohadeseh Ghanbari; Kafaky, Sasan Babaie; Mataji, Asadollah; Akhavan, Reza

    2018-05-21

    Hyrcanian forests of North of Iran are of great importance in terms of various economic and environmental aspects. In this study, Spot-6 satellite images and regression models were applied to estimate above-ground biomass in these forests. This research was carried out in six compartments in three climatic (semi-arid to humid) types and two altitude classes. In the first step, ground sampling methods at the compartment level were used to estimate aboveground biomass (Mg/ha). Then, by reviewing the results of other studies, the most appropriate vegetation indices were selected. In this study, three indices of NDVI, RVI, and TVI were calculated. We investigated the relationship between the vegetation indices and aboveground biomass measured at sample-plot level. Based on the results, the relationship between aboveground biomass values and vegetation indices was a linear regression with the highest level of significance for NDVI in all compartments. Since at the compartment level the correlation coefficient between NDVI and aboveground biomass was the highest, NDVI was used for mapping aboveground biomass. According to the results of this study, biomass values were highly different in various climatic and altitudinal classes with the highest biomass value observed in humid climate and high-altitude class.

  19. Tropical forest plantation biomass estimation using RADARSAT-SAR and TM data of south china

    NASA Astrophysics Data System (ADS)

    Wang, Chenli; Niu, Zheng; Gu, Xiaoping; Guo, Zhixing; Cong, Pifu

    2005-10-01

    Forest biomass is one of the most important parameters for global carbon stock model yet can only be estimated with great uncertainties. Remote sensing, especially SAR data can offers the possibility of providing relatively accurate forest biomass estimations at a lower cost than inventory in study tropical forest. The goal of this research was to compare the sensitivity of forest biomass to Landsat TM and RADARSAT-SAR data and to assess the efficiency of NDVI, EVI and other vegetation indices in study forest biomass based on the field survey date and GIS in south china. Based on vegetation indices and factor analysis, multiple regression and neural networks were developed for biomass estimation for each species of the plantation. For each species, the better relationships between the biomass predicted and that measured from field survey was obtained with a neural network developed for the species. The relationship between predicted and measured biomass derived from vegetation indices differed between species. This study concludes that single band and many vegetation indices are weakly correlated with selected forest biomass. RADARSAT-SAR Backscatter coefficient has a relatively good logarithmic correlation with forest biomass, but neither TM spectral bands nor vegetation indices alone are sufficient to establish an efficient model for biomass estimation due to the saturation of bands and vegetation indices, multiple regression models that consist of spectral and environment variables improve biomass estimation performance. Comparing with TM, a relatively well estimation result can be achieved by RADARSAT-SAR, but all had limitations in tropical forest biomass estimation. The estimation results obtained are not accurate enough for forest management purposes at the forest stand level. However, the approximate volume estimates derived by the method can be useful in areas where no other forest information is available. Therefore, this paper provides a better understanding of relationships of remote sensing data and forest stand parameters used in forest parameter estimation models.

  20. Seasonal forecasting of hydrological drought in the Limpopo Basin: a comparison of statistical methods

    NASA Astrophysics Data System (ADS)

    Seibert, Mathias; Merz, Bruno; Apel, Heiko

    2017-03-01

    The Limpopo Basin in southern Africa is prone to droughts which affect the livelihood of millions of people in South Africa, Botswana, Zimbabwe and Mozambique. Seasonal drought early warning is thus vital for the whole region. In this study, the predictability of hydrological droughts during the main runoff period from December to May is assessed using statistical approaches. Three methods (multiple linear models, artificial neural networks, random forest regression trees) are compared in terms of their ability to forecast streamflow with up to 12 months of lead time. The following four main findings result from the study. 1. There are stations in the basin at which standardised streamflow is predictable with lead times up to 12 months. The results show high inter-station differences of forecast skill but reach a coefficient of determination as high as 0.73 (cross validated). 2. A large range of potential predictors is considered in this study, comprising well-established climate indices, customised teleconnection indices derived from sea surface temperatures and antecedent streamflow as a proxy of catchment conditions. El Niño and customised indices, representing sea surface temperature in the Atlantic and Indian oceans, prove to be important teleconnection predictors for the region. Antecedent streamflow is a strong predictor in small catchments (with median 42 % explained variance), whereas teleconnections exert a stronger influence in large catchments. 3. Multiple linear models show the best forecast skill in this study and the greatest robustness compared to artificial neural networks and random forest regression trees, despite their capabilities to represent nonlinear relationships. 4. Employed in early warning, the models can be used to forecast a specific drought level. Even if the coefficient of determination is low, the forecast models have a skill better than a climatological forecast, which is shown by analysis of receiver operating characteristics (ROCs). Seasonal statistical forecasts in the Limpopo show promising results, and thus it is recommended to employ them as complementary to existing forecasts in order to strengthen preparedness for droughts.

  1. SU-D-204-06: Integration of Machine Learning and Bioinformatics Methods to Analyze Genome-Wide Association Study Data for Rectal Bleeding and Erectile Dysfunction Following Radiotherapy in Prostate Cancer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Oh, J; Deasy, J; Kerns, S

    Purpose: We investigated whether integration of machine learning and bioinformatics techniques on genome-wide association study (GWAS) data can improve the performance of predictive models in predicting the risk of developing radiation-induced late rectal bleeding and erectile dysfunction in prostate cancer patients. Methods: We analyzed a GWAS dataset generated from 385 prostate cancer patients treated with radiotherapy. Using genotype information from these patients, we designed a machine learning-based predictive model of late radiation-induced toxicities: rectal bleeding and erectile dysfunction. The model building process was performed using 2/3 of samples (training) and the predictive model was tested with 1/3 of samples (validation).more » To identify important single nucleotide polymorphisms (SNPs), we computed the SNP importance score, resulting from our random forest regression model. We performed gene ontology (GO) enrichment analysis for nearby genes of the important SNPs. Results: After univariate analysis on the training dataset, we filtered out many SNPs with p>0.001, resulting in 749 and 367 SNPs that were used in the model building process for rectal bleeding and erectile dysfunction, respectively. On the validation dataset, our random forest regression model achieved the area under the curve (AUC)=0.70 and 0.62 for rectal bleeding and erectile dysfunction, respectively. We performed GO enrichment analysis for the top 25%, 50%, 75%, and 100% SNPs out of the select SNPs in the univariate analysis. When we used the top 50% SNPs, more plausible biological processes were obtained for both toxicities. An additional test with the top 50% SNPs improved predictive power with AUC=0.71 and 0.65 for rectal bleeding and erectile dysfunction. A better performance was achieved with AUC=0.67 when age and androgen deprivation therapy were added to the model for erectile dysfunction. Conclusion: Our approach that combines machine learning and bioinformatics techniques enabled designing better models and identifying more plausible biological processes associated with the outcomes.« less

  2. Predicting Survival From Large Echocardiography and Electronic Health Record Datasets: Optimization With Machine Learning.

    PubMed

    Samad, Manar D; Ulloa, Alvaro; Wehner, Gregory J; Jing, Linyuan; Hartzel, Dustin; Good, Christopher W; Williams, Brent A; Haggerty, Christopher M; Fornwalt, Brandon K

    2018-06-09

    The goal of this study was to use machine learning to more accurately predict survival after echocardiography. Predicting patient outcomes (e.g., survival) following echocardiography is primarily based on ejection fraction (EF) and comorbidities. However, there may be significant predictive information within additional echocardiography-derived measurements combined with clinical electronic health record data. Mortality was studied in 171,510 unselected patients who underwent 331,317 echocardiograms in a large regional health system. We investigated the predictive performance of nonlinear machine learning models compared with that of linear logistic regression models using 3 different inputs: 1) clinical variables, including 90 cardiovascular-relevant International Classification of Diseases, Tenth Revision, codes, and age, sex, height, weight, heart rate, blood pressures, low-density lipoprotein, high-density lipoprotein, and smoking; 2) clinical variables plus physician-reported EF; and 3) clinical variables and EF, plus 57 additional echocardiographic measurements. Missing data were imputed with a multivariate imputation by using a chained equations algorithm (MICE). We compared models versus each other and baseline clinical scoring systems by using a mean area under the curve (AUC) over 10 cross-validation folds and across 10 survival durations (6 to 60 months). Machine learning models achieved significantly higher prediction accuracy (all AUC >0.82) over common clinical risk scores (AUC = 0.61 to 0.79), with the nonlinear random forest models outperforming logistic regression (p < 0.01). The random forest model including all echocardiographic measurements yielded the highest prediction accuracy (p < 0.01 across all models and survival durations). Only 10 variables were needed to achieve 96% of the maximum prediction accuracy, with 6 of these variables being derived from echocardiography. Tricuspid regurgitation velocity was more predictive of survival than LVEF. In a subset of studies with complete data for the top 10 variables, multivariate imputation by chained equations yielded slightly reduced predictive accuracies (difference in AUC of 0.003) compared with the original data. Machine learning can fully utilize large combinations of disparate input variables to predict survival after echocardiography with superior accuracy. Copyright © 2018 American College of Cardiology Foundation. Published by Elsevier Inc. All rights reserved.

  3. Machine-learning-based classification of real-time tissue elastography for hepatic fibrosis in patients with chronic hepatitis B.

    PubMed

    Chen, Yang; Luo, Yan; Huang, Wei; Hu, Die; Zheng, Rong-Qin; Cong, Shu-Zhen; Meng, Fan-Kun; Yang, Hong; Lin, Hong-Jun; Sun, Yan; Wang, Xiu-Yan; Wu, Tao; Ren, Jie; Pei, Shu-Fang; Zheng, Ying; He, Yun; Hu, Yu; Yang, Na; Yan, Hongmei

    2017-10-01

    Hepatic fibrosis is a common middle stage of the pathological processes of chronic liver diseases. Clinical intervention during the early stages of hepatic fibrosis can slow the development of liver cirrhosis and reduce the risk of developing liver cancer. Performing a liver biopsy, the gold standard for viral liver disease management, has drawbacks such as invasiveness and a relatively high sampling error rate. Real-time tissue elastography (RTE), one of the most recently developed technologies, might be promising imaging technology because it is both noninvasive and provides accurate assessments of hepatic fibrosis. However, determining the stage of liver fibrosis from RTE images in a clinic is a challenging task. In this study, in contrast to the previous liver fibrosis index (LFI) method, which predicts the stage of diagnosis using RTE images and multiple regression analysis, we employed four classical classifiers (i.e., Support Vector Machine, Naïve Bayes, Random Forest and K-Nearest Neighbor) to build a decision-support system to improve the hepatitis B stage diagnosis performance. Eleven RTE image features were obtained from 513 subjects who underwent liver biopsies in this multicenter collaborative research. The experimental results showed that the adopted classifiers significantly outperformed the LFI method and that the Random Forest(RF) classifier provided the highest average accuracy among the four machine algorithms. This result suggests that sophisticated machine-learning methods can be powerful tools for evaluating the stage of hepatic fibrosis and show promise for clinical applications. Copyright © 2017 Elsevier Ltd. All rights reserved.

  4. Thallium in flowering cabbage and lettuce: Potential health risks for local residents of the Pearl River Delta, South China.

    PubMed

    Yu, Huan-Yun; Chang, Chunying; Li, Fangbai; Wang, Qi; Chen, Manjia; Zhang, Jie

    2018-06-08

    Thallium (Tl), a rare metal, is universally present in the environment with high toxicity and accumulation. Thallium's behavior and fate require further study, especially in the Pearl River Delta (PRD), where severe Tl pollution incidents have occurred. One hundred two pairs of soil and flowering cabbage samples and 91 pairs of soil and lettuce samples were collected from typical farmland protection areas and vegetable bases across the PRD, South China. The contamination levels and spatial distributions of soil and vegetable (flowering cabbages and lettuces) Tl across the PRD were investigated. The relative contributions of soil properties to the bioavailability of Tl in vegetables were evaluated using random forest. Random forest is an accurate learning algorithm and is superior to conventional and correlation-based regression analyses. In addition, the health risks posed by Tl exposure via vegetable intake for residents of the PRD were assessed. The results indicated that rapidly available potassium (K) and total K in soil were the most important factors affecting Tl bioavailability, and the competitive effect of rapidly available K on vegetable Tl uptake was confirmed in this field study. Soil weathering also contributed substantially to Tl accumulation in the vegetables. In contrast, organic matter might not be a major factor affecting the mobility of Tl in most of the lettuce soils. Fe and manganese (Mn) oxides also contributed little to the bioavailability of Tl. A risk assessment suggested that the health risks for Tl exposure through flowering cabbage or lettuce intake were minimal. Copyright © 2018 Elsevier Ltd. All rights reserved.

  5. The use of random forests in modelling short-term air pollution effects based on traffic and meteorological conditions: A case study in Wrocław.

    PubMed

    Kamińska, Joanna A

    2018-07-01

    Random forests, an advanced data mining method, are used here to model the regression relationships between concentrations of the pollutants NO 2 , NO x and PM 2.5 , and nine variables describing meteorological conditions, temporal conditions and traffic flow. The study was based on hourly values of wind speed, wind direction, temperature, air pressure and relative humidity, temporal variables, and finally traffic flow, in the two years 2015 and 2016. An air quality measurement station was selected on a main road, located a short distance (40 m) from a large intersection equipped with a traffic flow measurement system. Nine different time subsets were defined, based among other things on the climatic conditions in Wrocław. An analysis was made of the fit of models created for those subsets, and of the importance of the predictors. Both the fit and the importance of particular predictors were found to be dependent on season. The best fit was obtained for models created for the six-month warm season (April-September) and for the summer season (June-August). The most important explanatory variable in the models of concentrations of nitrogen oxides was traffic flow, while in the case of PM 2.5 the most important were meteorological conditions, in particular temperature, wind speed and wind direction. Temporal variables (except for month in the case of PM 2.5 ) were found to have no significant effect on the concentrations of the studied pollutants. Copyright © 2018 Elsevier Ltd. All rights reserved.

  6. Lower Respiratory Tract Infection and Short-Term Outcome in Patients With Acute Respiratory Distress Syndrome.

    PubMed

    Zampieri, Fernando G; Póvoa, Pedro; Salluh, Jorge I; Rodriguez, Alejandro; Valade, Sandrine; Andrade Gomes, José; Reignier, Jean; Molinos, Elena; Almirall, Jordi; Boussekey, Nicolas; Socias, Lorenzo; Ramirez, Paula; Viana, William N; Rouzé, Anahita; Nseir, Saad; Martin-Loeches, Ignacio

    2018-01-01

    To assess whether ventilator-associated lower respiratory tract infections (VA-LRTIs) are associated with mortality in critically ill patients with acute respiratory distress syndrome (ARDS). Post hoc analysis of prospective cohort study including mechanically ventilated patients from a multicenter prospective observational study (TAVeM study); VA-LRTI was defined as either ventilator-associated tracheobronchitis (VAT) or ventilator-associated pneumonia (VAP) based on clinical criteria and microbiological confirmation. Association between intensive care unit (ICU) mortality in patients having ARDS with and without VA-LRTI was assessed through logistic regression controlling for relevant confounders. Association between VA-LRTI and duration of mechanical ventilation and ICU stay was assessed through competing risk analysis. Contribution of VA-LRTI to a mortality model over time was assessed through sequential random forest models. The cohort included 2960 patients of which 524 fulfilled criteria for ARDS; 21% had VA-LRTI (VAT = 10.3% and VAP = 10.7%). After controlling for illness severity and baseline health status, we could not find an association between VA-LRTI and ICU mortality (odds ratio: 1.07; 95% confidence interval: 0.62-1.83; P = .796); VA-LRTI was also not associated with prolonged ICU length of stay or duration of mechanical ventilation. The relative contribution of VA-LRTI to the random forest mortality model remained constant during time. The attributable VA-LRTI mortality for ARDS was higher than the attributable mortality for VA-LRTI alone. After controlling for relevant confounders, we could not find an association between occurrence of VA-LRTI and ICU mortality in patients with ARDS.

  7. What variables are important in predicting bovine viral diarrhea virus? A random forest approach.

    PubMed

    Machado, Gustavo; Mendoza, Mariana Recamonde; Corbellini, Luis Gustavo

    2015-07-24

    Bovine viral diarrhea virus (BVDV) causes one of the most economically important diseases in cattle, and the virus is found worldwide. A better understanding of the disease associated factors is a crucial step towards the definition of strategies for control and eradication. In this study we trained a random forest (RF) prediction model and performed variable importance analysis to identify factors associated with BVDV occurrence. In addition, we assessed the influence of features selection on RF performance and evaluated its predictive power relative to other popular classifiers and to logistic regression. We found that RF classification model resulted in an average error rate of 32.03% for the negative class (negative for BVDV) and 36.78% for the positive class (positive for BVDV).The RF model presented area under the ROC curve equal to 0.702. Variable importance analysis revealed that important predictors of BVDV occurrence were: a) who inseminates the animals, b) number of neighboring farms that have cattle and c) rectal palpation performed routinely. Our results suggest that the use of machine learning algorithms, especially RF, is a promising methodology for the analysis of cross-sectional studies, presenting a satisfactory predictive power and the ability to identify predictors that represent potential risk factors for BVDV investigation. We examined classical predictors and found some new and hard to control practices that may lead to the spread of this disease within and among farms, mainly regarding poor or neglected reproduction management, which should be considered for disease control and eradication.

  8. Development of an automated assessment tool for MedWatch reports in the FDA adverse event reporting system.

    PubMed

    Han, Lichy; Ball, Robert; Pamer, Carol A; Altman, Russ B; Proestel, Scott

    2017-09-01

    As the US Food and Drug Administration (FDA) receives over a million adverse event reports associated with medication use every year, a system is needed to aid FDA safety evaluators in identifying reports most likely to demonstrate causal relationships to the suspect medications. We combined text mining with machine learning to construct and evaluate such a system to identify medication-related adverse event reports. FDA safety evaluators assessed 326 reports for medication-related causality. We engineered features from these reports and constructed random forest, L1 regularized logistic regression, and support vector machine models. We evaluated model accuracy and further assessed utility by generating report rankings that represented a prioritized report review process. Our random forest model showed the best performance in report ranking and accuracy, with an area under the receiver operating characteristic curve of 0.66. The generated report ordering assigns reports with a higher probability of medication-related causality a higher rank and is significantly correlated to a perfect report ordering, with a Kendall's tau of 0.24 ( P  = .002). Our models produced prioritized report orderings that enable FDA safety evaluators to focus on reports that are more likely to contain valuable medication-related adverse event information. Applying our models to all FDA adverse event reports has the potential to streamline the manual review process and greatly reduce reviewer workload. Published by Oxford University Press on behalf of the American Medical Informatics Association 2017. This work is written by US Government employees and is in the public domain in the United States.

  9. Random Forests to Predict Rectal Toxicity Following Prostate Cancer Radiation Therapy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ospina, Juan D.; INSERM, U1099, Rennes; Escuela de Estadística, Universidad Nacional de Colombia Sede Medellín, Medellín

    2014-08-01

    Purpose: To propose a random forest normal tissue complication probability (RF-NTCP) model to predict late rectal toxicity following prostate cancer radiation therapy, and to compare its performance to that of classic NTCP models. Methods and Materials: Clinical data and dose-volume histograms (DVH) were collected from 261 patients who received 3-dimensional conformal radiation therapy for prostate cancer with at least 5 years of follow-up. The series was split 1000 times into training and validation cohorts. A RF was trained to predict the risk of 5-year overall rectal toxicity and bleeding. Parameters of the Lyman-Kutcher-Burman (LKB) model were identified and a logistic regression modelmore » was fit. The performance of all the models was assessed by computing the area under the receiving operating characteristic curve (AUC). Results: The 5-year grade ≥2 overall rectal toxicity and grade ≥1 and grade ≥2 rectal bleeding rates were 16%, 25%, and 10%, respectively. Predictive capabilities were obtained using the RF-NTCP model for all 3 toxicity endpoints, including both the training and validation cohorts. The age and use of anticoagulants were found to be predictors of rectal bleeding. The AUC for RF-NTCP ranged from 0.66 to 0.76, depending on the toxicity endpoint. The AUC values for the LKB-NTCP were statistically significantly inferior, ranging from 0.62 to 0.69. Conclusions: The RF-NTCP model may be a useful new tool in predicting late rectal toxicity, including variables other than DVH, and thus appears as a strong competitor to classic NTCP models.« less

  10. Optimal Symmetric Multimodal Templates and Concatenated Random Forests for Supervised Brain Tumor Segmentation (Simplified) with ANTsR.

    PubMed

    Tustison, Nicholas J; Shrinidhi, K L; Wintermark, Max; Durst, Christopher R; Kandel, Benjamin M; Gee, James C; Grossman, Murray C; Avants, Brian B

    2015-04-01

    Segmenting and quantifying gliomas from MRI is an important task for diagnosis, planning intervention, and for tracking tumor changes over time. However, this task is complicated by the lack of prior knowledge concerning tumor location, spatial extent, shape, possible displacement of normal tissue, and intensity signature. To accommodate such complications, we introduce a framework for supervised segmentation based on multiple modality intensity, geometry, and asymmetry feature sets. These features drive a supervised whole-brain and tumor segmentation approach based on random forest-derived probabilities. The asymmetry-related features (based on optimal symmetric multimodal templates) demonstrate excellent discriminative properties within this framework. We also gain performance by generating probability maps from random forest models and using these maps for a refining Markov random field regularized probabilistic segmentation. This strategy allows us to interface the supervised learning capabilities of the random forest model with regularized probabilistic segmentation using the recently developed ANTsR package--a comprehensive statistical and visualization interface between the popular Advanced Normalization Tools (ANTs) and the R statistical project. The reported algorithmic framework was the top-performing entry in the MICCAI 2013 Multimodal Brain Tumor Segmentation challenge. The challenge data were widely varying consisting of both high-grade and low-grade glioma tumor four-modality MRI from five different institutions. Average Dice overlap measures for the final algorithmic assessment were 0.87, 0.78, and 0.74 for "complete", "core", and "enhanced" tumor components, respectively.

  11. Local-scale spatial modelling for interpolating climatic temperature variables to predict agricultural plant suitability

    NASA Astrophysics Data System (ADS)

    Webb, Mathew A.; Hall, Andrew; Kidd, Darren; Minansy, Budiman

    2016-05-01

    Assessment of local spatial climatic variability is important in the planning of planting locations for horticultural crops. This study investigated three regression-based calibration methods (i.e. traditional versus two optimized methods) to relate short-term 12-month data series from 170 temperature loggers and 4 weather station sites with data series from nearby long-term Australian Bureau of Meteorology climate stations. The techniques trialled to interpolate climatic temperature variables, such as frost risk, growing degree days (GDDs) and chill hours, were regression kriging (RK), regression trees (RTs) and random forests (RFs). All three calibration methods produced accurate results, with the RK-based calibration method delivering the most accurate validation measures: coefficients of determination ( R 2) of 0.92, 0.97 and 0.95 and root-mean-square errors of 1.30, 0.80 and 1.31 °C, for daily minimum, daily maximum and hourly temperatures, respectively. Compared with the traditional method of calibration using direct linear regression between short-term and long-term stations, the RK-based calibration method improved R 2 and reduced root-mean-square error (RMSE) by at least 5 % and 0.47 °C for daily minimum temperature, 1 % and 0.23 °C for daily maximum temperature and 3 % and 0.33 °C for hourly temperature. Spatial modelling indicated insignificant differences between the interpolation methods, with the RK technique tending to be the slightly better method due to the high degree of spatial autocorrelation between logger sites.

  12. Data mining: Potential applications in research on nutrition and health.

    PubMed

    Batterham, Marijka; Neale, Elizabeth; Martin, Allison; Tapsell, Linda

    2017-02-01

    Data mining enables further insights from nutrition-related research, but caution is required. The aim of this analysis was to demonstrate and compare the utility of data mining methods in classifying a categorical outcome derived from a nutrition-related intervention. Baseline data (23 variables, 8 categorical) on participants (n = 295) in an intervention trial were used to classify participants in terms of meeting the criteria of achieving 10 000 steps per day. Results from classification and regression trees (CARTs), random forests, adaptive boosting, logistic regression, support vector machines and neural networks were compared using area under the curve (AUC) and error assessments. The CART produced the best model when considering the AUC (0.703), overall error (18%) and within class error (28%). Logistic regression also performed reasonably well compared to the other models (AUC 0.675, overall error 23%, within class error 36%). All the methods gave different rankings of variables' importance. CART found that body fat, quality of life using the SF-12 Physical Component Summary (PCS) and the cholesterol: HDL ratio were the most important predictors of meeting the 10 000 steps criteria, while logistic regression showed the SF-12PCS, glucose levels and level of education to be the most significant predictors (P ≤ 0.01). Differing outcomes suggest caution is required with a single data mining method, particularly in a dataset with nonlinear relationships and outliers and when exploring relationships that were not the primary outcomes of the research. © 2017 Dietitians Association of Australia.

  13. Intra-and-Inter Species Biomass Prediction in a Plantation Forest: Testing the Utility of High Spatial Resolution Spaceborne Multispectral RapidEye Sensor and Advanced Machine Learning Algorithms

    PubMed Central

    Dube, Timothy; Mutanga, Onisimo; Adam, Elhadi; Ismail, Riyad

    2014-01-01

    The quantification of aboveground biomass using remote sensing is critical for better understanding the role of forests in carbon sequestration and for informed sustainable management. Although remote sensing techniques have been proven useful in assessing forest biomass in general, more is required to investigate their capabilities in predicting intra-and-inter species biomass which are mainly characterised by non-linear relationships. In this study, we tested two machine learning algorithms, Stochastic Gradient Boosting (SGB) and Random Forest (RF) regression trees to predict intra-and-inter species biomass using high resolution RapidEye reflectance bands as well as the derived vegetation indices in a commercial plantation. The results showed that the SGB algorithm yielded the best performance for intra-and-inter species biomass prediction; using all the predictor variables as well as based on the most important selected variables. For example using the most important variables the algorithm produced an R2 of 0.80 and RMSE of 16.93 t·ha−1 for E. grandis; R2 of 0.79, RMSE of 17.27 t·ha−1 for P. taeda and R2 of 0.61, RMSE of 43.39 t·ha−1 for the combined species data sets. Comparatively, RF yielded plausible results only for E. dunii (R2 of 0.79; RMSE of 7.18 t·ha−1). We demonstrated that although the two statistical methods were able to predict biomass accurately, RF produced weaker results as compared to SGB when applied to combined species dataset. The result underscores the relevance of stochastic models in predicting biomass drawn from different species and genera using the new generation high resolution RapidEye sensor with strategically positioned bands. PMID:25140631

  14. Relations between fish abundances, summer temperatures, and forest harvest in a northern Minnesota stream system from 1997 to 2007

    USGS Publications Warehouse

    Merten, Eric C.; Hemstad, Nathaniel A.; Eggert, S.L.; Johnson, L.B.; Kolka, Randall K.; Newman, Raymond M.; Vondracek, Bruce C.

    2010-01-01

    Short-term effects of forest harvest on fish habitat have been well documented, including sediment inputs, leaf litter reductions, and stream warming. However, few studies have considered changes in local climate when examining postlogging changes in fish communities. To address this need, we examined fish abundances between 1997 and 2007 in a basin in a northern hardwood forest. Streams in the basin were subjected to experimental riparian forest harvest in fall 1997. We noted a significant decrease for fish index of biotic integrity and abundance of Salvelinus fontinalis and Phoxinus eos over the study period. However, for P. eos and Culaea inconstans, the temporal patterns in abundances were related more to summer air temperatures than to fine sediment or spring precipitation when examined using multiple regressions. Univariate regressions suggested that summer air temperatures influenced temporal patterns in fish communities more than fine sediment or spring precipitation.

  15. High resolution satellite remote sensing used in a stratified random sampling scheme to quantify the constituent land cover components of the shifting cultivation mosaic of the Democratic Republic of Congo

    NASA Astrophysics Data System (ADS)

    Molinario, G.; Hansen, M.; Potapov, P.

    2016-12-01

    High resolution satellite imagery obtained from the National Geospatial Intelligence Agency through NASA was used to photo-interpret sample areas within the DRC. The area sampled is a stratifcation of the forest cover loss from circa 2014 that either occurred completely within the previosly mapped homogenous area of the Rural Complex, at it's interface with primary forest, or in isolated forest perforations. Previous research resulted in a map of these areas that contextualizes forest loss depending on where it occurs and with what spatial density, leading to a better understading of the real impacts on forest degradation of livelihood shifting cultivation. The stratified random sampling approach of these areas allows the characterization of the constituent land cover types within these areas, and their variability throughout the DRC. Shifting cultivation has a variable forest degradation footprint in the DRC depending on many factors that drive it, but it's role in forest degradation and deforestation had been disputed, leading us to investigate and quantify the clearing and reuse rates within the strata throughout the country.

  16. Modeling forest site productivity using mapped geospatial attributes within a South Carolina Landscape, USA

    DOE PAGES

    Parresol, B. R.; Scott, D. A.; Zarnoch, S. J.; ...

    2017-12-15

    Spatially explicit mapping of forest productivity is important to assess many forest management alternatives. We assessed the relationship between mapped variables and site index of forests ranging from southern pine plantations to natural hardwoods on a 74,000-ha landscape in South Carolina, USA. Mapped features used in the analysis were soil association, land use condition in 1951, depth to groundwater, slope and aspect. Basal area, species composition, age and height were the tree variables measured. Linear modelling identified that plot basal area, depth to groundwater, soils association and the interactions between depth to groundwater and forest group, and between land usemore » in 1951 and forest group were related to site index (SI) (R 2 =0.37), but this model had regression attenuation. We then used structural equation modeling to incorporate error-in-measurement corrections for basal area and groundwater to remove bias in the model. We validated this model using 89 independent observations and found the 95% confidence intervals for the slope and intercept of an observed vs. predicted site index error-corrected regression included zero and one, respectively, indicating a good fit. With error in measurement incorporated, only basal area, soil association, and the interaction between forest groups and land use were important predictors (R2 =0.57). Thus, we were able to develop an unbiased model of SI that could be applied to create a spatially explicit map based primarily on soils as modified by past (land use and forest type) and recent forest management (basal area).« less

  17. Retrieval of forest biomass for tropical deciduous mixed forest using ALOS PALSAR mosaic imagery and field plot data

    NASA Astrophysics Data System (ADS)

    Ningthoujam, Ramesh K.; Joshi, P. K.; Roy, P. S.

    2018-07-01

    Tropical forest is an important ecosystem rich in biodiversity and structural complexity with high woody biomass content. Longer wavelength radar data at L-band sensor provides improved forest biomass (AGB) information due to its higher penetration level and sensitivity to canopy structure. The study presents a regression based woody biomass estimation for tropical deciduous mixed forest dominated by Shorea robusta using ALOS PALSAR mosaic (HH, HV) and field data at the lower Himalayan belt of Northern India. For the purpose of understanding the scattering mechanisms at L-band from this forest type, Michigan Microwave Canopy Scattering model (MIMICS-I) was parameterized with field data to simulate backscatter across polarization and incidence range. Regression analysis between field measured forest biomass and L-band backscatter data from PALSAR mosaic show retrieval of woody biomass up to 100 Mg ha-1 with error between 92 and 94 Mg ha-1 and coefficient of determination (r2) between 0.53 and 0.55 for HH and HH + HV polarized channel at 0.25 ha resolution. This positive relationship could be due to strong volume scattering from ground/trunk interaction at HH-polarized while in combination with direct canopy scattering for HV-polarization at ALOS specific incidence angles as predicted by MIMICS-I model. This study has found that L-band SAR data from currently ALOS-1/-2 and upcoming joint NASA-ISRO SAR (NISAR) are suitable for mapping forest biomass ≤100 Mg ha-1 at 25 m resolution in far incidence range in dense deciduous mixed forest of Northern India.

  18. Modeling forest site productivity using mapped geospatial attributes within a South Carolina Landscape, USA

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Parresol, B. R.; Scott, D. A.; Zarnoch, S. J.

    Spatially explicit mapping of forest productivity is important to assess many forest management alternatives. We assessed the relationship between mapped variables and site index of forests ranging from southern pine plantations to natural hardwoods on a 74,000-ha landscape in South Carolina, USA. Mapped features used in the analysis were soil association, land use condition in 1951, depth to groundwater, slope and aspect. Basal area, species composition, age and height were the tree variables measured. Linear modelling identified that plot basal area, depth to groundwater, soils association and the interactions between depth to groundwater and forest group, and between land usemore » in 1951 and forest group were related to site index (SI) (R 2 =0.37), but this model had regression attenuation. We then used structural equation modeling to incorporate error-in-measurement corrections for basal area and groundwater to remove bias in the model. We validated this model using 89 independent observations and found the 95% confidence intervals for the slope and intercept of an observed vs. predicted site index error-corrected regression included zero and one, respectively, indicating a good fit. With error in measurement incorporated, only basal area, soil association, and the interaction between forest groups and land use were important predictors (R2 =0.57). Thus, we were able to develop an unbiased model of SI that could be applied to create a spatially explicit map based primarily on soils as modified by past (land use and forest type) and recent forest management (basal area).« less

  19. Methods for calculating confidence and credible intervals for the residual between-study variance in random effects meta-regression models

    PubMed Central

    2014-01-01

    Background Meta-regression is becoming increasingly used to model study level covariate effects. However this type of statistical analysis presents many difficulties and challenges. Here two methods for calculating confidence intervals for the magnitude of the residual between-study variance in random effects meta-regression models are developed. A further suggestion for calculating credible intervals using informative prior distributions for the residual between-study variance is presented. Methods Two recently proposed and, under the assumptions of the random effects model, exact methods for constructing confidence intervals for the between-study variance in random effects meta-analyses are extended to the meta-regression setting. The use of Generalised Cochran heterogeneity statistics is extended to the meta-regression setting and a Newton-Raphson procedure is developed to implement the Q profile method for meta-analysis and meta-regression. WinBUGS is used to implement informative priors for the residual between-study variance in the context of Bayesian meta-regressions. Results Results are obtained for two contrasting examples, where the first example involves a binary covariate and the second involves a continuous covariate. Intervals for the residual between-study variance are wide for both examples. Conclusions Statistical methods, and R computer software, are available to compute exact confidence intervals for the residual between-study variance under the random effects model for meta-regression. These frequentist methods are almost as easily implemented as their established counterparts for meta-analysis. Bayesian meta-regressions are also easily performed by analysts who are comfortable using WinBUGS. Estimates of the residual between-study variance in random effects meta-regressions should be routinely reported and accompanied by some measure of their uncertainty. Confidence and/or credible intervals are well-suited to this purpose. PMID:25196829

  20. The increasing importance of small-scale forestry: evidence from family forest ownership patterns in the United States

    Treesearch

    Y. Zhang; X. Liao; B.J. Butler; J. Schelhas

    2009-01-01

    The state-level distribution of the size of family forest holdings in the contiguous United States was examined using data collected by the USDA Forest Service in 1993 and 2003. Regressions models were used to analyze the factors influencing the mean size and structural variation among states and between the two periods. Population density, percent of the population at...

  1. Research on electricity consumption forecast based on mutual information and random forests algorithm

    NASA Astrophysics Data System (ADS)

    Shi, Jing; Shi, Yunli; Tan, Jian; Zhu, Lei; Li, Hu

    2018-02-01

    Traditional power forecasting models cannot efficiently take various factors into account, neither to identify the relation factors. In this paper, the mutual information in information theory and the artificial intelligence random forests algorithm are introduced into the medium and long-term electricity demand prediction. Mutual information can identify the high relation factors based on the value of average mutual information between a variety of variables and electricity demand, different industries may be highly associated with different variables. The random forests algorithm was used for building the different industries forecasting models according to the different correlation factors. The data of electricity consumption in Jiangsu Province is taken as a practical example, and the above methods are compared with the methods without regard to mutual information and the industries. The simulation results show that the above method is scientific, effective, and can provide higher prediction accuracy.

  2. Habitat area requirements of breeding forest birds of the middle Atlantic states

    USGS Publications Warehouse

    Robbins, Chandler S.; Dawson, Deanna K.; Dowell, Barbara A.

    1989-01-01

    Conservation of birds requires an understanding of their nesting requirements, including area as well as structural characteristics of the habitat. Previous studies have shown that many neotropical migrant bird species seem to depend on extensive forested areas, but the specific area requirements of individual species have not been clarified sufficiently to aid in design and management of effective preserves. For this 5-year study, bird and vegetation data were obtained at 469 points in forests ranging in area from 0.1 ha to more than 3,000 ha in Maryland and adjacent states. Data were analyzed first by stepwise regression to identify habitat factors that had the greatest influence on relative abundance of each bird species. In the relatively undisturbed mature forests studied, degree of isolation and area were significant predictors of relative abundance for more bird species than were any habitat variables. For species for which forest area was a significant predictor of abundance, we used logistic regression to examine the relationship between forest area and the probability of detecting the species. In managing forest lands for wildlife, top priority should go toward providing for the needs of area-sensitive or rare species rather than increasing species diversity per se. Avian species that occur in small and disturbed forests are generalists that are adapted to survival under edge conditions and need no special assistance from man. Forest reserves with thousands of hectares are required to have the highest probability of providing for the least common species of forest birds in a region. However, if preservation of large contiguous forest tracts is not a realistic option, results of this study suggest 2 alternative approaches. First, if other habitat attributes also are considered, smaller forests may provide suitable breeding sites for relatively rare species. Second, smaller tracts in close proximity to other forests may serve to attract or retain area-sensitive species.

  3. Modeling and mapping abundance of American Woodcock across the Midwestern and Northeastern United States

    USGS Publications Warehouse

    Thogmartin, W.E.; Sauer, J.R.; Knutson, M.G.

    2007-01-01

    We used an over-dispersed Poisson regression with fixed and random effects, fitted by Markov chain Monte Carlo methods, to model population spatial patterns of relative abundance of American woodcock (Scolopax minor) across its breeding range in the United States. We predicted North American woodcock Singing Ground Survey counts with a log-linear function of explanatory variables describing habitat, year effects, and observer effects. The model also included a conditional autoregressive term representing potential correlation between adjacent route counts. Categories of explanatory habitat variables in the model included land-cover composition, climate, terrain heterogeneity, and human influence. Woodcock counts were higher in landscapes with more forest, especially aspen (Populus tremuloides) and birch (Betula spp.) forest, and in locations with a high degree of interspersion among forest, shrubs, and grasslands. Woodcock counts were lower in landscapes with a high degree of human development. The most noteworthy practical application of this spatial modeling approach was the ability to map predicted relative abundance. Based on a map of predicted relative abundance derived from the posterior parameter estimates, we identified major concentrations of woodcock abundance in east-central Minnesota, USA, the intersection of Vermont, USA, New York, USA, and Ontario, Canada, the upper peninsula of Michigan, USA, and St. Lawrence County, New York. The functional relations we elucidated for the American woodcock provide a basis for the development of management programs and the model and map may serve to focus management and monitoring on areas and habitat features important to American woodcock.

  4. Forest Biomass Mapping From Lidar and Radar Synergies

    NASA Technical Reports Server (NTRS)

    Sun, Guoqing; Ranson, K. Jon; Guo, Z.; Zhang, Z.; Montesano, P.; Kimes, D.

    2011-01-01

    The use of lidar and radar instruments to measure forest structure attributes such as height and biomass at global scales is being considered for a future Earth Observation satellite mission, DESDynI (Deformation, Ecosystem Structure, and Dynamics of Ice). Large footprint lidar makes a direct measurement of the heights of scatterers in the illuminated footprint and can yield accurate information about the vertical profile of the canopy within lidar footprint samples. Synthetic Aperture Radar (SAR) is known to sense the canopy volume, especially at longer wavelengths and provides image data. Methods for biomass mapping by a combination of lidar sampling and radar mapping need to be developed. In this study, several issues in this respect were investigated using aircraft borne lidar and SAR data in Howland, Maine, USA. The stepwise regression selected the height indices rh50 and rh75 of the Laser Vegetation Imaging Sensor (LVIS) data for predicting field measured biomass with a R(exp 2) of 0.71 and RMSE of 31.33 Mg/ha. The above-ground biomass map generated from this regression model was considered to represent the true biomass of the area and used as a reference map since no better biomass map exists for the area. Random samples were taken from the biomass map and the correlation between the sampled biomass and co-located SAR signature was studied. The best models were used to extend the biomass from lidar samples into all forested areas in the study area, which mimics a procedure that could be used for the future DESDYnI Mission. It was found that depending on the data types used (quad-pol or dual-pol) the SAR data can predict the lidar biomass samples with R2 of 0.63-0.71, RMSE of 32.0-28.2 Mg/ha up to biomass levels of 200-250 Mg/ha. The mean biomass of the study area calculated from the biomass maps generated by lidar- SAR synergy 63 was within 10% of the reference biomass map derived from LVIS data. The results from this study are preliminary, but do show the potential of the combined use of lidar samples and radar imagery for forest biomass mapping. Various issues regarding lidar/radar data synergies for biomass mapping are discussed in the paper.

  5. Fire risk in California

    NASA Astrophysics Data System (ADS)

    Peterson, Seth Howard

    Fire is an integral part of ecosystems in the western United States. Decades of fire suppression have led to (unnaturally) large accumulations of fuel in some forest communities, such as the lower elevation forests of the Sierra Nevada. Urban sprawl into fire prone chaparral vegetation in southern California has put human lives at risk and the decreased fire return intervals have put the vegetation community at risk of type conversion. This research examines the factors affecting fire risk in two of the dominant landscapes in the state of California, chaparral and inland coniferous forests. Live fuel moisture (LFM) is important for fire ignition, spread rate, and intensity in chaparral. LFM maps were generated for Los Angeles County by developing and then inverting robust cross-validated regression equations from time series field data and vegetation indices (VIs) and phenological metrics from MODIS data. Fire fuels, including understory fuels which are not visible to remote sensing instruments, were mapped in Yosemite National Park using the random forests decision tree algorithm and climatic, topographic, remotely sensed, and fire history variables. Combining the disparate data sources served to improve classification accuracies. The models were inverted to produce maps of fuel models and fuel amounts, and these showed that fire fuel amounts are highest in the low elevation forests that have been most affected by fire suppression impacting the natural fire regime. Wildland fires in chaparral commonly burn in late summer or fall when LFM is near its annual low, however, the Jesusita Fire burned in early May of 2009, when LFM was still relatively high. The HFire fire spread model was used to simulate the growth of the Jesusita Fire using LFM maps derived from imagery acquired at the time of the fire and imagery acquired in late August to determine how much different the fire would have been if it had occurred later in the year. Simulated fires were 1.5 times larger, and the fire reached the wildland urban interface three hours earlier, when using August LFM.

  6. An electronic health record based model predicts statin adherence, LDL cholesterol, and cardiovascular disease in the United States Military Health System

    PubMed Central

    Lucas, Joseph E.; Bazemore, Taylor C.; Alo, Celan; Monahan, Patrick B.

    2017-01-01

    HMG-CoA reductase inhibitors (or “statins”) are important and commonly used medications to lower cholesterol and prevent cardiovascular disease. Nearly half of patients stop taking statin medications one year after they are prescribed leading to higher cholesterol, increased cardiovascular risk, and costs due to excess hospitalizations. Identifying which patients are at highest risk for not adhering to long-term statin therapy is an important step towards individualizing interventions to improve adherence. Electronic health records (EHR) are an increasingly common source of data that are challenging to analyze but have potential for generating more accurate predictions of disease risk. The aim of this study was to build an EHR based model for statin adherence and link this model to biologic and clinical outcomes in patients receiving statin therapy. We gathered EHR data from the Military Health System which maintains administrative data for active duty, retirees, and dependents of the United States armed forces military that receive health care benefits. Data were gathered from patients prescribed their first statin prescription in 2005 and 2006. Baseline billing, laboratory, and pharmacy claims data were collected from the two years leading up to the first statin prescription and summarized using non-negative matrix factorization. Follow up statin prescription refill data was used to define the adherence outcome (> 80 percent days covered). The subsequent factors to emerge from this model were then used to build cross-validated, predictive models of 1) overall disease risk using coalescent regression and 2) statin adherence (using random forest regression). The predicted statin adherence for each patient was subsequently used to correlate with cholesterol lowering and hospitalizations for cardiovascular disease during the 5 year follow up period using Cox regression. The analytical dataset included 138 731 individuals and 1840 potential baseline predictors that were reduced to 30 independent EHR “factors”. A random forest predictive model taking patient, statin prescription, predicted disease risk, and the EHR factors as potential inputs produced a cross-validated c-statistic of 0.736 for classifying statin non-adherence. The addition of the first refill to the model increased the c-statistic to 0.81. The predicted statin adherence was independently associated with greater cholesterol lowering (correlation = 0.14, p < 1e-20) and lower hospitalization for myocardial infarction, coronary artery disease, and stroke (hazard ratio = 0.84, p = 1.87E-06). Electronic health records data can be used to build a predictive model of statin adherence that also correlates with statins’ cardiovascular benefits. PMID:29155848

  7. Spatially explicit estimation of aboveground boreal forest biomass in the Yukon River Basin, Alaska

    USGS Publications Warehouse

    Ji, Lei; Wylie, Bruce K.; Brown, Dana R. N.; Peterson, Birgit E.; Alexander, Heather D.; Mack, Michelle C.; Rover, Jennifer R.; Waldrop, Mark P.; McFarland, Jack W.; Chen, Xuexia; Pastick, Neal J.

    2015-01-01

    Quantification of aboveground biomass (AGB) in Alaska’s boreal forest is essential to the accurate evaluation of terrestrial carbon stocks and dynamics in northern high-latitude ecosystems. Our goal was to map AGB at 30 m resolution for the boreal forest in the Yukon River Basin of Alaska using Landsat data and ground measurements. We acquired Landsat images to generate a 3-year (2008–2010) composite of top-of-atmosphere reflectance for six bands as well as the brightness temperature (BT). We constructed a multiple regression model using field-observed AGB and Landsat-derived reflectance, BT, and vegetation indices. A basin-wide boreal forest AGB map at 30 m resolution was generated by applying the regression model to the Landsat composite. The fivefold cross-validation with field measurements had a mean absolute error (MAE) of 25.7 Mg ha−1 (relative MAE 47.5%) and a mean bias error (MBE) of 4.3 Mg ha−1(relative MBE 7.9%). The boreal forest AGB product was compared with lidar-based vegetation height data; the comparison indicated that there was a significant correlation between the two data sets.

  8. Effects of the amount and composition of the forest floor on emergence and early establishment of loblolly pine seedlings

    Treesearch

    Michael G. Shelton

    1995-01-01

    Five forest floor weights (0, 10, 20, 30, and 40 MgJha), three forest floor compositions (pine, pine-hardwood, and hardwood), and two seed placements (forest floor and soil surface) were tested in a three-factorial. split-plot design with four incomplete, randomized blocks. The experiment was conducted in a nursery setting and used wooden frames to define 0.145-m

  9. Forest-floor disturbance reduces chipmunk (Tamias spp.) abundance two years after variable-retention harvest of Pacific Northwestern forests

    Treesearch

    Randall J. Wilk; Timothy B. Harrington; Robert A. Gitzen; Chris C. Maguire

    2015-01-01

    We evaluated the two-year effects of variable-retention harvest on chipmunk (Tamias spp.) abundance (N^) and habitat in mature coniferous forests in western Oregon and Washington because wildlife responses to density/pattern of retained trees remain largely unknown. In a randomized complete-block design, six...

  10. Highlights of the national evaluation of the Forest Stewardship Planning Program

    Treesearch

    R.J. Moulton; J.D. Esseks

    2001-01-01

    In 1998 and 1999, a nationwide random sample of 1238 nonindustrial private (NIPF) landowners with approved multiple resource Forest Stewardship Plans were interviewed to determine if this program is meeting its Congressional mandate of promoting sustainable management of forest resources on NIPF ownerships. It was found that two-thirds of program participants had never...

  11. Ownership and ecosystem as sources of spatial heterogeneity in a forested landscape, Wisconsin, USA

    Treesearch

    Thomas R. Crow; George E. Host; David J. Mladenoff

    1999-01-01

    The interaction between physical environment and land ownership in creating spatial heterogeneity was studied in largely forested landscapes of northern Wisconsin, USA. A stratified random approach was used in which 2500-ha plots representing two ownerships (National Forest and private non-industrial) were located within two regional ecosystems (extremely well-drained...

  12. Prediction of forest fires occurrences with area-level Poisson mixed models.

    PubMed

    Boubeta, Miguel; Lombardía, María José; Marey-Pérez, Manuel Francisco; Morales, Domingo

    2015-05-01

    The number of fires in forest areas of Galicia (north-west of Spain) during the summer period is quite high. Local authorities are interested in analyzing the factors that explain this phenomenon. Poisson regression models are good tools for describing and predicting the number of fires per forest areas. This work employs area-level Poisson mixed models for treating real data about fires in forest areas. A parametric bootstrap method is applied for estimating the mean squared errors of fires predictors. The developed methodology and software are applied to a real data set of fires in forest areas of Galicia. Copyright © 2015 Elsevier Ltd. All rights reserved.

  13. Advanced Subspace Techniques for Modeling Channel and Session Variability in a Speaker Recognition System

    DTIC Science & Technology

    2012-03-01

    with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random

  14. A random forest algorithm for nowcasting of intense precipitation events

    NASA Astrophysics Data System (ADS)

    Das, Saurabh; Chakraborty, Rohit; Maitra, Animesh

    2017-09-01

    Automatic nowcasting of convective initiation and thunderstorms has potential applications in several sectors including aviation planning and disaster management. In this paper, random forest based machine learning algorithm is tested for nowcasting of convective rain with a ground based radiometer. Brightness temperatures measured at 14 frequencies (7 frequencies in 22-31 GHz band and 7 frequencies in 51-58 GHz bands) are utilized as the inputs of the model. The lower frequency band is associated to the water vapor absorption whereas the upper frequency band relates to the oxygen absorption and hence, provide information on the temperature and humidity of the atmosphere. Synthetic minority over-sampling technique is used to balance the data set and 10-fold cross validation is used to assess the performance of the model. Results indicate that random forest algorithm with fixed alarm generation time of 30 min and 60 min performs quite well (probability of detection of all types of weather condition ∼90%) with low false alarms. It is, however, also observed that reducing the alarm generation time improves the threat score significantly and also decreases false alarms. The proposed model is found to be very sensitive to the boundary layer instability as indicated by the variable importance measure. The study shows the suitability of a random forest algorithm for nowcasting application utilizing a large number of input parameters from diverse sources and can be utilized in other forecasting problems.

  15. A regional-scale survey and analysis of forest growth and mortality as affected by site and stand factors and acidic deposition

    Treesearch

    Robert T. Brooks

    1994-01-01

    Regression analyses were used to identify factors most closely related to species growth and mortality on continuous forest survey plots in Pennsylvania. In 1985, 200 plots with two prior measurements (in the 1960s and 1970s) were selected and measured for a third time to determine periodic forest growth and mortality rates. Growth and mortality were analyzed for...

  16. A preliminary comparison of Landsat Thematic Mapper and SPOT-1 HRV multispectral data for estimating coniferous forest volume

    NASA Technical Reports Server (NTRS)

    Ripple, W. J.; Wang, S.; Isaacson, D. L.; Paine, D. P.

    1991-01-01

    Digital Landsat Thematic Mapper (TM) and SPOT high-resolution visible (HRV) images of coniferous forest canopies were compared in their relationship to forest wood volume using correlation and regression analyses. Significant inverse relationships were found between softwood volume and the spectral bands from both sensors (P less than 0.01). The highest correlations were between the log of softwood volume and the near-infrared bands.

  17. The contribution of competition to tree mortality in old-growth coniferous forests

    USGS Publications Warehouse

    Das, A.; Battles, J.; Stephenson, N.L.; van Mantgem, P.J.

    2011-01-01

    Competition is a well-documented contributor to tree mortality in temperate forests, with numerous studies documenting a relationship between tree death and the competitive environment. Models frequently rely on competition as the only non-random mechanism affecting tree mortality. However, for mature forests, competition may cease to be the primary driver of mortality.We use a large, long-term dataset to study the importance of competition in determining tree mortality in old-growth forests on the western slope of the Sierra Nevada of California, U.S.A. We make use of the comparative spatial configuration of dead and live trees, changes in tree spatial pattern through time, and field assessments of contributors to an individual tree's death to quantify competitive effects.Competition was apparently a significant contributor to tree mortality in these forests. Trees that died tended to be in more competitive environments than trees that survived, and suppression frequently appeared as a factor contributing to mortality. On the other hand, based on spatial pattern analyses, only three of 14 plots demonstrated compelling evidence that competition was dominating mortality. Most of the rest of the plots fell within the expectation for random mortality, and three fit neither the random nor the competition model. These results suggest that while competition is often playing a significant role in tree mortality processes in these forests it only infrequently governs those processes. In addition, the field assessments indicated a substantial presence of biotic mortality agents in trees that died.While competition is almost certainly important, demographics in these forests cannot accurately be characterized without a better grasp of other mortality processes. In particular, we likely need a better understanding of biotic agents and their interactions with one another and with competition. ?? 2011.

  18. Predicting the rate of change in timber value for forest stands infested with gypsy moth

    Treesearch

    David A. Gansner; Owen W. Herrick

    1982-01-01

    Presents a method for estimating the potential impact of gypsy moth attacks on forest-stand value. Robust regression analysis is used to develop an equation for predicting the rate of change in timber value from easy-to-measure key characteristics of stand condition.

  19. System identification principles in studies of forest dynamics.

    Treesearch

    Rolfe A. Leary

    1970-01-01

    Shows how it is possible to obtain governing equation parameter estimates on the basis of observed system states. The approach used represents a constructive alternative to regression techniques for models expressed as differential equations. This approach allows scientists to more completely quantify knowledge of forest development processes, to express theories in...

  20. Estimating total forest biomass in Maine, 1995

    Treesearch

    Eric H. Wharton; Douglas M. Griffith; Douglas M. Griffith

    1998-01-01

    Presents methods for synthesizing information from existing biomass literature for estimating biomass over extensive forest areas with specific applications to Maine. Tables of appropriate regression equations and the tree and shrub species to which these equations can be applied are presented as well as biomass estimates at the county and state level.

  1. Environmental factors affecting understory diversity in second-growth deciduous forests

    Treesearch

    Cynthia D. Huebner; J.C. Randolph; G.R. Parker

    1995-01-01

    The purpose of this study was to determine the most important nonanthropogenic factors affecting understory (herbs, shrubs and low-growing vines) diversity in forested landscapes of southern Indiana. Fourteen environmental variables were measured for 46 sites. Multiple regression analysis showed significant positive correlation between understory diversity and tree...

  2. Random forest learning of ultrasonic statistical physics and object spaces for lesion detection in 2D sonomammography

    NASA Astrophysics Data System (ADS)

    Sheet, Debdoot; Karamalis, Athanasios; Kraft, Silvan; Noël, Peter B.; Vag, Tibor; Sadhu, Anup; Katouzian, Amin; Navab, Nassir; Chatterjee, Jyotirmoy; Ray, Ajoy K.

    2013-03-01

    Breast cancer is the most common form of cancer in women. Early diagnosis can significantly improve lifeexpectancy and allow different treatment options. Clinicians favor 2D ultrasonography for breast tissue abnormality screening due to high sensitivity and specificity compared to competing technologies. However, inter- and intra-observer variability in visual assessment and reporting of lesions often handicaps its performance. Existing Computer Assisted Diagnosis (CAD) systems though being able to detect solid lesions are often restricted in performance. These restrictions are inability to (1) detect lesion of multiple sizes and shapes, and (2) differentiate between hypo-echoic lesions from their posterior acoustic shadowing. In this work we present a completely automatic system for detection and segmentation of breast lesions in 2D ultrasound images. We employ random forests for learning of tissue specific primal to discriminate breast lesions from surrounding normal tissues. This enables it to detect lesions of multiple shapes and sizes, as well as discriminate between hypo-echoic lesion from associated posterior acoustic shadowing. The primal comprises of (i) multiscale estimated ultrasonic statistical physics and (ii) scale-space characteristics. The random forest learns lesion vs. background primal from a database of 2D ultrasound images with labeled lesions. For segmentation, the posterior probabilities of lesion pixels estimated by the learnt random forest are hard thresholded to provide a random walks segmentation stage with starting seeds. Our method achieves detection with 99.19% accuracy and segmentation with mean contour-to-contour error < 3 pixels on a set of 40 images with 49 lesions.

  3. Finding structure in data using multivariate tree boosting

    PubMed Central

    Miller, Patrick J.; Lubke, Gitta H.; McArtor, Daniel B.; Bergeman, C. S.

    2016-01-01

    Technology and collaboration enable dramatic increases in the size of psychological and psychiatric data collections, but finding structure in these large data sets with many collected variables is challenging. Decision tree ensembles such as random forests (Strobl, Malley, & Tutz, 2009) are a useful tool for finding structure, but are difficult to interpret with multiple outcome variables which are often of interest in psychology. To find and interpret structure in data sets with multiple outcomes and many predictors (possibly exceeding the sample size), we introduce a multivariate extension to a decision tree ensemble method called gradient boosted regression trees (Friedman, 2001). Our extension, multivariate tree boosting, is a method for nonparametric regression that is useful for identifying important predictors, detecting predictors with nonlinear effects and interactions without specification of such effects, and for identifying predictors that cause two or more outcome variables to covary. We provide the R package ‘mvtboost’ to estimate, tune, and interpret the resulting model, which extends the implementation of univariate boosting in the R package ‘gbm’ (Ridgeway et al., 2015) to continuous, multivariate outcomes. To illustrate the approach, we analyze predictors of psychological well-being (Ryff & Keyes, 1995). Simulations verify that our approach identifies predictors with nonlinear effects and achieves high prediction accuracy, exceeding or matching the performance of (penalized) multivariate multiple regression and multivariate decision trees over a wide range of conditions. PMID:27918183

  4. Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.

    PubMed

    Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi

    2014-01-01

    In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.

  5. Polarimetric signatures of a coniferous forest canopy based on vector radiative transfer theory

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.; Amar, F.; Mougin, E.; Lopes, A.; Beaudoin, A.

    1992-01-01

    Complete polarization signatures of a coniferous forest canopy are studied by the iterative solution of the vector radiative transfer equations up to the second order. The forest canopy constituents (leaves, branches, stems, and trunk) are embedded in a multi-layered medium over a rough interface. The branches, stems and trunk scatterers are modeled as finite randomly oriented cylinders. The leaves are modeled as randomly oriented needles. For a plane wave exciting the canopy, the average Mueller matrix is formulated in terms of the iterative solution of the radiative transfer solution and used to determine the linearly polarized backscattering coefficients, the co-polarized and cross-polarized power returns, and the phase difference statistics. Numerical results are presented to investigate the effect of transmitting and receiving antenna configurations on the polarimetric signature of a pine forest. Comparison is made with measurements.

  6. Soil moisture sensitivity of autotrophic and heterotrophic forest floor respiration in boreal xeric pine and mesic spruce forests

    NASA Astrophysics Data System (ADS)

    Ťupek, Boris; Launiainen, Samuli; Peltoniemi, Mikko; Heikkinen, Jukka; Lehtonen, Aleksi

    2016-04-01

    Litter decomposition rates of the most process based soil carbon models affected by environmental conditions are linked with soil heterotrophic CO2 emissions and serve for estimating soil carbon sequestration; thus due to the mass balance equation the variation in measured litter inputs and measured heterotrophic soil CO2 effluxes should indicate soil carbon stock changes, needed by soil carbon management for mitigation of anthropogenic CO2 emissions, if sensitivity functions of the applied model suit to the environmental conditions e.g. soil temperature and moisture. We evaluated the response forms of autotrophic and heterotrophic forest floor respiration to soil temperature and moisture in four boreal forest sites of the International Cooperative Programme on Assessment and Monitoring of Air Pollution Effects on Forests (ICP Forests) by a soil trenching experiment during year 2015 in southern Finland. As expected both autotrophic and heterotrophic forest floor respiration components were primarily controlled by soil temperature and exponential regression models generally explained more than 90% of the variance. Soil moisture regression models on average explained less than 10% of the variance and the response forms varied between Gaussian for the autotrophic forest floor respiration component and linear for the heterotrophic forest floor respiration component. Although the percentage of explained variance of soil heterotrophic respiration by the soil moisture was small, the observed reduction of CO2 emissions with higher moisture levels suggested that soil moisture response of soil carbon models not accounting for the reduction due to excessive moisture should be re-evaluated in order to estimate right levels of soil carbon stock changes. Our further study will include evaluation of process based soil carbon models by the annual heterotrophic respiration and soil carbon stocks.

  7. Prediction models for clustered data: comparison of a random intercept and standard regression model

    PubMed Central

    2013-01-01

    Background When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Methods Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. Results The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. Conclusion The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters. PMID:23414436

  8. Prediction models for clustered data: comparison of a random intercept and standard regression model.

    PubMed

    Bouwmeester, Walter; Twisk, Jos W R; Kappen, Teus H; van Klei, Wilton A; Moons, Karel G M; Vergouwe, Yvonne

    2013-02-15

    When study data are clustered, standard regression analysis is considered inappropriate and analytical techniques for clustered data need to be used. For prediction research in which the interest of predictor effects is on the patient level, random effect regression models are probably preferred over standard regression analysis. It is well known that the random effect parameter estimates and the standard logistic regression parameter estimates are different. Here, we compared random effect and standard logistic regression models for their ability to provide accurate predictions. Using an empirical study on 1642 surgical patients at risk of postoperative nausea and vomiting, who were treated by one of 19 anesthesiologists (clusters), we developed prognostic models either with standard or random intercept logistic regression. External validity of these models was assessed in new patients from other anesthesiologists. We supported our results with simulation studies using intra-class correlation coefficients (ICC) of 5%, 15%, or 30%. Standard performance measures and measures adapted for the clustered data structure were estimated. The model developed with random effect analysis showed better discrimination than the standard approach, if the cluster effects were used for risk prediction (standard c-index of 0.69 versus 0.66). In the external validation set, both models showed similar discrimination (standard c-index 0.68 versus 0.67). The simulation study confirmed these results. For datasets with a high ICC (≥15%), model calibration was only adequate in external subjects, if the used performance measure assumed the same data structure as the model development method: standard calibration measures showed good calibration for the standard developed model, calibration measures adapting the clustered data structure showed good calibration for the prediction model with random intercept. The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions. The prediction model with random intercept had good calibration within clusters.

  9. Field evaluation of a random forest activity classifier for wrist-worn accelerometer data.

    PubMed

    Pavey, Toby G; Gilson, Nicholas D; Gomersall, Sjaan R; Clark, Bronwyn; Trost, Stewart G

    2017-01-01

    Wrist-worn accelerometers are convenient to wear and associated with greater wear-time compliance. Previous work has generally relied on choreographed activity trials to train and test classification models. However, validity in free-living contexts is starting to emerge. Study aims were: (1) train and test a random forest activity classifier for wrist accelerometer data; and (2) determine if models trained on laboratory data perform well under free-living conditions. Twenty-one participants (mean age=27.6±6.2) completed seven lab-based activity trials and a 24h free-living trial (N=16). Participants wore a GENEActiv monitor on the non-dominant wrist. Classification models recognising four activity classes (sedentary, stationary+, walking, and running) were trained using time and frequency domain features extracted from 10-s non-overlapping windows. Model performance was evaluated using leave-one-out-cross-validation. Models were implemented using the randomForest package within R. Classifier accuracy during the 24h free living trial was evaluated by calculating agreement with concurrently worn activPAL monitors. Overall classification accuracy for the random forest algorithm was 92.7%. Recognition accuracy for sedentary, stationary+, walking, and running was 80.1%, 95.7%, 91.7%, and 93.7%, respectively for the laboratory protocol. Agreement with the activPAL data (stepping vs. non-stepping) during the 24h free-living trial was excellent and, on average, exceeded 90%. The ICC for stepping time was 0.92 (95% CI=0.75-0.97). However, sensitivity and positive predictive values were modest. Mean bias was 10.3min/d (95% LOA=-46.0 to 25.4min/d). The random forest classifier for wrist accelerometer data yielded accurate group-level predictions under controlled conditions, but was less accurate at identifying stepping verse non-stepping behaviour in free living conditions Future studies should conduct more rigorous field-based evaluations using observation as a criterion measure. Copyright © 2016 Sports Medicine Australia. Published by Elsevier Ltd. All rights reserved.

  10. An analysis of tree mortality using high resolution remotely-sensed data for mixed-conifer forests in San Diego county

    NASA Astrophysics Data System (ADS)

    Freeman, Mary Pyott

    ABSTRACT An Analysis of Tree Mortality Using High Resolution Remotely-Sensed Data for Mixed-Conifer Forests in San Diego County by Mary Pyott Freeman The montane mixed-conifer forests of San Diego County are currently experiencing extensive tree mortality, which is defined as dieback where whole stands are affected. This mortality is likely the result of the complex interaction of many variables, such as altered fire regimes, climatic conditions such as drought, as well as forest pathogens and past management strategies. Conifer tree mortality and its spatial pattern and change over time were examined in three components. In component 1, two remote sensing approaches were compared for their effectiveness in delineating dead trees, a spatial contextual approach and an OBIA (object based image analysis) approach, utilizing various dates and spatial resolutions of airborne image data. For each approach transforms and masking techniques were explored, which were found to improve classifications, and an object-based assessment approach was tested. In component 2, dead tree maps produced by the most effective techniques derived from component 1 were utilized for point pattern and vector analyses to further understand spatio-temporal changes in tree mortality for the years 1997, 2000, 2002, and 2005 for three study areas: Palomar, Volcan and Laguna mountains. Plot-based fieldwork was conducted to further assess mortality patterns. Results indicate that conifer mortality was significantly clustered, increased substantially between 2002 and 2005, and was non-random with respect to tree species and diameter class sizes. In component 3, multiple environmental variables were used in Generalized Linear Model (GLM-logistic regression) and decision tree classifier model development, revealing the importance of climate and topographic factors such as precipitation and elevation, in being able to predict areas of high risk for tree mortality. The results from this study highlight the importance of multi-scale spatial as well as temporal analyses, in order to understand mixed-conifer forest structure, dynamics, and processes of decline, which can lead to more sustainable management of forests with continued natural and anthropogenic disturbance.

  11. Metastability for discontinuous dynamical systems under Lévy noise: Case study on Amazonian Vegetation.

    PubMed

    Serdukova, Larissa; Zheng, Yayun; Duan, Jinqiao; Kurths, Jürgen

    2017-08-24

    For the tipping elements in the Earth's climate system, the most important issue to address is how stable is the desirable state against random perturbations. Extreme biotic and climatic events pose severe hazards to tropical rainforests. Their local effects are extremely stochastic and difficult to measure. Moreover, the direction and intensity of the response of forest trees to such perturbations are unknown, especially given the lack of efficient dynamical vegetation models to evaluate forest tree cover changes over time. In this study, we consider randomness in the mathematical modelling of forest trees by incorporating uncertainty through a stochastic differential equation. According to field-based evidence, the interactions between fires and droughts are a more direct mechanism that may describe sudden forest degradation in the south-eastern Amazon. In modeling the Amazonian vegetation system, we include symmetric α-stable Lévy perturbations. We report results of stability analysis of the metastable fertile forest state. We conclude that even a very slight threat to the forest state stability represents L´evy noise with large jumps of low intensity, that can be interpreted as a fire occurring in a non-drought year. During years of severe drought, high-intensity fires significantly accelerate the transition between a forest and savanna state.

  12. Stochastic assembly in a subtropical forest chronosequence: evidence from contrasting changes of species, phylogenetic and functional dissimilarity over succession.

    PubMed

    Mi, Xiangcheng; Swenson, Nathan G; Jia, Qi; Rao, Mide; Feng, Gang; Ren, Haibao; Bebber, Daniel P; Ma, Keping

    2016-09-07

    Deterministic and stochastic processes jointly determine the community dynamics of forest succession. However, it has been widely held in previous studies that deterministic processes dominate forest succession. Furthermore, inference of mechanisms for community assembly may be misleading if based on a single axis of diversity alone. In this study, we evaluated the relative roles of deterministic and stochastic processes along a disturbance gradient by integrating species, functional, and phylogenetic beta diversity in a subtropical forest chronosequence in Southeastern China. We found a general pattern of increasing species turnover, but little-to-no change in phylogenetic and functional turnover over succession at two spatial scales. Meanwhile, the phylogenetic and functional beta diversity were not significantly different from random expectation. This result suggested a dominance of stochastic assembly, contrary to the general expectation that deterministic processes dominate forest succession. On the other hand, we found significant interactions of environment and disturbance and limited evidence for significant deviations of phylogenetic or functional turnover from random expectations for different size classes. This result provided weak evidence of deterministic processes over succession. Stochastic assembly of forest succession suggests that post-disturbance restoration may be largely unpredictable and difficult to control in subtropical forests.

  13. Correspondence between sound propagation in discrete and continuous random media with application to forest acoustics.

    PubMed

    Ostashev, Vladimir E; Wilson, D Keith; Muhlestein, Michael B; Attenborough, Keith

    2018-02-01

    Although sound propagation in a forest is important in several applications, there are currently no rigorous yet computationally tractable prediction methods. Due to the complexity of sound scattering in a forest, it is natural to formulate the problem stochastically. In this paper, it is demonstrated that the equations for the statistical moments of the sound field propagating in a forest have the same form as those for sound propagation in a turbulent atmosphere if the scattering properties of the two media are expressed in terms of the differential scattering and total cross sections. Using the existing theories for sound propagation in a turbulent atmosphere, this analogy enables the derivation of several results for predicting forest acoustics. In particular, the second-moment parabolic equation is formulated for the spatial correlation function of the sound field propagating above an impedance ground in a forest with micrometeorology. Effective numerical techniques for solving this equation have been developed in atmospheric acoustics. In another example, formulas are obtained that describe the effect of a forest on the interference between the direct and ground-reflected waves. The formulated correspondence between wave propagation in discrete and continuous random media can also be used in other fields of physics.

  14. First direct landscape-scale measurement of tropical rain forest Leaf Area Index, a key driver of global primary productivity

    Treesearch

    David B. Clark; Paulo C. Olivas; Steven F. Oberbauer; Deborah A. Clark; Michael G. Ryan

    2008-01-01

    Leaf Area Index (leaf area per unit ground area, LAI) is a key driver of forest productivity but has never previously been measured directly at the landscape scale in tropical rain forest (TRF). We used a modular tower and stratified random sampling to harvest all foliage from forest floor to canopy top in 55 vertical transects (4.6 m2) across 500 ha of old growth in...

  15. Using methods from the data mining and machine learning literature for disease classification and prediction: A case study examining classification of heart failure sub-types

    PubMed Central

    Austin, Peter C.; Tu, Jack V.; Ho, Jennifer E.; Levy, Daniel; Lee, Douglas S.

    2014-01-01

    Objective Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines. Study design and Setting We compared the performance of these classification methods with those of conventional classification trees to classify patients with heart failure according to the following sub-types: heart failure with preserved ejection fraction (HFPEF) vs. heart failure with reduced ejection fraction (HFREF). We also compared the ability of these methods to predict the probability of the presence of HFPEF with that of conventional logistic regression. Results We found that modern, flexible tree-based methods from the data mining literature offer substantial improvement in prediction and classification of heart failure sub-type compared to conventional classification and regression trees. However, conventional logistic regression had superior performance for predicting the probability of the presence of HFPEF compared to the methods proposed in the data mining literature. Conclusion The use of tree-based methods offers superior performance over conventional classification and regression trees for predicting and classifying heart failure subtypes in a population-based sample of patients from Ontario. However, these methods do not offer substantial improvements over logistic regression for predicting the presence of HFPEF. PMID:23384592

  16. NIMEFI: gene regulatory network inference using multiple ensemble feature importance algorithms.

    PubMed

    Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan

    2014-01-01

    One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available.

  17. Quantitative monitoring of sucrose, reducing sugar and total sugar dynamics for phenotyping of water-deficit stress tolerance in rice through spectroscopy and chemometrics

    NASA Astrophysics Data System (ADS)

    Das, Bappa; Sahoo, Rabi N.; Pargal, Sourabh; Krishna, Gopal; Verma, Rakesh; Chinnusamy, Viswanathan; Sehgal, Vinay K.; Gupta, Vinod K.; Dash, Sushanta K.; Swain, Padmini

    2018-03-01

    In the present investigation, the changes in sucrose, reducing and total sugar content due to water-deficit stress in rice leaves were modeled using visible, near infrared (VNIR) and shortwave infrared (SWIR) spectroscopy. The objectives of the study were to identify the best vegetation indices and suitable multivariate technique based on precise analysis of hyperspectral data (350 to 2500 nm) and sucrose, reducing sugar and total sugar content measured at different stress levels from 16 different rice genotypes. Spectral data analysis was done to identify suitable spectral indices and models for sucrose estimation. Novel spectral indices in near infrared (NIR) range viz. ratio spectral index (RSI) and normalised difference spectral indices (NDSI) sensitive to sucrose, reducing sugar and total sugar content were identified which were subsequently calibrated and validated. The RSI and NDSI models had R2 values of 0.65, 0.71 and 0.67; RPD values of 1.68, 1.95 and 1.66 for sucrose, reducing sugar and total sugar, respectively for validation dataset. Different multivariate spectral models such as artificial neural network (ANN), multivariate adaptive regression splines (MARS), multiple linear regression (MLR), partial least square regression (PLSR), random forest regression (RFR) and support vector machine regression (SVMR) were also evaluated. The best performing multivariate models for sucrose, reducing sugars and total sugars were found to be, MARS, ANN and MARS, respectively with respect to RPD values of 2.08, 2.44, and 1.93. Results indicated that VNIR and SWIR spectroscopy combined with multivariate calibration can be used as a reliable alternative to conventional methods for measurement of sucrose, reducing sugars and total sugars of rice under water-deficit stress as this technique is fast, economic, and noninvasive.

  18. A Ranking Approach to Genomic Selection.

    PubMed

    Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori

    2015-01-01

    Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.

  19. Association between climatic variables and malaria incidence: a study in Kokrajhar district of Assam, India.

    PubMed

    Nath, Dilip C; Mwchahary, Dimacha Dwibrang

    2012-11-11

    A favorable climatic condition for transmission of malaria prevails in Kokrajhar district throughout the year. A sizeable part of the district is covered by forest due to which dissimilar dynamics of malaria transmission emerge in forest and non-forest areas. Observed malaria incidence rates of forest area, non-forest area and the whole district over the period 2001-2010 were considered for analyzing temporal correlation between malaria incidence and climatic variables. Associations between the two were examined by Pearson correlation analysis. Cross-correlation tests were performed between pre-whitened series of climatic variable and malaria series. Linear regressions were used to obtain linear relationships between climatic factors and malaria incidence, while weighted least squares regression was used to construct models for explaining and estimating malaria incidence rates. Annual concentration of malaria incidence was analyzed by Markham technique by obtaining seasonal index. Forest area and non-forest area have distinguishable malaria seasons. Relative humidity was positively correlated with forest malaria incidence, while temperature series were negatively correlated with non-forest malaria incidence. There was higher seasonality of concentration of malaria in the forest area than non-forest area. Significant correlation between annual changes in malaria cases in forest area and temperature was observed (coeff=0.689, p=0.040). Separate reliable models constructed for forecasting malaria incidence rates based on the combined influence of climatic variables on malaria incidence in different areas of the district were able to explain substantial percentage of observed variability in the incidence rates (R2adj=45.4%, 50.6%, 47.2%; p< .001 for all). There is an intricate association between climatic variables and malaria incidence of the district. Climatic variables influence malaria incidence in forest area and non-forest area in different ways. Rainfall plays a primary role in characterizing malaria incidences in the district. Malaria parasites in the district had adapted to a relative humidity condition higher than the normal range for transmission in India. Instead of individual influence of the climatic variables, their combined influence was utilizable for construction of models.

  20. Biological forest ecosystems diversity and there impact in semi arid land, analysis and followed by remote sensing (Alsat-1 data, Steppe of Algeria)

    NASA Astrophysics Data System (ADS)

    Zegrar, A.

    2008-05-01

    The Algerian forests present an important ecological diversity, due to the different type of weather, from the sub humid to arid. These type of weather have a direct influence on the forests ecosystem and condition the flours composition of these forests as well as their regeneration. The ecological diversity of some forests as part as it's constitution, plays an important role in the natural regeneration, following some natural curses (Forest fire, phenomenon of Chablis, stroke of wood...). The conservation of the biologic diversity and the bets in permanent value of some Forests ecosystem take moreover importance. Because the forest ecosystem have the aspect of uniting the biologic wealth of forests, some interior waters, some agricultural earths and some arid and sub humid earths. In this survey, the utilization of remote sensing data respectively satellite ALSAT-1 and satellite LANDSAT TM in different dates, inform us about impact of arid weather on the ecological diversity in the middle of some vegetal steppe formations and in particular on the regressive evolution of some forests ecosystem under the name of the DEFORESTATION. We have used data of satellite LANDSAT TM of the year 1989 and those of satellite ALSAT-1 of the year 2007, for a multistage study of the regressive evolution of forests ecosystem. an application of the specific treatments especially the classification supervised by the method of maximum of verisimilitude is used in order to identify the most important formations of the zone of survey. The index: NDVI, MSAVI2 and the index of IV verdure is used for characterized and determined the forests formations changes. The arithmetic combinations are used in the system of information geographical IDRISI. And after application of the method of the rations we obtained a picture of the changes. A map of the Vulnerability of the forests ecosystem was realized, this map informs us on the process of deforestation in the natural forests following the different aggression and permit us a possible perspectives of harnessing.

Top