Sample records for random forest models

  1. An application of quantile random forests for predictive mapping of forest attributes

    Treesearch

    E.A. Freeman; G.G. Moisen

    2015-01-01

    Increasingly, random forest models are used in predictive mapping of forest attributes. Traditional random forests output the mean prediction from the random trees. Quantile regression forests (QRF) is an extension of random forests developed by Nicolai Meinshausen that provides non-parametric estimates of the median predicted value as well as prediction quantiles. It...

  2. A tale of two "forests": random forest machine learning AIDS tropical forest carbon mapping.

    PubMed

    Mascaro, Joseph; Asner, Gregory P; Knapp, David E; Kennedy-Bowdoin, Ty; Martin, Roberta E; Anderson, Christopher; Higgins, Mark; Chadwick, K Dana

    2014-01-01

    Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including--in the latter case--x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called "out-of-bag"), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha(-1) when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation.

  3. A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping

    PubMed Central

    Mascaro, Joseph; Asner, Gregory P.; Knapp, David E.; Kennedy-Bowdoin, Ty; Martin, Roberta E.; Anderson, Christopher; Higgins, Mark; Chadwick, K. Dana

    2014-01-01

    Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including—in the latter case—x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called “out-of-bag”), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha−1 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation. PMID:24489686

  4. Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets.

    PubMed

    Marchese Robinson, Richard L; Palczewska, Anna; Palczewski, Jan; Kidley, Nathan

    2017-08-28

    The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.

  5. Approximating prediction uncertainty for random forest regression models

    Treesearch

    John W. Coulston; Christine E. Blinn; Valerie A. Thomas; Randolph H. Wynne

    2016-01-01

    Machine learning approaches such as random forest have increased for the spatial modeling and mapping of continuous variables. Random forest is a non-parametric ensemble approach, and unlike traditional regression approaches there is no direct quantification of prediction error. Understanding prediction uncertainty is important when using model-based continuous maps as...

  6. The Past, Present and Future of the Meteorological Phenomena Identification Near the Ground (mPING) Project

    NASA Astrophysics Data System (ADS)

    Elmore, K. L.

    2016-12-01

    The Metorological Phenomemna Identification NeartheGround (mPING) project is an example of a crowd-sourced, citizen science effort to gather data of sufficeint quality and quantity needed by new post processing methods that use machine learning. Transportation and infrastructure are particularly sensitive to precipitation type in winter weather. We extract attributes from operational numerical forecast models and use them in a random forest to generate forecast winter precipitation types. We find that random forests applied to forecast soundings are effective at generating skillful forecasts of surface ptype with consideralbly more skill than the current algorithms, especuially for ice pellets and freezing rain. We also find that three very different forecast models yuield similar overall results, showing that random forests are able to extract essentially equivalent information from different forecast models. We also show that the random forest for each model, and each profile type is unique to the particular forecast model and that the random forests developed using a particular model suffer significant degradation when given attributes derived from a different model. This implies that no single algorithm can perform well across all forecast models. Clearly, random forests extract information unavailable to "physically based" methods because the physical information in the models does not appear as we expect. One intersting result is that results from the classic "warm nose" sounding profile are, by far, the most sensitive to the particular forecast model, but this profile is also the one for which random forests are most skillful. Finally, a method for calibrarting probabilties for each different ptype using multinomial logistic regression is shown.

  7. A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data.

    PubMed

    Nasejje, Justine B; Mwambi, Henry; Dheda, Keertan; Lesosky, Maia

    2017-07-28

    Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate. In this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points). The study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points. Although survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.

  8. Using Random Forest Models to Predict Organizational Violence

    NASA Technical Reports Server (NTRS)

    Levine, Burton; Bobashev, Georgly

    2012-01-01

    We present a methodology to access the proclivity of an organization to commit violence against nongovernment personnel. We fitted a Random Forest model using the Minority at Risk Organizational Behavior (MAROS) dataset. The MAROS data is longitudinal; so, individual observations are not independent. We propose a modification to the standard Random Forest methodology to account for the violation of the independence assumption. We present the results of the model fit, an example of predicting violence for an organization; and finally, we present a summary of the forest in a "meta-tree,"

  9. Predicting temperate forest stand types using only structural profiles from discrete return airborne lidar

    NASA Astrophysics Data System (ADS)

    Fedrigo, Melissa; Newnham, Glenn J.; Coops, Nicholas C.; Culvenor, Darius S.; Bolton, Douglas K.; Nitschke, Craig R.

    2018-02-01

    Light detection and ranging (lidar) data have been increasingly used for forest classification due to its ability to penetrate the forest canopy and provide detail about the structure of the lower strata. In this study we demonstrate forest classification approaches using airborne lidar data as inputs to random forest and linear unmixing classification algorithms. Our results demonstrated that both random forest and linear unmixing models identified a distribution of rainforest and eucalypt stands that was comparable to existing ecological vegetation class (EVC) maps based primarily on manual interpretation of high resolution aerial imagery. Rainforest stands were also identified in the region that have not previously been identified in the EVC maps. The transition between stand types was better characterised by the random forest modelling approach. In contrast, the linear unmixing model placed greater emphasis on field plots selected as endmembers which may not have captured the variability in stand structure within a single stand type. The random forest model had the highest overall accuracy (84%) and Cohen's kappa coefficient (0.62). However, the classification accuracy was only marginally better than linear unmixing. The random forest model was applied to a region in the Central Highlands of south-eastern Australia to produce maps of stand type probability, including areas of transition (the 'ecotone') between rainforest and eucalypt forest. The resulting map provided a detailed delineation of forest classes, which specifically recognised the coalescing of stand types at the landscape scale. This represents a key step towards mapping the structural and spatial complexity of these ecosystems, which is important for both their management and conservation.

  10. Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

    PubMed Central

    Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-01-01

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914

  11. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    PubMed

    Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-03-15

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

  12. Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks.

    PubMed

    Hsieh, Chung-Ho; Lu, Ruey-Hwa; Lee, Nai-Hsin; Chiu, Wen-Ta; Hsu, Min-Huei; Li, Yu-Chuan Jack

    2011-01-01

    Diagnosing acute appendicitis clinically is still difficult. We developed random forests, support vector machines, and artificial neural network models to diagnose acute appendicitis. Between January 2006 and December 2008, patients who had a consultation session with surgeons for suspected acute appendicitis were enrolled. Seventy-five percent of the data set was used to construct models including random forest, support vector machines, artificial neural networks, and logistic regression. Twenty-five percent of the data set was withheld to evaluate model performance. The area under the receiver operating characteristic curve (AUC) was used to evaluate performance, which was compared with that of the Alvarado score. Data from a total of 180 patients were collected, 135 used for training and 45 for testing. The mean age of patients was 39.4 years (range, 16-85). Final diagnosis revealed 115 patients with and 65 without appendicitis. The AUC of random forest, support vector machines, artificial neural networks, logistic regression, and Alvarado was 0.98, 0.96, 0.91, 0.87, and 0.77, respectively. The sensitivity, specificity, positive, and negative predictive values of random forest were 94%, 100%, 100%, and 87%, respectively. Random forest performed better than artificial neural networks, logistic regression, and Alvarado. We demonstrated that random forest can predict acute appendicitis with good accuracy and, deployed appropriately, can be an effective tool in clinical decision making. Copyright © 2011 Mosby, Inc. All rights reserved.

  13. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

    EPA Science Inventory

    Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, e...

  14. Hierarchical Bayesian spatial models for predicting multiple forest variables using waveform LiDAR, hyperspectral imagery, and large inventory datasets

    USGS Publications Warehouse

    Finley, Andrew O.; Banerjee, Sudipto; Cook, Bruce D.; Bradford, John B.

    2013-01-01

    In this paper we detail a multivariate spatial regression model that couples LiDAR, hyperspectral and forest inventory data to predict forest outcome variables at a high spatial resolution. The proposed model is used to analyze forest inventory data collected on the US Forest Service Penobscot Experimental Forest (PEF), ME, USA. In addition to helping meet the regression model's assumptions, results from the PEF analysis suggest that the addition of multivariate spatial random effects improves model fit and predictive ability, compared with two commonly applied modeling approaches. This improvement results from explicitly modeling the covariation among forest outcome variables and spatial dependence among observations through the random effects. Direct application of such multivariate models to even moderately large datasets is often computationally infeasible because of cubic order matrix algorithms involved in estimation. We apply a spatial dimension reduction technique to help overcome this computational hurdle without sacrificing richness in modeling.

  15. Variable selection with random forest: Balancing stability, performance, and interpretation in ecological and environmental modeling

    EPA Science Inventory

    Random forest (RF) is popular in ecological and environmental modeling, in part, because of its insensitivity to correlated predictors and resistance to overfitting. Although variable selection has been proposed to improve both performance and interpretation of RF models, it is u...

  16. Newer classification and regression tree techniques: Bagging and Random Forests for ecological prediction

    Treesearch

    Anantha M. Prasad; Louis R. Iverson; Andy Liaw; Andy Liaw

    2006-01-01

    We evaluated four statistical models - Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) - for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.

  17. A Comparison between Decision Tree and Random Forest in Determining the Risk Factors Associated with Type 2 Diabetes.

    PubMed

    Esmaily, Habibollah; Tayefi, Maryam; Doosti, Hassan; Ghayour-Mobarhan, Majid; Nezami, Hossein; Amirabadizadeh, Alireza

    2018-04-24

    We aimed to identify the associated risk factors of type 2 diabetes mellitus (T2DM) using data mining approach, decision tree and random forest techniques using the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) Study program. A cross-sectional study. The MASHAD study started in 2010 and will continue until 2020. Two data mining tools, namely decision trees, and random forests, are used for predicting T2DM when some other characteristics are observed on 9528 subjects recruited from MASHAD database. This paper makes a comparison between these two models in terms of accuracy, sensitivity, specificity and the area under ROC curve. The prevalence rate of T2DM was 14% among these subjects. The decision tree model has 64.9% accuracy, 64.5% sensitivity, 66.8% specificity, and area under the ROC curve measuring 68.6%, while the random forest model has 71.1% accuracy, 71.3% sensitivity, 69.9% specificity, and area under the ROC curve measuring 77.3% respectively. The random forest model, when used with demographic, clinical, and anthropometric and biochemical measurements, can provide a simple tool to identify associated risk factors for type 2 diabetes. Such identification can substantially use for managing the health policy to reduce the number of subjects with T2DM .

  18. Random Forest as an Imputation Method for Education and Psychology Research: Its Impact on Item Fit and Difficulty of the Rasch Model

    ERIC Educational Resources Information Center

    Golino, Hudson F.; Gomes, Cristiano M. A.

    2016-01-01

    This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…

  19. Characterizing stand-level forest canopy cover and height using Landsat time series, samples of airborne LiDAR, and the Random Forest algorithm

    NASA Astrophysics Data System (ADS)

    Ahmed, Oumer S.; Franklin, Steven E.; Wulder, Michael A.; White, Joanne C.

    2015-03-01

    Many forest management activities, including the development of forest inventories, require spatially detailed forest canopy cover and height data. Among the various remote sensing technologies, LiDAR (Light Detection and Ranging) offers the most accurate and consistent means for obtaining reliable canopy structure measurements. A potential solution to reduce the cost of LiDAR data, is to integrate transects (samples) of LiDAR data with frequently acquired and spatially comprehensive optical remotely sensed data. Although multiple regression is commonly used for such modeling, often it does not fully capture the complex relationships between forest structure variables. This study investigates the potential of Random Forest (RF), a machine learning technique, to estimate LiDAR measured canopy structure using a time series of Landsat imagery. The study is implemented over a 2600 ha area of industrially managed coastal temperate forests on Vancouver Island, British Columbia, Canada. We implemented a trajectory-based approach to time series analysis that generates time since disturbance (TSD) and disturbance intensity information for each pixel and we used this information to stratify the forest land base into two strata: mature forests and young forests. Canopy cover and height for three forest classes (i.e. mature, young and mature and young (combined)) were modeled separately using multiple regression and Random Forest (RF) techniques. For all forest classes, the RF models provided improved estimates relative to the multiple regression models. The lowest validation error was obtained for the mature forest strata in a RF model (R2 = 0.88, RMSE = 2.39 m and bias = -0.16 for canopy height; R2 = 0.72, RMSE = 0.068% and bias = -0.0049 for canopy cover). This study demonstrates the value of using disturbance and successional history to inform estimates of canopy structure and obtain improved estimates of forest canopy cover and height using the RF algorithm.

  20. Random forests as cumulative effects models: A case study of lakes and rivers in Muskoka, Canada.

    PubMed

    Jones, F Chris; Plewes, Rachel; Murison, Lorna; MacDougall, Mark J; Sinclair, Sarah; Davies, Christie; Bailey, John L; Richardson, Murray; Gunn, John

    2017-10-01

    Cumulative effects assessment (CEA) - a type of environmental appraisal - lacks effective methods for modeling cumulative effects, evaluating indicators of ecosystem condition, and exploring the likely outcomes of development scenarios. Random forests are an extension of classification and regression trees, which model response variables by recursive partitioning. Random forests were used to model a series of candidate ecological indicators that described lakes and rivers from a case study watershed (The Muskoka River Watershed, Canada). Suitability of the candidate indicators for use in cumulative effects assessment and watershed monitoring was assessed according to how well they could be predicted from natural habitat features and how sensitive they were to human land-use. The best models explained 75% of the variation in a multivariate descriptor of lake benthic-macroinvertebrate community structure, and 76% of the variation in the conductivity of river water. Similar results were obtained by cross-validation. Several candidate indicators detected a simulated doubling of urban land-use in their catchments, and a few were able to detect a simulated doubling of agricultural land-use. The paper demonstrates that random forests can be used to describe the combined and singular effects of multiple stressors and natural environmental factors, and furthermore, that random forests can be used to evaluate the performance of monitoring indicators. The numerical methods presented are applicable to any ecosystem and indicator type, and therefore represent a step forward for CEA. Crown Copyright © 2017. Published by Elsevier Ltd. All rights reserved.

  1. Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada

    Treesearch

    Elizabeth A. Freeman; Gretchen G. Moisen; Tracy S. Frescino

    2012-01-01

    Random Forests is frequently used to model species distributions over large geographic areas. Complications arise when data used to train the models have been collected in stratified designs that involve different sampling intensity per stratum. The modeling process is further complicated if some of the target species are relatively rare on the landscape leading to an...

  2. Random forest regression modelling for forest aboveground biomass estimation using RISAT-1 PolSAR and terrestrial LiDAR data

    NASA Astrophysics Data System (ADS)

    Mangla, Rohit; Kumar, Shashi; Nandy, Subrata

    2016-05-01

    SAR and LiDAR remote sensing have already shown the potential of active sensors for forest parameter retrieval. SAR sensor in its fully polarimetric mode has an advantage to retrieve scattering property of different component of forest structure and LiDAR has the capability to measure structural information with very high accuracy. This study was focused on retrieval of forest aboveground biomass (AGB) using Terrestrial Laser Scanner (TLS) based point clouds and scattering property of forest vegetation obtained from decomposition modelling of RISAT-1 fully polarimetric SAR data. TLS data was acquired for 14 plots of Timli forest range, Uttarakhand, India. The forest area is dominated by Sal trees and random sampling with plot size of 0.1 ha (31.62m*31.62m) was adopted for TLS and field data collection. RISAT-1 data was processed to retrieve SAR data based variables and TLS point clouds based 3D imaging was done to retrieve LiDAR based variables. Surface scattering, double-bounce scattering, volume scattering, helix and wire scattering were the SAR based variables retrieved from polarimetric decomposition. Tree heights and stem diameters were used as LiDAR based variables retrieved from single tree vertical height and least square circle fit methods respectively. All the variables obtained for forest plots were used as an input in a machine learning based Random Forest Regression Model, which was developed in this study for forest AGB estimation. Modelled output for forest AGB showed reliable accuracy (RMSE = 27.68 t/ha) and a good coefficient of determination (0.63) was obtained through the linear regression between modelled AGB and field-estimated AGB. The sensitivity analysis showed that the model was more sensitive for the major contributed variables (stem diameter and volume scattering) and these variables were measured from two different remote sensing techniques. This study strongly recommends the integration of SAR and LiDAR data for forest AGB estimation.

  3. Probabilistic risk models for multiple disturbances: an example of forest insects and wildfires

    Treesearch

    Haiganoush K. Preisler; Alan A. Ager; Jane L. Hayes

    2010-01-01

    Building probabilistic risk models for highly random forest disturbances like wildfire and forest insect outbreaks is a challenging. Modeling the interactions among natural disturbances is even more difficult. In the case of wildfire and forest insects, we looked at the probability of a large fire given an insect outbreak and also the incidence of insect outbreaks...

  4. Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption.

    PubMed

    Nasejje, Justine B; Mwambi, Henry

    2017-09-07

    Uganda just like any other Sub-Saharan African country, has a high under-five child mortality rate. To inform policy on intervention strategies, sound statistical methods are required to critically identify factors strongly associated with under-five child mortality rates. The Cox proportional hazards model has been a common choice in analysing data to understand factors strongly associated with high child mortality rates taking age as the time-to-event variable. However, due to its restrictive proportional hazards (PH) assumption, some covariates of interest which do not satisfy the assumption are often excluded in the analysis to avoid mis-specifying the model. Otherwise using covariates that clearly violate the assumption would mean invalid results. Survival trees and random survival forests are increasingly becoming popular in analysing survival data particularly in the case of large survey data and could be attractive alternatives to models with the restrictive PH assumption. In this article, we adopt random survival forests which have never been used in understanding factors affecting under-five child mortality rates in Uganda using Demographic and Health Survey data. Thus the first part of the analysis is based on the use of the classical Cox PH model and the second part of the analysis is based on the use of random survival forests in the presence of covariates that do not necessarily satisfy the PH assumption. Random survival forests and the Cox proportional hazards model agree that the sex of the household head, sex of the child, number of births in the past 1 year are strongly associated to under-five child mortality in Uganda given all the three covariates satisfy the PH assumption. Random survival forests further demonstrated that covariates that were originally excluded from the earlier analysis due to violation of the PH assumption were important in explaining under-five child mortality rates. These covariates include the number of children under the age of five in a household, number of births in the past 5 years, wealth index, total number of children ever born and the child's birth order. The results further indicated that the predictive performance for random survival forests built using covariates including those that violate the PH assumption was higher than that for random survival forests built using only covariates that satisfy the PH assumption. Random survival forests are appealing methods in analysing public health data to understand factors strongly associated with under-five child mortality rates especially in the presence of covariates that violate the proportional hazards assumption.

  5. Calibrating random forests for probability estimation.

    PubMed

    Dankowski, Theresa; Ziegler, Andreas

    2016-09-30

    Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so-called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models. These are, in turn, used for re-calibration. The two updating strategies were compared in a simulation study and are illustrated with data from the German Stroke Study Collaboration. In most simulation scenarios, both methods led to similar improvements. In the simulation scenario in which the stricter assumptions of Elkan's method were not met, the logistic regression-based re-calibration approach for random forests outperformed Elkan's method. It also performed better on the stroke data than Elkan's method. The strength of Elkan's method is its general applicability to any probability machine. However, if the strict assumptions underlying this approach are not met, the logistic regression-based approach is preferable for updating random forests for probability estimation. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  6. D Semantic Labeling of ALS Data Based on Domain Adaption by Transferring and Fusing Random Forest Models

    NASA Astrophysics Data System (ADS)

    Wu, J.; Yao, W.; Zhang, J.; Li, Y.

    2018-04-01

    Labeling 3D point cloud data with traditional supervised learning methods requires considerable labelled samples, the collection of which is cost and time expensive. This work focuses on adopting domain adaption concept to transfer existing trained random forest classifiers (based on source domain) to new data scenes (target domain), which aims at reducing the dependence of accurate 3D semantic labeling in point clouds on training samples from the new data scene. Firstly, two random forest classifiers were firstly trained with existing samples previously collected for other data. They were different from each other by using two different decision tree construction algorithms: C4.5 with information gain ratio and CART with Gini index. Secondly, four random forest classifiers adapted to the target domain are derived through transferring each tree in the source random forest models with two types of operations: structure expansion and reduction-SER and structure transfer-STRUT. Finally, points in target domain are labelled by fusing the four newly derived random forest classifiers using weights of evidence based fusion model. To validate our method, experimental analysis was conducted using 3 datasets: one is used as the source domain data (Vaihingen data for 3D Semantic Labelling); another two are used as the target domain data from two cities in China (Jinmen city and Dunhuang city). Overall accuracies of 85.5 % and 83.3 % for 3D labelling were achieved for Jinmen city and Dunhuang city data respectively, with only 1/3 newly labelled samples compared to the cases without domain adaption.

  7. Road Network State Estimation Using Random Forest Ensemble Learning

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hou, Yi; Edara, Praveen; Chang, Yohan

    Network-scale travel time prediction not only enables traffic management centers (TMC) to proactively implement traffic management strategies, but also allows travelers make informed decisions about route choices between various origins and destinations. In this paper, a random forest estimator was proposed to predict travel time in a network. The estimator was trained using two years of historical travel time data for a case study network in St. Louis, Missouri. Both temporal and spatial effects were considered in the modeling process. The random forest models predicted travel times accurately during both congested and uncongested traffic conditions. The computational times for themore » models were low, thus useful for real-time traffic management and traveler information applications.« less

  8. Do bioclimate variables improve performance of climate envelope models?

    USGS Publications Warehouse

    Watling, James I.; Romañach, Stephanie S.; Bucklin, David N.; Speroterra, Carolina; Brandt, Laura A.; Pearlstine, Leonard G.; Mazzotti, Frank J.

    2012-01-01

    Climate envelope models are widely used to forecast potential effects of climate change on species distributions. A key issue in climate envelope modeling is the selection of predictor variables that most directly influence species. To determine whether model performance and spatial predictions were related to the selection of predictor variables, we compared models using bioclimate variables with models constructed from monthly climate data for twelve terrestrial vertebrate species in the southeastern USA using two different algorithms (random forests or generalized linear models), and two model selection techniques (using uncorrelated predictors or a subset of user-defined biologically relevant predictor variables). There were no differences in performance between models created with bioclimate or monthly variables, but one metric of model performance was significantly greater using the random forest algorithm compared with generalized linear models. Spatial predictions between maps using bioclimate and monthly variables were very consistent using the random forest algorithm with uncorrelated predictors, whereas we observed greater variability in predictions using generalized linear models.

  9. Screening large-scale association study data: exploiting interactions using random forests.

    PubMed

    Lunetta, Kathryn L; Hayward, L Brooke; Segal, Jonathan; Van Eerdewegh, Paul

    2004-12-10

    Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for further study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.

  10. Unbiased split variable selection for random survival forests using maximally selected rank statistics.

    PubMed

    Wright, Marvin N; Dankowski, Theresa; Ziegler, Andreas

    2017-04-15

    The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  11. Comparative analysis of used car price evaluation models

    NASA Astrophysics Data System (ADS)

    Chen, Chuancan; Hao, Lulu; Xu, Cong

    2017-05-01

    An accurate used car price evaluation is a catalyst for the healthy development of used car market. Data mining has been applied to predict used car price in several articles. However, little is studied on the comparison of using different algorithms in used car price estimation. This paper collects more than 100,000 used car dealing records throughout China to do empirical analysis on a thorough comparison of two algorithms: linear regression and random forest. These two algorithms are used to predict used car price in three different models: model for a certain car make, model for a certain car series and universal model. Results show that random forest has a stable but not ideal effect in price evaluation model for a certain car make, but it shows great advantage in the universal model compared with linear regression. This indicates that random forest is an optimal algorithm when handling complex models with a large number of variables and samples, yet it shows no obvious advantage when coping with simple models with less variables.

  12. Random forests and stochastic gradient boosting for predicting tree canopy cover: Comparing tuning processes and model performance

    Treesearch

    E. Freeman; G. Moisen; J. Coulston; B. Wilson

    2014-01-01

    Random forests (RF) and stochastic gradient boosting (SGB), both involving an ensemble of classification and regression trees, are compared for modeling tree canopy cover for the 2011 National Land Cover Database (NLCD). The objectives of this study were twofold. First, sensitivity of RF and SGB to choices in tuning parameters was explored. Second, performance of the...

  13. Random forests and stochastic gradient boosting for predicting tree canopy cover: Comparing tuning processes and model performance

    Treesearch

    Elizabeth A. Freeman; Gretchen G. Moisen; John W. Coulston; Barry T. (Ty) Wilson

    2015-01-01

    As part of the development of the 2011 National Land Cover Database (NLCD) tree canopy cover layer, a pilot project was launched to test the use of high-resolution photography coupled with extensive ancillary data to map the distribution of tree canopy cover over four study regions in the conterminous US. Two stochastic modeling techniques, random forests (RF...

  14. 3D statistical shape models incorporating 3D random forest regression voting for robust CT liver segmentation

    NASA Astrophysics Data System (ADS)

    Norajitra, Tobias; Meinzer, Hans-Peter; Maier-Hein, Klaus H.

    2015-03-01

    During image segmentation, 3D Statistical Shape Models (SSM) usually conduct a limited search for target landmarks within one-dimensional search profiles perpendicular to the model surface. In addition, landmark appearance is modeled only locally based on linear profiles and weak learners, altogether leading to segmentation errors from landmark ambiguities and limited search coverage. We present a new method for 3D SSM segmentation based on 3D Random Forest Regression Voting. For each surface landmark, a Random Regression Forest is trained that learns a 3D spatial displacement function between the according reference landmark and a set of surrounding sample points, based on an infinite set of non-local randomized 3D Haar-like features. Landmark search is then conducted omni-directionally within 3D search spaces, where voxelwise forest predictions on landmark position contribute to a common voting map which reflects the overall position estimate. Segmentation experiments were conducted on a set of 45 CT volumes of the human liver, of which 40 images were randomly chosen for training and 5 for testing. Without parameter optimization, using a simple candidate selection and a single resolution approach, excellent results were achieved, while faster convergence and better concavity segmentation were observed, altogether underlining the potential of our approach in terms of increased robustness from distinct landmark detection and from better search coverage.

  15. Do little interactions get lost in dark random forests?

    PubMed

    Wright, Marvin N; Ziegler, Andreas; König, Inke R

    2016-03-31

    Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.

  16. Benchmarking dairy herd health status using routinely recorded herd summary data.

    PubMed

    Parker Gaddis, K L; Cole, J B; Clay, J S; Maltecca, C

    2016-02-01

    Genetic improvement of dairy cattle health through the use of producer-recorded data has been determined to be feasible. Low estimated heritabilities indicate that genetic progress will be slow. Variation observed in lowly heritable traits can largely be attributed to nongenetic factors, such as the environment. More rapid improvement of dairy cattle health may be attainable if herd health programs incorporate environmental and managerial aspects. More than 1,100 herd characteristics are regularly recorded on farm test-days. We combined these data with producer-recorded health event data, and parametric and nonparametric models were used to benchmark herd and cow health status. Health events were grouped into 3 categories for analyses: mastitis, reproductive, and metabolic. Both herd incidence and individual incidence were used as dependent variables. Models implemented included stepwise logistic regression, support vector machines, and random forests. At both the herd and individual levels, random forest models attained the highest accuracy for predicting health status in all health event categories when evaluated with 10-fold cross-validation. Accuracy (SD) ranged from 0.61 (0.04) to 0.63 (0.04) when using random forest models at the herd level. Accuracy of prediction (SD) at the individual cow level ranged from 0.87 (0.06) to 0.93 (0.001) with random forest models. Highly significant variables and key words from logistic regression and random forest models were also investigated. All models identified several of the same key factors for each health event category, including movement out of the herd, size of the herd, and weather-related variables. We concluded that benchmarking health status using routinely collected herd data is feasible. Nonparametric models were better suited to handle this complex data with numerous variables. These data mining techniques were able to perform prediction of health status and could add evidence to personal experience in herd management. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  17. Fast image interpolation via random forests.

    PubMed

    Huang, Jun-Jie; Siu, Wan-Chi; Liu, Tian-Rui

    2015-10-01

    This paper proposes a two-stage framework for fast image interpolation via random forests (FIRF). The proposed FIRF method gives high accuracy, as well as requires low computation. The underlying idea of this proposed work is to apply random forests to classify the natural image patch space into numerous subspaces and learn a linear regression model for each subspace to map the low-resolution image patch to high-resolution image patch. The FIRF framework consists of two stages. Stage 1 of the framework removes most of the ringing and aliasing artifacts in the initial bicubic interpolated image, while Stage 2 further refines the Stage 1 interpolated image. By varying the number of decision trees in the random forests and the number of stages applied, the proposed FIRF method can realize computationally scalable image interpolation. Extensive experimental results show that the proposed FIRF(3, 2) method achieves more than 0.3 dB improvement in peak signal-to-noise ratio over the state-of-the-art nonlocal autoregressive modeling (NARM) method. Moreover, the proposed FIRF(1, 1) obtains similar or better results as NARM while only takes its 0.3% computational time.

  18. Tehran Air Pollutants Prediction Based on Random Forest Feature Selection Method

    NASA Astrophysics Data System (ADS)

    Shamsoddini, A.; Aboodi, M. R.; Karami, J.

    2017-09-01

    Air pollution as one of the most serious forms of environmental pollutions poses huge threat to human life. Air pollution leads to environmental instability, and has harmful and undesirable effects on the environment. Modern prediction methods of the pollutant concentration are able to improve decision making and provide appropriate solutions. This study examines the performance of the Random Forest feature selection in combination with multiple-linear regression and Multilayer Perceptron Artificial Neural Networks methods, in order to achieve an efficient model to estimate carbon monoxide and nitrogen dioxide, sulfur dioxide and PM2.5 contents in the air. The results indicated that Artificial Neural Networks fed by the attributes selected by Random Forest feature selection method performed more accurate than other models for the modeling of all pollutants. The estimation accuracy of sulfur dioxide emissions was lower than the other air contaminants whereas the nitrogen dioxide was predicted more accurate than the other pollutants.

  19. Estimating the impact of mineral aerosols on crop yields in food insecure regions using statistical crop models

    NASA Astrophysics Data System (ADS)

    Hoffman, A.; Forest, C. E.; Kemanian, A.

    2016-12-01

    A significant number of food-insecure nations exist in regions of the world where dust plays a large role in the climate system. While the impacts of common climate variables (e.g. temperature, precipitation, ozone, and carbon dioxide) on crop yields are relatively well understood, the impact of mineral aerosols on yields have not yet been thoroughly investigated. This research aims to develop the data and tools to progress our understanding of mineral aerosol impacts on crop yields. Suspended dust affects crop yields by altering the amount and type of radiation reaching the plant, modifying local temperature and precipitation. While dust events (i.e. dust storms) affect crop yields by depleting the soil of nutrients or by defoliation via particle abrasion. The impact of dust on yields is modeled statistically because we are uncertain which impacts will dominate the response on national and regional scales considered in this study. Multiple linear regression is used in a number of large-scale statistical crop modeling studies to estimate yield responses to various climate variables. In alignment with previous work, we develop linear crop models, but build upon this simple method of regression with machine-learning techniques (e.g. random forests) to identify important statistical predictors and isolate how dust affects yields on the scales of interest. To perform this analysis, we develop a crop-climate dataset for maize, soybean, groundnut, sorghum, rice, and wheat for the regions of West Africa, East Africa, South Africa, and the Sahel. Random forest regression models consistently model historic crop yields better than the linear models. In several instances, the random forest models accurately capture the temperature and precipitation threshold behavior in crops. Additionally, improving agricultural technology has caused a well-documented positive trend that dominates time series of global and regional yields. This trend is often removed before regression with traditional crop models, but likely at the cost of removing climate information. Our random forest models consistently discover the positive trend without removing any additional data. The application of random forests as a statistical crop model provides insight into understanding the impact of dust on yields in marginal food producing regions.

  20. Prostate cancer prediction using the random forest algorithm that takes into account transrectal ultrasound findings, age, and serum levels of prostate-specific antigen.

    PubMed

    Xiao, Li-Hong; Chen, Pei-Ran; Gou, Zhong-Ping; Li, Yong-Zhong; Li, Mei; Xiang, Liang-Cheng; Feng, Ping

    2017-01-01

    The aim of this study is to evaluate the ability of the random forest algorithm that combines data on transrectal ultrasound findings, age, and serum levels of prostate-specific antigen to predict prostate carcinoma. Clinico-demographic data were analyzed for 941 patients with prostate diseases treated at our hospital, including age, serum prostate-specific antigen levels, transrectal ultrasound findings, and pathology diagnosis based on ultrasound-guided needle biopsy of the prostate. These data were compared between patients with and without prostate cancer using the Chi-square test, and then entered into the random forest model to predict diagnosis. Patients with and without prostate cancer differed significantly in age and serum prostate-specific antigen levels (P < 0.001), as well as in all transrectal ultrasound characteristics (P < 0.05) except uneven echo (P = 0.609). The random forest model based on age, prostate-specific antigen and ultrasound predicted prostate cancer with an accuracy of 83.10%, sensitivity of 65.64%, and specificity of 93.83%. Positive predictive value was 86.72%, and negative predictive value was 81.64%. By integrating age, prostate-specific antigen levels and transrectal ultrasound findings, the random forest algorithm shows better diagnostic performance for prostate cancer than either diagnostic indicator on its own. This algorithm may help improve diagnosis of the disease by identifying patients at high risk for biopsy.

  1. The Efficiency of Random Forest Method for Shoreline Extraction from LANDSAT-8 and GOKTURK-2 Imageries

    NASA Astrophysics Data System (ADS)

    Bayram, B.; Erdem, F.; Akpinar, B.; Ince, A. K.; Bozkurt, S.; Catal Reis, H.; Seker, D. Z.

    2017-11-01

    Coastal monitoring plays a vital role in environmental planning and hazard management related issues. Since shorelines are fundamental data for environment management, disaster management, coastal erosion studies, modelling of sediment transport and coastal morphodynamics, various techniques have been developed to extract shorelines. Random Forest is one of these techniques which is used in this study for shoreline extraction.. This algorithm is a machine learning method based on decision trees. Decision trees analyse classes of training data creates rules for classification. In this study, Terkos region has been chosen for the proposed method within the scope of "TUBITAK Project (Project No: 115Y718) titled "Integration of Unmanned Aerial Vehicles for Sustainable Coastal Zone Monitoring Model - Three-Dimensional Automatic Coastline Extraction and Analysis: Istanbul-Terkos Example". Random Forest algorithm has been implemented to extract the shoreline of the Black Sea where near the lake from LANDSAT-8 and GOKTURK-2 satellite imageries taken in 2015. The MATLAB environment was used for classification. To obtain land and water-body classes, the Random Forest method has been applied to NIR bands of LANDSAT-8 (5th band) and GOKTURK-2 (4th band) imageries. Each image has been digitized manually and shorelines obtained for accuracy assessment. According to accuracy assessment results, Random Forest method is efficient for both medium and high resolution images for shoreline extraction studies.

  2. Random forest models to predict aqueous solubility.

    PubMed

    Palmer, David S; O'Boyle, Noel M; Glen, Robert C; Mitchell, John B O

    2007-01-01

    Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

  3. Decision tree modeling using R.

    PubMed

    Zhang, Zhongheng

    2016-08-01

    In machine learning field, decision tree learner is powerful and easy to interpret. It employs recursive binary partitioning algorithm that splits the sample in partitioning variable with the strongest association with the response variable. The process continues until some stopping criteria are met. In the example I focus on conditional inference tree, which incorporates tree-structured regression models into conditional inference procedures. While growing a single tree is subject to small changes in the training data, random forests procedure is introduced to address this problem. The sources of diversity for random forests come from the random sampling and restricted set of input variables to be selected. Finally, I introduce R functions to perform model based recursive partitioning. This method incorporates recursive partitioning into conventional parametric model building.

  4. Artificial Intelligence Procedures for Tree Taper Estimation within a Complex Vegetation Mosaic in Brazil

    PubMed Central

    Nunes, Matheus Henrique

    2016-01-01

    Tree stem form in native tropical forests is very irregular, posing a challenge to establishing taper equations that can accurately predict the diameter at any height along the stem and subsequently merchantable volume. Artificial intelligence approaches can be useful techniques in minimizing estimation errors within complex variations of vegetation. We evaluated the performance of Random Forest® regression tree and Artificial Neural Network procedures in modelling stem taper. Diameters and volume outside bark were compared to a traditional taper-based equation across a tropical Brazilian savanna, a seasonal semi-deciduous forest and a rainforest. Neural network models were found to be more accurate than the traditional taper equation. Random forest showed trends in the residuals from the diameter prediction and provided the least precise and accurate estimations for all forest types. This study provides insights into the superiority of a neural network, which provided advantages regarding the handling of local effects. PMID:27187074

  5. Electromagnetic wave extinction within a forested canopy

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.

    1989-01-01

    A forested canopy is modeled by a collection of randomly oriented finite-length cylinders shaded by randomly oriented and distributed disk- or needle-shaped leaves. For a plane wave exciting the forested canopy, the extinction coefficient is formulated in terms of the extinction cross sections (ECSs) in the local frame of each forest component and the Eulerian angles of orientation (used to describe the orientation of each component). The ECSs in the local frame for the finite-length cylinders used to model the branches are obtained by using the forward-scattering theorem. ECSs in the local frame for the disk- and needle-shaped leaves are obtained by the summation of the absorption and scattering cross-sections. The behavior of the extinction coefficients with the incidence angle is investigated numerically for both deciduous and coniferous forest. The dependencies of the extinction coefficients on the orientation of the leaves are illustrated numerically.

  6. Artificial Intelligence Procedures for Tree Taper Estimation within a Complex Vegetation Mosaic in Brazil.

    PubMed

    Nunes, Matheus Henrique; Görgens, Eric Bastos

    2016-01-01

    Tree stem form in native tropical forests is very irregular, posing a challenge to establishing taper equations that can accurately predict the diameter at any height along the stem and subsequently merchantable volume. Artificial intelligence approaches can be useful techniques in minimizing estimation errors within complex variations of vegetation. We evaluated the performance of Random Forest® regression tree and Artificial Neural Network procedures in modelling stem taper. Diameters and volume outside bark were compared to a traditional taper-based equation across a tropical Brazilian savanna, a seasonal semi-deciduous forest and a rainforest. Neural network models were found to be more accurate than the traditional taper equation. Random forest showed trends in the residuals from the diameter prediction and provided the least precise and accurate estimations for all forest types. This study provides insights into the superiority of a neural network, which provided advantages regarding the handling of local effects.

  7. Predicting Coastal Flood Severity using Random Forest Algorithm

    NASA Astrophysics Data System (ADS)

    Sadler, J. M.; Goodall, J. L.; Morsy, M. M.; Spencer, K.

    2017-12-01

    Coastal floods have become more common recently and are predicted to further increase in frequency and severity due to sea level rise. Predicting floods in coastal cities can be difficult due to the number of environmental and geographic factors which can influence flooding events. Built stormwater infrastructure and irregular urban landscapes add further complexity. This paper demonstrates the use of machine learning algorithms in predicting street flood occurrence in an urban coastal setting. The model is trained and evaluated using data from Norfolk, Virginia USA from September 2010 - October 2016. Rainfall, tide levels, water table levels, and wind conditions are used as input variables. Street flooding reports made by city workers after named and unnamed storm events, ranging from 1-159 reports per event, are the model output. Results show that Random Forest provides predictive power in estimating the number of flood occurrences given a set of environmental conditions with an out-of-bag root mean squared error of 4.3 flood reports and a mean absolute error of 0.82 flood reports. The Random Forest algorithm performed much better than Poisson regression. From the Random Forest model, total daily rainfall was by far the most important factor in flood occurrence prediction, followed by daily low tide and daily higher high tide. The model demonstrated here could be used to predict flood severity based on forecast rainfall and tide conditions and could be further enhanced using more complete street flooding data for model training.

  8. LiDAR based prediction of forest biomass using hierarchical models with spatially varying coefficients

    USGS Publications Warehouse

    Babcock, Chad; Finley, Andrew O.; Bradford, John B.; Kolka, Randall K.; Birdsey, Richard A.; Ryan, Michael G.

    2015-01-01

    Many studies and production inventory systems have shown the utility of coupling covariates derived from Light Detection and Ranging (LiDAR) data with forest variables measured on georeferenced inventory plots through regression models. The objective of this study was to propose and assess the use of a Bayesian hierarchical modeling framework that accommodates both residual spatial dependence and non-stationarity of model covariates through the introduction of spatial random effects. We explored this objective using four forest inventory datasets that are part of the North American Carbon Program, each comprising point-referenced measures of above-ground forest biomass and discrete LiDAR. For each dataset, we considered at least five regression model specifications of varying complexity. Models were assessed based on goodness of fit criteria and predictive performance using a 10-fold cross-validation procedure. Results showed that the addition of spatial random effects to the regression model intercept improved fit and predictive performance in the presence of substantial residual spatial dependence. Additionally, in some cases, allowing either some or all regression slope parameters to vary spatially, via the addition of spatial random effects, further improved model fit and predictive performance. In other instances, models showed improved fit but decreased predictive performance—indicating over-fitting and underscoring the need for cross-validation to assess predictive ability. The proposed Bayesian modeling framework provided access to pixel-level posterior predictive distributions that were useful for uncertainty mapping, diagnosing spatial extrapolation issues, revealing missing model covariates, and discovering locally significant parameters.

  9. Field evaluation of a random forest activity classifier for wrist-worn accelerometer data.

    PubMed

    Pavey, Toby G; Gilson, Nicholas D; Gomersall, Sjaan R; Clark, Bronwyn; Trost, Stewart G

    2017-01-01

    Wrist-worn accelerometers are convenient to wear and associated with greater wear-time compliance. Previous work has generally relied on choreographed activity trials to train and test classification models. However, validity in free-living contexts is starting to emerge. Study aims were: (1) train and test a random forest activity classifier for wrist accelerometer data; and (2) determine if models trained on laboratory data perform well under free-living conditions. Twenty-one participants (mean age=27.6±6.2) completed seven lab-based activity trials and a 24h free-living trial (N=16). Participants wore a GENEActiv monitor on the non-dominant wrist. Classification models recognising four activity classes (sedentary, stationary+, walking, and running) were trained using time and frequency domain features extracted from 10-s non-overlapping windows. Model performance was evaluated using leave-one-out-cross-validation. Models were implemented using the randomForest package within R. Classifier accuracy during the 24h free living trial was evaluated by calculating agreement with concurrently worn activPAL monitors. Overall classification accuracy for the random forest algorithm was 92.7%. Recognition accuracy for sedentary, stationary+, walking, and running was 80.1%, 95.7%, 91.7%, and 93.7%, respectively for the laboratory protocol. Agreement with the activPAL data (stepping vs. non-stepping) during the 24h free-living trial was excellent and, on average, exceeded 90%. The ICC for stepping time was 0.92 (95% CI=0.75-0.97). However, sensitivity and positive predictive values were modest. Mean bias was 10.3min/d (95% LOA=-46.0 to 25.4min/d). The random forest classifier for wrist accelerometer data yielded accurate group-level predictions under controlled conditions, but was less accurate at identifying stepping verse non-stepping behaviour in free living conditions Future studies should conduct more rigorous field-based evaluations using observation as a criterion measure. Copyright © 2016 Sports Medicine Australia. Published by Elsevier Ltd. All rights reserved.

  10. Modelling above Ground Biomass of Mangrove Forest Using SENTINEL-1 Imagery

    NASA Astrophysics Data System (ADS)

    Labadisos Argamosa, Reginald Jay; Conferido Blanco, Ariel; Balidoy Baloloy, Alvin; Gumbao Candido, Christian; Lovern Caboboy Dumalag, John Bart; Carandang Dimapilis, Lee, , Lady; Camero Paringit, Enrico

    2018-04-01

    Many studies have been conducted in the estimation of forest above ground biomass (AGB) using features from synthetic aperture radar (SAR). Specifically, L-band ALOS/PALSAR (wavelength 23 cm) data is often used. However, few studies have been made on the use of shorter wavelengths (e.g., C-band, 3.75 cm to 7.5 cm) for forest mapping especially in tropical forests since higher attenuation is observed for volumetric objects where energy propagated is absorbed. This study aims to model AGB estimates of mangrove forest using information derived from Sentinel-1 C-band SAR data. Combinations of polarisations (VV, VH), its derivatives, grey level co-occurrence matrix (GLCM), and its principal components were used as features for modelling AGB. Five models were tested with varying combinations of features; a) sigma nought polarisations and its derivatives; b) GLCM textures; c) the first five principal components; d) combination of models a-c; and e) the identified important features by Random Forest variable importance algorithm. Random Forest was used as regressor to compute for the AGB estimates to avoid over fitting caused by the introduction of too many features in the model. Model e obtained the highest r2 of 0.79 and an RMSE of 0.44 Mg using only four features, namely, σ°VH GLCM variance, σ°VH GLCM contrast, PC1, and PC2. This study shows that Sentinel-1 C-band SAR data could be used to produce acceptable AGB estimates in mangrove forest to compensate for the unavailability of longer wavelength SAR.

  11. Personalized Risk Prediction in Clinical Oncology Research: Applications and Practical Issues Using Survival Trees and Random Forests.

    PubMed

    Hu, Chen; Steingrimsson, Jon Arni

    2018-01-01

    A crucial component of making individualized treatment decisions is to accurately predict each patient's disease risk. In clinical oncology, disease risks are often measured through time-to-event data, such as overall survival and progression/recurrence-free survival, and are often subject to censoring. Risk prediction models based on recursive partitioning methods are becoming increasingly popular largely due to their ability to handle nonlinear relationships, higher-order interactions, and/or high-dimensional covariates. The most popular recursive partitioning methods are versions of the Classification and Regression Tree (CART) algorithm, which builds a simple interpretable tree structured model. With the aim of increasing prediction accuracy, the random forest algorithm averages multiple CART trees, creating a flexible risk prediction model. Risk prediction models used in clinical oncology commonly use both traditional demographic and tumor pathological factors as well as high-dimensional genetic markers and treatment parameters from multimodality treatments. In this article, we describe the most commonly used extensions of the CART and random forest algorithms to right-censored outcomes. We focus on how they differ from the methods for noncensored outcomes, and how the different splitting rules and methods for cost-complexity pruning impact these algorithms. We demonstrate these algorithms by analyzing a randomized Phase III clinical trial of breast cancer. We also conduct Monte Carlo simulations to compare the prediction accuracy of survival forests with more commonly used regression models under various scenarios. These simulation studies aim to evaluate how sensitive the prediction accuracy is to the underlying model specifications, the choice of tuning parameters, and the degrees of missing covariates.

  12. Visible and near infrared spectroscopy coupled to random forest to quantify some soil quality parameters

    NASA Astrophysics Data System (ADS)

    de Santana, Felipe Bachion; de Souza, André Marcelo; Poppi, Ronei Jesus

    2018-02-01

    This study evaluates the use of visible and near infrared spectroscopy (Vis-NIRS) combined with multivariate regression based on random forest to quantify some quality soil parameters. The parameters analyzed were soil cation exchange capacity (CEC), sum of exchange bases (SB), organic matter (OM), clay and sand present in the soils of several regions of Brazil. Current methods for evaluating these parameters are laborious, timely and require various wet analytical methods that are not adequate for use in precision agriculture, where faster and automatic responses are required. The random forest regression models were statistically better than PLS regression models for CEC, OM, clay and sand, demonstrating resistance to overfitting, attenuating the effect of outlier samples and indicating the most important variables for the model. The methodology demonstrates the potential of the Vis-NIR as an alternative for determination of CEC, SB, OM, sand and clay, making possible to develop a fast and automatic analytical procedure.

  13. Research on electricity consumption forecast based on mutual information and random forests algorithm

    NASA Astrophysics Data System (ADS)

    Shi, Jing; Shi, Yunli; Tan, Jian; Zhu, Lei; Li, Hu

    2018-02-01

    Traditional power forecasting models cannot efficiently take various factors into account, neither to identify the relation factors. In this paper, the mutual information in information theory and the artificial intelligence random forests algorithm are introduced into the medium and long-term electricity demand prediction. Mutual information can identify the high relation factors based on the value of average mutual information between a variety of variables and electricity demand, different industries may be highly associated with different variables. The random forests algorithm was used for building the different industries forecasting models according to the different correlation factors. The data of electricity consumption in Jiangsu Province is taken as a practical example, and the above methods are compared with the methods without regard to mutual information and the industries. The simulation results show that the above method is scientific, effective, and can provide higher prediction accuracy.

  14. A random forest algorithm for nowcasting of intense precipitation events

    NASA Astrophysics Data System (ADS)

    Das, Saurabh; Chakraborty, Rohit; Maitra, Animesh

    2017-09-01

    Automatic nowcasting of convective initiation and thunderstorms has potential applications in several sectors including aviation planning and disaster management. In this paper, random forest based machine learning algorithm is tested for nowcasting of convective rain with a ground based radiometer. Brightness temperatures measured at 14 frequencies (7 frequencies in 22-31 GHz band and 7 frequencies in 51-58 GHz bands) are utilized as the inputs of the model. The lower frequency band is associated to the water vapor absorption whereas the upper frequency band relates to the oxygen absorption and hence, provide information on the temperature and humidity of the atmosphere. Synthetic minority over-sampling technique is used to balance the data set and 10-fold cross validation is used to assess the performance of the model. Results indicate that random forest algorithm with fixed alarm generation time of 30 min and 60 min performs quite well (probability of detection of all types of weather condition ∼90%) with low false alarms. It is, however, also observed that reducing the alarm generation time improves the threat score significantly and also decreases false alarms. The proposed model is found to be very sensitive to the boundary layer instability as indicated by the variable importance measure. The study shows the suitability of a random forest algorithm for nowcasting application utilizing a large number of input parameters from diverse sources and can be utilized in other forecasting problems.

  15. New machine learning tools for predictive vegetation mapping after climate change: Bagging and Random Forest perform better than Regression Tree Analysis

    Treesearch

    L.R. Iverson; A.M. Prasad; A. Liaw

    2004-01-01

    More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in...

  16. Metastability for discontinuous dynamical systems under Lévy noise: Case study on Amazonian Vegetation.

    PubMed

    Serdukova, Larissa; Zheng, Yayun; Duan, Jinqiao; Kurths, Jürgen

    2017-08-24

    For the tipping elements in the Earth's climate system, the most important issue to address is how stable is the desirable state against random perturbations. Extreme biotic and climatic events pose severe hazards to tropical rainforests. Their local effects are extremely stochastic and difficult to measure. Moreover, the direction and intensity of the response of forest trees to such perturbations are unknown, especially given the lack of efficient dynamical vegetation models to evaluate forest tree cover changes over time. In this study, we consider randomness in the mathematical modelling of forest trees by incorporating uncertainty through a stochastic differential equation. According to field-based evidence, the interactions between fires and droughts are a more direct mechanism that may describe sudden forest degradation in the south-eastern Amazon. In modeling the Amazonian vegetation system, we include symmetric α-stable Lévy perturbations. We report results of stability analysis of the metastable fertile forest state. We conclude that even a very slight threat to the forest state stability represents L´evy noise with large jumps of low intensity, that can be interpreted as a fire occurring in a non-drought year. During years of severe drought, high-intensity fires significantly accelerate the transition between a forest and savanna state.

  17. Predicting live and dead tree basal area of bark beetle affected forests from discrete-return lidar

    Treesearch

    Benjamin C. Bright; Andrew T. Hudak; Robert McGaughey; Hans-Erik Andersen; Jose Negron

    2013-01-01

    Bark beetle outbreaks have killed large numbers of trees across North America in recent years. Lidar remote sensing can be used to effectively estimate forest biomass, but prediction of both live and dead standing biomass in beetle-affected forests using lidar alone has not been demonstrated. We developed Random Forest (RF) models predicting total, live, dead, and...

  18. Valuing the Recreational Benefits from the Creation of Nature Reserves in Irish Forests

    Treesearch

    Riccardo Scarpa; Susan M. Chilton; W. George Hutchinson; Joseph Buongiorno

    2000-01-01

    Data from a large-scale contingent valuation study are used to investigate the effects of forest attribum on willingness to pay for forest recreation in Ireland. In particular, the presence of a nature reserve in the forest is found to significantly increase the visitors' willingness to pay. A random utility model is used to estimate the welfare change associated...

  19. Polarimetric signatures of a coniferous forest canopy based on vector radiative transfer theory

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.; Amar, F.; Mougin, E.; Lopes, A.; Beaudoin, A.

    1992-01-01

    Complete polarization signatures of a coniferous forest canopy are studied by the iterative solution of the vector radiative transfer equations up to the second order. The forest canopy constituents (leaves, branches, stems, and trunk) are embedded in a multi-layered medium over a rough interface. The branches, stems and trunk scatterers are modeled as finite randomly oriented cylinders. The leaves are modeled as randomly oriented needles. For a plane wave exciting the canopy, the average Mueller matrix is formulated in terms of the iterative solution of the radiative transfer solution and used to determine the linearly polarized backscattering coefficients, the co-polarized and cross-polarized power returns, and the phase difference statistics. Numerical results are presented to investigate the effect of transmitting and receiving antenna configurations on the polarimetric signature of a pine forest. Comparison is made with measurements.

  20. A random forest learning assisted "divide and conquer" approach for peptide conformation search.

    PubMed

    Chen, Xin; Yang, Bing; Lin, Zijing

    2018-06-11

    Computational determination of peptide conformations is challenging as it is a problem of finding minima in a high-dimensional space. The "divide and conquer" approach is promising for reliably reducing the search space size. A random forest learning model is proposed here to expand the scope of applicability of the "divide and conquer" approach. A random forest classification algorithm is used to characterize the distributions of the backbone φ-ψ units ("words"). A random forest supervised learning model is developed to analyze the combinations of the φ-ψ units ("grammar"). It is found that amino acid residues may be grouped as equivalent "words", while the φ-ψ combinations in low-energy peptide conformations follow a distinct "grammar". The finding of equivalent words empowers the "divide and conquer" method with the flexibility of fragment substitution. The learnt grammar is used to improve the efficiency of the "divide and conquer" method by removing unfavorable φ-ψ combinations without the need of dedicated human effort. The machine learning assisted search method is illustrated by efficiently searching the conformations of GGG/AAA/GGGG/AAAA/GGGGG through assembling the structures of GFG/GFGG. Moreover, the computational cost of the new method is shown to increase rather slowly with the peptide length.

  1. Aspen, climate, and sudden decline in western USA

    Treesearch

    Gerald E. Rehfeldt; Dennis E. Ferguson; Nicholas L. Crookston

    2009-01-01

    A bioclimate model predicting the presence or absence of aspen, Populus tremuloides, in western USA from climate variables was developed by using the Random Forests classification tree on Forest Inventory data from about 118,000 permanent sample plots. A reasonably parsimonious model used eight predictors to describe aspen's climate profile. Classification errors...

  2. Stemflow estimation in a redwood forest using model-based stratified random sampling

    Treesearch

    Jack Lewis

    2003-01-01

    Model-based stratified sampling is illustrated by a case study of stemflow volume in a redwood forest. The approach is actually a model-assisted sampling design in which auxiliary information (tree diameter) is utilized in the design of stratum boundaries to optimize the efficiency of a regression or ratio estimator. The auxiliary information is utilized in both the...

  3. Why choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence.

    PubMed

    Mi, Chunrong; Huettmann, Falk; Guo, Yumin; Han, Xuesong; Wen, Lijia

    2017-01-01

    Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane ( Grus monacha , n  = 33), White-naped Crane ( Grus vipio , n  = 40), and Black-necked Crane ( Grus nigricollis , n  = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation.

  4. Why choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence

    PubMed Central

    Mi, Chunrong; Huettmann, Falk; Han, Xuesong; Wen, Lijia

    2017-01-01

    Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n = 33), White-naped Crane (Grus vipio, n = 40), and Black-necked Crane (Grus nigricollis, n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation. PMID:28097060

  5. Modeling urban coastal flood severity from crowd-sourced flood reports using Poisson regression and Random Forest

    NASA Astrophysics Data System (ADS)

    Sadler, J. M.; Goodall, J. L.; Morsy, M. M.; Spencer, K.

    2018-04-01

    Sea level rise has already caused more frequent and severe coastal flooding and this trend will likely continue. Flood prediction is an essential part of a coastal city's capacity to adapt to and mitigate this growing problem. Complex coastal urban hydrological systems however, do not always lend themselves easily to physically-based flood prediction approaches. This paper presents a method for using a data-driven approach to estimate flood severity in an urban coastal setting using crowd-sourced data, a non-traditional but growing data source, along with environmental observation data. Two data-driven models, Poisson regression and Random Forest regression, are trained to predict the number of flood reports per storm event as a proxy for flood severity, given extensive environmental data (i.e., rainfall, tide, groundwater table level, and wind conditions) as input. The method is demonstrated using data from Norfolk, Virginia USA from September 2010 to October 2016. Quality-controlled, crowd-sourced street flooding reports ranging from 1 to 159 per storm event for 45 storm events are used to train and evaluate the models. Random Forest performed better than Poisson regression at predicting the number of flood reports and had a lower false negative rate. From the Random Forest model, total cumulative rainfall was by far the most dominant input variable in predicting flood severity, followed by low tide and lower low tide. These methods serve as a first step toward using data-driven methods for spatially and temporally detailed coastal urban flood prediction.

  6. Mapping Deforestation area in North Korea Using Phenology-based Multi-Index and Random Forest

    NASA Astrophysics Data System (ADS)

    Jin, Y.; Sung, S.; Lee, D. K.; Jeong, S.

    2016-12-01

    Forest ecosystem provides ecological benefits to both humans and wildlife. Growing global demand for food and fiber is accelerating the pressure on the forest ecosystem in whole world from agriculture and logging. In recently, North Korea lost almost 40 % of its forests to crop fields for food production and cut-down of forest for fuel woods between 1990 and 2015. It led to the increased damage caused by natural disasters and is known to be one of the most forest degraded areas in the world. The characteristic of forest landscape in North Korea is complex and heterogeneous, the major landscape types in the forest are hillside farm, unstocked forest, natural forest and plateau vegetation. Remote sensing can be used for the forest degradation mapping of a dynamic landscape at a broad scale of detail and spatial distribution. Confusion mostly occurred between hillside farmland and unstocked forest, but also between unstocked forest and forest. Most previous forest degradation that used focused on the classification of broad types such as deforests area and sand from the perspective of land cover classification. The objective of this study is using random forest for mapping degraded forest in North Korea by phenological based vegetation index derived from MODIS products, which has various environmental factors such as vegetation, soil and water at a regional scale for improving accuracy. The model created by random forest resulted in an overall accuracy was 91.44%. Class user's accuracy of hillside farmland and unstocked forest were 97.2% and 84%%, which indicate the degraded forest. Unstocked forest had relative low user accuracy due to misclassified hillside farmland and forest samples. Producer's accuracy of hillside farmland and unstocked forest were 85.2% and 93.3%, repectly. In this case hillside farmland had lower produce accuracy mainly due to confusion with field, unstocked forest and forest. Such a classification of degraded forest could supply essential information to decide the priority of forest management and restoration in degraded forest area.

  7. Subtyping cognitive profiles in Autism Spectrum Disorder using a Functional Random Forest algorithm.

    PubMed

    Feczko, E; Balba, N M; Miranda-Dominguez, O; Cordova, M; Karalunas, S L; Irwin, L; Demeter, D V; Hill, A P; Langhorst, B H; Grieser Painter, J; Van Santen, J; Fombonne, E J; Nigg, J T; Fair, D A

    2018-05-15

    DSM-5 Autism Spectrum Disorder (ASD) comprises a set of neurodevelopmental disorders characterized by deficits in social communication and interaction and repetitive behaviors or restricted interests, and may both affect and be affected by multiple cognitive mechanisms. This study attempts to identify and characterize cognitive subtypes within the ASD population using our Functional Random Forest (FRF) machine learning classification model. This model trained a traditional random forest model on measures from seven tasks that reflect multiple levels of information processing. 47 ASD diagnosed and 58 typically developing (TD) children between the ages of 9 and 13 participated in this study. Our RF model was 72.7% accurate, with 80.7% specificity and 63.1% sensitivity. Using the random forest model, the FRF then measures the proximity of each subject to every other subject, generating a distance matrix between participants. This matrix is then used in a community detection algorithm to identify subgroups within the ASD and TD groups, and revealed 3 ASD and 4 TD putative subgroups with unique behavioral profiles. We then examined differences in functional brain systems between diagnostic groups and putative subgroups using resting-state functional connectivity magnetic resonance imaging (rsfcMRI). Chi-square tests revealed a significantly greater number of between group differences (p < .05) within the cingulo-opercular, visual, and default systems as well as differences in inter-system connections in the somato-motor, dorsal attention, and subcortical systems. Many of these differences were primarily driven by specific subgroups suggesting that our method could potentially parse the variation in brain mechanisms affected by ASD. Copyright © 2017. Published by Elsevier Inc.

  8. Extrapolating intensified forest inventory data to the surrounding landscape using landsat

    Treesearch

    Evan B. Brooks; John W. Coulston; Valerie A. Thomas; Randolph H. Wynne

    2015-01-01

    In 2011, a collection of spatially intensified plots was established on three of the Experimental Forests and Ranges (EFRs) sites with the intent of facilitating FIA program objectives for regional extrapolation. Characteristic coefficients from harmonic regression (HR) analysis of associated Landsat stacks are used as inputs into a conditional random forests model to...

  9. On the information content of hydrological signatures and their relationship to catchment attributes

    NASA Astrophysics Data System (ADS)

    Addor, Nans; Clark, Martyn P.; Prieto, Cristina; Newman, Andrew J.; Mizukami, Naoki; Nearing, Grey; Le Vine, Nataliya

    2017-04-01

    Hydrological signatures, which are indices characterizing hydrologic behavior, are increasingly used for the evaluation, calibration and selection of hydrological models. Their key advantage is to provide more direct insights into specific hydrological processes than aggregated metrics (e.g., the Nash-Sutcliffe efficiency). A plethora of signatures now exists, which enable characterizing a variety of hydrograph features, but also makes the selection of signatures for new studies challenging. Here we propose that the selection of signatures should be based on their information content, which we estimated using several approaches, all leading to similar conclusions. To explore the relationship between hydrological signatures and the landscape, we extended a previously published data set of hydrometeorological time series for 671 catchments in the contiguous United States, by characterizing the climatic conditions, topography, soil, vegetation and stream network of each catchment. This new catchment attributes data set will soon be in open access, and we are looking forward to introducing it to the community. We used this data set in a data-learning algorithm (random forests) to explore whether hydrological signatures could be inferred from catchment attributes alone. We find that some signatures can be predicted remarkably well by random forests and, interestingly, the same signatures are well captured when simulating discharge using a conceptual hydrological model. We discuss what this result reveals about our understanding of hydrological processes shaping hydrological signatures. We also identify which catchment attributes exert the strongest control on catchment behavior, in particular during extreme hydrological events. Overall, climatic attributes have the most significant influence, and strongly condition how well hydrological signatures can be predicted by random forests and simulated by the hydrological model. In contrast, soil characteristics at the catchment scale are not found to be significant predictors by random forests, which raises questions on how to best use soil data for hydrological modeling, for instance for parameter estimation. We finally demonstrate that signatures with high spatial variability are poorly captured by random forests and model simulations, which makes their regionalization delicate. We conclude with a ranking of signatures based on their information content, and propose that the signatures with high information content are best suited for model calibration, model selection and understanding hydrologic similarity.

  10. A scattering model for forested area

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.

    1988-01-01

    A forested area is modeled as a volume of randomly oriented and distributed disc-shaped, or needle-shaped leaves shading a distribution of branches modeled as randomly oriented finite-length, dielectric cylinders above an irregular soil surface. Since the radii of branches have a wide range of sizes, the model only requires the length of a branch to be large compared with its radius which may be any size relative to the incident wavelength. In addition, the model also assumes the thickness of a disc-shaped leaf or the radius of a needle-shaped leaf is much smaller than the electromagnetic wavelength. The scattering phase matrices for disc, needle, and cylinder are developed in terms of the scattering amplitudes of the corresponding fields which are computed by the forward scattering theorem. These quantities along with the Kirchoff scattering model for a randomly rough surface are used in the standard radiative transfer formulation to compute the backscattering coefficient. Numerical illustrations for the backscattering coefficient are given as a function of the shading factor, incidence angle, leaf orientation distribution, branch orientation distribution, and the number density of leaves. Also illustrated are the properties of the extinction coefficient as a function of leaf and branch orientation distributions. Comparisons are made with measured backscattering coefficients from forested areas reported in the literature.

  11. Modeling species’ realized climatic niche space and predicting their response to global warming for several western forest species with small geographic distributions

    Treesearch

    Marcus V. Warwell; Gerald E. Rehfeldt; Nicholas L. Crookston

    2010-01-01

    The Random Forests multiple regression tree was used to develop an empirically based bioclimatic model of the presence-absence of species occupying small geographic distributions in western North America. The species assessed were subalpine larch (Larix lyallii), smooth Arizona cypress (Cupressus arizonica ssp. glabra...

  12. Optimal Symmetric Multimodal Templates and Concatenated Random Forests for Supervised Brain Tumor Segmentation (Simplified) with ANTsR.

    PubMed

    Tustison, Nicholas J; Shrinidhi, K L; Wintermark, Max; Durst, Christopher R; Kandel, Benjamin M; Gee, James C; Grossman, Murray C; Avants, Brian B

    2015-04-01

    Segmenting and quantifying gliomas from MRI is an important task for diagnosis, planning intervention, and for tracking tumor changes over time. However, this task is complicated by the lack of prior knowledge concerning tumor location, spatial extent, shape, possible displacement of normal tissue, and intensity signature. To accommodate such complications, we introduce a framework for supervised segmentation based on multiple modality intensity, geometry, and asymmetry feature sets. These features drive a supervised whole-brain and tumor segmentation approach based on random forest-derived probabilities. The asymmetry-related features (based on optimal symmetric multimodal templates) demonstrate excellent discriminative properties within this framework. We also gain performance by generating probability maps from random forest models and using these maps for a refining Markov random field regularized probabilistic segmentation. This strategy allows us to interface the supervised learning capabilities of the random forest model with regularized probabilistic segmentation using the recently developed ANTsR package--a comprehensive statistical and visualization interface between the popular Advanced Normalization Tools (ANTs) and the R statistical project. The reported algorithmic framework was the top-performing entry in the MICCAI 2013 Multimodal Brain Tumor Segmentation challenge. The challenge data were widely varying consisting of both high-grade and low-grade glioma tumor four-modality MRI from five different institutions. Average Dice overlap measures for the final algorithmic assessment were 0.87, 0.78, and 0.74 for "complete", "core", and "enhanced" tumor components, respectively.

  13. Looking for age-related growth decline in natural forests: unexpected biomass patterns from tree rings and simulated mortality

    USGS Publications Warehouse

    Foster, Jane R.; D'Amato, Anthony W.; Bradford, John B.

    2014-01-01

    Forest biomass growth is almost universally assumed to peak early in stand development, near canopy closure, after which it will plateau or decline. The chronosequence and plot remeasurement approaches used to establish the decline pattern suffer from limitations and coarse temporal detail. We combined annual tree ring measurements and mortality models to address two questions: first, how do assumptions about tree growth and mortality influence reconstructions of biomass growth? Second, under what circumstances does biomass production follow the model that peaks early, then declines? We integrated three stochastic mortality models with a census tree-ring data set from eight temperate forest types to reconstruct stand-level biomass increments (in Minnesota, USA). We compared growth patterns among mortality models, forest types and stands. Timing of peak biomass growth varied significantly among mortality models, peaking 20–30 years earlier when mortality was random with respect to tree growth and size, than when mortality favored slow-growing individuals. Random or u-shaped mortality (highest in small or large trees) produced peak growth 25–30 % higher than the surviving tree sample alone. Growth trends for even-aged, monospecific Pinus banksiana or Acer saccharum forests were similar to the early peak and decline expectation. However, we observed continually increasing biomass growth in older, low-productivity forests of Quercus rubra, Fraxinus nigra, and Thuja occidentalis. Tree-ring reconstructions estimated annual changes in live biomass growth and identified more diverse development patterns than previous methods. These detailed, long-term patterns of biomass development are crucial for detecting recent growth responses to global change and modeling future forest dynamics.

  14. Prediction of aquatic toxicity mode of action using linear discriminant and random forest models.

    PubMed

    Martin, Todd M; Grulke, Christopher M; Young, Douglas M; Russom, Christine L; Wang, Nina Y; Jackson, Crystal R; Barron, Mace G

    2013-09-23

    The ability to determine the mode of action (MOA) for a diverse group of chemicals is a critical part of ecological risk assessment and chemical regulation. However, existing MOA assignment approaches in ecotoxicology have been limited to a relatively few MOAs, have high uncertainty, or rely on professional judgment. In this study, machine based learning algorithms (linear discriminant analysis and random forest) were used to develop models for assigning aquatic toxicity MOA. These methods were selected since they have been shown to be able to correlate diverse data sets and provide an indication of the most important descriptors. A data set of MOA assignments for 924 chemicals was developed using a combination of high confidence assignments, international consensus classifications, ASTER (ASessment Tools for the Evaluation of Risk) predictions, and weight of evidence professional judgment based an assessment of structure and literature information. The overall data set was randomly divided into a training set (75%) and a validation set (25%) and then used to develop linear discriminant analysis (LDA) and random forest (RF) MOA assignment models. The LDA and RF models had high internal concordance and specificity and were able to produce overall prediction accuracies ranging from 84.5 to 87.7% for the validation set. These results demonstrate that computational chemistry approaches can be used to determine the acute toxicity MOAs across a large range of structures and mechanisms.

  15. Advanced Subspace Techniques for Modeling Channel and Session Variability in a Speaker Recognition System

    DTIC Science & Technology

    2012-03-01

    with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random

  16. Ground-Level Digital Terrain Model (DTM) Construction from Tandem-X InSAR Data and Worldview Stereo-Photogrammetric Images

    NASA Technical Reports Server (NTRS)

    Lee, Seung-Kuk; Fatoyinbo, Temilola; Lagomasino, David; Osmanoglu, Batuhan; Feliciano, Emanuelle

    2016-01-01

    The ground-level digital elevation model (DEM) or digital terrain model (DTM) information are invaluable for environmental modeling, such as water dynamics in forests, canopy height, forest biomass, carbon estimation, etc. We propose to extract the DTM over forested areas from the combination of interferometric complex coherence from single-pass TanDEM-X (TDX) data at HH polarization and Digital Surface Model (DSM) derived from high-resolution WorldView (WV) image pair by means of random volume over ground (RVoG) model. The RVoG model is a widely and successfully used model for polarimetric SAR interferometry (Pol-InSAR) technique for vertical forest structure parameter retrieval [1][2][3][4]. The ground-level DEM have been obtained by complex volume decorrelation in the RVoG model with the DSM using stereo-photogrammetric technique. Finally, the airborne lidar data were used to validate the ground-level DEM and forest canopy height results.

  17. Learning accurate and interpretable models based on regularized random forests regression

    PubMed Central

    2014-01-01

    Background Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. Methods In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. Results We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. Conclusion It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied. PMID:25350120

  18. Data-Driven Lead-Acid Battery Prognostics Using Random Survival Forests

    DTIC Science & Technology

    2014-10-02

    Kogalur, Blackstone , & Lauer, 2008; Ishwaran & Kogalur, 2010). Random survival forest is a sur- vival analysis extension of Random Forests (Breiman, 2001...Statistics & probability letters, 80(13), 1056–1064. Ishwaran, H., Kogalur, U. B., Blackstone , E. H., & Lauer, M. S. (2008). Random survival forests. The

  19. Radar modeling of a boreal forest

    NASA Technical Reports Server (NTRS)

    Chauhan, Narinder S.; Lang, Roger H.; Ranson, K. J.

    1991-01-01

    Microwave modeling, ground truth, and SAR data are used to investigate the characteristics of forest stands. A mixed coniferous forest stand has been modeled at P, L, and C bands. Extensive measurements of ground truth and canopy geometry parameters were performed in a 200-m-square hemlock-dominated forest plot. About 10 percent of the trees were sampled to determine a distribution of diameter at breast height (DBH). Hemlock trees in the forest are modeled by characterizing tree trunks, branches, and needles as randomly oriented lossy dielectric cylinders whose area and orientation distributions are prescribed. The distorted Born approximation is used to compute the backscatter at P, L, and C bands. The theoretical results are found to be lower than the calibrated ground-truth data. The experiment and model results agree quite closely, however, when the ratios of VV to HH and HV to HH are compared.

  20. Managing salinity in Upper Colorado River Basin streams: Selecting catchments for sediment control efforts using watershed characteristics and random forests models

    USGS Publications Warehouse

    Tillman, Fred; Anning, David W.; Heilman, Julian A.; Buto, Susan G.; Miller, Matthew P.

    2018-01-01

    Elevated concentrations of dissolved-solids (salinity) including calcium, sodium, sulfate, and chloride, among others, in the Colorado River cause substantial problems for its water users. Previous efforts to reduce dissolved solids in upper Colorado River basin (UCRB) streams often focused on reducing suspended-sediment transport to streams, but few studies have investigated the relationship between suspended sediment and salinity, or evaluated which watershed characteristics might be associated with this relationship. Are there catchment properties that may help in identifying areas where control of suspended sediment will also reduce salinity transport to streams? A random forests classification analysis was performed on topographic, climate, land cover, geology, rock chemistry, soil, and hydrologic information in 163 UCRB catchments. Two random forests models were developed in this study: one for exploring stream and catchment characteristics associated with stream sites where dissolved solids increase with increasing suspended-sediment concentration, and the other for predicting where these sites are located in unmonitored reaches. Results of variable importance from the exploratory random forests models indicate that no simple source, geochemical process, or transport mechanism can easily explain the relationship between dissolved solids and suspended sediment concentrations at UCRB monitoring sites. Among the most important watershed characteristics in both models were measures of soil hydraulic conductivity, soil erodibility, minimum catchment elevation, catchment area, and the silt component of soil in the catchment. Predictions at key locations in the basin were combined with observations from selected monitoring sites, and presented in map-form to give a complete understanding of where catchment sediment control practices would also benefit control of dissolved solids in streams.

  1. Electromagnetic wave scattering from a forest or vegetation canopy - Ongoing research at the University of Texas at Arlington

    NASA Technical Reports Server (NTRS)

    Karam, Mostafa A.; Amar, Faouzi; Fung, Adrian K.

    1993-01-01

    The Wave Scattering Research Center at the University of Texas at Arlington has developed a scattering model for forest or vegetation, based on the theory of electromagnetic-wave scattering in random media. The model generalizes the assumptions imposed by earlier models, and compares well with measurements from several forest canopies. This paper gives a description of the model. It also indicates how the model elements are integrated to obtain the scattering characteristics of different forest canopies. The scattering characteristics may be displayed in the form of polarimetric signatures, represented by like- and cross-polarized scattering coefficients, for an elliptically-polarized wave, or in the form of signal-distribution curves. Results illustrating both types of scattering characteristics are given.

  2. Predicting relative species composition within mixed conifer forest pixels using zero‐inflated models and Landsat imagery

    Treesearch

    Shannon L. Savage; Rick L. Lawrence; John R. Squires

    2015-01-01

    Ecological and land management applications would often benefit from maps of relative canopy cover of each species present within a pixel, instead of traditional remote-sensing based maps of either dominant species or percent canopy cover without regard to species composition. Widely used statistical models for remote sensing, such as randomForest (RF),...

  3. Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan Manila, Philippines.

    PubMed

    Carvajal, Thaddeus M; Viacrusis, Katherine M; Hernandez, Lara Fides T; Ho, Howell T; Amalin, Divina M; Watanabe, Kozo

    2018-04-17

    Several studies have applied ecological factors such as meteorological variables to develop models and accurately predict the temporal pattern of dengue incidence or occurrence. With the vast amount of studies that investigated this premise, the modeling approaches differ from each study and only use a single statistical technique. It raises the question of whether which technique would be robust and reliable. Hence, our study aims to compare the predictive accuracy of the temporal pattern of Dengue incidence in Metropolitan Manila as influenced by meteorological factors from four modeling techniques, (a) General Additive Modeling, (b) Seasonal Autoregressive Integrated Moving Average with exogenous variables (c) Random Forest and (d) Gradient Boosting. Dengue incidence and meteorological data (flood, precipitation, temperature, southern oscillation index, relative humidity, wind speed and direction) of Metropolitan Manila from January 1, 2009 - December 31, 2013 were obtained from respective government agencies. Two types of datasets were used in the analysis; observed meteorological factors (MF) and its corresponding delayed or lagged effect (LG). After which, these datasets were subjected to the four modeling techniques. The predictive accuracy and variable importance of each modeling technique were calculated and evaluated. Among the statistical modeling techniques, Random Forest showed the best predictive accuracy. Moreover, the delayed or lag effects of the meteorological variables was shown to be the best dataset to use for such purpose. Thus, the model of Random Forest with delayed meteorological effects (RF-LG) was deemed the best among all assessed models. Relative humidity was shown to be the top-most important meteorological factor in the best model. The study exhibited that there are indeed different predictive outcomes generated from each statistical modeling technique and it further revealed that the Random forest model with delayed meteorological effects to be the best in predicting the temporal pattern of Dengue incidence in Metropolitan Manila. It is also noteworthy that the study also identified relative humidity as an important meteorological factor along with rainfall and temperature that can influence this temporal pattern.

  4. Comparison of Logistic Regression and Random Forests techniques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy)

    NASA Astrophysics Data System (ADS)

    Trigila, Alessandro; Iadanza, Carla; Esposito, Carlo; Scarascia-Mugnozza, Gabriele

    2015-11-01

    The aim of this work is to define reliable susceptibility models for shallow landslides using Logistic Regression and Random Forests multivariate statistical techniques. The study area, located in North-East Sicily, was hit on October 1st 2009 by a severe rainstorm (225 mm of cumulative rainfall in 7 h) which caused flash floods and more than 1000 landslides. Several small villages, such as Giampilieri, were hit with 31 fatalities, 6 missing persons and damage to buildings and transportation infrastructures. Landslides, mainly types such as earth and debris translational slides evolving into debris flows, were triggered on steep slopes and involved colluvium and regolith materials which cover the underlying metamorphic bedrock. The work has been carried out with the following steps: i) realization of a detailed event landslide inventory map through field surveys coupled with observation of high resolution aerial colour orthophoto; ii) identification of landslide source areas; iii) data preparation of landslide controlling factors and descriptive statistics based on a bivariate method (Frequency Ratio) to get an initial overview on existing relationships between causative factors and shallow landslide source areas; iv) choice of criteria for the selection and sizing of the mapping unit; v) implementation of 5 multivariate statistical susceptibility models based on Logistic Regression and Random Forests techniques and focused on landslide source areas; vi) evaluation of the influence of sample size and type of sampling on results and performance of the models; vii) evaluation of the predictive capabilities of the models using ROC curve, AUC and contingency tables; viii) comparison of model results and obtained susceptibility maps; and ix) analysis of temporal variation of landslide susceptibility related to input parameter changes. Models based on Logistic Regression and Random Forests have demonstrated excellent predictive capabilities. Land use and wildfire variables were found to have a strong control on the occurrence of very rapid shallow landslides.

  5. Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data.

    PubMed

    Stevens, Forrest R; Gaughan, Andrea E; Linard, Catherine; Tatem, Andrew J

    2015-01-01

    High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, "Random Forest" estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at ~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and Latin America.

  6. Vehicular traffic noise prediction using soft computing approach.

    PubMed

    Singh, Daljeet; Nigam, S P; Agrawal, V P; Kumar, Maneek

    2016-12-01

    A new approach for the development of vehicular traffic noise prediction models is presented. Four different soft computing methods, namely, Generalized Linear Model, Decision Trees, Random Forests and Neural Networks, have been used to develop models to predict the hourly equivalent continuous sound pressure level, Leq, at different locations in the Patiala city in India. The input variables include the traffic volume per hour, percentage of heavy vehicles and average speed of vehicles. The performance of the four models is compared on the basis of performance criteria of coefficient of determination, mean square error and accuracy. 10-fold cross validation is done to check the stability of the Random Forest model, which gave the best results. A t-test is performed to check the fit of the model with the field data. Copyright © 2016 Elsevier Ltd. All rights reserved.

  7. Uncertainty in Random Forests: What does it mean in a spatial context?

    NASA Astrophysics Data System (ADS)

    Klump, Jens; Fouedjio, Francky

    2017-04-01

    Geochemical surveys are an important part of exploration for mineral resources and in environmental studies. The samples and chemical analyses are often laborious and difficult to obtain and therefore come at a high cost. As a consequence, these surveys are characterised by datasets with large numbers of variables but relatively few data points when compared to conventional big data problems. With more remote sensing platforms and sensor networks being deployed, large volumes of auxiliary data of the surveyed areas are becoming available. The use of these auxiliary data has the potential to improve the prediction of chemical element concentrations over the whole study area. Kriging is a well established geostatistical method for the prediction of spatial data but requires significant pre-processing and makes some basic assumptions about the underlying distribution of the data. Some machine learning algorithms, on the other hand, may require less data pre-processing and are non-parametric. In this study we used a dataset provided by Kirkwood et al. [1] to explore the potential use of Random Forest in geochemical mapping. We chose Random Forest because it is a well understood machine learning method and has the advantage that it provides us with a measure of uncertainty. By comparing Random Forest to Kriging we found that both methods produced comparable maps of estimated values for our variables of interest. Kriging outperformed Random Forest for variables of interest with relatively strong spatial correlation. The measure of uncertainty provided by Random Forest seems to be quite different to the measure of uncertainty provided by Kriging. In particular, the lack of spatial context can give misleading results in areas without ground truth data. In conclusion, our preliminary results show that the model driven approach in geostatistics gives us more reliable estimates for our target variables than Random Forest for variables with relatively strong spatial correlation. However, in cases of weak spatial correlation Random Forest, as a nonparametric method, may give the better results once we have a better understanding of the meaning of its uncertainty measures in a spatial context. References [1] Kirkwood, C., M. Cave, D. Beamish, S. Grebby, and A. Ferreira (2016), A machine learning approach to geochemical mapping, Journal of Geochemical Exploration, 163, 28-40, doi:10.1016/j.gexplo.2016.05.003.

  8. The contribution of competition to tree mortality in old-growth coniferous forests

    USGS Publications Warehouse

    Das, A.; Battles, J.; Stephenson, N.L.; van Mantgem, P.J.

    2011-01-01

    Competition is a well-documented contributor to tree mortality in temperate forests, with numerous studies documenting a relationship between tree death and the competitive environment. Models frequently rely on competition as the only non-random mechanism affecting tree mortality. However, for mature forests, competition may cease to be the primary driver of mortality.We use a large, long-term dataset to study the importance of competition in determining tree mortality in old-growth forests on the western slope of the Sierra Nevada of California, U.S.A. We make use of the comparative spatial configuration of dead and live trees, changes in tree spatial pattern through time, and field assessments of contributors to an individual tree's death to quantify competitive effects.Competition was apparently a significant contributor to tree mortality in these forests. Trees that died tended to be in more competitive environments than trees that survived, and suppression frequently appeared as a factor contributing to mortality. On the other hand, based on spatial pattern analyses, only three of 14 plots demonstrated compelling evidence that competition was dominating mortality. Most of the rest of the plots fell within the expectation for random mortality, and three fit neither the random nor the competition model. These results suggest that while competition is often playing a significant role in tree mortality processes in these forests it only infrequently governs those processes. In addition, the field assessments indicated a substantial presence of biotic mortality agents in trees that died.While competition is almost certainly important, demographics in these forests cannot accurately be characterized without a better grasp of other mortality processes. In particular, we likely need a better understanding of biotic agents and their interactions with one another and with competition. ?? 2011.

  9. Assessing the Potential of Land Use Modification to Mitigate Ambient NO₂ and Its Consequences for Respiratory Health.

    PubMed

    Rao, Meenakshi; George, Linda A; Shandas, Vivek; Rosenstiel, Todd N

    2017-07-10

    Understanding how local land use and land cover (LULC) shapes intra-urban concentrations of atmospheric pollutants-and thus human health-is a key component in designing healthier cities. Here, NO₂ is modeled based on spatially dense summer and winter NO₂ observations in Portland-Hillsboro-Vancouver (USA), and the spatial variation of NO₂ with LULC investigated using random forest, an ensemble data learning technique. The NO 2 random forest model, together with BenMAP, is further used to develop a better understanding of the relationship among LULC, ambient NO₂ and respiratory health. The impact of land use modifications on ambient NO₂, and consequently on respiratory health, is also investigated using a sensitivity analysis. We find that NO₂ associated with roadways and tree-canopied areas may be affecting annual incidence rates of asthma exacerbation in 4-12 year olds by +3000 per 100,000 and -1400 per 100,000, respectively. Our model shows that increasing local tree canopy by 5% may reduce local incidences rates of asthma exacerbation by 6%, indicating that targeted local tree-planting efforts may have a substantial impact on reducing city-wide incidence of respiratory distress. Our findings demonstrate the utility of random forest modeling in evaluating LULC modifications for enhanced respiratory health.

  10. Sensitivity of a Riparian Large Woody Debris Recruitment Model to the Number of Contributing Banks and Tree Fall Pattern

    Treesearch

    Don C. Bragg; Jeffrey L. Kershner

    2004-01-01

    Riparian large woody debris (LWD) recruitment simulations have traditionally applied a random angle of tree fall from two well-forested stream banks. We used a riparian LWD recruitment model (CWD, version 1.4) to test the validity these assumptions. Both the number of contributing forest banks and predominant tree fall direction significantly influenced simulated...

  11. Machine Learning Predictions of a Multiresolution Climate Model Ensemble

    NASA Astrophysics Data System (ADS)

    Anderson, Gemma J.; Lucas, Donald D.

    2018-05-01

    Statistical models of high-resolution climate models are useful for many purposes, including sensitivity and uncertainty analyses, but building them can be computationally prohibitive. We generated a unique multiresolution perturbed parameter ensemble of a global climate model. We use a novel application of a machine learning technique known as random forests to train a statistical model on the ensemble to make high-resolution model predictions of two important quantities: global mean top-of-atmosphere energy flux and precipitation. The random forests leverage cheaper low-resolution simulations, greatly reducing the number of high-resolution simulations required to train the statistical model. We demonstrate that high-resolution predictions of these quantities can be obtained by training on an ensemble that includes only a small number of high-resolution simulations. We also find that global annually averaged precipitation is more sensitive to resolution changes than to any of the model parameters considered.

  12. How random is the random forest? Random forest algorithm on the service of structural imaging biomarkers for Alzheimer's disease: from Alzheimer's disease neuroimaging initiative (ADNI) database.

    PubMed

    Dimitriadis, Stavros I; Liparas, Dimitris

    2018-06-01

    Neuroinformatics is a fascinating research field that applies computational models and analytical tools to high dimensional experimental neuroscience data for a better understanding of how the brain functions or dysfunctions in brain diseases. Neuroinformaticians work in the intersection of neuroscience and informatics supporting the integration of various sub-disciplines (behavioural neuroscience, genetics, cognitive psychology, etc.) working on brain research. Neuroinformaticians are the pathway of information exchange between informaticians and clinicians for a better understanding of the outcome of computational models and the clinical interpretation of the analysis. Machine learning is one of the most significant computational developments in the last decade giving tools to neuroinformaticians and finally to radiologists and clinicians for an automatic and early diagnosis-prognosis of a brain disease. Random forest (RF) algorithm has been successfully applied to high-dimensional neuroimaging data for feature reduction and also has been applied to classify the clinical label of a subject using single or multi-modal neuroimaging datasets. Our aim was to review the studies where RF was applied to correctly predict the Alzheimer's disease (AD), the conversion from mild cognitive impairment (MCI) and its robustness to overfitting, outliers and handling of non-linear data. Finally, we described our RF-based model that gave us the 1 st position in an international challenge for automated prediction of MCI from MRI data.

  13. Semi-empirical modelling for forest above ground biomass estimation using hybrid and fully PolSAR data

    NASA Astrophysics Data System (ADS)

    Tomar, Kiledar S.; Kumar, Shashi; Tolpekin, Valentyn A.; Joshi, Sushil K.

    2016-05-01

    Forests act as sink of carbon and as a result maintains carbon cycle in atmosphere. Deforestation leads to imbalance in global carbon cycle and changes in climate. Hence estimation of forest biophysical parameter like biomass becomes a necessity. PolSAR has the ability to discriminate the share of scattering element like surface, double bounce and volume scattering in a single SAR resolution cell. Studies have shown that volume scattering is a significant parameter for forest biophysical characterization which mainly occurred from vegetation due to randomly oriented structures. This random orientation of forest structure causes shift in orientation angle of polarization ellipse which ultimately disturbs the radar signature and shows overestimation of volume scattering and underestimation of double bounce scattering after decomposition of fully PolSAR data. Hybrid polarimetry has the advantage of zero POA shift due to rotational symmetry followed by the circular transmission of electromagnetic waves. The prime objective of this study was to extract the potential of Hybrid PolSAR and fully PolSAR data for AGB estimation using Extended Water Cloud model. Validation was performed using field biomass. The study site chosen was Barkot Forest, Uttarakhand, India. To obtain the decomposition components, m-alpha and Yamaguchi decomposition modelling for Hybrid and fully PolSAR data were implied respectively. The RGB composite image for both the decomposition techniques has generated. The contribution of all scattering from each plot for m-alpha and Yamaguchi decomposition modelling were extracted. The R2 value for modelled AGB and field biomass from Hybrid PolSAR and fully PolSAR data were found 0.5127 and 0.4625 respectively. The RMSE for Hybrid and fully PolSAR between modelled AGB and field biomass were 63.156 (t ha-1) and 73.424 (t ha-1) respectively. On the basis of RMSE and R2 value, this study suggests Hybrid PolSAR decomposition modelling to retrieve scattering element for AGB estimation from forest.

  14. Radiative transfer theory for active remote sensing of a forested canopy

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.

    1989-01-01

    A canopy is modeled as a two-layer medium above a rough interface. The upper layer stands for the forest crown, with the leaves modeled as randomly oriented and distributed disks and needles and the branches modeled as randomly oriented finite dielectric cylinders. The lower layer contains the tree trunks, modeled as randomly positioned vertical cylinders above the rough soil. Radiative-transfer theory is applied to calculate EM scattering from such a canopy, is expressed in terms of the scattering-amplitude tensors (SATs). For leaves, the generalized Rayleigh-Gans approximation is applied, whereas the branch and trunk SATs are obtained by estimating the inner field by fields inside a similar cylinder of infinite length. The Kirchhoff method is used to calculate the soil SAT. For a plane wave exciting the canopy, the radiative-transfer equations are solved by iteration to the first order in albedo of the leaves and the branches. Numerical results are illustrated as a function of the incidence angle.

  15. GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.

    PubMed

    Jayaraj, P B; Ajay, Mathias K; Nufail, M; Gopakumar, G; Jaleel, U C A

    2016-01-01

    In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementing parallelized algorithms on a Graphical Processing Unit (GPU). Random Forest is a robust classification algorithm that can be employed in the virtual screening. A ligand based virtual screening tool (GPURFSCREEN) that uses random forests on GPU systems has been proposed and evaluated in this paper. This tool produces optimized results at a lower execution time for large bioassay data sets. The quality of results produced by our tool on GPU is same as that on a regular serial environment. Considering the magnitude of data to be screened, the parallelized virtual screening has a significantly lower running time at high throughput. The proposed parallel tool outperforms its serial counterpart by successfully screening billions of molecules in training and prediction phases.

  16. Prediction of 1-octanol solubilities using data from the Open Notebook Science Challenge.

    PubMed

    Buonaiuto, Michael A; Lang, Andrew S I D

    2015-12-01

    1-Octanol solubility is important in a variety of applications involving pharmacology and environmental chemistry. Current models are linear in nature and often require foreknowledge of either melting point or aqueous solubility. Here we extend the range of applicability of 1-octanol solubility models by creating a random forest model that can predict 1-octanol solubilities directly from structure. We created a random forest model using CDK descriptors that has an out-of-bag (OOB) R 2 value of 0.66 and an OOB mean squared error of 0.34. The model has been deployed for general use as a Shiny application. The 1-octanol solubility model provides reasonably accurate predictions of the 1-octanol solubility of organic solutes directly from structure. The model was developed under Open Notebook Science conditions which makes it open, reproducible, and as useful as possible.Graphical abstract.

  17. Aggregating pixel-level basal area predictions derived from LiDAR data to industrial forest stands in North-Central Idaho

    Treesearch

    Andrew T. Hudak; Jeffrey S. Evans; Nicholas L. Crookston; Michael J. Falkowski; Brant K. Steigers; Rob Taylor; Halli Hemingway

    2008-01-01

    Stand exams are the principal means by which timber companies monitor and manage their forested lands. Airborne LiDAR surveys sample forest stands at much finer spatial resolution and broader spatial extent than is practical on the ground. In this paper, we developed models that leverage spatially intensive and extensive LiDAR data and a stratified random sample of...

  18. Mapping ecological systems with a random foret model: tradeoffs between errors and bias

    Treesearch

    Emilie Grossmann; Janet Ohmann; James Kagan; Heather May; Matthew Gregory

    2010-01-01

    New methods for predictive vegetation mapping allow improved estimations of plant community composition across large regions. Random Forest (RF) models limit over-fitting problems of other methods, and are known for making accurate classification predictions from noisy, nonnormal data, but can be biased when plot samples are unbalanced. We developed two contrasting...

  19. Predicting CD4 count changes among patients on antiretroviral treatment: Application of data mining techniques.

    PubMed

    Kebede, Mihiretu; Zegeye, Desalegn Tigabu; Zeleke, Berihun Megabiaw

    2017-12-01

    To monitor the progress of therapy and disease progression, periodic CD4 counts are required throughout the course of HIV/AIDS care and support. The demand for CD4 count measurement is increasing as ART programs expand over the last decade. This study aimed to predict CD4 count changes and to identify the predictors of CD4 count changes among patients on ART. A cross-sectional study was conducted at the University of Gondar Hospital from 3,104 adult patients on ART with CD4 counts measured at least twice (baseline and most recent). Data were retrieved from the HIV care clinic electronic database and patients` charts. Descriptive data were analyzed by SPSS version 20. Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology was followed to undertake the study. WEKA version 3.8 was used to conduct a predictive data mining. Before building the predictive data mining models, information gain values and correlation-based Feature Selection methods were used for attribute selection. Variables were ranked according to their relevance based on their information gain values. J48, Neural Network, and Random Forest algorithms were experimented to assess model accuracies. The median duration of ART was 191.5 weeks. The mean CD4 count change was 243 (SD 191.14) cells per microliter. Overall, 2427 (78.2%) patients had their CD4 counts increased by at least 100 cells per microliter, while 4% had a decline from the baseline CD4 value. Baseline variables including age, educational status, CD8 count, ART regimen, and hemoglobin levels predicted CD4 count changes with predictive accuracies of J48, Neural Network, and Random Forest being 87.1%, 83.5%, and 99.8%, respectively. Random Forest algorithm had a superior performance accuracy level than both J48 and Artificial Neural Network. The precision, sensitivity and recall values of Random Forest were also more than 99%. Nearly accurate prediction results were obtained using Random Forest algorithm. This algorithm could be used in a low-resource setting to build a web-based prediction model for CD4 count changes. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. Modeling Verdict Outcomes Using Social Network Measures: The Watergate and Caviar Network Cases.

    PubMed

    Masías, Víctor Hugo; Valle, Mauricio; Morselli, Carlo; Crespo, Fernando; Vargas, Augusto; Laengle, Sigifredo

    2016-01-01

    Modelling criminal trial verdict outcomes using social network measures is an emerging research area in quantitative criminology. Few studies have yet analyzed which of these measures are the most important for verdict modelling or which data classification techniques perform best for this application. To compare the performance of different techniques in classifying members of a criminal network, this article applies three different machine learning classifiers-Logistic Regression, Naïve Bayes and Random Forest-with a range of social network measures and the necessary databases to model the verdicts in two real-world cases: the U.S. Watergate Conspiracy of the 1970's and the now-defunct Canada-based international drug trafficking ring known as the Caviar Network. In both cases it was found that the Random Forest classifier did better than either Logistic Regression or Naïve Bayes, and its superior performance was statistically significant. This being so, Random Forest was used not only for classification but also to assess the importance of the measures. For the Watergate case, the most important one proved to be betweenness centrality while for the Caviar Network, it was the effective size of the network. These results are significant because they show that an approach combining machine learning with social network analysis not only can generate accurate classification models but also helps quantify the importance social network variables in modelling verdict outcomes. We conclude our analysis with a discussion and some suggestions for future work in verdict modelling using social network measures.

  1. Simple to complex modeling of breathing volume using a motion sensor.

    PubMed

    John, Dinesh; Staudenmayer, John; Freedson, Patty

    2013-06-01

    To compare simple and complex modeling techniques to estimate categories of low, medium, and high ventilation (VE) from ActiGraph™ activity counts. Vertical axis ActiGraph™ GT1M activity counts, oxygen consumption and VE were measured during treadmill walking and running, sports, household chores and labor-intensive employment activities. Categories of low (<19.3 l/min), medium (19.3 to 35.4 l/min) and high (>35.4 l/min) VEs were derived from activity intensity classifications (light <2.9 METs, moderate 3.0 to 5.9 METs and vigorous >6.0 METs). We examined the accuracy of two simple techniques (multiple regression and activity count cut-point analyses) and one complex (random forest technique) modeling technique in predicting VE from activity counts. Prediction accuracy of the complex random forest technique was marginally better than the simple multiple regression method. Both techniques accurately predicted VE categories almost 80% of the time. The multiple regression and random forest techniques were more accurate (85 to 88%) in predicting medium VE. Both techniques predicted the high VE (70 to 73%) with greater accuracy than low VE (57 to 60%). Actigraph™ cut-points for light, medium and high VEs were <1381, 1381 to 3660 and >3660 cpm. There were minor differences in prediction accuracy between the multiple regression and the random forest technique. This study provides methods to objectively estimate VE categories using activity monitors that can easily be deployed in the field. Objective estimates of VE should provide a better understanding of the dose-response relationship between internal exposure to pollutants and disease. Copyright © 2013 Elsevier B.V. All rights reserved.

  2. Mathematical models application for mapping soils spatial distribution on the example of the farm from the North of Udmurt Republic of Russia

    NASA Astrophysics Data System (ADS)

    Dokuchaev, P. M.; Meshalkina, J. L.; Yaroslavtsev, A. M.

    2018-01-01

    Comparative analysis of soils geospatial modeling using multinomial logistic regression, decision trees, random forest, regression trees and support vector machines algorithms was conducted. The visual interpretation of the digital maps obtained and their comparison with the existing map, as well as the quantitative assessment of the individual soil groups detection overall accuracy and of the models kappa showed that multiple logistic regression, support vector method, and random forest models application with spatial prediction of the conditional soil groups distribution can be reliably used for mapping of the study area. It has shown the most accurate detection for sod-podzolics soils (Phaeozems Albic) lightly eroded and moderately eroded soils. In second place, according to the mean overall accuracy of the prediction, there are sod-podzolics soils - non-eroded and warp one, as well as sod-gley soils (Umbrisols Gleyic) and alluvial soils (Fluvisols Dystric, Umbric). Heavy eroded sod-podzolics and gray forest soils (Phaeozems Albic) were detected by methods of automatic classification worst of all.

  3. EDITORIAL: Special section on foliage penetration

    NASA Astrophysics Data System (ADS)

    Fiddy, M. A.; Lang, R.; McGahan, R. V.

    2004-04-01

    Waves in Random Media was founded in 1991 to provide a forum for papers dealing with electromagnetic and acoustic waves as they propagate and scatter through media or objects having some degree of randomness. This is a broad charter since, in practice, all scattering obstacles and structures have roughness or randomness, often on the scale of the wavelength being used to probe them. Including this random component leads to some quite different methods for describing propagation effects, for example, when propagating through the atmosphere or the ground. This special section on foliage penetration (FOPEN) focuses on the problems arising from microwave propagation through foliage and vegetation. Applications of such studies include the estimation for forest biomass and the moisture of the underlying soil, as well as detecting objects hidden therein. In addition to the so-called `direct problem' of trying to describe energy propagating through such media, the complementary inverse problem is of great interest and much harder to solve. The development of theoretical models and associated numerical algorithms for identifying objects concealed by foliage has applications in surveillance, ranging from monitoring drug trafficking to targeting military vehicles. FOPEN can be employed to map the earth's surface in cases when it is under a forest canopy, permitting the identification of objects or targets on that surface, but the process for doing so is not straightforward. There has been an increasing interest in foliage penetration synthetic aperture radar (FOPEN or FOPENSAR) over the last 10 years and this special section provides a broad overview of many of the issues involved. The detection, identification, and geographical location of targets under foliage or otherwise obscured by poor visibility conditions remains a challenge. In particular, a trade-off often needs to be appreciated, namely that diminishing the deleterious effects of multiple scattering from leaves is typically associated with a significant loss in target resolution. Foliage is more or less transparent to some radar frequencies, but longer wavelengths found in the VHF (30 to 300 MHz) and UHF (300 MHz to 3 GHz) portions of the microwave spectrum have more chance of penetrating foliage than do wavelengths at the X band (8 to 12 GHz). Reflection and multiple scattering occur for some other frequencies and models of the processes involved are crucial. Two topical reviews can be found in this issue, one on the microwave radiometry of forests (page S275) and another describing ionospheric effects on space-based radar (page S189). Subsequent papers present new results on modelling coherent backscatter from forests (page S299), modelling forests as discrete random media over a random interface (page S359) and interpreting ranging scatterometer data from forests (page S317). Cloude et al present research on identifying targets beneath foliage using polarimetric SAR interferometry (page S393) while Treuhaft and Siqueira use interferometric radar to describe forest structure and biomass (page S345). Vechhia et al model scattering from leaves (page S333) and Semichaevsky et al address the problem of the trade-off between increasing wavelength, reduction in multiple scattering, and target resolution (page S415).

  4. Modeling forest fire occurrences using count-data mixed models in Qiannan autonomous prefecture of Guizhou province in China.

    PubMed

    Xiao, Yundan; Zhang, Xiongqing; Ji, Ping

    2015-01-01

    Forest fires can cause catastrophic damage on natural resources. In the meantime, it can also bring serious economic and social impacts. Meteorological factors play a critical role in establishing conditions favorable for a forest fire. Effective prediction of forest fire occurrences could prevent or minimize losses. This paper uses count data models to analyze fire occurrence data which is likely to be dispersed and frequently contain an excess of zero counts (no fire occurrence). Such data have commonly been analyzed using count data models such as a Poisson model, negative binomial model (NB), zero-inflated models, and hurdle models. Data we used in this paper is collected from Qiannan autonomous prefecture of Guizhou province in China. Using the fire occurrence data from January to April (spring fire season) for the years 1996 through 2007, we introduced random effects to the count data models. In this study, the results indicated that the prediction achieved through NB model provided a more compelling and credible inferential basis for fitting actual forest fire occurrence, and mixed-effects model performed better than corresponding fixed-effects model in forest fire forecasting. Besides, among all meteorological factors, we found that relative humidity and wind speed is highly correlated with fire occurrence.

  5. Modeling Forest Fire Occurrences Using Count-Data Mixed Models in Qiannan Autonomous Prefecture of Guizhou Province in China

    PubMed Central

    Ji, Ping

    2015-01-01

    Forest fires can cause catastrophic damage on natural resources. In the meantime, it can also bring serious economic and social impacts. Meteorological factors play a critical role in establishing conditions favorable for a forest fire. Effective prediction of forest fire occurrences could prevent or minimize losses. This paper uses count data models to analyze fire occurrence data which is likely to be dispersed and frequently contain an excess of zero counts (no fire occurrence). Such data have commonly been analyzed using count data models such as a Poisson model, negative binomial model (NB), zero-inflated models, and hurdle models. Data we used in this paper is collected from Qiannan autonomous prefecture of Guizhou province in China. Using the fire occurrence data from January to April (spring fire season) for the years 1996 through 2007, we introduced random effects to the count data models. In this study, the results indicated that the prediction achieved through NB model provided a more compelling and credible inferential basis for fitting actual forest fire occurrence, and mixed-effects model performed better than corresponding fixed-effects model in forest fire forecasting. Besides, among all meteorological factors, we found that relative humidity and wind speed is highly correlated with fire occurrence. PMID:25790309

  6. Quantifying Biomass from Point Clouds by Connecting Representations of Ecosystem Structure

    NASA Astrophysics Data System (ADS)

    Hendryx, S. M.; Barron-Gafford, G.

    2017-12-01

    Quantifying terrestrial ecosystem biomass is an essential part of monitoring carbon stocks and fluxes within the global carbon cycle and optimizing natural resource management. Point cloud data such as from lidar and structure from motion can be effective for quantifying biomass over large areas, but significant challenges remain in developing effective models that allow for such predictions. Inference models that estimate biomass from point clouds are established in many environments, yet, are often scale-dependent, needing to be fitted and applied at the same spatial scale and grid size at which they were developed. Furthermore, training such models typically requires large in situ datasets that are often prohibitively costly or time-consuming to obtain. We present here a scale- and sensor-invariant framework for efficiently estimating biomass from point clouds. Central to this framework, we present a new algorithm, assignPointsToExistingClusters, that has been developed for finding matches between in situ data and clusters in remotely-sensed point clouds. The algorithm can be used for assessing canopy segmentation accuracy and for training and validating machine learning models for predicting biophysical variables. We demonstrate the algorithm's efficacy by using it to train a random forest model of above ground biomass in a shrubland environment in Southern Arizona. We show that by learning a nonlinear function to estimate biomass from segmented canopy features we can reduce error, especially in the presence of inaccurate clusterings, when compared to a traditional, deterministic technique to estimate biomass from remotely measured canopies. Our random forest on cluster features model extends established methods of training random forest regressions to predict biomass of subplots but requires significantly less training data and is scale invariant. The random forest on cluster features model reduced mean absolute error, when evaluated on all test data in leave one out cross validation, by 40.6% from deterministic mesquite allometry and 35.9% from the inferred ecosystem-state allometric function. Our framework should allow for the inference of biomass more efficiently than common subplot methods and more accurately than individual tree segmentation methods in densely vegetated environments.

  7. Preference heterogeneity in a count data model of demand for off-highway vehicle recreation

    Treesearch

    Thomas P Holmes; Jeffrey E Englin

    2010-01-01

    This paper examines heterogeneity in the preferences for OHV recreation by applying the random parameters Poisson model to a data set of off-highway vehicle (OHV) users at four National Forest sites in North Carolina. The analysis develops estimates of individual consumer surplus and finds that estimates are systematically affected by the random parameter specification...

  8. A Global Study of GPP focusing on Light Use Efficiency in a Random Forest Regression Model

    NASA Astrophysics Data System (ADS)

    Fang, W.; Wei, S.; Yi, C.; Hendrey, G. R.

    2016-12-01

    Light use efficiency (LUE) is at the core of mechanistic modeling of global gross primary production (GPP). However, most LUE estimates in global models are satellite-based and coarsely measured with emphasis on environmental variables. Others are from eddy covariance towers with much greater spatial and temporal data quality and emphasis on mechanistic processes, but in a limited number of sites. In this paper, we conducted a comprehensive global study of tower-based LUE from 237 FLUXNET towers, and scaled up LUEs from in-situ tower level to global biome level. We integrated key environmental and biological variables into the tower-based LUE estimates, at 0.5o x 0.5o grid-cell resolution, using a random forest regression (RFR) approach. We then developed an RFR-LUE-GPP model using the grid-cell LUE data, and compared it to a tower-LUE-GPP model by the conventional way of treating LUE as a series of biome-specific constants. In order to calibrate the LUE models, we developed a data-driven RFR-GPP model using a random forest regression method. Our results showed that LUE varies largely with latitude. We estimated a global area-weighted average of LUE at 1.21 gC m-2 MJ-1 APAR, which led to an estimated global GPP of 102.9 Gt C /year from 2000 to 2005. The tower-LUE-GPP model tended to overestimate forest GPP in tropical and boreal regions. Large uncertainties exist in GPP estimates over sparsely vegetated areas covered by savannas and woody savannas around the middle to low latitudes (i.g. 20oS to 40oS and 5oN to 15oN) due to lack of available data. Model results were improved by incorporating Köppen climate types to represent climate /meteorological information in machine learning modeling. This shed new light on the recognized issues of climate dependence of spring onset of photosynthesis and the challenges in modeling the biome GPP of evergreen broad leaf forests (EBF) accurately. The divergent responses of GPP to temperature and precipitation at mid-high latitudes and at mid-low latitudes echoed the necessity of modeling GPP separately by latitudes. This work provided a global distribution of LUE estimate, and developed a comprehensive algorithm modeling global terrestrial carbon with high spatial and temporal resolutions.

  9. Comparing spatial regression to random forests for large ...

    EPA Pesticide Factsheets

    Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. Our primary goal is predicting MMI at over 1.1 million perennial stream reaches across the USA. For spatial regression modeling, we develop two new methods to accommodate large data: (1) a procedure that estimates optimal Box-Cox transformations to linearize covariate relationships; and (2) a computationally efficient covariate selection routine that takes into account spatial autocorrelation. We show that our new methods lead to cross-validated performance similar to random forests, but that there is an advantage for spatial regression when quantifying the uncertainty of the predictions. Simulations are used to clarify advantages for each method. This research investigates different approaches for modeling and mapping national stream condition. We use MMI data from the EPA's National Rivers and Streams Assessment and predictors from StreamCat (Hill et al., 2015). Previous studies have focused on modeling the MMI condition classes (i.e., good, fair, and po

  10. The Trail Making test: a study of its ability to predict falls in the acute neurological in-patient population.

    PubMed

    Mateen, Bilal Akhter; Bussas, Matthias; Doogan, Catherine; Waller, Denise; Saverino, Alessia; Király, Franz J; Playford, E Diane

    2018-05-01

    To determine whether tests of cognitive function and patient-reported outcome measures of motor function can be used to create a machine learning-based predictive tool for falls. Prospective cohort study. Tertiary neurological and neurosurgical center. In all, 337 in-patients receiving neurosurgical, neurological, or neurorehabilitation-based care. Binary (Y/N) for falling during the in-patient episode, the Trail Making Test (a measure of attention and executive function) and the Walk-12 (a patient-reported measure of physical function). The principal outcome was a fall during the in-patient stay ( n = 54). The Trail test was identified as the best predictor of falls. Moreover, addition of other variables, did not improve the prediction (Wilcoxon signed-rank P < 0.001). Classical linear statistical modeling methods were then compared with more recent machine learning based strategies, for example, random forests, neural networks, support vector machines. The random forest was the best modeling strategy when utilizing just the Trail Making Test data (Wilcoxon signed-rank P < 0.001) with 68% (± 7.7) sensitivity, and 90% (± 2.3) specificity. This study identifies a simple yet powerful machine learning (Random Forest) based predictive model for an in-patient neurological population, utilizing a single neuropsychological test of cognitive function, the Trail Making test.

  11. Random forest feature selection approach for image segmentation

    NASA Astrophysics Data System (ADS)

    Lefkovits, László; Lefkovits, Szidónia; Emerich, Simina; Vaida, Mircea Florin

    2017-03-01

    In the field of image segmentation, discriminative models have shown promising performance. Generally, every such model begins with the extraction of numerous features from annotated images. Most authors create their discriminative model by using many features without using any selection criteria. A more reliable model can be built by using a framework that selects the important variables, from the point of view of the classification, and eliminates the unimportant once. In this article we present a framework for feature selection and data dimensionality reduction. The methodology is built around the random forest (RF) algorithm and its variable importance evaluation. In order to deal with datasets so large as to be practically unmanageable, we propose an algorithm based on RF that reduces the dimension of the database by eliminating irrelevant features. Furthermore, this framework is applied to optimize our discriminative model for brain tumor segmentation.

  12. Multiple filters affect tree species assembly in mid-latitude forest communities.

    PubMed

    Kubota, Y; Kusumoto, B; Shiono, T; Ulrich, W

    2018-05-01

    Species assembly patterns of local communities are shaped by the balance between multiple abiotic/biotic filters and dispersal that both select individuals from species pools at the regional scale. Knowledge regarding functional assembly can provide insight into the relative importance of the deterministic and stochastic processes that shape species assembly. We evaluated the hierarchical roles of the α niche and β niches by analyzing the influence of environmental filtering relative to functional traits on geographical patterns of tree species assembly in mid-latitude forests. Using forest plot datasets, we examined the α niche traits (leaf and wood traits) and β niche properties (cold/drought tolerance) of tree species, and tested non-randomness (clustering/over-dispersion) of trait assembly based on null models that assumed two types of species pools related to biogeographical regions. For most plots, species assembly patterns fell within the range of random expectation. However, particularly for cold/drought tolerance-related β niche properties, deviation from randomness was frequently found; non-random clustering was predominant in higher latitudes with harsh climates. Our findings demonstrate that both randomness and non-randomness in trait assembly emerged as a result of the α and β niches, although we suggest the potential role of dispersal processes and/or species equalization through trait similarities in generating the prevalence of randomness. Clustering of β niche traits along latitudinal climatic gradients provides clear evidence of species sorting by filtering particular traits. Our results reveal that multiple filters through functional niches and stochastic processes jointly shape geographical patterns of species assembly across mid-latitude forests.

  13. Assessing the Potential of Land Use Modification to Mitigate Ambient NO2 and Its Consequences for Respiratory Health

    PubMed Central

    Rao, Meenakshi; George, Linda A.; Shandas, Vivek; Rosenstiel, Todd N.

    2017-01-01

    Understanding how local land use and land cover (LULC) shapes intra-urban concentrations of atmospheric pollutants—and thus human health—is a key component in designing healthier cities. Here, NO2 is modeled based on spatially dense summer and winter NO2 observations in Portland-Hillsboro-Vancouver (USA), and the spatial variation of NO2 with LULC investigated using random forest, an ensemble data learning technique. The NO2 random forest model, together with BenMAP, is further used to develop a better understanding of the relationship among LULC, ambient NO2 and respiratory health. The impact of land use modifications on ambient NO2, and consequently on respiratory health, is also investigated using a sensitivity analysis. We find that NO2 associated with roadways and tree-canopied areas may be affecting annual incidence rates of asthma exacerbation in 4–12 year olds by +3000 per 100,000 and −1400 per 100,000, respectively. Our model shows that increasing local tree canopy by 5% may reduce local incidences rates of asthma exacerbation by 6%, indicating that targeted local tree-planting efforts may have a substantial impact on reducing city-wide incidence of respiratory distress. Our findings demonstrate the utility of random forest modeling in evaluating LULC modifications for enhanced respiratory health. PMID:28698523

  14. Development of machine learning models for diagnosis of glaucoma.

    PubMed

    Kim, Seong Jae; Cho, Kyong Jin; Oh, Sejong

    2017-01-01

    The study aimed to develop machine learning models that have strong prediction power and interpretability for diagnosis of glaucoma based on retinal nerve fiber layer (RNFL) thickness and visual field (VF). We collected various candidate features from the examination of retinal nerve fiber layer (RNFL) thickness and visual field (VF). We also developed synthesized features from original features. We then selected the best features proper for classification (diagnosis) through feature evaluation. We used 100 cases of data as a test dataset and 399 cases of data as a training and validation dataset. To develop the glaucoma prediction model, we considered four machine learning algorithms: C5.0, random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN). We repeatedly composed a learning model using the training dataset and evaluated it by using the validation dataset. Finally, we got the best learning model that produces the highest validation accuracy. We analyzed quality of the models using several measures. The random forest model shows best performance and C5.0, SVM, and KNN models show similar accuracy. In the random forest model, the classification accuracy is 0.98, sensitivity is 0.983, specificity is 0.975, and AUC is 0.979. The developed prediction models show high accuracy, sensitivity, specificity, and AUC in classifying among glaucoma and healthy eyes. It will be used for predicting glaucoma against unknown examination records. Clinicians may reference the prediction results and be able to make better decisions. We may combine multiple learning models to increase prediction accuracy. The C5.0 model includes decision rules for prediction. It can be used to explain the reasons for specific predictions.

  15. Introducing two Random Forest based methods for cloud detection in remote sensing images

    NASA Astrophysics Data System (ADS)

    Ghasemian, Nafiseh; Akhoondzadeh, Mehdi

    2018-07-01

    Cloud detection is a necessary phase in satellite images processing to retrieve the atmospheric and lithospheric parameters. Currently, some cloud detection methods based on Random Forest (RF) model have been proposed but they do not consider both spectral and textural characteristics of the image. Furthermore, they have not been tested in the presence of snow/ice. In this paper, we introduce two RF based algorithms, Feature Level Fusion Random Forest (FLFRF) and Decision Level Fusion Random Forest (DLFRF) to incorporate visible, infrared (IR) and thermal spectral and textural features (FLFRF) including Gray Level Co-occurrence Matrix (GLCM) and Robust Extended Local Binary Pattern (RELBP_CI) or visible, IR and thermal classifiers (DLFRF) for highly accurate cloud detection on remote sensing images. FLFRF first fuses visible, IR and thermal features. Thereafter, it uses the RF model to classify pixels to cloud, snow/ice and background or thick cloud, thin cloud and background. DLFRF considers visible, IR and thermal features (both spectral and textural) separately and inserts each set of features to RF model. Then, it holds vote matrix of each run of the model. Finally, it fuses the classifiers using the majority vote method. To demonstrate the effectiveness of the proposed algorithms, 10 Terra MODIS and 15 Landsat 8 OLI/TIRS images with different spatial resolutions are used in this paper. Quantitative analyses are based on manually selected ground truth data. Results show that after adding RELBP_CI to input feature set cloud detection accuracy improves. Also, the average cloud kappa values of FLFRF and DLFRF on MODIS images (1 and 0.99) are higher than other machine learning methods, Linear Discriminate Analysis (LDA), Classification And Regression Tree (CART), K Nearest Neighbor (KNN) and Support Vector Machine (SVM) (0.96). The average snow/ice kappa values of FLFRF and DLFRF on MODIS images (1 and 0.85) are higher than other traditional methods. The quantitative values on Landsat 8 images show similar trend. Consequently, while SVM and K-nearest neighbor show overestimation in predicting cloud and snow/ice pixels, our Random Forest (RF) based models can achieve higher cloud, snow/ice kappa values on MODIS and thin cloud, thick cloud and snow/ice kappa values on Landsat 8 images. Our algorithms predict both thin and thick cloud on Landsat 8 images while the existing cloud detection algorithm, Fmask cannot discriminate them. Compared to the state-of-the-art methods, our algorithms have acquired higher average cloud and snow/ice kappa values for different spatial resolutions.

  16. Modelling Biophysical Parameters of Maize Using Landsat 8 Time Series

    NASA Astrophysics Data System (ADS)

    Dahms, Thorsten; Seissiger, Sylvia; Conrad, Christopher; Borg, Erik

    2016-06-01

    Open and free access to multi-frequent high-resolution data (e.g. Sentinel - 2) will fortify agricultural applications based on satellite data. The temporal and spatial resolution of these remote sensing datasets directly affects the applicability of remote sensing methods, for instance a robust retrieving of biophysical parameters over the entire growing season with very high geometric resolution. In this study we use machine learning methods to predict biophysical parameters, namely the fraction of absorbed photosynthetic radiation (FPAR), the leaf area index (LAI) and the chlorophyll content, from high resolution remote sensing. 30 Landsat 8 OLI scenes were available in our study region in Mecklenburg-Western Pomerania, Germany. In-situ data were weekly to bi-weekly collected on 18 maize plots throughout the summer season 2015. The study aims at an optimized prediction of biophysical parameters and the identification of the best explaining spectral bands and vegetation indices. For this purpose, we used the entire in-situ dataset from 24.03.2015 to 15.10.2015. Random forest and conditional inference forests were used because of their explicit strong exploratory and predictive character. Variable importance measures allowed for analysing the relation between the biophysical parameters with respect to the spectral response, and the performance of the two approaches over the plant stock evolvement. Classical random forest regression outreached the performance of conditional inference forests, in particular when modelling the biophysical parameters over the entire growing period. For example, modelling biophysical parameters of maize for the entire vegetation period using random forests yielded: FPAR: R² = 0.85; RMSE = 0.11; LAI: R² = 0.64; RMSE = 0.9 and chlorophyll content (SPAD): R² = 0.80; RMSE=4.9. Our results demonstrate the great potential in using machine-learning methods for the interpretation of long-term multi-frequent remote sensing datasets to model biophysical parameters.

  17. Effect of inventory method on niche models: random versus systematic error

    Treesearch

    Heather E. Lintz; Andrew N. Gray; Bruce McCune

    2013-01-01

    Data from large-scale biological inventories are essential for understanding and managing Earth's ecosystems. The Forest Inventory and Analysis Program (FIA) of the U.S. Forest Service is the largest biological inventory in North America; however, the FIA inventory recently changed from an amalgam of different approaches to a nationally-standardized approach in...

  18. BitterSweetForest: A random forest based binary classifier to predict bitterness and sweetness of chemical compounds

    NASA Astrophysics Data System (ADS)

    Banerjee, Priyanka; Preissner, Robert

    2018-04-01

    Taste of a chemical compounds present in food stimulates us to take in nutrients and avoid poisons. However, the perception of taste greatly depends on the genetic as well as evolutionary perspectives. The aim of this work was the development and validation of a machine learning model based on molecular fingerprints to discriminate between sweet and bitter taste of molecules. BitterSweetForest is the first open access model based on KNIME workflow that provides platform for prediction of bitter and sweet taste of chemical compounds using molecular fingerprints and Random Forest based classifier. The constructed model yielded an accuracy of 95% and an AUC of 0.98 in cross-validation. In independent test set, BitterSweetForest achieved an accuracy of 96 % and an AUC of 0.98 for bitter and sweet taste prediction. The constructed model was further applied to predict the bitter and sweet taste of natural compounds, approved drugs as well as on an acute toxicity compound data set. BitterSweetForest suggests 70% of the natural product space, as bitter and 10 % of the natural product space as sweet with confidence score of 0.60 and above. 77 % of the approved drug set was predicted as bitter and 2% as sweet with a confidence scores of 0.75 and above. Similarly, 75% of the total compounds from acute oral toxicity class were predicted only as bitter with a minimum confidence score of 0.75, revealing toxic compounds are mostly bitter. Furthermore, we applied a Bayesian based feature analysis method to discriminate the most occurring chemical features between sweet and bitter compounds from the feature space of a circular fingerprint.

  19. BitterSweetForest: A Random Forest Based Binary Classifier to Predict Bitterness and Sweetness of Chemical Compounds

    PubMed Central

    Banerjee, Priyanka; Preissner, Robert

    2018-01-01

    Taste of a chemical compound present in food stimulates us to take in nutrients and avoid poisons. However, the perception of taste greatly depends on the genetic as well as evolutionary perspectives. The aim of this work was the development and validation of a machine learning model based on molecular fingerprints to discriminate between sweet and bitter taste of molecules. BitterSweetForest is the first open access model based on KNIME workflow that provides platform for prediction of bitter and sweet taste of chemical compounds using molecular fingerprints and Random Forest based classifier. The constructed model yielded an accuracy of 95% and an AUC of 0.98 in cross-validation. In independent test set, BitterSweetForest achieved an accuracy of 96% and an AUC of 0.98 for bitter and sweet taste prediction. The constructed model was further applied to predict the bitter and sweet taste of natural compounds, approved drugs as well as on an acute toxicity compound data set. BitterSweetForest suggests 70% of the natural product space, as bitter and 10% of the natural product space as sweet with confidence score of 0.60 and above. 77% of the approved drug set was predicted as bitter and 2% as sweet with a confidence score of 0.75 and above. Similarly, 75% of the total compounds from acute oral toxicity class were predicted only as bitter with a minimum confidence score of 0.75, revealing toxic compounds are mostly bitter. Furthermore, we applied a Bayesian based feature analysis method to discriminate the most occurring chemical features between sweet and bitter compounds using the feature space of a circular fingerprint. PMID:29696137

  20. Classification of savanna tree species, in the Greater Kruger National Park region, by integrating hyperspectral and LiDAR data in a Random Forest data mining environment

    NASA Astrophysics Data System (ADS)

    Naidoo, L.; Cho, M. A.; Mathieu, R.; Asner, G.

    2012-04-01

    The accurate classification and mapping of individual trees at species level in the savanna ecosystem can provide numerous benefits for the managerial authorities. Such benefits include the mapping of economically useful tree species, which are a key source of food production and fuel wood for the local communities, and of problematic alien invasive and bush encroaching species, which can threaten the integrity of the environment and livelihoods of the local communities. Species level mapping is particularly challenging in African savannas which are complex, heterogeneous, and open environments with high intra-species spectral variability due to differences in geology, topography, rainfall, herbivory and human impacts within relatively short distances. Savanna vegetation are also highly irregular in canopy and crown shape, height and other structural dimensions with a combination of open grassland patches and dense woody thicket - a stark contrast to the more homogeneous forest vegetation. This study classified eight common savanna tree species in the Greater Kruger National Park region, South Africa, using a combination of hyperspectral and Light Detection and Ranging (LiDAR)-derived structural parameters, in the form of seven predictor datasets, in an automated Random Forest modelling approach. The most important predictors, which were found to play an important role in the different classification models and contributed to the success of the hybrid dataset model when combined, were species tree height; NDVI; the chlorophyll b wavelength (466 nm) and a selection of raw, continuum removed and Spectral Angle Mapper (SAM) bands. It was also concluded that the hybrid predictor dataset Random Forest model yielded the highest classification accuracy and prediction success for the eight savanna tree species with an overall classification accuracy of 87.68% and KHAT value of 0.843.

  1. Application and partial validation of a habitat model for moose in the Lake Superior region

    USGS Publications Warehouse

    Allen, A.W.; Terrell, J.W.; Mangus, W.L.; Lindquist, E.L.

    1991-01-01

    A modified version of the dormant-season portion of a Habitat Suitability Index (HSI) model developed for assessing moose (Alces alces) habitat in the Lake Superior Region was incorporated in a Geographic Information System (GIS) for 490 km2 of Minnesota's Superior National Forest. Moose locations (n=235) were plotted during aerial surveys conducted in December 1988 and January 1990-1991. Dormant-season forage and cover quality for 1,000-m, 500-m, and 200-m radii plots around random points and moose locations were compared using U.S. Forest Service stand examination data. Cover quality indices were lower than forage quality indices within all plots. The median value for the average cover quality index was greater (P=0.003) within 200-m plots around cow moose locations than for plots around random points for the most severe winter of the study. The proportion of highest-quality winter cover, such as mixed stands dominated by mid-age class white spruce (Picea glauca) and balsam fir (Abies balsanea), was greater within 500-m and 200-m plots around cow moose than within similar plots around random points during the two most severe winters. These results indicate that suboptimum ratings of winter habitat quality used in the GIS for dormant-season forage >100 m from cover, as suggested in the original HSI model, are reasonable. Integrating the habitat model with forest stand data using a GIS permitted analysis of moose habitat within a relatively large geographic area. Simulation of habitat quality indicated a potential shortage of late-winter cover in the study area. The effects of forest management actions on moose habitat quality can be simulated without collecting additional data.

  2. A spatially explicit decision support model for restoration of forest bird habitat

    USGS Publications Warehouse

    Twedt, D.J.; Uihlein, W.B.; Elliott, A.B.

    2006-01-01

    The historical area of bottomland hardwood forest in the Mississippi Alluvial Valley has been reduced by >75%. Agricultural production was the primary motivator for deforestation; hence, clearing deliberately targeted higher and drier sites. Remaining forests are highly fragmented and hydrologically altered, with larger forest fragments subject to greater inundation, which has negatively affected many forest bird populations. We developed a spatially explicit decision support model, based on a Partners in Flight plan for forest bird conservation, that prioritizes forest restoration to reduce forest fragmentation and increase the area of forest core (interior forest >1 km from 'hostile' edge). Our primary objective was to increase the number of forest patches that harbor >2000 ha of forest core, but we also sought to increase the number and area of forest cores >5000 ha. Concurrently, we targeted restoration within local (320 km2) landscapes to achieve >60% forest cover. Finally, we emphasized restoration of higher-elevation bottomland hardwood forests in areas where restoration would not increase forest fragmentation. Reforestation of 10% of restorable land in the Mississippi Alluvial Valley (approximately 880,000 ha) targeted at priorities established by this decision support model resulted in approximately 824,000 ha of new forest core. This is more than 32 times the amount of core forest added through reforestation of randomly located fields (approximately 25,000 ha). The total area of forest core (1.6 million ha) that resulted from targeted restoration exceeded habitat objectives identified in the Partners in Flight Bird Conservation Plan and approached the area of forest core present in the 1950s.

  3. Free variable selection QSPR study to predict 19F chemical shifts of some fluorinated organic compounds using Random Forest and RBF-PLS methods

    NASA Astrophysics Data System (ADS)

    Goudarzi, Nasser

    2016-04-01

    In this work, two new and powerful chemometrics methods are applied for the modeling and prediction of the 19F chemical shift values of some fluorinated organic compounds. The radial basis function-partial least square (RBF-PLS) and random forest (RF) are employed to construct the models to predict the 19F chemical shifts. In this study, we didn't used from any variable selection method and RF method can be used as variable selection and modeling technique. Effects of the important parameters affecting the ability of the RF prediction power such as the number of trees (nt) and the number of randomly selected variables to split each node (m) were investigated. The root-mean-square errors of prediction (RMSEP) for the training set and the prediction set for the RBF-PLS and RF models were 44.70, 23.86, 29.77, and 23.69, respectively. Also, the correlation coefficients of the prediction set for the RBF-PLS and RF models were 0.8684 and 0.9313, respectively. The results obtained reveal that the RF model can be used as a powerful chemometrics tool for the quantitative structure-property relationship (QSPR) studies.

  4. Applying a weighted random forests method to extract karst sinkholes from LiDAR data

    NASA Astrophysics Data System (ADS)

    Zhu, Junfeng; Pierskalla, William P.

    2016-02-01

    Detailed mapping of sinkholes provides critical information for mitigating sinkhole hazards and understanding groundwater and surface water interactions in karst terrains. LiDAR (Light Detection and Ranging) measures the earth's surface in high-resolution and high-density and has shown great potentials to drastically improve locating and delineating sinkholes. However, processing LiDAR data to extract sinkholes requires separating sinkholes from other depressions, which can be laborious because of the sheer number of the depressions commonly generated from LiDAR data. In this study, we applied the random forests, a machine learning method, to automatically separate sinkholes from other depressions in a karst region in central Kentucky. The sinkhole-extraction random forest was grown on a training dataset built from an area where LiDAR-derived depressions were manually classified through a visual inspection and field verification process. Based on the geometry of depressions, as well as natural and human factors related to sinkholes, 11 parameters were selected as predictive variables to form the dataset. Because the training dataset was imbalanced with the majority of depressions being non-sinkholes, a weighted random forests method was used to improve the accuracy of predicting sinkholes. The weighted random forest achieved an average accuracy of 89.95% for the training dataset, demonstrating that the random forest can be an effective sinkhole classifier. Testing of the random forest in another area, however, resulted in moderate success with an average accuracy rate of 73.96%. This study suggests that an automatic sinkhole extraction procedure like the random forest classifier can significantly reduce time and labor costs and makes its more tractable to map sinkholes using LiDAR data for large areas. However, the random forests method cannot totally replace manual procedures, such as visual inspection and field verification.

  5. Forecasting Daily Patient Outflow From a Ward Having No Real-Time Clinical Data

    PubMed Central

    Tran, Truyen; Luo, Wei; Phung, Dinh; Venkatesh, Svetha

    2016-01-01

    Background: Modeling patient flow is crucial in understanding resource demand and prioritization. We study patient outflow from an open ward in an Australian hospital, where currently bed allocation is carried out by a manager relying on past experiences and looking at demand. Automatic methods that provide a reasonable estimate of total next-day discharges can aid in efficient bed management. The challenges in building such methods lie in dealing with large amounts of discharge noise introduced by the nonlinear nature of hospital procedures, and the nonavailability of real-time clinical information in wards. Objective Our study investigates different models to forecast the total number of next-day discharges from an open ward having no real-time clinical data. Methods We compared 5 popular regression algorithms to model total next-day discharges: (1) autoregressive integrated moving average (ARIMA), (2) the autoregressive moving average with exogenous variables (ARMAX), (3) k-nearest neighbor regression, (4) random forest regression, and (5) support vector regression. Although the autoregressive integrated moving average model relied on past 3-month discharges, nearest neighbor forecasting used median of similar discharges in the past in estimating next-day discharge. In addition, the ARMAX model used the day of the week and number of patients currently in ward as exogenous variables. For the random forest and support vector regression models, we designed a predictor set of 20 patient features and 88 ward-level features. Results Our data consisted of 12,141 patient visits over 1826 days. Forecasting quality was measured using mean forecast error, mean absolute error, symmetric mean absolute percentage error, and root mean square error. When compared with a moving average prediction model, all 5 models demonstrated superior performance with the random forests achieving 22.7% improvement in mean absolute error, for all days in the year 2014. Conclusions In the absence of clinical information, our study recommends using patient-level and ward-level data in predicting next-day discharges. Random forest and support vector regression models are able to use all available features from such data, resulting in superior performance over traditional autoregressive methods. An intelligent estimate of available beds in wards plays a crucial role in relieving access block in emergency departments. PMID:27444059

  6. Ensemble Feature Learning of Genomic Data Using Support Vector Machine

    PubMed Central

    Anaissi, Ali; Goyal, Madhu; Catchpoole, Daniel R.; Braytee, Ali; Kennedy, Paul J.

    2016-01-01

    The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data. PMID:27304923

  7. Validating predictions from climate envelope models

    USGS Publications Warehouse

    Watling, J.; Bucklin, D.; Speroterra, C.; Brandt, L.; Cabal, C.; Romañach, Stephanie S.; Mazzotti, Frank J.

    2013-01-01

    Climate envelope models are a potentially important conservation tool, but their ability to accurately forecast species’ distributional shifts using independent survey data has not been fully evaluated. We created climate envelope models for 12 species of North American breeding birds previously shown to have experienced poleward range shifts. For each species, we evaluated three different approaches to climate envelope modeling that differed in the way they treated climate-induced range expansion and contraction, using random forests and maximum entropy modeling algorithms. All models were calibrated using occurrence data from 1967–1971 (t1) and evaluated using occurrence data from 1998–2002 (t2). Model sensitivity (the ability to correctly classify species presences) was greater using the maximum entropy algorithm than the random forest algorithm. Although sensitivity did not differ significantly among approaches, for many species, sensitivity was maximized using a hybrid approach that assumed range expansion, but not contraction, in t2. Species for which the hybrid approach resulted in the greatest improvement in sensitivity have been reported from more land cover types than species for which there was little difference in sensitivity between hybrid and dynamic approaches, suggesting that habitat generalists may be buffered somewhat against climate-induced range contractions. Specificity (the ability to correctly classify species absences) was maximized using the random forest algorithm and was lowest using the hybrid approach. Overall, our results suggest cautious optimism for the use of climate envelope models to forecast range shifts, but also underscore the importance of considering non-climate drivers of species range limits. The use of alternative climate envelope models that make different assumptions about range expansion and contraction is a new and potentially useful way to help inform our understanding of climate change effects on species.

  8. Predicting Seagrass Occurrence in a Changing Climate Using Random Forests

    NASA Astrophysics Data System (ADS)

    Aydin, O.; Butler, K. A.

    2017-12-01

    Seagrasses are marine plants that can quickly sequester vast amounts of carbon (up to 100 times more and 12 times faster than tropical forests). In this work, we present an integrated GIS and machine learning approach to build a data-driven model of seagrass presence-absence. We outline a random forest approach that avoids the prevalence bias in many ecological presence-absence models. One of our goals is to predict global seagrass occurrence from a spatially limited training sample. In addition, we conduct a sensitivity study which investigates the vulnerability of seagrass to changing climate conditions. We integrate multiple data sources including fine-scale seagrass data from MarineCadastre.gov and the recently available globally extensive publicly available Ecological Marine Units (EMU) dataset. These data are used to train a model for seagrass occurrence along the U.S. coast. In situ oceans data are interpolated using Empirical Bayesian Kriging (EBK) to produce globally extensive prediction variables. A neural network is used to estimate probable future values of prediction variables such as ocean temperature to assess the impact of a warming climate on seagrass occurrence. The proposed workflow can be generalized to many presence-absence models.

  9. CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.

    PubMed

    Ma, Li; Fan, Suohai

    2017-03-14

    The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.

  10. Clustering Single-Cell Expression Data Using Random Forest Graphs.

    PubMed

    Pouyan, Maziyar Baran; Nourani, Mehrdad

    2017-07-01

    Complex tissues such as brain and bone marrow are made up of multiple cell types. As the study of biological tissue structure progresses, the role of cell-type-specific research becomes increasingly important. Novel sequencing technology such as single-cell cytometry provides researchers access to valuable biological data. Applying machine-learning techniques to these high-throughput datasets provides deep insights into the cellular landscape of the tissue where those cells are a part of. In this paper, we propose the use of random-forest-based single-cell profiling, a new machine-learning-based technique, to profile different cell types of intricate tissues using single-cell cytometry data. Our technique utilizes random forests to capture cell marker dependences and model the cellular populations using the cell network concept. This cellular network helps us discover what cell types are in the tissue. Our experimental results on public-domain datasets indicate promising performance and accuracy of our technique in extracting cell populations of complex tissues.

  11. Random Forest Application for NEXRAD Radar Data Quality Control

    NASA Astrophysics Data System (ADS)

    Keem, M.; Seo, B. C.; Krajewski, W. F.

    2017-12-01

    Identification and elimination of non-meteorological radar echoes (e.g., returns from ground, wind turbines, and biological targets) are the basic data quality control steps before radar data use in quantitative applications (e.g., precipitation estimation). Although WSR-88Ds' recent upgrade to dual-polarization has enhanced this quality control and echo classification, there are still challenges to detect some non-meteorological echoes that show precipitation-like characteristics (e.g., wind turbine or anomalous propagation clutter embedded in rain). With this in mind, a new quality control method using Random Forest is proposed in this study. This classification algorithm is known to produce reliable results with less uncertainty. The method introduces randomness into sampling and feature selections and integrates consequent multiple decision trees. The multidimensional structure of the trees can characterize the statistical interactions of involved multiple features in complex situations. The authors explore the performance of Random Forest method for NEXRAD radar data quality control. Training datasets are selected using several clear cases of precipitation and non-precipitation (but with some non-meteorological echoes). The model is structured using available candidate features (from the NEXRAD data) such as horizontal reflectivity, differential reflectivity, differential phase shift, copolar correlation coefficient, and their horizontal textures (e.g., local standard deviation). The influence of each feature on classification results are quantified by variable importance measures that are automatically estimated by the Random Forest algorithm. Therefore, the number and types of features in the final forest can be examined based on the classification accuracy. The authors demonstrate the capability of the proposed approach using several cases ranging from distinct to complex rain/no-rain events and compare the performance with the existing algorithms (e.g., MRMS). They also discuss operational feasibility based on the observed strength and weakness of the method.

  12. Ecological impacts and management strategies for western larch in the face of climate-change

    Treesearch

    Gerald E. Rehfeldt; Barry C. Jaquish

    2010-01-01

    Approximately 185,000 forest inventory and ecological plots from both USA and Canada were used to predict the contemporary distribution of western larch (Larix occidentalis Nutt.) from climate variables. The random forests algorithm, using an 8-variable model, produced an overall error rate of about 2.9 %, nearly all of which consisted of predicting presence at...

  13. Minimizing effects of methodological decisions on interpretation and prediction in species distribution studies: An example with background selection

    USGS Publications Warehouse

    Jarnevich, Catherine S.; Talbert, Marian; Morisette, Jeffrey T.; Aldridge, Cameron L.; Brown, Cynthia; Kumar, Sunil; Manier, Daniel; Talbert, Colin; Holcombe, Tracy R.

    2017-01-01

    Evaluating the conditions where a species can persist is an important question in ecology both to understand tolerances of organisms and to predict distributions across landscapes. Presence data combined with background or pseudo-absence locations are commonly used with species distribution modeling to develop these relationships. However, there is not a standard method to generate background or pseudo-absence locations, and method choice affects model outcomes. We evaluated combinations of both model algorithms (simple and complex generalized linear models, multivariate adaptive regression splines, Maxent, boosted regression trees, and random forest) and background methods (random, minimum convex polygon, and continuous and binary kernel density estimator (KDE)) to assess the sensitivity of model outcomes to choices made. We evaluated six questions related to model results, including five beyond the common comparison of model accuracy assessment metrics (biological interpretability of response curves, cross-validation robustness, independent data accuracy and robustness, and prediction consistency). For our case study with cheatgrass in the western US, random forest was least sensitive to background choice and the binary KDE method was least sensitive to model algorithm choice. While this outcome may not hold for other locations or species, the methods we used can be implemented to help determine appropriate methodologies for particular research questions.

  14. Global patterns of tropical forest fragmentation

    NASA Astrophysics Data System (ADS)

    Taubert, Franziska; Fischer, Rico; Groeneveld, Jürgen; Lehmann, Sebastian; Müller, Michael S.; Rödig, Edna; Wiegand, Thorsten; Huth, Andreas

    2018-02-01

    Remote sensing enables the quantification of tropical deforestation with high spatial resolution. This in-depth mapping has led to substantial advances in the analysis of continent-wide fragmentation of tropical forests. Here we identified approximately 130 million forest fragments in three continents that show surprisingly similar power-law size and perimeter distributions as well as fractal dimensions. Power-law distributions have been observed in many natural phenomena such as wildfires, landslides and earthquakes. The principles of percolation theory provide one explanation for the observed patterns, and suggest that forest fragmentation is close to the critical point of percolation; simulation modelling also supports this hypothesis. The observed patterns emerge not only from random deforestation, which can be described by percolation theory, but also from a wide range of deforestation and forest-recovery regimes. Our models predict that additional forest loss will result in a large increase in the total number of forest fragments—at maximum by a factor of 33 over 50 years—as well as a decrease in their size, and that these consequences could be partly mitigated by reforestation and forest protection.

  15. Water chemistry in 179 randomly selected Swedish headwater streams related to forest production, clear-felling and climate.

    PubMed

    Löfgren, Stefan; Fröberg, Mats; Yu, Jun; Nisell, Jakob; Ranneby, Bo

    2014-12-01

    From a policy perspective, it is important to understand forestry effects on surface waters from a landscape perspective. The EU Water Framework Directive demands remedial actions if not achieving good ecological status. In Sweden, 44 % of the surface water bodies have moderate ecological status or worse. Many of these drain catchments with a mosaic of managed forests. It is important for the forestry sector and water authorities to be able to identify where, in the forested landscape, special precautions are necessary. The aim of this study was to quantify the relations between forestry parameters and headwater stream concentrations of nutrients, organic matter and acid-base chemistry. The results are put into the context of regional climate, sulphur and nitrogen deposition, as well as marine influences. Water chemistry was measured in 179 randomly selected headwater streams from two regions in southwest and central Sweden, corresponding to 10 % of the Swedish land area. Forest status was determined from satellite images and Swedish National Forest Inventory data using the probabilistic classifier method, which was used to model stream water chemistry with Bayesian model averaging. The results indicate that concentrations of e.g. nitrogen, phosphorus and organic matter are related to factors associated with forest production but that it is not forestry per se that causes the excess losses. Instead, factors simultaneously affecting forest production and stream water chemistry, such as climate, extensive soil pools and nitrogen deposition, are the most likely candidates The relationships with clear-felled and wetland areas are likely to be direct effects.

  16. Inventory of forest resources (including water) by multi-level sampling. [nine northern Virginia coastal plain counties

    NASA Technical Reports Server (NTRS)

    Aldrich, R. C.; Dana, R. W.; Roberts, E. H. (Principal Investigator)

    1977-01-01

    The author has identified the following significant results. A stratified random sample using LANDSAT band 5 and 7 panchromatic prints resulted in estimates of water in counties with sampling errors less than + or - 9% (67% probability level). A forest inventory using a four band LANDSAT color composite resulted in estimates of forest area by counties that were within + or - 6.7% and + or - 3.7% respectively (67% probability level). Estimates of forest area for counties by computer assisted techniques were within + or - 21% of operational forest survey figures and for all counties the difference was only one percent. Correlations of airborne terrain reflectance measurements with LANDSAT radiance verified a linear atmospheric model with an additive (path radiance) term and multiplicative (transmittance) term. Coefficients of determination for 28 of the 32 modeling attempts, not adverseley affected by rain shower occurring between the times of LANDSAT passage and aircraft overflights, exceeded 0.83.

  17. Statistical-learning strategies generate only modestly performing predictive models for urinary symptoms following external beam radiotherapy of the prostate: A comparison of conventional and machine-learning methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yahya, Noorazrul, E-mail: noorazrul.yahya@research.uwa.edu.au; Ebert, Martin A.; Bulsara, Max

    Purpose: Given the paucity of available data concerning radiotherapy-induced urinary toxicity, it is important to ensure derivation of the most robust models with superior predictive performance. This work explores multiple statistical-learning strategies for prediction of urinary symptoms following external beam radiotherapy of the prostate. Methods: The performance of logistic regression, elastic-net, support-vector machine, random forest, neural network, and multivariate adaptive regression splines (MARS) to predict urinary symptoms was analyzed using data from 754 participants accrued by TROG03.04-RADAR. Predictive features included dose-surface data, comorbidities, and medication-intake. Four symptoms were analyzed: dysuria, haematuria, incontinence, and frequency, each with three definitions (grade ≥more » 1, grade ≥ 2 and longitudinal) with event rate between 2.3% and 76.1%. Repeated cross-validations producing matched models were implemented. A synthetic minority oversampling technique was utilized in endpoints with rare events. Parameter optimization was performed on the training data. Area under the receiver operating characteristic curve (AUROC) was used to compare performance using sample size to detect differences of ≥0.05 at the 95% confidence level. Results: Logistic regression, elastic-net, random forest, MARS, and support-vector machine were the highest-performing statistical-learning strategies in 3, 3, 3, 2, and 1 endpoints, respectively. Logistic regression, MARS, elastic-net, random forest, neural network, and support-vector machine were the best, or were not significantly worse than the best, in 7, 7, 5, 5, 3, and 1 endpoints. The best-performing statistical model was for dysuria grade ≥ 1 with AUROC ± standard deviation of 0.649 ± 0.074 using MARS. For longitudinal frequency and dysuria grade ≥ 1, all strategies produced AUROC>0.6 while all haematuria endpoints and longitudinal incontinence models produced AUROC<0.6. Conclusions: Logistic regression and MARS were most likely to be the best-performing strategy for the prediction of urinary symptoms with elastic-net and random forest producing competitive results. The predictive power of the models was modest and endpoint-dependent. New features, including spatial dose maps, may be necessary to achieve better models.« less

  18. Effective search for stable segregation configurations at grain boundaries with data-mining techniques

    NASA Astrophysics Data System (ADS)

    Kiyohara, Shin; Mizoguchi, Teruyasu

    2018-03-01

    Grain boundary segregation of dopants plays a crucial role in materials properties. To investigate the dopant segregation behavior at the grain boundary, an enormous number of combinations have to be considered in the segregation of multiple dopants at the complex grain boundary structures. Here, two data mining techniques, the random-forests regression and the genetic algorithm, were applied to determine stable segregation sites at grain boundaries efficiently. Using the random-forests method, a predictive model was constructed from 2% of the segregation configurations and it has been shown that this model could determine the stable segregation configurations. Furthermore, the genetic algorithm also successfully determined the most stable segregation configuration with great efficiency. We demonstrate that these approaches are quite effective to investigate the dopant segregation behaviors at grain boundaries.

  19. Integrating remotely sensed land cover observations and a biogeochemical model for estimating forest ecosystem carbon dynamics

    USGS Publications Warehouse

    Liu, J.; Liu, S.; Loveland, Thomas R.; Tieszen, L.L.

    2008-01-01

    Land cover change is one of the key driving forces for ecosystem carbon (C) dynamics. We present an approach for using sequential remotely sensed land cover observations and a biogeochemical model to estimate contemporary and future ecosystem carbon trends. We applied the General Ensemble Biogeochemical Modelling System (GEMS) for the Laurentian Plains and Hills ecoregion in the northeastern United States for the period of 1975-2025. The land cover changes, especially forest stand-replacing events, were detected on 30 randomly located 10-km by 10-km sample blocks, and were assimilated by GEMS for biogeochemical simulations. In GEMS, each unique combination of major controlling variables (including land cover change history) forms a geo-referenced simulation unit. For a forest simulation unit, a Monte Carlo process is used to determine forest type, forest age, forest biomass, and soil C, based on the Forest Inventory and Analysis (FIA) data and the U.S. General Soil Map (STATSGO) data. Ensemble simulations are performed for each simulation unit to incorporate input data uncertainty. Results show that on average forests of the Laurentian Plains and Hills ecoregion have been sequestrating 4.2 Tg C (1 teragram = 1012 gram) per year, including 1.9 Tg C removed from the ecosystem as the consequences of land cover change. ?? 2008 Elsevier B.V.

  20. Reducing RANS Model Error Using Random Forest

    NASA Astrophysics Data System (ADS)

    Wang, Jian-Xun; Wu, Jin-Long; Xiao, Heng; Ling, Julia

    2016-11-01

    Reynolds-Averaged Navier-Stokes (RANS) models are still the work-horse tools in the turbulence modeling of industrial flows. However, the model discrepancy due to the inadequacy of modeled Reynolds stresses largely diminishes the reliability of simulation results. In this work we use a physics-informed machine learning approach to improve the RANS modeled Reynolds stresses and propagate them to obtain the mean velocity field. Specifically, the functional forms of Reynolds stress discrepancies with respect to mean flow features are trained based on an offline database of flows with similar characteristics. The random forest model is used to predict Reynolds stress discrepancies in new flows. Then the improved Reynolds stresses are propagated to the velocity field via RANS equations. The effects of expanding the feature space through the use of a complete basis of Galilean tensor invariants are also studied. The flow in a square duct, which is challenging for standard RANS models, is investigated to demonstrate the merit of the proposed approach. The results show that both the Reynolds stresses and the propagated velocity field are improved over the baseline RANS predictions. SAND Number: SAND2016-7437 A

  1. Complex Network Simulation of Forest Network Spatial Pattern in Pearl River Delta

    NASA Astrophysics Data System (ADS)

    Zeng, Y.

    2017-09-01

    Forest network-construction uses for the method and model with the scale-free features of complex network theory based on random graph theory and dynamic network nodes which show a power-law distribution phenomenon. The model is suitable for ecological disturbance by larger ecological landscape Pearl River Delta consistent recovery. Remote sensing and GIS spatial data are available through the latest forest patches. A standard scale-free network node distribution model calculates the area of forest network's power-law distribution parameter value size; The recent existing forest polygons which are defined as nodes can compute the network nodes decaying index value of the network's degree distribution. The parameters of forest network are picked up then make a spatial transition to GIS real world models. Hence the connection is automatically generated by minimizing the ecological corridor by the least cost rule between the near nodes. Based on scale-free network node distribution requirements, select the number compared with less, a huge point of aggregation as a future forest planning network's main node, and put them with the existing node sequence comparison. By this theory, the forest ecological projects in the past avoid being fragmented, scattered disorderly phenomena. The previous regular forest networks can be reduced the required forest planting costs by this method. For ecological restoration of tropical and subtropical in south China areas, it will provide an effective method for the forest entering city project guidance and demonstration with other ecological networks (water, climate network, etc.) for networking a standard and base datum.

  2. A Multiscale Approach Indicates a Severe Reduction in Atlantic Forest Wetlands and Highlights that São Paulo Marsh Antwren Is on the Brink of Extinction

    PubMed Central

    Del-Rio, Glaucia; Rêgo, Marco Antonio; Silveira, Luís Fábio

    2015-01-01

    Over the last 200 years the wetlands of the Upper Tietê and Upper Paraíba do Sul basins, in the southeastern Atlantic Forest, Brazil, have been almost-completely transformed by urbanization, agriculture and mining. Endemic to these river basins, the São Paulo Marsh Antwren (Formicivora paludicola) survived these impacts, but remained unknown to science until its discovery in 2005. Its population status was cause for immediate concern. In order to understand the factors imperiling the species, and provide guidelines for its conservation, we investigated both the species’ distribution and the distribution of areas of suitable habitat using a multiscale approach encompassing species distribution modeling, fieldwork surveys and occupancy models. Of six species distribution models methods used (Generalized Linear Models, Generalized Additive Models, Multivariate Adaptive Regression Splines, Classification Tree Analysis, Artificial Neural Networks and Random Forest), Random Forest showed the best fit and was utilized to guide field validation. After surveying 59 sites, our results indicated that Formicivora paludicola occurred in only 13 sites, having narrow habitat specificity, and restricted habitat availability. Additionally, historic maps, distribution models and satellite imagery showed that human occupation has resulted in a loss of more than 346 km2 of suitable habitat for this species since the early twentieth century, so that it now only occupies a severely fragmented area (area of occupancy) of 1.42 km2, and it should be considered Critically Endangered according to IUCN criteria. Furthermore, averaged occupancy models showed that marshes with lower cattail (Typha dominguensis) densities have higher probabilities of being occupied. Thus, these areas should be prioritized in future conservation efforts to protect the species, and to restore a portion of Atlantic Forest wetlands, in times of unprecedented regional water supply problems. PMID:25798608

  3. A multiscale approach indicates a severe reduction in Atlantic Forest wetlands and highlights that São Paulo Marsh Antwren is on the brink of extinction.

    PubMed

    Del-Rio, Glaucia; Rêgo, Marco Antonio; Silveira, Luís Fábio

    2015-01-01

    Over the last 200 years the wetlands of the Upper Tietê and Upper Paraíba do Sul basins, in the southeastern Atlantic Forest, Brazil, have been almost-completely transformed by urbanization, agriculture and mining. Endemic to these river basins, the São Paulo Marsh Antwren (Formicivora paludicola) survived these impacts, but remained unknown to science until its discovery in 2005. Its population status was cause for immediate concern. In order to understand the factors imperiling the species, and provide guidelines for its conservation, we investigated both the species' distribution and the distribution of areas of suitable habitat using a multiscale approach encompassing species distribution modeling, fieldwork surveys and occupancy models. Of six species distribution models methods used (Generalized Linear Models, Generalized Additive Models, Multivariate Adaptive Regression Splines, Classification Tree Analysis, Artificial Neural Networks and Random Forest), Random Forest showed the best fit and was utilized to guide field validation. After surveying 59 sites, our results indicated that Formicivora paludicola occurred in only 13 sites, having narrow habitat specificity, and restricted habitat availability. Additionally, historic maps, distribution models and satellite imagery showed that human occupation has resulted in a loss of more than 346 km2 of suitable habitat for this species since the early twentieth century, so that it now only occupies a severely fragmented area (area of occupancy) of 1.42 km2, and it should be considered Critically Endangered according to IUCN criteria. Furthermore, averaged occupancy models showed that marshes with lower cattail (Typha dominguensis) densities have higher probabilities of being occupied. Thus, these areas should be prioritized in future conservation efforts to protect the species, and to restore a portion of Atlantic Forest wetlands, in times of unprecedented regional water supply problems.

  4. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches

    NASA Astrophysics Data System (ADS)

    Brokamp, Cole; Jandarov, Roman; Rao, M. B.; LeMasters, Grace; Ryan, Patrick

    2017-02-01

    Exposure assessment for elemental components of particulate matter (PM) using land use modeling is a complex problem due to the high spatial and temporal variations in pollutant concentrations at the local scale. Land use regression (LUR) models may fail to capture complex interactions and non-linear relationships between pollutant concentrations and land use variables. The increasing availability of big spatial data and machine learning methods present an opportunity for improvement in PM exposure assessment models. In this manuscript, our objective was to develop a novel land use random forest (LURF) model and compare its accuracy and precision to a LUR model for elemental components of PM in the urban city of Cincinnati, Ohio. PM smaller than 2.5 μm (PM2.5) and eleven elemental components were measured at 24 sampling stations from the Cincinnati Childhood Allergy and Air Pollution Study (CCAAPS). Over 50 different predictors associated with transportation, physical features, community socioeconomic characteristics, greenspace, land cover, and emission point sources were used to construct LUR and LURF models. Cross validation was used to quantify and compare model performance. LURF and LUR models were created for aluminum (Al), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), lead (Pb), sulfur (S), silicon (Si), vanadium (V), zinc (Zn), and total PM2.5 in the CCAAPS study area. LURF utilized a more diverse and greater number of predictors than LUR and LURF models for Al, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all showed a decrease in fractional predictive error of at least 5% compared to their LUR models. LURF models for Al, Cu, Fe, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all had a cross validated fractional predictive error less than 30%. Furthermore, LUR models showed a differential exposure assessment bias and had a higher prediction error variance. Random forest and other machine learning methods may provide more accurate exposure assessment.

  5. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches.

    PubMed

    Brokamp, Cole; Jandarov, Roman; Rao, M B; LeMasters, Grace; Ryan, Patrick

    2017-02-01

    Exposure assessment for elemental components of particulate matter (PM) using land use modeling is a complex problem due to the high spatial and temporal variations in pollutant concentrations at the local scale. Land use regression (LUR) models may fail to capture complex interactions and non-linear relationships between pollutant concentrations and land use variables. The increasing availability of big spatial data and machine learning methods present an opportunity for improvement in PM exposure assessment models. In this manuscript, our objective was to develop a novel land use random forest (LURF) model and compare its accuracy and precision to a LUR model for elemental components of PM in the urban city of Cincinnati, Ohio. PM smaller than 2.5 μm (PM2.5) and eleven elemental components were measured at 24 sampling stations from the Cincinnati Childhood Allergy and Air Pollution Study (CCAAPS). Over 50 different predictors associated with transportation, physical features, community socioeconomic characteristics, greenspace, land cover, and emission point sources were used to construct LUR and LURF models. Cross validation was used to quantify and compare model performance. LURF and LUR models were created for aluminum (Al), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), lead (Pb), sulfur (S), silicon (Si), vanadium (V), zinc (Zn), and total PM2.5 in the CCAAPS study area. LURF utilized a more diverse and greater number of predictors than LUR and LURF models for Al, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all showed a decrease in fractional predictive error of at least 5% compared to their LUR models. LURF models for Al, Cu, Fe, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all had a cross validated fractional predictive error less than 30%. Furthermore, LUR models showed a differential exposure assessment bias and had a higher prediction error variance. Random forest and other machine learning methods may provide more accurate exposure assessment.

  6. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches

    PubMed Central

    Brokamp, Cole; Jandarov, Roman; Rao, M.B.; LeMasters, Grace; Ryan, Patrick

    2017-01-01

    Exposure assessment for elemental components of particulate matter (PM) using land use modeling is a complex problem due to the high spatial and temporal variations in pollutant concentrations at the local scale. Land use regression (LUR) models may fail to capture complex interactions and non-linear relationships between pollutant concentrations and land use variables. The increasing availability of big spatial data and machine learning methods present an opportunity for improvement in PM exposure assessment models. In this manuscript, our objective was to develop a novel land use random forest (LURF) model and compare its accuracy and precision to a LUR model for elemental components of PM in the urban city of Cincinnati, Ohio. PM smaller than 2.5 μm (PM2.5) and eleven elemental components were measured at 24 sampling stations from the Cincinnati Childhood Allergy and Air Pollution Study (CCAAPS). Over 50 different predictors associated with transportation, physical features, community socioeconomic characteristics, greenspace, land cover, and emission point sources were used to construct LUR and LURF models. Cross validation was used to quantify and compare model performance. LURF and LUR models were created for aluminum (Al), copper (Cu), iron (Fe), potassium (K), manganese (Mn), nickel (Ni), lead (Pb), sulfur (S), silicon (Si), vanadium (V), zinc (Zn), and total PM2.5 in the CCAAPS study area. LURF utilized a more diverse and greater number of predictors than LUR and LURF models for Al, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all showed a decrease in fractional predictive error of at least 5% compared to their LUR models. LURF models for Al, Cu, Fe, K, Mn, Pb, Si, Zn, TRAP, and PM2.5 all had a cross validated fractional predictive error less than 30%. Furthermore, LUR models showed a differential exposure assessment bias and had a higher prediction error variance. Random forest and other machine learning methods may provide more accurate exposure assessment. PMID:28959135

  7. A Random Forest Approach to Predict the Spatial Distribution ...

    EPA Pesticide Factsheets

    Modeling the magnitude and distribution of sediment-bound pollutants in estuaries is often limited by incomplete knowledge of the site and inadequate sample density. To address these modeling limitations, a decision-support tool framework was conceived that predicts sediment contamination from the sub-estuary to broader estuary extent. For this study, a Random Forest (RF) model was implemented to predict the distribution of a model contaminant, triclosan (5-chloro-2-(2,4-dichlorophenoxy)phenol) (TCS), in Narragansett Bay, Rhode Island, USA. TCS is an unregulated contaminant used in many personal care products. The RF explanatory variables were associated with TCS transport and fate (proxies) and direct and indirect environmental entry. The continuous RF TCS concentration predictions were discretized into three levels of contamination (low, medium, and high) for three different quantile thresholds. The RF model explained 63% of the variance with a minimum number of variables. Total organic carbon (TOC) (transport and fate proxy) was a strong predictor of TCS contamination causing a mean squared error increase of 59% when compared to permutations of randomized values of TOC. Additionally, combined sewer overflow discharge (environmental entry) and sand (transport and fate proxy) were strong predictors. The discretization models identified a TCS area of greatest concern in the northern reach of Narragansett Bay (Providence River sub-estuary), which was validated wi

  8. Non-random species loss in a forest herbaceous layer following nitrogen addition

    Treesearch

    Christopher A. ​Walter; Mary Beth Adams; Frank S. Gilliam; William T. Peterjohn

    2017-01-01

    Nitrogen (N) additions have decreased species richness (S) in hardwood forest herbaceous layers, yet the functional mechanisms for these decreases have not been explicitly evaluated.We tested two hypothesized mechanisms, random species loss (RSL) and non-random species loss (NRSL), in the hardwood forest herbaceous layer of a long-term, plot-scale...

  9. Idaho forest carbon projections from 2017 to 2117 under forest disturbance and climate change scenarios

    NASA Astrophysics Data System (ADS)

    Hudak, A. T.; Crookston, N.; Kennedy, R. E.; Domke, G. M.; Fekety, P.; Falkowski, M. J.

    2017-12-01

    Commercial off-the-shelf lidar collections associated with tree measures in field plots allow aboveground biomass (AGB) estimation with high confidence. Predictive models developed from such datasets are used operationally to map AGB across lidar project areas. We use a random selection of these pixel-level AGB predictions as training for predicting AGB annually across Idaho and western Montana, primarily from Landsat time series imagery processed through LandTrendr. At both the landscape and regional scales, Random Forests is used for predictive AGB modeling. To project future carbon dynamics, we use Climate-FVS (Forest Vegetation Simulator), the tree growth engine used by foresters to inform forest planning decisions, under either constant or changing climate scenarios. Disturbance data compiled from LandTrendr (Kennedy et al. 2010) using TimeSync (Cohen et al. 2010) in forested lands of Idaho (n=509) and western Montana (n=288) are used to generate probabilities of disturbance (harvest, fire, or insect) by land ownership class (public, private) as well as the magnitude of disturbance. Our verification approach is to aggregate the regional, annual AGB predictions at the county level and compare them to annual county-level AGB summarized independently from systematic, field-based, annual inventories conducted by the US Forest Inventory and Analysis (FIA) Program nationally. This analysis shows that when federal lands are disturbed the magnitude is generally high and when other lands are disturbed the magnitudes are more moderate. The probability of disturbance in corporate lands is higher than in other lands but the magnitudes are generally lower. This is consistent with the much higher prevalence of fire and insects occurring on federal lands, and greater harvest activity on private lands. We found large forest carbon losses in drier southern Idaho, only partially offset by carbon gains in wetter northern Idaho, due to anticipated climate change. Public and private forest managers can use these forest carbon projections to 2117 to inform 2017 decisions on which tree species and seed sources to select for planting, and implement forest management strategies now that may seek to maximize forest carbon sequestration for greenhouse gas abatement a century from now.

  10. Analysis of Machine Learning Techniques for Heart Failure Readmissions.

    PubMed

    Mortazavi, Bobak J; Downing, Nicholas S; Bucholz, Emily M; Dharmarajan, Kumar; Manhapra, Ajay; Li, Shu-Xia; Negahban, Sahand N; Krumholz, Harlan M

    2016-11-01

    The current ability to predict readmissions in patients with heart failure is modest at best. It is unclear whether machine learning techniques that address higher dimensional, nonlinear relationships among variables would enhance prediction. We sought to compare the effectiveness of several machine learning algorithms for predicting readmissions. Using data from the Telemonitoring to Improve Heart Failure Outcomes trial, we compared the effectiveness of random forests, boosting, random forests combined hierarchically with support vector machines or logistic regression (LR), and Poisson regression against traditional LR to predict 30- and 180-day all-cause readmissions and readmissions because of heart failure. We randomly selected 50% of patients for a derivation set, and a validation set comprised the remaining patients, validated using 100 bootstrapped iterations. We compared C statistics for discrimination and distributions of observed outcomes in risk deciles for predictive range. In 30-day all-cause readmission prediction, the best performing machine learning model, random forests, provided a 17.8% improvement over LR (mean C statistics, 0.628 and 0.533, respectively). For readmissions because of heart failure, boosting improved the C statistic by 24.9% over LR (mean C statistic 0.678 and 0.543, respectively). For 30-day all-cause readmission, the observed readmission rates in the lowest and highest deciles of predicted risk with random forests (7.8% and 26.2%, respectively) showed a much wider separation than LR (14.2% and 16.4%, respectively). Machine learning methods improved the prediction of readmission after hospitalization for heart failure compared with LR and provided the greatest predictive range in observed readmission rates. © 2016 American Heart Association, Inc.

  11. Birth-jump processes and application to forest fire spotting.

    PubMed

    Hillen, T; Greese, B; Martin, J; de Vries, G

    2015-01-01

    Birth-jump models are designed to describe population models for which growth and spatial spread cannot be decoupled. A birth-jump model is a nonlinear integro-differential equation. We present two different derivations of this equation, one based on a random walk approach and the other based on a two-compartmental reaction-diffusion model. In the case that the redistribution kernels are highly concentrated, we show that the integro-differential equation can be approximated by a reaction-diffusion equation, in which the proliferation rate contributes to both the diffusion term and the reaction term. We completely solve the corresponding critical domain size problem and the minimal wave speed problem. Birth-jump models can be applied in many areas in mathematical biology. We highlight an application of our results in the context of forest fire spread through spotting. We show that spotting increases the invasion speed of a forest fire front.

  12. Red-shouldered hawk nesting habitat preference in south Texas

    USGS Publications Warehouse

    Strobel, Bradley N.; Boal, Clint W.

    2010-01-01

    We examined nesting habitat preference by red-shouldered hawks Buteo lineatus using conditional logistic regression on characteristics measured at 27 occupied nest sites and 68 unused sites in 2005–2009 in south Texas. We measured vegetation characteristics of individual trees (nest trees and unused trees) and corresponding 0.04-ha plots. We evaluated the importance of tree and plot characteristics to nesting habitat selection by comparing a priori tree-specific and plot-specific models using Akaike's information criterion. Models with only plot variables carried 14% more weight than models with only center tree variables. The model-averaged odds ratios indicated red-shouldered hawks selected to nest in taller trees and in areas with higher average diameter at breast height than randomly available within the forest stand. Relative to randomly selected areas, each 1-m increase in nest tree height and 1-cm increase in the plot average diameter at breast height increased the probability of selection by 85% and 10%, respectively. Our results indicate that red-shouldered hawks select nesting habitat based on vegetation characteristics of individual trees as well as the 0.04-ha area surrounding the tree. Our results indicate forest management practices resulting in tall forest stands with large average diameter at breast height would benefit red-shouldered hawks in south Texas.

  13. Land cover and land use mapping of the iSimangaliso Wetland Park, South Africa: comparison of oblique and orthogonal random forest algorithms

    NASA Astrophysics Data System (ADS)

    Bassa, Zaakirah; Bob, Urmilla; Szantoi, Zoltan; Ismail, Riyad

    2016-01-01

    In recent years, the popularity of tree-based ensemble methods for land cover classification has increased significantly. Using WorldView-2 image data, we evaluate the potential of the oblique random forest algorithm (oRF) to classify a highly heterogeneous protected area. In contrast to the random forest (RF) algorithm, the oRF algorithm builds multivariate trees by learning the optimal split using a supervised model. The oRF binary algorithm is adapted to a multiclass land cover and land use application using both the "one-against-one" and "one-against-all" combination approaches. Results show that the oRF algorithms are capable of achieving high classification accuracies (>80%). However, there was no statistical difference in classification accuracies obtained by the oRF algorithms and the more popular RF algorithm. For all the algorithms, user accuracies (UAs) and producer accuracies (PAs) >80% were recorded for most of the classes. Both the RF and oRF algorithms poorly classified the indigenous forest class as indicated by the low UAs and PAs. Finally, the results from this study advocate and support the utility of the oRF algorithm for land cover and land use mapping of protected areas using WorldView-2 image data.

  14. Modeling change in potential landscape vulnerability to forest insect and pathogen disturbances: methods for forested subwatersheds sampled in the midscale interior Columbia River basin assessment.

    Treesearch

    Paul F. Hessburg; Bradley G. Smith; Craig A. Miller; Scott D. Kreiter; R. Brion Salter

    1999-01-01

    In the interior Columbia River basin midscale ecological assessment, including portions of the Klamath and Great Basins, we mapped and characterized historical and current vegetation composition and structure of 337 randomly sampled subwatersheds (9500 ha average size) in 43 subbasins (404 000 ha average size). We compared landscape patterns, vegetation structure and...

  15. Gray level co-occurrence and random forest algorithm-based gender determination with maxillary tooth plaster images.

    PubMed

    Akkoç, Betül; Arslan, Ahmet; Kök, Hatice

    2016-06-01

    Gender is one of the intrinsic properties of identity, with performance enhancement reducing the cluster when a search is performed. Teeth have durable and resistant structure, and as such are important sources of identification in disasters (accident, fire, etc.). In this study, gender determination is accomplished by maxillary tooth plaster models of 40 people (20 males and 20 females). The images of tooth plaster models are taken with a lighting mechanism set-up. A gray level co-occurrence matrix of the image with segmentation is formed and classified via a Random Forest (RF) algorithm by extracting pertinent features of the matrix. Automatic gender determination has a 90% success rate, with an applicable system to determine gender from maxillary tooth plaster images. Copyright © 2016 Elsevier Ltd. All rights reserved.

  16. Intelligent Fault Diagnosis of HVCB with Feature Space Optimization-Based Random Forest

    PubMed Central

    Ma, Suliang; Wu, Jianwen; Wang, Yuhao; Jia, Bowen; Jiang, Yuan

    2018-01-01

    Mechanical faults of high-voltage circuit breakers (HVCBs) always happen over long-term operation, so extracting the fault features and identifying the fault type have become a key issue for ensuring the security and reliability of power supply. Based on wavelet packet decomposition technology and random forest algorithm, an effective identification system was developed in this paper. First, compared with the incomplete description of Shannon entropy, the wavelet packet time-frequency energy rate (WTFER) was adopted as the input vector for the classifier model in the feature selection procedure. Then, a random forest classifier was used to diagnose the HVCB fault, assess the importance of the feature variable and optimize the feature space. Finally, the approach was verified based on actual HVCB vibration signals by considering six typical fault classes. The comparative experiment results show that the classification accuracy of the proposed method with the origin feature space reached 93.33% and reached up to 95.56% with optimized input feature vector of classifier. This indicates that feature optimization procedure is successful, and the proposed diagnosis algorithm has higher efficiency and robustness than traditional methods. PMID:29659548

  17. Spectral Analysis of Ultrasound Radiofrequency Backscatter for the Detection of Intercostal Blood Vessels.

    PubMed

    Klingensmith, Jon D; Haggard, Asher; Fedewa, Russell J; Qiang, Beidi; Cummings, Kenneth; DeGrande, Sean; Vince, D Geoffrey; Elsharkawy, Hesham

    2018-04-19

    Spectral analysis of ultrasound radiofrequency backscatter has the potential to identify intercostal blood vessels during ultrasound-guided placement of paravertebral nerve blocks and intercostal nerve blocks. Autoregressive models were used for spectral estimation, and bandwidth, autoregressive order and region-of-interest size were evaluated. Eight spectral parameters were calculated and used to create random forests. An autoregressive order of 10, bandwidth of 6 dB and region-of-interest size of 1.0 mm resulted in the minimum out-of-bag error. An additional random forest, using these chosen values, was created from 70% of the data and evaluated independently from the remaining 30% of data. The random forest achieved a predictive accuracy of 92% and Youden's index of 0.85. These results suggest that spectral analysis of ultrasound radiofrequency backscatter has the potential to identify intercostal blood vessels. (jokling@siue.edu) © 2018 World Federation for Ultrasound in Medicine and Biology. Copyright © 2018 World Federation for Ultrasound in Medicine and Biology. Published by Elsevier Inc. All rights reserved.

  18. RandomForest4Life: a Random Forest for predicting ALS disease progression.

    PubMed

    Hothorn, Torsten; Jung, Hans H

    2014-09-01

    We describe a method for predicting disease progression in amyotrophic lateral sclerosis (ALS) patients. The method was developed as a submission to the DREAM Phil Bowen ALS Prediction Prize4Life Challenge of summer 2012. Based on repeated patient examinations over a three- month period, we used a random forest algorithm to predict future disease progression. The procedure was set up and internally evaluated using data from 1197 ALS patients. External validation by an expert jury was based on undisclosed information of an additional 625 patients; all patient data were obtained from the PRO-ACT database. In terms of prediction accuracy, the approach described here ranked third best. Our interpretation of the prediction model confirmed previous reports suggesting that past disease progression is a strong predictor of future disease progression measured on the ALS functional rating scale (ALSFRS). We also found that larger variability in initial ALSFRS scores is linked to faster future disease progression. The results reported here furthermore suggested that approaches taking the multidimensionality of the ALSFRS into account promise some potential for improved ALS disease prediction.

  19. [Functional diversity characteristics of canopy tree species of Jianfengling tropical montane rainforest on Hainan Island, China.

    PubMed

    Xu, Ge Xi; Shi, Zuo Min; Tang, Jing Chao; Liu, Shun; Ma, Fan Qiang; Xu, Han; Liu, Shi Rong; Li, Yi de

    2016-11-18

    Based on three 1-hm 2 plots of Jianfengling tropical montane rainforest on Hainan Island, 11 commom used functional traits of canopy trees were measured. After combining with topographical factors and trees census data of these three plots, we compared the impacts of weighted species abundance on two functional dispersion indices, mean pairwise distance (MPD) and mean nearest taxon distance (MNTD), by using single- and multi-dimensional traits, respectively. The relationship between functional richness of the forest canopies and species abundance was analyzed. We used a null model approach to explore the variations in standardized size effects of MPD and MNTD, which were weighted by species abundance and eliminated the influences of species richness diffe-rences among communities, and assessed functional diversity patterns of the forest canopies and their responses to local habitat heterogeneity at community's level. The results showed that variation in MPD was greatly dependent on the dimensionalities of functional traits as well as species abundance. The correlations between weighted and non-weighted MPD based on different dimensional traits were relatively weak (R=0.359-0.628). On the contrary, functional traits and species abundance had relatively weak effects on MNTD, which brought stronger correlations between weighted and non-weighted MNTD based on different dimensional traits (R=0.746-0.820). Functional dispersion of the forest canopies were generally overestimated when using non-weighted MPD and MNTD. Functional richness of the forest canopies showed an exponential relationship with species abundance (F=128.20; R 2 =0.632; AIC=97.72; P<0.001), which might exist a species abundance threshold value. Patterns of functional diversity of the forest canopies based on different dimensional functional traits and their habitat responses showed variations in some degree. Forest canopies in the valley usually had relatively stronger biological competition, and functional diversity was higher than expected functional diversity randomized by null model, which indicated dispersed distribution of functional traits among canopy tree species in this habitat. However, the functional diversity of the forest canopies tended to be close or lower than randomization in the other habitat types, which demonstrated random or clustered distribution of the functional traits among canopy tree species.

  20. Estimation of retinal vessel caliber using model fitting and random forests

    NASA Astrophysics Data System (ADS)

    Araújo, Teresa; Mendonça, Ana Maria; Campilho, Aurélio

    2017-03-01

    Retinal vessel caliber changes are associated with several major diseases, such as diabetes and hypertension. These caliber changes can be evaluated using eye fundus images. However, the clinical assessment is tiresome and prone to errors, motivating the development of automatic methods. An automatic method based on vessel crosssection intensity profile model fitting for the estimation of vessel caliber in retinal images is herein proposed. First, vessels are segmented from the image, vessel centerlines are detected and individual segments are extracted and smoothed. Intensity profiles are extracted perpendicularly to the vessel, and the profile lengths are determined. Then, model fitting is applied to the smoothed profiles. A novel parametric model (DoG-L7) is used, consisting on a Difference-of-Gaussians multiplied by a line which is able to describe profile asymmetry. Finally, the parameters of the best-fit model are used for determining the vessel width through regression using ensembles of bagged regression trees with random sampling of the predictors (random forests). The method is evaluated on the REVIEW public dataset. A precision close to the observers is achieved, outperforming other state-of-the-art methods. The method is robust and reliable for width estimation in images with pathologies and artifacts, with performance independent of the range of diameters.

  1. Modeling the Emergent Impacts of Harvesting Acadian Forests over 100+ Years

    NASA Astrophysics Data System (ADS)

    Luus, K. A.; Plug, L. J.

    2007-12-01

    Harvesting strategies and policies for Acadian forest in Nova Scotia, Canada, presently are set using Decision Support Models (DSMs) that aim to maximize the long-term (>100y) value of forests through decisions implemented over short time horizons (5-80 years). However, DSMs typically are aspatial, lack ecological processes and do not treat erosion, so the long-term (>100y) emergent impacts of the prescribed forestry decisions on erosion and vegetation in Acadian forests remain poorly known. To better understand these impacts, we created an equation-based model that simulates the evolution of a ≥4 km2 forest in time steps of 1 y and at a spatial resolution of 3 m2, the footprint of a single mature tree. The model combines 1) ecological processes of recruitment, competition, and mortality; 2) geomorphic processes of hillslope erosion; 3) anthropic processes of tree harvesting, replanting, and road construction under constraints imposed by regulations and cost/benefit ratio. The model uses digital elevation models, parameters (where available), and calibration (where measurements are not available) for conditions presently found in central Cape Breton, Nova Scotia. The model is unique because it 1) deals with the impacts of harvesting on an Acadian forest; and 2) vegetation and erosion are coupled. The model was tested by comparing the species-specific biomass of long-term (40 y) forest plot data to simulated results. At the spatial scale of individual 1 ha plots, model predictions presently account for approximately 50% of observed biomass changes through time, but predictions are hampered by the effects of serendipitous "random" events such as single tree windfall. Harvesting increases the cumulative erosion over 3000 years by 240% when compared to an old growth forest and significantly suppresses the growth of Balsam Fir and Sugar Maple. We discuss further tests of the model, and how it might be used to investigate the long-term sustainability of the recommendations made by DSMs and to better understand the relationship between vegetation, erosion, and forest management strategies.

  2. The experimental design of the Missouri Ozark Forest Ecosystem Project

    Treesearch

    Steven L. Sheriff; Shuoqiong He

    1997-01-01

    The Missouri Ozark Forest Ecosystem Project (MOFEP) is an experiment that examines the effects of three forest management practices on the forest community. MOFEP is designed as a randomized complete block design using nine sites divided into three blocks. Treatments of uneven-aged, even-aged, and no-harvest management were randomly assigned to sites within each block...

  3. The feasibility of using a universal Random Forest model to map tree height across different locations and vegetation types

    NASA Astrophysics Data System (ADS)

    Su, Y.; Guo, Q.; Jin, S.; Gao, S.; Hu, T.; Liu, J.; Xue, B. L.

    2017-12-01

    Tree height is an important forest structure parameter for understanding forest ecosystem and improving the accuracy of global carbon stock quantification. Light detection and ranging (LiDAR) can provide accurate tree height measurements, but its use in large-scale tree height mapping is limited by the spatial availability. Random Forest (RF) has been one of the most commonly used algorithms for mapping large-scale tree height through the fusion of LiDAR and other remotely sensed datasets. However, how the variances in vegetation types, geolocations and spatial scales of different study sites influence the RF results is still a question that needs to be addressed. In this study, we selected 16 study sites across four vegetation types in United States (U.S.) fully covered by airborne LiDAR data, and the area of each site was 100 km2. The LiDAR-derived canopy height models (CHMs) were used as the ground truth to train the RF algorithm to predict canopy height from other remotely sensed variables, such as Landsat TM imagery, terrain information and climate surfaces. To address the abovementioned question, 22 models were run under different combinations of vegetation types, geolocations and spatial scales. The results show that the RF model trained at one specific location or vegetation type cannot be used to predict tree height in other locations or vegetation types. However, by training the RF model using samples from all locations and vegetation types, a universal model can be achieved for predicting canopy height across different locations and vegetation types. Moreover, the number of training samples and the targeted spatial resolution of the canopy height product have noticeable influence on the RF prediction accuracy.

  4. Random Bits Forest: a Strong Classifier/Regressor for Big Data

    NASA Astrophysics Data System (ADS)

    Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li

    2016-07-01

    Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).

  5. Prognostic Factors for Survival in Patients with Gastric Cancer using a Random Survival Forest

    PubMed

    Adham, Davoud; Abbasgholizadeh, Nategh; Abazari, Malek

    2017-01-01

    Background: Gastric cancer is the fifth most common cancer and the third top cause of cancer related death with about 1 million new cases and 700,000 deaths in 2012. The aim of this investigation was to identify important factors for outcome using a random survival forest (RSF) approach. Materials and Methods: Data were collected from 128 gastric cancer patients through a historical cohort study in Hamedan-Iran from 2007 to 2013. The event under consideration was death due to gastric cancer. The random survival forest model in R software was applied to determine the key factors affecting survival. Four split criteria were used to determine importance of the variables in the model including log-rank, conversation?? of events, log-rank score, and randomization. Efficiency of the model was confirmed in terms of Harrell’s concordance index. Results: The mean age of diagnosis was 63 ±12.57 and mean and median survival times were 15.2 (95%CI: 13.3, 17.0) and 12.3 (95%CI: 11.0, 13.4) months, respectively. The one-year, two-year, and three-year rates for survival were 51%, 13%, and 5%, respectively. Each RSF approach showed a slightly different ranking order. Very important covariates in nearly all the 4 RSF approaches were metastatic status, age at diagnosis and tumor size. The performance of each RSF approach was in the range of 0.29-0.32 and the best error rate was obtained by the log-rank splitting rule; second, third, and fourth ranks were log-rank score, conservation of events, and the random splitting rule, respectively. Conclusion: Low survival rate of gastric cancer patients is an indication of absence of a screening program for early diagnosis of the disease. Timely diagnosis in early phases increases survival and decreases mortality. Creative Commons Attribution License

  6. Mapping the montane cloud forest of Taiwan using 12 year MODIS-derived ground fog frequency data.

    PubMed

    Schulz, Hans Martin; Li, Ching-Feng; Thies, Boris; Chang, Shih-Chieh; Bendix, Jörg

    2017-01-01

    Up until now montane cloud forest (MCF) in Taiwan has only been mapped for selected areas of vegetation plots. This paper presents the first comprehensive map of MCF distribution for the entire island. For its creation, a Random Forest model was trained with vegetation plots from the National Vegetation Database of Taiwan that were classified as "MCF" or "non-MCF". This model predicted the distribution of MCF from a raster data set of parameters derived from a digital elevation model (DEM), Landsat channels and texture measures derived from them as well as ground fog frequency data derived from the Moderate Resolution Imaging Spectroradiometer. While the DEM parameters and Landsat data predicted much of the cloud forest's location, local deviations in the altitudinal distribution of MCF linked to the monsoonal influence as well as the Massenerhebung effect (causing MCF in atypically low altitudes) were only captured once fog frequency data was included. Therefore, our study suggests that ground fog data are most useful for accurately mapping MCF.

  7. Applying genetic algorithms to set the optimal combination of forest fire related variables and model forest fire susceptibility based on data mining models. The case of Dayu County, China.

    PubMed

    Hong, Haoyuan; Tsangaratos, Paraskevas; Ilia, Ioanna; Liu, Junzhi; Zhu, A-Xing; Xu, Chong

    2018-07-15

    The main objective of the present study was to utilize Genetic Algorithms (GA) in order to obtain the optimal combination of forest fire related variables and apply data mining methods for constructing a forest fire susceptibility map. In the proposed approach, a Random Forest (RF) and a Support Vector Machine (SVM) was used to produce a forest fire susceptibility map for the Dayu County which is located in southwest of Jiangxi Province, China. For this purpose, historic forest fires and thirteen forest fire related variables were analyzed, namely: elevation, slope angle, aspect, curvature, land use, soil cover, heat load index, normalized difference vegetation index, mean annual temperature, mean annual wind speed, mean annual rainfall, distance to river network and distance to road network. The Natural Break and the Certainty Factor method were used to classify and weight the thirteen variables, while a multicollinearity analysis was performed to determine the correlation among the variables and decide about their usability. The optimal set of variables, determined by the GA limited the number of variables into eight excluding from the analysis, aspect, land use, heat load index, distance to river network and mean annual rainfall. The performance of the forest fire models was evaluated by using the area under the Receiver Operating Characteristic curve (ROC-AUC) based on the validation dataset. Overall, the RF models gave higher AUC values. Also the results showed that the proposed optimized models outperform the original models. Specifically, the optimized RF model gave the best results (0.8495), followed by the original RF (0.8169), while the optimized SVM gave lower values (0.7456) than the RF, however higher than the original SVM (0.7148) model. The study highlights the significance of feature selection techniques in forest fire susceptibility, whereas data mining methods could be considered as a valid approach for forest fire susceptibility modeling. Copyright © 2018 Elsevier B.V. All rights reserved.

  8. Projections of suitable habitat for rare species under global warming scenarios

    Treesearch

    F. Thomas Ledig; Gerald E. Rehfeldt; Cuauhtemoc Saenz-Romero; Flores-Lopez Celestino

    2010-01-01

    Premise of the study: Modeling the contemporary and future climate niche for rare plants is a major hurdle in conservation, yet such projections are necessary to prevent extinctions that may result from climate change. Methods: We used recently developed spline climatic models and modifi ed Random Forests...

  9. A Random Forest Approach to Predict the Spatial Distribution of Sediment Pollution in an Estuarine System

    EPA Science Inventory

    Modeling the magnitude and distribution of sediment-bound pollutants in estuaries is often limited by incomplete knowledge of the site and inadequate sample density. To address these modeling limitations, a decision-support tool framework was conceived that predicts sediment cont...

  10. Gradient modeling of conifer species using random forests

    Treesearch

    Jeffrey S. Evans; Samuel A. Cushman

    2009-01-01

    Landscape ecology often adopts a patch mosaic model of ecological patterns. However, many ecological attributes are inherently continuous and classification of species composition into vegetation communities and discrete patches provides an overly simplistic view of the landscape. If one adopts a nichebased, individualistic concept of biotic communities then it may...

  11. Forest Structure Characterization Using JPL's UAVSAR Multi-Baseline Polarimetric SAR Interferometry and Tomography

    NASA Technical Reports Server (NTRS)

    Neumann, Maxim; Hensley, Scott; Lavalle, Marco; Ahmed, Razi

    2013-01-01

    This paper concerns forest remote sensing using JPL's multi-baseline polarimetric interferometric UAVSAR data. It presents exemplary results and analyzes the possibilities and limitations of using SAR Tomography and Polarimetric SAR Interferometry (PolInSAR) techniques for the estimation of forest structure. Performance and error indicators for the applicability and reliability of the used multi-baseline (MB) multi-temporal (MT) PolInSAR random volume over ground (RVoG) model are discussed. Experimental results are presented based on JPL's L-band repeat-pass polarimetric interferometric UAVSAR data over temperate and tropical forest biomes in the Harvard Forest, Massachusetts, and in the La Amistad Park, Panama and Costa Rica. The results are partially compared with ground field measurements and with air-borne LVIS lidar data.

  12. Forest Structure Characterization Using Jpl's UAVSAR Multi-Baseline Polarimetric SAR Interferometry and Tomography

    NASA Technical Reports Server (NTRS)

    Neumann, Maxim; Hensley, Scott; Lavalle, Marco; Ahmed, Razi

    2013-01-01

    This paper concerns forest remote sensing using JPL's multi-baseline polarimetric interferometric UAVSAR data. It presents exemplary results and analyzes the possibilities and limitations of using SAR Tomography and Polarimetric SAR Interferometry (PolInSAR) techniques for the estimation of forest structure. Performance and error indicators for the applicability and reliability of the used multi-baseline (MB) multi-temporal (MT) PolInSAR random volume over ground (RVoG) model are discussed. Experimental results are presented based on JPL's L-band repeat-pass polarimetric interferometric UAVSAR data over temperate and tropical forest biomes in the Harvard Forest, Massachusetts, and in the La Amistad Park, Panama and Costa Rica. The results are partially compared with ground field measurements and with air-borne LVIS lidar data.

  13. A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring

    NASA Astrophysics Data System (ADS)

    Zimmerman, Naomi; Presto, Albert A.; Kumar, Sriniwasa P. N.; Gu, Jason; Hauryliuk, Aliaksei; Robinson, Ellis S.; Robinson, Allen L.; Subramanian, R.

    2018-01-01

    Low-cost sensing strategies hold the promise of denser air quality monitoring networks, which could significantly improve our understanding of personal air pollution exposure. Additionally, low-cost air quality sensors could be deployed to areas where limited monitoring exists. However, low-cost sensors are frequently sensitive to environmental conditions and pollutant cross-sensitivities, which have historically been poorly addressed by laboratory calibrations, limiting their utility for monitoring. In this study, we investigated different calibration models for the Real-time Affordable Multi-Pollutant (RAMP) sensor package, which measures CO, NO2, O3, and CO2. We explored three methods: (1) laboratory univariate linear regression, (2) empirical multiple linear regression, and (3) machine-learning-based calibration models using random forests (RF). Calibration models were developed for 16-19 RAMP monitors (varied by pollutant) using training and testing windows spanning August 2016 through February 2017 in Pittsburgh, PA, US. The random forest models matched (CO) or significantly outperformed (NO2, CO2, O3) the other calibration models, and their accuracy and precision were robust over time for testing windows of up to 16 weeks. Following calibration, average mean absolute error on the testing data set from the random forest models was 38 ppb for CO (14 % relative error), 10 ppm for CO2 (2 % relative error), 3.5 ppb for NO2 (29 % relative error), and 3.4 ppb for O3 (15 % relative error), and Pearson r versus the reference monitors exceeded 0.8 for most units. Model performance is explored in detail, including a quantification of model variable importance, accuracy across different concentration ranges, and performance in a range of monitoring contexts including the National Ambient Air Quality Standards (NAAQS) and the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. A key strength of the RF approach is that it accounts for pollutant cross-sensitivities. This highlights the importance of developing multipollutant sensor packages (as opposed to single-pollutant monitors); we determined this is especially critical for NO2 and CO2. The evaluation reveals that only the RF-calibrated sensors meet the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. We also demonstrate that the RF-model-calibrated sensors could detect differences in NO2 concentrations between a near-road site and a suburban site less than 1.5 km away. From this study, we conclude that combining RF models with carefully controlled state-of-the-art multipollutant sensor packages as in the RAMP monitors appears to be a very promising approach to address the poor performance that has plagued low-cost air quality sensors.

  14. Simulating the Effects of Fire on Forests in the Russian Far East: Integrating a Fire Danger Model and the FAREAST Forest Growth Model Across a Complex Landscape

    NASA Astrophysics Data System (ADS)

    Sherman, N. J.; Loboda, T.; Sun, G.; Shugart, H. H.; Csiszar, I.

    2008-12-01

    The remaining natural habitat of the critically endangered Amur tiger (Panthera tigris altaica) and Amur leopard (Panthera pardus orientalis) is a vast, biologically and topographically diverse area in the Russian Far East (RFE). Although wildland fire is a natural component of ecosystem functioning in the RFE, severe or repeated fires frequently re-set the process of forest succession, which may take centuries to return the affected forests to the pre-fire state and thus significantly alters habitat quality and long-term availability. The frequency of severe fire events has increased over the last 25 years, leading to irreversible modifications of some parts of the species' habitats. Moreover, fire regimes are expected to continue to change toward more frequent and severe events under the influence of climate change. Here we present an approach to developing capabilities for a comprehensive assessment of potential Amur tiger and leopard habitat availability throughout the 21st century by integrating regionally parameterized fire danger and forest growth models. The FAREAST model is an individual, gap-based model that simulates forest growth in a single location and demonstrates temporally explicit forest succession leading to mature forests. Including spatially explicit information on probabilities of fire occurrence at 1 km resolution developed from the regionally specific remotely -sensed data-driven fire danger model improves our ability to provide realistic long-term projections of potential forest composition in the RFE. This work presents the first attempt to merge the FAREAST model with a fire disturbance model, to validate its outputs across a large region, and to compare it to remotely-sensed data products as well as in situ assessments of forest structure. We ran the FAREAST model at 1,000 randomly selected points within forested areas in the RFE. At each point, the model was calibrated for temperature, precipitation, slope, elevation, and fire probability. The output of the model includes biomass estimates for 44 tree species that occur in the RFE, grouped by genus. We compared the model outputs with land cover classifications derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) data and LIDAR-based estimates of biomass across the entire region, and Russian forest inventory records at selected sites. Overall, we find that the FAREAST estimates of forest biomass and general composition are consistent with the observed distribution of forest types.

  15. Sensitivity of regional forest carbon budgets to continuous and stochastic climate change pressures

    NASA Astrophysics Data System (ADS)

    Sulman, B. N.; Desai, A. R.; Scheller, R. M.

    2010-12-01

    Climate change is expected to impact forest-atmosphere carbon budgets through three processes: 1. Increased disturbance rates, including fires, mortality due to pest outbreaks, and severe storms 2. Changes in patterns of inter-annual variability, related to increased incidence of severe droughts and defoliating insect outbreaks 3. Continuous changes in forest productivity and respiration, related to increases in mean temperature, growing season length, and CO2 fertilization While the importance of these climate change effects in future regional carbon budgets has been established, quantitative characterization of the relative sensitivity of forested landscapes to these different types of pressures is needed. We present a model- and- data-based approach to understanding the sensitivity of forested landscapes to climate change pressures. Eddy-covariance and biometric measurements from forests in the northern United States were used to constrain two forest landscape models. The first, LandNEP, uses a prescribed functional form for the evolution of net ecosystem productivity (NEP) over the age of a forested grid cell, which is reset following a disturbance event. This model was used for investigating the basic statistical properties of a simple landscape’s responses to climate change pressures. The second model, LANDIS-II, includes different tree species and models forest biomass accumulation and succession, allowing us to investigate the effects of more complex forest processes such as species change and carbon pool accumulation on landscape responses to climate change effects. We tested the sensitivity of forested landscapes to these three types of climate change pressures by applying ensemble perturbations of random disturbance rates, distribution functions of inter-annual variability, and maximum potential carbon uptake rates, in the two models. We find that landscape-scale net carbon exchange responds linearly to continuous changes in potential carbon uptake and inter-annual variability, while responses to stochastic changes are non-linear and become more important at shorter mean disturbance intervals. These results provide insight on how to better parameterize coupled carbon-climate models to more realistically simulate feedbacks between forests and the atmosphere.

  16. Integrating Geo-Spatial Data for Regional Landslide Susceptibility Modeling in Consideration of Run-Out Signature

    NASA Astrophysics Data System (ADS)

    Lai, J.-S.; Tsai, F.; Chiang, S.-H.

    2016-06-01

    This study implements a data mining-based algorithm, the random forests classifier, with geo-spatial data to construct a regional and rainfall-induced landslide susceptibility model. The developed model also takes account of landslide regions (source, non-occurrence and run-out signatures) from the original landslide inventory in order to increase the reliability of the susceptibility modelling. A total of ten causative factors were collected and used in this study, including aspect, curvature, elevation, slope, faults, geology, NDVI (Normalized Difference Vegetation Index), rivers, roads and soil data. Consequently, this study transforms the landslide inventory and vector-based causative factors into the pixel-based format in order to overlay with other raster data for constructing the random forests based model. This study also uses original and edited topographic data in the analysis to understand their impacts to the susceptibility modeling. Experimental results demonstrate that after identifying the run-out signatures, the overall accuracy and Kappa coefficient have been reached to be become more than 85 % and 0.8, respectively. In addition, correcting unreasonable topographic feature of the digital terrain model also produces more reliable modelling results.

  17. Aspen succession in the Intermountain West: A deterministic model

    Treesearch

    Dale L. Bartos; Frederick R. Ward; George S. Innis

    1983-01-01

    A deterministic model of succession in aspen forests was developed using existing data and intuition. The degree of uncertainty, which was determined by allowing the parameter values to vary at random within limits, was larger than desired. This report presents results of an analysis of model sensitivity to changes in parameter values. These results have indicated...

  18. Leaf area index uncertainty estimates for model-data fusion applications

    Treesearch

    Andrew D. Richardson; D. Bryan Dail; D.Y. Hollinger

    2011-01-01

    Estimates of data uncertainties are required to integrate different observational data streams as model constraints using model-data fusion. We describe an approach with which random and systematic uncertainties in optical measurements of leaf area index [LAI] can be quantified. We use data from a measurement campaign at the spruce-dominated Howland Forest AmeriFlux...

  19. Combination of support vector machine, artificial neural network and random forest for improving the classification of convective and stratiform rain using spectral features of SEVIRI data

    NASA Astrophysics Data System (ADS)

    Lazri, Mourad; Ameur, Soltane

    2018-05-01

    A model combining three classifiers, namely Support vector machine, Artificial neural network and Random forest (SAR) is designed for improving the classification of convective and stratiform rain. This model (SAR model) has been trained and then tested on a datasets derived from MSG-SEVIRI (Meteosat Second Generation-Spinning Enhanced Visible and Infrared Imager). Well-classified, mid-classified and misclassified pixels are determined from the combination of three classifiers. Mid-classified and misclassified pixels that are considered unreliable pixels are reclassified by using a novel training of the developed scheme. In this novel training, only the input data corresponding to the pixels in question to are used. This whole process is repeated a second time and applied to mid-classified and misclassified pixels separately. Learning and validation of the developed scheme are realized against co-located data observed by ground radar. The developed scheme outperformed different classifiers used separately and reached 97.40% of overall accuracy of classification.

  20. Ship Detection Based on Multiple Features in Random Forest Model for Hyperspectral Images

    NASA Astrophysics Data System (ADS)

    Li, N.; Ding, L.; Zhao, H.; Shi, J.; Wang, D.; Gong, X.

    2018-04-01

    A novel method for detecting ships which aim to make full use of both the spatial and spectral information from hyperspectral images is proposed. Firstly, the band which is high signal-noise ratio in the range of near infrared or short-wave infrared spectrum, is used to segment land and sea on Otsu threshold segmentation method. Secondly, multiple features that include spectral and texture features are extracted from hyperspectral images. Principal components analysis (PCA) is used to extract spectral features, the Grey Level Co-occurrence Matrix (GLCM) is used to extract texture features. Finally, Random Forest (RF) model is introduced to detect ships based on the extracted features. To illustrate the effectiveness of the method, we carry out experiments over the EO-1 data by comparing single feature and different multiple features. Compared with the traditional single feature method and Support Vector Machine (SVM) model, the proposed method can stably achieve the target detection of ships under complex background and can effectively improve the detection accuracy of ships.

  1. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions

    PubMed Central

    Hengl, Tomislav; Heuvelink, Gerard B. M.; Kempen, Bas; Leenaars, Johan G. B.; Walsh, Markus G.; Shepherd, Keith D.; Sila, Andrew; MacMillan, Robert A.; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E.

    2015-01-01

    80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008–2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management—organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15–75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological knowledge from data rich countries to countries with limited soil data. PMID:26110833

  2. Monitoring grass nutrients and biomass as indicators of rangeland quality and quantity using random forest modelling and WorldView-2 data

    NASA Astrophysics Data System (ADS)

    Ramoelo, Abel; Cho, M. A.; Mathieu, R.; Madonsela, S.; van de Kerchove, R.; Kaszta, Z.; Wolff, E.

    2015-12-01

    Land use and climate change could have huge impacts on food security and the health of various ecosystems. Leaf nitrogen (N) and above-ground biomass are some of the key factors limiting agricultural production and ecosystem functioning. Leaf N and biomass can be used as indicators of rangeland quality and quantity. Conventional methods for assessing these vegetation parameters at landscape scale level are time consuming and tedious. Remote sensing provides a bird-eye view of the landscape, which creates an opportunity to assess these vegetation parameters over wider rangeland areas. Estimation of leaf N has been successful during peak productivity or high biomass and limited studies estimated leaf N in dry season. The estimation of above-ground biomass has been hindered by the signal saturation problems using conventional vegetation indices. The objective of this study is to monitor leaf N and above-ground biomass as an indicator of rangeland quality and quantity using WorldView-2 satellite images and random forest technique in the north-eastern part of South Africa. Series of field work to collect samples for leaf N and biomass were undertaken in March 2013, April or May 2012 (end of wet season) and July 2012 (dry season). Several conventional and red edge based vegetation indices were computed. Overall results indicate that random forest and vegetation indices explained over 89% of leaf N concentrations for grass and trees, and less than 89% for all the years of assessment. The red edge based vegetation indices were among the important variables for predicting leaf N. For the biomass, random forest model explained over 84% of biomass variation in all years, and visible bands including red edge based vegetation indices were found to be important. The study demonstrated that leaf N could be monitored using high spatial resolution with the red edge band capability, and is important for rangeland assessment and monitoring.

  3. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions.

    PubMed

    Hengl, Tomislav; Heuvelink, Gerard B M; Kempen, Bas; Leenaars, Johan G B; Walsh, Markus G; Shepherd, Keith D; Sila, Andrew; MacMillan, Robert A; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E

    2015-01-01

    80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008-2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management--organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15-75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological knowledge from data rich countries to countries with limited soil data.

  4. The structure of tropical forests and sphere packings

    PubMed Central

    Jahn, Markus Wilhelm; Dobner, Hans-Jürgen; Wiegand, Thorsten; Huth, Andreas

    2015-01-01

    The search for simple principles underlying the complex architecture of ecological communities such as forests still challenges ecological theorists. We use tree diameter distributions—fundamental for deriving other forest attributes—to describe the structure of tropical forests. Here we argue that tree diameter distributions of natural tropical forests can be explained by stochastic packing of tree crowns representing a forest crown packing system: a method usually used in physics or chemistry. We demonstrate that tree diameter distributions emerge accurately from a surprisingly simple set of principles that include site-specific tree allometries, random placement of trees, competition for space, and mortality. The simple static model also successfully predicted the canopy structure, revealing that most trees in our two studied forests grow up to 30–50 m in height and that the highest packing density of about 60% is reached between the 25- and 40-m height layer. Our approach is an important step toward identifying a minimal set of processes responsible for generating the spatial structure of tropical forests. PMID:26598678

  5. A Random Forest-based ensemble method for activity recognition.

    PubMed

    Feng, Zengtao; Mo, Lingfei; Li, Meng

    2015-01-01

    This paper presents a multi-sensor ensemble approach to human physical activity (PA) recognition, using random forest. We designed an ensemble learning algorithm, which integrates several independent Random Forest classifiers based on different sensor feature sets to build a more stable, more accurate and faster classifier for human activity recognition. To evaluate the algorithm, PA data collected from the PAMAP (Physical Activity Monitoring for Aging People), which is a standard, publicly available database, was utilized to train and test. The experimental results show that the algorithm is able to correctly recognize 19 PA types with an accuracy of 93.44%, while the training is faster than others. The ensemble classifier system based on the RF (Random Forest) algorithm can achieve high recognition accuracy and fast calculation.

  6. Microwave scattering models and basic experiments

    NASA Technical Reports Server (NTRS)

    Fung, Adrian K.

    1989-01-01

    Progress is summarized which has been made in four areas of study: (1) scattering model development for sparsely populated media, such as a forested area; (2) scattering model development for dense media, such as a sea ice medium or a snow covered terrain; (3) model development for randomly rough surfaces; and (4) design and conduct of basic scattering and attenuation experiments suitable for the verification of theoretical models.

  7. Characterizing Forest Change Using Community-Based Monitoring Data and Landsat Time Series

    PubMed Central

    DeVries, Ben; Pratihast, Arun Kumar; Verbesselt, Jan; Kooistra, Lammert; Herold, Martin

    2016-01-01

    Increasing awareness of the issue of deforestation and degradation in the tropics has resulted in efforts to monitor forest resources in tropical countries. Advances in satellite-based remote sensing and ground-based technologies have allowed for monitoring of forests with high spatial, temporal and thematic detail. Despite these advances, there is a need to engage communities in monitoring activities and include these stakeholders in national forest monitoring systems. In this study, we analyzed activity data (deforestation and forest degradation) collected by local forest experts over a 3-year period in an Afro-montane forest area in southwestern Ethiopia and corresponding Landsat Time Series (LTS). Local expert data included forest change attributes, geo-location and photo evidence recorded using mobile phones with integrated GPS and photo capabilities. We also assembled LTS using all available data from all spectral bands and a suite of additional indices and temporal metrics based on time series trajectory analysis. We predicted deforestation, degradation or stable forests using random forest models trained with data from local experts and LTS spectral-temporal metrics as model covariates. Resulting models predicted deforestation and degradation with an out of bag (OOB) error estimate of 29% overall, and 26% and 31% for the deforestation and degradation classes, respectively. By dividing the local expert data into training and operational phases corresponding to local monitoring activities, we found that forest change models improved as more local expert data were used. Finally, we produced maps of deforestation and degradation using the most important spectral bands. The results in this study represent some of the first to combine local expert based forest change data and dense LTS, demonstrating the complementary value of both continuous data streams. Our results underpin the utility of both datasets and provide a useful foundation for integrated forest monitoring systems relying on data streams from diverse sources. PMID:27018852

  8. Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction

    PubMed Central

    Rahman, Raziur; Haider, Saad; Ghosh, Souparno; Pal, Ranadip

    2015-01-01

    Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error. PMID:27081304

  9. Combined use of two supervised learning algorithms to model sea turtle behaviours from tri-axial acceleration data.

    PubMed

    Jeantet, L; Dell'Amico, F; Forin-Wiart, M-A; Coutant, M; Bonola, M; Etienne, D; Gresser, J; Regis, S; Lecerf, N; Lefebvre, F; de Thoisy, B; Le Maho, Y; Brucker, M; Châtelain, N; Laesser, R; Crenner, F; Handrich, Y; Wilson, R; Chevallier, D

    2018-05-23

    Accelerometers are becoming ever more important sensors in animal-attached technology, providing data that allow determination of body posture and movement and thereby helping to elucidate behaviour in animals that are difficult to observe. We sought to validate the identification of sea turtle behaviours from accelerometer signals by deploying tags on the carapace of a juvenile loggerhead ( Caretta caretta ), an adult hawksbill ( Eretmochelys imbricata ) and an adult green turtle ( Chelonia mydas ) at Aquarium La Rochelle, France. We recorded tri-axial acceleration at 50 Hz for each species for a full day while two fixed cameras recorded their behaviours. We identified behaviours from the acceleration data using two different supervised learning algorithms, Random Forest and Classification And Regression Tree (CART), treating the data from the adult animals as separate from the juvenile data. We achieved a global accuracy of 81.30% for the adult hawksbill and green turtle CART model and 71.63% for the juvenile loggerhead, identifying 10 and 12 different behaviours, respectively. Equivalent figures were 86.96% for the adult hawksbill and green turtle Random Forest model and 79.49% for the juvenile loggerhead, for the same behaviours. The use of Random Forest combined with CART algorithms allowed us to understand the decision rules implicated in behaviour discrimination, and thus remove or group together some 'confused' or under--represented behaviours in order to get the most accurate models. This study is the first to validate accelerometer data to identify turtle behaviours and the approach can now be tested on other captive sea turtle species. © 2018. Published by The Company of Biologists Ltd.

  10. Automated retrieval of forest structure variables based on multi-scale texture analysis of VHR satellite imagery

    NASA Astrophysics Data System (ADS)

    Beguet, Benoit; Guyon, Dominique; Boukir, Samia; Chehata, Nesrine

    2014-10-01

    The main goal of this study is to design a method to describe the structure of forest stands from Very High Resolution satellite imagery, relying on some typical variables such as crown diameter, tree height, trunk diameter, tree density and tree spacing. The emphasis is placed on the automatization of the process of identification of the most relevant image features for the forest structure retrieval task, exploiting both spectral and spatial information. Our approach is based on linear regressions between the forest structure variables to be estimated and various spectral and Haralick's texture features. The main drawback of this well-known texture representation is the underlying parameters which are extremely difficult to set due to the spatial complexity of the forest structure. To tackle this major issue, an automated feature selection process is proposed which is based on statistical modeling, exploring a wide range of parameter values. It provides texture measures of diverse spatial parameters hence implicitly inducing a multi-scale texture analysis. A new feature selection technique, we called Random PRiF, is proposed. It relies on random sampling in feature space, carefully addresses the multicollinearity issue in multiple-linear regression while ensuring accurate prediction of forest variables. Our automated forest variable estimation scheme was tested on Quickbird and Pléiades panchromatic and multispectral images, acquired at different periods on the maritime pine stands of two sites in South-Western France. It outperforms two well-established variable subset selection techniques. It has been successfully applied to identify the best texture features in modeling the five considered forest structure variables. The RMSE of all predicted forest variables is improved by combining multispectral and panchromatic texture features, with various parameterizations, highlighting the potential of a multi-resolution approach for retrieving forest structure variables from VHR satellite images. Thus an average prediction error of ˜ 1.1 m is expected on crown diameter, ˜ 0.9 m on tree spacing, ˜ 3 m on height and ˜ 0.06 m on diameter at breast height.

  11. A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method.

    PubMed

    Yang, Jun-He; Cheng, Ching-Hsue; Chan, Chia-Pan

    2017-01-01

    Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

  12. Summer and winter habitat suitability of Marco Polo argali in southeastern Tajikistan: A modeling approach.

    PubMed

    Salas, Eric Ariel L; Valdez, Raul; Michel, Stefan

    2017-11-01

    We modeled summer and winter habitat suitability of Marco Polo argali in the Pamir Mountains in southeastern Tajikistan using these statistical algorithms: Generalized Linear Model, Random Forest, Boosted Regression Tree, Maxent, and Multivariate Adaptive Regression Splines. Using sheep occurrence data collected from 2009 to 2015 and a set of selected habitat predictors, we produced summer and winter habitat suitability maps and determined the important habitat suitability predictors for both seasons. Our results demonstrated that argali selected proximity to riparian areas and greenness as the two most relevant variables for summer, and the degree of slope (gentler slopes between 0° to 20°) and Landsat temperature band for winter. The terrain roughness was also among the most important variables in summer and winter models. Aspect was only significant for winter habitat, with argali preferring south-facing mountain slopes. We evaluated various measures of model performance such as the Area Under the Curve (AUC) and the True Skill Statistic (TSS). Comparing the five algorithms, the AUC scored highest for Boosted Regression Tree in summer (AUC = 0.94) and winter model runs (AUC = 0.94). In contrast, Random Forest underperformed in both model runs.

  13. Predicting the accuracy of ligand overlay methods with Random Forest models.

    PubMed

    Nandigam, Ravi K; Evans, David A; Erickson, Jon A; Kim, Sangtae; Sutherland, Jeffrey J

    2008-12-01

    The accuracy of binding mode prediction using standard molecular overlay methods (ROCS, FlexS, Phase, and FieldCompare) is studied. Previous work has shown that simple decision tree modeling can be used to improve accuracy by selection of the best overlay template. This concept is extended to the use of Random Forest (RF) modeling for template and algorithm selection. An extensive data set of 815 ligand-bound X-ray structures representing 5 gene families was used for generating ca. 70,000 overlays using four programs. RF models, trained using standard measures of ligand and protein similarity and Lipinski-related descriptors, are used for automatically selecting the reference ligand and overlay method maximizing the probability of reproducing the overlay deduced from X-ray structures (i.e., using rmsd < or = 2 A as the criteria for success). RF model scores are highly predictive of overlay accuracy, and their use in template and method selection produces correct overlays in 57% of cases for 349 overlay ligands not used for training RF models. The inclusion in the models of protein sequence similarity enables the use of templates bound to related protein structures, yielding useful results even for proteins having no available X-ray structures.

  14. Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

    PubMed

    Nguyen, Thanh-Tung; Huang, Joshua; Wu, Qingyao; Nguyen, Thuy; Li, Mark

    2015-01-01

    Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders. The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.

  15. SNP selection and classification of genome-wide SNP data using stratified sampling random forests.

    PubMed

    Wu, Qingyao; Ye, Yunming; Liu, Yang; Ng, Michael K

    2012-09-01

    For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

  16. A random forest approach for predicting the presence of Echinococcus multilocularis intermediate host Ochotona spp. presence in relation to landscape characteristics in western China

    PubMed Central

    Marston, Christopher G.; Danson, F. Mark; Armitage, Richard P.; Giraudoux, Patrick; Pleydell, David R.J.; Wang, Qian; Qui, Jiamin; Craig, Philip S.

    2014-01-01

    Understanding distribution patterns of hosts implicated in the transmission of zoonotic disease remains a key goal of parasitology. Here, random forests are employed to model spatial patterns of the presence of the plateau pika (Ochotona spp.) small mammal intermediate host for the parasitic tapeworm Echinococcus multilocularis which is responsible for a significant burden of human zoonoses in western China. Landsat ETM+ satellite imagery and digital elevation model data were utilized to generate quantified measures of environmental characteristics across a study area in Sichuan Province, China. Land cover maps were generated identifying the distribution of specific land cover types, with landscape metrics employed to describe the spatial organisation of land cover patches. Random forests were used to model spatial patterns of Ochotona spp. presence, enabling the relative importance of the environmental characteristics in relation to Ochotona spp. presence to be ranked. An index of habitat aggregation was identified as the most important variable in influencing Ochotona spp. presence, with area of degraded grassland the most important land cover class variable. 71% of the variance in Ochotona spp. presence was explained, with a 90.98% accuracy rate as determined by ‘out-of-bag’ error assessment. Identification of the environmental characteristics influencing Ochotona spp. presence enables us to better understand distribution patterns of hosts implicated in the transmission of Em. The predictive mapping of this Em host enables the identification of human populations at increased risk of infection, enabling preventative strategies to be adopted. PMID:25386042

  17. Predicting adaptive phenotypes from multilocus genotypes in Sitka spruce (Picea sitchensis) using random forest.

    PubMed

    Holliday, Jason A; Wang, Tongli; Aitken, Sally

    2012-09-01

    Climate is the primary driver of the distribution of tree species worldwide, and the potential for adaptive evolution will be an important factor determining the response of forests to anthropogenic climate change. Although association mapping has the potential to improve our understanding of the genomic underpinnings of climatically relevant traits, the utility of adaptive polymorphisms uncovered by such studies would be greatly enhanced by the development of integrated models that account for the phenotypic effects of multiple single-nucleotide polymorphisms (SNPs) and their interactions simultaneously. We previously reported the results of association mapping in the widespread conifer Sitka spruce (Picea sitchensis). In the current study we used the recursive partitioning algorithm 'Random Forest' to identify optimized combinations of SNPs to predict adaptive phenotypes. After adjusting for population structure, we were able to explain 37% and 30% of the phenotypic variation, respectively, in two locally adaptive traits--autumn budset timing and cold hardiness. For each trait, the leading five SNPs captured much of the phenotypic variation. To determine the role of epistasis in shaping these phenotypes, we also used a novel approach to quantify the strength and direction of pairwise interactions between SNPs and found such interactions to be common. Our results demonstrate the power of Random Forest to identify subsets of markers that are most important to climatic adaptation, and suggest that interactions among these loci may be widespread.

  18. Spectroscopic diagnosis of laryngeal carcinoma using near-infrared Raman spectroscopy and random recursive partitioning ensemble techniques.

    PubMed

    Teh, Seng Khoon; Zheng, Wei; Lau, David P; Huang, Zhiwei

    2009-06-01

    In this work, we evaluated the diagnostic ability of near-infrared (NIR) Raman spectroscopy associated with the ensemble recursive partitioning algorithm based on random forests for identifying cancer from normal tissue in the larynx. A rapid-acquisition NIR Raman system was utilized for tissue Raman measurements at 785 nm excitation, and 50 human laryngeal tissue specimens (20 normal; 30 malignant tumors) were used for NIR Raman studies. The random forests method was introduced to develop effective diagnostic algorithms for classification of Raman spectra of different laryngeal tissues. High-quality Raman spectra in the range of 800-1800 cm(-1) can be acquired from laryngeal tissue within 5 seconds. Raman spectra differed significantly between normal and malignant laryngeal tissues. Classification results obtained from the random forests algorithm on tissue Raman spectra yielded a diagnostic sensitivity of 88.0% and specificity of 91.4% for laryngeal malignancy identification. The random forests technique also provided variables importance that facilitates correlation of significant Raman spectral features with cancer transformation. This study shows that NIR Raman spectroscopy in conjunction with random forests algorithm has a great potential for the rapid diagnosis and detection of malignant tumors in the larynx.

  19. Pigmented skin lesion detection using random forest and wavelet-based texture

    NASA Astrophysics Data System (ADS)

    Hu, Ping; Yang, Tie-jun

    2016-10-01

    The incidence of cutaneous malignant melanoma, a disease of worldwide distribution and is the deadliest form of skin cancer, has been rapidly increasing over the last few decades. Because advanced cutaneous melanoma is still incurable, early detection is an important step toward a reduction in mortality. Dermoscopy photographs are commonly used in melanoma diagnosis and can capture detailed features of a lesion. A great variability exists in the visual appearance of pigmented skin lesions. Therefore, in order to minimize the diagnostic errors that result from the difficulty and subjectivity of visual interpretation, an automatic detection approach is required. The objectives of this paper were to propose a hybrid method using random forest and Gabor wavelet transformation to accurately differentiate which part belong to lesion area and the other is not in a dermoscopy photographs and analyze segmentation accuracy. A random forest classifier consisting of a set of decision trees was used for classification. Gabor wavelets transformation are the mathematical model of visual cortical cells of mammalian brain and an image can be decomposed into multiple scales and multiple orientations by using it. The Gabor function has been recognized as a very useful tool in texture analysis, due to its optimal localization properties in both spatial and frequency domain. Texture features based on Gabor wavelets transformation are found by the Gabor filtered image. Experiment results indicate the following: (1) the proposed algorithm based on random forest outperformed the-state-of-the-art in pigmented skin lesions detection (2) and the inclusion of Gabor wavelet transformation based texture features improved segmentation accuracy significantly.

  20. Correlation analysis between forest carbon stock and spectral vegetation indices in Xuan Lien Nature Reserve, Thanh Hoa, Viet Nam

    NASA Astrophysics Data System (ADS)

    Dung Nguyen, The; Kappas, Martin

    2017-04-01

    In the last several years, the interest in forest biomass and carbon stock estimation has increased due to its importance for forest management, modelling carbon cycle, and other ecosystem services. However, no estimates of biomass and carbon stocks of deferent forest cover types exist throughout in the Xuan Lien Nature Reserve, Thanh Hoa, Viet Nam. This study investigates the relationship between above ground carbon stock and different vegetation indices and to identify the most likely vegetation index that best correlate with forest carbon stock. The terrestrial inventory data come from 380 sample plots that were randomly sampled. Individual tree parameters such as DBH and tree height were collected to calculate the above ground volume, biomass and carbon for different forest types. The SPOT6 2013 satellite data was used in the study to obtain five vegetation indices NDVI, RDVI, MSR, RVI, and EVI. The relationships between the forest carbon stock and vegetation indices were investigated using a multiple linear regression analysis. R-square, RMSE values and cross-validation were used to measure the strength and validate the performance of the models. The methodology presented here demonstrates the possibility of estimating forest volume, biomass and carbon stock. It can also be further improved by addressing more spectral bands data and/or elevation.

  1. Automated time activity classification based on global positioning system (GPS) tracking data

    PubMed Central

    2011-01-01

    Background Air pollution epidemiological studies are increasingly using global positioning system (GPS) to collect time-location data because they offer continuous tracking, high temporal resolution, and minimum reporting burden for participants. However, substantial uncertainties in the processing and classifying of raw GPS data create challenges for reliably characterizing time activity patterns. We developed and evaluated models to classify people's major time activity patterns from continuous GPS tracking data. Methods We developed and evaluated two automated models to classify major time activity patterns (i.e., indoor, outdoor static, outdoor walking, and in-vehicle travel) based on GPS time activity data collected under free living conditions for 47 participants (N = 131 person-days) from the Harbor Communities Time Location Study (HCTLS) in 2008 and supplemental GPS data collected from three UC-Irvine research staff (N = 21 person-days) in 2010. Time activity patterns used for model development were manually classified by research staff using information from participant GPS recordings, activity logs, and follow-up interviews. We evaluated two models: (a) a rule-based model that developed user-defined rules based on time, speed, and spatial location, and (b) a random forest decision tree model. Results Indoor, outdoor static, outdoor walking and in-vehicle travel activities accounted for 82.7%, 6.1%, 3.2% and 7.2% of manually-classified time activities in the HCTLS dataset, respectively. The rule-based model classified indoor and in-vehicle travel periods reasonably well (Indoor: sensitivity > 91%, specificity > 80%, and precision > 96%; in-vehicle travel: sensitivity > 71%, specificity > 99%, and precision > 88%), but the performance was moderate for outdoor static and outdoor walking predictions. No striking differences in performance were observed between the rule-based and the random forest models. The random forest model was fast and easy to execute, but was likely less robust than the rule-based model under the condition of biased or poor quality training data. Conclusions Our models can successfully identify indoor and in-vehicle travel points from the raw GPS data, but challenges remain in developing models to distinguish outdoor static points and walking. Accurate training data are essential in developing reliable models in classifying time-activity patterns. PMID:22082316

  2. Automated time activity classification based on global positioning system (GPS) tracking data.

    PubMed

    Wu, Jun; Jiang, Chengsheng; Houston, Douglas; Baker, Dean; Delfino, Ralph

    2011-11-14

    Air pollution epidemiological studies are increasingly using global positioning system (GPS) to collect time-location data because they offer continuous tracking, high temporal resolution, and minimum reporting burden for participants. However, substantial uncertainties in the processing and classifying of raw GPS data create challenges for reliably characterizing time activity patterns. We developed and evaluated models to classify people's major time activity patterns from continuous GPS tracking data. We developed and evaluated two automated models to classify major time activity patterns (i.e., indoor, outdoor static, outdoor walking, and in-vehicle travel) based on GPS time activity data collected under free living conditions for 47 participants (N = 131 person-days) from the Harbor Communities Time Location Study (HCTLS) in 2008 and supplemental GPS data collected from three UC-Irvine research staff (N = 21 person-days) in 2010. Time activity patterns used for model development were manually classified by research staff using information from participant GPS recordings, activity logs, and follow-up interviews. We evaluated two models: (a) a rule-based model that developed user-defined rules based on time, speed, and spatial location, and (b) a random forest decision tree model. Indoor, outdoor static, outdoor walking and in-vehicle travel activities accounted for 82.7%, 6.1%, 3.2% and 7.2% of manually-classified time activities in the HCTLS dataset, respectively. The rule-based model classified indoor and in-vehicle travel periods reasonably well (Indoor: sensitivity > 91%, specificity > 80%, and precision > 96%; in-vehicle travel: sensitivity > 71%, specificity > 99%, and precision > 88%), but the performance was moderate for outdoor static and outdoor walking predictions. No striking differences in performance were observed between the rule-based and the random forest models. The random forest model was fast and easy to execute, but was likely less robust than the rule-based model under the condition of biased or poor quality training data. Our models can successfully identify indoor and in-vehicle travel points from the raw GPS data, but challenges remain in developing models to distinguish outdoor static points and walking. Accurate training data are essential in developing reliable models in classifying time-activity patterns.

  3. Estimation of Rice Crop Yields Using Random Forests in Taiwan

    NASA Astrophysics Data System (ADS)

    Chen, C. F.; Lin, H. S.; Nguyen, S. T.; Chen, C. R.

    2017-12-01

    Rice is globally one of the most important food crops, directly feeding more people than any other crops. Rice is not only the most important commodity, but also plays a critical role in the economy of Taiwan because it provides employment and income for large rural populations. The rice harvested area and production are thus monitored yearly due to the government's initiatives. Agronomic planners need such information for more precise assessment of food production to tackle issues of national food security and policymaking. This study aimed to develop a machine-learning approach using physical parameters to estimate rice crop yields in Taiwan. We processed the data for 2014 cropping seasons, following three main steps: (1) data pre-processing to construct input layers, including soil types and weather parameters (e.g., maxima and minima air temperature, precipitation, and solar radiation) obtained from meteorological stations across the country; (2) crop yield estimation using the random forests owing to its merits as it can process thousands of variables, estimate missing data, maintain the accuracy level when a large proportion of the data is missing, overcome most of over-fitting problems, and run fast and efficiently when handling large datasets; and (3) error verification. To execute the model, we separated the datasets into two groups of pixels: group-1 (70% of pixels) for training the model and group-2 (30% of pixels) for testing the model. Once the model is trained to produce small and stable out-of-bag error (i.e., the mean squared error between predicted and actual values), it can be used for estimating rice yields of cropping seasons. The results obtained from the random forests-based regression were compared with the actual yield statistics indicated the values of root mean square error (RMSE) and mean absolute error (MAE) achieved for the first rice crop were respectively 6.2% and 2.7%, while those for the second rice crop were 5.3% and 2.9%, respectively. Although there are several uncertainties attributed to the data quality of input layers, our study demonstrates the promising application of random forests for estimating rice crop yields at the national level in Taiwan. This approach could be transferable to other regions of the world for improving large-scale estimation of rice crop yields.

  4. Modeling species distribution and change using random forest [Chapter 8

    Treesearch

    Jeffrey S. Evans; Melanie A. Murphy; Zachary A. Holden; Samuel A. Cushman

    2011-01-01

    Although inference is a critical component in ecological modeling, the balance between accurate predictions and inference is the ultimate goal in ecological studies (Peters 1991; De’ath 2007). Practical applications of ecology in conservation planning, ecosystem assessment, and bio-diversity are highly dependent on very accurate spatial predictions of...

  5. Development of a hybrid proximal sensing method for rapid identification of petroleum contaminated soils.

    PubMed

    Chakraborty, Somsubhra; Weindorf, David C; Li, Bin; Ali Aldabaa, Abdalsamad Abdalsatar; Ghosh, Rakesh Kumar; Paul, Sathi; Nasim Ali, Md

    2015-05-01

    Using 108 petroleum contaminated soil samples, this pilot study proposed a new analytical approach of combining visible near-infrared diffuse reflectance spectroscopy (VisNIR DRS) and portable X-ray fluorescence spectrometry (PXRF) for rapid and improved quantification of soil petroleum contamination. Results indicated that an advanced fused model where VisNIR DRS spectra-based penalized spline regression (PSR) was used to predict total petroleum hydrocarbon followed by PXRF elemental data-based random forest regression was used to model the PSR residuals, it outperformed (R(2)=0.78, residual prediction deviation (RPD)=2.19) all other models tested, even producing better generalization than using VisNIR DRS alone (RPD's of 1.64, 1.86, and 1.96 for random forest, penalized spline regression, and partial least squares regression, respectively). Additionally, unsupervised principal component analysis using the PXRF+VisNIR DRS system qualitatively separated contaminated soils from control samples. Fusion of PXRF elemental data and VisNIR derivative spectra produced an optimized model for total petroleum hydrocarbon quantification in soils. Copyright © 2015 Elsevier B.V. All rights reserved.

  6. Faster Trees: Strategies for Accelerated Training and Prediction of Random Forests for Classification of Polsar Images

    NASA Astrophysics Data System (ADS)

    Hänsch, Ronny; Hellwich, Olaf

    2018-04-01

    Random Forests have continuously proven to be one of the most accurate, robust, as well as efficient methods for the supervised classification of images in general and polarimetric synthetic aperture radar data in particular. While the majority of previous work focus on improving classification accuracy, we aim for accelerating the training of the classifier as well as its usage during prediction while maintaining its accuracy. Unlike other approaches we mainly consider algorithmic changes to stay as much as possible independent of platform and programming language. The final model achieves an approximately 60 times faster training and a 500 times faster prediction, while the accuracy is only marginally decreased by roughly 1 %.

  7. Analysis and Recognition of Traditional Chinese Medicine Pulse Based on the Hilbert-Huang Transform and Random Forest in Patients with Coronary Heart Disease

    PubMed Central

    Wang, Yiqin; Yan, Hanxia; Yan, Jianjun; Yuan, Fengyin; Xu, Zhaoxia; Liu, Guoping; Xu, Wenjie

    2015-01-01

    Objective. This research provides objective and quantitative parameters of the traditional Chinese medicine (TCM) pulse conditions for distinguishing between patients with the coronary heart disease (CHD) and normal people by using the proposed classification approach based on Hilbert-Huang transform (HHT) and random forest. Methods. The energy and the sample entropy features were extracted by applying the HHT to TCM pulse by treating these pulse signals as time series. By using the random forest classifier, the extracted two types of features and their combination were, respectively, used as input data to establish classification model. Results. Statistical results showed that there were significant differences in the pulse energy and sample entropy between the CHD group and the normal group. Moreover, the energy features, sample entropy features, and their combination were inputted as pulse feature vectors; the corresponding average recognition rates were 84%, 76.35%, and 90.21%, respectively. Conclusion. The proposed approach could be appropriately used to analyze pulses of patients with CHD, which can lay a foundation for research on objective and quantitative criteria on disease diagnosis or Zheng differentiation. PMID:26180536

  8. Analysis and Recognition of Traditional Chinese Medicine Pulse Based on the Hilbert-Huang Transform and Random Forest in Patients with Coronary Heart Disease.

    PubMed

    Guo, Rui; Wang, Yiqin; Yan, Hanxia; Yan, Jianjun; Yuan, Fengyin; Xu, Zhaoxia; Liu, Guoping; Xu, Wenjie

    2015-01-01

    Objective. This research provides objective and quantitative parameters of the traditional Chinese medicine (TCM) pulse conditions for distinguishing between patients with the coronary heart disease (CHD) and normal people by using the proposed classification approach based on Hilbert-Huang transform (HHT) and random forest. Methods. The energy and the sample entropy features were extracted by applying the HHT to TCM pulse by treating these pulse signals as time series. By using the random forest classifier, the extracted two types of features and their combination were, respectively, used as input data to establish classification model. Results. Statistical results showed that there were significant differences in the pulse energy and sample entropy between the CHD group and the normal group. Moreover, the energy features, sample entropy features, and their combination were inputted as pulse feature vectors; the corresponding average recognition rates were 84%, 76.35%, and 90.21%, respectively. Conclusion. The proposed approach could be appropriately used to analyze pulses of patients with CHD, which can lay a foundation for research on objective and quantitative criteria on disease diagnosis or Zheng differentiation.

  9. Analysis of landslide hazard area in Ludian earthquake based on Random Forests

    NASA Astrophysics Data System (ADS)

    Xie, J.-C.; Liu, R.; Li, H.-W.; Lai, Z.-L.

    2015-04-01

    With the development of machine learning theory, more and more algorithms are evaluated for seismic landslides. After the Ludian earthquake, the research team combine with the special geological structure in Ludian area and the seismic filed exploration results, selecting SLOPE(PODU); River distance(HL); Fault distance(DC); Seismic Intensity(LD) and Digital Elevation Model(DEM), the normalized difference vegetation index(NDVI) which based on remote sensing images as evaluation factors. But the relationships among these factors are fuzzy, there also exists heavy noise and high-dimensional, we introduce the random forest algorithm to tolerate these difficulties and get the evaluation result of Ludian landslide areas, in order to verify the accuracy of the result, using the ROC graphs for the result evaluation standard, AUC covers an area of 0.918, meanwhile, the random forest's generalization error rate decreases with the increase of the classification tree to the ideal 0.08 by using Out Of Bag(OOB) Estimation. Studying the final landslides inversion results, paper comes to a statistical conclusion that near 80% of the whole landslides and dilapidations are in areas with high susceptibility and moderate susceptibility, showing the forecast results are reasonable and adopted.

  10. Comparisons between physics-based, engineering, and statistical learning models for outdoor sound propagation.

    PubMed

    Hart, Carl R; Reznicek, Nathan J; Wilson, D Keith; Pettit, Chris L; Nykaza, Edward T

    2016-05-01

    Many outdoor sound propagation models exist, ranging from highly complex physics-based simulations to simplified engineering calculations, and more recently, highly flexible statistical learning methods. Several engineering and statistical learning models are evaluated by using a particular physics-based model, namely, a Crank-Nicholson parabolic equation (CNPE), as a benchmark. Narrowband transmission loss values predicted with the CNPE, based upon a simulated data set of meteorological, boundary, and source conditions, act as simulated observations. In the simulated data set sound propagation conditions span from downward refracting to upward refracting, for acoustically hard and soft boundaries, and low frequencies. Engineering models used in the comparisons include the ISO 9613-2 method, Harmonoise, and Nord2000 propagation models. Statistical learning methods used in the comparisons include bagged decision tree regression, random forest regression, boosting regression, and artificial neural network models. Computed skill scores are relative to sound propagation in a homogeneous atmosphere over a rigid ground. Overall skill scores for the engineering noise models are 0.6%, -7.1%, and 83.8% for the ISO 9613-2, Harmonoise, and Nord2000 models, respectively. Overall skill scores for the statistical learning models are 99.5%, 99.5%, 99.6%, and 99.6% for bagged decision tree, random forest, boosting, and artificial neural network regression models, respectively.

  11. Lessons learned while integrating habitat, dispersal, disturbance, and life-history traits into species habitat models under climate change

    Treesearch

    Louis R. Iverson; Anantha M. Prasad; Stephen N. Matthews; Matthew P. Peters

    2011-01-01

    We present an approach to modeling potential climate-driven changes in habitat for tree and bird species in the eastern United States. First, we took an empirical-statistical modeling approach, using randomForest, with species abundance data from national inventories combined with soil, climate, and landscape variables, to build abundance-based habitat models for 134...

  12. Modelling Variable Fire Severity in Boreal Forests: Effects of Fire Intensity and Stand Structure

    PubMed Central

    Miquelajauregui, Yosune; Cumming, Steven G.; Gauthier, Sylvie

    2016-01-01

    It is becoming clear that fires in boreal forests are not uniformly stand-replacing. On the contrary, marked variation in fire severity, measured as tree mortality, has been found both within and among individual fires. It is important to understand the conditions under which this variation can arise. We integrated forest sample plot data, tree allometries and historical forest fire records within a diameter class-structured model of 1.0 ha patches of mono-specific black spruce and jack pine stands in northern Québec, Canada. The model accounts for crown fire initiation and vertical spread into the canopy. It uses empirical relations between fire intensity, scorch height, the percent of crown scorched and tree mortality to simulate fire severity, specifically the percent reduction in patch basal area due to fire-caused mortality. A random forest and a regression tree analysis of a large random sample of simulated fires were used to test for an effect of fireline intensity, stand structure, species composition and pyrogeographic regions on resultant severity. Severity increased with intensity and was lower for jack pine stands. The proportion of simulated fires that burned at high severity (e.g. >75% reduction in patch basal area) was 0.80 for black spruce and 0.11 for jack pine. We identified thresholds in intensity below which there was a marked sensitivity of simulated fire severity to stand structure, and to interactions between intensity and structure. We found no evidence for a residual effect of pyrogeographic region on simulated severity, after the effects of stand structure and species composition were accounted for. The model presented here was able to produce variation in fire severity under a range of fire intensity conditions. This suggests that variation in stand structure is one of the factors causing the observed variation in boreal fire severity. PMID:26919456

  13. Modelling Variable Fire Severity in Boreal Forests: Effects of Fire Intensity and Stand Structure.

    PubMed

    Miquelajauregui, Yosune; Cumming, Steven G; Gauthier, Sylvie

    2016-01-01

    It is becoming clear that fires in boreal forests are not uniformly stand-replacing. On the contrary, marked variation in fire severity, measured as tree mortality, has been found both within and among individual fires. It is important to understand the conditions under which this variation can arise. We integrated forest sample plot data, tree allometries and historical forest fire records within a diameter class-structured model of 1.0 ha patches of mono-specific black spruce and jack pine stands in northern Québec, Canada. The model accounts for crown fire initiation and vertical spread into the canopy. It uses empirical relations between fire intensity, scorch height, the percent of crown scorched and tree mortality to simulate fire severity, specifically the percent reduction in patch basal area due to fire-caused mortality. A random forest and a regression tree analysis of a large random sample of simulated fires were used to test for an effect of fireline intensity, stand structure, species composition and pyrogeographic regions on resultant severity. Severity increased with intensity and was lower for jack pine stands. The proportion of simulated fires that burned at high severity (e.g. >75% reduction in patch basal area) was 0.80 for black spruce and 0.11 for jack pine. We identified thresholds in intensity below which there was a marked sensitivity of simulated fire severity to stand structure, and to interactions between intensity and structure. We found no evidence for a residual effect of pyrogeographic region on simulated severity, after the effects of stand structure and species composition were accounted for. The model presented here was able to produce variation in fire severity under a range of fire intensity conditions. This suggests that variation in stand structure is one of the factors causing the observed variation in boreal fire severity.

  14. Extracting Tree Height from Repeat-Pass PolInSAR Data : Experiments with JPL and ESA Airborne Systems

    NASA Technical Reports Server (NTRS)

    Lavalle, Marco; Ahmed, Razi; Neumann, Maxim; Hensley, Scott

    2013-01-01

    In this paper we present our latest developments and experiments with the random-motion-over-ground (RMoG) model used to extract canopy height and other important forest parameters from repeat-pass polarimetricinterferometric SAR (Pol-InSAR) data. More specifically, we summarize the key features of the RMoG model in contrast with the random-volume-over-ground (RVoG) model, describe in detail a possible inversion scheme for the RMoG model and illustrate the results of the RMoG inversion using airborne data collected by the Jet Propulsion Laboratory (JPL) and the European Space Agency (ESA).

  15. Community turnover of wood-inhabiting fungi across hierarchical spatial scales.

    PubMed

    Abrego, Nerea; García-Baquero, Gonzalo; Halme, Panu; Ovaskainen, Otso; Salcedo, Isabel

    2014-01-01

    For efficient use of conservation resources it is important to determine how species diversity changes across spatial scales. In many poorly known species groups little is known about at which spatial scales the conservation efforts should be focused. Here we examined how the community turnover of wood-inhabiting fungi is realised at three hierarchical levels, and how much of community variation is explained by variation in resource composition and spatial proximity. The hierarchical study design consisted of management type (fixed factor), forest site (random factor, nested within management type) and study plots (randomly placed plots within each study site). To examine how species richness varied across the three hierarchical scales, randomized species accumulation curves and additive partitioning of species richness were applied. To analyse variation in wood-inhabiting species and dead wood composition at each scale, linear and Permanova modelling approaches were used. Wood-inhabiting fungal communities were dominated by rare and infrequent species. The similarity of fungal communities was higher within sites and within management categories than among sites or between the two management categories, and it decreased with increasing distance among the sampling plots and with decreasing similarity of dead wood resources. However, only a small part of community variation could be explained by these factors. The species present in managed forests were in a large extent a subset of those species present in natural forests. Our results suggest that in particular the protection of rare species requires a large total area. As managed forests have only little additional value complementing the diversity of natural forests, the conservation of natural forests is the key to ecologically effective conservation. As the dissimilarity of fungal communities increases with distance, the conserved natural forest sites should be broadly distributed in space, yet the individual conserved areas should be large enough to ensure local persistence.

  16. Community Turnover of Wood-Inhabiting Fungi across Hierarchical Spatial Scales

    PubMed Central

    Abrego, Nerea; García-Baquero, Gonzalo; Halme, Panu; Ovaskainen, Otso; Salcedo, Isabel

    2014-01-01

    For efficient use of conservation resources it is important to determine how species diversity changes across spatial scales. In many poorly known species groups little is known about at which spatial scales the conservation efforts should be focused. Here we examined how the community turnover of wood-inhabiting fungi is realised at three hierarchical levels, and how much of community variation is explained by variation in resource composition and spatial proximity. The hierarchical study design consisted of management type (fixed factor), forest site (random factor, nested within management type) and study plots (randomly placed plots within each study site). To examine how species richness varied across the three hierarchical scales, randomized species accumulation curves and additive partitioning of species richness were applied. To analyse variation in wood-inhabiting species and dead wood composition at each scale, linear and Permanova modelling approaches were used. Wood-inhabiting fungal communities were dominated by rare and infrequent species. The similarity of fungal communities was higher within sites and within management categories than among sites or between the two management categories, and it decreased with increasing distance among the sampling plots and with decreasing similarity of dead wood resources. However, only a small part of community variation could be explained by these factors. The species present in managed forests were in a large extent a subset of those species present in natural forests. Our results suggest that in particular the protection of rare species requires a large total area. As managed forests have only little additional value complementing the diversity of natural forests, the conservation of natural forests is the key to ecologically effective conservation. As the dissimilarity of fungal communities increases with distance, the conserved natural forest sites should be broadly distributed in space, yet the individual conserved areas should be large enough to ensure local persistence. PMID:25058128

  17. A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use information.

    PubMed

    Chen, Gongbo; Li, Shanshan; Knibbs, Luke D; Hamm, N A S; Cao, Wei; Li, Tiantian; Guo, Jianping; Ren, Hongyan; Abramson, Michael J; Guo, Yuming

    2018-09-15

    Machine learning algorithms have very high predictive ability. However, no study has used machine learning to estimate historical concentrations of PM 2.5 (particulate matter with aerodynamic diameter ≤ 2.5 μm) at daily time scale in China at a national level. To estimate daily concentrations of PM 2.5 across China during 2005-2016. Daily ground-level PM 2.5 data were obtained from 1479 stations across China during 2014-2016. Data on aerosol optical depth (AOD), meteorological conditions and other predictors were downloaded. A random forests model (non-parametric machine learning algorithms) and two traditional regression models were developed to estimate ground-level PM 2.5 concentrations. The best-fit model was then utilized to estimate the daily concentrations of PM 2.5 across China with a resolution of 0.1° (≈10 km) during 2005-2016. The daily random forests model showed much higher predictive accuracy than the other two traditional regression models, explaining the majority of spatial variability in daily PM 2.5 [10-fold cross-validation (CV) R 2  = 83%, root mean squared prediction error (RMSE) = 28.1 μg/m 3 ]. At the monthly and annual time-scale, the explained variability of average PM 2.5 increased up to 86% (RMSE = 10.7 μg/m 3 and 6.9 μg/m 3 , respectively). Taking advantage of a novel application of modeling framework and the most recent ground-level PM 2.5 observations, the machine learning method showed higher predictive ability than previous studies. Random forests approach can be used to estimate historical exposure to PM 2.5 in China with high accuracy. Copyright © 2018 Elsevier B.V. All rights reserved.

  18. Adapting GNU random forest program for Unix and Windows

    NASA Astrophysics Data System (ADS)

    Jirina, Marcel; Krayem, M. Said; Jirina, Marcel, Jr.

    2013-10-01

    The Random Forest is a well-known method and also a program for data clustering and classification. Unfortunately, the original Random Forest program is rather difficult to use. Here we describe a new version of this program originally written in Fortran 77. The modified program in Fortran 95 needs to be compiled only once and information for different tasks is passed with help of arguments. The program was tested with 24 data sets from UCI MLR and results are available on the net.

  19. Implications of allometric model selection for county-level biomass mapping.

    PubMed

    Duncanson, Laura; Huang, Wenli; Johnson, Kristofer; Swatantran, Anu; McRoberts, Ronald E; Dubayah, Ralph

    2017-10-18

    Carbon accounting in forests remains a large area of uncertainty in the global carbon cycle. Forest aboveground biomass is therefore an attribute of great interest for the forest management community, but the accuracy of aboveground biomass maps depends on the accuracy of the underlying field estimates used to calibrate models. These field estimates depend on the application of allometric models, which often have unknown and unreported uncertainties outside of the size class or environment in which they were developed. Here, we test three popular allometric approaches to field biomass estimation, and explore the implications of allometric model selection for county-level biomass mapping in Sonoma County, California. We test three allometric models: Jenkins et al. (For Sci 49(1): 12-35, 2003), Chojnacky et al. (Forestry 87(1): 129-151, 2014) and the US Forest Service's Component Ratio Method (CRM). We found that Jenkins and Chojnacky models perform comparably, but that at both a field plot level and a total county level there was a ~ 20% difference between these estimates and the CRM estimates. Further, we show that discrepancies are greater in high biomass areas with high canopy covers and relatively moderate heights (25-45 m). The CRM models, although on average ~ 20% lower than Jenkins and Chojnacky, produce higher estimates in the tallest forests samples (> 60 m), while Jenkins generally produces higher estimates of biomass in forests < 50 m tall. Discrepancies do not continually increase with increasing forest height, suggesting that inclusion of height in allometric models is not primarily driving discrepancies. Models developed using all three allometric models underestimate high biomass and overestimate low biomass, as expected with random forest biomass modeling. However, these deviations were generally larger using the Jenkins and Chojnacky allometries, suggesting that the CRM approach may be more appropriate for biomass mapping with lidar. These results confirm that allometric model selection considerably impacts biomass maps and estimates, and that allometric model errors remain poorly understood. Our findings that allometric model discrepancies are not explained by lidar heights suggests that allometric model form does not drive these discrepancies. A better understanding of the sources of allometric model errors, particularly in high biomass systems, is essential for improved forest biomass mapping.

  20. Integrating invasive grasses into carbon cycle projections: Cogongrass spread in southern pine forests

    NASA Astrophysics Data System (ADS)

    McCabe, T. D.; Flory, S. L.; Wiesner, S.; Dietze, M.

    2017-12-01

    Forested ecosystems are currently being disrupted by invasive species. One example is the invasive grass Imperata cylindrica (cogongrass), which is widespread in southeastern US pine forests. Pines forests dominate the forest cover of the southeast, and contribute to making the Southeast the United States' largest carbon sink. Cogongrass decreases the colonization of loblolly pine fine roots. If cogongrass continues to invade,this sink could be jeopardized. However, the effects of cogongrass invasion on carbon sequestration are largely unknown. We have projected the effects of elevated CO2 and changing climate on future cogongrass invasion. To test how pine stands are affected by cogongrass, cogongrass invasions were modeled using the Ecosystem Demography 2 (ED2) model, and parameterized using the Predictive Ecosystem Analyzer (PEcAn). ED2 takes into account local meteorological data, stand populations and succession, disturbance, and geochemical pools. PEcAn is a workflow that uses Bayesian sensitivity analyses and variance decomposition to quantify the uncertainty that each parameter contributes to overall model uncertainty. ED2 was run for four NEON and Ameriflux sites in the Southeast from the earliest available census of the site into 2010. These model results were compared to site measures to test for model accuracy and bias. To project the effect of elevated CO2 on cogongrass invasions, ED was run from 2006-2100 at four sites under four separate scenarios: 1) RPC4.5 CO2 and climate, 2) RPC4.5 climate only, with constant CO2 concentrations, 3) RPC4.5 Elevated CO2 only, with climate randomly selected from 2006-2026, 4) Present Day, made from randomly selected measures of CO2 and radiation from 2006-2026. Each scenario was run three times; once with cogongrass absent, once with a low cogongrass abundance, and once with a high cogongrass abundance. Model results suggest that many relevant parameters have high uncertainty due to lack of measurement. Further field work quantifying the carbon cycle, particularly belowground processes and respiration, could help constrain parameter uncertainty.

  1. Modeling nitrate at domestic and public-supply well depths in the Central Valley, California

    USGS Publications Warehouse

    Nolan, Bernard T.; Gronberg, JoAnn M.; Faunt, Claudia C.; Eberts, Sandra M.; Belitz, Ken

    2014-01-01

    Aquifer vulnerability models were developed to map groundwater nitrate concentration at domestic and public-supply well depths in the Central Valley, California. We compared three modeling methods for ability to predict nitrate concentration >4 mg/L: logistic regression (LR), random forest classification (RFC), and random forest regression (RFR). All three models indicated processes of nitrogen fertilizer input at the land surface, transmission through coarse-textured, well-drained soils, and transport in the aquifer to the well screen. The total percent correct predictions were similar among the three models (69–82%), but RFR had greater sensitivity (84% for shallow wells and 51% for deep wells). The results suggest that RFR can better identify areas with high nitrate concentration but that LR and RFC may better describe bulk conditions in the aquifer. A unique aspect of the modeling approach was inclusion of outputs from previous, physically based hydrologic and textural models as predictor variables, which were important to the models. Vertical water fluxes in the aquifer and percent coarse material above the well screen were ranked moderately high-to-high in the RFR models, and the average vertical water flux during the irrigation season was highly significant (p < 0.0001) in logistic regression.

  2. Using occupancy models of forest breeding birds to prioritize conservation planning

    USGS Publications Warehouse

    De Wan, A. A.; Sullivan, P.J.; Lembo, A.J.; Smith, C.R.; Maerz, J.C.; Lassoie, J.P.; Richmond, M.E.

    2009-01-01

    As urban development continues to encroach on the natural and rural landscape, land-use planners struggle to identify high priority conservation areas for protection. Although knowing where urban-sensitive species may be occurring on the landscape would facilitate conservation planning, research efforts are often not sufficiently designed to make quality predictions at unknown locations. Recent advances in occupancy modeling allow for more precise estimates of occupancy by accounting for differences in detectability. We applied these techniques to produce robust estimates of habitat occupancy for a subset of forest breeding birds, a group that has been shown to be sensitive to urbanization, in a rapidly urbanizing yet biological diverse region of New York State. We found that detection probability ranged widely across species, from 0.05 to 0.8. Our models suggest that detection probability declined with increasing forest fragmentation. We also found that the probability of occupancy of forest breeding birds is negatively influenced by increasing perimeter-area ratio of forest fragments and urbanization in the surrounding habitat matrix. We capitalized on our random sampling design to produce spatially explicit models that predict high priority conservation areas across the entire region, where interior-species were most likely to occur. Finally, we use our predictive maps to demonstrate how a strict sampling design coupled with occupancy modeling can be a valuable tool for prioritizing biodiversity conservation in land-use planning. ?? 2009 Elsevier Ltd.

  3. Subpixel urban land cover estimation: comparing cubist, random forests, and support vector regression

    Treesearch

    Jeffrey T. Walton

    2008-01-01

    Three machine learning subpixel estimation methods (Cubist, Random Forests, and support vector regression) were applied to estimate urban cover. Urban forest canopy cover and impervious surface cover were estimated from Landsat-7 ETM+ imagery using a higher resolution cover map resampled to 30 m as training and reference data. Three different band combinations (...

  4. Effects of road network on diversiform forest cover changes in the highest coverage region in China: An analysis of sampling strategies.

    PubMed

    Hu, Xisheng; Wu, Zhilong; Wu, Chengzhen; Ye, Limin; Lan, Chaofeng; Tang, Kun; Xu, Lu; Qiu, Rongzu

    2016-09-15

    Forest cover changes are of global concern due to their roles in global warming and biodiversity. However, many previous studies have ignored the fact that forest loss and forest gain are different processes that may respond to distinct factors by stressing forest loss more than gain or viewing forest cover change as a whole. It behooves us to carefully examine the patterns and drivers of the change by subdividing it into several categories. Our study includes areas of forest loss (4.8% of the study area), forest gain (1.3% of the study area) and forest loss and gain (2.0% of the study area) from 2000 to 2012 in Fujian Province, China. In the study area, approximately 65% and 90% of these changes occurred within 2000m of the nearest road and under road densities of 0.6km/km(2), respectively. We compared two sampling techniques (systematic sampling and random sampling) and four intensities for each technique to investigate the driving patterns underlying the changes using multinomial logistic regression. The results indicated the lack of pronounced differences in the regressions between the two sampling designs, although the sample size had a great impact on the regression outcome. The application of multi-model inference indicated that the low level road density had a negative significant association with forest loss and forest loss and gain, the expressway density had a positive significant impact on forest loss, and the road network was insignificantly related to forest gain. The model including socioeconomic and biophysical variables illuminated potentially different predictors of the different forest change categories. Moreover, the multiple comparisons tested by Fisher's least significant difference (LSD) were a good compensation for the multinomial logistic model to enrich the interpretation of the regression results. Copyright © 2016 Elsevier B.V. All rights reserved.

  5. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data

    PubMed Central

    Stevens, Forrest R.; Gaughan, Andrea E.; Linard, Catherine; Tatem, Andrew J.

    2015-01-01

    High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, “Random Forest” estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at ~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and Latin America. PMID:25689585

  6. Prediction of Return-to-original-work after an Industrial Accident Using Machine Learning and Comparison of Techniques

    PubMed Central

    2018-01-01

    Background Many studies have tried to develop predictors for return-to-work (RTW). However, since complex factors have been demonstrated to predict RTW, it is difficult to use them practically. This study investigated whether factors used in previous studies could predict whether an individual had returned to his/her original work by four years after termination of the worker's recovery period. Methods An initial logistic regression analysis of 1,567 participants of the fourth Panel Study of Worker's Compensation Insurance yielded odds ratios. The participants were divided into two subsets, a training dataset and a test dataset. Using the training dataset, logistic regression, decision tree, random forest, and support vector machine models were established, and important variables of each model were identified. The predictive abilities of the different models were compared. Results The analysis showed that only earned income and company-related factors significantly affected return-to-original-work (RTOW). The random forest model showed the best accuracy among the tested machine learning models; however, the difference was not prominent. Conclusion It is possible to predict a worker's probability of RTOW using machine learning techniques with moderate accuracy. PMID:29736160

  7. Assessing accuracy of point fire intervals across landscapes with simulation modelling

    Treesearch

    Russell A. Parsons; Emily K. Heyerdahl; Robert E. Keane; Brigitte Dorner; Joseph Fall

    2007-01-01

    We assessed accuracy in point fire intervals using a simulation model that sampled four spatially explicit simulated fire histories. These histories varied in fire frequency and size and were simulated on a flat landscape with two forest types (dry versus mesic). We used three sampling designs (random, systematic grids, and stratified). We assessed the sensitivity of...

  8. Estimating potential habitat for 134 eastern US tree species under six climate scenarios

    Treesearch

    Louis R. Iverson; Anantha M. Prasad; Stephen N. Matthews; Matthew Peters

    2008-01-01

    We modeled and mapped, using the predictive data mining tool Random Forests, 134 tree species from the eastern United States for potential response to several scenarios of climate change. Each species was modeled individually to show current and potential future habitats according to two emission scenarios (high emissions on current trajectory and reasonable...

  9. Modeling lake trophic state: a random forest approach

    EPA Science Inventory

    Productivity of lentic ecosystems has been well studied and it is widely accepted that as nutrient inputs increase, productivity increases and lakes transition from low trophic state (e.g. oligotrophic) to higher trophic states (e.g. eutrophic). These broad trophic state classi...

  10. Modelling past land use using archaeological and pollen data

    NASA Astrophysics Data System (ADS)

    Pirzamanbein, Behnaz; Lindström, johan; Poska, Anneli; Gaillard-Lemdahl, Marie-José

    2016-04-01

    Accurate maps of past land use are necessary for studying the impact of anthropogenic land-cover changes on climate and biodiversity. We develop a Bayesian hierarchical model to reconstruct the land use using Gaussian Markov random fields. The model uses two observations sets: 1) archaeological data, representing human settlements, urbanization and agricultural findings; and 2) pollen-based land estimates of the three land-cover types Coniferous forest, Broadleaved forest and Unforested/Open land. The pollen based estimates are obtained from the REVEALS model, based on pollen counts from lakes and bogs. Our developed model uses the sparse pollen-based estimations to reconstruct the spatial continuous cover of three land cover types. Using the open-land component and the archaeological data, the extent of land-use is reconstructed. The model is applied on three time periods - centred around 1900 CE, 1000 and, 4000 BCE over Sweden for which both pollen-based estimates and archaeological data are available. To estimate the model parameters and land use, a block updated Markov chain Monte Carlo (MCMC) algorithm is applied. Using the MCMC posterior samples uncertainties in land-use predictions are computed. Due to lack of good historic land use data, model results are evaluated by cross-validation. Keywords. Spatial reconstruction, Gaussian Markov random field, Fossil pollen records, Archaeological data, Human land-use, Prediction uncertainty

  11. Random Forest-Based Recognition of Isolated Sign Language Subwords Using Data from Accelerometers and Surface Electromyographic Sensors.

    PubMed

    Su, Ruiliang; Chen, Xiang; Cao, Shuai; Zhang, Xu

    2016-01-14

    Sign language recognition (SLR) has been widely used for communication amongst the hearing-impaired and non-verbal community. This paper proposes an accurate and robust SLR framework using an improved decision tree as the base classifier of random forests. This framework was used to recognize Chinese sign language subwords using recordings from a pair of portable devices worn on both arms consisting of accelerometers (ACC) and surface electromyography (sEMG) sensors. The experimental results demonstrated the validity of the proposed random forest-based method for recognition of Chinese sign language (CSL) subwords. With the proposed method, 98.25% average accuracy was obtained for the classification of a list of 121 frequently used CSL subwords. Moreover, the random forests method demonstrated a superior performance in resisting the impact of bad training samples. When the proportion of bad samples in the training set reached 50%, the recognition error rate of the random forest-based method was only 10.67%, while that of a single decision tree adopted in our previous work was almost 27.5%. Our study offers a practical way of realizing a robust and wearable EMG-ACC-based SLR systems.

  12. Pseudo CT estimation from MRI using patch-based random forest

    NASA Astrophysics Data System (ADS)

    Yang, Xiaofeng; Lei, Yang; Shu, Hui-Kuo; Rossi, Peter; Mao, Hui; Shim, Hyunsuk; Curran, Walter J.; Liu, Tian

    2017-02-01

    Recently, MR simulators gain popularity because of unnecessary radiation exposure of CT simulators being used in radiation therapy planning. We propose a method for pseudo CT estimation from MR images based on a patch-based random forest. Patient-specific anatomical features are extracted from the aligned training images and adopted as signatures for each voxel. The most robust and informative features are identified using feature selection to train the random forest. The well-trained random forest is used to predict the pseudo CT of a new patient. This prediction technique was tested with human brain images and the prediction accuracy was assessed using the original CT images. Peak signal-to-noise ratio (PSNR) and feature similarity (FSIM) indexes were used to quantify the differences between the pseudo and original CT images. The experimental results showed the proposed method could accurately generate pseudo CT images from MR images. In summary, we have developed a new pseudo CT prediction method based on patch-based random forest, demonstrated its clinical feasibility, and validated its prediction accuracy. This pseudo CT prediction technique could be a useful tool for MRI-based radiation treatment planning and attenuation correction in a PET/MRI scanner.

  13. Predicting attention-deficit/hyperactivity disorder severity from psychosocial stress and stress-response genes: a random forest regression approach

    PubMed Central

    van der Meer, D; Hoekstra, P J; van Donkelaar, M; Bralten, J; Oosterlaan, J; Heslenfeld, D; Faraone, S V; Franke, B; Buitelaar, J K; Hartman, C A

    2017-01-01

    Identifying genetic variants contributing to attention-deficit/hyperactivity disorder (ADHD) is complicated by the involvement of numerous common genetic variants with small effects, interacting with each other as well as with environmental factors, such as stress exposure. Random forest regression is well suited to explore this complexity, as it allows for the analysis of many predictors simultaneously, taking into account any higher-order interactions among them. Using random forest regression, we predicted ADHD severity, measured by Conners’ Parent Rating Scales, from 686 adolescents and young adults (of which 281 were diagnosed with ADHD). The analysis included 17 374 single-nucleotide polymorphisms (SNPs) across 29 genes previously linked to hypothalamic–pituitary–adrenal (HPA) axis activity, together with information on exposure to 24 individual long-term difficulties or stressful life events. The model explained 12.5% of variance in ADHD severity. The most important SNP, which also showed the strongest interaction with stress exposure, was located in a region regulating the expression of telomerase reverse transcriptase (TERT). Other high-ranking SNPs were found in or near NPSR1, ESR1, GABRA6, PER3, NR3C2 and DRD4. Chronic stressors were more influential than single, severe, life events. Top hits were partly shared with conduct problems. We conclude that random forest regression may be used to investigate how multiple genetic and environmental factors jointly contribute to ADHD. It is able to implicate novel SNPs of interest, interacting with stress exposure, and may explain inconsistent findings in ADHD genetics. This exploratory approach may be best combined with more hypothesis-driven research; top predictors and their interactions with one another should be replicated in independent samples. PMID:28585928

  14. Towards large-scale FAME-based bacterial species identification using machine learning techniques.

    PubMed

    Slabbinck, Bram; De Baets, Bernard; Dawyndt, Peter; De Vos, Paul

    2009-05-01

    In the last decade, bacterial taxonomy witnessed a huge expansion. The swift pace of bacterial species (re-)definitions has a serious impact on the accuracy and completeness of first-line identification methods. Consequently, back-end identification libraries need to be synchronized with the List of Prokaryotic names with Standing in Nomenclature. In this study, we focus on bacterial fatty acid methyl ester (FAME) profiling as a broadly used first-line identification method. From the BAME@LMG database, we have selected FAME profiles of individual strains belonging to the genera Bacillus, Paenibacillus and Pseudomonas. Only those profiles resulting from standard growth conditions have been retained. The corresponding data set covers 74, 44 and 95 validly published bacterial species, respectively, represented by 961, 378 and 1673 standard FAME profiles. Through the application of machine learning techniques in a supervised strategy, different computational models have been built for genus and species identification. Three techniques have been considered: artificial neural networks, random forests and support vector machines. Nearly perfect identification has been achieved at genus level. Notwithstanding the known limited discriminative power of FAME analysis for species identification, the computational models have resulted in good species identification results for the three genera. For Bacillus, Paenibacillus and Pseudomonas, random forests have resulted in sensitivity values, respectively, 0.847, 0.901 and 0.708. The random forests models outperform those of the other machine learning techniques. Moreover, our machine learning approach also outperformed the Sherlock MIS (MIDI Inc., Newark, DE, USA). These results show that machine learning proves very useful for FAME-based bacterial species identification. Besides good bacterial identification at species level, speed and ease of taxonomic synchronization are major advantages of this computational species identification strategy.

  15. Inside the black box: starting to uncover the underlying decision rules used in one-by-one expert assessment of occupational exposure in case-control studies

    PubMed Central

    Wheeler, David C.; Burstyn, Igor; Vermeulen, Roel; Yu, Kai; Shortreed, Susan M.; Pronk, Anjoeka; Stewart, Patricia A.; Colt, Joanne S.; Baris, Dalsu; Karagas, Margaret R.; Schwenn, Molly; Johnson, Alison; Silverman, Debra T.; Friesen, Melissa C.

    2014-01-01

    Objectives Evaluating occupational exposures in population-based case-control studies often requires exposure assessors to review each study participants' reported occupational information job-by-job to derive exposure estimates. Although such assessments likely have underlying decision rules, they usually lack transparency, are time-consuming and have uncertain reliability and validity. We aimed to identify the underlying rules to enable documentation, review, and future use of these expert-based exposure decisions. Methods Classification and regression trees (CART, predictions from a single tree) and random forests (predictions from many trees) were used to identify the underlying rules from the questionnaire responses and an expert's exposure assignments for occupational diesel exhaust exposure for several metrics: binary exposure probability and ordinal exposure probability, intensity, and frequency. Data were split into training (n=10,488 jobs), testing (n=2,247), and validation (n=2,248) data sets. Results The CART and random forest models' predictions agreed with 92–94% of the expert's binary probability assignments. For ordinal probability, intensity, and frequency metrics, the two models extracted decision rules more successfully for unexposed and highly exposed jobs (86–90% and 57–85%, respectively) than for low or medium exposed jobs (7–71%). Conclusions CART and random forest models extracted decision rules and accurately predicted an expert's exposure decisions for the majority of jobs and identified questionnaire response patterns that would require further expert review if the rules were applied to other jobs in the same or different study. This approach makes the exposure assessment process in case-control studies more transparent and creates a mechanism to efficiently replicate exposure decisions in future studies. PMID:23155187

  16. Genome analysis of Legionella pneumophila strains using a mixed-genome microarray.

    PubMed

    Euser, Sjoerd M; Nagelkerke, Nico J; Schuren, Frank; Jansen, Ruud; Den Boer, Jeroen W

    2012-01-01

    Legionella, the causative agent for Legionnaires' disease, is ubiquitous in both natural and man-made aquatic environments. The distribution of Legionella genotypes within clinical strains is significantly different from that found in environmental strains. Developing novel genotypic methods that offer the ability to distinguish clinical from environmental strains could help to focus on more relevant (virulent) Legionella species in control efforts. Mixed-genome microarray data can be used to perform a comparative-genome analysis of strain collections, and advanced statistical approaches, such as the Random Forest algorithm are available to process these data. Microarray analysis was performed on a collection of 222 Legionella pneumophila strains, which included patient-derived strains from notified cases in The Netherlands in the period 2002-2006 and the environmental strains that were collected during the source investigation for those patients within the Dutch National Legionella Outbreak Detection Programme. The Random Forest algorithm combined with a logistic regression model was used to select predictive markers and to construct a predictive model that could discriminate between strains from different origin: clinical or environmental. Four genetic markers were selected that correctly predicted 96% of the clinical strains and 66% of the environmental strains collected within the Dutch National Legionella Outbreak Detection Programme. The Random Forest algorithm is well suited for the development of prediction models that use mixed-genome microarray data to discriminate between Legionella strains from different origin. The identification of these predictive genetic markers could offer the possibility to identify virulence factors within the Legionella genome, which in the future may be implemented in the daily practice of controlling Legionella in the public health environment.

  17. Object-based random forest classification of Landsat ETM+ and WorldView-2 satellite imagery for mapping lowland native grassland communities in Tasmania, Australia

    NASA Astrophysics Data System (ADS)

    Melville, Bethany; Lucieer, Arko; Aryal, Jagannath

    2018-04-01

    This paper presents a random forest classification approach for identifying and mapping three types of lowland native grassland communities found in the Tasmanian Midlands region. Due to the high conservation priority assigned to these communities, there has been an increasing need to identify appropriate datasets that can be used to derive accurate and frequently updateable maps of community extent. Therefore, this paper proposes a method employing repeat classification and statistical significance testing as a means of identifying the most appropriate dataset for mapping these communities. Two datasets were acquired and analysed; a Landsat ETM+ scene, and a WorldView-2 scene, both from 2010. Training and validation data were randomly subset using a k-fold (k = 50) approach from a pre-existing field dataset. Poa labillardierei, Themeda triandra and lowland native grassland complex communities were identified in addition to dry woodland and agriculture. For each subset of randomly allocated points, a random forest model was trained based on each dataset, and then used to classify the corresponding imagery. Validation was performed using the reciprocal points from the independent subset that had not been used to train the model. Final training and classification accuracies were reported as per class means for each satellite dataset. Analysis of Variance (ANOVA) was undertaken to determine whether classification accuracy differed between the two datasets, as well as between classifications. Results showed mean class accuracies between 54% and 87%. Class accuracy only differed significantly between datasets for the dry woodland and Themeda grassland classes, with the WorldView-2 dataset showing higher mean classification accuracies. The results of this study indicate that remote sensing is a viable method for the identification of lowland native grassland communities in the Tasmanian Midlands, and that repeat classification and statistical significant testing can be used to identify optimal datasets for vegetation community mapping.

  18. Using Classification and Regression Trees (CART) and random forests to analyze attrition: Results from two simulations.

    PubMed

    Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J

    2015-12-01

    In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. (c) 2015 APA, all rights reserved).

  19. Using Classification and Regression Trees (CART) and Random Forests to Analyze Attrition: Results From Two Simulations

    PubMed Central

    Hayes, Timothy; Usami, Satoshi; Jacobucci, Ross; McArdle, John J.

    2016-01-01

    In this article, we describe a recent development in the analysis of attrition: using classification and regression trees (CART) and random forest methods to generate inverse sampling weights. These flexible machine learning techniques have the potential to capture complex nonlinear, interactive selection models, yet to our knowledge, their performance in the missing data analysis context has never been evaluated. To assess the potential benefits of these methods, we compare their performance with commonly employed multiple imputation and complete case techniques in 2 simulations. These initial results suggest that weights computed from pruned CART analyses performed well in terms of both bias and efficiency when compared with other methods. We discuss the implications of these findings for applied researchers. PMID:26389526

  20. Detecting targets hidden in random forests

    NASA Astrophysics Data System (ADS)

    Kouritzin, Michael A.; Luo, Dandan; Newton, Fraser; Wu, Biao

    2009-05-01

    Military tanks, cargo or troop carriers, missile carriers or rocket launchers often hide themselves from detection in the forests. This plagues the detection problem of locating these hidden targets. An electro-optic camera mounted on a surveillance aircraft or unmanned aerial vehicle is used to capture the images of the forests with possible hidden targets, e.g., rocket launchers. We consider random forests of longitudinal and latitudinal correlations. Specifically, foliage coverage is encoded with a binary representation (i.e., foliage or no foliage), and is correlated in adjacent regions. We address the detection problem of camouflaged targets hidden in random forests by building memory into the observations. In particular, we propose an efficient algorithm to generate random forests, ground, and camouflage of hidden targets with two dimensional correlations. The observations are a sequence of snapshots consisting of foliage-obscured ground or target. Theoretically, detection is possible because there are subtle differences in the correlations of the ground and camouflage of the rocket launcher. However, these differences are well beyond human perception. To detect the presence of hidden targets automatically, we develop a Markov representation for these sequences and modify the classical filtering equations to allow the Markov chain observation. Particle filters are used to estimate the position of the targets in combination with a novel random weighting technique. Furthermore, we give positive proof-of-concept simulations.

  1. Modeling groundwater nitrate concentrations in private wells in Iowa

    USGS Publications Warehouse

    Wheeler, David C.; Nolan, Bernard T.; Flory, Abigail R.; DellaValle, Curt T.; Ward, Mary H.

    2015-01-01

    Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square = 0.77) and was acceptable in the testing set (r-square = 0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort.

  2. Discrimination of fish populations using parasites: Random Forests on a 'predictable' host-parasite system.

    PubMed

    Pérez-Del-Olmo, A; Montero, F E; Fernández, M; Barrett, J; Raga, J A; Kostadinova, A

    2010-10-01

    We address the effect of spatial scale and temporal variation on model generality when forming predictive models for fish assignment using a new data mining approach, Random Forests (RF), to variable biological markers (parasite community data). Models were implemented for a fish host-parasite system sampled along the Mediterranean and Atlantic coasts of Spain and were validated using independent datasets. We considered 2 basic classification problems in evaluating the importance of variations in parasite infracommunities for assignment of individual fish to their populations of origin: multiclass (2-5 population models, using 2 seasonal replicates from each of the populations) and 2-class task (using 4 seasonal replicates from 1 Atlantic and 1 Mediterranean population each). The main results are that (i) RF are well suited for multiclass population assignment using parasite communities in non-migratory fish; (ii) RF provide an efficient means for model cross-validation on the baseline data and this allows sample size limitations in parasite tag studies to be tackled effectively; (iii) the performance of RF is dependent on the complexity and spatial extent/configuration of the problem; and (iv) the development of predictive models is strongly influenced by seasonal change and this stresses the importance of both temporal replication and model validation in parasite tagging studies.

  3. Modeling groundwater nitrate concentrations in private wells in Iowa.

    PubMed

    Wheeler, David C; Nolan, Bernard T; Flory, Abigail R; DellaValle, Curt T; Ward, Mary H

    2015-12-01

    Contamination of drinking water by nitrate is a growing problem in many agricultural areas of the country. Ingested nitrate can lead to the endogenous formation of N-nitroso compounds, potent carcinogens. We developed a predictive model for nitrate concentrations in private wells in Iowa. Using 34,084 measurements of nitrate in private wells, we trained and tested random forest models to predict log nitrate levels by systematically assessing the predictive performance of 179 variables in 36 thematic groups (well depth, distance to sinkholes, location, land use, soil characteristics, nitrogen inputs, meteorology, and other factors). The final model contained 66 variables in 17 groups. Some of the most important variables were well depth, slope length within 1 km of the well, year of sample, and distance to nearest animal feeding operation. The correlation between observed and estimated nitrate concentrations was excellent in the training set (r-square=0.77) and was acceptable in the testing set (r-square=0.38). The random forest model had substantially better predictive performance than a traditional linear regression model or a regression tree. Our model will be used to investigate the association between nitrate levels in drinking water and cancer risk in the Iowa participants of the Agricultural Health Study cohort. Copyright © 2015 Elsevier B.V. All rights reserved.

  4. Exploring prediction uncertainty of spatial data in geostatistical and machine learning Approaches

    NASA Astrophysics Data System (ADS)

    Klump, J. F.; Fouedjio, F.

    2017-12-01

    Geostatistical methods such as kriging with external drift as well as machine learning techniques such as quantile regression forest have been intensively used for modelling spatial data. In addition to providing predictions for target variables, both approaches are able to deliver a quantification of the uncertainty associated with the prediction at a target location. Geostatistical approaches are, by essence, adequate for providing such prediction uncertainties and their behaviour is well understood. However, they often require significant data pre-processing and rely on assumptions that are rarely met in practice. Machine learning algorithms such as random forest regression, on the other hand, require less data pre-processing and are non-parametric. This makes the application of machine learning algorithms to geostatistical problems an attractive proposition. The objective of this study is to compare kriging with external drift and quantile regression forest with respect to their ability to deliver reliable prediction uncertainties of spatial data. In our comparison we use both simulated and real world datasets. Apart from classical performance indicators, comparisons make use of accuracy plots, probability interval width plots, and the visual examinations of the uncertainty maps provided by the two approaches. By comparing random forest regression to kriging we found that both methods produced comparable maps of estimated values for our variables of interest. However, the measure of uncertainty provided by random forest seems to be quite different to the measure of uncertainty provided by kriging. In particular, the lack of spatial context can give misleading results in areas without ground truth data. These preliminary results raise questions about assessing the risks associated with decisions based on the predictions from geostatistical and machine learning algorithms in a spatial context, e.g. mineral exploration.

  5. Application of lifting wavelet and random forest in compound fault diagnosis of gearbox

    NASA Astrophysics Data System (ADS)

    Chen, Tang; Cui, Yulian; Feng, Fuzhou; Wu, Chunzhi

    2018-03-01

    Aiming at the weakness of compound fault characteristic signals of a gearbox of an armored vehicle and difficult to identify fault types, a fault diagnosis method based on lifting wavelet and random forest is proposed. First of all, this method uses the lifting wavelet transform to decompose the original vibration signal in multi-layers, reconstructs the multi-layer low-frequency and high-frequency components obtained by the decomposition to get multiple component signals. Then the time-domain feature parameters are obtained for each component signal to form multiple feature vectors, which is input into the random forest pattern recognition classifier to determine the compound fault type. Finally, a variety of compound fault data of the gearbox fault analog test platform are verified, the results show that the recognition accuracy of the fault diagnosis method combined with the lifting wavelet and the random forest is up to 99.99%.

  6. Integrating Archaeological Modeling in DoD Cultural Resource Compliance

    DTIC Science & Technology

    2012-10-26

    Leo 2001 Random Forests. Machine Learning 45:5–32. Briuer, Frederick, Clifford Brown, Alan Gillespie, Fredrick Limp, Michael Trimble, and Len...glaciolacustrine clays on glacial lake plains Inceptisols Very-fine, mixed, active, nonacid, mesic Mollic Endoaquepts Low to none LoC Lowville silt

  7. Land Covers Classification Based on Random Forest Method Using Features from Full-Waveform LIDAR Data

    NASA Astrophysics Data System (ADS)

    Ma, L.; Zhou, M.; Li, C.

    2017-09-01

    In this study, a Random Forest (RF) based land covers classification method is presented to predict the types of land covers in Miyun area. The returned full-waveforms which were acquired by a LiteMapper 5600 airborne LiDAR system were processed, including waveform filtering, waveform decomposition and features extraction. The commonly used features that were distance, intensity, Full Width at Half Maximum (FWHM), skewness and kurtosis were extracted. These waveform features were used as attributes of training data for generating the RF prediction model. The RF prediction model was applied to predict the types of land covers in Miyun area as trees, buildings, farmland and ground. The classification results of these four types of land covers were obtained according to the ground truth information acquired from CCD image data of the same region. The RF classification results were compared with that of SVM method and show better results. The RF classification accuracy reached 89.73% and the classification Kappa was 0.8631.

  8. Fuzzy association rule mining and classification for the prediction of malaria in South Korea.

    PubMed

    Buczak, Anna L; Baugher, Benjamin; Guven, Erhan; Ramac-Thomas, Liane C; Elbert, Yevgeniy; Babin, Steven M; Lewis, Sheri H

    2015-06-18

    Malaria is the world's most prevalent vector-borne disease. Accurate prediction of malaria outbreaks may lead to public health interventions that mitigate disease morbidity and mortality. We describe an application of a method for creating prediction models utilizing Fuzzy Association Rule Mining to extract relationships between epidemiological, meteorological, climatic, and socio-economic data from Korea. These relationships are in the form of rules, from which the best set of rules is automatically chosen and forms a classifier. Two classifiers have been built and their results fused to become a malaria prediction model. Future malaria cases are predicted as Low, Medium or High, where these classes are defined as a total of 0-2, 3-16, and above 17 cases, respectively, for a region in South Korea during a two-week period. Based on user recommendations, HIGH is considered an outbreak. Model accuracy is described by Positive Predictive Value (PPV), Sensitivity, and F-score for each class, computed on test data not previously used to develop the model. For predictions made 7-8 weeks in advance, model PPV and Sensitivity are 0.842 and 0.681, respectively, for the HIGH classes. The F0.5 and F3 scores (which combine PPV and Sensitivity) are 0.804 and 0.694, respectively, for the HIGH classes. The overall FARM results (as measured by F-scores) are significantly better than those obtained by Decision Tree, Random Forest, Support Vector Machine, and Holt-Winters methods for the HIGH class. For the Medium class, Random Forest and FARM obtain comparable results, with FARM being better at F0.5, and Random Forest obtaining a higher F3. A previously described method for creating disease prediction models has been modified and extended to build models for predicting malaria. In addition, some new input variables were used, including indicators of intervention measures. The South Korea malaria prediction models predict Low, Medium or High cases 7-8 weeks in the future. This paper demonstrates that our data driven approach can be used for the prediction of different diseases.

  9. Machine learning models in breast cancer survival prediction.

    PubMed

    Montazeri, Mitra; Montazeri, Mohadeseh; Montazeri, Mahdieh; Beigzadeh, Amin

    2016-01-01

    Breast cancer is one of the most common cancers with a high mortality rate among women. With the early diagnosis of breast cancer survival will increase from 56% to more than 86%. Therefore, an accurate and reliable system is necessary for the early diagnosis of this cancer. The proposed model is the combination of rules and different machine learning techniques. Machine learning models can help physicians to reduce the number of false decisions. They try to exploit patterns and relationships among a large number of cases and predict the outcome of a disease using historical cases stored in datasets. The objective of this study is to propose a rule-based classification method with machine learning techniques for the prediction of different types of Breast cancer survival. We use a dataset with eight attributes that include the records of 900 patients in which 876 patients (97.3%) and 24 (2.7%) patients were females and males respectively. Naive Bayes (NB), Trees Random Forest (TRF), 1-Nearest Neighbor (1NN), AdaBoost (AD), Support Vector Machine (SVM), RBF Network (RBFN), and Multilayer Perceptron (MLP) machine learning techniques with 10-cross fold technique were used with the proposed model for the prediction of breast cancer survival. The performance of machine learning techniques were evaluated with accuracy, precision, sensitivity, specificity, and area under ROC curve. Out of 900 patients, 803 patients and 97 patients were alive and dead, respectively. In this study, Trees Random Forest (TRF) technique showed better results in comparison to other techniques (NB, 1NN, AD, SVM and RBFN, MLP). The accuracy, sensitivity and the area under ROC curve of TRF are 96%, 96%, 93%, respectively. However, 1NN machine learning technique provided poor performance (accuracy 91%, sensitivity 91% and area under ROC curve 78%). This study demonstrates that Trees Random Forest model (TRF) which is a rule-based classification model was the best model with the highest level of accuracy. Therefore, this model is recommended as a useful tool for breast cancer survival prediction as well as medical decision making.

  10. Modeling Verdict Outcomes Using Social Network Measures: The Watergate and Caviar Network Cases

    PubMed Central

    2016-01-01

    Modelling criminal trial verdict outcomes using social network measures is an emerging research area in quantitative criminology. Few studies have yet analyzed which of these measures are the most important for verdict modelling or which data classification techniques perform best for this application. To compare the performance of different techniques in classifying members of a criminal network, this article applies three different machine learning classifiers–Logistic Regression, Naïve Bayes and Random Forest–with a range of social network measures and the necessary databases to model the verdicts in two real–world cases: the U.S. Watergate Conspiracy of the 1970’s and the now–defunct Canada–based international drug trafficking ring known as the Caviar Network. In both cases it was found that the Random Forest classifier did better than either Logistic Regression or Naïve Bayes, and its superior performance was statistically significant. This being so, Random Forest was used not only for classification but also to assess the importance of the measures. For the Watergate case, the most important one proved to be betweenness centrality while for the Caviar Network, it was the effective size of the network. These results are significant because they show that an approach combining machine learning with social network analysis not only can generate accurate classification models but also helps quantify the importance social network variables in modelling verdict outcomes. We conclude our analysis with a discussion and some suggestions for future work in verdict modelling using social network measures. PMID:26824351

  11. Modeling Forest Biomass and Growth: Coupling Long-Term Inventory and Lidar Data

    NASA Technical Reports Server (NTRS)

    Babcock, Chad; Finley, Andrew O.; Cook, Bruce D.; Weiskittel, Andrew; Woodall, Christopher W.

    2016-01-01

    Combining spatially-explicit long-term forest inventory and remotely sensed information from Light Detection and Ranging (LiDAR) datasets through statistical models can be a powerful tool for predicting and mapping above-ground biomass (AGB) at a range of geographic scales. We present and examine a novel modeling approach to improve prediction of AGB and estimate AGB growth using LiDAR data. The proposed model accommodates temporal misalignment between field measurements and remotely sensed data-a problem pervasive in such settings-by including multiple time-indexed measurements at plot locations to estimate AGB growth. We pursue a Bayesian modeling framework that allows for appropriately complex parameter associations and uncertainty propagation through to prediction. Specifically, we identify a space-varying coefficients model to predict and map AGB and its associated growth simultaneously. The proposed model is assessed using LiDAR data acquired from NASA Goddard's LiDAR, Hyper-spectral & Thermal imager and field inventory data from the Penobscot Experimental Forest in Bradley, Maine. The proposed model outperformed the time-invariant counterpart models in predictive performance as indicated by a substantial reduction in root mean squared error. The proposed model adequately accounts for temporal misalignment through the estimation of forest AGB growth and accommodates residual spatial dependence. Results from this analysis suggest that future AGB models informed using remotely sensed data, such as LiDAR, may be improved by adapting traditional modeling frameworks to account for temporal misalignment and spatial dependence using random effects.

  12. [Estimating individual tree aboveground biomass of the mid-subtropical forest using airborne LiDAR technology].

    PubMed

    Liu, Feng; Tan, Chang; Lei, Pi-Feng

    2014-11-01

    Taking Wugang forest farm in Xuefeng Mountain as the research object, using the airborne light detection and ranging (LiDAR) data under leaf-on condition and field data of concomitant plots, this paper assessed the ability of using LiDAR technology to estimate aboveground biomass of the mid-subtropical forest. A semi-automated individual tree LiDAR cloud point segmentation was obtained by using condition random fields and optimization methods. Spatial structure, waveform characteristics and topography were calculated as LiDAR metrics from the segmented objects. Then statistical models between aboveground biomass from field data and these LiDAR metrics were built. The individual tree recognition rates were 93%, 86% and 60% for coniferous, broadleaf and mixed forests, respectively. The adjusted coefficients of determination (R(2)adj) and the root mean squared errors (RMSE) for the three types of forest were 0.83, 0.81 and 0.74, and 28.22, 29.79 and 32.31 t · hm(-2), respectively. The estimation capability of model based on canopy geometric volume, tree percentile height, slope and waveform characteristics was much better than that of traditional regression model based on tree height. Therefore, LiDAR metrics from individual tree could facilitate better performance in biomass estimation.

  13. Mapping Species Composition of Forests and Tree Plantations in Northeastern Costa Rica with an Integration of Hyperspectral and Multitemporal Landsat Imagery

    NASA Technical Reports Server (NTRS)

    Fagan, Matthew E.; Defries, Ruth S.; Sesnie, Steven E.; Arroyo-Mora, J. Pablo; Soto, Carlomagno; Singh, Aditya; Townsend, Philip A.; Chazdon, Robin L.

    2015-01-01

    An efficient means to map tree plantations is needed to detect tropical land use change and evaluate reforestation projects. To analyze recent tree plantation expansion in northeastern Costa Rica, we examined the potential of combining moderate-resolution hyperspectral imagery (2005 HyMap mosaic) with multitemporal, multispectral data (Landsat) to accurately classify (1) general forest types and (2) tree plantations by species composition. Following a linear discriminant analysis to reduce data dimensionality, we compared four Random Forest classification models: hyperspectral data (HD) alone; HD plus interannual spectral metrics; HD plus a multitemporal forest regrowth classification; and all three models combined. The fourth, combined model achieved overall accuracy of 88.5%. Adding multitemporal data significantly improved classification accuracy (p less than 0.0001) of all forest types, although the effect on tree plantation accuracy was modest. The hyperspectral data alone classified six species of tree plantations with 75% to 93% producer's accuracy; adding multitemporal spectral data increased accuracy only for two species with dense canopies. Non-native tree species had higher classification accuracy overall and made up the majority of tree plantations in this landscape. Our results indicate that combining occasionally acquired hyperspectral data with widely available multitemporal satellite imagery enhances mapping and monitoring of reforestation in tropical landscapes.

  14. Satellite and in situ monitoring data used for modeling of forest vegetation reflectance

    NASA Astrophysics Data System (ADS)

    Zoran, M. A.; Savastru, R. S.; Savastru, D. M.; Miclos, S. I.; Tautan, M. N.; Baschir, L.

    2010-10-01

    As climatic variability and anthropogenic stressors are growing up continuously, must be defined the proper criteria for forest vegetation assessment. In order to characterize current and future state of forest vegetation satellite imagery is a very useful tool. Vegetation can be distinguished using remote sensing data from most other (mainly inorganic) materials by virtue of its notable absorption in the red and blue segments of the visible spectrum, its higher green reflectance and, especially, its very strong reflectance in the near-IR. Vegetation reflectance has variations with sun zenith angle, view zenith angle, and terrain slope angle. To provide corrections of these effects, for visible and near-infrared light, was used a developed a simple physical model of vegetation reflectance, by assuming homogeneous and closed vegetation canopy with randomly oriented leaves. A simple physical model of forest vegetation reflectance was applied and validated for Cernica forested area, near Bucharest town through two ASTER satellite data , acquired within minutes from one another ,a nadir and off-nadir for band 3 lying in the near infra red, most radiance differences between the two scenes can be attributed to the BRDF effect. Other satellite data MODIS, Landsat TM and ETM as well as, IKONOS have been used for different NDVI and classification analysis.

  15. Computer aided diagnosis system for the Alzheimer's disease based on partial least squares and random forest SPECT image classification.

    PubMed

    Ramírez, J; Górriz, J M; Segovia, F; Chaves, R; Salas-Gonzalez, D; López, M; Alvarez, I; Padilla, P

    2010-03-19

    This letter shows a computer aided diagnosis (CAD) technique for the early detection of the Alzheimer's disease (AD) by means of single photon emission computed tomography (SPECT) image classification. The proposed method is based on partial least squares (PLS) regression model and a random forest (RF) predictor. The challenge of the curse of dimensionality is addressed by reducing the large dimensionality of the input data by downscaling the SPECT images and extracting score features using PLS. A RF predictor then forms an ensemble of classification and regression tree (CART)-like classifiers being its output determined by a majority vote of the trees in the forest. A baseline principal component analysis (PCA) system is also developed for reference. The experimental results show that the combined PLS-RF system yields a generalization error that converges to a limit when increasing the number of trees in the forest. Thus, the generalization error is reduced when using PLS and depends on the strength of the individual trees in the forest and the correlation between them. Moreover, PLS feature extraction is found to be more effective for extracting discriminative information from the data than PCA yielding peak sensitivity, specificity and accuracy values of 100%, 92.7%, and 96.9%, respectively. Moreover, the proposed CAD system outperformed several other recently developed AD CAD systems. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  16. Pre-operative prediction of surgical morbidity in children: comparison of five statistical models.

    PubMed

    Cooper, Jennifer N; Wei, Lai; Fernandez, Soledad A; Minneci, Peter C; Deans, Katherine J

    2015-02-01

    The accurate prediction of surgical risk is important to patients and physicians. Logistic regression (LR) models are typically used to estimate these risks. However, in the fields of data mining and machine-learning, many alternative classification and prediction algorithms have been developed. This study aimed to compare the performance of LR to several data mining algorithms for predicting 30-day surgical morbidity in children. We used the 2012 National Surgical Quality Improvement Program-Pediatric dataset to compare the performance of (1) a LR model that assumed linearity and additivity (simple LR model) (2) a LR model incorporating restricted cubic splines and interactions (flexible LR model) (3) a support vector machine, (4) a random forest and (5) boosted classification trees for predicting surgical morbidity. The ensemble-based methods showed significantly higher accuracy, sensitivity, specificity, PPV, and NPV than the simple LR model. However, none of the models performed better than the flexible LR model in terms of the aforementioned measures or in model calibration or discrimination. Support vector machines, random forests, and boosted classification trees do not show better performance than LR for predicting pediatric surgical morbidity. After further validation, the flexible LR model derived in this study could be used to assist with clinical decision-making based on patient-specific surgical risks. Copyright © 2014 Elsevier Ltd. All rights reserved.

  17. Predicting stem total and assortment volumes in an industrial Pinus taeda L. forest plantation using airborne laser scanning data and random forest

    Treesearch

    Carlos Alberto Silva; Carine Klauberg; Andrew Thomas Hudak; Lee Alexander Vierling; Wan Shafrina Wan Mohd Jaafar; Midhun Mohan; Mariano Garcia; Antonio Ferraz; Adrian Cardil; Sassan Saatchi

    2017-01-01

    Improvements in the management of pine plantations result in multiple industrial and environmental benefits. Remote sensing techniques can dramatically increase the efficiency of plantation management by reducing or replacing time-consuming field sampling. We tested the utility and accuracy of combining field and airborne lidar data with Random Forest, a supervised...

  18. A Complex Network Theory Approach for the Spatial Distribution of Fire Breaks in Heterogeneous Forest Landscapes for the Control of Wildland Fires

    PubMed Central

    Russo, Lucia; Russo, Paola; Siettos, Constantinos I.

    2016-01-01

    Based on complex network theory, we propose a computational methodology which addresses the spatial distribution of fuel breaks for the inhibition of the spread of wildland fires on heterogeneous landscapes. This is a two-level approach where the dynamics of fire spread are modeled as a random Markov field process on a directed network whose edge weights are determined by a Cellular Automata model that integrates detailed GIS, landscape and meteorological data. Within this framework, the spatial distribution of fuel breaks is reduced to the problem of finding network nodes (small land patches) which favour fire propagation. Here, this is accomplished by exploiting network centrality statistics. We illustrate the proposed approach through (a) an artificial forest of randomly distributed density of vegetation, and (b) a real-world case concerning the island of Rhodes in Greece whose major part of its forest was burned in 2008. Simulation results show that the proposed methodology outperforms the benchmark/conventional policy of fuel reduction as this can be realized by selective harvesting and/or prescribed burning based on the density and flammability of vegetation. Interestingly, our approach reveals that patches with sparse density of vegetation may act as hubs for the spread of the fire. PMID:27780249

  19. A Complex Network Theory Approach for the Spatial Distribution of Fire Breaks in Heterogeneous Forest Landscapes for the Control of Wildland Fires.

    PubMed

    Russo, Lucia; Russo, Paola; Siettos, Constantinos I

    2016-01-01

    Based on complex network theory, we propose a computational methodology which addresses the spatial distribution of fuel breaks for the inhibition of the spread of wildland fires on heterogeneous landscapes. This is a two-level approach where the dynamics of fire spread are modeled as a random Markov field process on a directed network whose edge weights are determined by a Cellular Automata model that integrates detailed GIS, landscape and meteorological data. Within this framework, the spatial distribution of fuel breaks is reduced to the problem of finding network nodes (small land patches) which favour fire propagation. Here, this is accomplished by exploiting network centrality statistics. We illustrate the proposed approach through (a) an artificial forest of randomly distributed density of vegetation, and (b) a real-world case concerning the island of Rhodes in Greece whose major part of its forest was burned in 2008. Simulation results show that the proposed methodology outperforms the benchmark/conventional policy of fuel reduction as this can be realized by selective harvesting and/or prescribed burning based on the density and flammability of vegetation. Interestingly, our approach reveals that patches with sparse density of vegetation may act as hubs for the spread of the fire.

  20. Remote sensing-based measurement of Living Environment Deprivation: Improving classical approaches with machine learning

    PubMed Central

    2017-01-01

    This paper provides evidence on the usefulness of very high spatial resolution (VHR) imagery in gathering socioeconomic information in urban settlements. We use land cover, spectral, structure and texture features extracted from a Google Earth image of Liverpool (UK) to evaluate their potential to predict Living Environment Deprivation at a small statistical area level. We also contribute to the methodological literature on the estimation of socioeconomic indices with remote-sensing data by introducing elements from modern machine learning. In addition to classical approaches such as Ordinary Least Squares (OLS) regression and a spatial lag model, we explore the potential of the Gradient Boost Regressor and Random Forests to improve predictive performance and accuracy. In addition to novel predicting methods, we also introduce tools for model interpretation and evaluation such as feature importance and partial dependence plots, or cross-validation. Our results show that Random Forest proved to be the best model with an R2 of around 0.54, followed by Gradient Boost Regressor with 0.5. Both the spatial lag model and the OLS fall behind with significantly lower performances of 0.43 and 0.3, respectively. PMID:28464010

  1. Remote sensing-based measurement of Living Environment Deprivation: Improving classical approaches with machine learning.

    PubMed

    Arribas-Bel, Daniel; Patino, Jorge E; Duque, Juan C

    2017-01-01

    This paper provides evidence on the usefulness of very high spatial resolution (VHR) imagery in gathering socioeconomic information in urban settlements. We use land cover, spectral, structure and texture features extracted from a Google Earth image of Liverpool (UK) to evaluate their potential to predict Living Environment Deprivation at a small statistical area level. We also contribute to the methodological literature on the estimation of socioeconomic indices with remote-sensing data by introducing elements from modern machine learning. In addition to classical approaches such as Ordinary Least Squares (OLS) regression and a spatial lag model, we explore the potential of the Gradient Boost Regressor and Random Forests to improve predictive performance and accuracy. In addition to novel predicting methods, we also introduce tools for model interpretation and evaluation such as feature importance and partial dependence plots, or cross-validation. Our results show that Random Forest proved to be the best model with an R2 of around 0.54, followed by Gradient Boost Regressor with 0.5. Both the spatial lag model and the OLS fall behind with significantly lower performances of 0.43 and 0.3, respectively.

  2. Mixed-effects models for estimating stand volume by means of small footprint airborne laser scanner data.

    Treesearch

    J. Breidenbach; E. Kublin; R. McGaughey; H.-E. Andersen; S. Reutebuch

    2008-01-01

    For this study, hierarchical data sets--in that several sample plots are located within a stand--were analyzed for study sites in the USA and Germany. The German data had an additional hierarchy as the stands are located within four distinct public forests. Fixed-effects models and mixed-effects models with a random intercept on the stand level were fit to each data...

  3. Predicting Ascospore Release of Monilinia vaccinii-corymbosi of Blueberry with Machine Learning.

    PubMed

    Harteveld, Dalphy O C; Grant, Michael R; Pscheidt, Jay W; Peever, Tobin L

    2017-11-01

    Mummy berry, caused by Monilinia vaccinii-corymbosi, causes economic losses of highbush blueberry in the U.S. Pacific Northwest (PNW). Apothecia develop from mummified berries overwintering on soil surfaces and produce ascospores that infect tissue emerging from floral and vegetative buds. Disease control currently relies on fungicides applied on a calendar basis rather than inoculum availability. To establish a prediction model for ascospore release, apothecial development was tracked in three fields, one in western Oregon and two in northwestern Washington in 2015 and 2016. Air and soil temperature, precipitation, soil moisture, leaf wetness, relative humidity and solar radiation were monitored using in-field weather stations and Washington State University's AgWeatherNet stations. Four modeling approaches were compared: logistic regression, multivariate adaptive regression splines, artificial neural networks, and random forest. A supervised learning approach was used to train the models on two data sets: training (70%) and testing (30%). The importance of environmental factors was calculated for each model separately. Soil temperature, soil moisture, and solar radiation were identified as the most important factors influencing ascospore release. Random forest models, with 78% accuracy, showed the best performance compared with the other models. Results of this research helps PNW blueberry growers to optimize fungicide use and reduce production costs.

  4. Modeling and Prediction of Solvent Effect on Human Skin Permeability using Support Vector Regression and Random Forest.

    PubMed

    Baba, Hiromi; Takahara, Jun-ichi; Yamashita, Fumiyoshi; Hashida, Mitsuru

    2015-11-01

    The solvent effect on skin permeability is important for assessing the effectiveness and toxicological risk of new dermatological formulations in pharmaceuticals and cosmetics development. The solvent effect occurs by diverse mechanisms, which could be elucidated by efficient and reliable prediction models. However, such prediction models have been hampered by the small variety of permeants and mixture components archived in databases and by low predictive performance. Here, we propose a solution to both problems. We first compiled a novel large database of 412 samples from 261 structurally diverse permeants and 31 solvents reported in the literature. The data were carefully screened to ensure their collection under consistent experimental conditions. To construct a high-performance predictive model, we then applied support vector regression (SVR) and random forest (RF) with greedy stepwise descriptor selection to our database. The models were internally and externally validated. The SVR achieved higher performance statistics than RF. The (externally validated) determination coefficient, root mean square error, and mean absolute error of SVR were 0.899, 0.351, and 0.268, respectively. Moreover, because all descriptors are fully computational, our method can predict as-yet unsynthesized compounds. Our high-performance prediction model offers an attractive alternative to permeability experiments for pharmaceutical and cosmetic candidate screening and optimizing skin-permeable topical formulations.

  5. Prediction of Baseflow Index of Catchments using Machine Learning Algorithms

    NASA Astrophysics Data System (ADS)

    Yadav, B.; Hatfield, K.

    2017-12-01

    We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Considering both the accuracy and the computational complexity of these algorithms, we identify the extremely randomized trees as the best performing algorithm for BFI prediction in ungauged basins.

  6. Large-Scale Mixed Temperate Forest Mapping at the Single Tree Level using Airborne Laser Scanning

    NASA Astrophysics Data System (ADS)

    Scholl, V.; Morsdorf, F.; Ginzler, C.; Schaepman, M. E.

    2017-12-01

    Monitoring vegetation on a single tree level is critical to understand and model a variety of processes, functions, and changes in forest systems. Remote sensing technologies are increasingly utilized to complement and upscale the field-based measurements of forest inventories. Airborne laser scanning (ALS) systems provide valuable information in the vertical dimension for effective vegetation structure mapping. Although many algorithms exist to extract single tree segments from forest scans, they are often tuned to perform well in homogeneous coniferous or deciduous areas and are not successful in mixed forests. Other methods are too computationally expensive to apply operationally. The aim of this study was to develop a single tree detection workflow using leaf-off ALS data for the canton of Aargau in Switzerland. Aargau covers an area of over 1,400km2 and features mixed forests with various development stages and topography. Forest type was classified using random forests to guide local parameter selection. Canopy height model-based treetop maxima were detected and maintained based on the relationship between tree height and window size, used as a proxy to crown diameter. Watershed segmentation was used to generate crown polygons surrounding each maximum. The location, height, and crown dimensions of single trees were derived from the ALS returns within each polygon. Validation was performed through comparison with field measurements and extrapolated estimates from long-term monitoring plots of the Swiss National Forest Inventory within the framework of the Swiss Federal Institute for Forest, Snow, and Landscape Research. This method shows promise for robust, large-scale single tree detection in mixed forests. The single tree data will aid ecological studies as well as forest management practices. Figure description: Height-normalized ALS point cloud data (top) and resulting single tree segments (bottom) on the Laegeren mountain in Switzerland.

  7. Modelling Associations between Public Understanding, Engagement and Forest Conditions in the Inland Northwest, USA

    PubMed Central

    Hartter, Joel; Stevens, Forrest R.; Hamilton, Lawrence C.; Congalton, Russell G.; Ducey, Mark J.; Oester, Paul T.

    2015-01-01

    Opinions about public lands and the actions of private non-industrial forest owners in the western United States play important roles in forested landscape management as both public and private forests face increasing risks from large wildfires, pests and disease. This work presents the responses from two surveys, a random-sample telephone survey of more than 1500 residents and a mail survey targeting owners of parcels with 10 or more acres of forest. These surveys were conducted in three counties (Wallowa, Union, and Baker) in northeast Oregon, USA. We analyze these survey data using structural equation models in order to assess how individual characteristics and understanding of forest management issues affect perceptions about forest conditions and risks associated with declining forest health on public lands. We test whether forest understanding is informed by background, beliefs, and experiences, and whether as an intervening variable it is associated with views about forest conditions on publicly managed forests. Individual background characteristics such as age, gender and county of residence have significant direct or indirect effects on our measurement of understanding. Controlling for background factors, we found that forest owners with higher self-assessed understanding, and more education about forest management, tend to hold more pessimistic views about forest conditions. Based on our results we argue that self-assessed understanding, interest in learning, and willingness to engage in extension activities together have leverage to affect perceptions about the risks posed by declining forest conditions on public lands, influence land owner actions, and affect support for public policies. These results also have broader implications for management of forested landscapes on public and private lands amidst changing demographics in rural communities across the Inland Northwest where migration may significantly alter the composition of forest owner goals, understanding, and support for various management actions. PMID:25671619

  8. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

    PubMed Central

    Theis, Fabian J.

    2017-01-01

    Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia. PMID:29312464

  9. Applications of random forest feature selection for fine-scale genetic population assignment.

    PubMed

    Sylvester, Emma V A; Bentzen, Paul; Bradbury, Ian R; Clément, Marie; Pearce, Jon; Horne, John; Beiko, Robert G

    2018-02-01

    Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with F ST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon ( Salmo salar ) and a published SNP data set for Alaskan Chinook salmon ( Oncorhynchus tshawytscha ). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than F ST -selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using F ST -selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.

  10. Comparison of Models for the Prediction of Medical Costs of Spinal Fusion in Taiwan Diagnosis-Related Groups by Machine Learning Algorithms.

    PubMed

    Kuo, Ching-Yen; Yu, Liang-Chin; Chen, Hou-Chaung; Chan, Chien-Lung

    2018-01-01

    The aims of this study were to compare the performance of machine learning methods for the prediction of the medical costs associated with spinal fusion in terms of profit or loss in Taiwan Diagnosis-Related Groups (Tw-DRGs) and to apply these methods to explore the important factors associated with the medical costs of spinal fusion. A data set was obtained from a regional hospital in Taoyuan city in Taiwan, which contained data from 2010 to 2013 on patients of Tw-DRG49702 (posterior and other spinal fusion without complications or comorbidities). Naïve-Bayesian, support vector machines, logistic regression, C4.5 decision tree, and random forest methods were employed for prediction using WEKA 3.8.1. Five hundred thirty-two cases were categorized as belonging to the Tw-DRG49702 group. The mean medical cost was US $4,549.7, and the mean age of the patients was 62.4 years. The mean length of stay was 9.3 days. The length of stay was an important variable in terms of determining medical costs for patients undergoing spinal fusion. The random forest method had the best predictive performance in comparison to the other methods, achieving an accuracy of 84.30%, a sensitivity of 71.4%, a specificity of 92.2%, and an AUC of 0.904. Our study demonstrated that the random forest model can be employed to predict the medical costs of Tw-DRG49702, and could inform hospital strategy in terms of increasing the financial management efficiency of this operation.

  11. Prediction of soil attributes through interpolators in a deglaciated environment with complex landforms

    NASA Astrophysics Data System (ADS)

    Schünemann, Adriano Luis; Inácio Fernandes Filho, Elpídio; Rocha Francelino, Marcio; Rodrigues Santos, Gérson; Thomazini, Andre; Batista Pereira, Antônio; Gonçalves Reynaud Schaefer, Carlos Ernesto

    2017-04-01

    The knowledge of environmental variables values, in non-sampled sites from a minimum data set can be accessed through interpolation technique. Kriging and the classifier Random Forest algorithm are examples of predictors with this aim. The objective of this work was to compare methods of soil attributes spatialization in a recent deglaciated environment with complex landforms. Prediction of the selected soil attributes (potassium, calcium and magnesium) from ice-free areas were tested by using morphometric covariables, and geostatistical models without these covariables. For this, 106 soil samples were collected at 0-10 cm depth in Keller Peninsula, King George Island, Maritime Antarctica. Soil chemical analysis was performed by the gravimetric method, determining values of potassium, calcium and magnesium for each sampled point. Digital terrain models (DTMs) were obtained by using Terrestrial Laser Scanner. DTMs were generated from a cloud of points with spatial resolutions of 1, 5, 10, 20 and 30 m. Hence, 40 morphometric covariates were generated. Simple Kriging was performed using the R package software. The same data set coupled with morphometric covariates, was used to predict values of the studied attributes in non-sampled sites through Random Forest interpolator. Little differences were observed on the DTMs generated by Simple kriging and Random Forest interpolators. Also, DTMs with better spatial resolution did not improved the quality of soil attributes prediction. Results revealed that Simple Kriging can be used as interpolator when morphometric covariates are not available, with little impact regarding quality. It is necessary to go further in soil chemical attributes prediction techniques, especially in periglacial areas with complex landforms.

  12. Application of an imputation method for geospatial inventory of forest structural attributes across multiple spatial scales in the Lake States, U.S.A

    NASA Astrophysics Data System (ADS)

    Deo, Ram K.

    Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.

  13. A Random Forest approach to predict the spatial distribution of sediment pollution in an estuarine system

    PubMed Central

    Kreakie, Betty J.; Cantwell, Mark G.; Nacci, Diane

    2017-01-01

    Modeling the magnitude and distribution of sediment-bound pollutants in estuaries is often limited by incomplete knowledge of the site and inadequate sample density. To address these modeling limitations, a decision-support tool framework was conceived that predicts sediment contamination from the sub-estuary to broader estuary extent. For this study, a Random Forest (RF) model was implemented to predict the distribution of a model contaminant, triclosan (5-chloro-2-(2,4-dichlorophenoxy)phenol) (TCS), in Narragansett Bay, Rhode Island, USA. TCS is an unregulated contaminant used in many personal care products. The RF explanatory variables were associated with TCS transport and fate (proxies) and direct and indirect environmental entry. The continuous RF TCS concentration predictions were discretized into three levels of contamination (low, medium, and high) for three different quantile thresholds. The RF model explained 63% of the variance with a minimum number of variables. Total organic carbon (TOC) (transport and fate proxy) was a strong predictor of TCS contamination causing a mean squared error increase of 59% when compared to permutations of randomized values of TOC. Additionally, combined sewer overflow discharge (environmental entry) and sand (transport and fate proxy) were strong predictors. The discretization models identified a TCS area of greatest concern in the northern reach of Narragansett Bay (Providence River sub-estuary), which was validated with independent test samples. This decision-support tool performed well at the sub-estuary extent and provided the means to identify areas of concern and prioritize bay-wide sampling. PMID:28738089

  14. Estimating Mixed Broadleaves Forest Stand Volume Using Dsm Extracted from Digital Aerial Images

    NASA Astrophysics Data System (ADS)

    Sohrabi, H.

    2012-07-01

    In mixed old growth broadleaves of Hyrcanian forests, it is difficult to estimate stand volume at plot level by remotely sensed data while LiDar data is absent. In this paper, a new approach has been proposed and tested for estimating stand forest volume. The approach is based on this idea that forest volume can be estimated by variation of trees height at plots. In the other word, the more the height variation in plot, the more the stand volume would be expected. For testing this idea, 120 circular 0.1 ha sample plots with systematic random design has been collected in Tonekaon forest located in Hyrcanian zone. Digital surface model (DSM) measure the height values of the first surface on the ground including terrain features, trees, building etc, which provides a topographic model of the earth's surface. The DSMs have been extracted automatically from aerial UltraCamD images so that ground pixel size for extracted DSM varied from 1 to 10 m size by 1m span. DSMs were checked manually for probable errors. Corresponded to ground samples, standard deviation and range of DSM pixels have been calculated. For modeling, non-linear regression method was used. The results showed that standard deviation of plot pixels with 5 m resolution was the most appropriate data for modeling. Relative bias and RMSE of estimation was 5.8 and 49.8 percent, respectively. Comparing to other approaches for estimating stand volume based on passive remote sensing data in mixed broadleaves forests, these results are more encouraging. One big problem in this method occurs when trees canopy cover is totally closed. In this situation, the standard deviation of height is low while stand volume is high. In future studies, applying forest stratification could be studied.

  15. Geometric Accuracy Analysis of Worlddem in Relation to AW3D30, Srtm and Aster GDEM2

    NASA Astrophysics Data System (ADS)

    Bayburt, S.; Kurtak, A. B.; Büyüksalih, G.; Jacobsen, K.

    2017-05-01

    In a project area close to Istanbul the quality of WorldDEM, AW3D30, SRTM DSM and ASTER GDEM2 have been analyzed in relation to a reference aerial LiDAR DEM and to each other. The random and the systematic height errors have been separated. The absolute offset for all height models in X, Y and Z is within the expectation. The shifts have been respected in advance for a satisfying estimation of the random error component. All height models are influenced by some tilts, different in size. In addition systematic deformations can be seen not influencing the standard deviation too much. The delivery of WorldDEM includes information about the height error map which is based on the interferometric phase errors, and the number and location of coverage's from different orbits. A dependency of the height accuracy from the height error map information and the number of coverage's can be seen, but it is smaller as expected. WorldDEM is more accurate as the other investigated height models and with 10 m point spacing it includes more morphologic details, visible at contour lines. The morphologic details are close to the details based on the LiDAR digital surface model (DSM). As usual a dependency of the accuracy from the terrain slope can be seen. In forest areas the canopy definition of InSAR X- and C-band height models as well as for the height models based on optical satellite images is not the same as the height definition by LiDAR. In addition the interferometric phase uncertainty over forest areas is larger. Both effects lead to lower height accuracy in forest areas, also visible in the height error map.

  16. Modeling Mediterranean forest structure using airborne laser scanning data

    NASA Astrophysics Data System (ADS)

    Bottalico, Francesca; Chirici, Gherardo; Giannini, Raffaello; Mele, Salvatore; Mura, Matteo; Puxeddu, Michele; McRoberts, Ronald E.; Valbuena, Ruben; Travaglini, Davide

    2017-05-01

    The conservation of biological diversity is recognized as a fundamental component of sustainable development, and forests contribute greatly to its preservation. Structural complexity increases the potential biological diversity of a forest by creating multiple niches that can host a wide variety of species. To facilitate greater understanding of the contributions of forest structure to forest biological diversity, we modeled relationships between 14 forest structure variables and airborne laser scanning (ALS) data for two Italian study areas representing two common Mediterranean forests, conifer plantations and coppice oaks subjected to irregular intervals of unplanned and non-standard silvicultural interventions. The objectives were twofold: (i) to compare model prediction accuracies when using two types of ALS metrics, echo-based metrics and canopy height model (CHM)-based metrics, and (ii) to construct inferences in the form of confidence intervals for large area structural complexity parameters. Our results showed that the effects of the two study areas on accuracies were greater than the effects of the two types of ALS metrics. In particular, accuracies were less for the more complex study area in terms of species composition and forest structure. However, accuracies achieved using the echo-based metrics were only slightly greater than when using the CHM-based metrics, thus demonstrating that both options yield reliable and comparable results. Accuracies were greatest for dominant height (Hd) (R2 = 0.91; RMSE% = 8.2%) and mean height weighted by basal area (R2 = 0.83; RMSE% = 10.5%) when using the echo-based metrics, 99th percentile of the echo height distribution and interquantile distance. For the forested area, the generalized regression (GREG) estimate of mean Hd was similar to the simple random sampling (SRS) estimate, 15.5 m for GREG and 16.2 m SRS. Further, the GREG estimator with standard error of 0.10 m was considerable more precise than the SRS estimator with standard error of 0.69 m.

  17. Recursive random forest algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways.

    PubMed

    Deng, Wenping; Zhang, Kui; Busov, Victor; Wei, Hairong

    2017-01-01

    Present knowledge indicates a multilayered hierarchical gene regulatory network (ML-hGRN) often operates above a biological pathway. Although the ML-hGRN is very important for understanding how a pathway is regulated, there is almost no computational algorithm for directly constructing ML-hGRNs. A backward elimination random forest (BWERF) algorithm was developed for constructing the ML-hGRN operating above a biological pathway. For each pathway gene, the BWERF used a random forest model to calculate the importance values of all transcription factors (TFs) to this pathway gene recursively with a portion (e.g. 1/10) of least important TFs being excluded in each round of modeling, during which, the importance values of all TFs to the pathway gene were updated and ranked until only one TF was remained in the list. The above procedure, termed BWERF. After that, the importance values of a TF to all pathway genes were aggregated and fitted to a Gaussian mixture model to determine the TF retention for the regulatory layer immediately above the pathway layer. The acquired TFs at the secondary layer were then set to be the new bottom layer to infer the next upper layer, and this process was repeated until a ML-hGRN with the expected layers was obtained. BWERF improved the accuracy for constructing ML-hGRNs because it used backward elimination to exclude the noise genes, and aggregated the individual importance values for determining the TFs retention. We validated the BWERF by using it for constructing ML-hGRNs operating above mouse pluripotency maintenance pathway and Arabidopsis lignocellulosic pathway. Compared to GENIE3, BWERF showed an improvement in recognizing authentic TFs regulating a pathway. Compared to the bottom-up Gaussian graphical model algorithm we developed for constructing ML-hGRNs, the BWERF can construct ML-hGRNs with significantly reduced edges that enable biologists to choose the implicit edges for experimental validation.

  18. Fire detection system using random forest classification for image sequences of complex background

    NASA Astrophysics Data System (ADS)

    Kim, Onecue; Kang, Dong-Joong

    2013-06-01

    We present a fire alarm system based on image processing that detects fire accidents in various environments. To reduce false alarms that frequently appeared in earlier systems, we combined image features including color, motion, and blinking information. We specifically define the color conditions of fires in hue, saturation and value, and RGB color space. Fire features are represented as intensity variation, color mean and variance, motion, and image differences. Moreover, blinking fire features are modeled by using crossing patches. We propose an algorithm that classifies patches into fire or nonfire areas by using random forest supervised learning. We design an embedded surveillance device made with acrylonitrile butadiene styrene housing for stable fire detection in outdoor environments. The experimental results show that our algorithm works robustly in complex environments and is able to detect fires in real time.

  19. Combined rule extraction and feature elimination in supervised classification.

    PubMed

    Liu, Sheng; Patel, Ronak Y; Daga, Pankaj R; Liu, Haining; Fu, Gang; Doerksen, Robert J; Chen, Yixin; Wilkins, Dawn E

    2012-09-01

    There are a vast number of biology related research problems involving a combination of multiple sources of data to achieve a better understanding of the underlying problems. It is important to select and interpret the most important information from these sources. Thus it will be beneficial to have a good algorithm to simultaneously extract rules and select features for better interpretation of the predictive model. We propose an efficient algorithm, Combined Rule Extraction and Feature Elimination (CRF), based on 1-norm regularized random forests. CRF simultaneously extracts a small number of rules generated by random forests and selects important features. We applied CRF to several drug activity prediction and microarray data sets. CRF is capable of producing performance comparable with state-of-the-art prediction algorithms using a small number of decision rules. Some of the decision rules are biologically significant.

  20. Predicting the Spatial Distribution of Organic Contaminants in an Estuarine System using a Random Forest Approach

    EPA Science Inventory

    Modeling the magnitude and distribution of estuarine sediment contamination by pollutants of historic (e.g. PCB) and emerging concern (e.g., personal care products, PCP) is often limited by incomplete site knowledge and inadequate sediment contamination sampling. We tested a mode...

  1. Random forest models for the probable biological condition of streams and rivers in the USA

    EPA Science Inventory

    The National Rivers and Streams Assessment (NRSA) is a probability based survey conducted by the US Environmental Protection Agency and its state and tribal partners. It provides information on the ecological condition of the rivers and streams in the conterminous USA, and the ex...

  2. Prediction of aquatic toxicity mode of action using linear discriminant and random forest models

    EPA Science Inventory

    The ability to determine the mode of action (MOA) for a diverse group of chemicals is a critical part of ecological risk assessment and chemical regulation. However, existing MOA assignment approaches in ecotoxicology have been limited to a relatively few MOAs, have high uncertai...

  3. Analysis of perception and community participation in forest management at KPHP model unit VII-Hulu Sarolangun, Jambi Province

    NASA Astrophysics Data System (ADS)

    Purnomo, B.; Anggoro, S.; Izzati, M.

    2017-06-01

    The concept of forest management at the site level in the form of forest management units (KPH) implemented by the government in an effort to improve forest governance in Indonesia. Forest management must ensure fairness for all stakeholders, especially indigenous and local communities that have been the most marginalized groups. Local communities have become an important part in the efforts to achieve sustainable forest management. Public perception as one of the stakeholders in forest management need to be analyzed to determine their perspectives on the forest. This study aimed to analyze the perception and the level of community participation in forest management activities in KPHP Model Unit VII-Hulu Sarolangun, as well as examine the relationship between these two variables. Perception variables are divided into three categories: good, moderate and bad, while the participation variable is also divided into three categories: high, medium, and low. Data was obtained through semi-structured interviews with the key informants and questionnaires to randomly selected respondents. Statistical analysis was conducted to determine whether there are differences of perception and participation between the two villages and the relationship between perceptions of participation or not. The results showed 90,16 % of people have a good perception and the remaining 9,84% have a moderate perception. In general, community participation is at a low level that is as much as 76,17 % and only 1,55% had a high participation rate. The analysis showed differences in levels of participation between the two villages and there is no relationship between the perception and the level of community participation in forest management. The results of this study can be taken into consideration for KPHP and other stakeholders in forest management policy in the region KPHP.

  4. Association of extinction risk of saproxylic beetles with ecological degradation of forests in Europe.

    PubMed

    Seibold, Sebastian; Brandl, Roland; Buse, Jörn; Hothorn, Torsten; Schmidl, Jürgen; Thorn, Simon; Müller, Jörg

    2015-04-01

    To reduce future loss of biodiversity and to allocate conservation funds effectively, the major drivers behind large-scale extinction processes must be identified. A promising approach is to link the red-list status of species and specific traits that connect species of functionally important taxa or guilds to resources they rely on. Such traits can be used to detect the influence of anthropogenic ecosystem changes and conservation efforts on species, which allows for practical recommendations for conservation. We modeled the German Red List categories as an ordinal index of extinction risk of 1025 saproxylic beetles with a proportional-odds linear mixed-effects model for ordered categorical responses. In this model, we estimated fixed effects for intrinsic traits characterizing species biology, required resources, and distribution with phylogenetically correlated random intercepts. The model also allowed predictions of extinction risk for species with no red-list category. Our model revealed a higher extinction risk for lowland and large species as well as for species that rely on wood of large diameter, broad-leaved trees, or open canopy. These results mirror well the ecological degradation of European forests over the last centuries caused by modern forestry, that is the conversion of natural broad-leaved forests to dense conifer-dominated forests and the loss of old growth and dead wood. Therefore, conservation activities aimed at saproxylic beetles in all types of forests in Central and Western Europe should focus on lowlands, and habitat management of forest stands should aim at increasing the amount of dead wood of large diameter, dead wood of broad-leaved trees, and dead wood in sunny areas. © 2014 Society for Conservation Biology.

  5. Meta-Analysis of Land Use / Land Cover Change Factors in the Conterminous US and Prediction of Potential Working Timberlands in the US South from FIA Inventory Plots and NLCD Cover Maps

    NASA Astrophysics Data System (ADS)

    Jeuck, James A.

    This dissertation consists of research projects related to forest land use / land cover (LULC): (1) factors predicting LULC change and (2) methodology to predict particular forest use, or "potential working timberland" (PWT), from current forms of land data. The first project resulted in a published paper, a meta-analysis of 64 econometric models from 47 studies predicting forest land use changes. The response variables, representing some form of forest land change, were organized into four groups: forest conversion to agriculture (F2A), forestland to development (F2D), forestland to non-forested (F2NF) and undeveloped (including forestland) to developed (U2D) land. Over 250 independent econometric variables were identified, from 21 F2A models, 21 F2D models, 12 F2NF models, and 10 U2D models. These variables were organized into a hierarchy of 119 independent variable groups, 15 categories, and 4 econometric drivers suitable for conducting simple vote count statistics. Vote counts were summarized at the independent variable group level and formed into ratios estimating the predictive success of each variable group. Two ratio estimates were developed based on (1) proportion of times independent variables successfully achieved statistical significance (p ≤0.10), and (2) proportion of times independent variables successfully met the original researchers'expectations. In F2D models, popular independent variables such as population, income, and urban proximity often achieved statistical significance. In F2A models, popular independent variables such as forest and agricultural rents and costs, governmental programs, and site quality often achieved statistical significance. In U2D models, successful independent variables included urban rents and costs, zoning issues concerning forestland loss, site quality, urban proximity, population, and income. F2NF models high success variables were found to be agricultural rents, site quality, population, and income. This meta-analysis provides insight into the general success of econometric independent variables for future forest use or cover change research. The second part of this dissertation developed a method for predicting area estimates and spatial distribution of PWT in the US South. This technique determined land use from USFS Forest Inventory and Analysis (FIA) and land cover from the National Land Cover Database (NLCD). Three dependent variable forms (DV Forms) were derived from the FIA data: DV Form 1, timberland, other; DV Form 2, short timberland, tall timberland, agriculture, other; and DV Form 3, short hardwood (HW) timberland, tall HW timberland, short softwood (SW) timberland, tall SW timberland, agriculture, other. The prediction accuracy of each DV Form was investigated using both random forest model and logistic regression model specifications and data optimization techniques. Model verification employing a "leave-group-out" Monte Carlo simulation determined the selection of a stratified version of the random forest model using one-year NLCD observations with an overall accuracy of 0.53-0.94. The lower accuracy side of the range was when predictions were made from an aggregated NLCD land cover class "grass_shrub". The selected model specification was run using 2011 NLCD and the other predictor variables to produce three levels of timberland prediction and probability maps for the US South. Spatial masks removed areas unlikely to be working forests (protected and urbanized lands) resulting in PWT maps. The area of the resulting maps compared well with USFS area estimates and masked PWT maps and had an 8-11% reduction of the USFS timberland estimate for the US South compared to the DV Form. Change analysis of the 2011 NLCD to PWT showed (1) the majority of the short timberland came from NLCD grass_shrub; (2) the majority of NLCD grass_shrub predicted into tall timberland, and (3) NLCD grass_shrub was more strongly associated with timberland in the Coastal Plain. Resulting map products provide practical analytical tools for those interested in studying the area and distribution of PWT in the US South.

  6. Comparing spatial regression to random forests for large environmental data sets

    EPA Science Inventory

    Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates, whereas spatial regression, when using reduced rank methods, has a reputatio...

  7. Modeling contemporary climate profiles of whitebark pine (Pinus albicaulis) and predicting responses to global warming

    Treesearch

    Marcus V. Warwell; Gerald E. Rehfeldt; Nicholas L. Crookston

    2006-01-01

    The Random Forests multiple regression tree was used to develop an empirically-based bioclimate model for the distribution of Pinus albicaulis (whitebark pine) in western North America, latitudes 31° to 51° N and longitudes 102° to 125° W. Independent variables included 35 simple expressions of temperature and precipitation and their interactions....

  8. Properties of the endogenous post-stratified estimator using a random forests model

    Treesearch

    John Tipton; Jean Opsomer; Gretchen G. Moisen

    2012-01-01

    Post-stratification is used in survey statistics as a method to improve variance estimates. In traditional post-stratification methods, the variable on which the data is being stratified must be known at the population level. In many cases this is not possible, but it is possible to use a model to predict values using covariates, and then stratify on these predicted...

  9. Computed tomography synthesis from magnetic resonance images in the pelvis using multiple random forests and auto-context features

    NASA Astrophysics Data System (ADS)

    Andreasen, Daniel; Edmund, Jens M.; Zografos, Vasileios; Menze, Bjoern H.; Van Leemput, Koen

    2016-03-01

    In radiotherapy treatment planning that is only based on magnetic resonance imaging (MRI), the electron density information usually obtained from computed tomography (CT) must be derived from the MRI by synthesizing a so-called pseudo CT (pCT). This is a non-trivial task since MRI intensities are neither uniquely nor quantitatively related to electron density. Typical approaches involve either a classification or regression model requiring specialized MRI sequences to solve intensity ambiguities, or an atlas-based model necessitating multiple registrations between atlases and subject scans. In this work, we explore a machine learning approach for creating a pCT of the pelvic region from conventional MRI sequences without using atlases. We use a random forest provided with information about local texture, edges and spatial features derived from the MRI. This helps to solve intensity ambiguities. Furthermore, we use the concept of auto-context by sequentially training a number of classification forests to create and improve context features, which are finally used to train a regression forest for pCT prediction. We evaluate the pCT quality in terms of the voxel-wise error and the radiologic accuracy as measured by water-equivalent path lengths. We compare the performance of our method against two baseline pCT strategies, which either set all MRI voxels in the subject equal to the CT value of water, or in addition transfer the bone volume from the real CT. We show an improved performance compared to both baseline pCTs suggesting that our method may be useful for MRI-only radiotherapy.

  10. Estimating Forest Vertical Structure from Multialtitude, Fixed-Baseline Radar Interferometric and Polarimetric Data

    NASA Technical Reports Server (NTRS)

    Treuhaft, Robert N.; Law, Beverly E.; Siqueira, Paul R.

    2000-01-01

    Parameters describing the vertical structure of forests, for example tree height, height-to-base-of-live-crown, underlying topography, and leaf area density, bear on land-surface, biogeochemical, and climate modeling efforts. Single, fixed-baseline interferometric synthetic aperture radar (INSAR) normalized cross-correlations constitute two observations from which to estimate forest vertical structure parameters: Cross-correlation amplitude and phase. Multialtitude INSAR observations increase the effective number of baselines potentially enabling the estimation of a larger set of vertical-structure parameters. Polarimetry and polarimetric interferometry can further extend the observation set. This paper describes the first acquisition of multialtitude INSAR for the purpose of estimating the parameters describing a vegetated land surface. These data were collected over ponderosa pine in central Oregon near longitude and latitude -121 37 25 and 44 29 56. The JPL interferometric TOPSAR system was flown at the standard 8-km altitude, and also at 4-km and 2-km altitudes, in a race track. A reference line including the above coordinates was maintained at 35 deg for both the north-east heading and the return southwest heading, at all altitudes. In addition to the three altitudes for interferometry, one line was flown with full zero-baseline polarimetry at the 8-km altitude. A preliminary analysis of part of the data collected suggests that they are consistent with one of two physical models describing the vegetation: 1) a single-layer, randomly oriented forest volume with a very strong ground return or 2) a multilayered randomly oriented volume; a homogeneous, single-layer model with no ground return cannot account for the multialtitude correlation amplitudes. Below the inconsistency of the data with a single-layer model is followed by analysis scenarios which include either the ground or a layered structure. The ground returns suggested by this preliminary analysis seem too strong to be plausible, but parameters describing a two-layer compare reasonably well to a field-measured probability distribution of tree heights in the area.

  11. Mapping risk for nest predation on a barrier island

    USGS Publications Warehouse

    Hackney, Amanda D.; Baldwin, Robert F.; Jodice, Patrick G.R.

    2013-01-01

    Barrier islands and coastal beach systems provide nesting habitat for marine and estuarine turtles. Densely settled coastal areas may subsidize nest predators. Our purpose was to inform conservation by providing a greater understanding of habitat-based risk factors for nest predation, for an estuarine turtle. We expected that habitat conditions at predated nests would differ from random locations at two spatial extents. We developed and validated an island-wide model for the distribution of predated Diamondback terrapin nests using locations of 198 predated nests collected during exhaustive searches at Fisherman Island National Wildlife Refuge, USA. We used aerial photographs to identify all areas of possible nesting habitat and searched each and surrounding environments for nests, collecting location and random-point microhabitat data. We built models for the probability of finding a predated nest using an equal number of random points and validated them with a reserve set (N = 67). Five variables in 9 a priori models were used and the best selected model (AIC weight 0.98) reflected positive associations with sand patches near marshes and roadways. Model validation had an average capture rate of predated nests of 84.14 % (26.17–97.38 %, Q1 77.53 %, median 88.07 %, Q3 95.08 %). Microhabitat selection results suggest that nests placed at the edges of sand patches adjacent to upland shrub/forest and marsh systems are vulnerable to predation. Forests and marshes provide cover and alternative resources for predators and roadways provide access; a suggestion is to focus nest protection efforts on the edges of dunes, near dense vegetation and roads.

  12. Lawsuit lead time prediction: Comparison of data mining techniques based on categorical response variable.

    PubMed

    Gruginskie, Lúcia Adriana Dos Santos; Vaccaro, Guilherme Luís Roehe

    2018-01-01

    The quality of the judicial system of a country can be verified by the overall length time of lawsuits, or the lead time. When the lead time is excessive, a country's economy can be affected, leading to the adoption of measures such as the creation of the Saturn Center in Europe. Although there are performance indicators to measure the lead time of lawsuits, the analysis and the fit of prediction models are still underdeveloped themes in the literature. To contribute to this subject, this article compares different prediction models according to their accuracy, sensitivity, specificity, precision, and F1 measure. The database used was from TRF4-the Tribunal Regional Federal da 4a Região-a federal court in southern Brazil, corresponding to the 2nd Instance civil lawsuits completed in 2016. The models were fitted using support vector machine, naive Bayes, random forests, and neural network approaches with categorical predictor variables. The lead time of the 2nd Instance judgment was selected as the response variable measured in days and categorized in bands. The comparison among the models showed that the support vector machine and random forest approaches produced measurements that were superior to those of the other models. The evaluation of the models was made using k-fold cross-validation similar to that applied to the test models.

  13. Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests

    PubMed Central

    2011-01-01

    Background Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Results Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5. Conclusions When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing. PMID:21849043

  14. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets.

    PubMed

    Sankari, E Siva; Manimegalai, D

    2017-12-21

    Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.

  15. Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantation.

    PubMed

    Marino, S R; Lin, S; Maiers, M; Haagenson, M; Spellman, S; Klein, J P; Binkowski, T A; Lee, S J; van Besien, K

    2012-02-01

    The identification of important amino acid substitutions associated with low survival in hematopoietic cell transplantation (HCT) is hampered by the large number of observed substitutions compared with the small number of patients available for analysis. Random forest analysis is designed to address these limitations. We studied 2107 HCT recipients with good or intermediate risk hematological malignancies to identify HLA class I amino acid substitutions associated with reduced survival at day 100 post transplant. Random forest analysis and traditional univariate and multivariate analyses were used. Random forest analysis identified amino acid substitutions in 33 positions that were associated with reduced 100 day survival, including HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166 and 167; HLA-B 97, 109, 116 and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163 and 173. In all 13 had been previously reported by other investigators using classical biostatistical approaches. Using the same data set, traditional multivariate logistic regression identified only five amino acid substitutions associated with lower day 100 survival. Random forest analysis is a novel statistical methodology for analysis of HLA mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods.

  16. Complex mountain terrain and disturbance history drive variation in forest aboveground live carbon density in the western Oregon Cascades, USA

    PubMed Central

    Zald, Harold S.J.; Spies, Thomas A.; Seidl, Rupert; Pabst, Robert J.; Olsen, Keith A.; Steel, E. Ashley

    2016-01-01

    Forest carbon (C) density varies tremendously across space due to the inherent heterogeneity of forest ecosystems. Variation of forest C density is especially pronounced in mountainous terrain, where environmental gradients are compressed and vary at multiple spatial scales. Additionally, the influence of environmental gradients may vary with forest age and developmental stage, an important consideration as forest landscapes often have a diversity of stand ages from past management and other disturbance agents. Quantifying forest C density and its underlying environmental determinants in mountain terrain has remained challenging because many available data sources lack the spatial grain and ecological resolution needed at both stand and landscape scales. The objective of this study was to determine if environmental factors influencing aboveground live carbon (ALC) density differed between young versus old forests. We integrated aerial light detection and ranging (lidar) data with 702 field plots to map forest ALC density at a grain of 25 m across the H.J. Andrews Experimental Forest, a 6369 ha watershed in the Cascade Mountains of Oregon, USA. We used linear regressions, random forest ensemble learning (RF) and sequential autoregressive modeling (SAR) to reveal how mapped forest ALC density was related to climate, topography, soils, and past disturbance history (timber harvesting and wildfires). ALC increased with stand age in young managed forests, with much greater variation of ALC in relation to years since wildfire in old unmanaged forests. Timber harvesting was the most important driver of ALC across the entire watershed, despite occurring on only 23% of the landscape. More variation in forest ALC density was explained in models of young managed forests than in models of old unmanaged forests. Besides stand age, ALC density in young managed forests was driven by factors influencing site productivity, whereas variation in ALC density in old unmanaged forests was also affected by finer scale topographic conditions associated with sheltered sites. Past wildfires only had a small influence on current ALC density, which may be a result of long times since fire and/or prevalence of non-stand replacing fire. Our results indicate that forest ALC density depends on a suite of multi-scale environmental drivers mediated by complex mountain topography, and that these relationships are dependent on stand age. The high and context-dependent spatial variability of forest ALC density has implications for quantifying forest carbon stores, establishing upper bounds of potential carbon sequestration, and scaling field data to landscape and regional scales. PMID:27041818

  17. Forest fire spatial pattern analysis in Galicia (NW Spain).

    PubMed

    Fuentes-Santos, I; Marey-Pérez, M F; González-Manteiga, W

    2013-10-15

    Knowledge of fire behaviour is of key importance in forest management. In the present study, we analysed the spatial structure of forest fire with spatial point pattern analysis and inference techniques recently developed in the Spatstat package of R. Wildfires have been the primary threat to Galician forests in recent years. The district of Fonsagrada-Ancares is one of the most seriously affected by fire in the region and, therefore, the central focus of the study. Our main goal was to determine the spatial distribution of ignition points to model and predict fire occurrence. These data are of great value in establishing enhanced fire prevention and fire fighting plans. We found that the spatial distribution of wildfires is not random and that fire occurrence may depend on ownership conflicts. We also found positive interaction between small and large fires and spatial independence between wildfires in consecutive years. Copyright © 2013 Elsevier Ltd. All rights reserved.

  18. Estimating aboveground forest biomass carbon and fire consumption in the U.S. Utah High Plateaus using data from the Forest Inventory and Analysis program, Landsat, and LANDFIRE

    USGS Publications Warehouse

    Chen, Xuexia; Liu, Shuguang; Zhu, Zhiliang; Vogelmann, James E.; Li, Zhengpeng; Ohlen, Donald O.

    2011-01-01

    The concentrations of CO2 and other greenhouse gases in the atmosphere have been increasing and greatly affecting global climate and socio-economic systems. Actively growing forests are generally considered to be a major carbon sink, but forest wildfires lead to large releases of biomass carbon into the atmosphere. Aboveground forest biomass carbon (AFBC), an important ecological indicator, and fire-induced carbon emissions at regional scales are highly relevant to forest sustainable management and climate change. It is challenging to accurately estimate the spatial distribution of AFBC across large areas because of the spatial heterogeneity of forest cover types and canopy structure. In this study, Forest Inventory and Analysis (FIA) data, Landsat, and Landscape Fire and Resource Management Planning Tools Project (LANDFIRE) data were integrated in a regression tree model for estimating AFBC at a 30-m resolution in the Utah High Plateaus. AFBC were calculated from 225 FIA field plots and used as the dependent variable in the model. Of these plots, 10% were held out for model evaluation with stratified random sampling, and the other 90% were used as training data to develop the regression tree model. Independent variable layers included Landsat imagery and the derived spectral indicators, digital elevation model (DEM) data and derivatives, biophysical gradient data, existing vegetation cover type and vegetation structure. The cross-validation correlation coefficient (r value) was 0.81 for the training model. Independent validation using withheld plot data was similar with r value of 0.82. This validated regression tree model was applied to map AFBC in the Utah High Plateaus and then combined with burn severity information to estimate loss of AFBC in the Longston fire of Zion National Park in 2001. The final dataset represented 24 forest cover types for a 4 million ha forested area. We estimated a total of 353 Tg AFBC with an average of 87 MgC/ha in the Utah High Plateaus. We also estimated that 8054 Mg AFBC were released from 2.24 km2 burned forest area in the Longston fire. These results demonstrate that an AFBC spatial map and estimated biomass carbon consumption can readily be generated using existing database. The methodology provides a consistent, practical, and inexpensive way for estimating AFBC at 30-m resolution over large areas throughout the United States.

  19. Prediction of Short-Distance Aerial Movement of Phakopsora pachyrhizi Urediniospores Using Machine Learning.

    PubMed

    Wen, L; Bowen, C R; Hartman, G L

    2017-10-01

    Dispersal of urediniospores by wind is the primary means of spread for Phakopsora pachyrhizi, the cause of soybean rust. Our research focused on the short-distance movement of urediniospores from within the soybean canopy and up to 61 m from field-grown rust-infected soybean plants. Environmental variables were used to develop and compare models including the least absolute shrinkage and selection operator regression, zero-inflated Poisson/regular Poisson regression, random forest, and neural network to describe deposition of urediniospores collected in passive and active traps. All four models identified distance of trap from source, humidity, temperature, wind direction, and wind speed as the five most important variables influencing short-distance movement of urediniospores. The random forest model provided the best predictions, explaining 76.1 and 86.8% of the total variation in the passive- and active-trap datasets, respectively. The prediction accuracy based on the correlation coefficient (r) between predicted values and the true values were 0.83 (P < 0.0001) and 0.94 (P < 0.0001) for the passive and active trap datasets, respectively. Overall, multiple machine learning techniques identified the most important variables to make the most accurate predictions of movement of P. pachyrhizi urediniospores short-distance.

  20. Tropical forest carbon balance: effects of field- and satellite-based mortality regimes on the dynamics and the spatial structure of Central Amazon forest biomass

    NASA Astrophysics Data System (ADS)

    Di Vittorio, Alan V.; Negrón-Juárez, Robinson I.; Higuchi, Niro; Chambers, Jeffrey Q.

    2014-03-01

    Debate continues over the adequacy of existing field plots to sufficiently capture Amazon forest dynamics to estimate regional forest carbon balance. Tree mortality dynamics are particularly uncertain due to the difficulty of observing large, infrequent disturbances. A recent paper (Chambers et al 2013 Proc. Natl Acad. Sci. 110 3949-54) reported that Central Amazon plots missed 9-17% of tree mortality, and here we address ‘why’ by elucidating two distinct mortality components: (1) variation in annual landscape-scale average mortality and (2) the frequency distribution of the size of clustered mortality events. Using a stochastic-empirical tree growth model we show that a power law distribution of event size (based on merged plot and satellite data) is required to generate spatial clustering of mortality that is consistent with forest gap observations. We conclude that existing plots do not sufficiently capture losses because their placement, size, and longevity assume spatially random mortality, while mortality is actually distributed among differently sized events (clusters of dead trees) that determine the spatial structure of forest canopies.

  1. Missouri Ozark Forest Ecosystem Project: the experiment

    Treesearch

    Steven L. Sheriff

    2002-01-01

    Missouri Ozark Forest Ecosystem Project (MOFEP) is a unique experiment to learn about the impacts of management practices on a forest system. Three forest management practices (uneven-aged management, even-aged management, and no-harvest management) as practiced by the Missouri Department of Conservation were randomly assigned to nine forest management sites using a...

  2. Modelling the ecological consequences of whole tree harvest for bioenergy production

    NASA Astrophysics Data System (ADS)

    Skår, Silje; Lange, Holger; Sogn, Trine

    2013-04-01

    There is an increasing demand for energy from biomass as a substitute to fossil fuels worldwide, and the Norwegian government plans to double the production of bioenergy to 9% of the national energy production or to 28 TWh per year by 2020. A large part of this increase may come from forests, which have a great potential with respect to biomass supply as forest growth increasingly has exceeded harvest in the last decades. One feasible option is the utilization of forest residues (needles, twigs and branches) in addition to stems, known as Whole Tree Harvest (WTH). As opposed to WTH, the residues are traditionally left in the forest with Conventional Timber Harvesting (CH). However, the residues contain a large share of the treés nutrients, indicating that WTH may possibly alter the supply of nutrients and organic matter to the soil and the forest ecosystem. This may potentially lead to reduced tree growth. Other implications can be nutrient imbalance, loss of carbon from the soil and changes in species composition and diversity. This study aims to identify key factors and appropriate strategies for ecologically sustainable WTH in Norway spruce (Picea abies) and Scots pine (Pinus sylvestris) forest stands in Norway. We focus on identifying key factors driving soil organic matter, nutrients, biomass, biodiversity etc. Simulations of the effect on the carbon and nitrogen budget with the two harvesting methods will also be conducted. Data from field trials and long-term manipulation experiments are used to obtain a first overview of key variables. The relationships between the variables are hitherto unknown, but it is by no means obvious that they could be assumed as linear; thus, an ordinary multiple linear regression approach is expected to be insufficient. Here we apply two advanced and highly flexible modelling frameworks which hardly have been used in the context of tree growth, nutrient balances and biomass removal so far: Generalized Additive Models (GAMs) and Random Forests. Results obtained for GAMs so far show that there are differences between WTH and CH in two directions: both the significance of drivers and the shape of the response functions differ. GAMs turn out to be a flexible and powerful alternative to multivariate linear regression. The restriction to linear relationships seems to be unjustified in the present case. We use Random Forests as a highly efficient classifier which gives reliable estimates for the importance of each driver variable in determining the diameter growth for the two different harvesting treatments. Based on the final results of these two modelling approaches, the study contributes to find appropriate strategies and suitable regions (in Norway) where WTH may be sustainable performed.

  3. Landscape anthropogenic disturbance in the Mediterranean ecosystem: is the current landscape sustainable?

    NASA Astrophysics Data System (ADS)

    Biondi, Guido; D'Andrea, Mirko; Fiorucci, Paolo; Franciosi, Chiara; Lima, Marco

    2013-04-01

    Mediterranean landscape during the last centuries has been subject to strong anthropogenic disturbances who shifted natural vegetation cover in a cultural landscape. Most of the natural forest were destroyed in order to allow cultivation and grazing activities. In the last century, fast growing conifer plantations were introduced in order to increase timber production replacing slow growing natural forests. In addition, after the Second World War most of the grazing areas were changed in unmanaged mediterranean conifer forest frequently spread by fires. In the last decades radical socio economic changes lead to a dramatic abandonment of the cultural landscape. One of the most relevant result of these human disturbances, and in particular the replacement of deciduous forests with coniferous forests, has been the increasing in the number of forest fires, mainly human caused. The presence of conifers and shrubs, more prone to fire, triggered a feedback mechanism that makes difficult to return to the stage of potential vegetation causing huge economic, social and environmental damages. The aim of this work is to investigate the sustainability of the current landscape. A future landscape scenario has been simulated considering the natural succession in absence of human intervention assuming the current fire regime will be unaltered. To this end, a new model has been defined, implementing an ecological succession model coupled with a simply Forest Fire Model. The ecological succession model simulates the vegetation dynamics using a rule-based approach discrete in space and time. In this model Plant Functional Types (PFTs) are used to describe the landscape. Wildfires are randomly ignited on the landscape, and their propagation is simulated using a stochastic cellular automata model. The results show that the success of the natural succession toward a potential vegetation cover is prevented by the frequency of fire spreading. The actual landscape is then unsustainable because of the high cost of fire fighting activities. The right path to success consists in development of suitable land use planning and forest management to mitigate the consequences of past anthropogenic disturbances.

  4. Developing reservoir monthly inflow forecasts using artificial intelligence and climate phenomenon information

    NASA Astrophysics Data System (ADS)

    Yang, Tiantian; Asanjan, Ata Akbari; Welles, Edwin; Gao, Xiaogang; Sorooshian, Soroosh; Liu, Xiaomang

    2017-04-01

    Reservoirs are fundamental human-built infrastructures that collect, store, and deliver fresh surface water in a timely manner for many purposes. Efficient reservoir operation requires policy makers and operators to understand how reservoir inflows are changing under different hydrological and climatic conditions to enable forecast-informed operations. Over the last decade, the uses of Artificial Intelligence and Data Mining [AI & DM] techniques in assisting reservoir streamflow subseasonal to seasonal forecasts have been increasing. In this study, Random Forest [RF), Artificial Neural Network (ANN), and Support Vector Regression (SVR) are employed and compared with respect to their capabilities for predicting 1 month-ahead reservoir inflows for two headwater reservoirs in USA and China. Both current and lagged hydrological information and 17 known climate phenomenon indices, i.e., PDO and ENSO, etc., are selected as predictors for simulating reservoir inflows. Results show (1) three methods are capable of providing monthly reservoir inflows with satisfactory statistics; (2) the results obtained by Random Forest have the best statistical performances compared with the other two methods; (3) another advantage of Random Forest algorithm is its capability of interpreting raw model inputs; (4) climate phenomenon indices are useful in assisting monthly or seasonal forecasts of reservoir inflow; and (5) different climate conditions are autocorrelated with up to several months, and the climatic information and their lags are cross correlated with local hydrological conditions in our case studies.

  5. Random Forests for Evaluating Pedagogy and Informing Personalized Learning

    ERIC Educational Resources Information Center

    Spoon, Kelly; Beemer, Joshua; Whitmer, John C.; Fan, Juanjuan; Frazee, James P.; Stronach, Jeanne; Bohonak, Andrew J.; Levine, Richard A.

    2016-01-01

    Random forests are presented as an analytics foundation for educational data mining tasks. The focus is on course- and program-level analytics including evaluating pedagogical approaches and interventions and identifying and characterizing at-risk students. As part of this development, the concept of individualized treatment effects (ITE) is…

  6. Employing canopy hyperspectral narrowband data and random forest algorithm to differentiate palmer amaranth from colored cotton

    USDA-ARS?s Scientific Manuscript database

    Palmer amaranth (Amaranthus palmeri S. Wats.) invasion negatively impacts cotton (Gossypium hirsutum L.) production systems throughout the United States. The objective of this study was to evaluate canopy hyperspectral narrowband data as input into the random forest machine learning algorithm to dis...

  7. Does Sentinel multi sensor data offer synergy in Improving Accuracy of Aboveground Biomass Estimate of Dense Tropical Forest? - Utility of Decision Tree Based Machine Learning Algorithms

    NASA Astrophysics Data System (ADS)

    Ghosh, S. M.; Behera, M. D.

    2017-12-01

    Forest aboveground biomass (AGB) is an important factor for preparation of global policy making decisions to tackle the impact of climate change. Several previous studies has concluded that remote sensing methods are more suitable for estimating forest biomass on regional scale. Among all available remote sensing data and methods, Synthetic Aperture Radar (SAR) data in combination with decision tree based machine learning algorithms has shown better promise in estimating higher biomass values. There aren't many studies done for biomass estimation of dense Indian tropical forests with high biomass density. In this study aboveground biomass was estimated for two major tree species, Sal (Shorea robusta) and Teak (Tectona grandis), of Katerniaghat Wildlife Sanctuary, a tropical forest situated in northern India. Biomass was estimated by combining C-band SAR data from Sentinel-1A satellite, vegetation indices produced using Sentinel-2A data and ground inventory plots. Along with SAR backscatter value, SAR texture images were also used as input as earlier studies had found that image texture has a correlation with vegetation biomass. Decision tree based nonlinear machine learning algorithms were used in place of parametric regression models for establishing relationship between fields measured values and remotely sensed parameters. Using random forest model with a combination of vegetation indices with SAR backscatter as predictor variables shows best result for Sal forest, with a coefficient of determination value of 0.71 and a RMSE value of 105.027 t/ha. In teak forest also best result can be found in the same combination but for stochastic gradient boosted model with a coefficient of determination value of 0.6 and a RMSE value of 79.45 t/ha. These results are mostly better than the results of other studies done for similar kind of forests. This study shows that Sentinel series satellite data has exceptional capabilities in estimating dense forest AGB and machine learning algorithms are better means to do so than parametric regression models.

  8. Pedological memory in forest soil development

    Treesearch

    Jonathan D. Phillips; Daniel A. Marion

    2004-01-01

    Individual trees may have significant impacts on soil morphology. If these impacts are non-random such that some microsites are repeatedly preferentially affected by trees, complex local spatial variability of soils would result. A model of self-reinforcing pedologic influences of trees (SRPIT) is proposed to explain patterns of soil variability in the Ouachita...

  9. Predicting ToxCast™ and Tox21 Bioactivity Using Toxprint Chemotypes (WC10)

    EPA Science Inventory

    The EPA ToxCast™ and Tox21 programs have generated bioactivity data for nearly 9076 chemicals across ~1192 assay endpoints; however, for over 70% of the chemical-assay endpoint pairs there is no data. To help fill the gaps, we constructed random forest models for each assay endpo...

  10. Identification of Genes Involved in Breast Cancer Metastasis by Integrating Protein-Protein Interaction Information with Expression Data.

    PubMed

    Tian, Xin; Xin, Mingyuan; Luo, Jian; Liu, Mingyao; Jiang, Zhenran

    2017-02-01

    The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein-protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.

  11. Old-growth and mature forests near spotted owl nests in western Oregon

    NASA Technical Reports Server (NTRS)

    Ripple, William J.; Johnson, David H.; Hershey, K. T.; Meslow, E. Charles

    1995-01-01

    We investigated how the amount of old-growth and mature forest influences the selection of nest sites by northern spotted owls (Strix occidentalis caurina) in the Central Cascade Mountains of Oregon. We used 7 different plot sizes to compare the proportion of mature and old-growth forest between 30 nest sites and 30 random sites. The proportion of old-growth and mature forest was significantly greater at nests sites than at random sites for all plot sizes (P less than or equal to 0.01). Thus, management of the spotted owl might require setting the percentage of old-growth and mature forest retained from harvesting at least 1 standard deviation above the mean for the 30 nest sites we examined.

  12. Methods for identifying SNP interactions: a review on variations of Logic Regression, Random Forest and Bayesian logistic regression.

    PubMed

    Chen, Carla Chia-Ming; Schwender, Holger; Keith, Jonathan; Nunkesser, Robin; Mengersen, Kerrie; Macrossan, Paula

    2011-01-01

    Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.

  13. Quantifying and Characterizing Tonic Thermal Pain Across Subjects From EEG Data Using Random Forest Models.

    PubMed

    Vijayakumar, Vishal; Case, Michelle; Shirinpour, Sina; He, Bin

    2017-12-01

    Effective pain assessment and management strategies are needed to better manage pain. In addition to self-report, an objective pain assessment system can provide a more complete picture of the neurophysiological basis for pain. In this study, a robust and accurate machine learning approach is developed to quantify tonic thermal pain across healthy subjects into a maximum of ten distinct classes. A random forest model was trained to predict pain scores using time-frequency wavelet representations of independent components obtained from electroencephalography (EEG) data, and the relative importance of each frequency band to pain quantification is assessed. The mean classification accuracy for predicting pain on an independent test subject for a range of 1-10 is 89.45%, highest among existing state of the art quantification algorithms for EEG. The gamma band is the most important to both intersubject and intrasubject classification accuracy. The robustness and generalizability of the classifier are demonstrated. Our results demonstrate the potential of this tool to be used clinically to help us to improve chronic pain treatment and establish spectral biomarkers for future pain-related studies using EEG.

  14. An assessment of the effectiveness of a random forest classifier for land-cover classification

    NASA Astrophysics Data System (ADS)

    Rodriguez-Galiano, V. F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J. P.

    2012-01-01

    Land cover monitoring using remotely sensed data requires robust classification methods which allow for the accurate mapping of complex land cover and land use categories. Random forest (RF) is a powerful machine learning classifier that is relatively unknown in land remote sensing and has not been evaluated thoroughly by the remote sensing community compared to more conventional pattern recognition techniques. Key advantages of RF include: their non-parametric nature; high classification accuracy; and capability to determine variable importance. However, the split rules for classification are unknown, therefore RF can be considered to be black box type classifier. RF provides an algorithm for estimating missing values; and flexibility to perform several types of data analysis, including regression, classification, survival analysis, and unsupervised learning. In this paper, the performance of the RF classifier for land cover classification of a complex area is explored. Evaluation was based on several criteria: mapping accuracy, sensitivity to data set size and noise. Landsat-5 Thematic Mapper data captured in European spring and summer were used with auxiliary variables derived from a digital terrain model to classify 14 different land categories in the south of Spain. Results show that the RF algorithm yields accurate land cover classifications, with 92% overall accuracy and a Kappa index of 0.92. RF is robust to training data reduction and noise because significant differences in kappa values were only observed for data reduction and noise addition values greater than 50 and 20%, respectively. Additionally, variables that RF identified as most important for classifying land cover coincided with expectations. A McNemar test indicates an overall better performance of the random forest model over a single decision tree at the 0.00001 significance level.

  15. Tissue segmentation of computed tomography images using a Random Forest algorithm: a feasibility study

    NASA Astrophysics Data System (ADS)

    Polan, Daniel F.; Brady, Samuel L.; Kaufman, Robert A.

    2016-09-01

    There is a need for robust, fully automated whole body organ segmentation for diagnostic CT. This study investigates and optimizes a Random Forest algorithm for automated organ segmentation; explores the limitations of a Random Forest algorithm applied to the CT environment; and demonstrates segmentation accuracy in a feasibility study of pediatric and adult patients. To the best of our knowledge, this is the first study to investigate a trainable Weka segmentation (TWS) implementation using Random Forest machine-learning as a means to develop a fully automated tissue segmentation tool developed specifically for pediatric and adult examinations in a diagnostic CT environment. Current innovation in computed tomography (CT) is focused on radiomics, patient-specific radiation dose calculation, and image quality improvement using iterative reconstruction, all of which require specific knowledge of tissue and organ systems within a CT image. The purpose of this study was to develop a fully automated Random Forest classifier algorithm for segmentation of neck-chest-abdomen-pelvis CT examinations based on pediatric and adult CT protocols. Seven materials were classified: background, lung/internal air or gas, fat, muscle, solid organ parenchyma, blood/contrast enhanced fluid, and bone tissue using Matlab and the TWS plugin of FIJI. The following classifier feature filters of TWS were investigated: minimum, maximum, mean, and variance evaluated over a voxel radius of 2 n , (n from 0 to 4), along with noise reduction and edge preserving filters: Gaussian, bilateral, Kuwahara, and anisotropic diffusion. The Random Forest algorithm used 200 trees with 2 features randomly selected per node. The optimized auto-segmentation algorithm resulted in 16 image features including features derived from maximum, mean, variance Gaussian and Kuwahara filters. Dice similarity coefficient (DSC) calculations between manually segmented and Random Forest algorithm segmented images from 21 patient image sections, were analyzed. The automated algorithm produced segmentation of seven material classes with a median DSC of 0.86  ±  0.03 for pediatric patient protocols, and 0.85  ±  0.04 for adult patient protocols. Additionally, 100 randomly selected patient examinations were segmented and analyzed, and a mean sensitivity of 0.91 (range: 0.82-0.98), specificity of 0.89 (range: 0.70-0.98), and accuracy of 0.90 (range: 0.76-0.98) were demonstrated. In this study, we demonstrate that this fully automated segmentation tool was able to produce fast and accurate segmentation of the neck and trunk of the body over a wide range of patient habitus and scan parameters.

  16. A microwave scattering model for layered vegetation

    NASA Technical Reports Server (NTRS)

    Karam, Mostafa A.; Fung, Adrian K.; Lang, Roger H.; Chauhan, Narinder S.

    1992-01-01

    A microwave scattering model was developed for layered vegetation based on an iterative solution of the radiative transfer equation up to the second order to account for multiple scattering within the canopy and between the ground and the canopy. The model is designed to operate over a wide frequency range for both deciduous and coniferous forest and to account for the branch size distribution, leaf orientation distribution, and branch orientation distribution for each size. The canopy is modeled as a two-layered medium above a rough interface. The upper layer is the crown containing leaves, stems, and branches. The lower layer is the trunk region modeled as randomly positioned cylinders with a preferred orientation distribution above an irregular soil surface. Comparisons of this model with measurements from deciduous and coniferous forests show good agreements at several frequencies for both like and cross polarizations. Major features of the model needed to realize the agreement include allowance for: (1) branch size distribution, (2) second-order effects, and (3) tree component models valid over a wide range of frequencies.

  17. Prediction of the effect of formulation on the toxicity of chemicals.

    PubMed

    Mistry, Pritesh; Neagu, Daniel; Sanchez-Ruiz, Antonio; Trundle, Paul R; Vessey, Jonathan D; Gosling, John Paul

    2017-01-01

    Two approaches for the prediction of which of two vehicles will result in lower toxicity for anticancer agents are presented. Machine-learning models are developed using decision tree, random forest and partial least squares methodologies and statistical evidence is presented to demonstrate that they represent valid models. Separately, a clustering method is presented that allows the ordering of vehicles by the toxicity they show for chemically-related compounds.

  18. Landslide susceptibility assessment in Lianhua County (China): A comparison between a random forest data mining technique and bivariate and multivariate statistical models

    NASA Astrophysics Data System (ADS)

    Hong, Haoyuan; Pourghasemi, Hamid Reza; Pourtaghi, Zohre Sadat

    2016-04-01

    Landslides are an important natural hazard that causes a great amount of damage around the world every year, especially during the rainy season. The Lianhua area is located in the middle of China's southern mountainous area, west of Jiangxi Province, and is known to be an area prone to landslides. The aim of this study was to evaluate and compare landslide susceptibility maps produced using the random forest (RF) data mining technique with those produced by bivariate (evidential belief function and frequency ratio) and multivariate (logistic regression) statistical models for Lianhua County, China. First, a landslide inventory map was prepared using aerial photograph interpretation, satellite images, and extensive field surveys. In total, 163 landslide events were recognized in the study area, with 114 landslides (70%) used for training and 49 landslides (30%) used for validation. Next, the landslide conditioning factors-including the slope angle, altitude, slope aspect, topographic wetness index (TWI), slope-length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, distance to roads, annual precipitation, land use, normalized difference vegetation index (NDVI), and lithology-were derived from the spatial database. Finally, the landslide susceptibility maps of Lianhua County were generated in ArcGIS 10.1 based on the random forest (RF), evidential belief function (EBF), frequency ratio (FR), and logistic regression (LR) approaches and were validated using a receiver operating characteristic (ROC) curve. The ROC plot assessment results showed that for landslide susceptibility maps produced using the EBF, FR, LR, and RF models, the area under the curve (AUC) values were 0.8122, 0.8134, 0.7751, and 0.7172, respectively. Therefore, we can conclude that all four models have an AUC of more than 0.70 and can be used in landslide susceptibility mapping in the study area; meanwhile, the EBF and FR models had the best performance for Lianhua County, China. Thus, the resultant susceptibility maps will be useful for land use planning and hazard mitigation aims.

  19. Improving the performance of the mass transfer-based reference evapotranspiration estimation approaches through a coupled wavelet-random forest methodology

    NASA Astrophysics Data System (ADS)

    Shiri, Jalal

    2018-06-01

    Among different reference evapotranspiration (ETo) modeling approaches, mass transfer-based methods have been less studied. These approaches utilize temperature and wind speed records. On the other hand, the empirical equations proposed in this context generally produce weak simulations, except when a local calibration is used for improving their performance. This might be a crucial drawback for those equations in case of local data scarcity for calibration procedure. So, application of heuristic methods can be considered as a substitute for improving the performance accuracy of the mass transfer-based approaches. However, given that the wind speed records have usually higher variation magnitudes than the other meteorological parameters, application of a wavelet transform for coupling with heuristic models would be necessary. In the present paper, a coupled wavelet-random forest (WRF) methodology was proposed for the first time to improve the performance accuracy of the mass transfer-based ETo estimation approaches using cross-validation data management scenarios in both local and cross-station scales. The obtained results revealed that the new coupled WRF model (with the minimum scatter index values of 0.150 and 0.192 for local and external applications, respectively) improved the performance accuracy of the single RF models as well as the empirical equations to great extent.

  20. Applying under-sampling techniques and cost-sensitive learning methods on risk assessment of breast cancer.

    PubMed

    Hsu, Jia-Lien; Hung, Ping-Cheng; Lin, Hung-Yen; Hsieh, Chung-Ho

    2015-04-01

    Breast cancer is one of the most common cause of cancer mortality. Early detection through mammography screening could significantly reduce mortality from breast cancer. However, most of screening methods may consume large amount of resources. We propose a computational model, which is solely based on personal health information, for breast cancer risk assessment. Our model can be served as a pre-screening program in the low-cost setting. In our study, the data set, consisting of 3976 records, is collected from Taipei City Hospital starting from 2008.1.1 to 2008.12.31. Based on the dataset, we first apply the sampling techniques and dimension reduction method to preprocess the testing data. Then, we construct various kinds of classifiers (including basic classifiers, ensemble methods, and cost-sensitive methods) to predict the risk. The cost-sensitive method with random forest classifier is able to achieve recall (or sensitivity) as 100 %. At the recall of 100 %, the precision (positive predictive value, PPV), and specificity of cost-sensitive method with random forest classifier was 2.9 % and 14.87 %, respectively. In our study, we build a breast cancer risk assessment model by using the data mining techniques. Our model has the potential to be served as an assisting tool in the breast cancer screening.

  1. Field strategies for the calibration and validation of high-resolution forest carbon maps: Scaling from plots to a three state region MD, DE, & PA, USA.

    NASA Astrophysics Data System (ADS)

    Dolan, K. A.; Huang, W.; Johnson, K. D.; Birdsey, R.; Finley, A. O.; Dubayah, R.; Hurtt, G. C.

    2016-12-01

    In 2010 Congress directed NASA to initiate research towards the development of Carbon Monitoring Systems (CMS). In response, our team has worked to develop a robust, replicable framework to quantify and map aboveground forest biomass at high spatial resolutions. Crucial to this framework has been the collection of field-based estimates of aboveground tree biomass, combined with remotely detected canopy and structural attributes, for calibration and validation. Here we evaluate the field- based calibration and validation strategies within this carbon monitoring framework and discuss the implications on local to national monitoring systems. Through project development, the domain of this research has expanded from two counties in MD (2,181 km2), to the entire state of MD (32,133 km2), and most recently the tri-state region of MD, PA, and DE (157,868 km2) and covers forests in four major USDA ecological providences. While there are approximately 1000 Forest Inventory and Analysis (FIA) plots distributed across the state of MD, 60% fell in areas considered non-forest or had conditions that precluded them from being measured in the last forest inventory. Across the two pilot counties, where population and landuse competition is high, that proportion rose to 70% Thus, during the initial phases of this project 850 independent field plots were established for model calibration following a random stratified design to insure the adequate representation of height and vegetation classes found across the state, while FIA data were used as an independent data source for validation. As the project expanded to cover the larger spatial tri-state domain, the strategy was flipped to base calibration on more than 3,300 measured FIA plots, as they provide a standardized, consistent and available data source across the nation. An additional 350 stratified random plots were deployed in the Northern Mixed forests of PA and the Coastal Plains forests of DE for validation.

  2. Improved predictive mapping of indoor radon concentrations using ensemble regression trees based on automatic clustering of geological units.

    PubMed

    Kropat, Georg; Bochud, Francois; Jaboyedoff, Michel; Laedermann, Jean-Pascal; Murith, Christophe; Palacios Gruson, Martha; Baechler, Sébastien

    2015-09-01

    According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information. Copyright © 2015 Elsevier Ltd. All rights reserved.

  3. Automated segmentation of dental CBCT image with prior-guided sequential random forests

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wang, Li; Gao, Yaozong; Shi, Feng

    Purpose: Cone-beam computed tomography (CBCT) is an increasingly utilized imaging modality for the diagnosis and treatment planning of the patients with craniomaxillofacial (CMF) deformities. Accurate segmentation of CBCT image is an essential step to generate 3D models for the diagnosis and treatment planning of the patients with CMF deformities. However, due to the image artifacts caused by beam hardening, imaging noise, inhomogeneity, truncation, and maximal intercuspation, it is difficult to segment the CBCT. Methods: In this paper, the authors present a new automatic segmentation method to address these problems. Specifically, the authors first employ a majority voting method to estimatemore » the initial segmentation probability maps of both mandible and maxilla based on multiple aligned expert-segmented CBCT images. These probability maps provide an important prior guidance for CBCT segmentation. The authors then extract both the appearance features from CBCTs and the context features from the initial probability maps to train the first-layer of random forest classifier that can select discriminative features for segmentation. Based on the first-layer of trained classifier, the probability maps are updated, which will be employed to further train the next layer of random forest classifier. By iteratively training the subsequent random forest classifier using both the original CBCT features and the updated segmentation probability maps, a sequence of classifiers can be derived for accurate segmentation of CBCT images. Results: Segmentation results on CBCTs of 30 subjects were both quantitatively and qualitatively validated based on manually labeled ground truth. The average Dice ratios of mandible and maxilla by the authors’ method were 0.94 and 0.91, respectively, which are significantly better than the state-of-the-art method based on sparse representation (p-value < 0.001). Conclusions: The authors have developed and validated a novel fully automated method for CBCT segmentation.« less

  4. Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

    PubMed

    Maniruzzaman, Md; Rahman, Md Jahanur; Al-MehediHasan, Md; Suri, Harman S; Abedin, Md Menhazul; El-Baz, Ayman; Suri, Jasjit S

    2018-04-10

    Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.

  5. Temporal changes in randomness of bird communities across Central Europe.

    PubMed

    Renner, Swen C; Gossner, Martin M; Kahl, Tiemo; Kalko, Elisabeth K V; Weisser, Wolfgang W; Fischer, Markus; Allan, Eric

    2014-01-01

    Many studies have examined whether communities are structured by random or deterministic processes, and both are likely to play a role, but relatively few studies have attempted to quantify the degree of randomness in species composition. We quantified, for the first time, the degree of randomness in forest bird communities based on an analysis of spatial autocorrelation in three regions of Germany. The compositional dissimilarity between pairs of forest patches was regressed against the distance between them. We then calculated the y-intercept of the curve, i.e. the 'nugget', which represents the compositional dissimilarity at zero spatial distance. We therefore assume, following similar work on plant communities, that this represents the degree of randomness in species composition. We then analysed how the degree of randomness in community composition varied over time and with forest management intensity, which we expected to reduce the importance of random processes by increasing the strength of environmental drivers. We found that a high portion of the bird community composition could be explained by chance (overall mean of 0.63), implying that most of the variation in local bird community composition is driven by stochastic processes. Forest management intensity did not consistently affect the mean degree of randomness in community composition, perhaps because the bird communities were relatively insensitive to management intensity. We found a high temporal variation in the degree of randomness, which may indicate temporal variation in assembly processes and in the importance of key environmental drivers. We conclude that the degree of randomness in community composition should be considered in bird community studies, and the high values we find may indicate that bird community composition is relatively hard to predict at the regional scale.

  6. Security authentication with a three-dimensional optical phase code using random forest classifier: an overview

    NASA Astrophysics Data System (ADS)

    Markman, Adam; Carnicer, Artur; Javidi, Bahram

    2017-05-01

    We overview our recent work [1] on utilizing three-dimensional (3D) optical phase codes for object authentication using the random forest classifier. A simple 3D optical phase code (OPC) is generated by combining multiple diffusers and glass slides. This tag is then placed on a quick-response (QR) code, which is a barcode capable of storing information and can be scanned under non-uniform illumination conditions, rotation, and slight degradation. A coherent light source illuminates the OPC and the transmitted light is captured by a CCD to record the unique signature. Feature extraction on the signature is performed and inputted into a pre-trained random-forest classifier for authentication.

  7. Remote sensing based detection of forested wetlands: An evaluation of LiDAR, aerial imagery, and their data fusion

    NASA Astrophysics Data System (ADS)

    Suiter, Ashley Elizabeth

    Multi-spectral imagery provides a robust and low-cost dataset for assessing wetland extent and quality over broad regions and is frequently used for wetland inventories. However in forested wetlands, hydrology is obscured by tree canopy making it difficult to detect with multi-spectral imagery alone. Because of this, classification of forested wetlands often includes greater errors than that of other wetlands types. Elevation and terrain derivatives have been shown to be useful for modelling wetland hydrology. But, few studies have addressed the use of LiDAR intensity data detecting hydrology in forested wetlands. Due the tendency of LiDAR signal to be attenuated by water, this research proposed the fusion of LiDAR intensity data with LiDAR elevation, terrain data, and aerial imagery, for the detection of forested wetland hydrology. We examined the utility of LiDAR intensity data and determined whether the fusion of Lidar derived data with multispectral imagery increased the accuracy of forested wetland classification compared with a classification performed with only multi-spectral image. Four classifications were performed: Classification A -- All Imagery, Classification B -- All LiDAR, Classification C -- LiDAR without Intensity, and Classification D -- Fusion of All Data. These classifications were performed using random forest and each resulted in a 3-foot resolution thematic raster of forested upland and forested wetland locations in Vermilion County, Illinois. The accuracies of these classifications were compared using Kappa Coefficient of Agreement. Importance statistics produced within the random forest classifier were evaluated in order to understand the contribution of individual datasets. Classification D, which used the fusion of LiDAR and multi-spectral imagery as input variables, had moderate to strong agreement between reference data and classification results. It was found that Classification A performed using all the LiDAR data and its derivatives (intensity, elevation, slope, aspect, curvatures, and Topographic Wetness Index) was the most accurate classification with Kappa: 78.04%, indicating moderate to strong agreement. However, Classification C, performed with LiDAR derivative without intensity data had less agreement than would be expected by chance, indicating that LiDAR contributed significantly to the accuracy of Classification B.

  8. Does rational selection of training and test sets improve the outcome of QSAR modeling?

    PubMed

    Martin, Todd M; Harten, Paul; Young, Douglas M; Muratov, Eugene N; Golbraikh, Alexander; Zhu, Hao; Tropsha, Alexander

    2012-10-22

    Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.

  9. Mapping the montane cloud forest of Taiwan using 12 year MODIS-derived ground fog frequency data

    PubMed Central

    Li, Ching-Feng; Thies, Boris; Chang, Shih-Chieh; Bendix, Jörg

    2017-01-01

    Up until now montane cloud forest (MCF) in Taiwan has only been mapped for selected areas of vegetation plots. This paper presents the first comprehensive map of MCF distribution for the entire island. For its creation, a Random Forest model was trained with vegetation plots from the National Vegetation Database of Taiwan that were classified as “MCF” or “non-MCF”. This model predicted the distribution of MCF from a raster data set of parameters derived from a digital elevation model (DEM), Landsat channels and texture measures derived from them as well as ground fog frequency data derived from the Moderate Resolution Imaging Spectroradiometer. While the DEM parameters and Landsat data predicted much of the cloud forest’s location, local deviations in the altitudinal distribution of MCF linked to the monsoonal influence as well as the Massenerhebung effect (causing MCF in atypically low altitudes) were only captured once fog frequency data was included. Therefore, our study suggests that ground fog data are most useful for accurately mapping MCF. PMID:28245279

  10. CW-SSIM kernel based random forest for image classification

    NASA Astrophysics Data System (ADS)

    Fan, Guangzhe; Wang, Zhou; Wang, Jiheng

    2010-07-01

    Complex wavelet structural similarity (CW-SSIM) index has been proposed as a powerful image similarity metric that is robust to translation, scaling and rotation of images, but how to employ it in image classification applications has not been deeply investigated. In this paper, we incorporate CW-SSIM as a kernel function into a random forest learning algorithm. This leads to a novel image classification approach that does not require a feature extraction or dimension reduction stage at the front end. We use hand-written digit recognition as an example to demonstrate our algorithm. We compare the performance of the proposed approach with random forest learning based on other kernels, including the widely adopted Gaussian and the inner product kernels. Empirical evidences show that the proposed method is superior in its classification power. We also compared our proposed approach with the direct random forest method without kernel and the popular kernel-learning method support vector machine. Our test results based on both simulated and realworld data suggest that the proposed approach works superior to traditional methods without the feature selection procedure.

  11. Aboveground Biomass and Dynamics of Forest Attributes using LiDAR Data and Vegetation Model

    NASA Astrophysics Data System (ADS)

    V V L, P. A.

    2015-12-01

    In recent years, biomass estimation for tropical forests has received much attention because of the fact that regional biomass is considered to be a critical input to climate change. Biomass almost determines the potential carbon emission that could be released to the atmosphere due to deforestation or conservation to non-forest land use. Thus, accurate biomass estimation is necessary for better understating of deforestation impacts on global warming and environmental degradation. In this context, forest stand height inclusion in biomass estimation plays a major role in reducing the uncertainty in the estimation of biomass. The improvement in the accuracy in biomass shall also help in meeting the MRV objectives of REDD+. Along with the precise estimate of biomass, it is also important to emphasize the role of vegetation models that will most likely become an important tool for assessing the effects of climate change on potential vegetation dynamics and terrestrial carbon storage and for managing terrestrial ecosystem sustainability. Remote sensing is an efficient way to estimate forest parameters in large area, especially at regional scale where field data is limited. LIDAR (Light Detection And Ranging) provides accurate information on the vertical structure of forests. We estimated average tree canopy heights and AGB from GLAS waveform parameters by using a multi-regression linear model in forested area of Madhya Pradesh (area-3,08,245 km2), India. The derived heights from ICESat-GLAS were correlated with field measured tree canopy heights for 60 plots. Results have shown a significant correlation of R2= 74% for top canopy heights and R2= 57% for stand biomass. The total biomass estimation 320.17 Mt and canopy heights are generated by using random forest algorithm. These canopy heights and biomass maps were used in vegetation models to predict the changes biophysical/physiological characteristics of forest according to the changing climate. In our study we have used Dynamic Global Vegetation Model to understand the possible vegetation dynamics in the event of climate change. The vegetation represents a biogeographic regime. Simulations were carried out for 70 years time period. The model produced leaf area index and biomass for each plant functional type and biome for each grid in that region.

  12. Improvement of Forest Height Retrieval By Integration of Dual-Baseline PolInSAR Data And External DEM Data

    NASA Astrophysics Data System (ADS)

    Xie, Q.; Wang, C.; Zhu, J.; Fu, H.; Wang, C.

    2015-06-01

    In recent years, a lot of studies have shown that polarimetric synthetic aperture radar interferometry (PolInSAR) is a powerful technique for forest height mapping and monitoring. However, few researches address the problem of terrain slope effect, which will be one of the major limitations for forest height inversion in mountain forest area. In this paper, we present a novel forest height retrieval algorithm by integration of dual-baseline PolInSAR data and external DEM data. For the first time, we successfully expand the S-RVoG (Sloped-Random Volume over Ground) model for forest parameters inversion into the case of dual-baseline PolInSAR configuration. In this case, the proposed method not only corrects terrain slope variation effect efficiently, but also involves more observations to improve the accuracy of parameters inversion. In order to demonstrate the performance of the inversion algorithm, a set of quad-pol images acquired at the P-band in interferometric repeat-pass mode by the German Aerospace Center (DLR) with the Experimental SAR (E-SAR) system, in the frame of the BioSAR2008 campaign, has been used for the retrieval of forest height over Krycklan boreal forest in northern Sweden. At the same time, a high accuracy external DEM in the experimental area has been collected for computing terrain slope information, which subsequently is used as an inputting parameter in the S-RVoG model. Finally, in-situ ground truth heights in stand-level have been collected to validate the inversion result. The preliminary results show that the proposed inversion algorithm promises to provide much more accurate estimation of forest height than traditional dualbaseline inversion algorithms.

  13. Improved high-dimensional prediction with Random Forests by the use of co-data.

    PubMed

    Te Beest, Dennis E; Mes, Steven W; Wilting, Saskia M; Brakenhoff, Ruud H; van de Wiel, Mark A

    2017-12-28

    Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

  14. GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran.

    PubMed

    Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali

    2016-01-01

    Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.

  15. [The Effects of Urban Forest-walking Program on Health Promotion Behavior, Physical Health, Depression, and Quality of Life: A Randomized Controlled Trial of Office-workers].

    PubMed

    Bang, Kyung Sook; Lee, In Sook; Kim, Sung Jae; Song, Min Kyung; Park, Se Eun

    2016-02-01

    This study was performed to determine the physical and psychological effects of an urban forest-walking program for office workers. For many workers, sedentary lifestyles can lead to low levels of physical activity causing various health problems despite an increased interest in health promotion. Fifty four office workers participated in this study. They were assigned to two groups (experimental group and control group) in random order and the experimental group performed 5 weeks of walking exercise based on Information-Motivation-Behavioral skills Model. The data were collected from October to November 2014. SPSS 21.0 was used for the statistical analysis. The results showed that the urban forest walking program had positive effects on the physical activity level (U=65.00, p<.001), health promotion behavior (t=-2.20, p=.033), and quality of life (t=-2.42, p=.020). However, there were no statistical differences in depression, waist size, body mass index, blood pressure, or bone density between the groups. The current findings of the study suggest the forest-walking program may have positive effects on improving physical activity, health promotion behavior, and quality of life. The program can be used as an effective and efficient strategy for physical and psychological health promotion for office workers.

  16. Distance error correction for time-of-flight cameras

    NASA Astrophysics Data System (ADS)

    Fuersattel, Peter; Schaller, Christian; Maier, Andreas; Riess, Christian

    2017-06-01

    The measurement accuracy of time-of-flight cameras is limited due to properties of the scene and systematic errors. These errors can accumulate to multiple centimeters which may limit the applicability of these range sensors. In the past, different approaches have been proposed for improving the accuracy of these cameras. In this work, we propose a new method that improves two important aspects of the range calibration. First, we propose a new checkerboard which is augmented by a gray-level gradient. With this addition it becomes possible to capture the calibration features for intrinsic and distance calibration at the same time. The gradient strip allows to acquire a large amount of distance measurements for different surface reflectivities, which results in more meaningful training data. Second, we present multiple new features which are used as input to a random forest regressor. By using random regression forests, we circumvent the problem of finding an accurate model for the measurement error. During application, a correction value for each individual pixel is estimated with the trained forest based on a specifically tailored feature vector. With our approach the measurement error can be reduced by more than 40% for the Mesa SR4000 and by more than 30% for the Microsoft Kinect V2. In our evaluation we also investigate the impact of the individual forest parameters and illustrate the importance of the individual features.

  17. Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

    PubMed

    Le, Trang T; Simmons, W Kyle; Misaki, Masaya; Bodurka, Jerzy; White, Bill C; Savitz, Jonathan; McKinney, Brett A

    2017-09-15

    Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting. We introduce private Evaporative Cooling, a stochastic privacy-preserving machine learning algorithm that uses Relief-F for feature selection and random forest for privacy preserving classification that also prevents overfitting. We relate the privacy-preserving threshold mechanism to a thermodynamic Maxwell-Boltzmann distribution, where the temperature represents the privacy threshold. We use the thermal statistical physics concept of Evaporative Cooling of atomic gases to perform backward stepwise privacy-preserving feature selection. On simulated data with main effects and statistical interactions, we compare accuracies on holdout and validation sets for three privacy-preserving methods: the reusable holdout, reusable holdout with random forest, and private Evaporative Cooling, which uses Relief-F feature selection and random forest classification. In simulations where interactions exist between attributes, private Evaporative Cooling provides higher classification accuracy without overfitting based on an independent validation set. In simulations without interactions, thresholdout with random forest and private Evaporative Cooling give comparable accuracies. We also apply these privacy methods to human brain resting-state fMRI data from a study of major depressive disorder. Code available at http://insilico.utulsa.edu/software/privateEC . brett-mckinney@utulsa.edu. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  18. Deorientation of PolSAR coherency matrix for volume scattering retrieval

    NASA Astrophysics Data System (ADS)

    Kumar, Shashi; Garg, R. D.; Kushwaha, S. P. S.

    2016-05-01

    Polarimetric SAR data has proven its potential to extract scattering information for different features appearing in single resolution cell. Several decomposition modelling approaches have been developed to retrieve scattering information from PolSAR data. During scattering power decomposition based on physical scattering models it becomes very difficult to distinguish volume scattering as a result from randomly oriented vegetation from scattering nature of oblique structures which are responsible for double-bounce and volume scattering , because both are decomposed in same scattering mechanism. The polarization orientation angle (POA) of an electromagnetic wave is one of the most important character which gets changed due to scattering from geometrical structure of topographic slopes, oriented urban area and randomly oriented features like vegetation cover. The shift in POA affects the polarimetric radar signatures. So, for accurate estimation of scattering nature of feature compensation in polarization orientation shift becomes an essential procedure. The prime objective of this work was to investigate the effect of shift in POA in scattering information retrieval and to explore the effect of deorientation on regression between field-estimated aboveground biomass (AGB) and volume scattering. For this study Dudhwa National Park, U.P., India was selected as study area and fully polarimetric ALOS PALSAR data was used to retrieve scattering information from the forest area of Dudhwa National Park. Field data for DBH and tree height was collect for AGB estimation using stratified random sampling. AGB was estimated for 170 plots for different locations of the forest area. Yamaguchi four component decomposition modelling approach was utilized to retrieve surface, double-bounce, helix and volume scattering information. Shift in polarization orientation angle was estimated and deorientation of coherency matrix for compensation of POA shift was performed. Effect of deorientation on RGB color composite for the forest area can be easily seen. Overestimation of volume scattering and under estimation of double bounce scattering was recorded for PolSAR decomposition without deorientation and increase in double bounce scattering and decrease in volume scattering was noticed after deorientation. This study was mainly focused on volume scattering retrieval and its relation with field estimated AGB. Change in volume scattering after POA compensation of PolSAR data was recorded and a comparison was performed on volume scattering values for all the 170 forest plots for which field data were collected. Decrease in volume scattering after deorientation was noted for all the plots. Regression between PolSAR decomposition based volume scattering and AGB was performed. Before deorientation, coefficient determination (R2) between volume scattering and AGB was 0.225. After deorientation an improvement in coefficient of determination was found and the obtained value was 0.613. This study recommends deorientation of PolSAR data for decomposition modelling to retrieve reliable volume scattering information from forest area.

  19. What does it take to get family forest owners to enroll in a forest stewardship-type program?

    Treesearch

    Michael A. Kilgore; Stephanie A. Snyder; Joseph Schertz; Steven J. Taff

    2008-01-01

    We estimated the probability of enrollment and factors influencing participation in a forest stewardship-type program, Minnesota's Sustainable Forest Incentives Act, using data from a mail survey of over 1000 randomly-selected Minnesota family forest owners. Of the 15 variables tested, only five were significant predictors of a landowner's interest in...

  20. Mapping forest vegetation for the western United States using modified random forests imputation of FIA forest plots

    Treesearch

    Karin Riley; Isaac C. Grenfell; Mark A. Finney

    2016-01-01

    Maps of the number, size, and species of trees in forests across the western United States are desirable for many applications such as estimating terrestrial carbon resources, predicting tree mortality following wildfires, and for forest inventory. However, detailed mapping of trees for large areas is not feasible with current technologies, but statistical...

  1. Predictors of occurrence of the aquatic macrophyte Podostemum ceratophyllum in a southern Appalachian River

    USGS Publications Warehouse

    Argentina, Jane E.; Freeman, Mary C.; Freeman, Byron J.

    2010-01-01

    The aquatic macrophyte Podostemum ceratophyllum (Hornleaf Riverweed) commonly provides habitat for invertebrates and fishes in flowing-water portions of Piedmont and Appalachian streams in the eastern US. We quantified variation in percent cover by P. ceratophyllum in a 39-km reach of the Conasauga River, TN and GA, to test the hypothesis that cover decreased with increasing non-forest land use. We estimated percent P. ceratophyllum cover in quadrats (0.09 m2) placed at random coordinates within 20 randomly selected shoals. We then used hierarchical logistic regression, in an information-theoretic framework, to evaluate relative support for models incorporating alternative combinations of microhabitat and shoal-level variables to predict the occurrence of high (≥50%)P. ceratophyllum cover. As expected, bed sediment size and measures of light availability (location in the center of the channel, canopy cover) were included in best-supported models and had similar estimated-effect sizes across models. Podostemum ceratophyllum cover declined with increasing watershed size (included in 8 of 13 models in the confidence set of models); however, this decrease in cover was not well predicted by variation in land use. Focused monitoring of temporal and spatial trends in status of P. ceratophyllum are important due to its biotic importance in fast-flowing waters and its potential sensitivity to landscape-level changes, such as declines in forested land cover and homogenization of benthic habitats.

  2. Random Forest-Based Approach for Maximum Power Point Tracking of Photovoltaic Systems Operating under Actual Environmental Conditions.

    PubMed

    Shareef, Hussain; Mutlag, Ammar Hussein; Mohamed, Azah

    2017-01-01

    Many maximum power point tracking (MPPT) algorithms have been developed in recent years to maximize the produced PV energy. These algorithms are not sufficiently robust because of fast-changing environmental conditions, efficiency, accuracy at steady-state value, and dynamics of the tracking algorithm. Thus, this paper proposes a new random forest (RF) model to improve MPPT performance. The RF model has the ability to capture the nonlinear association of patterns between predictors, such as irradiance and temperature, to determine accurate maximum power point. A RF-based tracker is designed for 25 SolarTIFSTF-120P6 PV modules, with the capacity of 3 kW peak using two high-speed sensors. For this purpose, a complete PV system is modeled using 300,000 data samples and simulated using the MATLAB/SIMULINK package. The proposed RF-based MPPT is then tested under actual environmental conditions for 24 days to validate the accuracy and dynamic response. The response of the RF-based MPPT model is also compared with that of the artificial neural network and adaptive neurofuzzy inference system algorithms for further validation. The results show that the proposed MPPT technique gives significant improvement compared with that of other techniques. In addition, the RF model passes the Bland-Altman test, with more than 95 percent acceptability.

  3. Accurate prediction of personalized olfactory perception from large-scale chemoinformatic features.

    PubMed

    Li, Hongyang; Panwar, Bharat; Omenn, Gilbert S; Guan, Yuanfang

    2018-02-01

    The olfactory stimulus-percept problem has been studied for more than a century, yet it is still hard to precisely predict the odor given the large-scale chemoinformatic features of an odorant molecule. A major challenge is that the perceived qualities vary greatly among individuals due to different genetic and cultural backgrounds. Moreover, the combinatorial interactions between multiple odorant receptors and diverse molecules significantly complicate the olfaction prediction. Many attempts have been made to establish structure-odor relationships for intensity and pleasantness, but no models are available to predict the personalized multi-odor attributes of molecules. In this study, we describe our winning algorithm for predicting individual and population perceptual responses to various odorants in the DREAM Olfaction Prediction Challenge. We find that random forest model consisting of multiple decision trees is well suited to this prediction problem, given the large feature spaces and high variability of perceptual ratings among individuals. Integrating both population and individual perceptions into our model effectively reduces the influence of noise and outliers. By analyzing the importance of each chemical feature, we find that a small set of low- and nondegenerative features is sufficient for accurate prediction. Our random forest model successfully predicts personalized odor attributes of structurally diverse molecules. This model together with the top discriminative features has the potential to extend our understanding of olfactory perception mechanisms and provide an alternative for rational odorant design.

  4. Random Forest-Based Approach for Maximum Power Point Tracking of Photovoltaic Systems Operating under Actual Environmental Conditions

    PubMed Central

    Shareef, Hussain; Mohamed, Azah

    2017-01-01

    Many maximum power point tracking (MPPT) algorithms have been developed in recent years to maximize the produced PV energy. These algorithms are not sufficiently robust because of fast-changing environmental conditions, efficiency, accuracy at steady-state value, and dynamics of the tracking algorithm. Thus, this paper proposes a new random forest (RF) model to improve MPPT performance. The RF model has the ability to capture the nonlinear association of patterns between predictors, such as irradiance and temperature, to determine accurate maximum power point. A RF-based tracker is designed for 25 SolarTIFSTF-120P6 PV modules, with the capacity of 3 kW peak using two high-speed sensors. For this purpose, a complete PV system is modeled using 300,000 data samples and simulated using the MATLAB/SIMULINK package. The proposed RF-based MPPT is then tested under actual environmental conditions for 24 days to validate the accuracy and dynamic response. The response of the RF-based MPPT model is also compared with that of the artificial neural network and adaptive neurofuzzy inference system algorithms for further validation. The results show that the proposed MPPT technique gives significant improvement compared with that of other techniques. In addition, the RF model passes the Bland–Altman test, with more than 95 percent acceptability. PMID:28702051

  5. A comparison of selected parametric and imputation methods for estimating snag density and snag quality attributes

    USGS Publications Warehouse

    Eskelson, Bianca N.I.; Hagar, Joan; Temesgen, Hailemariam

    2012-01-01

    Snags (standing dead trees) are an essential structural component of forests. Because wildlife use of snags depends on size and decay stage, snag density estimation without any information about snag quality attributes is of little value for wildlife management decision makers. Little work has been done to develop models that allow multivariate estimation of snag density by snag quality class. Using climate, topography, Landsat TM data, stand age and forest type collected for 2356 forested Forest Inventory and Analysis plots in western Washington and western Oregon, we evaluated two multivariate techniques for their abilities to estimate density of snags by three decay classes. The density of live trees and snags in three decay classes (D1: recently dead, little decay; D2: decay, without top, some branches and bark missing; D3: extensive decay, missing bark and most branches) with diameter at breast height (DBH) ≥ 12.7 cm was estimated using a nonparametric random forest nearest neighbor imputation technique (RF) and a parametric two-stage model (QPORD), for which the number of trees per hectare was estimated with a Quasipoisson model in the first stage and the probability of belonging to a tree status class (live, D1, D2, D3) was estimated with an ordinal regression model in the second stage. The presence of large snags with DBH ≥ 50 cm was predicted using a logistic regression and RF imputation. Because of the more homogenous conditions on private forest lands, snag density by decay class was predicted with higher accuracies on private forest lands than on public lands, while presence of large snags was more accurately predicted on public lands, owing to the higher prevalence of large snags on public lands. RF outperformed the QPORD model in terms of percent accurate predictions, while QPORD provided smaller root mean square errors in predicting snag density by decay class. The logistic regression model achieved more accurate presence/absence classification of large snags than the RF imputation approach. Adjusting the decision threshold to account for unequal size for presence and absence classes is more straightforward for the logistic regression than for the RF imputation approach. Overall, model accuracies were poor in this study, which can be attributed to the poor predictive quality of the explanatory variables and the large range of forest types and geographic conditions observed in the data.

  6. The Random Forests Statistical Technique: An Examination of Its Value for the Study of Reading

    ERIC Educational Resources Information Center

    Matsuki, Kazunaga; Kuperman, Victor; Van Dyke, Julie A.

    2016-01-01

    Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this article, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such data sets. The method is…

  7. An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests

    ERIC Educational Resources Information Center

    Strobl, Carolin; Malley, James; Tutz, Gerhard

    2009-01-01

    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and…

  8. Random location of fuel treatments in wildland community interfaces: a percolation approach

    Treesearch

    Michael Bevers; Philip N. Omi; John G. Hof

    2004-01-01

    We explore the use of spatially correlated random treatments to reduce fuels in landscape patterns that appear somewhat natural while forming fully connected fuelbreaks between wildland forests and developed protection zones. From treatment zone maps partitioned into grids of hexagonal forest cells representing potential treatment sites, we selected cells to be treated...

  9. Measuring the effect of fuel treatments on forest carbon using landscape risk analysis

    NASA Astrophysics Data System (ADS)

    Ager, A. A.; Finney, M. A.; McMahan, A.; Cathcart, J.

    2010-12-01

    Wildfire simulation modelling was used to examine whether fuel reduction treatments can potentially reduce future wildfire emissions and provide carbon benefits. In contrast to previous reports, the current study modelled landscape scale effects of fuel treatments on fire spread and intensity, and used a probabilistic framework to quantify wildfire effects on carbon pools to account for stochastic wildfire occurrence. The study area was a 68 474 ha watershed located on the Fremont-Winema National Forest in southeastern Oregon, USA. Fuel reduction treatments were simulated on 10% of the watershed (19% of federal forestland). We simulated 30 000 wildfires with random ignition locations under both treated and untreated landscapes to estimate the change in burn probability by flame length class resulting from the treatments. Carbon loss functions were then calculated with the Forest Vegetation Simulator for each stand in the study area to quantify change in carbon as a function of flame length. We then calculated the expected change in carbon from a random ignition and wildfire as the sum of the product of the carbon loss and the burn probabilities by flame length class. The expected carbon difference between the non-treatment and treatment scenarios was then calculated to quantify the effect of fuel treatments. Overall, the results show that the carbon loss from implementing fuel reduction treatments exceeded the expected carbon benefit associated with lowered burn probabilities and reduced fire severity on the treated landscape. Thus, fuel management activities resulted in an expected net loss of carbon immediately after treatment. However, the findings represent a point in time estimate (wildfire immediately after treatments), and a temporal analysis with a probabilistic framework used here is needed to model carbon dynamics over the life cycle of the fuel treatments. Of particular importance is the long-term balance between emissions from the decay of dead trees killed by fire and carbon sequestration by forest regeneration following wildfire.

  10. Landscape Analysis of Adult Florida Panther Habitat.

    PubMed

    Frakes, Robert A; Belden, Robert C; Wood, Barry E; James, Frederick E

    2015-01-01

    Historically occurring throughout the southeastern United States, the Florida panther is now restricted to less than 5% of its historic range in one breeding population located in southern Florida. Using radio-telemetry data from 87 prime-aged (≥3 years old) adult panthers (35 males and 52 females) during the period 2004 through 2013 (28,720 radio-locations), we analyzed the characteristics of the occupied area and used those attributes in a random forest model to develop a predictive distribution map for resident breeding panthers in southern Florida. Using 10-fold cross validation, the model was 87.5 % accurate in predicting presence or absence of panthers in the 16,678 km2 study area. Analysis of variable importance indicated that the amount of forests and forest edge, hydrology, and human population density were the most important factors determining presence or absence of panthers. Sensitivity analysis showed that the presence of human populations, roads, and agriculture (other than pasture) had strong negative effects on the probability of panther presence. Forest cover and forest edge had strong positive effects. The median model-predicted probability of presence for panther home ranges was 0.81 (0.82 for females and 0.74 for males). The model identified 5579 km2 of suitable breeding habitat remaining in southern Florida; 1399 km2 (25%) of this habitat is in non-protected private ownership. Because there is less panther habitat remaining than previously thought, we recommend that all remaining breeding habitat in south Florida should be maintained, and the current panther range should be expanded into south-central Florida. This model should be useful for evaluating the impacts of future development projects, in prioritizing areas for panther conservation, and in evaluating the potential impacts of sea-level rise and changes in hydrology.

  11. Detecting understory plant invasion in urban forests using LiDAR

    NASA Astrophysics Data System (ADS)

    Singh, Kunwar K.; Davis, Amy J.; Meentemeyer, Ross K.

    2015-06-01

    Light detection and ranging (LiDAR) data are increasingly used to measure structural characteristics of urban forests but are rarely used to detect the growing problem of exotic understory plant invaders. We explored the merits of using LiDAR-derived metrics alone and through integration with spectral data to detect the spatial distribution of the exotic understory plant Ligustrum sinense, a rapidly spreading invader in the urbanizing region of Charlotte, North Carolina, USA. We analyzed regional-scale L. sinense occurrence data collected over the course of three years with LiDAR-derived metrics of forest structure that were categorized into the following groups: overstory, understory, topography, and overall vegetation characteristics, and IKONOS spectral features - optical. Using random forest (RF) and logistic regression (LR) classifiers, we assessed the relative contributions of LiDAR and IKONOS derived variables to the detection of L. sinense. We compared the top performing models developed for a smaller, nested experimental extent using RF and LR classifiers, and used the best overall model to produce a predictive map of the spatial distribution of L. sinense across our country-wide study extent. RF classification of LiDAR-derived topography metrics produced the highest mapping accuracy estimates, outperforming IKONOS data by 17.5% and the integration of LiDAR and IKONOS data by 5.3%. The top performing model from the RF classifier produced the highest kappa of 64.8%, improving on the parsimonious LR model kappa by 31.1% with a moderate gain of 6.2% over the county extent model. Our results demonstrate the superiority of LiDAR-derived metrics over spectral data and fusion of LiDAR and spectral data for accurately mapping the spatial distribution of the forest understory invader L. sinense.

  12. Adaptive economic and ecological forest management under risk

    Treesearch

    Joseph Buongiorno; Mo Zhou

    2015-01-01

    Background: Forest managers must deal with inherently stochastic ecological and economic processes. The future growth of trees is uncertain, and so is their value. The randomness of low-impact, high frequency or rare catastrophic shocks in forest growth has significant implications in shaping the mix of tree species and the forest landscape...

  13. Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity.

    PubMed

    Webb, Samuel J; Hanser, Thierry; Howlin, Brendan; Krause, Paul; Vessey, Jonathan D

    2014-03-25

    A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints.A fragmentation algorithm is utilised to investigate the model's behaviour on specific substructures present in the query. An output is formulated summarising causes of activation and deactivation. The algorithm is able to identify multiple causes of activation or deactivation in addition to identifying localised deactivations where the prediction for the query is active overall. No loss in performance is seen as there is no change in the prediction; the interpretation is produced directly on the model's behaviour for the specific query. Models have been built using multiple learning algorithms including support vector machine and random forest. The models were built on public Ames mutagenicity data and a variety of fingerprint descriptors were used. These models produced a good performance in both internal and external validation with accuracies around 82%. The models were used to evaluate the interpretation algorithm. Interpretation was revealed that links closely with understood mechanisms for Ames mutagenicity. This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model. Additionally the algorithm could be utilised for chemical dataset investigation and knowledge extraction/human SAR development.

  14. Seeing the forest for the trees: utilizing modified random forests imputation of forest plot data for landscape-level analyses

    Treesearch

    Karin L. Riley; Isaac C. Grenfell; Mark A. Finney

    2015-01-01

    Mapping the number, size, and species of trees in forests across the western United States has utility for a number of research endeavors, ranging from estimation of terrestrial carbon resources to tree mortality following wildfires. For landscape fire and forest simulations that use the Forest Vegetation Simulator (FVS), a tree-level dataset, or “tree list”, is a...

  15. Forecasting model of Corylus, Alnus, and Betula pollen concentration levels using spatiotemporal correlation properties of pollen count.

    PubMed

    Nowosad, Jakub; Stach, Alfred; Kasprzyk, Idalia; Weryszko-Chmielewska, Elżbieta; Piotrowska-Weryszko, Krystyna; Puc, Małgorzata; Grewling, Łukasz; Pędziszewska, Anna; Uruska, Agnieszka; Myszkowska, Dorota; Chłopek, Kazimiera; Majkowska-Wojciechowska, Barbara

    The aim of the study was to create and evaluate models for predicting high levels of daily pollen concentration of Corylus , Alnus , and Betula using a spatiotemporal correlation of pollen count. For each taxon, a high pollen count level was established according to the first allergy symptoms during exposure. The dataset was divided into a training set and a test set, using a stratified random split. For each taxon and city, the model was built using a random forest method. Corylus models performed poorly. However, the study revealed the possibility of predicting with substantial accuracy the occurrence of days with high pollen concentrations of Alnus and Betula using past pollen count data from monitoring sites. These results can be used for building (1) simpler models, which require data only from aerobiological monitoring sites, and (2) combined meteorological and aerobiological models for predicting high levels of pollen concentration.

  16. Forest Cover Estimation in Ireland Using Radar Remote Sensing: A Comparative Analysis of Forest Cover Assessment Methodologies.

    PubMed

    Devaney, John; Barrett, Brian; Barrett, Frank; Redmond, John; O Halloran, John

    2015-01-01

    Quantification of spatial and temporal changes in forest cover is an essential component of forest monitoring programs. Due to its cloud free capability, Synthetic Aperture Radar (SAR) is an ideal source of information on forest dynamics in countries with near-constant cloud-cover. However, few studies have investigated the use of SAR for forest cover estimation in landscapes with highly sparse and fragmented forest cover. In this study, the potential use of L-band SAR for forest cover estimation in two regions (Longford and Sligo) in Ireland is investigated and compared to forest cover estimates derived from three national (Forestry2010, Prime2, National Forest Inventory), one pan-European (Forest Map 2006) and one global forest cover (Global Forest Change) product. Two machine-learning approaches (Random Forests and Extremely Randomised Trees) are evaluated. Both Random Forests and Extremely Randomised Trees classification accuracies were high (98.1-98.5%), with differences between the two classifiers being minimal (<0.5%). Increasing levels of post classification filtering led to a decrease in estimated forest area and an increase in overall accuracy of SAR-derived forest cover maps. All forest cover products were evaluated using an independent validation dataset. For the Longford region, the highest overall accuracy was recorded with the Forestry2010 dataset (97.42%) whereas in Sligo, highest overall accuracy was obtained for the Prime2 dataset (97.43%), although accuracies of SAR-derived forest maps were comparable. Our findings indicate that spaceborne radar could aid inventories in regions with low levels of forest cover in fragmented landscapes. The reduced accuracies observed for the global and pan-continental forest cover maps in comparison to national and SAR-derived forest maps indicate that caution should be exercised when applying these datasets for national reporting.

  17. Forest Cover Estimation in Ireland Using Radar Remote Sensing: A Comparative Analysis of Forest Cover Assessment Methodologies

    PubMed Central

    Devaney, John; Barrett, Brian; Barrett, Frank; Redmond, John; O`Halloran, John

    2015-01-01

    Quantification of spatial and temporal changes in forest cover is an essential component of forest monitoring programs. Due to its cloud free capability, Synthetic Aperture Radar (SAR) is an ideal source of information on forest dynamics in countries with near-constant cloud-cover. However, few studies have investigated the use of SAR for forest cover estimation in landscapes with highly sparse and fragmented forest cover. In this study, the potential use of L-band SAR for forest cover estimation in two regions (Longford and Sligo) in Ireland is investigated and compared to forest cover estimates derived from three national (Forestry2010, Prime2, National Forest Inventory), one pan-European (Forest Map 2006) and one global forest cover (Global Forest Change) product. Two machine-learning approaches (Random Forests and Extremely Randomised Trees) are evaluated. Both Random Forests and Extremely Randomised Trees classification accuracies were high (98.1–98.5%), with differences between the two classifiers being minimal (<0.5%). Increasing levels of post classification filtering led to a decrease in estimated forest area and an increase in overall accuracy of SAR-derived forest cover maps. All forest cover products were evaluated using an independent validation dataset. For the Longford region, the highest overall accuracy was recorded with the Forestry2010 dataset (97.42%) whereas in Sligo, highest overall accuracy was obtained for the Prime2 dataset (97.43%), although accuracies of SAR-derived forest maps were comparable. Our findings indicate that spaceborne radar could aid inventories in regions with low levels of forest cover in fragmented landscapes. The reduced accuracies observed for the global and pan-continental forest cover maps in comparison to national and SAR-derived forest maps indicate that caution should be exercised when applying these datasets for national reporting. PMID:26262681

  18. WDL-RF: Predicting Bioactivities of Ligand Molecules Acting with G Protein-coupled Receptors by Combining Weighted Deep Learning and Random Forest.

    PubMed

    Wu, Jiansheng; Zhang, Qiuming; Wu, Weijian; Pang, Tao; Hu, Haifeng; Chan, Wallace K B; Ke, Xiaoyan; Zhang, Yang; Wren, Jonathan

    2018-02-08

    Precise assessment of ligand bioactivities (including IC50, EC50, Ki, Kd, etc.) is essential for virtual screening and lead compound identification. However, not all ligands have experimentally-determined activities. In particular, many G protein-coupled receptors (GPCRs), which are the largest integral membrane protein family and represent targets of nearly 40% drugs on the market, lack published experimental data about ligand interactions. Computational methods with the ability to accurately predict the bioactivity of ligands can help efficiently address this problem. We proposed a new method, WDL-RF, using weighted deep learning and random forest, to model the bioactivity of GPCR-associated ligand molecules. The pipeline of our algorithm consists of two consecutive stages: 1) molecular fingerprint generation through a new weighted deep learning method, and 2) bioactivity calculations with a random forest model; where one uniqueness of the approach is that the model allows end-to-end learning of prediction pipelines with input ligands being of arbitrary size. The method was tested on a set of twenty-six non-redundant GPCRs that have a high number of active ligands, each with 200∼4000 ligand associations. The results from our benchmark show that WDL-RF can generate bioactivity predictions with an average root-mean square error 1.33 and correlation coefficient (r2) 0.80 compared to the experimental measurements, which are significantly more accurate than the control predictors with different molecular fingerprints and descriptors. In particular, data-driven molecular fingerprint features, as extracted from the weighted deep learning models, can help solve deficiencies stemming from the use of traditional hand-crafted features and significantly increase the efficiency of short molecular fingerprints in virtual screening. The WDL-RF web server, as well as source codes and datasets of WDL-RF, is freely available at https://zhanglab.ccmb.med.umich.edu/WDL-RF/ for academic purposes. Xiaoyan Ke (kexynj@hotmail.com); Yang Zhang (zhng@umich.edu). Supplementary data are available at Bioinformatics online. © The Author (2018). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  19. Multidecadal Rates of Disturbance- and Climate Change-Induced Land Cover Change in Arctic and Boreal Ecosystems over Western Canada and Alaska Inferred from Dense Landsat Time Series

    NASA Astrophysics Data System (ADS)

    Wang, J.; Sulla-menashe, D. J.; Woodcock, C. E.; Sonnentag, O.; Friedl, M. A.

    2017-12-01

    Rapid climate change in arctic and boreal ecosystems is driving changes to land cover composition, including woody expansion in the arctic tundra, successional shifts following boreal fires, and thaw-induced wetland expansion and forest collapse along the southern limit of permafrost. The impacts of these land cover transformations on the physical climate and the carbon cycle are increasingly well-documented from field and model studies, but there have been few attempts to empirically estimate rates of land cover change at decadal time scale and continental spatial scale. Previous studies have used too coarse spatial resolution or have been too limited in temporal range to enable broad multi-decadal assessment of land cover change. As part of NASA's Arctic Boreal Vulnerability Experiment (ABoVE), we are using dense time series of Landsat remote sensing data to map disturbances and classify land cover types across the ABoVE extended domain (spanning western Canada and Alaska) over the last three decades (1982-2014) at 30 m resolution. We utilize regionally-complete and repeated acquisition high-resolution (<2 m) DigitalGlobe imagery to generate training data from across the region that follows a nested, hierarchical classification scheme encompassing plant functional type and cover density, understory type, wetland status, and land use. Additionally, we crosswalk plot-level field data into our scheme for additional high quality training sites. We use the Continuous Change Detection and Classification algorithm to estimate land cover change dates and temporal-spectral features in the Landsat data. These features are used to train random forest classification models and map land cover and analyze land cover change processes, focusing primarily on tundra "shrubification", post-fire succession, and boreal wetland expansion. We will analyze the high resolution data based on stratified random sampling of our change maps to validate and assess the accuracy of our model predictions. In this paper, we present initial results from this effort, including sub-regional analyses focused on several key areas, such as the Taiga Plains and the Southern Arctic ecozones, to calibrate our random forest models and assess results.

  20. Image matching as a data source for forest inventory - Comparison of Semi-Global Matching and Next-Generation Automatic Terrain Extraction algorithms in a typical managed boreal forest environment

    NASA Astrophysics Data System (ADS)

    Kukkonen, M.; Maltamo, M.; Packalen, P.

    2017-08-01

    Image matching is emerging as a compelling alternative to airborne laser scanning (ALS) as a data source for forest inventory and management. There is currently an open discussion in the forest inventory community about whether, and to what extent, the new method can be applied to practical inventory campaigns. This paper aims to contribute to this discussion by comparing two different image matching algorithms (Semi-Global Matching [SGM] and Next-Generation Automatic Terrain Extraction [NGATE]) and ALS in a typical managed boreal forest environment in southern Finland. Spectral features from unrectified aerial images were included in the modeling and the potential of image matching in areas without a high resolution digital terrain model (DTM) was also explored. Plot level predictions for total volume, stem number, basal area, height of basal area median tree and diameter of basal area median tree were modeled using an area-based approach. Plot level dominant tree species were predicted using a random forest algorithm, also using an area-based approach. The statistical difference between the error rates from different datasets was evaluated using a bootstrap method. Results showed that ALS outperformed image matching with every forest attribute, even when a high resolution DTM was used for height normalization and spectral information from images was included. Dominant tree species classification with image matching achieved accuracy levels similar to ALS regardless of the resolution of the DTM when spectral metrics were used. Neither of the image matching algorithms consistently outperformed the other, but there were noticeably different error rates depending on the parameter configuration, spectral band, resolution of DTM, or response variable. This study showed that image matching provides reasonable point cloud data for forest inventory purposes, especially when a high resolution DTM is available and information from the understory is redundant.

  1. A bioavailable strontium isoscape for Western Europe: A machine learning approach

    PubMed Central

    von Holstein, Isabella C. C.; Laffoon, Jason E.; Willmes, Malte; Liu, Xiao-Ming; Davies, Gareth R.

    2018-01-01

    Strontium isotope ratios (87Sr/86Sr) are gaining considerable interest as a geolocation tool and are now widely applied in archaeology, ecology, and forensic research. However, their application for provenance requires the development of baseline models predicting surficial 87Sr/86Sr variations (“isoscapes”). A variety of empirically-based and process-based models have been proposed to build terrestrial 87Sr/86Sr isoscapes but, in their current forms, those models are not mature enough to be integrated with continuous-probability surface models used in geographic assignment. In this study, we aim to overcome those limitations and to predict 87Sr/86Sr variations across Western Europe by combining process-based models and a series of remote-sensing geospatial products into a regression framework. We find that random forest regression significantly outperforms other commonly used regression and interpolation methods, and efficiently predicts the multi-scale patterning of 87Sr/86Sr variations by accounting for geological, geomorphological and atmospheric controls. Random forest regression also provides an easily interpretable and flexible framework to integrate different types of environmental auxiliary variables required to model the multi-scale patterning of 87Sr/86Sr variability. The method is transferable to different scales and resolutions and can be applied to the large collection of geospatial data available at local and global levels. The isoscape generated in this study provides the most accurate 87Sr/86Sr predictions in bioavailable strontium for Western Europe (R2 = 0.58 and RMSE = 0.0023) to date, as well as a conservative estimate of spatial uncertainty by applying quantile regression forest. We anticipate that the method presented in this study combined with the growing numbers of bioavailable 87Sr/86Sr data and satellite geospatial products will extend the applicability of the 87Sr/86Sr geo-profiling tool in provenance applications. PMID:29847595

  2. Cross-country transferability of multi-variable damage models

    NASA Astrophysics Data System (ADS)

    Wagenaar, Dennis; Lüdtke, Stefan; Kreibich, Heidi; Bouwer, Laurens

    2017-04-01

    Flood damage assessment is often done with simple damage curves based only on flood water depth. Additionally, damage models are often transferred in space and time, e.g. from region to region or from one flood event to another. Validation has shown that depth-damage curve estimates are associated with high uncertainties, particularly when applied in regions outside the area where the data for curve development was collected. Recently, progress has been made with multi-variable damage models created with data-mining techniques, i.e. Bayesian Networks and random forest. However, it is still unknown to what extent and under which conditions model transfers are possible and reliable. Model validations in different countries will provide valuable insights into the transferability of multi-variable damage models. In this study we compare multi-variable models developed on basis of flood damage datasets from Germany as well as from The Netherlands. Data from several German floods was collected using computer aided telephone interviews. Data from the 1993 Meuse flood in the Netherlands is available, based on compensations paid by the government. The Bayesian network and random forest based models are applied and validated in both countries on basis of the individual datasets. A major challenge was the harmonization of the variables between both datasets due to factors like differences in variable definitions, and regional and temporal differences in flood hazard and exposure characteristics. Results of model validations and comparisons in both countries are discussed, particularly in respect to encountered challenges and possible solutions for an improvement of model transferability.

  3. Ensemble Methods for Classification of Physical Activities from Wrist Accelerometry.

    PubMed

    Chowdhury, Alok Kumar; Tjondronegoro, Dian; Chandran, Vinod; Trost, Stewart G

    2017-09-01

    To investigate whether the use of ensemble learning algorithms improve physical activity recognition accuracy compared to the single classifier algorithms, and to compare the classification accuracy achieved by three conventional ensemble machine learning methods (bagging, boosting, random forest) and a custom ensemble model comprising four algorithms commonly used for activity recognition (binary decision tree, k nearest neighbor, support vector machine, and neural network). The study used three independent data sets that included wrist-worn accelerometer data. For each data set, a four-step classification framework consisting of data preprocessing, feature extraction, normalization and feature selection, and classifier training and testing was implemented. For the custom ensemble, decisions from the single classifiers were aggregated using three decision fusion methods: weighted majority vote, naïve Bayes combination, and behavior knowledge space combination. Classifiers were cross-validated using leave-one subject out cross-validation and compared on the basis of average F1 scores. In all three data sets, ensemble learning methods consistently outperformed the individual classifiers. Among the conventional ensemble methods, random forest models provided consistently high activity recognition; however, the custom ensemble model using weighted majority voting demonstrated the highest classification accuracy in two of the three data sets. Combining multiple individual classifiers using conventional or custom ensemble learning methods can improve activity recognition accuracy from wrist-worn accelerometer data.

  4. Random Forests for Global and Regional Crop Yield Predictions.

    PubMed

    Jeong, Jig Han; Resop, Jonathan P; Mueller, Nathaniel D; Fleisher, David H; Yun, Kyungdahm; Butler, Ethan E; Timlin, Dennis J; Shim, Kyo-Moon; Gerber, James S; Reddy, Vangimalla R; Kim, Soo-Hyung

    2016-01-01

    Accurate predictions of crop yield are critical for developing effective agricultural and food policies at the regional and global scales. We evaluated a machine-learning method, Random Forests (RF), for its ability to predict crop yield responses to climate and biophysical variables at global and regional scales in wheat, maize, and potato in comparison with multiple linear regressions (MLR) serving as a benchmark. We used crop yield data from various sources and regions for model training and testing: 1) gridded global wheat grain yield, 2) maize grain yield from US counties over thirty years, and 3) potato tuber and maize silage yield from the northeastern seaboard region. RF was found highly capable of predicting crop yields and outperformed MLR benchmarks in all performance statistics that were compared. For example, the root mean square errors (RMSE) ranged between 6 and 14% of the average observed yield with RF models in all test cases whereas these values ranged from 14% to 49% for MLR models. Our results show that RF is an effective and versatile machine-learning method for crop yield predictions at regional and global scales for its high accuracy and precision, ease of use, and utility in data analysis. RF may result in a loss of accuracy when predicting the extreme ends or responses beyond the boundaries of the training data.

  5. Predicting changes in hypertension control using electronic health records from a chronic disease management program

    PubMed Central

    Sun, Jimeng; McNaughton, Candace D; Zhang, Ping; Perer, Adam; Gkoulalas-Divanis, Aris; Denny, Joshua C; Kirby, Jacqueline; Lasko, Thomas; Saip, Alexander; Malin, Bradley A

    2014-01-01

    Objective Common chronic diseases such as hypertension are costly and difficult to manage. Our ultimate goal is to use data from electronic health records to predict the risk and timing of deterioration in hypertension control. Towards this goal, this work predicts the transition points at which hypertension is brought into, as well as pushed out of, control. Method In a cohort of 1294 patients with hypertension enrolled in a chronic disease management program at the Vanderbilt University Medical Center, patients are modeled as an array of features derived from the clinical domain over time, which are distilled into a core set using an information gain criteria regarding their predictive performance. A model for transition point prediction was then computed using a random forest classifier. Results The most predictive features for transitions in hypertension control status included hypertension assessment patterns, comorbid diagnoses, procedures and medication history. The final random forest model achieved a c-statistic of 0.836 (95% CI 0.830 to 0.842) and an accuracy of 0.773 (95% CI 0.766 to 0.780). Conclusions This study achieved accurate prediction of transition points of hypertension control status, an important first step in the long-term goal of developing personalized hypertension management plans. PMID:24045907

  6. Predicting changes in hypertension control using electronic health records from a chronic disease management program.

    PubMed

    Sun, Jimeng; McNaughton, Candace D; Zhang, Ping; Perer, Adam; Gkoulalas-Divanis, Aris; Denny, Joshua C; Kirby, Jacqueline; Lasko, Thomas; Saip, Alexander; Malin, Bradley A

    2014-01-01

    Common chronic diseases such as hypertension are costly and difficult to manage. Our ultimate goal is to use data from electronic health records to predict the risk and timing of deterioration in hypertension control. Towards this goal, this work predicts the transition points at which hypertension is brought into, as well as pushed out of, control. In a cohort of 1294 patients with hypertension enrolled in a chronic disease management program at the Vanderbilt University Medical Center, patients are modeled as an array of features derived from the clinical domain over time, which are distilled into a core set using an information gain criteria regarding their predictive performance. A model for transition point prediction was then computed using a random forest classifier. The most predictive features for transitions in hypertension control status included hypertension assessment patterns, comorbid diagnoses, procedures and medication history. The final random forest model achieved a c-statistic of 0.836 (95% CI 0.830 to 0.842) and an accuracy of 0.773 (95% CI 0.766 to 0.780). This study achieved accurate prediction of transition points of hypertension control status, an important first step in the long-term goal of developing personalized hypertension management plans.

  7. Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches.

    PubMed

    Stylianou, Neophytos; Akbarov, Artur; Kontopantelis, Evangelos; Buchan, Iain; Dunn, Ken W

    2015-08-01

    Predicting mortality from burn injury has traditionally employed logistic regression models. Alternative machine learning methods have been introduced in some areas of clinical prediction as the necessary software and computational facilities have become accessible. Here we compare logistic regression and machine learning predictions of mortality from burn. An established logistic mortality model was compared to machine learning methods (artificial neural network, support vector machine, random forests and naïve Bayes) using a population-based (England & Wales) case-cohort registry. Predictive evaluation used: area under the receiver operating characteristic curve; sensitivity; specificity; positive predictive value and Youden's index. All methods had comparable discriminatory abilities, similar sensitivities, specificities and positive predictive values. Although some machine learning methods performed marginally better than logistic regression the differences were seldom statistically significant and clinically insubstantial. Random forests were marginally better for high positive predictive value and reasonable sensitivity. Neural networks yielded slightly better prediction overall. Logistic regression gives an optimal mix of performance and interpretability. The established logistic regression model of burn mortality performs well against more complex alternatives. Clinical prediction with a small set of strong, stable, independent predictors is unlikely to gain much from machine learning outside specialist research contexts. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.

  8. Quantitative prediction of oral cancer risk in patients with oral leukoplakia.

    PubMed

    Liu, Yao; Li, Yicheng; Fu, Yue; Liu, Tong; Liu, Xiaoyong; Zhang, Xinyan; Fu, Jie; Guan, Xiaobing; Chen, Tong; Chen, Xiaoxin; Sun, Zheng

    2017-07-11

    Exfoliative cytology has been widely used for early diagnosis of oral squamous cell carcinoma. We have developed an oral cancer risk index using DNA index value to quantitatively assess cancer risk in patients with oral leukoplakia, but with limited success. In order to improve the performance of the risk index, we collected exfoliative cytology, histopathology, and clinical follow-up data from two independent cohorts of normal, leukoplakia and cancer subjects (training set and validation set). Peaks were defined on the basis of first derivatives with positives, and modern machine learning techniques were utilized to build statistical prediction models on the reconstructed data. Random forest was found to be the best model with high sensitivity (100%) and specificity (99.2%). Using the Peaks-Random Forest model, we constructed an index (OCRI2) as a quantitative measurement of cancer risk. Among 11 leukoplakia patients with an OCRI2 over 0.5, 4 (36.4%) developed cancer during follow-up (23 ± 20 months), whereas 3 (5.3%) of 57 leukoplakia patients with an OCRI2 less than 0.5 developed cancer (32 ± 31 months). OCRI2 is better than other methods in predicting oral squamous cell carcinoma during follow-up. In conclusion, we have developed an exfoliative cytology-based method for quantitative prediction of cancer risk in patients with oral leukoplakia.

  9. Predicting fecal indicator organism contamination in Oregon coastal streams.

    PubMed

    Pettus, Paul; Foster, Eugene; Pan, Yangdong

    2015-12-01

    In this study, we used publicly available GIS layers and statistical tree-based modeling (CART and Random Forest) to predict pathogen indicator counts at a regional scale using 88 spatially explicit landscape predictors and 6657 samples from non-estuarine streams in the Oregon Coast Range. A total of 532 frequently sampled sites were parsed down to 93 pathogen sampling sites to control for spatial and temporal biases. This model's 56.5% explanation of variance, was comparable to other regional models, while still including a large number of variables. Analysis showed the most important predictors on bacteria counts to be: forest and natural riparian zones, cattle related activities, and urban land uses. This research confirmed linkages to anthropogenic activities, with the research prediction mapping showing increased bacteria counts in agricultural and urban land use areas and lower counts with more natural riparian conditions. Copyright © 2015 Elsevier Ltd. All rights reserved.

  10. Hand pose estimation in depth image using CNN and random forest

    NASA Astrophysics Data System (ADS)

    Chen, Xi; Cao, Zhiguo; Xiao, Yang; Fang, Zhiwen

    2018-03-01

    Thanks to the availability of low cost depth cameras, like Microsoft Kinect, 3D hand pose estimation attracted special research attention in these years. Due to the large variations in hand`s viewpoint and the high dimension of hand motion, 3D hand pose estimation is still challenging. In this paper we propose a two-stage framework which joint with CNN and Random Forest to boost the performance of hand pose estimation. First, we use a standard Convolutional Neural Network (CNN) to regress the hand joints` locations. Second, using a Random Forest to refine the joints from the first stage. In the second stage, we propose a pyramid feature which merges the information flow of the CNN. Specifically, we get the rough joints` location from first stage, then rotate the convolutional feature maps (and image). After this, for each joint, we map its location to each feature map (and image) firstly, then crop features at each feature map (and image) around its location, put extracted features to Random Forest to refine at last. Experimentally, we evaluate our proposed method on ICVL dataset and get the mean error about 11mm, our method is also real-time on a desktop.

  11. Accurate Segmentation of CT Male Pelvic Organs via Regression-based Deformable Models and Multi-task Random Forests

    PubMed Central

    Gao, Yaozong; Shao, Yeqin; Lian, Jun; Wang, Andrew Z.; Chen, Ronald C.

    2016-01-01

    Segmenting male pelvic organs from CT images is a prerequisite for prostate cancer radiotherapy. The efficacy of radiation treatment highly depends on segmentation accuracy. However, accurate segmentation of male pelvic organs is challenging due to low tissue contrast of CT images, as well as large variations of shape and appearance of the pelvic organs. Among existing segmentation methods, deformable models are the most popular, as shape prior can be easily incorporated to regularize the segmentation. Nonetheless, the sensitivity to initialization often limits their performance, especially for segmenting organs with large shape variations. In this paper, we propose a novel approach to guide deformable models, thus making them robust against arbitrary initializations. Specifically, we learn a displacement regressor, which predicts 3D displacement from any image voxel to the target organ boundary based on the local patch appearance. This regressor provides a nonlocal external force for each vertex of deformable model, thus overcoming the initialization problem suffered by the traditional deformable models. To learn a reliable displacement regressor, two strategies are particularly proposed. 1) A multi-task random forest is proposed to learn the displacement regressor jointly with the organ classifier; 2) an auto-context model is used to iteratively enforce structural information during voxel-wise prediction. Extensive experiments on 313 planning CT scans of 313 patients show that our method achieves better results than alternative classification or regression based methods, and also several other existing methods in CT pelvic organ segmentation. PMID:26800531

  12. Comparison of modeling methods to predict the spatial distribution of deep-sea coral and sponge in the Gulf of Alaska

    NASA Astrophysics Data System (ADS)

    Rooper, Christopher N.; Zimmermann, Mark; Prescott, Megan M.

    2017-08-01

    Deep-sea coral and sponge ecosystems are widespread throughout most of Alaska's marine waters, and are associated with many different species of fishes and invertebrates. These ecosystems are vulnerable to the effects of commercial fishing activities and climate change. We compared four commonly used species distribution models (general linear models, generalized additive models, boosted regression trees and random forest models) and an ensemble model to predict the presence or absence and abundance of six groups of benthic invertebrate taxa in the Gulf of Alaska. All four model types performed adequately on training data for predicting presence and absence, with regression forest models having the best overall performance measured by the area under the receiver-operating-curve (AUC). The models also performed well on the test data for presence and absence with average AUCs ranging from 0.66 to 0.82. For the test data, ensemble models performed the best. For abundance data, there was an obvious demarcation in performance between the two regression-based methods (general linear models and generalized additive models), and the tree-based models. The boosted regression tree and random forest models out-performed the other models by a wide margin on both the training and testing data. However, there was a significant drop-off in performance for all models of invertebrate abundance ( 50%) when moving from the training data to the testing data. Ensemble model performance was between the tree-based and regression-based methods. The maps of predictions from the models for both presence and abundance agreed very well across model types, with an increase in variability in predictions for the abundance data. We conclude that where data conforms well to the modeled distribution (such as the presence-absence data and binomial distribution in this study), the four types of models will provide similar results, although the regression-type models may be more consistent with biological theory. For data with highly zero-inflated distributions and non-normal distributions such as the abundance data from this study, the tree-based methods performed better. Ensemble models that averaged predictions across the four model types, performed better than the GLM or GAM models but slightly poorer than the tree-based methods, suggesting ensemble models might be more robust to overfitting than tree methods, while mitigating some of the disadvantages in predictive performance of regression methods.

  13. Recent drought conditions in the Conterminous United States

    Treesearch

    Frank H. Koch; William D. Smith; John W. Coulston

    2013-01-01

    Droughts are common in virtually all U.S. forests, but their frequency and intensity vary widely both between and within forest ecosystems (Hanson and Weltzin 2000). Forests in the Western United States generally exhibit a pattern of annual seasonal droughts. Forests in the Eastern United States tend to exhibit one of two prevailing patterns: random occasional droughts...

  14. Stratifying to reduce bias caused by high nonresponse rates: A case study from New Mexico’s forest inventory

    Treesearch

    Sara A. Goeking; Paul L. Patterson

    2013-01-01

    The USDA Forest Service’s Forest Inventory and Analysis (FIA) Program applies specific sampling and analysis procedures to estimate a variety of forest attributes. FIA’s Interior West region uses post-stratification, where strata consist of forest/nonforest polygons based on MODIS imagery, and assumes that nonresponse plots are distributed at random across each stratum...

  15. RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest.

    PubMed

    Ismail, Hamid D; Jones, Ahoi; Kim, Jung H; Newman, Robert H; Kc, Dukka B

    2016-01-01

    Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.

  16. Different relationships between temporal phylogenetic turnover and phylogenetic similarity and in two forests were detected by a new null model.

    PubMed

    Huang, Jian-Xiong; Zhang, Jian; Shen, Yong; Lian, Ju-yu; Cao, Hong-lin; Ye, Wan-hui; Wu, Lin-fang; Bin, Yue

    2014-01-01

    Ecologists have been monitoring community dynamics with the purpose of understanding the rates and causes of community change. However, there is a lack of monitoring of community dynamics from the perspective of phylogeny. We attempted to understand temporal phylogenetic turnover in a 50 ha tropical forest (Barro Colorado Island, BCI) and a 20 ha subtropical forest (Dinghushan in southern China, DHS). To obtain temporal phylogenetic turnover under random conditions, two null models were used. The first shuffled names of species that are widely used in community phylogenetic analyses. The second simulated demographic processes with careful consideration on the variation in dispersal ability among species and the variations in mortality both among species and among size classes. With the two models, we tested the relationships between temporal phylogenetic turnover and phylogenetic similarity at different spatial scales in the two forests. Results were more consistent with previous findings using the second null model suggesting that the second null model is more appropriate for our purposes. With the second null model, a significantly positive relationship was detected between phylogenetic turnover and phylogenetic similarity in BCI at a 10 m×10 m scale, potentially indicating phylogenetic density dependence. This relationship in DHS was significantly negative at three of five spatial scales. This could indicate abiotic filtering processes for community assembly. Using variation partitioning, we found phylogenetic similarity contributed to variation in temporal phylogenetic turnover in the DHS plot but not in BCI plot. The mechanisms for community assembly in BCI and DHS vary from phylogenetic perspective. Only the second null model detected this difference indicating the importance of choosing a proper null model.

  17. Comparison of Stem Map Developed from Crown Geometry Allometry Linked Census Data to Airborne and Terrestrial Lidar at Harvard Forest, MA

    NASA Astrophysics Data System (ADS)

    Sullivan, F.; Palace, M. W.; Ducey, M. J.; David, O.; Cook, B. D.; Lepine, L. C.

    2014-12-01

    Harvard Forest in Petersham, MA, USA is the location of one of the temperate forest plots established by the Center for Tropical Forest Science (CTFS) as a joint effort with Harvard Forest and the Smithsonian Institute's Forest Global Earth Observatory (ForestGEO) to characterize ecosystem processes and forest dynamics. Census of a 35 ha plot on Prospect Hill was completed during the winter of 2014 by researchers at Harvard Forest. Census data were collected according to CTFS protocol; measured variables included species, stem diameter, and relative X-Y locations. Airborne lidar data were collected over the censused plot using the high spatial resolution Goddard LiDAR, Hyperspectral, and Thermal sensor package (G-LiHT) during June 2012. As part of a separate study, 39 variable radius plots (VRPs) were randomly located and sampled within and throughout the Prospect Hill CTFS/ForestGEO plot during September and October 2013. On VRPs, biometric properties of trees were sampled, including species, stem diameter, total height, crown base height, crown radii, and relative location to plot centers using a 20 Basal Area Factor prism. In addition, a terrestrial-based lidar scanner was used to collect one lidar scan at plot center for 38 of the 39 VRPs. Leveraging allometric equations of crown geometry and tree height developed from 374 trees and 16 different species sampled on 39 VRPs, a 3-dimensional stem map will be created using the Harvard Forest ForestGEO Prospect Hill census. Vertical and horizontal structure of 3d field-based stem maps will be compared to terrestrial and airborne lidar scan data. Furthermore, to assess the quality of allometric equations, a 2d canopy height raster of the field-based stem map will be compared to a G-LiHT derived canopy height model for the 35 ha census plot. Our automated crown delineation methods will be applied to the 2d representation of the census stem map and the G-LiHT canopy height model. For future work related to this study, high quality field-based stem maps with species and crown geometry information will allow for better comparisons and interpretations of individual tree spectra from the G-LiHT hyperspectral sensor as estimated by automated crown delineation of the G-LiHT lidar canopy height model.

  18. QUANTIFYING FOREST ABOVEGROUND CARBON POOLS AND FLUXES USING MULTI-TEMPORAL LIDAR A report on field monitoring, remote sensing MMV, GIS integration, and modeling results for forestry field validation test to quantify aboveground tree biomass and carbon

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lee Spangler; Lee A. Vierling; Eva K. Stand

    2012-04-01

    Sound policy recommendations relating to the role of forest management in mitigating atmospheric carbon dioxide (CO{sub 2}) depend upon establishing accurate methodologies for quantifying forest carbon pools for large tracts of land that can be dynamically updated over time. Light Detection and Ranging (LiDAR) remote sensing is a promising technology for achieving accurate estimates of aboveground biomass and thereby carbon pools; however, not much is known about the accuracy of estimating biomass change and carbon flux from repeat LiDAR acquisitions containing different data sampling characteristics. In this study, discrete return airborne LiDAR data was collected in 2003 and 2009 acrossmore » {approx}20,000 hectares (ha) of an actively managed, mixed conifer forest landscape in northern Idaho, USA. Forest inventory plots, established via a random stratified sampling design, were established and sampled in 2003 and 2009. The Random Forest machine learning algorithm was used to establish statistical relationships between inventory data and forest structural metrics derived from the LiDAR acquisitions. Aboveground biomass maps were created for the study area based on statistical relationships developed at the plot level. Over this 6-year period, we found that the mean increase in biomass due to forest growth across the non-harvested portions of the study area was 4.8 metric ton/hectare (Mg/ha). In these non-harvested areas, we found a significant difference in biomass increase among forest successional stages, with a higher biomass increase in mature and old forest compared to stand initiation and young forest. Approximately 20% of the landscape had been disturbed by harvest activities during the six-year time period, representing a biomass loss of >70 Mg/ha in these areas. During the study period, these harvest activities outweighed growth at the landscape scale, resulting in an overall loss in aboveground carbon at this site. The 30-fold increase in sampling density between the 2003 and 2009 did not affect the biomass estimates. Overall, LiDAR data coupled with field reference data offer a powerful method for calculating pools and changes in aboveground carbon in forested systems. The results of our study suggest that multitemporal LiDAR-based approaches are likely to be useful for high quality estimates of aboveground carbon change in conifer forest systems.« less

  19. Empirical analyses of plant-climate relationships for the western United States

    Treesearch

    Gerald E. Rehfeldt; Nicholas L. Crookston; Marcus V. Warwell; Jeffrey S. Evans

    2006-01-01

    The Random Forests multiple-regression tree was used to model climate profiles of 25 biotic communities of the western United States and nine of their constituent species. Analyses of the communities were based on a gridded sample of ca. 140,000 points, while those for the species used presence-absence data from ca. 120,000 locations. Independent variables included 35...

  20. Kalman filter for statistical monitoring of forest cover across sub-continental regions

    Treesearch

    Raymond L. Czaplewski

    1991-01-01

    The Kalman filter is a multivariate generalization of the composite estimator which recursively combines a current direct estimate with a past estimate that is updated for expected change over time with a prediction model. The Kalman filter can estimate proportions of different cover types for sub-continental regions each year. A random sample of high-resolution...

  1. Simulation of Long-Term Landscape-Level Fuel Treatment Effects on Large Wildfires

    Treesearch

    Mark A. Finney; Rob C. Seli; Charles W. McHugh; Alan A. Ager; Berni Bahro; James K. Agee

    2006-01-01

    A simulation system was developed to explore how fuel treatments placed in random and optimal spatial patterns affect the growth and behavior of large fires when implemented at different rates over the course of five decades. The system consists of a forest/fuel dynamics simulation module (FVS), logic for deriving fuel model dynamics from FVS output, a spatial fuel...

  2. Incorporating landscape fuel treatment modeling into the Forest Vegetation Simulator

    Treesearch

    Robert C. Seli; Alan A. Ager; Nicholas L. Crookston; Mark A. Finney; Berni Bahro; James K. Agee; Charles W. McHugh

    2008-01-01

    A simulation system was developed to explore how fuel treatments placed in random and optimal spatial patterns affect the growth and behavior of large fires when implemented at different rates over the course of five decades. The system consists of several command line programs linked together: (1) FVS with the Parallel Processor (PPE) and Fire and Fuels (FFE)...

  3. Source identification of western Oregon Douglas-fir wood cores using mass spectrometry and random forest classification.

    PubMed

    Finch, Kristen; Espinoza, Edgard; Jones, F Andrew; Cronn, Richard

    2017-05-01

    We investigated whether wood metabolite profiles from direct analysis in real time (time-of-flight) mass spectrometry (DART-TOFMS) could be used to determine the geographic origin of Douglas-fir wood cores originating from two regions in western Oregon, USA. Three annual ring mass spectra were obtained from 188 adult Douglas-fir trees, and these were analyzed using random forest models to determine whether samples could be classified to geographic origin, growth year, or growth year and geographic origin. Specific wood molecules that contributed to geographic discrimination were identified. Douglas-fir mass spectra could be differentiated into two geographic classes with an accuracy between 70% and 76%. Classification models could not accurately classify sample mass spectra based on growth year. Thirty-two molecules were identified as key for classifying western Oregon Douglas-fir wood cores to geographic origin. DART-TOFMS is capable of detecting minute but regionally informative differences in wood molecules over a small geographic scale, and these differences made it possible to predict the geographic origin of Douglas-fir wood with moderate accuracy. Studies involving DART-TOFMS, alone and in combination with other technologies, will be relevant for identifying the geographic origin of illegally harvested wood.

  4. Random Forests Are Able to Identify Differences in Clotting Dynamics from Kinetic Models of Thrombin Generation.

    PubMed

    Arumugam, Jayavel; Bukkapatnam, Satish T S; Narayanan, Krishna R; Srinivasa, Arun R

    2016-01-01

    Current methods for distinguishing acute coronary syndromes such as heart attack from stable coronary artery disease, based on the kinetics of thrombin formation, have been limited to evaluating sensitivity of well-established chemical species (e.g., thrombin) using simple quantifiers of their concentration profiles (e.g., maximum level of thrombin concentration, area under the thrombin concentration versus time curve). In order to get an improved classifier, we use a 34-protein factor clotting cascade model and convert the simulation data into a high-dimensional representation (about 19000 features) using a piecewise cubic polynomial fit. Then, we systematically find plausible assays to effectively gauge changes in acute coronary syndrome/coronary artery disease populations by introducing a statistical learning technique called Random Forests. We find that differences associated with acute coronary syndromes emerge in combinations of a handful of features. For instance, concentrations of 3 chemical species, namely, active alpha-thrombin, tissue factor-factor VIIa-factor Xa ternary complex, and intrinsic tenase complex with factor X, at specific time windows, could be used to classify acute coronary syndromes to an accuracy of about 87.2%. Such a combination could be used to efficiently assay the coagulation system.

  5. Preliminary investigation of human exhaled breath for tuberculosis diagnosis by multidimensional gas chromatography - Time of flight mass spectrometry and machine learning.

    PubMed

    Beccaria, Marco; Mellors, Theodore R; Petion, Jacky S; Rees, Christiaan A; Nasir, Mavra; Systrom, Hannah K; Sairistil, Jean W; Jean-Juste, Marc-Antoine; Rivera, Vanessa; Lavoile, Kerline; Severe, Patrice; Pape, Jean W; Wright, Peter F; Hill, Jane E

    2018-02-01

    Tuberculosis (TB) remains a global public health malady that claims almost 1.8 million lives annually. Diagnosis of TB represents perhaps one of the most challenging aspects of tuberculosis control. Gold standards for diagnosis of active TB (culture and nucleic acid amplification) are sputum-dependent, however, in up to a third of TB cases, an adequate biological sputum sample is not readily available. The analysis of exhaled breath, as an alternative to sputum-dependent tests, has the potential to provide a simple, fast, and non-invasive, and ready-available diagnostic service that could positively change TB detection. Human breath has been evaluated in the setting of active tuberculosis using thermal desorption-comprehensive two-dimensional gas chromatography-time of flight mass spectrometry methodology. From the entire spectrum of volatile metabolites in breath, three random forest machine learning models were applied leading to the generation of a panel of 46 breath features. The twenty-two common features within each random forest model used were selected as a set that could distinguish subjects with confirmed pulmonary M. tuberculosis infection and people with other pathologies than TB. Copyright © 2018 Elsevier B.V. All rights reserved.

  6. Texture analysis of common renal masses in multiple MR sequences for prediction of pathology

    NASA Astrophysics Data System (ADS)

    Hoang, Uyen N.; Malayeri, Ashkan A.; Lay, Nathan S.; Summers, Ronald M.; Yao, Jianhua

    2017-03-01

    This pilot study performs texture analysis on multiple magnetic resonance (MR) images of common renal masses for differentiation of renal cell carcinoma (RCC). Bounding boxes are drawn around each mass on one axial slice in T1 delayed sequence to use for feature extraction and classification. All sequences (T1 delayed, venous, arterial, pre-contrast phases, T2, and T2 fat saturated sequences) are co-registered and texture features are extracted from each sequence simultaneously. Random forest is used to construct models to classify lesions on 96 normal regions, 87 clear cell RCCs, 8 papillary RCCs, and 21 renal oncocytomas; ground truths are verified through pathology reports. The highest performance is seen in random forest model when data from all sequences are used in conjunction, achieving an overall classification accuracy of 83.7%. When using data from one single sequence, the overall accuracies achieved for T1 delayed, venous, arterial, and pre-contrast phase, T2, and T2 fat saturated were 79.1%, 70.5%, 56.2%, 61.0%, 60.0%, and 44.8%, respectively. This demonstrates promising results of utilizing intensity information from multiple MR sequences for accurate classification of renal masses.

  7. Random-Forest Classification of High-Resolution Remote Sensing Images and Ndsm Over Urban Areas

    NASA Astrophysics Data System (ADS)

    Sun, X. F.; Lin, X. G.

    2017-09-01

    As an intermediate step between raw remote sensing data and digital urban maps, remote sensing data classification has been a challenging and long-standing research problem in the community of remote sensing. In this work, an effective classification method is proposed for classifying high-resolution remote sensing data over urban areas. Starting from high resolution multi-spectral images and 3D geometry data, our method proceeds in three main stages: feature extraction, classification, and classified result refinement. First, we extract color, vegetation index and texture features from the multi-spectral image and compute the height, elevation texture and differential morphological profile (DMP) features from the 3D geometry data. Then in the classification stage, multiple random forest (RF) classifiers are trained separately, then combined to form a RF ensemble to estimate each sample's category probabilities. Finally the probabilities along with the feature importance indicator outputted by RF ensemble are used to construct a fully connected conditional random field (FCCRF) graph model, by which the classification results are refined through mean-field based statistical inference. Experiments on the ISPRS Semantic Labeling Contest dataset show that our proposed 3-stage method achieves 86.9% overall accuracy on the test data.

  8. Integrating geographic information systems and remote sensing with spatial econometric and mixed logit models for environmental valuation

    NASA Astrophysics Data System (ADS)

    Wells, Aaron Raymond

    This research focuses on the Emory and Obed Watersheds in the Cumberland Plateau in Central Tennessee and the Lower Hatchie River Watershed in West Tennessee. A framework based on market and nonmarket valuation techniques was used to empirically estimate economic values for environmental amenities and negative externalities in these areas. The specific techniques employed include a variation of hedonic pricing and discrete choice conjoint analysis (i.e., choice modeling), in addition to geographic information systems (GIS) and remote sensing. Microeconomic models of agent behavior, including random utility theory and profit maximization, provide the principal theoretical foundation linking valuation techniques and econometric models. The generalized method of moments estimator for a first-order spatial autoregressive function and mixed logit models are the principal econometric methods applied within the framework. The dissertation is subdivided into three separate chapters written in a manuscript format. The first chapter provides the necessary theoretical and mathematical conditions that must be satisfied in order for a forest amenity enhancement program to be implemented. These conditions include utility, value, and profit maximization. The second chapter evaluates the effect of forest land cover and information about future land use change on respondent preferences and willingness to pay for alternative hypothetical forest amenity enhancement options. Land use change information and the amount of forest land cover significantly influenced respondent preferences, choices, and stated willingness to pay. Hicksian welfare estimates for proposed enhancement options ranged from 57.42 to 25.53, depending on the policy specification, information level, and econometric model. The third chapter presents economic values for negative externalities associated with channelization that affect the productivity and overall market value of forested wetlands. Results of robust, generalized moments estimation of a double logarithmic first-order spatial autoregressive error model (inverse distance weights with spatial dependence up to 1500m) indicate that the implicit cost of damages to forested wetlands caused by channelization equaled -$5,438 ha-1. Collectively, the results of this dissertation provide economic measures of the damages to and benefits of environmental assets, help private landowners and policy makers identify the amenity attributes preferred by the public, and improve the management of natural resources.

  9. Sentinel node status prediction by four statistical models: results from a large bi-institutional series (n = 1132).

    PubMed

    Mocellin, Simone; Thompson, John F; Pasquali, Sandro; Montesco, Maria C; Pilati, Pierluigi; Nitti, Donato; Saw, Robyn P; Scolyer, Richard A; Stretch, Jonathan R; Rossi, Carlo R

    2009-12-01

    To improve selection for sentinel node (SN) biopsy (SNB) in patients with cutaneous melanoma using statistical models predicting SN status. About 80% of patients currently undergoing SNB are node negative. In the absence of conclusive evidence of a SNBassociated survival benefit, these patients may be over-treated. Here, we tested the efficiency of 4 different models in predicting SN status. The clinicopathologic data (age, gender, tumor thickness, Clark level, regression, ulceration, histologic subtype, and mitotic index) of 1132 melanoma patients who had undergone SNB at institutions in Italy and Australia were analyzed. Logistic regression, classification tree, random forest, and support vector machine models were fitted to the data. The predictive models were built with the aim of maximizing the negative predictive value (NPV) and reducing the rate of SNB procedures though minimizing the error rate. After cross-validation logistic regression, classification tree, random forest, and support vector machine predictive models obtained clinically relevant NPV (93.6%, 94.0%, 97.1%, and 93.0%, respectively), SNB reduction (27.5%, 29.8%, 18.2%, and 30.1%, respectively), and error rates (1.8%, 1.8%, 0.5%, and 2.1%, respectively). Using commonly available clinicopathologic variables, predictive models can preoperatively identify a proportion of patients ( approximately 25%) who might be spared SNB, with an acceptable (1%-2%) error. If validated in large prospective series, these models might be implemented in the clinical setting for improved patient selection, which ultimately would lead to better quality of life for patients and optimization of resource allocation for the health care system.

  10. Producing landslide susceptibility maps by utilizing machine learning methods. The case of Finikas catchment basin, North Peloponnese, Greece.

    NASA Astrophysics Data System (ADS)

    Tsangaratos, Paraskevas; Ilia, Ioanna; Loupasakis, Constantinos; Papadakis, Michalis; Karimalis, Antonios

    2017-04-01

    The main objective of the present study was to apply two machine learning methods for the production of a landslide susceptibility map in the Finikas catchment basin, located in North Peloponnese, Greece and to compare their results. Specifically, Logistic Regression and Random Forest were utilized, based on a database of 40 sites classified into two categories, non-landslide and landslide areas that were separated into a training dataset (70% of the total data) and a validation dataset (remaining 30%). The identification of the areas was established by analyzing airborne imagery, extensive field investigation and the examination of previous research studies. Six landslide related variables were analyzed, namely: lithology, elevation, slope, aspect, distance to rivers and distance to faults. Within the Finikas catchment basin most of the reported landslides were located along the road network and within the residential complexes, classified as rotational and translational slides, and rockfalls, mainly caused due to the physical conditions and the general geotechnical behavior of the geological formation that cover the area. Each landslide susceptibility map was reclassified by applying the Geometric Interval classification technique into five classes, namely: very low susceptibility, low susceptibility, moderate susceptibility, high susceptibility, and very high susceptibility. The comparison and validation of the outcomes of each model were achieved using statistical evaluation measures, the receiving operating characteristic and the area under the success and predictive rate curves. The computation process was carried out using RStudio an integrated development environment for R language and ArcGIS 10.1 for compiling the data and producing the landslide susceptibility maps. From the outcomes of the Logistic Regression analysis it was induced that the highest b coefficient is allocated to lithology and slope, which was 2.8423 and 1.5841, respectively. From the estimation of the mean decrease in Gini coefficient performed during the application of Random Forest and the mean decrease in accuracy the most important variable is slope followed by lithology, aspect, elevation, distance from river network, and distance from faults, while the most used variables during the training phase were the variable aspect (21.45%), slope (20.53%) and lithology (19.84%). The outcomes of the analysis are consistent with previous studies concerning the area of research, which have indicated the high influence of lithology and slope in the manifestation of landslides. High percentage of landslide occurrence has been observed in Plio-Pleistocene sediments, flysch formations, and Cretaceous limestone. Also the presences of landslides have been associated with the degree of weathering and fragmentation, the orientation of the discontinuities surfaces and the intense morphological relief. The most accurate model was Random Forest which identified correctly 92.00% of the instances during the training phase, followed by the Logistic Regression 89.00%. The same pattern of accuracy was calculated during the validation phase, in which the Random Forest achieved a classification accuracy of 93.00%, while the Logistic Regression model achieved an accuracy of 91.00%. In conclusion, the outcomes of the study could be a useful cartographic product to local authorities and government agencies during the implementation of successful decision-making and land use planning strategies. Keywords: Landslide Susceptibility, Logistic Regression, Random Forest, GIS, Greece.

  11. Modeling the Effects of Climate Change on Whitebark Pine Along the Pacific Crest Trail

    NASA Astrophysics Data System (ADS)

    Anderson, R. S.; Nguyen, A.; Gill, N.; Kannan, S.; Patadia, N.; Meyer, M.; Schmidt, C.

    2012-12-01

    The Pacific Crest Trail (PCT), one of eight National Scenic Trails, stretches 2,650 miles from Mexico to the Canadian border. At high elevations along this trail, within Inyo and Sierra National Forests, populations of whitebark pine (Pinus albicaulis) have been diminishing due to infestation of the mountain pine beetle (Dendroctonus ponderosae) and are threatened due to a changing climate. Understanding the current and future condition of whitebark pine is a primary goal of forest managers due to its high ecological and economic importance, and it is currently a candidate for protection under the Endangered Species Act (ESA). Using satellite imagery, we analyzed the rate and spatial extent of whitebark pine tree mortality from 1984 to 2011 using the Landsat-based Detection of Trends in Disturbance and Recovery (LandTrendr) program. Climate data, soil properties, and biological features of the whitebark pine were incorporated in the Physiological Principles to Predict Growth (3-PG) model to predict future rates of growth and assess its applicability in modeling natural whitebark pine processes. Finally, the Random Forest algorithm was used with topographic data alongside recent and future climate data from the IPCC A2 and B1 climate scenarios for the years 2030, 2060, and 2090 to model the future distribution of whitebark pine. LandTrendr results indicate beetle related mortality covering 14,940 km2 of forest, 2,880 km2 of which are within whitebark pine forest. By 2090, our results show that under the A2 climate scenario, whitebark pine suitable habitat may be reduced by as much as 99.97% by the year 2090 within our study area. Under the B1 climate scenario, which has decreased CO2 emissions, 13.54% more habitat would be preserved in 2090.

  12. Optimizing classification performance in an object-based very-high-resolution land use-land cover urban application

    NASA Astrophysics Data System (ADS)

    Georganos, Stefanos; Grippa, Tais; Vanhuysse, Sabine; Lennert, Moritz; Shimoni, Michal; Wolff, Eléonore

    2017-10-01

    This study evaluates the impact of three Feature Selection (FS) algorithms in an Object Based Image Analysis (OBIA) framework for Very-High-Resolution (VHR) Land Use-Land Cover (LULC) classification. The three selected FS algorithms, Correlation Based Selection (CFS), Mean Decrease in Accuracy (MDA) and Random Forest (RF) based Recursive Feature Elimination (RFE), were tested on Support Vector Machine (SVM), K-Nearest Neighbor, and Random Forest (RF) classifiers. The results demonstrate that the accuracy of SVM and KNN classifiers are the most sensitive to FS. The RF appeared to be more robust to high dimensionality, although a significant increase in accuracy was found by using the RFE method. In terms of classification accuracy, SVM performed the best using FS, followed by RF and KNN. Finally, only a small number of features is needed to achieve the highest performance using each classifier. This study emphasizes the benefits of rigorous FS for maximizing performance, as well as for minimizing model complexity and interpretation.

  13. Taxi-Out Time Prediction for Departures at Charlotte Airport Using Machine Learning Techniques

    NASA Technical Reports Server (NTRS)

    Lee, Hanbong; Malik, Waqar; Jung, Yoon C.

    2016-01-01

    Predicting the taxi-out times of departures accurately is important for improving airport efficiency and takeoff time predictability. In this paper, we attempt to apply machine learning techniques to actual traffic data at Charlotte Douglas International Airport for taxi-out time prediction. To find the key factors affecting aircraft taxi times, surface surveillance data is first analyzed. From this data analysis, several variables, including terminal concourse, spot, runway, departure fix and weight class, are selected for taxi time prediction. Then, various machine learning methods such as linear regression, support vector machines, k-nearest neighbors, random forest, and neural networks model are applied to actual flight data. Different traffic flow and weather conditions at Charlotte airport are also taken into account for more accurate prediction. The taxi-out time prediction results show that linear regression and random forest techniques can provide the most accurate prediction in terms of root-mean-square errors. We also discuss the operational complexity and uncertainties that make it difficult to predict the taxi times accurately.

  14. Forest clearing in the Ecuadorian Amazon: A study of patterns over space and time

    PubMed Central

    Pan, William; Carr, David; Barbieri, Alisson; Bilsborrow, Richard; Suchindran, Chirayath

    2010-01-01

    This study tests four hypotheses related to forest clearing over time in Ecuador’s northern Amazon: (1) a larger increase in population over time on a farm (finca) leads to more deforestation; (2) rates of forest clearing surrounding four primary reference communities differ (spatial heterogeneity); (3) fincas farther from towns/communities experience lower rates of forest clearing over time; and (4) forest clearing differs by finca settlement cohort, viz., by year of establishment of the finca. In this paper, we examine the relationship between forest clearing and key variables over time, and compare three statistical models—OLS, random effects, and spatial regression—to test hypotheses. Descriptive analyses indicate that 7–15% of forest area was cleared on fincas between 1990 and 1999; that more recently established fincas experienced more rapid forest clearing; and that population size and forest clearing are both related to distance from a major community. Controlling for key variables, model results indicate that an increase in population size is significantly related to more forest clearing; rates of forest clearing around the four major communities are not significantly different; distances separating fincas and communities are not significantly related to deforestation; and deforestation rates are higher among more recently established fincas. Key policy implications include the importance of reducing population growth and momentum through measures such as improving information about and provision of family planning services; increasing the low level of girls education to delay and reduce fertility; and expanding credit and agricultural extension services to increase agricultural intensification. PMID:20703367

  15. Relationship of field and LiDAR estimates of forest canopy cover with snow accumulation and melt

    Treesearch

    Mariana Dobre; William J. Elliot; Joan Q. Wu; Timothy E. Link; Brandon Glaza; Theresa B. Jain; Andrew T. Hudak

    2012-01-01

    At the Priest River Experimental Forest in northern Idaho, USA, snow water equivalent (SWE) was recorded over a period of six years on random, equally-spaced plots in ~4.5 ha small watersheds (n=10). Two watersheds were selected as controls and eight as treatments, with two watersheds randomly assigned per treatment as follows: harvest (2007) followed by mastication (...

  16. Chapter4 - Drought patterns in the conterminous United States and Hawaii.

    Treesearch

    Frank H. Koch; William D. Smith; John W. Coulston

    2014-01-01

    Droughts are common in virtually all U.S. forests, but their frequency and intensity vary widely both between and within forest ecosystems (Hanson and Weltzin 2000). Forests in the Western United States generally exhibit a pattern of annual seasonal droughts. Forests in the Eastern United States tend to exhibit one of two prevailing patterns: random occasional droughts...

  17. A Prospectus on Restoring Late Successional Forest Structure to Eastside Pine Ecosystems Through Large-Scale, Interdisciplinary Research

    Treesearch

    Steve Zack; William F. Laudenslayer; Luke George; Carl Skinner; William Oliver

    1999-01-01

    At two different locations in northeast California, an interdisciplinary team of scientists is initiating long-term studies to quantify the effects of forest manipulations intended to accelerate andlor enhance late-successional structure of eastside pine forest ecosystems. One study, at Blacks Mountain Experimental Forest, uses a split-plot, factorial, randomized block...

  18. Multisource passive acoustic tracking: an application of random finite set data fusion

    NASA Astrophysics Data System (ADS)

    Ali, Andreas M.; Hudson, Ralph E.; Lorenzelli, Flavio; Yao, Kung

    2010-04-01

    Multisource passive acoustic tracking is useful in animal bio-behavioral study by replacing or enhancing human involvement during and after field data collection. Multiple simultaneous vocalizations are a common occurrence in a forest or a jungle, where many species are encountered. Given a set of nodes that are capable of producing multiple direction-of-arrivals (DOAs), such data needs to be combined into meaningful estimates. Random Finite Set provides the mathematical probabilistic model, which is suitable for analysis and optimal estimation algorithm synthesis. Then the proposed algorithm has been verified using a simulation and a controlled test experiment.

  19. Utilizing random forests imputation of forest plot data for landscape-level wildfire analyses

    Treesearch

    Karin L. Riley; Isaac C. Grenfell; Mark A. Finney; Nicholas L. Crookston

    2014-01-01

    Maps of the number, size, and species of trees in forests across the United States are desirable for a number of applications. For landscape-level fire and forest simulations that use the Forest Vegetation Simulator (FVS), a spatial tree-level dataset, or “tree list”, is a necessity. FVS is widely used at the stand level for simulating fire effects on tree mortality,...

  20. Prediction of drug synergy in cancer using ensemble-based machine learning techniques

    NASA Astrophysics Data System (ADS)

    Singh, Harpreet; Rana, Prashant Singh; Singh, Urvinder

    2018-04-01

    Drug synergy prediction plays a significant role in the medical field for inhibiting specific cancer agents. It can be developed as a pre-processing tool for therapeutic successes. Examination of different drug-drug interaction can be done by drug synergy score. It needs efficient regression-based machine learning approaches to minimize the prediction errors. Numerous machine learning techniques such as neural networks, support vector machines, random forests, LASSO, Elastic Nets, etc., have been used in the past to realize requirement as mentioned above. However, these techniques individually do not provide significant accuracy in drug synergy score. Therefore, the primary objective of this paper is to design a neuro-fuzzy-based ensembling approach. To achieve this, nine well-known machine learning techniques have been implemented by considering the drug synergy data. Based on the accuracy of each model, four techniques with high accuracy are selected to develop ensemble-based machine learning model. These models are Random forest, Fuzzy Rules Using Genetic Cooperative-Competitive Learning method (GFS.GCCL), Adaptive-Network-Based Fuzzy Inference System (ANFIS) and Dynamic Evolving Neural-Fuzzy Inference System method (DENFIS). Ensembling is achieved by evaluating the biased weighted aggregation (i.e. adding more weights to the model with a higher prediction score) of predicted data by selected models. The proposed and existing machine learning techniques have been evaluated on drug synergy score data. The comparative analysis reveals that the proposed method outperforms others in terms of accuracy, root mean square error and coefficient of correlation.

  1. Using GPS, GIS, and Accelerometer Data to Predict Transportation Modes.

    PubMed

    Brondeel, Ruben; Pannier, Bruno; Chaix, Basile

    2015-12-01

    Active transportation is a substantial source of physical activity, which has a positive influence on many health outcomes. A survey of transportation modes for each trip is challenging, time-consuming, and requires substantial financial investments. This study proposes a passive collection method and the prediction of modes at the trip level using random forests. The RECORD GPS study collected real-life trip data from 236 participants over 7 d, including the transportation mode, global positioning system, geographical information systems, and accelerometer data. A prediction model of transportation modes was constructed using the random forests method. Finally, we investigated the performance of models on the basis of a limited number of participants/trips to predict transportation modes for a large number of trips. The full model had a correct prediction rate of 90%. A simpler model of global positioning system explanatory variables combined with geographical information systems variables performed nearly as well. Relatively good predictions could be made using a model based on the 991 trips of the first 30 participants. This study uses real-life data from a large sample set to test a method for predicting transportation modes at the trip level, thereby providing a useful complement to time unit-level prediction methods. By enabling predictions on the basis of a limited number of observations, this method may decrease the workload for participants/researchers and provide relevant trip-level data to investigate relations between transportation and health.

  2. A prediction scheme of tropical cyclone frequency based on lasso and random forest

    NASA Astrophysics Data System (ADS)

    Tan, Jinkai; Liu, Hexiang; Li, Mengya; Wang, Jun

    2017-07-01

    This study aims to propose a novel prediction scheme of tropical cyclone frequency (TCF) over the Western North Pacific (WNP). We concerned the large-scale meteorological factors inclusive of the sea surface temperature, sea level pressure, the Niño-3.4 index, the wind shear, the vorticity, the subtropical high, and the sea ice cover, since the chronic change of these factors in the context of climate change would cause a gradual variation of the annual TCF. Specifically, we focus on the correlation between the year-to-year increment of these factors and TCF. The least absolute shrinkage and selection operator (Lasso) method was used for variable selection and dimension reduction from 11 initial predictors. Then, a prediction model based on random forest (RF) was established by using the training samples (1978-2011) for calibration and the testing samples (2012-2016) for validation. The RF model presents a major variation and trend of TCF in the period of calibration, and also fitted well with the observed TCF in the period of validation though there were some deviations. The leave-one-out cross validation of the model exhibited most of the predicted TCF are in consistence with the observed TCF with a high correlation coefficient. A comparison between results of the RF model and the multiple linear regression (MLR) model suggested the RF is more practical and capable of giving reliable results of TCF prediction over the WNP.

  3. Machine learning to predict the occurrence of bisphosphonate-related osteonecrosis of the jaw associated with dental extraction: A preliminary report.

    PubMed

    Kim, Dong Wook; Kim, Hwiyoung; Nam, Woong; Kim, Hyung Jun; Cha, In-Ho

    2018-04-23

    The aim of this study was to build and validate five types of machine learning models that can predict the occurrence of BRONJ associated with dental extraction in patients taking bisphosphonates for the management of osteoporosis. A retrospective review of the medical records was conducted to obtain cases and controls for the study. Total 125 patients consisting of 41 cases and 84 controls were selected for the study. Five machine learning prediction algorithms including multivariable logistic regression model, decision tree, support vector machine, artificial neural network, and random forest were implemented. The outputs of these models were compared with each other and also with conventional methods, such as serum CTX level. Area under the receiver operating characteristic (ROC) curve (AUC) was used to compare the results. The performance of machine learning models was significantly superior to conventional statistical methods and single predictors. The random forest model yielded the best performance (AUC = 0.973), followed by artificial neural network (AUC = 0.915), support vector machine (AUC = 0.882), logistic regression (AUC = 0.844), decision tree (AUC = 0.821), drug holiday alone (AUC = 0.810), and CTX level alone (AUC = 0.630). Machine learning methods showed superior performance in predicting BRONJ associated with dental extraction compared to conventional statistical methods using drug holiday and serum CTX level. Machine learning can thus be applied in a wide range of clinical studies. Copyright © 2017. Published by Elsevier Inc.

  4. Spatiotemporal prediction of daily ambient ozone levels across China using random forest for human exposure assessment.

    PubMed

    Zhan, Yu; Luo, Yuzhou; Deng, Xunfei; Grieneisen, Michael L; Zhang, Minghua; Di, Baofeng

    2018-02-01

    In China, ozone pollution shows an increasing trend and becomes the primary air pollutant in warm seasons. Leveraging the air quality monitoring network, a random forest model is developed to predict the daily maximum 8-h average ozone concentrations ([O 3 ] MDA8 ) across China in 2015 for human exposure assessment. This model captures the observed spatiotemporal variations of [O 3 ] MDA8 by using the data of meteorology, elevation, and recent-year emission inventories (cross-validation R 2  = 0.69 and RMSE = 26 μg/m 3 ). Compared with chemical transport models that require a plenty of variables and expensive computation, the random forest model shows comparable or higher predictive performance based on only a handful of readily-available variables at much lower computational cost. The nationwide population-weighted [O 3 ] MDA8 is predicted to be 84 ± 23 μg/m 3 annually, with the highest seasonal mean in the summer (103 ± 8 μg/m 3 ). The summer [O 3 ] MDA8 is predicted to be the highest in North China (125 ± 17 μg/m 3 ). Approximately 58% of the population lives in areas with more than 100 nonattainment days ([O 3 ] MDA8 >100 μg/m 3 ), and 12% of the population are exposed to [O 3 ] MDA8 >160 μg/m 3 (WHO Interim Target 1) for more than 30 days. As the most populous zones in China, the Beijing-Tianjin Metro, Yangtze River Delta, Pearl River Delta, and Sichuan Basin are predicted to be at 154, 141, 124, and 98 nonattainment days, respectively. Effective controls of O 3 pollution are urgently needed for the highly-populated zones, especially the Beijing-Tianjin Metro with seasonal [O 3 ] MDA8 of 140 ± 29 μg/m 3 in summer. To the best of the authors' knowledge, this study is the first statistical modeling work of ambient O 3 for China at the national level. This timely and extensively validated [O 3 ] MDA8 dataset is valuable for refining epidemiological analyses on O 3 pollution in China. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. Regional scale soil salinity assessment using remote sensing based environmental factors and vegetation indicators

    NASA Astrophysics Data System (ADS)

    Ma, Ligang; Ma, Fenglan; Li, Jiadan; Gu, Qing; Yang, Shengtian; Ding, Jianli

    2017-04-01

    Land degradation, specifically soil salinization has rendered large areas of China west sterile and unproductive while diminishing the productivity of adjacent lands and other areas where salting is less severe. Up to now despite decades of research in soil mapping, few accurate and up-to-date information on the spatial extent and variability of soil salinity are available for large geographic regions. This study explores the po-tentials of assessing soil salinity via linear and random forest modeling of remote sensing based environmental factors and indirect indicators. A case study is presented for the arid oases of Tarim and Junggar Basin, Xinjiang, China using time series land surface temperature (LST), evapotranspiration (ET), TRMM precipitation (TRM), DEM product and vegetation indexes as well as their second order products. In par-ticular, the location of the oasis, the best feature sets, different salinity degrees and modeling approaches were fully examined. All constructed models were evaluated for their fit to the whole data set and their performance in a leave-one-field-out spatial cross-validation. In addition, the Kruskal-Wallis rank test was adopted for the statis-tical comparison of different models. Overall, the random forest model outperformed the linear model for the two basins, all salinity degrees and datasets. As for feature set, LST and ET were consistently identified to be the most important factors for two ba-sins while the contribution of vegetation indexes vary with location. What's more, models performances are promising for the salinity ranges that are most relevant to agricultural productivity.

  6. Random Forests to Predict Rectal Toxicity Following Prostate Cancer Radiation Therapy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ospina, Juan D.; INSERM, U1099, Rennes; Escuela de Estadística, Universidad Nacional de Colombia Sede Medellín, Medellín

    2014-08-01

    Purpose: To propose a random forest normal tissue complication probability (RF-NTCP) model to predict late rectal toxicity following prostate cancer radiation therapy, and to compare its performance to that of classic NTCP models. Methods and Materials: Clinical data and dose-volume histograms (DVH) were collected from 261 patients who received 3-dimensional conformal radiation therapy for prostate cancer with at least 5 years of follow-up. The series was split 1000 times into training and validation cohorts. A RF was trained to predict the risk of 5-year overall rectal toxicity and bleeding. Parameters of the Lyman-Kutcher-Burman (LKB) model were identified and a logistic regression modelmore » was fit. The performance of all the models was assessed by computing the area under the receiving operating characteristic curve (AUC). Results: The 5-year grade ≥2 overall rectal toxicity and grade ≥1 and grade ≥2 rectal bleeding rates were 16%, 25%, and 10%, respectively. Predictive capabilities were obtained using the RF-NTCP model for all 3 toxicity endpoints, including both the training and validation cohorts. The age and use of anticoagulants were found to be predictors of rectal bleeding. The AUC for RF-NTCP ranged from 0.66 to 0.76, depending on the toxicity endpoint. The AUC values for the LKB-NTCP were statistically significantly inferior, ranging from 0.62 to 0.69. Conclusions: The RF-NTCP model may be a useful new tool in predicting late rectal toxicity, including variables other than DVH, and thus appears as a strong competitor to classic NTCP models.« less

  7. Fault Detection of Aircraft System with Random Forest Algorithm and Similarity Measure

    PubMed Central

    Park, Wookje; Jung, Sikhang

    2014-01-01

    Research on fault detection algorithm was developed with the similarity measure and random forest algorithm. The organized algorithm was applied to unmanned aircraft vehicle (UAV) that was readied by us. Similarity measure was designed by the help of distance information, and its usefulness was also verified by proof. Fault decision was carried out by calculation of weighted similarity measure. Twelve available coefficients among healthy and faulty status data group were used to determine the decision. Similarity measure weighting was done and obtained through random forest algorithm (RFA); RF provides data priority. In order to get a fast response of decision, a limited number of coefficients was also considered. Relation of detection rate and amount of feature data were analyzed and illustrated. By repeated trial of similarity calculation, useful data amount was obtained. PMID:25057508

  8. A primer on stand and forest inventory designs

    Treesearch

    H. Gyde Lund; Charles E. Thomas

    1989-01-01

    Covers designs for the inventory of stands and forests in detail and with worked-out examples. For stands, random sampling, line transects, ricochet plot, systematic sampling, single plot, cluster, subjective sampling and complete enumeration are discussed. For forests inventory, the main categories are subjective sampling, inventories without prior stand mapping,...

  9. Forest community classification of the Porcupine River drainage, interior Alaska, and its application to forest management.

    Treesearch

    John Yarie

    1983-01-01

    The forest vegetation of 3,600,000 hectares in northeast interior Alaska was classified. A total of 365 plots located in a stratified random design were run through the ordination programs SIMORD and TWINSPAN. A total of 40 forest communities were described vegetatively and, to a limited extent, environmentally. The area covered by each community was similar, ranging...

  10. Experimental Design Considerations for Establishing an Off-Road, Habitat-Specific Bird Monitoring Program Using Point Counts

    Treesearch

    JoAnn M. Hanowski; Gerald J. Niemi

    1995-01-01

    We established bird monitoring programs in two regions of Minnesota: the Chippewa National Forest and the Superior National Forest. The experimental design defined forest cover types as strata in which samples of forest stands were randomly selected. Subsamples (3 point counts) were placed in each stand to maximize field effort and to assess within-stand and between-...

  11. Unbiased feature selection in learning random forests for high-dimensional data.

    PubMed

    Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi

    2015-01-01

    Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

  12. North American vegetation model for land-use planning in a changing climate: A solution to large classification problems

    Treesearch

    Gerald E. Rehfeldt; Nicholas L. Crookston; Cuauhtemoc Saenz-Romero; Elizabeth M. Campbell

    2012-01-01

    Data points intensively sampling 46 North American biomes were used to predict the geographic distribution of biomes from climate variables using the Random Forests classification tree. Techniques were incorporated to accommodate a large number of classes and to predict the future occurrence of climates beyond the contemporary climatic range of the biomes. Errors of...

  13. Mapping growing stock volume and forest live biomass: a case study of the Polissya region of Ukraine

    NASA Astrophysics Data System (ADS)

    Bilous, Andrii; Myroniuk, Viktor; Holiaka, Dmytrii; Bilous, Svitlana; See, Linda; Schepaschenko, Dmitry

    2017-10-01

    Forest inventory and biomass mapping are important tasks that require inputs from multiple data sources. In this paper we implement two methods for the Ukrainian region of Polissya: random forest (RF) for tree species prediction and k-nearest neighbors (k-NN) for growing stock volume and biomass mapping. We examined the suitability of the five-band RapidEye satellite image to predict the distribution of six tree species. The accuracy of RF is quite high: ~99% for forest/non-forest mask and 89% for tree species prediction. Our results demonstrate that inclusion of elevation as a predictor variable in the RF model improved the performance of tree species classification. We evaluated different distance metrics for the k-NN method, including Euclidean or Mahalanobis distance, most similar neighbor (MSN), gradient nearest neighbor, and independent component analysis. The MSN with the four nearest neighbors (k = 4) is the most precise (according to the root-mean-square deviation) for predicting forest attributes across the study area. The k-NN method allowed us to estimate growing stock volume with an accuracy of 3 m3 ha-1 and for live biomass of about 2 t ha-1 over the study area.

  14. The Search for Efficiency in Arboreal Ray Tracing Applications

    NASA Astrophysics Data System (ADS)

    van Leeuwen, M.; Disney, M.; Chen, J. M.; Gomez-Dans, J.; Kelbe, D.; van Aardt, J. A.; Lewis, P.

    2016-12-01

    Forest structure significantly impacts a range of abiotic conditions, including humidity and the radiation regime, all of which affect the rate of net and gross primary productivity. Current forest productivity models typically consider abstract media to represent the transfer of radiation within the canopy. Examples include the representation forest structure via a layered canopy model, where leaf area and inclination angles are stratified with canopy depth, or as turbid media where leaves are randomly distributed within space or within confined geometric solids such as blocks, spheres or cones. While these abstract models are known to produce accurate estimates of primary productivity at the stand level, their limited geometric resolution restricts applicability at fine spatial scales, such as the cell, leaf or shoot levels, thereby not addressing the full potential of assimilation of data from laboratory and field measurements with that of remote sensing technology. Recent research efforts have explored the use of laser scanning to capture detailed tree morphology at millimeter accuracy. These data can subsequently be used to combine ray tracing with primary productivity models, providing an ability to explore trade-offs among different morphological traits or assimilate data from spatial scales, spanning the leaf- to the stand level. Ray tracing has a major advantage of allowing the most accurate structural description of the canopy, and can directly exploit new 3D structural measurements, e.g., from laser scanning. However, the biggest limitation of ray tracing models is their high computational cost, which currently limits their use for large-scale applications. In this talk, we explore ways to more efficiently exploit ray tracing simulations and capture this information in a readily computable form for future evaluation, thus potentially enabling large-scale first-principles forest growth modelling applications.

  15. High-Resolution Regional Biomass Map of Siberia from Glas, Palsar L-Band Radar and Landsat Vcf Data

    NASA Astrophysics Data System (ADS)

    Sun, G.; Ranson, K.; Montesano, P.; Zhang, Z.; Kharuk, V.

    2015-12-01

    The Arctic-Boreal zone is known be warming at an accelerated rate relative to other biomes. The taiga or boreal forest covers over 16 x106 km2 of Arctic North America, Scandinavia, and Eurasia. A large part of the northern Boreal forests are in Russia's Siberia, as area with recent accelerated climate warming. During the last two decades we have been working on characterization of boreal forests in north-central Siberia using field and satellite measurements. We have published results of circumpolar biomass using field plots, airborne (PALS, ACTM) and spaceborne (GLAS) lidar data with ASTER DEM, LANDSAT and MODIS land cover classification, MODIS burned area and WWF's ecoregion map. Researchers from ESA and Russia have also been working on biomass (or growing stock) mapping in Siberia. For example, they developed a pan-boreal growing stock volume map at 1-kilometer scale using hyper-temporal ENVISAT ASAR ScanSAR backscatter data. Using the annual PALSAR mosaics from 2007 to 2010 growing stock volume maps were retrieved based on a supervised random forest regression approach. This method is being used in the ESA/Russia ZAPAS project for Central Siberia Biomass mapping. Spatially specific biomass maps of this region at higher resolution are desired for carbon cycle and climate change studies. In this study, our work focused on improving resolution ( 50 m) of a biomass map based on PALSAR L-band data and Landsat Vegetation Canopy Fraction products. GLAS data were carefully processed and screened using land cover classification, local slope, and acquisition dates. The biomass at remaining footprints was estimated using a model developed from field measurements at GLAS footprints. The GLAS biomass samples were then aggregated into 1 Mg/ha bins of biomass and mean VCF and PALSAR backscatter and textures were calculated for each of these biomass bins. The resulted biomass/signature data was used to train a random forest model for biomass mapping of entire region from 50oN to 75oN, and 80oE to 145oE. The spatial patterns of the new biomass map is much better than the previous maps due to spatially specific mapping in high resolution. The uncertainties of field/GLAS and GLAS/imagery models were investigated using bootstrap procedure, and the final biomass map was compared with previous maps.

  16. Diameter distribution in a Brazilian tropical dry forest domain: predictions for the stand and species.

    PubMed

    Lima, Robson B DE; Bufalino, Lina; Alves, Francisco T; Silva, José A A DA; Ferreira, Rinaldo L C

    2017-01-01

    Currently, there is a lack of studies on the correct utilization of continuous distributions for dry tropical forests. Therefore, this work aims to investigate the diameter structure of a brazilian tropical dry forest and to select suitable continuous distributions by means of statistic tools for the stand and the main species. Two subsets were randomly selected from 40 plots. Diameter at base height was obtained. The following functions were tested: log-normal; gamma; Weibull 2P and Burr. The best fits were selected by Akaike's information validation criterion. Overall, the diameter distribution of the dry tropical forest was better described by negative exponential curves and positive skewness. The forest studied showed diameter distributions with decreasing probability for larger trees. This behavior was observed for both the main species and the stand. The generalization of the function fitted for the main species show that the development of individual models is needed. The Burr function showed good flexibility to describe the diameter structure of the stand and the behavior of Mimosa ophthalmocentra and Bauhinia cheilantha species. For Poincianella bracteosa, Aspidosperma pyrifolium and Myracrodum urundeuva better fitting was obtained with the log-normal function.

  17. Maximizing Conservation and Production with Intensive Forest Management: It's All About Location

    NASA Astrophysics Data System (ADS)

    Tittler, Rebecca; Filotas, Élise; Kroese, Jasmin; Messier, Christian

    2015-11-01

    Functional zoning has been suggested as a way to balance the needs of a viable forest industry with those of healthy ecosystems. Under this system, part of the forest is set aside for protected areas, counterbalanced by intensive and extensive management of the rest of the forest. Studies indicate this may provide adequate timber while minimizing road construction and favoring the development of large mature and old stands. However, it is unclear how the spatial arrangement of intensive management areas may affect the success of this zoning. Should these areas be agglomerated or dispersed throughout the forest landscape? Should managers prioritize (a) proximity to existing roads, (b) distance from protected areas, or (c) site-specific productivity? We use a spatially explicit landscape simulation model to examine the effects of different spatial scenarios on landscape structure, connectivity for native forest wildlife, stand diversity, harvest volume, and road construction: (1) random placement of intensive management areas, and (2-8) all possible combinations of rules (a)-(c). Results favor the agglomeration of intensive management areas. For most wildlife species, connectivity was the highest when intensive management was far from the protected areas. This scenario also resulted in relatively high harvest volumes. Maximizing distance of intensive management areas from protected areas may therefore be the best way to maximize the benefits of intensive management areas while minimizing their potentially negative effects on forest structure and biodiversity.

  18. Do David and Goliath Play the Same Game? Explanation of the Abundance of Rare and Frequent Invasive Alien Plants in Urban Woodlands in Warsaw, Poland.

    PubMed

    Obidziński, Artur; Mędrzycki, Piotr; Kołaczkowska, Ewa; Ciurzycki, Wojciech; Marciszewska, Katarzyna

    2016-01-01

    Invasive Alien Plants occur in numbers differing by orders of magnitude at subsequent invasion stages. Effective sampling and quantifying niches of rare invasive plants are quite problematic. The aim of this paper is an estimation of the influence of invasive plants frequency on the explanation of their local abundance. We attempted to achieve it through: (1) assessment of occurrence of self-regenerating invasive plants in urban woodlands, (2) comparison of Random Forest modelling results for frequent and rare species. We hypothesized that the abundance of frequent species would be explained better than that of rare ones and that both rare and frequent species share a common hierarchy of the most important determinants. We found 15 taxa in almost two thirds of 1040 plots with a total number of 1068 occurrences. There were recorded 6 taxa of high frequency-Prunus serotina, Quercus rubra, Acer negundo, Robinia pseudoacacia, Impatiens parviflora and Solidago spp.-and 9 taxa of low frequency: Acer saccharinum, Amelanchier spicata, Cornus spp., Fraxinus spp., Parthenocissus spp., Syringa vulgaris, Echinocystis lobata, Helianthus tuberosus, Reynoutria spp. Random Forest's models' quality grows with the number of occurrences of frequent taxa but not of the rare ones. Both frequent and rare taxa share a similar hierarchy of predictors' importance: Land use > Tree stand > Seed source and, for frequent taxa, Forest properties as well. We conclude that there is an 'explanation jump' at higher species frequencies, but rare species are surprisingly similar to frequent ones in their determinant's hierarchy, with differences conforming with their respective stages of invasion.

  19. Classification of suicide attempters in schizophrenia using sociocultural and clinical features: A machine learning approach.

    PubMed

    Hettige, Nuwan C; Nguyen, Thai Binh; Yuan, Chen; Rajakulendran, Thanara; Baddour, Jermeen; Bhagwat, Nikhil; Bani-Fatemi, Ali; Voineskos, Aristotle N; Mallar Chakravarty, M; De Luca, Vincenzo

    2017-07-01

    Suicide is a major concern for those afflicted by schizophrenia. Identifying patients at the highest risk for future suicide attempts remains a complex problem for psychiatric interventions. Machine learning models allow for the integration of many risk factors in order to build an algorithm that predicts which patients are likely to attempt suicide. Currently it is unclear how to integrate previously identified risk factors into a clinically relevant predictive tool to estimate the probability of a patient with schizophrenia for attempting suicide. We conducted a cross-sectional assessment on a sample of 345 participants diagnosed with schizophrenia spectrum disorders. Suicide attempters and non-attempters were clearly identified using the Columbia Suicide Severity Rating Scale (C-SSRS) and the Beck Suicide Ideation Scale (BSS). We developed four classification algorithms using a regularized regression, random forest, elastic net and support vector machine models with sociocultural and clinical variables as features to train the models. All classification models performed similarly in identifying suicide attempters and non-attempters. Our regularized logistic regression model demonstrated an accuracy of 67% and an area under the curve (AUC) of 0.71, while the random forest model demonstrated 66% accuracy and an AUC of 0.67. Support vector classifier (SVC) model demonstrated an accuracy of 67% and an AUC of 0.70, and the elastic net model demonstrated and accuracy of 65% and an AUC of 0.71. Machine learning algorithms offer a relatively successful method for incorporating many clinical features to predict individuals at risk for future suicide attempts. Increased performance of these models using clinically relevant variables offers the potential to facilitate early treatment and intervention to prevent future suicide attempts. Copyright © 2017 Elsevier Inc. All rights reserved.

  20. Supervised machine learning for analysing spectra of exoplanetary atmospheres

    NASA Astrophysics Data System (ADS)

    Márquez-Neila, Pablo; Fisher, Chloe; Sznitman, Raphael; Heng, Kevin

    2018-06-01

    The use of machine learning is becoming ubiquitous in astronomy1-3, but remains rare in the study of the atmospheres of exoplanets. Given the spectrum of an exoplanetary atmosphere, a multi-parameter space is swept through in real time to find the best-fit model4-6. Known as atmospheric retrieval, this technique originates in the Earth and planetary sciences7. Such methods are very time-consuming, and by necessity there is a compromise between physical and chemical realism and computational feasibility. Machine learning has previously been used to determine which molecules to include in the model, but the retrieval itself was still performed using standard methods8. Here, we report an adaptation of the `random forest' method of supervised machine learning9,10, trained on a precomputed grid of atmospheric models, which retrieves full posterior distributions of the abundances of molecules and the cloud opacity. The use of a precomputed grid allows a large part of the computational burden to be shifted offline. We demonstrate our technique on a transmission spectrum of the hot gas-giant exoplanet WASP-12b using a five-parameter model (temperature, a constant cloud opacity and the volume mixing ratios or relative abundances of molecules of water, ammonia and hydrogen cyanide)11. We obtain results consistent with the standard nested-sampling retrieval method. We also estimate the sensitivity of the measured spectrum to the model parameters, and we are able to quantify the information content of the spectrum. Our method can be straightforwardly applied using more sophisticated atmospheric models to interpret an ensemble of spectra without having to retrain the random forest.

  1. A comparative study of family-specific protein-ligand complex affinity prediction based on random forest approach

    NASA Astrophysics Data System (ADS)

    Wang, Yu; Guo, Yanzhi; Kuang, Qifan; Pu, Xuemei; Ji, Yue; Zhang, Zhihang; Li, Menglong

    2015-04-01

    The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein-ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients ( R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.

  2. Modeled streamflow metrics on small, ungaged stream reaches in the Upper Colorado River Basin

    USGS Publications Warehouse

    Reynolds, Lindsay V.; Shafroth, Patrick B.

    2016-01-20

    Modeling streamflow is an important approach for understanding landscape-scale drivers of flow and estimating flows where there are no streamgage records. In this study conducted by the U.S. Geological Survey in cooperation with Colorado State University, the objectives were to model streamflow metrics on small, ungaged streams in the Upper Colorado River Basin and identify streams that are potentially threatened with becoming intermittent under drier climate conditions. The Upper Colorado River Basin is a region that is critical for water resources and also projected to experience large future climate shifts toward a drying climate. A random forest modeling approach was used to model the relationship between streamflow metrics and environmental variables. Flow metrics were then projected to ungaged reaches in the Upper Colorado River Basin using environmental variables for each stream, represented as raster cells, in the basin. Last, the projected random forest models of minimum flow coefficient of variation and specific mean daily flow were used to highlight streams that had greater than 61.84 percent minimum flow coefficient of variation and less than 0.096 specific mean daily flow and suggested that these streams will be most threatened to shift to intermittent flow regimes under drier climate conditions. Map projection products can help scientists, land managers, and policymakers understand current hydrology in the Upper Colorado River Basin and make informed decisions regarding water resources. With knowledge of which streams are likely to undergo significant drying in the future, managers and scientists can plan for stream-dependent ecosystems and human water users.

  3. Estimating current and future streamflow characteristics at ungaged sites, central and eastern Montana, with application to evaluating effects of climate change on fish populations

    USGS Publications Warehouse

    Sando, Roy; Chase, Katherine J.

    2017-03-23

    A common statistical procedure for estimating streamflow statistics at ungaged locations is to develop a relational model between streamflow and drainage basin characteristics at gaged locations using least squares regression analysis; however, least squares regression methods are parametric and make constraining assumptions about the data distribution. The random forest regression method provides an alternative nonparametric method for estimating streamflow characteristics at ungaged sites and requires that the data meet fewer statistical conditions than least squares regression methods.Random forest regression analysis was used to develop predictive models for 89 streamflow characteristics using Precipitation-Runoff Modeling System simulated streamflow data and drainage basin characteristics at 179 sites in central and eastern Montana. The predictive models were developed from streamflow data simulated for current (baseline, water years 1982–99) conditions and three future periods (water years 2021–38, 2046–63, and 2071–88) under three different climate-change scenarios. These predictive models were then used to predict streamflow characteristics for baseline conditions and three future periods at 1,707 fish sampling sites in central and eastern Montana. The average root mean square error for all predictive models was about 50 percent. When streamflow predictions at 23 fish sampling sites were compared to nearby locations with simulated data, the mean relative percent difference was about 43 percent. When predictions were compared to streamflow data recorded at 21 U.S. Geological Survey streamflow-gaging stations outside of the calibration basins, the average mean absolute percent error was about 73 percent.

  4. Predicting acidification recovery at the Hubbard Brook Experimental Forest, New Hampshire: evaluation of four models.

    PubMed

    Tominaga, Koji; Aherne, Julian; Watmough, Shaun A; Alveteg, Mattias; Cosby, Bernard J; Driscoll, Charles T; Posch, Maximilian; Pourmokhtarian, Afshin

    2010-12-01

    The performance and prediction uncertainty (owing to parameter and structural uncertainties) of four dynamic watershed acidification models (MAGIC, PnET-BGC, SAFE, and VSD) were assessed by systematically applying them to data from the Hubbard Brook Experimental Forest (HBEF), New Hampshire, where long-term records of precipitation and stream chemistry were available. In order to facilitate systematic evaluation, Monte Carlo simulation was used to randomly generate common model input data sets (n = 10,000) from parameter distributions; input data were subsequently translated among models to retain consistency. The model simulations were objectively calibrated against observed data (streamwater: 1963-2004, soil: 1983). The ensemble of calibrated models was used to assess future response of soil and stream chemistry to reduced sulfur deposition at the HBEF. Although both hindcast (1850-1962) and forecast (2005-2100) predictions were qualitatively similar across the four models, the temporal pattern of key indicators of acidification recovery (stream acid neutralizing capacity and soil base saturation) differed substantially. The range in predictions resulted from differences in model structure and their associated posterior parameter distributions. These differences can be accommodated by employing multiple models (ensemble analysis) but have implications for individual model applications.

  5. High resolution satellite remote sensing used in a stratified random sampling scheme to quantify the constituent land cover components of the shifting cultivation mosaic of the Democratic Republic of Congo

    NASA Astrophysics Data System (ADS)

    Molinario, G.; Hansen, M.; Potapov, P.

    2016-12-01

    High resolution satellite imagery obtained from the National Geospatial Intelligence Agency through NASA was used to photo-interpret sample areas within the DRC. The area sampled is a stratifcation of the forest cover loss from circa 2014 that either occurred completely within the previosly mapped homogenous area of the Rural Complex, at it's interface with primary forest, or in isolated forest perforations. Previous research resulted in a map of these areas that contextualizes forest loss depending on where it occurs and with what spatial density, leading to a better understading of the real impacts on forest degradation of livelihood shifting cultivation. The stratified random sampling approach of these areas allows the characterization of the constituent land cover types within these areas, and their variability throughout the DRC. Shifting cultivation has a variable forest degradation footprint in the DRC depending on many factors that drive it, but it's role in forest degradation and deforestation had been disputed, leading us to investigate and quantify the clearing and reuse rates within the strata throughout the country.

  6. A Random Forest Based Risk Model for Reliable and Accurate Prediction of Receipt of Transfusion in Patients Undergoing Percutaneous Coronary Intervention

    PubMed Central

    Gurm, Hitinder S.; Kooiman, Judith; LaLonde, Thomas; Grines, Cindy; Share, David; Seth, Milan

    2014-01-01

    Background Transfusion is a common complication of Percutaneous Coronary Intervention (PCI) and is associated with adverse short and long term outcomes. There is no risk model for identifying patients most likely to receive transfusion after PCI. The objective of our study was to develop and validate a tool for predicting receipt of blood transfusion in patients undergoing contemporary PCI. Methods Random forest models were developed utilizing 45 pre-procedural clinical and laboratory variables to estimate the receipt of transfusion in patients undergoing PCI. The most influential variables were selected for inclusion in an abbreviated model. Model performance estimating transfusion was evaluated in an independent validation dataset using area under the ROC curve (AUC), with net reclassification improvement (NRI) used to compare full and reduced model prediction after grouping in low, intermediate, and high risk categories. The impact of procedural anticoagulation on observed versus predicted transfusion rates were assessed for the different risk categories. Results Our study cohort was comprised of 103,294 PCI procedures performed at 46 hospitals between July 2009 through December 2012 in Michigan of which 72,328 (70%) were randomly selected for training the models, and 30,966 (30%) for validation. The models demonstrated excellent calibration and discrimination (AUC: full model  = 0.888 (95% CI 0.877–0.899), reduced model AUC = 0.880 (95% CI, 0.868–0.892), p for difference 0.003, NRI = 2.77%, p = 0.007). Procedural anticoagulation and radial access significantly influenced transfusion rates in the intermediate and high risk patients but no clinically relevant impact was noted in low risk patients, who made up 70% of the total cohort. Conclusions The risk of transfusion among patients undergoing PCI can be reliably calculated using a novel easy to use computational tool (https://bmc2.org/calculators/transfusion). This risk prediction algorithm may prove useful for both bed side clinical decision making and risk adjustment for assessment of quality. PMID:24816645

  7. Using random forest for the risk assessment of coal-floor water inrush in Panjiayao Coal Mine, northern China

    NASA Astrophysics Data System (ADS)

    Zhao, Dekang; Wu, Qiang; Cui, Fangpeng; Xu, Hua; Zeng, Yifan; Cao, Yufei; Du, Yuanze

    2018-04-01

    Coal-floor water-inrush incidents account for a large proportion of coal mine disasters in northern China, and accurate risk assessment is crucial for safe coal production. A novel and promising assessment model for water inrush is proposed based on random forest (RF), which is a powerful intelligent machine-learning algorithm. RF has considerable advantages, including high classification accuracy and the capability to evaluate the importance of variables; in particularly, it is robust in dealing with the complicated and non-linear problems inherent in risk assessment. In this study, the proposed model is applied to Panjiayao Coal Mine, northern China. Eight factors were selected as evaluation indices according to systematic analysis of the geological conditions and a field survey of the study area. Risk assessment maps were generated based on RF, and the probabilistic neural network (PNN) model was also used for risk assessment as a comparison. The results demonstrate that the two methods are consistent in the risk assessment of water inrush at the mine, and RF shows a better performance compared to PNN with an overall accuracy higher by 6.67%. It is concluded that RF is more practicable to assess the water-inrush risk than PNN. The presented method will be helpful in avoiding water inrush and also can be extended to various engineering applications.

  8. Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest

    NASA Astrophysics Data System (ADS)

    Chen, Hui; Lin, Zan; Wu, Hegang; Wang, Li; Wu, Tong; Tan, Chao

    2015-01-01

    Near-infrared (NIR) spectroscopy has such advantages as being noninvasive, fast, relatively inexpensive, and no risk of ionizing radiation. Differences in the NIR signals can reflect many physiological changes, which are in turn associated with such factors as vascularization, cellularity, oxygen consumption, or remodeling. NIR spectral differences between colorectal cancer and healthy tissues were investigated. A Fourier transform NIR spectroscopy instrument equipped with a fiber-optic probe was used to mimic in situ clinical measurements. A total of 186 spectra were collected and then underwent the preprocessing of standard normalize variate (SNV) for removing unwanted background variances. All the specimen and spots used for spectral collection were confirmed staining and examination by an experienced pathologist so as to ensure the representative of the pathology. Principal component analysis (PCA) was used to uncover the possible clustering. Several methods including random forest (RF), partial least squares-discriminant analysis (PLSDA), K-nearest neighbor and classification and regression tree (CART) were used to extract spectral features and to construct the diagnostic models. By comparison, it reveals that, even if no obvious difference of misclassified ratio (MCR) was observed between these models, RF is preferable since it is quicker, more convenient and insensitive to over-fitting. The results indicate that NIR spectroscopy coupled with RF model can serve as a potential tool for discriminating the colorectal cancer tissues from normal ones.

  9. Global patterns and predictions of seafloor biomass using random forests.

    PubMed

    Wei, Chih-Lin; Rowe, Gilbert T; Escobar-Briones, Elva; Boetius, Antje; Soltwedel, Thomas; Caley, M Julian; Soliman, Yousria; Huettmann, Falk; Qu, Fangyuan; Yu, Zishan; Pitcher, C Roland; Haedrich, Richard L; Wicksten, Mary K; Rex, Michael A; Baguley, Jeffrey G; Sharma, Jyotsna; Danovaro, Roberto; MacDonald, Ian R; Nunnally, Clifton C; Deming, Jody W; Montagna, Paul; Lévesque, Mélanie; Weslawski, Jan Marcin; Wlodarska-Kowalczuk, Maria; Ingole, Baban S; Bett, Brian J; Billett, David S M; Yool, Andrew; Bluhm, Bodil A; Iken, Katrin; Narayanaswamy, Bhavani E

    2010-12-30

    A comprehensive seafloor biomass and abundance database has been constructed from 24 oceanographic institutions worldwide within the Census of Marine Life (CoML) field projects. The machine-learning algorithm, Random Forests, was employed to model and predict seafloor standing stocks from surface primary production, water-column integrated and export particulate organic matter (POM), seafloor relief, and bottom water properties. The predictive models explain 63% to 88% of stock variance among the major size groups. Individual and composite maps of predicted global seafloor biomass and abundance are generated for bacteria, meiofauna, macrofauna, and megafauna (invertebrates and fishes). Patterns of benthic standing stocks were positive functions of surface primary production and delivery of the particulate organic carbon (POC) flux to the seafloor. At a regional scale, the census maps illustrate that integrated biomass is highest at the poles, on continental margins associated with coastal upwelling and with broad zones associated with equatorial divergence. Lowest values are consistently encountered on the central abyssal plains of major ocean basins The shift of biomass dominance groups with depth is shown to be affected by the decrease in average body size rather than abundance, presumably due to decrease in quantity and quality of food supply. This biomass census and associated maps are vital components of mechanistic deep-sea food web models and global carbon cycling, and as such provide fundamental information that can be incorporated into evidence-based management.

  10. Global Patterns and Predictions of Seafloor Biomass Using Random Forests

    PubMed Central

    Wei, Chih-Lin; Rowe, Gilbert T.; Escobar-Briones, Elva; Boetius, Antje; Soltwedel, Thomas; Caley, M. Julian; Soliman, Yousria; Huettmann, Falk; Qu, Fangyuan; Yu, Zishan; Pitcher, C. Roland; Haedrich, Richard L.; Wicksten, Mary K.; Rex, Michael A.; Baguley, Jeffrey G.; Sharma, Jyotsna; Danovaro, Roberto; MacDonald, Ian R.; Nunnally, Clifton C.; Deming, Jody W.; Montagna, Paul; Lévesque, Mélanie; Weslawski, Jan Marcin; Wlodarska-Kowalczuk, Maria; Ingole, Baban S.; Bett, Brian J.; Billett, David S. M.; Yool, Andrew; Bluhm, Bodil A.; Iken, Katrin; Narayanaswamy, Bhavani E.

    2010-01-01

    A comprehensive seafloor biomass and abundance database has been constructed from 24 oceanographic institutions worldwide within the Census of Marine Life (CoML) field projects. The machine-learning algorithm, Random Forests, was employed to model and predict seafloor standing stocks from surface primary production, water-column integrated and export particulate organic matter (POM), seafloor relief, and bottom water properties. The predictive models explain 63% to 88% of stock variance among the major size groups. Individual and composite maps of predicted global seafloor biomass and abundance are generated for bacteria, meiofauna, macrofauna, and megafauna (invertebrates and fishes). Patterns of benthic standing stocks were positive functions of surface primary production and delivery of the particulate organic carbon (POC) flux to the seafloor. At a regional scale, the census maps illustrate that integrated biomass is highest at the poles, on continental margins associated with coastal upwelling and with broad zones associated with equatorial divergence. Lowest values are consistently encountered on the central abyssal plains of major ocean basins The shift of biomass dominance groups with depth is shown to be affected by the decrease in average body size rather than abundance, presumably due to decrease in quantity and quality of food supply. This biomass census and associated maps are vital components of mechanistic deep-sea food web models and global carbon cycling, and as such provide fundamental information that can be incorporated into evidence-based management. PMID:21209928

  11. Discrimination of raw and processed Dipsacus asperoides by near infrared spectroscopy combined with least squares-support vector machine and random forests

    NASA Astrophysics Data System (ADS)

    Xin, Ni; Gu, Xiao-Feng; Wu, Hao; Hu, Yu-Zhu; Yang, Zhong-Lin

    2012-04-01

    Most herbal medicines could be processed to fulfill the different requirements of therapy. The purpose of this study was to discriminate between raw and processed Dipsacus asperoides, a common traditional Chinese medicine, based on their near infrared (NIR) spectra. Least squares-support vector machine (LS-SVM) and random forests (RF) were employed for full-spectrum classification. Three types of kernels, including linear kernel, polynomial kernel and radial basis function kernel (RBF), were checked for optimization of LS-SVM model. For comparison, a linear discriminant analysis (LDA) model was performed for classification, and the successive projections algorithm (SPA) was executed prior to building an LDA model to choose an appropriate subset of wavelengths. The three methods were applied to a dataset containing 40 raw herbs and 40 corresponding processed herbs. We ran 50 runs of 10-fold cross validation to evaluate the model's efficiency. The performance of the LS-SVM with RBF kernel (RBF LS-SVM) was better than the other two kernels. The RF, RBF LS-SVM and SPA-LDA successfully classified all test samples. The mean error rates for the 50 runs of 10-fold cross validation were 1.35% for RBF LS-SVM, 2.87% for RF, and 2.50% for SPA-LDA. The best classification results were obtained by using LS-SVM with RBF kernel, while RF was fast in the training and making predictions.

  12. Performance of thigh-mounted triaxial accelerometer algorithms in objective quantification of sedentary behaviour and physical activity in older adults

    PubMed Central

    Verschueren, Sabine M. P.; Degens, Hans; Morse, Christopher I.; Onambélé, Gladys L.

    2017-01-01

    Accurate monitoring of sedentary behaviour and physical activity is key to investigate their exact role in healthy ageing. To date, accelerometers using cut-off point models are most preferred for this, however, machine learning seems a highly promising future alternative. Hence, the current study compared between cut-off point and machine learning algorithms, for optimal quantification of sedentary behaviour and physical activity intensities in the elderly. Thus, in a heterogeneous sample of forty participants (aged ≥60 years, 50% female) energy expenditure during laboratory-based activities (ranging from sedentary behaviour through to moderate-to-vigorous physical activity) was estimated by indirect calorimetry, whilst wearing triaxial thigh-mounted accelerometers. Three cut-off point algorithms and a Random Forest machine learning model were developed and cross-validated using the collected data. Detailed analyses were performed to check algorithm robustness, and examine and benchmark both overall and participant-specific balanced accuracies. This revealed that the four models can at least be used to confidently monitor sedentary behaviour and moderate-to-vigorous physical activity. Nevertheless, the machine learning algorithm outperformed the cut-off point models by being robust for all individual’s physiological and non-physiological characteristics and showing more performance of an acceptable level over the whole range of physical activity intensities. Therefore, we propose that Random Forest machine learning may be optimal for objective assessment of sedentary behaviour and physical activity in older adults using thigh-mounted triaxial accelerometry. PMID:29155839

  13. Performance of thigh-mounted triaxial accelerometer algorithms in objective quantification of sedentary behaviour and physical activity in older adults.

    PubMed

    Wullems, Jorgen A; Verschueren, Sabine M P; Degens, Hans; Morse, Christopher I; Onambélé, Gladys L

    2017-01-01

    Accurate monitoring of sedentary behaviour and physical activity is key to investigate their exact role in healthy ageing. To date, accelerometers using cut-off point models are most preferred for this, however, machine learning seems a highly promising future alternative. Hence, the current study compared between cut-off point and machine learning algorithms, for optimal quantification of sedentary behaviour and physical activity intensities in the elderly. Thus, in a heterogeneous sample of forty participants (aged ≥60 years, 50% female) energy expenditure during laboratory-based activities (ranging from sedentary behaviour through to moderate-to-vigorous physical activity) was estimated by indirect calorimetry, whilst wearing triaxial thigh-mounted accelerometers. Three cut-off point algorithms and a Random Forest machine learning model were developed and cross-validated using the collected data. Detailed analyses were performed to check algorithm robustness, and examine and benchmark both overall and participant-specific balanced accuracies. This revealed that the four models can at least be used to confidently monitor sedentary behaviour and moderate-to-vigorous physical activity. Nevertheless, the machine learning algorithm outperformed the cut-off point models by being robust for all individual's physiological and non-physiological characteristics and showing more performance of an acceptable level over the whole range of physical activity intensities. Therefore, we propose that Random Forest machine learning may be optimal for objective assessment of sedentary behaviour and physical activity in older adults using thigh-mounted triaxial accelerometry.

  14. Optimal Subset Selection of Time-Series MODIS Images and Sample Data Transfer with Random Forests for Supervised Classification Modelling.

    PubMed

    Zhou, Fuqun; Zhang, Aining

    2016-10-25

    Nowadays, various time-series Earth Observation data with multiple bands are freely available, such as Moderate Resolution Imaging Spectroradiometer (MODIS) datasets including 8-day composites from NASA, and 10-day composites from the Canada Centre for Remote Sensing (CCRS). It is challenging to efficiently use these time-series MODIS datasets for long-term environmental monitoring due to their vast volume and information redundancy. This challenge will be greater when Sentinel 2-3 data become available. Another challenge that researchers face is the lack of in-situ data for supervised modelling, especially for time-series data analysis. In this study, we attempt to tackle the two important issues with a case study of land cover mapping using CCRS 10-day MODIS composites with the help of Random Forests' features: variable importance, outlier identification. The variable importance feature is used to analyze and select optimal subsets of time-series MODIS imagery for efficient land cover mapping, and the outlier identification feature is utilized for transferring sample data available from one year to an adjacent year for supervised classification modelling. The results of the case study of agricultural land cover classification at a regional scale show that using only about a half of the variables we can achieve land cover classification accuracy close to that generated using the full dataset. The proposed simple but effective solution of sample transferring could make supervised modelling possible for applications lacking sample data.

  15. Strategies for minimizing sample size for use in airborne LiDAR-based forest inventory

    USGS Publications Warehouse

    Junttila, Virpi; Finley, Andrew O.; Bradford, John B.; Kauranne, Tuomo

    2013-01-01

    Recently airborne Light Detection And Ranging (LiDAR) has emerged as a highly accurate remote sensing modality to be used in operational scale forest inventories. Inventories conducted with the help of LiDAR are most often model-based, i.e. they use variables derived from LiDAR point clouds as the predictive variables that are to be calibrated using field plots. The measurement of the necessary field plots is a time-consuming and statistically sensitive process. Because of this, current practice often presumes hundreds of plots to be collected. But since these plots are only used to calibrate regression models, it should be possible to minimize the number of plots needed by carefully selecting the plots to be measured. In the current study, we compare several systematic and random methods for calibration plot selection, with the specific aim that they be used in LiDAR based regression models for forest parameters, especially above-ground biomass. The primary criteria compared are based on both spatial representativity as well as on their coverage of the variability of the forest features measured. In the former case, it is important also to take into account spatial auto-correlation between the plots. The results indicate that choosing the plots in a way that ensures ample coverage of both spatial and feature space variability improves the performance of the corresponding models, and that adequate coverage of the variability in the feature space is the most important condition that should be met by the set of plots collected.

  16. Quantifying uncertainty in national forest carbon stocks: challenges and opportunities for the United States National Greenhouse Gas Inventory

    NASA Astrophysics Data System (ADS)

    Clough, B.; Russell, M.; Domke, G. M.; Woodall, C. W.

    2016-12-01

    Uncertainty estimates are needed to establish confidence in national forest carbon stocks and to verify changes reported to the United Nations Framework Convention on Climate Change. Good practice guidance from the Intergovernmental Panel on Climate Change stipulates that uncertainty assessments should neither exaggerate nor underestimate the actual error within carbon stocks, yet methodological guidance for forests has been hampered by limited understanding of how complex dynamics give rise to errors across spatial scales (i.e., individuals to continents). This talk highlights efforts to develop a multi-scale, data-driven framework for assessing uncertainty within the United States (US) forest carbon inventory, and focuses on challenges and opportunities for improving the precision of national forest carbon stock estimates. Central to our approach is the calibration of allometric models with a newly established legacy biomass database for North American tree species, and the use of hierarchical models to link these data with the Forest Inventory and Analysis (FIA) database as well as remote sensing datasets. Our work suggests substantial risk for misestimating key sources of uncertainty including: (1) attributing more confidence in allometric models than what is warranted by the best available data; (2) failing to capture heterogeneity in biomass stocks due to environmental variation at regional scales; and (3) ignoring spatial autocorrelation and other random effects that are characteristic of national forest inventory data. Our results suggest these sources of error may be much higher than is generally assumed, though these results must be understood with the limited scope and availability of appropriate calibration data in mind. In addition to reporting on important sources of uncertainty, this talk will discuss opportunities to improve the precision of national forest carbon stocks that are motivated by our use of data-driven forecasting including: (1) improving the taxonomic and geographic scope of available biomass data; (2) direct attribution of landscape-level heterogeneity in biomass stocks to specific ecological processes; and (3) integration of expert opinion and meta-analysis to lessen the influence of often highly variable datasets on biomass stock forecasts.

  17. Can Predictive Modeling Identify Head and Neck Oncology Patients at Risk for Readmission?

    PubMed

    Manning, Amy M; Casper, Keith A; Peter, Kay St; Wilson, Keith M; Mark, Jonathan R; Collar, Ryan M

    2018-05-01

    Objective Unplanned readmission within 30 days is a contributor to health care costs in the United States. The use of predictive modeling during hospitalization to identify patients at risk for readmission offers a novel approach to quality improvement and cost reduction. Study Design Two-phase study including retrospective analysis of prospectively collected data followed by prospective longitudinal study. Setting Tertiary academic medical center. Subjects and Methods Prospectively collected data for patients undergoing surgical treatment for head and neck cancer from January 2013 to January 2015 were used to build predictive models for readmission within 30 days of discharge using logistic regression, classification and regression tree (CART) analysis, and random forests. One model (logistic regression) was then placed prospectively into the discharge workflow from March 2016 to May 2016 to determine the model's ability to predict which patients would be readmitted within 30 days. Results In total, 174 admissions had descriptive data. Thirty-two were excluded due to incomplete data. Logistic regression, CART, and random forest predictive models were constructed using the remaining 142 admissions. When applied to 106 consecutive prospective head and neck oncology patients at the time of discharge, the logistic regression model predicted readmissions with a specificity of 94%, a sensitivity of 47%, a negative predictive value of 90%, and a positive predictive value of 62% (odds ratio, 14.9; 95% confidence interval, 4.02-55.45). Conclusion Prospectively collected head and neck cancer databases can be used to develop predictive models that can accurately predict which patients will be readmitted. This offers valuable support for quality improvement initiatives and readmission-related cost reduction in head and neck cancer care.

  18. High-Risk Breast Lesions: A Machine Learning Model to Predict Pathologic Upgrade and Reduce Unnecessary Surgical Excision.

    PubMed

    Bahl, Manisha; Barzilay, Regina; Yedidia, Adam B; Locascio, Nicholas J; Yu, Lili; Lehman, Constance D

    2018-03-01

    Purpose To develop a machine learning model that allows high-risk breast lesions (HRLs) diagnosed with image-guided needle biopsy that require surgical excision to be distinguished from HRLs that are at low risk for upgrade to cancer at surgery and thus could be surveilled. Materials and Methods Consecutive patients with biopsy-proven HRLs who underwent surgery or at least 2 years of imaging follow-up from June 2006 to April 2015 were identified. A random forest machine learning model was developed to identify HRLs at low risk for upgrade to cancer. Traditional features such as age and HRL histologic results were used in the model, as were text features from the biopsy pathologic report. Results One thousand six HRLs were identified, with a cancer upgrade rate of 11.4% (115 of 1006). A machine learning random forest model was developed with 671 HRLs and tested with an independent set of 335 HRLs. Among the most important traditional features were age and HRL histologic results (eg, atypical ductal hyperplasia). An important text feature from the pathologic reports was "severely atypical." Instead of surgical excision of all HRLs, if those categorized with the model to be at low risk for upgrade were surveilled and the remainder were excised, then 97.4% (37 of 38) of malignancies would have been diagnosed at surgery, and 30.6% (91 of 297) of surgeries of benign lesions could have been avoided. Conclusion This study provides proof of concept that a machine learning model can be applied to predict the risk of upgrade of HRLs to cancer. Use of this model could decrease unnecessary surgery by nearly one-third and could help guide clinical decision making with regard to surveillance versus surgical excision of HRLs. © RSNA, 2017.

  19. Evaluating the potential for site-specific modification of LiDAR DEM derivatives to improve environmental planning-scale wetland identification using Random Forest classification

    NASA Astrophysics Data System (ADS)

    O'Neil, Gina L.; Goodall, Jonathan L.; Watson, Layne T.

    2018-04-01

    Wetlands are important ecosystems that provide many ecological benefits, and their quality and presence are protected by federal regulations. These regulations require wetland delineations, which can be costly and time-consuming to perform. Computer models can assist in this process, but lack the accuracy necessary for environmental planning-scale wetland identification. In this study, the potential for improvement of wetland identification models through modification of digital elevation model (DEM) derivatives, derived from high-resolution and increasingly available light detection and ranging (LiDAR) data, at a scale necessary for small-scale wetland delineations is evaluated. A novel approach of flow convergence modelling is presented where Topographic Wetness Index (TWI), curvature, and Cartographic Depth-to-Water index (DTW), are modified to better distinguish wetland from upland areas, combined with ancillary soil data, and used in a Random Forest classification. This approach is applied to four study sites in Virginia, implemented as an ArcGIS model. The model resulted in significant improvement in average wetland accuracy compared to the commonly used National Wetland Inventory (84.9% vs. 32.1%), at the expense of a moderately lower average non-wetland accuracy (85.6% vs. 98.0%) and average overall accuracy (85.6% vs. 92.0%). From this, we concluded that modifying TWI, curvature, and DTW provides more robust wetland and non-wetland signatures to the models by improving accuracy rates compared to classifications using the original indices. The resulting ArcGIS model is a general tool able to modify these local LiDAR DEM derivatives based on site characteristics to identify wetlands at a high resolution.

  20. Machine Learning Algorithm Predicts Cardiac Resynchronization Therapy Outcomes: Lessons From the COMPANION Trial.

    PubMed

    Kalscheur, Matthew M; Kipp, Ryan T; Tattersall, Matthew C; Mei, Chaoqun; Buhr, Kevin A; DeMets, David L; Field, Michael E; Eckhardt, Lee L; Page, C David

    2018-01-01

    Cardiac resynchronization therapy (CRT) reduces morbidity and mortality in heart failure patients with reduced left ventricular function and intraventricular conduction delay. However, individual outcomes vary significantly. This study sought to use a machine learning algorithm to develop a model to predict outcomes after CRT. Models were developed with machine learning algorithms to predict all-cause mortality or heart failure hospitalization at 12 months post-CRT in the COMPANION trial (Comparison of Medical Therapy, Pacing, and Defibrillation in Heart Failure). The best performing model was developed with the random forest algorithm. The ability of this model to predict all-cause mortality or heart failure hospitalization and all-cause mortality alone was compared with discrimination obtained using a combination of bundle branch block morphology and QRS duration. In the 595 patients with CRT-defibrillator in the COMPANION trial, 105 deaths occurred (median follow-up, 15.7 months). The survival difference across subgroups differentiated by bundle branch block morphology and QRS duration did not reach significance ( P =0.08). The random forest model produced quartiles of patients with an 8-fold difference in survival between those with the highest and lowest predicted probability for events (hazard ratio, 7.96; P <0.0001). The model also discriminated the risk of the composite end point of all-cause mortality or heart failure hospitalization better than subgroups based on bundle branch block morphology and QRS duration. In the COMPANION trial, a machine learning algorithm produced a model that predicted clinical outcomes after CRT. Applied before device implant, this model may better differentiate outcomes over current clinical discriminators and improve shared decision-making with patients. © 2018 American Heart Association, Inc.

  1. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection.

    PubMed

    Wu, Ke; Zhang, Kun; Fan, Wei; Edwards, Andrea; Yu, Philip S

    Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.

  2. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection

    PubMed Central

    Wu, Ke; Zhang, Kun; Fan, Wei; Edwards, Andrea; Yu, Philip S.

    2015-01-01

    Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request. PMID:25685112

  3. Modeling biophysical properties of broad-leaved stands in the hyrcanian forests of Iran using fused airborne laser scanner data and ultraCam-D images

    NASA Astrophysics Data System (ADS)

    Mohammadi, Jahangir; Shataee, Shaban; Namiranian, Manochehr; Næsset, Erik

    2017-09-01

    Inventories of mixed broad-leaved forests of Iran mainly rely on terrestrial measurements. Due to rapid changes and disturbances and great complexity of the silvicultural systems of these multilayer forests, frequent repetition of conventional ground-based plot surveys is often cost prohibitive. Airborne laser scanning (ALS) and multispectral data offer an alternative or supplement to conventional inventories in the Hyrcanian forests of Iran. In this study, the capability of a combination of ALS and UltraCam-D data to model stand volume, tree density, and basal area using random forest (RF) algorithm was evaluated. Systematic sampling was applied to collect field plot data on a 150 m × 200 m sampling grid within a 1100 ha study area located at 36°38‧- 36°42‧N and 54°24‧-54°25‧E. A total of 308 circular plots (0.1 ha) were measured for calculation of stand volume, tree density, and basal area per hectare. For each plot, a set of variables was extracted from both ALS and multispectral data. The RF algorithm was used for modeling of the biophysical properties using ALS and UltraCam-D data separately and combined. The results showed that combining the ALS data and UltraCam-D images provided a slight increase in prediction accuracy compared to separate modeling. The RMSE as percentage of the mean, the mean difference between observed and predicted values, and standard deviation of the differences using a combination of ALS data and UltraCam-D images in an independent validation at 0.1-ha plot level were 31.7%, 1.1%, and 84 m3 ha-1 for stand volume; 27.2%, 0.86%, and 6.5 m2 ha-1 for basal area, and 35.8%, -4.6%, and 77.9 n ha-1 for tree density, respectively. Based on the results, we conclude that fusion of ALS and UltraCam-D data may be useful for modeling of stand volume, basal area, and tree density and thus gain insights into structural characteristics in the complex Hyrcanian forests.

  4. Effects of fire on spotted owl site occupancy in a late-successional forest

    USGS Publications Warehouse

    Roberts, Susan L.; van Wagtendonk, Jan W.; Miles, A. Keith; Kelt, Douglas A.

    2011-01-01

    The spotted owl (Strix occidentalis) is a late-successional forest dependent species that is sensitive to forest management practices throughout its range. An increase in the frequency and spatial extent of standreplacing fires in western North America has prompted concern for the persistence of spotted owls and other sensitive late-successional forest associated species. However, there is sparse information on the effects of fire on spotted owls to guide conservation policies. In 2004-2005, we surveyed for California spotted owls during the breeding season at 32 random sites (16 burned, 16 unburned) throughout late-successional montane forest in Yosemite National Park, California. Our burned areas burned at all severities, but predominately involved low to moderate fire severity. Based on an information theoretic approach, spotted owl detection and occupancy rates were similar between burned and unburned sites. Nest and roost site occupancy was best explained by a model that combined total tree basal area (positive effect) with cover by coarse woody debris (negative effect). The density estimates of California spotted owl pairs were similar in burned and unburned forests, and the overall mean density estimate for Yosemite was higher than previously reported for montane forests. Our results indicate that low to moderate severity fires, historically common within montane forests of the Sierra Nevada, California, maintain habitat characteristics essential for spotted owl site occupancy. These results suggest that managed fires that emulate the historic fire regime of these forests may maintain spotted owl habitat and protect this species from the effects of future catastrophic fires.

  5. Combination of lateral and PA view radiographs to study development of knee OA and associated pain

    NASA Astrophysics Data System (ADS)

    Minciullo, Luca; Thomson, Jessie; Cootes, Timothy F.

    2017-03-01

    Knee Osteoarthritis (OA) is the most common form of arthritis, affecting millions of people around the world. The effects of the disease have been studied using the shape and texture features of bones in PosteriorAnterior (PA) and Lateral radiographs separately. In this work we compare the utility of features from each view, and evaluate whether combining features from both is advantageous. We built a fully automated system to independently locate landmark points in both radiographic images using Random Forest Constrained Local Models. We extracted discriminative features from the two bony outlines using Appearance Models. The features were used to train Random Forest classifiers to solve three specific tasks: (i) OA classification, distinguishing patients with structural signs of OA from the others; (ii) predicting future onset of the disease and (iii) predicting which patients with no current pain will have a positive pain score later in a follow-up visit. Using a subset of the MOST dataset we show that the PA view has more discriminative features to classify and predict OA, while the lateral view contains features that achieve better performance in predicting pain, and that combining the features from both views gives a small improvement in accuracy of the classification compared to the individual views.

  6. Predicting human liver microsomal stability with machine learning techniques.

    PubMed

    Sakiyama, Yojiro; Yuki, Hitomi; Moriya, Takashi; Hattori, Kazunari; Suzuki, Misaki; Shimada, Kaoru; Honma, Teruki

    2008-02-01

    To ensure a continuing pipeline in pharmaceutical research, lead candidates must possess appropriate metabolic stability in the drug discovery process. In vitro ADMET (absorption, distribution, metabolism, elimination, and toxicity) screening provides us with useful information regarding the metabolic stability of compounds. However, before the synthesis stage, an efficient process is required in order to deal with the vast quantity of data from large compound libraries and high-throughput screening. Here we have derived a relationship between the chemical structure and its metabolic stability for a data set of in-house compounds by means of various in silico machine learning such as random forest, support vector machine (SVM), logistic regression, and recursive partitioning. For model building, 1952 proprietary compounds comprising two classes (stable/unstable) were used with 193 descriptors calculated by Molecular Operating Environment. The results using test compounds have demonstrated that all classifiers yielded satisfactory results (accuracy > 0.8, sensitivity > 0.9, specificity > 0.6, and precision > 0.8). Above all, classification by random forest as well as SVM yielded kappa values of approximately 0.7 in an independent validation set, slightly higher than other classification tools. These results suggest that nonlinear/ensemble-based classification methods might prove useful in the area of in silico ADME modeling.

  7. Evaluating the statistical performance of less applied algorithms in classification of worldview-3 imagery data in an urbanized landscape

    NASA Astrophysics Data System (ADS)

    Ranaie, Mehrdad; Soffianian, Alireza; Pourmanafi, Saeid; Mirghaffari, Noorollah; Tarkesh, Mostafa

    2018-03-01

    In recent decade, analyzing the remotely sensed imagery is considered as one of the most common and widely used procedures in the environmental studies. In this case, supervised image classification techniques play a central role. Hence, taking a high resolution Worldview-3 over a mixed urbanized landscape in Iran, three less applied image classification methods including Bagged CART, Stochastic gradient boosting model and Neural network with feature extraction were tested and compared with two prevalent methods: random forest and support vector machine with linear kernel. To do so, each method was run ten time and three validation techniques was used to estimate the accuracy statistics consist of cross validation, independent validation and validation with total of train data. Moreover, using ANOVA and Tukey test, statistical difference significance between the classification methods was significantly surveyed. In general, the results showed that random forest with marginal difference compared to Bagged CART and stochastic gradient boosting model is the best performing method whilst based on independent validation there was no significant difference between the performances of classification methods. It should be finally noted that neural network with feature extraction and linear support vector machine had better processing speed than other.

  8. Predicting Blood Lactate Concentration and Oxygen Uptake from sEMG Data during Fatiguing Cycling Exercise.

    PubMed

    Ražanskas, Petras; Verikas, Antanas; Olsson, Charlotte; Viberg, Per-Arne

    2015-08-19

    This article presents a study of the relationship between electromyographic (EMG) signals from vastus lateralis, rectus femoris, biceps femoris and semitendinosus muscles, collected during fatiguing cycling exercises, and other physiological measurements, such as blood lactate concentration and oxygen consumption. In contrast to the usual practice of picking one particular characteristic of the signal, e.g., the median or mean frequency, multiple variables were used to obtain a thorough characterization of EMG signals in the spectral domain. Based on these variables, linear and non-linear (random forest) models were built to predict blood lactate concentration and oxygen consumption. The results showed that mean and median frequencies are sub-optimal choices for predicting these physiological quantities in dynamic exercises, as they did not exhibit significant changes over the course of our protocol and only weakly correlated with blood lactate concentration or oxygen uptake. Instead, the root mean square of the original signal and backward difference, as well as parameters describing the tails of the EMG power distribution were the most important variables for these models. Coefficients of determination ranging from R(2) = 0:77 to R(2) = 0:98 (for blood lactate) and from R(2) = 0:81 to R(2) = 0:97 (for oxygen uptake) were obtained when using random forest regressors.

  9. Source identification of western Oregon Douglas-fir wood cores using mass spectrometry and random forest classification1

    PubMed Central

    Finch, Kristen; Espinoza, Edgard; Jones, F. Andrew; Cronn, Richard

    2017-01-01

    Premise of the study: We investigated whether wood metabolite profiles from direct analysis in real time (time-of-flight) mass spectrometry (DART-TOFMS) could be used to determine the geographic origin of Douglas-fir wood cores originating from two regions in western Oregon, USA. Methods: Three annual ring mass spectra were obtained from 188 adult Douglas-fir trees, and these were analyzed using random forest models to determine whether samples could be classified to geographic origin, growth year, or growth year and geographic origin. Specific wood molecules that contributed to geographic discrimination were identified. Results: Douglas-fir mass spectra could be differentiated into two geographic classes with an accuracy between 70% and 76%. Classification models could not accurately classify sample mass spectra based on growth year. Thirty-two molecules were identified as key for classifying western Oregon Douglas-fir wood cores to geographic origin. Discussion: DART-TOFMS is capable of detecting minute but regionally informative differences in wood molecules over a small geographic scale, and these differences made it possible to predict the geographic origin of Douglas-fir wood with moderate accuracy. Studies involving DART-TOFMS, alone and in combination with other technologies, will be relevant for identifying the geographic origin of illegally harvested wood. PMID:28529831

  10. Effects of the amount and composition of the forest floor on emergence and early establishment of loblolly pine seedlings

    Treesearch

    Michael G. Shelton

    1995-01-01

    Five forest floor weights (0, 10, 20, 30, and 40 MgJha), three forest floor compositions (pine, pine-hardwood, and hardwood), and two seed placements (forest floor and soil surface) were tested in a three-factorial. split-plot design with four incomplete, randomized blocks. The experiment was conducted in a nursery setting and used wooden frames to define 0.145-m

  11. Forest-floor disturbance reduces chipmunk (Tamias spp.) abundance two years after variable-retention harvest of Pacific Northwestern forests

    Treesearch

    Randall J. Wilk; Timothy B. Harrington; Robert A. Gitzen; Chris C. Maguire

    2015-01-01

    We evaluated the two-year effects of variable-retention harvest on chipmunk (Tamias spp.) abundance (N^) and habitat in mature coniferous forests in western Oregon and Washington because wildlife responses to density/pattern of retained trees remain largely unknown. In a randomized complete-block design, six...

  12. Highlights of the national evaluation of the Forest Stewardship Planning Program

    Treesearch

    R.J. Moulton; J.D. Esseks

    2001-01-01

    In 1998 and 1999, a nationwide random sample of 1238 nonindustrial private (NIPF) landowners with approved multiple resource Forest Stewardship Plans were interviewed to determine if this program is meeting its Congressional mandate of promoting sustainable management of forest resources on NIPF ownerships. It was found that two-thirds of program participants had never...

  13. Ownership and ecosystem as sources of spatial heterogeneity in a forested landscape, Wisconsin, USA

    Treesearch

    Thomas R. Crow; George E. Host; David J. Mladenoff

    1999-01-01

    The interaction between physical environment and land ownership in creating spatial heterogeneity was studied in largely forested landscapes of northern Wisconsin, USA. A stratified random approach was used in which 2500-ha plots representing two ownerships (National Forest and private non-industrial) were located within two regional ecosystems (extremely well-drained...

  14. Modeling and mapping abundance of American Woodcock across the Midwestern and Northeastern United States

    USGS Publications Warehouse

    Thogmartin, W.E.; Sauer, J.R.; Knutson, M.G.

    2007-01-01

    We used an over-dispersed Poisson regression with fixed and random effects, fitted by Markov chain Monte Carlo methods, to model population spatial patterns of relative abundance of American woodcock (Scolopax minor) across its breeding range in the United States. We predicted North American woodcock Singing Ground Survey counts with a log-linear function of explanatory variables describing habitat, year effects, and observer effects. The model also included a conditional autoregressive term representing potential correlation between adjacent route counts. Categories of explanatory habitat variables in the model included land-cover composition, climate, terrain heterogeneity, and human influence. Woodcock counts were higher in landscapes with more forest, especially aspen (Populus tremuloides) and birch (Betula spp.) forest, and in locations with a high degree of interspersion among forest, shrubs, and grasslands. Woodcock counts were lower in landscapes with a high degree of human development. The most noteworthy practical application of this spatial modeling approach was the ability to map predicted relative abundance. Based on a map of predicted relative abundance derived from the posterior parameter estimates, we identified major concentrations of woodcock abundance in east-central Minnesota, USA, the intersection of Vermont, USA, New York, USA, and Ontario, Canada, the upper peninsula of Michigan, USA, and St. Lawrence County, New York. The functional relations we elucidated for the American woodcock provide a basis for the development of management programs and the model and map may serve to focus management and monitoring on areas and habitat features important to American woodcock.

  15. Probability machines: consistent probability estimation using nonparametric learning machines.

    PubMed

    Malley, J D; Kruppa, J; Dasgupta, A; Malley, K G; Ziegler, A

    2012-01-01

    Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

  16. Policy Implications and Suggestions on Administrative Measures of Urban Flood

    NASA Astrophysics Data System (ADS)

    Lee, S. V.; Lee, M. J.; Lee, C.; Yoon, J. H.; Chae, S. H.

    2017-12-01

    The frequency and intensity of floods are increasing worldwide as recent climate change progresses gradually. Flood management should be policy-oriented in urban municipalities due to the characteristics of urban areas with a lot of damage. Therefore, the purpose of this study is to prepare a flood susceptibility map by using data mining model and make a policy suggestion on administrative measures of urban flood. Therefore, we constructed a spatial database by collecting relevant factors including the topography, geology, soil and land use data of the representative city, Seoul, the capital city of Korea. Flood susceptibility map was constructed by applying the data mining models of random forest and boosted tree model to input data and existing flooded area data in 2010. The susceptibility map has been validated using the 2011 flood area data which was not used for training. The predictor importance value of each factor to the results was calculated in this process. The distance from the water, DEM and geology showed a high predictor importance value which means to be a high priority for flood preparation policy. As a result of receiver operating characteristic (ROC), random forest model showed 78.78% and 79.18% accuracy of regression and classification and boosted tree model showed 77.55% and 77.26% accuracy of regression and classification, respectively. The results show that the flood susceptibility maps can be applied to flood prevention and management, and it also can help determine the priority areas for flood mitigation policy by providing useful information to policy makers.

  17. The use of random forests in modelling short-term air pollution effects based on traffic and meteorological conditions: A case study in Wrocław.

    PubMed

    Kamińska, Joanna A

    2018-07-01

    Random forests, an advanced data mining method, are used here to model the regression relationships between concentrations of the pollutants NO 2 , NO x and PM 2.5 , and nine variables describing meteorological conditions, temporal conditions and traffic flow. The study was based on hourly values of wind speed, wind direction, temperature, air pressure and relative humidity, temporal variables, and finally traffic flow, in the two years 2015 and 2016. An air quality measurement station was selected on a main road, located a short distance (40 m) from a large intersection equipped with a traffic flow measurement system. Nine different time subsets were defined, based among other things on the climatic conditions in Wrocław. An analysis was made of the fit of models created for those subsets, and of the importance of the predictors. Both the fit and the importance of particular predictors were found to be dependent on season. The best fit was obtained for models created for the six-month warm season (April-September) and for the summer season (June-August). The most important explanatory variable in the models of concentrations of nitrogen oxides was traffic flow, while in the case of PM 2.5 the most important were meteorological conditions, in particular temperature, wind speed and wind direction. Temporal variables (except for month in the case of PM 2.5 ) were found to have no significant effect on the concentrations of the studied pollutants. Copyright © 2018 Elsevier Ltd. All rights reserved.

  18. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project.

    PubMed

    Alghamdi, Manal; Al-Mallah, Mouaz; Keteyian, Steven; Brawner, Clinton; Ehrman, Jonathan; Sakr, Sherif

    2017-01-01

    Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.

  19. Development of an automated assessment tool for MedWatch reports in the FDA adverse event reporting system.

    PubMed

    Han, Lichy; Ball, Robert; Pamer, Carol A; Altman, Russ B; Proestel, Scott

    2017-09-01

    As the US Food and Drug Administration (FDA) receives over a million adverse event reports associated with medication use every year, a system is needed to aid FDA safety evaluators in identifying reports most likely to demonstrate causal relationships to the suspect medications. We combined text mining with machine learning to construct and evaluate such a system to identify medication-related adverse event reports. FDA safety evaluators assessed 326 reports for medication-related causality. We engineered features from these reports and constructed random forest, L1 regularized logistic regression, and support vector machine models. We evaluated model accuracy and further assessed utility by generating report rankings that represented a prioritized report review process. Our random forest model showed the best performance in report ranking and accuracy, with an area under the receiver operating characteristic curve of 0.66. The generated report ordering assigns reports with a higher probability of medication-related causality a higher rank and is significantly correlated to a perfect report ordering, with a Kendall's tau of 0.24 ( P  = .002). Our models produced prioritized report orderings that enable FDA safety evaluators to focus on reports that are more likely to contain valuable medication-related adverse event information. Applying our models to all FDA adverse event reports has the potential to streamline the manual review process and greatly reduce reviewer workload. Published by Oxford University Press on behalf of the American Medical Informatics Association 2017. This work is written by US Government employees and is in the public domain in the United States.

  20. How to reconcile wood production and biodiversity conservation? The Pan-European boreal forest history gradient as an "experiment".

    PubMed

    Naumov, Vladimir; Manton, Michael; Elbakidze, Marine; Rendenieks, Zigmars; Priednieks, Janis; Uhlianets, Siarhei; Yamelynets, Taras; Zhivotov, Anton; Angelstam, Per

    2018-07-15

    There are currently competing demands on Europe's forests and the finite resources and services that they can offer. Forestry intensification that aims at mitigating climate change and biodiversity conservation is one example. Whether or not these two objectives compete can be evaluated by comparative studies of forest landscapes with different histories. We test the hypothesis that indicators of wood production and biodiversity conservation are inversely related in a gradient of long to short forestry intensification histories. Forest management data containing stand age, volume and tree species were used to model the opportunity for wood production and biodiversity conservation in five north European forest regions representing a gradient in landscape history from very long in the West and short in the East. Wood production indicators captured the supply of coniferous wood and total biomass, as well as current accessibility by transport infrastructure. Biodiversity conservation indicators were based on modelling habitat network functionality for focal bird species dependent on different combinations of stand age and tree species composition representing naturally dynamic forests. In each region we randomly sampled 25 individual 100-km 2 areas with contiguous forest cover. Regarding wood production, Sweden's Bergslagen region had the largest areas of coniferous wood, followed by Vitebsk in Belarus and Zemgale in Latvia. NW Russia's case study regions in Pskov and Komi had the lowest values, except for the biomass indicator. The addition of forest accessibility for transportation made the Belarusian and Swedish study region most suitable for wood and biomass production, followed by Latvia and two study regions in NW Russian. Regarding biodiversity conservation, the overall rank among regions was opposite. Mixed and deciduous habitats were functional in Russia, Belarus and Latvia. Old Scots pine and Norway spruce habitats were only functional in Komi. Thus, different regional forest histories provide different challenges in terms of satisfying both wood production and biodiversity conservation objectives in a forest management unit. These regional differences in northern Europe create opportunities for exchanging experiences among different regional contexts about how to achieve both objectives. We discuss this in the context of land-sharing versus land-sparing. Copyright © 2018 Elsevier Ltd. All rights reserved.

  1. Landscape-scale consequences of differential tree mortality from catastrophic wind disturbance in the Amazon.

    PubMed

    Rifai, Sami W; Urquiza Muñoz, José D; Negrón-Juárez, Robinson I; Ramírez Arévalo, Fredy R; Tello-Espinoza, Rodil; Vanderwel, Mark C; Lichstein, Jeremy W; Chambers, Jeffrey Q; Bohlman, Stephanie A

    2016-10-01

    Wind disturbance can create large forest blowdowns, which greatly reduces live biomass and adds uncertainty to the strength of the Amazon carbon sink. Observational studies from within the central Amazon have quantified blowdown size and estimated total mortality but have not determined which trees are most likely to die from a catastrophic wind disturbance. Also, the impact of spatial dependence upon tree mortality from wind disturbance has seldom been quantified, which is important because wind disturbance often kills clusters of trees due to large treefalls killing surrounding neighbors. We examine (1) the causes of differential mortality between adult trees from a 300-ha blowdown event in the Peruvian region of the northwestern Amazon, (2) how accounting for spatial dependence affects mortality predictions, and (3) how incorporating both differential mortality and spatial dependence affect the landscape level estimation of necromass produced from the blowdown. Standard regression and spatial regression models were used to estimate how stem diameter, wood density, elevation, and a satellite-derived disturbance metric influenced the probability of tree death from the blowdown event. The model parameters regarding tree characteristics, topography, and spatial autocorrelation of the field data were then used to determine the consequences of non-random mortality for landscape production of necromass through a simulation model. Tree mortality was highly non-random within the blowdown, where tree mortality rates were highest for trees that were large, had low wood density, and were located at high elevation. Of the differential mortality models, the non-spatial models overpredicted necromass, whereas the spatial model slightly underpredicted necromass. When parameterized from the same field data, the spatial regression model with differential mortality estimated only 7.5% more dead trees across the entire blowdown than the random mortality model, yet it estimated 51% greater necromass. We suggest that predictions of forest carbon loss from wind disturbance are sensitive to not only the underlying spatial dependence of observations, but also the biological differences between individuals that promote differential levels of mortality. © 2016 by the Ecological Society of America.

  2. Predicting network modules of cell cycle regulators using relative protein abundance statistics.

    PubMed

    Oguz, Cihan; Watson, Layne T; Baumann, William T; Tyson, John J

    2017-02-28

    Parameter estimation in systems biology is typically done by enforcing experimental observations through an objective function as the parameter space of a model is explored by numerical simulations. Past studies have shown that one usually finds a set of "feasible" parameter vectors that fit the available experimental data equally well, and that these alternative vectors can make different predictions under novel experimental conditions. In this study, we characterize the feasible region of a complex model of the budding yeast cell cycle under a large set of discrete experimental constraints in order to test whether the statistical features of relative protein abundance predictions are influenced by the topology of the cell cycle regulatory network. Using differential evolution, we generate an ensemble of feasible parameter vectors that reproduce the phenotypes (viable or inviable) of wild-type yeast cells and 110 mutant strains. We use this ensemble to predict the phenotypes of 129 mutant strains for which experimental data is not available. We identify 86 novel mutants that are predicted to be viable and then rank the cell cycle proteins in terms of their contributions to cumulative variability of relative protein abundance predictions. Proteins involved in "regulation of cell size" and "regulation of G1/S transition" contribute most to predictive variability, whereas proteins involved in "positive regulation of transcription involved in exit from mitosis," "mitotic spindle assembly checkpoint" and "negative regulation of cyclin-dependent protein kinase by cyclin degradation" contribute the least. These results suggest that the statistics of these predictions may be generating patterns specific to individual network modules (START, S/G2/M, and EXIT). To test this hypothesis, we develop random forest models for predicting the network modules of cell cycle regulators using relative abundance statistics as model inputs. Predictive performance is assessed by the areas under receiver operating characteristics curves (AUC). Our models generate an AUC range of 0.83-0.87 as opposed to randomized models with AUC values around 0.50. By using differential evolution and random forest modeling, we show that the model prediction statistics generate distinct network module-specific patterns within the cell cycle network.

  3. Random forest learning of ultrasonic statistical physics and object spaces for lesion detection in 2D sonomammography

    NASA Astrophysics Data System (ADS)

    Sheet, Debdoot; Karamalis, Athanasios; Kraft, Silvan; Noël, Peter B.; Vag, Tibor; Sadhu, Anup; Katouzian, Amin; Navab, Nassir; Chatterjee, Jyotirmoy; Ray, Ajoy K.

    2013-03-01

    Breast cancer is the most common form of cancer in women. Early diagnosis can significantly improve lifeexpectancy and allow different treatment options. Clinicians favor 2D ultrasonography for breast tissue abnormality screening due to high sensitivity and specificity compared to competing technologies. However, inter- and intra-observer variability in visual assessment and reporting of lesions often handicaps its performance. Existing Computer Assisted Diagnosis (CAD) systems though being able to detect solid lesions are often restricted in performance. These restrictions are inability to (1) detect lesion of multiple sizes and shapes, and (2) differentiate between hypo-echoic lesions from their posterior acoustic shadowing. In this work we present a completely automatic system for detection and segmentation of breast lesions in 2D ultrasound images. We employ random forests for learning of tissue specific primal to discriminate breast lesions from surrounding normal tissues. This enables it to detect lesions of multiple shapes and sizes, as well as discriminate between hypo-echoic lesion from associated posterior acoustic shadowing. The primal comprises of (i) multiscale estimated ultrasonic statistical physics and (ii) scale-space characteristics. The random forest learns lesion vs. background primal from a database of 2D ultrasound images with labeled lesions. For segmentation, the posterior probabilities of lesion pixels estimated by the learnt random forest are hard thresholded to provide a random walks segmentation stage with starting seeds. Our method achieves detection with 99.19% accuracy and segmentation with mean contour-to-contour error < 3 pixels on a set of 40 images with 49 lesions.

  4. Remote sensing of Earth terrain

    NASA Technical Reports Server (NTRS)

    Kong, J. A.

    1992-01-01

    Research findings are summarized for projects dealing with the following: application of theoretical models to active and passive remote sensing of saline ice; radiative transfer theory for polarimetric remote sensing of pine forest; scattering of electromagnetic waves from a dense medium consisting of correlated Mie scatterers with size distribution and applications to dry snow; variance of phase fluctuations of waves propagating through a random medium; theoretical modeling for passive microwave remote sensing of earth terrain; polarimetric signatures of a canopy of dielectric cylinders based on first and second order vector radiative transfer theory; branching model for vegetation; polarimetric passive remote sensing of periodic surfaces; composite volume and surface scattering model; and radar image classification.

  5. Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.

    PubMed

    Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi

    2014-01-01

    In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.

  6. Data-Science Analysis of the Macro-scale Features Governing the Corrosion to Crack Transition in AA7050-T7451

    NASA Astrophysics Data System (ADS)

    Co, Noelle Easter C.; Brown, Donald E.; Burns, James T.

    2018-05-01

    This study applies data science approaches (random forest and logistic regression) to determine the extent to which macro-scale corrosion damage features govern the crack formation behavior in AA7050-T7451. Each corrosion morphology has a set of corresponding predictor variables (pit depth, volume, area, diameter, pit density, total fissure length, surface roughness metrics, etc.) describing the shape of the corrosion damage. The values of the predictor variables are obtained from white light interferometry, x-ray tomography, and scanning electron microscope imaging of the corrosion damage. A permutation test is employed to assess the significance of the logistic and random forest model predictions. Results indicate minimal relationship between the macro-scale corrosion feature predictor variables and fatigue crack initiation. These findings suggest that the macro-scale corrosion features and their interactions do not solely govern the crack formation behavior. While these results do not imply that the macro-features have no impact, they do suggest that additional parameters must be considered to rigorously inform the crack formation location.

  7. Random forest classification of stars in the Galactic Centre

    NASA Astrophysics Data System (ADS)

    Plewa, P. M.

    2018-05-01

    Near-infrared high-angular resolution imaging observations of the Milky Way's nuclear star cluster have revealed all luminous members of the existing stellar population within the central parsec. Generally, these stars are either evolved late-type giants or massive young, early-type stars. We revisit the problem of stellar classification based on intermediate-band photometry in the K band, with the primary aim of identifying faint early-type candidate stars in the extended vicinity of the central massive black hole. A random forest classifier, trained on a subsample of spectroscopically identified stars, performs similarly well as competitive methods (F1 = 0.85), without involving any model of stellar spectral energy distributions. Advantages of using such a machine-trained classifier are a minimum of required calibration effort, a predictive accuracy expected to improve as more training data become available, and the ease of application to future, larger data sets. By applying this classifier to archive data, we are also able to reproduce the results of previous studies of the spatial distribution and the K-band luminosity function of both the early- and late-type stars.

  8. Per-field crop classification in irrigated agricultural regions in middle Asia using random forest and support vector machine ensemble

    NASA Astrophysics Data System (ADS)

    Löw, Fabian; Schorcht, Gunther; Michel, Ulrich; Dech, Stefan; Conrad, Christopher

    2012-10-01

    Accurate crop identification and crop area estimation are important for studies on irrigated agricultural systems, yield and water demand modeling, and agrarian policy development. In this study a novel combination of Random Forest (RF) and Support Vector Machine (SVM) classifiers is presented that (i) enhances crop classification accuracy and (ii) provides spatial information on map uncertainty. The methodology was implemented over four distinct irrigated sites in Middle Asia using RapidEye time series data. The RF feature importance statistics was used as feature-selection strategy for the SVM to assess possible negative effects on classification accuracy caused by an oversized feature space. The results of the individual RF and SVM classifications were combined with rules based on posterior classification probability and estimates of classification probability entropy. SVM classification performance was increased by feature selection through RF. Further experimental results indicate that the hybrid classifier improves overall classification accuracy in comparison to the single classifiers as well as useŕs and produceŕs accuracy.

  9. A Hybrid Color Space for Skin Detection Using Genetic Algorithm Heuristic Search and Principal Component Analysis Technique

    PubMed Central

    2015-01-01

    Color is one of the most prominent features of an image and used in many skin and face detection applications. Color space transformation is widely used by researchers to improve face and skin detection performance. Despite the substantial research efforts in this area, choosing a proper color space in terms of skin and face classification performance which can address issues like illumination variations, various camera characteristics and diversity in skin color tones has remained an open issue. This research proposes a new three-dimensional hybrid color space termed SKN by employing the Genetic Algorithm heuristic and Principal Component Analysis to find the optimal representation of human skin color in over seventeen existing color spaces. Genetic Algorithm heuristic is used to find the optimal color component combination setup in terms of skin detection accuracy while the Principal Component Analysis projects the optimal Genetic Algorithm solution to a less complex dimension. Pixel wise skin detection was used to evaluate the performance of the proposed color space. We have employed four classifiers including Random Forest, Naïve Bayes, Support Vector Machine and Multilayer Perceptron in order to generate the human skin color predictive model. The proposed color space was compared to some existing color spaces and shows superior results in terms of pixel-wise skin detection accuracy. Experimental results show that by using Random Forest classifier, the proposed SKN color space obtained an average F-score and True Positive Rate of 0.953 and False Positive Rate of 0.0482 which outperformed the existing color spaces in terms of pixel wise skin detection accuracy. The results also indicate that among the classifiers used in this study, Random Forest is the most suitable classifier for pixel wise skin detection applications. PMID:26267377

  10. Stochastic assembly in a subtropical forest chronosequence: evidence from contrasting changes of species, phylogenetic and functional dissimilarity over succession.

    PubMed

    Mi, Xiangcheng; Swenson, Nathan G; Jia, Qi; Rao, Mide; Feng, Gang; Ren, Haibao; Bebber, Daniel P; Ma, Keping

    2016-09-07

    Deterministic and stochastic processes jointly determine the community dynamics of forest succession. However, it has been widely held in previous studies that deterministic processes dominate forest succession. Furthermore, inference of mechanisms for community assembly may be misleading if based on a single axis of diversity alone. In this study, we evaluated the relative roles of deterministic and stochastic processes along a disturbance gradient by integrating species, functional, and phylogenetic beta diversity in a subtropical forest chronosequence in Southeastern China. We found a general pattern of increasing species turnover, but little-to-no change in phylogenetic and functional turnover over succession at two spatial scales. Meanwhile, the phylogenetic and functional beta diversity were not significantly different from random expectation. This result suggested a dominance of stochastic assembly, contrary to the general expectation that deterministic processes dominate forest succession. On the other hand, we found significant interactions of environment and disturbance and limited evidence for significant deviations of phylogenetic or functional turnover from random expectations for different size classes. This result provided weak evidence of deterministic processes over succession. Stochastic assembly of forest succession suggests that post-disturbance restoration may be largely unpredictable and difficult to control in subtropical forests.

  11. Correspondence between sound propagation in discrete and continuous random media with application to forest acoustics.

    PubMed

    Ostashev, Vladimir E; Wilson, D Keith; Muhlestein, Michael B; Attenborough, Keith

    2018-02-01

    Although sound propagation in a forest is important in several applications, there are currently no rigorous yet computationally tractable prediction methods. Due to the complexity of sound scattering in a forest, it is natural to formulate the problem stochastically. In this paper, it is demonstrated that the equations for the statistical moments of the sound field propagating in a forest have the same form as those for sound propagation in a turbulent atmosphere if the scattering properties of the two media are expressed in terms of the differential scattering and total cross sections. Using the existing theories for sound propagation in a turbulent atmosphere, this analogy enables the derivation of several results for predicting forest acoustics. In particular, the second-moment parabolic equation is formulated for the spatial correlation function of the sound field propagating above an impedance ground in a forest with micrometeorology. Effective numerical techniques for solving this equation have been developed in atmospheric acoustics. In another example, formulas are obtained that describe the effect of a forest on the interference between the direct and ground-reflected waves. The formulated correspondence between wave propagation in discrete and continuous random media can also be used in other fields of physics.

  12. First direct landscape-scale measurement of tropical rain forest Leaf Area Index, a key driver of global primary productivity

    Treesearch

    David B. Clark; Paulo C. Olivas; Steven F. Oberbauer; Deborah A. Clark; Michael G. Ryan

    2008-01-01

    Leaf Area Index (leaf area per unit ground area, LAI) is a key driver of forest productivity but has never previously been measured directly at the landscape scale in tropical rain forest (TRF). We used a modular tower and stratified random sampling to harvest all foliage from forest floor to canopy top in 55 vertical transects (4.6 m2) across 500 ha of old growth in...

  13. Alternative methods to evaluate trial level surrogacy.

    PubMed

    Abrahantes, Josè Cortiñas; Shkedy, Ziv; Molenberghs, Geert

    2008-01-01

    The evaluation and validation of surrogate endpoints have been extensively studied in the last decade. Prentice [1] and Freedman, Graubard and Schatzkin [2] laid the foundations for the evaluation of surrogate endpoints in randomized clinical trials. Later, Buyse et al. [5] proposed a meta-analytic methodology, producing different methods for different settings, which was further studied by Alonso and Molenberghs [9], in their unifying approach based on information theory. In this article, we focus our attention on the trial-level surrogacy and propose alternative procedures to evaluate such surrogacy measure, which do not pre-specify the type of association. A promising correction based on cross-validation is investigated. As well as the construction of confidence intervals for this measure. In order to avoid making assumption about the type of relationship between the treatment effects and its distribution, a collection of alternative methods, based on regression trees, bagging, random forests, and support vector machines, combined with bootstrap-based confidence interval and, should one wish, in conjunction with a cross-validation based correction, will be proposed and applied. We apply the various strategies to data from three clinical studies: in opthalmology, in advanced colorectal cancer, and in schizophrenia. The results obtained for the three case studies are compared; they indicate that using random forest or bagging models produces larger estimated values for the surrogacy measure, which are in general stabler and the confidence interval narrower than linear regression and support vector regression. For the advanced colorectal cancer studies, we even found the trial-level surrogacy is considerably different from what has been reported. In general the alternative methods are more computationally demanding, and specially the calculation of the confidence intervals, require more computational time that the delta-method counterpart. First, more flexible modeling techniques can be used, allowing for other type of association. Second, when no cross-validation-based correction is applied, overly optimistic trial-level surrogacy estimates will be found, thus cross-validation is highly recommendable. Third, the use of the delta method to calculate confidence intervals is not recommendable since it makes assumptions valid only in very large samples. It may also produce range-violating limits. We therefore recommend alternatives: bootstrap methods in general. Also, the information-theoretic approach produces comparable results with the bagging and random forest approaches, when cross-validation correction is applied. It is also important to observe that, even for the case in which the linear model might be a good option too, bagging methods perform well too, and their confidence intervals were more narrow.

  14. Landscape analysis and pattern of hurricane impact and circulation on mangrove forests of the everglades

    USGS Publications Warehouse

    Doyle, T.W.; Krauss, K.W.; Wells, C.J.

    2009-01-01

    The Everglades ecosystem contains the largest contiguous tract of mangrove forest outside the tropics that were also coincidentally intersected by a major Category 5 hurricane. Airborne videography was flown to capture the landscape pattern and process of forest damage in relation to storm trajectory and circulation. Two aerial video transects, representing different topographic positions, were used to quantify forest damage from video frame analysis in relation to prevailing wind force, treefall direction, and forest height. A hurricane simulation model was applied to reconstruct wind fields corresponding to the ground location of each video frame and to correlate observed treefall and destruction patterns with wind speed and direction. Mangrove forests within the storm's eyepath and in the right-side (forewind) quadrants suffered whole or partial blowdowns, while left-side (backwind) sites south of the eyewall zone incurred moderate canopy reduction and defoliation. Sites along the coastal transect sustained substantially more storm damage than sites along the inland transect which may be attributed to differences in stand exposure and/or stature. Observed treefall directions were shown to be non-random and associated with hurricane trajectory and simulated forewind azimuths. Wide-area sampling using airborne videography provided an efficient adjunct to limited ground observations and improved our spatial understanding of how hurricanes imprint landscape-scale patterns of disturbance. ?? 2009 The Society of Wetland Scientists.

  15. Biodiversity mapping in a tropical West African forest with airborne hyperspectral data.

    PubMed

    Vaglio Laurin, Gaia; Cheung-Wai Chan, Jonathan; Chen, Qi; Lindsell, Jeremy A; Coomes, David A; Guerriero, Leila; Del Frate, Fabio; Miglietta, Franco; Valentini, Riccardo

    2014-01-01

    Tropical forests are major repositories of biodiversity, but are fast disappearing as land is converted to agriculture. Decision-makers need to know which of the remaining forests to prioritize for conservation, but the only spatial information on forest biodiversity has, until recently, come from a sparse network of ground-based plots. Here we explore whether airborne hyperspectral imagery can be used to predict the alpha diversity of upper canopy trees in a West African forest. The abundance of tree species were collected from 64 plots (each 1250 m(2) in size) within a Sierra Leonean national park, and Shannon-Wiener biodiversity indices were calculated. An airborne spectrometer measured reflectances of 186 bands in the visible and near-infrared spectral range at 1 m(2) resolution. The standard deviations of these reflectance values and their first-order derivatives were calculated for each plot from the c. 1250 pixels of hyperspectral information within them. Shannon-Wiener indices were then predicted from these plot-based reflectance statistics using a machine-learning algorithm (Random Forest). The regression model fitted the data well (pseudo-R(2) = 84.9%), and we show that standard deviations of green-band reflectances and infra-red region derivatives had the strongest explanatory powers. Our work shows that airborne hyperspectral sensing can be very effective at mapping canopy tree diversity, because its high spatial resolution allows within-plot heterogeneity in reflectance to be characterized, making it an effective tool for monitoring forest biodiversity over large geographic scales.

  16. Biodiversity Mapping in a Tropical West African Forest with Airborne Hyperspectral Data

    PubMed Central

    Vaglio Laurin, Gaia; Chan, Jonathan Cheung-Wai; Chen, Qi; Lindsell, Jeremy A.; Coomes, David A.; Guerriero, Leila; Frate, Fabio Del; Miglietta, Franco; Valentini, Riccardo

    2014-01-01

    Tropical forests are major repositories of biodiversity, but are fast disappearing as land is converted to agriculture. Decision-makers need to know which of the remaining forests to prioritize for conservation, but the only spatial information on forest biodiversity has, until recently, come from a sparse network of ground-based plots. Here we explore whether airborne hyperspectral imagery can be used to predict the alpha diversity of upper canopy trees in a West African forest. The abundance of tree species were collected from 64 plots (each 1250 m2 in size) within a Sierra Leonean national park, and Shannon-Wiener biodiversity indices were calculated. An airborne spectrometer measured reflectances of 186 bands in the visible and near-infrared spectral range at 1 m2 resolution. The standard deviations of these reflectance values and their first-order derivatives were calculated for each plot from the c. 1250 pixels of hyperspectral information within them. Shannon-Wiener indices were then predicted from these plot-based reflectance statistics using a machine-learning algorithm (Random Forest). The regression model fitted the data well (pseudo-R2 = 84.9%), and we show that standard deviations of green-band reflectances and infra-red region derivatives had the strongest explanatory powers. Our work shows that airborne hyperspectral sensing can be very effective at mapping canopy tree diversity, because its high spatial resolution allows within-plot heterogeneity in reflectance to be characterized, making it an effective tool for monitoring forest biodiversity over large geographic scales. PMID:24937407

  17. Determinants of the process and outcomes of household participation in collaborative forest management in Ghana: a quantitative test of a community resilience model.

    PubMed

    Akamani, Kofi; Hall, Troy Elizabeth

    2015-01-01

    This study tested a proposed community resilience model by investigating the role of institutions, capital assets, community and socio-demographic variables as determinants of households' participation in Ghana's collaborative forest management (CFM) program and outcomes of the program. Quantitative survey data were gathered from 209 randomly selected households from two forest-dependent communities. Regression analysis shows that households' participation in the CFM program was predicted by community location, past connections with institutions, and past bonding social capital. Community location and past capitals were the strongest predictors of the outcomes of the CFM program as judged by current levels of capitals. Participation in the CFM program also had a positive effect on human capital but had minimal impact on the other capitals influencing household well-being and resilience, suggesting that the impact of co-management on household resilience may be modest. In all, the findings highlight the need for co-management policies to pay attention to the historical context of community interaction processes influencing access to capital assets and local institutions to successfully promote equitable resilience. Copyright © 2014 Elsevier Ltd. All rights reserved.

  18. Dynamics of Tree Species Diversity in Unlogged and Selectively Logged Malaysian Forests.

    PubMed

    Shima, Ken; Yamada, Toshihiro; Okuda, Toshinori; Fletcher, Christine; Kassim, Abdul Rahman

    2018-01-18

    Selective logging that is commonly conducted in tropical forests may change tree species diversity. In rarely disturbed tropical forests, locally rare species exhibit higher survival rates. If this non-random process occurs in a logged forest, the forest will rapidly recover its tree species diversity. Here we determined whether a forest in the Pasoh Forest Reserve, Malaysia, which was selectively logged 40 years ago, recovered its original species diversity (species richness and composition). To explore this, we compared the dynamics of secies diversity between unlogged forest plot (18.6 ha) and logged forest plot (5.4 ha). We found that 40 years are not sufficient to recover species diversity after logging. Unlike unlogged forests, tree deaths and recruitments did not contribute to increased diversity in the selectively logged forests. Our results predict that selectively logged forests require a longer time at least than our observing period (40 years) to regain their diversity.

  19. Assessing change in large-scale forest area by visually interpreting Landsat images

    Treesearch

    Jerry D. Greer; Frederick P. Weber; Raymond L. Czaplewski

    2000-01-01

    As part of the Forest Resources Assessment 1990, the Food and Agriculture Organization of the United Nations visually interpreted a stratified random sample of 117 Landsat scenes to estimate global status and change in tropical forest area. Images from 1980 and 1990 were interpreted by a group of widely experienced technical people in many different tropical countries...

  20. A ground-based method of assessing urban forest structure and ecosystem services

    Treesearch

    David J. Nowak; Daniel E. Crane; Jack C. Stevens; Robert E. Hoehn; Jeffrey T. Walton; Jerry Bond

    2008-01-01

    To properly manage urban forests, it is essential to have data on this important resource. An efficient means to obtain this information is to randomly sample urban areas. To help assess the urban forest structure (e.g., number of trees, species composition, tree sizes, health) and several functions (e.g., air pollution removal, carbon storage and sequestration), the...

  1. Spatially random mortality in old-growth red pine forests of northern Minnesota

    Treesearch

    Tuomas ​Aakala; Shawn Fraver; Brian J. Palik; Anthony W. D' Amato

    2012-01-01

    Characterizing the spatial distribution of tree mortality is critical to understanding forest dynamics, but empirical studies on these patterns under old-growth conditions are rare. This rarity is due in part to low mortality rates in old-growth forests, the study of which necessitates long observation periods, and the confounding influence of tree in-growth during...

  2. Random forests of interaction trees for estimating individualized treatment effects in randomized trials.

    PubMed

    Su, Xiaogang; Peña, Annette T; Liu, Lei; Levine, Richard A

    2018-04-29

    Assessing heterogeneous treatment effects is a growing interest in advancing precision medicine. Individualized treatment effects (ITEs) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees. To this end, we propose a smooth sigmoid surrogate method, as an alternative to greedy search, to speed up tree construction. The RFIT outperforms the "separate regression" approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT are obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial. Copyright © 2018 John Wiley & Sons, Ltd.

  3. Mapping SOC (Soil Organic Carbon) using LiDAR-derived vegetation indices in a random forest regression model

    NASA Astrophysics Data System (ADS)

    Will, R. M.; Glenn, N. F.; Benner, S. G.; Pierce, J. L.; Spaete, L.; Li, A.

    2015-12-01

    Quantifying SOC (Soil Organic Carbon) storage in complex terrain is challenging due to high spatial variability. Generally, the challenge is met by transforming point data to the entire landscape using surrogate, spatially-distributed, variables like elevation or precipitation. In many ecosystems, remotely sensed information on above-ground vegetation (e.g. NDVI) is a good predictor of below-ground carbon stocks. In this project, we are attempting to improve this predictive method by incorporating LiDAR-derived vegetation indices. LiDAR provides a mechanism for improved characterization of aboveground vegetation by providing structural parameters such as vegetation height and biomass. In this study, a random forest model is used to predict SOC using a suite of LiDAR-derived vegetation indices as predictor variables. The Reynolds Creek Experimental Watershed (RCEW) is an ideal location for a study of this type since it encompasses a strong elevation/precipitation gradient that supports lower biomass sagebrush ecosystems at low elevations and forests with more biomass at higher elevations. Sagebrush ecosystems composed of Wyoming, Low and Mountain Sagebrush have SOC values ranging from .4 to 1% (top 30 cm), while higher biomass ecosystems composed of aspen, juniper and fir have SOC values approaching 4% (top 30 cm). Large differences in SOC have been observed between canopy and interspace locations and high resolution vegetation information is likely to explain plot scale variability in SOC. Mapping of the SOC reservoir will help identify underlying controls on SOC distribution and provide insight into which processes are most important in determining SOC in semi-arid mountainous regions. In addition, airborne LiDAR has the potential to characterize vegetation communities at a high resolution and could be a tool for improving estimates of SOC at larger scales.

  4. Assessing soil carbon vulnerability in the Western USA by geospatial modeling of pyrogenic and particulate carbon stocks

    NASA Astrophysics Data System (ADS)

    Ahmed, Zia U.; Woodbury, Peter B.; Sanderman, Jonathan; Hawke, Bruce; Jauss, Verena; Solomon, Dawit; Lehmann, Johannes

    2017-02-01

    To predict how land management practices and climate change will affect soil carbon cycling, improved understanding of factors controlling soil organic carbon fractions at large spatial scales is needed. We analyzed total soil organic (SOC) as well as pyrogenic (PyC), particulate (POC), and other soil organic carbon (OOC) fractions in surface layers from 650 stratified-sampling locations throughout Colorado, Kansas, New Mexico, and Wyoming. PyC varied from 0.29 to 18.0 mg C g-1 soil with a mean of 4.05 mg C g-1 soil. The mean PyC was 34.6% of the SOC and ranged from 11.8 to 96.6%. Both POC and PyC were highest in forests and canyon bottoms. In the best random forest regression model, normalized vegetation index (NDVI), mean annual precipitation (MAP), mean annual temperature (MAT), and elevation were ranked as the top four important variables determining PyC and POC variability. Random forests regression kriging (RFK) with environmental covariables improved predictions over ordinary kriging by 20 and 7% for PyC and POC, respectively. Based on RFK, 8% of the study area was dominated (≥50% of SOC) by PyC and less than 1% was dominated by POC. Furthermore, based on spatial analysis of the ratio of POC to PyC, we estimated that about 16% of the study area is medium to highly vulnerable to SOC mineralization in surface soil. These are the first results to characterize PyC and POC stocks geospatially using stratified sampling scheme at the scale of 1,000,000 km2, and the methods are scalable to other regions.

  5. Dynamics of transit times and StorAge Selection functions in four forested catchments from stable isotope data

    NASA Astrophysics Data System (ADS)

    Rodriguez, Nicolas B.; McGuire, Kevin J.; Klaus, Julian

    2017-04-01

    Transit time distributions, residence time distributions and StorAge Selection functions are fundamental integrated descriptors of water storage, mixing, and release in catchments. In this contribution, we determined these time-variant functions in four neighboring forested catchments in H.J. Andrews Experimental Forest, Oregon, USA by employing a two year time series of 18O in precipitation and discharge. Previous studies in these catchments assumed stationary, exponentially distributed transit times, and complete mixing/random sampling to explore the influence of various catchment properties on the mean transit time. Here we relaxed such assumptions to relate transit time dynamics and the variability of StoreAge Selection functions to catchment characteristics, catchment storage, and meteorological forcing seasonality. Conceptual models of the catchments, consisting of two reservoirs combined in series-parallel, were calibrated to discharge and stable isotope tracer data. We assumed randomly sampled/fully mixed conditions for each reservoir, which resulted in an incompletely mixed system overall. Based on the results we solved the Master Equation, which describes the dynamics of water ages in storage and in catchment outflows Consistent between all catchments, we found that transit times were generally shorter during wet periods, indicating the contribution of shallow storage (soil, saprolite) to discharge. During extended dry periods, transit times increased significantly indicating the contribution of deeper storage (bedrock) to discharge. Our work indicated that the strong seasonality of precipitation impacted transit times by leading to a dynamic selection of stored water ages, whereas catchment size was not a control on transit times. In general this work showed the usefulness of using time-variant transit times with conceptual models and confirmed the existence of the catchment age mixing behaviors emerging from other similar studies.

  6. Probabilistic hazard assessment for skin sensitization potency by dose–response modeling using feature elimination instead of quantitative structure–activity relationships

    PubMed Central

    McKim, James M.; Hartung, Thomas; Kleensang, Andre; Sá-Rocha, Vanessa

    2016-01-01

    Supervised learning methods promise to improve integrated testing strategies (ITS), but must be adjusted to handle high dimensionality and dose–response data. ITS approaches are currently fueled by the increasing mechanistic understanding of adverse outcome pathways (AOP) and the development of tests reflecting these mechanisms. Simple approaches to combine skin sensitization data sets, such as weight of evidence, fail due to problems in information redundancy and high dimension-ality. The problem is further amplified when potency information (dose/response) of hazards would be estimated. Skin sensitization currently serves as the foster child for AOP and ITS development, as legislative pressures combined with a very good mechanistic understanding of contact dermatitis have led to test development and relatively large high-quality data sets. We curated such a data set and combined a recursive variable selection algorithm to evaluate the information available through in silico, in chemico and in vitro assays. Chemical similarity alone could not cluster chemicals’ potency, and in vitro models consistently ranked high in recursive feature elimination. This allows reducing the number of tests included in an ITS. Next, we analyzed with a hidden Markov model that takes advantage of an intrinsic inter-relationship among the local lymph node assay classes, i.e. the monotonous connection between local lymph node assay and dose. The dose-informed random forest/hidden Markov model was superior to the dose-naive random forest model on all data sets. Although balanced accuracy improvement may seem small, this obscures the actual improvement in misclassifications as the dose-informed hidden Markov model strongly reduced "false-negatives" (i.e. extreme sensitizers as non-sensitizer) on all data sets. PMID:26046447

  7. Utilizing random Forest QSAR models with optimized parameters for target identification and its application to target-fishing server.

    PubMed

    Lee, Kyoungyeul; Lee, Minho; Kim, Dongsup

    2017-12-28

    The identification of target molecules is important for understanding the mechanism of "target deconvolution" in phenotypic screening and "polypharmacology" of drugs. Because conventional methods of identifying targets require time and cost, in-silico target identification has been considered an alternative solution. One of the well-known in-silico methods of identifying targets involves structure activity relationships (SARs). SARs have advantages such as low computational cost and high feasibility; however, the data dependency in the SAR approach causes imbalance of active data and ambiguity of inactive data throughout targets. We developed a ligand-based virtual screening model comprising 1121 target SAR models built using a random forest algorithm. The performance of each target model was tested by employing the ROC curve and the mean score using an internal five-fold cross validation. Moreover, recall rates for top-k targets were calculated to assess the performance of target ranking. A benchmark model using an optimized sampling method and parameters was examined via external validation set. The result shows recall rates of 67.6% and 73.9% for top-11 (1% of the total targets) and top-33, respectively. We provide a website for users to search the top-k targets for query ligands available publicly at http://rfqsar.kaist.ac.kr . The target models that we built can be used for both predicting the activity of ligands toward each target and ranking candidate targets for a query ligand using a unified scoring scheme. The scores are additionally fitted to the probability so that users can estimate how likely a ligand-target interaction is active. The user interface of our web site is user friendly and intuitive, offering useful information and cross references.

  8. Efficacy of extracting indices from large-scale acoustic recordings to monitor biodiversity.

    PubMed

    Buxton, Rachel; McKenna, Megan F; Clapp, Mary; Meyer, Erik; Stabenau, Erik; Angeloni, Lisa M; Crooks, Kevin; Wittemyer, George

    2018-04-20

    Passive acoustic monitoring has the potential to be a powerful approach for assessing biodiversity across large spatial and temporal scales. However, extracting meaningful information from recordings can be prohibitively time consuming. Acoustic indices offer a relatively rapid method for processing acoustic data and are increasingly used to characterize biological communities. We examine the ability of acoustic indices to predict the diversity and abundance of biological sounds within recordings. First we reviewed the acoustic index literature and found that over 60 indices have been applied to a range of objectives with varying success. We then implemented a subset of the most successful indices on acoustic data collected at 43 sites in temperate terrestrial and tropical marine habitats across the continental U.S., developing a predictive model of the diversity of animal sounds observed in recordings. For terrestrial recordings, random forest models using a suite of acoustic indices as covariates predicted Shannon diversity, richness, and total number of biological sounds with high accuracy (R 2 > = 0.94, mean squared error MSE < = 170.2). Among the indices assessed, roughness, acoustic activity, and acoustic richness contributed most to the predictive ability of models. Performance of index models was negatively impacted by insect, weather, and anthropogenic sounds. For marine recordings, random forest models predicted Shannon diversity, richness, and total number of biological sounds with low accuracy (R 2 < = 0.40, MSE > = 195), indicating that alternative methods are necessary in marine habitats. Our results suggest that using a combination of relevant indices in a flexible model can accurately predict the diversity of biological sounds in temperate terrestrial acoustic recordings. Thus, acoustic approaches could be an important contribution to biodiversity monitoring in some habitats in the face of accelerating human-caused ecological change. This article is protected by copyright. All rights reserved.

  9. Probabilistic hazard assessment for skin sensitization potency by dose-response modeling using feature elimination instead of quantitative structure-activity relationships.

    PubMed

    Luechtefeld, Thomas; Maertens, Alexandra; McKim, James M; Hartung, Thomas; Kleensang, Andre; Sá-Rocha, Vanessa

    2015-11-01

    Supervised learning methods promise to improve integrated testing strategies (ITS), but must be adjusted to handle high dimensionality and dose-response data. ITS approaches are currently fueled by the increasing mechanistic understanding of adverse outcome pathways (AOP) and the development of tests reflecting these mechanisms. Simple approaches to combine skin sensitization data sets, such as weight of evidence, fail due to problems in information redundancy and high dimensionality. The problem is further amplified when potency information (dose/response) of hazards would be estimated. Skin sensitization currently serves as the foster child for AOP and ITS development, as legislative pressures combined with a very good mechanistic understanding of contact dermatitis have led to test development and relatively large high-quality data sets. We curated such a data set and combined a recursive variable selection algorithm to evaluate the information available through in silico, in chemico and in vitro assays. Chemical similarity alone could not cluster chemicals' potency, and in vitro models consistently ranked high in recursive feature elimination. This allows reducing the number of tests included in an ITS. Next, we analyzed with a hidden Markov model that takes advantage of an intrinsic inter-relationship among the local lymph node assay classes, i.e. the monotonous connection between local lymph node assay and dose. The dose-informed random forest/hidden Markov model was superior to the dose-naive random forest model on all data sets. Although balanced accuracy improvement may seem small, this obscures the actual improvement in misclassifications as the dose-informed hidden Markov model strongly reduced " false-negatives" (i.e. extreme sensitizers as non-sensitizer) on all data sets. Copyright © 2015 John Wiley & Sons, Ltd.

  10. Effects of Land Cover on the Movement of Frugivorous Birds in a Heterogeneous Landscape.

    PubMed

    Da Silveira, Natalia Stefanini; Niebuhr, Bernardo Brandão S; Muylaert, Renata de Lara; Ribeiro, Milton Cezar; Pizo, Marco Aurélio

    2016-01-01

    Movement is a key spatiotemporal process that enables interactions between animals and other elements of nature. The understanding of animal trajectories and the mechanisms that influence them at the landscape level can yield insight into ecological processes and potential solutions to specific ecological problems. Based upon optimal foraging models and empirical evidence, we hypothesized that movement by thrushes is highly tortuous (low average movement speeds and homogeneous distribution of turning angles) inside forests, moderately tortuous in urban areas, which present intermediary levels of resources, and minimally tortuous (high movement speeds and turning angles next to 0 radians) in open matrix types (e.g., crops and pasture). We used data on the trajectories of two common thrush species (Turdus rufiventris and Turdus leucomelas) collected by radio telemetry in a fragmented region in Brazil. Using a maximum likelihood model selection approach we fit four probability distribution models to average speed data, considering short-tailed, long-tailed, and scale-free distributions (to represent different regimes of movement variation), and one distribution to relative angle data. Models included land cover type and distance from forest-matrix edges as explanatory variables. Speed was greater farther away from forest edges and increased faster inside forest habitat compared to urban and open matrices. However, turning angle was not influenced by land cover. Thrushes presented a very tortuous trajectory, with many displacements followed by turns near 180 degrees. Thrush trajectories resembled habitat and edge dependent, tortuous random walks, with a well-defined movement scale inside each land cover type. Although thrushes are habitat generalists, they showed a greater preference for forest edges, and thus may be considered edge specialists. Our results reinforce the importance of studying animal movement patterns in order to understand ecological processes such as seed dispersal in fragmented areas, where the percentage of remaining habitat is dwindling.

  11. A Machine Learning and Cross-Validation Approach for the Discrimination of Vegetation Physiognomic Types Using Satellite Based Multispectral and Multitemporal Data.

    PubMed

    Sharma, Ram C; Hara, Keitarou; Hirayama, Hidetake

    2017-01-01

    This paper presents the performance and evaluation of a number of machine learning classifiers for the discrimination between the vegetation physiognomic classes using the satellite based time-series of the surface reflectance data. Discrimination of six vegetation physiognomic classes, Evergreen Coniferous Forest, Evergreen Broadleaf Forest, Deciduous Coniferous Forest, Deciduous Broadleaf Forest, Shrubs, and Herbs, was dealt with in the research. Rich-feature data were prepared from time-series of the satellite data for the discrimination and cross-validation of the vegetation physiognomic types using machine learning approach. A set of machine learning experiments comprised of a number of supervised classifiers with different model parameters was conducted to assess how the discrimination of vegetation physiognomic classes varies with classifiers, input features, and ground truth data size. The performance of each experiment was evaluated by using the 10-fold cross-validation method. Experiment using the Random Forests classifier provided highest overall accuracy (0.81) and kappa coefficient (0.78). However, accuracy metrics did not vary much with experiments. Accuracy metrics were found to be very sensitive to input features and size of ground truth data. The results obtained in the research are expected to be useful for improving the vegetation physiognomic mapping in Japan.

  12. Linear Subpixel Learning Algorithm for Land Cover Classification from WELD using High Performance Computing

    NASA Technical Reports Server (NTRS)

    Kumar, Uttam; Nemani, Ramakrishna R.; Ganguly, Sangram; Kalia, Subodh; Michaelis, Andrew

    2017-01-01

    In this work, we use a Fully Constrained Least Squares Subpixel Learning Algorithm to unmix global WELD (Web Enabled Landsat Data) to obtain fractions or abundances of substrate (S), vegetation (V) and dark objects (D) classes. Because of the sheer nature of data and compute needs, we leveraged the NASA Earth Exchange (NEX) high performance computing architecture to optimize and scale our algorithm for large-scale processing. Subsequently, the S-V-D abundance maps were characterized into 4 classes namely, forest, farmland, water and urban areas (with NPP-VIIRS-national polar orbiting partnership visible infrared imaging radiometer suite nighttime lights data) over California, USA using Random Forest classifier. Validation of these land cover maps with NLCD (National Land Cover Database) 2011 products and NAFD (North American Forest Dynamics) static forest cover maps showed that an overall classification accuracy of over 91 percent was achieved, which is a 6 percent improvement in unmixing based classification relative to per-pixel-based classification. As such, abundance maps continue to offer an useful alternative to high-spatial resolution data derived classification maps for forest inventory analysis, multi-class mapping for eco-climatic models and applications, fast multi-temporal trend analysis and for societal and policy-relevant applications needed at the watershed scale.

  13. Linear Subpixel Learning Algorithm for Land Cover Classification from WELD using High Performance Computing

    NASA Astrophysics Data System (ADS)

    Ganguly, S.; Kumar, U.; Nemani, R. R.; Kalia, S.; Michaelis, A.

    2017-12-01

    In this work, we use a Fully Constrained Least Squares Subpixel Learning Algorithm to unmix global WELD (Web Enabled Landsat Data) to obtain fractions or abundances of substrate (S), vegetation (V) and dark objects (D) classes. Because of the sheer nature of data and compute needs, we leveraged the NASA Earth Exchange (NEX) high performance computing architecture to optimize and scale our algorithm for large-scale processing. Subsequently, the S-V-D abundance maps were characterized into 4 classes namely, forest, farmland, water and urban areas (with NPP-VIIRS - national polar orbiting partnership visible infrared imaging radiometer suite nighttime lights data) over California, USA using Random Forest classifier. Validation of these land cover maps with NLCD (National Land Cover Database) 2011 products and NAFD (North American Forest Dynamics) static forest cover maps showed that an overall classification accuracy of over 91% was achieved, which is a 6% improvement in unmixing based classification relative to per-pixel based classification. As such, abundance maps continue to offer an useful alternative to high-spatial resolution data derived classification maps for forest inventory analysis, multi-class mapping for eco-climatic models and applications, fast multi-temporal trend analysis and for societal and policy-relevant applications needed at the watershed scale.

  14. Evaluating the Effectiveness of Flood Control Strategies in Contrasting Urban Watersheds and Implications for Houston's Future Flood Vulnerability

    NASA Astrophysics Data System (ADS)

    Ganguly, S.; Kumar, U.; Nemani, R. R.; Kalia, S.; Michaelis, A.

    2016-12-01

    In this work, we use a Fully Constrained Least Squares Subpixel Learning Algorithm to unmix global WELD (Web Enabled Landsat Data) to obtain fractions or abundances of substrate (S), vegetation (V) and dark objects (D) classes. Because of the sheer nature of data and compute needs, we leveraged the NASA Earth Exchange (NEX) high performance computing architecture to optimize and scale our algorithm for large-scale processing. Subsequently, the S-V-D abundance maps were characterized into 4 classes namely, forest, farmland, water and urban areas (with NPP-VIIRS - national polar orbiting partnership visible infrared imaging radiometer suite nighttime lights data) over California, USA using Random Forest classifier. Validation of these land cover maps with NLCD (National Land Cover Database) 2011 products and NAFD (North American Forest Dynamics) static forest cover maps showed that an overall classification accuracy of over 91% was achieved, which is a 6% improvement in unmixing based classification relative to per-pixel based classification. As such, abundance maps continue to offer an useful alternative to high-spatial resolution data derived classification maps for forest inventory analysis, multi-class mapping for eco-climatic models and applications, fast multi-temporal trend analysis and for societal and policy-relevant applications needed at the watershed scale.

  15. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.

    PubMed

    Fox, Eric W; Hill, Ryan A; Leibowitz, Scott G; Olsen, Anthony R; Thornbrugh, Darren J; Weber, Marc H

    2017-07-01

    Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.

  16. Machine-Learning Techniques for the Determination of Attrition of Forces Due to Atmospheric Conditions

    DTIC Science & Technology

    2018-02-01

    the possibility of a correlation between aircraft incidents in the National Transportation Safety Board database and meteorological conditions. If a...strong correlation could be found, it could be used to derive a model to predict aircraft incidents and become part of a decision support tool for...techniques, primarily the random forest algorithm, were used to explore the possibility of a correlation between aircraft incidents in the National

  17. A Predictive Analysis of the Department of Defense Distribution System Utilizing Random Forests

    DTIC Science & Technology

    2016-06-01

    resources capable of meeting both customer and individual resource constraints and goals while also maximizing the global benefit to the supply...and probability rules to determine the optimal red wine distribution network for an Italian-based wine producer. The decision support model for...combinations of factors that will result in delivery of the highest quality wines . The model’s first stage inputs basic logistics information to look

  18. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach.

    PubMed

    Taylor, R Andrew; Pare, Joseph R; Venkatesh, Arjun K; Mowafi, Hani; Melnick, Edward R; Fleischman, William; Hall, M Kennedy

    2016-03-01

    Predictive analytics in emergency care has mostly been limited to the use of clinical decision rules (CDRs) in the form of simple heuristics and scoring systems. In the development of CDRs, limitations in analytic methods and concerns with usability have generally constrained models to a preselected small set of variables judged to be clinically relevant and to rules that are easily calculated. Furthermore, CDRs frequently suffer from questions of generalizability, take years to develop, and lack the ability to be updated as new information becomes available. Newer analytic and machine learning techniques capable of harnessing the large number of variables that are already available through electronic health records (EHRs) may better predict patient outcomes and facilitate automation and deployment within clinical decision support systems. In this proof-of-concept study, a local, big data-driven, machine learning approach is compared to existing CDRs and traditional analytic methods using the prediction of sepsis in-hospital mortality as the use case. This was a retrospective study of adult ED visits admitted to the hospital meeting criteria for sepsis from October 2013 to October 2014. Sepsis was defined as meeting criteria for systemic inflammatory response syndrome with an infectious admitting diagnosis in the ED. ED visits were randomly partitioned into an 80%/20% split for training and validation. A random forest model (machine learning approach) was constructed using over 500 clinical variables from data available within the EHRs of four hospitals to predict in-hospital mortality. The machine learning prediction model was then compared to a classification and regression tree (CART) model, logistic regression model, and previously developed prediction tools on the validation data set using area under the receiver operating characteristic curve (AUC) and chi-square statistics. There were 5,278 visits among 4,676 unique patients who met criteria for sepsis. Of the 4,222 patients in the training group, 210 (5.0%) died during hospitalization, and of the 1,056 patients in the validation group, 50 (4.7%) died during hospitalization. The AUCs with 95% confidence intervals (CIs) for the different models were as follows: random forest model, 0.86 (95% CI = 0.82 to 0.90); CART model, 0.69 (95% CI = 0.62 to 0.77); logistic regression model, 0.76 (95% CI = 0.69 to 0.82); CURB-65, 0.73 (95% CI = 0.67 to 0.80); MEDS, 0.71 (95% CI = 0.63 to 0.77); and mREMS, 0.72 (95% CI = 0.65 to 0.79). The random forest model AUC was statistically different from all other models (p ≤ 0.003 for all comparisons). In this proof-of-concept study, a local big data-driven, machine learning approach outperformed existing CDRs as well as traditional analytic techniques for predicting in-hospital mortality of ED patients with sepsis. Future research should prospectively evaluate the effectiveness of this approach and whether it translates into improved clinical outcomes for high-risk sepsis patients. The methods developed serve as an example of a new model for predictive analytics in emergency care that can be automated, applied to other clinical outcomes of interest, and deployed in EHRs to enable locally relevant clinical predictions. © 2015 by the Society for Academic Emergency Medicine.

  19. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data–Driven, Machine Learning Approach

    PubMed Central

    Taylor, R. Andrew; Pare, Joseph R.; Venkatesh, Arjun K.; Mowafi, Hani; Melnick, Edward R.; Fleischman, William; Hall, M. Kennedy

    2018-01-01

    Objectives Predictive analytics in emergency care has mostly been limited to the use of clinical decision rules (CDRs) in the form of simple heuristics and scoring systems. In the development of CDRs, limitations in analytic methods and concerns with usability have generally constrained models to a preselected small set of variables judged to be clinically relevant and to rules that are easily calculated. Furthermore, CDRs frequently suffer from questions of generalizability, take years to develop, and lack the ability to be updated as new information becomes available. Newer analytic and machine learning techniques capable of harnessing the large number of variables that are already available through electronic health records (EHRs) may better predict patient outcomes and facilitate automation and deployment within clinical decision support systems. In this proof-of-concept study, a local, big data–driven, machine learning approach is compared to existing CDRs and traditional analytic methods using the prediction of sepsis in-hospital mortality as the use case. Methods This was a retrospective study of adult ED visits admitted to the hospital meeting criteria for sepsis from October 2013 to October 2014. Sepsis was defined as meeting criteria for systemic inflammatory response syndrome with an infectious admitting diagnosis in the ED. ED visits were randomly partitioned into an 80%/20% split for training and validation. A random forest model (machine learning approach) was constructed using over 500 clinical variables from data available within the EHRs of four hospitals to predict in-hospital mortality. The machine learning prediction model was then compared to a classification and regression tree (CART) model, logistic regression model, and previously developed prediction tools on the validation data set using area under the receiver operating characteristic curve (AUC) and chi-square statistics. Results There were 5,278 visits among 4,676 unique patients who met criteria for sepsis. Of the 4,222 patients in the training group, 210 (5.0%) died during hospitalization, and of the 1,056 patients in the validation group, 50 (4.7%) died during hospitalization. The AUCs with 95% confidence intervals (CIs) for the different models were as follows: random forest model, 0.86 (95% CI = 0.82 to 0.90); CART model, 0.69 (95% CI = 0.62 to 0.77); logistic regression model, 0.76 (95% CI = 0.69 to 0.82); CURB-65, 0.73 (95% CI = 0.67 to 0.80); MEDS, 0.71 (95% CI = 0.63 to 0.77); and mREMS, 0.72 (95% CI = 0.65 to 0.79). The random forest model AUC was statistically different from all other models (p ≤ 0.003 for all comparisons). Conclusions In this proof-of-concept study, a local big data–driven, machine learning approach outperformed existing CDRs as well as traditional analytic techniques for predicting in-hospital mortality of ED patients with sepsis. Future research should prospectively evaluate the effectiveness of this approach and whether it translates into improved clinical outcomes for high-risk sepsis patients. The methods developed serve as an example of a new model for predictive analytics in emergency care that can be automated, applied to other clinical outcomes of interest, and deployed in EHRs to enable locally relevant clinical predictions. PMID:26679719

  20. Impacts of Landscape Context on Patterns of Wind Downfall Damage in a Fragmented Amazonian Landscape

    NASA Astrophysics Data System (ADS)

    Schwartz, N.; Uriarte, M.; DeFries, R. S.; Gutierrez-Velez, V. H.; Fernandes, K.; Pinedo-Vasquez, M.

    2015-12-01

    Wind is a major disturbance in the Amazon and has both short-term impacts and lasting legacies in tropical forests. Observed patterns of damage across landscapes result from differences in wind exposure and stand characteristics, such as tree stature, species traits, successional age, and fragmentation. Wind disturbance has important consequences for biomass dynamics in Amazonian forests, and understanding the spatial distribution and size of impacts is necessary to quantify the effects on carbon dynamics. In November 2013, a mesoscale convective system was observed over the study area in Ucayali, Peru, a highly human modified and fragmented forest landscape. We mapped downfall damage associated with the storm in order to ask: how does the severity of damage vary within forest patches, and across forest patches of different sizes and successional ages? We applied spectral mixture analysis to Landsat images from 2013 and 2014 to calculate the change in non-photosynthetic vegetation fraction after the storm, and combined it with C-band SAR data from the Sentinel-1 satellite to predict downfall damage measured in 30 field plots using random forest regression. We then applied this model to map damage in forests across the study area. Using a land cover classification developed in a previous study, we mapped secondary and mature forest, and compared the severity of damage in the two. We found that damage was on average higher in secondary forests, but patterns varied spatially. This study demonstrates the utility of using multiple sources of satellite data for mapping wind disturbance, and adds to our understanding of the sources of variation in wind-related damage. Ultimately, an improved ability to map wind impacts and a better understanding of their spatial patterns can contribute to better quantification of carbon dynamics in Amazonian landscapes.

Top