NASA Astrophysics Data System (ADS)
Bassa, Zaakirah; Bob, Urmilla; Szantoi, Zoltan; Ismail, Riyad
2016-01-01
In recent years, the popularity of tree-based ensemble methods for land cover classification has increased significantly. Using WorldView-2 image data, we evaluate the potential of the oblique random forest algorithm (oRF) to classify a highly heterogeneous protected area. In contrast to the random forest (RF) algorithm, the oRF algorithm builds multivariate trees by learning the optimal split using a supervised model. The oRF binary algorithm is adapted to a multiclass land cover and land use application using both the "one-against-one" and "one-against-all" combination approaches. Results show that the oRF algorithms are capable of achieving high classification accuracies (>80%). However, there was no statistical difference in classification accuracies obtained by the oRF algorithms and the more popular RF algorithm. For all the algorithms, user accuracies (UAs) and producer accuracies (PAs) >80% were recorded for most of the classes. Both the RF and oRF algorithms poorly classified the indigenous forest class as indicated by the low UAs and PAs. Finally, the results from this study advocate and support the utility of the oRF algorithm for land cover and land use mapping of protected areas using WorldView-2 image data.
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, e...
RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest.
Ismail, Hamid D; Jones, Ahoi; Kim, Jung H; Newman, Robert H; Kc, Dukka B
2016-01-01
Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.
Random forest (RF) is popular in ecological and environmental modeling, in part, because of its insensitivity to correlated predictors and resistance to overfitting. Although variable selection has been proposed to improve both performance and interpretation of RF models, it is u...
Ma, Li; Fan, Suohai
2017-03-14
The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.
NASA Astrophysics Data System (ADS)
Ahmed, Oumer S.; Franklin, Steven E.; Wulder, Michael A.; White, Joanne C.
2015-03-01
Many forest management activities, including the development of forest inventories, require spatially detailed forest canopy cover and height data. Among the various remote sensing technologies, LiDAR (Light Detection and Ranging) offers the most accurate and consistent means for obtaining reliable canopy structure measurements. A potential solution to reduce the cost of LiDAR data, is to integrate transects (samples) of LiDAR data with frequently acquired and spatially comprehensive optical remotely sensed data. Although multiple regression is commonly used for such modeling, often it does not fully capture the complex relationships between forest structure variables. This study investigates the potential of Random Forest (RF), a machine learning technique, to estimate LiDAR measured canopy structure using a time series of Landsat imagery. The study is implemented over a 2600 ha area of industrially managed coastal temperate forests on Vancouver Island, British Columbia, Canada. We implemented a trajectory-based approach to time series analysis that generates time since disturbance (TSD) and disturbance intensity information for each pixel and we used this information to stratify the forest land base into two strata: mature forests and young forests. Canopy cover and height for three forest classes (i.e. mature, young and mature and young (combined)) were modeled separately using multiple regression and Random Forest (RF) techniques. For all forest classes, the RF models provided improved estimates relative to the multiple regression models. The lowest validation error was obtained for the mature forest strata in a RF model (R2 = 0.88, RMSE = 2.39 m and bias = -0.16 for canopy height; R2 = 0.72, RMSE = 0.068% and bias = -0.0049 for canopy cover). This study demonstrates the value of using disturbance and successional history to inform estimates of canopy structure and obtain improved estimates of forest canopy cover and height using the RF algorithm.
E. Freeman; G. Moisen; J. Coulston; B. Wilson
2014-01-01
Random forests (RF) and stochastic gradient boosting (SGB), both involving an ensemble of classification and regression trees, are compared for modeling tree canopy cover for the 2011 National Land Cover Database (NLCD). The objectives of this study were twofold. First, sensitivity of RF and SGB to choices in tuning parameters was explored. Second, performance of the...
Sarica, Alessia; Cerasa, Antonio; Quattrone, Aldo
2017-01-01
Objective: Machine learning classification has been the most important computational development in the last years to satisfy the primary need of clinicians for automatic early diagnosis and prognosis. Nowadays, Random Forest (RF) algorithm has been successfully applied for reducing high dimensional and multi-source data in many scientific realms. Our aim was to explore the state of the art of the application of RF on single and multi-modal neuroimaging data for the prediction of Alzheimer's disease. Methods: A systematic review following PRISMA guidelines was conducted on this field of study. In particular, we constructed an advanced query using boolean operators as follows: ("random forest" OR "random forests") AND neuroimaging AND ("alzheimer's disease" OR alzheimer's OR alzheimer) AND (prediction OR classification) . The query was then searched in four well-known scientific databases: Pubmed, Scopus, Google Scholar and Web of Science. Results: Twelve articles-published between the 2007 and 2017-have been included in this systematic review after a quantitative and qualitative selection. The lesson learnt from these works suggest that when RF was applied on multi-modal data for prediction of Alzheimer's disease (AD) conversion from the Mild Cognitive Impairment (MCI), it produces one of the best accuracies to date. Moreover, the RF has important advantages in terms of robustness to overfitting, ability to handle highly non-linear data, stability in the presence of outliers and opportunity for efficient parallel processing mainly when applied on multi-modality neuroimaging data, such as, MRI morphometric, diffusion tensor imaging, and PET images. Conclusions: We discussed the strengths of RF, considering also possible limitations and by encouraging further studies on the comparisons of this algorithm with other commonly used classification approaches, particularly in the early prediction of the progression from MCI to AD.
NASA Astrophysics Data System (ADS)
Goudarzi, Nasser
2016-04-01
In this work, two new and powerful chemometrics methods are applied for the modeling and prediction of the 19F chemical shift values of some fluorinated organic compounds. The radial basis function-partial least square (RBF-PLS) and random forest (RF) are employed to construct the models to predict the 19F chemical shifts. In this study, we didn't used from any variable selection method and RF method can be used as variable selection and modeling technique. Effects of the important parameters affecting the ability of the RF prediction power such as the number of trees (nt) and the number of randomly selected variables to split each node (m) were investigated. The root-mean-square errors of prediction (RMSEP) for the training set and the prediction set for the RBF-PLS and RF models were 44.70, 23.86, 29.77, and 23.69, respectively. Also, the correlation coefficients of the prediction set for the RBF-PLS and RF models were 0.8684 and 0.9313, respectively. The results obtained reveal that the RF model can be used as a powerful chemometrics tool for the quantitative structure-property relationship (QSPR) studies.
NASA Astrophysics Data System (ADS)
Zafari, A.; Zurita-Milla, R.; Izquierdo-Verdiguier, E.
2017-10-01
Crop maps are essential inputs for the agricultural planning done at various governmental and agribusinesses agencies. Remote sensing offers timely and costs efficient technologies to identify and map crop types over large areas. Among the plethora of classification methods, Support Vector Machine (SVM) and Random Forest (RF) are widely used because of their proven performance. In this work, we study the synergic use of both methods by introducing a random forest kernel (RFK) in an SVM classifier. A time series of multispectral WorldView-2 images acquired over Mali (West Africa) in 2014 was used to develop our case study. Ground truth containing five common crop classes (cotton, maize, millet, peanut, and sorghum) were collected at 45 farms and used to train and test the classifiers. An SVM with the standard Radial Basis Function (RBF) kernel, a RF, and an SVM-RFK were trained and tested over 10 random training and test subsets generated from the ground data. Results show that the newly proposed SVM-RFK classifier can compete with both RF and SVM-RBF. The overall accuracies based on the spectral bands only are of 83, 82 and 83% respectively. Adding vegetation indices to the analysis result in the classification accuracy of 82, 81 and 84% for SVM-RFK, RF, and SVM-RBF respectively. Overall, it can be observed that the newly tested RFK can compete with SVM-RBF and RF classifiers in terms of classification accuracy.
An assessment of the effectiveness of a random forest classifier for land-cover classification
NASA Astrophysics Data System (ADS)
Rodriguez-Galiano, V. F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J. P.
2012-01-01
Land cover monitoring using remotely sensed data requires robust classification methods which allow for the accurate mapping of complex land cover and land use categories. Random forest (RF) is a powerful machine learning classifier that is relatively unknown in land remote sensing and has not been evaluated thoroughly by the remote sensing community compared to more conventional pattern recognition techniques. Key advantages of RF include: their non-parametric nature; high classification accuracy; and capability to determine variable importance. However, the split rules for classification are unknown, therefore RF can be considered to be black box type classifier. RF provides an algorithm for estimating missing values; and flexibility to perform several types of data analysis, including regression, classification, survival analysis, and unsupervised learning. In this paper, the performance of the RF classifier for land cover classification of a complex area is explored. Evaluation was based on several criteria: mapping accuracy, sensitivity to data set size and noise. Landsat-5 Thematic Mapper data captured in European spring and summer were used with auxiliary variables derived from a digital terrain model to classify 14 different land categories in the south of Spain. Results show that the RF algorithm yields accurate land cover classifications, with 92% overall accuracy and a Kappa index of 0.92. RF is robust to training data reduction and noise because significant differences in kappa values were only observed for data reduction and noise addition values greater than 50 and 20%, respectively. Additionally, variables that RF identified as most important for classifying land cover coincided with expectations. A McNemar test indicates an overall better performance of the random forest model over a single decision tree at the 0.00001 significance level.
Anantha M. Prasad; Louis R. Iverson; Andy Liaw; Andy Liaw
2006-01-01
We evaluated four statistical models - Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) - for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.
NASA Astrophysics Data System (ADS)
Ma, L.; Zhou, M.; Li, C.
2017-09-01
In this study, a Random Forest (RF) based land covers classification method is presented to predict the types of land covers in Miyun area. The returned full-waveforms which were acquired by a LiteMapper 5600 airborne LiDAR system were processed, including waveform filtering, waveform decomposition and features extraction. The commonly used features that were distance, intensity, Full Width at Half Maximum (FWHM), skewness and kurtosis were extracted. These waveform features were used as attributes of training data for generating the RF prediction model. The RF prediction model was applied to predict the types of land covers in Miyun area as trees, buildings, farmland and ground. The classification results of these four types of land covers were obtained according to the ground truth information acquired from CCD image data of the same region. The RF classification results were compared with that of SVM method and show better results. The RF classification accuracy reached 89.73% and the classification Kappa was 0.8631.
Nagasawa, Shinji; Al-Naamani, Eman; Saeki, Akinori
2018-05-17
Owing to the diverse chemical structures, organic photovoltaic (OPV) applications with a bulk heterojunction framework have greatly evolved over the last two decades, which has produced numerous organic semiconductors exhibiting improved power conversion efficiencies (PCEs). Despite the recent fast progress in materials informatics and data science, data-driven molecular design of OPV materials remains challenging. We report a screening of conjugated molecules for polymer-fullerene OPV applications by supervised learning methods (artificial neural network (ANN) and random forest (RF)). Approximately 1000 experimental parameters including PCE, molecular weight, and electronic properties are manually collected from the literature and subjected to machine learning with digitized chemical structures. Contrary to the low correlation coefficient in ANN, RF yields an acceptable accuracy, which is twice that of random classification. We demonstrate the application of RF screening for the design, synthesis, and characterization of a conjugated polymer, which facilitates a rapid development of optoelectronic materials.
Na, X D; Zang, S Y; Wu, C S; Li, W L
2015-11-01
Knowledge of the spatial extent of forested wetlands is essential to many studies including wetland functioning assessment, greenhouse gas flux estimation, and wildlife suitable habitat identification. For discriminating forested wetlands from their adjacent land cover types, researchers have resorted to image analysis techniques applied to numerous remotely sensed data. While with some success, there is still no consensus on the optimal approaches for mapping forested wetlands. To address this problem, we examined two machine learning approaches, random forest (RF) and K-nearest neighbor (KNN) algorithms, and applied these two approaches to the framework of pixel-based and object-based classifications. The RF and KNN algorithms were constructed using predictors derived from Landsat 8 imagery, Radarsat-2 advanced synthetic aperture radar (SAR), and topographical indices. The results show that the objected-based classifications performed better than per-pixel classifications using the same algorithm (RF) in terms of overall accuracy and the difference of their kappa coefficients are statistically significant (p<0.01). There were noticeably omissions for forested and herbaceous wetlands based on the per-pixel classifications using the RF algorithm. As for the object-based image analysis, there were also statistically significant differences (p<0.01) of Kappa coefficient between results performed based on RF and KNN algorithms. The object-based classification using RF provided a more visually adequate distribution of interested land cover types, while the object classifications based on the KNN algorithm showed noticeably commissions for forested wetlands and omissions for agriculture land. This research proves that the object-based classification with RF using optical, radar, and topographical data improved the mapping accuracy of land covers and provided a feasible approach to discriminate the forested wetlands from the other land cover types in forestry area.
Nagao, Chioko; Nagano, Nozomi; Mizuguchi, Kenji
2014-01-01
Determining enzyme functions is essential for a thorough understanding of cellular processes. Although many prediction methods have been developed, it remains a significant challenge to predict enzyme functions at the fourth-digit level of the Enzyme Commission numbers. Functional specificity of enzymes often changes drastically by mutations of a small number of residues and therefore, information about these critical residues can potentially help discriminate detailed functions. However, because these residues must be identified by mutagenesis experiments, the available information is limited, and the lack of experimentally verified specificity determining residues (SDRs) has hindered the development of detailed function prediction methods and computational identification of SDRs. Here we present a novel method for predicting enzyme functions by random forests, EFPrf, along with a set of putative SDRs, the random forests derived SDRs (rf-SDRs). EFPrf consists of a set of binary predictors for enzymes in each CATH superfamily and the rf-SDRs are the residue positions corresponding to the most highly contributing attributes obtained from each predictor. EFPrf showed a precision of 0.98 and a recall of 0.89 in a cross-validated benchmark assessment. The rf-SDRs included many residues, whose importance for specificity had been validated experimentally. The analysis of the rf-SDRs revealed both a general tendency that functionally diverged superfamilies tend to include more active site residues in their rf-SDRs than in less diverged superfamilies, and superfamily-specific conservation patterns of each functional residue. EFPrf and the rf-SDRs will be an effective tool for annotating enzyme functions and for understanding how enzyme functions have diverged within each superfamily. PMID:24416252
NASA Astrophysics Data System (ADS)
Georganos, Stefanos; Grippa, Tais; Vanhuysse, Sabine; Lennert, Moritz; Shimoni, Michal; Wolff, Eléonore
2017-10-01
This study evaluates the impact of three Feature Selection (FS) algorithms in an Object Based Image Analysis (OBIA) framework for Very-High-Resolution (VHR) Land Use-Land Cover (LULC) classification. The three selected FS algorithms, Correlation Based Selection (CFS), Mean Decrease in Accuracy (MDA) and Random Forest (RF) based Recursive Feature Elimination (RFE), were tested on Support Vector Machine (SVM), K-Nearest Neighbor, and Random Forest (RF) classifiers. The results demonstrate that the accuracy of SVM and KNN classifiers are the most sensitive to FS. The RF appeared to be more robust to high dimensionality, although a significant increase in accuracy was found by using the RFE method. In terms of classification accuracy, SVM performed the best using FS, followed by RF and KNN. Finally, only a small number of features is needed to achieve the highest performance using each classifier. This study emphasizes the benefits of rigorous FS for maximizing performance, as well as for minimizing model complexity and interpretation.
Nguyen, Thanh-Tung; Huang, Joshua; Wu, Qingyao; Nguyen, Thuy; Li, Mark
2015-01-01
Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders. The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.
NASA Astrophysics Data System (ADS)
Su, Y.; Guo, Q.; Jin, S.; Gao, S.; Hu, T.; Liu, J.; Xue, B. L.
2017-12-01
Tree height is an important forest structure parameter for understanding forest ecosystem and improving the accuracy of global carbon stock quantification. Light detection and ranging (LiDAR) can provide accurate tree height measurements, but its use in large-scale tree height mapping is limited by the spatial availability. Random Forest (RF) has been one of the most commonly used algorithms for mapping large-scale tree height through the fusion of LiDAR and other remotely sensed datasets. However, how the variances in vegetation types, geolocations and spatial scales of different study sites influence the RF results is still a question that needs to be addressed. In this study, we selected 16 study sites across four vegetation types in United States (U.S.) fully covered by airborne LiDAR data, and the area of each site was 100 km2. The LiDAR-derived canopy height models (CHMs) were used as the ground truth to train the RF algorithm to predict canopy height from other remotely sensed variables, such as Landsat TM imagery, terrain information and climate surfaces. To address the abovementioned question, 22 models were run under different combinations of vegetation types, geolocations and spatial scales. The results show that the RF model trained at one specific location or vegetation type cannot be used to predict tree height in other locations or vegetation types. However, by training the RF model using samples from all locations and vegetation types, a universal model can be achieved for predicting canopy height across different locations and vegetation types. Moreover, the number of training samples and the targeted spatial resolution of the canopy height product have noticeable influence on the RF prediction accuracy.
Ozçift, Akin
2011-05-01
Supervised classification algorithms are commonly used in the designing of computer-aided diagnosis systems. In this study, we present a resampling strategy based Random Forests (RF) ensemble classifier to improve diagnosis of cardiac arrhythmia. Random forests is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. In this way, an RF ensemble classifier performs better than a single tree from classification performance point of view. In general, multiclass datasets having unbalanced distribution of sample sizes are difficult to analyze in terms of class discrimination. Cardiac arrhythmia is such a dataset that has multiple classes with small sample sizes and it is therefore adequate to test our resampling based training strategy. The dataset contains 452 samples in fourteen types of arrhythmias and eleven of these classes have sample sizes less than 15. Our diagnosis strategy consists of two parts: (i) a correlation based feature selection algorithm is used to select relevant features from cardiac arrhythmia dataset. (ii) RF machine learning algorithm is used to evaluate the performance of selected features with and without simple random sampling to evaluate the efficiency of proposed training strategy. The resultant accuracy of the classifier is found to be 90.0% and this is a quite high diagnosis performance for cardiac arrhythmia. Furthermore, three case studies, i.e., thyroid, cardiotocography and audiology, are used to benchmark the effectiveness of the proposed method. The results of experiments demonstrated the efficiency of random sampling strategy in training RF ensemble classification algorithm. Copyright © 2011 Elsevier Ltd. All rights reserved.
Dimitriadis, Stavros I; Liparas, Dimitris
2018-06-01
Neuroinformatics is a fascinating research field that applies computational models and analytical tools to high dimensional experimental neuroscience data for a better understanding of how the brain functions or dysfunctions in brain diseases. Neuroinformaticians work in the intersection of neuroscience and informatics supporting the integration of various sub-disciplines (behavioural neuroscience, genetics, cognitive psychology, etc.) working on brain research. Neuroinformaticians are the pathway of information exchange between informaticians and clinicians for a better understanding of the outcome of computational models and the clinical interpretation of the analysis. Machine learning is one of the most significant computational developments in the last decade giving tools to neuroinformaticians and finally to radiologists and clinicians for an automatic and early diagnosis-prognosis of a brain disease. Random forest (RF) algorithm has been successfully applied to high-dimensional neuroimaging data for feature reduction and also has been applied to classify the clinical label of a subject using single or multi-modal neuroimaging datasets. Our aim was to review the studies where RF was applied to correctly predict the Alzheimer's disease (AD), the conversion from mild cognitive impairment (MCI) and its robustness to overfitting, outliers and handling of non-linear data. Finally, we described our RF-based model that gave us the 1 st position in an international challenge for automated prediction of MCI from MRI data.
Random forests for classification in ecology
Cutler, D.R.; Edwards, T.C.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J.
2007-01-01
Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature. ?? 2007 by the Ecological Society of America.
L.R. Iverson; A.M. Prasad; A. Liaw
2004-01-01
More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in...
Elizabeth A. Freeman; Gretchen G. Moisen; John W. Coulston; Barry T. (Ty) Wilson
2015-01-01
As part of the development of the 2011 National Land Cover Database (NLCD) tree canopy cover layer, a pilot project was launched to test the use of high-resolution photography coupled with extensive ancillary data to map the distribution of tree canopy cover over four study regions in the conterminous US. Two stochastic modeling techniques, random forests (RF...
NASA Astrophysics Data System (ADS)
Löw, Fabian; Schorcht, Gunther; Michel, Ulrich; Dech, Stefan; Conrad, Christopher
2012-10-01
Accurate crop identification and crop area estimation are important for studies on irrigated agricultural systems, yield and water demand modeling, and agrarian policy development. In this study a novel combination of Random Forest (RF) and Support Vector Machine (SVM) classifiers is presented that (i) enhances crop classification accuracy and (ii) provides spatial information on map uncertainty. The methodology was implemented over four distinct irrigated sites in Middle Asia using RapidEye time series data. The RF feature importance statistics was used as feature-selection strategy for the SVM to assess possible negative effects on classification accuracy caused by an oversized feature space. The results of the individual RF and SVM classifications were combined with rules based on posterior classification probability and estimates of classification probability entropy. SVM classification performance was increased by feature selection through RF. Further experimental results indicate that the hybrid classifier improves overall classification accuracy in comparison to the single classifiers as well as useŕs and produceŕs accuracy.
Introducing two Random Forest based methods for cloud detection in remote sensing images
NASA Astrophysics Data System (ADS)
Ghasemian, Nafiseh; Akhoondzadeh, Mehdi
2018-07-01
Cloud detection is a necessary phase in satellite images processing to retrieve the atmospheric and lithospheric parameters. Currently, some cloud detection methods based on Random Forest (RF) model have been proposed but they do not consider both spectral and textural characteristics of the image. Furthermore, they have not been tested in the presence of snow/ice. In this paper, we introduce two RF based algorithms, Feature Level Fusion Random Forest (FLFRF) and Decision Level Fusion Random Forest (DLFRF) to incorporate visible, infrared (IR) and thermal spectral and textural features (FLFRF) including Gray Level Co-occurrence Matrix (GLCM) and Robust Extended Local Binary Pattern (RELBP_CI) or visible, IR and thermal classifiers (DLFRF) for highly accurate cloud detection on remote sensing images. FLFRF first fuses visible, IR and thermal features. Thereafter, it uses the RF model to classify pixels to cloud, snow/ice and background or thick cloud, thin cloud and background. DLFRF considers visible, IR and thermal features (both spectral and textural) separately and inserts each set of features to RF model. Then, it holds vote matrix of each run of the model. Finally, it fuses the classifiers using the majority vote method. To demonstrate the effectiveness of the proposed algorithms, 10 Terra MODIS and 15 Landsat 8 OLI/TIRS images with different spatial resolutions are used in this paper. Quantitative analyses are based on manually selected ground truth data. Results show that after adding RELBP_CI to input feature set cloud detection accuracy improves. Also, the average cloud kappa values of FLFRF and DLFRF on MODIS images (1 and 0.99) are higher than other machine learning methods, Linear Discriminate Analysis (LDA), Classification And Regression Tree (CART), K Nearest Neighbor (KNN) and Support Vector Machine (SVM) (0.96). The average snow/ice kappa values of FLFRF and DLFRF on MODIS images (1 and 0.85) are higher than other traditional methods. The quantitative values on Landsat 8 images show similar trend. Consequently, while SVM and K-nearest neighbor show overestimation in predicting cloud and snow/ice pixels, our Random Forest (RF) based models can achieve higher cloud, snow/ice kappa values on MODIS and thin cloud, thick cloud and snow/ice kappa values on Landsat 8 images. Our algorithms predict both thin and thick cloud on Landsat 8 images while the existing cloud detection algorithm, Fmask cannot discriminate them. Compared to the state-of-the-art methods, our algorithms have acquired higher average cloud and snow/ice kappa values for different spatial resolutions.
Ramírez, J; Górriz, J M; Segovia, F; Chaves, R; Salas-Gonzalez, D; López, M; Alvarez, I; Padilla, P
2010-03-19
This letter shows a computer aided diagnosis (CAD) technique for the early detection of the Alzheimer's disease (AD) by means of single photon emission computed tomography (SPECT) image classification. The proposed method is based on partial least squares (PLS) regression model and a random forest (RF) predictor. The challenge of the curse of dimensionality is addressed by reducing the large dimensionality of the input data by downscaling the SPECT images and extracting score features using PLS. A RF predictor then forms an ensemble of classification and regression tree (CART)-like classifiers being its output determined by a majority vote of the trees in the forest. A baseline principal component analysis (PCA) system is also developed for reference. The experimental results show that the combined PLS-RF system yields a generalization error that converges to a limit when increasing the number of trees in the forest. Thus, the generalization error is reduced when using PLS and depends on the strength of the individual trees in the forest and the correlation between them. Moreover, PLS feature extraction is found to be more effective for extracting discriminative information from the data than PCA yielding peak sensitivity, specificity and accuracy values of 100%, 92.7%, and 96.9%, respectively. Moreover, the proposed CAD system outperformed several other recently developed AD CAD systems. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Predicting live and dead tree basal area of bark beetle affected forests from discrete-return lidar
Benjamin C. Bright; Andrew T. Hudak; Robert McGaughey; Hans-Erik Andersen; Jose Negron
2013-01-01
Bark beetle outbreaks have killed large numbers of trees across North America in recent years. Lidar remote sensing can be used to effectively estimate forest biomass, but prediction of both live and dead standing biomass in beetle-affected forests using lidar alone has not been demonstrated. We developed Random Forest (RF) models predicting total, live, dead, and...
A Random Forest-based ensemble method for activity recognition.
Feng, Zengtao; Mo, Lingfei; Li, Meng
2015-01-01
This paper presents a multi-sensor ensemble approach to human physical activity (PA) recognition, using random forest. We designed an ensemble learning algorithm, which integrates several independent Random Forest classifiers based on different sensor feature sets to build a more stable, more accurate and faster classifier for human activity recognition. To evaluate the algorithm, PA data collected from the PAMAP (Physical Activity Monitoring for Aging People), which is a standard, publicly available database, was utilized to train and test. The experimental results show that the algorithm is able to correctly recognize 19 PA types with an accuracy of 93.44%, while the training is faster than others. The ensemble classifier system based on the RF (Random Forest) algorithm can achieve high recognition accuracy and fast calculation.
A Random Forest Approach to Predict the Spatial Distribution ...
Modeling the magnitude and distribution of sediment-bound pollutants in estuaries is often limited by incomplete knowledge of the site and inadequate sample density. To address these modeling limitations, a decision-support tool framework was conceived that predicts sediment contamination from the sub-estuary to broader estuary extent. For this study, a Random Forest (RF) model was implemented to predict the distribution of a model contaminant, triclosan (5-chloro-2-(2,4-dichlorophenoxy)phenol) (TCS), in Narragansett Bay, Rhode Island, USA. TCS is an unregulated contaminant used in many personal care products. The RF explanatory variables were associated with TCS transport and fate (proxies) and direct and indirect environmental entry. The continuous RF TCS concentration predictions were discretized into three levels of contamination (low, medium, and high) for three different quantile thresholds. The RF model explained 63% of the variance with a minimum number of variables. Total organic carbon (TOC) (transport and fate proxy) was a strong predictor of TCS contamination causing a mean squared error increase of 59% when compared to permutations of randomized values of TOC. Additionally, combined sewer overflow discharge (environmental entry) and sand (transport and fate proxy) were strong predictors. The discretization models identified a TCS area of greatest concern in the northern reach of Narragansett Bay (Providence River sub-estuary), which was validated wi
Zimbelman, Eloise G; Keefe, Robert F
2018-01-01
Real-time positioning on mobile devices using global navigation satellite system (GNSS) technology paired with radio frequency (RF) transmission (GNSS-RF) may help to improve safety on logging operations by increasing situational awareness. However, GNSS positional accuracy for ground workers in motion may be reduced by multipath error, satellite signal obstruction, or other factors. Radio propagation of GNSS locations may also be impacted due to line-of-sight (LOS) obstruction in remote, forested areas. The objective of this study was to characterize the effects of forest stand characteristics, topography, and other LOS obstructions on the GNSS accuracy and radio signal propagation quality of multiple Raveon Atlas PT GNSS-RF transponders functioning as a network in a range of forest conditions. Because most previous research with GNSS in forestry has focused on stationary units, we chose to analyze units in motion by evaluating the time-to-signal accuracy of geofence crossings in 21 randomly-selected stands on the University of Idaho Experimental Forest. Specifically, we studied the effects of forest stand characteristics, topography, and LOS obstructions on (1) the odds of missed GNSS-RF signals, (2) the root mean squared error (RMSE) of Atlas PTs, and (3) the time-to-signal accuracy of safety geofence crossings in forested environments. Mixed-effects models used to analyze the data showed that stand characteristics, topography, and obstructions in the LOS affected the odds of missed radio signals while stand variables alone affected RMSE. Both stand characteristics and topography affected the accuracy of geofence alerts.
2018-01-01
Real-time positioning on mobile devices using global navigation satellite system (GNSS) technology paired with radio frequency (RF) transmission (GNSS-RF) may help to improve safety on logging operations by increasing situational awareness. However, GNSS positional accuracy for ground workers in motion may be reduced by multipath error, satellite signal obstruction, or other factors. Radio propagation of GNSS locations may also be impacted due to line-of-sight (LOS) obstruction in remote, forested areas. The objective of this study was to characterize the effects of forest stand characteristics, topography, and other LOS obstructions on the GNSS accuracy and radio signal propagation quality of multiple Raveon Atlas PT GNSS-RF transponders functioning as a network in a range of forest conditions. Because most previous research with GNSS in forestry has focused on stationary units, we chose to analyze units in motion by evaluating the time-to-signal accuracy of geofence crossings in 21 randomly-selected stands on the University of Idaho Experimental Forest. Specifically, we studied the effects of forest stand characteristics, topography, and LOS obstructions on (1) the odds of missed GNSS-RF signals, (2) the root mean squared error (RMSE) of Atlas PTs, and (3) the time-to-signal accuracy of safety geofence crossings in forested environments. Mixed-effects models used to analyze the data showed that stand characteristics, topography, and obstructions in the LOS affected the odds of missed radio signals while stand variables alone affected RMSE. Both stand characteristics and topography affected the accuracy of geofence alerts. PMID:29324794
Unbiased feature selection in learning random forests for high-dimensional data.
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
Shareef, Hussain; Mutlag, Ammar Hussein; Mohamed, Azah
2017-01-01
Many maximum power point tracking (MPPT) algorithms have been developed in recent years to maximize the produced PV energy. These algorithms are not sufficiently robust because of fast-changing environmental conditions, efficiency, accuracy at steady-state value, and dynamics of the tracking algorithm. Thus, this paper proposes a new random forest (RF) model to improve MPPT performance. The RF model has the ability to capture the nonlinear association of patterns between predictors, such as irradiance and temperature, to determine accurate maximum power point. A RF-based tracker is designed for 25 SolarTIFSTF-120P6 PV modules, with the capacity of 3 kW peak using two high-speed sensors. For this purpose, a complete PV system is modeled using 300,000 data samples and simulated using the MATLAB/SIMULINK package. The proposed RF-based MPPT is then tested under actual environmental conditions for 24 days to validate the accuracy and dynamic response. The response of the RF-based MPPT model is also compared with that of the artificial neural network and adaptive neurofuzzy inference system algorithms for further validation. The results show that the proposed MPPT technique gives significant improvement compared with that of other techniques. In addition, the RF model passes the Bland-Altman test, with more than 95 percent acceptability.
Shareef, Hussain; Mohamed, Azah
2017-01-01
Many maximum power point tracking (MPPT) algorithms have been developed in recent years to maximize the produced PV energy. These algorithms are not sufficiently robust because of fast-changing environmental conditions, efficiency, accuracy at steady-state value, and dynamics of the tracking algorithm. Thus, this paper proposes a new random forest (RF) model to improve MPPT performance. The RF model has the ability to capture the nonlinear association of patterns between predictors, such as irradiance and temperature, to determine accurate maximum power point. A RF-based tracker is designed for 25 SolarTIFSTF-120P6 PV modules, with the capacity of 3 kW peak using two high-speed sensors. For this purpose, a complete PV system is modeled using 300,000 data samples and simulated using the MATLAB/SIMULINK package. The proposed RF-based MPPT is then tested under actual environmental conditions for 24 days to validate the accuracy and dynamic response. The response of the RF-based MPPT model is also compared with that of the artificial neural network and adaptive neurofuzzy inference system algorithms for further validation. The results show that the proposed MPPT technique gives significant improvement compared with that of other techniques. In addition, the RF model passes the Bland–Altman test, with more than 95 percent acceptability. PMID:28702051
NASA Astrophysics Data System (ADS)
Morizet, N.; Godin, N.; Tang, J.; Maillet, E.; Fregonese, M.; Normand, B.
2016-03-01
This paper aims to propose a novel approach to classify acoustic emission (AE) signals deriving from corrosion experiments, even if embedded into a noisy environment. To validate this new methodology, synthetic data are first used throughout an in-depth analysis, comparing Random Forests (RF) to the k-Nearest Neighbor (k-NN) algorithm. Moreover, a new evaluation tool called the alter-class matrix (ACM) is introduced to simulate different degrees of uncertainty on labeled data for supervised classification. Then, tests on real cases involving noise and crevice corrosion are conducted, by preprocessing the waveforms including wavelet denoising and extracting a rich set of features as input of the RF algorithm. To this end, a software called RF-CAM has been developed. Results show that this approach is very efficient on ground truth data and is also very promising on real data, especially for its reliability, performance and speed, which are serious criteria for the chemical industry.
NASA Astrophysics Data System (ADS)
Hu, Yifan; Han, Hao; Zhu, Wei; Li, Lihong; Pickhardt, Perry J.; Liang, Zhengrong
2016-03-01
Feature classification plays an important role in differentiation or computer-aided diagnosis (CADx) of suspicious lesions. As a widely used ensemble learning algorithm for classification, random forest (RF) has a distinguished performance for CADx. Our recent study has shown that the location index (LI), which is derived from the well-known kNN (k nearest neighbor) and wkNN (weighted k nearest neighbor) classifier [1], has also a distinguished role in the classification for CADx. Therefore, in this paper, based on the property that the LI will achieve a very high accuracy, we design an algorithm to integrate the LI into RF for improved or higher value of AUC (area under the curve of receiver operating characteristics -- ROC). Experiments were performed by the use of a database of 153 lesions (polyps), including 116 neoplastic lesions and 37 hyperplastic lesions, with comparison to the existing classifiers of RF and wkNN, respectively. A noticeable gain by the proposed integrated classifier was quantified by the AUC measure.
Credit Risk Evaluation of Power Market Players with Random Forest
NASA Astrophysics Data System (ADS)
Umezawa, Yasushi; Mori, Hiroyuki
A new method is proposed for credit risk evaluation in a power market. The credit risk evaluation is to measure the bankruptcy risk of the company. The power system liberalization results in new environment that puts emphasis on the profit maximization and the risk minimization. There is a high probability that the electricity transaction causes a risk between companies. So, power market players are concerned with the risk minimization. As a management strategy, a risk index is requested to evaluate the worth of the business partner. This paper proposes a new method for evaluating the credit risk with Random Forest (RF) that makes ensemble learning for the decision tree. RF is one of efficient data mining technique in clustering data and extracting relationship between input and output data. In addition, the method of generating pseudo-measurements is proposed to improve the performance of RF. The proposed method is successfully applied to real financial data of energy utilities in the power market. A comparison is made between the proposed and the conventional methods.
Random Forests for Global and Regional Crop Yield Predictions.
Jeong, Jig Han; Resop, Jonathan P; Mueller, Nathaniel D; Fleisher, David H; Yun, Kyungdahm; Butler, Ethan E; Timlin, Dennis J; Shim, Kyo-Moon; Gerber, James S; Reddy, Vangimalla R; Kim, Soo-Hyung
2016-01-01
Accurate predictions of crop yield are critical for developing effective agricultural and food policies at the regional and global scales. We evaluated a machine-learning method, Random Forests (RF), for its ability to predict crop yield responses to climate and biophysical variables at global and regional scales in wheat, maize, and potato in comparison with multiple linear regressions (MLR) serving as a benchmark. We used crop yield data from various sources and regions for model training and testing: 1) gridded global wheat grain yield, 2) maize grain yield from US counties over thirty years, and 3) potato tuber and maize silage yield from the northeastern seaboard region. RF was found highly capable of predicting crop yields and outperformed MLR benchmarks in all performance statistics that were compared. For example, the root mean square errors (RMSE) ranged between 6 and 14% of the average observed yield with RF models in all test cases whereas these values ranged from 14% to 49% for MLR models. Our results show that RF is an effective and versatile machine-learning method for crop yield predictions at regional and global scales for its high accuracy and precision, ease of use, and utility in data analysis. RF may result in a loss of accuracy when predicting the extreme ends or responses beyond the boundaries of the training data.
Random Forest Segregation of Drug Responses May Define Regions of Biological Significance.
Bukhari, Qasim; Borsook, David; Rudin, Markus; Becerra, Lino
2016-01-01
The ability to assess brain responses in unsupervised manner based on fMRI measure has remained a challenge. Here we have applied the Random Forest (RF) method to detect differences in the pharmacological MRI (phMRI) response in rats to treatment with an analgesic drug (buprenorphine) as compared to control (saline). Three groups of animals were studied: two groups treated with different doses of the opioid buprenorphine, low (LD), and high dose (HD), and one receiving saline. PhMRI responses were evaluated in 45 brain regions and RF analysis was applied to allocate rats to the individual treatment groups. RF analysis was able to identify drug effects based on differential phMRI responses in the hippocampus, amygdala, nucleus accumbens, superior colliculus, and the lateral and posterior thalamus for drug vs. saline. These structures have high levels of mu opioid receptors. In addition these regions are involved in aversive signaling, which is inhibited by mu opioids. The results demonstrate that buprenorphine mediated phMRI responses comprise characteristic features that allow a supervised differentiation from placebo treated rats as well as the proper allocation to the respective drug dose group using the RF method, a method that has been successfully applied in clinical studies.
Predicting the accuracy of ligand overlay methods with Random Forest models.
Nandigam, Ravi K; Evans, David A; Erickson, Jon A; Kim, Sangtae; Sutherland, Jeffrey J
2008-12-01
The accuracy of binding mode prediction using standard molecular overlay methods (ROCS, FlexS, Phase, and FieldCompare) is studied. Previous work has shown that simple decision tree modeling can be used to improve accuracy by selection of the best overlay template. This concept is extended to the use of Random Forest (RF) modeling for template and algorithm selection. An extensive data set of 815 ligand-bound X-ray structures representing 5 gene families was used for generating ca. 70,000 overlays using four programs. RF models, trained using standard measures of ligand and protein similarity and Lipinski-related descriptors, are used for automatically selecting the reference ligand and overlay method maximizing the probability of reproducing the overlay deduced from X-ray structures (i.e., using rmsd < or = 2 A as the criteria for success). RF model scores are highly predictive of overlay accuracy, and their use in template and method selection produces correct overlays in 57% of cases for 349 overlay ligands not used for training RF models. The inclusion in the models of protein sequence similarity enables the use of templates bound to related protein structures, yielding useful results even for proteins having no available X-ray structures.
Random-Forest Classification of High-Resolution Remote Sensing Images and Ndsm Over Urban Areas
NASA Astrophysics Data System (ADS)
Sun, X. F.; Lin, X. G.
2017-09-01
As an intermediate step between raw remote sensing data and digital urban maps, remote sensing data classification has been a challenging and long-standing research problem in the community of remote sensing. In this work, an effective classification method is proposed for classifying high-resolution remote sensing data over urban areas. Starting from high resolution multi-spectral images and 3D geometry data, our method proceeds in three main stages: feature extraction, classification, and classified result refinement. First, we extract color, vegetation index and texture features from the multi-spectral image and compute the height, elevation texture and differential morphological profile (DMP) features from the 3D geometry data. Then in the classification stage, multiple random forest (RF) classifiers are trained separately, then combined to form a RF ensemble to estimate each sample's category probabilities. Finally the probabilities along with the feature importance indicator outputted by RF ensemble are used to construct a fully connected conditional random field (FCCRF) graph model, by which the classification results are refined through mean-field based statistical inference. Experiments on the ISPRS Semantic Labeling Contest dataset show that our proposed 3-stage method achieves 86.9% overall accuracy on the test data.
A System-Level Pathway-Phenotype Association Analysis Using Synthetic Feature Random Forest
Pan, Qinxin; Hu, Ting; Malley, James D.; Andrew, Angeline S.; Karagas, Margaret R.; Moore, Jason H.
2015-01-01
As the cost of genome-wide genotyping decreases, the number of genome-wide association studies (GWAS) has increased considerably. However, the transition from GWAS findings to the underlying biology of various phenotypes remains challenging. As a result, due to its system-level interpretability, pathway analysis has become a popular tool for gaining insights on the underlying biology from high-throughput genetic association data. In pathway analyses, gene sets representing particular biological processes are tested for significant associations with a given phenotype. Most existing pathway analysis approaches rely on single-marker statistics and assume that pathways are independent of each other. As biological systems are driven by complex biomolecular interactions, embracing the complex relationships between single-nucleotide polymorphisms (SNPs) and pathways needs to be addressed. To incorporate the complexity of gene-gene interactions and pathway-pathway relationships, we propose a system-level pathway analysis approach, synthetic feature random forest (SF-RF), which is designed to detect pathway-phenotype associations without making assumptions about the relationships among SNPs or pathways. In our approach, the genotypes of SNPs in a particular pathway are aggregated into a synthetic feature representing that pathway via Random Forest (RF). Multiple synthetic features are analyzed using RF simultaneously and the significance of a synthetic feature indicates the significance of the corresponding pathway. We further complement SF-RF with pathway-based Statistical Epistasis Network (SEN) analysis that evaluates interactions among pathways. By investigating the pathway SEN, we hope to gain additional insights into the genetic mechanisms contributing to the pathway-phenotype association. We apply SF-RF to a population-based genetic study of bladder cancer and further investigate the mechanisms that help explain the pathway-phenotype associations using SEN. The bladder cancer associated pathways we found are both consistent with existing biological knowledge and reveal novel and plausible hypotheses for future biological validations. PMID:24535726
Application of random forests methods to diabetic retinopathy classification analyses.
Casanova, Ramon; Saldana, Santiago; Chew, Emily Y; Danis, Ronald P; Greven, Craig M; Ambrosius, Walter T
2014-01-01
Diabetic retinopathy (DR) is one of the leading causes of blindness in the United States and world-wide. DR is a silent disease that may go unnoticed until it is too late for effective treatment. Therefore, early detection could improve the chances of therapeutic interventions that would alleviate its effects. Graded fundus photography and systemic data from 3443 ACCORD-Eye Study participants were used to estimate Random Forest (RF) and logistic regression classifiers. We studied the impact of sample size on classifier performance and the possibility of using RF generated class conditional probabilities as metrics describing DR risk. RF measures of variable importance are used to detect factors that affect classification performance. Both types of data were informative when discriminating participants with or without DR. RF based models produced much higher classification accuracy than those based on logistic regression. Combining both types of data did not increase accuracy but did increase statistical discrimination of healthy participants who subsequently did or did not have DR events during four years of follow-up. RF variable importance criteria revealed that microaneurysms counts in both eyes seemed to play the most important role in discrimination among the graded fundus variables, while the number of medicines and diabetes duration were the most relevant among the systemic variables. We have introduced RF methods to DR classification analyses based on fundus photography data. In addition, we propose an approach to DR risk assessment based on metrics derived from graded fundus photography and systemic data. Our results suggest that RF methods could be a valuable tool to diagnose DR diagnosis and evaluate its progression.
Wu, Jiansheng; Zhang, Qiuming; Wu, Weijian; Pang, Tao; Hu, Haifeng; Chan, Wallace K B; Ke, Xiaoyan; Zhang, Yang; Wren, Jonathan
2018-02-08
Precise assessment of ligand bioactivities (including IC50, EC50, Ki, Kd, etc.) is essential for virtual screening and lead compound identification. However, not all ligands have experimentally-determined activities. In particular, many G protein-coupled receptors (GPCRs), which are the largest integral membrane protein family and represent targets of nearly 40% drugs on the market, lack published experimental data about ligand interactions. Computational methods with the ability to accurately predict the bioactivity of ligands can help efficiently address this problem. We proposed a new method, WDL-RF, using weighted deep learning and random forest, to model the bioactivity of GPCR-associated ligand molecules. The pipeline of our algorithm consists of two consecutive stages: 1) molecular fingerprint generation through a new weighted deep learning method, and 2) bioactivity calculations with a random forest model; where one uniqueness of the approach is that the model allows end-to-end learning of prediction pipelines with input ligands being of arbitrary size. The method was tested on a set of twenty-six non-redundant GPCRs that have a high number of active ligands, each with 200∼4000 ligand associations. The results from our benchmark show that WDL-RF can generate bioactivity predictions with an average root-mean square error 1.33 and correlation coefficient (r2) 0.80 compared to the experimental measurements, which are significantly more accurate than the control predictors with different molecular fingerprints and descriptors. In particular, data-driven molecular fingerprint features, as extracted from the weighted deep learning models, can help solve deficiencies stemming from the use of traditional hand-crafted features and significantly increase the efficiency of short molecular fingerprints in virtual screening. The WDL-RF web server, as well as source codes and datasets of WDL-RF, is freely available at https://zhanglab.ccmb.med.umich.edu/WDL-RF/ for academic purposes. Xiaoyan Ke (kexynj@hotmail.com); Yang Zhang (zhng@umich.edu). Supplementary data are available at Bioinformatics online. © The Author (2018). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
NASA Astrophysics Data System (ADS)
Zhao, Dekang; Wu, Qiang; Cui, Fangpeng; Xu, Hua; Zeng, Yifan; Cao, Yufei; Du, Yuanze
2018-04-01
Coal-floor water-inrush incidents account for a large proportion of coal mine disasters in northern China, and accurate risk assessment is crucial for safe coal production. A novel and promising assessment model for water inrush is proposed based on random forest (RF), which is a powerful intelligent machine-learning algorithm. RF has considerable advantages, including high classification accuracy and the capability to evaluate the importance of variables; in particularly, it is robust in dealing with the complicated and non-linear problems inherent in risk assessment. In this study, the proposed model is applied to Panjiayao Coal Mine, northern China. Eight factors were selected as evaluation indices according to systematic analysis of the geological conditions and a field survey of the study area. Risk assessment maps were generated based on RF, and the probabilistic neural network (PNN) model was also used for risk assessment as a comparison. The results demonstrate that the two methods are consistent in the risk assessment of water inrush at the mine, and RF shows a better performance compared to PNN with an overall accuracy higher by 6.67%. It is concluded that RF is more practicable to assess the water-inrush risk than PNN. The presented method will be helpful in avoiding water inrush and also can be extended to various engineering applications.
Lee, Soo Yee; Mediani, Ahmed; Maulidiani, Maulidiani; Khatib, Alfi; Ismail, Intan Safinar; Zawawi, Norhasnida; Abas, Faridah
2018-01-01
Neptunia oleracea is a plant consumed as a vegetable and which has been used as a folk remedy for several diseases. Herein, two regression models (partial least squares, PLS; and random forest, RF) in a metabolomics approach were compared and applied to the evaluation of the relationship between phenolics and bioactivities of N. oleracea. In addition, the effects of different extraction conditions on the phenolic constituents were assessed by pattern recognition analysis. Comparison of the PLS and RF showed that RF exhibited poorer generalization and hence poorer predictive performance. Both the regression coefficient of PLS and the variable importance of RF revealed that quercetin and kaempferol derivatives, caffeic acid and vitexin-2-O-rhamnoside were significant towards the tested bioactivities. Furthermore, principal component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA) results showed that sonication and absolute ethanol are the preferable extraction method and ethanol ratio, respectively, to produce N. oleracea extracts with high phenolic levels and therefore high DPPH scavenging and α-glucosidase inhibitory activities. Both PLS and RF are useful regression models in metabolomics studies. This work provides insight into the performance of different multivariate data analysis tools and the effects of different extraction conditions on the extraction of desired phenolics from plants. © 2017 Society of Chemical Industry. © 2017 Society of Chemical Industry.
Mapping growing stock volume and forest live biomass: a case study of the Polissya region of Ukraine
NASA Astrophysics Data System (ADS)
Bilous, Andrii; Myroniuk, Viktor; Holiaka, Dmytrii; Bilous, Svitlana; See, Linda; Schepaschenko, Dmitry
2017-10-01
Forest inventory and biomass mapping are important tasks that require inputs from multiple data sources. In this paper we implement two methods for the Ukrainian region of Polissya: random forest (RF) for tree species prediction and k-nearest neighbors (k-NN) for growing stock volume and biomass mapping. We examined the suitability of the five-band RapidEye satellite image to predict the distribution of six tree species. The accuracy of RF is quite high: ~99% for forest/non-forest mask and 89% for tree species prediction. Our results demonstrate that inclusion of elevation as a predictor variable in the RF model improved the performance of tree species classification. We evaluated different distance metrics for the k-NN method, including Euclidean or Mahalanobis distance, most similar neighbor (MSN), gradient nearest neighbor, and independent component analysis. The MSN with the four nearest neighbors (k = 4) is the most precise (according to the root-mean-square deviation) for predicting forest attributes across the study area. The k-NN method allowed us to estimate growing stock volume with an accuracy of 3 m3 ha-1 and for live biomass of about 2 t ha-1 over the study area.
Hong, Haoyuan; Tsangaratos, Paraskevas; Ilia, Ioanna; Liu, Junzhi; Zhu, A-Xing; Xu, Chong
2018-07-15
The main objective of the present study was to utilize Genetic Algorithms (GA) in order to obtain the optimal combination of forest fire related variables and apply data mining methods for constructing a forest fire susceptibility map. In the proposed approach, a Random Forest (RF) and a Support Vector Machine (SVM) was used to produce a forest fire susceptibility map for the Dayu County which is located in southwest of Jiangxi Province, China. For this purpose, historic forest fires and thirteen forest fire related variables were analyzed, namely: elevation, slope angle, aspect, curvature, land use, soil cover, heat load index, normalized difference vegetation index, mean annual temperature, mean annual wind speed, mean annual rainfall, distance to river network and distance to road network. The Natural Break and the Certainty Factor method were used to classify and weight the thirteen variables, while a multicollinearity analysis was performed to determine the correlation among the variables and decide about their usability. The optimal set of variables, determined by the GA limited the number of variables into eight excluding from the analysis, aspect, land use, heat load index, distance to river network and mean annual rainfall. The performance of the forest fire models was evaluated by using the area under the Receiver Operating Characteristic curve (ROC-AUC) based on the validation dataset. Overall, the RF models gave higher AUC values. Also the results showed that the proposed optimized models outperform the original models. Specifically, the optimized RF model gave the best results (0.8495), followed by the original RF (0.8169), while the optimized SVM gave lower values (0.7456) than the RF, however higher than the original SVM (0.7148) model. The study highlights the significance of feature selection techniques in forest fire susceptibility, whereas data mining methods could be considered as a valid approach for forest fire susceptibility modeling. Copyright © 2018 Elsevier B.V. All rights reserved.
Random Forest Segregation of Drug Responses May Define Regions of Biological Significance
Bukhari, Qasim; Borsook, David; Rudin, Markus; Becerra, Lino
2016-01-01
The ability to assess brain responses in unsupervised manner based on fMRI measure has remained a challenge. Here we have applied the Random Forest (RF) method to detect differences in the pharmacological MRI (phMRI) response in rats to treatment with an analgesic drug (buprenorphine) as compared to control (saline). Three groups of animals were studied: two groups treated with different doses of the opioid buprenorphine, low (LD), and high dose (HD), and one receiving saline. PhMRI responses were evaluated in 45 brain regions and RF analysis was applied to allocate rats to the individual treatment groups. RF analysis was able to identify drug effects based on differential phMRI responses in the hippocampus, amygdala, nucleus accumbens, superior colliculus, and the lateral and posterior thalamus for drug vs. saline. These structures have high levels of mu opioid receptors. In addition these regions are involved in aversive signaling, which is inhibited by mu opioids. The results demonstrate that buprenorphine mediated phMRI responses comprise characteristic features that allow a supervised differentiation from placebo treated rats as well as the proper allocation to the respective drug dose group using the RF method, a method that has been successfully applied in clinical studies. PMID:27014046
Pérez-Del-Olmo, A; Montero, F E; Fernández, M; Barrett, J; Raga, J A; Kostadinova, A
2010-10-01
We address the effect of spatial scale and temporal variation on model generality when forming predictive models for fish assignment using a new data mining approach, Random Forests (RF), to variable biological markers (parasite community data). Models were implemented for a fish host-parasite system sampled along the Mediterranean and Atlantic coasts of Spain and were validated using independent datasets. We considered 2 basic classification problems in evaluating the importance of variations in parasite infracommunities for assignment of individual fish to their populations of origin: multiclass (2-5 population models, using 2 seasonal replicates from each of the populations) and 2-class task (using 4 seasonal replicates from 1 Atlantic and 1 Mediterranean population each). The main results are that (i) RF are well suited for multiclass population assignment using parasite communities in non-migratory fish; (ii) RF provide an efficient means for model cross-validation on the baseline data and this allows sample size limitations in parasite tag studies to be tackled effectively; (iii) the performance of RF is dependent on the complexity and spatial extent/configuration of the problem; and (iv) the development of predictive models is strongly influenced by seasonal change and this stresses the importance of both temporal replication and model validation in parasite tagging studies.
Shannon L. Savage; Rick L. Lawrence; John R. Squires
2015-01-01
Ecological and land management applications would often benefit from maps of relative canopy cover of each species present within a pixel, instead of traditional remote-sensing based maps of either dominant species or percent canopy cover without regard to species composition. Widely used statistical models for remote sensing, such as randomForest (RF),...
Application of Random Forests Methods to Diabetic Retinopathy Classification Analyses
Casanova, Ramon; Saldana, Santiago; Chew, Emily Y.; Danis, Ronald P.; Greven, Craig M.; Ambrosius, Walter T.
2014-01-01
Background Diabetic retinopathy (DR) is one of the leading causes of blindness in the United States and world-wide. DR is a silent disease that may go unnoticed until it is too late for effective treatment. Therefore, early detection could improve the chances of therapeutic interventions that would alleviate its effects. Methodology Graded fundus photography and systemic data from 3443 ACCORD-Eye Study participants were used to estimate Random Forest (RF) and logistic regression classifiers. We studied the impact of sample size on classifier performance and the possibility of using RF generated class conditional probabilities as metrics describing DR risk. RF measures of variable importance are used to detect factors that affect classification performance. Principal Findings Both types of data were informative when discriminating participants with or without DR. RF based models produced much higher classification accuracy than those based on logistic regression. Combining both types of data did not increase accuracy but did increase statistical discrimination of healthy participants who subsequently did or did not have DR events during four years of follow-up. RF variable importance criteria revealed that microaneurysms counts in both eyes seemed to play the most important role in discrimination among the graded fundus variables, while the number of medicines and diabetes duration were the most relevant among the systemic variables. Conclusions and Significance We have introduced RF methods to DR classification analyses based on fundus photography data. In addition, we propose an approach to DR risk assessment based on metrics derived from graded fundus photography and systemic data. Our results suggest that RF methods could be a valuable tool to diagnose DR diagnosis and evaluate its progression. PMID:24940623
Random forest feature selection approach for image segmentation
NASA Astrophysics Data System (ADS)
Lefkovits, László; Lefkovits, Szidónia; Emerich, Simina; Vaida, Mircea Florin
2017-03-01
In the field of image segmentation, discriminative models have shown promising performance. Generally, every such model begins with the extraction of numerous features from annotated images. Most authors create their discriminative model by using many features without using any selection criteria. A more reliable model can be built by using a framework that selects the important variables, from the point of view of the classification, and eliminates the unimportant once. In this article we present a framework for feature selection and data dimensionality reduction. The methodology is built around the random forest (RF) algorithm and its variable importance evaluation. In order to deal with datasets so large as to be practically unmanageable, we propose an algorithm based on RF that reduces the dimension of the database by eliminating irrelevant features. Furthermore, this framework is applied to optimize our discriminative model for brain tumor segmentation.
Onsongo, Getiria; Baughn, Linda B; Bower, Matthew; Henzler, Christine; Schomaker, Matthew; Silverstein, Kevin A T; Thyagarajan, Bharat
2016-11-01
Simultaneous detection of small copy number variations (CNVs) (<0.5 kb) and single-nucleotide variants in clinically significant genes is of great interest for clinical laboratories. The analytical variability in next-generation sequencing (NGS) and artifacts in coverage data because of issues with mappability along with lack of robust bioinformatics tools for CNV detection have limited the utility of targeted NGS data to identify CNVs. We describe the development and implementation of a bioinformatics algorithm, copy number variation-random forest (CNV-RF), that incorporates a machine learning component to identify CNVs from targeted NGS data. Using CNV-RF, we identified 12 of 13 deletions in samples with known CNVs, two cases with duplications, and identified novel deletions in 22 additional cases. Furthermore, no CNVs were identified among 60 genes in 14 cases with normal copy number and no CNVs were identified in another 104 patients with clinical suspicion of CNVs. All positive deletions and duplications were confirmed using a quantitative PCR method. CNV-RF also detected heterozygous deletions and duplications with a specificity of 50% across 4813 genes. The ability of CNV-RF to detect clinically relevant CNVs with a high degree of sensitivity along with confirmation using a low-cost quantitative PCR method provides a framework for providing comprehensive NGS-based CNV/single-nucleotide variant detection in a clinical molecular diagnostics laboratory. Copyright © 2016 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali
2016-01-01
Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.
Mapping ecological systems with a random foret model: tradeoffs between errors and bias
Emilie Grossmann; Janet Ohmann; James Kagan; Heather May; Matthew Gregory
2010-01-01
New methods for predictive vegetation mapping allow improved estimations of plant community composition across large regions. Random Forest (RF) models limit over-fitting problems of other methods, and are known for making accurate classification predictions from noisy, nonnormal data, but can be biased when plot samples are unbalanced. We developed two contrasting...
NASA Astrophysics Data System (ADS)
Gao, Yan; Marpu, Prashanth; Morales Manila, Luis M.
2014-11-01
This paper assesses the suitability of 8-band Worldview-2 (WV2) satellite data and object-based random forest algorithm for the classification of avocado growth stages in Mexico. We tested both pixel-based with minimum distance (MD) and maximum likelihood (MLC) and object-based with Random Forest (RF) algorithm for this task. Training samples and verification data were selected by visual interpreting the WV2 images for seven thematic classes: fully grown, middle stage, and early stage of avocado crops, bare land, two types of natural forests, and water body. To examine the contribution of the four new spectral bands of WV2 sensor, all the tested classifications were carried out with and without the four new spectral bands. Classification accuracy assessment results show that object-based classification with RF algorithm obtained higher overall higher accuracy (93.06%) than pixel-based MD (69.37%) and MLC (64.03%) method. For both pixel-based and object-based methods, the classifications with the four new spectral bands (overall accuracy obtained higher accuracy than those without: overall accuracy of object-based RF classification with vs without: 93.06% vs 83.59%, pixel-based MD: 69.37% vs 67.2%, pixel-based MLC: 64.03% vs 36.05%, suggesting that the four new spectral bands in WV2 sensor contributed to the increase of the classification accuracy.
Classification of large-sized hyperspectral imagery using fast machine learning algorithms
NASA Astrophysics Data System (ADS)
Xia, Junshi; Yokoya, Naoto; Iwasaki, Akira
2017-07-01
We present a framework of fast machine learning algorithms in the context of large-sized hyperspectral images classification from the theoretical to a practical viewpoint. In particular, we assess the performance of random forest (RF), rotation forest (RoF), and extreme learning machine (ELM) and the ensembles of RF and ELM. These classifiers are applied to two large-sized hyperspectral images and compared to the support vector machines. To give the quantitative analysis, we pay attention to comparing these methods when working with high input dimensions and a limited/sufficient training set. Moreover, other important issues such as the computational cost and robustness against the noise are also discussed.
Prediction of aquatic toxicity mode of action using linear discriminant and random forest models.
Martin, Todd M; Grulke, Christopher M; Young, Douglas M; Russom, Christine L; Wang, Nina Y; Jackson, Crystal R; Barron, Mace G
2013-09-23
The ability to determine the mode of action (MOA) for a diverse group of chemicals is a critical part of ecological risk assessment and chemical regulation. However, existing MOA assignment approaches in ecotoxicology have been limited to a relatively few MOAs, have high uncertainty, or rely on professional judgment. In this study, machine based learning algorithms (linear discriminant analysis and random forest) were used to develop models for assigning aquatic toxicity MOA. These methods were selected since they have been shown to be able to correlate diverse data sets and provide an indication of the most important descriptors. A data set of MOA assignments for 924 chemicals was developed using a combination of high confidence assignments, international consensus classifications, ASTER (ASessment Tools for the Evaluation of Risk) predictions, and weight of evidence professional judgment based an assessment of structure and literature information. The overall data set was randomly divided into a training set (75%) and a validation set (25%) and then used to develop linear discriminant analysis (LDA) and random forest (RF) MOA assignment models. The LDA and RF models had high internal concordance and specificity and were able to produce overall prediction accuracies ranging from 84.5 to 87.7% for the validation set. These results demonstrate that computational chemistry approaches can be used to determine the acute toxicity MOAs across a large range of structures and mechanisms.
Kreakie, Betty J.; Cantwell, Mark G.; Nacci, Diane
2017-01-01
Modeling the magnitude and distribution of sediment-bound pollutants in estuaries is often limited by incomplete knowledge of the site and inadequate sample density. To address these modeling limitations, a decision-support tool framework was conceived that predicts sediment contamination from the sub-estuary to broader estuary extent. For this study, a Random Forest (RF) model was implemented to predict the distribution of a model contaminant, triclosan (5-chloro-2-(2,4-dichlorophenoxy)phenol) (TCS), in Narragansett Bay, Rhode Island, USA. TCS is an unregulated contaminant used in many personal care products. The RF explanatory variables were associated with TCS transport and fate (proxies) and direct and indirect environmental entry. The continuous RF TCS concentration predictions were discretized into three levels of contamination (low, medium, and high) for three different quantile thresholds. The RF model explained 63% of the variance with a minimum number of variables. Total organic carbon (TOC) (transport and fate proxy) was a strong predictor of TCS contamination causing a mean squared error increase of 59% when compared to permutations of randomized values of TOC. Additionally, combined sewer overflow discharge (environmental entry) and sand (transport and fate proxy) were strong predictors. The discretization models identified a TCS area of greatest concern in the northern reach of Narragansett Bay (Providence River sub-estuary), which was validated with independent test samples. This decision-support tool performed well at the sub-estuary extent and provided the means to identify areas of concern and prioritize bay-wide sampling. PMID:28738089
Fox, Eric W; Hill, Ryan A; Leibowitz, Scott G; Olsen, Anthony R; Thornbrugh, Darren J; Weber, Marc H
2017-07-01
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
Perdiguero-Alonso, Diana; Montero, Francisco E; Kostadinova, Aneta; Raga, Juan Antonio; Barrett, John
2008-10-01
Due to the complexity of host-parasite relationships, discrimination between fish populations using parasites as biological tags is difficult. This study introduces, to our knowledge for the first time, random forests (RF) as a new modelling technique in the application of parasite community data as biological markers for population assignment of fish. This novel approach is applied to a dataset with a complex structure comprising 763 parasite infracommunities in population samples of Atlantic cod, Gadus morhua, from the spawning/feeding areas in five regions in the North East Atlantic (Baltic, Celtic, Irish and North seas and Icelandic waters). The learning behaviour of RF is evaluated in comparison with two other algorithms applied to class assignment problems, the linear discriminant function analysis (LDA) and artificial neural networks (ANN). The three algorithms are used to develop predictive models applying three cross-validation procedures in a series of experiments (252 models in total). The comparative approach to RF, LDA and ANN algorithms applied to the same datasets demonstrates the competitive potential of RF for developing predictive models since RF exhibited better accuracy of prediction and outperformed LDA and ANN in the assignment of fish to their regions of sampling using parasite community data. The comparative analyses and the validation experiment with a 'blind' sample confirmed that RF models performed more effectively with a large and diverse training set and a large number of variables. The discrimination results obtained for a migratory fish species with largely overlapping parasite communities reflects the high potential of RF for developing predictive models using data that are both complex and noisy, and indicates that it is a promising tool for parasite tag studies. Our results suggest that parasite community data can be used successfully to discriminate individual cod from the five different regions of the North East Atlantic studied using RF.
What variables are important in predicting bovine viral diarrhea virus? A random forest approach.
Machado, Gustavo; Mendoza, Mariana Recamonde; Corbellini, Luis Gustavo
2015-07-24
Bovine viral diarrhea virus (BVDV) causes one of the most economically important diseases in cattle, and the virus is found worldwide. A better understanding of the disease associated factors is a crucial step towards the definition of strategies for control and eradication. In this study we trained a random forest (RF) prediction model and performed variable importance analysis to identify factors associated with BVDV occurrence. In addition, we assessed the influence of features selection on RF performance and evaluated its predictive power relative to other popular classifiers and to logistic regression. We found that RF classification model resulted in an average error rate of 32.03% for the negative class (negative for BVDV) and 36.78% for the positive class (positive for BVDV).The RF model presented area under the ROC curve equal to 0.702. Variable importance analysis revealed that important predictors of BVDV occurrence were: a) who inseminates the animals, b) number of neighboring farms that have cattle and c) rectal palpation performed routinely. Our results suggest that the use of machine learning algorithms, especially RF, is a promising methodology for the analysis of cross-sectional studies, presenting a satisfactory predictive power and the ability to identify predictors that represent potential risk factors for BVDV investigation. We examined classical predictors and found some new and hard to control practices that may lead to the spread of this disease within and among farms, mainly regarding poor or neglected reproduction management, which should be considered for disease control and eradication.
Random Forests to Predict Rectal Toxicity Following Prostate Cancer Radiation Therapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ospina, Juan D.; INSERM, U1099, Rennes; Escuela de Estadística, Universidad Nacional de Colombia Sede Medellín, Medellín
2014-08-01
Purpose: To propose a random forest normal tissue complication probability (RF-NTCP) model to predict late rectal toxicity following prostate cancer radiation therapy, and to compare its performance to that of classic NTCP models. Methods and Materials: Clinical data and dose-volume histograms (DVH) were collected from 261 patients who received 3-dimensional conformal radiation therapy for prostate cancer with at least 5 years of follow-up. The series was split 1000 times into training and validation cohorts. A RF was trained to predict the risk of 5-year overall rectal toxicity and bleeding. Parameters of the Lyman-Kutcher-Burman (LKB) model were identified and a logistic regression modelmore » was fit. The performance of all the models was assessed by computing the area under the receiving operating characteristic curve (AUC). Results: The 5-year grade ≥2 overall rectal toxicity and grade ≥1 and grade ≥2 rectal bleeding rates were 16%, 25%, and 10%, respectively. Predictive capabilities were obtained using the RF-NTCP model for all 3 toxicity endpoints, including both the training and validation cohorts. The age and use of anticoagulants were found to be predictors of rectal bleeding. The AUC for RF-NTCP ranged from 0.66 to 0.76, depending on the toxicity endpoint. The AUC values for the LKB-NTCP were statistically significantly inferior, ranging from 0.62 to 0.69. Conclusions: The RF-NTCP model may be a useful new tool in predicting late rectal toxicity, including variables other than DVH, and thus appears as a strong competitor to classic NTCP models.« less
An AUC-based permutation variable importance measure for random forests
2013-01-01
Background The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. Results We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings. Conclusions The standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html. PMID:23560875
An AUC-based permutation variable importance measure for random forests.
Janitza, Silke; Strobl, Carolin; Boulesteix, Anne-Laure
2013-04-05
The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings. The standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.
NASA Astrophysics Data System (ADS)
Zimmerman, Naomi; Presto, Albert A.; Kumar, Sriniwasa P. N.; Gu, Jason; Hauryliuk, Aliaksei; Robinson, Ellis S.; Robinson, Allen L.; Subramanian, R.
2018-01-01
Low-cost sensing strategies hold the promise of denser air quality monitoring networks, which could significantly improve our understanding of personal air pollution exposure. Additionally, low-cost air quality sensors could be deployed to areas where limited monitoring exists. However, low-cost sensors are frequently sensitive to environmental conditions and pollutant cross-sensitivities, which have historically been poorly addressed by laboratory calibrations, limiting their utility for monitoring. In this study, we investigated different calibration models for the Real-time Affordable Multi-Pollutant (RAMP) sensor package, which measures CO, NO2, O3, and CO2. We explored three methods: (1) laboratory univariate linear regression, (2) empirical multiple linear regression, and (3) machine-learning-based calibration models using random forests (RF). Calibration models were developed for 16-19 RAMP monitors (varied by pollutant) using training and testing windows spanning August 2016 through February 2017 in Pittsburgh, PA, US. The random forest models matched (CO) or significantly outperformed (NO2, CO2, O3) the other calibration models, and their accuracy and precision were robust over time for testing windows of up to 16 weeks. Following calibration, average mean absolute error on the testing data set from the random forest models was 38 ppb for CO (14 % relative error), 10 ppm for CO2 (2 % relative error), 3.5 ppb for NO2 (29 % relative error), and 3.4 ppb for O3 (15 % relative error), and Pearson r versus the reference monitors exceeded 0.8 for most units. Model performance is explored in detail, including a quantification of model variable importance, accuracy across different concentration ranges, and performance in a range of monitoring contexts including the National Ambient Air Quality Standards (NAAQS) and the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. A key strength of the RF approach is that it accounts for pollutant cross-sensitivities. This highlights the importance of developing multipollutant sensor packages (as opposed to single-pollutant monitors); we determined this is especially critical for NO2 and CO2. The evaluation reveals that only the RF-calibrated sensors meet the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. We also demonstrate that the RF-model-calibrated sensors could detect differences in NO2 concentrations between a near-road site and a suburban site less than 1.5 km away. From this study, we conclude that combining RF models with carefully controlled state-of-the-art multipollutant sensor packages as in the RAMP monitors appears to be a very promising approach to address the poor performance that has plagued low-cost air quality sensors.
2014-09-18
Converter AES Advance Encryption Standard ANN Artificial Neural Network APS Application Support AUC Area Under the Curve CPA Correlation Power Analysis ...Importance WGN White Gaussian Noise WPAN Wireless Personal Area Networks XEnv Cross-Environment XRx Cross-Receiver xxi ADVANCES IN SCA AND RF-DNA...based tool called KillerBee was released in 2009 that increases the exposure of ZigBee and other IEEE 802.15.4-based Wireless Personal Area Networks
Automatic pattern identification of rock moisture based on the Staff-RF model
NASA Astrophysics Data System (ADS)
Zheng, Wei; Tao, Kai; Jiang, Wei
2018-04-01
Studies on the moisture and damage state of rocks generally focus on the qualitative description and mechanical information of rocks. This method is not applicable to the real-time safety monitoring of rock mass. In this study, a musical staff computing model is used to quantify the acoustic emission signals of rocks with different moisture patterns. Then, the random forest (RF) method is adopted to form the staff-RF model for the real-time pattern identification of rock moisture. The entire process requires only the computing information of the AE signal and does not require the mechanical conditions of rocks.
Fault Detection of Aircraft System with Random Forest Algorithm and Similarity Measure
Park, Wookje; Jung, Sikhang
2014-01-01
Research on fault detection algorithm was developed with the similarity measure and random forest algorithm. The organized algorithm was applied to unmanned aircraft vehicle (UAV) that was readied by us. Similarity measure was designed by the help of distance information, and its usefulness was also verified by proof. Fault decision was carried out by calculation of weighted similarity measure. Twelve available coefficients among healthy and faulty status data group were used to determine the decision. Similarity measure weighting was done and obtained through random forest algorithm (RFA); RF provides data priority. In order to get a fast response of decision, a limited number of coefficients was also considered. Relation of detection rate and amount of feature data were analyzed and illustrated. By repeated trial of similarity calculation, useful data amount was obtained. PMID:25057508
A random forest model based classification scheme for neonatal amplitude-integrated EEG.
Chen, Weiting; Wang, Yu; Cao, Guitao; Chen, Guoqiang; Gu, Qiufang
2014-01-01
Modern medical advances have greatly increased the survival rate of infants, while they remain in the higher risk group for neurological problems later in life. For the infants with encephalopathy or seizures, identification of the extent of brain injury is clinically challenging. Continuous amplitude-integrated electroencephalography (aEEG) monitoring offers a possibility to directly monitor the brain functional state of the newborns over hours, and has seen an increasing application in neonatal intensive care units (NICUs). This paper presents a novel combined feature set of aEEG and applies random forest (RF) method to classify aEEG tracings. To that end, a series of experiments were conducted on 282 aEEG tracing cases (209 normal and 73 abnormal ones). Basic features, statistic features and segmentation features were extracted from both the tracing as a whole and the segmented recordings, and then form a combined feature set. All the features were sent to a classifier afterwards. The significance of feature, the data segmentation, the optimization of RF parameters, and the problem of imbalanced datasets were examined through experiments. Experiments were also done to evaluate the performance of RF on aEEG signal classifying, compared with several other widely used classifiers including SVM-Linear, SVM-RBF, ANN, Decision Tree (DT), Logistic Regression(LR), ML, and LDA. The combined feature set can better characterize aEEG signals, compared with basic features, statistic features and segmentation features respectively. With the combined feature set, the proposed RF-based aEEG classification system achieved a correct rate of 92.52% and a high F1-score of 95.26%. Among all of the seven classifiers examined in our work, the RF method got the highest correct rate, sensitivity, specificity, and F1-score, which means that RF outperforms all of the other classifiers considered here. The results show that the proposed RF-based aEEG classification system with the combined feature set is efficient and helpful to better detect the brain disorders in newborns.
Cluster ensemble based on Random Forests for genetic data.
Alhusain, Luluah; Hafez, Alaaeldin M
2017-01-01
Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Detecting understory plant invasion in urban forests using LiDAR
NASA Astrophysics Data System (ADS)
Singh, Kunwar K.; Davis, Amy J.; Meentemeyer, Ross K.
2015-06-01
Light detection and ranging (LiDAR) data are increasingly used to measure structural characteristics of urban forests but are rarely used to detect the growing problem of exotic understory plant invaders. We explored the merits of using LiDAR-derived metrics alone and through integration with spectral data to detect the spatial distribution of the exotic understory plant Ligustrum sinense, a rapidly spreading invader in the urbanizing region of Charlotte, North Carolina, USA. We analyzed regional-scale L. sinense occurrence data collected over the course of three years with LiDAR-derived metrics of forest structure that were categorized into the following groups: overstory, understory, topography, and overall vegetation characteristics, and IKONOS spectral features - optical. Using random forest (RF) and logistic regression (LR) classifiers, we assessed the relative contributions of LiDAR and IKONOS derived variables to the detection of L. sinense. We compared the top performing models developed for a smaller, nested experimental extent using RF and LR classifiers, and used the best overall model to produce a predictive map of the spatial distribution of L. sinense across our country-wide study extent. RF classification of LiDAR-derived topography metrics produced the highest mapping accuracy estimates, outperforming IKONOS data by 17.5% and the integration of LiDAR and IKONOS data by 5.3%. The top performing model from the RF classifier produced the highest kappa of 64.8%, improving on the parsimonious LR model kappa by 31.1% with a moderate gain of 6.2% over the county extent model. Our results demonstrate the superiority of LiDAR-derived metrics over spectral data and fusion of LiDAR and spectral data for accurately mapping the spatial distribution of the forest understory invader L. sinense.
Nicodemus, Kristin K; Malley, James D; Strobl, Carolin; Ziegler, Andreas
2010-02-27
Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results. In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0. Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.
NASA Astrophysics Data System (ADS)
Xin, Ni; Gu, Xiao-Feng; Wu, Hao; Hu, Yu-Zhu; Yang, Zhong-Lin
2012-04-01
Most herbal medicines could be processed to fulfill the different requirements of therapy. The purpose of this study was to discriminate between raw and processed Dipsacus asperoides, a common traditional Chinese medicine, based on their near infrared (NIR) spectra. Least squares-support vector machine (LS-SVM) and random forests (RF) were employed for full-spectrum classification. Three types of kernels, including linear kernel, polynomial kernel and radial basis function kernel (RBF), were checked for optimization of LS-SVM model. For comparison, a linear discriminant analysis (LDA) model was performed for classification, and the successive projections algorithm (SPA) was executed prior to building an LDA model to choose an appropriate subset of wavelengths. The three methods were applied to a dataset containing 40 raw herbs and 40 corresponding processed herbs. We ran 50 runs of 10-fold cross validation to evaluate the model's efficiency. The performance of the LS-SVM with RBF kernel (RBF LS-SVM) was better than the other two kernels. The RF, RBF LS-SVM and SPA-LDA successfully classified all test samples. The mean error rates for the 50 runs of 10-fold cross validation were 1.35% for RBF LS-SVM, 2.87% for RF, and 2.50% for SPA-LDA. The best classification results were obtained by using LS-SVM with RBF kernel, while RF was fast in the training and making predictions.
A prediction scheme of tropical cyclone frequency based on lasso and random forest
NASA Astrophysics Data System (ADS)
Tan, Jinkai; Liu, Hexiang; Li, Mengya; Wang, Jun
2017-07-01
This study aims to propose a novel prediction scheme of tropical cyclone frequency (TCF) over the Western North Pacific (WNP). We concerned the large-scale meteorological factors inclusive of the sea surface temperature, sea level pressure, the Niño-3.4 index, the wind shear, the vorticity, the subtropical high, and the sea ice cover, since the chronic change of these factors in the context of climate change would cause a gradual variation of the annual TCF. Specifically, we focus on the correlation between the year-to-year increment of these factors and TCF. The least absolute shrinkage and selection operator (Lasso) method was used for variable selection and dimension reduction from 11 initial predictors. Then, a prediction model based on random forest (RF) was established by using the training samples (1978-2011) for calibration and the testing samples (2012-2016) for validation. The RF model presents a major variation and trend of TCF in the period of calibration, and also fitted well with the observed TCF in the period of validation though there were some deviations. The leave-one-out cross validation of the model exhibited most of the predicted TCF are in consistence with the observed TCF with a high correlation coefficient. A comparison between results of the RF model and the multiple linear regression (MLR) model suggested the RF is more practical and capable of giving reliable results of TCF prediction over the WNP.
NASA Astrophysics Data System (ADS)
Othman, Arsalan A.; Gloaguen, Richard
2017-09-01
Lithological mapping in mountainous regions is often impeded by limited accessibility due to relief. This study aims to evaluate (1) the performance of different supervised classification approaches using remote sensing data and (2) the use of additional information such as geomorphology. We exemplify the methodology in the Bardi-Zard area in NE Iraq, a part of the Zagros Fold - Thrust Belt, known for its chromite deposits. We highlighted the improvement of remote sensing geological classification by integrating geomorphic features and spatial information in the classification scheme. We performed a Maximum Likelihood (ML) classification method besides two Machine Learning Algorithms (MLA): Support Vector Machine (SVM) and Random Forest (RF) to allow the joint use of geomorphic features, Band Ratio (BR), Principal Component Analysis (PCA), spatial information (spatial coordinates) and multispectral data of the Advanced Space-borne Thermal Emission and Reflection radiometer (ASTER) satellite. The RF algorithm showed reliable results and discriminated serpentinite, talus and terrace deposits, red argillites with conglomerates and limestone, limy conglomerates and limestone conglomerates, tuffites interbedded with basic lavas, limestone and Metamorphosed limestone and reddish green shales. The best overall accuracy (∼80%) was achieved by Random Forest (RF) algorithms in the majority of the sixteen tested combination datasets.
Human tracking in thermal images using adaptive particle filters with online random forest learning
NASA Astrophysics Data System (ADS)
Ko, Byoung Chul; Kwak, Joon-Young; Nam, Jae-Yeal
2013-11-01
This paper presents a fast and robust human tracking method to use in a moving long-wave infrared thermal camera under poor illumination with the existence of shadows and cluttered backgrounds. To improve the human tracking performance while minimizing the computation time, this study proposes an online learning of classifiers based on particle filters and combination of a local intensity distribution (LID) with oriented center-symmetric local binary patterns (OCS-LBP). Specifically, we design a real-time random forest (RF), which is the ensemble of decision trees for confidence estimation, and confidences of the RF are converted into a likelihood function of the target state. First, the target model is selected by the user and particles are sampled. Then, RFs are generated using the positive and negative examples with LID and OCS-LBP features by online learning. The learned RF classifiers are used to detect the most likely target position in the subsequent frame in the next stage. Then, the RFs are learned again by means of fast retraining with the tracked object and background appearance in the new frame. The proposed algorithm is successfully applied to various thermal videos as tests and its tracking performance is better than those of other methods.
Eskelson, Bianca N.I.; Hagar, Joan; Temesgen, Hailemariam
2012-01-01
Snags (standing dead trees) are an essential structural component of forests. Because wildlife use of snags depends on size and decay stage, snag density estimation without any information about snag quality attributes is of little value for wildlife management decision makers. Little work has been done to develop models that allow multivariate estimation of snag density by snag quality class. Using climate, topography, Landsat TM data, stand age and forest type collected for 2356 forested Forest Inventory and Analysis plots in western Washington and western Oregon, we evaluated two multivariate techniques for their abilities to estimate density of snags by three decay classes. The density of live trees and snags in three decay classes (D1: recently dead, little decay; D2: decay, without top, some branches and bark missing; D3: extensive decay, missing bark and most branches) with diameter at breast height (DBH) ≥ 12.7 cm was estimated using a nonparametric random forest nearest neighbor imputation technique (RF) and a parametric two-stage model (QPORD), for which the number of trees per hectare was estimated with a Quasipoisson model in the first stage and the probability of belonging to a tree status class (live, D1, D2, D3) was estimated with an ordinal regression model in the second stage. The presence of large snags with DBH ≥ 50 cm was predicted using a logistic regression and RF imputation. Because of the more homogenous conditions on private forest lands, snag density by decay class was predicted with higher accuracies on private forest lands than on public lands, while presence of large snags was more accurately predicted on public lands, owing to the higher prevalence of large snags on public lands. RF outperformed the QPORD model in terms of percent accurate predictions, while QPORD provided smaller root mean square errors in predicting snag density by decay class. The logistic regression model achieved more accurate presence/absence classification of large snags than the RF imputation approach. Adjusting the decision threshold to account for unequal size for presence and absence classes is more straightforward for the logistic regression than for the RF imputation approach. Overall, model accuracies were poor in this study, which can be attributed to the poor predictive quality of the explanatory variables and the large range of forest types and geographic conditions observed in the data.
Naderi, S; Yin, T; König, S
2016-09-01
A simulation study was conducted to investigate the performance of random forest (RF) and genomic BLUP (GBLUP) for genomic predictions of binary disease traits based on cow calibration groups. Training and testing sets were modified in different scenarios according to disease incidence, the quantitative-genetic background of the trait (h(2)=0.30 and h(2)=0.10), and the genomic architecture [725 quantitative trait loci (QTL) and 290 QTL, populations with high and low levels of linkage disequilibrium (LD)]. For all scenarios, 10,005 SNP (depicting a low-density 10K SNP chip) and 50,025 SNP (depicting a 50K SNP chip) were evenly spaced along 29 chromosomes. Training and testing sets included 20,000 cows (4,000 sick, 16,000 healthy, disease incidence 20%) from the last 2 generations. Initially, 4,000 sick cows were assigned to the testing set, and the remaining 16,000 healthy cows represented the training set. In the ongoing allocation schemes, the number of sick cows in the training set increased stepwise by moving 10% of the sick animals from the testing set to the training set, and vice versa. The size of the training and testing sets was kept constant. Evaluation criteria for both GBLUP and RF were the correlations between genomic breeding values and true breeding values (prediction accuracy), and the area under the receiving operating characteristic curve (AUROC). Prediction accuracy and AUROC increased for both methods and all scenarios as increasing percentages of sick cows were allocated to the training set. Highest prediction accuracies were observed for disease incidences in training sets that reflected the population disease incidence of 0.20. For this allocation scheme, the largest prediction accuracies of 0.53 for RF and of 0.51 for GBLUP, and the largest AUROC of 0.66 for RF and of 0.64 for GBLUP, were achieved using 50,025 SNP, a heritability of 0.30, and 725 QTL. Heritability decreases from 0.30 to 0.10 and QTL reduction from 725 to 290 were associated with decreasing prediction accuracy and decreasing AUROC for all scenarios. This decrease was more pronounced for RF. Also, the increase of LD had stronger effect on RF results than on GBLUP results. The highest prediction accuracy from the low LD scenario was 0.30 from RF and 0.36 from GBLUP, and increased to 0.39 for both methods in the high LD population. Random forest successfully identified important SNP in close map distance to QTL explaining a high proportion of the phenotypic trait variations. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Discriminant forest classification method and system
Chen, Barry Y.; Hanley, William G.; Lemmond, Tracy D.; Hiller, Lawrence J.; Knapp, David A.; Mugge, Marshall J.
2012-11-06
A hybrid machine learning methodology and system for classification that combines classical random forest (RF) methodology with discriminant analysis (DA) techniques to provide enhanced classification capability. A DA technique which uses feature measurements of an object to predict its class membership, such as linear discriminant analysis (LDA) or Andersen-Bahadur linear discriminant technique (AB), is used to split the data at each node in each of its classification trees to train and grow the trees and the forest. When training is finished, a set of n DA-based decision trees of a discriminant forest is produced for use in predicting the classification of new samples of unknown class.
Wang, Huazhen; Liu, Xin; Lv, Bing; Yang, Fan; Hong, Yanzhu
2014-01-01
Objective Chronic Fatigue (CF) still remains unclear about its etiology, pathophysiology, nomenclature and diagnostic criteria in the medical community. Traditional Chinese medicine (TCM) adopts a unique diagnostic method, namely ‘bian zheng lun zhi’ or syndrome differentiation, to diagnose the CF with a set of syndrome factors, which can be regarded as the Multi-Label Learning (MLL) problem in the machine learning literature. To obtain an effective and reliable diagnostic tool, we use Conformal Predictor (CP), Random Forest (RF) and Problem Transformation method (PT) for the syndrome differentiation of CF. Methods and Materials In this work, using PT method, CP-RF is extended to handle MLL problem. CP-RF applies RF to measure the confidence level (p-value) of each label being the true label, and then selects multiple labels whose p-values are larger than the pre-defined significance level as the region prediction. In this paper, we compare the proposed CP-RF with typical CP-NBC(Naïve Bayes Classifier), CP-KNN(K-Nearest Neighbors) and ML-KNN on CF dataset, which consists of 736 cases. Specifically, 95 symptoms are used to identify CF, and four syndrome factors are employed in the syndrome differentiation, including ‘spleen deficiency’, ‘heart deficiency’, ‘liver stagnation’ and ‘qi deficiency’. The Results CP-RF demonstrates an outstanding performance beyond CP-NBC, CP-KNN and ML-KNN under the general metrics of subset accuracy, hamming loss, one-error, coverage, ranking loss and average precision. Furthermore, the performance of CP-RF remains steady at the large scale of confidence levels from 80% to 100%, which indicates its robustness to the threshold determination. In addition, the confidence evaluation provided by CP is valid and well-calibrated. Conclusion CP-RF not only offers outstanding performance but also provides valid confidence evaluation for the CF syndrome differentiation. It would be well applicable to TCM practitioners and facilitate the utilities of objective, effective and reliable computer-based diagnosis tool. PMID:24918430
Comparison of machine-learning methods for above-ground biomass estimation based on Landsat imagery
NASA Astrophysics Data System (ADS)
Wu, Chaofan; Shen, Huanhuan; Shen, Aihua; Deng, Jinsong; Gan, Muye; Zhu, Jinxia; Xu, Hongwei; Wang, Ke
2016-07-01
Biomass is one significant biophysical parameter of a forest ecosystem, and accurate biomass estimation on the regional scale provides important information for carbon-cycle investigation and sustainable forest management. In this study, Landsat satellite imagery data combined with field-based measurements were integrated through comparisons of five regression approaches [stepwise linear regression, K-nearest neighbor, support vector regression, random forest (RF), and stochastic gradient boosting] with two different candidate variable strategies to implement the optimal spatial above-ground biomass (AGB) estimation. The results suggested that RF algorithm exhibited the best performance by 10-fold cross-validation with respect to R2 (0.63) and root-mean-square error (26.44 ton/ha). Consequently, the map of estimated AGB was generated with a mean value of 89.34 ton/ha in northwestern Zhejiang Province, China, with a similar pattern to the distribution mode of local forest species. This research indicates that machine-learning approaches associated with Landsat imagery provide an economical way for biomass estimation. Moreover, ensemble methods using all candidate variables, especially for Landsat images, provide an alternative for regional biomass simulation.
NASA Astrophysics Data System (ADS)
Deo, Ram K.
Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.
Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery.
Sigdel, Madhav; Dinç, İmren; Dinç, Semih; Sigdel, Madhu S; Pusey, Marc L; Aygün, Ramazan S
2014-03-01
In this paper, we investigate the performance of two wrapper methods for semi-supervised learning algorithms for classification of protein crystallization images with limited labeled images. Firstly, we evaluate the performance of semi-supervised approach using self-training with naïve Bayesian (NB) and sequential minimum optimization (SMO) as the base classifiers. The confidence values returned by these classifiers are used to select high confident predictions to be used for self-training. Secondly, we analyze the performance of Yet Another Two Stage Idea (YATSI) semi-supervised learning using NB, SMO, multilayer perceptron (MLP), J48 and random forest (RF) classifiers. These results are compared with the basic supervised learning using the same training sets. We perform our experiments on a dataset consisting of 2250 protein crystallization images for different proportions of training and test data. Our results indicate that NB and SMO using both self-training and YATSI semi-supervised approaches improve accuracies with respect to supervised learning. On the other hand, MLP, J48 and RF perform better using basic supervised learning. Overall, random forest classifier yields the best accuracy with supervised learning for our dataset.
Classification of Hyperspectral Data Based on Guided Filtering and Random Forest
NASA Astrophysics Data System (ADS)
Ma, H.; Feng, W.; Cao, X.; Wang, L.
2017-09-01
Hyperspectral images usually consist of more than one hundred spectral bands, which have potentials to provide rich spatial and spectral information. However, the application of hyperspectral data is still challengeable due to "the curse of dimensionality". In this context, many techniques, which aim to make full use of both the spatial and spectral information, are investigated. In order to preserve the geometrical information, meanwhile, with less spectral bands, we propose a novel method, which combines principal components analysis (PCA), guided image filtering and the random forest classifier (RF). In detail, PCA is firstly employed to reduce the dimension of spectral bands. Secondly, the guided image filtering technique is introduced to smooth land object, meanwhile preserving the edge of objects. Finally, the features are fed into RF classifier. To illustrate the effectiveness of the method, we carry out experiments over the popular Indian Pines data set, which is collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. By comparing the proposed method with the method of only using PCA or guided image filter, we find that effect of the proposed method is better.
Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest
NASA Astrophysics Data System (ADS)
Chen, Hui; Lin, Zan; Wu, Hegang; Wang, Li; Wu, Tong; Tan, Chao
2015-01-01
Near-infrared (NIR) spectroscopy has such advantages as being noninvasive, fast, relatively inexpensive, and no risk of ionizing radiation. Differences in the NIR signals can reflect many physiological changes, which are in turn associated with such factors as vascularization, cellularity, oxygen consumption, or remodeling. NIR spectral differences between colorectal cancer and healthy tissues were investigated. A Fourier transform NIR spectroscopy instrument equipped with a fiber-optic probe was used to mimic in situ clinical measurements. A total of 186 spectra were collected and then underwent the preprocessing of standard normalize variate (SNV) for removing unwanted background variances. All the specimen and spots used for spectral collection were confirmed staining and examination by an experienced pathologist so as to ensure the representative of the pathology. Principal component analysis (PCA) was used to uncover the possible clustering. Several methods including random forest (RF), partial least squares-discriminant analysis (PLSDA), K-nearest neighbor and classification and regression tree (CART) were used to extract spectral features and to construct the diagnostic models. By comparison, it reveals that, even if no obvious difference of misclassified ratio (MCR) was observed between these models, RF is preferable since it is quicker, more convenient and insensitive to over-fitting. The results indicate that NIR spectroscopy coupled with RF model can serve as a potential tool for discriminating the colorectal cancer tissues from normal ones.
Defining Higher-Order Turbulent Moment Closures with an Artificial Neural Network and Random Forest
NASA Astrophysics Data System (ADS)
McGibbon, J.; Bretherton, C. S.
2017-12-01
Unresolved turbulent advection and clouds must be parameterized in atmospheric models. Modern higher-order closure schemes depend on analytic moment closure assumptions that diagnose higher-order moments in terms of lower-order ones. These are then tested against Large-Eddy Simulation (LES) higher-order moment relations. However, these relations may not be neatly analytic in nature. Rather than rely on an analytic higher-order moment closure, can we use machine learning on LES data itself to define a higher-order moment closure?We assess the ability of a deep artificial neural network (NN) and random forest (RF) to perform this task using a set of observationally-based LES runs from the MAGIC field campaign. By training on a subset of 12 simulations and testing on remaining simulations, we avoid over-fitting the training data.Performance of the NN and RF will be assessed and compared to the Analytic Double Gaussian 1 (ADG1) closure assumed by Cloudy Layers Unified By Binormals (CLUBB), a higher-order turbulence closure currently used in the Community Atmosphere Model (CAM). We will show that the RF outperforms the NN and the ADG1 closure for the MAGIC cases within this diagnostic framework. Progress and challenges in using a diagnostic machine learning closure within a prognostic cloud and turbulence parameterization will also be discussed.
Silva, Carlos Alberto; Hudak, Andrew Thomas; Klauberg, Carine; Vierling, Lee Alexandre; Gonzalez-Benecke, Carlos; de Padua Chaves Carvalho, Samuel; Rodriguez, Luiz Carlos Estraviz; Cardil, Adrián
2017-12-01
LiDAR remote sensing is a rapidly evolving technology for quantifying a variety of forest attributes, including aboveground carbon (AGC). Pulse density influences the acquisition cost of LiDAR, and grid cell size influences AGC prediction using plot-based methods; however, little work has evaluated the effects of LiDAR pulse density and cell size for predicting and mapping AGC in fast-growing Eucalyptus forest plantations. The aim of this study was to evaluate the effect of LiDAR pulse density and grid cell size on AGC prediction accuracy at plot and stand-levels using airborne LiDAR and field data. We used the Random Forest (RF) machine learning algorithm to model AGC using LiDAR-derived metrics from LiDAR collections of 5 and 10 pulses m -2 (RF5 and RF10) and grid cell sizes of 5, 10, 15 and 20 m. The results show that LiDAR pulse density of 5 pulses m -2 provides metrics with similar prediction accuracy for AGC as when using a dataset with 10 pulses m -2 in these fast-growing plantations. Relative root mean square errors (RMSEs) for the RF5 and RF10 were 6.14 and 6.01%, respectively. Equivalence tests showed that the predicted AGC from the training and validation models were equivalent to the observed AGC measurements. The grid cell sizes for mapping ranging from 5 to 20 also did not significantly affect the prediction accuracy of AGC at stand level in this system. LiDAR measurements can be used to predict and map AGC across variable-age Eucalyptus plantations with adequate levels of precision and accuracy using 5 pulses m -2 and a grid cell size of 5 m. The promising results for AGC modeling in this study will allow for greater confidence in comparing AGC estimates with varying LiDAR sampling densities for Eucalyptus plantations and assist in decision making towards more cost effective and efficient forest inventory.
NASA Astrophysics Data System (ADS)
Liu, Jiamin; Chang, Kevin; Kim, Lauren; Turkbey, Evrim; Lu, Le; Yao, Jianhua; Summers, Ronald
2015-03-01
The thyroid gland plays an important role in clinical practice, especially for radiation therapy treatment planning. For patients with head and neck cancer, radiation therapy requires a precise delineation of the thyroid gland to be spared on the pre-treatment planning CT images to avoid thyroid dysfunction. In the current clinical workflow, the thyroid gland is normally manually delineated by radiologists or radiation oncologists, which is time consuming and error prone. Therefore, a system for automated segmentation of the thyroid is desirable. However, automated segmentation of the thyroid is challenging because the thyroid is inhomogeneous and surrounded by structures that have similar intensities. In this work, the thyroid gland segmentation is initially estimated by multi-atlas label fusion algorithm. The segmentation is refined by supervised statistical learning based voxel labeling with a random forest algorithm. Multiatlas label fusion (MALF) transfers expert-labeled thyroids from atlases to a target image using deformable registration. Errors produced by label transfer are reduced by label fusion that combines the results produced by all atlases into a consensus solution. Then, random forest (RF) employs an ensemble of decision trees that are trained on labeled thyroids to recognize features. The trained forest classifier is then applied to the thyroid estimated from the MALF by voxel scanning to assign the class-conditional probability. Voxels from the expert-labeled thyroids in CT volumes are treated as positive classes; background non-thyroid voxels as negatives. We applied this automated thyroid segmentation system to CT scans of 20 patients. The results showed that the MALF achieved an overall 0.75 Dice Similarity Coefficient (DSC) and the RF classification further improved the DSC to 0.81.
Marchese Robinson, Richard L; Palczewska, Anna; Palczewski, Jan; Kidley, Nathan
2017-08-28
The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.
Random forest models to predict aqueous solubility.
Palmer, David S; O'Boyle, Noel M; Glen, Robert C; Mitchell, John B O
2007-01-01
Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.
Random forest meteorological normalisation models for Swiss PM10 trend analysis
NASA Astrophysics Data System (ADS)
Grange, Stuart K.; Carslaw, David C.; Lewis, Alastair C.; Boleti, Eirini; Hueglin, Christoph
2018-05-01
Meteorological normalisation is a technique which accounts for changes in meteorology over time in an air quality time series. Controlling for such changes helps support robust trend analysis because there is more certainty that the observed trends are due to changes in emissions or chemistry, not changes in meteorology. Predictive random forest models (RF; a decision tree machine learning technique) were grown for 31 air quality monitoring sites in Switzerland using surface meteorological, synoptic scale, boundary layer height, and time variables to explain daily PM10 concentrations. The RF models were used to calculate meteorologically normalised trends which were formally tested and evaluated using the Theil-Sen estimator. Between 1997 and 2016, significantly decreasing normalised PM10 trends ranged between -0.09 and -1.16 µg m-3 yr-1 with urban traffic sites experiencing the greatest mean decrease in PM10 concentrations at -0.77 µg m-3 yr-1. Similar magnitudes have been reported for normalised PM10 trends for earlier time periods in Switzerland which indicates PM10 concentrations are continuing to decrease at similar rates as in the past. The ability for RF models to be interpreted was leveraged using partial dependence plots to explain the observed trends and relevant physical and chemical processes influencing PM10 concentrations. Notably, two regimes were suggested by the models which cause elevated PM10 concentrations in Switzerland: one related to poor dispersion conditions and a second resulting from high rates of secondary PM generation in deep, photochemically active boundary layers. The RF meteorological normalisation process was found to be robust, user friendly and simple to implement, and readily interpretable which suggests the technique could be useful in many air quality exploratory data analysis situations.
Wang, Qi; Xie, Zhiyi; Li, Fangbai
2015-11-01
This study aims to identify and apportion multi-source and multi-phase heavy metal pollution from natural and anthropogenic inputs using ensemble models that include stochastic gradient boosting (SGB) and random forest (RF) in agricultural soils on the local scale. The heavy metal pollution sources were quantitatively assessed, and the results illustrated the suitability of the ensemble models for the assessment of multi-source and multi-phase heavy metal pollution in agricultural soils on the local scale. The results of SGB and RF consistently demonstrated that anthropogenic sources contributed the most to the concentrations of Pb and Cd in agricultural soils in the study region and that SGB performed better than RF. Copyright © 2015 Elsevier Ltd. All rights reserved.
Thanh Noi, Phan; Kappas, Martin
2017-01-01
In previous classification studies, three non-parametric classifiers, Random Forest (RF), k-Nearest Neighbor (kNN), and Support Vector Machine (SVM), were reported as the foremost classifiers at producing high accuracies. However, only a few studies have compared the performances of these classifiers with different training sample sizes for the same remote sensing images, particularly the Sentinel-2 Multispectral Imager (MSI). In this study, we examined and compared the performances of the RF, kNN, and SVM classifiers for land use/cover classification using Sentinel-2 image data. An area of 30 × 30 km2 within the Red River Delta of Vietnam with six land use/cover types was classified using 14 different training sample sizes, including balanced and imbalanced, from 50 to over 1250 pixels/class. All classification results showed a high overall accuracy (OA) ranging from 90% to 95%. Among the three classifiers and 14 sub-datasets, SVM produced the highest OA with the least sensitivity to the training sample sizes, followed consecutively by RF and kNN. In relation to the sample size, all three classifiers showed a similar and high OA (over 93.85%) when the training sample size was large enough, i.e., greater than 750 pixels/class or representing an area of approximately 0.25% of the total study area. The high accuracy was achieved with both imbalanced and balanced datasets. PMID:29271909
Thanh Noi, Phan; Kappas, Martin
2017-12-22
In previous classification studies, three non-parametric classifiers, Random Forest (RF), k-Nearest Neighbor (kNN), and Support Vector Machine (SVM), were reported as the foremost classifiers at producing high accuracies. However, only a few studies have compared the performances of these classifiers with different training sample sizes for the same remote sensing images, particularly the Sentinel-2 Multispectral Imager (MSI). In this study, we examined and compared the performances of the RF, kNN, and SVM classifiers for land use/cover classification using Sentinel-2 image data. An area of 30 × 30 km² within the Red River Delta of Vietnam with six land use/cover types was classified using 14 different training sample sizes, including balanced and imbalanced, from 50 to over 1250 pixels/class. All classification results showed a high overall accuracy (OA) ranging from 90% to 95%. Among the three classifiers and 14 sub-datasets, SVM produced the highest OA with the least sensitivity to the training sample sizes, followed consecutively by RF and kNN. In relation to the sample size, all three classifiers showed a similar and high OA (over 93.85%) when the training sample size was large enough, i.e., greater than 750 pixels/class or representing an area of approximately 0.25% of the total study area. The high accuracy was achieved with both imbalanced and balanced datasets.
Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery
Sigdel, Madhav; Dinç, İmren; Dinç, Semih; Sigdel, Madhu S.; Pusey, Marc L.; Aygün, Ramazan S.
2015-01-01
In this paper, we investigate the performance of two wrapper methods for semi-supervised learning algorithms for classification of protein crystallization images with limited labeled images. Firstly, we evaluate the performance of semi-supervised approach using self-training with naïve Bayesian (NB) and sequential minimum optimization (SMO) as the base classifiers. The confidence values returned by these classifiers are used to select high confident predictions to be used for self-training. Secondly, we analyze the performance of Yet Another Two Stage Idea (YATSI) semi-supervised learning using NB, SMO, multilayer perceptron (MLP), J48 and random forest (RF) classifiers. These results are compared with the basic supervised learning using the same training sets. We perform our experiments on a dataset consisting of 2250 protein crystallization images for different proportions of training and test data. Our results indicate that NB and SMO using both self-training and YATSI semi-supervised approaches improve accuracies with respect to supervised learning. On the other hand, MLP, J48 and RF perform better using basic supervised learning. Overall, random forest classifier yields the best accuracy with supervised learning for our dataset. PMID:25914518
Improved high-dimensional prediction with Random Forests by the use of co-data.
Te Beest, Dennis E; Mes, Steven W; Wilting, Saskia M; Brakenhoff, Ruud H; van de Wiel, Mark A
2017-12-28
Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.
Uncertain Photometric Redshifts with Deep Learning Methods
NASA Astrophysics Data System (ADS)
D'Isanto, A.
2017-06-01
The need for accurate photometric redshifts estimation is a topic that has fundamental importance in Astronomy, due to the necessity of efficiently obtaining redshift information without the need of spectroscopic analysis. We propose a method for determining accurate multi-modal photo-z probability density functions (PDFs) using Mixture Density Networks (MDN) and Deep Convolutional Networks (DCN). A comparison with a Random Forest (RF) is performed.
NASA Astrophysics Data System (ADS)
Hong, Haoyuan; Pourghasemi, Hamid Reza; Pourtaghi, Zohre Sadat
2016-04-01
Landslides are an important natural hazard that causes a great amount of damage around the world every year, especially during the rainy season. The Lianhua area is located in the middle of China's southern mountainous area, west of Jiangxi Province, and is known to be an area prone to landslides. The aim of this study was to evaluate and compare landslide susceptibility maps produced using the random forest (RF) data mining technique with those produced by bivariate (evidential belief function and frequency ratio) and multivariate (logistic regression) statistical models for Lianhua County, China. First, a landslide inventory map was prepared using aerial photograph interpretation, satellite images, and extensive field surveys. In total, 163 landslide events were recognized in the study area, with 114 landslides (70%) used for training and 49 landslides (30%) used for validation. Next, the landslide conditioning factors-including the slope angle, altitude, slope aspect, topographic wetness index (TWI), slope-length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, distance to roads, annual precipitation, land use, normalized difference vegetation index (NDVI), and lithology-were derived from the spatial database. Finally, the landslide susceptibility maps of Lianhua County were generated in ArcGIS 10.1 based on the random forest (RF), evidential belief function (EBF), frequency ratio (FR), and logistic regression (LR) approaches and were validated using a receiver operating characteristic (ROC) curve. The ROC plot assessment results showed that for landslide susceptibility maps produced using the EBF, FR, LR, and RF models, the area under the curve (AUC) values were 0.8122, 0.8134, 0.7751, and 0.7172, respectively. Therefore, we can conclude that all four models have an AUC of more than 0.70 and can be used in landslide susceptibility mapping in the study area; meanwhile, the EBF and FR models had the best performance for Lianhua County, China. Thus, the resultant susceptibility maps will be useful for land use planning and hazard mitigation aims.
Multi-label spacecraft electrical signal classification method based on DBN and random forest
Li, Ke; Yu, Nan; Li, Pengfei; Song, Shimin; Wu, Yalei; Li, Yang; Liu, Meng
2017-01-01
In spacecraft electrical signal characteristic data, there exists a large amount of data with high-dimensional features, a high computational complexity degree, and a low rate of identification problems, which causes great difficulty in fault diagnosis of spacecraft electronic load systems. This paper proposes a feature extraction method that is based on deep belief networks (DBN) and a classification method that is based on the random forest (RF) algorithm; The proposed algorithm mainly employs a multi-layer neural network to reduce the dimension of the original data, and then, classification is applied. Firstly, we use the method of wavelet denoising, which was used to pre-process the data. Secondly, the deep belief network is used to reduce the feature dimension and improve the rate of classification for the electrical characteristics data. Finally, we used the random forest algorithm to classify the data and comparing it with other algorithms. The experimental results show that compared with other algorithms, the proposed method shows excellent performance in terms of accuracy, computational efficiency, and stability in addressing spacecraft electrical signal data. PMID:28486479
Multi-label spacecraft electrical signal classification method based on DBN and random forest.
Li, Ke; Yu, Nan; Li, Pengfei; Song, Shimin; Wu, Yalei; Li, Yang; Liu, Meng
2017-01-01
In spacecraft electrical signal characteristic data, there exists a large amount of data with high-dimensional features, a high computational complexity degree, and a low rate of identification problems, which causes great difficulty in fault diagnosis of spacecraft electronic load systems. This paper proposes a feature extraction method that is based on deep belief networks (DBN) and a classification method that is based on the random forest (RF) algorithm; The proposed algorithm mainly employs a multi-layer neural network to reduce the dimension of the original data, and then, classification is applied. Firstly, we use the method of wavelet denoising, which was used to pre-process the data. Secondly, the deep belief network is used to reduce the feature dimension and improve the rate of classification for the electrical characteristics data. Finally, we used the random forest algorithm to classify the data and comparing it with other algorithms. The experimental results show that compared with other algorithms, the proposed method shows excellent performance in terms of accuracy, computational efficiency, and stability in addressing spacecraft electrical signal data.
Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S.
2016-01-01
Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0–20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The good performance of the RF model was attributable to its ability to handle the non-linear and hierarchical relationships between soil Cd and environmental variables. These results confirm that the RF approach is promising for the prediction and spatial distribution mapping of soil Cd at the regional scale. PMID:26964095
Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S
2016-01-01
Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0-20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The good performance of the RF model was attributable to its ability to handle the non-linear and hierarchical relationships between soil Cd and environmental variables. These results confirm that the RF approach is promising for the prediction and spatial distribution mapping of soil Cd at the regional scale.
Time Series of Images to Improve Tree Species Classification
NASA Astrophysics Data System (ADS)
Miyoshi, G. T.; Imai, N. N.; de Moraes, M. V. A.; Tommaselli, A. M. G.; Näsi, R.
2017-10-01
Tree species classification provides valuable information to forest monitoring and management. The high floristic variation of the tree species appears as a challenging issue in the tree species classification because the vegetation characteristics changes according to the season. To help to monitor this complex environment, the imaging spectroscopy has been largely applied since the development of miniaturized sensors attached to Unmanned Aerial Vehicles (UAV). Considering the seasonal changes in forests and the higher spectral and spatial resolution acquired with sensors attached to UAV, we present the use of time series of images to classify four tree species. The study area is an Atlantic Forest area located in the western part of São Paulo State. Images were acquired in August 2015 and August 2016, generating three data sets of images: only with the image spectra of 2015; only with the image spectra of 2016; with the layer stacking of images from 2015 and 2016. Four tree species were classified using Spectral angle mapper (SAM), Spectral information divergence (SID) and Random Forest (RF). The results showed that SAM and SID caused an overfitting of the data whereas RF showed better results and the use of the layer stacking improved the classification achieving a kappa coefficient of 18.26 %.
Chebouba, Lokmane; Boughaci, Dalila; Guziolowski, Carito
2018-06-04
The use of data issued from high throughput technologies in drug target problems is widely widespread during the last decades. This study proposes a meta-heuristic framework using stochastic local search (SLS) combined with random forest (RF) where the aim is to specify the most important genes and proteins leading to the best classification of Acute Myeloid Leukemia (AML) patients. First we use a stochastic local search meta-heuristic as a feature selection technique to select the most significant proteins to be used in the classification task step. Then we apply RF to classify new patients into their corresponding classes. The evaluation technique is to run the RF classifier on the training data to get a model. Then, we apply this model on the test data to find the appropriate class. We use as metrics the balanced accuracy (BAC) and the area under the receiver operating characteristic curve (AUROC) to measure the performance of our model. The proposed method is evaluated on the dataset issued from DREAM 9 challenge. The comparison is done with a pure random forest (without feature selection), and with the two best ranked results of the DREAM 9 challenge. We used three types of data: only clinical data, only proteomics data, and finally clinical and proteomics data combined. The numerical results show that the highest scores are obtained when using clinical data alone, and the lowest is obtained when using proteomics data alone. Further, our method succeeds in finding promising results compared to the methods presented in the DREAM challenge.
NASA Astrophysics Data System (ADS)
Bai, Ting; Sun, Kaimin; Deng, Shiquan; Chen, Yan
2018-03-01
High resolution image change detection is one of the key technologies of remote sensing application, which is of great significance for resource survey, environmental monitoring, fine agriculture, military mapping and battlefield environment detection. In this paper, for high-resolution satellite imagery, Random Forest (RF), Support Vector Machine (SVM), Deep belief network (DBN), and Adaboost models were established to verify the possibility of different machine learning applications in change detection. In order to compare detection accuracy of four machine learning Method, we applied these four machine learning methods for two high-resolution images. The results shows that SVM has higher overall accuracy at small samples compared to RF, Adaboost, and DBN for binary and from-to change detection. With the increase in the number of samples, RF has higher overall accuracy compared to Adaboost, SVM and DBN.
Dube, Timothy; Mutanga, Onisimo; Adam, Elhadi; Ismail, Riyad
2014-01-01
The quantification of aboveground biomass using remote sensing is critical for better understanding the role of forests in carbon sequestration and for informed sustainable management. Although remote sensing techniques have been proven useful in assessing forest biomass in general, more is required to investigate their capabilities in predicting intra-and-inter species biomass which are mainly characterised by non-linear relationships. In this study, we tested two machine learning algorithms, Stochastic Gradient Boosting (SGB) and Random Forest (RF) regression trees to predict intra-and-inter species biomass using high resolution RapidEye reflectance bands as well as the derived vegetation indices in a commercial plantation. The results showed that the SGB algorithm yielded the best performance for intra-and-inter species biomass prediction; using all the predictor variables as well as based on the most important selected variables. For example using the most important variables the algorithm produced an R2 of 0.80 and RMSE of 16.93 t·ha−1 for E. grandis; R2 of 0.79, RMSE of 17.27 t·ha−1 for P. taeda and R2 of 0.61, RMSE of 43.39 t·ha−1 for the combined species data sets. Comparatively, RF yielded plausible results only for E. dunii (R2 of 0.79; RMSE of 7.18 t·ha−1). We demonstrated that although the two statistical methods were able to predict biomass accurately, RF produced weaker results as compared to SGB when applied to combined species dataset. The result underscores the relevance of stochastic models in predicting biomass drawn from different species and genera using the new generation high resolution RapidEye sensor with strategically positioned bands. PMID:25140631
NASA Astrophysics Data System (ADS)
Du, Shihong; Zhang, Fangli; Zhang, Xiuyuan
2015-07-01
While most existing studies have focused on extracting geometric information on buildings, only a few have concentrated on semantic information. The lack of semantic information cannot satisfy many demands on resolving environmental and social issues. This study presents an approach to semantically classify buildings into much finer categories than those of existing studies by learning random forest (RF) classifier from a large number of imbalanced samples with high-dimensional features. First, a two-level segmentation mechanism combining GIS and VHR image produces single image objects at a large scale and intra-object components at a small scale. Second, a semi-supervised method chooses a large number of unbiased samples by considering the spatial proximity and intra-cluster similarity of buildings. Third, two important improvements in RF classifier are made: a voting-distribution ranked rule for reducing the influences of imbalanced samples on classification accuracy and a feature importance measurement for evaluating each feature's contribution to the recognition of each category. Fourth, the semantic classification of urban buildings is practically conducted in Beijing city, and the results demonstrate that the proposed approach is effective and accurate. The seven categories used in the study are finer than those in existing work and more helpful to studying many environmental and social problems.
Baba, Hiromi; Takahara, Jun-ichi; Yamashita, Fumiyoshi; Hashida, Mitsuru
2015-11-01
The solvent effect on skin permeability is important for assessing the effectiveness and toxicological risk of new dermatological formulations in pharmaceuticals and cosmetics development. The solvent effect occurs by diverse mechanisms, which could be elucidated by efficient and reliable prediction models. However, such prediction models have been hampered by the small variety of permeants and mixture components archived in databases and by low predictive performance. Here, we propose a solution to both problems. We first compiled a novel large database of 412 samples from 261 structurally diverse permeants and 31 solvents reported in the literature. The data were carefully screened to ensure their collection under consistent experimental conditions. To construct a high-performance predictive model, we then applied support vector regression (SVR) and random forest (RF) with greedy stepwise descriptor selection to our database. The models were internally and externally validated. The SVR achieved higher performance statistics than RF. The (externally validated) determination coefficient, root mean square error, and mean absolute error of SVR were 0.899, 0.351, and 0.268, respectively. Moreover, because all descriptors are fully computational, our method can predict as-yet unsynthesized compounds. Our high-performance prediction model offers an attractive alternative to permeability experiments for pharmaceutical and cosmetic candidate screening and optimizing skin-permeable topical formulations.
Steyrl, David; Scherer, Reinhold; Faller, Josef; Müller-Putz, Gernot R
2016-02-01
There is general agreement in the brain-computer interface (BCI) community that although non-linear classifiers can provide better results in some cases, linear classifiers are preferable. Particularly, as non-linear classifiers often involve a number of parameters that must be carefully chosen. However, new non-linear classifiers were developed over the last decade. One of them is the random forest (RF) classifier. Although popular in other fields of science, RFs are not common in BCI research. In this work, we address three open questions regarding RFs in sensorimotor rhythm (SMR) BCIs: parametrization, online applicability, and performance compared to regularized linear discriminant analysis (LDA). We found that the performance of RF is constant over a large range of parameter values. We demonstrate - for the first time - that RFs are applicable online in SMR-BCIs. Further, we show in an offline BCI simulation that RFs statistically significantly outperform regularized LDA by about 3%. These results confirm that RFs are practical and convenient non-linear classifiers for SMR-BCIs. Taking into account further properties of RFs, such as independence from feature distributions, maximum margin behavior, multiclass and advanced data mining capabilities, we argue that RFs should be taken into consideration for future BCIs.
Desbordes, Paul; Ruan, Su; Modzelewski, Romain; Pineau, Pascal; Vauclin, Sébastien; Gouel, Pierrick; Michel, Pierre; Di Fiore, Frédéric; Vera, Pierre; Gardin, Isabelle
2017-01-01
In oncology, texture features extracted from positron emission tomography with 18-fluorodeoxyglucose images (FDG-PET) are of increasing interest for predictive and prognostic studies, leading to several tens of features per tumor. To select the best features, the use of a random forest (RF) classifier was investigated. Sixty-five patients with an esophageal cancer treated with a combined chemo-radiation therapy were retrospectively included. All patients underwent a pretreatment whole-body FDG-PET. The patients were followed for 3 years after the end of the treatment. The response assessment was performed 1 month after the end of the therapy. Patients were classified as complete responders and non-complete responders. Sixty-one features were extracted from medical records and PET images. First, Spearman's analysis was performed to eliminate correlated features. Then, the best predictive and prognostic subsets of features were selected using a RF algorithm. These results were compared to those obtained by a Mann-Whitney U test (predictive study) and a univariate Kaplan-Meier analysis (prognostic study). Among the 61 initial features, 28 were not correlated. From these 28 features, the best subset of complementary features found using the RF classifier to predict response was composed of 2 features: metabolic tumor volume (MTV) and homogeneity from the co-occurrence matrix. The corresponding predictive value (AUC = 0.836 ± 0.105, Se = 82 ± 9%, Sp = 91 ± 12%) was higher than the best predictive results found using the Mann-Whitney test: busyness from the gray level difference matrix (P < 0.0001, AUC = 0.810, Se = 66%, Sp = 88%). The best prognostic subset found using RF was composed of 3 features: MTV and 2 clinical features (WHO status and nutritional risk index) (AUC = 0.822 ± 0.059, Se = 79 ± 9%, Sp = 95 ± 6%), while no feature was significantly prognostic according to the Kaplan-Meier analysis. The RF classifier can improve predictive and prognostic values compared to the Mann-Whitney U test and the univariate Kaplan-Meier survival analysis when applied to several tens of features in a limited patient database.
Li, Jin; Tran, Maggie; Siwabessy, Justy
2016-01-01
Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia’s marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to ‘small p and large n’ problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and caution should be taken when applying filter FS methods in selecting predictive models. PMID:26890307
Li, Jin; Tran, Maggie; Siwabessy, Justy
2016-01-01
Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia's marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to 'small p and large n' problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and caution should be taken when applying filter FS methods in selecting predictive models.
Akkoç, Betül; Arslan, Ahmet; Kök, Hatice
2016-06-01
Gender is one of the intrinsic properties of identity, with performance enhancement reducing the cluster when a search is performed. Teeth have durable and resistant structure, and as such are important sources of identification in disasters (accident, fire, etc.). In this study, gender determination is accomplished by maxillary tooth plaster models of 40 people (20 males and 20 females). The images of tooth plaster models are taken with a lighting mechanism set-up. A gray level co-occurrence matrix of the image with segmentation is formed and classified via a Random Forest (RF) algorithm by extracting pertinent features of the matrix. Automatic gender determination has a 90% success rate, with an applicable system to determine gender from maxillary tooth plaster images. Copyright © 2016 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Adelabu, Samuel; Mutanga, Onisimo; Adam, Elhadi; Cho, Moses Azong
2013-01-01
Classification of different tree species in semiarid areas can be challenging as a result of the change in leaf structure and orientation due to soil moisture constraints. Tree species mapping is, however, a key parameter for forest management in semiarid environments. In this study, we examined the suitability of 5-band RapidEye satellite data for the classification of five tree species in mopane woodland of Botswana using machine leaning algorithms with limited training samples.We performed classification using random forest (RF) and support vector machines (SVM) based on EnMap box. The overall accuracies for classifying the five tree species was 88.75 and 85% for both SVM and RF, respectively. We also demonstrated that the new red-edge band in the RapidEye sensor has the potential for classifying tree species in semiarid environments when integrated with other standard bands. Similarly, we observed that where there are limited training samples, SVM is preferred over RF. Finally, we demonstrated that the two accuracy measures of quantity and allocation disagreement are simpler and more helpful for the vast majority of remote sensing classification process than the kappa coefficient. Overall, high species classification can be achieved using strategically located RapidEye bands integrated with advanced processing algorithms.
Pan, Yue; Liu, Hongmei; Metsch, Lisa R; Feaster, Daniel J
2017-02-01
HIV testing is the foundation for consolidated HIV treatment and prevention. In this study, we aim to discover the most relevant variables for predicting HIV testing uptake among substance users in substance use disorder treatment programs by applying random forest (RF), a robust multivariate statistical learning method. We also provide a descriptive introduction to this method for those who are unfamiliar with it. We used data from the National Institute on Drug Abuse Clinical Trials Network HIV testing and counseling study (CTN-0032). A total of 1281 HIV-negative or status unknown participants from 12 US community-based substance use disorder treatment programs were included and were randomized into three HIV testing and counseling treatment groups. The a priori primary outcome was self-reported receipt of HIV test results. Classification accuracy of RF was compared to logistic regression, a standard statistical approach for binary outcomes. Variable importance measures for the RF model were used to select the most relevant variables. RF based models produced much higher classification accuracy than those based on logistic regression. Treatment group is the most important predictor among all covariates, with a variable importance index of 12.9%. RF variable importance revealed that several types of condomless sex behaviors, condom use self-efficacy and attitudes towards condom use, and level of depression are the most important predictors of receipt of HIV testing results. There is a non-linear negative relationship between count of condomless sex acts and the receipt of HIV testing. In conclusion, RF seems promising in discovering important factors related to HIV testing uptake among large numbers of predictors and should be encouraged in future HIV prevention and treatment research and intervention program evaluations.
Predicting active-layer soil thickness using topographic variables at a small watershed scale
Li, Aidi; Tan, Xing; Wu, Wei; Liu, Hongbin; Zhu, Jie
2017-01-01
Knowledge about the spatial distribution of active-layer (AL) soil thickness is indispensable for ecological modeling, precision agriculture, and land resource management. However, it is difficult to obtain the details on AL soil thickness by using conventional soil survey method. In this research, the objective is to investigate the possibility and accuracy of mapping the spatial distribution of AL soil thickness through random forest (RF) model by using terrain variables at a small watershed scale. A total of 1113 soil samples collected from the slope fields were randomly divided into calibration (770 soil samples) and validation (343 soil samples) sets. Seven terrain variables including elevation, aspect, relative slope position, valley depth, flow path length, slope height, and topographic wetness index were derived from a digital elevation map (30 m). The RF model was compared with multiple linear regression (MLR), geographically weighted regression (GWR) and support vector machines (SVM) approaches based on the validation set. Model performance was evaluated by precision criteria of mean error (ME), mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R2). Comparative results showed that RF outperformed MLR, GWR and SVM models. The RF gave better values of ME (0.39 cm), MAE (7.09 cm), and RMSE (10.85 cm) and higher R2 (62%). The sensitivity analysis demonstrated that the DEM had less uncertainty than the AL soil thickness. The outcome of the RF model indicated that elevation, flow path length and valley depth were the most important factors affecting the AL soil thickness variability across the watershed. These results demonstrated the RF model is a promising method for predicting spatial distribution of AL soil thickness using terrain parameters. PMID:28877196
Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data.
Wei, Runmin; Wang, Jingye; Su, Mingming; Jia, Erik; Chen, Shaoqiu; Chen, Tianlu; Ni, Yan
2018-01-12
Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
Assessing the accuracy and stability of variable selection ...
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively add/remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating dataset consists of the good/poor condition of n=1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p=212) of landscape features from the StreamCat dataset. Two types of RF models are compared: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substanti
Prediction of Incident Diabetes in the Jackson Heart Study Using High-Dimensional Machine Learning
Casanova, Ramon; Saldana, Santiago; Simpson, Sean L.; Lacy, Mary E.; Subauste, Angela R.; Blackshear, Chad; Wagenknecht, Lynne; Bertoni, Alain G.
2016-01-01
Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1) investigate the relative performance of a machine learning method such as Random Forests (RF) for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2) uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82) to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data. PMID:27727289
Recovering area-to-mass ratio of resident space objects through data mining
NASA Astrophysics Data System (ADS)
Peng, Hao; Bai, Xiaoli
2018-01-01
The area-to-mass ratio (AMR) of a resident space object (RSO) is an important parameter for improved space situation awareness capability due to its effect on the non-conservative forces including the atmosphere drag force and the solar radiation pressure force. However, information about AMR is often not provided in most space catalogs. The present paper investigates recovering the AMR information from the consistency error, which refers to the difference between the orbit predicted from an earlier estimate and the orbit estimated at the current epoch. A data mining technique, particularly the random forest (RF) method, is used to discover the relationship between the consistency error and the AMR. Using a simulation-based space catalog environment as the testbed, this paper demonstrates that the classification RF model can determine the RSO's category AMR and the regression RF model can generate continuous AMR values, both with good accuracies. Furthermore, the paper reveals that by recording additional information besides the consistency error, the RF model can estimate the AMR with even higher accuracy.
Source localization in an ocean waveguide using supervised machine learning.
Niu, Haiqiang; Reeves, Emma; Gerstoft, Peter
2017-09-01
Source localization in ocean acoustics is posed as a machine learning problem in which data-driven methods learn source ranges directly from observed acoustic data. The pressure received by a vertical linear array is preprocessed by constructing a normalized sample covariance matrix and used as the input for three machine learning methods: feed-forward neural networks (FNN), support vector machines (SVM), and random forests (RF). The range estimation problem is solved both as a classification problem and as a regression problem by these three machine learning algorithms. The results of range estimation for the Noise09 experiment are compared for FNN, SVM, RF, and conventional matched-field processing and demonstrate the potential of machine learning for underwater source localization.
Mehrang, Saeed; Pietilä, Julia; Korhonen, Ilkka
2018-02-22
Wrist-worn sensors have better compliance for activity monitoring compared to hip, waist, ankle or chest positions. However, wrist-worn activity monitoring is challenging due to the wide degree of freedom for the hand movements, as well as similarity of hand movements in different activities such as varying intensities of cycling. To strengthen the ability of wrist-worn sensors in detecting human activities more accurately, motion signals can be complemented by physiological signals such as optical heart rate (HR) based on photoplethysmography. In this paper, an activity monitoring framework using an optical HR sensor and a triaxial wrist-worn accelerometer is presented. We investigated a range of daily life activities including sitting, standing, household activities and stationary cycling with two intensities. A random forest (RF) classifier was exploited to detect these activities based on the wrist motions and optical HR. The highest overall accuracy of 89.6 ± 3.9% was achieved with a forest of a size of 64 trees and 13-s signal segments with 90% overlap. Removing the HR-derived features decreased the classification accuracy of high-intensity cycling by almost 7%, but did not affect the classification accuracies of other activities. A feature reduction utilizing the feature importance scores of RF was also carried out and resulted in a shrunken feature set of only 21 features. The overall accuracy of the classification utilizing the shrunken feature set was 89.4 ± 4.2%, which is almost equivalent to the above-mentioned peak overall accuracy.
Lu, Hua-Zheng; Sha, Li-Qing; Wang, Jun; Hu, Wen-Yan; Wu, Bing-Xia
2009-10-01
By using trenching method and infrared gas analyzer, this paper studied the seasonal variation of soil respiration (SR), including root respiration (RR) and heterotrophic respiration (HR), in tropical seasonal rain forest (RF) and rubber (Hevea brasiliensis) plantation (RP) in Xishuangbanna of Yunnan, China. The results showed that the SR and HR rates were significantly higher in RF than in RP (P < 0.01), while the RR rate had less difference between the two forests. Soil temperature and moisture were the key factors affecting the SR, RR and HR. The SR and HR rates in the two forests were rainy season > dry-hot season > foggy season, but the RR rate was rainy season > foggy season > dry-hot season in RF, and foggy season > rainy season > dry-hot season in RP. The contribution of RR to SR in RF (29%) was much lower than that in RP (42%, P < 0.01), while the contribution of HR to SR was 71% in RF and 58% in RP. When the soil temperature at 5 cm depth varied from 12 degrees C to 32 degrees C, the Q10 values for SR, HR, and RR rates were higher in RF than in RP. HR had the highest Q10 value, while RR had the lowest one.
Golkarian, Ali; Naghibi, Seyed Amir; Kalantar, Bahareh; Pradhan, Biswajeet
2018-02-17
Ever increasing demand for water resources for different purposes makes it essential to have better understanding and knowledge about water resources. As known, groundwater resources are one of the main water resources especially in countries with arid climatic condition. Thus, this study seeks to provide groundwater potential maps (GPMs) employing new algorithms. Accordingly, this study aims to validate the performance of C5.0, random forest (RF), and multivariate adaptive regression splines (MARS) algorithms for generating GPMs in the eastern part of Mashhad Plain, Iran. For this purpose, a dataset was produced consisting of spring locations as indicator and groundwater-conditioning factors (GCFs) as input. In this research, 13 GCFs were selected including altitude, slope aspect, slope angle, plan curvature, profile curvature, topographic wetness index (TWI), slope length, distance from rivers and faults, rivers and faults density, land use, and lithology. The mentioned dataset was divided into two classes of training and validation with 70 and 30% of the springs, respectively. Then, C5.0, RF, and MARS algorithms were employed using R statistical software, and the final values were transformed into GPMs. Finally, two evaluation criteria including Kappa and area under receiver operating characteristics curve (AUC-ROC) were calculated. According to the findings of this research, MARS had the best performance with AUC-ROC of 84.2%, followed by RF and C5.0 algorithms with AUC-ROC values of 79.7 and 77.3%, respectively. The results indicated that AUC-ROC values for the employed models are more than 70% which shows their acceptable performance. As a conclusion, the produced methodology could be used in other geographical areas. GPMs could be used by water resource managers and related organizations to accelerate and facilitate water resource exploitation.
Exploring Capabilities of SENTINEL-2 for Vegetation Mapping Using Random Forest
NASA Astrophysics Data System (ADS)
Saini, R.; Ghosh, S. K.
2018-04-01
Accurate vegetation mapping is essential for monitoring crop and sustainable agricultural practice. This study aims to explore the capabilities of Sentinel-2 data over Landsat-8 Operational Land Imager (OLI) data for vegetation mapping. Two combination of Sentinel-2 dataset have been considered, first combination is 4-band dataset at 10m resolution which consists of NIR, R, G and B bands, while second combination is generated by stacking 4 bands having 10 m resolution along with other six sharpened bands using Gram-Schmidt algorithm. For Landsat-8 OLI dataset, six multispectral bands have been pan-sharpened to have a spatial resolution of 15 m using Gram-Schmidt algorithm. Random Forest (RF) and Maximum Likelihood classifier (MLC) have been selected for classification of images. It is found that, overall accuracy achieved by RF for 4-band, 10-band dataset of Sentinel-2 and Landsat-8 OLI are 88.38 %, 90.05 % and 86.68 % respectively. While, MLC give an overall accuracy of 85.12 %, 87.14 % and 83.56 % for 4-band, 10-band Sentinel and Landsat-8 OLI respectively. Results shown that 10-band Sentinel-2 dataset gives highest accuracy and shows a rise of 3.37 % for RF and 3.58 % for MLC compared to Landsat-8 OLI. However, all the classes show significant improvement in accuracy but a major rise in accuracy is observed for Sugarcane, Wheat and Fodder for Sentinel 10-band imagery. This study substantiates the fact that Sentinel-2 data can be utilized for mapping of vegetation with a good degree of accuracy when compared to Landsat-8 OLI specifically when objective is to map a sub class of vegetation.
NASA Astrophysics Data System (ADS)
Olory Agomma, R.; Vázquez, C.; Cresson, T.; De Guise, J.
2018-02-01
Most algorithms to detect and identify anatomical structures in medical images require either to be initialized close to the target structure, or to know that the structure is present in the image, or to be trained on a homogeneous database (e.g. all full body or all lower limbs). Detecting these structures when there is no guarantee that the structure is present in the image, or when the image database is heterogeneous (mixed configurations), is a challenge for automatic algorithms. In this work we compared two state-of-the-art machine learning techniques in order to determine which one is the most appropriate for predicting targets locations based on image patches. By knowing the position of thirteen landmarks points, labelled by an expert in EOS frontal radiography, we learn the displacement between salient points detected in the image and these thirteen landmarks. The learning step is carried out with a machine learning approach by exploring two methods: Convolutional Neural Network (CNN) and Random Forest (RF). The automatic detection of the thirteen landmarks points in a new image is then obtained by averaging the positions of each one of these thirteen landmarks estimated from all the salient points in the new image. We respectively obtain for CNN and RF, an average prediction error (both mean and standard deviation in mm) of 29 +/-18 and 30 +/- 21 for the thirteen landmarks points, indicating the approximate location of anatomical regions. On the other hand, the learning time is 9 days for CNN versus 80 minutes for RF. We provide a comparison of the results between the two machine learning approaches.
Saliency-Guided Change Detection of Remotely Sensed Images Using Random Forest
NASA Astrophysics Data System (ADS)
Feng, W.; Sui, H.; Chen, X.
2018-04-01
Studies based on object-based image analysis (OBIA) representing the paradigm shift in change detection (CD) have achieved remarkable progress in the last decade. Their aim has been developing more intelligent interpretation analysis methods in the future. The prediction effect and performance stability of random forest (RF), as a new kind of machine learning algorithm, are better than many single predictors and integrated forecasting method. In this paper, we present a novel CD approach for high-resolution remote sensing images, which incorporates visual saliency and RF. First, highly homogeneous and compact image super-pixels are generated using super-pixel segmentation, and the optimal segmentation result is obtained through image superimposition and principal component analysis (PCA). Second, saliency detection is used to guide the search of interest regions in the initial difference image obtained via the improved robust change vector analysis (RCVA) algorithm. The salient regions within the difference image that correspond to the binarized saliency map are extracted, and the regions are subject to the fuzzy c-means (FCM) clustering to obtain the pixel-level pre-classification result, which can be used as a prerequisite for superpixel-based analysis. Third, on the basis of the optimal segmentation and pixel-level pre-classification results, different super-pixel change possibilities are calculated. Furthermore, the changed and unchanged super-pixels that serve as the training samples are automatically selected. The spectral features and Gabor features of each super-pixel are extracted. Finally, superpixel-based CD is implemented by applying RF based on these samples. Experimental results on Ziyuan 3 (ZY3) multi-spectral images show that the proposed method outperforms the compared methods in the accuracy of CD, and also confirm the feasibility and effectiveness of the proposed approach.
Deciphering the Routes of invasion of Drosophila suzukii by Means of ABC Random Forest.
Fraimout, Antoine; Debat, Vincent; Fellous, Simon; Hufbauer, Ruth A; Foucaud, Julien; Pudlo, Pierre; Marin, Jean-Michel; Price, Donald K; Cattel, Julien; Chen, Xiao; Deprá, Marindia; François Duyck, Pierre; Guedot, Christelle; Kenis, Marc; Kimura, Masahito T; Loeb, Gregory; Loiseau, Anne; Martinez-Sañudo, Isabel; Pascual, Marta; Polihronakis Richmond, Maxi; Shearer, Peter; Singh, Nadia; Tamura, Koichiro; Xuéreb, Anne; Zhang, Jinping; Estoup, Arnaud
2017-04-01
Deciphering invasion routes from molecular data is crucial to understanding biological invasions, including identifying bottlenecks in population size and admixture among distinct populations. Here, we unravel the invasion routes of the invasive pest Drosophila suzukii using a multi-locus microsatellite dataset (25 loci on 23 worldwide sampling locations). To do this, we use approximate Bayesian computation (ABC), which has improved the reconstruction of invasion routes, but can be computationally expensive. We use our study to illustrate the use of a new, more efficient, ABC method, ABC random forest (ABC-RF) and compare it to a standard ABC method (ABC-LDA). We find that Japan emerges as the most probable source of the earliest recorded invasion into Hawaii. Southeast China and Hawaii together are the most probable sources of populations in western North America, which then in turn served as sources for those in eastern North America. European populations are genetically more homogeneous than North American populations, and their most probable source is northeast China, with evidence of limited gene flow from the eastern US as well. All introduced populations passed through bottlenecks, and analyses reveal five distinct admixture events. These findings can inform hypotheses concerning how this species evolved between different and independent source and invasive populations. Methodological comparisons indicate that ABC-RF and ABC-LDA show concordant results if ABC-LDA is based on a large number of simulated datasets but that ABC-RF out-performs ABC-LDA when using a comparable and more manageable number of simulated datasets, especially when analyzing complex introduction scenarios. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Recognizing pedestrian's unsafe behaviors in far-infrared imagery at night
NASA Astrophysics Data System (ADS)
Lee, Eun Ju; Ko, Byoung Chul; Nam, Jae-Yeal
2016-05-01
Pedestrian behavior recognition is important work for early accident prevention in advanced driver assistance system (ADAS). In particular, because most pedestrian-vehicle crashes are occurred from late of night to early of dawn, our study focus on recognizing unsafe behavior of pedestrians using thermal image captured from moving vehicle at night. For recognizing unsafe behavior, this study uses convolutional neural network (CNN) which shows high quality of recognition performance. However, because traditional CNN requires the very expensive training time and memory, we design the light CNN consisted of two convolutional layers and two subsampling layers for real-time processing of vehicle applications. In addition, we combine light CNN with boosted random forest (Boosted RF) classifier so that the output of CNN is not fully connected with the classifier but randomly connected with Boosted random forest. We named this CNN as randomly connected CNN (RC-CNN). The proposed method was successfully applied to the pedestrian unsafe behavior (PUB) dataset captured from far-infrared camera at night and its behavior recognition accuracy is confirmed to be higher than that of some algorithms related to CNNs, with a shorter processing time.
NASA Astrophysics Data System (ADS)
Geelen, Christopher D.; Wijnhoven, Rob G. J.; Dubbelman, Gijs; de With, Peter H. N.
2015-03-01
This research considers gender classification in surveillance environments, typically involving low-resolution images and a large amount of viewpoint variations and occlusions. Gender classification is inherently difficult due to the large intra-class variation and interclass correlation. We have developed a gender classification system, which is successfully evaluated on two novel datasets, which realistically consider the above conditions, typical for surveillance. The system reaches a mean accuracy of up to 90% and approaches our human baseline of 92.6%, proving a high-quality gender classification system. We also present an in-depth discussion of the fundamental differences between SVM and RF classifiers. We conclude that balancing the degree of randomization in any classifier is required for the highest classification accuracy. For our problem, an RF-SVM hybrid classifier exploiting the combination of HSV and LBP features results in the highest classification accuracy of 89.9 0.2%, while classification computation time is negligible compared to the detection time of pedestrians.
Tian, Xin; Xin, Mingyuan; Luo, Jian; Liu, Mingyao; Jiang, Zhenran
2017-02-01
The selection of relevant genes for breast cancer metastasis is critical for the treatment and prognosis of cancer patients. Although much effort has been devoted to the gene selection procedures by use of different statistical analysis methods or computational techniques, the interpretation of the variables in the resulting survival models has been limited so far. This article proposes a new Random Forest (RF)-based algorithm to identify important variables highly related with breast cancer metastasis, which is based on the important scores of two variable selection algorithms, including the mean decrease Gini (MDG) criteria of Random Forest and the GeneRank algorithm with protein-protein interaction (PPI) information. The new gene selection algorithm can be called PPIRF. The improved prediction accuracy fully illustrated the reliability and high interpretability of gene list selected by the PPIRF approach.
NASA Astrophysics Data System (ADS)
Novelli, Antonio; Aguilar, Manuel A.; Nemmaoui, Abderrahim; Aguilar, Fernando J.; Tarantino, Eufemia
2016-10-01
This paper shows the first comparison between data from Sentinel-2 (S2) Multi Spectral Instrument (MSI) and Landsat 8 (L8) Operational Land Imager (OLI) headed up to greenhouse detection. Two closely related in time scenes, one for each sensor, were classified by using Object Based Image Analysis and Random Forest (RF). The RF input consisted of several object-based features computed from spectral bands and including mean values, spectral indices and textural features. S2 and L8 data comparisons were also extended using a common segmentation dataset extracted form VHR World-View 2 (WV2) imagery to test differences only due to their specific spectral contribution. The best band combinations to perform segmentation were found through a modified version of the Euclidian Distance 2 index. Four different RF classifications schemes were considered achieving 89.1%, 91.3%, 90.9% and 93.4% as the best overall accuracies respectively, evaluated over the whole study area.
The prediction of food additives in the fruit juice based on electronic nose with chemometrics.
Qiu, Shanshan; Wang, Jun
2017-09-01
Food additives are added to products to enhance their taste, and preserve flavor or appearance. While their use should be restricted to achieve a technological benefit, the contents of food additives should be also strictly controlled. In this study, E-nose was applied as an alternative to traditional monitoring technologies for determining two food additives, namely benzoic acid and chitosan. For quantitative monitoring, support vector machine (SVM), random forest (RF), extreme learning machine (ELM) and partial least squares regression (PLSR) were applied to establish regression models between E-nose signals and the amount of food additives in fruit juices. The monitoring models based on ELM and RF reached higher correlation coefficients (R 2 s) and lower root mean square errors (RMSEs) than models based on PLSR and SVM. This work indicates that E-nose combined with RF or ELM can be a cost-effective, easy-to-build and rapid detection system for food additive monitoring. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Huesca Martinez, M.; Garcia, M.; Roth, K. L.; Casas, A.; Ustin, S.
2015-12-01
There is a well-established need within the remote sensing community for improved estimation of canopy structure and understanding of its influence on the retrieval of leaf biochemical properties. The aim of this project was to evaluate the estimation of structural properties directly from hyperspectral data, with the broader goal that these might be used to constrain retrievals of canopy chemistry. We used NASA's Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) to discriminate different canopy structural types, defined in terms of biomass, canopy height and vegetation complexity, and compared them to estimates of these properties measured by LiDAR data. We tested a large number of optical metrics, including single narrow band reflectance and 1st derivative, sub-pixel cover fractions, narrow-band indices, spectral absorption features, and Principal Component Analysis components. Canopy structural types were identified and classified from different forest types by integrating structural traits measured by optical metrics using the Random Forest (RF) classifier. The classification accuracy was above 70% in most of the vegetation scenarios. The best overall accuracy was achieved for hardwood forest (>80% accuracy) and the lowest accuracy was found in mixed forest (~70% accuracy). Furthermore, similarly high accuracy was found when the RF classifier was applied to a spatially independent dataset, showing significant portability for the method used. Results show that all spectral regions played a role in canopy structure assessment, thus the whole spectrum is required. Furthermore, optical metrics derived from AVIRIS proved to be a powerful technique for structural attribute mapping. This research illustrates the potential for using optical properties to distinguish several canopy structural types in different forest types, and these may be used to constrain quantitative measurements of absorbing properties in future research.
NASA Astrophysics Data System (ADS)
Agjee, Na'eem Hoosen; Ismail, Riyad; Mutanga, Onisimo
2016-10-01
Water hyacinth plants (Eichhornia crassipes) are threatening freshwater ecosystems throughout Africa. The Neochetina spp. weevils are seen as an effective solution that can combat the proliferation of the invasive alien plant. We aimed to determine if multitemporal hyperspectral data could be utilized to detect the efficacy of the biocontrol agent. The random forest (RF) algorithm was used to classify variable infestation levels for 6 weeks using: (1) all the hyperspectral bands, (2) bands selected by the recursive feature elimination (RFE) algorithm, and (3) bands selected by the Boruta algorithm. Results showed that the RF model using all the bands successfully produced low-classification errors (12.50% to 32.29%) for all 6 weeks. However, the RF model using Boruta selected bands produced lower classification errors (8.33% to 15.62%) than the RF model using all the bands or bands selected by the RFE algorithm (11.25% to 21.25%) for all 6 weeks, highlighting the utility of Boruta as an all relevant band selection algorithm. All relevant bands selected by Boruta included: 352, 754, 770, 771, 775, 781, 782, 783, 786, and 789 nm. It was concluded that RF coupled with Boruta band-selection algorithm can be utilized to undertake multitemporal monitoring of variable infestation levels on water hyacinth plants.
NASA Astrophysics Data System (ADS)
Chen, Y.; Luo, M.; Xu, L.; Zhou, X.; Ren, J.; Zhou, J.
2018-04-01
The RF method based on grid-search parameter optimization could achieve a classification accuracy of 88.16 % in the classification of images with multiple feature variables. This classification accuracy was higher than that of SVM and ANN under the same feature variables. In terms of efficiency, the RF classification method performs better than SVM and ANN, it is more capable of handling multidimensional feature variables. The RF method combined with object-based analysis approach could highlight the classification accuracy further. The multiresolution segmentation approach on the basis of ESP scale parameter optimization was used for obtaining six scales to execute image segmentation, when the segmentation scale was 49, the classification accuracy reached the highest value of 89.58 %. The classification accuracy of object-based RF classification was 1.42 % higher than that of pixel-based classification (88.16 %), and the classification accuracy was further improved. Therefore, the RF classification method combined with object-based analysis approach could achieve relatively high accuracy in the classification and extraction of land use information for industrial and mining reclamation areas. Moreover, the interpretation of remotely sensed imagery using the proposed method could provide technical support and theoretical reference for remotely sensed monitoring land reclamation.
Nor Hashim, Ezyan; Ramli, Rosli
2013-01-01
A comparative study of understorey birds inhabiting different habitats, that is, virgin jungle reserve (VJR) and regenerated forest (RF), was conducted in Ulu Gombak Forest Reserve and Selangor and Triang Forest Reserve, Negeri Sembilan, Peninsular Malaysia. The objective of this study was to assess the diversity of understorey birds in both habitats and the effects of forest regeneration on the understorey bird community. The mist-netting method was used to capture understorey birds inhabiting both habitats in both locations. Species composition and feeding guild indicated that understorey bird populations were similar in the two habitats. However, the number of secondary forest species such as Little spiderhunter (Arachnothera longirostra) in VJR is increasing due to its proximity to RF. This study discovered that RFs in both study areas are not yet fully recovered. However, based on the range of species discovered, the RFs have conservation value and should be maintained because they harbour important forest species such as babblers and flycatchers. The assessment of the community structure of understorey birds in VJR and RF is important for forest management and conservation, especially where both habitats are intact.
Nor Hashim, Ezyan; Ramli, Rosli
2013-01-01
A comparative study of understorey birds inhabiting different habitats, that is, virgin jungle reserve (VJR) and regenerated forest (RF), was conducted in Ulu Gombak Forest Reserve and Selangor and Triang Forest Reserve, Negeri Sembilan, Peninsular Malaysia. The objective of this study was to assess the diversity of understorey birds in both habitats and the effects of forest regeneration on the understorey bird community. The mist-netting method was used to capture understorey birds inhabiting both habitats in both locations. Species composition and feeding guild indicated that understorey bird populations were similar in the two habitats. However, the number of secondary forest species such as Little spiderhunter (Arachnothera longirostra) in VJR is increasing due to its proximity to RF. This study discovered that RFs in both study areas are not yet fully recovered. However, based on the range of species discovered, the RFs have conservation value and should be maintained because they harbour important forest species such as babblers and flycatchers. The assessment of the community structure of understorey birds in VJR and RF is important for forest management and conservation, especially where both habitats are intact. PMID:24453888
Random forest wetland classification using ALOS-2 L-band, RADARSAT-2 C-band, and TerraSAR-X imagery
NASA Astrophysics Data System (ADS)
Mahdianpari, Masoud; Salehi, Bahram; Mohammadimanesh, Fariba; Motagh, Mahdi
2017-08-01
Wetlands are important ecosystems around the world, although they are degraded due both to anthropogenic and natural process. Newfoundland is among the richest Canadian province in terms of different wetland classes. Herbaceous wetlands cover extensive areas of the Avalon Peninsula, which are the habitat of a number of animal and plant species. In this study, a novel hierarchical object-based Random Forest (RF) classification approach is proposed for discriminating between different wetland classes in a sub-region located in the north eastern portion of the Avalon Peninsula. Particularly, multi-polarization and multi-frequency SAR data, including X-band TerraSAR-X single polarized (HH), L-band ALOS-2 dual polarized (HH/HV), and C-band RADARSAT-2 fully polarized images, were applied in different classification levels. First, a SAR backscatter analysis of different land cover types was performed by training data and used in Level-I classification to separate water from non-water classes. This was followed by Level-II classification, wherein the water class was further divided into shallow- and deep-water classes, and the non-water class was partitioned into herbaceous and non-herbaceous classes. In Level-III classification, the herbaceous class was further divided into bog, fen, and marsh classes, while the non-herbaceous class was subsequently partitioned into urban, upland, and swamp classes. In Level-II and -III classifications, different polarimetric decomposition approaches, including Cloude-Pottier, Freeman-Durden, Yamaguchi decompositions, and Kennaugh matrix elements were extracted to aid the RF classifier. The overall accuracy and kappa coefficient were determined in each classification level for evaluating the classification results. The importance of input features was also determined using the variable importance obtained by RF. It was found that the Kennaugh matrix elements, Yamaguchi, and Freeman-Durden decompositions were the most important parameters for wetland classification in this study. Using this new hierarchical RF classification approach, an overall accuracy of up to 94% was obtained for classifying different land cover types in the study area.
Císař, Petr; Labbé, Laurent; Souček, Pavel; Pelissier, Pablo; Kerneis, Thierry
2018-01-01
The main aim of this study was to develop a new objective method for evaluating the impacts of different diets on the live fish skin using image-based features. In total, one-hundred and sixty rainbow trout (Oncorhynchus mykiss) were fed either a fish-meal based diet (80 fish) or a 100% plant-based diet (80 fish) and photographed using consumer-grade digital camera. Twenty-three colour features and four texture features were extracted. Four different classification methods were used to evaluate fish diets including Random forest (RF), Support vector machine (SVM), Logistic regression (LR) and k-Nearest neighbours (k-NN). The SVM with radial based kernel provided the best classifier with correct classification rate (CCR) of 82% and Kappa coefficient of 0.65. Although the both LR and RF methods were less accurate than SVM, they achieved good classification with CCR 75% and 70% respectively. The k-NN was the least accurate (40%) classification model. Overall, it can be concluded that consumer-grade digital cameras could be employed as the fast, accurate and non-invasive sensor for classifying rainbow trout based on their diets. Furthermore, these was a close association between image-based features and fish diet received during cultivation. These procedures can be used as non-invasive, accurate and precise approaches for monitoring fish status during the cultivation by evaluating diet’s effects on fish skin. PMID:29596375
Saberioon, Mohammadmehdi; Císař, Petr; Labbé, Laurent; Souček, Pavel; Pelissier, Pablo; Kerneis, Thierry
2018-03-29
The main aim of this study was to develop a new objective method for evaluating the impacts of different diets on the live fish skin using image-based features. In total, one-hundred and sixty rainbow trout ( Oncorhynchus mykiss ) were fed either a fish-meal based diet (80 fish) or a 100% plant-based diet (80 fish) and photographed using consumer-grade digital camera. Twenty-three colour features and four texture features were extracted. Four different classification methods were used to evaluate fish diets including Random forest (RF), Support vector machine (SVM), Logistic regression (LR) and k -Nearest neighbours ( k -NN). The SVM with radial based kernel provided the best classifier with correct classification rate (CCR) of 82% and Kappa coefficient of 0.65. Although the both LR and RF methods were less accurate than SVM, they achieved good classification with CCR 75% and 70% respectively. The k -NN was the least accurate (40%) classification model. Overall, it can be concluded that consumer-grade digital cameras could be employed as the fast, accurate and non-invasive sensor for classifying rainbow trout based on their diets. Furthermore, these was a close association between image-based features and fish diet received during cultivation. These procedures can be used as non-invasive, accurate and precise approaches for monitoring fish status during the cultivation by evaluating diet's effects on fish skin.
Automatic Estimation of Osteoporotic Fracture Cases by Using Ensemble Learning Approaches.
Kilic, Niyazi; Hosgormez, Erkan
2016-03-01
Ensemble learning methods are one of the most powerful tools for the pattern classification problems. In this paper, the effects of ensemble learning methods and some physical bone densitometry parameters on osteoporotic fracture detection were investigated. Six feature set models were constructed including different physical parameters and they fed into the ensemble classifiers as input features. As ensemble learning techniques, bagging, gradient boosting and random subspace (RSM) were used. Instance based learning (IBk) and random forest (RF) classifiers applied to six feature set models. The patients were classified into three groups such as osteoporosis, osteopenia and control (healthy), using ensemble classifiers. Total classification accuracy and f-measure were also used to evaluate diagnostic performance of the proposed ensemble classification system. The classification accuracy has reached to 98.85 % by the combination of model 6 (five BMD + five T-score values) using RSM-RF classifier. The findings of this paper suggest that the patients will be able to be warned before a bone fracture occurred, by just examining some physical parameters that can easily be measured without invasive operations.
NASA Astrophysics Data System (ADS)
Lee, Jung-Hyun; Sameen, Maher Ibrahim; Pradhan, Biswajeet; Park, Hyuck-Jin
2018-02-01
This study evaluated the generalizability of five models to select a suitable approach for landslide susceptibility modeling in data-scarce environments. In total, 418 landslide inventories and 18 landslide conditioning factors were analyzed. Multicollinearity and factor optimization were investigated before data modeling, and two experiments were then conducted. In each experiment, five susceptibility maps were produced based on support vector machine (SVM), random forest (RF), weight-of-evidence (WoE), ridge regression (Rid_R), and robust regression (RR) models. The highest accuracy (AUC = 0.85) was achieved with the SVM model when either the full or limited landslide inventories were used. Furthermore, the RF and WoE models were severely affected when less landslide samples were used for training. The other models were affected slightly when the training samples were limited.
Ship Detection Based on Multiple Features in Random Forest Model for Hyperspectral Images
NASA Astrophysics Data System (ADS)
Li, N.; Ding, L.; Zhao, H.; Shi, J.; Wang, D.; Gong, X.
2018-04-01
A novel method for detecting ships which aim to make full use of both the spatial and spectral information from hyperspectral images is proposed. Firstly, the band which is high signal-noise ratio in the range of near infrared or short-wave infrared spectrum, is used to segment land and sea on Otsu threshold segmentation method. Secondly, multiple features that include spectral and texture features are extracted from hyperspectral images. Principal components analysis (PCA) is used to extract spectral features, the Grey Level Co-occurrence Matrix (GLCM) is used to extract texture features. Finally, Random Forest (RF) model is introduced to detect ships based on the extracted features. To illustrate the effectiveness of the method, we carry out experiments over the EO-1 data by comparing single feature and different multiple features. Compared with the traditional single feature method and Support Vector Machine (SVM) model, the proposed method can stably achieve the target detection of ships under complex background and can effectively improve the detection accuracy of ships.
Power of data mining methods to detect genetic associations and interactions.
Molinaro, Annette M; Carriero, Nicholas; Bjornson, Robert; Hartge, Patricia; Rothman, Nathaniel; Chatterjee, Nilanjan
2011-01-01
Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR). We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma. The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest. Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM. Copyright © 2011 S. Karger AG, Basel.
Use of real-time GNSS-RF data to characterize the swing movements of forestry equipment
Ryer M. Becker; Robert F. Keefe; Nathaniel M. Anderson
2017-01-01
The western United States faces significant forest management challenges after severe bark beetle infestations have led to substantial mortality. Minimizing costs is vital for increasing the feasibility of management operations in affected forests. Multiâtransmitter Global Navigation Satellite System (GNSS)âradio frequencies (RF) technology has applications in the...
NASA Astrophysics Data System (ADS)
Chen, C. R.; Chen, C. F.; Nguyen, S. T.; Lau, K.; Lay, J. G.
2016-12-01
Sugarcane mostly grown in tropical and subtropical regions is one of the important commercial crops worldwide, providing significant employment, foreign exchange earnings, and other social and environmental benefits. The sugar industry is a vital component of Belize's economy as it provides employment to 15% of the country's population and 60% of the national agricultural exports. Sugarcane mapping is thus an important task due to official initiatives to provide reliable information on sugarcane-growing areas in respect to improved accuracy in monitoring sugarcane production and yield estimates. Policymakers need such monitoring information to formulate timely plans to ensure sustainably socioeconomic development. Sugarcane monitoring in Belize is traditionally carried out through time-consuming and costly field surveys. Remote sensing is an indispensable tool for crop monitoring on national, regional and global scales. The use of high and low resolution satellites for sugarcane monitoring in Belize is often restricted due to cost limitations and mixed pixel problems because sugarcane fields are small and fragmental. With the launch of Sentinel-2 satellite, it is possible to collectively map small patches of sugarcane fields over a large region as the data are free of charge and have high spectral, spatial, and temporal resolutions. This study aims to develop an object-based classification approach to comparatively map sugarcane fields in Belize from Sentinel-2 data using random forests (RF) and support vector machines (SVM). The data were processed through four main steps: (1) data pre-processing, (2) image segmentation, (3) sugarcane classification, and (4) accuracy assessment. The mapping results compared with the ground reference data indicated satisfactory results. The overall accuracies and Kappa coefficients were generally higher than 80% and 0.7, in both cases. The RF produced slightly more accurate mapping results than SVM. This study demonstrates the realization of the potential application of Sentinel-2 data for sugarcane mapping in Belize with the aid of RF and SVM methods. The methods are thus proposed for monitoring purposes in the country.
Using Random Forest to Improve the Downscaling of Global Livestock Census Data
Nicolas, Gaëlle; Robinson, Timothy P.; Wint, G. R. William; Conchedda, Giulia; Cinardi, Giuseppina; Gilbert, Marius
2016-01-01
Large scale, high-resolution global data on farm animal distributions are essential for spatially explicit assessments of the epidemiological, environmental and socio-economic impacts of the livestock sector. This has been the major motivation behind the development of the Gridded Livestock of the World (GLW) database, which has been extensively used since its first publication in 2007. The database relies on a downscaling methodology whereby census counts of animals in sub-national administrative units are redistributed at the level of grid cells as a function of a series of spatial covariates. The recent upgrade of GLW1 to GLW2 involved automating the processing, improvement of input data, and downscaling at a spatial resolution of 1 km per cell (5 km per cell in the earlier version). The underlying statistical methodology, however, remained unchanged. In this paper, we evaluate new methods to downscale census data with a higher accuracy and increased processing efficiency. Two main factors were evaluated, based on sample census datasets of cattle in Africa and chickens in Asia. First, we implemented and evaluated Random Forest models (RF) instead of stratified regressions. Second, we investigated whether models that predicted the number of animals per rural person (per capita) could provide better downscaled estimates than the previous approach that predicted absolute densities (animals per km2). RF models consistently provided better predictions than the stratified regressions for both continents and species. The benefit of per capita over absolute density models varied according to the species and continent. In addition, different technical options were evaluated to reduce the processing time while maintaining their predictive power. Future GLW runs (GLW 3.0) will apply the new RF methodology with optimized modelling options. The potential benefit of per capita models will need to be further investigated with a better distinction between rural and agricultural populations. PMID:26977807
Qiu, Shanshan; Wang, Jun; Gao, Liping
2014-07-09
An electronic nose (E-nose) and an electronic tongue (E-tongue) have been used to characterize five types of strawberry juices based on processing approaches (i.e., microwave pasteurization, steam blanching, high temperature short time pasteurization, frozen-thawed, and freshly squeezed). Juice quality parameters (vitamin C, pH, total soluble solid, total acid, and sugar/acid ratio) were detected by traditional measuring methods. Multivariate statistical methods (linear discriminant analysis (LDA) and partial least squares regression (PLSR)) and neural networks (Random Forest (RF) and Support Vector Machines) were employed to qualitative classification and quantitative regression. E-tongue system reached higher accuracy rates than E-nose did, and the simultaneous utilization did have an advantage in LDA classification and PLSR regression. According to cross-validation, RF has shown outstanding and indisputable performances in the qualitative and quantitative analysis. This work indicates that the simultaneous utilization of E-nose and E-tongue can discriminate processed fruit juices and predict quality parameters successfully for the beverage industry.
2015-01-28
trends is that at both 24 hr time points and the 1.5 wks time point, the HPA had already become desensitized , potentially involving attenuated release...points To explore metabolic alterations occurring at 24 hrs that potentially persisted out to 1.5 and 4 wks, we used random forests (RF) to classify...NR0B2, SLC27A3, SREBF1) Inhibition of cholesterol and lipid metabolism and transport; activation of Phase I metabolizing enzymes ->lipid and xenobiotic
The brain MRI classification problem from wavelets perspective
NASA Astrophysics Data System (ADS)
Bendib, Mohamed M.; Merouani, Hayet F.; Diaba, Fatma
2015-02-01
Haar and Daubechies 4 (DB4) are the most used wavelets for brain MRI (Magnetic Resonance Imaging) classification. The former is simple and fast to compute while the latter is more complex and offers a better resolution. This paper explores the potential of both of them in performing Normal versus Pathological discrimination on the one hand, and Multiclassification on the other hand. The Whole Brain Atlas is used as a validation database, and the Random Forest (RF) algorithm is employed as a learning approach. The achieved results are discussed and statistically compared.
Subtyping cognitive profiles in Autism Spectrum Disorder using a Functional Random Forest algorithm.
Feczko, E; Balba, N M; Miranda-Dominguez, O; Cordova, M; Karalunas, S L; Irwin, L; Demeter, D V; Hill, A P; Langhorst, B H; Grieser Painter, J; Van Santen, J; Fombonne, E J; Nigg, J T; Fair, D A
2018-05-15
DSM-5 Autism Spectrum Disorder (ASD) comprises a set of neurodevelopmental disorders characterized by deficits in social communication and interaction and repetitive behaviors or restricted interests, and may both affect and be affected by multiple cognitive mechanisms. This study attempts to identify and characterize cognitive subtypes within the ASD population using our Functional Random Forest (FRF) machine learning classification model. This model trained a traditional random forest model on measures from seven tasks that reflect multiple levels of information processing. 47 ASD diagnosed and 58 typically developing (TD) children between the ages of 9 and 13 participated in this study. Our RF model was 72.7% accurate, with 80.7% specificity and 63.1% sensitivity. Using the random forest model, the FRF then measures the proximity of each subject to every other subject, generating a distance matrix between participants. This matrix is then used in a community detection algorithm to identify subgroups within the ASD and TD groups, and revealed 3 ASD and 4 TD putative subgroups with unique behavioral profiles. We then examined differences in functional brain systems between diagnostic groups and putative subgroups using resting-state functional connectivity magnetic resonance imaging (rsfcMRI). Chi-square tests revealed a significantly greater number of between group differences (p < .05) within the cingulo-opercular, visual, and default systems as well as differences in inter-system connections in the somato-motor, dorsal attention, and subcortical systems. Many of these differences were primarily driven by specific subgroups suggesting that our method could potentially parse the variation in brain mechanisms affected by ASD. Copyright © 2017. Published by Elsevier Inc.
NASA Astrophysics Data System (ADS)
Yang, Tiantian; Asanjan, Ata Akbari; Welles, Edwin; Gao, Xiaogang; Sorooshian, Soroosh; Liu, Xiaomang
2017-04-01
Reservoirs are fundamental human-built infrastructures that collect, store, and deliver fresh surface water in a timely manner for many purposes. Efficient reservoir operation requires policy makers and operators to understand how reservoir inflows are changing under different hydrological and climatic conditions to enable forecast-informed operations. Over the last decade, the uses of Artificial Intelligence and Data Mining [AI & DM] techniques in assisting reservoir streamflow subseasonal to seasonal forecasts have been increasing. In this study, Random Forest [RF), Artificial Neural Network (ANN), and Support Vector Regression (SVR) are employed and compared with respect to their capabilities for predicting 1 month-ahead reservoir inflows for two headwater reservoirs in USA and China. Both current and lagged hydrological information and 17 known climate phenomenon indices, i.e., PDO and ENSO, etc., are selected as predictors for simulating reservoir inflows. Results show (1) three methods are capable of providing monthly reservoir inflows with satisfactory statistics; (2) the results obtained by Random Forest have the best statistical performances compared with the other two methods; (3) another advantage of Random Forest algorithm is its capability of interpreting raw model inputs; (4) climate phenomenon indices are useful in assisting monthly or seasonal forecasts of reservoir inflow; and (5) different climate conditions are autocorrelated with up to several months, and the climatic information and their lags are cross correlated with local hydrological conditions in our case studies.
Men, Hong; Fu, Songlin; Yang, Jialin; Cheng, Meiqi; Shi, Yan; Liu, Jingjing
2018-01-18
Paraffin odor intensity is an important quality indicator when a paraffin inspection is performed. Currently, paraffin odor level assessment is mainly dependent on an artificial sensory evaluation. In this paper, we developed a paraffin odor analysis system to classify and grade four kinds of paraffin samples. The original feature set was optimized using Principal Component Analysis (PCA) and Partial Least Squares (PLS). Support Vector Machine (SVM), Random Forest (RF), and Extreme Learning Machine (ELM) were applied to three different feature data sets for classification and level assessment of paraffin. For classification, the model based on SVM, with an accuracy rate of 100%, was superior to that based on RF, with an accuracy rate of 98.33-100%, and ELM, with an accuracy rate of 98.01-100%. For level assessment, the R² related to the training set was above 0.97 and the R² related to the test set was above 0.87. Through comprehensive comparison, the generalization of the model based on ELM was superior to those based on SVM and RF. The scoring errors for the three models were 0.0016-0.3494, lower than the error of 0.5-1.0 measured by industry standard experts, meaning these methods have a higher prediction accuracy for scoring paraffin level.
Grading of Chinese Cantonese Sausage Using Hyperspectral Imaging Combined with Chemometric Methods
Gong, Aiping; Zhu, Susu; He, Yong; Zhang, Chu
2017-01-01
Fast and accurate grading of Chinese Cantonese sausage is an important concern for customers, organizations, and the industry. Hyperspectral imaging in the spectral range of 874–1734 nm, combined with chemometric methods, was applied to grade Chinese Cantonese sausage. Three grades of intact and sliced Cantonese sausages were studied, including the top, first, and second grades. Support vector machine (SVM) and random forests (RF) techniques were used to build two different models. Second derivative spectra and RF were applied to select optimal wavelengths. The optimal wavelengths were the same for intact and sliced sausages when selected from second derivative spectra, while the optimal wavelengths for intact and sliced sausages selected using RF were quite similar. The SVM and RF models, using full spectra and the optimal wavelengths, obtained acceptable results for intact and sliced sausages. Both models for intact sausages performed better than those for sliced sausages, with a classification accuracy of the calibration and prediction set of over 90%. The overall results indicated that hyperspectral imaging combined with chemometric methods could be used to grade Chinese Cantonese sausages, with intact sausages being better suited for grading. This study will help to develop fast and accurate online grading of Cantonese sausages, as well as other sausages. PMID:28757578
Retrieving Temperature Anomaly in the Global Subsurface and Deeper Ocean From Satellite Observations
NASA Astrophysics Data System (ADS)
Su, Hua; Li, Wene; Yan, Xiao-Hai
2018-01-01
Retrieving the subsurface and deeper ocean (SDO) dynamic parameters from satellite observations is crucial for effectively understanding ocean interior anomalies and dynamic processes, but it is challenging to accurately estimate the subsurface thermal structure over the global scale from sea surface parameters. This study proposes a new approach based on Random Forest (RF) machine learning to retrieve subsurface temperature anomaly (STA) in the global ocean from multisource satellite observations including sea surface height anomaly (SSHA), sea surface temperature anomaly (SSTA), sea surface salinity anomaly (SSSA), and sea surface wind anomaly (SSWA) via in situ Argo data for RF training and testing. RF machine-learning approach can accurately retrieve the STA in the global ocean from satellite observations of sea surface parameters (SSHA, SSTA, SSSA, SSWA). The Argo STA data were used to validate the accuracy and reliability of the results from the RF model. The results indicated that SSHA, SSTA, SSSA, and SSWA together are useful parameters for detecting SDO thermal information and obtaining accurate STA estimations. The proposed method also outperformed support vector regression (SVR) in global STA estimation. It will be a useful technique for studying SDO thermal variability and its role in global climate system from global-scale satellite observations.
Prediction of Enzyme Mutant Activity Using Computational Mutagenesis and Incremental Transduction
Basit, Nada; Wechsler, Harry
2011-01-01
Wet laboratory mutagenesis to determine enzyme activity changes is expensive and time consuming. This paper expands on standard one-shot learning by proposing an incremental transductive method (T2bRF) for the prediction of enzyme mutant activity during mutagenesis using Delaunay tessellation and 4-body statistical potentials for representation. Incremental learning is in tune with both eScience and actual experimentation, as it accounts for cumulative annotation effects of enzyme mutant activity over time. The experimental results reported, using cross-validation, show that overall the incremental transductive method proposed, using random forest as base classifier, yields better results compared to one-shot learning methods. T2bRF is shown to yield 90% on T4 and LAC (and 86% on HIV-1). This is significantly better than state-of-the-art competing methods, whose performance yield is at 80% or less using the same datasets. PMID:22007208
Chen, Jun; Toyomasu, Yoshitaka; Hayashi, Yujiro; Linden, David R; Szurszewski, Joseph H; Nelson, Heidi; Farrugia, Gianrico; Kashyap, Purna C; Chia, Nicholas; Ordog, Tamas
2016-10-03
Nutritional interventions often fail to prevent growth failure in childhood and adolescent malnutrition and the mechanisms remain unclear. Recent studies revealed altered microbiota in malnourished children and anorexia nervosa. To facilitate mechanistic studies under physiologically relevant conditions, we established a mouse model of growth failure following chronic dietary restriction and examined microbiota in relation to age, diet, body weight, and anabolic treatment. Four-week-old female BALB/c mice (n = 12/group) were fed ad libitum (AL) or offered limited food to abolish weight gain (LF). A subset of restricted mice was treated with an insulin-like growth factor 1 (IGF1) analog. Food access was restored in a subset of untreated LF (LF-RF) and IGF1-treated LF mice (TLF-RF) on day 97. Gut microbiota were determined on days 69, 96-99 and 120 by next generation sequencing of the V3-5 region of the 16S rRNA gene. Microbiota-host factor associations were analyzed by distance-based PERMANOVA and quantified by the coefficient of determination R 2 for age, diet, and normalized body weight change (Δbwt). Microbial taxa on day 120 were compared following fitting with an overdispersed Poisson regression model. The machine learning algorithm Random Forests was used to predict age based on the microbiota. On day 120, Δbwt in AL, LF, LF-RF, and TLF-RF mice was 52 ± 3, -6 ± 1*, 40 ± 3*, and 46 ± 2 % (*, P < 0.05 versus AL). Age and diet, but not Δbwt, were associated with gut microbiota composition. Age explained a larger proportion of the microbiota variability than diet or Δbwt. Random Forests predicted chronological age based on the microbiota and indicated microbiota immaturity in the LF mice before, but not after, refeeding. However, on day 120, the microbiota community structure of LF-RF mice was significantly different from that of both AL and LF mice. IGF1 mitigated the difference from the AL group. Refed groups had a higher abundance of Bacteroidetes and Proteobacteria and a lower abundance of Firmicutes than AL mice. Persistent growth failure can be induced by 97-day dietary restriction in young female mice and is associated with microbiota changes seen in lean mice and individuals and anorexia nervosa. IGF1 facilitates recovery of body weights and microbiota.
NASA Technical Reports Server (NTRS)
Gao, Feng; Ghimire, Bardan; Jiao, Tong; Williams, Christopher A.; Masek, Jeffrey; Schaaf, Crystal
2017-01-01
Large-scale deforestation and reforestation have contributed substantially to historical and contemporary global climate change in part through albedo-induced radiative forcing, with meaningful implications for forest management aiming to mitigate climate change. Associated warming or cooling varies widely across the globe due to a range of factors including forest type, snow cover, and insolation, but resulting geographic variation remain spoorly described and has been largely based on model assessments. This study provides an observation-based approach to quantify local and global radiative forcings from large-scale deforestation and reforestation and further examines mechanisms that result in the spatial heterogeneity of radiative forcing. We incorporate a new spatially and temporally explicit land cover-specific albedo product derived from Moderate Resolution Imaging Spectroradiometer with a historical land use data set (Land Use Harmonization product). Spatial variation in radiative forcing was attributed to four mechanisms, including the change in snow-covered albedo, change in snow-free albedo, snow cover fraction, and incoming solar radiation. We find an albedo-only radiative forcing (RF) of -0.819 W m(exp -2) if year 2000 forests were completely deforested and converted to croplands. Albedo RF from global reforestation of present-day croplands to recover year 1700 forests is estimated to be 0.161 W m)exp -2). Snow-cover fraction is identified as the primary factor in determining the spatial variation of radiative forcing in winter, while the magnitude of the change in snow-free albedo is the primary factor determining variations in summertime RF. Findings reinforce the notion that, for conifers at the snowier high latitudes, albedo RF diminishes the warming from forest loss and the cooling from forest gain more so than for other forest types, latitudes, and climate settings.
NASA Astrophysics Data System (ADS)
Li, Manchun; Ma, Lei; Blaschke, Thomas; Cheng, Liang; Tiede, Dirk
2016-07-01
Geographic Object-Based Image Analysis (GEOBIA) is becoming more prevalent in remote sensing classification, especially for high-resolution imagery. Many supervised classification approaches are applied to objects rather than pixels, and several studies have been conducted to evaluate the performance of such supervised classification techniques in GEOBIA. However, these studies did not systematically investigate all relevant factors affecting the classification (segmentation scale, training set size, feature selection and mixed objects). In this study, statistical methods and visual inspection were used to compare these factors systematically in two agricultural case studies in China. The results indicate that Random Forest (RF) and Support Vector Machines (SVM) are highly suitable for GEOBIA classifications in agricultural areas and confirm the expected general tendency, namely that the overall accuracies decline with increasing segmentation scale. All other investigated methods except for RF and SVM are more prone to obtain a lower accuracy due to the broken objects at fine scales. In contrast to some previous studies, the RF classifiers yielded the best results and the k-nearest neighbor classifier were the worst results, in most cases. Likewise, the RF and Decision Tree classifiers are the most robust with or without feature selection. The results of training sample analyses indicated that the RF and adaboost. M1 possess a superior generalization capability, except when dealing with small training sample sizes. Furthermore, the classification accuracies were directly related to the homogeneity/heterogeneity of the segmented objects for all classifiers. Finally, it was suggested that RF should be considered in most cases for agricultural mapping.
Silva, José Cleydson F; Carvalho, Thales F M; Fontes, Elizabeth P B; Cerqueira, Fabio R
2017-09-30
Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns. This study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively. Therefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp .
Yang, Runtao; Zhang, Chengjin; Gao, Rui; Zhang, Lina
2016-01-01
The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist in drug development and understanding the mechanisms of the GA involved in various cellular processes. In this paper, a new computational method is proposed for identifying cis-Golgi proteins from trans-Golgi proteins. Based on the concept of Common Spatial Patterns (CSP), a novel feature extraction technique is developed to extract evolutionary information from protein sequences. To deal with the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted. A feature selection method called Random Forest-Recursive Feature Elimination (RF-RFE) is employed to search the optimal features from the CSP based features and g-gap dipeptide composition. Based on the optimal features, a Random Forest (RF) module is used to distinguish cis-Golgi proteins from trans-Golgi proteins. Through the jackknife cross-validation, the proposed method achieves a promising performance with a sensitivity of 0.889, a specificity of 0.880, an accuracy of 0.885, and a Matthew’s Correlation Coefficient (MCC) of 0.765, which remarkably outperforms previous methods. Moreover, when tested on a common independent dataset, our method also achieves a significantly improved performance. These results highlight the promising performance of the proposed method to identify Golgi-resident protein types. Furthermore, the CSP based feature extraction method may provide guidelines for protein function predictions. PMID:26861308
NASA Astrophysics Data System (ADS)
Carlucci, Roberto; Cipriano, Giulia; Paoli, Chiara; Ricci, Pasquale; Fanizza, Carmelo; Capezzuto, Francesca; Vassallo, Paolo
2018-05-01
This study provides the first estimates of density and abundance of the striped dolphin Stenella coeruleoalba and common bottlenose dolphin Tursiops truncatus in the Gulf of Taranto (Northern Ionian Sea, Central Mediterranean Sea) and identifies the predictive variables mainly influencing their occurrence and concentration in the study area. Conventional Distance Sampling (CDS) and the Delta approach on Random Forest (DaRF) methods have been applied to sightings data collected between 2009 and 2016 during standardized vessel-based surveys, providing similar outcomes. The mean value of density over the entire study area was 0.72 ± 0.26 specimens/km2 for the striped dolphin and 0.47 ± 0.09 specimens/km2 for the common bottlenose dolphin. The abundance estimated by DaRF in the Gulf of Taranto was 10080 ± 3584 specimens of S. coeruleoalba and 6580 ± 1270 specimens of T. truncatus, respectively. Eight predictive variables were selected, considering both the local physiographic features and human activities existing in the investigated area. The explanatory variables depth, distance from the coast, distance from industrial areas and distance from areas exploited by fishery seem to play a key role in influencing the spatial distribution of both species, whereas the geomorphological variables proved to be the most significant factors shaping the concentration of both dolphins. The establishment of a Specially Protected Area of Mediterranean Importance (SPAMI) according the SPA/BD Protocol in the Gulf of Taranto is indicated as an effective management tool for the conservation of both dolphin populations in the Central-eastern Mediterranean Sea.
NASA Astrophysics Data System (ADS)
Shabani, Farzin; Kumar, Lalit; Solhjouy-fard, Samaneh
2017-08-01
The aim of this study was to have a comparative investigation and evaluation of the capabilities of correlative and mechanistic modeling processes, applied to the projection of future distributions of date palm in novel environments and to establish a method of minimizing uncertainty in the projections of differing techniques. The location of this study on a global scale is in Middle Eastern Countries. We compared the mechanistic model CLIMEX (CL) with the correlative models MaxEnt (MX), Boosted Regression Trees (BRT), and Random Forests (RF) to project current and future distributions of date palm ( Phoenix dactylifera L.). The Global Climate Model (GCM), the CSIRO-Mk3.0 (CS) using the A2 emissions scenario, was selected for making projections. Both indigenous and alien distribution data of the species were utilized in the modeling process. The common areas predicted by MX, BRT, RF, and CL from the CS GCM were extracted and compared to ascertain projection uncertainty levels of each individual technique. The common areas identified by all four modeling techniques were used to produce a map indicating suitable and unsuitable areas for date palm cultivation for Middle Eastern countries, for the present and the year 2100. The four different modeling approaches predict fairly different distributions. Projections from CL were more conservative than from MX. The BRT and RF were the most conservative methods in terms of projections for the current time. The combination of the final CL and MX projections for the present and 2100 provide higher certainty concerning those areas that will become highly suitable for future date palm cultivation. According to the four models, cold, hot, and wet stress, with differences on a regional basis, appears to be the major restrictions on future date palm distribution. The results demonstrate variances in the projections, resulting from different techniques. The assessment and interpretation of model projections requires reservations, especially in correlative models such as MX, BRT, and RF. Intersections between different techniques may decrease uncertainty in future distribution projections. However, readers should not miss the fact that the uncertainties are mostly because the future GHG emission scenarios are unknowable with sufficient precision. Suggestions towards methodology and processing for improving projections are included.
Spatial Downscaling of Alien Species Presences using Machine Learning
NASA Astrophysics Data System (ADS)
Daliakopoulos, Ioannis N.; Katsanevakis, Stelios; Moustakas, Aristides
2017-07-01
Large scale, high-resolution data on alien species distributions are essential for spatially explicit assessments of their environmental and socio-economic impacts, and management interventions for mitigation. However, these data are often unavailable. This paper presents a method that relies on Random Forest (RF) models to distribute alien species presence counts at a finer resolution grid, thus achieving spatial downscaling. A sufficiently large number of RF models are trained using random subsets of the dataset as predictors, in a bootstrapping approach to account for the uncertainty introduced by the subset selection. The method is tested with an approximately 8×8 km2 grid containing floral alien species presence and several indices of climatic, habitat, land use covariates for the Mediterranean island of Crete, Greece. Alien species presence is aggregated at 16×16 km2 and used as a predictor of presence at the original resolution, thus simulating spatial downscaling. Potential explanatory variables included habitat types, land cover richness, endemic species richness, soil type, temperature, precipitation, and freshwater availability. Uncertainty assessment of the spatial downscaling of alien species’ occurrences was also performed and true/false presences and absences were quantified. The approach is promising for downscaling alien species datasets of larger spatial scale but coarse resolution, where the underlying environmental information is available at a finer resolution than the alien species data. Furthermore, the RF architecture allows for tuning towards operationally optimal sensitivity and specificity, thus providing a decision support tool for designing a resource efficient alien species census.
NASA Astrophysics Data System (ADS)
Daliakopoulos, Ioannis; Tsanis, Ioannis
2017-04-01
Mitigating the vulnerability of Mediterranean rangelands against degradation is limited by our ability to understand and accurately characterize those impacts in space and time. The Normalized Difference Vegetation Index (NDVI) is a radiometric measure of the photosynthetically active radiation absorbed by green vegetation canopy chlorophyll and is therefore a good surrogate measure of vegetation dynamics. On the other hand, meteorological indices such as the drought assessing Standardised Precipitation Index (SPI) are can be easily estimated from historical and projected datasets at the global scale. This work investigates the potential of driving Random Forest (RF) models with meteorological indices to approximate NDVI-based vegetation dynamics. A sufficiently large number of RF models are trained using random subsets of the dataset as predictors, in a bootstrapping approach to account for the uncertainty introduced by the subset selection. The updated E-OBS-v13.1 dataset of the ENSEMBLES EU FP6 program provides observed monthly meteorological input to estimate SPI over the Mediterranean rangelands. RF models are trained to depict vegetation dynamics using the latest version (3g.v1) of the third generation GIMMS NDVI generated from NOAA's Advanced Very High Resolution Radiometer (AVHRR) sensors. Analysis is conducted for the period 1981-2015 at a gridded spatial resolution of 25 km. Preliminary results demonstrate the potential of machine learning algorithms to effectively mimic the underlying physical relationship of drought and Earth Observation vegetation indices to provide estimates based on precipitation variability.
Automated Identification of Abnormal Adult EEGs
López, S.; Suarez, G.; Jungreis, D.; Obeid, I.; Picone, J.
2016-01-01
The interpretation of electroencephalograms (EEGs) is a process that is still dependent on the subjective analysis of the examiners. Though interrater agreement on critical events such as seizures is high, it is much lower on subtler events (e.g., when there are benign variants). The process used by an expert to interpret an EEG is quite subjective and hard to replicate by machine. The performance of machine learning technology is far from human performance. We have been developing an interpretation system, AutoEEG, with a goal of exceeding human performance on this task. In this work, we are focusing on one of the early decisions made in this process – whether an EEG is normal or abnormal. We explore two baseline classification algorithms: k-Nearest Neighbor (kNN) and Random Forest Ensemble Learning (RF). A subset of the TUH EEG Corpus was used to evaluate performance. Principal Components Analysis (PCA) was used to reduce the dimensionality of the data. kNN achieved a 41.8% detection error rate while RF achieved an error rate of 31.7%. These error rates are significantly lower than those obtained by random guessing based on priors (49.5%). The majority of the errors were related to misclassification of normal EEGs. PMID:27195311
Bright, Benjamin C.; Hudak, Andrew T.; Meddens, Arjan J.H.; Hawbaker, Todd J.; Briggs, Jenny S.; Kennedy, Robert E.
2017-01-01
Wildfire behavior depends on the type, quantity, and condition of fuels, and the effect that bark beetle outbreaks have on fuels is a topic of current research and debate. Remote sensing can provide estimates of fuels across landscapes, although few studies have estimated surface fuels from remote sensing data. Here we predicted and mapped field-measured canopy and surface fuels from light detection and ranging (lidar) and Landsat time series explanatory variables via random forest (RF) modeling across a coniferous montane forest in Colorado, USA, which was affected by mountain pine beetles (Dendroctonus ponderosae Hopkins) approximately six years prior. We examined relationships between mapped fuels and the severity of tree mortality with correlation tests. RF models explained 59%, 48%, 35%, and 70% of the variation in available canopy fuel, canopy bulk density, canopy base height, and canopy height, respectively (percent root-mean-square error (%RMSE) = 12–54%). Surface fuels were predicted less accurately, with models explaining 24%, 28%, 32%, and 30% of the variation in litter and duff, 1 to 100-h, 1000-h, and total surface fuels, respectively (%RMSE = 37–98%). Fuel metrics were negatively correlated with the severity of tree mortality, except canopy base height, which increased with greater tree mortality. Our results showed how bark beetle-caused tree mortality significantly reduced canopy fuels in our study area. We demonstrated that lidar and Landsat time series data contain substantial information about canopy and surface fuels and can be used for large-scale efforts to monitor and map fuel loads for fire behavior modeling at a landscape scale.
Assessment of various supervised learning algorithms using different performance metrics
NASA Astrophysics Data System (ADS)
Susheel Kumar, S. M.; Laxkar, Deepak; Adhikari, Sourav; Vijayarajan, V.
2017-11-01
Our work brings out comparison based on the performance of supervised machine learning algorithms on a binary classification task. The supervised machine learning algorithms which are taken into consideration in the following work are namely Support Vector Machine(SVM), Decision Tree(DT), K Nearest Neighbour (KNN), Naïve Bayes(NB) and Random Forest(RF). This paper mostly focuses on comparing the performance of above mentioned algorithms on one binary classification task by analysing the Metrics such as Accuracy, F-Measure, G-Measure, Precision, Misclassification Rate, False Positive Rate, True Positive Rate, Specificity, Prevalence.
Can single classifiers be as useful as model ensembles to produce benthic seabed substratum maps?
NASA Astrophysics Data System (ADS)
Turner, Joseph A.; Babcock, Russell C.; Hovey, Renae; Kendrick, Gary A.
2018-05-01
Numerous machine-learning classifiers are available for benthic habitat map production, which can lead to different results. This study highlights the performance of the Random Forest (RF) classifier, which was significantly better than Classification Trees (CT), Naïve Bayes (NB), and a multi-model ensemble in terms of overall accuracy, Balanced Error Rate (BER), Kappa, and area under the curve (AUC) values. RF accuracy was often higher than 90% for each substratum class, even at the most detailed level of the substratum classification and AUC values also indicated excellent performance (0.8-1). Total agreement between classifiers was high at the broadest level of classification (75-80%) when differentiating between hard and soft substratum. However, this sharply declined as the number of substratum categories increased (19-45%) including a mix of rock, gravel, pebbles, and sand. The model ensemble, produced from the results of all three classifiers by majority voting, did not show any increase in predictive performance when compared to the single RF classifier. This study shows how a single classifier may be sufficient to produce benthic seabed maps and model ensembles of multiple classifiers.
NASA Astrophysics Data System (ADS)
Hu, Xiaohua; Lang, Wenhui; Liu, Wei; Xu, Xue; Yang, Jianbo; Zheng, Lei
2017-08-01
Terahertz (THz) spectroscopy technique has been researched and developed for rapid and non-destructive detection of food safety and quality due to its low-energy and non-ionizing characteristics. The objective of this study was to develop a flexible identification model to discriminate transgenic and non-transgenic rice seeds based on terahertz (THz) spectroscopy. To extract THz spectral features and reduce the feature dimension, sparse representation (SR) is employed in this work. A sufficient sparsity level is selected to train the sparse coding of the THz data, and the random forest (RF) method is then applied to obtain a discrimination model. The results show that there exist differences between transgenic and non-transgenic rice seeds in THz spectral band and, comparing with Least squares support vector machines (LS-SVM) method, SR-RF is a better model for discrimination (accuracy is 95% in prediction set, 100% in calibration set, respectively). The conclusion is that SR may be more useful in the application of THz spectroscopy to reduce dimension and the SR-RF provides a new, effective, and flexible method for detection and identification of transgenic and non-transgenic rice seeds with THz spectral system.
Jeyasingh, Suganthi; Veluchamy, Malathi
2017-05-01
Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License
Men, Hong; Fu, Songlin; Yang, Jialin; Cheng, Meiqi; Shi, Yan
2018-01-01
Paraffin odor intensity is an important quality indicator when a paraffin inspection is performed. Currently, paraffin odor level assessment is mainly dependent on an artificial sensory evaluation. In this paper, we developed a paraffin odor analysis system to classify and grade four kinds of paraffin samples. The original feature set was optimized using Principal Component Analysis (PCA) and Partial Least Squares (PLS). Support Vector Machine (SVM), Random Forest (RF), and Extreme Learning Machine (ELM) were applied to three different feature data sets for classification and level assessment of paraffin. For classification, the model based on SVM, with an accuracy rate of 100%, was superior to that based on RF, with an accuracy rate of 98.33–100%, and ELM, with an accuracy rate of 98.01–100%. For level assessment, the R2 related to the training set was above 0.97 and the R2 related to the test set was above 0.87. Through comprehensive comparison, the generalization of the model based on ELM was superior to those based on SVM and RF. The scoring errors for the three models were 0.0016–0.3494, lower than the error of 0.5–1.0 measured by industry standard experts, meaning these methods have a higher prediction accuracy for scoring paraffin level. PMID:29346328
Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE.
Chen, Qi; Meng, Zhaopeng; Liu, Xinyi; Jin, Qianguo; Su, Ran
2018-06-15
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.
NASA Astrophysics Data System (ADS)
Nur Johana, J.; Muzzneena, A. M.; Grismer, L. L.; Norhayati, A.
2016-11-01
Anurans on Langkawi Island, Peninsular Malaysia exhibit variation in their habits and forms, ranging from small (SVL < 25 mm) to large (SVL > 150 mm), and occupy a range of habitats, such as riverine forests, agricultural fields, peat swamps, and lowland and upland dipterocarp forests. These variations provide a platform to explore species diversity, distribution, abundance, microhabitat, and other ecological parameters to understand the distribution patterns and to facilitate conservation and management of sensitive or important species and areas. The objective of this study was to evaluate the diversity and distribution of anuran species in different types of habitat on Langkawi Island. Specimens were collected based on active sampling using the Visual Encounter Survey (VES) method. We surveyed anuran species inhabiting seven types of habitat, namely agriculture (AG), coastal (CL), forest (FT), pond (PD), mangrove (MG), riparian forest (RF) and river (RV). A total of 775 individuals were sampled from all localities, representing 23 species from 12 genera and included all six families of frogs in Malaysia. FT and RF showed high values of Shannon Index, H', 2.60 and 2.38, respectively, followed by the other types of habitat, CL (1.82), RV (1.71), MG (1.56), PD (1.54), and AG (1.53). AG had the highest abundance (156 individuals) compared to other habitat types. Based on Cluster Analysis by using Jaccard coefficient (UPGMA), two groups can be clearly seen and assigned as forested species group (FT and RF) and species associating with human activity (AG, CL, PD, MG and RV). Forest species group is more diverse compared to non-forest group. Nevertheless, non-forest species are found in abundance, highlighting the relevance of these disturbed habitats in supporting the amphibians.
Yuan, Xiao Chun; Lin, Wei Sheng; Pu, Xiao Ting; Yang, Zhi Rong; Zheng, Wei; Chen, Yue Min; Yang, Yu Sheng
2016-06-01
Using the negative pressure sampling method, the concentrations and spectral characte-ristics of dissolved organic matter (DOM) of soil solution were studied at 0-15, 15-30, 30-60 cm layers in Castanopsis carlesii forest (BF), human-assisted naturally regenerated C. carlesii forest (RF), C. carlesii plantation (CP) in evergreen broad-leaved forests in Sanming City, Fujian Pro-vince. The results showed that the overall trend of dissolved organic carbon (DOC) concentrations in soil solution was RF>CP>BF, and the concentration of dissolved organic nitrogen (DON) was highest in C. carlesii plantation. The concentrations of DOC and DON in surface soil (0-15 cm) were all significantly higher than in the subsurface (30-60 cm). The aromatic index (AI) was in the order of RF>CP>BF, and as a whole, the highest AI was observed in the surface soil. Higher fluorescence intensity and a short wave absorption peak (320 nm) were observed in C. carlesii plantation, suggesting the surface soil of C. carlesii plantation was rich in decomposed substance content, while the degree of humification was lower. A medium wave absorption peak (380 nm) was observed in human-assisted naturally regenerated C. carlesii forest, indicating the degree of humification was higher which would contribute to the storage of soil fertility. In addition, DOM characte-ristics in 30-60 cm soil solution were almost unaffected by forest regeneration patterns.
NASA Astrophysics Data System (ADS)
Rouet-Leduc, B.; Hulbert, C.; Riviere, J.; Lubbers, N.; Barros, K.; Marone, C.; Johnson, P. A.
2016-12-01
Forecasting failure is a primary goal in diverse domains that include earthquake physics, materials science, nondestructive evaluation of materials and other engineering applications. Due to the highly complex physics of material failure and limitations on gathering data in the failure nucleation zone, this goal has often appeared out of reach; however, recent advances in instrumentation sensitivity, instrument density and data analysis show promise toward forecasting failure times. Here, we show that we can predict frictional failure times of both slow and fast stick slip failure events in the laboratory. This advance is made possible by applying a machine learning approach known as Random Forests1(RF) to the continuous acoustic emission (AE) time series recorded by detectors located on the fault blocks. The RF is trained using a large number of statistical features derived from the AE time series signal. The model is then applied to data not previously analyzed. Remarkably, we find that the RF method predicts upcoming failure time far in advance of a stick slip event, based only on a short time window of data. Further, the algorithm accurately predicts the time of the beginning and end of the next slip event. The predicted time improves as failure is approached, as other data features add to prediction. Our results show robust predictions of slow and dynamic failure based on acoustic emissions from the fault zone throughout the laboratory seismic cycle. The predictions are based on previously unidentified tremor-like acoustic signals that occur during stress build up and the onset of macroscopic frictional weakening. We suggest that the tremor-like signals carry information about fault zone processes and allow precise predictions of failure at any time in the slow slip or stick slip cycle2. If the laboratory experiments represent Earth frictional conditions, it could well be that signals are being missed that contain highly useful predictive information. 1Breiman, L. Random forests. Machine Learning 45, 5-32 (2001). 2Rouet-Leduc, B. C. Hulbert, N. Lubbers, K. Barros and P. A. Johnson, Learning the physics of failure, in review (2016).
Burlina, Philippe; Billings, Seth; Joshi, Neil
2017-01-01
Objective To evaluate the use of ultrasound coupled with machine learning (ML) and deep learning (DL) techniques for automated or semi-automated classification of myositis. Methods Eighty subjects comprised of 19 with inclusion body myositis (IBM), 14 with polymyositis (PM), 14 with dermatomyositis (DM), and 33 normal (N) subjects were included in this study, where 3214 muscle ultrasound images of 7 muscles (observed bilaterally) were acquired. We considered three problems of classification including (A) normal vs. affected (DM, PM, IBM); (B) normal vs. IBM patients; and (C) IBM vs. other types of myositis (DM or PM). We studied the use of an automated DL method using deep convolutional neural networks (DL-DCNNs) for diagnostic classification and compared it with a semi-automated conventional ML method based on random forests (ML-RF) and “engineered” features. We used the known clinical diagnosis as the gold standard for evaluating performance of muscle classification. Results The performance of the DL-DCNN method resulted in accuracies ± standard deviation of 76.2% ± 3.1% for problem (A), 86.6% ± 2.4% for (B) and 74.8% ± 3.9% for (C), while the ML-RF method led to accuracies of 72.3% ± 3.3% for problem (A), 84.3% ± 2.3% for (B) and 68.9% ± 2.5% for (C). Conclusions This study demonstrates the application of machine learning methods for automatically or semi-automatically classifying inflammatory muscle disease using muscle ultrasound. Compared to the conventional random forest machine learning method used here, which has the drawback of requiring manual delineation of muscle/fat boundaries, DCNN-based classification by and large improved the accuracies in all classification problems while providing a fully automated approach to classification. PMID:28854220
Burlina, Philippe; Billings, Seth; Joshi, Neil; Albayda, Jemima
2017-01-01
To evaluate the use of ultrasound coupled with machine learning (ML) and deep learning (DL) techniques for automated or semi-automated classification of myositis. Eighty subjects comprised of 19 with inclusion body myositis (IBM), 14 with polymyositis (PM), 14 with dermatomyositis (DM), and 33 normal (N) subjects were included in this study, where 3214 muscle ultrasound images of 7 muscles (observed bilaterally) were acquired. We considered three problems of classification including (A) normal vs. affected (DM, PM, IBM); (B) normal vs. IBM patients; and (C) IBM vs. other types of myositis (DM or PM). We studied the use of an automated DL method using deep convolutional neural networks (DL-DCNNs) for diagnostic classification and compared it with a semi-automated conventional ML method based on random forests (ML-RF) and "engineered" features. We used the known clinical diagnosis as the gold standard for evaluating performance of muscle classification. The performance of the DL-DCNN method resulted in accuracies ± standard deviation of 76.2% ± 3.1% for problem (A), 86.6% ± 2.4% for (B) and 74.8% ± 3.9% for (C), while the ML-RF method led to accuracies of 72.3% ± 3.3% for problem (A), 84.3% ± 2.3% for (B) and 68.9% ± 2.5% for (C). This study demonstrates the application of machine learning methods for automatically or semi-automatically classifying inflammatory muscle disease using muscle ultrasound. Compared to the conventional random forest machine learning method used here, which has the drawback of requiring manual delineation of muscle/fat boundaries, DCNN-based classification by and large improved the accuracies in all classification problems while providing a fully automated approach to classification.
NASA Astrophysics Data System (ADS)
Liu, Meiling; Liu, Xiangnan; Li, Jin; Ding, Chao; Jiang, Jiale
2014-12-01
Satellites routinely provide frequent, large-scale, near-surface views of many oceanographic variables pertinent to plankton ecology. However, the nutrient fertility of water can be challenging to detect accurately using remote sensing technology. This research has explored an approach to estimate the nutrient fertility in coastal waters through the fusion of synthetic aperture radar (SAR) images and optical images using the random forest (RF) algorithm. The estimation of total inorganic nitrogen (TIN) in the Hong Kong Sea, China, was used as a case study. In March of 2009 and May and August of 2010, a sequence of multi-temporal in situ data and CCD images from China's HJ-1 satellite and RADARSAT-2 images were acquired. Four sensitive parameters were selected as input variables to evaluate TIN: single-band reflectance, a normalized difference spectral index (NDSI) and HV and VH polarizations. The RF algorithm was used to merge the different input variables from the SAR and optical imagery to generate a new dataset (i.e., the TIN outputs). The results showed the temporal-spatial distribution of TIN. The TIN values decreased from coastal waters to the open water areas, and TIN values in the northeast area were higher than those found in the southwest region of the study area. The maximum TIN values occurred in May. Additionally, the estimation accuracy for estimating TIN was significantly improved when the SAR and optical data were used in combination rather than a single data type alone. This study suggests that this method of estimating nutrient fertility in coastal waters by effectively fusing data from multiple sensors is very promising.
Prediction of hot spots in protein interfaces using a random forest model with hybrid features.
Wang, Lin; Liu, Zhi-Ping; Zhang, Xiang-Sun; Chen, Luonan
2012-03-01
Prediction of hot spots in protein interfaces provides crucial information for the research on protein-protein interaction and drug design. Existing machine learning methods generally judge whether a given residue is likely to be a hot spot by extracting features only from the target residue. However, hot spots usually form a small cluster of residues which are tightly packed together at the center of protein interface. With this in mind, we present a novel method to extract hybrid features which incorporate a wide range of information of the target residue and its spatially neighboring residues, i.e. the nearest contact residue in the other face (mirror-contact residue) and the nearest contact residue in the same face (intra-contact residue). We provide a novel random forest (RF) model to effectively integrate these hybrid features for predicting hot spots in protein interfaces. Our method can achieve accuracy (ACC) of 82.4% and Matthew's correlation coefficient (MCC) of 0.482 in Alanine Scanning Energetics Database, and ACC of 77.6% and MCC of 0.429 in Binding Interface Database. In a comparison study, performance of our RF model exceeds other existing methods, such as Robetta, FOLDEF, KFC, KFC2, MINERVA and HotPoint. Of our hybrid features, three physicochemical features of target residues (mass, polarizability and isoelectric point), the relative side-chain accessible surface area and the average depth index of mirror-contact residues are found to be the main discriminative features in hot spots prediction. We also confirm that hot spots tend to form large contact surface areas between two interacting proteins. Source data and code are available at: http://www.aporc.org/doc/wiki/HotSpot.
Detecting paroxysmal coughing from pertussis cases using voice recognition technology.
Parker, Danny; Picone, Joseph; Harati, Amir; Lu, Shuang; Jenkyns, Marion H; Polgreen, Philip M
2013-01-01
Pertussis is highly contagious; thus, prompt identification of cases is essential to control outbreaks. Clinicians experienced with the disease can easily identify classic cases, where patients have bursts of rapid coughing followed by gasps, and a characteristic whooping sound. However, many clinicians have never seen a case, and thus may miss initial cases during an outbreak. The purpose of this project was to use voice-recognition software to distinguish pertussis coughs from croup and other coughs. We collected a series of recordings representing pertussis, croup and miscellaneous coughing by children. We manually categorized coughs as either pertussis or non-pertussis, and extracted features for each category. We used Mel-frequency cepstral coefficients (MFCC), a sampling rate of 16 KHz, a frame Duration of 25 msec, and a frame rate of 10 msec. The coughs were filtered. Each cough was divided into 3 sections of proportion 3-4-3. The average of the 13 MFCCs for each section was computed and made into a 39-element feature vector used for the classification. We used the following machine learning algorithms: Neural Networks, K-Nearest Neighbor (KNN), and a 200 tree Random Forest (RF). Data were reserved for cross-validation of the KNN and RF. The Neural Network was trained 100 times, and the averaged results are presented. After categorization, we had 16 examples of non-pertussis coughs and 31 examples of pertussis coughs. Over 90% of all pertussis coughs were properly classified as pertussis. The error rates were: Type I errors of 7%, 12%, and 25% and Type II errors of 8%, 0%, and 0%, using the Neural Network, Random Forest, and KNN, respectively. Our results suggest that we can build a robust classifier to assist clinicians and the public to help identify pertussis cases in children presenting with typical symptoms.
Ballester, Pedro J; Mitchell, John B O
2010-05-01
Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of molecular docking, which in turn is an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterize the complex, which also include parameters fitted to experimental or simulation data and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions. We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score. pedro.ballester@ebi.ac.uk; jbom@st-andrews.ac.uk Supplementary data are available at Bioinformatics online.
Mapping the spatial distribution of Aedes aegypti and Aedes albopictus.
Ding, Fangyu; Fu, Jingying; Jiang, Dong; Hao, Mengmeng; Lin, Gang
2018-02-01
Mosquito-borne infectious diseases, such as Rift Valley fever, Dengue, Chikungunya and Zika, have caused mass human death with the transnational expansion fueled by economic globalization. Simulating the distribution of the disease vectors is of great importance in formulating public health planning and disease control strategies. In the present study, we simulated the global distribution of Aedes aegypti and Aedes albopictus at a 5×5km spatial resolution with high-dimensional multidisciplinary datasets and machine learning methods Three relatively popular and robust machine learning models, including support vector machine (SVM), gradient boosting machine (GBM) and random forest (RF), were used. During the fine-tuning process based on training datasets of A. aegypti and A. albopictus, RF models achieved the highest performance with an area under the curve (AUC) of 0.973 and 0.974, respectively, followed by GBM (AUC of 0.971 and 0.972, respectively) and SVM (AUC of 0.963 and 0.964, respectively) models. The simulation difference between RF and GBM models was not statistically significant (p>0.05) based on the validation datasets, whereas statistically significant differences (p<0.05) were observed for RF and GBM simulations compared with SVM simulations. From the simulated maps derived from RF models, we observed that the distribution of A. albopictus was wider than that of A. aegypti along a latitudinal gradient. The discriminatory power of each factor in simulating the global distribution of the two species was also analyzed. Our results provided fundamental information for further study on disease transmission simulation and risk assessment. Copyright © 2017 Elsevier B.V. All rights reserved.
Comparison of Random Forest and Support Vector Machine classifiers using UAV remote sensing imagery
NASA Astrophysics Data System (ADS)
Piragnolo, Marco; Masiero, Andrea; Pirotti, Francesco
2017-04-01
Since recent years surveying with unmanned aerial vehicles (UAV) is getting a great amount of attention due to decreasing costs, higher precision and flexibility of usage. UAVs have been applied for geomorphological investigations, forestry, precision agriculture, cultural heritage assessment and for archaeological purposes. It can be used for land use and land cover classification (LULC). In literature, there are two main types of approaches for classification of remote sensing imagery: pixel-based and object-based. On one hand, pixel-based approach mostly uses training areas to define classes and respective spectral signatures. On the other hand, object-based classification considers pixels, scale, spatial information and texture information for creating homogeneous objects. Machine learning methods have been applied successfully for classification, and their use is increasing due to the availability of faster computing capabilities. The methods learn and train the model from previous computation. Two machine learning methods which have given good results in previous investigations are Random Forest (RF) and Support Vector Machine (SVM). The goal of this work is to compare RF and SVM methods for classifying LULC using images collected with a fixed wing UAV. The processing chain regarding classification uses packages in R, an open source scripting language for data analysis, which provides all necessary algorithms. The imagery was acquired and processed in November 2015 with cameras providing information over the red, blue, green and near infrared wavelength reflectivity over a testing area in the campus of Agripolis, in Italy. Images were elaborated and ortho-rectified through Agisoft Photoscan. The ortho-rectified image is the full data set, and the test set is derived from partial sub-setting of the full data set. Different tests have been carried out, using a percentage from 2 % to 20 % of the total. Ten training sets and ten validation sets are obtained from each test set. The control dataset consist of an independent visual classification done by an expert over the whole area. The classes are (i) broadleaf, (ii) building, (iii) grass, (iv) headland access path, (v) road, (vi) sowed land, (vii) vegetable. The RF and SVM are applied to the test set. The performances of the methods are evaluated using the three following accuracy metrics: Kappa index, Classification accuracy and Classification Error. All three are calculated in three different ways: with K-fold cross validation, using the validation test set and using the full test set. The analysis indicates that SVM gets better results in terms of good scores using K-fold cross or validation test set. Using the full test set, RF achieves a better result in comparison to SVM. It also seems that SVM performs better with smaller training sets, whereas RF performs better as training sets get larger.
Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.
Maniruzzaman, Md; Rahman, Md Jahanur; Al-MehediHasan, Md; Suri, Harman S; Abedin, Md Menhazul; El-Baz, Ayman; Suri, Jasjit S
2018-04-10
Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.
Early esophageal cancer detection using RF classifiers
NASA Astrophysics Data System (ADS)
Janse, Markus H. A.; van der Sommen, Fons; Zinger, Svitlana; Schoon, Erik J.; de With, Peter H. N.
2016-03-01
Esophageal cancer is one of the fastest rising forms of cancer in the Western world. Using High-Definition (HD) endoscopy, gastroenterology experts can identify esophageal cancer at an early stage. Recent research shows that early cancer can be found using a state-of-the-art computer-aided detection (CADe) system based on analyzing static HD endoscopic images. Our research aims at extending this system by applying Random Forest (RF) classification, which introduces a confidence measure for detected cancer regions. To visualize this data, we propose a novel automated annotation system, employing the unique characteristics of the previous confidence measure. This approach allows reliable modeling of multi-expert knowledge and provides essential data for real-time video processing, to enable future use of the system in a clinical setting. The performance of the CADe system is evaluated on a 39-patient dataset, containing 100 images annotated by 5 expert gastroenterologists. The proposed system reaches a precision of 75% and recall of 90%, thereby improving the state-of-the-art results by 11 and 6 percentage points, respectively.
Canizo, Brenda V; Escudero, Leticia B; Pérez, María B; Pellerano, Roberto G; Wuilloud, Rodolfo G
2018-03-01
The feasibility of the application of chemometric techniques associated with multi-element analysis for the classification of grape seeds according to their provenance vineyard soil was investigated. Grape seed samples from different localities of Mendoza province (Argentina) were evaluated. Inductively coupled plasma mass spectrometry (ICP-MS) was used for the determination of twenty-nine elements (Ag, As, Ce, Co, Cs, Cu, Eu, Fe, Ga, Gd, La, Lu, Mn, Mo, Nb, Nd, Ni, Pr, Rb, Sm, Te, Ti, Tl, Tm, U, V, Y, Zn and Zr). Once the analytical data were collected, supervised pattern recognition techniques such as linear discriminant analysis (LDA), partial least square discriminant analysis (PLS-DA), k-nearest neighbors (k-NN), support vector machine (SVM) and Random Forest (RF) were applied to construct classification/discrimination rules. The results indicated that nonlinear methods, RF and SVM, perform best with up to 98% and 93% accuracy rate, respectively, and therefore are excellent tools for classification of grapes. Copyright © 2017 Elsevier Ltd. All rights reserved.
Interpretation of fingerprint image quality features extracted by self-organizing maps
NASA Astrophysics Data System (ADS)
Danov, Ivan; Olsen, Martin A.; Busch, Christoph
2014-05-01
Accurate prediction of fingerprint quality is of significant importance to any fingerprint-based biometric system. Ensuring high quality samples for both probe and reference can substantially improve the system's performance by lowering false non-matches, thus allowing finer adjustment of the decision threshold of the biometric system. Furthermore, the increasing usage of biometrics in mobile contexts demands development of lightweight methods for operational environment. A novel two-tier computationally efficient approach was recently proposed based on modelling block-wise fingerprint image data using Self-Organizing Map (SOM) to extract specific ridge pattern features, which are then used as an input to a Random Forests (RF) classifier trained to predict the quality score of a propagated sample. This paper conducts an investigative comparative analysis on a publicly available dataset for the improvement of the two-tier approach by proposing additionally three feature interpretation methods, based respectively on SOM, Generative Topographic Mapping and RF. The analysis shows that two of the proposed methods produce promising results on the given dataset.
Application of Machine Learning Approaches for Protein-protein Interactions Prediction.
Zhang, Mengying; Su, Qiang; Lu, Yi; Zhao, Manman; Niu, Bing
2017-01-01
Proteomics endeavors to study the structures, functions and interactions of proteins. Information of the protein-protein interactions (PPIs) helps to improve our knowledge of the functions and the 3D structures of proteins. Thus determining the PPIs is essential for the study of the proteomics. In this review, in order to study the application of machine learning in predicting PPI, some machine learning approaches such as support vector machine (SVM), artificial neural networks (ANNs) and random forest (RF) were selected, and the examples of its applications in PPIs were listed. SVM and RF are two commonly used methods. Nowadays, more researchers predict PPIs by combining more than two methods. This review presents the application of machine learning approaches in predicting PPI. Many examples of success in identification and prediction in the area of PPI prediction have been discussed, and the PPIs research is still in progress. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Comparing Pixel- and Object-Based Approaches in Effectively Classifying Wetland-Dominated Landscapes
Berhane, Tedros M.; Lane, Charles R.; Wu, Qiusheng; Anenkhonov, Oleg A.; Chepinoga, Victor V.; Autrey, Bradley C.; Liu, Hongxing
2018-01-01
Wetland ecosystems straddle both terrestrial and aquatic habitats, performing many ecological functions directly and indirectly benefitting humans. However, global wetland losses are substantial. Satellite remote sensing and classification informs wise wetland management and monitoring. Both pixel- and object-based classification approaches using parametric and non-parametric algorithms may be effectively used in describing wetland structure and habitat, but which approach should one select? We conducted both pixel- and object-based image analyses (OBIA) using parametric (Iterative Self-Organizing Data Analysis Technique, ISODATA, and maximum likelihood, ML) and non-parametric (random forest, RF) approaches in the Barguzin Valley, a large wetland (~500 km2) in the Lake Baikal, Russia, drainage basin. Four Quickbird multispectral bands plus various spatial and spectral metrics (e.g., texture, Non-Differentiated Vegetation Index, slope, aspect, etc.) were analyzed using field-based regions of interest sampled to characterize an initial 18 ISODATA-based classes. Parsimoniously using a three-layer stack (Quickbird band 3, water ratio index (WRI), and mean texture) in the analyses resulted in the highest accuracy, 87.9% with pixel-based RF, followed by OBIA RF (segmentation scale 5, 84.6% overall accuracy), followed by pixel-based ML (83.9% overall accuracy). Increasing the predictors from three to five by adding Quickbird bands 2 and 4 decreased the pixel-based overall accuracy while increasing the OBIA RF accuracy to 90.4%. However, McNemar’s chi-square test confirmed no statistically significant difference in overall accuracy among the classifiers (pixel-based ML, RF, or object-based RF) for either the three- or five-layer analyses. Although potentially useful in some circumstances, the OBIA approach requires substantial resources and user input (such as segmentation scale selection—which was found to substantially affect overall accuracy). Hence, we conclude that pixel-based RF approaches are likely satisfactory for classifying wetland-dominated landscapes. PMID:29707381
Berhane, Tedros M; Lane, Charles R; Wu, Qiusheng; Anenkhonov, Oleg A; Chepinoga, Victor V; Autrey, Bradley C; Liu, Hongxing
2018-01-01
Wetland ecosystems straddle both terrestrial and aquatic habitats, performing many ecological functions directly and indirectly benefitting humans. However, global wetland losses are substantial. Satellite remote sensing and classification informs wise wetland management and monitoring. Both pixel- and object-based classification approaches using parametric and non-parametric algorithms may be effectively used in describing wetland structure and habitat, but which approach should one select? We conducted both pixel- and object-based image analyses (OBIA) using parametric (Iterative Self-Organizing Data Analysis Technique, ISODATA, and maximum likelihood, ML) and non-parametric (random forest, RF) approaches in the Barguzin Valley, a large wetland (~500 km 2 ) in the Lake Baikal, Russia, drainage basin. Four Quickbird multispectral bands plus various spatial and spectral metrics (e.g., texture, Non-Differentiated Vegetation Index, slope, aspect, etc.) were analyzed using field-based regions of interest sampled to characterize an initial 18 ISODATA-based classes. Parsimoniously using a three-layer stack (Quickbird band 3, water ratio index (WRI), and mean texture) in the analyses resulted in the highest accuracy, 87.9% with pixel-based RF, followed by OBIA RF (segmentation scale 5, 84.6% overall accuracy), followed by pixel-based ML (83.9% overall accuracy). Increasing the predictors from three to five by adding Quickbird bands 2 and 4 decreased the pixel-based overall accuracy while increasing the OBIA RF accuracy to 90.4%. However, McNemar's chi-square test confirmed no statistically significant difference in overall accuracy among the classifiers (pixel-based ML, RF, or object-based RF) for either the three- or five-layer analyses. Although potentially useful in some circumstances, the OBIA approach requires substantial resources and user input (such as segmentation scale selection-which was found to substantially affect overall accuracy). Hence, we conclude that pixel-based RF approaches are likely satisfactory for classifying wetland-dominated landscapes.
NASA Astrophysics Data System (ADS)
Mohammadi, Jahangir; Shataee, Shaban; Namiranian, Manochehr; Næsset, Erik
2017-09-01
Inventories of mixed broad-leaved forests of Iran mainly rely on terrestrial measurements. Due to rapid changes and disturbances and great complexity of the silvicultural systems of these multilayer forests, frequent repetition of conventional ground-based plot surveys is often cost prohibitive. Airborne laser scanning (ALS) and multispectral data offer an alternative or supplement to conventional inventories in the Hyrcanian forests of Iran. In this study, the capability of a combination of ALS and UltraCam-D data to model stand volume, tree density, and basal area using random forest (RF) algorithm was evaluated. Systematic sampling was applied to collect field plot data on a 150 m × 200 m sampling grid within a 1100 ha study area located at 36°38‧- 36°42‧N and 54°24‧-54°25‧E. A total of 308 circular plots (0.1 ha) were measured for calculation of stand volume, tree density, and basal area per hectare. For each plot, a set of variables was extracted from both ALS and multispectral data. The RF algorithm was used for modeling of the biophysical properties using ALS and UltraCam-D data separately and combined. The results showed that combining the ALS data and UltraCam-D images provided a slight increase in prediction accuracy compared to separate modeling. The RMSE as percentage of the mean, the mean difference between observed and predicted values, and standard deviation of the differences using a combination of ALS data and UltraCam-D images in an independent validation at 0.1-ha plot level were 31.7%, 1.1%, and 84 m3 ha-1 for stand volume; 27.2%, 0.86%, and 6.5 m2 ha-1 for basal area, and 35.8%, -4.6%, and 77.9 n ha-1 for tree density, respectively. Based on the results, we conclude that fusion of ALS and UltraCam-D data may be useful for modeling of stand volume, basal area, and tree density and thus gain insights into structural characteristics in the complex Hyrcanian forests.
VizieR Online Data Catalog: Gamma-ray AGN type determination (Hassan+, 2013)
NASA Astrophysics Data System (ADS)
Hassan, T.; Mirabal, N.; Contreras, J. L.; Oya, I.
2013-11-01
In this paper, we employ Support Vector Machines (SVMs) and Random Forest (RF) that embody two of the most robust supervised learning algorithms available today. We are interested in building classifiers that can distinguish between two AGN classes: BL Lacs and FSRQs. In the 2FGL, there is a total set of 1074 identified/associated AGN objects with the following labels: 'bzb' (BL Lacs), 'bzq' (FSRQs), 'agn' (other non-blazar AGN) and 'agu' (active galaxies of uncertain type). From this global set, we group the identified/associated blazars ('bzb' and 'bzq' labels) as the training/testing set of our algorithms. (2 data files).
Feature genes predicting the FLT3/ITD mutation in acute myeloid leukemia
LI, CHENGLONG; ZHU, BIAO; CHEN, JIAO; HUANG, XIAOBING
2016-01-01
In the present study, gene expression profiles of acute myeloid leukemia (AML) samples were analyzed to identify feature genes with the capacity to predict the mutation status of FLT3/ITD. Two machine learning models, namely the support vector machine (SVM) and random forest (RF) methods, were used for classification. Four datasets were downloaded from the European Bioinformatics Institute, two of which (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 mutation-positive samples) were randomly defined as the training group, while the other two datasets (containing 488 samples, including 350 FLT3/ITD mutation-negative and 138 mutation-positive samples) were defined as the test group. Differentially expressed genes (DEGs) were identified by significance analysis of the micro-array data by using the training samples. The classification efficiency of the SCM and RF methods was evaluated using the following parameters: Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the receiver operating characteristic curve. Functional enrichment analysis was performed for the feature genes with DAVID. A total of 585 DEGs were identified in the training group, of which 580 were upregulated and five were downregulated. The classification accuracy rates of the two methods for the training group, the test group and the combined group using the 585 feature genes were >90%. For the SVM and RF methods, the rates of correct determination, specificity and PPV were >90%, while the sensitivity and NPV were >80%. The SVM method produced a slightly better classification effect than the RF method. A total of 13 biological pathways were overrepresented by the feature genes, mainly involving energy metabolism, chromatin organization and translation. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML. PMID:27177049
Feature genes predicting the FLT3/ITD mutation in acute myeloid leukemia.
Li, Chenglong; Zhu, Biao; Chen, Jiao; Huang, Xiaobing
2016-07-01
In the present study, gene expression profiles of acute myeloid leukemia (AML) samples were analyzed to identify feature genes with the capacity to predict the mutation status of FLT3/ITD. Two machine learning models, namely the support vector machine (SVM) and random forest (RF) methods, were used for classification. Four datasets were downloaded from the European Bioinformatics Institute, two of which (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 mutation‑positive samples) were randomly defined as the training group, while the other two datasets (containing 488 samples, including 350 FLT3/ITD mutation-negative and 138 mutation-positive samples) were defined as the test group. Differentially expressed genes (DEGs) were identified by significance analysis of the microarray data by using the training samples. The classification efficiency of the SCM and RF methods was evaluated using the following parameters: Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the receiver operating characteristic curve. Functional enrichment analysis was performed for the feature genes with DAVID. A total of 585 DEGs were identified in the training group, of which 580 were upregulated and five were downregulated. The classification accuracy rates of the two methods for the training group, the test group and the combined group using the 585 feature genes were >90%. For the SVM and RF methods, the rates of correct determination, specificity and PPV were >90%, while the sensitivity and NPV were >80%. The SVM method produced a slightly better classification effect than the RF method. A total of 13 biological pathways were overrepresented by the feature genes, mainly involving energy metabolism, chromatin organization and translation. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML.
NASA Astrophysics Data System (ADS)
Othman, Arsalan; Gloaguen, Richard
2015-04-01
Topographic effects and complex vegetation cover hinder lithology classification in mountain regions based not only in field, but also in reflectance remote sensing data. The area of interest "Bardi-Zard" is located in the NE of Iraq. It is part of the Zagros orogenic belt, where seven lithological units outcrop and is known for its chromite deposit. The aim of this study is to compare three machine learning algorithms (MLAs): Maximum Likelihood (ML), Support Vector Machines (SVM), and Random Forest (RF) in the context of a supervised lithology classification task using Advanced Space-borne Thermal Emission and Reflection radiometer (ASTER) satellite, its derived, spatial information (spatial coordinates) and geomorphic data. We emphasize the enhancement in remote sensing lithological mapping accuracy that arises from the integration of geomorphic features and spatial information (spatial coordinates) in classifications. This study identifies that RF is better than ML and SVM algorithms in almost the sixteen combination datasets, which were tested. The overall accuracy of the best dataset combination with the RF map for the all seven classes reach ~80% and the producer and user's accuracies are ~73.91% and 76.09% respectively while the kappa coefficient is ~0.76. TPI is more effective with SVM algorithm than an RF algorithm. This paper demonstrates that adding geomorphic indices such as TPI and spatial information in the dataset increases the lithological classification accuracy.
Guo, Doudou; Juan, Jiaxiang; Chang, Liying; Zhang, Jingjin; Huang, Danfeng
2017-08-15
Plant-based sensing on water stress can provide sensitive and direct reference for precision irrigation system in greenhouse. However, plant information acquisition, interpretation, and systematical application remain insufficient. This study developed a discrimination method for plant root zone water status in greenhouse by integrating phenotyping and machine learning techniques. Pakchoi plants were used and treated by three root zone moisture levels, 40%, 60%, and 80% relative water content. Three classification models, Random Forest (RF), Neural Network (NN), and Support Vector Machine (SVM) were developed and validated in different scenarios with overall accuracy over 90% for all. SVM model had the highest value, but it required the longest training time. All models had accuracy over 85% in all scenarios, and more stable performance was observed in RF model. Simplified SVM model developed by the top five most contributing traits had the largest accuracy reduction as 29.5%, while simplified RF and NN model still maintained approximately 80%. For real case application, factors such as operation cost, precision requirement, and system reaction time should be synthetically considered in model selection. Our work shows it is promising to discriminate plant root zone water status by implementing phenotyping and machine learning techniques for precision irrigation management.
NASA Astrophysics Data System (ADS)
Ma, Hongchao; Cai, Zhan; Zhang, Liang
2018-01-01
This paper discusses airborne light detection and ranging (LiDAR) point cloud filtering (a binary classification problem) from the machine learning point of view. We compared three supervised classifiers for point cloud filtering, namely, Adaptive Boosting, support vector machine, and random forest (RF). Nineteen features were generated from raw LiDAR point cloud based on height and other geometric information within a given neighborhood. The test datasets issued by the International Society for Photogrammetry and Remote Sensing (ISPRS) were used to evaluate the performance of the three filtering algorithms; RF showed the best results with an average total error of 5.50%. The paper also makes tentative exploration in the application of transfer learning theory to point cloud filtering, which has not been introduced into the LiDAR field to the authors' knowledge. We performed filtering of three datasets from real projects carried out in China with RF models constructed by learning from the 15 ISPRS datasets and then transferred with little to no change of the parameters. Reliable results were achieved, especially in rural area (overall accuracy achieved 95.64%), indicating the feasibility of model transfer in the context of point cloud filtering for both easy automation and acceptable accuracy.
Zhuang, Xiaodong; Guo, Yue; Ni, Ao; Yang, Daya; Liao, Lizhen; Zhang, Shaozhao; Zhou, Huimin; Sun, Xiuting; Wang, Lichun; Wang, Xueqin; Liao, Xinxue
2018-06-04
An environment-wide association study (EWAS) may be useful to comprehensively test and validate associations between environmental factors and cardiovascular disease (CVD) in an unbiased manner. Data from National Health and Nutrition Examination Survey (1999-2014) were randomly 50:50 spilt into training set and testing set. CVD was ascertained by a self-reported diagnosis of myocardial infarction, coronary heart disease or stroke. We performed multiple linear regression analyses associating 203 environmental factors and 132 clinical phenotypes with CVD in training set (false discovery rate < 5%) and significant factors were validated in the testing set (P < 0.05). Random forest (RF) model was used for multicollinearity elimination and variable importance ranking. Discriminative power of factors for CVD was calculated by area under the receiver operating characteristic (AUROC). Overall, 43,568 participants with 4084 (9.4%) CVD were included. After adjusting for age, sex, race, body mass index, blood pressure and socio-economic level, we identified 5 environmental variables and 19 clinical phenotypes associated with CVD in training and testing dataset. Top five factors in RF importance ranking were: waist, glucose, uric acid, and red cell distribution width and glycated hemoglobin. AUROC of the RF model was 0.816 (top 5 factors) and 0.819 (full model). Sensitivity analyses reveal no specific moderators of the associations. Our systematic evaluation provides new knowledge on the complex array of environmental correlates of CVD. These identified correlates may serve as a complementary approach to CVD risk assessment. Our findings need to be probed in further observational and interventional studies. Copyright © 2018. Published by Elsevier Ltd.
Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators
NASA Astrophysics Data System (ADS)
Hunger, L.; Cosenza, B.; Kimeswenger, S.; Fahringer, T.
2015-11-01
A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes. We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lower-order, one-dimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on non-regular (non-uniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, out-of-core approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations.
Filling of Cloud-Induced Gaps for Land Use and Land Cover Classifications Around Refugee Camps
NASA Astrophysics Data System (ADS)
Braun, Andreas; Hagensieker, Ron; Hochschild, Volker
2016-08-01
Clouds cover is one of the main constraints in the field of optical remote sensing. Especially the use of multispectral imagery is affected by either fully obscured data or parts of the image which remain unusable. This study compares four algorithms for the filling of cloud induced gaps in classified land cover products based on Markov Random Fields (MRF), Random Forest (RF), Closest Spectral Fit (CSF) operators. They are tested on a classified image of Sentinel-2 where artificial clouds are filled by information derived from a scene of Sentinel-1. The approaches rely on different mathematical principles and therefore produced results varying in both pattern and quality. Overall accuracies for the filled areas range from 57 to 64 %. Best results are achieved by CSF, however some classes (e.g. sands and grassland) remain critical through all approaches.
Ren, Zhoupeng; Zhu, Jun; Gao, Yanfang; Yin, Qian; Hu, Maogui; Dai, Li; Deng, Changfei; Yi, Lin; Deng, Kui; Wang, Yanping; Li, Xiaohong; Wang, Jinfeng
2018-07-15
Previous research suggested an association between maternal exposure to ambient air pollutants and risk of congenital heart defects (CHDs), though the effects of particulate matter ≤10μm in aerodynamic diameter (PM 10 ) on CHDs are inconsistent. We used two machine learning models (i.e., random forest (RF) and gradient boosting (GB)) to investigate the non-linear effects of PM 10 exposure during the critical time window, weeks 3-8 in pregnancy, on risk of CHDs. From 2009 through 2012, we carried out a population-based birth cohort study on 39,053 live-born infants in Beijing. RF and GB models were used to calculate odds ratios for CHDs associated with increase in PM 10 exposure, adjusting for maternal and perinatal characteristics. Maternal exposure to PM 10 was identified as the primary risk factor for CHDs in all machine learning models. We observed a clear non-linear effect of maternal exposure to PM 10 on CHDs risk. Compared to 40μgm -3 , the following odds ratios resulted: 1) 92μgm -3 [RF: 1.16 (95% CI: 1.06, 1.28); GB: 1.26 (95% CI: 1.17, 1.35)]; 2) 111μgm -3 [RF: 1.04 (95% CI: 0.96, 1.14); GB: 1.04 (95% CI: 0.99, 1.08)]; 3) 124μgm -3 [RF: 1.01 (95% CI: 0.94, 1.10); GB: 0.98 (95% CI: 0.93, 1.02)]; 4) 190μgm -3 [RF: 1.29 (95% CI: 1.14, 1.44); GB: 1.71 (95% CI: 1.04, 2.17)]. Overall, both machine models showed an association between maternal exposure to ambient PM 10 and CHDs in Beijing, highlighting the need for non-linear methods to investigate dose-response relationships. Copyright © 2018 Elsevier B.V. All rights reserved.
Development of machine learning models for diagnosis of glaucoma.
Kim, Seong Jae; Cho, Kyong Jin; Oh, Sejong
2017-01-01
The study aimed to develop machine learning models that have strong prediction power and interpretability for diagnosis of glaucoma based on retinal nerve fiber layer (RNFL) thickness and visual field (VF). We collected various candidate features from the examination of retinal nerve fiber layer (RNFL) thickness and visual field (VF). We also developed synthesized features from original features. We then selected the best features proper for classification (diagnosis) through feature evaluation. We used 100 cases of data as a test dataset and 399 cases of data as a training and validation dataset. To develop the glaucoma prediction model, we considered four machine learning algorithms: C5.0, random forest (RF), support vector machine (SVM), and k-nearest neighbor (KNN). We repeatedly composed a learning model using the training dataset and evaluated it by using the validation dataset. Finally, we got the best learning model that produces the highest validation accuracy. We analyzed quality of the models using several measures. The random forest model shows best performance and C5.0, SVM, and KNN models show similar accuracy. In the random forest model, the classification accuracy is 0.98, sensitivity is 0.983, specificity is 0.975, and AUC is 0.979. The developed prediction models show high accuracy, sensitivity, specificity, and AUC in classifying among glaucoma and healthy eyes. It will be used for predicting glaucoma against unknown examination records. Clinicians may reference the prediction results and be able to make better decisions. We may combine multiple learning models to increase prediction accuracy. The C5.0 model includes decision rules for prediction. It can be used to explain the reasons for specific predictions.
Pian, Cong; Zhang, Guangle; Chen, Zhi; Chen, Yuanyuan; Zhang, Jin; Yang, Tao; Zhang, Liangyun
2016-01-01
As a novel class of noncoding RNAs, long noncoding RNAs (lncRNAs) have been verified to be associated with various diseases. As large scale transcripts are generated every year, it is significant to accurately and quickly identify lncRNAs from thousands of assembled transcripts. To accurately discover new lncRNAs, we develop a classification tool of random forest (RF) named LncRNApred based on a new hybrid feature. This hybrid feature set includes three new proposed features, which are MaxORF, RMaxORF and SNR. LncRNApred is effective for classifying lncRNAs and protein coding transcripts accurately and quickly. Moreover,our RF model only requests the training using data on human coding and non-coding transcripts. Other species can also be predicted by using LncRNApred. The result shows that our method is more effective compared with the Coding Potential Calculate (CPC). The web server of LncRNApred is available for free at http://mm20132014.wicp.net:57203/LncRNApred/home.jsp.
Ji, Guoli; Ye, Pengchao; Shi, Yijian; Yuan, Leiming; Chen, Xiaojing; Yuan, Mingshun; Zhu, Dehua; Chen, Xi; Hu, Xinyu; Jiang, Jing
2017-01-01
Tegillarca granosa samples contaminated artificially by three kinds of toxic heavy metals including zinc (Zn), cadmium (Cd), and lead (Pb) were attempted to be distinguished using laser-induced breakdown spectroscopy (LIBS) technology and pattern recognition methods in this study. The measured spectra were firstly processed by a wavelet transform algorithm (WTA), then the generated characteristic information was subsequently expressed by an information gain algorithm (IGA). As a result, 30 variables obtained were used as input variables for three classifiers: partial least square discriminant analysis (PLS-DA), support vector machine (SVM), and random forest (RF), among which the RF model exhibited the best performance, with 93.3% discrimination accuracy among those classifiers. Besides, the extracted characteristic information was used to reconstruct the original spectra by inverse WTA, and the corresponding attribution of the reconstructed spectra was then discussed. This work indicates that the healthy shellfish samples of Tegillarca granosa could be distinguished from the toxic heavy-metal-contaminated ones by pattern recognition analysis combined with LIBS technology, which only requires minimal pretreatments. PMID:29149053
Multi-channel non-invasive fetal electrocardiography detection using wavelet decomposition
NASA Astrophysics Data System (ADS)
Almeida, Javier; Ruano, Josué; Corredor, Germán.; Romo-Bucheli, David; Navarro-Vargas, José Ricardo; Romero, Eduardo
2017-11-01
Non-invasive fetal electrocardiography (fECG) has attracted the medical community because of the importance of fetal monitoring. However, its implementation in clinical practice is challenging: the fetal signal has a low Signal- to-Noise-Ratio and several signal sources are present in the maternal abdominal electrocardiography (AECG). This paper presents a novel method to detect the fetal signal from a multi-channel maternal AECG. The method begins by applying filters and signal detrending the AECG signals. Afterwards, the maternal QRS complexes are identified and subtracted. The residual signals are used to detect the fetal QRS complex. Intervals of these signals are analyzed by using a wavelet decomposition. The resulting representation feds a previously trained Random Forest (RF) classifier that identifies signal intervals associated to fetal QRS complex. The method was evaluated on a public available dataset: the Physionet2013 challenge. A set of 50 maternal AECG records were used to train the RF classifier. The evaluation was carried out in signals intervals extracted from additional 25 maternal AECG. The proposed method yielded an 83:77% accuracy in the fetal QRS complex classification task.
A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes.
Mehmood, Tahir; Bohlin, Jon; Snipen, Lars
2015-01-01
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Semi-supervised prediction of gene regulatory networks using machine learning algorithms.
Patel, Nihir; Wang, Jason T L
2015-10-01
Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely, support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabelled data for training. We investigated inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabelled data. We then applied our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluated the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.
Hybrid feature selection for supporting lightweight intrusion detection systems
NASA Astrophysics Data System (ADS)
Song, Jianglong; Zhao, Wentao; Liu, Qiang; Wang, Xin
2017-08-01
Redundant and irrelevant features not only cause high resource consumption but also degrade the performance of Intrusion Detection Systems (IDS), especially when coping with big data. These features slow down the process of training and testing in network traffic classification. Therefore, a hybrid feature selection approach in combination with wrapper and filter selection is designed in this paper to build a lightweight intrusion detection system. Two main phases are involved in this method. The first phase conducts a preliminary search for an optimal subset of features, in which the chi-square feature selection is utilized. The selected set of features from the previous phase is further refined in the second phase in a wrapper manner, in which the Random Forest(RF) is used to guide the selection process and retain an optimized set of features. After that, we build an RF-based detection model and make a fair comparison with other approaches. The experimental results on NSL-KDD datasets show that our approach results are in higher detection accuracy as well as faster training and testing processes.
Schneider, Matthias; Hirsch, Sven; Weber, Bruno; Székely, Gábor; Menze, Bjoern H
2015-01-01
We propose a novel framework for joint 3-D vessel segmentation and centerline extraction. The approach is based on multivariate Hough voting and oblique random forests (RFs) that we learn from noisy annotations. It relies on steerable filters for the efficient computation of local image features at different scales and orientations. We validate both the segmentation performance and the centerline accuracy of our approach both on synthetic vascular data and four 3-D imaging datasets of the rat visual cortex at 700 nm resolution. First, we evaluate the most important structural components of our approach: (1) Orthogonal subspace filtering in comparison to steerable filters that show, qualitatively, similarities to the eigenspace filters learned from local image patches. (2) Standard RF against oblique RF. Second, we compare the overall approach to different state-of-the-art methods for (1) vessel segmentation based on optimally oriented flux (OOF) and the eigenstructure of the Hessian, and (2) centerline extraction based on homotopic skeletonization and geodesic path tracing. Our experiments reveal the benefit of steerable over eigenspace filters as well as the advantage of oblique split directions over univariate orthogonal splits. We further show that the learning-based approach outperforms different state-of-the-art methods and proves highly accurate and robust with regard to both vessel segmentation and centerline extraction in spite of the high level of label noise in the training data. Copyright © 2014 Elsevier B.V. All rights reserved.
Mapping Winter Wheat with Multi-Temporal SAR and Optical Images in an Urban Agricultural Region
Zhou, Tao; Pan, Jianjun; Zhang, Peiyu; Wei, Shanbao; Han, Tao
2017-01-01
Winter wheat is the second largest food crop in China. It is important to obtain reliable winter wheat acreage to guarantee the food security for the most populous country in the world. This paper focuses on assessing the feasibility of in-season winter wheat mapping and investigating potential classification improvement by using SAR (Synthetic Aperture Radar) images, optical images, and the integration of both types of data in urban agricultural regions with complex planting structures in Southern China. Both SAR (Sentinel-1A) and optical (Landsat-8) data were acquired, and classification using different combinations of Sentinel-1A-derived information and optical images was performed using a support vector machine (SVM) and a random forest (RF) method. The interference coherence and texture images were obtained and used to assess the effect of adding them to the backscatter intensity images on the classification accuracy. The results showed that the use of four Sentinel-1A images acquired before the jointing period of winter wheat can provide satisfactory winter wheat classification accuracy, with an F1 measure of 87.89%. The combination of SAR and optical images for winter wheat mapping achieved the best F1 measure–up to 98.06%. The SVM was superior to RF in terms of the overall accuracy and the kappa coefficient, and was faster than RF, while the RF classifier was slightly better than SVM in terms of the F1 measure. In addition, the classification accuracy can be effectively improved by adding the texture and coherence images to the backscatter intensity data. PMID:28587066
Carvajal, Thaddeus M; Viacrusis, Katherine M; Hernandez, Lara Fides T; Ho, Howell T; Amalin, Divina M; Watanabe, Kozo
2018-04-17
Several studies have applied ecological factors such as meteorological variables to develop models and accurately predict the temporal pattern of dengue incidence or occurrence. With the vast amount of studies that investigated this premise, the modeling approaches differ from each study and only use a single statistical technique. It raises the question of whether which technique would be robust and reliable. Hence, our study aims to compare the predictive accuracy of the temporal pattern of Dengue incidence in Metropolitan Manila as influenced by meteorological factors from four modeling techniques, (a) General Additive Modeling, (b) Seasonal Autoregressive Integrated Moving Average with exogenous variables (c) Random Forest and (d) Gradient Boosting. Dengue incidence and meteorological data (flood, precipitation, temperature, southern oscillation index, relative humidity, wind speed and direction) of Metropolitan Manila from January 1, 2009 - December 31, 2013 were obtained from respective government agencies. Two types of datasets were used in the analysis; observed meteorological factors (MF) and its corresponding delayed or lagged effect (LG). After which, these datasets were subjected to the four modeling techniques. The predictive accuracy and variable importance of each modeling technique were calculated and evaluated. Among the statistical modeling techniques, Random Forest showed the best predictive accuracy. Moreover, the delayed or lag effects of the meteorological variables was shown to be the best dataset to use for such purpose. Thus, the model of Random Forest with delayed meteorological effects (RF-LG) was deemed the best among all assessed models. Relative humidity was shown to be the top-most important meteorological factor in the best model. The study exhibited that there are indeed different predictive outcomes generated from each statistical modeling technique and it further revealed that the Random forest model with delayed meteorological effects to be the best in predicting the temporal pattern of Dengue incidence in Metropolitan Manila. It is also noteworthy that the study also identified relative humidity as an important meteorological factor along with rainfall and temperature that can influence this temporal pattern.
Liang, Ja-Der; Ping, Xiao-Ou; Tseng, Yi-Ju; Huang, Guan-Tarn; Lai, Feipei; Yang, Pei-Ming
2014-12-01
Recurrence of hepatocellular carcinoma (HCC) is an important issue despite effective treatments with tumor eradication. Identification of patients who are at high risk for recurrence may provide more efficacious screening and detection of tumor recurrence. The aim of this study was to develop recurrence predictive models for HCC patients who received radiofrequency ablation (RFA) treatment. From January 2007 to December 2009, 83 newly diagnosed HCC patients receiving RFA as their first treatment were enrolled. Five feature selection methods including genetic algorithm (GA), simulated annealing (SA) algorithm, random forests (RF) and hybrid methods (GA+RF and SA+RF) were utilized for selecting an important subset of features from a total of 16 clinical features. These feature selection methods were combined with support vector machine (SVM) for developing predictive models with better performance. Five-fold cross-validation was used to train and test SVM models. The developed SVM-based predictive models with hybrid feature selection methods and 5-fold cross-validation had averages of the sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and area under the ROC curve as 67%, 86%, 82%, 69%, 90%, and 0.69, respectively. The SVM derived predictive model can provide suggestive high-risk recurrent patients, who should be closely followed up after complete RFA treatment. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
A novel feature extraction scheme with ensemble coding for protein-protein interaction prediction.
Du, Xiuquan; Cheng, Jiaxing; Zheng, Tingting; Duan, Zheng; Qian, Fulan
2014-07-18
Protein-protein interactions (PPIs) play key roles in most cellular processes, such as cell metabolism, immune response, endocrine function, DNA replication, and transcription regulation. PPI prediction is one of the most challenging problems in functional genomics. Although PPI data have been increasing because of the development of high-throughput technologies and computational methods, many problems are still far from being solved. In this study, a novel predictor was designed by using the Random Forest (RF) algorithm with the ensemble coding (EC) method. To reduce computational time, a feature selection method (DX) was adopted to rank the features and search the optimal feature combination. The DXEC method integrates many features and physicochemical/biochemical properties to predict PPIs. On the Gold Yeast dataset, the DXEC method achieves 67.2% overall precision, 80.74% recall, and 70.67% accuracy. On the Silver Yeast dataset, the DXEC method achieves 76.93% precision, 77.98% recall, and 77.27% accuracy. On the human dataset, the prediction accuracy reaches 80% for the DXEC-RF method. We extended the experiment to a bigger and more realistic dataset that maintains 50% recall on the Yeast All dataset and 80% recall on the Human All dataset. These results show that the DXEC method is suitable for performing PPI prediction. The prediction service of the DXEC-RF classifier is available at http://ailab.ahu.edu.cn:8087/ DXECPPI/index.jsp.
Peng, Jiangjun; Leung, Yee; Leung, Kwong-Sak; Wong, Man-Hon; Lu, Gang; Ballester, Pedro J.
2018-01-01
It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future. PMID:29538331
Machine Learning Estimation of Atom Condensed Fukui Functions.
Zhang, Qingyou; Zheng, Fangfang; Zhao, Tanfeng; Qu, Xiaohui; Aires-de-Sousa, João
2016-02-01
To enable the fast estimation of atom condensed Fukui functions, machine learning algorithms were trained with databases of DFT pre-calculated values for ca. 23,000 atoms in organic molecules. The problem was approached as the ranking of atom types with the Bradley-Terry (BT) model, and as the regression of the Fukui function. Random Forests (RF) were trained to predict the condensed Fukui function, to rank atoms in a molecule, and to classify atoms as high/low Fukui function. Atomic descriptors were based on counts of atom types in spheres around the kernel atom. The BT coefficients assigned to atom types enabled the identification (93-94 % accuracy) of the atom with the highest Fukui function in pairs of atoms in the same molecule with differences ≥0.1. In whole molecules, the atom with the top Fukui function could be recognized in ca. 50 % of the cases and, on the average, about 3 of the top 4 atoms could be recognized in a shortlist of 4. Regression RF yielded predictions for test sets with R(2) =0.68-0.69, improving the ability of BT coefficients to rank atoms in a molecule. Atom classification (as high/low Fukui function) was obtained with RF with sensitivity of 55-61 % and specificity of 94-95 %. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Deep neural nets as a method for quantitative structure-activity relationships.
Ma, Junshui; Sheridan, Robert P; Liaw, Andy; Dahl, George E; Svetnik, Vladimir
2015-02-23
Neural networks were widely used for quantitative structure-activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, more efficient training algorithms, and advancements in computer hardware. In particular, deep neural nets (DNNs), i.e. neural nets with more than one hidden layer, have found great successes in many applications, such as computer vision and natural language processing. Here we show that DNNs can routinely make better prospective predictions than RF on a set of large diverse QSAR data sets that are taken from Merck's drug discovery effort. The number of adjustable parameters needed for DNNs is fairly large, but our results show that it is not necessary to optimize them for individual data sets, and a single set of recommended parameters can achieve better performance than RF for most of the data sets we studied. The usefulness of the parameters is demonstrated on additional data sets not used in the calibration. Although training DNNs is still computationally intensive, using graphical processing units (GPUs) can make this issue manageable.
Li, Hongjian; Peng, Jiangjun; Leung, Yee; Leung, Kwong-Sak; Wong, Man-Hon; Lu, Gang; Ballester, Pedro J
2018-03-14
It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.
NASA Astrophysics Data System (ADS)
Clark, M. L.; Kilham, N. E.
2015-12-01
Land-cover maps are important science products needed for natural resource and ecosystem service management, biodiversity conservation planning, and assessing human-induced and natural drivers of land change. Most land-cover maps at regional to global scales are produced with remote sensing techniques applied to multispectral satellite imagery with 30-500 m pixel sizes (e.g., Landsat, MODIS). Hyperspectral, or imaging spectrometer, imagery measuring the visible to shortwave infrared regions (VSWIR) of the spectrum have shown impressive capacity to map plant species and coarser land-cover associations, yet techniques have not been widely tested at regional and greater spatial scales. The Hyperspectral Infrared Imager (HyspIRI) mission is a VSWIR hyperspectral and thermal satellite being considered for development by NASA. The goal of this study was to assess multi-temporal, HyspIRI-like satellite imagery for improved land cover mapping relative to multispectral satellites. We mapped FAO Land Cover Classification System (LCCS) classes over 22,500 km2 in the San Francisco Bay Area, California using 30-m HyspIRI, Landsat 8 and Sentinel-2 imagery simulated from data acquired by NASA's AVIRIS airborne sensor. Random Forests (RF) and Multiple-Endmember Spectral Mixture Analysis (MESMA) classifiers were applied to the simulated images and accuracies were compared to those from real Landsat 8 images. The RF classifier was superior to MESMA, and multi-temporal data yielded higher accuracy than summer-only data. With RF, hyperspectral data had overall accuracy of 72.2% and 85.1% with full 20-class and reduced 12-class schemes, respectively. Multispectral imagery had lower accuracy. For example, simulated and real Landsat data had 7.5% and 4.6% lower accuracy than HyspIRI data with 12 classes, respectively. In summary, our results indicate increased mapping accuracy using HyspIRI multi-temporal imagery, particularly in discriminating different natural vegetation types, such as spectrally-mixed woodlands and forests.
Anastasiadou, Maria N; Christodoulakis, Manolis; Papathanasiou, Eleftherios S; Papacostas, Savvas S; Mitsis, Georgios D
2017-09-01
This paper proposes supervised and unsupervised algorithms for automatic muscle artifact detection and removal from long-term EEG recordings, which combine canonical correlation analysis (CCA) and wavelets with random forests (RF). The proposed algorithms first perform CCA and continuous wavelet transform of the canonical components to generate a number of features which include component autocorrelation values and wavelet coefficient magnitude values. A subset of the most important features is subsequently selected using RF and labelled observations (supervised case) or synthetic data constructed from the original observations (unsupervised case). The proposed algorithms are evaluated using realistic simulation data as well as 30min epochs of non-invasive EEG recordings obtained from ten patients with epilepsy. We assessed the performance of the proposed algorithms using classification performance and goodness-of-fit values for noisy and noise-free signal windows. In the simulation study, where the ground truth was known, the proposed algorithms yielded almost perfect performance. In the case of experimental data, where expert marking was performed, the results suggest that both the supervised and unsupervised algorithm versions were able to remove artifacts without affecting noise-free channels considerably, outperforming standard CCA, independent component analysis (ICA) and Lagged Auto-Mutual Information Clustering (LAMIC). The proposed algorithms achieved excellent performance for both simulation and experimental data. Importantly, for the first time to our knowledge, we were able to perform entirely unsupervised artifact removal, i.e. without using already marked noisy data segments, achieving performance that is comparable to the supervised case. Overall, the results suggest that the proposed algorithms yield significant future potential for improving EEG signal quality in research or clinical settings without the need for marking by expert neurophysiologists, EMG signal recording and user visual inspection. Copyright © 2017 International Federation of Clinical Neurophysiology. Published by Elsevier B.V. All rights reserved.
Detecting Paroxysmal Coughing from Pertussis Cases Using Voice Recognition Technology
Parker, Danny; Picone, Joseph; Harati, Amir; Lu, Shuang; Jenkyns, Marion H.; Polgreen, Philip M.
2013-01-01
Background Pertussis is highly contagious; thus, prompt identification of cases is essential to control outbreaks. Clinicians experienced with the disease can easily identify classic cases, where patients have bursts of rapid coughing followed by gasps, and a characteristic whooping sound. However, many clinicians have never seen a case, and thus may miss initial cases during an outbreak. The purpose of this project was to use voice-recognition software to distinguish pertussis coughs from croup and other coughs. Methods We collected a series of recordings representing pertussis, croup and miscellaneous coughing by children. We manually categorized coughs as either pertussis or non-pertussis, and extracted features for each category. We used Mel-frequency cepstral coefficients (MFCC), a sampling rate of 16 KHz, a frame Duration of 25 msec, and a frame rate of 10 msec. The coughs were filtered. Each cough was divided into 3 sections of proportion 3-4-3. The average of the 13 MFCCs for each section was computed and made into a 39-element feature vector used for the classification. We used the following machine learning algorithms: Neural Networks, K-Nearest Neighbor (KNN), and a 200 tree Random Forest (RF). Data were reserved for cross-validation of the KNN and RF. The Neural Network was trained 100 times, and the averaged results are presented. Results After categorization, we had 16 examples of non-pertussis coughs and 31 examples of pertussis coughs. Over 90% of all pertussis coughs were properly classified as pertussis. The error rates were: Type I errors of 7%, 12%, and 25% and Type II errors of 8%, 0%, and 0%, using the Neural Network, Random Forest, and KNN, respectively. Conclusion Our results suggest that we can build a robust classifier to assist clinicians and the public to help identify pertussis cases in children presenting with typical symptoms. PMID:24391730
Improving the Accuracy of Cloud Detection Using Machine Learning
NASA Astrophysics Data System (ADS)
Craddock, M. E.; Alliss, R. J.; Mason, M.
2017-12-01
Cloud detection from geostationary satellite imagery has long been accomplished through multi-spectral channel differencing in comparison to the Earth's surface. The distinction of clear/cloud is then determined by comparing these differences to empirical thresholds. Using this methodology, the probability of detecting clouds exceeds 90% but performance varies seasonally, regionally and temporally. The Cloud Mask Generator (CMG) database developed under this effort, consists of 20 years of 4 km, 15minute clear/cloud images based on GOES data over CONUS and Hawaii. The algorithms to determine cloudy pixels in the imagery are based on well-known multi-spectral techniques and defined thresholds. These thresholds were produced by manually studying thousands of images and thousands of man-hours to determine the success and failure of the algorithms to fine tune the thresholds. This study aims to investigate the potential of improving cloud detection by using Random Forest (RF) ensemble classification. RF is the ideal methodology to employ for cloud detection as it runs efficiently on large datasets, is robust to outliers and noise and is able to deal with highly correlated predictors, such as multi-spectral satellite imagery. The RF code was developed using Python in about 4 weeks. The region of focus selected was Hawaii and includes the use of visible and infrared imagery, topography and multi-spectral image products as predictors. The development of the cloud detection technique is realized in three steps. First, tuning of the RF models is completed to identify the optimal values of the number of trees and number of predictors to employ for both day and night scenes. Second, the RF models are trained using the optimal number of trees and a select number of random predictors identified during the tuning phase. Lastly, the model is used to predict clouds for an independent time period than used during training and compared to truth, the CMG cloud mask. Initial results show 97% accuracy during the daytime, 94% accuracy at night, and 95% accuracy for all times. The total time to train, tune and test was approximately one week. The improved performance and reduced time to produce results is testament to improved computer technology and the use of machine learning as a more efficient and accurate methodology of cloud detection.
Contributions of projected land use to global radiative forcing ascribed to local sources
NASA Astrophysics Data System (ADS)
Ward, D. S.; Mahowald, N. M.; Kloster, S.
2013-12-01
With global demand for food expected to dramatically increase and put additional pressures on natural lands, there is a need to understand the environmental impacts of land use and land cover change (LULCC). Previous studies have shown that the magnitude and even the sign of the radiative forcing (RF) of biogeophysical effects from LULCC depends on the latitude and forest ecology of the disturbed region. Here we ascribe the contributions to the global RF by land-use related anthropogenic activities to their local sources, organized on a grid of 1.9 degrees latitude by 2.5 degrees longitude. We use RF estimates for the year 2100, using five future LULCC projections, computed from simulations with the National Center for Atmospheric Research Community Land Model and Community Atmosphere Models and additional offline analyses. Our definition of the LULCC RF includes changes to terrestrial carbon storage, methane and nitrous oxide emissions, atmospheric chemistry, aerosol emissions, and surface albedo. We ascribe the RF to gridded locations based on LULCC-related emissions of relevant trace gases and aerosols, including emissions from fires. We find that the largest contributions to the global RF in year 2100 from LULCC originate in the tropics for all future scenarios. In fact, LULCC is the largest tropical source of anthropogenic RF. The LULCC RF in the tropics is dominated by emissions of CO2 from deforestation and methane emissions from livestock and soils. Land surface albedo change is rarely the dominant forcing agent in any of the future LULCC projections, at any location. By combining the five future scenarios we find that deforested area at a specific tropical location can be used to predict the contribution to global RF from LULCC at that location (the relationship does not hold as well in the extratropics). This information could support global efforts like REDD (Reducing Emissions from Deforestation and Forest Degradation), that aim to reduce greenhouse gas emissions from land use, by helping to optimize their effectiveness for climate change mitigation.
NASA Astrophysics Data System (ADS)
Shiri, Jalal
2018-06-01
Among different reference evapotranspiration (ETo) modeling approaches, mass transfer-based methods have been less studied. These approaches utilize temperature and wind speed records. On the other hand, the empirical equations proposed in this context generally produce weak simulations, except when a local calibration is used for improving their performance. This might be a crucial drawback for those equations in case of local data scarcity for calibration procedure. So, application of heuristic methods can be considered as a substitute for improving the performance accuracy of the mass transfer-based approaches. However, given that the wind speed records have usually higher variation magnitudes than the other meteorological parameters, application of a wavelet transform for coupling with heuristic models would be necessary. In the present paper, a coupled wavelet-random forest (WRF) methodology was proposed for the first time to improve the performance accuracy of the mass transfer-based ETo estimation approaches using cross-validation data management scenarios in both local and cross-station scales. The obtained results revealed that the new coupled WRF model (with the minimum scatter index values of 0.150 and 0.192 for local and external applications, respectively) improved the performance accuracy of the single RF models as well as the empirical equations to great extent.
Building rooftop classification using random forests for large-scale PV deployment
NASA Astrophysics Data System (ADS)
Assouline, Dan; Mohajeri, Nahid; Scartezzini, Jean-Louis
2017-10-01
Large scale solar Photovoltaic (PV) deployment on existing building rooftops has proven to be one of the most efficient and viable sources of renewable energy in urban areas. As it usually requires a potential analysis over the area of interest, a crucial step is to estimate the geometric characteristics of the building rooftops. In this paper, we introduce a multi-layer machine learning methodology to classify 6 roof types, 9 aspect (azimuth) classes and 5 slope (tilt) classes for all building rooftops in Switzerland, using GIS processing. We train Random Forests (RF), an ensemble learning algorithm, to build the classifiers. We use (2 × 2) [m2 ] LiDAR data (considering buildings and vegetation) to extract several rooftop features, and a generalised footprint polygon data to localize buildings. The roof classifier is trained and tested with 1252 labeled roofs from three different urban areas, namely Baden, Luzern, and Winterthur. The results for roof type classification show an average accuracy of 67%. The aspect and slope classifiers are trained and tested with 11449 labeled roofs in the Zurich periphery area. The results for aspect and slope classification show different accuracies depending on the classes: while some classes are well identified, other under-represented classes remain challenging to detect.
Anghelone, Marta; Jembrih-Simbürger, Dubravka; Schreiner, Manfred
2015-10-05
Copper phthalocyanine (CuPc) blues (PB15) are largely used in art and industry as pigments. In these fields mainly three different polymorphic modifications of PB15 are employed: alpha, beta and epsilon. Differentiating among these CuPc forms can give important information for developing conservation strategy and can help in relative dating, since each form was introduced in the market in different time periods. This study focuses on the classification of Raman spectra measured using 532 nm excitation wavelength on: (i) dry pigment powders, (ii) unaged mock-ups of self-made paints, (iii) unaged commercial paints, and (iv) paints subjected to accelerated UV ageing. The ratios among integrated Raman bands are taken in consideration as features to perform Random Forest (RF). Features selection based on Gini Contrast score was carried out on the measured dataset to determine the Raman bands ratios with higher predictive power. These were used as polymorphic markers, in order to establish an easy and accessible method for the identification. Three different ratios and the presence of a characteristic vibrational band allowed the identification of the crystal modification in pigments powder as well as in unaged and aged paint films. Copyright © 2015 Elsevier B.V. All rights reserved.
Prediction of Protein-Protein Interaction Sites by Random Forest Algorithm with mRMR and IFS
Li, Bi-Qing; Feng, Kai-Yan; Chen, Lei; Huang, Tao; Cai, Yu-Dong
2012-01-01
Prediction of protein-protein interaction (PPI) sites is one of the most challenging problems in computational biology. Although great progress has been made by employing various machine learning approaches with numerous characteristic features, the problem is still far from being solved. In this study, we developed a novel predictor based on Random Forest (RF) algorithm with the Minimum Redundancy Maximal Relevance (mRMR) method followed by incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility. We also included five 3D structural features to predict protein-protein interaction sites and achieved an overall accuracy of 0.672997 and MCC of 0.347977. Feature analysis showed that 3D structural features such as Depth Index (DPX) and surface curvature (SC) contributed most to the prediction of protein-protein interaction sites. It was also shown via site-specific feature analysis that the features of individual residues from PPI sites contribute most to the determination of protein-protein interaction sites. It is anticipated that our prediction method will become a useful tool for identifying PPI sites, and that the feature analysis described in this paper will provide useful insights into the mechanisms of interaction. PMID:22937126
Zhou, Qingtao; Flores, Alejandro; Glenn, Nancy F; Walters, Reggie; Han, Bangshuai
2017-01-01
Shortwave solar radiation is an important component of the surface energy balance and provides the principal source of energy for terrestrial ecosystems. This paper presents a machine learning approach in the form of a random forest (RF) model for estimating daily downward solar radiation flux at the land surface over complex terrain using MODIS (MODerate Resolution Imaging Spectroradiometer) remote sensing data. The model-building technique makes use of a unique network of 16 solar flux measurements in the semi-arid Reynolds Creek Experimental Watershed and Critical Zone Observatory, in southwest Idaho, USA. Based on a composite RF model built on daily observations from all 16 sites in the watershed, the model simulation of downward solar radiation matches well with the observation data (r2 = 0.96). To evaluate model performance, RF models were built from 12 of 16 sites selected at random and validated against the observations at the remaining four sites. Overall root mean square errors (RMSE), bias, and mean absolute error (MAE) are small (range: 37.17 W/m2-81.27 W/m2, -48.31 W/m2-15.67 W/m2, and 26.56 W/m2-63.77 W/m2, respectively). When extrapolated to the entire watershed, spatiotemporal patterns of solar flux are largely consistent with expected trends in this watershed. We also explored significant predictors of downward solar flux in order to reveal important properties and processes controlling downward solar radiation. Based on the composite RF model built on all 16 sites, the three most important predictors to estimate downward solar radiation include the black sky albedo (BSA) near infrared band (0.858 μm), BSA visible band (0.3-0.7 μm), and clear day coverage. This study has important implications for improving the ability to derive downward solar radiation through a fusion of multiple remote sensing datasets and can potentially capture spatiotemporally varying trends in solar radiation that is useful for land surface hydrologic and terrestrial ecosystem modeling.
Flores, Alejandro; Glenn, Nancy F.; Walters, Reggie; Han, Bangshuai
2017-01-01
Shortwave solar radiation is an important component of the surface energy balance and provides the principal source of energy for terrestrial ecosystems. This paper presents a machine learning approach in the form of a random forest (RF) model for estimating daily downward solar radiation flux at the land surface over complex terrain using MODIS (MODerate Resolution Imaging Spectroradiometer) remote sensing data. The model-building technique makes use of a unique network of 16 solar flux measurements in the semi-arid Reynolds Creek Experimental Watershed and Critical Zone Observatory, in southwest Idaho, USA. Based on a composite RF model built on daily observations from all 16 sites in the watershed, the model simulation of downward solar radiation matches well with the observation data (r2 = 0.96). To evaluate model performance, RF models were built from 12 of 16 sites selected at random and validated against the observations at the remaining four sites. Overall root mean square errors (RMSE), bias, and mean absolute error (MAE) are small (range: 37.17 W/m2-81.27 W/m2, -48.31 W/m2-15.67 W/m2, and 26.56 W/m2-63.77 W/m2, respectively). When extrapolated to the entire watershed, spatiotemporal patterns of solar flux are largely consistent with expected trends in this watershed. We also explored significant predictors of downward solar flux in order to reveal important properties and processes controlling downward solar radiation. Based on the composite RF model built on all 16 sites, the three most important predictors to estimate downward solar radiation include the black sky albedo (BSA) near infrared band (0.858 μm), BSA visible band (0.3–0.7 μm), and clear day coverage. This study has important implications for improving the ability to derive downward solar radiation through a fusion of multiple remote sensing datasets and can potentially capture spatiotemporally varying trends in solar radiation that is useful for land surface hydrologic and terrestrial ecosystem modeling. PMID:28777811
Evaluation of digital soil mapping approaches with large sets of environmental covariates
NASA Astrophysics Data System (ADS)
Nussbaum, Madlene; Spiess, Kay; Baltensweiler, Andri; Grob, Urs; Keller, Armin; Greiner, Lucie; Schaepman, Michael E.; Papritz, Andreas
2018-01-01
The spatial assessment of soil functions requires maps of basic soil properties. Unfortunately, these are either missing for many regions or are not available at the desired spatial resolution or down to the required soil depth. The field-based generation of large soil datasets and conventional soil maps remains costly. Meanwhile, legacy soil data and comprehensive sets of spatial environmental data are available for many regions. Digital soil mapping (DSM) approaches relating soil data (responses) to environmental data (covariates) face the challenge of building statistical models from large sets of covariates originating, for example, from airborne imaging spectroscopy or multi-scale terrain analysis. We evaluated six approaches for DSM in three study regions in Switzerland (Berne, Greifensee, ZH forest) by mapping the effective soil depth available to plants (SD), pH, soil organic matter (SOM), effective cation exchange capacity (ECEC), clay, silt, gravel content and fine fraction bulk density for four soil depths (totalling 48 responses). Models were built from 300-500 environmental covariates by selecting linear models through (1) grouped lasso and (2) an ad hoc stepwise procedure for robust external-drift kriging (georob). For (3) geoadditive models we selected penalized smoothing spline terms by component-wise gradient boosting (geoGAM). We further used two tree-based methods: (4) boosted regression trees (BRTs) and (5) random forest (RF). Lastly, we computed (6) weighted model averages (MAs) from the predictions obtained from methods 1-5. Lasso, georob and geoGAM successfully selected strongly reduced sets of covariates (subsets of 3-6 % of all covariates). Differences in predictive performance, tested on independent validation data, were mostly small and did not reveal a single best method for 48 responses. Nevertheless, RF was often the best among methods 1-5 (28 of 48 responses), but was outcompeted by MA for 14 of these 28 responses. RF tended to over-fit the data. The performance of BRT was slightly worse than RF. GeoGAM performed poorly on some responses and was the best only for 7 of 48 responses. The prediction accuracy of lasso was intermediate. All models generally had small bias. Only the computationally very efficient lasso had slightly larger bias because it tended to under-fit the data. Summarizing, although differences were small, the frequencies of the best and worst performance clearly favoured RF if a single method is applied and MA if multiple prediction models can be developed.
Using random forest for reliable classification and cost-sensitive learning for medical diagnosis.
Yang, Fan; Wang, Hua-zhen; Mi, Hong; Lin, Cheng-de; Cai, Wei-wen
2009-01-30
Most machine-learning classifiers output label predictions for new instances without indicating how reliable the predictions are. The applicability of these classifiers is limited in critical domains where incorrect predictions have serious consequences, like medical diagnosis. Further, the default assumption of equal misclassification costs is most likely violated in medical diagnosis. In this paper, we present a modified random forest classifier which is incorporated into the conformal predictor scheme. A conformal predictor is a transductive learning scheme, using Kolmogorov complexity to test the randomness of a particular sample with respect to the training sets. Our method show well-calibrated property that the performance can be set prior to classification and the accurate rate is exactly equal to the predefined confidence level. Further, to address the cost sensitive problem, we extend our method to a label-conditional predictor which takes into account different costs for misclassifications in different class and allows different confidence level to be specified for each class. Intensive experiments on benchmark datasets and real world applications show the resultant classifier is well-calibrated and able to control the specific risk of different class. The method of using RF outlier measure to design a nonconformity measure benefits the resultant predictor. Further, a label-conditional classifier is developed and turn to be an alternative approach to the cost sensitive learning problem that relies on label-wise predefined confidence level. The target of minimizing the risk of misclassification is achieved by specifying the different confidence level for different class.
A Deep Machine Learning Algorithm to Optimize the Forecast of Atmospherics
NASA Astrophysics Data System (ADS)
Russell, A. M.; Alliss, R. J.; Felton, B. D.
Space-based applications from imaging to optical communications are significantly impacted by the atmosphere. Specifically, the occurrence of clouds and optical turbulence can determine whether a mission is a success or a failure. In the case of space-based imaging applications, clouds produce atmospheric transmission losses that can make it impossible for an electro-optical platform to image its target. Hence, accurate predictions of negative atmospheric effects are a high priority in order to facilitate the efficient scheduling of resources. This study seeks to revolutionize our understanding of and our ability to predict such atmospheric events through the mining of data from a high-resolution Numerical Weather Prediction (NWP) model. Specifically, output from the Weather Research and Forecasting (WRF) model is mined using a Random Forest (RF) ensemble classification and regression approach in order to improve the prediction of low cloud cover over the Haleakala summit of the Hawaiian island of Maui. RF techniques have a number of advantages including the ability to capture non-linear associations between the predictors (in this case physical variables from WRF such as temperature, relative humidity, wind speed and pressure) and the predictand (clouds), which becomes critical when dealing with the complex non-linear occurrence of clouds. In addition, RF techniques are capable of representing complex spatial-temporal dynamics to some extent. Input predictors to the WRF-based RF model are strategically selected based on expert knowledge and a series of sensitivity tests. Ultimately, three types of WRF predictors are chosen: local surface predictors, regional 3D moisture predictors and regional inversion predictors. A suite of RF experiments is performed using these predictors in order to evaluate the performance of the hybrid RF-WRF technique. The RF model is trained and tuned on approximately half of the input dataset and evaluated on the other half. The RF approach is validated using in-situ observations of clouds. All of the hybrid RF-WRF experiments demonstrated here significantly outperform the base WRF local low cloud cover forecasts in terms of the probability of detection and the overall bias. In particular, RF experiments that use only regional three-dimensional moisture predictors from the WRF model produce the highest accuracy when compared to RF experiments that use local surface predictors only or regional inversion predictors only. Furthermore, adding multiple types of WRF predictors and additional WRF predictors to the RF algorithm does not necessarily add more value in the resulting forecasts, indicating that it is better to have a small set of meaningful predictors than to have a vast set of indiscriminately-chosen predictors. This work also reveals that the WRF-based RF approach is highly sensitive to the time period over which the algorithm is trained and evaluated. Future work will focus on developing a similar WRF-based RF model for high cloud prediction and expanding the algorithm to two-dimensions horizontally.
Multi-fractal detrended texture feature for brain tumor classification
NASA Astrophysics Data System (ADS)
Reza, Syed M. S.; Mays, Randall; Iftekharuddin, Khan M.
2015-03-01
We propose a novel non-invasive brain tumor type classification using Multi-fractal Detrended Fluctuation Analysis (MFDFA) [1] in structural magnetic resonance (MR) images. This preliminary work investigates the efficacy of the MFDFA features along with our novel texture feature known as multifractional Brownian motion (mBm) [2] in classifying (grading) brain tumors as High Grade (HG) and Low Grade (LG). Based on prior performance, Random Forest (RF) [3] is employed for tumor grading using two different datasets such as BRATS-2013 [4] and BRATS-2014 [5]. Quantitative scores such as precision, recall, accuracy are obtained using the confusion matrix. On an average 90% precision and 85% recall from the inter-dataset cross-validation confirm the efficacy of the proposed method.
NASA Astrophysics Data System (ADS)
Esposito, Carlo; Barra, Anna; Evans, Stephen G.; Scarascia Mugnozza, Gabriele; Delaney, Keith
2014-05-01
The study of landslide susceptibility by multivariate statistical methods is based on finding a quantitative relationship between controlling factors and landslide occurrence. Such studies have become popular in the last few decades thanks to the development of geographic information systems (GIS) software and the related improved data management. In this work we applied a statistical approach to an area of high landslide susceptibility mainly due to its tropical climate and geological-geomorphological setting. The study area is located in the south-east region of Brazil that has frequently been affected by flood and landslide hazard, especially because of heavy rainfall events during the summer season. In this work we studied a disastrous event that occurred on January 11th and 12th of 2011, which involved Região Serrana (the mountainous region of Rio de Janeiro State) and caused more than 5000 landslides and at least 904 deaths. In order to produce susceptibility maps, we focused our attention on an area of 93,6 km2 that includes Nova Friburgo city. We utilized two different multivariate statistic methods: Logistic Regression (LR), already widely used in applied geosciences, and Random Forest (RF), which has only recently been applied to landslide susceptibility analysis. With reference to each mapping unit, the first method (LR) results in a probability of landslide occurrence, while the second one (RF) gives a prediction in terms of % of area susceptible to slope failure. With this aim in mind, a landslide inventory map (related to the studied event) has been drawn up through analyses of high-resolution GeoEye satellite images, in a GIS environment. Data layers of 11 causative factors have been created and processed in order to be used as continuous numerical or discrete categorical variables in statistical analysis. In particular, the logistic regression method has frequent difficulties in managing numerical continuous and discrete categorical variables together; therefore in our work we tried different methods to process categorical variables , until we obtained a statistically significant model. The outcomes of the two statistical methods (RF and LR) have been tested with a spatial validation and gave us two susceptibility maps. The significance of the models is quantified in terms of Area Under ROC Curve (AUC resulted in 0.81 for RF model and in 0.72 for LR model). In the first instance, a graphical comparison of the two methods shows a good correspondence between them. Further, we integrated results in a unique susceptibility map which maintains both information of probability of occurrence and % of area of landslide detachment, resulting from LR and RF respectively. In fact, in view of a landslide susceptibility classification of the study area, the former is less accurate but gives easily classifiable results, while the latter is more accurate but the results can be only subjectively classified. The obtained "integrated" susceptibility map preserves information about the probability that a given % of area could fail for each mapping unit.
Wang, Jingzhe; Abulimiti, Aerzuna; Cai, Lianghong
2018-01-01
Soil salinization is one of the most common forms of land degradation. The detection and assessment of soil salinity is critical for the prevention of environmental deterioration especially in arid and semi-arid areas. This study introduced the fractional derivative in the pretreatment of visible and near infrared (VIS–NIR) spectroscopy. The soil samples (n = 400) collected from the Ebinur Lake Wetland, Xinjiang Uyghur Autonomous Region (XUAR), China, were used as the dataset. After measuring the spectral reflectance and salinity in the laboratory, the raw spectral reflectance was preprocessed by means of the absorbance and the fractional derivative order in the range of 0.0–2.0 order with an interval of 0.1. Two different modeling methods, namely, partial least squares regression (PLSR) and random forest (RF) with preprocessed reflectance were used for quantifying soil salinity. The results showed that more spectral characteristics were refined for the spectrum reflectance treated via fractional derivative. The validation accuracies showed that RF models performed better than those of PLSR. The most effective model was established based on RF with the 1.5 order derivative of absorbance with the optimal values of R2 (0.93), RMSE (4.57 dS m−1), and RPD (2.78 ≥ 2.50). The developed RF model was stable and accurate in the application of spectral reflectance for determining the soil salinity of the Ebinur Lake wetland. The pretreatment of fractional derivative could be useful for monitoring multiple soil parameters with higher accuracy, which could effectively help to analyze the soil salinity. PMID:29736341
Sevenster, Merlijn; Bozeman, Jeffrey; Cowhy, Andrea; Trost, William
2015-02-01
To standardize and objectivize treatment response assessment in oncology, guidelines have been proposed that are driven by radiological measurements, which are typically communicated in free-text reports defying automated processing. We study through inter-annotator agreement and natural language processing (NLP) algorithm development the task of pairing measurements that quantify the same finding across consecutive radiology reports, such that each measurement is paired with at most one other ("partial uniqueness"). Ground truth is created based on 283 abdomen and 311 chest CT reports of 50 patients each. A pre-processing engine segments reports and extracts measurements. Thirteen features are developed based on volumetric similarity between measurements, semantic similarity between their respective narrative contexts and structural properties of their report positions. A Random Forest classifier (RF) integrates all features. A "mutual best match" (MBM) post-processor ensures partial uniqueness. In an end-to-end evaluation, RF has precision 0.841, recall 0.807, F-measure 0.824 and AUC 0.971; with MBM, which performs above chance level (P<0.001), it has precision 0.899, recall 0.776, F-measure 0.833 and AUC 0.935. RF (RF+MBM) has error-free performance on 52.7% (57.4%) of report pairs. Inter-annotator agreement of three domain specialists with the ground truth (κ>0.960) indicates that the task is well defined. Domain properties and inter-section differences are discussed to explain superior performance in abdomen. Enforcing partial uniqueness has mixed but minor effects on performance. A combined machine learning-filtering approach is proposed for pairing measurements, which can support prospective (supporting treatment response assessment) and retrospective purposes (data mining). Copyright © 2014 Elsevier Inc. All rights reserved.
Comparison of four statistical and machine learning methods for crash severity prediction.
Iranitalab, Amirfarrokh; Khattak, Aemal
2017-11-01
Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012-2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012-2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ichikawa, Daisuke; Saito, Toki; Ujita, Waka; Oyama, Hiroshi
2016-12-01
Our purpose was to develop a new machine-learning approach (a virtual health check-up) toward identification of those at high risk of hyperuricemia. Applying the system to general health check-ups is expected to reduce medical costs compared with administering an additional test. Data were collected during annual health check-ups performed in Japan between 2011 and 2013 (inclusive). We prepared training and test datasets from the health check-up data to build prediction models; these were composed of 43,524 and 17,789 persons, respectively. Gradient-boosting decision tree (GBDT), random forest (RF), and logistic regression (LR) approaches were trained using the training dataset and were then used to predict hyperuricemia in the test dataset. Undersampling was applied to build the prediction models to deal with the imbalanced class dataset. The results showed that the RF and GBDT approaches afforded the best performances in terms of sensitivity and specificity, respectively. The area under the curve (AUC) values of the models, which reflected the total discriminative ability of the classification, were 0.796 [95% confidence interval (CI): 0.766-0.825] for the GBDT, 0.784 [95% CI: 0.752-0.815] for the RF, and 0.785 [95% CI: 0.752-0.819] for the LR approaches. No significant differences were observed between pairs of each approach. Small changes occurred in the AUCs after applying undersampling to build the models. We developed a virtual health check-up that predicted the development of hyperuricemia using machine-learning methods. The GBDT, RF, and LR methods had similar predictive capability. Undersampling did not remarkably improve predictive power. Copyright © 2016 Elsevier Inc. All rights reserved.
Wang, Jingzhe; Ding, Jianli; Abulimiti, Aerzuna; Cai, Lianghong
2018-01-01
Soil salinization is one of the most common forms of land degradation. The detection and assessment of soil salinity is critical for the prevention of environmental deterioration especially in arid and semi-arid areas. This study introduced the fractional derivative in the pretreatment of visible and near infrared (VIS-NIR) spectroscopy. The soil samples ( n = 400) collected from the Ebinur Lake Wetland, Xinjiang Uyghur Autonomous Region (XUAR), China, were used as the dataset. After measuring the spectral reflectance and salinity in the laboratory, the raw spectral reflectance was preprocessed by means of the absorbance and the fractional derivative order in the range of 0.0-2.0 order with an interval of 0.1. Two different modeling methods, namely, partial least squares regression (PLSR) and random forest (RF) with preprocessed reflectance were used for quantifying soil salinity. The results showed that more spectral characteristics were refined for the spectrum reflectance treated via fractional derivative. The validation accuracies showed that RF models performed better than those of PLSR. The most effective model was established based on RF with the 1.5 order derivative of absorbance with the optimal values of R 2 (0.93), RMSE (4.57 dS m -1 ), and RPD (2.78 ≥ 2.50). The developed RF model was stable and accurate in the application of spectral reflectance for determining the soil salinity of the Ebinur Lake wetland. The pretreatment of fractional derivative could be useful for monitoring multiple soil parameters with higher accuracy, which could effectively help to analyze the soil salinity.
Permutation importance: a corrected feature importance measure.
Altmann, André; Toloşi, Laura; Sander, Oliver; Lengauer, Thomas
2010-05-15
In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred. In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models. R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/ approximately altmann/download/PIMP.R CONTACT: altmann@mpi-inf.mpg.de, laura.tolosi@mpi-inf.mpg.de Supplementary data are available at Bioinformatics online.
Improved supervised classification of accelerometry data to distinguish behaviors of soaring birds.
Sur, Maitreyi; Suffredini, Tony; Wessells, Stephen M; Bloom, Peter H; Lanzone, Michael; Blackshire, Sheldon; Sridhar, Srisarguru; Katzner, Todd
2017-01-01
Soaring birds can balance the energetic costs of movement by switching between flapping, soaring and gliding flight. Accelerometers can allow quantification of flight behavior and thus a context to interpret these energetic costs. However, models to interpret accelerometry data are still being developed, rarely trained with supervised datasets, and difficult to apply. We collected accelerometry data at 140Hz from a trained golden eagle (Aquila chrysaetos) whose flight we recorded with video that we used to characterize behavior. We applied two forms of supervised classifications, random forest (RF) models and K-nearest neighbor (KNN) models. The KNN model was substantially easier to implement than the RF approach but both were highly accurate in classifying basic behaviors such as flapping (85.5% and 83.6% accurate, respectively), soaring (92.8% and 87.6%) and sitting (84.1% and 88.9%) with overall accuracies of 86.6% and 92.3% respectively. More detailed classification schemes, with specific behaviors such as banking and straight flights were well classified only by the KNN model (91.24% accurate; RF = 61.64% accurate). The RF model maintained its accuracy of classifying basic behavior classification accuracy of basic behaviors at sampling frequencies as low as 10Hz, the KNN at sampling frequencies as low as 20Hz. Classification of accelerometer data collected from free ranging birds demonstrated a strong dependence of predicted behavior on the type of classification model used. Our analyses demonstrate the consequence of different approaches to classification of accelerometry data, the potential to optimize classification algorithms with validated flight behaviors to improve classification accuracy, ideal sampling frequencies for different classification algorithms, and a number of ways to improve commonly used analytical techniques and best practices for classification of accelerometry data.
Improved supervised classification of accelerometry data to distinguish behaviors of soaring birds
Suffredini, Tony; Wessells, Stephen M.; Bloom, Peter H.; Lanzone, Michael; Blackshire, Sheldon; Sridhar, Srisarguru; Katzner, Todd
2017-01-01
Soaring birds can balance the energetic costs of movement by switching between flapping, soaring and gliding flight. Accelerometers can allow quantification of flight behavior and thus a context to interpret these energetic costs. However, models to interpret accelerometry data are still being developed, rarely trained with supervised datasets, and difficult to apply. We collected accelerometry data at 140Hz from a trained golden eagle (Aquila chrysaetos) whose flight we recorded with video that we used to characterize behavior. We applied two forms of supervised classifications, random forest (RF) models and K-nearest neighbor (KNN) models. The KNN model was substantially easier to implement than the RF approach but both were highly accurate in classifying basic behaviors such as flapping (85.5% and 83.6% accurate, respectively), soaring (92.8% and 87.6%) and sitting (84.1% and 88.9%) with overall accuracies of 86.6% and 92.3% respectively. More detailed classification schemes, with specific behaviors such as banking and straight flights were well classified only by the KNN model (91.24% accurate; RF = 61.64% accurate). The RF model maintained its accuracy of classifying basic behavior classification accuracy of basic behaviors at sampling frequencies as low as 10Hz, the KNN at sampling frequencies as low as 20Hz. Classification of accelerometer data collected from free ranging birds demonstrated a strong dependence of predicted behavior on the type of classification model used. Our analyses demonstrate the consequence of different approaches to classification of accelerometry data, the potential to optimize classification algorithms with validated flight behaviors to improve classification accuracy, ideal sampling frequencies for different classification algorithms, and a number of ways to improve commonly used analytical techniques and best practices for classification of accelerometry data. PMID:28403159
Improved supervised classification of accelerometry data to distinguish behaviors of soaring birds
Sur, Maitreyi; Suffredini, Tony; Wessells, Stephen M.; Bloom, Peter H.; Lanzone, Michael J.; Blackshire, Sheldon; Sridhar, Srisarguru; Katzner, Todd
2017-01-01
Soaring birds can balance the energetic costs of movement by switching between flapping, soaring and gliding flight. Accelerometers can allow quantification of flight behavior and thus a context to interpret these energetic costs. However, models to interpret accelerometry data are still being developed, rarely trained with supervised datasets, and difficult to apply. We collected accelerometry data at 140Hz from a trained golden eagle (Aquila chrysaetos) whose flight we recorded with video that we used to characterize behavior. We applied two forms of supervised classifications, random forest (RF) models and K-nearest neighbor (KNN) models. The KNN model was substantially easier to implement than the RF approach but both were highly accurate in classifying basic behaviors such as flapping (85.5% and 83.6% accurate, respectively), soaring (92.8% and 87.6%) and sitting (84.1% and 88.9%) with overall accuracies of 86.6% and 92.3% respectively. More detailed classification schemes, with specific behaviors such as banking and straight flights were well classified only by the KNN model (91.24% accurate; RF = 61.64% accurate). The RF model maintained its accuracy of classifying basic behavior classification accuracy of basic behaviors at sampling frequencies as low as 10Hz, the KNN at sampling frequencies as low as 20Hz. Classification of accelerometer data collected from free ranging birds demonstrated a strong dependence of predicted behavior on the type of classification model used. Our analyses demonstrate the consequence of different approaches to classification of accelerometry data, the potential to optimize classification algorithms with validated flight behaviors to improve classification accuracy, ideal sampling frequencies for different classification algorithms, and a number of ways to improve commonly used analytical techniques and best practices for classification of accelerometry data.
Zeng, Qinghui; Liu, Yi; Zhao, Hongtao; Sun, Mingdong; Li, Xuyong
2017-04-01
Inter-basin water transfer projects might cause complex hydro-chemical and biological variation in the receiving aquatic ecosystems. Whether machine learning models can be used to predict changes in phytoplankton community composition caused by water transfer projects have rarely been studied. In the present study, we used machine learning models to predict the total algal cell densities and changes in phytoplankton community composition in Miyun reservoir caused by the middle route of the South-to-North Water Transfer Project (SNWTP). The model performances of four machine learning models, including regression trees (RT), random forest (RF), support vector machine (SVM), and artificial neural network (ANN) were evaluated and the best model was selected for further prediction. The results showed that the predictive accuracies (Pearson's correlation coefficient) of the models were RF (0.974), ANN (0.951), SVM (0.860), and RT (0.817) in the training step and RF (0.806), ANN (0.734), SVM (0.730), and RT (0.692) in the testing step. Therefore, the RF model was the best method for estimating total algal cell densities. Furthermore, the predicted accuracies of the RF model for dominant phytoplankton phyla (Cyanophyta, Chlorophyta, and Bacillariophyta) in Miyun reservoir ranged from 0.824 to 0.869 in the testing step. The predicted proportions with water transfer of the different phytoplankton phyla ranged from -8.88% to 9.93%, and the predicted dominant phyla with water transfer in each season remained unchanged compared to the phytoplankton succession without water transfer. The results of the present study provide a useful tool for predicting the changes in phytoplankton community caused by water transfer. The method is transferrable to other locations via establishment of models with relevant data to a particular area. Our findings help better understanding the possible changes in aquatic ecosystems influenced by inter-basin water transfer. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Valdez Vasquez, M. C.; Chen, C. F.; Chiang, S. H.
2016-12-01
Forests in Honduras are one of the most important resources as they provide a wide range of environmental, economic, and social benefits. However, they are endangered as a result of the relentless occurrence of wildfires during the dry season. Despite the knowledge acquired by the population concerning the effects of wildfires, the frequency is increasing, a pattern attributable to the numerous ignition sources linked to human activity. The purpose of this study is to integrate the wildfire occurrences throughout the 2010-2015 period with a series of anthropogenic and non-anthropogenic variables using the random forest algorithm (RF). We use a series of variables that represent the anthropogenic activity, the flammability of vegetation, climatic conditions, and topography. To represent the anthropogenic activity, we included the continuous distances to rivers, roads, and settlements. To characterize the vegetation flammability, we used the normalized difference vegetation index (NDVI) and the normalized multi-band drought index (NMDI) acquired from MODIS surface reflectance data. Additionally, we included the topographical variables elevation, slope, and solar radiation derived from the ASTER global digital elevation model (GDEM V2). To represent the climatic conditions, we employed the land surface temperature (LST) product from the MODIS sensor and the WorldClim precipitation data. We analyzed the explanatory variables through native RF variable importance analysis and jackknife test, and the results revealed that the dry fuel conditions and low precipitation combined with the proximity to non-paved roads were the major drivers of wildfires. Furthermore, we predicted the areas with highest wildfire susceptibility, which are located mainly in the central and eastern regions of the country, within coniferous and mixed forests. Results acquired were validated using the area under the receiver operating characteristic (ROC) curve and the point biserial correlation and both validation metrics showed satisfactory agreement with the test data. Predictions of forest fire risk and its spatial variability are important instruments for proper management and the results acquired can lead to enhanced preventive measures to minimize risk and reduce the impacts caused by wildfires.
Recognising discourse causality triggers in the biomedical domain.
Mihăilă, Claudiu; Ananiadou, Sophia
2013-12-01
Current domain-specific information extraction systems represent an important resource for biomedical researchers, who need to process vast amounts of knowledge in a short time. Automatic discourse causality recognition can further reduce their workload by suggesting possible causal connections and aiding in the curation of pathway models. We describe here an approach to the automatic identification of discourse causality triggers in the biomedical domain using machine learning. We create several baselines and experiment with and compare various parameter settings for three algorithms, i.e. Conditional Random Fields (CRF), Support Vector Machines (SVM) and Random Forests (RF). We also evaluate the impact of lexical, syntactic, and semantic features on each of the algorithms, showing that semantics improves the performance in all cases. We test our comprehensive feature set on two corpora containing gold standard annotations of causal relations, and demonstrate the need for more gold standard data. The best performance of 79.35% F-score is achieved by CRFs when using all three feature types.
USDA-ARS?s Scientific Manuscript database
The American Academy of Pediatrics and World Health Organization recommend responsive feeding (RF) to promote healthy eating behaviors in early childhood. This project developed and tested a vicarious learning video to teach parents RF practices. A RF vicarious learning video was developed using com...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Chang; Deng, Na; Wang, Haimin
Adverse space-weather effects can often be traced to solar flares, the prediction of which has drawn significant research interests. The Helioseismic and Magnetic Imager (HMI) produces full-disk vector magnetograms with continuous high cadence, while flare prediction efforts utilizing this unprecedented data source are still limited. Here we report results of flare prediction using physical parameters provided by the Space-weather HMI Active Region Patches (SHARP) and related data products. We survey X-ray flares that occurred from 2010 May to 2016 December and categorize their source regions into four classes (B, C, M, and X) according to the maximum GOES magnitude ofmore » flares they generated. We then retrieve SHARP-related parameters for each selected region at the beginning of its flare date to build a database. Finally, we train a machine-learning algorithm, called random forest (RF), to predict the occurrence of a certain class of flares in a given active region within 24 hr, evaluate the classifier performance using the 10-fold cross-validation scheme, and characterize the results using standard performance metrics. Compared to previous works, our experiments indicate that using the HMI parameters and RF is a valid method for flare forecasting with fairly reasonable prediction performance. To our knowledge, this is the first time that RF has been used to make multiclass predictions of solar flares. We also find that the total unsigned quantities of vertical current, current helicity, and flux near the polarity inversion line are among the most important parameters for classifying flaring regions into different classes.« less
HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy.
Hu, Huan; Zhang, Li; Ai, Haixin; Zhang, Hui; Fan, Yetian; Zhao, Qi; Liu, Hongsheng
2018-03-27
LncRNA plays an important role in many biological and disease progression by binding to related proteins. However, the experimental methods for studying lncRNA-protein interactions are time-consuming and expensive. Although there are a few models designed to predict the interactions of ncRNA-protein, they all have some common drawbacks that limit their predictive performance. In this study, we present a model called HLPI-Ensemble designed specifically for human lncRNA-protein interactions. HLPI-Ensemble adopts the ensemble strategy based on three mainstream machine learning algorithms of Support Vector Machines (SVM), Random Forests (RF) and Extreme Gradient Boosting (XGB) to generate HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble, respectively. The results of 10-fold cross-validation show that HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble achieved AUCs of 0.95, 0.96 and 0.96, respectively, in the test dataset. Furthermore, we compared the performance of the HLPI-Ensemble models with the previous models through external validation dataset. The results show that the false positives (FPs) of HLPI-Ensemble models are much lower than that of the previous models, and other evaluation indicators of HLPI-Ensemble models are also higher than those of the previous models. It is further showed that HLPI-Ensemble models are superior in predicting human lncRNA-protein interaction compared with previous models. The HLPI-Ensemble is publicly available at: http://ccsipb.lnu.edu.cn/hlpiensemble/ .
Comparison of stream invertebrate response models for bioassessment metric
Waite, Ian R.; Kennen, Jonathan G.; May, Jason T.; Brown, Larry R.; Cuffney, Thomas F.; Jones, Kimberly A.; Orlando, James L.
2012-01-01
We aggregated invertebrate data from various sources to assemble data for modeling in two ecoregions in Oregon and one in California. Our goal was to compare the performance of models developed using multiple linear regression (MLR) techniques with models developed using three relatively new techniques: classification and regression trees (CART), random forest (RF), and boosted regression trees (BRT). We used tolerance of taxa based on richness (RICHTOL) and ratio of observed to expected taxa (O/E) as response variables and land use/land cover as explanatory variables. Responses were generally linear; therefore, there was little improvement to the MLR models when compared to models using CART and RF. In general, the four modeling techniques (MLR, CART, RF, and BRT) consistently selected the same primary explanatory variables for each region. However, results from the BRT models showed significant improvement over the MLR models for each region; increases in R2 from 0.09 to 0.20. The O/E metric that was derived from models specifically calibrated for Oregon consistently had lower R2 values than RICHTOL for the two regions tested. Modeled O/E R2 values were between 0.06 and 0.10 lower for each of the four modeling methods applied in the Willamette Valley and were between 0.19 and 0.36 points lower for the Blue Mountains. As a result, BRT models may indeed represent a good alternative to MLR for modeling species distribution relative to environmental variables.
Marini, C; Fossa, F; Paoli, C; Bellingeri, M; Gnone, G; Vassallo, P
2015-03-01
Habitat modeling is an important tool to investigate the quality of the habitat for a species within a certain area, to predict species distribution and to understand the ecological processes behind it. Many species have been investigated by means of habitat modeling techniques mainly to address effective management and protection policies and cetaceans play an important role in this context. The bottlenose dolphin (Tursiops truncatus) has been investigated with habitat modeling techniques since 1997. The objectives of this work were to predict the distribution of bottlenose dolphin in a coastal area through the use of static morphological features and to compare the prediction performances of three different modeling techniques: Generalized Linear Model (GLM), Generalized Additive Model (GAM) and Random Forest (RF). Four static variables were tested: depth, bottom slope, distance from 100 m bathymetric contour and distance from coast. RF revealed itself both the most accurate and the most precise modeling technique with very high distribution probabilities predicted in presence cells (90.4% of mean predicted probabilities) and with 66.7% of presence cells with a predicted probability comprised between 90% and 100%. The bottlenose distribution obtained with RF allowed the identification of specific areas with particularly high presence probability along the coastal zone; the recognition of these core areas may be the starting point to develop effective management practices to improve T. truncatus protection. Copyright © 2014 Elsevier Ltd. All rights reserved.
Predicting the trajectories and intensities of hurricanes by applying machine learning techniques
NASA Astrophysics Data System (ADS)
Sujithkumar, A.; King, A. W.; Kovilakam, M.; Graves, D.
2017-12-01
The world has witnessed an escalation of devastating hurricanes and tropical cyclones over the last three decades. Hurricanes and tropical cyclones of very high magnitude will likely be even more frequent in a warmer world. Thus, precise forecasting of the track and intensity of hurricane/tropical cyclones remains one of the meteorological community's top priorities. However, comprehensive prediction of hurricane/ tropical cyclone is a difficult problem due to the many complexities of underlying physical processes with many variables and complex relations. The availability of global meteorological and hurricane/tropical storm climatological data opens new opportunities for data-driven approaches to hurricane/tropical cyclone modeling. Here we report initial results from two data-driven machine learning techniques, specifically, random forest (RF) and Bayesian learning (BL) to predict the trajectory and intensity of hurricanes and tropical cyclones. We used International Best Track Archive for Climate Stewardship (IBTrACS) data along with weather data from NOAA in a 50 km buffer surrounding each of the reported hurricane and tropical cyclone tracts to train the model. Initial results reveal that both RF and BL are skillful in predicting storm intensity. We will also present results for the more complicated trajectory prediction.
NASA Astrophysics Data System (ADS)
Aytaç Korkmaz, Sevcan; Binol, Hamidullah
2018-03-01
Patients who die from stomach cancer are still present. Early diagnosis is crucial in reducing the mortality rate of cancer patients. Therefore, computer aided methods have been developed for early detection in this article. Stomach cancer images were obtained from Fırat University Medical Faculty Pathology Department. The Local Binary Patterns (LBP) and Histogram of Oriented Gradients (HOG) features of these images are calculated. At the same time, Sammon mapping, Stochastic Neighbor Embedding (SNE), Isomap, Classical multidimensional scaling (MDS), Local Linear Embedding (LLE), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Laplacian Eigenmaps methods are used for dimensional the reduction of the features. The high dimension of these features has been reduced to lower dimensions using dimensional reduction methods. Artificial neural networks (ANN) and Random Forest (RF) classifiers were used to classify stomach cancer images with these new lower feature sizes. New medical systems have developed to measure the effects of these dimensions by obtaining features in different dimensional with dimensional reduction methods. When all the methods developed are compared, it has been found that the best accuracy results are obtained with LBP_MDS_ANN and LBP_LLE_ANN methods.
Detecting Parkinson's disease from sustained phonation and speech signals.
Vaiciukynas, Evaldas; Verikas, Antanas; Gelzinis, Adas; Bacauskiene, Marija
2017-01-01
This study investigates signals from sustained phonation and text-dependent speech modalities for Parkinson's disease screening. Phonation corresponds to the vowel /a/ voicing task and speech to the pronunciation of a short sentence in Lithuanian language. Signals were recorded through two channels simultaneously, namely, acoustic cardioid (AC) and smart phone (SP) microphones. Additional modalities were obtained by splitting speech recording into voiced and unvoiced parts. Information in each modality is summarized by 18 well-known audio feature sets. Random forest (RF) is used as a machine learning algorithm, both for individual feature sets and for decision-level fusion. Detection performance is measured by the out-of-bag equal error rate (EER) and the cost of log-likelihood-ratio. Essentia audio feature set was the best using the AC speech modality and YAAFE audio feature set was the best using the SP unvoiced modality, achieving EER of 20.30% and 25.57%, respectively. Fusion of all feature sets and modalities resulted in EER of 19.27% for the AC and 23.00% for the SP channel. Non-linear projection of a RF-based proximity matrix into the 2D space enriched medical decision support by visualization.
Men, Hong; Shi, Yan; Fu, Songlin; Jiao, Yanan; Qiao, Yu; Liu, Jingjing
2017-01-01
Multi-sensor data fusion can provide more comprehensive and more accurate analysis results. However, it also brings some redundant information, which is an important issue with respect to finding a feature-mining method for intuitive and efficient analysis. This paper demonstrates a feature-mining method based on variable accumulation to find the best expression form and variables’ behavior affecting beer flavor. First, e-tongue and e-nose were used to gather the taste and olfactory information of beer, respectively. Second, principal component analysis (PCA), genetic algorithm-partial least squares (GA-PLS), and variable importance of projection (VIP) scores were applied to select feature variables of the original fusion set. Finally, the classification models based on support vector machine (SVM), random forests (RF), and extreme learning machine (ELM) were established to evaluate the efficiency of the feature-mining method. The result shows that the feature-mining method based on variable accumulation obtains the main feature affecting beer flavor information, and the best classification performance for the SVM, RF, and ELM models with 96.67%, 94.44%, and 98.33% prediction accuracy, respectively. PMID:28753917
2017-01-01
Driver fatigue has become an important factor to traffic accidents worldwide, and effective detection of driver fatigue has major significance for public health. The purpose method employs entropy measures for feature extraction from a single electroencephalogram (EEG) channel. Four types of entropies measures, sample entropy (SE), fuzzy entropy (FE), approximate entropy (AE), and spectral entropy (PE), were deployed for the analysis of original EEG signal and compared by ten state-of-the-art classifiers. Results indicate that optimal performance of single channel is achieved using a combination of channel CP4, feature FE, and classifier Random Forest (RF). The highest accuracy can be up to 96.6%, which has been able to meet the needs of real applications. The best combination of channel + features + classifier is subject-specific. In this work, the accuracy of FE as the feature is far greater than the Acc of other features. The accuracy using classifier RF is the best, while that of classifier SVM with linear kernel is the worst. The impact of channel selection on the Acc is larger. The performance of various channels is very different. PMID:28255330
Hu, Jianfeng
2017-01-01
Driver fatigue has become an important factor to traffic accidents worldwide, and effective detection of driver fatigue has major significance for public health. The purpose method employs entropy measures for feature extraction from a single electroencephalogram (EEG) channel. Four types of entropies measures, sample entropy (SE), fuzzy entropy (FE), approximate entropy (AE), and spectral entropy (PE), were deployed for the analysis of original EEG signal and compared by ten state-of-the-art classifiers. Results indicate that optimal performance of single channel is achieved using a combination of channel CP4, feature FE, and classifier Random Forest (RF). The highest accuracy can be up to 96.6%, which has been able to meet the needs of real applications. The best combination of channel + features + classifier is subject-specific. In this work, the accuracy of FE as the feature is far greater than the Acc of other features. The accuracy using classifier RF is the best, while that of classifier SVM with linear kernel is the worst. The impact of channel selection on the Acc is larger. The performance of various channels is very different.
Robust face alignment under occlusion via regional predictive power estimation.
Heng Yang; Xuming He; Xuhui Jia; Patras, Ioannis
2015-08-01
Face alignment has been well studied in recent years, however, when a face alignment model is applied on facial images with heavy partial occlusion, the performance deteriorates significantly. In this paper, instead of training an occlusion-aware model with visibility annotation, we address this issue via a model adaptation scheme that uses the result of a local regression forest (RF) voting method. In the proposed scheme, the consistency of the votes of the local RF in each of several oversegmented regions is used to determine the reliability of predicting the location of the facial landmarks. The latter is what we call regional predictive power (RPP). Subsequently, we adapt a holistic voting method (cascaded pose regression based on random ferns) by putting weights on the votes of each fern according to the RPP of the regions used in the fern tests. The proposed method shows superior performance over existing face alignment models in the most challenging data sets (COFW and 300-W). Moreover, it can also estimate with high accuracy (72.4% overlap ratio) which image areas belong to the face or nonface objects, on the heavily occluded images of the COFW data set, without explicit occlusion modeling.
An application of quantile random forests for predictive mapping of forest attributes
E.A. Freeman; G.G. Moisen
2015-01-01
Increasingly, random forest models are used in predictive mapping of forest attributes. Traditional random forests output the mean prediction from the random trees. Quantile regression forests (QRF) is an extension of random forests developed by Nicolai Meinshausen that provides non-parametric estimates of the median predicted value as well as prediction quantiles. It...
Mehra, Lucky K; Cowger, Christina; Gross, Kevin; Ojiambo, Peter S
2016-01-01
Pre-planting factors have been associated with the late-season severity of Stagonospora nodorum blotch (SNB), caused by the fungal pathogen Parastagonospora nodorum, in winter wheat (Triticum aestivum). The relative importance of these factors in the risk of SNB has not been determined and this knowledge can facilitate disease management decisions prior to planting of the wheat crop. In this study, we examined the performance of multiple regression (MR) and three machine learning algorithms namely artificial neural networks, categorical and regression trees, and random forests (RF), in predicting the pre-planting risk of SNB in wheat. Pre-planting factors tested as potential predictor variables were cultivar resistance, latitude, longitude, previous crop, seeding rate, seed treatment, tillage type, and wheat residue. Disease severity assessed at the end of the growing season was used as the response variable. The models were developed using 431 disease cases (unique combinations of predictors) collected from 2012 to 2014 and these cases were randomly divided into training, validation, and test datasets. Models were evaluated based on the regression of observed against predicted severity values of SNB, sensitivity-specificity ROC analysis, and the Kappa statistic. A strong relationship was observed between late-season severity of SNB and specific pre-planting factors in which latitude, longitude, wheat residue, and cultivar resistance were the most important predictors. The MR model explained 33% of variability in the data, while machine learning models explained 47 to 79% of the total variability. Similarly, the MR model correctly classified 74% of the disease cases, while machine learning models correctly classified 81 to 83% of these cases. Results show that the RF algorithm, which explained 79% of the variability within the data, was the most accurate in predicting the risk of SNB, with an accuracy rate of 93%. The RF algorithm could allow early assessment of the risk of SNB, facilitating sound disease management decisions prior to planting of wheat.
Quraishi, B M; Zhang, H; Everson, T M; Ray, M; Lockett, G A; Holloway, J W; Tetali, S R; Arshad, S H; Kaushal, A; Rezwan, F I; Karmaus, W
2015-01-01
The prevalence of eczema is increasing in industrialized nations. Limited evidence has shown the association of DNA methylation (DNA-M) with eczema. We explored this association at the epigenome-scale to better understand the role of DNA-M. Data from the first generation (F1) of the Isle of Wight (IoW) birth cohort participants and the second generation (F2) were examined in our study. Epigenome-scale DNA methylation of F1 at age 18 years and F2 in cord blood was measured using the Illumina Infinium HumanMethylation450 Beadchip. A total of 307,357 cytosine-phosphate-guanine sites (CpGs) in the F1 generation were screened via recursive random forest (RF) for their potential association with eczema at age 18. Functional enrichment and pathway analysis of resulting genes were carried out using DAVID gene functional classification tool. Log-linear models were performed in F1 to corroborate the identified CpGs. Findings in F1 were further replicated in F2. The recursive RF yielded 140 CpGs, 88 of which showed statistically significant associations with eczema at age 18, corroborated by log-linear models after controlling for false discovery rate (FDR) of 0.05. These CpGs were enriched among many biological pathways, including pathways related to creating transcriptional variety and pathways mechanistically linked to eczema such as cadherins, cell adhesion, gap junctions, tight junctions, melanogenesis, and apoptosis. In the F2 generation, about half of the 83 CpGs identified in F1 showed the same direction of association with eczema risk as in F1, of which two CpGs were significantly associated with eczema risk, cg04850479 of the PROZ gene (risk ratio (RR) = 15.1 in F1, 95 % confidence interval (CI) 1.71, 79.5; RR = 6.82 in F2, 95 % CI 1.52, 30.62) and cg01427769 of the NEU1 gene (RR = 0.13 in F1, 95 % CI 0.03, 0.46; RR = 0.09 in F2, 95 % CI 0.03, 0.36). Via epigenome-scaled analyses using recursive RF followed by log-linear models, we identified 88 CpGs associated with eczema in F1, of which 41 were replicated in F2. Several identified CpGs are located within genes in biological pathways relating to skin barrier integrity, which is central to the pathogenesis of eczema. Novel genes associated with eczema risk were identified (e.g., the PROZ and NEU1 genes).
Forecasting Solar Flares Using Magnetogram-based Predictors and Machine Learning
NASA Astrophysics Data System (ADS)
Florios, Kostas; Kontogiannis, Ioannis; Park, Sung-Hong; Guerra, Jordan A.; Benvenuto, Federico; Bloomfield, D. Shaun; Georgoulis, Manolis K.
2018-02-01
We propose a forecasting approach for solar flares based on data from Solar Cycle 24, taken by the Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO) mission. In particular, we use the Space-weather HMI Active Region Patches (SHARP) product that facilitates cut-out magnetograms of solar active regions (AR) in the Sun in near-realtime (NRT), taken over a five-year interval (2012 - 2016). Our approach utilizes a set of thirteen predictors, which are not included in the SHARP metadata, extracted from line-of-sight and vector photospheric magnetograms. We exploit several machine learning (ML) and conventional statistics techniques to predict flares of peak magnitude {>} M1 and {>} C1 within a 24 h forecast window. The ML methods used are multi-layer perceptrons (MLP), support vector machines (SVM), and random forests (RF). We conclude that random forests could be the prediction technique of choice for our sample, with the second-best method being multi-layer perceptrons, subject to an entropy objective function. A Monte Carlo simulation showed that the best-performing method gives accuracy ACC=0.93(0.00), true skill statistic TSS=0.74(0.02), and Heidke skill score HSS=0.49(0.01) for {>} M1 flare prediction with probability threshold 15% and ACC=0.84(0.00), TSS=0.60(0.01), and HSS=0.59(0.01) for {>} C1 flare prediction with probability threshold 35%.
ERIC Educational Resources Information Center
Ledoux, Tracey; Robinson, Jessica; Baranowski, Tom; O'Connor, Daniel P.
2018-01-01
The American Academy of Pediatrics and World Health Organization recommend responsive feeding (RF) to promote healthy eating behaviors in early childhood. This project developed and tested a vicarious learning video to teach parents RF practices. A RF vicarious learning video was developed using community-based participatory research methods.…
Random forests-based differential analysis of gene sets for gene expression data.
Hsueh, Huey-Miin; Zhou, Da-Wei; Tsai, Chen-An
2013-04-10
In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses. Copyright © 2012 Elsevier B.V. All rights reserved.
Effects of rock fragments on water dynamics in a fire-affected soil
NASA Astrophysics Data System (ADS)
Gordillo-Rivero, Ángel J.; García-Moreno, Jorge; Jordán, Antonio; Zavala, Lorena M.
2014-05-01
Rock fragments (RF) are common in the surface of Mediterranean semiarid soils, and have important effects on the soil physical (bulk density and porosity) and hydrological processes (infiltration, evaporation, splash erosion and runoff generation) (Poesen and Lavee, 1994; Rieke-Zapp et al., 2007). In some cases, RFs in Mediterranean areas have been shown to protect bare soils from erosion risk (Cerdà, 2001; Martínez-Zavala, Jordán, 2008; Zavala et al., 2010). Some of these effects are much more relevant when vegetation cover is low or has been reduced after land use change or other causes, as forest fires. Although very few studies exist, the interest on the hydrological effects of RFs in burned areas is increasing recently. After a forest fire, RFs may contribute significantly to soil recovery. In this research we have studied the effect of surface and embedded RFs on soil water control, infiltration and evaporation in calcareous fire-affected soils from a Mediterranean area (SW Spain). For this study, we selected an area with soils derived from limestone under holm oak forest, recently affected by a moderate severity forest fire. The proportion of RF cover showed a significant positive relation with soil water-holding capacity and infiltration rates, although infiltration rate reduced significantly when RF cover increased above a certain threshold. Soil evaporation rate decreased with increasing volumetric content of RFs and became stable with RF contents approximately above 30%. Evaporation also decreased with increasing RF cover. When RF cover increased above 50%, no significant differences were observed between burned and control vegetated plots. REFERENCES Poesen, J., Lavee, H. 1994. Rock fragments in top soils: significance and processes. Catena Supplement 23, 1-28. Cerdà, A. 2001. Effect of rock fragment cover on soil infiltration, interrill runoff and erosion. European Journal of Soil Science 52, 59-68. DOI: 10.1046/j.1365-2389.2001.00354.x. Rieke-Zapp, D., Poesen, J., Nearing, M.A. 2007. Effects of rock fragments incorporated in the soil matrix on concentrated flow hydraulics and erosion. Earth Surface Processes and Landforms 32, 1063-1076. Martínez-Zavala, L., Jordán, A., 2008. Effect of rock fragment cover on interrill soil erosion from bare soils in Western Andalusia, Spain. Soil Use and Management 24, 108, 117. DOI: 10.1111/j.1475-2743.2007.00139.x. Zavala, L.M., Jordán, A., Bellinfante, N., Gil, J. 2010. Relationships between rock fragment cover and soil hydrological response in a Mediterranean environment, Soil Science and Plant Nutrition 56, 95-104. DOI: 10.1111/j.1747-0765.2009.00429.x.
NASA Astrophysics Data System (ADS)
Burkard, Reto; Bützberger, Patrick; Eugster, Werner
During the winter of 2001/2002 wet and occult deposition measurements were performed at the Lägeren research site ( 690 m a.s.l.) in Switzerland. Two types of fog were observed: radiation fog (RF) and fog associated with atmospheric instabilities (FAI). The deposition measurements were performed above the forest canopy on a 45 m high tower. Occult deposition was measured by means of the eddy covariance method. Due to the large differences of microphysical properties of the two fog types, the liquid water fluxes were much higher (6.9 mg m -2 s-1) during RF than during FAI (0.57 mg m -2 s-1) . Fogwater concentrations were considerably enhanced during RF compared with FAI. The comparison of fog and rain revealed that fogwater nutrient concentrations were 3-66 times larger than concentrations in precipitation. The considerably larger water fluxes and nutrient concentrations of RF resulted in much higher nutrient deposition compared with FAI. In winter when RF was quite frequent, occult deposition was the dominant pathway for nitrate and ammonium deposition. Daily fluxes of total inorganic nitrogen were 1.89 mg m -2 d-1 by occult and 1.01 mg m -2 d-1 by wet deposition. The estimated contribution of occult deposition to total annual nitrogen input was 16.4% or 4.3 kg N ha -1 yr-1, and wet deposition contributed 26.5% ( 6.9 kg N ha -1 yr-1) . As a consequence, critical loads of annual N-input were exceeded, resulting in a significant over-fertilization at the Lägeren site.
Automatic Classification of Aerial Imagery for Urban Hydrological Applications
NASA Astrophysics Data System (ADS)
Paul, A.; Yang, C.; Breitkopf, U.; Liu, Y.; Wang, Z.; Rottensteiner, F.; Wallner, M.; Verworn, A.; Heipke, C.
2018-04-01
In this paper we investigate the potential of automatic supervised classification for urban hydrological applications. In particular, we contribute to runoff simulations using hydrodynamic urban drainage models. In order to assess whether the capacity of the sewers is sufficient to avoid surcharge within certain return periods, precipitation is transformed into runoff. The transformation of precipitation into runoff requires knowledge about the proportion of drainage-effective areas and their spatial distribution in the catchment area. Common simulation methods use the coefficient of imperviousness as an important parameter to estimate the overland flow, which subsequently contributes to the pipe flow. The coefficient of imperviousness is the percentage of area covered by impervious surfaces such as roofs or road surfaces. It is still common practice to assign the coefficient of imperviousness for each particular land parcel manually by visual interpretation of aerial images. Based on classification results of these imagery we contribute to an objective automatic determination of the coefficient of imperviousness. In this context we compare two classification techniques: Random Forests (RF) and Conditional Random Fields (CRF). Experimental results performed on an urban test area show good results and confirm that the automated derivation of the coefficient of imperviousness, apart from being more objective and, thus, reproducible, delivers more accurate results than the interactive estimation. We achieve an overall accuracy of about 85 % for both classifiers. The root mean square error of the differences of the coefficient of imperviousness compared to the reference is 4.4 % for the CRF-based classification, and 3.8 % for the RF-based classification.
Ozcift, Akin; Gulten, Arif
2011-12-01
Improving accuracies of machine learning algorithms is vital in designing high performance computer-aided diagnosis (CADx) systems. Researches have shown that a base classifier performance might be enhanced by ensemble classification strategies. In this study, we construct rotation forest (RF) ensemble classifiers of 30 machine learning algorithms to evaluate their classification performances using Parkinson's, diabetes and heart diseases from literature. While making experiments, first the feature dimension of three datasets is reduced using correlation based feature selection (CFS) algorithm. Second, classification performances of 30 machine learning algorithms are calculated for three datasets. Third, 30 classifier ensembles are constructed based on RF algorithm to assess performances of respective classifiers with the same disease data. All the experiments are carried out with leave-one-out validation strategy and the performances of the 60 algorithms are evaluated using three metrics; classification accuracy (ACC), kappa error (KE) and area under the receiver operating characteristic (ROC) curve (AUC). Base classifiers succeeded 72.15%, 77.52% and 84.43% average accuracies for diabetes, heart and Parkinson's datasets, respectively. As for RF classifier ensembles, they produced average accuracies of 74.47%, 80.49% and 87.13% for respective diseases. RF, a newly proposed classifier ensemble algorithm, might be used to improve accuracy of miscellaneous machine learning algorithms to design advanced CADx systems. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Haguma, D.; Leconte, R.
2017-12-01
Spatial and temporal water resources variability are associated with large-scale pressure and circulation anomalies known as teleconnections that influence the pattern of the atmospheric circulation. Teleconnection indices have been used successfully to forecast streamflow in short term. However, in some watersheds, classical methods cannot establish relationships between seasonal streamflow and teleconnection indices because of weak correlation. In this study, machine learning algorithms have been applied for seasonal streamflow forecast using teleconnection indices. Machine learning offers an alternative to classical methods to address the non-linear relationship between streamflow and teleconnection indices the context non-stationary climate. Two machine learning algorithms, random forest (RF) and support vector machine (SVM), with teleconnection indices associated with North American climatology, have been used to forecast inflows for one and two leading seasons for the Romaine River and Manicouagan River watersheds, located in Quebec, Canada. The indices are Pacific-North America (PNA), North Atlantic Oscillation (NAO), El Niño-Southern Oscillation (ENSO), Arctic Oscillation (AO) and Pacific Decadal Oscillation (PDO). The results showed that the machine learning algorithms have an important predictive power for seasonal streamflow for one and two leading seasons. The RF performed better for training and SVM generally have better results with high predictive capability for testing. The RF which is an ensemble method, allowed to assess the uncertainty of the forecast. The integration of teleconnection indices responds to the seasonal forecast of streamflow in the conditions of the non-stationarity the climate, although the teleconnection indices have a weak correlation with streamflow.
Zhou, Fei; Zhao, Yajing; Peng, Jiyu; Jiang, Yirong; Li, Maiquan; Jiang, Yuan; Lu, Baiyi
2017-07-01
Osmanthus fragrans flowers are used as folk medicine and additives for teas, beverages and foods. The metabolites of O. fragrans flowers from different geographical origins were inconsistent in some extent. Chromatography and mass spectrometry combined with multivariable analysis methods provides an approach for discriminating the origin of O. fragrans flowers. To discriminate the Osmanthus fragrans var. thunbergii flowers from different origins with the identified metabolites. GC-MS and UPLC-PDA were conducted to analyse the metabolites in O. fragrans var. thunbergii flowers (in total 150 samples). Principal component analysis (PCA), soft independent modelling of class analogy analysis (SIMCA) and random forest (RF) analysis were applied to group the GC-MS and UPLC-PDA data. GC-MS identified 32 compounds common to all samples while UPLC-PDA/QTOF-MS identified 16 common compounds. PCA of the UPLC-PDA data generated a better clustering than PCA of the GC-MS data. Ten metabolites (six from GC-MS and four from UPLC-PDA) were selected as effective compounds for discrimination by PCA loadings. SIMCA and RF analysis were used to build classification models, and the RF model, based on the four effective compounds (caffeic acid derivative, acteoside, ligustroside and compound 15), yielded better results with the classification rate of 100% in the calibration set and 97.8% in the prediction set. GC-MS and UPLC-PDA combined with multivariable analysis methods can discriminate the origin of Osmanthus fragrans var. thunbergii flowers. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Predicting assemblages and species richness of endemic fish in the upper Yangtze River.
He, Yongfeng; Wang, Jianwei; Lek-Ang, Sithan; Lek, Sovan
2010-09-01
The present work describes the ability of two modeling methods, Classification and Regression Tree (CART) and Random Forest (RF), to predict endemic fish assemblages and species richness in the upper Yangtze River, and then to identify the determinant environmental factors contributing to the models. The models included 24 predictor variables and 2 response variables (fish assemblage and species richness) for a total of 46 site units. The predictive quality of the modeling approaches was judged with a leave-one-out validation procedure. There was an average success of 60.9% and 71.7% to assign each site unit to the correct assemblage of fish, and 73% and 84% to explain the variance in species richness, by using CART and RF models, respectively. RF proved to be better than CART in terms of accuracy and efficiency in ecological applications. In any case, the mixed models including both land cover and river characteristic variables were more powerful than either individual one in explaining the endemic fish distribution pattern in the upper Yangtze River. For instance, altitude, slope, length, discharge, runoff, farmland and alpine and sub-alpine meadow played important roles in driving the observed endemic fish assemblage structure, while farmland, slope grassland, discharge, runoff, altitude and drainage area in explaining the observed patterns of endemic species richness. Therefore, the various effects of human activity on natural aquatic ecosystems, in particular, the flow modification of the river and the land use changes may have a considerable effect on the endemic fish distribution patterns on a regional scale. Copyright 2010 Elsevier B.V. All rights reserved.
Landscape variability of vegetation change across the forest to tundra transition of central Canada
NASA Astrophysics Data System (ADS)
Bonney, Mitchell Thurston
Widespread vegetation productivity increases in tundra ecosystems and stagnation, or even productivity decreases, in boreal forest ecosystems have been detected from coarse-scale remote sensing observations over the last few decades. However, finer-scale Landsat studies have shown that these changes are heterogeneous and may be related to landscape and regional variability in climate, land cover, topography and moisture. In this study, a Landsat Normalized Difference Vegetation Index (NDVI) time-series (1984-2016) was examined for a study area spanning the entirety of the sub-Arctic boreal forest to Low Arctic tundra transition of central Canada (i.e., Yellowknife to the Arctic Ocean). NDVI trend analysis indicated that 27% of un-masked pixels in the study area exhibited a significant (p < 0.05) trend and virtually all (99.3%) of those pixels were greening. Greening pixels were most common in the northern tundra zone and the southern forest-tundra ecotone zone. NDVI trends were positive throughout the study area, but were smallest in the forest zone and largest in the northern tundra zone. These results were supported by ground validation, which found a strong relationship (R2 = 0.81) between bulk vegetation volume (BVV) and NDVI for non-tree functional groups in the North Slave region of Northwest Territories. Field observations indicate that alder (Alnus spp.) shrublands and open woodland sites with shrubby understories were most likely to exhibit greening in that area. Random Forest (RF) modelling of the relationship between NDVI trends and environmental variables found that the magnitude and direction of trends differed across the forest to tundra transition. Increased summer temperatures, shrubland and forest land cover, closer proximity to major drainage systems, longer distances from major lakes and lower elevations were generally more important and associated with larger positive NDVI trends. These findings indicate that the largest positive NDVI trends were primarily associated with the increased productivity of shrubby environments, especially at, and north of the forest-tundra ecotone in areas with more favorable growing conditions. Smaller and less significant NDVI trends in boreal forest environments south of the forest-tundra ecotone were likely associated with long-term recovery from fire disturbance rather than the variables analyzed here.
Douglas, R K; Nawar, S; Alamar, M C; Mouazen, A M; Coulon, F
2018-03-01
Visible and near infrared spectrometry (vis-NIRS) coupled with data mining techniques can offer fast and cost-effective quantitative measurement of total petroleum hydrocarbons (TPH) in contaminated soils. Literature showed however significant differences in the performance on the vis-NIRS between linear and non-linear calibration methods. This study compared the performance of linear partial least squares regression (PLSR) with a nonlinear random forest (RF) regression for the calibration of vis-NIRS when analysing TPH in soils. 88 soil samples (3 uncontaminated and 85 contaminated) collected from three sites located in the Niger Delta were scanned using an analytical spectral device (ASD) spectrophotometer (350-2500nm) in diffuse reflectance mode. Sequential ultrasonic solvent extraction-gas chromatography (SUSE-GC) was used as reference quantification method for TPH which equal to the sum of aliphatic and aromatic fractions ranging between C 10 and C 35 . Prior to model development, spectra were subjected to pre-processing including noise cut, maximum normalization, first derivative and smoothing. Then 65 samples were selected as calibration set and the remaining 20 samples as validation set. Both vis-NIR spectrometry and gas chromatography profiles of the 85 soil samples were subjected to RF and PLSR with leave-one-out cross-validation (LOOCV) for the calibration models. Results showed that RF calibration model with a coefficient of determination (R 2 ) of 0.85, a root means square error of prediction (RMSEP) 68.43mgkg -1 , and a residual prediction deviation (RPD) of 2.61 outperformed PLSR (R 2 =0.63, RMSEP=107.54mgkg -1 and RDP=2.55) in cross-validation. These results indicate that RF modelling approach is accounting for the nonlinearity of the soil spectral responses hence, providing significantly higher prediction accuracy compared to the linear PLSR. It is recommended to adopt the vis-NIRS coupled with RF modelling approach as a portable and cost effective method for the rapid quantification of TPH in soils. Copyright © 2017 Elsevier B.V. All rights reserved.
[Hyperspectral Estimation of Apple Tree Canopy LAI Based on SVM and RF Regression].
Han, Zhao-ying; Zhu, Xi-cun; Fang, Xian-yi; Wang, Zhuo-yuan; Wang, Ling; Zhao, Geng-Xing; Jiang, Yuan-mao
2016-03-01
Leaf area index (LAI) is the dynamic index of crop population size. Hyperspectral technology can be used to estimate apple canopy LAI rapidly and nondestructively. It can be provide a reference for monitoring the tree growing and yield estimation. The Red Fuji apple trees of full bearing fruit are the researching objects. Ninety apple trees canopies spectral reflectance and LAI values were measured by the ASD Fieldspec3 spectrometer and LAI-2200 in thirty orchards in constant two years in Qixia research area of Shandong Province. The optimal vegetation indices were selected by the method of correlation analysis of the original spectral reflectance and vegetation indices. The models of predicting the LAI were built with the multivariate regression analysis method of support vector machine (SVM) and random forest (RF). The new vegetation indices, GNDVI527, ND-VI676, RVI682, FD-NVI656 and GRVI517 and the previous two main vegetation indices, NDVI670 and NDVI705, are in accordance with LAI. In the RF regression model, the calibration set decision coefficient C-R2 of 0.920 and validation set decision coefficient V-R2 of 0.889 are higher than the SVM regression model by 0.045 and 0.033 respectively. The root mean square error of calibration set C-RMSE of 0.249, the root mean square error validation set V-RMSE of 0.236 are lower than that of the SVM regression model by 0.054 and 0.058 respectively. Relative analysis of calibrating error C-RPD and relative analysis of validation set V-RPD reached 3.363 and 2.520, 0.598 and 0.262, respectively, which were higher than the SVM regression model. The measured and predicted the scatterplot trend line slope of the calibration set and validation set C-S and V-S are close to 1. The estimation result of RF regression model is better than that of the SVM. RF regression model can be used to estimate the LAI of red Fuji apple trees in full fruit period.
Spectroscopic Diagnosis of Arsenic Contamination in Agricultural Soils
Shi, Tiezhu; Liu, Huizeng; Chen, Yiyun; Fei, Teng; Wang, Junjie; Wu, Guofeng
2017-01-01
This study investigated the abilities of pre-processing, feature selection and machine-learning methods for the spectroscopic diagnosis of soil arsenic contamination. The spectral data were pre-processed by using Savitzky-Golay smoothing, first and second derivatives, multiplicative scatter correction, standard normal variate, and mean centering. Principle component analysis (PCA) and the RELIEF algorithm were used to extract spectral features. Machine-learning methods, including random forests (RF), artificial neural network (ANN), radial basis function- and linear function- based support vector machine (RBF- and LF-SVM) were employed for establishing diagnosis models. The model accuracies were evaluated and compared by using overall accuracies (OAs). The statistical significance of the difference between models was evaluated by using McNemar’s test (Z value). The results showed that the OAs varied with the different combinations of pre-processing, feature selection, and classification methods. Feature selection methods could improve the modeling efficiencies and diagnosis accuracies, and RELIEF often outperformed PCA. The optimal models established by RF (OA = 86%), ANN (OA = 89%), RBF- (OA = 89%) and LF-SVM (OA = 87%) had no statistical difference in diagnosis accuracies (Z < 1.96, p < 0.05). These results indicated that it was feasible to diagnose soil arsenic contamination using reflectance spectroscopy. The appropriate combination of multivariate methods was important to improve diagnosis accuracies. PMID:28471412
NASA Astrophysics Data System (ADS)
Paul, Subir; Nagesh Kumar, D.
2018-04-01
Hyperspectral (HS) data comprises of continuous spectral responses of hundreds of narrow spectral bands with very fine spectral resolution or bandwidth, which offer feature identification and classification with high accuracy. In the present study, Mutual Information (MI) based Segmented Stacked Autoencoder (S-SAE) approach for spectral-spatial classification of the HS data is proposed to reduce the complexity and computational time compared to Stacked Autoencoder (SAE) based feature extraction. A non-parametric dependency measure (MI) based spectral segmentation is proposed instead of linear and parametric dependency measure to take care of both linear and nonlinear inter-band dependency for spectral segmentation of the HS bands. Then morphological profiles are created corresponding to segmented spectral features to assimilate the spatial information in the spectral-spatial classification approach. Two non-parametric classifiers, Support Vector Machine (SVM) with Gaussian kernel and Random Forest (RF) are used for classification of the three most popularly used HS datasets. Results of the numerical experiments carried out in this study have shown that SVM with a Gaussian kernel is providing better results for the Pavia University and Botswana datasets whereas RF is performing better for Indian Pines dataset. The experiments performed with the proposed methodology provide encouraging results compared to numerous existing approaches.
van der Ploeg, Tjeerd; Austin, Peter C; Steyerberg, Ewout W
2014-12-22
Modern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size ("data hungriness"). We performed simulation studies based on three clinical cohorts: 1282 patients with head and neck cancer (with 46.9% 5 year survival), 1731 patients with traumatic brain injury (22.3% 6 month mortality) and 3181 patients with minor head injury (7.6% with CT scan abnormalities). We compared three relatively modern modelling techniques: support vector machines (SVM), neural nets (NN), and random forests (RF) and two classical techniques: logistic regression (LR) and classification and regression trees (CART). We created three large artificial databases with 20 fold, 10 fold and 6 fold replication of subjects, where we generated dichotomous outcomes according to different underlying models. We applied each modelling technique to increasingly larger development parts (100 repetitions). The area under the ROC-curve (AUC) indicated the performance of each model in the development part and in an independent validation part. Data hungriness was defined by plateauing of AUC and small optimism (difference between the mean apparent AUC and the mean validated AUC <0.01). We found that a stable AUC was reached by LR at approximately 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Optimism decreased with increasing sample sizes and the same ranking of techniques. The RF, SVM and NN models showed instability and a high optimism even with >200 events per variable. Modern modelling techniques such as SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. This implies that such modern techniques should only be used in medical prediction problems if very large data sets are available.
Zald, Harold S.J.; Spies, Thomas A.; Seidl, Rupert; Pabst, Robert J.; Olsen, Keith A.; Steel, E. Ashley
2016-01-01
Forest carbon (C) density varies tremendously across space due to the inherent heterogeneity of forest ecosystems. Variation of forest C density is especially pronounced in mountainous terrain, where environmental gradients are compressed and vary at multiple spatial scales. Additionally, the influence of environmental gradients may vary with forest age and developmental stage, an important consideration as forest landscapes often have a diversity of stand ages from past management and other disturbance agents. Quantifying forest C density and its underlying environmental determinants in mountain terrain has remained challenging because many available data sources lack the spatial grain and ecological resolution needed at both stand and landscape scales. The objective of this study was to determine if environmental factors influencing aboveground live carbon (ALC) density differed between young versus old forests. We integrated aerial light detection and ranging (lidar) data with 702 field plots to map forest ALC density at a grain of 25 m across the H.J. Andrews Experimental Forest, a 6369 ha watershed in the Cascade Mountains of Oregon, USA. We used linear regressions, random forest ensemble learning (RF) and sequential autoregressive modeling (SAR) to reveal how mapped forest ALC density was related to climate, topography, soils, and past disturbance history (timber harvesting and wildfires). ALC increased with stand age in young managed forests, with much greater variation of ALC in relation to years since wildfire in old unmanaged forests. Timber harvesting was the most important driver of ALC across the entire watershed, despite occurring on only 23% of the landscape. More variation in forest ALC density was explained in models of young managed forests than in models of old unmanaged forests. Besides stand age, ALC density in young managed forests was driven by factors influencing site productivity, whereas variation in ALC density in old unmanaged forests was also affected by finer scale topographic conditions associated with sheltered sites. Past wildfires only had a small influence on current ALC density, which may be a result of long times since fire and/or prevalence of non-stand replacing fire. Our results indicate that forest ALC density depends on a suite of multi-scale environmental drivers mediated by complex mountain topography, and that these relationships are dependent on stand age. The high and context-dependent spatial variability of forest ALC density has implications for quantifying forest carbon stores, establishing upper bounds of potential carbon sequestration, and scaling field data to landscape and regional scales. PMID:27041818
Ma, Xin; Guo, Jing; Sun, Xiao
2016-01-01
DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.
E.R Peña-Mendoza; A. Gómez-Guerrero; Mark Fenn; P. Hernández de la Rosa; D. Alvarado Rosales
2016-01-01
The nutritional content and tree top in the forests are evaluated of Abies religiosa, San Miguel Tlaixpan (SMT) and Rio Frio (RF), State of Mexico. The work had two parts. In the first, the nutritional content was evaluated in new foliage (N, P, K, Ca and Mg) in Abies religiosa trees, in periods of spring, summer and winter, in...
Ledoux, Tracey; Robinson, Jessica; Baranowski, Tom; O'Connor, Daniel P
2018-04-01
The American Academy of Pediatrics and World Health Organization recommend responsive feeding (RF) to promote healthy eating behaviors in early childhood. This project developed and tested a vicarious learning video to teach parents RF practices. A RF vicarious learning video was developed using community-based participatory research methods. Fifty parents of preschoolers were randomly assigned to watch Happier Meals or a control video about education. Knowledge and beliefs about RF practices were measured 1 week before and immediately after intervention. Experimental group participants also completed measures of narrative engagement and video acceptability. Seventy-four percent of the sample was White, 90% had at least a college degree, 96% were married, and 88% made >$50,000/year. RF knowledge increased ( p = .03) and positive beliefs about some unresponsive feeding practices decreased ( ps < .05) more among experimental than control parents. Knowledge and belief changes were associated with video engagement ( ps < .05). Parents perceived Happier Meals as highly relevant, applicable, and informative. Community-based participatory research methods were instrumental in developing this vicarious learning video, with preliminary evidence of effectiveness in teaching parents about RF. Happier Meals is freely available for parents or community health workers to use when working with families to promote healthy eating behaviors in early childhood.
Data-Driven Lead-Acid Battery Prognostics Using Random Survival Forests
2014-10-02
Kogalur, Blackstone , & Lauer, 2008; Ishwaran & Kogalur, 2010). Random survival forest is a sur- vival analysis extension of Random Forests (Breiman, 2001...Statistics & probability letters, 80(13), 1056–1064. Ishwaran, H., Kogalur, U. B., Blackstone , E. H., & Lauer, M. S. (2008). Random survival forests. The
NASA Astrophysics Data System (ADS)
Houborg, Rasmus; McCabe, Matthew F.
2018-01-01
With an increasing volume and dimensionality of Earth observation data, enhanced integration of machine-learning methodologies is needed to effectively analyze and utilize these information rich datasets. In machine-learning, a training dataset is required to establish explicit associations between a suite of explanatory 'predictor' variables and the target property. The specifics of this learning process can significantly influence model validity and portability, with a higher generalization level expected with an increasing number of observable conditions being reflected in the training dataset. Here we propose a hybrid training approach for leaf area index (LAI) estimation, which harnesses synergistic attributes of scattered in-situ measurements and systematically distributed physically based model inversion results to enhance the information content and spatial representativeness of the training data. To do this, a complimentary training dataset of independent LAI was derived from a regularized model inversion of RapidEye surface reflectances and subsequently used to guide the development of LAI regression models via Cubist and random forests (RF) decision tree methods. The application of the hybrid training approach to a broad set of Landsat 8 vegetation index (VI) predictor variables resulted in significantly improved LAI prediction accuracies and spatial consistencies, relative to results relying on in-situ measurements alone for model training. In comparing the prediction capacity and portability of the two machine-learning algorithms, a pair of relatively simple multi-variate regression models established by Cubist performed best, with an overall relative mean absolute deviation (rMAD) of ∼11%, determined based on a stringent scene-specific cross-validation approach. In comparison, the portability of RF regression models was less effective (i.e., an overall rMAD of ∼15%), which was attributed partly to model saturation at high LAI in association with inherent extrapolation and transferability limitations. Explanatory VIs formed from bands in the near-infrared (NIR) and shortwave infrared domains (e.g., NDWI) were associated with the highest predictive ability, whereas Cubist models relying entirely on VIs based on NIR and red band combinations (e.g., NDVI) were associated with comparatively high uncertainties (i.e., rMAD ∼ 21%). The most transferable and best performing models were based on combinations of several predictor variables, which included both NDWI- and NDVI-like variables. In this process, prior screening of input VIs based on an assessment of variable relevance served as an effective mechanism for optimizing prediction accuracies from both Cubist and RF. While this study demonstrated benefit in combining data mining operations with physically based constraints via a hybrid training approach, the concept of transferability and portability warrants further investigations in order to realize the full potential of emerging machine-learning techniques for regression purposes.
Chen, Chao-Feng; Gao, Xiao-Fei; Duan, Xu; Chen, Bin; Liu, Xiao-Hua; Xu, Yi-Zhou
2017-04-01
The present systematic review and meta-analysis aimed to assess and compare the safety and efficacy of radiofrequency (RF) and cryoballoon (CB) ablation for paroxysmal atrial fibrillation (PAF). RF and CB ablation are two frequently used methods for pulmonary vein isolation in PAF, but which is a better choice for PAF remains uncertain. A systematic review was conducted in Medline, PubMed, Embase, and Cochrane Library. All trials comparing RF and CB ablation were screened and included if the inclusion criteria were met. A total of 38 eligible studies, 9 prospective randomized or randomized controlled trials (RCTs), and 29 non- RCTs were identified, adding up to 15,496 patients. Pool analyses indicated that CB ablation was more beneficial in terms of procedural time [standard mean difference = -0.58; 95% confidence interval (CI), -0.85 to -0.30], complications without phrenic nerve injury (PNI) [odds ratio (OR) = 0.79; 95% CI, 0.67-0.93; I 2 = 16%], and recrudescence (OR = 0.83; 95% CI, 0.70-0.97; I 2 = 63%) for PAF; however, the total complications of CB was higher than RF. The subgroup analysis found that, compared with non-contact force radiofrequency (non-CF-RF), both first-generation cryoballoon (CB1) and second-generation cryoballoon (CB2) ablation could reduce complications with PNI, procedural time, and recrudescence. However, the safety and efficacy of CB2 was similar to those of CF-RF. Available overall and subgroup data suggested that both CB1 and CB2 were more beneficial than RF ablation, and the main advantages were reflected in comparing them with non-CF-RF. However, CF-RF and CB2 showed similar clinical benefits.
A tale of two "forests": random forest machine learning AIDS tropical forest carbon mapping.
Mascaro, Joseph; Asner, Gregory P; Knapp, David E; Kennedy-Bowdoin, Ty; Martin, Roberta E; Anderson, Christopher; Higgins, Mark; Chadwick, K Dana
2014-01-01
Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including--in the latter case--x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called "out-of-bag"), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha(-1) when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation.
Ozcift, Akin
2012-08-01
Parkinson disease (PD) is an age-related deterioration of certain nerve systems, which affects movement, balance, and muscle control of clients. PD is one of the common diseases which affect 1% of people older than 60 years. A new classification scheme based on support vector machine (SVM) selected features to train rotation forest (RF) ensemble classifiers is presented for improving diagnosis of PD. The dataset contains records of voice measurements from 31 people, 23 with PD and each record in the dataset is defined with 22 features. The diagnosis model first makes use of a linear SVM to select ten most relevant features from 22. As a second step of the classification model, six different classifiers are trained with the subset of features. Subsequently, at the third step, the accuracies of classifiers are improved by the utilization of RF ensemble classification strategy. The results of the experiments are evaluated using three metrics; classification accuracy (ACC), Kappa Error (KE) and Area under the Receiver Operating Characteristic (ROC) Curve (AUC). Performance measures of two base classifiers, i.e. KStar and IBk, demonstrated an apparent increase in PD diagnosis accuracy compared to similar studies in literature. After all, application of RF ensemble classification scheme improved PD diagnosis in 5 of 6 classifiers significantly. We, numerically, obtained about 97% accuracy in RF ensemble of IBk (a K-Nearest Neighbor variant) algorithm, which is a quite high performance for Parkinson disease diagnosis.
Modelling the spatial distribution of Fasciola hepatica in dairy cattle in Europe.
Ducheyne, Els; Charlier, Johannes; Vercruysse, Jozef; Rinaldi, Laura; Biggeri, Annibale; Demeler, Janina; Brandt, Christina; De Waal, Theo; Selemetas, Nikolaos; Höglund, Johan; Kaba, Jaroslaw; Kowalczyk, Slawomir J; Hendrickx, Guy
2015-03-26
A harmonized sampling approach in combination with spatial modelling is required to update current knowledge of fasciolosis in dairy cattle in Europe. Within the scope of the EU project GLOWORM, samples from 3,359 randomly selected farms in 849 municipalities in Belgium, Germany, Ireland, Poland and Sweden were collected and their infection status assessed using an indirect bulk tank milk (BTM) enzyme-linked immunosorbent assay (ELISA). Dairy farms were considered exposed when the optical density ratio (ODR) exceeded the 0.3 cut-off. Two ensemble-modelling techniques, Random Forests (RF) and Boosted Regression Trees (BRT), were used to obtain the spatial distribution of the probability of exposure to Fasciola hepatica using remotely sensed environmental variables (1-km spatial resolution) and interpolated values from meteorological stations as predictors. The median ODRs amounted to 0.31, 0.12, 0.54, 0.25 and 0.44 for Belgium, Germany, Ireland, Poland and southern Sweden, respectively. Using the 0.3 threshold, 571 municipalities were categorized as positive and 429 as negative. RF was seen as capable of predicting the spatial distribution of exposure with an area under the receiver operation characteristic (ROC) curve (AUC) of 0.83 (0.96 for BRT). Both models identified rainfall and temperature as the most important factors for probability of exposure. Areas of high and low exposure were identified by both models, with BRT better at discriminating between low-probability and high-probability exposure; this model may therefore be more useful in practise. Given a harmonized sampling strategy, it should be possible to generate robust spatial models for fasciolosis in dairy cattle in Europe to be used as input for temporal models and for the detection of deviations in baseline probability. Further research is required for model output in areas outside the eco-climatic range investigated.
A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping
Mascaro, Joseph; Asner, Gregory P.; Knapp, David E.; Kennedy-Bowdoin, Ty; Martin, Roberta E.; Anderson, Christopher; Higgins, Mark; Chadwick, K. Dana
2014-01-01
Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including—in the latter case—x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called “out-of-bag”), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha−1 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation. PMID:24489686
Communication methods, systems, apparatus, and devices involving RF tag registration
Burghard, Brion J [W. Richland, WA; Skorpik, James R [Kennewick, WA
2008-04-22
One technique of the present invention includes a number of Radio Frequency (RF) tags that each have a different identifier. Information is broadcast to the tags from an RF tag interrogator. This information corresponds to a maximum quantity of tag response time slots that are available. This maximum quantity may be less than the total number of tags. The tags each select one of the time slots as a function of the information and a random number provided by each respective tag. The different identifiers are transmitted to the interrogator from at least a subset of the RF tags.
NASA Astrophysics Data System (ADS)
Gavish, Yoni; O'Connell, Jerome; Marsh, Charles J.; Tarantino, Cristina; Blonda, Palma; Tomaselli, Valeria; Kunin, William E.
2018-02-01
The increasing need for high quality Habitat/Land-Cover (H/LC) maps has triggered considerable research into novel machine-learning based classification models. In many cases, H/LC classes follow pre-defined hierarchical classification schemes (e.g., CORINE), in which fine H/LC categories are thematically nested within more general categories. However, none of the existing machine-learning algorithms account for this pre-defined hierarchical structure. Here we introduce a novel Random Forest (RF) based application of hierarchical classification, which fits a separate local classification model in every branching point of the thematic tree, and then integrates all the different local models to a single global prediction. We applied the hierarchal RF approach in a NATURA 2000 site in Italy, using two land-cover (CORINE, FAO-LCCS) and one habitat classification scheme (EUNIS) that differ from one another in the shape of the class hierarchy. For all 3 classification schemes, both the hierarchical model and a flat model alternative provided accurate predictions, with kappa values mostly above 0.9 (despite using only 2.2-3.2% of the study area as training cells). The flat approach slightly outperformed the hierarchical models when the hierarchy was relatively simple, while the hierarchical model worked better under more complex thematic hierarchies. Most misclassifications came from habitat pairs that are thematically distant yet spectrally similar. In 2 out of 3 classification schemes, the additional constraints of the hierarchical model resulted with fewer such serious misclassifications relative to the flat model. The hierarchical model also provided valuable information on variable importance which can shed light into "black-box" based machine learning algorithms like RF. We suggest various ways by which hierarchical classification models can increase the accuracy and interpretability of H/LC classification maps.
Kruppa, Jochen; Liu, Yufeng; Biau, Gérard; Kohler, Michael; König, Inke R; Malley, James D; Ziegler, Andreas
2014-07-01
Probability estimation for binary and multicategory outcome using logistic and multinomial logistic regression has a long-standing tradition in biostatistics. However, biases may occur if the model is misspecified. In contrast, outcome probabilities for individuals can be estimated consistently with machine learning approaches, including k-nearest neighbors (k-NN), bagged nearest neighbors (b-NN), random forests (RF), and support vector machines (SVM). Because machine learning methods are rarely used by applied biostatisticians, the primary goal of this paper is to explain the concept of probability estimation with these methods and to summarize recent theoretical findings. Probability estimation in k-NN, b-NN, and RF can be embedded into the class of nonparametric regression learning machines; therefore, we start with the construction of nonparametric regression estimates and review results on consistency and rates of convergence. In SVMs, outcome probabilities for individuals are estimated consistently by repeatedly solving classification problems. For SVMs we review classification problem and then dichotomous probability estimation. Next we extend the algorithms for estimating probabilities using k-NN, b-NN, and RF to multicategory outcomes and discuss approaches for the multicategory probability estimation problem using SVM. In simulation studies for dichotomous and multicategory dependent variables we demonstrate the general validity of the machine learning methods and compare it with logistic regression. However, each method fails in at least one simulation scenario. We conclude with a discussion of the failures and give recommendations for selecting and tuning the methods. Applications to real data and example code are provided in a companion article (doi:10.1002/bimj.201300077). © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Li, Yun; Zhang, Jin-Yu; Wang, Yuan-Zhong
2018-01-01
Three data fusion strategies (low-llevel, mid-llevel, and high-llevel) combined with a multivariate classification algorithm (random forest, RF) were applied to authenticate the geographical origins of Panax notoginseng collected from five regions of Yunnan province in China. In low-level fusion, the original data from two spectra (Fourier transform mid-IR spectrum and near-IR spectrum) were directly concatenated into a new matrix, which then was applied for the classification. Mid-level fusion was the strategy that inputted variables extracted from the spectral data into an RF classification model. The extracted variables were processed by iterate variable selection of the RF model and principal component analysis. The use of high-level fusion combined the decision making of each spectroscopic technique and resulted in an ensemble decision. The results showed that the mid-level and high-level data fusion take advantage of the information synergy from two spectroscopic techniques and had better classification performance than that of independent decision making. High-level data fusion is the most effective strategy since the classification results are better than those of the other fusion strategies: accuracy rates ranged between 93% and 96% for the low-level data fusion, between 95% and 98% for the mid-level data fusion, and between 98% and 100% for the high-level data fusion. In conclusion, the high-level data fusion strategy for Fourier transform mid-IR and near-IR spectra can be used as a reliable tool for correct geographical identification of P. notoginseng. Graphical abstract The analytical steps of Fourier transform mid-IR and near-IR spectral data fusion for the geographical traceability of Panax notoginseng.
Machine learning-based dual-energy CT parametric mapping
NASA Astrophysics Data System (ADS)
Su, Kuan-Hao; Kuo, Jung-Wen; Jordan, David W.; Van Hedent, Steven; Klahr, Paul; Wei, Zhouping; Helo, Rose Al; Liang, Fan; Qian, Pengjiang; Pereira, Gisele C.; Rassouli, Negin; Gilkeson, Robert C.; Traughber, Bryan J.; Cheng, Chee-Wai; Muzic, Raymond F., Jr.
2018-06-01
The aim is to develop and evaluate machine learning methods for generating quantitative parametric maps of effective atomic number (Zeff), relative electron density (ρ e), mean excitation energy (I x ), and relative stopping power (RSP) from clinical dual-energy CT data. The maps could be used for material identification and radiation dose calculation. Machine learning methods of historical centroid (HC), random forest (RF), and artificial neural networks (ANN) were used to learn the relationship between dual-energy CT input data and ideal output parametric maps calculated for phantoms from the known compositions of 13 tissue substitutes. After training and model selection steps, the machine learning predictors were used to generate parametric maps from independent phantom and patient input data. Precision and accuracy were evaluated using the ideal maps. This process was repeated for a range of exposure doses, and performance was compared to that of the clinically-used dual-energy, physics-based method which served as the reference. The machine learning methods generated more accurate and precise parametric maps than those obtained using the reference method. Their performance advantage was particularly evident when using data from the lowest exposure, one-fifth of a typical clinical abdomen CT acquisition. The RF method achieved the greatest accuracy. In comparison, the ANN method was only 1% less accurate but had much better computational efficiency than RF, being able to produce parametric maps in 15 s. Machine learning methods outperformed the reference method in terms of accuracy and noise tolerance when generating parametric maps, encouraging further exploration of the techniques. Among the methods we evaluated, ANN is the most suitable for clinical use due to its combination of accuracy, excellent low-noise performance, and computational efficiency.
Machine learning-based dual-energy CT parametric mapping.
Su, Kuan-Hao; Kuo, Jung-Wen; Jordan, David W; Van Hedent, Steven; Klahr, Paul; Wei, Zhouping; Al Helo, Rose; Liang, Fan; Qian, Pengjiang; Pereira, Gisele C; Rassouli, Negin; Gilkeson, Robert C; Traughber, Bryan J; Cheng, Chee-Wai; Muzic, Raymond F
2018-06-08
The aim is to develop and evaluate machine learning methods for generating quantitative parametric maps of effective atomic number (Z eff ), relative electron density (ρ e ), mean excitation energy (I x ), and relative stopping power (RSP) from clinical dual-energy CT data. The maps could be used for material identification and radiation dose calculation. Machine learning methods of historical centroid (HC), random forest (RF), and artificial neural networks (ANN) were used to learn the relationship between dual-energy CT input data and ideal output parametric maps calculated for phantoms from the known compositions of 13 tissue substitutes. After training and model selection steps, the machine learning predictors were used to generate parametric maps from independent phantom and patient input data. Precision and accuracy were evaluated using the ideal maps. This process was repeated for a range of exposure doses, and performance was compared to that of the clinically-used dual-energy, physics-based method which served as the reference. The machine learning methods generated more accurate and precise parametric maps than those obtained using the reference method. Their performance advantage was particularly evident when using data from the lowest exposure, one-fifth of a typical clinical abdomen CT acquisition. The RF method achieved the greatest accuracy. In comparison, the ANN method was only 1% less accurate but had much better computational efficiency than RF, being able to produce parametric maps in 15 s. Machine learning methods outperformed the reference method in terms of accuracy and noise tolerance when generating parametric maps, encouraging further exploration of the techniques. Among the methods we evaluated, ANN is the most suitable for clinical use due to its combination of accuracy, excellent low-noise performance, and computational efficiency.
Urinary Volatile Organic Compounds for the Detection of Prostate Cancer
Khalid, Tanzeela; Aggio, Raphael; White, Paul; De Lacy Costello, Ben; Persad, Raj; Al-Kateb, Huda; Jones, Peter; Probert, Chris S.; Ratcliffe, Norman
2015-01-01
The aim of this work was to investigate volatile organic compounds (VOCs) emanating from urine samples to determine whether they can be used to classify samples into those from prostate cancer and non-cancer groups. Participants were men referred for a trans-rectal ultrasound-guided prostate biopsy because of an elevated prostate specific antigen (PSA) level or abnormal findings on digital rectal examination. Urine samples were collected from patients with prostate cancer (n = 59) and cancer-free controls (n = 43), on the day of their biopsy, prior to their procedure. VOCs from the headspace of basified urine samples were extracted using solid-phase micro-extraction and analysed by gas chromatography/mass spectrometry. Classifiers were developed using Random Forest (RF) and Linear Discriminant Analysis (LDA) classification techniques. PSA alone had an accuracy of 62–64% in these samples. A model based on 4 VOCs, 2,6-dimethyl-7-octen-2-ol, pentanal, 3-octanone, and 2-octanone, was marginally more accurate 63–65%. When combined, PSA level and these four VOCs had mean accuracies of 74% and 65%, using RF and LDA, respectively. With repeated double cross-validation, the mean accuracies fell to 71% and 65%, using RF and LDA, respectively. Results from VOC profiling of urine headspace are encouraging and suggest that there are other metabolomic avenues worth exploring which could help improve the stratification of men at risk of prostate cancer. This study also adds to our knowledge on the profile of compounds found in basified urine, from controls and cancer patients, which is useful information for future studies comparing the urine from patients with other disease states. PMID:26599280
Meng, Ran; Dennison, Philip E.; Jamison, Levi R.; van Riper, Charles; Nager, Pamela; Hultine, Kevin R.; Bean, Dan W.; Dudley, Tom
2012-01-01
The spread of tamarisk (Tamarix spp., also known as saltcedar) is a significant ecological disturbance in western North America and has long been targeted for control, leading to the importation of the northern tamarisk beetle (Diorhabda carinulata) as a biological control agent. Following its initial release along the Colorado River near Moab, Utah in 2004, the beetle has successfully established and defoliated tamarisk across much of the upper Colorado River Basin. However, the spatial distribution and seasonal timing of defoliation are complex and difficult to quantify over large areas. To address this challenge, we tested and compared two remote sensing approaches to mapping tamarisk defoliation: Disturbance Index (DI) and a decision tree method called Random Forest (RF). Based on multitemporal Landsat 5 TM imagery for 2006-2010, changes in DI and defoliation probability from RF were calculated to detect tamarisk defoliation along the banks of Green, Colorado, Dolores and San Juan rivers within the Colorado Plateau area. Defoliation mapping accuracy was assessed based on field surveys partitioned into 10 km sections of river and on regions of interest created for continuous riparian vegetation. The DI method detected 3711 ha of defoliated area in 2007, 7350 ha in 2008, 10,457 ha in 2009 and 5898 ha in 2010. The RF method detected much smaller areas of defoliation but proved to have higher accuracy, as demonstrated by accuracy assessment and sensitivity analysis, with 784 ha in 2007, 960 ha in 2008, 934 ha in 2009, and 1008 ha in 2010. Results indicate that remote sensing approaches are likely to be useful for studying spatiotemporal patterns of tamarisk defoliation as the tamarisk leaf beetle spreads throughout the western United States.
Icing detection from geostationary satellite data using machine learning approaches
NASA Astrophysics Data System (ADS)
Lee, J.; Ha, S.; Sim, S.; Im, J.
2015-12-01
Icing can cause a significant structural damage to aircraft during flight, resulting in various aviation accidents. Icing studies have been typically performed using two approaches: one is a numerical model-based approach and the other is a remote sensing-based approach. The model based approach diagnoses aircraft icing using numerical atmospheric parameters such as temperature, relative humidity, and vertical thermodynamic structure. This approach tends to over-estimate icing according to the literature. The remote sensing-based approach typically uses meteorological satellite/ground sensor data such as Geostationary Operational Environmental Satellite (GOES) and Dual-Polarization radar data. This approach detects icing areas by applying thresholds to parameters such as liquid water path and cloud optical thickness derived from remote sensing data. In this study, we propose an aircraft icing detection approach which optimizes thresholds for L1B bands and/or Cloud Optical Thickness (COT) from Communication, Ocean and Meteorological Satellite-Meteorological Imager (COMS MI) and newly launched Himawari-8 Advanced Himawari Imager (AHI) over East Asia. The proposed approach uses machine learning algorithms including decision trees (DT) and random forest (RF) for optimizing thresholds of L1B data and/or COT. Pilot Reports (PIREPs) from South Korea and Japan were used as icing reference data. Results show that RF produced a lower false alarm rate (1.5%) and a higher overall accuracy (98.8%) than DT (8.5% and 75.3%), respectively. The RF-based approach was also compared with the existing COMS MI and GOES-R icing mask algorithms. The agreements of the proposed approach with the existing two algorithms were 89.2% and 45.5%, respectively. The lower agreement with the GOES-R algorithm was possibly due to the high uncertainty of the cloud phase product from COMS MI.
Hollings, Tracey; Robinson, Andrew; van Andel, Mary; Jewell, Chris; Burgman, Mark
2017-01-01
In livestock industries, reliable up-to-date spatial distribution and abundance records for animals and farms are critical for governments to manage and respond to risks. Yet few, if any, countries can afford to maintain comprehensive, up-to-date agricultural census data. Statistical modelling can be used as a proxy for such data but comparative modelling studies have rarely been undertaken for livestock populations. Widespread species, including livestock, can be difficult to model effectively due to complex spatial distributions that do not respond predictably to environmental gradients. We assessed three machine learning species distribution models (SDM) for their capacity to estimate national-level farm animal population numbers within property boundaries: boosted regression trees (BRT), random forests (RF) and K-nearest neighbour (K-NN). The models were built from a commercial livestock database and environmental and socio-economic predictor data for New Zealand. We used two spatial data stratifications to test (i) support for decision making in an emergency response situation, and (ii) the ability for the models to predict to new geographic regions. The performance of the three model types varied substantially, but the best performing models showed very high accuracy. BRTs had the best performance overall, but RF performed equally well or better in many simulations; RFs were superior at predicting livestock numbers for all but very large commercial farms. K-NN performed poorly relative to both RF and BRT in all simulations. The predictions of both multi species and single species models for farms and within hypothetical quarantine zones were very close to observed data. These models are generally applicable for livestock estimation with broad applications in disease risk modelling, biosecurity, policy and planning.
Robinson, Andrew; van Andel, Mary; Jewell, Chris; Burgman, Mark
2017-01-01
In livestock industries, reliable up-to-date spatial distribution and abundance records for animals and farms are critical for governments to manage and respond to risks. Yet few, if any, countries can afford to maintain comprehensive, up-to-date agricultural census data. Statistical modelling can be used as a proxy for such data but comparative modelling studies have rarely been undertaken for livestock populations. Widespread species, including livestock, can be difficult to model effectively due to complex spatial distributions that do not respond predictably to environmental gradients. We assessed three machine learning species distribution models (SDM) for their capacity to estimate national-level farm animal population numbers within property boundaries: boosted regression trees (BRT), random forests (RF) and K-nearest neighbour (K-NN). The models were built from a commercial livestock database and environmental and socio-economic predictor data for New Zealand. We used two spatial data stratifications to test (i) support for decision making in an emergency response situation, and (ii) the ability for the models to predict to new geographic regions. The performance of the three model types varied substantially, but the best performing models showed very high accuracy. BRTs had the best performance overall, but RF performed equally well or better in many simulations; RFs were superior at predicting livestock numbers for all but very large commercial farms. K-NN performed poorly relative to both RF and BRT in all simulations. The predictions of both multi species and single species models for farms and within hypothetical quarantine zones were very close to observed data. These models are generally applicable for livestock estimation with broad applications in disease risk modelling, biosecurity, policy and planning. PMID:28837685
Hafiz, Pegah; Nematollahi, Mohtaram; Boostani, Reza; Namavar Jahromi, Bahia
2017-10-01
In vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) are two important subsets of the assisted reproductive techniques, used for the treatment of infertility. Predicting implantation outcome of IVF/ICSI or the chance of pregnancy is essential for infertile couples, since these treatments are complex and expensive with a low probability of conception. In this cross-sectional study, the data of 486 patients were collected using census method. The IVF/ICSI dataset contains 29 variables along with an identifier for each patient that is either negative or positive. Mean accuracy and mean area under the receiver operating characteristic (ROC) curve are calculated for the classifiers. Sensitivity, specificity, positive and negative predictive values, and likelihood ratios of classifiers are employed as indicators of performance. The state-of-art classifiers which are candidates for this study include support vector machines, recursive partitioning (RPART), random forest (RF), adaptive boosting, and one-nearest neighbor. RF and RPART outperform the other comparable methods. The results revealed the areas under the ROC curve (AUC) as 84.23 and 82.05%, respectively. The importance of IVF/ICSI features was extracted from the output of RPART. Our findings demonstrate that the probability of pregnancy is low for women aged above 38. Classifiers RF and RPART are better at predicting IVF/ICSI cases compared to other decision makers that were tested in our study. Elicited decision rules of RPART determine useful predictive features of IVF/ICSI. Out of 20 factors, the age of woman, number of developed embryos, and serum estradiol level on the day of human chorionic gonadotropin administration are the three best features for such prediction. Copyright© by Royan Institute. All rights reserved.
Prosperi, Mattia C. F.; Rosen-Zvi, Michal; Altmann, André; Zazzi, Maurizio; Di Giambenedetto, Simona; Kaiser, Rolf; Schülter, Eugen; Struck, Daniel; Sloot, Peter; van de Vijver, David A.; Vandamme, Anne-Mieke; Sönnerborg, Anders
2010-01-01
Background Although genotypic resistance testing (GRT) is recommended to guide combination antiretroviral therapy (cART), funding and/or facilities to perform GRT may not be available in low to middle income countries. Since treatment history (TH) impacts response to subsequent therapy, we investigated a set of statistical learning models to optimise cART in the absence of GRT information. Methods and Findings The EuResist database was used to extract 8-week and 24-week treatment change episodes (TCE) with GRT and additional clinical, demographic and TH information. Random Forest (RF) classification was used to predict 8- and 24-week success, defined as undetectable HIV-1 RNA, comparing nested models including (i) GRT+TH and (ii) TH without GRT, using multiple cross-validation and area under the receiver operating characteristic curve (AUC). Virological success was achieved in 68.2% and 68.0% of TCE at 8- and 24-weeks (n = 2,831 and 2,579), respectively. RF (i) and (ii) showed comparable performances, with an average (st.dev.) AUC 0.77 (0.031) vs. 0.757 (0.035) at 8-weeks, 0.834 (0.027) vs. 0.821 (0.025) at 24-weeks. Sensitivity analyses, carried out on a data subset that included antiretroviral regimens commonly used in low to middle income countries, confirmed our findings. Training on subtype B and validation on non-B isolates resulted in a decline of performance for models (i) and (ii). Conclusions Treatment history-based RF prediction models are comparable to GRT-based for classification of virological outcome. These results may be relevant for therapy optimisation in areas where availability of GRT is limited. Further investigations are required in order to account for different demographics, subtypes and different therapy switching strategies. PMID:21060792
Park, Eunjeong; Chang, Hyuk-Jae; Nam, Hyo Suk
2017-04-18
The pronator drift test (PDT), a neurological examination, is widely used in clinics to measure motor weakness of stroke patients. The aim of this study was to develop a PDT tool with machine learning classifiers to detect stroke symptoms based on quantification of proximal arm weakness using inertial sensors and signal processing. We extracted features of drift and pronation from accelerometer signals of wearable devices on the inner wrists of 16 stroke patients and 10 healthy controls. Signal processing and feature selection approach were applied to discriminate PDT features used to classify stroke patients. A series of machine learning techniques, namely support vector machine (SVM), radial basis function network (RBFN), and random forest (RF), were implemented to discriminate stroke patients from controls with leave-one-out cross-validation. Signal processing by the PDT tool extracted a total of 12 PDT features from sensors. Feature selection abstracted the major attributes from the 12 PDT features to elucidate the dominant characteristics of proximal weakness of stroke patients using machine learning classification. Our proposed PDT classifiers had an area under the receiver operating characteristic curve (AUC) of .806 (SVM), .769 (RBFN), and .900 (RF) without feature selection, and feature selection improves the AUCs to .913 (SVM), .956 (RBFN), and .975 (RF), representing an average performance enhancement of 15.3%. Sensors and machine learning methods can reliably detect stroke signs and quantify proximal arm weakness. Our proposed solution will facilitate pervasive monitoring of stroke patients. ©Eunjeong Park, Hyuk-Jae Chang, Hyo Suk Nam. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 18.04.2017.
Variability of pCO2 in surface waters and development of prediction model.
Chung, Sewoong; Park, Hyungseok; Yoo, Jisu
2018-05-01
Inland waters are substantial sources of atmospheric carbon, but relevant data are rare in Asian monsoon regions including Korea. Emissions of CO 2 to the atmosphere depend largely on the partial pressure of CO 2 (pCO 2 ) in water; however, measured pCO 2 data are scarce and calculated pCO 2 can show large uncertainty. This study had three objectives: 1) to examine the spatial variability of pCO 2 in diverse surface water systems in Korea; 2) to compare pCO 2 calculated using pH-total alkalinity (Alk) and pH-dissolved inorganic carbon (DIC) with pCO 2 measured by an in situ submersible nondispersive infrared detector; and 3) to characterize the major environmental variables determining the variation of pCO 2 based on physical, chemical, and biological data collected concomitantly. Of 30 samples, 80% were found supersaturated in CO 2 with respect to the overlying atmosphere. Calculated pCO 2 using pH-Alk and pH-DIC showed weak prediction capability and large variations with respect to measured pCO 2 . Error analysis indicated that calculated pCO 2 is highly sensitive to the accuracy of pH measurements, particularly at low pH. Stepwise multiple linear regression (MLR) and random forest (RF) techniques were implemented to develop the most parsimonious model based on 10 potential predictor variables (pH, Alk, DIC, Uw, Cond, Turb, COD, DOC, TOC, Chla) by optimizing model performance. The RF model showed better performance than the MLR model, and the most parsimonious RF model (pH, Turb, Uw, Chla) improved pCO 2 prediction capability considerably compared with the simple calculation approach, reducing the RMSE from 527-544 to 105μatm at the study sites. Copyright © 2017 Elsevier B.V. All rights reserved.
Wang, Wei; Liu, Juan; Sun, Lin
2016-07-01
Protein-DNA bindings are critical to many biological processes. However, the structural mechanisms underlying these interactions are not fully understood. Here, we analyzed the residues shape (peak, flat, or valley) and the surrounding environment of double-stranded DNA-binding proteins (DSBs) and single-stranded DNA-binding proteins (SSBs) in protein-DNA interfaces. In the results, we found that the interface shapes, hydrogen bonds, and the surrounding environment present significant differences between the two kinds of proteins. Built on the investigation results, we constructed a random forest (RF) classifier to distinguish DSBs and SSBs with satisfying performance. In conclusion, we present a novel methodology to characterize protein interfaces, which will deepen our understanding of the specificity of proteins binding to ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA). Proteins 2016; 84:979-989. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Multi-fractal texture features for brain tumor and edema segmentation
NASA Astrophysics Data System (ADS)
Reza, S.; Iftekharuddin, K. M.
2014-03-01
In this work, we propose a fully automatic brain tumor and edema segmentation technique in brain magnetic resonance (MR) images. Different brain tissues are characterized using the novel texture features such as piece-wise triangular prism surface area (PTPSA), multi-fractional Brownian motion (mBm) and Gabor-like textons, along with regular intensity and intensity difference features. Classical Random Forest (RF) classifier is used to formulate the segmentation task as classification of these features in multi-modal MRIs. The segmentation performance is compared with other state-of-art works using a publicly available dataset known as Brain Tumor Segmentation (BRATS) 2012 [1]. Quantitative evaluation is done using the online evaluation tool from Kitware/MIDAS website [2]. The results show that our segmentation performance is more consistent and, on the average, outperforms other state-of-the art works in both training and challenge cases in the BRATS competition.
A Comparison of Machine Learning Approaches for Corn Yield Estimation
NASA Astrophysics Data System (ADS)
Kim, N.; Lee, Y. W.
2017-12-01
Machine learning is an efficient empirical method for classification and prediction, and it is another approach to crop yield estimation. The objective of this study is to estimate corn yield in the Midwestern United States by employing the machine learning approaches such as the support vector machine (SVM), random forest (RF), and deep neural networks (DNN), and to perform the comprehensive comparison for their results. We constructed the database using satellite images from MODIS, the climate data of PRISM climate group, and GLDAS soil moisture data. In addition, to examine the seasonal sensitivities of corn yields, two period groups were set up: May to September (MJJAS) and July and August (JA). In overall, the DNN showed the highest accuracies in term of the correlation coefficient for the two period groups. The differences between our predictions and USDA yield statistics were about 10-11 %.
Assessments of SENTINEL-2 Vegetation Red-Edge Spectral Bands for Improving Land Cover Classification
NASA Astrophysics Data System (ADS)
Qiu, S.; He, B.; Yin, C.; Liao, Z.
2017-09-01
The Multi Spectral Instrument (MSI) onboard Sentinel-2 can record the information in Vegetation Red-Edge (VRE) spectral domains. In this study, the performance of the VRE bands on improving land cover classification was evaluated based on a Sentinel-2A MSI image in East Texas, USA. Two classification scenarios were designed by excluding and including the VRE bands. A Random Forest (RF) classifier was used to generate land cover maps and evaluate the contributions of different spectral bands. The combination of VRE bands increased the overall classification accuracy by 1.40 %, which was statistically significant. Both confusion matrices and land cover maps indicated that the most beneficial increase was from vegetation-related land cover types, especially agriculture. Comparison of the relative importance of each band showed that the most beneficial VRE bands were Band 5 and Band 6. These results demonstrated the value of VRE bands for land cover classification.
Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes
2013-01-01
Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. Availability The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. PMID:24564704
Wang, Yue; Goh, Wilson; Wong, Limsoon; Montana, Giovanni
2013-01-01
Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana.
Method and apparatus for determining position using global positioning satellites
NASA Technical Reports Server (NTRS)
Ward, John (Inventor); Ward, William S. (Inventor)
1998-01-01
A global positioning satellite receiver having an antenna for receiving a L1 signal from a satellite. The L1 signal is processed by a preamplifier stage including a band pass filter and a low noise amplifier and output as a radio frequency (RF) signal. A mixer receives and de-spreads the RF signal in response to a pseudo-random noise code, i.e., Gold code, generated by an internal pseudo-random noise code generator. A microprocessor enters a code tracking loop, such that during the code tracking loop, it addresses the pseudo-random code generator to cause the pseudo-random code generator to sequentially output pseudo-random codes corresponding to satellite codes used to spread the L1 signal, until correlation occurs. When an output of the mixer is indicative of the occurrence of correlation between the RF signal and the generated pseudo-random codes, the microprocessor enters an operational state which slows the receiver code sequence to stay locked with the satellite code sequence. The output of the mixer is provided to a detector which, in turn, controls certain routines of the microprocessor. The microprocessor will output pseudo range information according to an interrupt routine in response detection of correlation. The pseudo range information is to be telemetered to a ground station which determines the position of the global positioning satellite receiver.
Antonsen, Bjørnar T; Johansen, Merete S; Rø, Frida G; Kvarstein, Elfrida H; Wilberg, Theresa
2016-01-01
Mentalization is the capacity to understand behavior as the expression of various mental states and is assumed to be important in a range of psychopathologies, especially personality disorders (PDs). The first aim of the present study was to investigate the relationship between mentalization capacity, operationalized as reflective functioning (RF), and clinical manifestations before entering study treatment. The second aim was to investigate the relationship between baseline RF and long-term clinical outcome both independent of treatment (predictor analyses) and dependent on treatment (moderator analyses). Seventy-nine patients from a randomized clinical trial (Ullevål Personality Project) who had borderline and/or avoidant PD were randomly assigned to either a step-down treatment program, comprising short-term day-hospital treatment followed by outpatient combined group and individual psychotherapy, or to outpatient individual psychotherapy. Patients were evaluated on variables including symptomatic distress, psychosocial functioning, personality functioning, and self-esteem at baseline, 8 and 18months, and 3 and 6years. RF was significantly associated with a wide range of variables at baseline. In longitudinal analyses RF was not found to be a predictor of long-term clinical outcome. However, when considering treatment type, there were significant moderator effects of RF. Patients with low RF had better outcomes in outpatient individual therapy compared to the step-down program. In contrast, patients in the medium RF group achieved better results in the step-down program. These findings indicate that RF is associated with core aspects of personality pathology and capture clinically relevant phenomena in adult patients with PDs. Moreover, patients with different capacities for mentalization may need different kinds of therapeutic approaches. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
a Empirical Modelation of Runoff in Small Watersheds Using LIDAR Data
NASA Astrophysics Data System (ADS)
Lopatin, J.; Hernández, J.; Galleguillos, M.; Mancilla, G.
2013-12-01
Hydrological models allow the simulation of water natural processes and also the quantification and prediction of the effects of human impacts in runoff behavior. However, obtaining the information that is need for applying these models can be costly in both time and resources, especially in large and difficult to access areas. The objective of this research was to integrate LiDAR data in the hydrological modeling of runoff in small watersheds, using derivated hydrologic, vegetation and topography variables. The study area includes 10 small head watersheds cover bay forest, between 2 and 16 ha, which are located in the south-central coastal range of Chile. In each of the former instantaneous rainfall and runoff flow of a total of 15 rainfall events were measured, between August 2012 and July 2013, yielding a total of 79 observations. In March 2011 a Harrier 54/G4 Dual System was used to obtain a LiDAR point cloud of discrete pulse with an average of 4.64 points per square meter. A Digital Terrain Model (DTM) of 1 meter resolution was obtained from the point cloud, and subsequently 55 topographic variables were derived, such as physical watershed parameters and morphometric features. At the same time, 30 vegetation descriptive variables were obtained directly from the point cloud and from a Digital Canopy Model (DCM). The classification and regression "Random Forest" (RF) algorithm was used to select the most important variables in predicting water height (liters), and the "Partial Least Squares Path Modeling" (PLS-PM) algorithm was used to fit a model using the selected set of variables. Four Latent variables were selected (outer model) related to: climate, topography, vegetation and runoff, where in each one was designated a group of the predictor variables selected by RF (inner model). The coefficient of determination (R2) and Goodnes-of-Fit (GoF) of the final model were obtained. The best results were found when modeling using only the upper 50th percentile of rainfall events. The best variables selected by the RF algorithm were three topographic variables and three vegetation related ones. We obtained an R2 of 0.82 and a GoF of 0.87 with a 95% of confidence interval. This study shows that it is possible to predict the water harvesting collected during a rainstorm event in forest environment using only LiDAR data. However, this type of methodology does not have good result in flow produced by low magnitude rainfall events, as these are more influenced by initial conditions of soil, vegetation and climate, which make their behavior slower and erratic.
NASA Astrophysics Data System (ADS)
Singh, Leeth; Mutanga, Onisimo; Mafongoya, Paramu; Peerbhay, Kabir
2017-07-01
The concentration of forage fiber content is critical in explaining the palatability of forage quality for livestock grazers in tropical grasslands. Traditional methods of determining forage fiber content are usually time consuming, costly, and require specialized laboratory analysis. With the potential of remote sensing technologies, determination of key fiber attributes can be made more accurately. This study aims to determine the effectiveness of known absorption wavelengths for detecting forage fiber biochemicals, neutral detergent fiber, acid detergent fiber, and lignin using hyperspectral data. Hyperspectral reflectance spectral measurements (350 to 2500 nm) of grass were collected and implemented within the random forest (RF) ensemble. Results show successful correlations between the known absorption features and the biochemicals with coefficients of determination (R2) ranging from 0.57 to 0.81 and root mean square errors ranging from 6.97 to 3.03 g/kg. In comparison, using the entire dataset, the study identified additional wavelengths for detecting fiber biochemicals, which contributes to the accurate determination of forage quality in a grassland environment. Overall, the results showed that hyperspectral remote sensing in conjunction with the competent RF ensemble could discriminate each key biochemical evaluated. This study shows the potential to upscale the methodology to a space-borne multispectral platform with similar spectral configurations for an accurate and cost effective mapping analysis of forage quality.
Comparison modeling for alpine vegetation distribution in an arid area.
Zhou, Jihua; Lai, Liming; Guan, Tianyu; Cai, Wetao; Gao, Nannan; Zhang, Xiaolong; Yang, Dawen; Cong, Zhentao; Zheng, Yuanrun
2016-07-01
Mapping and modeling vegetation distribution are fundamental topics in vegetation ecology. With the rise of powerful new statistical techniques and GIS tools, the development of predictive vegetation distribution models has increased rapidly. However, modeling alpine vegetation with high accuracy in arid areas is still a challenge because of the complexity and heterogeneity of the environment. Here, we used a set of 70 variables from ASTER GDEM, WorldClim, and Landsat-8 OLI (land surface albedo and spectral vegetation indices) data with decision tree (DT), maximum likelihood classification (MLC), and random forest (RF) models to discriminate the eight vegetation groups and 19 vegetation formations in the upper reaches of the Heihe River Basin in the Qilian Mountains, northwest China. The combination of variables clearly discriminated vegetation groups but failed to discriminate vegetation formations. Different variable combinations performed differently in each type of model, but the most consistently important parameter in alpine vegetation modeling was elevation. The best RF model was more accurate for vegetation modeling compared with the DT and MLC models for this alpine region, with an overall accuracy of 75 % and a kappa coefficient of 0.64 verified against field point data and an overall accuracy of 65 % and a kappa of 0.52 verified against vegetation map data. The accuracy of regional vegetation modeling differed depending on the variable combinations and models, resulting in different classifications for specific vegetation groups.
Mapping abnormal subcortical brain morphometry in an elderly HIV+ cohort.
Wade, Benjamin S C; Valcour, Victor G; Wendelken-Riegelhaupt, Lauren; Esmaeili-Firidouni, Pardis; Joshi, Shantanu H; Gutman, Boris A; Thompson, Paul M
2015-01-01
Over 50% of HIV + individuals exhibit neurocognitive impairment and subcortical atrophy, but the profile of brain abnormalities associated with HIV is still poorly understood. Using surface-based shape analyses, we mapped the 3D profile of subcortical morphometry in 63 elderly HIV + participants and 31 uninfected controls. The thalamus, caudate, putamen, pallidum, hippocampus, amygdala, brainstem, accumbens, callosum and ventricles were segmented from high-resolution MRIs. To investigate shape-based morphometry, we analyzed the Jacobian determinant (JD) and radial distances (RD) defined on each region's surfaces. We also investigated effects of nadir CD4 + T-cell counts, viral load, time since diagnosis (TSD) and cognition on subcortical morphology. Lastly, we explored whether HIV + participants were distinguishable from unaffected controls in a machine learning context. All shape and volume features were included in a random forest (RF) model. The model was validated with 2-fold cross-validation. Volumes of HIV + participants' bilateral thalamus, left pallidum, left putamen and callosum were significantly reduced while ventricular spaces were enlarged. Significant shape variation was associated with HIV status, TSD and the Wechsler adult intelligence scale. HIV + people had diffuse atrophy, particularly in the caudate, putamen, hippocampus and thalamus. Unexpectedly, extended TSD was associated with increased thickness of the anterior right pallidum. In the classification of HIV + participants vs. controls, our RF model attained an area under the curve of 72%.
NASA Astrophysics Data System (ADS)
Herkül, Kristjan; Peterson, Anneliis; Paekivi, Sander
2017-06-01
Both basic science and marine spatial planning are in a need of high resolution spatially continuous data on seabed habitats and biota. As conventional point-wise sampling is unable to cover large spatial extents in high detail, it must be supplemented with remote sensing and modeling in order to fulfill the scientific and management needs. The combined use of in situ sampling, sonar scanning, and mathematical modeling is becoming the main method for mapping both abiotic and biotic seabed features. Further development and testing of the methods in varying locations and environmental settings is essential for moving towards unified and generally accepted methodology. To fill the relevant research gap in the Baltic Sea, we used multibeam sonar and mathematical modeling methods - generalized additive models (GAM) and random forest (RF) - together with underwater video to map seabed substrate and epibenthos of offshore shallows. In addition to testing the general applicability of the proposed complex of techniques, the predictive power of different sonar-based variables and modeling algorithms were tested. Mean depth, followed by mean backscatter, were the most influential variables in most of the models. Generally, mean values of sonar-based variables had higher predictive power than their standard deviations. The predictive accuracy of RF was higher than that of GAM. To conclude, we found the method to be feasible and with predictive accuracy similar to previous studies of sonar-based mapping.
Invasive Shrub Mapping in an Urban Environment from Hyperspectral and LiDAR-Derived Attributes.
Chance, Curtis M; Coops, Nicholas C; Plowright, Andrew A; Tooke, Thoreau R; Christen, Andreas; Aven, Neal
2016-01-01
Proactive management of invasive species in urban areas is critical to restricting their overall distribution. The objective of this work is to determine whether advanced remote sensing technologies can help to detect invasions effectively and efficiently in complex urban ecosystems such as parks. In Surrey, BC, Canada, Himalayan blackberry ( Rubus armeniacus ) and English ivy ( Hedera helix ) are two invasive shrub species that can negatively affect native ecosystems in cities and managed urban parks. Random forest (RF) models were created to detect these two species using a combination of hyperspectral imagery, and light detection and ranging (LiDAR) data. LiDAR-derived predictor variables included irradiance models, canopy structural characteristics, and orographic variables. RF detection accuracy ranged from 77.8 to 87.8% for Himalayan blackberry and 81.9 to 82.1% for English ivy, with open areas classified more accurately than areas under canopy cover. English ivy was predicted to occur across a greater area than Himalayan blackberry both within parks and across the entire city. Both Himalayan blackberry and English ivy were mostly located in clusters according to a Local Moran's I analysis. The occurrence of both species decreased as the distance from roads increased. This study shows the feasibility of producing highly accurate detection maps of plant invasions in urban environments using a fusion of remotely sensed data, as well as the ability to use these products to guide management decisions.
Models for H₃ receptor antagonist activity of sulfonylurea derivatives.
Khatri, Naveen; Madan, A K
2014-03-01
The histamine H₃ receptor has been perceived as an auspicious target for the treatment of various central and peripheral nervous system diseases. In present study, a wide variety of 60 2D and 3D molecular descriptors (MDs) were successfully utilized for the development of models for the prediction of antagonist activity of sulfonylurea derivatives for histamine H₃ receptors. Models were developed through decision tree (DT), random forest (RF) and moving average analysis (MAA). Dragon software version 6.0.28 was employed for calculation of values of diverse MDs of each analogue involved in the data set. The DT classified and correctly predicted the input data with an impressive non-error rate of 94% in the training set and 82.5% during cross validation. RF correctly classified the analogues into active and inactive with a non-error rate of 79.3%. The MAA based models predicted the antagonist histamine H₃ receptor activity with non-error rate up to 90%. Active ranges of the proposed MAA based models not only exhibited high potency but also showed improved safety as indicated by relatively high values of selectivity index. The statistical significance of the models was assessed through sensitivity, specificity, non-error rate, Matthew's correlation coefficient and intercorrelation analysis. Proposed models offer vast potential for providing lead structures for development of potent but safe H₃ receptor antagonist sulfonylurea derivatives. Copyright © 2013 Elsevier Inc. All rights reserved.
Zhang, Bin; He, Xin; Ouyang, Fusheng; Gu, Dongsheng; Dong, Yuhao; Zhang, Lu; Mo, Xiaokai; Huang, Wenhui; Tian, Jie; Zhang, Shuixing
2017-09-10
We aimed to identify optimal machine-learning methods for radiomics-based prediction of local failure and distant failure in advanced nasopharyngeal carcinoma (NPC). We enrolled 110 patients with advanced NPC. A total of 970 radiomic features were extracted from MRI images for each patient. Six feature selection methods and nine classification methods were evaluated in terms of their performance. We applied the 10-fold cross-validation as the criterion for feature selection and classification. We repeated each combination for 50 times to obtain the mean area under the curve (AUC) and test error. We observed that the combination methods Random Forest (RF) + RF (AUC, 0.8464 ± 0.0069; test error, 0.3135 ± 0.0088) had the highest prognostic performance, followed by RF + Adaptive Boosting (AdaBoost) (AUC, 0.8204 ± 0.0095; test error, 0.3384 ± 0.0097), and Sure Independence Screening (SIS) + Linear Support Vector Machines (LSVM) (AUC, 0.7883 ± 0.0096; test error, 0.3985 ± 0.0100). Our radiomics study identified optimal machine-learning methods for the radiomics-based prediction of local failure and distant failure in advanced NPC, which could enhance the applications of radiomics in precision oncology and clinical practice. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Hu, Q.; Friedl, M. A.; Wu, W.
2017-12-01
Accurate and timely information regarding the spatial distribution of crop types and their changes is essential for acreage surveys, yield estimation, water management, and agricultural production decision-making. In recent years, increasing population, dietary shifts and climate change have driven drastic changes in China's agricultural land use. However, no maps are currently available that document the spatial and temporal patterns of these agricultural land use changes. Because of its short revisit period, rich spectral bands and global coverage, MODIS time series data has been shown to have great potential for detecting the seasonal dynamics of different crop types. However, its inherently coarse spatial resolution limits the accuracy with which crops can be identified from MODIS in regions with small fields or complex agricultural landscapes. To evaluate this more carefully and specifically understand the strengths and weaknesses of MODIS data for crop-type mapping, we used MODIS time-series imagery to map the sub-pixel fractional crop area for four major crop types (rice, corn, soybean and wheat) at 500-m spatial resolution for Heilongjiang province, one of the most important grain-production regions in China where recent agricultural land use change has been rapid and pronounced. To do this, a random forest regression (RF-g) model was constructed to estimate the percentage of each sub-pixel crop type in 2006, 2011 and 2016. Crop type maps generated through expert visual interpretation of high spatial resolution images (i.e., Landsat and SPOT data) were used to calibrate the regression model. Five different time series of vegetation indices (155 features) derived from different spectral channels of MODIS land surface reflectance (MOD09A1) data were used as candidate features for the RF-g model. An out-of-bag strategy and backward elimination approach was applied to select the optimal spectra-temporal feature subset for each crop type. The resulting crop maps were assessed in two ways: (1) wall-to-wall pixel comparison with corresponding high spatial resolution reference maps; and (2) county-level comparison with census data. Based on these derived maps, changes in crop type, total area, and spatial patterns of change in Heilongjiang province during 2006-2016 were analyzed.
Awad, Aya; Bader-El-Den, Mohamed; McNicholas, James; Briggs, Jim
2017-12-01
Mortality prediction of hospitalized patients is an important problem. Over the past few decades, several severity scoring systems and machine learning mortality prediction models have been developed for predicting hospital mortality. By contrast, early mortality prediction for intensive care unit patients remains an open challenge. Most research has focused on severity of illness scoring systems or data mining (DM) models designed for risk estimation at least 24 or 48h after ICU admission. This study highlights the main data challenges in early mortality prediction in ICU patients and introduces a new machine learning based framework for Early Mortality Prediction for Intensive Care Unit patients (EMPICU). The proposed method is evaluated on the Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database. Mortality prediction models are developed for patients at the age of 16 or above in Medical ICU (MICU), Surgical ICU (SICU) or Cardiac Surgery Recovery Unit (CSRU). We employ the ensemble learning Random Forest (RF), the predictive Decision Trees (DT), the probabilistic Naive Bayes (NB) and the rule-based Projective Adaptive Resonance Theory (PART) models. The primary outcome was hospital mortality. The explanatory variables included demographic, physiological, vital signs and laboratory test variables. Performance measures were calculated using cross-validated area under the receiver operating characteristic curve (AUROC) to minimize bias. 11,722 patients with single ICU stays are considered. Only patients at the age of 16 years old and above in Medical ICU (MICU), Surgical ICU (SICU) or Cardiac Surgery Recovery Unit (CSRU) are considered in this study. The proposed EMPICU framework outperformed standard scoring systems (SOFA, SAPS-I, APACHE-II, NEWS and qSOFA) in terms of AUROC and time (i.e. at 6h compared to 48h or more after admission). The results show that although there are many values missing in the first few hour of ICU admission, there is enough signal to effectively predict mortality during the first 6h of admission. The proposed framework, in particular the one that uses the ensemble learning approach - EMPICU Random Forest (EMPICU-RF) offers a base to construct an effective and novel mortality prediction model in the early hours of an ICU patient admission, with an improved performance profile. Copyright © 2017 Elsevier B.V. All rights reserved.
Chen, Yang; Luo, Yan; Huang, Wei; Hu, Die; Zheng, Rong-Qin; Cong, Shu-Zhen; Meng, Fan-Kun; Yang, Hong; Lin, Hong-Jun; Sun, Yan; Wang, Xiu-Yan; Wu, Tao; Ren, Jie; Pei, Shu-Fang; Zheng, Ying; He, Yun; Hu, Yu; Yang, Na; Yan, Hongmei
2017-10-01
Hepatic fibrosis is a common middle stage of the pathological processes of chronic liver diseases. Clinical intervention during the early stages of hepatic fibrosis can slow the development of liver cirrhosis and reduce the risk of developing liver cancer. Performing a liver biopsy, the gold standard for viral liver disease management, has drawbacks such as invasiveness and a relatively high sampling error rate. Real-time tissue elastography (RTE), one of the most recently developed technologies, might be promising imaging technology because it is both noninvasive and provides accurate assessments of hepatic fibrosis. However, determining the stage of liver fibrosis from RTE images in a clinic is a challenging task. In this study, in contrast to the previous liver fibrosis index (LFI) method, which predicts the stage of diagnosis using RTE images and multiple regression analysis, we employed four classical classifiers (i.e., Support Vector Machine, Naïve Bayes, Random Forest and K-Nearest Neighbor) to build a decision-support system to improve the hepatitis B stage diagnosis performance. Eleven RTE image features were obtained from 513 subjects who underwent liver biopsies in this multicenter collaborative research. The experimental results showed that the adopted classifiers significantly outperformed the LFI method and that the Random Forest(RF) classifier provided the highest average accuracy among the four machine algorithms. This result suggests that sophisticated machine-learning methods can be powerful tools for evaluating the stage of hepatic fibrosis and show promise for clinical applications. Copyright © 2017 Elsevier Ltd. All rights reserved.
Identification of phreatophytic groundwater dependent ecosystems using geospatial technologies
NASA Astrophysics Data System (ADS)
Perez Hoyos, Isabel Cristina
The protection of groundwater dependent ecosystems (GDEs) is increasingly being recognized as an essential aspect for the sustainable management and allocation of water resources. Ecosystem services are crucial for human well-being and for a variety of flora and fauna. However, the conservation of GDEs is only possible if knowledge about their location and extent is available. Several studies have focused on the identification of GDEs at specific locations using ground-based measurements. However, recent progress in technologies such as remote sensing and their integration with geographic information systems (GIS) has provided alternative ways to map GDEs at much larger spatial extents. This study is concerned with the discovery of patterns in geospatial data sets using data mining techniques for mapping phreatophytic GDEs in the United States at 1 km spatial resolution. A methodology to identify the probability of an ecosystem to be groundwater dependent is developed. Probabilities are obtained by modeling the relationship between the known locations of GDEs and main factors influencing groundwater dependency, namely water table depth (WTD) and aridity index (AI). A methodology is proposed to predict WTD at 1 km spatial resolution using relevant geospatial data sets calibrated with WTD observations. An ensemble learning algorithm called random forest (RF) is used in order to model the distribution of groundwater in three study areas: Nevada, California, and Washington, as well as in the entire United States. RF regression performance is compared with a single regression tree (RT). The comparison is based on contrasting training error, true prediction error, and variable importance estimates of both methods. Additionally, remote sensing variables are omitted from the process of fitting the RF model to the data to evaluate the deterioration in the model performance when these variables are not used as an input. Research results suggest that although the prediction accuracy of a single RT is reduced in comparison with RFs, single trees can still be used to understand the interactions that might be taking place between predictor variables and the response variable. Regarding RF, there is a great potential in using the power of an ensemble of trees for prediction of WTD. The superior capability of RF to accurately map water table position in Nevada, California, and Washington demonstrate that this technique can be applied at scales larger than regional levels. It is also shown that the removal of remote sensing variables from the RF training process degrades the performance of the model. Using the predicted WTD, the probability of an ecosystem to be groundwater dependent (GDE probability) is estimated at 1 km spatial resolution. The modeling technique is evaluated in the state of Nevada, USA to develop a systematic approach for the identification of GDEs and it is then applied in the United States. The modeling approach selected for the development of the GDE probability map results from a comparison of the performance of classification trees (CT) and classification forests (CF). Predictive performance evaluation for the selection of the most accurate model is achieved using a threshold independent technique, and the prediction accuracy of both models is assessed in greater detail using threshold-dependent measures. The resulting GDE probability map can potentially be used for the definition of conservation areas since it can be translated into a binary classification map with two classes: GDE and NON-GDE. These maps are created by selecting a probability threshold. It is demonstrated that the choice of this threshold has dramatic effects on deterministic model performance measures.
Chen, Yasheng; Dhar, Rajat; Heitsch, Laura; Ford, Andria; Fernandez-Cadenas, Israel; Carrera, Caty; Montaner, Joan; Lin, Weili; Shen, Dinggang; An, Hongyu; Lee, Jin-Moo
2016-01-01
Although cerebral edema is a major cause of death and deterioration following hemispheric stroke, there remains no validated biomarker that captures the full spectrum of this critical complication. We recently demonstrated that reduction in intracranial cerebrospinal fluid (CSF) volume (∆ CSF) on serial computed tomography (CT) scans provides an accurate measure of cerebral edema severity, which may aid in early triaging of stroke patients for craniectomy. However, application of such a volumetric approach would be too cumbersome to perform manually on serial scans in a real-world setting. We developed and validated an automated technique for CSF segmentation via integration of random forest (RF) based machine learning with geodesic active contour (GAC) segmentation. The proposed RF + GAC approach was compared to conventional Hounsfield Unit (HU) thresholding and RF segmentation methods using Dice similarity coefficient (DSC) and the correlation of volumetric measurements, with manual delineation serving as the ground truth. CSF spaces were outlined on scans performed at baseline (< 6 h after stroke onset) and early follow-up (FU) (closest to 24 h) in 38 acute ischemic stroke patients. RF performed significantly better than optimized HU thresholding (p < 10 - 4 in baseline and p < 10 - 5 in FU) and RF + GAC performed significantly better than RF (p < 10 - 3 in baseline and p < 10 - 5 in FU). Pearson correlation coefficients between the automatically detected ∆ CSF and the ground truth were r = 0.178 (p = 0.285), r = 0.876 (p < 10 - 6 ) and r = 0.879 (p < 10 - 6 ) for thresholding, RF and RF + GAC, respectively, with a slope closer to the line of identity in RF + GAC. When we applied the algorithm trained from images of one stroke center to segment CTs from another center, similar findings held. In conclusion, we have developed and validated an accurate automated approach to segment CSF and calculate its shifts on serial CT scans. This algorithm will allow us to efficiently and accurately measure the evolution of cerebral edema in future studies including large multi-site patient populations.
Boreal Forest Fire Cools Climate
NASA Astrophysics Data System (ADS)
Randerson, J. T.; Liu, H.; Flanner, M.; Chambers, S. D.; Harden, J. W.; Hess, P. G.; Jin, Y.; Mack, M. C.; Pfister, G.; Schuur, E. A.; Treseder, K. K.; Welp, L. R.; Zender, C. S.
2005-12-01
We report measurements, modeling, and analysis of carbon and energy fluxes from a boreal forest fire that occurred in interior Alaska during 1999. In the first year after the fire, ozone production, atmospheric aerosol loading, greenhouse gas emissions, soot deposition, and decreases in summer albedo contributed to a positive annual radiative forcing (RF). These effects were partly offset by an increase in fall, winter, and spring albedo from reduced canopy cover and increased exposure of snow-covered surfaces. The atmospheric lifetime of aerosols and ozone and are relatively short (days to months). The radiative effects of soot on snow are also attenuated rapidly from the deposition of fresh snow. As a result, a year after the fire, only two classes of RF mechanisms remained: greenhouse gas emissions and post-fire changes in surface albedo. Summer albedo increased rapidly in subsequent years and was substantially higher than unburned control areas (by more than 0.03) after 4 years as a result of grass and shrub establishment. Satellite measurements from MODIS of other interior Alaska burn scars provided evidence that elevated levels of spring and summer albedo (relative to unburned control areas) persisted for at least 4 decades after fire. In parallel, our chamber, eddy covariance, and biomass measurements indicated that the post-fire ecosystems switch from a source to a sink within the first decade. Taken together, the extended period of increased spring and summer albedo and carbon uptake of intermediate-aged stands appears to more than offset the initial warming pulse caused by fire emissions, when compared using the RF concept. This result suggests that management of forests in northern countries to suppress fire and preserve carbon sinks may have the opposite effect on climate as that intended.
Zhang, Zhen; Fei, Ye; Chen, Xiangdong; Lu, Wenli; Chen, Jinan
2013-04-01
No studies have compared fractional microplasma radio frequency (RF) technology with the carbon dioxide fractional laser system (CO2 FS) in the treatment of atrophic acne scars in the same patient. To compare the efficacy and safety of fractional microplasma RF with CO2 FS in the treatment of atrophic acne scars. Thirty-three Asian patients received three sessions of a randomized split-face treatment of fractional microplasma RF or CO2 FS. Both modalities had a roughly equivalent effect. Échelle d'Évaluation Clinique Des Cicatrices d'Acné scores were significantly lower after fractional microplasma RF (from 51.1 ± 14.2 to 22.3 ± 8.6, 56.4% improvement) and CO2 FS (from 48.8 ± 15.1 to 19.9 ± 7.9, 59.2% improvement) treatments. There was no statistically significant difference between the two therapies. Twelve subjects (36.4%) experienced postinflammatory hyperpigmentation (PIH) after 30 of 99 treatment sessions (30.3%) on the CO2 FS side and no PIH was observed on the fractional microplasma RF sides. Both modalities have good effects on treating atrophic scars. PIH was not seen with the fractional microplasma RF, which might make it a better choice for patients with darker skin. © 2013 by the American Society for Dermatologic Surgery, Inc. Published by Wiley Periodicals, Inc.
Meta-analysis of trials of streptococcal throat treatment programs to prevent rheumatic fever.
Lennon, Diana; Kerdemelidis, Melissa; Arroll, Bruce
2009-07-01
Rheumatic fever (RF) is the commonest cause of pediatric heart disease globally. Penicillin for streptococcal pharyngitis prevents RF. Inequitable access to health care persists. To investigate RF prevention by treating streptococcal pharyngitis in school- and/or community-based programs. Medline, Old Medline, the Cochrane Library, DARE, Central, NHS, EED, NICE, NRMC, Clinical Evidence, CDC website, PubMed, and reference lists of retrieved articles. Known researchers in the field were contacted where possible. Randomized, controlled trials or trials of before/after design examining treatment of sore throats in schools or communities with RF as an outcome where data were able to be pooled for analysis. Two authors examined titles, abstracts, selected articles, and extracted data. Disagreements were resolved by consensus. QUANTITATIVE ANALYSIS TOOL: Review Manager version 4.2 to assess pooled relative risks and 95% confidence intervals. Six studies (of 677 screened) which met the criteria and could be pooled were included. Meta-analysis of these trials for RF control produced a relative risk of 0.41 (95% CI: 0.23-0.70). There was statistical heterogeneity (I = 70.5%). Hence a random effects analysis was conducted. Many studies were poor quality. Title and available abstracts of non-English studies were checked. There may be publication bias. This is the best available evidence in an area with imperfect information. It is expected acute RF cases would diminish by about 60% using a school or community clinic to treat streptococcal pharyngitis. This should be considered in high-risk populations.
Goldberg, David J; Yatskayer, Margarita; Raab, Susana; Chen, Nannan; Krol, Yevgeniy; Oresajo, Christian
2014-10-01
Abstract Background: Skin laxity and cellulite on the buttocks and thighs are two common cosmetic concerns. Skin tightening with radiofrequency (RF) devices has become increasingly popular. The purpose of this study is to evaluate the efficacy and safety of a topical skin laxity tightening agent when used in combination with an RF device. A double-blinded, randomized clinical trial enrolled twenty females with mild-to-moderate skin laxity on the posterior thighs/buttocks. Each subject underwent two monthly treatments with an RF source (Alma Accent) to both legs. Subjects were then randomized to apply a topical agent (Skinceuticals Body Tightening Concentrate) twice daily to only one designated thigh/buttock throughout the eight-week duration of the study. All subjects were evaluated for improvement in lifting, skin tone, radiance, firmness/tightness, skin texture, and overall appearance based on photographic evaluation by blinded investigators at 12 weeks following the final RF treatment. A statistically significant improvement was found in the overall appearance on both sides treated with the RF device when compared to baseline. However, the area treated with the topical agent showed a statistically significantly greater degree of improvement than the side where no topical agent was applied. No adverse effects were reported. The use of a novel skin tightening agent used after RF procedures is both safe and effective for treatment of skin laxity on the buttocks and thighs. Combined therapy leads to a better result.
The Stream-Catchment (StreamCat) and Lake-Catchment ...
Background/Question/MethodsLake and stream conditions respond to both natural and human-related landscape features. Characterizing these features within contributing areas (i.e., delineated watersheds) of streams and lakes could improve our understanding of how biological conditions vary spatially and improve the use, management, and restoration of these aquatic resources. However, the specialized geospatial techniques required to define and characterize stream and lake watersheds has limited their widespread use in both scientific and management efforts at large spatial scales. We developed the StreamCat and LakeCat Datasets to model, predict, and map the probable biological conditions of streams and lakes across the conterminous US (CONUS). Both StreamCat and LakeCat contain watershed-level characterizations of several hundred natural (e.g., soils, geology, climate, and land cover) and anthropogenic (e.g., urbanization, agriculture, mining, and forest management) landscape features for ca. 2.6 million stream segments and 376,000 lakes across the CONUS, respectively. These datasets can be paired with field samples to provide independent variables for modeling and other analyses. We paired 1,380 stream and 1,073 lake samples from the USEPAs National Aquatic Resource Surveys with StreamCat and LakeCat and used random forest (RF) to model and then map an invertebrate condition index and chlorophyll a concentration, respectively. Results/ConclusionsThe invertebrate
NASA Astrophysics Data System (ADS)
Herguido, Estela; Pulido, Manuel; Francisco Lavado Contador, Joaquín; Schnabel, Susanne
2017-04-01
In Iberian dehesas and montados, the lack of tree recruitment compromises its long-term sustainability. However, in marginal areas of dehesas shrub encroachment facilitates tree recruitment while altering the distinctive physiognomic and cultural characteristics of the system. These are ongoing processes that should be considered when designing afforestation measures and policies. Based on spatial variables, we modeled the proneness of a piece of land to undergo tree recruitment and the results were related with the afforestation measures carried out under the UE First Afforestation Agricultural Land Program between 1992 and 2008. We analyzed the temporal tree population dynamics in 800 randomly selected plots of 100 m radius (2,510 ha in total) in dehesas and treeless pasturelands of Extremadura (hereafter rangelands). Tree changes were revealed by comparing aerial images taken in 1956 with orthophotographs and infrared ones from 2012. Spatial models that predict the areas prone either to lack tree recruitment or with recruitment were developed and based on three data mining algorithms: MARS (Multivariate Adaptive Regression Splines), Random Forest (RF) and Stochastic Gradient Boosting (Tree-Net, TN). Recruited-tree locations (1) vs. locations of places with no recruitment (0) (randomly selected from the study areas) were used as the binary dependent variable. A 5% of the data were used as test data set. As candidate explanatory variables we used 51 different topographic, climatic, bioclimatic, land cover-related and edaphic ones. The statistical models developed were extrapolated to the spatial context of the afforested areas in the region and also to the whole Extremenian rangelands, and the percentage of area modelled as prone to tree recruitment was calculated for each case. A total of 46,674.63 ha were afforested with holm oak (Quercus ilex) or cork oak (Quercus suber) in the studied rangelands under the UE First Afforestation Agricultural Land Program. In the sampled plots, 16,747 trees were detected as recruited, while 47,058 and 12,803 were present in both dates and lost during the studied period, respectively. Based on the Area Under the ROC Curve (AUC), all the data mining models considered showed a high fitness (MARS AUC= 0.86; TN AUC= 0.92; RF AUC= 0.95) and low misclassification rates. Correctly predicted test samples for absence and presence of tree recruitment accounted respectively to 78.3% and 76.8% when using MARS, 90.8% and 90.8% using TN and 88.9% and 89.1% using RF. The spatial patterns of the different models were similar. However, attending only the percentage of area prone to tree recruitment, outstanding differences were observed among models considering the total surface of rangelands (36.03% in MARS, 22.88% in TN and 6.72 % in RF). Despite these differences, when comparing the results with those of the afforested surfaces (31.73% in MARS, 20.70% in TN and 5.63 % in RF) the three algorithms pointed to similar conclusions, i.e. the afforestations performed in rangelands of Extremadura under UE First Afforestation Agricultural Land Program, barely discriminate between areas with or without natural regeneration. In conclusion, data mining technics are suitable to develop high-performance spatial models of vegetation dynamics. These models could be useful for policy and decision makers aimed at assessing the implementation of afforestation measures and the selection of more adequate locations.
Calibrating random forests for probability estimation.
Dankowski, Theresa; Ziegler, Andreas
2016-09-30
Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so-called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models. These are, in turn, used for re-calibration. The two updating strategies were compared in a simulation study and are illustrated with data from the German Stroke Study Collaboration. In most simulation scenarios, both methods led to similar improvements. In the simulation scenario in which the stricter assumptions of Elkan's method were not met, the logistic regression-based re-calibration approach for random forests outperformed Elkan's method. It also performed better on the stroke data than Elkan's method. The strength of Elkan's method is its general applicability to any probability machine. However, if the strict assumptions underlying this approach are not met, the logistic regression-based approach is preferable for updating random forests for probability estimation. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
NASA Astrophysics Data System (ADS)
Masselink, Rens; Temme, Arnaud; Giménez, Rafael; Casalí, Javier; Keesstra, Saskia
2017-04-01
Soil erosion from agricultural areas is a large problem, because of off-site effects like the rapid filling of reservoirs. To mitigate the problem of sediments from agricultural areas reaching the channel, reservoirs and other surface waters, it is important to understand hillslope-channel connectivity and catchment connectivity. To determine the functioning of hillslope-channel connectivity and the continuation of transport of these sediments in the channel, it is necessary to obtain data on sediment transport from the hillslopes to the channels. Simultaneously, the factors that influence sediment export out of the catchment need to be studied. For measuring hillslope-channel sediment connectivity, Rare-Earth Oxide (REO) tracers were applied to a hillslope in an agricultural catchment in Navarre, Spain, preceding the winter of 2014-2015. The results showed that during the winter there was no sediment transport from the hillslope to the channel. Analysis of precipitation data showed that total precipitation quantities did not differ much from the mean. However, precipitation intensities were low, causing little sediment mobilisation. To test the implication of the REO results at the catchment scale, two conceptual models for sediment connectivity were assessed using a Random Forest (RF) machine learning method. One model proposes that small events provide sediment for large events, while the other proposes that only large events cause sediment detachment and small events subsequently remove these sediments from near and in the channel. The RF method was applied to a daily dataset of sediment yield from the catchment (N=2451 days), and two subsets of the whole dataset: small events (N=2319) and large events (N=132). For sediment yield prediction of small events, variables related to large preceding events were the most important. The model for large events underperformed and, therefore, we could not draw any immediate conclusions whether small events influence the amount of sediment exported during large events. Both REO tracers and RF method showed that low intensity events do not contribute any sediments to the channel in the Latxaga catchment (cf. Masselink et al., 2016). Sediment dynamics are dominated by sediment mobilisation during large (high intensity) events. Sediments are for a large part exported during those events, but large amount of sediments are deposited in and near the channel after these events. These sediments are gradually removed by small events. To better understand the delivery of sediments to the channel and how large and small events influence each other more field data on hillslope-channel connectivity and within-channel sediment dynamics is necessary. Reference: Masselink, R.J.H., Keesstra, S.D., Temme, A.J.A.M., Seeger, M., Giménez, R., Casalí, J., 2016. Modelling Discharge and Sediment Yield at Catchment Scale Using Connectivity Components. Land Degrad. Dev. 27, 933-945. doi:10.1002/ldr.2512
Halski, Tomasz; Dymarek, Robert; Ptaszkowski, Kuba; Słupska, Lucyna; Rajfur, Katarzyna; Rajfur, Joanna; Pasternok, Małgorzata; Smykla, Agnieszka; Taradaj, Jakub
2015-01-01
Background Kinesiology taping (KT) is a popular method of supporting professional athletes during sports activities, traumatic injury prevention, and physiotherapeutic procedures after a wide range of musculoskeletal injuries. The effectiveness of KT in muscle strength and motor units recruitment is still uncertain. The objective of this study was to assess the effect of KT on surface electromyographic (sEMG) activity and muscle flexibility of the rectus femoris (RF), vastus lateralis (VL), and vastus medialis (VM) muscles in healthy volleyball players. Material/Methods Twenty-two healthy volleyball players (8 men and 14 women) were included in the study and randomly assigned to 2 comparative groups: “kinesiology taping” (KT; n=12; age: 22.30±1.88 years; BMI: 22.19±4.00 kg/m2) in which KT application over the RF muscle was used, and “placebo taping” (PT; n=10; age: 21.50±2.07 years; BMI: 22.74±2.67 kg/m2) in which adhesive nonelastic tape over the same muscle was used. All subjects were analyzed for resting sEMG activity of the VL and VM muscles, resting and functional sEMG activity of RF muscle, and muscle flexibility of RF muscle. Results No significant differences in muscle flexibility of the RF muscle and sEMG activity of the RF, VL, and VM muscles were registered before and after interventions in both groups, and between the KT and PT groups (p>0.05). Conclusions The results show that application of the KT to the RF muscle is not useful to improve sEMG activity. PMID:26232122
Lordêlo, Patrícia; Leal, Mariana Robatto Dantas; Brasil, Cristina Aires; Santos, Juliana Menezes; Lima, Maria Clara Neves Pavie Cardoso; Sartori, Marair Gracio Ferreira
2016-11-01
Female sexual behavior goes through cultural changes constantly, and recently, some women have shown the desire the ideal genitalia. In this study, we aimed to evaluate clinical responses to nonablative radiofrequency (RF) in terms of its cosmetic outcome in the female external genitalia and its effect on sexual function. A single-masking randomized controlled trial was conducted in 43 women (29 sexually active) who were unsatisfied with the appearance of their external genitalia. The women were divided into an RF group (n = 21, 14 sexually active) and a control group (n = 22, 15 sexually active). Eight sessions of RF were performed once a week. Photographs (taken before the first session and 8 days after the last session) were evaluated by the women and three blinded health professionals by using two 3-point Likert scales (unsatisfied, unchanged, and satisfied; and worst, unchanged, and improved). Sexual function was evaluated using the Female Sexual Function Index (FSFI) and analyzed using the Student t test. Women's satisfaction and health professional evaluation were analyzed using the chi-square test and inter- and intragroup binomial comparisons. Satisfaction response rates were 76 and 27 % for the RF and control groups, respectively (p = 0.001). All professionals found a clinical improvement association in the treated group with RF in comparison with the control group (p < 0.01). The overall FSFI sexual function score increased by 3.51 points in the RF group vs 0.1 points in the control group (p = 0.003). RF is an alternative for attaining a cosmetic outcome for the female external genitalia, with positives changes in patients' satisfaction and FSFI scores.
Khondoker, Mizanur; Dobson, Richard; Skirrow, Caroline; Simmons, Andrew; Stahl, Daniel
2016-10-01
Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms. We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features. For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study. © The Author(s) 2013.
A Dirichlet process model for classifying and forecasting epidemic curves.
Nsoesie, Elaine O; Leman, Scotland C; Marathe, Madhav V
2014-01-09
A forecast can be defined as an endeavor to quantitatively estimate a future event or probabilities assigned to a future occurrence. Forecasting stochastic processes such as epidemics is challenging since there are several biological, behavioral, and environmental factors that influence the number of cases observed at each point during an epidemic. However, accurate forecasts of epidemics would impact timely and effective implementation of public health interventions. In this study, we introduce a Dirichlet process (DP) model for classifying and forecasting influenza epidemic curves. The DP model is a nonparametric Bayesian approach that enables the matching of current influenza activity to simulated and historical patterns, identifies epidemic curves different from those observed in the past and enables prediction of the expected epidemic peak time. The method was validated using simulated influenza epidemics from an individual-based model and the accuracy was compared to that of the tree-based classification technique, Random Forest (RF), which has been shown to achieve high accuracy in the early prediction of epidemic curves using a classification approach. We also applied the method to forecasting influenza outbreaks in the United States from 1997-2013 using influenza-like illness (ILI) data from the Centers for Disease Control and Prevention (CDC). We made the following observations. First, the DP model performed as well as RF in identifying several of the simulated epidemics. Second, the DP model correctly forecasted the peak time several days in advance for most of the simulated epidemics. Third, the accuracy of identifying epidemics different from those already observed improved with additional data, as expected. Fourth, both methods correctly classified epidemics with higher reproduction numbers (R) with a higher accuracy compared to epidemics with lower R values. Lastly, in the classification of seasonal influenza epidemics based on ILI data from the CDC, the methods' performance was comparable. Although RF requires less computational time compared to the DP model, the algorithm is fully supervised implying that epidemic curves different from those previously observed will always be misclassified. In contrast, the DP model can be unsupervised, semi-supervised or fully supervised. Since both methods have their relative merits, an approach that uses both RF and the DP model could be beneficial.
Current Status and Perspectives of Hyperthermia in Cancer Therapy
NASA Astrophysics Data System (ADS)
Hiraoka, Masahiro; Nagata, Yasushi; Mitsumori, Michihide; Sakamoto, Masashi; Masunaga, Shin-ichiro
2004-08-01
Clinical trials of hyperthermia in combination with radiation therapy or chemotherapy undertaken over the past decades in Japan have been reviewed. Originally developed heating devices were mostly used for these trials, which include RF (radiofrequency) capacitive heating devices, a microwave heating device with a lens applicator, an RF intracavitary heating device, an RF current interstitial heating device, and ferromagnetic implant heating device. Non-randomized trials for various cancers, demonstrated higher response rate in thermoradiotherapy than in radiotherapy alone. Randomized trials undertaken for esophageal cancers also demonstrated improved local response with the combined use of hyperthermia. Furthermore, the complications associated with treatment were not generally serious. These clinical results indicate the benefit of combined treatment of hyperthermia and radiotherapy for various malignancies. On the other hand, the presently available heating devices are not satisfactory from the clinical viewpoints. With the advancement of heating and thermometry technologies, hyperthermia will be more widely and safely used in the treatment of cancers.
Applying a weighted random forests method to extract karst sinkholes from LiDAR data
NASA Astrophysics Data System (ADS)
Zhu, Junfeng; Pierskalla, William P.
2016-02-01
Detailed mapping of sinkholes provides critical information for mitigating sinkhole hazards and understanding groundwater and surface water interactions in karst terrains. LiDAR (Light Detection and Ranging) measures the earth's surface in high-resolution and high-density and has shown great potentials to drastically improve locating and delineating sinkholes. However, processing LiDAR data to extract sinkholes requires separating sinkholes from other depressions, which can be laborious because of the sheer number of the depressions commonly generated from LiDAR data. In this study, we applied the random forests, a machine learning method, to automatically separate sinkholes from other depressions in a karst region in central Kentucky. The sinkhole-extraction random forest was grown on a training dataset built from an area where LiDAR-derived depressions were manually classified through a visual inspection and field verification process. Based on the geometry of depressions, as well as natural and human factors related to sinkholes, 11 parameters were selected as predictive variables to form the dataset. Because the training dataset was imbalanced with the majority of depressions being non-sinkholes, a weighted random forests method was used to improve the accuracy of predicting sinkholes. The weighted random forest achieved an average accuracy of 89.95% for the training dataset, demonstrating that the random forest can be an effective sinkhole classifier. Testing of the random forest in another area, however, resulted in moderate success with an average accuracy rate of 73.96%. This study suggests that an automatic sinkhole extraction procedure like the random forest classifier can significantly reduce time and labor costs and makes its more tractable to map sinkholes using LiDAR data for large areas. However, the random forests method cannot totally replace manual procedures, such as visual inspection and field verification.
Including the biogeochemical impacts of deforestation increases projected warming of climate
NASA Astrophysics Data System (ADS)
Scott, Catherine; Monks, Sarah; Spracklen, Dominick; Arnold, Stephen; Forster, Piers; Rap, Alexandru; Carslaw, Kenneth; Chipperfield, Martyn; Reddington, Carly; Wilson, Christopher
2016-04-01
Forests cover almost one third of the Earth's land area and their distribution is changing as a result of human activities. The presence, and removal, of forests affects the climate in many ways, with the net climate impact of deforestation dependent upon the relative strength of these effects (Betts, 2000; Bala et al., 2007; Davin and de Noblet-Ducoudré, 2010). In addition to controlling the surface albedo and exchanging carbon dioxide (CO2) and moisture with the atmosphere, vegetation emits biogenic volatile organic compounds (BVOCs), which lead to the formation of biogenic secondary organic aerosol (SOA) and alter the oxidative capacity of the atmosphere, affecting ozone (O3) and methane (CH4) concentrations. In this work, we combine a land-surface model with a chemical transport model, a global aerosol model, and a radiative transfer model to compare several radiative impacts of idealised deforestation scenarios in the present day. We find that the simulated reduction in biogenic SOA production, due to complete global deforestation, exerts a positive combined aerosol radiative forcing (RF) of between +308.0 and +362.7 mW m-2; comprised of a direct radiative effect of between +116.5 and +165.0 mW m-2, and a first aerosol indirect effect of between +191.5 and +197.7 mW m-2. We find that the reduction in O3 exerts a negative RF of -150.7 mW m-2 and the reduction in CH4 results in a negative RF of -76.2 mWm-2. When the impacts on biogenic SOA, O3 and CH4 are combined, global deforestation exerts an overall positive RF of between +81.1 and +135.9 mW m-2 through changes to short-lived climate forcers (SLCF). Taking these additional biogeochemical impacts into account increases the net positive RF of complete global deforestation, due to changes in CO2 and surface albedo, by 7-11%. Overall, our work suggests that deforestation has a stronger warming impact on climate than previously thought. References: Bala, G. et al., 2007. Combined climate and carbon-cycle effects of large-scale deforestation. PNAS, 104, 6550-6555. Betts, R. A. 2000. Offset of the potential carbon sink from boreal forestation by decreases in surface albedo. Nature, 408, 187-190. Davin, E. L. & De Noblet-Ducoudré, N. 2010. Climatic Impact of Global-Scale Deforestation: Radiative versus Non-radiative Processes. Journal of Climate, 23, 97-112. .
Noise-Induced Synchronization among Sub-RF CMOS Analog Oscillators for Skew-Free Clock Distribution
NASA Astrophysics Data System (ADS)
Utagawa, Akira; Asai, Tetsuya; Hirose, Tetsuya; Amemiya, Yoshihito
We present on-chip oscillator arrays synchronized by random noises, aiming at skew-free clock distribution on synchronous digital systems. Nakao et al. recently reported that independent neural oscillators can be synchronized by applying temporal random impulses to the oscillators [1], [2]. We regard neural oscillators as independent clock sources on LSIs; i. e., clock sources are distributed on LSIs, and they are forced to synchronize through the use of random noises. We designed neuron-based clock generators operating at sub-RF region (<1GHz) by modifying the original neuron model to a new model that is suitable for CMOS implementation with 0.25-μm CMOS parameters. Through circuit simulations, we demonstrate that i) the clock generators are certainly synchronized by pseudo-random noises and ii) clock generators exhibited phase-locked oscillations even if they had small device mismatches.
Temperature measurement of a dust particle in a RF plasma GEC reference cell
NASA Astrophysics Data System (ADS)
Kong, Jie; Qiao, Ke; Matthews, Lorin S.; Hyde, Truell W.
2016-10-01
The thermal motion of a dust particle levitated in a plasma chamber is similar to that described by Brownian motion in many ways. The primary difference between a dust particle in a plasma system and a free Brownian particle is that in addition to the random collisions between the dust particle and the neutral gas atoms, there are electric field fluctuations, dust charge fluctuations, and correlated motions from the unwanted continuous signals originating within the plasma system itself. This last contribution does not include random motion and is therefore separable from the random motion in a `normal' temperature measurement. In this paper, we discuss how to separate random and coherent motions of a dust particle confined in a glass box in a Gaseous Electronic Conference (GEC) radio-frequency (RF) reference cell employing experimentally determined dust particle fluctuation data analysed using the mean square displacement technique.
Guan, Li; Hao, Bibo; Cheng, Qijin; Yip, Paul SF
2015-01-01
Background Traditional offline assessment of suicide probability is time consuming and difficult in convincing at-risk individuals to participate. Identifying individuals with high suicide probability through online social media has an advantage in its efficiency and potential to reach out to hidden individuals, yet little research has been focused on this specific field. Objective The objective of this study was to apply two classification models, Simple Logistic Regression (SLR) and Random Forest (RF), to examine the feasibility and effectiveness of identifying high suicide possibility microblog users in China through profile and linguistic features extracted from Internet-based data. Methods There were nine hundred and nine Chinese microblog users that completed an Internet survey, and those scoring one SD above the mean of the total Suicide Probability Scale (SPS) score, as well as one SD above the mean in each of the four subscale scores in the participant sample were labeled as high-risk individuals, respectively. Profile and linguistic features were fed into two machine learning algorithms (SLR and RF) to train the model that aims to identify high-risk individuals in general suicide probability and in its four dimensions. Models were trained and then tested by 5-fold cross validation; in which both training set and test set were generated under the stratified random sampling rule from the whole sample. There were three classic performance metrics (Precision, Recall, F1 measure) and a specifically defined metric “Screening Efficiency” that were adopted to evaluate model effectiveness. Results Classification performance was generally matched between SLR and RF. Given the best performance of the classification models, we were able to retrieve over 70% of the labeled high-risk individuals in overall suicide probability as well as in the four dimensions. Screening Efficiency of most models varied from 1/4 to 1/2. Precision of the models was generally below 30%. Conclusions Individuals in China with high suicide probability are recognizable by profile and text-based information from microblogs. Although there is still much space to improve the performance of classification models in the future, this study may shed light on preliminary screening of risky individuals via machine learning algorithms, which can work side-by-side with expert scrutiny to increase efficiency in large-scale-surveillance of suicide probability from online social media. PMID:26543921
Guan, Li; Hao, Bibo; Cheng, Qijin; Yip, Paul Sf; Zhu, Tingshao
2015-01-01
Traditional offline assessment of suicide probability is time consuming and difficult in convincing at-risk individuals to participate. Identifying individuals with high suicide probability through online social media has an advantage in its efficiency and potential to reach out to hidden individuals, yet little research has been focused on this specific field. The objective of this study was to apply two classification models, Simple Logistic Regression (SLR) and Random Forest (RF), to examine the feasibility and effectiveness of identifying high suicide possibility microblog users in China through profile and linguistic features extracted from Internet-based data. There were nine hundred and nine Chinese microblog users that completed an Internet survey, and those scoring one SD above the mean of the total Suicide Probability Scale (SPS) score, as well as one SD above the mean in each of the four subscale scores in the participant sample were labeled as high-risk individuals, respectively. Profile and linguistic features were fed into two machine learning algorithms (SLR and RF) to train the model that aims to identify high-risk individuals in general suicide probability and in its four dimensions. Models were trained and then tested by 5-fold cross validation; in which both training set and test set were generated under the stratified random sampling rule from the whole sample. There were three classic performance metrics (Precision, Recall, F1 measure) and a specifically defined metric "Screening Efficiency" that were adopted to evaluate model effectiveness. Classification performance was generally matched between SLR and RF. Given the best performance of the classification models, we were able to retrieve over 70% of the labeled high-risk individuals in overall suicide probability as well as in the four dimensions. Screening Efficiency of most models varied from 1/4 to 1/2. Precision of the models was generally below 30%. Individuals in China with high suicide probability are recognizable by profile and text-based information from microblogs. Although there is still much space to improve the performance of classification models in the future, this study may shed light on preliminary screening of risky individuals via machine learning algorithms, which can work side-by-side with expert scrutiny to increase efficiency in large-scale-surveillance of suicide probability from online social media.
NASA Astrophysics Data System (ADS)
Ferraretto, Daniele; Heal, Kate
2017-04-01
Temperate forest ecosystems are significant sinks for nitrogen deposition (Ndep) yielding benefits such as protection of waterbodies from eutrophication and enhanced sequestration of atmospheric CO2. Previous studies have shown evidence of biological nitrification and Ndep processing and retention in forest canopies. However, this was reported only at sites with high environmental or experimentally enhanced rates of Ndep (˜18 kg N ha-1 y-1) and has not yet been demonstrated in low Ndep environments. We have used bulk field hydrochemical measurements and labelled isotopic experiments to assess canopy processing in a lower Ndep environment (˜7 kg N ha-1 year-1) at a Sitka spruce plantation in Perthshire, Scotland, representing the dominant tree species (24%) in woodlands in Great Britain. Analysis of 4.5 years of measured N fluxes in rainfall (RF) and fogwater onto the canopy and throughfall (TF) and stemflow (SF) below the canopy suggests strong transformation and uptake of Ndep in the forest canopy. Annual canopy Ndep uptake was ˜4.7 kg N ha-1 year-1, representing 60-76% of annual Ndep. To validate these plot-scale results and track N uptake within the forest canopy in different seasons, double 15N-labelled NH4NO3 (98%) solution was sprayed in summer and winter onto the canopy of three trees at the measurement site. RF, TF and SF samples have been collected and analysed for 15NH4 and 15NO3. Comparing the amount of labelled N recovered under the sample trees with the measured δ15N signal is expected to provide further evidence of the role of forest canopies in actively processing and retaining atmospheric N deposition.
Arevalillo, Jorge M; Sztein, Marcelo B; Kotloff, Karen L; Levine, Myron M; Simon, Jakub K
2017-10-01
Immunologic correlates of protection are important in vaccine development because they give insight into mechanisms of protection, assist in the identification of promising vaccine candidates, and serve as endpoints in bridging clinical vaccine studies. Our goal is the development of a methodology to identify immunologic correlates of protection using the Shigella challenge as a model. The proposed methodology utilizes the Random Forests (RF) machine learning algorithm as well as Classification and Regression Trees (CART) to detect immune markers that predict protection, identify interactions between variables, and define optimal cutoffs. Logistic regression modeling is applied to estimate the probability of protection and the confidence interval (CI) for such a probability is computed by bootstrapping the logistic regression models. The results demonstrate that the combination of Classification and Regression Trees and Random Forests complements the standard logistic regression and uncovers subtle immune interactions. Specific levels of immunoglobulin IgG antibody in blood on the day of challenge predicted protection in 75% (95% CI 67-86). Of those subjects that did not have blood IgG at or above a defined threshold, 100% were protected if they had IgA antibody secreting cells above a defined threshold. Comparison with the results obtained by applying only logistic regression modeling with standard Akaike Information Criterion for model selection shows the usefulness of the proposed method. Given the complexity of the immune system, the use of machine learning methods may enhance traditional statistical approaches. When applied together, they offer a novel way to quantify important immune correlates of protection that may help the development of vaccines. Copyright © 2017 Elsevier Inc. All rights reserved.
Lei, Tailong; Sun, Huiyong; Kang, Yu; Zhu, Feng; Liu, Hui; Zhou, Wenfang; Wang, Zhe; Li, Dan; Li, Youyong; Hou, Tingjun
2017-11-06
Xenobiotic chemicals and their metabolites are mainly excreted out of our bodies by the urinary tract through the urine. Chemical-induced urinary tract toxicity is one of the main reasons that cause failure during drug development, and it is a common adverse event for medications, natural supplements, and environmental chemicals. Despite its importance, there are only a few in silico models for assessing urinary tract toxicity for a large number of compounds with diverse chemical structures. Here, we developed a series of qualitative and quantitative structure-activity relationship (QSAR) models for predicting urinary tract toxicity. In our study, the recursive feature elimination method incorporated with random forests (RFE-RF) was used for dimension reduction, and then eight machine learning approaches were used for QSAR modeling, i.e., relevance vector machine (RVM), support vector machine (SVM), regularized random forest (RRF), C5.0 trees, eXtreme gradient boosting (XGBoost), AdaBoost.M1, SVM boosting (SVMBoost), and RVM boosting (RVMBoost). For building classification models, the synthetic minority oversampling technique was used to handle the imbalance data set problem. Among all the machine learning approaches, SVMBoost based on the RBF kernel achieves both the best quantitative (q ext 2 = 0.845) and qualitative predictions for the test set (MCC of 0.787, AUC of 0.893, sensitivity of 89.6%, specificity of 94.1%, and global accuracy of 90.8%). The application domains were then analyzed, and all of the tested chemicals fall within the application domain coverage. We also examined the structure features of the chemicals with large prediction errors. In brief, both the regression and classification models developed by the SVMBoost approach have reliable prediction capability for assessing chemical-induced urinary tract toxicity.
Estimating future burned areas under changing climate in the EU-Mediterranean countries.
Amatulli, Giuseppe; Camia, Andrea; San-Miguel-Ayanz, Jesús
2013-04-15
The impacts of climate change on forest fires have received increased attention in recent years at both continental and local scales. It is widely recognized that weather plays a key role in extreme fire situations. It is therefore of great interest to analyze projected changes in fire danger under climate change scenarios and to assess the consequent impacts of forest fires. In this study we estimated burned areas in the European Mediterranean (EU-Med) countries under past and future climate conditions. Historical (1985-2004) monthly burned areas in EU-Med countries were modeled by using the Canadian Fire Weather Index (CFWI). Monthly averages of the CFWI sub-indices were used as explanatory variables to estimate the monthly burned areas in each of the five most affected countries in Europe using three different modeling approaches (Multiple Linear Regression - MLR, Random Forest - RF, Multivariate Adaptive Regression Splines - MARS). MARS outperformed the other methods. Regression equations and significant coefficients of determination were obtained, although there were noticeable differences from country to country. Climatic conditions at the end of the 21st Century were simulated using results from the runs of the regional climate model HIRHAM in the European project PRUDENCE, considering two IPCC SRES scenarios (A2-B2). The MARS models were applied to both scenarios resulting in projected burned areas in each country and in the EU-Med region. Results showed that significant increases, 66% and 140% of the total burned area, can be expected in the EU-Med region under the A2 and B2 scenarios, respectively. Copyright © 2013 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Rodriguez-Galiano, Victor; Aragones, David; Caparros-Santiago, Jose A.; Navarro-Cerrillo, Rafael M.
2017-10-01
Land surface phenology (LSP) can improve the characterisation of forest areas and their change processes. The aim of this work was: i) to characterise the temporal dynamics in Mediterranean Pinus forests, and ii) to evaluate the potential of LSP for species discrimination. The different experiments were based on 679 mono-specific plots for the 5 native species on the Iberian Peninsula: P. sylvestris, P. pinea, P. halepensis, P. nigra and P. pinaster. The entire MODIS NDVI time series (2000-2016) of the MOD13Q1 product was used to characterise phenology. The following phenological parameters were extracted: the start, end and median days of the season, and the length of the season in days, as well as the base value, maximum value, amplitude and integrated value. Multi-temporal metrics were calculated to synthesise the inter-annual variability of the phenological parameters. The species were discriminated by the application of Random Forest (RF) classifiers from different subsets of variables: model 1) NDVI-smoothed time series, model 2) multi-temporal metrics of the phenological parameters, and model 3) multi-temporal metrics and the auxiliary physical variables (altitude, slope, aspect and distance to the coastline). Model 3 was the best, with an overall accuracy of 82%, a kappa coefficient of 0.77 and whose most important variables were: elevation, coast distance, and the end and start days of the growing season. The species that presented the largest errors was P. nigra, (kappa= 0.45), having locations with a similar behaviour to P. sylvestris or P. pinaster.
Physical Human Activity Recognition Using Wearable Sensors.
Attal, Ferhat; Mohammed, Samer; Dedabrishvili, Mariam; Chamroukhi, Faicel; Oukhellou, Latifa; Amirat, Yacine
2015-12-11
This paper presents a review of different classification techniques used to recognize human activities from wearable inertial sensor data. Three inertial sensor units were used in this study and were worn by healthy subjects at key points of upper/lower body limbs (chest, right thigh and left ankle). Three main steps describe the activity recognition process: sensors' placement, data pre-processing and data classification. Four supervised classification techniques namely, k-Nearest Neighbor (k-NN), Support Vector Machines (SVM), Gaussian Mixture Models (GMM), and Random Forest (RF) as well as three unsupervised classification techniques namely, k-Means, Gaussian mixture models (GMM) and Hidden Markov Model (HMM), are compared in terms of correct classification rate, F-measure, recall, precision, and specificity. Raw data and extracted features are used separately as inputs of each classifier. The feature selection is performed using a wrapper approach based on the RF algorithm. Based on our experiments, the results obtained show that the k-NN classifier provides the best performance compared to other supervised classification algorithms, whereas the HMM classifier is the one that gives the best results among unsupervised classification algorithms. This comparison highlights which approach gives better performance in both supervised and unsupervised contexts. It should be noted that the obtained results are limited to the context of this study, which concerns the classification of the main daily living human activities using three wearable accelerometers placed at the chest, right shank and left ankle of the subject.
Estimates of grassland biomass and turnover time on the Tibetan Plateau
NASA Astrophysics Data System (ADS)
Xia, Jiangzhou; Ma, Minna; Liang, Tiangang; Wu, Chaoyang; Yang, Yuanhe; Zhang, Li; Zhang, Yangjian; Yuan, Wenping
2018-01-01
The grassland of the Tibetan Plateau forms a globally significant biome, which represents 6% of the world’s grasslands and 44% of China’s grasslands. However, large uncertainties remain concerning the vegetation carbon storage and turnover time in this biome. In this study, we quantified the pool size of both the aboveground and belowground biomass and turnover time of belowground biomass across the Tibetan Plateau by combining systematic measurements taken from a substantial number of surveys (i.e. 1689 sites for aboveground biomass, 174 sites for belowground biomass) with a machine learning technique (i.e. random forest, RF). Our study demonstrated that the RF model is effective tool for upscaling local biomass observations to the regional scale, and for producing continuous biomass estimates of the Tibetan Plateau. On average, the models estimated 46.57 Tg (1 Tg = 1012g) C of aboveground biomass and 363.71 Tg C of belowground biomass in the Tibetan grasslands covering an area of 1.32 × 106 km2. The turnover time of belowground biomass demonstrated large spatial heterogeneity, with a median turnover time of 4.25 years. Our results also demonstrated large differences in the biomass simulations among the major ecosystem models used for the Tibetan Plateau, largely because of inadequate model parameterization and validation. This study provides a spatially continuous measure of vegetation carbon storage and turnover time, and provides useful information for advancing ecosystem models and improving their performance.
Invasive Shrub Mapping in an Urban Environment from Hyperspectral and LiDAR-Derived Attributes
Chance, Curtis M.; Coops, Nicholas C.; Plowright, Andrew A.; Tooke, Thoreau R.; Christen, Andreas; Aven, Neal
2016-01-01
Proactive management of invasive species in urban areas is critical to restricting their overall distribution. The objective of this work is to determine whether advanced remote sensing technologies can help to detect invasions effectively and efficiently in complex urban ecosystems such as parks. In Surrey, BC, Canada, Himalayan blackberry (Rubus armeniacus) and English ivy (Hedera helix) are two invasive shrub species that can negatively affect native ecosystems in cities and managed urban parks. Random forest (RF) models were created to detect these two species using a combination of hyperspectral imagery, and light detection and ranging (LiDAR) data. LiDAR-derived predictor variables included irradiance models, canopy structural characteristics, and orographic variables. RF detection accuracy ranged from 77.8 to 87.8% for Himalayan blackberry and 81.9 to 82.1% for English ivy, with open areas classified more accurately than areas under canopy cover. English ivy was predicted to occur across a greater area than Himalayan blackberry both within parks and across the entire city. Both Himalayan blackberry and English ivy were mostly located in clusters according to a Local Moran’s I analysis. The occurrence of both species decreased as the distance from roads increased. This study shows the feasibility of producing highly accurate detection maps of plant invasions in urban environments using a fusion of remotely sensed data, as well as the ability to use these products to guide management decisions. PMID:27818664
DOE Office of Scientific and Technical Information (OSTI.GOV)
Masci, Frank J.; Grillmair, Carl J.; Cutri, Roc M.
2014-07-01
We describe a methodology to classify periodic variable stars identified using photometric time-series measurements constructed from the Wide-field Infrared Survey Explorer (WISE) full-mission single-exposure Source Databases. This will assist in the future construction of a WISE Variable Source Database that assigns variables to specific science classes as constrained by the WISE observing cadence with statistically meaningful classification probabilities. We have analyzed the WISE light curves of 8273 variable stars identified in previous optical variability surveys (MACHO, GCVS, and ASAS) and show that Fourier decomposition techniques can be extended into the mid-IR to assist with their classification. Combined with other periodicmore » light-curve features, this sample is then used to train a machine-learned classifier based on the random forest (RF) method. Consistent with previous classification studies of variable stars in general, the RF machine-learned classifier is superior to other methods in terms of accuracy, robustness against outliers, and relative immunity to features that carry little or redundant class information. For the three most common classes identified by WISE: Algols, RR Lyrae, and W Ursae Majoris type variables, we obtain classification efficiencies of 80.7%, 82.7%, and 84.5% respectively using cross-validation analyses, with 95% confidence intervals of approximately ±2%. These accuracies are achieved at purity (or reliability) levels of 88.5%, 96.2%, and 87.8% respectively, similar to that achieved in previous automated classification studies of periodic variable stars.« less
Machine learning search for variable stars
NASA Astrophysics Data System (ADS)
Pashchenko, Ilya N.; Sokolovsky, Kirill V.; Gavras, Panagiotis
2018-04-01
Photometric variability detection is often considered as a hypothesis testing problem: an object is variable if the null hypothesis that its brightness is constant can be ruled out given the measurements and their uncertainties. The practical applicability of this approach is limited by uncorrected systematic errors. We propose a new variability detection technique sensitive to a wide range of variability types while being robust to outliers and underestimated measurement uncertainties. We consider variability detection as a classification problem that can be approached with machine learning. Logistic Regression (LR), Support Vector Machines (SVM), k Nearest Neighbours (kNN), Neural Nets (NN), Random Forests (RF), and Stochastic Gradient Boosting classifier (SGB) are applied to 18 features (variability indices) quantifying scatter and/or correlation between points in a light curve. We use a subset of Optical Gravitational Lensing Experiment phase two (OGLE-II) Large Magellanic Cloud (LMC) photometry (30 265 light curves) that was searched for variability using traditional methods (168 known variable objects) as the training set and then apply the NN to a new test set of 31 798 OGLE-II LMC light curves. Among 205 candidates selected in the test set, 178 are real variables, while 13 low-amplitude variables are new discoveries. The machine learning classifiers considered are found to be more efficient (select more variables and fewer false candidates) compared to traditional techniques using individual variability indices or their linear combination. The NN, SGB, SVM, and RF show a higher efficiency compared to LR and kNN.
Yi, Hai-Cheng; You, Zhu-Hong; Huang, De-Shuang; Li, Xiao; Jiang, Tong-Hai; Li, Li-Ping
2018-06-01
The interactions between non-coding RNAs (ncRNAs) and proteins play an important role in many biological processes, and their biological functions are primarily achieved by binding with a variety of proteins. High-throughput biological techniques are used to identify protein molecules bound with specific ncRNA, but they are usually expensive and time consuming. Deep learning provides a powerful solution to computationally predict RNA-protein interactions. In this work, we propose the RPI-SAN model by using the deep-learning stacked auto-encoder network to mine the hidden high-level features from RNA and protein sequences and feed them into a random forest (RF) model to predict ncRNA binding proteins. Stacked assembling is further used to improve the accuracy of the proposed method. Four benchmark datasets, including RPI2241, RPI488, RPI1807, and NPInter v2.0, were employed for the unbiased evaluation of five established prediction tools: RPI-Pred, IPMiner, RPISeq-RF, lncPro, and RPI-SAN. The experimental results show that our RPI-SAN model achieves much better performance than other methods, with accuracies of 90.77%, 89.7%, 96.1%, and 99.33%, respectively. It is anticipated that RPI-SAN can be used as an effective computational tool for future biomedical researches and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for biological research. Copyright © 2018 The Author(s). Published by Elsevier Inc. All rights reserved.
Local-search based prediction of medical image registration error
NASA Astrophysics Data System (ADS)
Saygili, Görkem
2018-03-01
Medical image registration is a crucial task in many different medical imaging applications. Hence, considerable amount of work has been published recently that aim to predict the error in a registration without any human effort. If provided, these error predictions can be used as a feedback to the registration algorithm to further improve its performance. Recent methods generally start with extracting image-based and deformation-based features, then apply feature pooling and finally train a Random Forest (RF) regressor to predict the real registration error. Image-based features can be calculated after applying a single registration but provide limited accuracy whereas deformation-based features such as variation of deformation vector field may require up to 20 registrations which is a considerably high time-consuming task. This paper proposes to use extracted features from a local search algorithm as image-based features to estimate the error of a registration. The proposed method comprises a local search algorithm to find corresponding voxels between registered image pairs and based on the amount of shifts and stereo confidence measures, it predicts the amount of registration error in millimetres densely using a RF regressor. Compared to other algorithms in the literature, the proposed algorithm does not require multiple registrations, can be efficiently implemented on a Graphical Processing Unit (GPU) and can still provide highly accurate error predictions in existence of large registration error. Experimental results with real registrations on a public dataset indicate a substantially high accuracy achieved by using features from the local search algorithm.
Liu, Bin; Long, Ren; Chou, Kuo-Chen
2016-08-15
Regulatory DNA elements are associated with DNase I hypersensitive sites (DHSs). Accordingly, identification of DHSs will provide useful insights for in-depth investigation into the function of noncoding genomic regions. In this study, using the strategy of ensemble learning framework, we proposed a new predictor called iDHS-EL for identifying the location of DHS in human genome. It was formed by fusing three individual Random Forest (RF) classifiers into an ensemble predictor. The three RF operators were respectively based on the three special modes of the general pseudo nucleotide composition (PseKNC): (i) kmer, (ii) reverse complement kmer and (iii) pseudo dinucleotide composition. It has been demonstrated that the new predictor remarkably outperforms the relevant state-of-the-art methods in both accuracy and stability. For the convenience of most experimental scientists, a web server for iDHS-EL is established at http://bioinformatics.hitsz.edu.cn/iDHS-EL, which is the first web-server predictor ever established for identifying DHSs, and by which users can easily get their desired results without the need to go through the mathematical details. We anticipate that IDHS-EL: will become a very useful high throughput tool for genome analysis. bliu@gordonlifescience.org or bliu@insun.hit.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Harris, Ted D.; Graham, Jennifer L.
2017-01-01
Cyanobacterial blooms degrade water quality in drinking water supply reservoirs by producing toxic and taste-and-odor causing secondary metabolites, which ultimately cause public health concerns and lead to increased treatment costs for water utilities. There have been numerous attempts to create models that predict cyanobacteria and their secondary metabolites, most using linear models; however, linear models are limited by assumptions about the data and have had limited success as predictive tools. Thus, lake and reservoir managers need improved modeling techniques that can accurately predict large bloom events that have the highest impact on recreational activities and drinking-water treatment processes. In this study, we compared 12 unique linear and nonlinear regression modeling techniques to predict cyanobacterial abundance and the cyanobacterial secondary metabolites microcystin and geosmin using 14 years of physiochemical water quality data collected from Cheney Reservoir, Kansas. Support vector machine (SVM), random forest (RF), boosted tree (BT), and Cubist modeling techniques were the most predictive of the compared modeling approaches. SVM, RF, and BT modeling techniques were able to successfully predict cyanobacterial abundance, microcystin, and geosmin concentrations <60,000 cells/mL, 2.5 µg/L, and 20 ng/L, respectively. Only Cubist modeling predicted maxima concentrations of cyanobacteria and geosmin; no modeling technique was able to predict maxima microcystin concentrations. Because maxima concentrations are a primary concern for lake and reservoir managers, Cubist modeling may help predict the largest and most noxious concentrations of cyanobacteria and their secondary metabolites.
Physical Human Activity Recognition Using Wearable Sensors
Attal, Ferhat; Mohammed, Samer; Dedabrishvili, Mariam; Chamroukhi, Faicel; Oukhellou, Latifa; Amirat, Yacine
2015-01-01
This paper presents a review of different classification techniques used to recognize human activities from wearable inertial sensor data. Three inertial sensor units were used in this study and were worn by healthy subjects at key points of upper/lower body limbs (chest, right thigh and left ankle). Three main steps describe the activity recognition process: sensors’ placement, data pre-processing and data classification. Four supervised classification techniques namely, k-Nearest Neighbor (k-NN), Support Vector Machines (SVM), Gaussian Mixture Models (GMM), and Random Forest (RF) as well as three unsupervised classification techniques namely, k-Means, Gaussian mixture models (GMM) and Hidden Markov Model (HMM), are compared in terms of correct classification rate, F-measure, recall, precision, and specificity. Raw data and extracted features are used separately as inputs of each classifier. The feature selection is performed using a wrapper approach based on the RF algorithm. Based on our experiments, the results obtained show that the k-NN classifier provides the best performance compared to other supervised classification algorithms, whereas the HMM classifier is the one that gives the best results among unsupervised classification algorithms. This comparison highlights which approach gives better performance in both supervised and unsupervised contexts. It should be noted that the obtained results are limited to the context of this study, which concerns the classification of the main daily living human activities using three wearable accelerometers placed at the chest, right shank and left ankle of the subject. PMID:26690450
Using Predictive Analytics to Predict Power Outages from Severe Weather
NASA Astrophysics Data System (ADS)
Wanik, D. W.; Anagnostou, E. N.; Hartman, B.; Frediani, M. E.; Astitha, M.
2015-12-01
The distribution of reliable power is essential to businesses, public services, and our daily lives. With the growing abundance of data being collected and created by industry (i.e. outage data), government agencies (i.e. land cover), and academia (i.e. weather forecasts), we can begin to tackle problems that previously seemed too complex to solve. In this session, we will present newly developed tools to aid decision-support challenges at electric distribution utilities that must mitigate, prepare for, respond to and recover from severe weather. We will show a performance evaluation of outage predictive models built for Eversource Energy (formerly Connecticut Light & Power) for storms of all types (i.e. blizzards, thunderstorms and hurricanes) and magnitudes (from 20 to >15,000 outages). High resolution weather simulations (simulated with the Weather and Research Forecast Model) were joined with utility outage data to calibrate four types of models: a decision tree (DT), random forest (RF), boosted gradient tree (BT) and an ensemble (ENS) decision tree regression that combined predictions from DT, RF and BT. The study shows that the ENS model forced with weather, infrastructure and land cover data was superior to the other models we evaluated, especially in terms of predicting the spatial distribution of outages. This research has the potential to be used for other critical infrastructure systems (such as telecommunications, drinking water and gas distribution networks), and can be readily expanded to the entire New England region to facilitate better planning and coordination among decision-makers when severe weather strikes.
Park, Bo-Yong; Lee, Mi Ji; Lee, Seung-Hak; Cha, Jihoon; Chung, Chin-Sang; Kim, Sung Tae; Park, Hyunjin
2018-01-01
Migraineurs show an increased load of white matter hyperintensities (WMHs) and more rapid deep WMH progression. Previous methods for WMH segmentation have limited efficacy to detect small deep WMHs. We developed a new fully automated detection pipeline, DEWS (DEep White matter hyperintensity Segmentation framework), for small and superficially-located deep WMHs. A total of 148 non-elderly subjects with migraine were included in this study. The pipeline consists of three components: 1) white matter (WM) extraction, 2) WMH detection, and 3) false positive reduction. In WM extraction, we adjusted the WM mask to re-assign misclassified WMHs back to WM using many sequential low-level image processing steps. In WMH detection, the potential WMH clusters were detected using an intensity based threshold and region growing approach. For false positive reduction, the detected WMH clusters were classified into final WMHs and non-WMHs using the random forest (RF) classifier. Size, texture, and multi-scale deep features were used to train the RF classifier. DEWS successfully detected small deep WMHs with a high positive predictive value (PPV) of 0.98 and true positive rate (TPR) of 0.70 in the training and test sets. Similar performance of PPV (0.96) and TPR (0.68) was attained in the validation set. DEWS showed a superior performance in comparison with other methods. Our proposed pipeline is freely available online to help the research community in quantifying deep WMHs in non-elderly adults.
Maize Cropping Systems Mapping Using RapidEye Observations in Agro-Ecological Landscapes in Kenya.
Richard, Kyalo; Abdel-Rahman, Elfatih M; Subramanian, Sevgan; Nyasani, Johnson O; Thiel, Michael; Jozani, Hosein; Borgemeister, Christian; Landmann, Tobias
2017-11-03
Cropping systems information on explicit scales is an important but rarely available variable in many crops modeling routines and of utmost importance for understanding pests and disease propagation mechanisms in agro-ecological landscapes. In this study, high spatial and temporal resolution RapidEye bio-temporal data were utilized within a novel 2-step hierarchical random forest (RF) classification approach to map areas of mono- and mixed maize cropping systems. A small-scale maize farming site in Machakos County, Kenya was used as a study site. Within the study site, field data was collected during the satellite acquisition period on general land use/land cover (LULC) and the two cropping systems. Firstly, non-cropland areas were masked out from other land use/land cover using the LULC mapping result. Subsequently an optimized RF model was applied to the cropland layer to map the two cropping systems (2nd classification step). An overall accuracy of 93% was attained for the LULC classification, while the class accuracies (PA: producer's accuracy and UA: user's accuracy) for the two cropping systems were consistently above 85%. We concluded that explicit mapping of different cropping systems is feasible in complex and highly fragmented agro-ecological landscapes if high resolution and multi-temporal satellite data such as 5 m RapidEye data is employed. Further research is needed on the feasibility of using freely available 10-20 m Sentinel-2 data for wide-area assessment of cropping systems as an important variable in numerous crop productivity models.
Tseng, Chih-Jen; Lu, Chi-Jie; Chang, Chi-Chang; Chen, Gin-Den; Cheewakriangkrai, Chalong
2017-05-01
Ovarian cancer is the second leading cause of deaths among gynecologic cancers in the world. Approximately 90% of women with ovarian cancer reported having symptoms long before a diagnosis was made. Literature shows that recurrence should be predicted with regard to their personal risk factors and the clinical symptoms of this devastating cancer. In this study, ensemble learning and five data mining approaches, including support vector machine (SVM), C5.0, extreme learning machine (ELM), multivariate adaptive regression splines (MARS), and random forest (RF), were integrated to rank the importance of risk factors and diagnose the recurrence of ovarian cancer. The medical records and pathologic status were extracted from the Chung Shan Medical University Hospital Tumor Registry. Experimental results illustrated that the integrated C5.0 model is a superior approach in predicting the recurrence of ovarian cancer. Moreover, the classification accuracies of C5.0, ELM, MARS, RF, and SVM indeed increased after using the selected important risk factors as predictors. Our findings suggest that The International Federation of Gynecology and Obstetrics (FIGO), Pathologic M, Age, and Pathologic T were the four most critical risk factors for ovarian cancer recurrence. In summary, the above information can support the important influence of personality and clinical symptom representations on all phases of guide interventions, with the complexities of multiple symptoms associated with ovarian cancer in all phases of the recurrent trajectory. Copyright © 2017 Elsevier B.V. All rights reserved.
Assessing deep and shallow learning methods for quantitative prediction of acute chemical toxicity.
Liu, Ruifeng; Madore, Michael; Glover, Kyle P; Feasel, Michael G; Wallqvist, Anders
2018-05-02
Animal-based methods for assessing chemical toxicity are struggling to meet testing demands. In silico approaches, including machine-learning methods, are promising alternatives. Recently, deep neural networks (DNNs) were evaluated and reported to outperform other machine-learning methods for quantitative structure-activity relationship modeling of molecular properties. However, most of the reported performance evaluations relied on global performance metrics, such as the root mean squared error (RMSE) between the predicted and experimental values of all samples, without considering the impact of sample distribution across the activity spectrum. Here, we carried out an in-depth analysis of DNN performance for quantitative prediction of acute chemical toxicity using several datasets. We found that the overall performance of DNN models on datasets of up to 30,000 compounds was similar to that of random forest (RF) models, as measured by the RMSE and correlation coefficients between the predicted and experimental results. However, our detailed analyses demonstrated that global performance metrics are inappropriate for datasets with a highly uneven sample distribution, because they show a strong bias for the most populous compounds along the toxicity spectrum. For highly toxic compounds, DNN and RF models trained on all samples performed much worse than the global performance metrics indicated. Surprisingly, our variable nearest neighbor method, which utilizes only structurally similar compounds to make predictions, performed reasonably well, suggesting that information of close near neighbors in the training sets is a key determinant of acute toxicity predictions.
Lim, Dong Kyu; Long, Nguyen Phuoc; Mo, Changyeun; Dong, Ziyuan; Cui, Lingmei; Kim, Giyoung; Kwon, Sung Won
2017-10-01
The mixing of extraneous ingredients with original products is a common adulteration practice in food and herbal medicines. In particular, authenticity of white rice and its corresponding blended products has become a key issue in food industry. Accordingly, our current study aimed to develop and evaluate a novel discrimination method by combining targeted lipidomics with powerful supervised learning methods, and eventually introduce a platform to verify the authenticity of white rice. A total of 30 cultivars were collected, and 330 representative samples of white rice from Korea and China as well as seven mixing ratios were examined. Random forests (RF), support vector machines (SVM) with a radial basis function kernel, C5.0, model averaged neural network, and k-nearest neighbor classifiers were used for the classification. We achieved desired results, and the classifiers effectively differentiated white rice from Korea to blended samples with high prediction accuracy for the contamination ratio as low as five percent. In addition, RF and SVM classifiers were generally superior to and more robust than the other techniques. Our approach demonstrated that the relative differences in lysoGPLs can be successfully utilized to detect the adulterated mixing of white rice originating from different countries. In conclusion, the present study introduces a novel and high-throughput platform that can be applied to authenticate adulterated admixtures from original white rice samples. Copyright © 2017 Elsevier Ltd. All rights reserved.
Determinants of gait stability while walking on a treadmill: A machine learning approach.
Reynard, Fabienne; Terrier, Philippe
2017-12-08
Dynamic balance in human locomotion can be assessed through the local dynamic stability (LDS) method. Whereas gait LDS has been used successfully in many settings and applications, little is known about its sensitivity to individual characteristics of healthy adults. Therefore, we reanalyzed a large dataset of accelerometric data measured for 100 healthy adults from 20 to 70 years of age performing 10 min treadmill walking. We sought to assess the extent to which the variations of age, body mass and height, sex, and preferred walking speed (PWS) could influence gait LDS. The random forest (RF) and multiple adaptive regression splines (MARS) algorithms were selected for their good bias-variance tradeoff and their capabilities to handle nonlinear associations. First, through variable importance measure (VIM), we used RF to evaluate which individual characteristics had the highest influence on gait LDS. Second, we used MARS to detect potential interactions among individual characteristics that may influence LDS. The VIM and MARS results indicated that PWS and age correlated with LDS, whereas no associations were found for sex, body height, and body mass. Further, the MARS model detected an age by PWS interaction: on one hand, at high PWS, gait stability is constant across age while, on the other hand, at low PWS, gait instability increases substantially with age. We conclude that it is advisable to consider the participants' age as well as their PWS to avoid potential biases in evaluating dynamic balance through LDS. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ion mobility spectrometry fingerprints: A rapid detection technology for adulteration of sesame oil.
Zhang, Liangxiao; Shuai, Qian; Li, Peiwu; Zhang, Qi; Ma, Fei; Zhang, Wen; Ding, Xiaoxia
2016-02-01
A simple and rapid detection technology was proposed based on ion mobility spectrometry (IMS) fingerprints to determine potential adulteration of sesame oil. Oil samples were diluted by n-hexane and analyzed by IMS for 20s. Then, chemometric methods were employed to establish discriminant models for sesame oils and four other edible oils, pure and adulterated sesame oils, and pure and counterfeit sesame oils, respectively. Finally, Random Forests (RF) classification model could correctly classify all five types of edible oils. The detection results indicated that the discriminant models built by recursive support vector machine (R-SVM) method could identify adulterated sesame oil samples (⩾ 10%) with an accuracy value of 94.2%. Therefore, IMS was shown to be an effective method to detect the adulterated sesame oils. Meanwhile, IMS fingerprints work well to detect the counterfeit sesame oils produced by adding sesame oil essence into cheaper edible oils. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Eckert, Sandra
2016-08-01
The SPOT-5 Take 5 campaign provided SPOT time series data of an unprecedented spatial and temporal resolution. We analysed 29 scenes acquired between May and September 2015 of a semi-arid region in the foothills of Mount Kenya, with two aims: first, to distinguish rainfed from irrigated cropland and cropland from natural vegetation covers, which show similar reflectance patterns; and second, to identify individual crop types. We tested several input data sets in different combinations: the spectral bands and the normalized difference vegetation index (NDVI) time series, principal components of NDVI time series, and selected NDVI time series statistics. For the classification we used random forests (RF). In the test differentiating rainfed cropland, irrigated cropland, and natural vegetation covers, the best classification accuracies were achieved using spectral bands. For the differentiation of crop types, we analysed the phenology of selected crop types based on NDVI time series. First results are promising.
A Real-time Breakdown Prediction Method for Urban Expressway On-ramp Bottlenecks
NASA Astrophysics Data System (ADS)
Ye, Yingjun; Qin, Guoyang; Sun, Jian; Liu, Qiyuan
2018-01-01
Breakdown occurrence on expressway is considered to relate with various factors. Therefore, to investigate the association between breakdowns and these factors, a Bayesian network (BN) model is adopted in this paper. Based on the breakdown events identified at 10 urban expressways on-ramp in Shanghai, China, 23 parameters before breakdowns are extracted, including dynamic environment conditions aggregated with 5-minutes and static geometry features. Different time periods data are used to predict breakdown. Results indicate that the models using 5-10 min data prior to breakdown performs the best prediction, with the prediction accuracies higher than 73%. Moreover, one unified model for all bottlenecks is also built and shows reasonably good prediction performance with the classification accuracy of breakdowns about 75%, at best. Additionally, to simplify the model parameter input, the random forests (RF) model is adopted to identify the key variables. Modeling with the selected 7 parameters, the refined BN model can predict breakdown with adequate accuracy.
Prediction of survival with multi-scale radiomic analysis in glioblastoma patients.
Chaddad, Ahmad; Sabri, Siham; Niazi, Tamim; Abdulkarim, Bassam
2018-06-19
We propose a multiscale texture features based on Laplacian-of Gaussian (LoG) filter to predict progression free (PFS) and overall survival (OS) in patients newly diagnosed with glioblastoma (GBM). Experiments use the extracted features derived from 40 patients of GBM with T1-weighted imaging (T1-WI) and Fluid-attenuated inversion recovery (FLAIR) images that were segmented manually into areas of active tumor, necrosis, and edema. Multiscale texture features were extracted locally from each of these areas of interest using a LoG filter and the relation between features to OS and PFS was investigated using univariate (i.e., Spearman's rank correlation coefficient, log-rank test and Kaplan-Meier estimator) and multivariate analyses (i.e., Random Forest classifier). Three and seven features were statistically correlated with PFS and OS, respectively, with absolute correlation values between 0.32 and 0.36 and p < 0.05. Three features derived from active tumor regions only were associated with OS (p < 0.05) with hazard ratios (HR) of 2.9, 3, and 3.24, respectively. Combined features showed an AUC value of 85.37 and 85.54% for predicting the PFS and OS of GBM patients, respectively, using the random forest (RF) classifier. We presented a multiscale texture features to characterize the GBM regions and predict he PFS and OS. The efficiency achievable suggests that this technique can be developed into a GBM MR analysis system suitable for clinical use after a thorough validation involving more patients. Graphical abstract Scheme of the proposed model for characterizing the heterogeneity of GBM regions and predicting the overall survival and progression free survival of GBM patients. (1) Acquisition of pretreatment MRI images; (2) Affine registration of T1-WI image with its corresponding FLAIR images, and GBM subtype (phenotypes) labelling; (3) Extraction of nine texture features from the three texture scales fine, medium, and coarse derived from each of GBM regions; (4) Comparing heterogeneity between GBM regions by ANOVA test; Survival analysis using Univariate (Spearman rank correlation between features and survival (i.e., PFS and OS) based on each of the GBM regions, Kaplan-Meier estimator and log-rank test to predict the PFS and OS of patient groups that grouped based on median of feature), and multivariate (random forest model) for predicting the PFS and OS of patients groups that grouped based on median of PFS and OS.
NASA Astrophysics Data System (ADS)
Li, Ningzhi; Li, Shizhe; Shen, Jun
2017-06-01
In vivo 13C magnetic resonance spectroscopy (MRS) is a unique and effective tool for studying dynamic human brain metabolism and the cycling of neurotransmitters. One of the major technical challenges for in vivo 13C-MRS is the high radio frequency (RF) power necessary for heteronuclear decoupling. In the common practice of in vivo 13C-MRS, alkanyl carbons are detected in the spectra range of 10-65ppm. The amplitude of decoupling pulses has to be significantly greater than the large one-bond 1H-13C scalar coupling (1JCH=125-145 Hz). Two main proton decoupling methods have been developed: broadband stochastic decoupling and coherent composite or adiabatic pulse decoupling (e.g., WALTZ); the latter is widely used because of its efficiency and superb performance under inhomogeneous B1 field. Because the RF power required for proton decoupling increases quadratically with field strength, in vivo 13C-MRS using coherent decoupling is often limited to low magnetic fields (<= 4 Tesla (T)) to keep the local and averaged specific absorption rate (SAR) under the safety guidelines established by the International Electrotechnical Commission (IEC) and the US Food and Drug Administration (FDA). Alternately, carboxylic/amide carbons are coupled to protons via weak long-range 1H-13C scalar couplings, which can be decoupled using low RF power broadband stochastic decoupling. Recently, the carboxylic/amide 13C-MRS technique using low power random RF heteronuclear decoupling was safely applied to human brain studies at 7T. Here, we review the two major decoupling methods and the carboxylic/amide 13C-MRS with low power decoupling strategy. Further decreases in RF power deposition by frequency-domain windowing and time-domain random under-sampling are also discussed. Low RF power decoupling opens the possibility of performing in vivo 13C experiments of human brain at very high magnetic fields (such as 11.7T), where signal-to-noise ratio as well as spatial and temporal spectral resolution are more favorable than lower fields.
Machine Learning to Assess Grassland Productivity in Southeastern Arizona
NASA Astrophysics Data System (ADS)
Ponce-Campos, G. E.; Heilman, P.; Armendariz, G.; Moser, E.; Archer, V.; Vaughan, R.
2015-12-01
We present preliminary results of machine learning (ML) techniques modeling the combined effects of climate, management, and inherent potential on productivity of grazed semi-arid grasslands in southeastern Arizona. Our goal is to support public land managers determine if agency management policies are meeting objectives and where to focus attention. Monitoring in the field is becoming more and more limited in space and time. Remotely sensed data cover the entire allotments and go back in time, but do not consider the key issue of species composition. By estimating expected vegetative production as a function of site potential and climatic inputs, management skill can be assessed through time, across individual allotments, and between allotments. Here we present the use of Random Forest (RF) as the main ML technique, in this case for the purpose of regression. Our response variable is the maximum annual NDVI, a surrogate for grassland productivity, as generated by the Google Earth Engine cloud computing platform based on Landsat 5, 7, and 8 datasets. PRISM 33-year normal precipitation (1980-2013) was resampled to the Landsat scale. In addition, the GRIDMET climate dataset was the source for the calculation of the annual SPEI (Standardized Precipitation Evapotranspiration Index), a drought index. We also included information about landscape position, aspect, streams, ponds, roads and fire disturbances as part of the modeling process. Our results show that in terms of variable importance, the 33-year normal precipitation, along with SPEI, are the most important features affecting grasslands productivity within the study area. The RF approach was compared to a linear regression model with the same variables. The linear model resulted in an r2 = 0.41, whereas RF showed a significant improvement with an r2 = 0.79. We continue refining the model by comparison with aerial photography and to include grazing intensity and infrastructure from units/allotments to assess the effect of management practices on vegetation production.
Che Hasan, Rozaimi; Ierodiaconou, Daniel; Laurenson, Laurie; Schimel, Alexandre
2014-01-01
Multibeam echosounders (MBES) are increasingly becoming the tool of choice for marine habitat mapping applications. In turn, the rapid expansion of habitat mapping studies has resulted in a need for automated classification techniques to efficiently map benthic habitats, assess confidence in model outputs, and evaluate the importance of variables driving the patterns observed. The benthic habitat characterisation process often involves the analysis of MBES bathymetry, backscatter mosaic or angular response with observation data providing ground truth. However, studies that make use of the full range of MBES outputs within a single classification process are limited. We present an approach that integrates backscatter angular response with MBES bathymetry, backscatter mosaic and their derivatives in a classification process using a Random Forests (RF) machine-learning algorithm to predict the distribution of benthic biological habitats. This approach includes a method of deriving statistical features from backscatter angular response curves created from MBES data collated within homogeneous regions of a backscatter mosaic. Using the RF algorithm we assess the relative importance of each variable in order to optimise the classification process and simplify models applied. The results showed that the inclusion of the angular response features in the classification process improved the accuracy of the final habitat maps from 88.5% to 93.6%. The RF algorithm identified bathymetry and the angular response mean as the two most important predictors. However, the highest classification rates were only obtained after incorporating additional features derived from bathymetry and the backscatter mosaic. The angular response features were found to be more important to the classification process compared to the backscatter mosaic features. This analysis indicates that integrating angular response information with bathymetry and the backscatter mosaic, along with their derivatives, constitutes an important improvement for studying the distribution of benthic habitats, which is necessary for effective marine spatial planning and resource management. PMID:24824155
Mapping the Transmission Risk of Zika Virus using Machine Learning Models.
Jiang, Dong; Hao, Mengmeng; Ding, Fangyu; Fu, Jingying; Li, Meng
2018-06-19
Zika virus, which has been linked to severe congenital abnormalities, is exacerbating global public health problems with its rapid transnational expansion fueled by increased global travel and trade. Suitability mapping of the transmission risk of Zika virus is essential for drafting public health plans and disease control strategies, which are especially important in areas where medical resources are relatively scarce. Predicting the risk of Zika virus outbreak has been studied in recent years, but the published literature rarely includes multiple model comparisons or predictive uncertainty analysis. Here, three relatively popular machine learning models including backward propagation neural network (BPNN), gradient boosting machine (GBM) and random forest (RF) were adopted to map the probability of Zika epidemic outbreak at the global level, pairing high-dimensional multidisciplinary covariate layers with comprehensive location data on recorded Zika virus infection in humans. The results show that the predicted high-risk areas for Zika transmission are concentrated in four regions: Southeastern North America, Eastern South America, Central Africa and Eastern Asia. To evaluate the performance of machine learning models, the 50 modeling processes were conducted based on a training dataset. The BPNN model obtained the highest predictive accuracy with a 10-fold cross-validation area under the curve (AUC) of 0.966 [95% confidence interval (CI) 0.965-0.967], followed by the GBM model (10-fold cross-validation AUC = 0.964[0.963-0.965]) and the RF model (10-fold cross-validation AUC = 0.963[0.962-0.964]). Based on training samples, compared with the BPNN-based model, we find that significant differences (p = 0.0258* and p = 0.0001***, respectively) are observed for prediction accuracies achieved by the GBM and RF models. Importantly, the prediction uncertainty introduced by the selection of absence data was quantified and could provide more accurate fundamental and scientific information for further study on disease transmission prediction and risk assessment. Copyright © 2018. Published by Elsevier B.V.
Developing robust arsenic awareness prediction models using machine learning algorithms.
Singh, Sushant K; Taylor, Robert W; Rahman, Mohammad Mahmudur; Pradhan, Biswajeet
2018-04-01
Arsenic awareness plays a vital role in ensuring the sustainability of arsenic mitigation technologies. Thus far, however, few studies have dealt with the sustainability of such technologies and its associated socioeconomic dimensions. As a result, arsenic awareness prediction has not yet been fully conceptualized. Accordingly, this study evaluated arsenic awareness among arsenic-affected communities in rural India, using a structured questionnaire to record socioeconomic, demographic, and other sociobehavioral factors with an eye to assessing their association with and influence on arsenic awareness. First a logistic regression model was applied and its results compared with those produced by six state-of-the-art machine-learning algorithms (Support Vector Machine [SVM], Kernel-SVM, Decision Tree [DT], k-Nearest Neighbor [k-NN], Naïve Bayes [NB], and Random Forests [RF]) as measured by their accuracy at predicting arsenic awareness. Most (63%) of the surveyed population was found to be arsenic-aware. Significant arsenic awareness predictors were divided into three types: (1) socioeconomic factors: caste, education level, and occupation; (2) water and sanitation behavior factors: number of family members involved in water collection, distance traveled and time spent for water collection, places for defecation, and materials used for handwashing after defecation; and (3) social capital and trust factors: presence of anganwadi and people's trust in other community members, NGOs, and private agencies. Moreover, individuals' having higher social network positively contributed to arsenic awareness in the communities. Results indicated that both the SVM and the RF algorithms outperformed at overall prediction of arsenic awareness-a nonlinear classification problem. Lower-caste, less educated, and unemployed members of the population were found to be the most vulnerable, requiring immediate arsenic mitigation. To this end, local social institutions and NGOs could play a crucial role in arsenic awareness and outreach programs. Use of SVM or RF or a combination of the two, together with use of a larger sample size, could enhance the accuracy of arsenic awareness prediction. Copyright © 2018 Elsevier Ltd. All rights reserved.
Extensions and applications of ensemble-of-trees methods in machine learning
NASA Astrophysics Data System (ADS)
Bleich, Justin
Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of violence during probation hearings in court systems.
Ali, Safdar; Majid, Abdul; Khan, Asifullah
2014-04-01
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.
Non-random species loss in a forest herbaceous layer following nitrogen addition
Christopher A. Walter; Mary Beth Adams; Frank S. Gilliam; William T. Peterjohn
2017-01-01
Nitrogen (N) additions have decreased species richness (S) in hardwood forest herbaceous layers, yet the functional mechanisms for these decreases have not been explicitly evaluated.We tested two hypothesized mechanisms, random species loss (RSL) and non-random species loss (NRSL), in the hardwood forest herbaceous layer of a long-term, plot-scale...
Public Involvement Processes and Methodologies: An Analysis
Ernst Valfer; Stephen Laner; Daina Dravnieks
1977-01-01
This report explor'es some sensitive or critical areas in public involvement.. A 1972 RF&D workshop on public involvement identified a series of issues requiring research and analysis. A subsequent PNW study "Public Involvement and the Forest Service", (Hendee 1973) addressed many of these issues. This study assignment by the Chief's Office was...
Fault Diagnosis Strategies for SOFC-Based Power Generation Plants
Costamagna, Paola; De Giorgi, Andrea; Gotelli, Alberto; Magistri, Loredana; Moser, Gabriele; Sciaccaluga, Emanuele; Trucco, Andrea
2016-01-01
The success of distributed power generation by plants based on solid oxide fuel cells (SOFCs) is hindered by reliability problems that can be mitigated through an effective fault detection and isolation (FDI) system. However, the numerous operating conditions under which such plants can operate and the random size of the possible faults make identifying damaged plant components starting from the physical variables measured in the plant very difficult. In this context, we assess two classical FDI strategies (model-based with fault signature matrix and data-driven with statistical classification) and the combination of them. For this assessment, a quantitative model of the SOFC-based plant, which is able to simulate regular and faulty conditions, is used. Moreover, a hybrid approach based on the random forest (RF) classification method is introduced to address the discrimination of regular and faulty situations due to its practical advantages. Working with a common dataset, the FDI performances obtained using the aforementioned strategies, with different sets of monitored variables, are observed and compared. We conclude that the hybrid FDI strategy, realized by combining a model-based scheme with a statistical classifier, outperforms the other strategies. In addition, the inclusion of two physical variables that should be measured inside the SOFCs can significantly improve the FDI performance, despite the actual difficulty in performing such measurements. PMID:27556472
Alcidi, L; Beneforti, E; Maresca, M; Santosuosso, U; Zoppi, M
2007-01-01
To investigate the analgesic effect of low power radiofrequency electromagnetic radiation (RF) in osteoarthritis (OA) of the knee. In a randomized study on 40 patients the analgesic effect of RF was compared with the effect of transcutaneous electrical nerve stimulation (TENS). RF and TENS applications were repeated every day for a period of 5 days. The therapeutic effect was evaluated by a visual analogue scale (VAS) and by Lequesne's index: tests were performed before, immediately after and 30 days after therapy. RF therapy induced a statistically significant and long lasting decrease of VAS and of Lequesne's index; TENS induced a decrease of VAS and of Lequesne's index which was not statistically significant. A therapeutic effect of RF was therefore demonstrated on pain and disability due to knee OA. This effect was better than the effect of TENS, which is a largely used analgesic technique. Such a difference of the therapeutic effect may be due to the fact that TENS acts only on superficial tissues and nerve terminals, while RF acts increasing superficial and deep tissue temperature.
Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-01-01
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914
Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry
2014-03-15
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
[Effects of land use changes on soil water conservation in Hainan Island, China].
Wen, Zhi; Zhao, He; Liu, Lei; OuYang, Zhi Yun; Zheng, Hua; Mi, Hong Xu; Li, Yan Min
2017-12-01
In tropical areas, a large number of natural forests have been transformed into other plantations, which affected the water conservation function of terrestrial ecosystems. In order to clari-fy the effects of land use changes on soil water conservation function, we selected four typical land use types in the central mountainous region of Hainan Island, i.e., natural forests with stand age greater than 100 years (VF), secondary forests with stand age of 10 years (SF), areca plantations with stand age of 12 years (AF) and rubber plantations with stand age of 35 years (RF). The effects of land use change on soil water holding capacity and water conservation (presented by soil water index, SWI) were assessed. The results showed that, compared with VF, the soil water holding capacity index of other land types decreased in the top soil layer (0-10 cm). AF had the lowest soil water holding capacity in all soil layers. Soil water content and maximum water holding capacity were significantly related to canopy density, soil organic matter and soil bulk density, which indicated that canopy density, soil organic matter and compactness were important factors influencing soil water holding capacity. Compared to VF, soil water conservation of SF, AF and RF were reduced by 27.7%, 54.3% and 11.5%, respectively. The change of soil water conservation was inconsistent in different soil layers. Vegetation canopy density, soil organic matter and soil bulk density explained 83.3% of the variance of soil water conservation. It was suggested that land use conversion had significantly altered soil water holding capacity and water conservation function. RF could keep the soil water better than AF in the research area. Increasing soil organic matter and reducing soil compaction would be helpful to improve soil water holding capacity and water conservation function in land management.
Approximating prediction uncertainty for random forest regression models
John W. Coulston; Christine E. Blinn; Valerie A. Thomas; Randolph H. Wynne
2016-01-01
Machine learning approaches such as random forest have increased for the spatial modeling and mapping of continuous variables. Random forest is a non-parametric ensemble approach, and unlike traditional regression approaches there is no direct quantification of prediction error. Understanding prediction uncertainty is important when using model-based continuous maps as...
Radiofrequency Cauterization with Biopsy Introducer Needle
Pritchard, William F.; Wray-Cahen, Diane; Karanian, John W.; Hilbert, Stephen; Wood, Bradford J.
2014-01-01
PURPOSE The principal risks of needle biopsy are hemorrhage and implantation of tumor cells in the needle tract. This study compared hemorrhage after liver and kidney biopsy with and without radiofrequency (RF) ablation of the needle tract. MATERIALS AND METHODS Biopsies of liver and kidney were performed in swine through introducer needles modified to allow RF ablation with the distal 2 cm of the needle. After each biopsy, randomization determined whether the site was to undergo RF ablation during withdrawal of the introducer needle. Temperature was measured with a thermistor stylet near the needle tip, with a target temperature of 70°C–100°C with RF ablation. Blood loss was measured as grams of blood absorbed in gauze at the puncture site for 2 minutes after needle withdrawal. Selected specimens were cut for gross examination. RESULTS RF ablation reduced bleeding compared with absence of RF ablation in liver and kidney (P < .01), with mean blood loss reduced 63% and 97%, respectively. Mean amounts of blood loss (±SD) in the liver in the RF and no-RF groups were 2.03 g ± 4.03 (CI, 0.53–3.54 g) and 5.50 g ± 5.58 (CI, 3.33–7.66 g), respectively. Mean amounts of blood loss in the kidney in the RF and no-RF groups were 0.26 g ± 0.32 (CI, −0.01 to 0.53 g) and 8.79 g ± 7.72 (CI, 2.34–15.24 g), respectively. With RF ablation, thermal coagulation of the tissue surrounding the needle tract was observed. CONCLUSION RF ablation of needle biopsy tracts reduced hemorrhage after biopsy in the liver and kidney and may reduce complications of hemorrhage as well as implantation of tumor cells in the tract. PMID:14963187
NASA Astrophysics Data System (ADS)
Fedrigo, Melissa; Newnham, Glenn J.; Coops, Nicholas C.; Culvenor, Darius S.; Bolton, Douglas K.; Nitschke, Craig R.
2018-02-01
Light detection and ranging (lidar) data have been increasingly used for forest classification due to its ability to penetrate the forest canopy and provide detail about the structure of the lower strata. In this study we demonstrate forest classification approaches using airborne lidar data as inputs to random forest and linear unmixing classification algorithms. Our results demonstrated that both random forest and linear unmixing models identified a distribution of rainforest and eucalypt stands that was comparable to existing ecological vegetation class (EVC) maps based primarily on manual interpretation of high resolution aerial imagery. Rainforest stands were also identified in the region that have not previously been identified in the EVC maps. The transition between stand types was better characterised by the random forest modelling approach. In contrast, the linear unmixing model placed greater emphasis on field plots selected as endmembers which may not have captured the variability in stand structure within a single stand type. The random forest model had the highest overall accuracy (84%) and Cohen's kappa coefficient (0.62). However, the classification accuracy was only marginally better than linear unmixing. The random forest model was applied to a region in the Central Highlands of south-eastern Australia to produce maps of stand type probability, including areas of transition (the 'ecotone') between rainforest and eucalypt forest. The resulting map provided a detailed delineation of forest classes, which specifically recognised the coalescing of stand types at the landscape scale. This represents a key step towards mapping the structural and spatial complexity of these ecosystems, which is important for both their management and conservation.
Hsieh, Chung-Ho; Lu, Ruey-Hwa; Lee, Nai-Hsin; Chiu, Wen-Ta; Hsu, Min-Huei; Li, Yu-Chuan Jack
2011-01-01
Diagnosing acute appendicitis clinically is still difficult. We developed random forests, support vector machines, and artificial neural network models to diagnose acute appendicitis. Between January 2006 and December 2008, patients who had a consultation session with surgeons for suspected acute appendicitis were enrolled. Seventy-five percent of the data set was used to construct models including random forest, support vector machines, artificial neural networks, and logistic regression. Twenty-five percent of the data set was withheld to evaluate model performance. The area under the receiver operating characteristic curve (AUC) was used to evaluate performance, which was compared with that of the Alvarado score. Data from a total of 180 patients were collected, 135 used for training and 45 for testing. The mean age of patients was 39.4 years (range, 16-85). Final diagnosis revealed 115 patients with and 65 without appendicitis. The AUC of random forest, support vector machines, artificial neural networks, logistic regression, and Alvarado was 0.98, 0.96, 0.91, 0.87, and 0.77, respectively. The sensitivity, specificity, positive, and negative predictive values of random forest were 94%, 100%, 100%, and 87%, respectively. Random forest performed better than artificial neural networks, logistic regression, and Alvarado. We demonstrated that random forest can predict acute appendicitis with good accuracy and, deployed appropriately, can be an effective tool in clinical decision making. Copyright © 2011 Mosby, Inc. All rights reserved.
The experimental design of the Missouri Ozark Forest Ecosystem Project
Steven L. Sheriff; Shuoqiong He
1997-01-01
The Missouri Ozark Forest Ecosystem Project (MOFEP) is an experiment that examines the effects of three forest management practices on the forest community. MOFEP is designed as a randomized complete block design using nine sites divided into three blocks. Treatments of uneven-aged, even-aged, and no-harvest management were randomly assigned to sites within each block...
Attademo, Andrés M; Cabagna-Zenklusen, Mariana; Lajmanovich, Rafael C; Peltzer, Paola M; Junges, Celina; Bassó, Agustín
2011-01-01
Activity of B-esterases (BChE: butyrylcholinesterase and CbE: carboxylesterase using two model substrates: α-naphthyl acetate and 4-nitrophenyl valerate) in a native frog, Leptodactylus chaquensis from rice fields (RF1: methamidophos and RF2: cypermethrin and endosulfan sprayed by aircraft) and non-contaminated area (pristine forest) was measured. The ability of pyridine-2-aldoxime methochloride (2-PAM) to reactivate BChE levels was also explored. In addition, changes in blood cell morphology and parasite infection were determined. Mean values of plasma BChE activities were lower in samples from the two rice fields than in those from the reference site. CbE (4-nitrophenyl valerate) levels varied in the three sites studied, being highest in RF1. Frog plasma from RF1 showed positive reactivation of BChE activity after incubation with 2-PAM. Blood parameters of frogs from RF2 revealed morphological alterations (anisochromasia and immature erythrocytes frequency). Moreover, a major infection of protozoan Trypanosoma sp. in individuals from the two rice fields was detected. We suggest that integrated use of several biomarkers (BChE and CBEs, chemical reactivation of plasma with 2-PAM, and blood cell parameters) may be a promising procedure for use in biomonitoring programmes to diagnose pesticide exposure of wild populations of this frog and other native anuran species in Argentina.
ERIC Educational Resources Information Center
Golino, Hudson F.; Gomes, Cristiano M. A.
2016-01-01
This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…
NASA Astrophysics Data System (ADS)
Schepaschenko, D.; McCallum, I.; Shvidenko, A.; Kraxner, F.; Fritz, S.
2009-04-01
There is a critical need for accurate land cover information for resource assessment, biophysical modeling, greenhouse gas studies, and for estimating possible terrestrial responses and feedbacks to climate change. However, practically all existing land cover datasets have quite a high level of uncertainty and suffer from a lack of important details that does not allow for relevant parameterization, e.g., data derived from different forest inventories. The objective of this study is to develop a methodology in order to create a hybrid land cover dataset at the level which would satisfy requirements of the verified terrestrial biota full greenhouse gas account (Shvidenko et al., 2008) for large regions i.e. Russia. Such requirements necessitate a detailed quantification of land classes (e.g., for forests - dominant species, age, growing stock, net primary production, etc.) with additional information on uncertainties of the major biometric and ecological parameters in the range of 10-20% and a confidence interval of around 0.9. The approach taken here allows the integration of different datasets to explore synergies and in particular the merging and harmonization of land and forest inventories, ecological monitoring, remote sensing data and in-situ information. The following datasets have been integrated: Remote sensing: Global Land Cover 2000 (Fritz et al., 2003), Vegetation Continuous Fields (Hansen et al., 2002), Vegetation Fire (Sukhinin, 2007), Regional land cover (Schmullius et al., 2005); GIS: Soil 1:2.5 Mio (Dokuchaev Soil Science Institute, 1996), Administrative Regions 1:2.5 Mio, Vegetation 1:4 Mio, Bioclimatic Zones 1:4 Mio (Stolbovoi & McCallum, 2002), Forest Enterprises 1:2.5 Mio, Rivers/Lakes and Roads/Railways 1:1 Mio (IIASA's data base); Inventories and statistics: State Land Account (FARSC RF, 2006), State Forest Account - SFA (FFS RF, 2003), Disturbances in forests (FFS RF, 2006). The resulting hybrid land cover dataset at 1-km resolution comprises the following classes: Forest (each grid links to the SFA database, which contains 86,613 records); Agriculture (5 classes, parameterized by 89 administrative units); Wetlands (8 classes, parameterized by 83 zone/region units); Open Woodland, Burnt area; Shrub/grassland (50 classes, parameterized by 300 zone/region units); Water; Unproductive area. This study has demonstrated the ability to produce a highly detailed (both spatially and thematically) land cover dataset over Russia. Future efforts include further validation of the hybrid land cover dataset for Russia, and its use for assessment of the terrestrial biota full greenhouse gas budget across Russia. The methodology proposed in this study could be applied at the global level. Results of such an undertaking would however be highly dependent upon the quality of the available ground data. The implementation of the hybrid land cover dataset was undertaken in a way that it can be regularly updated based on new ground data and remote sensing products (ie. MODIS).
NASA Astrophysics Data System (ADS)
Singh, Minerva; Malhi, Yadvinder; Bhagwat, Shonil
2014-01-01
The focus of this study is to assess the efficacy of using optical remote sensing (RS) in evaluating disparities in forest composition and aboveground biomass (AGB). The research was carried out in the East Sabah region, Malaysia, which constitutes a disturbance gradient ranging from pristine old growth forests to forests that have experienced varying levels of disturbances. Additionally, a significant proportion of the area consists of oil palm plantations. In accordance with local laws, riparian forest (RF) zones have been retained within oil palm plantations and other forest types. The RS imagery was used to assess forest stand structure and AGB. Band reflectance, vegetation indicators, and gray-level co-occurrence matrix (GLCM) consistency features were used as predictor variables in regression analysis. Results indicate that the spectral variables were limited in their effectiveness in differentiating between forest types and in calculating biomass. However, GLCM based variables illustrated strong correlations with the forest stand structures as well as with the biomass of the various forest types in the study area. The present study provides new insights into the efficacy of texture examination methods in differentiating between various land-use types (including small, isolated forest zones such as RFs) as well as their AGB stocks.
Random Bits Forest: a Strong Classifier/Regressor for Big Data
NASA Astrophysics Data System (ADS)
Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li
2016-07-01
Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).
Using Random Forest Models to Predict Organizational Violence
NASA Technical Reports Server (NTRS)
Levine, Burton; Bobashev, Georgly
2012-01-01
We present a methodology to access the proclivity of an organization to commit violence against nongovernment personnel. We fitted a Random Forest model using the Minority at Risk Organizational Behavior (MAROS) dataset. The MAROS data is longitudinal; so, individual observations are not independent. We propose a modification to the standard Random Forest methodology to account for the violation of the independence assumption. We present the results of the model fit, an example of predicting violence for an organization; and finally, we present a summary of the forest in a "meta-tree,"
Responsive feeding and child undernutrition in low- and middle-income countries.
Bentley, Margaret E; Wasser, Heather M; Creed-Kanashiro, Hilary M
2011-03-01
Growth faltering and nutritional deficiencies continue to be highly prevalent in infants and young children (IYC) living in low- and middle-income (LAMI) countries. There is increasing recognition that feeding behaviors and styles, particularly responsive feeding (RF), could influence acceptance of food and dietary intake and thus the growth of IYC. This paper presents the evolution of RF research and the strength of the evidence for RF on child undernutrition in LAMI countries. Multiple approaches were used to identify studies, including keyword searches in many databases, hand searches of retrieved articles, and consultation with experts in the field. Articles were included if they contained a RF exposure and child undernutrition outcome. In total, we identified 21 studies: 15 on child growth, 4 on dietary intake, 3 on disease, and 8 on eating behaviors. Most studies were conducted among children <36 mo of age and were published in the last 10 y. Cross-study comparisons were difficult due to multiple definitions of RF. One-half of the studies were observational with cross-sectional designs and few interventions were designed to isolate the effect of RF on child undernutrition. Overall, few studies have demonstrated a positive association between RF and child undernutrition, although there is promising evidence that positive caregiver verbalizations during feeding increase child acceptance of food. Recommendations for future research include consensus on the definition and measurement of RF, longitudinal studies that begin early in infancy, and randomized controlled trials that isolate the effect of RF on child undernutrition.
A Dirichlet process model for classifying and forecasting epidemic curves
2014-01-01
Background A forecast can be defined as an endeavor to quantitatively estimate a future event or probabilities assigned to a future occurrence. Forecasting stochastic processes such as epidemics is challenging since there are several biological, behavioral, and environmental factors that influence the number of cases observed at each point during an epidemic. However, accurate forecasts of epidemics would impact timely and effective implementation of public health interventions. In this study, we introduce a Dirichlet process (DP) model for classifying and forecasting influenza epidemic curves. Methods The DP model is a nonparametric Bayesian approach that enables the matching of current influenza activity to simulated and historical patterns, identifies epidemic curves different from those observed in the past and enables prediction of the expected epidemic peak time. The method was validated using simulated influenza epidemics from an individual-based model and the accuracy was compared to that of the tree-based classification technique, Random Forest (RF), which has been shown to achieve high accuracy in the early prediction of epidemic curves using a classification approach. We also applied the method to forecasting influenza outbreaks in the United States from 1997–2013 using influenza-like illness (ILI) data from the Centers for Disease Control and Prevention (CDC). Results We made the following observations. First, the DP model performed as well as RF in identifying several of the simulated epidemics. Second, the DP model correctly forecasted the peak time several days in advance for most of the simulated epidemics. Third, the accuracy of identifying epidemics different from those already observed improved with additional data, as expected. Fourth, both methods correctly classified epidemics with higher reproduction numbers (R) with a higher accuracy compared to epidemics with lower R values. Lastly, in the classification of seasonal influenza epidemics based on ILI data from the CDC, the methods’ performance was comparable. Conclusions Although RF requires less computational time compared to the DP model, the algorithm is fully supervised implying that epidemic curves different from those previously observed will always be misclassified. In contrast, the DP model can be unsupervised, semi-supervised or fully supervised. Since both methods have their relative merits, an approach that uses both RF and the DP model could be beneficial. PMID:24405642
Li, Aihua; Dhakal, Shital; Glenn, Nancy F.; Spaete, Luke P.; Shinneman, Douglas; Pilliod, David S.; Arkle, Robert; McIlroy, Susan
2017-01-01
Our study objectives were to model the aboveground biomass in a xeric shrub-steppe landscape with airborne light detection and ranging (Lidar) and explore the uncertainty associated with the models we created. We incorporated vegetation vertical structure information obtained from Lidar with ground-measured biomass data, allowing us to scale shrub biomass from small field sites (1 m subplots and 1 ha plots) to a larger landscape. A series of airborne Lidar-derived vegetation metrics were trained and linked with the field-measured biomass in Random Forests (RF) regression models. A Stepwise Multiple Regression (SMR) model was also explored as a comparison. Our results demonstrated that the important predictors from Lidar-derived metrics had a strong correlation with field-measured biomass in the RF regression models with a pseudo R2 of 0.76 and RMSE of 125 g/m2 for shrub biomass and a pseudo R2 of 0.74 and RMSE of 141 g/m2 for total biomass, and a weak correlation with field-measured herbaceous biomass. The SMR results were similar but slightly better than RF, explaining 77–79% of the variance, with RMSE ranging from 120 to 129 g/m2 for shrub and total biomass, respectively. We further explored the computational efficiency and relative accuracies of using point cloud and raster Lidar metrics at different resolutions (1 m to 1 ha). Metrics derived from the Lidar point cloud processing led to improved biomass estimates at nearly all resolutions in comparison to raster-derived Lidar metrics. Only at 1 m were the results from the point cloud and raster products nearly equivalent. The best Lidar prediction models of biomass at the plot-level (1 ha) were achieved when Lidar metrics were derived from an average of fine resolution (1 m) metrics to minimize boundary effects and to smooth variability. Overall, both RF and SMR methods explained more than 74% of the variance in biomass, with the most important Lidar variables being associated with vegetation structure and statistical measures of this structure (e.g., standard deviation of height was a strong predictor of biomass). Using our model results, we developed spatially-explicit Lidar estimates of total and shrub biomass across our study site in the Great Basin, U.S.A., for monitoring and planning in this imperiled ecosystem.
Large-scale Estimates of Leaf Area Index from Active Remote Sensing Laser Altimetry
NASA Astrophysics Data System (ADS)
Hopkinson, C.; Mahoney, C.
2016-12-01
Leaf area index (LAI) is a key parameter that describes the spatial distribution of foliage within forest canopies which in turn control numerous relationships between the ground, canopy, and atmosphere. The retrieval of LAI has demonstrated success by in-situ (digital) hemispherical photography (DHP) and airborne laser scanning (ALS) data; however, field and ALS acquisitions are often spatially limited (100's km2) and costly. Large-scale (>1000's km2) retrievals have been demonstrated by optical sensors, however, accuracies remain uncertain due to the sensor's inability to penetrate the canopy. The spaceborne Geoscience Laser Altimeter System (GLAS) provides a possible solution in retrieving large-scale derivations whilst simultaneously penetrating the canopy. LAI retrieved by multiple DHP from 6 Australian sites, representing a cross-section of Australian ecosystems, were employed to model ALS LAI, which in turn were used to infer LAI from GLAS data at 5 other sites. An optimally filtered GLAS dataset was then employed in conjunction with a host of supplementary data to build a Random Forest (RF) model to infer predictions (and uncertainties) of LAI at a 250 m resolution across the forested regions of Australia. Predictions were validated against ALS-based LAI from 20 sites (R2=0.64, RMSE=1.1 m2m-2); MODIS-based LAI were also assessed against these sites (R2=0.30, RMSE=1.78 m2m-2) to demonstrate the strength of GLAS-based predictions. The large-scale nature of current predictions was also leveraged to demonstrate large-scale relationships of LAI with other environmental characteristics, such as: canopy height, elevation, and slope. The need for such wide-scale quantification of LAI is key in the assessment and modification of forest management strategies across Australia. Such work also assists Australia's Terrestrial Ecosystem Research Network, in fulfilling their government issued mandates.
Ultrasonic RF time series for early assessment of the tumor response to chemotherapy.
Lin, Qingguang; Wang, Jianwei; Li, Qing; Lin, Chunyi; Guo, Zhixing; Zheng, Wei; Yan, Cuiju; Li, Anhua; Zhou, Jianhua
2018-01-05
Ultrasound radio-frequency (RF) time series have been shown to carry tissue typing information. To evaluate the potential of RF time series for early prediction of tumor response to chemotherapy, 50MCF-7 breast cancer-bearing nude mice were randomized to receive cisplatin and paclitaxel (treatment group; n = 26) or sterile saline (control group; n = 24). Sequential ultrasound imaging was performed on days 0, 3, 6, and 8 of treatment to simultaneously collect B-mode images and RF data. Six RF time series features, slope, intercept, S1, S2, S3 , and S4 , were extracted during RF data analysis and contrasted with microstructural tumor changes on histopathology. Chemotherapy administration reduced tumor growth relative to control on days 6 and 8. Compared with day 0, intercept, S1 , and S2 were increased while slope was decreased on days 3, 6, and 8 in the treatment group. Compared with the control group, intercept, S1, S2, S3 , and S4 were increased, and slope was decreased, on days 3, 6, and 8 in the treatment group. Tumor cell density decreased significantly in the latter on day 3. We conclude that ultrasonic RF time series analysis provides a simple way to noninvasively assess the early tumor response to chemotherapy.
Radiofrequency Procedures to Relieve Chronic Knee Pain: An Evidence-Based Narrative Review.
Bhatia, Anuj; Peng, Philip; Cohen, Steven P
2016-01-01
Chronic knee pain from osteoarthritis or following arthroplasty is a common problem. A number of publications have reported analgesic success of radiofrequency (RF) procedures on nerves innervating the knee, but interpretation is hampered by lack of clarity regarding indications, clinical protocols, targets, and longevity of benefit from RF procedures. We reviewed the following medical literature databases for publications on RF procedures on the knee joint for chronic pain: MEDLINE, EMBASE, Cochrane Central Register of Controlled Trials, Cochrane Database of Systematic Reviews, and Google Scholar up to August 9, 2015. Data on scores for pain, validated scores for measuring physical disability, and adverse effects measured at any timepoint after 1 month following the interventions were collected, analyzed, and reported in this narrative review. Thirteen publications on ablative or pulsed RF treatments of innervation of the knee joint were identified. A high success rate of these procedures in relieving chronic pain of the knee joint was reported at 1 to 12 months after the procedures, but only 2 of the publications were randomized controlled trials. There was evidence for improvement in function and a lack of serious adverse events of RF treatments. Radiofrequency treatments on the knee joint (major or periarticular nerve supply or intra-articular branches) have the potential to reduce pain from osteoarthritis or persistent postarthroplasty pain. Ongoing concerns regarding the quality, procedural aspects, and monitoring of outcomes in publications on this topic remain. Randomized controlled trials of high methodological quality are required to further elaborate role of these interventions in this population.
NASA Astrophysics Data System (ADS)
Elmore, K. L.
2016-12-01
The Metorological Phenomemna Identification NeartheGround (mPING) project is an example of a crowd-sourced, citizen science effort to gather data of sufficeint quality and quantity needed by new post processing methods that use machine learning. Transportation and infrastructure are particularly sensitive to precipitation type in winter weather. We extract attributes from operational numerical forecast models and use them in a random forest to generate forecast winter precipitation types. We find that random forests applied to forecast soundings are effective at generating skillful forecasts of surface ptype with consideralbly more skill than the current algorithms, especuially for ice pellets and freezing rain. We also find that three very different forecast models yuield similar overall results, showing that random forests are able to extract essentially equivalent information from different forecast models. We also show that the random forest for each model, and each profile type is unique to the particular forecast model and that the random forests developed using a particular model suffer significant degradation when given attributes derived from a different model. This implies that no single algorithm can perform well across all forecast models. Clearly, random forests extract information unavailable to "physically based" methods because the physical information in the models does not appear as we expect. One intersting result is that results from the classic "warm nose" sounding profile are, by far, the most sensitive to the particular forecast model, but this profile is also the one for which random forests are most skillful. Finally, a method for calibrarting probabilties for each different ptype using multinomial logistic regression is shown.
Concussion classification via deep learning using whole-brain white matter fiber strains
Cai, Yunliang; Wu, Shaoju; Zhao, Wei; Li, Zhigang; Wu, Zheyang
2018-01-01
Developing an accurate and reliable injury predictor is central to the biomechanical studies of traumatic brain injury. State-of-the-art efforts continue to rely on empirical, scalar metrics based on kinematics or model-estimated tissue responses explicitly pre-defined in a specific brain region of interest. They could suffer from loss of information. A single training dataset has also been used to evaluate performance but without cross-validation. In this study, we developed a deep learning approach for concussion classification using implicit features of the entire voxel-wise white matter fiber strains. Using reconstructed American National Football League (NFL) injury cases, leave-one-out cross-validation was employed to objectively compare injury prediction performances against two baseline machine learning classifiers (support vector machine (SVM) and random forest (RF)) and four scalar metrics via univariate logistic regression (Brain Injury Criterion (BrIC), cumulative strain damage measure of the whole brain (CSDM-WB) and the corpus callosum (CSDM-CC), and peak fiber strain in the CC). Feature-based machine learning classifiers including deep learning, SVM, and RF consistently outperformed all scalar injury metrics across all performance categories (e.g., leave-one-out accuracy of 0.828–0.862 vs. 0.690–0.776, and .632+ error of 0.148–0.176 vs. 0.207–0.292). Further, deep learning achieved the best cross-validation accuracy, sensitivity, AUC, and .632+ error. These findings demonstrate the superior performances of deep learning in concussion prediction and suggest its promise for future applications in biomechanical investigations of traumatic brain injury. PMID:29795640
Concussion classification via deep learning using whole-brain white matter fiber strains.
Cai, Yunliang; Wu, Shaoju; Zhao, Wei; Li, Zhigang; Wu, Zheyang; Ji, Songbai
2018-01-01
Developing an accurate and reliable injury predictor is central to the biomechanical studies of traumatic brain injury. State-of-the-art efforts continue to rely on empirical, scalar metrics based on kinematics or model-estimated tissue responses explicitly pre-defined in a specific brain region of interest. They could suffer from loss of information. A single training dataset has also been used to evaluate performance but without cross-validation. In this study, we developed a deep learning approach for concussion classification using implicit features of the entire voxel-wise white matter fiber strains. Using reconstructed American National Football League (NFL) injury cases, leave-one-out cross-validation was employed to objectively compare injury prediction performances against two baseline machine learning classifiers (support vector machine (SVM) and random forest (RF)) and four scalar metrics via univariate logistic regression (Brain Injury Criterion (BrIC), cumulative strain damage measure of the whole brain (CSDM-WB) and the corpus callosum (CSDM-CC), and peak fiber strain in the CC). Feature-based machine learning classifiers including deep learning, SVM, and RF consistently outperformed all scalar injury metrics across all performance categories (e.g., leave-one-out accuracy of 0.828-0.862 vs. 0.690-0.776, and .632+ error of 0.148-0.176 vs. 0.207-0.292). Further, deep learning achieved the best cross-validation accuracy, sensitivity, AUC, and .632+ error. These findings demonstrate the superior performances of deep learning in concussion prediction and suggest its promise for future applications in biomechanical investigations of traumatic brain injury.
Calle, M. Luz; Rothman, Nathaniel; Urrea, Víctor; Kogevinas, Manolis; Petrus, Sandra; Chanock, Stephen J.; Tardón, Adonina; García-Closas, Montserrat; González-Neira, Anna; Vellalta, Gemma; Carrato, Alfredo; Navarro, Arcadi; Lorente-Galdós, Belén; Silverman, Debra T.; Real, Francisco X.; Wu, Xifeng; Malats, Núria
2013-01-01
The relationship between inflammation and cancer is well established in several tumor types, including bladder cancer. We performed an association study between 886 inflammatory-gene variants and bladder cancer risk in 1,047 cases and 988 controls from the Spanish Bladder Cancer (SBC)/EPICURO Study. A preliminary exploration with the widely used univariate logistic regression approach did not identify any significant SNP after correcting for multiple testing. We further applied two more comprehensive methods to capture the complexity of bladder cancer genetic susceptibility: Bayesian Threshold LASSO (BTL), a regularized regression method, and AUC-Random Forest, a machine-learning algorithm. Both approaches explore the joint effect of markers. BTL analysis identified a signature of 37 SNPs in 34 genes showing an association with bladder cancer. AUC-RF detected an optimal predictive subset of 56 SNPs. 13 SNPs were identified by both methods in the total population. Using resources from the Texas Bladder Cancer study we were able to replicate 30% of the SNPs assessed. The associations between inflammatory SNPs and bladder cancer were reexamined among non-smokers to eliminate the effect of tobacco, one of the strongest and most prevalent environmental risk factor for this tumor. A 9 SNP-signature was detected by BTL. Here we report, for the first time, a set of SNP in inflammatory genes jointly associated with bladder cancer risk. These results highlight the importance of the complex structure of genetic susceptibility associated with cancer risk. PMID:24391818
NASA Astrophysics Data System (ADS)
Gibril, Mohamed Barakat A.; Idrees, Mohammed Oludare; Yao, Kouame; Shafri, Helmi Zulhaidi Mohd
2018-01-01
The growing use of optimization for geographic object-based image analysis and the possibility to derive a wide range of information about the image in textual form makes machine learning (data mining) a versatile tool for information extraction from multiple data sources. This paper presents application of data mining for land-cover classification by fusing SPOT-6, RADARSAT-2, and derived dataset. First, the images and other derived indices (normalized difference vegetation index, normalized difference water index, and soil adjusted vegetation index) were combined and subjected to segmentation process with optimal segmentation parameters obtained using combination of spatial and Taguchi statistical optimization. The image objects, which carry all the attributes of the input datasets, were extracted and related to the target land-cover classes through data mining algorithms (decision tree) for classification. To evaluate the performance, the result was compared with two nonparametric classifiers: support vector machine (SVM) and random forest (RF). Furthermore, the decision tree classification result was evaluated against six unoptimized trials segmented using arbitrary parameter combinations. The result shows that the optimized process produces better land-use land-cover classification with overall classification accuracy of 91.79%, 87.25%, and 88.69% for SVM and RF, respectively, while the results of the six unoptimized classifications yield overall accuracy between 84.44% and 88.08%. Higher accuracy of the optimized data mining classification approach compared to the unoptimized results indicates that the optimization process has significant impact on the classification quality.
Rachmadi, Muhammad Febrian; Valdés-Hernández, Maria Del C; Agan, Maria Leonora Fatimah; Di Perri, Carol; Komura, Taku
2018-06-01
We propose an adaptation of a convolutional neural network (CNN) scheme proposed for segmenting brain lesions with considerable mass-effect, to segment white matter hyperintensities (WMH) characteristic of brains with none or mild vascular pathology in routine clinical brain magnetic resonance images (MRI). This is a rather difficult segmentation problem because of the small area (i.e., volume) of the WMH and their similarity to non-pathological brain tissue. We investigate the effectiveness of the 2D CNN scheme by comparing its performance against those obtained from another deep learning approach: Deep Boltzmann Machine (DBM), two conventional machine learning approaches: Support Vector Machine (SVM) and Random Forest (RF), and a public toolbox: Lesion Segmentation Tool (LST), all reported to be useful for segmenting WMH in MRI. We also introduce a way to incorporate spatial information in convolution level of CNN for WMH segmentation named global spatial information (GSI). Analysis of covariance corroborated known associations between WMH progression, as assessed by all methods evaluated, and demographic and clinical data. Deep learning algorithms outperform conventional machine learning algorithms by excluding MRI artefacts and pathologies that appear similar to WMH. Our proposed approach of incorporating GSI also successfully helped CNN to achieve better automatic WMH segmentation regardless of network's settings tested. The mean Dice Similarity Coefficient (DSC) values for LST-LGA, SVM, RF, DBM, CNN and CNN-GSI were 0.2963, 0.1194, 0.1633, 0.3264, 0.5359 and 5389 respectively. Crown Copyright © 2018. Published by Elsevier Ltd. All rights reserved.
Fusion of pixel and object-based features for weed mapping using unmanned aerial vehicle imagery
NASA Astrophysics Data System (ADS)
Gao, Junfeng; Liao, Wenzhi; Nuyttens, David; Lootens, Peter; Vangeyte, Jürgen; Pižurica, Aleksandra; He, Yong; Pieters, Jan G.
2018-05-01
The developments in the use of unmanned aerial vehicles (UAVs) and advanced imaging sensors provide new opportunities for ultra-high resolution (e.g., less than a 10 cm ground sampling distance (GSD)) crop field monitoring and mapping in precision agriculture applications. In this study, we developed a strategy for inter- and intra-row weed detection in early season maize fields from aerial visual imagery. More specifically, the Hough transform algorithm (HT) was applied to the orthomosaicked images for inter-row weed detection. A semi-automatic Object-Based Image Analysis (OBIA) procedure was developed with Random Forests (RF) combined with feature selection techniques to classify soil, weeds and maize. Furthermore, the two binary weed masks generated from HT and OBIA were fused for accurate binary weed image. The developed RF classifier was evaluated by 5-fold cross validation, and it obtained an overall accuracy of 0.945, and Kappa value of 0.912. Finally, the relationship of detected weeds and their ground truth densities was quantified by a fitted linear model with a coefficient of determination of 0.895 and a root mean square error of 0.026. Besides, the importance of input features was evaluated, and it was found that the ratio of vegetation length and width was the most significant feature for the classification model. Overall, our approach can yield a satisfactory weed map, and we expect that the obtained accurate and timely weed map from UAV imagery will be applicable to realize site-specific weed management (SSWM) in early season crop fields for reducing spraying non-selective herbicides and costs.
[Predictive model based multimetric index of macroinvertebrates for river health assessment].
Chen, Kai; Yu, Hai Yan; Zhang, Ji Wei; Wang, Bei Xin; Chen, Qiu Wen
2017-06-18
Improving the stability of integrity of biotic index (IBI; i.e., multi-metric indices, MMI) across temporal and spatial scales is one of the most important issues in water ecosystem integrity bioassessment and water environment management. Using datasets of field-based macroinvertebrate and physicochemical variables and GIS-based natural predictors (e.g., geomorphology and climate) and land use variables collected at 227 river sites from 2004 to 2011 across the Zhejiang Province, China, we used random forests (RF) to adjust the effects of natural variations at temporal and spatial scales on macroinvertebrate metrics. We then developed natural variations adjusted (predictive) and unadjusted (null) MMIs and compared performance between them. The core me-trics selected for predictive and null MMIs were different from each other, and natural variations within core metrics in predictive MMI explained by RF models ranged between 11.4% and 61.2%. The predictive MMI was more precise and accurate, but less responsive and sensitive than null MMI. The multivariate nearest-neighbor test determined that 9 test sites and 1 most degraded site were flagged outside of the environmental space of the reference site network. We found that combination of predictive MMI developed by using predictive model and the nearest-neighbor test performed best and decreased risks of inferring type I (designating a water body as being in poor biological condition, when it was actually in good condition) and type II (designating a water body as being in good biological condition, when it was actually in poor condition) errors. Our results provided an effective method to improve the stability and performance of integrity of biotic index.
Majid, Abdul; Ali, Safdar; Iqbal, Mubashar; Kausar, Nabeela
2014-03-01
This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for cancer/non-cancer dataset, breast/non-breast cancer dataset and colon/non-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Naïve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Mortazavi, S M J; Owji, S M; Shojaei-Fard, M B; Ghader-Panah, M; Mortazavi, S A R; Tavakoli-Golpayegani, A; Haghani, M; Taeb, S; Shokrpour, N; Koohi, O
2016-12-01
The rapidly increasing use of mobile phones has led to public concerns about possible health effects of these popular communication devices. This study is an attempt to investigate the effects of radiofrequency (RF) radiation produced by GSM mobile phones on the insulin release in rats. Forty two female adult Sprague Dawley rats were randomly divided into 4 groups. Group1 were exposed to RF radiation 6 hours per day for 7 days. Group 2 received sham exposure (6 hours per day for 7 days). Groups 3 and 4 received RF radiation 3 hours per day for 7 days and sham exposure (3 hours per day), respectively. The specific absorption rate (SAR) of RF was 2.0 W/kg. Our results showed that RF radiations emitted from mobile phone could not alter insulin release in rats. However, mild to severe inflammatory changes in the portal spaces of the liver of rats as well as damage in the cells of islet of Langerhans were observed. These changes were linked with the duration of the exposures. RF exposure can induce inflammatory changes in the liver as well causing damage in the cells of islet of Langerhans.
The effect of MRET polymer compound on SAR values of RF phones.
Smirnov, Igor
2008-01-01
This article is related to the proposed hypothesis and experimental data regarding the ability of defined polar polymer compound (MRET polymer) applied to RF phones to increase the dielectric permittivity of water based solutions and to reduce the SAR (Specific Absorption Rate) values inside the "phantom head" filled with the jelly simulating muscle and brain tissues. Due to the high organizational state of fractal structures of MRET polymer compounds and the phenomenon of piezoelectricity, this polymer generates specific subtle, low frequency, non-coherent electromagnetic oscillations (optimal random field) that can affect the hydrogen lattice of the molecular structure of water and subsequently modify the electrodynamic properties of water. The increase of dielectric permittivity of water finally leads to the reduction of the absorption rate of the electromagnetic field by living tissue. The reduction of SAR values is confirmed by the research conducted in June - July of 2006 at RF Exposure Laboratory in Escondido, California. This test also confirmed that the application of MRET polymer to RF phones does not significantly affect the air measurements of RF phone signals, and subsequently does not lead to any significant distortion of transmitted RF signals.
Wang, Chao; Gao, Qiong; Wang, Xian; Yu, Mei
2015-01-01
Land use land cover (LULC) changes frequently in ecotones due to the large climate and soil gradients, and complex landscape composition and configuration. Accurate mapping of LULC changes in ecotones is of great importance for assessment of ecosystem functions/services and policy-decision support. Decadal or sub-decadal mapping of LULC provides scenarios for modeling biogeochemical processes and their feedbacks to climate, and evaluating effectiveness of land-use policies, e.g. forest conversion. However, it remains a great challenge to produce reliable LULC maps in moderate resolution and to evaluate their uncertainties over large areas with complex landscapes. In this study we developed a robust LULC classification system using multiple classifiers based on MODIS (Moderate Resolution Imaging Spectroradiometer) data and posterior data fusion. Not only does the system create LULC maps with high statistical accuracy, but also it provides pixel-level uncertainties that are essential for subsequent analyses and applications. We applied the classification system to the Agro-pasture transition band in northern China (APTBNC) to detect the decadal changes in LULC during 2003-2013 and evaluated the effectiveness of the implementation of major Key Forestry Programs (KFPs). In our study, the random forest (RF), support vector machine (SVM), and weighted k-nearest neighbors (WKNN) classifiers outperformed the artificial neural networks (ANN) and naive Bayes (NB) in terms of high classification accuracy and low sensitivity to training sample size. The Bayesian-average data fusion based on the results of RF, SVM, and WKNN achieved the 87.5% Kappa statistics, higher than any individual classifiers and the majority-vote integration. The pixel-level uncertainty map agreed with the traditional accuracy assessment. However, it conveys spatial variation of uncertainty. Specifically, it pinpoints the southwestern area of APTBNC has higher uncertainty than other part of the region, and the open shrubland is likely to be misclassified to the bare ground in some locations. Forests, closed shrublands, and grasslands in APTBNC expanded by 23%, 50%, and 9%, respectively, during 2003-2013. The expansion of these land cover types is compensated with the shrinkages in croplands (20%), bare ground (15%), and open shrublands (30%). The significant decline in agricultural lands is primarily attributed to the KFPs implemented in the end of last century and the nationwide urbanization in recent decade. The increased coverage of grass and woody plants would largely reduce soil erosion, improve mitigation of climate change, and enhance carbon sequestration in this region.
Wright, Marvin N; Dankowski, Theresa; Ziegler, Andreas
2017-04-15
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Nasejje, Justine B; Mwambi, Henry; Dheda, Keertan; Lesosky, Maia
2017-07-28
Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate. In this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points). The study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points. Although survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.
SNP selection and classification of genome-wide SNP data using stratified sampling random forests.
Wu, Qingyao; Ye, Yunming; Liu, Yang; Ng, Michael K
2012-09-01
For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.
Teh, Seng Khoon; Zheng, Wei; Lau, David P; Huang, Zhiwei
2009-06-01
In this work, we evaluated the diagnostic ability of near-infrared (NIR) Raman spectroscopy associated with the ensemble recursive partitioning algorithm based on random forests for identifying cancer from normal tissue in the larynx. A rapid-acquisition NIR Raman system was utilized for tissue Raman measurements at 785 nm excitation, and 50 human laryngeal tissue specimens (20 normal; 30 malignant tumors) were used for NIR Raman studies. The random forests method was introduced to develop effective diagnostic algorithms for classification of Raman spectra of different laryngeal tissues. High-quality Raman spectra in the range of 800-1800 cm(-1) can be acquired from laryngeal tissue within 5 seconds. Raman spectra differed significantly between normal and malignant laryngeal tissues. Classification results obtained from the random forests algorithm on tissue Raman spectra yielded a diagnostic sensitivity of 88.0% and specificity of 91.4% for laryngeal malignancy identification. The random forests technique also provided variables importance that facilitates correlation of significant Raman spectral features with cancer transformation. This study shows that NIR Raman spectroscopy in conjunction with random forests algorithm has a great potential for the rapid diagnosis and detection of malignant tumors in the larynx.
Li, Shufeng; Li, Hongli; Mingyan, E; Yu, Bo
2009-02-01
The development of pulmonary vein stenosis has recently been described after radiofrequency ablation (RF) to treat atrial fibrillation (AF). The purpose of this study was to examine expression of TGFbeta1 in pulmonary vein stenosis after radiofrequency ablation in chronic atrial fibrillation of dogs. About 28 mongrel dogs were randomly assigned to the sham-operated group (n = 7), the AF group (n = 7), AF + RF group (n = 7), and RF group (n = 7). In AF or AF + RF groups, dogs underwent chronic pulmonary vein (PV) pacing to induce sustained AF. RF application was applied around the PVs until electrical activity was eliminated. Histological assessment of pulmonary veins was performed using hematoxylin and eosin staining; TGFbeta1 gene expression in pulmonary veins was examined by RT-PCR analysis; expression of TGFbeta1 protein in pulmonary veins was assessed by Western blot analysis. Rapid pacing from the left superior pulmonary vein (LSPV) induced sustained AF in AF group and AF + RF group. Pulmonary vein ablation terminated the chronic atrial fibrillation in dogs. Histological examination revealed necrotic tissues in various stages of collagen replacement, intimal thickening, and cartilaginous metaplasia with chondroblasts and chondroclasts. Compared with sham-operated and AF group, TGFbeta1 gene and protein expressions was increased in AF + RF or RF groups. It was concluded that TGFbeta1 might be associated with pulmonary vein stenosis after radiofrequency ablation in chronic atrial fibrillation of dogs.
Dziuba, Bartłomiej; Dziuba, Marta
2014-08-20
New peptides with potential antimicrobial activity, encrypted in milk protein sequences, were searched for with the use of bioinformatic tools. The major milk proteins were hydrolyzed in silico by 28 enzymes. The obtained peptides were characterized by the following parameters: molecular weight, isoelectric point, composition and number of amino acid residues, net charge at pH 7.0, aliphatic index, instability index, Boman index, and GRAVY index, and compared with those calculated for known 416 antimicrobial peptides including 59 antimicrobial peptides (AMPs) from milk proteins listed in the BIOPEP database. A simple analysis of physico-chemical properties and the values of biological activity indicators were insufficient to select potentially antimicrobial peptides released in silico from milk proteins by proteolytic enzymes. The final selection was made based on the results of multidimensional statistical analysis such as support vector machines (SVM), random forest (RF), artificial neural networks (ANN) and discriminant analysis (DA) available in the Collection of Anti-Microbial Peptides (CAMP database). Eleven new peptides with potential antimicrobial activity were selected from all peptides released during in silico proteolysis of milk proteins.
Dziuba, Bartłomiej; Dziuba, Marta
2014-01-01
New peptides with potential antimicrobial activity, encrypted in milk protein sequences, were searched for with the use of bioinformatic tools. The major milk proteins were hydrolyzed in silico by 28 enzymes. The obtained peptides were characterized by the following parameters: molecular weight, isoelectric point, composition and number of amino acid residues, net charge at pH 7.0, aliphatic index, instability index, Boman index, and GRAVY index, and compared with those calculated for known 416 antimicrobial peptides including 59 antimicrobial peptides (AMPs) from milk proteins listed in the BIOPEP database. A simple analysis of physico-chemical properties and the values of biological activity indicators were insufficient to select potentially antimicrobial peptides released in silico from milk proteins by proteolytic enzymes. The final selection was made based on the results of multidimensional statistical analysis such as support vector machines (SVM), random forest (RF), artificial neural networks (ANN) and discriminant analysis (DA) available in the Collection of Anti-Microbial Peptides (CAMP database). Eleven new peptides with potential antimicrobial activity were selected from all peptides released during in silico proteolysis of milk proteins. PMID:25141106
Gao, Yu-Fei; Li, Bi-Qing; Cai, Yu-Dong; Feng, Kai-Yan; Li, Zhan-Dong; Jiang, Yang
2013-01-27
Identification of catalytic residues plays a key role in understanding how enzymes work. Although numerous computational methods have been developed to predict catalytic residues and active sites, the prediction accuracy remains relatively low with high false positives. In this work, we developed a novel predictor based on the Random Forest algorithm (RF) aided by the maximum relevance minimum redundancy (mRMR) method and incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility to predict active sites of enzymes and achieved an overall accuracy of 0.885687 and MCC of 0.689226 on an independent test dataset. Feature analysis showed that every category of the features except disorder contributed to the identification of active sites. It was also shown via the site-specific feature analysis that the features derived from the active site itself contributed most to the active site determination. Our prediction method may become a useful tool for identifying the active sites and the key features identified by the paper may provide valuable insights into the mechanism of catalysis.
Dabbah, M A; Graham, J; Petropoulos, I N; Tavakoli, M; Malik, R A
2011-10-01
Diabetic peripheral neuropathy (DPN) is one of the most common long term complications of diabetes. Corneal confocal microscopy (CCM) image analysis is a novel non-invasive technique which quantifies corneal nerve fibre damage and enables diagnosis of DPN. This paper presents an automatic analysis and classification system for detecting nerve fibres in CCM images based on a multi-scale adaptive dual-model detection algorithm. The algorithm exploits the curvilinear structure of the nerve fibres and adapts itself to the local image information. Detected nerve fibres are then quantified and used as feature vectors for classification using random forest (RF) and neural networks (NNT) classifiers. We show, in a comparative study with other well known curvilinear detectors, that the best performance is achieved by the multi-scale dual model in conjunction with the NNT classifier. An evaluation of clinical effectiveness shows that the performance of the automated system matches that of ground-truth defined by expert manual annotation. Copyright © 2011 Elsevier B.V. All rights reserved.
Multiple-Primitives Hierarchical Classification of Airborne Laser Scanning Data in Urban Areas
NASA Astrophysics Data System (ADS)
Ni, H.; Lin, X. G.; Zhang, J. X.
2017-09-01
A hierarchical classification method for Airborne Laser Scanning (ALS) data of urban areas is proposed in this paper. This method is composed of three stages among which three types of primitives are utilized, i.e., smooth surface, rough surface, and individual point. In the first stage, the input ALS data is divided into smooth surfaces and rough surfaces by employing a step-wise point cloud segmentation method. In the second stage, classification based on smooth surfaces and rough surfaces is performed. Points in the smooth surfaces are first classified into ground and buildings based on semantic rules. Next, features of rough surfaces are extracted. Then, points in rough surfaces are classified into vegetation and vehicles based on the derived features and Random Forests (RF). In the third stage, point-based features are extracted for the ground points, and then, an individual point classification procedure is performed to classify the ground points into bare land, artificial ground and greenbelt. Moreover, the shortages of the existing studies are analyzed, and experiments show that the proposed method overcomes these shortages and handles more types of objects.
Ma, Tao; Wang, Fen; Cheng, Jianjun; Yu, Yang; Chen, Xiaoyun
2016-01-01
The development of intrusion detection systems (IDS) that are adapted to allow routers and network defence systems to detect malicious network traffic disguised as network protocols or normal access is a critical challenge. This paper proposes a novel approach called SCDNN, which combines spectral clustering (SC) and deep neural network (DNN) algorithms. First, the dataset is divided into k subsets based on sample similarity using cluster centres, as in SC. Next, the distance between data points in a testing set and the training set is measured based on similarity features and is fed into the deep neural network algorithm for intrusion detection. Six KDD-Cup99 and NSL-KDD datasets and a sensor network dataset were employed to test the performance of the model. These experimental results indicate that the SCDNN classifier not only performs better than backpropagation neural network (BPNN), support vector machine (SVM), random forest (RF) and Bayes tree models in detection accuracy and the types of abnormal attacks found. It also provides an effective tool of study and analysis of intrusion detection in large networks. PMID:27754380
Ma, Tao; Wang, Fen; Cheng, Jianjun; Yu, Yang; Chen, Xiaoyun
2016-10-13
The development of intrusion detection systems (IDS) that are adapted to allow routers and network defence systems to detect malicious network traffic disguised as network protocols or normal access is a critical challenge. This paper proposes a novel approach called SCDNN, which combines spectral clustering (SC) and deep neural network (DNN) algorithms. First, the dataset is divided into k subsets based on sample similarity using cluster centres, as in SC. Next, the distance between data points in a testing set and the training set is measured based on similarity features and is fed into the deep neural network algorithm for intrusion detection. Six KDD-Cup99 and NSL-KDD datasets and a sensor network dataset were employed to test the performance of the model. These experimental results indicate that the SCDNN classifier not only performs better than backpropagation neural network (BPNN), support vector machine (SVM), random forest (RF) and Bayes tree models in detection accuracy and the types of abnormal attacks found. It also provides an effective tool of study and analysis of intrusion detection in large networks.
IoT/M2M wearable-based activity-calorie monitoring and analysis for elders.
Soraya, Sabrina I; Ting-Hui Chiang; Guo-Jing Chan; Yi-Juan Su; Chih-Wei Yi; Yu-Chee Tseng; Yu-Tai Ching
2017-07-01
With the growth of aging population, elder care service has become an important part of the service industry of Internet of Things. Activity monitoring is one of the most important services in the field of the elderly care service. In this paper, we proposed a wearable solution to provide an activity monitoring service on elders for caregivers. The system uses wireless signals to estimate calorie burned by the walking and localization. In addition, it also uses wireless motion sensors to recognize physical activity, such as drinking and restroom activity. Overall, the system can be divided into four parts: wearable device, gateway, cloud server, and caregiver's android application. The algorithms we proposed for drinking activity are Decision Tree (J48) and Random Forest (RF). While for restroom activity, we proposed supervised Reduced Error Pruning (REP) Tree and Variable Order Hidden Markov Model (VOHMM). We developed a prototype service Android app to provide a life log for the recording of the activity sequence which would be useful for the caregiver to monitor elder activity and its calorie consumption.
NASA Astrophysics Data System (ADS)
Gao, Rong; Cheng, Jianhua; Fan, Chunlei; Shi, Xiaofeng; Cao, Yuan; Sun, Bo; Ding, Huiguo; Hu, Chengjin; Dong, Fangting; Yan, Xianzhong
2015-12-01
Hepatocellular carcinoma (HCC) is a common malignancy that has region specific etiologies. Unfortunately, 85% of cases of HCC are diagnosed at an advanced stage. Reliable biomarkers for the early diagnosis of HCC are urgently required to reduced mortality and therapeutic expenditure. We established a non-targeted gas chromatography-time of flight-mass spectrometry (GC-TOFMS) metabolomics method in conjunction with Random Forests (RF) analysis based on 201 serum samples from healthy controls (NC), hepatitis B virus (HBV), liver cirrhosis (LC) and HCC patients to explore the metabolic characteristics in the progression of hepatocellular carcinogenesis. Ultimately, 15 metabolites were identified intimately associated with the process. Phenylalanine, malic acid and 5-methoxytryptamine for HBV vs. NC, palmitic acid for LC vs. HBV, and asparagine and β-glutamate for HCC vs. LC were screened as the liver disease-specific potential biomarkers with an excellent discriminant performance. All the metabolic perturbations in these liver diseases are associated with pathways for energy metabolism, macromolecular synthesis, and maintaining the redox balance to protect tumor cells from oxidative stress.
Developing a radiomics framework for classifying non-small cell lung carcinoma subtypes
NASA Astrophysics Data System (ADS)
Yu, Dongdong; Zang, Yali; Dong, Di; Zhou, Mu; Gevaert, Olivier; Fang, Mengjie; Shi, Jingyun; Tian, Jie
2017-03-01
Patient-targeted treatment of non-small cell lung carcinoma (NSCLC) has been well documented according to the histologic subtypes over the past decade. In parallel, recent development of quantitative image biomarkers has recently been highlighted as important diagnostic tools to facilitate histological subtype classification. In this study, we present a radiomics analysis that classifies the adenocarcinoma (ADC) and squamous cell carcinoma (SqCC). We extract 52-dimensional, CT-based features (7 statistical features and 45 image texture features) to represent each nodule. We evaluate our approach on a clinical dataset including 324 ADCs and 110 SqCCs patients with CT image scans. Classification of these features is performed with four different machine-learning classifiers including Support Vector Machines with Radial Basis Function kernel (RBF-SVM), Random forest (RF), K-nearest neighbor (KNN), and RUSBoost algorithms. To improve the classifiers' performance, optimal feature subset is selected from the original feature set by using an iterative forward inclusion and backward eliminating algorithm. Extensive experimental results demonstrate that radiomics features achieve encouraging classification results on both complete feature set (AUC=0.89) and optimal feature subset (AUC=0.91).
Voice based gender classification using machine learning
NASA Astrophysics Data System (ADS)
Raahul, A.; Sapthagiri, R.; Pankaj, K.; Vijayarajan, V.
2017-11-01
Gender identification is one of the major problem speech analysis today. Tracing the gender from acoustic data i.e., pitch, median, frequency etc. Machine learning gives promising results for classification problem in all the research domains. There are several performance metrics to evaluate algorithms of an area. Our Comparative model algorithm for evaluating 5 different machine learning algorithms based on eight different metrics in gender classification from acoustic data. Agenda is to identify gender, with five different algorithms: Linear Discriminant Analysis (LDA), K-Nearest Neighbour (KNN), Classification and Regression Trees (CART), Random Forest (RF), and Support Vector Machine (SVM) on basis of eight different metrics. The main parameter in evaluating any algorithms is its performance. Misclassification rate must be less in classification problems, which says that the accuracy rate must be high. Location and gender of the person have become very crucial in economic markets in the form of AdSense. Here with this comparative model algorithm, we are trying to assess the different ML algorithms and find the best fit for gender classification of acoustic data.
Adapting GNU random forest program for Unix and Windows
NASA Astrophysics Data System (ADS)
Jirina, Marcel; Krayem, M. Said; Jirina, Marcel, Jr.
2013-10-01
The Random Forest is a well-known method and also a program for data clustering and classification. Unfortunately, the original Random Forest program is rather difficult to use. Here we describe a new version of this program originally written in Fortran 77. The modified program in Fortran 95 needs to be compiled only once and information for different tasks is passed with help of arguments. The program was tested with 24 data sets from UCI MLR and results are available on the net.
Developing Data-driven models for quantifying Cochlodinium polykrikoides in Coastal Waters
NASA Astrophysics Data System (ADS)
Kwon, Yongsung; Jang, Eunna; Im, Jungho; Baek, Seungho; Park, Yongeun; Cho, Kyunghwa
2017-04-01
Harmful algal blooms have been worldwide problems because it leads to serious dangers to human health and aquatic ecosystems. Especially, fish killing red tide blooms by one of dinoflagellate, Cochlodinium polykrikoides (C. polykrikoides), have caused critical damage to mariculture in the Korean coastal waters. In this work, multiple linear regression (MLR), regression tree (RT), and random forest (RF) models were constructed and applied to estimate C. polykrikoides blooms in coastal waters. Five different types of input dataset were carried out to test the performance of three models. To train and validate the three models, observed number of C. polykrikoides cells from National institute of fisheries science (NIFS) and remote sensing reflectance data from Geostationary Ocean Color Imager (GOCI) images for 3 years from 2013 to 2015 were used. The RT model showed the best prediction performance when using 4 bands and 3 band ratios data were used as input data simultaneously. Results obtained from iterative model development with randomly chosen input data indicated that the recognition of patterns in training data caused a variation in prediction performance. This work provided useful tools for reliably estimate the number of C. polykrikoides cells by using reasonable input reflectance dataset in coastal waters. It is expected that the RT model is easily accessed and manipulated by administrators and decision-makers working with coastal waters.
NASA Astrophysics Data System (ADS)
Lian, Jianyu
In this work, modification of the cosine current distribution rf coil, PCOS, has been introduced and tested. The coil produces a very homogeneous rf magnetic field, and it is inexpensive to build and easy to tune for multiple resonance frequency. The geometrical parameters of the coil are optimized to produce the most homogeneous rf field over a large volume. To avoid rf field distortion when the coil length is comparable to a quarter wavelength, a parallel PCOS coil is proposed and discussed. For testing rf coils and correcting B _1 in NMR experiments, a simple, rugged and accurate NMR rf field mapping technique has been developed. The method has been tested and used in 1D, 2D, 3D and in vivo rf mapping experiments. The method has been proven to be very useful in the design of rf coils. To preserve the linear relation between rf output applied on an rf coil and modulating input for an rf modulating -amplifying system of NMR imaging spectrometer, a quadrature feedback loop is employed in an rf modulator with two orthogonal rf channels to correct the amplitude and phase non-linearities caused by the rf components in the rf system. The modulator is very linear over a large range and it can generate an arbitrary rf shape. A diffusion imaging sequence has been developed for measuring and imaging diffusion in the presence of background gradients. Cross terms between the diffusion sensitizing gradients and background gradients or imaging gradients can complicate diffusion measurement and make the interpretation of NMR diffusion data ambiguous, but these have been eliminated in this method. Further, the background gradients has been measured and imaged. A dipole random distribution model has been established to study background magnetic fields Delta B and background magnetic gradients G_0 produced by small particles in a sample when it is in a B_0 field. From this model, the minimum distance that a spin can approach a particle can be determined by measuring
Mapping Deforestation area in North Korea Using Phenology-based Multi-Index and Random Forest
NASA Astrophysics Data System (ADS)
Jin, Y.; Sung, S.; Lee, D. K.; Jeong, S.
2016-12-01
Forest ecosystem provides ecological benefits to both humans and wildlife. Growing global demand for food and fiber is accelerating the pressure on the forest ecosystem in whole world from agriculture and logging. In recently, North Korea lost almost 40 % of its forests to crop fields for food production and cut-down of forest for fuel woods between 1990 and 2015. It led to the increased damage caused by natural disasters and is known to be one of the most forest degraded areas in the world. The characteristic of forest landscape in North Korea is complex and heterogeneous, the major landscape types in the forest are hillside farm, unstocked forest, natural forest and plateau vegetation. Remote sensing can be used for the forest degradation mapping of a dynamic landscape at a broad scale of detail and spatial distribution. Confusion mostly occurred between hillside farmland and unstocked forest, but also between unstocked forest and forest. Most previous forest degradation that used focused on the classification of broad types such as deforests area and sand from the perspective of land cover classification. The objective of this study is using random forest for mapping degraded forest in North Korea by phenological based vegetation index derived from MODIS products, which has various environmental factors such as vegetation, soil and water at a regional scale for improving accuracy. The model created by random forest resulted in an overall accuracy was 91.44%. Class user's accuracy of hillside farmland and unstocked forest were 97.2% and 84%%, which indicate the degraded forest. Unstocked forest had relative low user accuracy due to misclassified hillside farmland and forest samples. Producer's accuracy of hillside farmland and unstocked forest were 85.2% and 93.3%, repectly. In this case hillside farmland had lower produce accuracy mainly due to confusion with field, unstocked forest and forest. Such a classification of degraded forest could supply essential information to decide the priority of forest management and restoration in degraded forest area.
Jeffrey T. Walton
2008-01-01
Three machine learning subpixel estimation methods (Cubist, Random Forests, and support vector regression) were applied to estimate urban cover. Urban forest canopy cover and impervious surface cover were estimated from Landsat-7 ETM+ imagery using a higher resolution cover map resampled to 30 m as training and reference data. Three different band combinations (...
Monte-Carlo Orbit/Full Wave Simulation of Fast Alfvén Wave (FW) Damping on Resonant Ions in Tokamaks
NASA Astrophysics Data System (ADS)
Choi, M.; Chan, V. S.; Tang, V.; Bonoli, P.; Pinsker, R. I.; Wright, J.
2005-09-01
To simulate the resonant interaction of fast Alfvén wave (FW) heating and Coulomb collisions on energetic ions, including finite orbit effects, a Monte-Carlo code ORBIT-RF has been coupled with a 2D full wave code TORIC4. ORBIT-RF solves Hamiltonian guiding center drift equations to follow trajectories of test ions in 2D axisymmetric numerical magnetic equilibrium under Coulomb collisions and ion cyclotron radio frequency quasi-linear heating. Monte-Carlo operators for pitch-angle scattering and drag calculate the changes of test ions in velocity and pitch angle due to Coulomb collisions. A rf-induced random walk model describing fast ion stochastic interaction with FW reproduces quasi-linear diffusion in velocity space. FW fields and its wave numbers from TORIC are passed on to ORBIT-RF to calculate perpendicular rf kicks of resonant ions valid for arbitrary cyclotron harmonics. ORBIT-RF coupled with TORIC using a single dominant toroidal and poloidal wave number has demonstrated consistency of simulations with recent DIII-D FW experimental results for interaction between injected neutral-beam ions and FW, including measured neutron enhancement and enhanced high energy tail. Comparison with C-Mod fundamental heating discharges also yielded reasonable agreement.
ERIC Educational Resources Information Center
Levy, Kenneth N.; Meehan, Kevin B.; Kelly, Kristen M.; Reynoso, Joseph S.; Weber, Michal; Clarkin, John F.; Kernberg, Otto F.
2006-01-01
Changes in attachment organization and reflective function (RF) were assessed as putative mechanisms of change in 1 of 3 year-long psychotherapy treatments for patients with borderline personality disorder (BPD). Ninety patients reliably diagnosed with BPD were randomized to transference-focused psychotherapy (TFP), dialectical behavior…
Polsky, Sarit; Beck, Jimikaye; Stark, Rebecca A.; Pan, Zhaoxing; Hill, James O.; Peters, John C.
2014-01-01
Adults often consume more fat than is recommended. We examined factors that may improve liking of reduced fat and reduced saturated fat foods, including the addition of herbs and spices and habitual consumption of different high-fat and low-fat food items. We randomized adults to taste three different conditions: full fat (FF), reduced fat with no added spice (RF), and reduced fat plus spice (RFS). Subjects rated their liking of French toast, sausage and the overall meal, or chicken, vegetables, pasta and the overall meal on a nine-point hedonic Likert scale. Overall liking of the RF breakfast and lunch meals were lower than the FF and RFS versions (Breakfast: 6.50 RF vs. 6.84 FF, p=0.0061; 6.50 RF vs. 6.82 RFS, p=0.0030; Lunch: 6.35 RF vs. 6.94 FF, p<0.0001; 6.35 RF vs. 6.71 RFS, p=0.0061). RFS and FF breakfast and lunch meals, French toast, chicken and vegetable likings were similar. FF and RFS conditions were liked more than RF for the breakfast and lunch meals, French toast, chicken entrée and vegetables. Liking of all three sausage conditions were similar. FF Pasta was liked more than RFS and RF (7.47 FF vs. 6.42 RFS, p<0.0001; 7.47 FF vs. 6.47 RF, p<0.0001). Habitual consumption of roasted chicken was associated with reduced liking of FF chicken (r = −0.23, p=0.004) and FF pasta (r = −0.23, p=0.005). Herbs and spices may be useful for improving the liking of lower-fat foods and helping Americans maintain a diet consistent with the US Dietary Guidelines. PMID:25219391
Recent advances in environmental data mining
NASA Astrophysics Data System (ADS)
Leuenberger, Michael; Kanevski, Mikhail
2016-04-01
Due to the large amount and complexity of data available nowadays in geo- and environmental sciences, we face the need to develop and incorporate more robust and efficient methods for their analysis, modelling and visualization. An important part of these developments deals with an elaboration and application of a contemporary and coherent methodology following the process from data collection to the justification and communication of the results. Recent fundamental progress in machine learning (ML) can considerably contribute to the development of the emerging field - environmental data science. The present research highlights and investigates the different issues that can occur when dealing with environmental data mining using cutting-edge machine learning algorithms. In particular, the main attention is paid to the description of the self-consistent methodology and two efficient algorithms - Random Forest (RF, Breiman, 2001) and Extreme Learning Machines (ELM, Huang et al., 2006), which recently gained a great popularity. Despite the fact that they are based on two different concepts, i.e. decision trees vs artificial neural networks, they both propose promising results for complex, high dimensional and non-linear data modelling. In addition, the study discusses several important issues of data driven modelling, including feature selection and uncertainties. The approach considered is accompanied by simulated and real data case studies from renewable resources assessment and natural hazards tasks. In conclusion, the current challenges and future developments in statistical environmental data learning are discussed. References - Breiman, L., 2001. Random Forests. Machine Learning 45 (1), 5-32. - Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2006. Extreme learning machine: theory and applications. Neurocomputing 70 (1-3), 489-501. - Kanevski, M., Pozdnoukhov, A., Timonin, V., 2009. Machine Learning for Spatial Environmental Data. EPFL Press; Lausanne, Switzerland, p.392. - Leuenberger, M., Kanevski, M., 2015. Extreme Learning Machines for spatial environmental data. Computers and Geosciences 85, 64-73.
Comparison of Calibration Techniques for Low-Cost Air Quality Monitoring
NASA Astrophysics Data System (ADS)
Malings, C.; Ramachandran, S.; Tanzer, R.; Kumar, S. P. N.; Hauryliuk, A.; Zimmerman, N.; Presto, A. A.
2017-12-01
Assessing the intra-city spatial distribution and temporal variability of air quality can be facilitated by a dense network of monitoring stations. However, the cost of implementing such a network can be prohibitive if high-quality but high-cost monitoring systems are used. To this end, the Real-time Affordable Multi-Pollutant (RAMP) sensor package has been developed at the Center for Atmospheric Particle Studies of Carnegie Mellon University, in collaboration with SenSevere LLC. This self-contained unit can measure up to five gases out of CO, SO2, NO, NO2, O3, VOCs, and CO2, along with temperature and relative humidity. Responses of individual gas sensors can vary greatly even when exposed to the same ambient conditions. Those of VOC sensors in particular were observed to vary by a factor-of-8, which suggests that each sensor requires its own calibration model. To this end, we apply and compare two different calibration methods to data collected by RAMP sensors collocated with a reference monitor station. The first method, random forest (RF) modeling, is a rule-based method which maps sensor responses to pollutant concentrations by implementing a trained sequence of decision rules. RF modeling has previously been used for other RAMP gas sensors by the group, and has produced precise calibrated measurements. However, RF models can only predict pollutant concentrations within the range observed in the training data collected during the collocation period. The second method, Gaussian process (GP) modeling, is a probabilistic Bayesian technique whereby broad prior estimates of pollutant concentrations are updated using sensor responses to generate more refined posterior predictions, as well as allowing predictions beyond the range of the training data. The accuracy and precision of these techniques are assessed and compared on VOC data collected during the summer of 2017 in Pittsburgh, PA. By combining pollutant data gathered by each RAMP sensor and applying appropriate calibration techniques, the potentially noisy or biased responses of individual sensors can be mapped to pollutant concentration values which are comparable to those of reference instruments.
NASA Astrophysics Data System (ADS)
Messenzehl, Karoline; Meyer, Hanna; Otto, Jan-Christoph; Hoffmann, Thomas; Dikau, Richard
2017-06-01
In mountain geosystems, rockfalls are among the most effective sediment transfer processes, reflected in the regional-scale distribution of talus slopes. However, the understanding of the key controlling factors seems to decrease with increasing spatial scale, due to emergent and complex system behavior and not least to recent methodological shortcomings in rockfall modeling research. In this study, we aim (i) to develop a new approach to identify major regional-scale rockfall controls and (ii) to quantify the relative importance of these controls. Using a talus slope inventory in the Turtmann Valley (Swiss Alps), we applied for the first time the decision-tree based random forest algorithm (RF) in combination with a principal component logistic regression (PCLR) to evaluate the spatial distribution of rockfall activity. This study presents new insights into the discussion on whether periglacial rockfall events are controlled more by topo-climatic, cryospheric, paraglacial or/and rock mechanical properties. Both models explain the spatial rockfall pattern very well, given the high areas under the Receiver Operating Characteristic (ROC) curves of > 0.83. Highest accuracy was obtained by the RF, correctly predicting 88% of the rockfall source areas. The RF appears to have a great potential in geomorphic research involving multicollinear data. The regional permafrost distribution, coupled to the bedrock curvature and valley topography, was detected to be the primary rockfall control. Rockfall source areas cluster within a low-radiation elevation belt (2900-3300 m a.s.l,) consistent with a permafrost probability of > 90%. The second most important factor is the time since deglaciation, reflected by the high abundance of rockfalls along recently deglaciated (< 100 years), north-facing slopes. However, our findings also indicate a strong rock mechanical control on the paraglacial rockfall activity, declining either exponentially or linearly since deglaciation. The study demonstrates the benefit of combined statistical approaches for predicting rockfall activity in deglaciated, permafrost-affected mountain valleys and highlights the complex interplay between rock mechanical, paraglacial and topo-climatic controls at the regional scale.
NASA Astrophysics Data System (ADS)
Nakatsugawa, M.; Kobayashi, Y.; Okazaki, R.; Taniguchi, Y.
2017-12-01
This research aims to improve accuracy of water level prediction calculations for more effective river management. In August 2016, Hokkaido was visited by four typhoons, whose heavy rainfall caused severe flooding. In the Tokoro river basin of Eastern Hokkaido, the water level (WL) at the Kamikawazoe gauging station, which is at the lower reaches exceeded the design high-water level and the water rose to the highest level on record. To predict such flood conditions and mitigate disaster damage, it is necessary to improve the accuracy of prediction as well as to prolong the lead time (LT) required for disaster mitigation measures such as flood-fighting activities and evacuation actions by residents. There is the need to predict the river water level around the peak stage earlier and more accurately. Previous research dealing with WL prediction had proposed a method in which the WL at the lower reaches is estimated by the correlation with the WL at the upper reaches (hereinafter: "the water level correlation method"). Additionally, a runoff model-based method has been generally used in which the discharge is estimated by giving rainfall prediction data to a runoff model such as a storage function model and then the WL is estimated from that discharge by using a WL discharge rating curve (H-Q curve). In this research, an attempt was made to predict WL by applying the Random Forest (RF) method, which is a machine learning method that can estimate the contribution of explanatory variables. Furthermore, from the practical point of view, we investigated the prediction of WL based on a multiple correlation (MC) method involving factors using explanatory variables with high contribution in the RF method, and we examined the proper selection of explanatory variables and the extension of LT. The following results were found: 1) Based on the RF method tuned up by learning from previous floods, the WL for the abnormal flood case of August 2016 was properly predicted with a lead time of 6 h. 2) Based on the contribution of explanatory variables, factors were selected for the MC method. In this way, plausible prediction results were obtained.
NASA Astrophysics Data System (ADS)
Yang, T.; Akbari Asanjan, A.; Gao, X.; Sorooshian, S.
2016-12-01
Reservoirs are fundamental human-built infrastructures that collect, store, and deliver fresh surface water in a timely manner for all kinds of purposes, including residential and industrial water supply, flood control, hydropower, and irrigation, etc. Efficient reservoir operation requires that policy makers and operators understand how reservoir inflows, available storage, and discharges are changing under different climatic conditions. Over the last decade, the uses of Artificial Intelligence and Data Mining (AI & DM) techniques in assisting reservoir management and seasonal forecasts have been increasing. Therefore, in this study, two distinct AI & DM methods, Artificial Neural Network (ANN) and Random Forest (RF), are employed and compared with respect to their capabilities of predicting monthly reservoir inflow, managing storage, and scheduling reservoir releases. A case study on Trinity Lake in northern California is conducted using long-term (over 50 years) reservoir operation records and 17 known climate phenomenon indices, i.e. PDO and ENSO, etc., as predictors. Results show that (1) both ANN and RF are capable of providing reasonable monthly reservoir storage, inflow, and outflow prediction with satisfactory statistics, and (2) climate phenomenon indices are useful in assisting monthly or seasonal forecasts of reservoir inflow and outflow. It is also found that reservoir storage has a consistent high autocorrelation effect, while inflow and outflow are more likely to be influenced by climate conditions. Using a Gini diversity index, RF method identifies that the reservoir discharges are associated with Southern Oscillation Index (SOI) and reservoir inflows are influenced by multiple climate phenomenon indices during different seasons. Furthermore, results also show that, during the winter season, reservoir discharges are controlled by the storage level for flood-control purposes, while, during the summer season, the flood-control operation is not as significant as that in the winter. With regard to the suitability of the AI & DM methods in support of reservoir operation, the Decision Tree method is suggested for future reservoir studies because of its transparency and non-parametric features over the "black-box" style ANN regression model.
Nef, Tobias; Urwyler, Prabitha; Büchler, Marcel; Tarnanas, Ioannis; Stucki, Reto; Cazzoli, Dario; Müri, René; Mosimann, Urs
2012-01-01
Smart homes for the aging population have recently started attracting the attention of the research community. The “health state” of smart homes is comprised of many different levels; starting with the physical health of citizens, it also includes longer-term health norms and outcomes, as well as the arena of positive behavior changes. One of the problems of interest is to monitor the activities of daily living (ADL) of the elderly, aiming at their protection and well-being. For this purpose, we installed passive infrared (PIR) sensors to detect motion in a specific area inside a smart apartment and used them to collect a set of ADL. In a novel approach, we describe a technology that allows the ground truth collected in one smart home to train activity recognition systems for other smart homes. We asked the users to label all instances of all ADL only once and subsequently applied data mining techniques to cluster in-home sensor firings. Each cluster would therefore represent the instances of the same activity. Once the clusters were associated to their corresponding activities, our system was able to recognize future activities. To improve the activity recognition accuracy, our system preprocessed raw sensor data by identifying overlapping activities. To evaluate the recognition performance from a 200-day dataset, we implemented three different active learning classification algorithms and compared their performance: naive Bayesian (NB), support vector machine (SVM) and random forest (RF). Based on our results, the RF classifier recognized activities with an average specificity of 96.53%, a sensitivity of 68.49%, a precision of 74.41% and an F-measure of 71.33%, outperforming both the NB and SVM classifiers. Further clustering markedly improved the results of the RF classifier. An activity recognition system based on PIR sensors in conjunction with a clustering classification approach was able to detect ADL from datasets collected from different homes. Thus, our PIR-based smart home technology could improve care and provide valuable information to better understand the functioning of our societies, as well as to inform both individual and collective action in a smart city scenario. PMID:26007727
Nef, Tobias; Urwyler, Prabitha; Büchler, Marcel; Tarnanas, Ioannis; Stucki, Reto; Cazzoli, Dario; Müri, René; Mosimann, Urs
2015-05-21
Smart homes for the aging population have recently started attracting the attention of the research community. The "health state" of smart homes is comprised of many different levels; starting with the physical health of citizens, it also includes longer-term health norms and outcomes, as well as the arena of positive behavior changes. One of the problems of interest is to monitor the activities of daily living (ADL) of the elderly, aiming at their protection and well-being. For this purpose, we installed passive infrared (PIR) sensors to detect motion in a specific area inside a smart apartment and used them to collect a set of ADL. In a novel approach, we describe a technology that allows the ground truth collected in one smart home to train activity recognition systems for other smart homes. We asked the users to label all instances of all ADL only once and subsequently applied data mining techniques to cluster in-home sensor firings. Each cluster would therefore represent the instances of the same activity. Once the clusters were associated to their corresponding activities, our system was able to recognize future activities. To improve the activity recognition accuracy, our system preprocessed raw sensor data by identifying overlapping activities. To evaluate the recognition performance from a 200-day dataset, we implemented three different active learning classification algorithms and compared their performance: naive Bayesian (NB), support vector machine (SVM) and random forest (RF). Based on our results, the RF classifier recognized activities with an average specificity of 96.53%, a sensitivity of 68.49%, a precision of 74.41% and an F-measure of 71.33%, outperforming both the NB and SVM classifiers. Further clustering markedly improved the results of the RF classifier. An activity recognition system based on PIR sensors in conjunction with a clustering classification approach was able to detect ADL from datasets collected from different homes. Thus, our PIR-based smart home technology could improve care and provide valuable information to better understand the functioning of our societies, as well as to inform both individual and collective action in a smart city scenario.
Qin, Fenju; Yuan, Hongxia; Nie, Jihua; Cao, Yi; Tong, Jian
2014-01-01
To study the effects of nano-selenium (NSe) on cognition performance of mice exposed to 1800 MHz radiofrequency fields (RF). Male mice were randomly divided into four groups, control and nano-Se low, middle and high dose groups (L, M, H). Each group was sub-divided into three groups, RF 0 min, RF 30 min and RF 120 min. Nano-se solution (2, 4 and 8 microg/ml) were administered to mice of L, M, H groups by intra-gastric injection respectively, 0.5 ml/d for 50 days, the conctral group were administered with distilled water. At the 21st day, the mice in RF subgroup were exposed to 208 microW/cm2 1800 MHz radiofrequency fields (0, 30 and 120 min/d respectively) for 30 days. The cognitive ability of the mice were tested with Y-maze. Further, the levels of MDA, GABA, Glu, Ach and the activities of CAT and GSH-Px in cerebra were measured. Significant impairments in learning and memory (P < 0.05) were observed in the RF 120 min group, and with reduction of the Ach level and the activities of CAT and GSH-Px and increase of the content of GABA, Glu and MDA in cerebrum. NSe enhanced cognitive performance of RF mice, decreased GABA, Glu and MDA levels, increased Ach levels, GSH-Px and CAT activities. NSe could improve cognitive impairments of mice exposed to RF, the mechanism of which might involve the increasing antioxidation, decreasing free radical content and the changes of cerebra neurotransmitters.
Su, Ruiliang; Chen, Xiang; Cao, Shuai; Zhang, Xu
2016-01-14
Sign language recognition (SLR) has been widely used for communication amongst the hearing-impaired and non-verbal community. This paper proposes an accurate and robust SLR framework using an improved decision tree as the base classifier of random forests. This framework was used to recognize Chinese sign language subwords using recordings from a pair of portable devices worn on both arms consisting of accelerometers (ACC) and surface electromyography (sEMG) sensors. The experimental results demonstrated the validity of the proposed random forest-based method for recognition of Chinese sign language (CSL) subwords. With the proposed method, 98.25% average accuracy was obtained for the classification of a list of 121 frequently used CSL subwords. Moreover, the random forests method demonstrated a superior performance in resisting the impact of bad training samples. When the proportion of bad samples in the training set reached 50%, the recognition error rate of the random forest-based method was only 10.67%, while that of a single decision tree adopted in our previous work was almost 27.5%. Our study offers a practical way of realizing a robust and wearable EMG-ACC-based SLR systems.
Pseudo CT estimation from MRI using patch-based random forest
NASA Astrophysics Data System (ADS)
Yang, Xiaofeng; Lei, Yang; Shu, Hui-Kuo; Rossi, Peter; Mao, Hui; Shim, Hyunsuk; Curran, Walter J.; Liu, Tian
2017-02-01
Recently, MR simulators gain popularity because of unnecessary radiation exposure of CT simulators being used in radiation therapy planning. We propose a method for pseudo CT estimation from MR images based on a patch-based random forest. Patient-specific anatomical features are extracted from the aligned training images and adopted as signatures for each voxel. The most robust and informative features are identified using feature selection to train the random forest. The well-trained random forest is used to predict the pseudo CT of a new patient. This prediction technique was tested with human brain images and the prediction accuracy was assessed using the original CT images. Peak signal-to-noise ratio (PSNR) and feature similarity (FSIM) indexes were used to quantify the differences between the pseudo and original CT images. The experimental results showed the proposed method could accurately generate pseudo CT images from MR images. In summary, we have developed a new pseudo CT prediction method based on patch-based random forest, demonstrated its clinical feasibility, and validated its prediction accuracy. This pseudo CT prediction technique could be a useful tool for MRI-based radiation treatment planning and attenuation correction in a PET/MRI scanner.
Pereira, Thalita Rodrigues Christovam; Vassão, Patrícia Gabrielli; Venancio, Michele Garcia; Renno, Ana Cláudia Muniz; Aveiro, Mariana Chaves
2017-06-01
The objective of this study was to evaluate the effects of Non-ablative Radiofrequency (RF) associated or not with low-level laser therapy (LLLT) on aspect of facial wrinkles among adult women. Forty-six participants were randomized into three groups: Control Group (CG, n = 15), RF Group (RG, n = 16), and RF and LLLT Group (RLG, n = 15). Every participant was evaluated on baseline (T0), after eight weeks (T8) and eight weeks after the completion of treatment (follow-up). They were photographed in order to classify nasolabial folds and periorbital wrinkles (Modified Fitzpatrick Wrinkle Scale and Fitzpatrick Wrinkle Classification System, respectively) and improvement on appearance (Global Aesthetic Improvement Scale). Photograph analyses were performed by 3 blinded evaluators. Classification of nasolabial and periorbital wrinkles did not show any significant difference between groups. Aesthetic appearance indicated a significant improvement for nasolabial folds on the right side of face immediately after treatment (p = 0.018) and follow-up (p = 0.029) comparison. RG presented better results than CG on T8 (p = 0.041, ES = -0.49) and on follow-up (p = 0.041, ES = -0.49) and better than RLG on T8 (p = 0.041, ES = -0.49). RLG presented better results than CG on follow-up (p = 0.007, ES = -0.37). Nasolabial folds and periorbital wrinkles did not change throughout the study; however, some aesthetic improvement was observed. LLLT did not potentiate RF treatment.
Ökmen, Burcu Metin; Ökmen, Korgün
2017-11-01
Shoulder pain can be difficult to treat due to its complex anatomic structure, and different treatment methods can be used. We aimed to examine the efficacy of photobiomodulation therapy (PBMT) and suprascapular nerve (SSN)-pulsed radiofrequency (RF) therapy. In this prospective, randomized, controlled, single-blind study, 59 patients with chronic shoulder pain due to impingement syndrome received PBMT (group H) or SSN-pulsed RF therapy (group P) in addition to exercise therapy for 14 sessions over 2 weeks. Records were taken using visual analog scale (VAS), Shoulder Pain and Disability Index (SPADI), and Nottingham Health Profile (NHP) scoring systems for pretreatment (PRT), posttreatment (PST), and PST follow-up at months 1, 3, and 6. There was no statistically significant difference in initial VAS score, SPADI, and NHP values between group H and group P (p > 0.05). Compared to the values of PRT, PST, and PST at months 1, 3, and 6, VAS, SPADI, and NHP values were statistically significantly lower in both groups (p < 0.001). There was no statistically significant difference at all measurement times in VAS, SPADI, and NHP between the two groups. We established that PBMT and SSN-pulsed RF therapy are effective methods, in addition to exercise therapy, in patients with chronic shoulder pain. PBMT seems to be advantageous compared to SSN-pulsed RF therapy, as it is a noninvasive method.
Detecting targets hidden in random forests
NASA Astrophysics Data System (ADS)
Kouritzin, Michael A.; Luo, Dandan; Newton, Fraser; Wu, Biao
2009-05-01
Military tanks, cargo or troop carriers, missile carriers or rocket launchers often hide themselves from detection in the forests. This plagues the detection problem of locating these hidden targets. An electro-optic camera mounted on a surveillance aircraft or unmanned aerial vehicle is used to capture the images of the forests with possible hidden targets, e.g., rocket launchers. We consider random forests of longitudinal and latitudinal correlations. Specifically, foliage coverage is encoded with a binary representation (i.e., foliage or no foliage), and is correlated in adjacent regions. We address the detection problem of camouflaged targets hidden in random forests by building memory into the observations. In particular, we propose an efficient algorithm to generate random forests, ground, and camouflage of hidden targets with two dimensional correlations. The observations are a sequence of snapshots consisting of foliage-obscured ground or target. Theoretically, detection is possible because there are subtle differences in the correlations of the ground and camouflage of the rocket launcher. However, these differences are well beyond human perception. To detect the presence of hidden targets automatically, we develop a Markov representation for these sequences and modify the classical filtering equations to allow the Markov chain observation. Particle filters are used to estimate the position of the targets in combination with a novel random weighting technique. Furthermore, we give positive proof-of-concept simulations.
Nishimori, Keisuke; Kitahata, Shigeru; Nishino, Takashi; Maruyama, Tatsuo
2018-05-10
Controlling the surface properties of solid polymers is important for practical applications. We here succeeded in controlling the surface segregation of polymers to display carboxy groups on an outermost surface, which allowed the covalent immobilization of functional molecules via the carboxy groups on a substrate surface. Random methacrylate-based copolymers containing carboxy groups, in which carboxy groups were protected with perfluoroacyl (Rf) groups, were dip-coated on acrylic substrate surfaces. X-ray photoelectron spectroscopy and contact-angle measurements revealed that the Rf groups were segregated to the outermost surface of the dip-coated substrates. The Rf groups were removed by hydrolysis of the Rf esters in the copolymers, resulting in the display of carboxy groups on the surface. The quantification of carboxy groups on a surface revealed that the carboxy groups were reactive to a water-soluble solute in aqueous solution. The surface segregation was affected by the molecular structure of the copolymer used for dip-coating.
Screening large-scale association study data: exploiting interactions using random forests.
Lunetta, Kathryn L; Hayward, L Brooke; Segal, Jonathan; Van Eerdewegh, Paul
2004-12-10
Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for further study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
NASA Astrophysics Data System (ADS)
Falco, N.; Wainwright, H. M.; Dafflon, B.; Leger, E.; Peterson, J.; Steltzer, H.; Wilmer, C.; Williams, K. H.; Hubbard, S. S.
2017-12-01
Mountainous watershed systems are characterized by extreme heterogeneity in hydrological and pedological properties that influence biotic activities, plant communities and their dynamics. To gain predictive understanding of how ecosystem and watershed system evolve under climate change, it is critical to capture such heterogeneity and to quantify the effect of key environmental variables such as topography, and soil properties. In this study, we exploit advanced geophysical and remote sensing techniques - coupled with machine learning - to better characterize and quantify the interactions between plant communities' distribution and subsurface properties. First, we have developed a remote sensing data fusion framework based on the random forest (RF) classification algorithm to estimate the spatial distribution of plant communities. The framework allows the integration of both plant spectral and structural information, which are derived from multispectral satellite images and airborne LiDAR data. We then use the RF method to evaluate the estimated plant community map, exploiting the subsurface properties (such as bedrock depth, soil moisture and other properties) and geomorphological parameters (such as slope, curvature) as predictors. Datasets include high-resolution geophysical data (electrical resistivity tomography) and LiDAR digital elevation maps. We demonstrate our approach on a mountain hillslope and meadow within the East River watershed in Colorado, which is considered to be a representative headwater catchment in the Upper Colorado Basin. The obtained results show the existence of co-evolution between above and below-ground processes; in particular, dominant shrub communities in wet and flat areas. We show that successful integration of remote sensing data with geophysical measurements allows identifying and quantifying the key environmental controls on plant communities' distribution, and provides insights into their potential changes in the future climate conditions.
Wide-area mapping of small-scale features in agricultural landscapes using airborne remote sensing
NASA Astrophysics Data System (ADS)
O'Connell, Jerome; Bradter, Ute; Benton, Tim G.
2015-11-01
Natural and semi-natural habitats in agricultural landscapes are likely to come under increasing pressure with the global population set to exceed 9 billion by 2050. These non-cropped habitats are primarily made up of trees, hedgerows and grassy margins and their amount, quality and spatial configuration can have strong implications for the delivery and sustainability of various ecosystem services. In this study high spatial resolution (0.5 m) colour infrared aerial photography (CIR) was used in object based image analysis for the classification of non-cropped habitat in a 10,029 ha area of southeast England. Three classification scenarios were devised using 4 and 9 class scenarios. The machine learning algorithm Random Forest (RF) was used to reduce the number of variables used for each classification scenario by 25.5 % ± 2.7%. Proportion of votes from the 4 class hierarchy was made available to the 9 class scenarios and where the highest ranked variables in all cases. This approach allowed for misclassified parent objects to be correctly classified at a lower level. A single object hierarchy with 4 class proportion of votes produced the best result (kappa 0.909). Validation of the optimum training sample size in RF showed no significant difference between mean internal out-of-bag error and external validation. As an example of the utility of this data, we assessed habitat suitability for a declining farmland bird, the yellowhammer (Emberiza citronella), which requires hedgerows associated with grassy margins. We found that ˜22% of hedgerows were within 200 m of margins with an area >183.31 m2. The results from this analysis can form a key information source at the environmental and policy level in landscape optimisation for food production and ecosystem service sustainability.
Classification of team sport activities using a single wearable tracking device.
Wundersitz, Daniel W T; Josman, Casey; Gupta, Ritu; Netto, Kevin J; Gastin, Paul B; Robertson, Sam
2015-11-26
Wearable tracking devices incorporating accelerometers and gyroscopes are increasingly being used for activity analysis in sports. However, minimal research exists relating to their ability to classify common activities. The purpose of this study was to determine whether data obtained from a single wearable tracking device can be used to classify team sport-related activities. Seventy-six non-elite sporting participants were tested during a simulated team sport circuit (involving stationary, walking, jogging, running, changing direction, counter-movement jumping, jumping for distance and tackling activities) in a laboratory setting. A MinimaxX S4 wearable tracking device was worn below the neck, in-line and dorsal to the first to fifth thoracic vertebrae of the spine, with tri-axial accelerometer and gyroscope data collected at 100Hz. Multiple time domain, frequency domain and custom features were extracted from each sensor using 0.5, 1.0, and 1.5s movement capture durations. Features were further screened using a combination of ANOVA and Lasso methods. Relevant features were used to classify the eight activities performed using the Random Forest (RF), Support Vector Machine (SVM) and Logistic Model Tree (LMT) algorithms. The LMT (79-92% classification accuracy) outperformed RF (32-43%) and SVM algorithms (27-40%), obtaining strongest performance using the full model (accelerometer and gyroscope inputs). Processing time can be reduced through feature selection methods (range 1.5-30.2%), however a trade-off exists between classification accuracy and processing time. Movement capture duration also had little impact on classification accuracy or processing time. In sporting scenarios where wearable tracking devices are employed, it is both possible and feasible to accurately classify team sport-related activities. Copyright © 2015 Elsevier Ltd. All rights reserved.
Kautzky, Alexander; Dold, Markus; Bartova, Lucie; Spies, Marie; Vanicek, Thomas; Souery, Daniel; Montgomery, Stuart; Mendlewicz, Julien; Zohar, Joseph; Fabbri, Chiara; Serretti, Alessandro; Lanzenberger, Rupert; Kasper, Siegfried
The study objective was to generate a prediction model for treatment-resistant depression (TRD) using machine learning featuring a large set of 47 clinical and sociodemographic predictors of treatment outcome. 552 Patients diagnosed with major depressive disorder (MDD) according to DSM-IV criteria were enrolled between 2011 and 2016. TRD was defined as failure to reach response to antidepressant treatment, characterized by a Montgomery-Asberg Depression Rating Scale (MADRS) score below 22 after at least 2 antidepressant trials of adequate length and dosage were administered. RandomForest (RF) was used for predicting treatment outcome phenotypes in a 10-fold cross-validation. The full model with 47 predictors yielded an accuracy of 75.0%. When the number of predictors was reduced to 15, accuracies between 67.6% and 71.0% were attained for different test sets. The most informative predictors of treatment outcome were baseline MADRS score for the current episode; impairment of family, social, and work life; the timespan between first and last depressive episode; severity; suicidal risk; age; body mass index; and the number of lifetime depressive episodes as well as lifetime duration of hospitalization. With the application of the machine learning algorithm RF, an efficient prediction model with an accuracy of 75.0% for forecasting treatment outcome could be generated, thus surpassing the predictive capabilities of clinical evaluation. We also supply a simplified algorithm of 15 easily collected clinical and sociodemographic predictors that can be obtained within approximately 10 minutes, which reached an accuracy of 70.6%. Thus, we are confident that our model will be validated within other samples to advance an accurate prediction model fit for clinical usage in TRD. © Copyright 2017 Physicians Postgraduate Press, Inc.
Triviño, Maria; Thuiller, Wilfried; Cabeza, Mar; Hickler, Thomas; Araújo, Miguel B.
2011-01-01
Although climate is known to be one of the key factors determining animal species distributions amongst others, projections of global change impacts on their distributions often rely on bioclimatic envelope models. Vegetation structure and landscape configuration are also key determinants of distributions, but they are rarely considered in such assessments. We explore the consequences of using simulated vegetation structure and composition as well as its associated landscape configuration in models projecting global change effects on Iberian bird species distributions. Both present-day and future distributions were modelled for 168 bird species using two ensemble forecasting methods: Random Forests (RF) and Boosted Regression Trees (BRT). For each species, several models were created, differing in the predictor variables used (climate, vegetation, and landscape configuration). Discrimination ability of each model in the present-day was then tested with four commonly used evaluation methods (AUC, TSS, specificity and sensitivity). The different sets of predictor variables yielded similar spatial patterns for well-modelled species, but the future projections diverged for poorly-modelled species. Models using all predictor variables were not significantly better than models fitted with climate variables alone for ca. 50% of the cases. Moreover, models fitted with climate data were always better than models fitted with landscape configuration variables, and vegetation variables were found to correlate with bird species distributions in 26–40% of the cases with BRT, and in 1–18% of the cases with RF. We conclude that improvements from including vegetation and its landscape configuration variables in comparison with climate only variables might not always be as great as expected for future projections of Iberian bird species. PMID:22216263
NASA Astrophysics Data System (ADS)
Weng, Shizhuang; Dong, Ronglu; Zhu, Zede; Zhang, Dongyan; Zhao, Jinling; Huang, Linsheng; Liang, Dong
2018-01-01
Conventional Surface-Enhanced Raman Spectroscopy (SERS) for fast detection of drugs in urine on the portable Raman spectrometer remains challenges because of low sensitivity and unreliable Raman signal, and spectra process with manual intervention. Here, we develop a novel detection method of drugs in urine using chemometric methods and dynamic SERS (D-SERS) with mPEG-SH coated gold nanorods (GNRs). D-SERS combined with the uniform GNRs can obtain giant enhancement, and the signal is also of high reproducibility. On the basis of the above advantages, we obtained the spectra of urine, urine with methamphetamine (MAMP), urine with 3, 4-Methylenedioxy Methamphetamine (MDMA) using D-SERS. Simultaneously, some chemometric methods were introduced for the intelligent and automatic analysis of spectra. Firstly, the spectra at the critical state were selected through using K-means. Then, the spectra were proposed by random forest (RF) with feature selection and principal component analysis (PCA) to develop the recognition model. And the identification accuracy of model were 100%, 98.7% and 96.7%, respectively. To validate the effect in practical issue further, the drug abusers'urine samples with 0.4, 3, 30 ppm MAMP were detected using D-SERS and identified by the classification model. The high recognition accuracy of > 92.0% can meet the demand of practical application. Additionally, the parameter optimization of RF classification model was simple. Compared with the general laboratory method, the detection process of urine's spectra using D-SERS only need 2 mins and 2 μL samples volume, and the identification of spectra based on chemometric methods can be finish in seconds. It is verified that the proposed approach can provide the accurate, convenient and rapid detection of drugs in urine.
Kracalik, Ian T; Kenu, Ernest; Ayamdooh, Evans Nsoh; Allegye-Cudjoe, Emmanuel; Polkuu, Paul Nokuma; Frimpong, Joseph Asamoah; Nyarko, Kofi Mensah; Bower, William A; Traxler, Rita; Blackburn, Jason K
2017-10-01
Anthrax is hyper-endemic in West Africa. Despite the effectiveness of livestock vaccines in controlling anthrax, underreporting, logistics, and limited resources makes implementing vaccination campaigns difficult. To better understand the geographic limits of anthrax, elucidate environmental factors related to its occurrence, and identify human and livestock populations at risk, we developed predictive models of the environmental suitability of anthrax in Ghana. We obtained data on the location and date of livestock anthrax from veterinary and outbreak response records in Ghana during 2005-2016, as well as livestock vaccination registers and population estimates of characteristically high-risk groups. To predict the environmental suitability of anthrax, we used an ensemble of random forest (RF) models built using a combination of climatic and environmental factors. From 2005 through the first six months of 2016, there were 67 anthrax outbreaks (851 cases) in livestock; outbreaks showed a seasonal peak during February through April and primarily involved cattle. There was a median of 19,709 vaccine doses [range: 0-175 thousand] administered annually. Results from the RF model suggest a marked ecological divide separating the broad areas of environmental suitability in northern Ghana from the southern part of the country. Increasing alkaline soil pH was associated with a higher probability of anthrax occurrence. We estimated 2.2 (95% CI: 2.0, 2.5) million livestock and 805 (95% CI: 519, 890) thousand low income rural livestock keepers were located in anthrax risk areas. Based on our estimates, the current anthrax vaccination efforts in Ghana cover a fraction of the livestock potentially at risk, thus control efforts should be focused on improving vaccine coverage among high risk groups.
NASA Astrophysics Data System (ADS)
Aslan, N.; Koc-San, D.
2016-06-01
The main objectives of this study are (i) to calculate Land Surface Temperature (LST) from Landsat imageries, (ii) to determine the UHI effects from Landsat 7 ETM+ (June 5, 2001) and Landsat 8 OLI (June 17, 2014) imageries, (iii) to examine the relationship between LST and different Land Use/Land Cover (LU/LC) types for the years 2001 and 2014. The study is implemented in the central districts of Antalya. Initially, the brightness temperatures are retrieved and the LST values are calculated from Landsat thermal images. Then, the LU/LC maps are created from Landsat pan-sharpened images using Random Forest (RF) classifier. Normalized Difference Vegetation Index (NDVI) image, ASTER Global Digital Elevation Model (GDEM) and DMSP_OLS nighttime lights data are used as auxiliary data during the classification procedure. Finally, UHI effect is determined and the LST values are compared with LU/LC classes. The overall accuracies of RF classification results were computed higher than 88 % for both Landsat images. During 13-year time interval, it was observed that the urban and industrial areas were increased significantly. Maximum LST values were detected for dry agriculture, urban, and bareland classes, while minimum LST values were detected for vegetation and irrigated agriculture classes. The UHI effect was computed as 5.6 °C for 2001 and 6.8 °C for 2014. The validity of the study results were assessed using MODIS/Terra LST and Emissivity data and it was found that there are high correlation between Landsat LST and MODIS LST data (r2 = 0.7 and r2 = 0.9 for 2001 and 2014, respectively).
Zhang, He-Hua; Yang, Liuyang; Liu, Yuchuan; Wang, Pin; Yin, Jun; Li, Yongming; Qiu, Mingguo; Zhu, Xueru; Yan, Fang
2016-11-16
The use of speech based data in the classification of Parkinson disease (PD) has been shown to provide an effect, non-invasive mode of classification in recent years. Thus, there has been an increased interest in speech pattern analysis methods applicable to Parkinsonism for building predictive tele-diagnosis and tele-monitoring models. One of the obstacles in optimizing classifications is to reduce noise within the collected speech samples, thus ensuring better classification accuracy and stability. While the currently used methods are effect, the ability to invoke instance selection has been seldomly examined. In this study, a PD classification algorithm was proposed and examined that combines a multi-edit-nearest-neighbor (MENN) algorithm and an ensemble learning algorithm. First, the MENN algorithm is applied for selecting optimal training speech samples iteratively, thereby obtaining samples with high separability. Next, an ensemble learning algorithm, random forest (RF) or decorrelated neural network ensembles (DNNE), is used to generate trained samples from the collected training samples. Lastly, the trained ensemble learning algorithms are applied to the test samples for PD classification. This proposed method was examined using a more recently deposited public datasets and compared against other currently used algorithms for validation. Experimental results showed that the proposed algorithm obtained the highest degree of improved classification accuracy (29.44%) compared with the other algorithm that was examined. Furthermore, the MENN algorithm alone was found to improve classification accuracy by as much as 45.72%. Moreover, the proposed algorithm was found to exhibit a higher stability, particularly when combining the MENN and RF algorithms. This study showed that the proposed method could improve PD classification when using speech data and can be applied to future studies seeking to improve PD classification methods.
Hu, Yongli; Hase, Takeshi; Li, Hui Peng; Prabhakar, Shyam; Kitano, Hiroaki; Ng, See Kiong; Ghosh, Samik; Wee, Lawrence Jin Kiat
2016-12-22
The ability to sequence the transcriptomes of single cells using single-cell RNA-seq sequencing technologies presents a shift in the scientific paradigm where scientists, now, are able to concurrently investigate the complex biology of a heterogeneous population of cells, one at a time. However, till date, there has not been a suitable computational methodology for the analysis of such intricate deluge of data, in particular techniques which will aid the identification of the unique transcriptomic profiles difference between the different cellular subtypes. In this paper, we describe the novel methodology for the analysis of single-cell RNA-seq data, obtained from neocortical cells and neural progenitor cells, using machine learning algorithms (Support Vector machine (SVM) and Random Forest (RF)). Thirty-eight key transcripts were identified, using the SVM-based recursive feature elimination (SVM-RFE) method of feature selection, to best differentiate developing neocortical cells from neural progenitor cells in the SVM and RF classifiers built. Also, these genes possessed a higher discriminative power (enhanced prediction accuracy) as compared commonly used statistical techniques or geneset-based approaches. Further downstream network reconstruction analysis was carried out to unravel hidden general regulatory networks where novel interactions could be further validated in web-lab experimentation and be useful candidates to be targeted for the treatment of neuronal developmental diseases. This novel approach reported for is able to identify transcripts, with reported neuronal involvement, which optimally differentiate neocortical cells and neural progenitor cells. It is believed to be extensible and applicable to other single-cell RNA-seq expression profiles like that of the study of the cancer progression and treatment within a highly heterogeneous tumour.
van der Ploeg, Tjeerd; Nieboer, Daan; Steyerberg, Ewout W
2016-10-01
Prediction of medical outcomes may potentially benefit from using modern statistical modeling techniques. We aimed to externally validate modeling strategies for prediction of 6-month mortality of patients suffering from traumatic brain injury (TBI) with predictor sets of increasing complexity. We analyzed individual patient data from 15 different studies including 11,026 TBI patients. We consecutively considered a core set of predictors (age, motor score, and pupillary reactivity), an extended set with computed tomography scan characteristics, and a further extension with two laboratory measurements (glucose and hemoglobin). With each of these sets, we predicted 6-month mortality using default settings with five statistical modeling techniques: logistic regression (LR), classification and regression trees, random forests (RFs), support vector machines (SVM) and neural nets. For external validation, a model developed on one of the 15 data sets was applied to each of the 14 remaining sets. This process was repeated 15 times for a total of 630 validations. The area under the receiver operating characteristic curve (AUC) was used to assess the discriminative ability of the models. For the most complex predictor set, the LR models performed best (median validated AUC value, 0.757), followed by RF and support vector machine models (median validated AUC value, 0.735 and 0.732, respectively). With each predictor set, the classification and regression trees models showed poor performance (median validated AUC value, <0.7). The variability in performance across the studies was smallest for the RF- and LR-based models (inter quartile range for validated AUC values from 0.07 to 0.10). In the area of predicting mortality from TBI, nonlinear and nonadditive effects are not pronounced enough to make modern prediction methods beneficial. Copyright © 2016 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Abbaspour, Karim
2018-02-01
Considering the unstable condition of water resources in Iran and many other countries in arid and semi-arid regions, groundwater studies are very important. Therefore, the aim of this study is to model groundwater potential by qanat locations as indicators and ten advanced and soft computing models applied to the Beheshtabad Watershed, Iran. Qanat is a man-made underground construction which gathers groundwater from higher altitudes and transmits it to low land areas where it can be used for different purposes. For this purpose, at first, the location of the qanats was detected using extensive field surveys. These qanats were classified into two datasets including training (70%) and validation (30%). Then, 14 influence factors depicting the region's physical, morphological, lithological, and hydrological features were identified to model groundwater potential. Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), flexible discriminant analysis (FDA), penalized discriminant analysis (PDA), boosted regression tree (BRT), random forest (RF), artificial neural network (ANN), K-nearest neighbor (KNN), multivariate adaptive regression splines (MARS), and support vector machine (SVM) models were applied in R scripts to produce groundwater potential maps. For evaluation of the performance accuracies of the developed models, ROC curve and kappa index were implemented. According to the results, RF had the best performance, followed by SVM and BRT models. Our results showed that qanat locations could be used as a good indicator for groundwater potential. Furthermore, altitude, slope, plan curvature, and profile curvature were found to be the most important influence factors. On the other hand, lithology, land use, and slope aspect were the least significant factors. The methodology in the current study could be used by land use and terrestrial planners and water resource managers to reduce the costs of groundwater resource discovery.
Wide-area mapping of small-scale features in agricultural landscapes using airborne remote sensing.
O'Connell, Jerome; Bradter, Ute; Benton, Tim G
2015-11-01
Natural and semi-natural habitats in agricultural landscapes are likely to come under increasing pressure with the global population set to exceed 9 billion by 2050. These non-cropped habitats are primarily made up of trees, hedgerows and grassy margins and their amount, quality and spatial configuration can have strong implications for the delivery and sustainability of various ecosystem services. In this study high spatial resolution (0.5 m) colour infrared aerial photography (CIR) was used in object based image analysis for the classification of non-cropped habitat in a 10,029 ha area of southeast England. Three classification scenarios were devised using 4 and 9 class scenarios. The machine learning algorithm Random Forest (RF) was used to reduce the number of variables used for each classification scenario by 25.5 % ± 2.7%. Proportion of votes from the 4 class hierarchy was made available to the 9 class scenarios and where the highest ranked variables in all cases. This approach allowed for misclassified parent objects to be correctly classified at a lower level. A single object hierarchy with 4 class proportion of votes produced the best result (kappa 0.909). Validation of the optimum training sample size in RF showed no significant difference between mean internal out-of-bag error and external validation. As an example of the utility of this data, we assessed habitat suitability for a declining farmland bird, the yellowhammer ( Emberiza citronella ), which requires hedgerows associated with grassy margins. We found that ∼22% of hedgerows were within 200 m of margins with an area >183.31 m 2 . The results from this analysis can form a key information source at the environmental and policy level in landscape optimisation for food production and ecosystem service sustainability.
NASA Astrophysics Data System (ADS)
Tamimi, E.; Ebadi, H.; Kiani, A.
2017-09-01
Automatic building detection from High Spatial Resolution (HSR) images is one of the most important issues in Remote Sensing (RS). Due to the limited number of spectral bands in HSR images, using other features will lead to improve accuracy. By adding these features, the presence probability of dependent features will be increased, which leads to accuracy reduction. In addition, some parameters should be determined in Support Vector Machine (SVM) classification. Therefore, it is necessary to simultaneously determine classification parameters and select independent features according to image type. Optimization algorithm is an efficient method to solve this problem. On the other hand, pixel-based classification faces several challenges such as producing salt-paper results and high computational time in high dimensional data. Hence, in this paper, a novel method is proposed to optimize object-based SVM classification by applying continuous Ant Colony Optimization (ACO) algorithm. The advantages of the proposed method are relatively high automation level, independency of image scene and type, post processing reduction for building edge reconstruction and accuracy improvement. The proposed method was evaluated by pixel-based SVM and Random Forest (RF) classification in terms of accuracy. In comparison with optimized pixel-based SVM classification, the results showed that the proposed method improved quality factor and overall accuracy by 17% and 10%, respectively. Also, in the proposed method, Kappa coefficient was improved by 6% rather than RF classification. Time processing of the proposed method was relatively low because of unit of image analysis (image object). These showed the superiority of the proposed method in terms of time and accuracy.
Allegye-Cudjoe, Emmanuel; Polkuu, Paul Nokuma; Frimpong, Joseph Asamoah; Nyarko, Kofi Mensah; Bower, William A.; Traxler, Rita
2017-01-01
Anthrax is hyper-endemic in West Africa. Despite the effectiveness of livestock vaccines in controlling anthrax, underreporting, logistics, and limited resources makes implementing vaccination campaigns difficult. To better understand the geographic limits of anthrax, elucidate environmental factors related to its occurrence, and identify human and livestock populations at risk, we developed predictive models of the environmental suitability of anthrax in Ghana. We obtained data on the location and date of livestock anthrax from veterinary and outbreak response records in Ghana during 2005–2016, as well as livestock vaccination registers and population estimates of characteristically high-risk groups. To predict the environmental suitability of anthrax, we used an ensemble of random forest (RF) models built using a combination of climatic and environmental factors. From 2005 through the first six months of 2016, there were 67 anthrax outbreaks (851 cases) in livestock; outbreaks showed a seasonal peak during February through April and primarily involved cattle. There was a median of 19,709 vaccine doses [range: 0–175 thousand] administered annually. Results from the RF model suggest a marked ecological divide separating the broad areas of environmental suitability in northern Ghana from the southern part of the country. Increasing alkaline soil pH was associated with a higher probability of anthrax occurrence. We estimated 2.2 (95% CI: 2.0, 2.5) million livestock and 805 (95% CI: 519, 890) thousand low income rural livestock keepers were located in anthrax risk areas. Based on our estimates, the current anthrax vaccination efforts in Ghana cover a fraction of the livestock potentially at risk, thus control efforts should be focused on improving vaccine coverage among high risk groups. PMID:29028799
Analysis of emotionality and locomotion in radio-frequency electromagnetic radiation exposed rats.
Narayanan, Sareesh Naduvil; Kumar, Raju Suresh; Paval, Jaijesh; Kedage, Vivekananda; Bhat, M Shankaranarayana; Nayak, Satheesha; Bhat, P Gopalakrishna
2013-07-01
In the current study the modulatory role of mobile phone radio-frequency electromagnetic radiation (RF-EMR) on emotionality and locomotion was evaluated in adolescent rats. Male albino Wistar rats (6-8 weeks old) were randomly assigned into the following groups having 12 animals in each group. Group I (Control): they remained in the home cage throughout the experimental period. Group II (Sham exposed): they were exposed to mobile phone in switch-off mode for 28 days, and Group III (RF-EMR exposed): they were exposed to RF-EMR (900 MHz) from an active GSM (Global system for mobile communications) mobile phone with a peak power density of 146.60 μW/cm(2) for 28 days. On 29th day, the animals were tested for emotionality and locomotion. Elevated plus maze (EPM) test revealed that, percentage of entries into the open arm, percentage of time spent on the open arm and distance travelled on the open arm were significantly reduced in the RF-EMR exposed rats. Rearing frequency and grooming frequency were also decreased in the RF-EMR exposed rats. Defecation boli count during the EPM test was more with the RF-EMR group. No statistically significant difference was found in total distance travelled, total arm entries, percentage of closed arm entries and parallelism index in the RF-EMR exposed rats compared to controls. Results indicate that mobile phone radiation could affect the emotionality of rats without affecting the general locomotion.
Röösli, Martin; Frei, Patrizia; Bolte, John; Neubauer, Georg; Cardis, Elisabeth; Feychting, Maria; Gajsek, Peter; Heinrich, Sabine; Joseph, Wout; Mann, Simon; Martens, Luc; Mohler, Evelyn; Parslow, Roger C; Poulsen, Aslak Harbo; Radon, Katja; Schüz, Joachim; Thuroczy, György; Viel, Jean-François; Vrijheid, Martine
2010-05-20
The development of new wireless communication technologies that emit radio frequency electromagnetic fields (RF-EMF) is ongoing, but little is known about the RF-EMF exposure distribution in the general population. Previous attempts to measure personal exposure to RF-EMF have used different measurement protocols and analysis methods making comparisons between exposure situations across different study populations very difficult. As a result, observed differences in exposure levels between study populations may not reflect real exposure differences but may be in part, or wholly due to methodological differences. The aim of this paper is to develop a study protocol for future personal RF-EMF exposure studies based on experience drawn from previous research. Using the current knowledge base, we propose procedures for the measurement of personal exposure to RF-EMF, data collection, data management and analysis, and methods for the selection and instruction of study participants. We have identified two basic types of personal RF-EMF measurement studies: population surveys and microenvironmental measurements. In the case of a population survey, the unit of observation is the individual and a randomly selected representative sample of the population is needed to obtain reliable results. For microenvironmental measurements, study participants are selected in order to represent typical behaviours in different microenvironments. These two study types require different methods and procedures. Applying our proposed common core procedures in future personal measurement studies will allow direct comparisons of personal RF-EMF exposures in different populations and study areas.
2010-01-01
Background The development of new wireless communication technologies that emit radio frequency electromagnetic fields (RF-EMF) is ongoing, but little is known about the RF-EMF exposure distribution in the general population. Previous attempts to measure personal exposure to RF-EMF have used different measurement protocols and analysis methods making comparisons between exposure situations across different study populations very difficult. As a result, observed differences in exposure levels between study populations may not reflect real exposure differences but may be in part, or wholly due to methodological differences. Methods The aim of this paper is to develop a study protocol for future personal RF-EMF exposure studies based on experience drawn from previous research. Using the current knowledge base, we propose procedures for the measurement of personal exposure to RF-EMF, data collection, data management and analysis, and methods for the selection and instruction of study participants. Results We have identified two basic types of personal RF-EMF measurement studies: population surveys and microenvironmental measurements. In the case of a population survey, the unit of observation is the individual and a randomly selected representative sample of the population is needed to obtain reliable results. For microenvironmental measurements, study participants are selected in order to represent typical behaviours in different microenvironments. These two study types require different methods and procedures. Conclusion Applying our proposed common core procedures in future personal measurement studies will allow direct comparisons of personal RF-EMF exposures in different populations and study areas. PMID:20487532
Application of lifting wavelet and random forest in compound fault diagnosis of gearbox
NASA Astrophysics Data System (ADS)
Chen, Tang; Cui, Yulian; Feng, Fuzhou; Wu, Chunzhi
2018-03-01
Aiming at the weakness of compound fault characteristic signals of a gearbox of an armored vehicle and difficult to identify fault types, a fault diagnosis method based on lifting wavelet and random forest is proposed. First of all, this method uses the lifting wavelet transform to decompose the original vibration signal in multi-layers, reconstructs the multi-layer low-frequency and high-frequency components obtained by the decomposition to get multiple component signals. Then the time-domain feature parameters are obtained for each component signal to form multiple feature vectors, which is input into the random forest pattern recognition classifier to determine the compound fault type. Finally, a variety of compound fault data of the gearbox fault analog test platform are verified, the results show that the recognition accuracy of the fault diagnosis method combined with the lifting wavelet and the random forest is up to 99.99%.
NASA Astrophysics Data System (ADS)
Wu, J.; Yao, W.; Zhang, J.; Li, Y.
2018-04-01
Labeling 3D point cloud data with traditional supervised learning methods requires considerable labelled samples, the collection of which is cost and time expensive. This work focuses on adopting domain adaption concept to transfer existing trained random forest classifiers (based on source domain) to new data scenes (target domain), which aims at reducing the dependence of accurate 3D semantic labeling in point clouds on training samples from the new data scene. Firstly, two random forest classifiers were firstly trained with existing samples previously collected for other data. They were different from each other by using two different decision tree construction algorithms: C4.5 with information gain ratio and CART with Gini index. Secondly, four random forest classifiers adapted to the target domain are derived through transferring each tree in the source random forest models with two types of operations: structure expansion and reduction-SER and structure transfer-STRUT. Finally, points in target domain are labelled by fusing the four newly derived random forest classifiers using weights of evidence based fusion model. To validate our method, experimental analysis was conducted using 3 datasets: one is used as the source domain data (Vaihingen data for 3D Semantic Labelling); another two are used as the target domain data from two cities in China (Jinmen city and Dunhuang city). Overall accuracies of 85.5 % and 83.3 % for 3D labelling were achieved for Jinmen city and Dunhuang city data respectively, with only 1/3 newly labelled samples compared to the cases without domain adaption.
Carlos Alberto Silva; Carine Klauberg; Andrew Thomas Hudak; Lee Alexander Vierling; Wan Shafrina Wan Mohd Jaafar; Midhun Mohan; Mariano Garcia; Antonio Ferraz; Adrian Cardil; Sassan Saatchi
2017-01-01
Improvements in the management of pine plantations result in multiple industrial and environmental benefits. Remote sensing techniques can dramatically increase the efficiency of plantation management by reducing or replacing time-consuming field sampling. We tested the utility and accuracy of combining field and airborne lidar data with Random Forest, a supervised...
Uncertainty in Random Forests: What does it mean in a spatial context?
NASA Astrophysics Data System (ADS)
Klump, Jens; Fouedjio, Francky
2017-04-01
Geochemical surveys are an important part of exploration for mineral resources and in environmental studies. The samples and chemical analyses are often laborious and difficult to obtain and therefore come at a high cost. As a consequence, these surveys are characterised by datasets with large numbers of variables but relatively few data points when compared to conventional big data problems. With more remote sensing platforms and sensor networks being deployed, large volumes of auxiliary data of the surveyed areas are becoming available. The use of these auxiliary data has the potential to improve the prediction of chemical element concentrations over the whole study area. Kriging is a well established geostatistical method for the prediction of spatial data but requires significant pre-processing and makes some basic assumptions about the underlying distribution of the data. Some machine learning algorithms, on the other hand, may require less data pre-processing and are non-parametric. In this study we used a dataset provided by Kirkwood et al. [1] to explore the potential use of Random Forest in geochemical mapping. We chose Random Forest because it is a well understood machine learning method and has the advantage that it provides us with a measure of uncertainty. By comparing Random Forest to Kriging we found that both methods produced comparable maps of estimated values for our variables of interest. Kriging outperformed Random Forest for variables of interest with relatively strong spatial correlation. The measure of uncertainty provided by Random Forest seems to be quite different to the measure of uncertainty provided by Kriging. In particular, the lack of spatial context can give misleading results in areas without ground truth data. In conclusion, our preliminary results show that the model driven approach in geostatistics gives us more reliable estimates for our target variables than Random Forest for variables with relatively strong spatial correlation. However, in cases of weak spatial correlation Random Forest, as a nonparametric method, may give the better results once we have a better understanding of the meaning of its uncertainty measures in a spatial context. References [1] Kirkwood, C., M. Cave, D. Beamish, S. Grebby, and A. Ferreira (2016), A machine learning approach to geochemical mapping, Journal of Geochemical Exploration, 163, 28-40, doi:10.1016/j.gexplo.2016.05.003.
... DC, Wewalka M, et al. Adjustable gastric band surgery or medical management in patients with type 2 diabetes: a randomized ... 25900871 . Kushner RF, Ryan DH. Assessment and lifestyle management of patients with ... surgery versus conventional medical treatment in obese patients with ...
Hiruy, Hiwot; Fuchs, Edward J.; Marzinke, Mark A.; Bakshi, Rahul P.; Breakey, Jennifer C.; Aung, Wutyi S.; Manohar, Madhuri; Yue, Chen; Caffo, Brian S.; Du, Yong; Abebe, Kaleab Z.; Spiegel, Hans M.L.; Rohan, Lisa C.; McGowan, Ian
2015-01-01
Abstract CHARM-02 is a crossover, double-blind, randomized trial to compare the safety and pharmacokinetics of three rectally applied tenofovir 1% gel candidate rectal microbicides of varying osmolalities: vaginal formulation (VF) (3111 mOsmol/kg), the reduced glycerin vaginal formulation (RGVF) (836 mOsmol/kg), and an isoosmolal rectal-specific formulation (RF) (479 mOsmol/kg). Participants (n = 9) received a single, 4 ml, radiolabeled dose of each gel twice, once with and once without simulated unprotected receptive anal intercourse (RAI). The safety, plasma tenofovir pharmacokinetics, colonic small molecule permeability, and SPECT/CT imaging of lower gastrointestinal distribution of drug and virus surrogate were assessed. There were no Grade 3 or 4 adverse events reported for any of the products. Overall, there were more Grade 2 adverse events in the VF group compared to RF (p = 0.006) and RGVF (p = 0.048). In the absence of simulated unprotected RAI, VF had up to 3.8-fold greater systemic tenofovir exposure, 26- to 234-fold higher colonic permeability of the drug surrogate, and 1.5- to 2-fold greater proximal migration in the colonic lumen, when compared to RF and RGVF. Similar trends were observed with simulated unprotected RAI, but most did not reach statistical significance. SPECT analysis showed 86% (standard deviation 19%) of the drug surrogate colocalized with the virus surrogate in the colonic lumen. There were no significant differences between the RGVF and RF formulation, with the exception of a higher plasma tenofovir concentration of RGVF in the absence of simulated unprotected RAI. VF had the most adverse events, highest plasma tenofovir concentrations, greater mucosal permeability of the drug surrogate, and most proximal colonic luminal migration compared to RF and RGVF formulations. There were no major differences between RF and RGVF formulations. Simultaneous assessment of toxicity, systemic and luminal pharmacokinetics, and colocalization of drug and viral surrogates substantially informs rectal microbicide product development. PMID:26227279
Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
Theis, Fabian J.
2017-01-01
Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia. PMID:29312464
Esmaily, Habibollah; Tayefi, Maryam; Doosti, Hassan; Ghayour-Mobarhan, Majid; Nezami, Hossein; Amirabadizadeh, Alireza
2018-04-24
We aimed to identify the associated risk factors of type 2 diabetes mellitus (T2DM) using data mining approach, decision tree and random forest techniques using the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) Study program. A cross-sectional study. The MASHAD study started in 2010 and will continue until 2020. Two data mining tools, namely decision trees, and random forests, are used for predicting T2DM when some other characteristics are observed on 9528 subjects recruited from MASHAD database. This paper makes a comparison between these two models in terms of accuracy, sensitivity, specificity and the area under ROC curve. The prevalence rate of T2DM was 14% among these subjects. The decision tree model has 64.9% accuracy, 64.5% sensitivity, 66.8% specificity, and area under the ROC curve measuring 68.6%, while the random forest model has 71.1% accuracy, 71.3% sensitivity, 69.9% specificity, and area under the ROC curve measuring 77.3% respectively. The random forest model, when used with demographic, clinical, and anthropometric and biochemical measurements, can provide a simple tool to identify associated risk factors for type 2 diabetes. Such identification can substantially use for managing the health policy to reduce the number of subjects with T2DM .
Applications of random forest feature selection for fine-scale genetic population assignment.
Sylvester, Emma V A; Bentzen, Paul; Bradbury, Ian R; Clément, Marie; Pearce, Jon; Horne, John; Beiko, Robert G
2018-02-01
Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with F ST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon ( Salmo salar ) and a published SNP data set for Alaskan Chinook salmon ( Oncorhynchus tshawytscha ). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than F ST -selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using F ST -selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.
Do little interactions get lost in dark random forests?
Wright, Marvin N; Ziegler, Andreas; König, Inke R
2016-03-31
Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.
Nasejje, Justine B; Mwambi, Henry
2017-09-07
Uganda just like any other Sub-Saharan African country, has a high under-five child mortality rate. To inform policy on intervention strategies, sound statistical methods are required to critically identify factors strongly associated with under-five child mortality rates. The Cox proportional hazards model has been a common choice in analysing data to understand factors strongly associated with high child mortality rates taking age as the time-to-event variable. However, due to its restrictive proportional hazards (PH) assumption, some covariates of interest which do not satisfy the assumption are often excluded in the analysis to avoid mis-specifying the model. Otherwise using covariates that clearly violate the assumption would mean invalid results. Survival trees and random survival forests are increasingly becoming popular in analysing survival data particularly in the case of large survey data and could be attractive alternatives to models with the restrictive PH assumption. In this article, we adopt random survival forests which have never been used in understanding factors affecting under-five child mortality rates in Uganda using Demographic and Health Survey data. Thus the first part of the analysis is based on the use of the classical Cox PH model and the second part of the analysis is based on the use of random survival forests in the presence of covariates that do not necessarily satisfy the PH assumption. Random survival forests and the Cox proportional hazards model agree that the sex of the household head, sex of the child, number of births in the past 1 year are strongly associated to under-five child mortality in Uganda given all the three covariates satisfy the PH assumption. Random survival forests further demonstrated that covariates that were originally excluded from the earlier analysis due to violation of the PH assumption were important in explaining under-five child mortality rates. These covariates include the number of children under the age of five in a household, number of births in the past 5 years, wealth index, total number of children ever born and the child's birth order. The results further indicated that the predictive performance for random survival forests built using covariates including those that violate the PH assumption was higher than that for random survival forests built using only covariates that satisfy the PH assumption. Random survival forests are appealing methods in analysing public health data to understand factors strongly associated with under-five child mortality rates especially in the presence of covariates that violate the proportional hazards assumption.
ERIC Educational Resources Information Center
Handel, Richard W.; Ben-Porath, Yossef S.; Tellegen, Auke; Archer, Robert P.
2010-01-01
In the present study, the authors evaluated the effects of increasing degrees of simulated non-content-based (random or fixed) responding on scores on the newly developed Variable Response Inconsistency-Revised (VRIN-r) and True Response Inconsistency-Revised (TRIN-r) scales of the Minnesota Multiphasic Personality Inventory-2 Restructured Form…
Discrimination of crop types with TerraSAR-X-derived information
NASA Astrophysics Data System (ADS)
Sonobe, Rei; Tani, Hiroshi; Wang, Xiufeng; Kobayashi, Nobuyuki; Shimamura, Hideki
Although classification maps are required for management and for the estimation of agricultural disaster compensation, those techniques have yet to be established. This paper describes the comparison of three different classification algorithms for mapping crops in Hokkaido, Japan, using TerraSAR-X (including TanDEM-X) dual-polarimetric data. In the study area, beans, beets, grasslands, maize, potatoes and winter wheat were cultivated. In this study, classification using TerraSAR-X-derived information was performed. Coherence values, polarimetric parameters and gamma nought values were also obtained and evaluated regarding their usefulness in crop classification. Accurate classification may be possible with currently existing supervised learning models. A comparison between the classification and regression tree (CART), support vector machine (SVM) and random forests (RF) algorithms was performed. Even though J-M distances were lower than 1.0 on all TerraSAR-X acquisition days, good results were achieved (e.g., separability between winter wheat and grass) due to the characteristics of the machine learning algorithm. It was found that SVM performed best, achieving an overall accuracy of 95.0% based on the polarimetric parameters and gamma nought values for HH and VV polarizations. The misclassified fields were less than 100 a in area and 79.5-96.3% were less than 200 a with the exception of grassland. When some feature such as a road or windbreak forest is present in the TerraSAR-X data, the ratio of its extent to that of the field is relatively higher for the smaller fields, which leads to misclassifications.
Comparing spatial regression to random forests for large environmental data sets
Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates, whereas spatial regression, when using reduced rank methods, has a reputatio...
No increased sensitivity in brain activity of adolescents exposed to mobile phone-like emissions.
Loughran, S P; Benz, D C; Schmid, M R; Murbach, M; Kuster, N; Achermann, P
2013-07-01
To examine the potential sensitivity of adolescents to radiofrequency electromagnetic field (RF EMF) exposures, such as those emitted by mobile phones. In a double-blind, randomized, crossover design, 22 adolescents aged 11-13 years (12 males) underwent three experimental sessions in which they were exposed to mobile phone-like RF EMF signals at two different intensities, and a sham session. During exposure cognitive tasks were performed and waking EEG was recorded at three time-points subsequent to exposure (0, 30 and 60 min). No clear significant effects of RF EMF exposure were found on the waking EEG or cognitive performance. Overall, the current study was unable to demonstrate exposure-related effects previously observed on the waking EEG in adults, and also provides further support for a lack of an influence of mobile phone-like exposure on cognitive performance. Adolescents do not appear to be more sensitive than adults to mobile phone RF EMF emissions. Copyright © 2013 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
Comparison of RF spectrum prediction methods for dynamic spectrum access
NASA Astrophysics Data System (ADS)
Kovarskiy, Jacob A.; Martone, Anthony F.; Gallagher, Kyle A.; Sherbondy, Kelly D.; Narayanan, Ram M.
2017-05-01
Dynamic spectrum access (DSA) refers to the adaptive utilization of today's busy electromagnetic spectrum. Cognitive radio/radar technologies require DSA to intelligently transmit and receive information in changing environments. Predicting radio frequency (RF) activity reduces sensing time and energy consumption for identifying usable spectrum. Typical spectrum prediction methods involve modeling spectral statistics with Hidden Markov Models (HMM) or various neural network structures. HMMs describe the time-varying state probabilities of Markov processes as a dynamic Bayesian network. Neural Networks model biological brain neuron connections to perform a wide range of complex and often non-linear computations. This work compares HMM, Multilayer Perceptron (MLP), and Recurrent Neural Network (RNN) algorithms and their ability to perform RF channel state prediction. Monte Carlo simulations on both measured and simulated spectrum data evaluate the performance of these algorithms. Generalizing spectrum occupancy as an alternating renewal process allows Poisson random variables to generate simulated data while energy detection determines the occupancy state of measured RF spectrum data for testing. The results suggest that neural networks achieve better prediction accuracy and prove more adaptable to changing spectral statistics than HMMs given sufficient training data.
Sankari, E Siva; Manimegalai, D
2017-12-21
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.
Xiao, Li-Hong; Chen, Pei-Ran; Gou, Zhong-Ping; Li, Yong-Zhong; Li, Mei; Xiang, Liang-Cheng; Feng, Ping
2017-01-01
The aim of this study is to evaluate the ability of the random forest algorithm that combines data on transrectal ultrasound findings, age, and serum levels of prostate-specific antigen to predict prostate carcinoma. Clinico-demographic data were analyzed for 941 patients with prostate diseases treated at our hospital, including age, serum prostate-specific antigen levels, transrectal ultrasound findings, and pathology diagnosis based on ultrasound-guided needle biopsy of the prostate. These data were compared between patients with and without prostate cancer using the Chi-square test, and then entered into the random forest model to predict diagnosis. Patients with and without prostate cancer differed significantly in age and serum prostate-specific antigen levels (P < 0.001), as well as in all transrectal ultrasound characteristics (P < 0.05) except uneven echo (P = 0.609). The random forest model based on age, prostate-specific antigen and ultrasound predicted prostate cancer with an accuracy of 83.10%, sensitivity of 65.64%, and specificity of 93.83%. Positive predictive value was 86.72%, and negative predictive value was 81.64%. By integrating age, prostate-specific antigen levels and transrectal ultrasound findings, the random forest algorithm shows better diagnostic performance for prostate cancer than either diagnostic indicator on its own. This algorithm may help improve diagnosis of the disease by identifying patients at high risk for biopsy.
NASA Astrophysics Data System (ADS)
Bayram, B.; Erdem, F.; Akpinar, B.; Ince, A. K.; Bozkurt, S.; Catal Reis, H.; Seker, D. Z.
2017-11-01
Coastal monitoring plays a vital role in environmental planning and hazard management related issues. Since shorelines are fundamental data for environment management, disaster management, coastal erosion studies, modelling of sediment transport and coastal morphodynamics, various techniques have been developed to extract shorelines. Random Forest is one of these techniques which is used in this study for shoreline extraction.. This algorithm is a machine learning method based on decision trees. Decision trees analyse classes of training data creates rules for classification. In this study, Terkos region has been chosen for the proposed method within the scope of "TUBITAK Project (Project No: 115Y718) titled "Integration of Unmanned Aerial Vehicles for Sustainable Coastal Zone Monitoring Model - Three-Dimensional Automatic Coastline Extraction and Analysis: Istanbul-Terkos Example". Random Forest algorithm has been implemented to extract the shoreline of the Black Sea where near the lake from LANDSAT-8 and GOKTURK-2 satellite imageries taken in 2015. The MATLAB environment was used for classification. To obtain land and water-body classes, the Random Forest method has been applied to NIR bands of LANDSAT-8 (5th band) and GOKTURK-2 (4th band) imageries. Each image has been digitized manually and shorelines obtained for accuracy assessment. According to accuracy assessment results, Random Forest method is efficient for both medium and high resolution images for shoreline extraction studies.
Marino, S R; Lin, S; Maiers, M; Haagenson, M; Spellman, S; Klein, J P; Binkowski, T A; Lee, S J; van Besien, K
2012-02-01
The identification of important amino acid substitutions associated with low survival in hematopoietic cell transplantation (HCT) is hampered by the large number of observed substitutions compared with the small number of patients available for analysis. Random forest analysis is designed to address these limitations. We studied 2107 HCT recipients with good or intermediate risk hematological malignancies to identify HLA class I amino acid substitutions associated with reduced survival at day 100 post transplant. Random forest analysis and traditional univariate and multivariate analyses were used. Random forest analysis identified amino acid substitutions in 33 positions that were associated with reduced 100 day survival, including HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166 and 167; HLA-B 97, 109, 116 and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163 and 173. In all 13 had been previously reported by other investigators using classical biostatistical approaches. Using the same data set, traditional multivariate logistic regression identified only five amino acid substitutions associated with lower day 100 survival. Random forest analysis is a novel statistical methodology for analysis of HLA mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods.
Combined data mining/NIR spectroscopy for purity assessment of lime juice
NASA Astrophysics Data System (ADS)
Shafiee, Sahameh; Minaei, Saeid
2018-06-01
This paper reports the data mining study on the NIR spectrum of lime juice samples to determine their purity (natural or synthetic). NIR spectra for 72 pure and synthetic lime juice samples were recorded in reflectance mode. Sample outliers were removed using PCA analysis. Different data mining techniques for feature selection (Genetic Algorithm (GA)) and classification (including the radial basis function (RBF) network, Support Vector Machine (SVM), and Random Forest (RF) tree) were employed. Based on the results, SVM proved to be the most accurate classifier as it achieved the highest accuracy (97%) using the raw spectrum information. The classifier accuracy dropped to 93% when selected feature vector by GA search method was applied as classifier input. It can be concluded that some relevant features which produce good performance with the SVM classifier are removed by feature selection. Also, reduced spectra using PCA do not show acceptable performance (total accuracy of 66% by RBFNN), which indicates that dimensional reduction methods such as PCA do not always lead to more accurate results. These findings demonstrate the potential of data mining combination with near-infrared spectroscopy for monitoring lime juice quality in terms of natural or synthetic nature.
Jahandideh, Samad; Srinivasasainagendra, Vinodh; Zhi, Degui
2012-11-07
RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.
Deep-HiTS: Rotation Invariant Convolutional Neural Network for Transient Detection
NASA Astrophysics Data System (ADS)
Cabrera-Vives, Guillermo; Reyes, Ignacio; Förster, Francisco; Estévez, Pablo A.; Maureira, Juan-Carlos
2017-02-01
We introduce Deep-HiTS, a rotation-invariant convolutional neural network (CNN) model for classifying images of transient candidates into artifacts or real sources for the High cadence Transient Survey (HiTS). CNNs have the advantage of learning the features automatically from the data while achieving high performance. We compare our CNN model against a feature engineering approach using random forests (RFs). We show that our CNN significantly outperforms the RF model, reducing the error by almost half. Furthermore, for a fixed number of approximately 2000 allowed false transient candidates per night, we are able to reduce the misclassified real transients by approximately one-fifth. To the best of our knowledge, this is the first time CNNs have been used to detect astronomical transient events. Our approach will be very useful when processing images from next generation instruments such as the Large Synoptic Survey Telescope. We have made all our code and data available to the community for the sake of allowing further developments and comparisons at https://github.com/guille-c/Deep-HiTS. Deep-HiTS is licensed under the terms of the GNU General Public License v3.0.
Barik, Amita; Das, Santasabuj
2018-01-02
Small RNAs (sRNAs) in bacteria have emerged as key players in transcriptional and post-transcriptional regulation of gene expression. Here, we present a statistical analysis of different sequence- and structure-related features of bacterial sRNAs to identify the descriptors that could discriminate sRNAs from other bacterial RNAs. We investigated a comprehensive and heterogeneous collection of 816 sRNAs, identified by northern blotting across 33 bacterial species and compared their various features with other classes of bacterial RNAs, such as tRNAs, rRNAs and mRNAs. We observed that sRNAs differed significantly from the rest with respect to G+C composition, normalized minimum free energy of folding, motif frequency and several RNA-folding parameters like base-pairing propensity, Shannon entropy and base-pair distance. Based on the selected features, we developed a predictive model using Random Forests (RF) method to classify the above four classes of RNAs. Our model displayed an overall predictive accuracy of 89.5%. These findings would help to differentiate bacterial sRNAs from other RNAs and further promote prediction of novel sRNAs in different bacterial species.
Liu, Zhicheng; Nahon, Pierre; Li, Zaifang; Yin, Peiyuan; Li, Yanli; Amathieu, Roland; Ganne-Carrié, Nathalie; Ziol, Marianne; Sellier, Nicolas; Seror, Olivier; Le Moyec, Laurence; Savarin, Philippe; Xu, Guowang
2018-01-01
Hepatitis C virus (HCV) infection is associated with a high risk of developing hepatocellular carcinoma (HCC) and HCC recurrence remains the primary threat to outcomes after curative therapy. In this study, we compared recurrent and non-recurrent HCC patients treated with radiofrequency ablation (RFA) in order to identify characteristic metabolic profile variations associated with HCC recurrence. Gas chromatography-mass spectrometry (GC-MS) -based metabolomic analyses were conducted on serum samples obtained before and after RFA therapy. Significant variations were observed in metabolites in the glycerolipid, tricarboxylic acid (TCA) cycle, fatty acid, and amino acid pathways between recurrent and non-recurrent patients. Observed differences in metabolites associated with recurrence did not coincide before and after treatment except for fatty acids. Based on the comparison of serum metabolomes between recurrent and non-recurrent patients, key discriminatory metabolites were defined by a random forest (RF) test. Two combinations of these metabolites before and after RFA treatment showed outstanding performance in predicting HCV-related HCC recurrence, they were further confirmed by an external validation set. Our study showed that the determined combination of metabolites may be potential biomarkers for the prediction of HCC recurrence before and after RFA treatment. PMID:29464069
Chakraborty, Somsubhra; Weindorf, David C; Li, Bin; Ali, Md Nasim; Majumdar, K; Ray, D P
2014-07-01
This pilot study compared penalized spline regression (PSR) and random forest (RF) regression using visible and near-infrared diffuse reflectance spectroscopy (VisNIR DRS) derived spectra of 164 petroleum contaminated soils after two different spectral pretreatments [first derivative (FD) and standard normal variate (SNV) followed by detrending] for rapid quantification of soil petroleum contamination. Additionally, a new analytical approach was proposed for the recovery of the pure spectral and concentration profiles of n-hexane present in the unresolved mixture of petroleum contaminated soils using multivariate curve resolution alternating least squares (MCR-ALS). The PSR model using FD spectra (r(2) = 0.87, RMSE = 0.580 log10 mg kg(-1), and residual prediction deviation = 2.78) outperformed all other models tested. Quantitative results obtained by MCR-ALS for n-hexane in presence of interferences (r(2) = 0.65 and RMSE 0.261 log10 mg kg(-1)) were comparable to those obtained using FD (PSR) model. Furthermore, MCR ALS was able to recover pure spectra of n-hexane. Copyright © 2014 Elsevier Ltd. All rights reserved.
Hodyna, Diana; Kovalishyn, Vasyl; Rogalsky, Sergiy; Blagodatnyi, Volodymyr; Petko, Kirill; Metelytsia, Larisa
2016-09-01
Predictive QSAR models for the inhibitors of B. subtilis and Ps. aeruginosa among imidazolium-based ionic liquids were developed using literary data. The regression QSAR models were created through Artificial Neural Network and k-nearest neighbor procedures. The classification QSAR models were constructed using WEKA-RF (random forest) method. The predictive ability of the models was tested by fivefold cross-validation; giving q(2) = 0.77-0.92 for regression models and accuracy 83-88% for classification models. Twenty synthesized samples of 1,3-dialkylimidazolium ionic liquids with predictive value of activity level of antimicrobial potential were evaluated. For all asymmetric 1,3-dialkylimidazolium ionic liquids, only compounds containing at least one radical with alkyl chain length of 12 carbon atoms showed high antibacterial activity. However, the activity of symmetric 1,3-dialkylimidazolium salts was found to have opposite relationship with the length of aliphatic radical being maximum for compounds based on 1,3-dioctylimidazolium cation. The obtained experimental results suggested that the application of classification QSAR models is more accurate for the prediction of activity of new imidazolium-based ILs as potential antibacterials. © 2016 John Wiley & Sons A/S.